361 Commits

Author SHA1 Message Date
Yaxun (Sam) Liu
d7e1932f85
[HIP] Fix comdat of template kernel handle (#66283)
Currently, clang emits LLVM IR that fails verifier for the following
code:

```
template<typename T>
__global__ void foo(T x);

void bar() {
  foo<<<1, 1>>>(0);
}
```
This is due to clang putting the kernel handle for foo into comdat,
which is not allowed, since the kernel handle is a declaration.

The siutation is similar to calling a declaration-only template
function. The callee will be a declaration in LLVM IR and won't be put
into comdat. This is in contrast to calling a template function with
body, which will be put into comdat.

Fixes: SWDEV-419769
2023-09-14 15:56:02 -04:00
Joseph Huber
49ff6a96a7
[Clang] Define AMDGPU ABI when referenced in CodeGen for ABI "none" (#66162)
Summary:
We use the `llvm.amgcn.abi.version` varaible to control code generation.
This is emitted in every module now to indicate what should be used when
compiling. Previously, the logic caused us to emit an external reference
to this variable when creating the code for the `none` type. This would
then cause us not to emit the actual definition. This patch refines the
logic to create the external reference, and then update it if it is
found unset by the time we emit the global. I had to remove the
reference to `GetOrCreateLLVmGlobal` because it did not accept the
proper address space.
2023-09-13 08:31:31 -05:00
Saiyedul Islam
466a8149b3
Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)" (#66060)
This reverts commit 0a8d17e79b02a92814a2a788d79df1f54d70ec3e.
2023-09-12 15:13:59 +05:30
Saiyedul Islam
0a8d17e79b
[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)
Also update LIT tests and docs.
For more details, see
https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata

Reviewed By: arsenm, jhuber6

Github PR: #65410

Differential Revision: https://reviews.llvm.org/D129818
2023-09-12 13:53:31 +05:30
Yaxun (Sam) Liu
f2a1331a01
[CUDA][HIP] Do not mark extern shared var (#65990)
Fixes: https://github.com/llvm/llvm-project/issues/65806

Currently clang put extern shared var ODR-used by host device functions
in global var __clang_gpu_used_external. This behavior was due to
https://reviews.llvm.org/D123441. However, clang should not do that for
extern shared vars since their addresses are per warp, therefore cannot
be accessed by host code.
2023-09-11 17:04:55 -04:00
Juan Manuel MARTINEZ CAAMAÑO
d60c47476d [Clang] Propagate target-features if compatible when using mlink-builtin-bitcode
Buitlins from AMD's device-libs are compiled without specifying a
target-cpu, which results in builtins without the target-features
attribute set.

Before this patch, when linking this builtins with -mlink-builtin-bitcode
the target-features were not propagated in the incoming builtins.

With this patch, the default target features are propagated
if they are compatible with the target-features in the incoming builtin.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D159206
2023-09-08 11:20:16 +02:00
Yaxun (Sam) Liu
9b7763821a
Reland "[CUDA][HIP] Fix overloading resolution in global var init" (#65606)
https://reviews.llvm.org/D158247 caused regressions for HIP on Windows
and was reverted.

A reduced test case is:

```
typedef void (__stdcall* funcTy)();
void invoke(funcTy f);

static void __stdcall callee() noexcept {
}

void foo() {
   invoke(callee);
}
```

It is due to clang missing handling host/device attributes for calling
convention at a few places

This patch fixes that.
2023-09-07 23:18:30 -04:00
Yaxun (Sam) Liu
27313b68ef Revert "[CUDA][HIP] Fix overloading resolution in global variable initializer"
This reverts commit de0df639724b10001ea9a74539381ea494296be9.

It was reverted due to regression in HIP unit test on Windows:

 In file included from C:\hip-tests\catch\unit\graph\hipGraphClone.cc:37:

 In file included from C:\hip-tests\catch\.\include\hip_test_common.hh:24:

 In file included from C:\hip-tests\catch\.\include/hip_test_context.hh:24:

 In file included from C:/install/native/Release/x64/hip/include\hip/hip_runtime.h:54:

 C:/dk/win\vc\14.31.31107\include\thread:76:70: error: cannot initialize a parameter of type '_beginthreadex_proc_type' (aka 'unsigned int (*)(void *) __attribute__((stdcall))') with an lvalue of type 'const unsigned int (*)(void *) noexcept __attribute__((stdcall))': different exception specifications

    76 |             reinterpret_cast<void*>(_CSTD _beginthreadex(nullptr, 0, _Invoker_proc, _Decay_copied.get(), 0, &_Thr._Id));

       |                                                                      ^~~~~~~~~~~~~

 C:\hip-tests\catch\unit\graph\hipGraphClone.cc:290:21) &>' requested here

    90 |         _Start(_STD forward<_Fn>(_Fx), _STD forward<_Args>(_Ax)...);

       |         ^

 C:\hip-tests\catch\unit\graph\hipGraphClone.cc:290:21) &, 0>' requested here

   311 |     std::thread t(lambdaFunc);

       |                 ^

 C:/dk/win\ms_wdk\e22621\Include\10.0.22621.0\ucrt\process.h:99:40: note: passing argument to parameter '_StartAddress' here

    99 |     _In_      _beginthreadex_proc_type _StartAddress,

       |                                        ^

 1 error generated when compiling for gfx1030.
2023-08-31 09:02:49 -04:00
Yaxun (Sam) Liu
de0df63972 [CUDA][HIP] Fix overloading resolution in global variable initializer
Currently, clang does not resolve certain overloaded functions correctly in the initializer
of global variables, e.g.

template<typename T1, typename U>
T1 mypow(T1, U);

__attribute__((device)) double mypow(double, int);

double t_extent = mypow(1.0, 2);

In the above example, mypow is supposed to resolve to the host version
but clang resolves it to the device version instead, and emits an error
(https://godbolt.org/z/17xxzaa67).

However, if the variable is assigned in a host function, there is no error.
The discrepancy in overloading resolution inside and outside of
a function is due to clang not accounting for the host/device target
when resolving functions called in the initializer of a global variable.

This patch introduces a global host/device target context for CUDA/HIP
for functions called outside of functions. For global variable initialization,
it is determined by the host/device attribute of the variable. For other
situations, a default value of host_device is sufficient.

Reviewed by: Artem Belevich

Differential Revision: https://reviews.llvm.org/D158247

Fixes: SWDEV-416731
2023-08-29 10:17:24 -04:00
Saiyedul Islam
f616c3eeb4
[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.

CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.

Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Differential Revision: https://reviews.llvm.org/D139730

Reviewed By: jhuber6, yaxunl
2023-08-29 06:35:44 -05:00
Artem Belevich
8f8df788ae Added missing test constraints. 2023-08-18 11:39:11 -07:00
Artem Belevich
72757343fa [CUDA/NVPTX] Improve handling of memcpy for -Os compilations.
We had some instances when LLVM would not inline fixed-count memcpy and ended up
attempting to lower it a a libcall, which would not work on NVPTX as there's no
standard library to call.

The patch relaxes the threshold used for -Os compilation so we're always allowed
to inline memory copy functions.

Differential Revision: https://reviews.llvm.org/D158226
2023-08-18 11:27:36 -07:00
Changpeng Fang
d77c62053c [clang][AMDGPU]: Don't use byval for struct arguments in function ABI
Summary:
  Byval requires allocating additional stack space, and always requires an implicit copy to be inserted in codegen,
where it can be difficult to optimize. In this work, we use byref/IndirectAliased promotion method instead of
byval with the implicit copy semantics.

Reviewers:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D155986
2023-08-11 16:37:42 -07:00
Matt Arsenault
9e3d9c9eae clang: Add __builtin_elementwise_sqrt
This will be used in the opencl builtin headers to provide direct
intrinsic access with proper !fpmath metadata.

https://reviews.llvm.org/D156737
2023-08-11 19:32:39 -04:00
Yaxun (Sam) Liu
247cc265e7 [CUDA][HIP] Fix overloading resolution of delete operator
Currently clang does not consider host/device preference
when resolving delete operator in the file scope, which
causes device operator delete selected for class member
initialization.

Reviewed by: Artem Belevich

Differential Revision: https://reviews.llvm.org/D156795
2023-08-08 09:50:24 -04:00
Joseph Huber
0ba9aec38f [Clang][NVPTX] Permit use of the alias attribute for NVPTX targets
The patch in D155211 added basic support for the `.alias` keyword in
PTX. This means we should be able to permit use of this in clang.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D156014
2023-08-07 14:08:04 -05:00
Yaxun (Sam) Liu
ac72531043 [Driver] Add -f[no-]offload-uniform-block
By default, clang assumes HIP kernels are launched with uniform block size,
which is the case for kernels launched through triple chevron or
hipLaunchKernelGGL. Clang adds uniform-work-group-size function attribute
to HIP kernels to allow the backend to do optimizations on that.

However, in some rare cases, HIP kernels can be launched
through hipExtModuleLaunchKernel where global work size is specified,
which may result in non-uniform block size.

To be able to support non-uniform block size for HIP kernels,
an option `-f[no-]offload-uniform-block is added. This option
is generic for offloading languages. Its default value is on for
CUDA/HIP and off otherwise.

Make -cl-uniform-work-group-size an alias to -foffload-uniform-block.

Reviewed by: Siu Chi Chan, Matt Arsenault, Fangrui Song, Johannes Doerfert

Differential Revision: https://reviews.llvm.org/D155213

Fixes: SWDEV-406592
2023-07-27 16:36:02 -04:00
Yaxun (Sam) Liu
8ef04488d1 Reland "[CUDA][HIP] Use the same default language std as C++""
Reland after fixing regression in lldb.

Differential Revision: https://reviews.llvm.org/D155539
2023-07-20 12:02:33 -04:00
Sylvestre Ledru
443f490b6a Revert "[CUDA][HIP] Use the same default language std as C++"
This reverts commit 2d1d07152bd26b001dedec3400b4b01d3bb11622.
2023-07-20 10:54:54 +02:00
Yaxun (Sam) Liu
2d1d07152b [CUDA][HIP] Use the same default language std as C++
Currently CUDA/HIP defines their own language standards in
LanguageStandards.def but they are redundant. They are the same as stdc++14.
The fact that CUDA/HIP uses c++* in option -std= indicates that they have the
same language standards as C++. The CUDA/HIP specific language features are
conveyed through language options, not language standards features. It makes
sense to let CUDA/HIP uses the same default language standard as C++.

Reviewed by: Siu Chi Chan, Artem Belevich

Differential Revision: https://reviews.llvm.org/D155539

Fixes: SWDEV-407685
2023-07-19 23:54:58 -04:00
Yaxun (Sam) Liu
f0a955d3fa [HIP] Rename predefined macros
Rename HIP_API_PER_THREAD_DEFAULT_STREAM and __HIP_NO_IMAGE_SUPPORT
so that they follow the convention with prefix and postfix __.

Reviewed by: Artem Belevich

Differential Revision: https://reviews.llvm.org/D155480
2023-07-17 15:45:49 -04:00
Matt Arsenault
bac2a07540 clang: Attach !fpmath metadata to __builtin_sqrt based on language flags
OpenCL and HIP have -cl-fp32-correctly-rounded-divide-sqrt and
-fno-hip-correctly-rounded-divide-sqrt. The corresponding fpmath metadata
was only set on fdiv, and not sqrt. The backend is currently underutilizing
sqrt lowering options, and the responsibility is split between the libraries
and backend and this metadata is needed.

CUDA/NVCC has -prec-div and -prev-sqrt but clang doesn't appear to be
aiming for compatibility with those. Don't know if OpenMP has a similar
control.
2023-07-14 18:46:18 -04:00
boxu.zhang
f05b58a946 [clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA
I'm using clang to compile CUDA code. And just found that clang doesn't support the per-thread stream option for NV CUDA. I don't know if there is another solution.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D154822
2023-07-13 16:54:57 -07:00
root
250f2bb2c6 adding bf16 support to NVPTX
Currently, bf16 has been scatteredly added to the PTX codegen. This patch aims to complete the set of instructions and code path required to support bf16 data type.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D144911

Co-authored-by: Artem Belevich <tra@google.com>
2023-06-28 11:57:13 -07:00
pvanhout
23431b5246 [clang][CodeGen] Fix GPU-specific attributes being dropped by bitcode linking
Device libs make use of patterns like this:
```
__attribute__((target("gfx11-insts")))
static unsigned do_intrin_stuff(void)
{
  return __builtin_amdgcn_s_sendmsg_rtnl(0x0);
}
```
For functions that are assumed to be eliminated if the currennt GPU target doesn't support them.
At O0 such functions aren't eliminated by common optimizations but often by AMDGPURemoveIncompatibleFunctions instead, which sees the "+gfx11-insts" attribute on, say, GFX9 and knows it's not valid, so it removes the function.

D142907 accidentally made it so such attributes were dropped during bitcode linking, making it impossible for RemoveIncompatibleFunctions to catch the functions and causing ISel to catch fire eventually.

This fixes the issue and adds a new test to ensure we don't accidentally fall into this trap again.

Fixes SWDEV-403642

Reviewed By: arsenm, yaxunl

Differential Revision: https://reviews.llvm.org/D152251
2023-06-07 15:51:52 +02:00
Yaxun (Sam) Liu
f2677afe91 [CUDA][HIP] Externalize device var in anonymous namespace
Device variables in an anonymous namespace may be
referenced by host code, therefore they need to
be externalized in a similar way as a static device
variables or kernels in an anonymous namespace.

Fixes: https://github.com/ROCm-Developer-Tools/HIP/issues/3246

Reviewed by: Artem Belevich

Differential Revision: https://reviews.llvm.org/D152164
2023-06-06 12:03:48 -04:00
Artem Belevich
dc90f42ea7 Coalesce 16-bit FP types to use integer register classes.
i16/f16/bf16 will use the same .b16 registers and
i32/v2f16 and v2bf16 will share .b32 registers.

The changes are mostly mechanical, intended to remove unnecessary register
classes which tend to produce redundant register moves.

Differential Revision: https://reviews.llvm.org/D151601

v2f16 regtype conversion to i32
2023-06-05 12:21:52 -07:00
Yaxun (Sam) Liu
00448a548c [clang] Allow fp in atomic fetch max/min builtins
LLVM IR already allows floating point type in atomicrmw.
Update clang atomic fetch max/min builtins to accept
floating point type like we did for fetch add/sub.

Reviewed by: Artem Belevich

Differential Revision: https://reviews.llvm.org/D150985

Fixes: SWDEV-401056
2023-05-31 15:19:31 -04:00
Luke Drummond
e3fbede7f3 [HIP] Add missing __hip_atomic_fetch_sub support
The rest of the fetch/op intrinsics were added in e13246a2ec3 but sub
was conspicuous by its absence.

Reviewed By: yaxunl

Differential Revision: https://reviews.llvm.org/D151701
2023-05-30 22:22:43 +01:00
M. Zeeshan Siddiqui
e621757365 [Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support
Pursuant to discussions at
https://discourse.llvm.org/t/rfc-c-23-p1467r9-extended-floating-point-types-and-standard-names/70033/22,
this commit enhances the handling of the __bf16 type in Clang.
- Firstly, it upgrades __bf16 from a storage-only type to an arithmetic
  type.
- Secondly, it changes the mangling of __bf16 to DF16b on all
  architectures except ARM. This change has been made in
  accordance with the finalization of the mangling for the
  std::bfloat16_t type, as discussed at
  https://github.com/itanium-cxx-abi/cxx-abi/pull/147.
- Finally, this commit extends the existing excess precision support to
  the __bf16 type. This applies to hardware architectures that do not
  natively support bfloat16 arithmetic.
Appropriate tests have been added to verify the effects of these
changes and ensure no regressions in other areas of the compiler.

Reviewed By: rjmccall, pengfei, zahiraam

Differential Revision: https://reviews.llvm.org/D150913
2023-05-27 13:33:50 +08:00
Artem Belevich
25708b3df6 [NVPTX, CUDA] barrier intrinsics and builtins for sm_90
Differential Revision: https://reviews.llvm.org/D151363
2023-05-25 11:57:57 -07:00
Artem Belevich
0a0bae1e9f [CUDA] plumb through new sm_90-specific builtins.
Differential Revision: https://reviews.llvm.org/D151168
2023-05-25 11:57:56 -07:00
Sergei Barannikov
f46b0e6d75 [clang] Convert a few tests to opaque pointers
Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D150520
2023-05-14 21:00:15 +03:00
Matt Arsenault
bc37be1855 LangRef: Add "dynamic" option to "denormal-fp-math"
This is stricter than the default "ieee", and should probably be the
default. This patch leaves the default alone. I can change this in a
future patch.

There are non-reversible transforms I would like to perform which are
legal under IEEE denormal handling, but illegal with flushing zero
behavior. Namely, conversions between llvm.is.fpclass and fcmp with
zeroes.

Under "ieee" handling, it is legal to translate between
llvm.is.fpclass(x, fcZero) and fcmp x, 0.

Under "preserve-sign" handling, it is legal to translate between
llvm.is.fpclass(x, fcSubnormal|fcZero) and fcmp x, 0.

I would like to compile and distribute some math library functions in
a mode where it's callable from code with and without denormals
enabled, which requires not changing the compares with denormals or
zeroes.

If an IEEE function transforms an llvm.is.fpclass call into an fcmp 0,
it is no longer possible to call the function from code with denormals
enabled, or write an optimization to move the function into a denormal
flushing mode. For the original function, if x was a denormal, the
class would evaluate to false. If the function compiled with denormal
handling was converted to or called from a preserve-sign function, the
fcmp now evaluates to true.

This could also be of use for strictfp handling, where code may be
changing the denormal mode.

Alternative name could be "unknown".

Replaces the old AMDGPU custom inlining logic with more conservative
logic which tries to permit inlining for callees with dynamic handling
and avoids inlining other mismatched modes.
2023-04-29 08:44:59 -04:00
Artem Belevich
2aa90da012 [CUDA] Update cached kernel handle when the function instance changes.
Fixes clang crash caused by a stale function pointer.

The bug has been present for a pretty long time, but we were lucky not to
trigger it until  D140663.

Differential Revision: https://reviews.llvm.org/D146448
2023-03-21 15:36:12 -07:00
Nikita Popov
06621ecdaf [Clang] Convert some tests to opaque pointers (NFC) 2023-02-17 15:08:50 +01:00
Matt Arsenault
7cbbbc0a04 clang: Rename misleading test name
This is a test for attribute propagation, not metadata
2023-02-15 08:32:57 -04:00
Yaxun (Sam) Liu
f4d8b8781d [AMDGPU ASAN] Remove reference to asan bitcode library
The asan functions are now attributed as used
in the device library, no need to keep the
declaration of asan device preserve function.

Patch by: Praveen Velliengiri

Reviewed by: Yaxun Liu

Differential Revision: https://reviews.llvm.org/D143495
2023-02-14 11:52:41 -05:00
Daniele Castagna
32c26e27b6 CUDA/HIP: Use kernel name to map to symbol
Currently CGCUDANV uses an llvm::Function as a key to map kernels to a
symbol in host code.  HIP adds one level of indirection and uses the
llvm::Function to map to a global variable that will be initialized to
the kernel stub ptr.

Unfortunately there is no garantee that the llvm::Function created
by GetOrCreateLLVMFunction will be the same.  In fact, the first
time we encounter GetOrCrateLLVMFunction for a kernel, the type
might not be completed yet, and the type of llvm::Function will be
a generic {}, since the complete type is not required to get a symbol
to a function.  In this case we end up creating two global variables,
one for the llvm::Function with the incomplete type and one for the
function with the complete type. The first global variable will be
declared by not defined, resulting in a linking error.

This change uses the mangled name of the llvm::Function as key in the
KernelHandles map, in this way the same llvm::Function will be
associated to the same kernel handle even if they types are different.

Reviewed By: yaxunl

Differential Revision: https://reviews.llvm.org/D140663
2023-01-19 15:02:14 -08:00
Nikita Popov
b164b047f2 [Clang] Convert test to opaque pointers (NFC) 2023-01-17 10:08:57 +01:00
Nikita Popov
02856565ac [Clang] Emit noundef metadata next to range metadata
To preserve the previous semantics after D141386, adjust places
that currently emit !range metadata to also emit !noundef metadata.
This retains range violation as immediate undefined behavior,
rather than just poison.

Differential Revision: https://reviews.llvm.org/D141494
2023-01-12 10:03:05 +01:00
Matt Arsenault
2e9c663ab4 clang/AMDGPU: Add missing tests for some builtin
These were tested under opencl but need hip testing for the potential
addrspacecasts.
2023-01-10 13:07:01 -05:00
Pierre van Houtryve
678d8946ba [AMDGPU] Add bf16 storage support
- [Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
  - Add Sema & CodeGen tests cases.
  - Also add cases that D138651 would have covered as this patch replaces it.
- [AMDGPU] Add BF16 storage-only support
  - Support legalization/dealing with bf16 operations in DAGIsel.
  - bf16 as a type remains illegal and is represented as i16 for storage purposes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D139398
2022-12-13 10:34:26 -05:00
Nikita Popov
0419465fa4 [Clang] Update some CUDA tests to opaque pointers (NFC) 2022-12-13 11:50:08 +01:00
Nikita Popov
9466b49171 [Clang] Convert various tests to opaque pointers (NFC)
These were all tests where no manual fixup was required.
2022-12-12 17:11:46 +01:00
Yaxun (Sam) Liu
262c3034bb Revert "[AMDGPU] Disable bool range metadata to workaround backend issue"
This reverts commit 107ee2613063183cb643cef97f0fad403508c9f0 to facilitate
investigating and fixing the root cause.

Differential Revision: https://reviews.llvm.org/D135269
2022-12-08 17:35:10 -05:00
Ron Lieberman
ca856fff1c Revert "enable code-object-version=5"
very sorry wrong repo.

This reverts commit d882ba7aeac4b496dccd1b10cb58bd691786b691.
2022-11-29 15:21:09 -06:00
Ron Lieberman
d882ba7aea enable code-object-version=5 2022-11-29 15:11:57 -06:00
Pierre van Houtryve
c05f1639f7 [clang][cuda/hip] Allow __noinline__ lambdas
D124866 seem to have had an unintended side effect: __noinline__ on lambdas was no longer accepted.

This fixes the regression and adds a test case for it.

Reviewed By: aaron.ballman

Differential Revision: https://reviews.llvm.org/D137251
2022-11-04 07:33:31 +00:00
Artem Belevich
0e8a414ab3 [CUDA, NVPTX] Added basic __bf16 support for NVPTX.
Recent Clang changes expose _bf16 types for SSE2-enabled host compilations and
that makes those types visible furing GPU-side compilation, where it currently
fails with Sema complaining that __bf16 is not supported.

Considering that __bf16 is a storage-only type, enabling it for NVPTX if it's
enabled on the host should pose no issues, correctness-wise.

Recent NVIDIA GPUs have introduced bf16 support, so we'll likely grow better
support for __bf16 on NVPTX going forward.

Differential Revision: https://reviews.llvm.org/D136311
2022-10-25 11:08:06 -07:00