By default, clang assumes HIP kernels are launched with uniform block size,
which is the case for kernels launched through triple chevron or
hipLaunchKernelGGL. Clang adds uniform-work-group-size function attribute
to HIP kernels to allow the backend to do optimizations on that.
However, in some rare cases, HIP kernels can be launched
through hipExtModuleLaunchKernel where global work size is specified,
which may result in non-uniform block size.
To be able to support non-uniform block size for HIP kernels,
an option `-f[no-]offload-uniform-block is added. This option
is generic for offloading languages. Its default value is on for
CUDA/HIP and off otherwise.
Make -cl-uniform-work-group-size an alias to -foffload-uniform-block.
Reviewed by: Siu Chi Chan, Matt Arsenault, Fangrui Song, Johannes Doerfert
Differential Revision: https://reviews.llvm.org/D155213
Fixes: SWDEV-406592
Currently CUDA/HIP defines their own language standards in
LanguageStandards.def but they are redundant. They are the same as stdc++14.
The fact that CUDA/HIP uses c++* in option -std= indicates that they have the
same language standards as C++. The CUDA/HIP specific language features are
conveyed through language options, not language standards features. It makes
sense to let CUDA/HIP uses the same default language standard as C++.
Reviewed by: Siu Chi Chan, Artem Belevich
Differential Revision: https://reviews.llvm.org/D155539
Fixes: SWDEV-407685
Rename HIP_API_PER_THREAD_DEFAULT_STREAM and __HIP_NO_IMAGE_SUPPORT
so that they follow the convention with prefix and postfix __.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D155480
OpenCL and HIP have -cl-fp32-correctly-rounded-divide-sqrt and
-fno-hip-correctly-rounded-divide-sqrt. The corresponding fpmath metadata
was only set on fdiv, and not sqrt. The backend is currently underutilizing
sqrt lowering options, and the responsibility is split between the libraries
and backend and this metadata is needed.
CUDA/NVCC has -prec-div and -prev-sqrt but clang doesn't appear to be
aiming for compatibility with those. Don't know if OpenMP has a similar
control.
I'm using clang to compile CUDA code. And just found that clang doesn't support the per-thread stream option for NV CUDA. I don't know if there is another solution.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D154822
Currently, bf16 has been scatteredly added to the PTX codegen. This patch aims to complete the set of instructions and code path required to support bf16 data type.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D144911
Co-authored-by: Artem Belevich <tra@google.com>
Device libs make use of patterns like this:
```
__attribute__((target("gfx11-insts")))
static unsigned do_intrin_stuff(void)
{
return __builtin_amdgcn_s_sendmsg_rtnl(0x0);
}
```
For functions that are assumed to be eliminated if the currennt GPU target doesn't support them.
At O0 such functions aren't eliminated by common optimizations but often by AMDGPURemoveIncompatibleFunctions instead, which sees the "+gfx11-insts" attribute on, say, GFX9 and knows it's not valid, so it removes the function.
D142907 accidentally made it so such attributes were dropped during bitcode linking, making it impossible for RemoveIncompatibleFunctions to catch the functions and causing ISel to catch fire eventually.
This fixes the issue and adds a new test to ensure we don't accidentally fall into this trap again.
Fixes SWDEV-403642
Reviewed By: arsenm, yaxunl
Differential Revision: https://reviews.llvm.org/D152251
i16/f16/bf16 will use the same .b16 registers and
i32/v2f16 and v2bf16 will share .b32 registers.
The changes are mostly mechanical, intended to remove unnecessary register
classes which tend to produce redundant register moves.
Differential Revision: https://reviews.llvm.org/D151601
v2f16 regtype conversion to i32
LLVM IR already allows floating point type in atomicrmw.
Update clang atomic fetch max/min builtins to accept
floating point type like we did for fetch add/sub.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D150985
Fixes: SWDEV-401056
The rest of the fetch/op intrinsics were added in e13246a2ec3 but sub
was conspicuous by its absence.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D151701
Pursuant to discussions at
https://discourse.llvm.org/t/rfc-c-23-p1467r9-extended-floating-point-types-and-standard-names/70033/22,
this commit enhances the handling of the __bf16 type in Clang.
- Firstly, it upgrades __bf16 from a storage-only type to an arithmetic
type.
- Secondly, it changes the mangling of __bf16 to DF16b on all
architectures except ARM. This change has been made in
accordance with the finalization of the mangling for the
std::bfloat16_t type, as discussed at
https://github.com/itanium-cxx-abi/cxx-abi/pull/147.
- Finally, this commit extends the existing excess precision support to
the __bf16 type. This applies to hardware architectures that do not
natively support bfloat16 arithmetic.
Appropriate tests have been added to verify the effects of these
changes and ensure no regressions in other areas of the compiler.
Reviewed By: rjmccall, pengfei, zahiraam
Differential Revision: https://reviews.llvm.org/D150913
This is stricter than the default "ieee", and should probably be the
default. This patch leaves the default alone. I can change this in a
future patch.
There are non-reversible transforms I would like to perform which are
legal under IEEE denormal handling, but illegal with flushing zero
behavior. Namely, conversions between llvm.is.fpclass and fcmp with
zeroes.
Under "ieee" handling, it is legal to translate between
llvm.is.fpclass(x, fcZero) and fcmp x, 0.
Under "preserve-sign" handling, it is legal to translate between
llvm.is.fpclass(x, fcSubnormal|fcZero) and fcmp x, 0.
I would like to compile and distribute some math library functions in
a mode where it's callable from code with and without denormals
enabled, which requires not changing the compares with denormals or
zeroes.
If an IEEE function transforms an llvm.is.fpclass call into an fcmp 0,
it is no longer possible to call the function from code with denormals
enabled, or write an optimization to move the function into a denormal
flushing mode. For the original function, if x was a denormal, the
class would evaluate to false. If the function compiled with denormal
handling was converted to or called from a preserve-sign function, the
fcmp now evaluates to true.
This could also be of use for strictfp handling, where code may be
changing the denormal mode.
Alternative name could be "unknown".
Replaces the old AMDGPU custom inlining logic with more conservative
logic which tries to permit inlining for callees with dynamic handling
and avoids inlining other mismatched modes.
Fixes clang crash caused by a stale function pointer.
The bug has been present for a pretty long time, but we were lucky not to
trigger it until D140663.
Differential Revision: https://reviews.llvm.org/D146448
The asan functions are now attributed as used
in the device library, no need to keep the
declaration of asan device preserve function.
Patch by: Praveen Velliengiri
Reviewed by: Yaxun Liu
Differential Revision: https://reviews.llvm.org/D143495
Currently CGCUDANV uses an llvm::Function as a key to map kernels to a
symbol in host code. HIP adds one level of indirection and uses the
llvm::Function to map to a global variable that will be initialized to
the kernel stub ptr.
Unfortunately there is no garantee that the llvm::Function created
by GetOrCreateLLVMFunction will be the same. In fact, the first
time we encounter GetOrCrateLLVMFunction for a kernel, the type
might not be completed yet, and the type of llvm::Function will be
a generic {}, since the complete type is not required to get a symbol
to a function. In this case we end up creating two global variables,
one for the llvm::Function with the incomplete type and one for the
function with the complete type. The first global variable will be
declared by not defined, resulting in a linking error.
This change uses the mangled name of the llvm::Function as key in the
KernelHandles map, in this way the same llvm::Function will be
associated to the same kernel handle even if they types are different.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D140663
To preserve the previous semantics after D141386, adjust places
that currently emit !range metadata to also emit !noundef metadata.
This retains range violation as immediate undefined behavior,
rather than just poison.
Differential Revision: https://reviews.llvm.org/D141494
- [Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
- Add Sema & CodeGen tests cases.
- Also add cases that D138651 would have covered as this patch replaces it.
- [AMDGPU] Add BF16 storage-only support
- Support legalization/dealing with bf16 operations in DAGIsel.
- bf16 as a type remains illegal and is represented as i16 for storage purposes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D139398
This reverts commit 107ee2613063183cb643cef97f0fad403508c9f0 to facilitate
investigating and fixing the root cause.
Differential Revision: https://reviews.llvm.org/D135269
D124866 seem to have had an unintended side effect: __noinline__ on lambdas was no longer accepted.
This fixes the regression and adds a test case for it.
Reviewed By: aaron.ballman
Differential Revision: https://reviews.llvm.org/D137251
Recent Clang changes expose _bf16 types for SSE2-enabled host compilations and
that makes those types visible furing GPU-side compilation, where it currently
fails with Sema complaining that __bf16 is not supported.
Considering that __bf16 is a storage-only type, enabling it for NVPTX if it's
enabled on the host should pose no issues, correctness-wise.
Recent NVIDIA GPUs have introduced bf16 support, so we'll likely grow better
support for __bf16 on NVPTX going forward.
Differential Revision: https://reviews.llvm.org/D136311
Extra NUL does not impact functionality of the generated code, but it confuses
various NVIDIA tools used to examine embedded GPU binaries.
Differential Revision: https://reviews.llvm.org/D135832
There are currently two options that are used to tell the compiler to perform
unsafe floating-point optimizations:
'-ffast-math' and '-funsafe-math-optimizations'.
'-ffast-math' is enabled by default. It automatically enables the driver option
'-menable-unsafe-fp-math'.
Below is a table illustrating the special operations enabled automatically by
'-ffast-math', '-funsafe-math-optimizations' and '-menable-unsafe-fp-math'
respectively.
Special Operations -ffast-math -funsafe-math-optimizations -menable-unsafe-fp-math
MathErrno 0 1 1
FiniteMathOnly 1 0 0
AllowFPReassoc 1 1 1
NoSignedZero 1 1 1
AllowRecip 1 1 1
ApproxFunc 1 1 1
RoundingMath 0 0 0
UnsafeFPMath 1 0 1
FPContract fast on on
'-ffast-math' enables '-fno-math-errno', '-ffinite-math-only',
'-funsafe-math-optimzations' and sets 'FpContract' to 'fast'. The driver option
'-menable-unsafe-fp-math' enables the same special options than
'-funsafe-math-optimizations'. This is redundant.
We propose to remove the driver option '-menable-unsafe-fp-math' and use
instead, the setting of the special operations to set the function attribute
'unsafe-fp-math'. This attribute will be enabled only if those special
operations are enabled and if 'FPContract' is either 'fast' or set to the
default value.
Differential Revision: https://reviews.llvm.org/D135097
Promoting kernel arg pointer to global addr space is only
available with registered amdgcn target.
Fix test so that it does not require registered amdgcn target.
Currently there is a middle-end or backend issue
https://github.com/llvm/llvm-project/issues/58176
which causes values loaded from bool pointer incorrect when
bool range metadata is emitted. Temporarily
disable bool range metadata until the backend issue
is fixed.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D135269
Fixes: SWDEV-344137
A copy-paste error caused UB in the definition of the unsigned long long
versions of the shfl intrinsics. Reported and diagnosed by @trws.
Differential Revision: https://reviews.llvm.org/D129536
The new driver primarily allows us to support RDC-mode compilations with
proper linking. This is not needed for non-RDC mode compilation, but we
still would like the new driver to be able to handle this mode so we can
transition away from the old driver in the future. This patch adds the
necessary code to support creating a fatbinary for CUDA code generation
as well as removing old assumptions and errors about RDC-mode with the
new driver.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D129655
This patch adds the small change required to output offloading entried
for HIP instead of CUDA. These should be placed in different sections so
because they need to be distinct to the offloading toolchain, otherwise
we'd have HIP trying to register CUDA kernels or vice-versa. This patch will
precede support for HIP in the linker wrapper.
Reviewed By: yaxunl, tra
Differential Revision: https://reviews.llvm.org/D128850
This removes creation of udiv/sdiv/urem/srem constant expressions,
in preparation for their removal. I've added a
ConstantExpr::isDesirableBinOp() predicate to determine whether
an expression should be created for a certain operator.
With this patch, div/rem expressions can still be created through
explicit IR/bitcode, forbidding them entirely will be the next step.
Differential Revision: https://reviews.llvm.org/D128820
Add option -fhip-kernel-arg-name to emit kernel argument
name metadata, which is needed for certain HIP applications.
Reviewed by: Artem Belevich, Fangrui Song, Brian Sumner
Differential Revision: https://reviews.llvm.org/D128022
For amdgpu target long double type is the same as double type.
The width and align of long double type was incorrectly
overridden when copying aux target properties, which
caused assertion in codegen when emitting global
variables with long double type.
This patch fix that by saving and restoring width
and align of long double type.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D127771
Fixes: SWDEV-335515
This enabled opaque pointers by default in LLVM. The effect of this
is twofold:
* If IR that contains *neither* explicit ptr nor %T* types is passed
to tools, we will now use opaque pointer mode, unless
-opaque-pointers=0 has been explicitly passed.
* Users of LLVM as a library will now default to opaque pointers.
It is possible to opt-out by calling setOpaquePointers(false) on
LLVMContext.
A cmake option to toggle this default will not be provided. Frontends
or other tools that want to (temporarily) keep using typed pointers
should disable opaque pointers via LLVMContext.
Differential Revision: https://reviews.llvm.org/D126689
CUDA requires that static variables be visible to the host when
offloading. However, The standard semantics of a stiatc variable dictate
that it should not be visible outside of the current file. In order to
access it from the host we need to perform "externalization" on the
static variable on the device. This requires generating a semi-unique
name that can be affixed to the variable as to not cause linker errors.
This is currently done using the CUID functionality, an MD5 hash value
set up by the clang driver. This allows us to achieve is mostly unique
ID that is unique even between multiple compilations of the same file.
However, this is not always availible. Instead, this patch uses the
unique ID from the file to generate a unique symbol name. This will
create a unique name that is consistent between the host and device side
compilations without requiring the CUID to be entered by the driver. The
one downside to this is that we are no longer stable under multiple
compilations of the same file. However, this is a very niche use-case
and is not supported by Nvidia's CUDA compiler so it likely to be good
enough.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D125904