Disable generic DAG combines for AMDGPU at -O0 via
disableGenericCombines() to preserve instructions that users may want to
set breakpoints on during debugging.
Assisted-by: Cursor / Claude Opus 4.6
This change adds two builtins for AMDGPU:
- `__builtin_amdgcn_processor_is`, which is similar in observable
behaviour with `__builtin_cpu_is`, except that it is never "evaluated"
at run time;
- `__builtin_amdgcn_is_invocable`, which is behaviourally similar with
`__has_builtin`, except that it is not a macro (i.e. not evaluated at
preprocessing time).
Neither of these are `constexpr`, even though when compiling for
concrete (i.e. `gfxXXX` / `gfxXXX-generic`) targets they get evaluated
in Clang, so they shouldn't tear the AST too badly / at all for
multi-pass compilation cases like HIP. They can only be used in specific
contexts (as args to control structures).
The motivation for adding these is two-fold:
- as a nice to have, it provides an AST-visible way to incorporate
architecture specific code, rather than having to rely on macros and the
preprocessor, which burn in the choice quite early;
- as a must have, it allows featureful AMDGCN flavoured SPIR-V to be
produced, where target specific capability is guarded and chosen or
discarded when finalising compilation for a concrete target; this is
built atop the Speciali\ation Constant concept which is described in the
SPIR-V specification under section [2.12
Specialization](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_specialization_2)
I've tried to keep the overall footprint of the change small. The
changes to Sema are a bit unpleasant, but there was a strong desire to
have Clang validate these, and to constrain their uses, and this was the
most compact solution I could come up with (suggestions welcome).
---------
Co-authored-by: Juan Manuel Martinez Caamaño <jmartinezcaamao@gmail.com>
Co-authored-by: Voicu <avoicu@amd.com>
This PR upstreams the `mangling.cu` test from the ClangIR incubator.
Building on the feedback from my previous upstreaming PR, I have
expanded the verification for this file to include:
1. **CIR checks:** Verifying the ClangIR generated functions.
2. **LLVM checks:** Verifying the LLVM IR generated via the ClangIR
pipeline.
3. **OGCG checks:** Verifying against the original CodeGen pipeline to
ensure name-mangling parity.
I have also moved the `Inputs/cuda.h` mock header to the upstream
`Inputs` directory to support this and future CUDA tests.
If this multi-stage verification approach looks correct to the
maintainers, I plan to follow up by upstreaming the other currently
passing CUDA test, `simple-nvptx-triple.cu`, using the same standard.
Verified locally with `llvm-lit`. Partially addresses #156747.
These were assuming uniform work group sizes. Emit the v4 and v5
sequences to take the remainder group for the nonuniform case.
Currently the device libs uses this builtin on the legacy ABI path with
the same sequence to calculate the remainder, and fully implements the
v5 path. If you perform a franken-build of the library with the updated
builtin, the result is worse. The duplicate sequence does not fully fold out.
However, it does not appear to be wrong. The relevant conformance tests still
pass.
Use the DataLayout-aware TargetFolder instead of ConstantFolder in
Clang's CGBuilder. The primary impact of this change is that GEP
constant expressions are now emitted in canonical `getelementptr i8`
form. This is in preparation for the migration to ptradd, which requires
this form.
Part of the test updates were performed by Claude Code and reviewed by
me.
Summary:
This PR simply changes the behavior of the `wchar_size` flag. Currently,
we emit this in all cases for all targets. This causes problems during
LLVM-IR linking, specifically because this would vary between Linux and
Windows in unintuitive ways. Now we have an llvm::Triple helper to
determine the size from the known values. The module flag will only be
emitted if these do not match (indicating a non-standard environment).
In addition to fixing AMDGCN bitcode linking, this also means we don't
need to bloat *every* IR module compiled by clang with this flag. The
changed tests reflects this, one less unnecessary piece of metadata.
The "uniform-work-group-size" function attribute previously took a
string value of "true" or "false". Since presence alone can convey the
"true" semantics and absence can convey "false", the value is
unnecessary.
This patch converts it to a valueless string attribute: presence
indicates true, absence indicates false. For backward compatibility,
auto-upgrade logic is added in both UpgradeAttributes (bitcode) and
UpgradeFunctionAttributes: if the old value is "true", the attribute is
kept without a value; if "false", the attribute is removed.
In standard C++, const variables at namespace scope have internal
linkage. For __device__ const variables, this makes them invisible to
runtime symbol lookup APIs (cudaGetSymbolAddress/hipGetSymbolAddress).
Reading a __device__ const variable from host code is a valid usage
pattern — the host may need to know the value at runtime. This is
also needed by libcudacxx's cuda::get_device_address.
This patch extends the existing CUDADeviceVarODRUsedByHost tracking
to cover __device__ const variables. When host code references such a
variable, it gets externalized (same mechanism used for static device
vars). Variables only used in device code keep internal linkage and
can still be constant-folded.
The fix is in SemaExpr: __device__ const variables are classified as
CVT_Both (due to an implicit CUDAConstantAttr), so the ODR-use
tracking is extended to include CVT_Both variables with an explicit
CUDADeviceAttr, distinguishing them from plain const variables.
Summary:
From the Language reference:
> By default, global initializers are optimized by assuming that global
variables defined within the module are not modified from their initial
values before the start of the global initializer. This is true even for
variables potentially accessible from outside the module, including
those with external linkage or appearing in @llvm.used or dllexported
variables. This assumption may be suppressed by marking the variable
with externally_initialized.
This is intended because device programs can be modified beyond the
normal lifetime expected by the optimization pipeline. However, for
constant variables we should be able to safely assume that these are
truly constant within the module. In the vast majority of cases these
will not get externally visible symbols, but even `extern const` uses we
should assert that the user should not be writing them if they are
marked const.
Previous commit message:
>Previous commit message:
>
>> Original commit message:
>>
>>>When users explicitly specify a PTX version via -mattr=+ptxNN that's
insufficient for their target SM, we now emit a fatal error. Previously,
we silently upgraded the PTX version to the minimum required for the
target SM.
>>>
>>>When no SM or PTX version is specified, we now use PTX 3.2 (the
minimum for the default SM 3.0) instead of PTX 6.0.
>>
>>The following commits should fix the failures that arose when I
previously tried to land this commit:
>>
>>
>>9fc5fd0ad6
should address the llvm-nvptx*-nvidia-* build failures:
https://github.com/llvm/llvm-project/pull/174834#issuecomment-3742242651
>>
>>
>>600514a637
should address the MLIR failures
>
>The previous commit was reverted with
d23cb79ba4
because the
[mlir-nvidia](https://lab.llvm.org/buildbot/#/builders/138/builds/24797)
and
[mlir-nvidia-gcc7](https://lab.llvm.org/buildbot/#/builders/116/builds/23929)
Buildbots were failing.
>
>Those tests failed because MLIR's default SM was 5.0, which caused
NVPTX
to target PTX ISA v4.0, which did not support the intrinsics used in the
failing tests.
>
>243f011577
should address this by bumping MLIR's default SM to 7.5. Now, using
MLIR's new default SM, NVPTX
targets the PTX ISA v6.3, which supports the intrinsics used in the
failing tests.
---
The previous commit was reverted with
e9b578a4d77025e18318efedd0f3f3764338d859
[because](https://github.com/llvm/llvm-project/pull/179304#issuecomment-3856301333)
the clang driver set the default PTX ISA version to v4.2 when no CUDA
installation is found. However, given our patch, we should not set a
default; instead, let the LLVM backend select the appropriate PTX ISA
version for the target SM.
Previous commit message:
> Original commit message:
>
>>When users explicitly specify a PTX version via -mattr=+ptxNN that's
insufficient for their target SM, we now emit a fatal error. Previously,
we silently upgraded the PTX version to the minimum required for the
target SM.
>>
>>When no SM or PTX version is specified, we now use PTX 3.2 (the
minimum for the default SM 3.0) instead of PTX 6.0.
>
>The following commits should fix the failures that arose when I
previously tried to land this commit:
>
>9fc5fd0ad6
should address the llvm-nvptx*-nvidia-* build failures:
https://github.com/llvm/llvm-project/pull/174834#issuecomment-3742242651
>
>600514a637
should address the MLIR failures
---
The previous commit was reverted with
d23cb79ba497281de050ef609cb91b91058bf323 because the
[mlir-nvidia](https://lab.llvm.org/buildbot/#/builders/138/builds/24797)
and
[mlir-nvidia-gcc7](https://lab.llvm.org/buildbot/#/builders/116/builds/23929)
Buildbots were failing.
Those tests failed because MLIR's default SM was 5.0, which caused NVPTX
to target PTX ISA v4.0, which did not support the intrinsics used in the
failing tests.
243f011577193c99358ccc4142b296d4fa80ea11 should address this by bumping
MLIR's default SM to 7.5. Now, using MLIR's new default SM, NVPTX
targets the PTX ISA v6.3, which supports the intrinsics used in the
failing tests.
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.
The syntax in the common cases looks like this:
`denormal_fpenv(preservesign,preservesign)`
`denormal_fpenv(float: preservesign,preservesign)`
`denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`
I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).
This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.
This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.
I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
This just added unnecessary work to the IR, since they are only used for
load and store, which just causes some IR noise. Tests updated by UTC
script to remove the extra lines.
Reverts llvm/llvm-project#177459
`mlir-nvidia` and `mlir-nvidia-gcc7` Buildbots are failing.
The blamelist is small and likely because of my change. Preemptively
reverting.
Original commit message:
> When users explicitly specify a PTX version via -mattr=+ptxNN that's
insufficient for their target SM, we now emit a fatal error. Previously,
we silently upgraded the PTX version to the minimum required for the
target SM.
>
>When no SM or PTX version is specified, we now use PTX 3.2 (the minimum
for the default SM 3.0) instead of PTX 6.0.
---
The following commits should fix the failures that arose when I
previously tried to land this commit:
- 9fc5fd0ad689eed94f65b1d6d10f9c5642935e68 should address the
`llvm-nvptx*-nvidia-*` build failures:
https://github.com/llvm/llvm-project/pull/174834#issuecomment-3742242651
- 600514a63760c6730e4cd970d2fcead9c5a897b3 should address the MLIR
failures
When users explicitly specify a PTX version via `-mattr=+ptxNN` that's
insufficient for their target SM, we now emit a fatal error. Previously,
we silently upgraded the PTX version to the minimum required for the
target SM.
When no SM or PTX version is specified, we now use PTX 3.2 (the minimum
for the default SM 3.0) instead of PTX 6.0.
At the moment AMDGCN flavoured SPIRV uses the SPIRV ABI with some tweaks
revolving around passing aggregates as direct. This is problematic in
multiple ways:
- it leads to divergence from code compiled for a concrete target, which
makes it difficult to debug;
- it incurs a run time cost, when dealing with larger aggregates;
- it incurs a compile time cost, when dealing with larger aggregates.
This patch switches over AMDGCN flavoured SPIRV to implement the AMDGPU
ABI (except for dealing with variadic functions, which will be added in
the future). One additional complication (and the primary motivation
behind the current less than ideal state of affairs) stems from `byref`,
which AMDGPU uses, not being expressible in SPIR-V. We deal with this by
CodeGen-ing for `byref`, lowering it to the `FuncParamAttr ByVal` in
SPIR-V, and restoring it when doing reverse translation from AMDGCN
flavoured SPIR-V.
- CUDA's dynamic parallelism extension allows device-side kernel
launches, which share the identical syntax to host-side launches, e.g.,
kernel<<<Dg, Db, Ns, S>>>(arguments);
but differ from the code generation. That device-side kernel launches is
eventually translated into the following sequence
config = cudaGetParameterBuffer(alignment, size);
// setup arguments by copying them into `config`.
cudaLaunchDevice(func, config, Dg, Db, Ns, S);
- To support the device-side kernel launch, 'CUDAKernelCallExpr' is
reused but its config expr is set to a call to 'cudaLaunchDevice'.
During the code generation, 'CUDAKernelCallExpr' is expanded into the
sequence aforementioned.
- As the device-side kernel launch requires the source to be compiled as
relocatable device code and linked with '-lcudadevrt'. Linkers are
changed to pass relevant link options to 'nvlink'.
This PR adds checks for when emitting weak aliases in: `void
CodeGenModule::EmitGlobal(GlobalDecl GD)`, before for device compilation
for OpenMP, HIP and Cuda, clang would look for the aliasee even if it
was never marked for device compilation.
For OpenMP the following case now works:
> Failed before when compiling with device, ie: `clang -fopenmp
-fopenmp-targets=amdgcn-amd-amdhsa`
> ```
> int __Two(void) { return 2; }
> int Two(void) __attribute__ ((weak, alias("__Two")));
> ```
For HIP / Cuda:
>
> ```
> int __HostFunc(void) { return 42; }
> int HostFunc(void) __attribute__ ((weak, alias("__HostFunc")));
> ```
For HIP:
>Failed before on HIP, Cuda fails due to: `NVPTX aliasee must not be
'.weak'` error
> ```
> __device__ int __One(void) { return 2; }
> __device__ int One(void) __attribute__ ((weak, alias("__One")));
> ```
Included are Codegen LIT tests for the above cases, and also cases for
weak alias cases that currently work in clang.
Fixes https://github.com/llvm/llvm-project/issues/117369
This PR adds basic frontend support for `__cluster_dims__` and
`__no_cluster__` attribute.
In CUDA/HIP programming, the ``__cluster_dims__`` attribute can be
applied to a kernel function to set the dimensions of a thread block
cluster. The ``__no_cluster__`` attribute can be applied to a kernel
function to indicate that the thread block cluster feature will not be
enabled at both compile time and kernel launch time. Note that
`__no_cluster__` is a LLVM/Clang only attribute.
Co-authored-by: Yaxun (Sam) Liu <yaxun.liu@amd.com>
Co-authored-by: Jay Foad <jay.foad@amd.com>
Upgrade the !"grid_constant" !nvvm.annotation to a "nvvm.grid_constant"
attribute. This attribute is much simpler for front-ends to apply and
faster and simpler to query.
This change consolidates and cleans up various NVPTXISD target-specific
nodes in order to simplify SDAG ISel. While there are some whitespace
changes in the emitted PTX it is otherwise a non-functional change.
NVPTXISD::Wrapper - This node was used to wrap external-symbol and
global-address nodes. It is redundant and has been removed. Instead we
use the non-target versions of these nodes and convert them
appropriately during ISel.
NVPTXISD::CALL - Much of the family of nodes used to represent a PTX
call instruction have been replaced by this new single node. It
corresponds to a single instruction and is therefore much simpler to
create and lower.
In most cases, the type information attached to load and store
instructions is meaningless and inconsistently applied. We can usually
use ".b" loads and avoid the complexity of trying to assign the correct
type. The one expectation is sign-extending load, which will continue to
use ".s" to ensure the sign extension into a larger register is done
correctly.
In a lambda function, a call of a function may
resolve to host and device functions with different
signatures. Especially, a constexpr local variable may
be passed by value by the device function and
passed by reference by the host function, which
will cause the constexpr variable captured by
the lambda function in host compilation but
not in the device compilation. The discrepancy
in the lambda captures will violate ODR and
causes UB for kernels using these lambdas.
This PR fixes the issue by identifying
discrepancy of ODR/non-ODR usages of constexpr
local variables passed to host/device functions
and conservatively capture them.
Fixes: https://github.com/llvm/llvm-project/issues/132068
Following are the changes:
1. Make OffloadKind enum values to be powers of two so we can use them
like a bitfield
2. Include OFK_SYCL enum value
3. Modify ActiveOffloadKinds support in clang-linker-wrapper to use
bitfields instead of a vector.
Thanks
---------
Signed-off-by: Arvind Sudarsanam <arvind.sudarsanam@intel.com>
Summary:
When we were first porting to COV5, this lead to some ABI issues due to
a change in how we looked up the work group size. Bitcode libraries
relied on the builtins to emit code, but this was changed between
versions. This prevented the bitcode libraries, like OpenMP or libc,
from being used for both COV4 and COV5. The solution was to have this
'none' functionality which effectively emitted code that branched off of
a global to resolve to either version.
This isn't a great solution because it forced every TU to have this
variable in it. The patch in
https://github.com/llvm/llvm-project/pull/131033 removed support for
COV4 from OpenMP, which was the only consumer of this functionality.
Other users like HIP and OpenCL did not use this because they linked the
ROCm Device Library directly which has its own handling (The name was
borrowed from it after all).
So, now that we don't need to worry about backward compatibility with
COV4, we can remove this special handling. Users can still emit COV4
code, this simply removes the special handling used to make the OpenMP
device runtime bitcode version agnostic.
Summary:
This attribute is mostly borrowed from OpenCL, but is useful in general
for accessing the LLVM vector types. Previously the only way to use it
was through typedefs. This patch changes that to allow use as a regular
type attribute, similar to address spaces.
Add option and statement attribute for controlling emitting of
target-specific
metadata to atomicrmw instructions in IR.
The RFC for this attribute and option is
https://discourse.llvm.org/t/rfc-add-clang-atomic-control-options-and-pragmas/80641,
Originally a pragma was proposed, then it was changed to clang
attribute.
This attribute allows users to specify one, two, or all three options
and must be applied
to a compound statement. The attribute can also be nested, with inner
attributes
overriding the options specified by outer attributes or the target's
default
options. These options will then determine the target-specific metadata
added to atomic
instructions in the IR.
In addition to the attribute, three new compiler options are introduced:
`-f[no-]atomic-remote-memory`, `-f[no-]atomic-fine-grained-memory`,
`-f[no-]atomic-ignore-denormal-mode`.
These compiler options allow users to override the default options
through the
Clang driver and front end. `-m[no-]unsafe-fp-atomics` is aliased to
`-f[no-]ignore-denormal-mode`.
In terms of implementation, the atomic attribute is represented in the
AST by the
existing AttributedStmt, with minimal changes to AST and Sema.
During code generation in Clang, the CodeGenModule maintains the current
atomic options,
which are used to emit the relevant metadata for atomic instructions.
RAII is used
to manage the saving and restoring of atomic options when entering
and exiting nested AttributedStmt.
Replace some more nvvm.annotations with function attributes,
auto-upgrading the annotations as needed. These new attributes will be
more idiomatic and compile-time efficient than the annotations.
- !"maxntid[xyz]" -> "nvvm.maxntid"
- !"reqntid[xyz]" -> "nvvm.reqntid"
- !"cluster_dim_[xyz]" -> "nvvm.cluster_dim"
Currently, the clang frontend incorrectly emits the callee instead of
the thunk for the callee in the VTable. This is the case because the
thunk index is not incremented when their callees cannot be emitted.
This patch fixes the bug.
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.
This PR removes all occurrences of gfx940/gfx941 from clang that can be
removed without changes in the llvm directory. The
target-invalid-cpu-note/amdgcn.c test is not included here since it
tests a list of targets that is defined in
llvm/lib/TargetParser/TargetParser.cpp.
For SWDEV-512631
Replace some more nvvm.annotations with function attributes,
auto-upgrading the annotations as needed. These new attributes will be
more idiomatic and compile-time efficient than the annotations.
- !"maxclusterrank" / !"cluster_max_blocks" -> "nvvm.maxclusterrank"
- !"minctasm" -> "nvvm.minctasm"
- !"maxnreg" -> "nvvm.maxnreg"
Summary:
This patch unifies the existing offloading entires into a single section
called `llvm_offload_entires`. This lets us use a more unified
offloading infrastructure so that all targets share the same handling.
The effect is that people in the runtimes now need to check if the kind
is what they expect, but the expectation is that you can combine
multiple potential providers into a compile job. Doesn't fully work
yet because of other runtime issues, but some day. Mostly this helps the
future of liboffload where we want to handle different languages than
OpenMP.
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
Summary:
The previous offloading entry type did not fit the current use-cases
very well. This widens it and adds a version to prevent further
annoyances. It also includes the kind to better sort who's using it.
The first 64-bytes are reserved as zero so the OpenMP runtime can detect
the old format for binary compatibilitry.
Summary:
Previously, managed variables didn't work in rdc mode using the new
driver because we just didn't register them. This was previously ignored
because we didn't have enough space in the current struct format. This
patch amends that by just emitting a struct pair for the two variables
and using the single pointer.
In the future, a more extensible entry format would be nice, but that
can be done later.