This change adds two builtins for AMDGPU:
- `__builtin_amdgcn_processor_is`, which is similar in observable
behaviour with `__builtin_cpu_is`, except that it is never "evaluated"
at run time;
- `__builtin_amdgcn_is_invocable`, which is behaviourally similar with
`__has_builtin`, except that it is not a macro (i.e. not evaluated at
preprocessing time).
Neither of these are `constexpr`, even though when compiling for
concrete (i.e. `gfxXXX` / `gfxXXX-generic`) targets they get evaluated
in Clang, so they shouldn't tear the AST too badly / at all for
multi-pass compilation cases like HIP. They can only be used in specific
contexts (as args to control structures).
The motivation for adding these is two-fold:
- as a nice to have, it provides an AST-visible way to incorporate
architecture specific code, rather than having to rely on macros and the
preprocessor, which burn in the choice quite early;
- as a must have, it allows featureful AMDGCN flavoured SPIR-V to be
produced, where target specific capability is guarded and chosen or
discarded when finalising compilation for a concrete target; this is
built atop the Speciali\ation Constant concept which is described in the
SPIR-V specification under section [2.12
Specialization](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_specialization_2)
I've tried to keep the overall footprint of the change small. The
changes to Sema are a bit unpleasant, but there was a strong desire to
have Clang validate these, and to constrain their uses, and this was the
most compact solution I could come up with (suggestions welcome).
---------
Co-authored-by: Juan Manuel Martinez Caamaño <jmartinezcaamao@gmail.com>
Co-authored-by: Voicu <avoicu@amd.com>
Adding new clang builtins for AMDGPU raw/struct buffer format load/store
intrinsics. Clang currently has `__builtin_amdgcn_raw_buffer_load_b*`
and `__builtin_amdgcn_raw_buffer_store_b*` builtins, but is missing
builtins for the format variants. These format intrinsics are currently
used by device-libs via manually written IR wrappers in
[buffer-intrinsics.ll](https://github.com/ROCm/llvm-project/blob/amd-staging/amd/device-libs/ockl/src/buffer-intrinsics.ll).
This patch adds support for specifying all target memory locations using
a
single IR spellings such as:
```
memory(target_mem: read)
```
This form is not supported in TableGen, but it is now accepted by the IR
parser.
When the parser encounters target_mem, it expands it to all
target-memory
locations (e.g., target_mem0, target_mem1, …).
Printing behavior
When all target-memory locations share the same ModRef value, the
printer
now collapses them into a single entry:
```
memory(target_mem: read)
```
Otherwise, each target memory location is printed separately.
Rejected IR:
```
memory(target_mem0: write, target_mem: read)
```
This is invalid because the default access kind for the target memory
group
must appear first.
These were assuming uniform work group sizes. Emit the v4 and v5
sequences to take the remainder group for the nonuniform case.
Currently the device libs uses this builtin on the legacy ABI path with
the same sequence to calculate the remainder, and fully implements the
v5 path. If you perform a franken-build of the library with the updated
builtin, the result is worse. The duplicate sequence does not fully fold out.
However, it does not appear to be wrong. The relevant conformance tests still
pass.
Summary:
This PR simply changes the behavior of the `wchar_size` flag. Currently,
we emit this in all cases for all targets. This causes problems during
LLVM-IR linking, specifically because this would vary between Linux and
Windows in unintuitive ways. Now we have an llvm::Triple helper to
determine the size from the known values. The module flag will only be
emitted if these do not match (indicating a non-standard environment).
In addition to fixing AMDGCN bitcode linking, this also means we don't
need to bloat *every* IR module compiled by clang with this flag. The
changed tests reflects this, one less unnecessary piece of metadata.
Rename TENSOR_LOAD_TO_LDS to TENSOR_LOAD_TO_LDS_d4;
Rename TENSOR_STORE_FROM_LDS to TENSOR_STORE_FROM_LDS_d4;
Also rename function names in a couple of tests to reflect this change.
Summary:
LLVM IR should support these for all cases except for compare-exchange.
Currently the code goes through an integer indirection for these cases.
This PR changes the behavior to use atomics directly to the target
memory type.
The "uniform-work-group-size" function attribute previously took a
string value of "true" or "false". Since presence alone can convey the
"true" semantics and absence can convey "false", the value is
unnecessary.
This patch converts it to a valueless string attribute: presence
indicates true, absence indicates false. For backward compatibility,
auto-upgrade logic is added in both UpgradeAttributes (bitcode) and
UpgradeFunctionAttributes: if the old value is "true", the attribute is
kept without a value; if "false", the attribute is removed.
The intrinsic has five arguments for the tensor descriptor (D#), while the fifth one is reserved for future targets, and it will be silently ignored in codegen for gfx1250.
For tensor up to 2D, only the first two D# groups are meaningful and the rest should be zero-initialized.
Introduce two new subtarget features:
- WMMA256bInsts for GFX11 WMMA instructions and
- WMMA128bInsts for GFX1170 and GFX12 WMMA and SWMMAC instructions
Some WMMA instructions have changed from GFX 11.0 to GFX 11.7 so new
Real versions were added with "_gfx1170" suffix. For consistency all
WMMA and SWMMAC GFX11.7 instructions use this suffix.
To resolve decoding issues between different formats for some WMMA
instructions between GFX 11 and GFX 11.7, new decoding tables were
added.
Asynchronous operations are memory transfers (usually between the global
memory and LDS) that are completed independently at an unspecified
scope. A thread that requests one or more asynchronous transfers can use
async marks to track their completion. The thread waits for each mark to
be completed, which indicates that requests initiated in program order
before this mark have also completed.
For now, we implement asyncmark/wait operations on pre-GFX12
architectures that support "LDS DMA" operations. Future work will extend
support to GFX12Plus architectures that support "true" async operations.
This is part of a stack split out from #173259
- #180467
- #180466
Co-authored-by: Ryan Mitchell ryan.mitchell@amd.com
Fixes: SWDEV-521121
The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.
This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.
This is part of a stack split out from #173259:
- #180467
- #180466
Fixes: SWDEV-521121
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.
Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.
This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.
Solves SWDEV-516398.
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.
The syntax in the common cases looks like this:
`denormal_fpenv(preservesign,preservesign)`
`denormal_fpenv(float: preservesign,preservesign)`
`denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`
I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).
This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.
This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.
I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
This just added unnecessary work to the IR, since they are only used for
load and store, which just causes some IR noise. Tests updated by UTC
script to remove the extra lines.
OpenCL spec restricts that variable in local address space can only be
declared at kernel function scope.
Add a Clang internal extension __cl_clang_function_scope_local_variables
to lift the restriction.
To expose static local allocations at kernel scope, targets can either
force-inline non-kernel functions that declare local memory or pass a
kernel-allocated local buffer to those functions via an implicit argument.
Motivation: support local memory allocation in libclc's implementation
of work-group collective built-ins, see example at:
https://github.com/intel/llvm/blob/41455e305117/libclc/libspirv/lib/amdgcn-amdhsa/group/collectives_helpers.llhttps://github.com/intel/llvm/blob/41455e305117/libclc/libspirv/lib/amdgcn-amdhsa/group/collectives.cl#L182
Right now this is a Clang-only OpenCL extension intended for compiling
OpenCL libraries with Clang. It could be proposed as a standard OpenCL
extension in the future.
Currently, AMDGPU functions have `target-features` attribute populated
with all default features for the target GPU. This is redundant because
the backend can derive these defaults from the `target-cpu` attribute
via `AMDGPUTargetMachine::getFeatureString()`.
In this PR, for AMDGPU targets only:
- Functions without explicit target attributes no longer emit
`target-features`
- Functions with `__attribute__((target(...)))` or `-target-feature`
emit only features that differ from the target's defaults (delta)
The backend already handles missing `target-features` correctly by
falling back to the TargetMachine's defaults.
A new cc1 flag `-famdgpu-emit-full-target-features` is added to emit
full features when needed.
Example:
Before:
```llvm
attributes #0 = { "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,..." }
```
After (default):
```llvm
attributes #0 = { "target-cpu"="gfx90a" }
```
After (with explicit `+wavefrontsize32` override):
```llvm
attributes #0 = { "target-cpu"="gfx90a" "target-features"="+wavefrontsize32" }
```
We should not allow `-wavefrontsize32` and `-wavefrontsize64` to be
specified at the same time. We should also not allow `-wavefrontsize32`
on a target that only supports `wavefrontsize32`, and the vice versa.
Before this change, constrained fptrunc for __builtin_store_half/halff
always used round.tonearest, ignoring the active pragma STDC FENV_ROUND.
This PR guards builtin emission with CGFPOptionsRAII so the current
rounding mode is propagated to the generated constrained intrinsic.
This reverts commit 2c376ffeca490a5732e4fd6e98e5351fcf6d692a because it
breaks assembler.
```
$ llvm-mc -triple=amdgcn -mcpu=gfx1250 -show-encoding <<< "v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] matrix_b_reuse"
v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] clamp ; encoding: [0x10,0x80,0x72,0xcc,0x00,0x11,0x42,0x1c]
```
We have a fundamental issue in the clamp support in VOP3P instructions,
which will need more changes.