Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.
Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.
This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.
Solves SWDEV-516398.
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.
The syntax in the common cases looks like this:
`denormal_fpenv(preservesign,preservesign)`
`denormal_fpenv(float: preservesign,preservesign)`
`denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`
I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).
This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.
This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.
I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
This just added unnecessary work to the IR, since they are only used for
load and store, which just causes some IR noise. Tests updated by UTC
script to remove the extra lines.
OpenCL spec restricts that variable in local address space can only be
declared at kernel function scope.
Add a Clang internal extension __cl_clang_function_scope_local_variables
to lift the restriction.
To expose static local allocations at kernel scope, targets can either
force-inline non-kernel functions that declare local memory or pass a
kernel-allocated local buffer to those functions via an implicit argument.
Motivation: support local memory allocation in libclc's implementation
of work-group collective built-ins, see example at:
https://github.com/intel/llvm/blob/41455e305117/libclc/libspirv/lib/amdgcn-amdhsa/group/collectives_helpers.llhttps://github.com/intel/llvm/blob/41455e305117/libclc/libspirv/lib/amdgcn-amdhsa/group/collectives.cl#L182
Right now this is a Clang-only OpenCL extension intended for compiling
OpenCL libraries with Clang. It could be proposed as a standard OpenCL
extension in the future.
Currently, AMDGPU functions have `target-features` attribute populated
with all default features for the target GPU. This is redundant because
the backend can derive these defaults from the `target-cpu` attribute
via `AMDGPUTargetMachine::getFeatureString()`.
In this PR, for AMDGPU targets only:
- Functions without explicit target attributes no longer emit
`target-features`
- Functions with `__attribute__((target(...)))` or `-target-feature`
emit only features that differ from the target's defaults (delta)
The backend already handles missing `target-features` correctly by
falling back to the TargetMachine's defaults.
A new cc1 flag `-famdgpu-emit-full-target-features` is added to emit
full features when needed.
Example:
Before:
```llvm
attributes #0 = { "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,..." }
```
After (default):
```llvm
attributes #0 = { "target-cpu"="gfx90a" }
```
After (with explicit `+wavefrontsize32` override):
```llvm
attributes #0 = { "target-cpu"="gfx90a" "target-features"="+wavefrontsize32" }
```
We should not allow `-wavefrontsize32` and `-wavefrontsize64` to be
specified at the same time. We should also not allow `-wavefrontsize32`
on a target that only supports `wavefrontsize32`, and the vice versa.
Before this change, constrained fptrunc for __builtin_store_half/halff
always used round.tonearest, ignoring the active pragma STDC FENV_ROUND.
This PR guards builtin emission with CGFPOptionsRAII so the current
rounding mode is propagated to the generated constrained intrinsic.
This reverts commit 2c376ffeca490a5732e4fd6e98e5351fcf6d692a because it
breaks assembler.
```
$ llvm-mc -triple=amdgcn -mcpu=gfx1250 -show-encoding <<< "v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] matrix_b_reuse"
v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] clamp ; encoding: [0x10,0x80,0x72,0xcc,0x00,0x11,0x42,0x1c]
```
We have a fundamental issue in the clamp support in VOP3P instructions,
which will need more changes.
This commit goes through IntrinsicsAMDGPU.td and adds
`nocreateundeforpoison` to intrinsics that (to my knowledge) perform
arithmetic operations that are defined everywhere (so no bitfield
extracts and such since those can have invalid inputs, and similarly for permutations).
Fixes#166989
- Adds a clamp immediate operand to the AMDGPU WMMA iu8 intrinsic and
threads it through LLVM IR, MIR lowering, Clang builtins/tests, and MLIR
ROCDL dialect so all layers agree on the new operand
- Updates AMDGPUWmmaIntrinsicModsAB so the clamp attribute is emitted,
teaches VOP3P encoding to accept the immediate, and adjusts Clang
codegen/builtin headers plus MLIR op definitions and tests to match
- Documents what the WMMA clamp operand do
- Implement bitcode AutoUpgrade for source compatibility on WMMA IU8
Intrinsic op
Possible future enhancements:
- infer clamping as an optimization fold based on the use context
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Allows for type checking depending on the built-in signature.
This introduces some subtle changes in code generation: before, since
the signature was meaningless, we would accept any pointer type without
casting. After this change, the pointer of the `atomicrmw` matches the
flat address space.
Fixes#154772
We previously set `ptx_kernel` for all kernels. But it's incorrect to
add `ptx_kernel` to the stub version of kernel introduced in #115821.
This patch copies the workaround of AMDGPU.
This patch introduces preliminary support for additional memory
locations.
They are: target_mem0 and target_mem1 and they model memory locations
that cannot be represented with existing memory locations.
It was a solution suggested in :
https://discourse.llvm.org/t/rfc-improving-fpmr-handling-for-fp8-intrinsics-in-llvm/86868/6
Currently, these locations are not yet target-specific. The goal is to
enable the compiler to express read/write effects on these resources.
Also add a corresponding intrinsic property that can be used to mark
intrinsics that do not introduce poison, for example simple arithmetic
intrinsics that propagate poison just like a simple arithmetic
instruction.
As a smoke test this patch adds the new property to
llvm.amdgcn.fmul.legacy.
This reverts commit 78bf682cb9033cf6a5bbc733e062c7b7d825fdaf.
Original PR: #157463
Revert PR: #158566
The relevant buildbots have been updated to a ROCm version that does not
use the macros anymore to avoid the failures.
Implements SWDEV-522062.
Currently Clang usually leaves padding bits uninitialized, which means
they are undef at the moment.
When expanding stores of vector types to include padding, the padding
lanes will be poison, hence the padding bits will be poison.
This interacts badly with coercion of arguments and return values, where
3 x float vectors will be loaded as i128 integer; poisoning the padding
bits will make the whole value poison.
Not sure if there's a better way, but I think we have a number of places
that currently rely on the padding being undef, not poison.
PR: https://github.com/llvm/llvm-project/pull/164821
Let Clang emit `llvm.tbaa.errno` metadata in order to let LLVM
carry out optimizations around errno-writing libcalls to, as
long as it is proved the involved memory location does not alias
`errno`.
Previous discussion: https://discourse.llvm.org/t/rfc-modelling-errno-memory-effects/82972.
LLVM models ConstantPointerNull as all-zero, but some GPUs (e.g. AMDGPU
and our downstream GPU target) use a non-zero sentinel for null in
private / local address spaces.
SPIR-V is a supported input for our GPU target. This PR preserves a
canonical zero form in the generic AS while allowing later lowering to
substitute the target’s real sentinel.
AMDGCN flavoured SPIR-V allows AMDGCN specific builtins, including those
for scoped fences and some specific RMWs. However, at present we don't
map syncscopes to their SPIR-V equivalents, but rather use the AMDGCN
ones. This ends up pessimising the resulting code as system scope is
used instead of device (agent) or subgroup (wavefront), so we correct
the behaviour, to ensure that we do the right thing during reverse
translation.
On new targets like `gfx1250`, the buffer resource (V#) now uses this
format:
```
base (57-bit): resource[56:0]
num_records (45-bit): resource[101:57]
reserved (6-bit): resource[107:102]
stride (14-bit): resource[121:108]
```
This PR changes the type of `num_records` from `i32` to `i64` in both
builtin and intrinsic, and also adds the support for lowering the new
format.
Fixes SWDEV-554034.
---------
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>