Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.
Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.
This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.
Solves SWDEV-516398.
In some of our use cases, the GPU runtime stores some data at the top of
the stack. It figures out where it's safe to store it by using the PAL
metadata generated by the backend, which includes the total stack size.
However, the metadata does not include the space reserved at the bottom
of the stack for the trap handler when CWSR is enabled in dynamic VGPR
mode. This space is reserved dynamically based on whether or not the
code is running on the compute queue. Therefore, the runtime needs a way
to take that into account.
Add support for `llvm.sponentry`, which should return the base of the
stack,
skipping over any reserved areas. This allows us to keep this
computation in
one place rather than duplicate it between the backend and the runtime.
The implementation for functions that set up their own stack uses a
pseudo
that is expanded to the same code sequence as that used in the prolog to
set up the stack in the first place.
In callable functions, we generate a fixed stack object and use that
instead,
similar to the Arm/AArch64 approach. This wastes some stack space but
that's
not a problem for now because we're not planning to use this in callable
functions yet.
Certain graphics APIs explicitly want the semantics of saturated
conversions, particularly w.r.t. edge cases like NaN. The underlying
hardware instructions (v_cvt_*) provide the expected behaviour so
llvm.fptosi.sat and llvm.fptoui.sat can be implemented directly.
Limitations:
- conversion to i64 is not handled (default expansion is used)
- v_cvt_u16_f16 and v_cvt_i16_f16 are not utilized (future work)
- scalar float is untested/unoptimized (future work)
Per CDNA4 ISA:
V_FFBH_I32
Count the number of leading bits that are the same as the sign bit of a
vector input and store the result into a vector register. Store -1 if
all input bits are the same.
which matches CTLS semantics.
Addresses: https://github.com/llvm/llvm-project/issues/177635
Returns uint64_t to simplify callers. The goal is eventually replace
getValueType with this query, which should return the known minimum
reference-able size, as provided (instead of a Type) during create.
Additionally the common isSized query would be replaced with an
isExactKnownSize query to test if that size is an exact definition.
## Problem Summary
PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical
precision error where `log(1.1422761679)` computed with 54% higher error
than expected (9.042e-09 vs 5.859e-09), causing gradient computations to
exceed tolerance thresholds. This precision degradation was reproducible
across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I
tracked down the problem to the commit **4703f8b6610a** (March 6, 2025)
which changed HIP math headers to call `__builtin_logf()` directly
instead of `__ocml_log_f32()`:
```diff
- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }
```
This change exposed a problem in the AMDGCN back-end as described below:
## Key Findings
**1. Contract flag propagation:** When `-ffp-contract=fast` is enabled
(default for HIP), Clang's CodeGen adds the `contract` flag to all
`CallInst` instructions within the scope of `CGFPOptionsRAII`, including
calls to LLVM intrinsics like `llvm.log.f32`.
**2. Behavior change from OCML to builtin path:**
- **Old path** (via `__ocml_log_f32`): The preprocessed IR showed the
call to the OCML library function had the contract flag, but the OCML
implementation internally dropped the contract flag when calling the
`llvm.log.f32` intrinsic.
```llvm
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
%call = call contract float @__ocml_log_f32(float noundef %0) #23
ret float %call
}
; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @__ocml_log_f32(float noundef %0) #7 {
%2 = tail call float @llvm.log.f32(float %0)
ret float %2
}
```
- **New path** (via `__builtin_logf`): The call goes directly to
`llvm.log.f32` intrinsic with the contract flag preserved, causing the
backend to apply FMA contraction during polynomial expansion.
```llvm
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
%1 = call contract float @llvm.log.f32(float %0)
ret float %1
}
```
**3. Why contract breaks log:** Our AMDGCM target back end implements
the natural logarithm by taking the result of the hardware log, then
multiplying that by `ln(2)`, and applying some rounding error correction
to that multiplication. This results in something like:
```c
r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant
```
```asm
v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v4, v1, s2, -v3
v_fmac_f32_e32 v4, 0x3377d1cf, v1
v_add_f32_e32 v3, v3, v4
```
With the presence of the `contract` flag, the back-end fuses the add (`r
+ Z`) with the multiply thinking that it is legal, thus eliminating the
intermediate rounding. The error compensation term, which was calculated
based on the rounded product, is now being added to the full-precision
result from the FMA, leading to incorrect error correction and degraded
accuracy. The corresponding contracted operations become the following:
```c
r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));
```
```asm
v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v3, v1, s2, -v3
v_fmac_f32_e32 v3, 0x3377d1cf, v1
v_fmac_f32_e32 v3, 0x3f317217, v1
```
## Solution and Proposed Fix
Based on our implementation of `llvm.log` and `llvm.log10`, it should be
illegal for the back-end to propagate the `contract` flag when it is
present on the intrinsic call because it uses error-correcting
summation. My proposed fix is to modify the instruction selection passes
(both global-isel and sdag) to drop the `contract` flag when lowering
llvm.log. That way, when the instruction selection performs the
contraction optimization, it will not fuse the multiply and add.
Note: I had originally implemented this fix in the FE by removing the
`contract` flag when lowering the llvm.log builtin (PR #168770). I have
since closed that PR.
The test changes are mostly GlobalISel specific regressions.
GlobalISel is still relying on isUniformMMO, but it doesn't really
have an excuse for doing so. These should be avoidable with new
regbankselect.
There is an additional regression for addrspacecast for cov4. We
probably ought to be using a separate PseudoSourceValue for the
access of the queue pointer.
This isn't quite a constant pool, but probably close enough for this
purpose. We just need some known invariant value address. The aliasing
queries against the real kernarg base pointer will falsely report
no aliasing, but for invariant memory it probably doesn't matter.
If the high bits are assumed 0 for the cast, use zext. Previously
we would emit a build_vector and a bitcast with the high element
as 0. The zext is more easily optimized. I'm less convinced this is
good for globalisel, since you still need to have the inttoptr back
to the original pointer type.
The default value is 0, though I'm not sure if this is meaningful
in the real world. The real uses might always override the high
bit value with the attribute.
Add GlobalISel lowering of G_FMINIMUM and G_FMAXIMUM following the same
logic as in SDag's expandFMINIMUM_FMAXIMUM.
Update AMDGPU legalization rules: Pre GFX12 now uses new lowering method
and make G_FMINNUM_IEEE and G_FMAXNUM_IEEE legal to match SDag.
On new targets like `gfx1250`, the buffer resource (V#) now uses this
format:
```
base (57-bit): resource[56:0]
num_records (45-bit): resource[101:57]
reserved (6-bit): resource[107:102]
stride (14-bit): resource[121:108]
```
This PR changes the type of `num_records` from `i32` to `i64` in both
builtin and intrinsic, and also adds the support for lowering the new
format.
Fixes SWDEV-554034.
---------
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
Since many code are connected, this also changes how workgroup id is lowered.
Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
- Add clang built-ins + sema/codegen
- Add IR Intrinsic + verifier
- Add DAG/GlobalISel codegen for the intrinsics
- Add lowering in SIMemoryLegalizer using a MMO flag.
Lowering in GlobalISel for AMDGPU previously always narrows to i32 on
truncating store regardless of mem size or scalar size, causing issues
with types like i65 which is first extended to i128 then stored as i64 +
i8 to i128 locations. Narrowing only on store to pow of 2 mem location
ensures only narrowing to mem size near end of legalization.
This LLVM defect was identified via the AMD Fuzzing project.
This concerns offset computations for kernargs and
RegBankLegalizeHelper::splitLoad, which should all be within the bounds of a
memory object. See #150392 for the motivation for introducing the
buildObjectPtrOffset function.
For SWDEV-516125.
The hardware min/max follow the IR rules with IEEE mode disabled,
so we can avoid the canonicalizes of the input. We lose the quieting
of a signaling nan if both inputs are nans, but we only require that
with strictfp.
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
https://github.com/llvm/llvm-project/pull/139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
This annotates the `Twine` passed to the constructors of the various
DiagnosticInfo subclasses with `[[clang::lifetimebound]]`, which causes
us to warn when we would try to print the twine after it had already
been destructed.
We also update `DiagnosticInfoUnsupported` to hold a `const Twine &`
like all of the other DiagnosticInfo classes, since this warning allows
us to clean up all of the places where it was being used incorrectly.
Fix for a bug found by the AMD fuzzing project.
The legaliser would originally try to widen a small vector such as `<4 x
i1>` to a single `i16` during the legalisation of bitshifts, as it was
not originally written with consideration for vector operands. This
patch simply adds a guard to prohibit this transformation and allow
other legalisation transformations to step in.
This is the bare minimum to get the intrinsic to compile for AMDGPU,
and it's not optimal. We need to follow along closer with the existing
G_FMINNUM/G_FMAXNUM with custom lowering to handle the IEEE=0 case
better.
Just re-use the existing lowering for the old semantics for
G_FMINNUM/G_FMAXNUM. This does not change G_FMINNUM/G_FMAXNUM's
treatment,
nor try to handle the general expansion without an underlying min/max
variant (or with G_FMINIMUM/G_FMAXIMUM).
Legalize the amdgcn.dead intrinsic to work with types other than i32. It
still generates IMPLICIT_DEFs.
Remove some of the previous code for selecting/reg bank mapping it for
32-bit types, since everything is done in the legalizer now.