Certain graphics APIs explicitly want the semantics of saturated
conversions, particularly w.r.t. edge cases like NaN. The underlying
hardware instructions (v_cvt_*) provide the expected behaviour so
llvm.fptosi.sat and llvm.fptoui.sat can be implemented directly.
Limitations:
- conversion to i64 is not handled (default expansion is used)
- v_cvt_u16_f16 and v_cvt_i16_f16 are not utilized (future work)
- scalar float is untested/unoptimized (future work)
One of global flags in `resetTargetOptions`, users should use `nsz`
instead.
`fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have
regression when `fadd` is annotated with `nsz`.
Per CDNA4 ISA:
V_FFBH_I32
Count the number of leading bits that are the same as the sign bit of a
vector input and store the result into a vector register. Store -1 if
all input bits are the same.
which matches CTLS semantics.
Addresses: https://github.com/llvm/llvm-project/issues/177635
This treats it as free on targets without legal f16. This
matches the existing logic in fneg, and they should be the same.
The test changes are mostly neutral with a few improvements.
Returns uint64_t to simplify callers. The goal is eventually replace
getValueType with this query, which should return the known minimum
reference-able size, as provided (instead of a Type) during create.
Additionally the common isSized query would be replaced with an
isExactKnownSize query to test if that size is an exact definition.
## Problem Summary
PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical
precision error where `log(1.1422761679)` computed with 54% higher error
than expected (9.042e-09 vs 5.859e-09), causing gradient computations to
exceed tolerance thresholds. This precision degradation was reproducible
across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I
tracked down the problem to the commit **4703f8b6610a** (March 6, 2025)
which changed HIP math headers to call `__builtin_logf()` directly
instead of `__ocml_log_f32()`:
```diff
- float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); }
+ float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); }
```
This change exposed a problem in the AMDGCN back-end as described below:
## Key Findings
**1. Contract flag propagation:** When `-ffp-contract=fast` is enabled
(default for HIP), Clang's CodeGen adds the `contract` flag to all
`CallInst` instructions within the scope of `CGFPOptionsRAII`, including
calls to LLVM intrinsics like `llvm.log.f32`.
**2. Behavior change from OCML to builtin path:**
- **Old path** (via `__ocml_log_f32`): The preprocessed IR showed the
call to the OCML library function had the contract flag, but the OCML
implementation internally dropped the contract flag when calling the
`llvm.log.f32` intrinsic.
```llvm
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23
%call = call contract float @__ocml_log_f32(float noundef %0) #23
ret float %call
}
; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define internal noundef float @__ocml_log_f32(float noundef %0) #7 {
%2 = tail call float @llvm.log.f32(float %0)
ret float %2
}
```
- **New path** (via `__builtin_logf`): The call goes directly to
`llvm.log.f32` intrinsic with the contract flag preserved, causing the
backend to apply FMA contraction during polynomial expansion.
```llvm
; Function Attrs: alwaysinline convergent mustprogress nounwind
define internal noundef float @_ZL4logff(float noundef %__x) #6 {
entry:
%retval = alloca float, align 4, addrspace(5)
%__x.addr = alloca float, align 4, addrspace(5)
%retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr
%__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr
store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24
%0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24
%1 = call contract float @llvm.log.f32(float %0)
ret float %1
}
```
**3. Why contract breaks log:** Our AMDGCM target back end implements
the natural logarithm by taking the result of the hardware log, then
multiplying that by `ln(2)`, and applying some rounding error correction
to that multiplication. This results in something like:
```c
r = y * c1; // y is result of v_log_ instruction, c1 = ln(2)
r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant
```
```asm
v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v4, v1, s2, -v3
v_fmac_f32_e32 v4, 0x3377d1cf, v1
v_add_f32_e32 v3, v3, v4
```
With the presence of the `contract` flag, the back-end fuses the add (`r
+ Z`) with the multiply thinking that it is legal, thus eliminating the
intermediate rounding. The error compensation term, which was calculated
based on the rounded product, is now being added to the full-precision
result from the FMA, leading to incorrect error correction and degraded
accuracy. The corresponding contracted operations become the following:
```c
r = y * c1;
r = fma(y, c1, fma(y, c2, fma(y, c1, -r)));
```
```asm
v_log_f32_e32 v1, v1
s_mov_b32 s2, 0x3f317217
v_mul_f32_e32 v3, 0x3f317217, v1
v_fma_f32 v3, v1, s2, -v3
v_fmac_f32_e32 v3, 0x3377d1cf, v1
v_fmac_f32_e32 v3, 0x3f317217, v1
```
## Solution and Proposed Fix
Based on our implementation of `llvm.log` and `llvm.log10`, it should be
illegal for the back-end to propagate the `contract` flag when it is
present on the intrinsic call because it uses error-correcting
summation. My proposed fix is to modify the instruction selection passes
(both global-isel and sdag) to drop the `contract` flag when lowering
llvm.log. That way, when the instruction selection performs the
contraction optimization, it will not fuse the multiply and add.
Note: I had originally implemented this fix in the FE by removing the
`contract` flag when lowering the llvm.log builtin (PR #168770). I have
since closed that PR.
This was calling the exp handling, so multiplying by the wrong
constant.
GlobalISel is still broken, but missing the fast exp10 path.
This is tracked in https://github.com/llvm/llvm-project/issues/170576
Fixes a bug in `AMDGPUISelLowering` where alias analysis info is not
propagated to split loads and stores.
This is required for #161375
---------
Co-authored-by: Leon Clark <leoclark@amd.com>
Currently LibcallLoweringInfo is defined inside of TargetLowering,
which is owned by the subtarget. Pass in the subtarget so we can
construct LibcallLoweringInfo with the subtarget. This is a temporary
step that should be revertable in the future, after LibcallLoweringInfo
is moved out of TargetLowering.
This allows SDNodes to be validated against their expected type profiles
and reduces the number of changes required to add a new node.
Autogenerated node names start with "AMDGPUISD::", hence the changes in
the tests.
The few nodes defined in R600.td are *not* imported because TableGen
processes AMDGPU.td that doesn't include R600.td. Ideally, we would have
two sets of nodes, but that would require careful reorganization of td
files since some nodes are shared between AMDGPU/R600. Not sure if it
something worth looking into.
Some nodes fail validation, those are listed in
`AMDGPUSelectionDAGInfo::verifyTargetNode()`.
Part of #119709.
Pull Request: https://github.com/llvm/llvm-project/pull/168248
- Enable s_or_b64/s_and_b64/s_xor_b64 for v2i32. Add various additional
combines to make use of these newly legalised instructions.
- Update several tests and separate legacy r600 tests where necessary.
Support tail calls to whole wave functions (trivial) and from whole wave
functions (slightly more involved because we need a new pseudo for the
tail call return, that patches up the EXEC mask).
Move the expansion of whole wave function return pseudos (regular and
tail call returns) to prolog epilog insertion, since that's where we
patch up the EXEC mask.
This patch implements a correctly rounded expansion of the frem
instruction in LLVM IR. This is useful for target architectures for
which such an expansion is too involved to be implement in ISel
Lowering. The expansion is based on the code from the AMD device libs
and has been tested successfully against the OpenCL conformance tests on
amdgpu. The expansion is implemented in the preexisting "expand-fp"
pass. It replaces the expansion of "frem" in ISel for the amdgpu target;
it is enabled for targets which do not directly support "frem" and for
which no matching "fmod" LibCall is available.
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
[recommit https://github.com/llvm/llvm-project/pull/151763 after fixing
https://github.com/llvm/llvm-project/issues/152150]
We already had corresponding f32 and i32 vector types for these sizes.
Also add VTs v[567]i8 and v[567]i16: these are needed by the Hexagon
backend which for each i1 vector types want to query information about
the corresponding i8 and i16 types in
HexagonTargetLowering::getPreferredHvxVectorAction.
Add AMDGPUTargetLowering::canCreateUndefOrPoisonForTargetNode handler
and tag BFE_I32/U32 nodes as they can only propagate poison, not create
poison/undef.
Fighting some of the remaining regressions in #152107
Compare sinking is selectable based on the result of
hasMultipleConditionRegisters. This function is too coarse grained by
not taking into account the differences between scalar and vector
compares. This PR extends the interface to take an EVT to allow finer
control.
The new interface is used by AArch64 to disable sinking of scalable
vector compares, but with isProfitableToSinkOperands updated to maintain
the cases that are specifically tested.
We already had corresponding f32 and i32 vector types for these sizes.
Also add VTs v[567]i8 and v[567]i16: these are needed by the Hexagon
backend which for each i1 vector types want to query information about
the corresponding i8 and i16 types in
HexagonTargetLowering::getPreferredHvxVectorAction.
These float operations were expanded for scalar f32/f64/f128, but not
for f16 and more problematically, not for vectors. A small subset of
them was separately set to expand for vectors.
Change these to always expand by default, and adjust targets to mark
these as legal where necessary instead.
This is a much safer default, and avoids unnecessary legalization
failures because a target failed to manually mark them as expand.
Fixes https://github.com/llvm/llvm-project/issues/110753.
Fixes https://github.com/llvm/llvm-project/issues/121390.
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.
Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.
At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.
This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
a SReg_1 representing `%active`, which needs to be passed into
SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
special handling of %active. Since the return may be in a different
basic block, it's difficult to add the virtual reg for %active to
SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
marks any used VGPRs as WWM registers, which are then spilled and
restored with the usual logic.
Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>