llvm-project

Author	SHA1	Message	Date
paperchalice	25315f287c	[AMDGPU] Remove `NoNaNsFPMath` uses (#180469 ) Should use `nnan` flag only.	2026-02-09 17:29:21 +08:00
Carl Ritson	447f1e43bb	[AMDGPU] Implement llvm.fptosi.sat and llvm.fptoui.sat (#174726 ) Certain graphics APIs explicitly want the semantics of saturated conversions, particularly w.r.t. edge cases like NaN. The underlying hardware instructions (v_cvt_*) provide the expected behaviour so llvm.fptosi.sat and llvm.fptoui.sat can be implemented directly. Limitations: - conversion to i64 is not handled (default expansion is used) - v_cvt_u16_f16 and v_cvt_i16_f16 are not utilized (future work) - scalar float is untested/unoptimized (future work)	2026-01-30 17:07:40 +09:00
Kewen Meng	120b482375	Revert "[AMDGPU] Replace AMDGPUISD::FFBH_I32 with ISD::CTLS" (#178837 ) Revert to unblock buildbot: https://lab.llvm.org/buildbot/#/builders/206/builds/12769	2026-01-29 21:19:15 -08:00
paperchalice	62aa40a4dd	[AMDGPU] Remove `NoSignedZerosFPMath` uses (#178343 ) One of global flags in `resetTargetOptions`, users should use `nsz` instead. `fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have regression when `fadd` is annotated with `nsz`.	2026-01-30 09:18:40 +08:00
Dmitry Sidorov	65925b0405	[AMDGPU] Replace AMDGPUISD::FFBH_I32 with ISD::CTLS (#178420 ) Per CDNA4 ISA: V_FFBH_I32 Count the number of leading bits that are the same as the sign bit of a vector input and store the result into a vector register. Store -1 if all input bits are the same. which matches CTLS semantics. Addresses: https://github.com/llvm/llvm-project/issues/177635	2026-01-30 01:36:28 +01:00
Matt Arsenault	df3aa0d16b	AMDGPU: Use generic legality checks instead of checking subtarget feature (#177656 )	2026-01-23 20:43:44 +01:00
Matt Arsenault	852649d3ee	AMDGPU: Remove an unnecessary lookup of the AMDGPUSubtarget (#177646 )	2026-01-23 20:33:57 +01:00
Matt Arsenault	c45684f037	AMDGPU: Ignore type legality in isFAbsFree (#177630 ) This treats it as free on targets without legal f16. This matches the existing logic in fneg, and they should be the same. The test changes are mostly neutral with a few improvements.	2026-01-23 19:43:23 +01:00
Matt Arsenault	98b55bcdec	AMDGPU: Move f16 legality configuration to SITargetLowering (#177629 ) f16 is never legal for R600 so this should not be in the common base class.	2026-01-23 18:36:26 +00:00
Matt Arsenault	9f3d143d96	AMDGPU: Remove dead code configuring f16 is_fpclass (#177626 ) isTypeLegal can never be true here. The register classes are registered at the end of the target lowering constructor, and in the subclasses.	2026-01-23 19:24:24 +01:00
Matt Arsenault	d5545db0b0	AMDGPU: Mark strict_fp16_to_fp as expand (#177417 ) This prevents a regression in a future change.	2026-01-22 20:16:34 +01:00
Jameson Nash	d10b2b566a	[NFCI] replace getValueType with new getGlobalSize query (#177186 ) Returns uint64_t to simplify callers. The goal is eventually replace getValueType with this query, which should return the known minimum reference-able size, as provided (instead of a Type) during create. Additionally the common isSized query would be replaced with an isExactKnownSize query to test if that size is an exact definition.	2026-01-22 13:55:53 -05:00
Matt Arsenault	a470e708be	AMDGPU: Teach lowering that sqrt never returns subnormal (#174838 )	2026-01-08 12:05:29 +01:00
Islam Imad	7ceecfad40	[CodeGen] Fix EVT::changeVectorElementType assertion on simple-to-extended fallback (#173413 ) Fixes #171608	2025-12-28 18:51:18 +00:00
Matt Arsenault	0f572c1053	AMDGPU: Teach lowering that exp and log intrinsics cannot return denormals (#172296 )	2025-12-23 10:51:11 +01:00
Matt Arsenault	786498b281	AMDGPU: Fix truncstore from v6f32 to v6f16 (#171212 ) The v6bf16 cases work, but that's likely because v6bf16 isn't currently an MVT. Fixes: SWDEV-570985	2025-12-08 22:46:36 +00:00
Jay Foad	7a59ab0e1a	[AMDGPU] Common up some unsafe fexp lowering. NFC. (#170841 )	2025-12-08 09:50:45 +00:00
Jay Foad	b36f89faed	[AMDGPU] Make rotr illegal (#166558 ) fshr is already legal and is strictly more powerful than rotr, so we should only need selection patterns for fshr.	2025-12-05 12:39:50 +00:00
Matt Arsenault	63e9d60786	AMDGPU: Improve exp10 lowering for f16 (#170771 )	2025-12-05 11:58:13 +01:00
Adel Ejjeh	3c2e5d50ca	[AMDGPU] Update log lowering to remove contract for AMDGCN backend (#168916 ) ## Problem Summary PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical precision error where `log(1.1422761679)` computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I tracked down the problem to the commit 4703f8b6610a (March 6, 2025) which changed HIP math headers to call `__builtin_logf()` directly instead of `__ocml_log_f32()`: ```diff - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); } + float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); } ``` This change exposed a problem in the AMDGCN back-end as described below: ## Key Findings 1. Contract flag propagation: When `-ffp-contract=fast` is enabled (default for HIP), Clang's CodeGen adds the `contract` flag to all `CallInst` instructions within the scope of `CGFPOptionsRAII`, including calls to LLVM intrinsics like `llvm.log.f32`. 2. Behavior change from OCML to builtin path: - Old path (via `__ocml_log_f32`): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the `llvm.log.f32` intrinsic. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23 %call = call contract float @__ocml_log_f32(float noundef %0) #23 ret float %call } ; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) define internal noundef float @__ocml_log_f32(float noundef %0) #7 { %2 = tail call float @llvm.log.f32(float %0) ret float %2 } ``` - New path (via `__builtin_logf`): The call goes directly to `llvm.log.f32` intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24 %1 = call contract float @llvm.log.f32(float %0) ret float %1 } ``` 3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by `ln(2)`, and applying some rounding error correction to that multiplication. This results in something like: ```c r = y * c1; // y is result of v_log_ instruction, c1 = ln(2) r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v4, v1, s2, -v3 v_fmac_f32_e32 v4, 0x3377d1cf, v1 v_add_f32_e32 v3, v3, v4 ``` With the presence of the `contract` flag, the back-end fuses the add (`r + Z`) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following: ```c r = y * c1; r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v3, v1, s2, -v3 v_fmac_f32_e32 v3, 0x3377d1cf, v1 v_fmac_f32_e32 v3, 0x3f317217, v1 ``` ## Solution and Proposed Fix Based on our implementation of `llvm.log` and `llvm.log10`, it should be illegal for the back-end to propagate the `contract` flag when it is present on the intrinsic call because it uses error-correcting summation. My proposed fix is to modify the instruction selection passes (both global-isel and sdag) to drop the `contract` flag when lowering llvm.log. That way, when the instruction selection performs the contraction optimization, it will not fuse the multiply and add. Note: I had originally implemented this fix in the FE by removing the `contract` flag when lowering the llvm.log builtin (PR #168770). I have since closed that PR.	2025-12-04 21:55:13 +01:00
Matt Arsenault	5c659c917a	AMDGPU: Create a dummy call sequence when emitting call error (#170656 ) At least one special case call lowering tries to parse the call sequence and asserts when it can't find a callseq_end.	2025-12-04 17:37:57 +00:00
Matt Arsenault	b8b7eda7fe	AMDGPU: Use correct chain when emitting error on a call (#170645 ) Return the input chain at the callsite, not the entry node chain. Presumably this could cause issues somewhere.	2025-12-04 14:26:42 +01:00
Matt Arsenault	52113cf14f	AMDGPU: Fix broken exp10 lowering for f16 (#170582 ) This was calling the exp handling, so multiplying by the wrong constant. GlobalISel is still broken, but missing the fast exp10 path. This is tracked in https://github.com/llvm/llvm-project/issues/170576	2025-12-04 10:47:33 +01:00
Leon Clark	ee4f6478ba	[AMDGPU] Propagate AA info in vector load/store splitting. (#168871 ) Fixes a bug in `AMDGPUISelLowering` where alias analysis info is not propagated to split loads and stores. This is required for #161375 --------- Co-authored-by: Leon Clark <leoclark@amd.com>	2025-11-24 03:30:35 +00:00
Matt Arsenault	a757c4e74e	CodeGen: Add subtarget to TargetLoweringBase constructor (#168620 ) Currently LibcallLoweringInfo is defined inside of TargetLowering, which is owned by the subtarget. Pass in the subtarget so we can construct LibcallLoweringInfo with the subtarget. This is a temporary step that should be revertable in the future, after LibcallLoweringInfo is moved out of TargetLowering.	2025-11-19 19:18:13 +00:00
Sergei Barannikov	900c517919	[AMDGPU] TableGen-erate SDNode descriptions (#168248 ) This allows SDNodes to be validated against their expected type profiles and reduces the number of changes required to add a new node. Autogenerated node names start with "AMDGPUISD::", hence the changes in the tests. The few nodes defined in R600.td are not imported because TableGen processes AMDGPU.td that doesn't include R600.td. Ideally, we would have two sets of nodes, but that would require careful reorganization of td files since some nodes are shared between AMDGPU/R600. Not sure if it something worth looking into. Some nodes fail validation, those are listed in `AMDGPUSelectionDAGInfo::verifyTargetNode()`. Part of #119709. Pull Request: https://github.com/llvm/llvm-project/pull/168248	2025-11-17 00:10:07 +00:00
Jay Foad	72c69aefba	[AMDGPU] Make use of getFunction and getMF. NFC. (#167872 )	2025-11-14 11:00:57 +00:00
Sergei Barannikov	71927ddb63	[CodeGen] Delete two ComputeValueVTs overloads (NFC) (#166758 ) Those have only a few uses.	2025-11-06 19:45:29 +03:00
LU-JOHN	f899893c19	[AMDGPU][NFC] Cleanly make 32-bit abs legal (#164837 ) Cleanly make 32-bit abs legal only in SIISelLowering.cpp Signed-off-by: John Lu <John.Lu@amd.com>	2025-10-23 15:21:30 -05:00
LU-JOHN	f7c9618d57	[AMDGPU] 32-bit ABS is a legal DAG node (#163907 ) 32-bit ABS can be lowered legally. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-10-17 16:11:59 -05:00
paperchalice	14a42e64cf	[AMDGPU] Remove NoInfsFPMath uses (#163028 ) Only `ninf` should be used.	2025-10-13 19:15:49 +08:00
Janek van Oirschot	a584bd98a1	Revert "[AMDGPU] Elide bitcast fold i64 imm to build_vector" (#160325 ) Reverts llvm/llvm-project#154115 Co-authored-by: ronlieb <ron.lieberman@amd.com>	2025-09-23 17:36:02 +01:00
Janek van Oirschot	341cdbc970	[AMDGPU] Elide bitcast fold i64 imm to build_vector (#154115 ) Elide bitcast combine to build_vector in case of i64 immediate that can be materialized through 64b mov	2025-09-16 16:44:51 +01:00
Chris Jackson	5e6564b098	[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions (#140694 ) - Enable s_or_b64/s_and_b64/s_xor_b64 for v2i32. Add various additional combines to make use of these newly legalised instructions. - Update several tests and separate legacy r600 tests where necessary.	2025-09-11 13:32:44 +01:00
Diana Picus	018dc1b397	[AMDGPU] Tail call support for whole wave functions (#145860 ) Support tail calls to whole wave functions (trivial) and from whole wave functions (slightly more involved because we need a new pseudo for the tail call return, that patches up the EXEC mask). Move the expansion of whole wave function return pseudos (regular and tail call returns) to prolog epilog insertion, since that's where we patch up the EXEC mask.	2025-09-04 10:34:43 +02:00
Frederik Harwath	47793f9a73	[AMDGPU] Implement IR expansion for frem instruction (#130988 ) This patch implements a correctly rounded expansion of the frem instruction in LLVM IR. This is useful for target architectures for which such an expansion is too involved to be implement in ISel Lowering. The expansion is based on the code from the AMD device libs and has been tested successfully against the OpenCL conformance tests on amdgpu. The expansion is implemented in the preexisting "expand-fp" pass. It replaces the expansion of "frem" in ISel for the amdgpu target; it is enabled for targets which do not directly support "frem" and for which no matching "fmod" LibCall is available. --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>	2025-09-03 16:27:15 +02:00
Adam Nemet	9b5502292d	[CG] Add VTs for v[567]i1 and v[567]f16 (#156523 ) [recommit https://github.com/llvm/llvm-project/pull/151763 after fixing https://github.com/llvm/llvm-project/issues/152150] We already had corresponding f32 and i32 vector types for these sizes. Also add VTs v[567]i8 and v[567]i16: these are needed by the Hexagon backend which for each i1 vector types want to query information about the corresponding i8 and i16 types in HexagonTargetLowering::getPreferredHvxVectorAction.	2025-09-02 17:40:43 -07:00
paperchalice	595573d1ed	[AMDGPU] Remove `ApproxFuncFPMath` uses (#155578 ) One of options in `resetTargetOptions`, this removes `ApproxFuncFPMath` in AMDGPU part.	2025-08-28 11:09:01 +08:00
Simon Pilgrim	2a79ef66eb	[AMDGPU] canCreateUndefOrPoisonForTargetNode - BFE_I32/U32 can't create poison/undef (#154932 ) Add AMDGPUTargetLowering::canCreateUndefOrPoisonForTargetNode handler and tag BFE_I32/U32 nodes as they can only propagate poison, not create poison/undef. Fighting some of the remaining regressions in #152107	2025-08-22 12:14:45 +00:00
Matt Arsenault	1f0af171cd	AMDGPU: Fix using illegal extract_subvector indexes (#154098 )	2025-08-20 00:35:09 +00:00
Gang Chen	ef68d1587d	[AMDGPU] upstream barrier count reporting part1 (#154409 )	2025-08-19 16:42:31 -07:00
Stanislav Mekhanoshin	e7c2c80fa1	[AMDGPU] Combine prng(undef) -> undef (#154160 )	2025-08-18 12:13:16 -07:00
Adam Nemet	124722bfe5	Revert "[CG] Add VTs for v[567]i1 and v[567]f16" (#152217 ) Reverts llvm/llvm-project#151763 It caused: https://github.com/llvm/llvm-project/issues/152150	2025-08-05 16:47:50 -07:00
Paul Walker	94d374ab6c	[LLVM][CGP] Allow finer control for sinking compares. (#151366 ) Compare sinking is selectable based on the result of hasMultipleConditionRegisters. This function is too coarse grained by not taking into account the differences between scalar and vector compares. This PR extends the interface to take an EVT to allow finer control. The new interface is used by AArch64 to disable sinking of scalable vector compares, but with isProfitableToSinkOperands updated to maintain the cases that are specifically tested.	2025-08-05 11:43:41 +01:00
Adam Nemet	300e41d72f	[CG] Add VTs for v[567]i1 and v[567]f16 (#151763 ) We already had corresponding f32 and i32 vector types for these sizes. Also add VTs v[567]i8 and v[567]i16: these are needed by the Hexagon backend which for each i1 vector types want to query information about the corresponding i8 and i16 types in HexagonTargetLowering::getPreferredHvxVectorAction.	2025-08-02 09:00:31 -07:00
paperchalice	8bacfb2538	[AMDGPU] Remove `UnsafeFPMath` uses (#151079 ) Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to clang and the ultimate goal is to remove `resetTargetOptions` method in `TargetMachine`, see FIXME in `resetTargetOptions`. See also https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-07-31 17:36:57 +08:00
Daniil Fukalov	e650c4b9ef	[NFC][AMDGPU] Move cmp+select arguments optimization to SIISelLowering. (#150929 ) As requested in #148740.	2025-07-28 22:11:36 +02:00
Nikita Popov	fe0dbe0f29	[CodeGen] More consistently expand float ops by default (#150597 ) These float operations were expanded for scalar f32/f64/f128, but not for f16 and more problematically, not for vectors. A small subset of them was separately set to expand for vectors. Change these to always expand by default, and adjust targets to mark these as legal where necessary instead. This is a much safer default, and avoids unnecessary legalization failures because a target failed to manually mark them as expand. Fixes https://github.com/llvm/llvm-project/issues/110753. Fixes https://github.com/llvm/llvm-project/issues/121390.	2025-07-28 09:46:00 +02:00
Jay Foad	28b85502eb	[AMDGPU] Remove some duplicated lines. NFC. (#128029 )	2025-07-21 17:28:31 +01:00
Diana Picus	20d8398825	[AMDGPU] ISel & PEI for whole wave functions (#145858 ) Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-07-21 10:39:09 +02:00

1 2 3 4 5 ...

710 Commits