llvm-project

Author	SHA1	Message	Date
paperchalice	25315f287c	[AMDGPU] Remove `NoNaNsFPMath` uses (#180469 ) Should use `nnan` flag only.	2026-02-09 17:29:21 +08:00
Pierre van Houtryve	b79ba02479	[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343 ) Load monitor operations make more sense as atomic operations, as non-atomic operations cannot be used for inter-thread communication w/o additional synchronization. The previous built-in made it work because one could just override the CPol bits, but that bypasses the memory model and forces the user to learn about ISA bits encoding. Making load monitor an atomic operation has a couple of advantages. First, the memory model foundation for it is stronger. We just lean on the existing rules for atomic operations. Second, the CPol bits are abstracted away from the user, which avoids leaking ISA details into the API. This patch also adds supporting memory model and intrinsics documentation to AMDGPUUsage. Solves SWDEV-516398.	2026-02-09 09:57:27 +01:00
Alex Wang	a947599991	[AMDGPU][GlobalISel] Add lowering for G_FMODF (#180152 ) Add generic expansion for G_FMODF matching the SelectionDAG implementation. Enable G_FMODF lowering for AMDGPU with tests. Related: #179434	2026-02-07 18:43:55 +00:00
Diana Picus	9022f47ca4	[AMDGPU] Implement llvm.sponentry (#176357 ) In some of our use cases, the GPU runtime stores some data at the top of the stack. It figures out where it's safe to store it by using the PAL metadata generated by the backend, which includes the total stack size. However, the metadata does not include the space reserved at the bottom of the stack for the trap handler when CWSR is enabled in dynamic VGPR mode. This space is reserved dynamically based on whether or not the code is running on the compute queue. Therefore, the runtime needs a way to take that into account. Add support for `llvm.sponentry`, which should return the base of the stack, skipping over any reserved areas. This allows us to keep this computation in one place rather than duplicate it between the backend and the runtime. The implementation for functions that set up their own stack uses a pseudo that is expanded to the same code sequence as that used in the prolog to set up the stack in the first place. In callable functions, we generate a fixed stack object and use that instead, similar to the Arm/AArch64 approach. This wastes some stack space but that's not a problem for now because we're not planning to use this in callable functions yet.	2026-02-03 15:02:07 +01:00
Carl Ritson	447f1e43bb	[AMDGPU] Implement llvm.fptosi.sat and llvm.fptoui.sat (#174726 ) Certain graphics APIs explicitly want the semantics of saturated conversions, particularly w.r.t. edge cases like NaN. The underlying hardware instructions (v_cvt_*) provide the expected behaviour so llvm.fptosi.sat and llvm.fptoui.sat can be implemented directly. Limitations: - conversion to i64 is not handled (default expansion is used) - v_cvt_u16_f16 and v_cvt_i16_f16 are not utilized (future work) - scalar float is untested/unoptimized (future work)	2026-01-30 17:07:40 +09:00
Kewen Meng	120b482375	Revert "[AMDGPU] Replace AMDGPUISD::FFBH_I32 with ISD::CTLS" (#178837 ) Revert to unblock buildbot: https://lab.llvm.org/buildbot/#/builders/206/builds/12769	2026-01-29 21:19:15 -08:00
Dmitry Sidorov	65925b0405	[AMDGPU] Replace AMDGPUISD::FFBH_I32 with ISD::CTLS (#178420 ) Per CDNA4 ISA: V_FFBH_I32 Count the number of leading bits that are the same as the sign bit of a vector input and store the result into a vector register. Store -1 if all input bits are the same. which matches CTLS semantics. Addresses: https://github.com/llvm/llvm-project/issues/177635	2026-01-30 01:36:28 +01:00
Mariusz Sikora	3c0f5045e1	[AMDGPU] Add FeatureGFX13 and SMEM encoding for gfx13 (#177567 ) For now list of features is based on gfx12 and gfx1250 --------- Co-authored-by: Jay Foad <jay.foad@amd.com>	2026-01-26 14:16:36 +01:00
Jameson Nash	d10b2b566a	[NFCI] replace getValueType with new getGlobalSize query (#177186 ) Returns uint64_t to simplify callers. The goal is eventually replace getValueType with this query, which should return the known minimum reference-able size, as provided (instead of a Type) during create. Additionally the common isSized query would be replaced with an isExactKnownSize query to test if that size is an exact definition.	2026-01-22 13:55:53 -05:00
Shilei Tian	4b1cfc5d7c	[NFCI][AMDGPU] Final touch before moving to `GET_SUBTARGETINFO_MACRO` (#177401 )	2026-01-22 17:33:17 +00:00
Shilei Tian	02d34a76f7	[NFCI][AMDGPU] Remove more redundant code from `GCNSubtarget.h` (#177297 ) We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the header with this cleanup.	2026-01-22 09:07:15 -05:00
Matt Arsenault	a470e708be	AMDGPU: Teach lowering that sqrt never returns subnormal (#174838 )	2026-01-08 12:05:29 +01:00
Victor Chernyakin	c438773432	[LLVM][ADT] Migrate users of `make_scope_exit` to CTAD (#174030 ) This is a followup to #173131, which introduced the CTAD functionality.	2026-01-02 20:42:56 -08:00
Matt Arsenault	0f572c1053	AMDGPU: Teach lowering that exp and log intrinsics cannot return denormals (#172296 )	2025-12-23 10:51:11 +01:00
Petar Avramovic	448ac1fb00	AMDGPU/GlobalISel: Fix broken exp10 lowering for f16 (#170708 )	2025-12-08 10:35:40 +01:00
anjenner	740a3ad1f7	AMDGPU: Add codegen for atomicrmw operations usub_cond and usub_sat (#141068 ) Split off from https://github.com/llvm/llvm-project/pull/105553 as per discussion there.	2025-12-05 12:37:33 +00:00
Adel Ejjeh	3c2e5d50ca	[AMDGPU] Update log lowering to remove contract for AMDGCN backend (#168916 ) ## Problem Summary PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical precision error where `log(1.1422761679)` computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I tracked down the problem to the commit 4703f8b6610a (March 6, 2025) which changed HIP math headers to call `__builtin_logf()` directly instead of `__ocml_log_f32()`: ```diff - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); } + float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); } ``` This change exposed a problem in the AMDGCN back-end as described below: ## Key Findings 1. Contract flag propagation: When `-ffp-contract=fast` is enabled (default for HIP), Clang's CodeGen adds the `contract` flag to all `CallInst` instructions within the scope of `CGFPOptionsRAII`, including calls to LLVM intrinsics like `llvm.log.f32`. 2. Behavior change from OCML to builtin path: - Old path (via `__ocml_log_f32`): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the `llvm.log.f32` intrinsic. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23 %call = call contract float @__ocml_log_f32(float noundef %0) #23 ret float %call } ; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) define internal noundef float @__ocml_log_f32(float noundef %0) #7 { %2 = tail call float @llvm.log.f32(float %0) ret float %2 } ``` - New path (via `__builtin_logf`): The call goes directly to `llvm.log.f32` intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24 %1 = call contract float @llvm.log.f32(float %0) ret float %1 } ``` 3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by `ln(2)`, and applying some rounding error correction to that multiplication. This results in something like: ```c r = y * c1; // y is result of v_log_ instruction, c1 = ln(2) r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v4, v1, s2, -v3 v_fmac_f32_e32 v4, 0x3377d1cf, v1 v_add_f32_e32 v3, v3, v4 ``` With the presence of the `contract` flag, the back-end fuses the add (`r + Z`) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following: ```c r = y * c1; r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v3, v1, s2, -v3 v_fmac_f32_e32 v3, 0x3377d1cf, v1 v_fmac_f32_e32 v3, 0x3f317217, v1 ``` ## Solution and Proposed Fix Based on our implementation of `llvm.log` and `llvm.log10`, it should be illegal for the back-end to propagate the `contract` flag when it is present on the intrinsic call because it uses error-correcting summation. My proposed fix is to modify the instruction selection passes (both global-isel and sdag) to drop the `contract` flag when lowering llvm.log. That way, when the instruction selection performs the contraction optimization, it will not fuse the multiply and add. Note: I had originally implemented this fix in the FE by removing the `contract` flag when lowering the llvm.log builtin (PR #168770). I have since closed that PR.	2025-12-04 21:55:13 +01:00
Matt Arsenault	0d853aefec	AMDGPU: Fix treating unknown mem operands as uniform (#170309 ) The test changes are mostly GlobalISel specific regressions. GlobalISel is still relying on isUniformMMO, but it doesn't really have an excuse for doing so. These should be avoidable with new regbankselect. There is an additional regression for addrspacecast for cov4. We probably ought to be using a separate PseudoSourceValue for the access of the queue pointer.	2025-12-02 16:19:46 +00:00
Matt Arsenault	23e6dbf864	AMDGPU: Use ConstantPool as source value for DAG lowered kernarg loads (#168917 ) This isn't quite a constant pool, but probably close enough for this purpose. We just need some known invariant value address. The aliasing queries against the real kernarg base pointer will falsely report no aliasing, but for invariant memory it probably doesn't matter.	2025-12-02 15:48:02 +00:00
Matt Arsenault	6883d4a236	AMDGPU: Try to use zext to implement constant-32-bit addrspacecast (#168977 ) If the high bits are assumed 0 for the cast, use zext. Previously we would emit a build_vector and a bitcast with the high element as 0. The zext is more easily optimized. I'm less convinced this is good for globalisel, since you still need to have the inttoptr back to the original pointer type. The default value is 0, though I'm not sure if this is meaningful in the real world. The real uses might always override the high bit value with the attribute.	2025-12-01 18:40:26 -05:00
Jay Foad	1c3b10f2e2	[AMDGPU] Remove isKernelLDS, add isKernel(const Function &). NFC. (#167300 ) Since #142598 isKernelLDS has been a pointless wrapper around isKernel.	2025-11-25 15:43:18 +00:00
Mirko Brkušanin	fe5f49942e	[AMDGPU][GlobalISel] Lower G_FMINIMUM and G_FMAXIMUM (#151122 ) Add GlobalISel lowering of G_FMINIMUM and G_FMAXIMUM following the same logic as in SDag's expandFMINIMUM_FMAXIMUM. Update AMDGPU legalization rules: Pre GFX12 now uses new lowering method and make G_FMINNUM_IEEE and G_FMAXNUM_IEEE legal to match SDag.	2025-10-24 14:48:27 +02:00
paperchalice	14a42e64cf	[AMDGPU] Remove NoInfsFPMath uses (#163028 ) Only `ninf` should be used.	2025-10-13 19:15:49 +08:00
Shilei Tian	2195fe7e01	[AMDGPU] Add the support for 45-bit buffer resource (#159702 ) On new targets like `gfx1250`, the buffer resource (V#) now uses this format: ``` base (57-bit): resource[56:0] num_records (45-bit): resource[101:57] reserved (6-bit): resource[107:102] stride (14-bit): resource[121:108] ``` This PR changes the type of `num_records` from `i32` to `i64` in both builtin and intrinsic, and also adds the support for lowering the new format. Fixes SWDEV-554034. --------- Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>	2025-09-24 11:12:02 -04:00
Stanislav Mekhanoshin	76efbc068a	[AMDGPU] Fix codegen to emit COPY instead of S_MOV_B64 for aperture regs (#158754 )	2025-09-16 02:26:32 -07:00
Shilei Tian	1180c2ced0	[AMDGPU] Support lowering of cluster related instrinsics (#157978 ) Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-12 21:11:17 -04:00
Anshil Gandhi	c6899193ed	[AMDGPU][Legalizer] Avoid pack/unpack for G_FSHR (#156796 ) Scalarize G_FSHR only if the subtarget does not support V2S16 type.	2025-09-04 17:12:57 -06:00
Pierre van Houtryve	e2bd10cf16	[AMDGPU][gfx1250] Add 128B cooperative atomics (#156418 ) - Add clang built-ins + sema/codegen - Add IR Intrinsic + verifier - Add DAG/GlobalISel codegen for the intrinsics - Add lowering in SIMemoryLegalizer using a MMO flag.	2025-09-04 09:19:25 +00:00
paperchalice	595573d1ed	[AMDGPU] Remove `ApproxFuncFPMath` uses (#155578 ) One of options in `resetTargetOptions`, this removes `ApproxFuncFPMath` in AMDGPU part.	2025-08-28 11:09:01 +08:00
Tiger Ding	4ab14685a0	[AMDGPU] Narrow only on store to pow of 2 mem location (#150093 ) Lowering in GlobalISel for AMDGPU previously always narrows to i32 on truncating store regardless of mem size or scalar size, causing issues with types like i65 which is first extended to i128 then stored as i64 + i8 to i128 locations. Narrowing only on store to pow of 2 mem location ensures only narrowing to mem size near end of legalization. This LLVM defect was identified via the AMD Fuzzing project.	2025-08-19 00:04:27 +09:00
Stanislav Mekhanoshin	ea14834966	[AMDGPU] Per-subtarget DPP instruction classification (#153096 ) This is NFCI at this point.	2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin	abc22f771e	[AMDGPU] Fix buffer addressing mode matching (#152584 ) Starting in gfx1250, voffset and immoffset are zero-extended from 32 bits to 45 bits before being added together.	2025-08-07 14:23:41 -07:00
Stanislav Mekhanoshin	b8eb61adc9	[AMDGPU] Implement addrspacecast from flat <-> private on gfx1250 (#152218 )	2025-08-05 16:25:23 -07:00
paperchalice	8bacfb2538	[AMDGPU] Remove `UnsafeFPMath` uses (#151079 ) Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to clang and the ultimate goal is to remove `resetTargetOptions` method in `TargetMachine`, see FIXME in `resetTargetOptions`. See also https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-07-31 17:36:57 +08:00
Fabian Ritter	957ae8ad46	[AMDGPU][GISel] Use buildObjectPtrOffset instead of buildPtrAdd (#150899 ) This concerns offset computations for kernargs and RegBankLegalizeHelper::splitLoad, which should all be within the bounds of a memory object. See #150392 for the motivation for introducing the buildObjectPtrOffset function. For SWDEV-516125.	2025-07-30 08:30:27 +02:00
Stanislav Mekhanoshin	3dfd939a16	[AMDGPU] gfx1250 V_{MIN\|MAX}_{I\|U}64 opcodes (#151256 )	2025-07-29 19:13:51 -07:00
Changpeng Fang	6184ef1c2f	[AMDGPU] Support f64 atomics on gfx1250 (#151172 ) - BUF/FLAT/GLOBAL_ADD/MIN/MAX_F64 - DS_ADD_F64 Co-authored-by: Konstantin Zhuravlyov <Konstantin Zhuravlyov@amd.com>	2025-07-29 09:41:00 -07:00
Stanislav Mekhanoshin	2346968807	[AMDGPU] Add V_ADD\|SUB\|MUL_U64 gfx1250 opcodes (#150291 )	2025-07-23 13:17:56 -07:00
Stanislav Mekhanoshin	2d6534b7da	[AMDGPU] gfx1250 64-bit relocations and fixups (#148951 )	2025-07-15 17:13:42 -07:00
Changpeng Fang	868793fa8e	AMDGPU: Support intrinsic selection for gfx1250 wmma instructions (#148957 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com> Co-authored-by: Shilei Tian <Shilei.Tian@amd.com>	2025-07-15 15:25:05 -07:00
Stanislav Mekhanoshin	d0a4af725e	[AMDGPU] Add FeatureIEEEMinimumMaximumInsts. NFCI. (#147594 ) Co-authored-by: Mirko Brkušanin <Mirko.Brkusanin@amd.com>	2025-07-08 14:32:44 -07:00
Matt Arsenault	c80282d333	AMDGPU: Directly select minimumnum/maximumnum with ieee_mode=0 (#141903 ) The hardware min/max follow the IR rules with IEEE mode disabled, so we can avoid the canonicalizes of the input. We lose the quieting of a signaling nan if both inputs are nans, but we only require that with strictfp.	2025-06-18 00:27:41 +09:00
Kazu Hirata	03f616eb3a	[llvm] Compare std::optional<T> to values directly (NFC) (#143340 ) This patch transforms: X && *X == Y to: X == Y where X is of std::optional<T>, and Y is of T or similar.	2025-06-08 22:37:59 -07:00
Changpeng Fang	70e78be7dc	AMDGPU: Custom lower fptrunc vectors for f32 -> f16 (#141883 ) The latest asics support v_cvt_pk_f16_f32 instruction. However current implementation of vector fptrunc lowering fully scalarizes the vectors, and the scalar conversions may not always be combined to generate the packed one. We made v2f32 -> v2f16 legal in https://github.com/llvm/llvm-project/pull/139956. This work is an extension to handle wider vectors. Instead of fully scalarization, we split the vector to packs (v2f32 -> v2f16) to ensure the packed conversion can always been generated.	2025-06-06 15:15:24 -07:00
Justin Bogner	b7bb256703	Warn on misuse of DiagnosticInfo classes that hold Twines (#137397 ) This annotates the `Twine` passed to the constructors of the various DiagnosticInfo subclasses with `[[clang::lifetimebound]]`, which causes us to warn when we would try to print the twine after it had already been destructed. We also update `DiagnosticInfoUnsupported` to hold a `const Twine &` like all of the other DiagnosticInfo classes, since this warning allows us to clean up all of the places where it was being used incorrectly.	2025-05-28 12:26:39 -07:00
zGoldthorpe	bb7e559740	[AMDGPU] Correct bitshift legality transformation for small vectors (#140940 ) Fix for a bug found by the AMD fuzzing project. The legaliser would originally try to widen a small vector such as `<4 x i1>` to a single `i16` during the legalisation of bitshifts, as it was not originally written with consideration for vector operands. This patch simply adds a guard to prohibit this transformation and allow other legalisation transformations to step in.	2025-05-23 10:56:21 +02:00
Matt Arsenault	2e2bbcacf8	AMDGPU/GlobalISel: Start legalizing minimumnum and maximumnum (#140900 ) This is the bare minimum to get the intrinsic to compile for AMDGPU, and it's not optimal. We need to follow along closer with the existing G_FMINNUM/G_FMAXNUM with custom lowering to handle the IEEE=0 case better. Just re-use the existing lowering for the old semantics for G_FMINNUM/G_FMAXNUM. This does not change G_FMINNUM/G_FMAXNUM's treatment, nor try to handle the general expansion without an underlying min/max variant (or with G_FMINIMUM/G_FMAXIMUM).	2025-05-21 17:00:45 +02:00
Chinmay Deshpande	3a5af231fd	[GlobalISel][AMDGPU] Fix handling of v2i128 type for AND, OR, XOR (#138574 ) Current behavior crashes the compiler. This bug was found using the AMDGPU Fuzzing project. Fixes SWDEV-508816.	2025-05-08 19:31:28 +02:00
Pierre van Houtryve	0d0eed419f	[AMDGPU][Legalizer] Widen i16 G_SEXT_INREG (#131308 ) It's better to widen them to avoid it being lowered into a G_ASHR + G_SHL. With this change we just extend to i32 then trunc the result.	2025-05-07 10:22:15 +02:00
Diana Picus	45d96df797	[AMDGPU] Support arbitrary types in amdgcn.dead (#134841 ) Legalize the amdgcn.dead intrinsic to work with types other than i32. It still generates IMPLICIT_DEFs. Remove some of the previous code for selecting/reg bank mapping it for 32-bit types, since everything is done in the legalizer now.	2025-05-05 14:08:00 +02:00

1 2 3 4 5 ...

779 Commits