llvm-project

Author	SHA1	Message	Date
Dmitry Sidorov	76f88063b6	[AMDGPU] Remove AMDGPUISD::FFBH_I32 and add ISD::CTLS lowering (#187694 ) It's the a continuation of previously reverted https://github.com/llvm/llvm-project/pull/178420 The patch removes custom AMDGPUISD::FFBH_I32 SelectionDAG node. Call sites that need raw hardware semantics (LowerINT_TO_FP32, legalizeITOFP) now use amdgcn_sffbh intrinsic directly. ISD::CTLS is added as a Custom operation for i32. Previous attempt had an issue: The hardware v_ffbh_i32 instruction (v_cls_i32 on newer targets) has different semantics than ISD::CTLS: -sffbh returns [1, BitWidth-1] for normal values, -1 for all-same-bits -CTLS returns [0, BitWidth-2] for normal values, BitWidth-1 for all-same-bits Now LowerCTLS handles this by: sffbh -> umin(sffbh, BitWidth) -> sub 1. Current patch also adds DAG combine to recognize the common CTLS idiom: sub(ctlz(xor(x, sra(x, BitWidth-1))), 1) -> ctls(x) and an optimization in performMinMaxCombine to fold away umin when the input is not all-same-bits. Partially addresses #177635	2026-03-26 16:14:34 +01:00
Mehdi Amini	6a045c29a9	Revert "[GlobalISel][LLT] Introduce FPInfo for LLT (Enable bfloat, ppc128float and others in GlobalISel) (#155107 )" (#188344 ) This reverts commit b1aa6a45060bb9f89efded9e694503d6b4626a4a and commit ce44d63e0d14039f1e8f68e6b7c4672457cabd4e. This fails the build with some older gcc: llvm/include/llvm/CodeGenTypes/LowLevelType.h:501:35: error: call to non-constexpr function ‘static llvm::LLT llvm::LLT::integer(unsigned int)’ return integer(getSizeInBits()); ^	2026-03-24 21:40:36 +00:00
Denis.G	b1aa6a4506	[GlobalISel][LLT] Introduce FPInfo for LLT (Enable bfloat, ppc128float and others in GlobalISel) (#155107 ) Added extra information in LLT to support ambiguous fp types during GlobalISel. Original idea by @tgymnich Main differences from https://github.com/llvm/llvm-project/pull/122503 are: * Do not deprecate LLT::scalar * Allow targets to enable/disable IR translation with extenden LLT via `TargetOption::EnableGlobalISelExtendedLLT` (disabled by default) * `IRTranslator` use `TargetLoweringInfo` for appropriate `LLT` generation. * For this reason added flag in GlobalISelMatchTable` to allow switch between legacy and new extended LLT names * Revert using stubs like `LLT::float32` for float types as they are real now. Added `TODO` for such cases. Also MIRParser now may parse new type indentifiers. --------- Co-authored-by: Tim Gymnich <tim@gymni.ch> Co-authored-by: Ryan Cowan <ryan.cowan@arm.com>	2026-03-24 08:40:39 -04:00
Jay Foad	0b49adc32c	[AMDGPU] Rename AMDGPUMachineFunction to AMDGPUMachineFunctionInfo. NFC. (#187276 ) This is derived from MachineFunctionInfo not MachineFunction.	2026-03-18 20:29:47 +00:00
Igor Wodiany	55cee50e6b	[AMDGPU] Use native instructions for f16 to u16/i16 saturated conversion (#186769 ) This addresses one of the limitations of #174726 by directly selecting `v_cvt_[u16/i16]_f16` instructions for conversion between 16-bit types, as they already handle saturation internally.	2026-03-18 16:21:58 +00:00
Jay Foad	79d1a2c418	[AMDGPU] Standardize on using AMDGPU::getNullPointerValue. NFC. (#187037 ) AMDGPUTargetMachine also had a static method which did the same thing. Remove it so that we have a single source of truth.	2026-03-17 17:08:16 +00:00
vangthao95	53956753fc	AMDGPU/GlobalISel: Lower G_EXTRACT/INSERT in legalizer (#181036 ) Lower G_EXTRACT/INSERT in legalizer by using custom lowering for simple 32-bit aligned cases and calling generic extract/insert lowering for all other cases.	2026-03-12 10:03:44 -07:00
Matt Arsenault	4832c33f66	AMDGPU: Implement expansion for f64 exp (#182539 ) I asked AI to port the device libs reference implementation. It mostly worked, though it got the compares wrong and also missed a fold that happened in compiler. With that fixed I get identical DAG output, and almost the same globalisel output (differing by an inverted compare and select). Also adjusted some stylistic choices.	2026-02-25 21:06:13 +01:00
Matt Arsenault	86f167b9df	AMDGPU: Use promotion to f32 path for log/log10 for f16 by default (#180240 )	2026-02-16 20:25:45 +01:00
Stanislav Mekhanoshin	7487c7581e	[AMDGPU] Change 9 SWMMAC builtins to use 64-bit index (#181246 ) There 9 gfx1250 instructions have 64-bit packed index: - v_swmmac_f16_16x16x128_bf8_bf8 - v_swmmac_f16_16x16x128_bf8_fp8 - v_swmmac_f16_16x16x128_fp8_bf8 - v_swmmac_f16_16x16x128_fp8_fp8 - v_swmmac_f32_16x16x128_bf8_bf8 - v_swmmac_f32_16x16x128_bf8_fp8 - v_swmmac_f32_16x16x128_fp8_bf8 - v_swmmac_f32_16x16x128_fp8_fp8 - v_swmmac_i32_16x16x128_iu8 Intrinsics accept anyint, but builtins are defined with i32 argument. Fixes: SWDEV-579843	2026-02-12 15:31:50 -08:00
paperchalice	25315f287c	[AMDGPU] Remove `NoNaNsFPMath` uses (#180469 ) Should use `nnan` flag only.	2026-02-09 17:29:21 +08:00
Pierre van Houtryve	b79ba02479	[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343 ) Load monitor operations make more sense as atomic operations, as non-atomic operations cannot be used for inter-thread communication w/o additional synchronization. The previous built-in made it work because one could just override the CPol bits, but that bypasses the memory model and forces the user to learn about ISA bits encoding. Making load monitor an atomic operation has a couple of advantages. First, the memory model foundation for it is stronger. We just lean on the existing rules for atomic operations. Second, the CPol bits are abstracted away from the user, which avoids leaking ISA details into the API. This patch also adds supporting memory model and intrinsics documentation to AMDGPUUsage. Solves SWDEV-516398.	2026-02-09 09:57:27 +01:00
Alex Wang	a947599991	[AMDGPU][GlobalISel] Add lowering for G_FMODF (#180152 ) Add generic expansion for G_FMODF matching the SelectionDAG implementation. Enable G_FMODF lowering for AMDGPU with tests. Related: #179434	2026-02-07 18:43:55 +00:00
Diana Picus	9022f47ca4	[AMDGPU] Implement llvm.sponentry (#176357 ) In some of our use cases, the GPU runtime stores some data at the top of the stack. It figures out where it's safe to store it by using the PAL metadata generated by the backend, which includes the total stack size. However, the metadata does not include the space reserved at the bottom of the stack for the trap handler when CWSR is enabled in dynamic VGPR mode. This space is reserved dynamically based on whether or not the code is running on the compute queue. Therefore, the runtime needs a way to take that into account. Add support for `llvm.sponentry`, which should return the base of the stack, skipping over any reserved areas. This allows us to keep this computation in one place rather than duplicate it between the backend and the runtime. The implementation for functions that set up their own stack uses a pseudo that is expanded to the same code sequence as that used in the prolog to set up the stack in the first place. In callable functions, we generate a fixed stack object and use that instead, similar to the Arm/AArch64 approach. This wastes some stack space but that's not a problem for now because we're not planning to use this in callable functions yet.	2026-02-03 15:02:07 +01:00
Carl Ritson	447f1e43bb	[AMDGPU] Implement llvm.fptosi.sat and llvm.fptoui.sat (#174726 ) Certain graphics APIs explicitly want the semantics of saturated conversions, particularly w.r.t. edge cases like NaN. The underlying hardware instructions (v_cvt_*) provide the expected behaviour so llvm.fptosi.sat and llvm.fptoui.sat can be implemented directly. Limitations: - conversion to i64 is not handled (default expansion is used) - v_cvt_u16_f16 and v_cvt_i16_f16 are not utilized (future work) - scalar float is untested/unoptimized (future work)	2026-01-30 17:07:40 +09:00
Kewen Meng	120b482375	Revert "[AMDGPU] Replace AMDGPUISD::FFBH_I32 with ISD::CTLS" (#178837 ) Revert to unblock buildbot: https://lab.llvm.org/buildbot/#/builders/206/builds/12769	2026-01-29 21:19:15 -08:00
Dmitry Sidorov	65925b0405	[AMDGPU] Replace AMDGPUISD::FFBH_I32 with ISD::CTLS (#178420 ) Per CDNA4 ISA: V_FFBH_I32 Count the number of leading bits that are the same as the sign bit of a vector input and store the result into a vector register. Store -1 if all input bits are the same. which matches CTLS semantics. Addresses: https://github.com/llvm/llvm-project/issues/177635	2026-01-30 01:36:28 +01:00
Mariusz Sikora	3c0f5045e1	[AMDGPU] Add FeatureGFX13 and SMEM encoding for gfx13 (#177567 ) For now list of features is based on gfx12 and gfx1250 --------- Co-authored-by: Jay Foad <jay.foad@amd.com>	2026-01-26 14:16:36 +01:00
Jameson Nash	d10b2b566a	[NFCI] replace getValueType with new getGlobalSize query (#177186 ) Returns uint64_t to simplify callers. The goal is eventually replace getValueType with this query, which should return the known minimum reference-able size, as provided (instead of a Type) during create. Additionally the common isSized query would be replaced with an isExactKnownSize query to test if that size is an exact definition.	2026-01-22 13:55:53 -05:00
Shilei Tian	4b1cfc5d7c	[NFCI][AMDGPU] Final touch before moving to `GET_SUBTARGETINFO_MACRO` (#177401 )	2026-01-22 17:33:17 +00:00
Shilei Tian	02d34a76f7	[NFCI][AMDGPU] Remove more redundant code from `GCNSubtarget.h` (#177297 ) We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the header with this cleanup.	2026-01-22 09:07:15 -05:00
Matt Arsenault	a470e708be	AMDGPU: Teach lowering that sqrt never returns subnormal (#174838 )	2026-01-08 12:05:29 +01:00
Victor Chernyakin	c438773432	[LLVM][ADT] Migrate users of `make_scope_exit` to CTAD (#174030 ) This is a followup to #173131, which introduced the CTAD functionality.	2026-01-02 20:42:56 -08:00
Matt Arsenault	0f572c1053	AMDGPU: Teach lowering that exp and log intrinsics cannot return denormals (#172296 )	2025-12-23 10:51:11 +01:00
Petar Avramovic	448ac1fb00	AMDGPU/GlobalISel: Fix broken exp10 lowering for f16 (#170708 )	2025-12-08 10:35:40 +01:00
anjenner	740a3ad1f7	AMDGPU: Add codegen for atomicrmw operations usub_cond and usub_sat (#141068 ) Split off from https://github.com/llvm/llvm-project/pull/105553 as per discussion there.	2025-12-05 12:37:33 +00:00
Adel Ejjeh	3c2e5d50ca	[AMDGPU] Update log lowering to remove contract for AMDGCN backend (#168916 ) ## Problem Summary PyTorch's `test_warp_softmax_64bit_indexing` is failing with a numerical precision error where `log(1.1422761679)` computed with 54% higher error than expected (9.042e-09 vs 5.859e-09), causing gradient computations to exceed tolerance thresholds. This precision degradation was reproducible across all AMD GPU architectures (gfx1100, gfx1200, gfx90a, gfx950). I tracked down the problem to the commit 4703f8b6610a (March 6, 2025) which changed HIP math headers to call `__builtin_logf()` directly instead of `__ocml_log_f32()`: ```diff - float logf(float __x) { return __FAST_OR_SLOW(__logf, __ocml_log_f32)(__x); } + float logf(float __x) { return __FAST_OR_SLOW(__logf, __builtin_logf)(__x); } ``` This change exposed a problem in the AMDGCN back-end as described below: ## Key Findings 1. Contract flag propagation: When `-ffp-contract=fast` is enabled (default for HIP), Clang's CodeGen adds the `contract` flag to all `CallInst` instructions within the scope of `CGFPOptionsRAII`, including calls to LLVM intrinsics like `llvm.log.f32`. 2. Behavior change from OCML to builtin path: - Old path (via `__ocml_log_f32`): The preprocessed IR showed the call to the OCML library function had the contract flag, but the OCML implementation internally dropped the contract flag when calling the `llvm.log.f32` intrinsic. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !23 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !23 %call = call contract float @__ocml_log_f32(float noundef %0) #23 ret float %call } ; Function Attrs: convergent mustprogress nofree norecurse nosync nounwind willreturn memory(none) define internal noundef float @__ocml_log_f32(float noundef %0) #7 { %2 = tail call float @llvm.log.f32(float %0) ret float %2 } ``` - New path (via `__builtin_logf`): The call goes directly to `llvm.log.f32` intrinsic with the contract flag preserved, causing the backend to apply FMA contraction during polynomial expansion. ```llvm ; Function Attrs: alwaysinline convergent mustprogress nounwind define internal noundef float @_ZL4logff(float noundef %__x) #6 { entry: %retval = alloca float, align 4, addrspace(5) %__x.addr = alloca float, align 4, addrspace(5) %retval.ascast = addrspacecast ptr addrspace(5) %retval to ptr %__x.addr.ascast = addrspacecast ptr addrspace(5) %__x.addr to ptr store float %__x, ptr %__x.addr.ascast, align 4, !tbaa !24 %0 = load float, ptr %__x.addr.ascast, align 4, !tbaa !24 %1 = call contract float @llvm.log.f32(float %0) ret float %1 } ``` 3. Why contract breaks log: Our AMDGCM target back end implements the natural logarithm by taking the result of the hardware log, then multiplying that by `ln(2)`, and applying some rounding error correction to that multiplication. This results in something like: ```c r = y * c1; // y is result of v_log_ instruction, c1 = ln(2) r = r + fma(y, c2, fma(y, c1, -r)) // c2 is another error-correcting constant ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v4, v1, s2, -v3 v_fmac_f32_e32 v4, 0x3377d1cf, v1 v_add_f32_e32 v3, v3, v4 ``` With the presence of the `contract` flag, the back-end fuses the add (`r + Z`) with the multiply thinking that it is legal, thus eliminating the intermediate rounding. The error compensation term, which was calculated based on the rounded product, is now being added to the full-precision result from the FMA, leading to incorrect error correction and degraded accuracy. The corresponding contracted operations become the following: ```c r = y * c1; r = fma(y, c1, fma(y, c2, fma(y, c1, -r))); ``` ```asm v_log_f32_e32 v1, v1 s_mov_b32 s2, 0x3f317217 v_mul_f32_e32 v3, 0x3f317217, v1 v_fma_f32 v3, v1, s2, -v3 v_fmac_f32_e32 v3, 0x3377d1cf, v1 v_fmac_f32_e32 v3, 0x3f317217, v1 ``` ## Solution and Proposed Fix Based on our implementation of `llvm.log` and `llvm.log10`, it should be illegal for the back-end to propagate the `contract` flag when it is present on the intrinsic call because it uses error-correcting summation. My proposed fix is to modify the instruction selection passes (both global-isel and sdag) to drop the `contract` flag when lowering llvm.log. That way, when the instruction selection performs the contraction optimization, it will not fuse the multiply and add. Note: I had originally implemented this fix in the FE by removing the `contract` flag when lowering the llvm.log builtin (PR #168770). I have since closed that PR.	2025-12-04 21:55:13 +01:00
Matt Arsenault	0d853aefec	AMDGPU: Fix treating unknown mem operands as uniform (#170309 ) The test changes are mostly GlobalISel specific regressions. GlobalISel is still relying on isUniformMMO, but it doesn't really have an excuse for doing so. These should be avoidable with new regbankselect. There is an additional regression for addrspacecast for cov4. We probably ought to be using a separate PseudoSourceValue for the access of the queue pointer.	2025-12-02 16:19:46 +00:00
Matt Arsenault	23e6dbf864	AMDGPU: Use ConstantPool as source value for DAG lowered kernarg loads (#168917 ) This isn't quite a constant pool, but probably close enough for this purpose. We just need some known invariant value address. The aliasing queries against the real kernarg base pointer will falsely report no aliasing, but for invariant memory it probably doesn't matter.	2025-12-02 15:48:02 +00:00
Matt Arsenault	6883d4a236	AMDGPU: Try to use zext to implement constant-32-bit addrspacecast (#168977 ) If the high bits are assumed 0 for the cast, use zext. Previously we would emit a build_vector and a bitcast with the high element as 0. The zext is more easily optimized. I'm less convinced this is good for globalisel, since you still need to have the inttoptr back to the original pointer type. The default value is 0, though I'm not sure if this is meaningful in the real world. The real uses might always override the high bit value with the attribute.	2025-12-01 18:40:26 -05:00
Jay Foad	1c3b10f2e2	[AMDGPU] Remove isKernelLDS, add isKernel(const Function &). NFC. (#167300 ) Since #142598 isKernelLDS has been a pointless wrapper around isKernel.	2025-11-25 15:43:18 +00:00
Mirko Brkušanin	fe5f49942e	[AMDGPU][GlobalISel] Lower G_FMINIMUM and G_FMAXIMUM (#151122 ) Add GlobalISel lowering of G_FMINIMUM and G_FMAXIMUM following the same logic as in SDag's expandFMINIMUM_FMAXIMUM. Update AMDGPU legalization rules: Pre GFX12 now uses new lowering method and make G_FMINNUM_IEEE and G_FMAXNUM_IEEE legal to match SDag.	2025-10-24 14:48:27 +02:00
paperchalice	14a42e64cf	[AMDGPU] Remove NoInfsFPMath uses (#163028 ) Only `ninf` should be used.	2025-10-13 19:15:49 +08:00
Shilei Tian	2195fe7e01	[AMDGPU] Add the support for 45-bit buffer resource (#159702 ) On new targets like `gfx1250`, the buffer resource (V#) now uses this format: ``` base (57-bit): resource[56:0] num_records (45-bit): resource[101:57] reserved (6-bit): resource[107:102] stride (14-bit): resource[121:108] ``` This PR changes the type of `num_records` from `i32` to `i64` in both builtin and intrinsic, and also adds the support for lowering the new format. Fixes SWDEV-554034. --------- Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>	2025-09-24 11:12:02 -04:00
Stanislav Mekhanoshin	76efbc068a	[AMDGPU] Fix codegen to emit COPY instead of S_MOV_B64 for aperture regs (#158754 )	2025-09-16 02:26:32 -07:00
Shilei Tian	1180c2ced0	[AMDGPU] Support lowering of cluster related instrinsics (#157978 ) Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-12 21:11:17 -04:00
Anshil Gandhi	c6899193ed	[AMDGPU][Legalizer] Avoid pack/unpack for G_FSHR (#156796 ) Scalarize G_FSHR only if the subtarget does not support V2S16 type.	2025-09-04 17:12:57 -06:00
Pierre van Houtryve	e2bd10cf16	[AMDGPU][gfx1250] Add 128B cooperative atomics (#156418 ) - Add clang built-ins + sema/codegen - Add IR Intrinsic + verifier - Add DAG/GlobalISel codegen for the intrinsics - Add lowering in SIMemoryLegalizer using a MMO flag.	2025-09-04 09:19:25 +00:00
paperchalice	595573d1ed	[AMDGPU] Remove `ApproxFuncFPMath` uses (#155578 ) One of options in `resetTargetOptions`, this removes `ApproxFuncFPMath` in AMDGPU part.	2025-08-28 11:09:01 +08:00
Tiger Ding	4ab14685a0	[AMDGPU] Narrow only on store to pow of 2 mem location (#150093 ) Lowering in GlobalISel for AMDGPU previously always narrows to i32 on truncating store regardless of mem size or scalar size, causing issues with types like i65 which is first extended to i128 then stored as i64 + i8 to i128 locations. Narrowing only on store to pow of 2 mem location ensures only narrowing to mem size near end of legalization. This LLVM defect was identified via the AMD Fuzzing project.	2025-08-19 00:04:27 +09:00
Stanislav Mekhanoshin	ea14834966	[AMDGPU] Per-subtarget DPP instruction classification (#153096 ) This is NFCI at this point.	2025-08-11 15:41:02 -07:00
Stanislav Mekhanoshin	abc22f771e	[AMDGPU] Fix buffer addressing mode matching (#152584 ) Starting in gfx1250, voffset and immoffset are zero-extended from 32 bits to 45 bits before being added together.	2025-08-07 14:23:41 -07:00
Stanislav Mekhanoshin	b8eb61adc9	[AMDGPU] Implement addrspacecast from flat <-> private on gfx1250 (#152218 )	2025-08-05 16:25:23 -07:00
paperchalice	8bacfb2538	[AMDGPU] Remove `UnsafeFPMath` uses (#151079 ) Remove `UnsafeFPMath` in AMDGPU part, it blocks some bugfixes related to clang and the ultimate goal is to remove `resetTargetOptions` method in `TargetMachine`, see FIXME in `resetTargetOptions`. See also https://discourse.llvm.org/t/rfc-honor-pragmas-with-ffp-contract-fast https://discourse.llvm.org/t/allowfpopfusion-vs-sdnodeflags-hasallowcontract --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-07-31 17:36:57 +08:00
Fabian Ritter	957ae8ad46	[AMDGPU][GISel] Use buildObjectPtrOffset instead of buildPtrAdd (#150899 ) This concerns offset computations for kernargs and RegBankLegalizeHelper::splitLoad, which should all be within the bounds of a memory object. See #150392 for the motivation for introducing the buildObjectPtrOffset function. For SWDEV-516125.	2025-07-30 08:30:27 +02:00
Stanislav Mekhanoshin	3dfd939a16	[AMDGPU] gfx1250 V_{MIN\|MAX}_{I\|U}64 opcodes (#151256 )	2025-07-29 19:13:51 -07:00
Changpeng Fang	6184ef1c2f	[AMDGPU] Support f64 atomics on gfx1250 (#151172 ) - BUF/FLAT/GLOBAL_ADD/MIN/MAX_F64 - DS_ADD_F64 Co-authored-by: Konstantin Zhuravlyov <Konstantin Zhuravlyov@amd.com>	2025-07-29 09:41:00 -07:00
Stanislav Mekhanoshin	2346968807	[AMDGPU] Add V_ADD\|SUB\|MUL_U64 gfx1250 opcodes (#150291 )	2025-07-23 13:17:56 -07:00
Stanislav Mekhanoshin	2d6534b7da	[AMDGPU] gfx1250 64-bit relocations and fixups (#148951 )	2025-07-15 17:13:42 -07:00
Changpeng Fang	868793fa8e	AMDGPU: Support intrinsic selection for gfx1250 wmma instructions (#148957 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com> Co-authored-by: Shilei Tian <Shilei.Tian@amd.com>	2025-07-15 15:25:05 -07:00

1 2 3 4 5 ...

789 Commits