llvm-project

Author	SHA1	Message	Date
Sergio Afonso	2cff995e91	[AMDGPU] Fix crash with dead frame indices in debug values (#183297 ) When spill slots are eliminated (VGPR-to-AGPR, SGPR-to-VGPR lanes), debug values referencing these frame indices were not always properly cleaned up. This caused an assertion failure in getObjectOffset() when PrologEpilogInserter tried to access the offset of a dead frame object. The existing debug fixup code in SIFrameLowering and SILowerSGPRSpills had two limitations: 1. It only checked one operand position, but DBG_VALUE_LIST instructions can have multiple debug operands with frame indices. 2. It didn't handle all types of dead frame indices uniformly. Fix by centralizing debug info cleanup in removeDeadFrameIndices(), which already knows all frame indices being removed. This iterates over all debug operands using MI.debug_operands(). Assisted-by: Claude Code.	2026-04-01 13:41:53 +01:00
Diana Picus	f2e8e2faff	[AMDGPU] Make chain functions receive a stack pointer (#184616 ) Currently, chain functions are free to set up a stack pointer if they need one, and they assume they can start at scratch offset 0. This is not correct if CWSR and dynamic VGPRs are both enabled, since in that case we need to reserve an area at offset 0 for the trap handler, but only when running on a compute queue (which we determine at runtime). Rather than duplicate in every chain function the code sequence for determining if/how much scratch space needs to be reserved, this patch changes the ABI of chain functions so that they receive a stack pointer from their caller. Since chain functions can no longer use plain offsets to access their own stack, we'll also need to allocate a frame pointer more often (and sometimes also a base pointer). For simplicity, we use the same registers that `amdgpu_gfx` functions do (s32, s33, s34). This may change in the future. Chain functions never return to their caller and thus don't need to preserve the frame or base pointer. Another consequence is that now we might need to realign the stack in some cases (since it no longer starts at the infinitely aligned 0).	2026-03-06 11:01:42 +01:00
Sirish Pande	67fcdc9016	[AMDGPU] Efficient way to get NumArchVGPRs. (#182537 ) No functional change. Cleaning up to get number of VGPRs for different AMDGPU target based on features.	2026-02-20 14:52:17 -06:00
Diana Picus	9022f47ca4	[AMDGPU] Implement llvm.sponentry (#176357 ) In some of our use cases, the GPU runtime stores some data at the top of the stack. It figures out where it's safe to store it by using the PAL metadata generated by the backend, which includes the total stack size. However, the metadata does not include the space reserved at the bottom of the stack for the trap handler when CWSR is enabled in dynamic VGPR mode. This space is reserved dynamically based on whether or not the code is running on the compute queue. Therefore, the runtime needs a way to take that into account. Add support for `llvm.sponentry`, which should return the base of the stack, skipping over any reserved areas. This allows us to keep this computation in one place rather than duplicate it between the backend and the runtime. The implementation for functions that set up their own stack uses a pseudo that is expanded to the same code sequence as that used in the prolog to set up the stack in the first place. In callable functions, we generate a fixed stack object and use that instead, similar to the Arm/AArch64 approach. This wastes some stack space but that's not a problem for now because we're not planning to use this in callable functions yet.	2026-02-03 15:02:07 +01:00
Shilei Tian	4b1cfc5d7c	[NFCI][AMDGPU] Final touch before moving to `GET_SUBTARGETINFO_MACRO` (#177401 )	2026-01-22 17:33:17 +00:00
Shilei Tian	1843a7fe9f	[NFCI][AMDGPU] Use X-macro to reduce boilerplate in `GCNSubtarget.h` (#176844 ) `GCNSubtarget.h` contained a large amount of repetitive code following the pattern `bool HasXXX = false;` for member declarations and `bool hasXXX() const { return HasXXX; }` for getters. This boilerplate made the file unnecessarily long and harder to maintain. This patch introduces an X-macro pattern `GCN_SUBTARGET_HAS_FEATURE` that consolidates 135 simple subtarget features into a single list. The macro is expanded twice: once in the protected section to generate member variable declarations, and once in the public section to generate the corresponding getter methods. This reduces the file by approximately 600 lines while preserving the exact same API and functionality. Features with complex getter logic or inconsistent naming conventions are left as manual implementations for future improvement. Ideally, these could be generated by TableGen using `GET_SUBTARGETINFO_MACRO`, similar to the X86 backend. However, `AMDGPU.td` has several issues that prevent direct adoption: duplicate field names (e.g., `DumpCode` is set by both `FeatureDumpCode` and `FeatureDumpCodeLower`), and inconsistent naming conventions where many features don't have the `Has` prefix (e.g., `FlatAddressSpace`, `GFX10Insts`, `FP64`). Fixing these issues would require renaming fields in `AMDGPU.td` and updating all references, which is left for future work.	2026-01-21 15:29:09 -05:00
Philip Reames	49a7772be5	[CodeGen] Replace (Min,Max)CSFrameIndex with flag on frame object [NFCI] (#170905 ) This removes the tracking of the MinCSFrameIndex, and MaxCSFrameIndex markers, simplifying the target API. This brings the tracking for callee save spill slots in line with how we handle other properties of stack locations. A couple notes: 1) This requires doing scans of the entire object range, but we have other such instances in the code already, so I doubt this will matter in practice. 2) This removes the requirement that callee saved spill slots be contiguous in the frame index identified space. I marked this as NFCI because if prior code violated the contiguous range assumption - I can't find a case where we did - then this change might adjust frame layout in some edge cases. The motivation for this is mostly code readability, but I might use this as a primitive for something in an upcoming patch series around shrink wrapping. Haven't decided yet.	2025-12-06 14:00:55 -08:00
Christudasan Devadasan	a2dc4e02e7	[AMDGPU] Enable multi-group xnack replay in hardware (GFX1250) (#169016 ) This patch enables the multi-group xnack replay mode by configuring the hardware MODE register at kernel entry. This aligns the hardware behavior with the compiler's existing multi-group s_wait_xcnt insertion logic.	2025-11-21 19:42:17 +05:30
Robert Imschweiler	c02bdd466a	[AMDGPU] Fix handling of FP in cs.chain functions (#161194 ) In case there is an dynamic alloca / an alloca which is not in the entry block, cs.chain functions do not set up an FP, but are reported to need one. This results in a failed assertion in `SIFrameLowering::emitPrologue()` (Assertion `(!HasFP \|\| FPSaved) && "Needed to save FP but didn't save it anywhere"' failed.) This commit changes `hasFPImpl` so that the need for an SP in a cs.chain function does not directly imply the need for an FP anymore. This LLVM defect was identified via the AMD Fuzzing project.	2025-11-04 10:22:13 +01:00
Diana Picus	3027b4a55b	[AMDGPU] Fix iterator invalidation during frame lowering (#163952 ) I was a bit too eager to remove the SI_WHOLE_WAVE_FUNC_SETUP instruction during prolog emission. Erasing it invalidates MBBI, which in some cases is still needed outside of `emitCSRSpillStores`. Do the erasing at the end of prolog insertion instead.	2025-10-20 13:17:56 +02:00
Matt Arsenault	e3d23f8573	AMDGPU: Fix trying to constrain physical registers in spill handling (#161793 ) It's nonsensical to call constrainRegClass on a physical register, and we should not see virtual registers here.	2025-10-03 21:19:48 +09:00
Benjamin Maxwell	9f5abd38dd	[Codegen] Add a separate stack ID for scalable predicates (#142390 ) This splits out "ScalablePredicateVector" from the "ScalableVector" StackID this is primarily to allow easy differentiation between vectors and predicates (without inspecting instructions). This new stack ID is not used in many places yet, but will be used in a later patch to mark stack slots that are known to contain predicates. Co-authored-by: Kerry McLaughlin <kerry.mclaughlin@arm.com>	2025-10-02 14:43:07 +01:00
Carl Ritson	fdb06d9792	[AMDGPU] Refactor out common exec mask opcode patterns (NFCI) (#154718 ) Create utility mechanism for finding wave size dependent opcodes used to manipulate exec/lane masks.	2025-09-16 03:22:14 +00:00
Diana Picus	018dc1b397	[AMDGPU] Tail call support for whole wave functions (#145860 ) Support tail calls to whole wave functions (trivial) and from whole wave functions (slightly more involved because we need a new pseudo for the tail call return, that patches up the EXEC mask). Move the expansion of whole wave function return pseudos (regular and tail call returns) to prolog epilog insertion, since that's where we patch up the EXEC mask.	2025-09-04 10:34:43 +02:00
Stanislav Mekhanoshin	6aebbb0a85	[AMDGPU] Define 1024 VGPRs on gfx1250 (#156765 ) This is a baseline support, it is not useable yet.	2025-09-03 16:25:18 -07:00
macurtis-amd	d28bb1fd78	[AMDGPU] Ensure non-reserved CSR spilled regs are live-in (#146427 ) Fixes: ``` * Bad machine code: Using an undefined physical register * - function: widget - basic block: %bb.0 bb (0x564092cbe140) - instruction: $vgpr63 = V_ACCVGPR_READ_B32_e64 killed $agpr13, implicit $exec - operand 1: killed $agpr13 LLVM ERROR: Found 1 machine code errors. ``` The detailed sequence of events that led to this assert: 1. MachineVerifier fails because `$agpr13` is not defined on line 19 below: ``` 1: bb.0.bb: 2: successors: %bb.1(0x80000000); %bb.1(100.00%) 3: liveins: $agpr14, $agpr15, $sgpr12, $sgpr13, $sgpr14, \ 4: $sgpr15, $sgpr30, $sgpr31, $sgpr34, $sgpr35, \ 5: $sgpr36, $sgpr37, $sgpr38, $sgpr39, $sgpr48, \ 6: $sgpr49, $sgpr50, $sgpr51, $sgpr52, $sgpr53, \ 7: $sgpr54, $sgpr55, $sgpr64, $sgpr65, $sgpr66, \ 8: $sgpr67, $sgpr68, $sgpr69, $sgpr70, $sgpr71, \ 9: $sgpr80, $sgpr81, $sgpr82, $sgpr83, $sgpr84, \ 10: $sgpr85, $sgpr86, $sgpr87, $sgpr96, $sgpr97, \ 11: $sgpr98, $sgpr99, $vgpr0, $vgpr31, $vgpr40, $vgpr41, \ 12: $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr8_sgpr9, \ 13: $sgpr10_sgpr11 14: $sgpr16 = COPY $sgpr33 15: $sgpr33 = frame-setup COPY $sgpr32 16: $sgpr18_sgpr19 = S_XOR_SAVEEXEC_B64 -1, \ 17: implicit-def $exec, implicit-def dead $scc, \ 18: implicit $exec 19: $vgpr63 = V_ACCVGPR_READ_B32_e64 killed $agpr13, \ 20: implicit $exec 21: BUFFER_STORE_DWORD_OFFSET killed $vgpr63, \ 22: $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr33, 0, 0, 0, \ 23: implicit $exec :: (store (s32) into %stack.38, \ 24: addrspace 5) 25: ... 26: $vgpr43 = IMPLICIT_DEF 27: $vgpr43 = SI_SPILL_S32_TO_VGPR $sgpr15, 0, \ 28: killed $vgpr43(tied-def 0) 29: $vgpr43 = SI_SPILL_S32_TO_VGPR $sgpr14, 1, \ 30: killed $vgpr43(tied-def 0) 31: $sgpr100_sgpr101 = S_OR_SAVEEXEC_B64 -1, \ 32: implicit-def $exec, implicit-def dead $scc, \ 33: implicit $exec 34: renamable $agpr13 = COPY killed $vgpr43, implicit $exec ``` 2. That instruction is created by [`emitCSRSpillStores`](`d599bdeaa4/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (L977)`) (called by [`SIFrameLowering::emitPrologue`](`d599bdeaa4/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (L1122)`)) because `$agpr13` is in `WWMSpills`. See lines 982, 998, and 993 below. ``` 977: // Spill Whole-Wave Mode VGPRs. Save only the inactive lanes of the scratch 978: // registers. However, save all lanes of callee-saved VGPRs. Due to this, we 979: // might end up flipping the EXEC bits twice. 980: Register ScratchExecCopy; 981: SmallVector<std::pair<Register, int>, 2> WWMCalleeSavedRegs, WWMScratchRegs; 982: FuncInfo->splitWWMSpillRegisters(MF, WWMCalleeSavedRegs, WWMScratchRegs); 983: if (!WWMScratchRegs.empty()) 984: ScratchExecCopy = 985: buildScratchExecCopy(LiveUnits, MF, MBB, MBBI, DL, 986: /IsProlog/ true, /EnableInactiveLanes/ true); 987: 988: auto StoreWWMRegisters = 989: [&](SmallVectorImpl<std::pair<Register, int>> &WWMRegs) { 990: for (const auto &Reg : WWMRegs) { 991: Register VGPR = Reg.first; 992: int FI = Reg.second; 993: buildPrologSpill(ST, TRI, *FuncInfo, LiveUnits, MF, MBB, MBBI, DL, 994: VGPR, FI, FrameReg); 995: } 996: }; 997: 998: StoreWWMRegisters(WWMScratchRegs); ``` 3. `$agpr13` got added to `WWMSpills` by [`SILowerWWMCopies::run`](`59a7185dd9/llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp (L137)`) as it processed the `WWM_COPY` on line 3 below (corresponds to line 34 above in point #_1_): ``` 1: %45:vgpr_32 = SI_SPILL_S32_TO_VGPR $sgpr15, 0, %45:vgpr_32(tied-def 0) 2: %45:vgpr_32 = SI_SPILL_S32_TO_VGPR $sgpr14, 1, %45:vgpr_32(tied-def 0) 3: %44:av_32 = WWM_COPY %45:vgpr_32 ```	2025-08-01 11:54:25 -05:00
Diana Picus	20d8398825	[AMDGPU] ISel & PEI for whole wave functions (#145858 ) Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-07-21 10:39:09 +02:00
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Kazu Hirata	d1cd68881a	[llvm] Use llvm::is_sorted (NFC) (#140399 )	2025-05-17 14:29:35 -07:00
Brox Chen	72bc0525d8	[AMDGPU][True16][CodeGen] update wwm reg sorting check condition (#135053 ) We currently just need to shift down 32bit wwm registers. Previous check condition mistakenly select 16bit registers in true16 mode. Update check condition to skip the 16bit register in wmm reg sorting	2025-04-27 14:30:34 -04:00
Diana Picus	5bad5d84a1	Reland [AMDGPU] Support block load/store for CSR #130013 (#137169 ) Add support for using the existing SCRATCH_STORE_BLOCK and SCRATCH_LOAD_BLOCK instructions for saving and restoring callee-saved VGPRs. This is controlled by a new subtarget feature, block-vgpr-csr. It does not include WWM registers - those will be saved and restored individually, just like before. This patch does not change the ABI. Use of this feature may lead to slightly increased stack usage, because the memory is not compacted if certain registers don't have to be transferred (this will happen in practice for calling conventions where the callee and caller saved registers are interleaved in groups of 8). However, if the registers at the end of the block of 32 don't have to be transferred, we don't need to use a whole 128-byte stack slot - we can trim some space off the end of the range. In order to implement this feature, we need to rely less on the target-independent code in the PrologEpilogInserter, so we override several new methods in SIFrameLowering. We also add new pseudos, SI_BLOCK_SPILL_V1024_SAVE/RESTORE. One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the SCRATCH_LOAD_BLOCK instructions will have all the registers that are not transferred added as implicit uses. This is done in order to inform LiveRegUnits that those registers are not available before the restore (since we're not really restoring them - so we can't afford to scavenge them). Unfortunately, this trick doesn't work with the save, so before the save all the registers in the block will be unavailable (see the unit test). This was reverted due to failures in the builds with expensive checks on, now fixed by always updating LiveIntervals and SlotIndexes in SILowerSGPRSpills.	2025-04-25 11:29:27 +02:00
Diana Picus	6bb2f90557	Revert "[AMDGPU] Support block load/store for CSR" (#136846 ) Reverts llvm/llvm-project#130013 due to failures with expensive checks on.	2025-04-23 14:01:00 +02:00
Diana Picus	4a58071d87	[AMDGPU] Support block load/store for CSR (#130013 ) Add support for using the existing `SCRATCH_STORE_BLOCK` and `SCRATCH_LOAD_BLOCK` instructions for saving and restoring callee-saved VGPRs. This is controlled by a new subtarget feature, `block-vgpr-csr`. It does not include WWM registers - those will be saved and restored individually, just like before. This patch does not change the ABI. Use of this feature may lead to slightly increased stack usage, because the memory is not compacted if certain registers don't have to be transferred (this will happen in practice for calling conventions where the callee and caller saved registers are interleaved in groups of 8). However, if the registers at the end of the block of 32 don't have to be transferred, we don't need to use a whole 128-byte stack slot - we can trim some space off the end of the range. In order to implement this feature, we need to rely less on the target-independent code in the PrologEpilogInserter, so we override several new methods in `SIFrameLowering`. We also add new pseudos, `SI_BLOCK_SPILL_V1024_SAVE/RESTORE`. One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the SCRATCH_LOAD_BLOCK instructions will have all the registers that are not transferred added as implicit uses. This is done in order to inform LiveRegUnits that those registers are not available before the restore (since we're not really restoring them - so we can't afford to scavenge them). Unfortunately, this trick doesn't work with the save, so before the save all the registers in the block will be unavailable (see the unit test).	2025-04-23 10:33:36 +02:00
Diana Picus	72c3c30452	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055 ) The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).	2025-03-19 13:49:19 +01:00
Pierre van Houtryve	5231736329	[AMDGPU] Do not allow M0 as v_readfirstlane_b32 dst (#128851 ) M0 can only be written to by the SALU, so `v_readfirstlane_b32 m0` is effectively useless. Represent this by restricting the dest RC of that instruction to `SReg_32_XM0` which excludes M0. There is a lot of test changes due to the register class changing, but most changes are trivial. In some cases, an extra register and `s_mov_b32` is needed. Fixes SWDEV-513269	2025-02-26 13:14:03 +01:00
Craig Topper	af64f0a6c2	[FrameLowering] Use MCRegister instead of Register in CalleeSavedInfo. NFC (#128095 ) Callee saved registers should always be phyiscal registers. They are often passed directly to other functions that take MCRegister like getMinimalPhysRegClass or TargetRegisterClass::contains. Unfortunately, sometimes the MCRegister is compared to a Register which gave an ambiguous comparison error when the MCRegister is on the LHS. Adding a MCRegister==Register comparison operator created more ambiguous comparison errors elsewhere. These cases were usually comparing against a base or frame pointer register that is a physical register in a Register. For those I added an explicit conversion of Register to MCRegister to fix the error.	2025-02-20 23:44:05 -08:00
Aaditya	11b0401926	[AMDGPU] Restore SP from saved-FP or saved-BP (#124007 ) Currently, the AMDGPU backend bumps the Stack Pointer by fixed size offsets in the prolog of device functions, and restores it by the same amount in the epilog. Prolog: sp += frameSize Epilog: sp -= frameSize If a function has dynamic stack realignment, Prolog: sp += frameSize + max_alignment Epilog: sp -= frameSize + max_alignment These calculations are not optimal in case of dynamic stack realignment, and completely fail in case of dynamic stack readjustment. This patch uses the saved Frame Pointer to restore SP. Prolog: fp = sp sp += frameSize Epilog: sp = fp In case of dynamic stack realignment, SP is restored from the saved Base Pointer. Prolog: fp = sp + (max_alignment - 1) fp = fp & (-max_alignment) bp = sp sp += frameSize + max_alignment Epilog: sp = bp (Note: The presence of BP has been enforced in case of any dynamic stack realignment.) --------- Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>	2025-01-24 19:13:40 +05:30
Guy David	1a935d7a17	[llvm] Mark scavenging spill-slots as spilled stack objects. (#122673 ) This seems like an oversight when copying code from other backends.	2025-01-14 10:18:31 +02:00
Diana Picus	09c41246ed	[AMDGPU] Fix restores in chain functions (#116193 ) When spilling a VGPR in `emitPrologue`, chain functions prefer to use offsets to access the stack instead of the SP. This patch fixes `emitEpilogue` to do the same. It also brings back some test coverage that was lost in #93526, when WWM registers started being shifted to the lowest available range (which meant that tests that were originally spilling v8 would shift to spill v0, which is a scratch register for chain functions and didn't get spilled). Change-Id: Icb07fccd859b563cd45f74c25ae578ecb38bdeeb	2024-11-20 10:43:59 +01:00
Alex Rønne Petersen	ad4a582fd9	[llvm] Consistently respect `naked` fn attribute in `TargetFrameLowering::hasFP()` (#106014 ) Some targets (e.g. PPC and Hexagon) already did this. I think it's best to do this consistently so that frontend authors don't run into inconsistent results when they emit `naked` functions. For example, in Zig, we had to change our emit code to also set `frame-pointer=none` to get reliable results across targets. Note: I don't have commit access.	2024-10-18 09:35:42 +04:00
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Christudasan Devadasan	ac0f64f06d	[AMDGPU] Split vgpr regalloc pipeline (#93526 ) Allocating wwm-registers and per-thread VGPR operands together imposes many challenges in the way the registers are reused during allocation. There are times when regalloc reuses the registers of regular VGPRs operations for wwm-operations in a small range leading to unwantedly clobbering their inactive lanes causing correctness issues that are hard to trace. This patch splits the VGPR allocation pipeline further to allocate wwm-registers first and the regular VGPR operands in a separate pipeline. The splitting would ensure that the physical registers used for wwm allocations won't take part in the next allocation pipeline to avoid any such clobbering.	2024-09-30 19:55:42 +05:30
Christudasan Devadasan	23487be490	[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911 ) Multiple conditions exist to decide whether callee save spills/restores are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions. This patch consolidates them all and moves to a single place.	2024-09-26 10:50:00 +05:30
Pravin Jagtap	3659aa8079	[AMDGPU] Fix handling of DBG_VALUE_LIST while fixing the dead frame indices. (#109685 ) Both SGPR->VGPR and VGPR->AGPR spilling code give a fixup to the spill frame indices referred in debug instructions so that they can be entirely removed. The stack argument is present at 0th index in DBG_VALUE and at 2nd index for DBG_VALUE_LIST. Fixes: SWDEV-484156	2024-09-24 14:41:45 +05:30
Nikita Popov	cee0bf9626	[AMDGPU] Use Lo_32 and Hi_32 helpers (NFC) (#109413 )	2024-09-20 14:35:38 +02:00
Diana Picus	3356208531	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512 ) This reverts commit `7792b4ae79`. The problem was a conflict with `e55d6f5ea2` "[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (https://github.com/llvm/llvm-project/pull/107889)" which changed the syntax of V_SET_INACTIVE (and thus made my MIR test crash). ...if only we had a merge queue.	2024-09-13 11:54:30 +02:00
Diana Picus	7792b4ae79	Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 )"" (#108341 ) Reverts llvm/llvm-project#108173 si-init-whole-wave.mir crashes on some buildbots (although it passed both locally with sanitizers enabled and in pre-merge tests). Investigating.	2024-09-12 10:12:09 +02:00
Diana Picus	703ebca869	Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 )" (#108173 ) This reverts commit `c7a7767fca`. The buildbots failed because I removed a MI from its parent before updating LIS. This PR should fix that.	2024-09-12 09:11:41 +02:00
Vitaly Buka	c7a7767fca	Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054 ) Breaks bots, see #105822. Reverts llvm/llvm-project#105822	2024-09-10 09:51:43 -07:00
Diana Picus	44556e64f2	[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822 ) This intrinsic is meant to be used in functions that have a "tail" that needs to be run with all the lanes enabled. The "tail" may contain complex control flow that makes it unsuitable for the use of the existing WWM intrinsics. Instead, we will pretend that the function starts with all the lanes enabled, then branches into the actual body of the function for the lanes that were meant to run it, and then finally all the lanes will rejoin and run the tail. As such, the intrinsic will return the EXEC mask for the body of the function, and is meant to be used only as part of a very limited pattern (for now only in amdgpu_cs_chain functions): ``` entry: %func_exec = call i1 @llvm.amdgcn.init.whole.wave() br i1 %func_exec, label %func, label %tail func: ; ... stuff that should run with the actual EXEC mask br label %tail tail: ; ... stuff that runs with all the lanes enabled; ; can contain more than one basic block ``` It's an error to use the result of this intrinsic for anything other than a branch (but unfortunately checking that in the verifier is non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if between the intrinsic and the branch). The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now is expanded in si-wqm (which is where SI_INIT_EXEC is handled too); however the information that the function was conceptually started in whole wave mode is stored in the machine function info (hasInitWholeWave). This will be useful in prolog epilog insertion, where we can skip saving the inactive lanes for CSRs (since if the function started with all the lanes active, then there are no inactive lanes to preserve).	2024-09-10 13:24:53 +02:00
Matt Arsenault	b1bcb7ca46	Reapply "AMDGPU: Move attributor into optimization pipeline (#83131 )" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851 ) This reverts commit adaff46d087799072438dd744b038e6fd50a2d78. Drop the -O3 checks from default-attributes.hip. I don't know why they are different on some bots but reverting this is far too disruptive.	2024-07-15 11:51:44 +04:00
dyung	adaff46d08	Revert "AMDGPU: Move attributor into optimization pipeline (#83131 )" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851 ) This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and 78bc1b64a6dc3fb6191355a5e1b502be8b3668e7. The test CodeGenHIP/default-attributes.hip is failing on multiple bots even after the attempted fix including the following: - https://lab.llvm.org/buildbot/#/builders/3/builds/1473 - https://lab.llvm.org/buildbot/#/builders/65/builds/1380 - https://lab.llvm.org/buildbot/#/builders/161/builds/595 - https://lab.llvm.org/buildbot/#/builders/154/builds/1372 - https://lab.llvm.org/buildbot/#/builders/133/builds/1547 - https://lab.llvm.org/buildbot/#/builders/81/builds/755 - https://lab.llvm.org/buildbot/#/builders/40/builds/570 - https://lab.llvm.org/buildbot/#/builders/13/builds/748 - https://lab.llvm.org/buildbot/#/builders/12/builds/1845 - https://lab.llvm.org/buildbot/#/builders/11/builds/1695 - https://lab.llvm.org/buildbot/#/builders/190/builds/1829 - https://lab.llvm.org/buildbot/#/builders/193/builds/962 - https://lab.llvm.org/buildbot/#/builders/23/builds/991 - https://lab.llvm.org/buildbot/#/builders/144/builds/2256 - https://lab.llvm.org/buildbot/#/builders/46/builds/1614 These bots have been broken for a day, so reverting to get everything back to green.	2024-07-14 18:48:54 -07:00
Matt Arsenault	78bc1b64a6	AMDGPU: Move attributor into optimization pipeline (#83131 ) Removing it from the codegen pipeline induces a lot of test churn because llc is no longer optimizing out implicit arguments to kernels. Mostly mechanical, but there are some creative test updates. I preferred to take the changes as-is in tests where the ABI isn't relevant. In cases where it's more relevant, or the optimize out logic was too ingrained in the test, I pre-run the optimization. Some cases manually add attributes to disable inputs.	2024-07-14 08:36:33 +04:00
Pankaj Dwivedi	dc8da7ddea	[AMDGPU] Reserved private memory register during PEI (#93536 ) - Reserved newly selected private memory registers in entry Function Prologue generation. - Added assertion patch in eliminateFrameIndex to ensure register is reserved. Co-authored-by: PankajDwivedi-25 <pankajkumar.divedi@amd.com>	2024-05-29 15:10:44 +05:30
Gang Chen	167427f5db	[AMDGPU] change order of fp and sp in kernel prologue (#90626 ) change order of fp and sp in kernel prologue also related codegen tests to make it easier to merge code into our downstream branches Signed-off-by: gangc <gangc@amd.com>	2024-05-01 08:16:55 -07:00
Ivan Kosarev	dfa1d9b027	[AMDGPU][NFC] Have helpers to deal with encoding fields. (#82772 ) These are hoped to provide more convenient and less error prone facilities to encode and decode fields than manually defined constants and functions.	2024-02-23 17:34:55 +00:00
Jan Patrick Lehr	f661057865	Revert "[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init" (#81234 ) Reverts llvm/llvm-project#79586 This broke the AMDGPU OpenMP Offload buildbot. The typical error message was that the GPU attempted to read beyong the largest legal address. Error message: AMDGPU fatal error 1: Received error in queue 0x7f8363f22000: HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address.	2024-02-09 09:57:38 +01:00
alex-t	88e52511ca	[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586 ) This change implements synthesizing the private buffer resource descriptor in the kernel prolog instead of using the preloaded kernel argument.	2024-02-08 20:27:36 +01:00
Christudasan Devadasan	89ec940b4a	[AMDGPU] Insert spill codes for the SGPRs used for EXEC copy (#79428 ) The SGPR registers used for preserving EXEC mask while lowering the whole-wave register spills and copies should be preserved at the prolog and epilog if they are in the CSR range. It isn't happening when there is only wwm-copy lowered and there are no wwm-spills. This patch addresses that problem.	2024-02-05 18:32:23 +05:30
Christudasan Devadasan	230c13d59d	[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669 ) CSR SGPR spilling currently uses the early available physical VGPRs. It currently imposes a high register pressure while trying to allocate large VGPR tuples within the default register budget. This patch changes the spilling strategy by picking the VGPRs in the reverse order, the highest available VGPR first and later after regalloc shift them back to the lowest available range. With that, the initial VGPRs would be available for allocation and possibility of finding large number of contiguous registers will be more.	2024-01-24 07:08:43 +05:30

1 2 3 4 5 ...

254 Commits