llvm-project

Author	SHA1	Message	Date
Jay Foad	0b49adc32c	[AMDGPU] Rename AMDGPUMachineFunction to AMDGPUMachineFunctionInfo. NFC. (#187276 ) This is derived from MachineFunctionInfo not MachineFunction.	2026-03-18 20:29:47 +00:00
Diana Picus	f2e8e2faff	[AMDGPU] Make chain functions receive a stack pointer (#184616 ) Currently, chain functions are free to set up a stack pointer if they need one, and they assume they can start at scratch offset 0. This is not correct if CWSR and dynamic VGPRs are both enabled, since in that case we need to reserve an area at offset 0 for the trap handler, but only when running on a compute queue (which we determine at runtime). Rather than duplicate in every chain function the code sequence for determining if/how much scratch space needs to be reserved, this patch changes the ABI of chain functions so that they receive a stack pointer from their caller. Since chain functions can no longer use plain offsets to access their own stack, we'll also need to allocate a frame pointer more often (and sometimes also a base pointer). For simplicity, we use the same registers that `amdgpu_gfx` functions do (s32, s33, s34). This may change in the future. Chain functions never return to their caller and thus don't need to preserve the frame or base pointer. Another consequence is that now we might need to realign the stack in some cases (since it no longer starts at the infinitely aligned 0).	2026-03-06 11:01:42 +01:00
Dark Steve	9e6a6be8a8	[AMDGPU] Remove AMDGPUArgumentUsageInfo pass (#182490 ) `AMDGPUArgumentUsageInfo` provided a per-function map that `lowerFormalArguments` would write each function's implicit argument register layout into, and `passSpecialInputs` would read back when lowering calls to look up the callee's layout. This per-function map is redundant for all non-entry callees, which already use the same `FixedABIFunctionInfo` register layout. GlobalISel already used `FixedABIFunctionInfo` unconditionally. This change makes SelectionDAG do the same.	2026-02-23 18:47:01 +05:30
serge-sans-paille	85919fbfa4	[perf] Replace copy-assign by move-assign in llvm/lib/Target/AMDGPU/ (#179460 )	2026-02-03 14:24:31 +00:00
paperchalice	62aa40a4dd	[AMDGPU] Remove `NoSignedZerosFPMath` uses (#178343 ) One of global flags in `resetTargetOptions`, users should use `nsz` instead. `fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have regression when `fadd` is annotated with `nsz`.	2026-01-30 09:18:40 +08:00
Shilei Tian	4b1cfc5d7c	[NFCI][AMDGPU] Final touch before moving to `GET_SUBTARGETINFO_MACRO` (#177401 )	2026-01-22 17:33:17 +00:00
Matt Arsenault	9568772187	AMDGPU: Select VGPR MFMAs by default (#159493 ) AGPRs are undesirable since they are only usable by a handful instructions like loads, stores and mfmas and everything else requires copies to/from VGPRs. Using the AGPR form should be a measure of last resort if we must use more than 256 VGPRs.	2026-01-22 13:41:25 +00:00
tyb0807	29d1e1857d	[AMDGPU] Enable serializing of allocated preload kernarg SGPRs info (#168374 ) - Support serialization of the number of allocated preload kernarg SGPRs - Support serialization of the first preload kernarg SGPR allocated Together they enable reconstructing correctly MIR with preload kernarg SGPRs.	2025-11-22 14:03:14 -08:00
Matt Arsenault	476a6ea957	AMDGPU: Track minNumAGPRs in MFI instead of mayUseAGPRs (#161996 ) Fix mfma agpr allocation failures with -O0. Previously we were getting lucky on cases that can use AV registers with the normal optimization pipeline. This logic needs to be consistent with getMaxNumVectorRegs, as that is what getReservedRegs to determine the AGPR budget. In the future we should directly check the minimum AGPR budget, and individual selection patterns need to know the minimum budget required for them. Start accounting for the number of AGPRs required to perform the allocation. Refine the selection predicates to check this number is available, and default to selecting the VGPR case if there aren't enough. This also avoids register allocation failures for the largest MFMAs with the default register budget.	2025-10-07 08:48:45 +09:00
Shilei Tian	8122ccdca9	[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120 ) Hardware initializes a single value in ttmp9 which is either the workgroup ID X or cluster ID X. Most of this patch is a refactoring to use a single `PreloadedValue` enumerator for this value, instead of two enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same value. This makes it simpler to have a single attribute `amdgpu-no-workgroup-id-x` indicating that this value is not used, which in turns sets the TGID_EN_X bit appropriately to tell the hardware whether to initialize it. All of the above applies to Y and Z similarly. Fixes: LWPSCGFX13-568 Co-authored-by: Jay Foad <jay.foad@amd.com>	2025-09-16 15:37:01 -04:00
Shilei Tian	1180c2ced0	[AMDGPU] Support lowering of cluster related instrinsics (#157978 ) Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>	2025-09-12 21:11:17 -04:00
Stanislav Mekhanoshin	d267fac3bc	[AMDGPU] Use subtarget call to determine number of VGPRs (#157927 ) Since the register file was increased that is no longer valid to call VGPR_32RegClass.getNumregs() to get a total number of arch registers available on a subtarget. Fixes: SWDEV-550425	2025-09-11 00:39:56 -07:00
Matt Arsenault	0a0f077b94	AMDGPU: Add missing static to cl::opt (#152747 )	2025-08-09 08:33:09 +09:00
Diana Picus	a910a6a8b5	[AMDGPU] AsmPrinter: Unify arg handling (#151672 ) When computing the number of registers required by entry functions, the `AMDGPUAsmPrinter` needs to take into account both the register usage computed by the `AMDGPUResourceUsageAnalysis` pass, and the number of registers initialized by the hardware. At the moment, the way it computes the latter is different for graphics vs compute, due to differences in the implementation. For kernels, all the information needed is available in the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much repeats some of the logic from instruction selection. This patch introduces 2 new members to `SIMachineFunctionInfo`, one for SGPRs and one for VGPRs. Both will be computed during instruction selection and then used during `AMDGPUAsmPrinter`, removing the need to refer to the `Function` when printing assembly. This patch is NFC except for the fact that we now add the extra SGPRs (VCC, XNACK etc) to the number of SGPRs computed for graphics entry points. I'm not sure why these weren't included before. It would be nice if someone could confirm if that was just an oversight or if we have some docs somewhere that I haven't managed to find. Only one test is affected (its SGPR usage increases because we now take into account the XNACK registers).	2025-08-08 12:00:37 +02:00
Matt Arsenault	2b1ce25e21	AMDGPU: Fix -amdgpu-mfma-vgpr-form flag on gfx908 (#150599 ) This should be ignored since there are no VGPR forms. This makes it possible to flip the default for the flag to true.	2025-07-25 19:49:56 +09:00
Diana Picus	20d8398825	[AMDGPU] ISel & PEI for whole wave functions (#145858 ) Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-07-21 10:39:09 +02:00
Jeffrey Byrnes	695660cdfd	[AMDGPU] Provide control to force VGPR MFMA form (#148079 ) This gives an override to the user to force select VGPR form of MFMA. Eventually we will drop this in favor of compiler making better decisions, but this provides a mechanism for users to address the cases where MayNeedAGPRs favors the AGPR form and performance is degraded due to poor RA.	2025-07-18 13:53:17 -07:00
Kazu Hirata	7da8f7394f	[AMDGPU] Remove an unnecessary cast (NFC) (#148868 ) STI is already of const GCNSubtarget *.	2025-07-15 20:47:31 -07:00
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Kazu Hirata	1380a8259e	[AMDGPU] Use llvm::find and llvm::find_if (NFC) (#135582 )	2025-04-13 23:46:57 -07:00
Diana Picus	72c3c30452	[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055 ) The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).	2025-03-19 13:49:19 +01:00
Kazu Hirata	aead088f02	[AMDGPU] Avoid repeated hash lookups (NFC) (#131419 )	2025-03-14 23:54:49 -07:00
Matt Arsenault	a216358ce7	AMDGPU: Replace amdgpu-no-agpr with amdgpu-agpr-alloc (#129893 ) This performs the minimal replacment of amdgpu-no-agpr to amdgpu-agpr-alloc=0. Most of the test diffs are due to the new attribute sorting later alphabetically. We could do better by trying to perform range merging in the attributor, and trying to pick non-0 values.	2025-03-06 09:17:51 +07:00
Matt Arsenault	ccad5e7744	AMDGPU: Respect amdgpu-no-agpr in functions and with calls (#128147 ) Remove the MIR scan to detect whether AGPRs are used or not, and the special case for callable functions. This behavior was confusing, and not overridable. The amdgpu-no-agpr attribute was intended to avoid this imprecise heuristic for how many AGPRs to allocate. It was also too confusing to make this interact with the pending amdgpu-num-agpr replacement for amdgpu-no-agpr. Also adds an xfail-ish test where the register allocator asserts after allocation fails which I ran into. Future work should reintroduce a more refined MIR scan to estimate AGPR pressure for how to split AGPRs and VGPRs.	2025-02-23 09:00:37 +07:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Stanislav Mekhanoshin	21704a685d	[AMDGPU] Fix printing hasInitWholeWave in mir (#123232 )	2025-01-17 03:00:02 -08:00
Austin Kerbow	2e5c298281	[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167 ) Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.	2025-01-10 11:39:02 -08:00
Ruiling, Song	67c55b1ffc	[AMDGPU] Make max dwords of memory cluster configurable (#119342 ) We find it helpful to increase the value for graphics workload. Make it configurable so we can experiment with a different value.	2024-12-18 14:17:27 +08:00
dyung	bc7e099aa8	Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353 ) Reverts llvm/llvm-project#115291 Reverting due to test failures on many bots including https://lab.llvm.org/buildbot/#/builders/174/builds/8049	2024-11-07 13:02:51 -05:00
Akshat Oke	21835ee28d	[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291 )	2024-11-07 20:08:36 +05:30
Akshat Oke	3495d04560	[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129 )	2024-11-05 13:17:25 +05:30
Jay Foad	8d13e7b8c3	[AMDGPU] Qualify auto. NFC. (#110878 ) Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)	2024-10-03 13:07:54 +01:00
Christudasan Devadasan	ac0f64f06d	[AMDGPU] Split vgpr regalloc pipeline (#93526 ) Allocating wwm-registers and per-thread VGPR operands together imposes many challenges in the way the registers are reused during allocation. There are times when regalloc reuses the registers of regular VGPRs operations for wwm-operations in a small range leading to unwantedly clobbering their inactive lanes causing correctness issues that are hard to trace. This patch splits the VGPR allocation pipeline further to allocate wwm-registers first and the regular VGPR operands in a separate pipeline. The splitting would ensure that the physical registers used for wwm allocations won't take part in the next allocation pipeline to avoid any such clobbering.	2024-09-30 19:55:42 +05:30
Christudasan Devadasan	23487be490	[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911 ) Multiple conditions exist to decide whether callee save spills/restores are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions. This patch consolidates them all and moves to a single place.	2024-09-26 10:50:00 +05:30
Jay Foad	e03f427196	[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133 ) It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.	2024-09-19 16:16:38 +01:00
Christudasan Devadasan	a566635915	[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720 ) This will allow us to reuse the existing flags and the static functions while building the pipeline for new pass manager.	2024-08-19 20:32:55 +05:30
Jay Foad	63fae3ed65	[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298 )	2024-07-17 21:11:00 +01:00
Jay Foad	c7309dadbf	[AMDGPU] Use range-based for loops. NFC. (#99047 )	2024-07-17 10:18:03 +01:00
Jay Foad	5e338f1f4a	[AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC.	2024-07-17 08:27:35 +01:00
Jay Foad	0b43d573f5	[AMDGPU] clang-tidy: replace macro with enum. NFC.	2024-07-16 16:37:34 +01:00
Nicolai Hähnle	7e9b49f6b8	AMDGPU: Add plumbing for private segment size argument (#96445 ) The actual size of scratch/private is determined at dispatch time, so add more plumbing to request it. Will be used in subsequent change.	2024-06-25 16:20:51 +02:00
Nicolai Hähnle	d6c7410262	AMDGPU: Remove an outdated TODO (#96446 ) We have a fixed calling convention for stack pointer and frame pointer, we shouldn't try to shift anything around.	2024-06-25 16:20:22 +02:00
Jay Foad	5b18775145	[AMDGPU] Fix typo in #89773 Fixes #90281	2024-04-29 11:57:06 +01:00
Jay Foad	46163688e1	[AMDGPU] Allow WorkgroupID intrinsics in amdgpu_gfx functions (#89773 ) With GFX12 architected SGPRs the workgroup ids are trivially available in any function called from a compute entrypoint.	2024-04-24 09:35:40 +01:00
Matt Arsenault	b6b703b2df	AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948 ) SIMachineFunctionInfo has a scan of the function body for inline asm which may use AGPRs, or callees in SIMachineFunctionInfo. Move this into the attributor, so it actually works interprocedurally. Could probably avoid most of the test churn if this bothered to avoid adding this on subtargets without AGPRs. We should also probably try to delete the MIR scan in usesAGPRs but it seems to be trickier to eliminate.	2024-03-21 14:24:06 +05:30
Jun Wang	c4e517f59c	[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035 ) A new function attribute named amdgpu_num_work_groups is added. This attribute, which consists of three integers, allows programmers to let the compiler know the number of workgroups to be launched in each of the three dimensions and do optimizations based on that information. --------- Co-authored-by: Jun Wang <jun.wang7@amd.com>	2024-03-12 10:30:39 -07:00
Diana Picus	bc6955f18c	[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136 ) At the moment, the emergency spill slot is a fixed object for entry functions and chain functions, and a regular stack object otherwise. This patch adopts the latter behaviour for entry/chain functions too. It seems this was always the intention [1] and it will also save us a bit of stack space in cases where the first stack object has a large alignment. [1] `34c8b835b1`	2024-02-09 09:20:25 +01:00
Christudasan Devadasan	230c13d59d	[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669 ) CSR SGPR spilling currently uses the early available physical VGPRs. It currently imposes a high register pressure while trying to allocate large VGPR tuples within the default register budget. This patch changes the spilling strategy by picking the VGPRs in the reverse order, the highest available VGPR first and later after regalloc shift them back to the lowest available range. With that, the initial VGPRs would be available for allocation and possibility of finding large number of contiguous registers will be more.	2024-01-24 07:08:43 +05:30
Carl Ritson	5139299618	[AMDGPU] Track physical VGPRs used for SGPR spills (#75573 ) Physical VGPRs used for SGPR spills need to be tracked independent of WWM reserved registers. The WWM reserved set contains extra registers allocated during WWM pre-allocation pass. This causes SGPR spills allocated after WWM pre-allocation to overlap with WWM register usage, e.g. if frame pointer is spilt during prologue/epilog insertion.	2023-12-17 16:44:16 +09:00
Piotr Sobczak	fac093dd08	[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030 ) Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>	2023-12-13 13:52:40 +01:00

1 2 3 4 5

239 Commits