239 Commits

Author SHA1 Message Date
Jay Foad
0b49adc32c
[AMDGPU] Rename AMDGPUMachineFunction to AMDGPUMachineFunctionInfo. NFC. (#187276)
This is derived from MachineFunctionInfo not MachineFunction.
2026-03-18 20:29:47 +00:00
Diana Picus
f2e8e2faff
[AMDGPU] Make chain functions receive a stack pointer (#184616)
Currently, chain functions are free to set up a stack pointer if they
need one, and they assume they can start at scratch offset 0. This is
not correct if CWSR and dynamic VGPRs are both enabled, since in that
case we need to reserve an area at offset 0 for the trap handler, but
only when running on a compute queue (which we determine at runtime).
Rather than duplicate in every chain function the code sequence for
determining if/how much scratch space needs to be reserved, this patch
changes the ABI of chain functions so that they receive a stack pointer
from their caller.

Since chain functions can no longer use plain offsets to access their
own stack, we'll also need to allocate a frame pointer more often (and
sometimes also a base pointer). For simplicity, we use the same
registers that `amdgpu_gfx` functions do (s32, s33, s34). This may
change in the future. Chain functions never return to their caller and
thus don't need to preserve the frame or base pointer.

Another consequence is that now we might need to realign the stack in
some cases (since it no longer starts at the infinitely aligned 0).
2026-03-06 11:01:42 +01:00
Dark Steve
9e6a6be8a8
[AMDGPU] Remove AMDGPUArgumentUsageInfo pass (#182490)
`AMDGPUArgumentUsageInfo` provided a per-function map that
`lowerFormalArguments` would write each function's implicit argument
register layout into, and `passSpecialInputs` would read back when
lowering calls to look up the callee's layout. This per-function map is
redundant for all non-entry callees, which already use the same
`FixedABIFunctionInfo` register layout.

GlobalISel already used `FixedABIFunctionInfo` unconditionally. This
change makes SelectionDAG do the same.
2026-02-23 18:47:01 +05:30
serge-sans-paille
85919fbfa4
[perf] Replace copy-assign by move-assign in llvm/lib/Target/AMDGPU/ (#179460) 2026-02-03 14:24:31 +00:00
paperchalice
62aa40a4dd
[AMDGPU] Remove NoSignedZerosFPMath uses (#178343)
One of global flags in `resetTargetOptions`, users should use `nsz`
instead.

`fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have
regression when `fadd` is annotated with `nsz`.
2026-01-30 09:18:40 +08:00
Shilei Tian
4b1cfc5d7c
[NFCI][AMDGPU] Final touch before moving to GET_SUBTARGETINFO_MACRO (#177401) 2026-01-22 17:33:17 +00:00
Matt Arsenault
9568772187
AMDGPU: Select VGPR MFMAs by default (#159493)
AGPRs are undesirable since they are only usable by a
handful instructions like loads, stores and mfmas and everything
else requires copies to/from VGPRs. Using the AGPR form should be
a measure of last resort if we must use more than 256 VGPRs.
2026-01-22 13:41:25 +00:00
tyb0807
29d1e1857d
[AMDGPU] Enable serializing of allocated preload kernarg SGPRs info (#168374)
- Support serialization of the number of allocated preload kernarg SGPRs
- Support serialization of the first preload kernarg SGPR allocated

Together they enable reconstructing correctly MIR with preload kernarg
SGPRs.
2025-11-22 14:03:14 -08:00
Matt Arsenault
476a6ea957
AMDGPU: Track minNumAGPRs in MFI instead of mayUseAGPRs (#161996)
Fix mfma agpr allocation failures with -O0. Previously we were getting
lucky
on cases that can use AV registers with the normal optimization
pipeline.

This logic needs to be consistent with getMaxNumVectorRegs,
as that is what getReservedRegs to determine the AGPR budget. In the
future we should directly check the minimum AGPR budget, and individual
selection patterns need to know the minimum budget required for them.

Start accounting for the number of AGPRs required to perform the
allocation. Refine the selection predicates to check this number is
available, and default to selecting the VGPR case if there aren't
enough. This also avoids register allocation failures for the largest
MFMAs with the default register budget.
2025-10-07 08:48:45 +09:00
Shilei Tian
8122ccdca9
[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120)
Hardware initializes a single value in ttmp9 which is either the
workgroup ID X or cluster ID X. Most of this patch is a refactoring to
use a single `PreloadedValue` enumerator for this value, instead of two
enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same
value.

This makes it simpler to have a single attribute
`amdgpu-no-workgroup-id-x` indicating that this value is not used, which
in turns sets the TGID_EN_X bit appropriately to tell the hardware
whether to initialize it.

All of the above applies to Y and Z similarly.

Fixes: LWPSCGFX13-568

Co-authored-by: Jay Foad <jay.foad@amd.com>
2025-09-16 15:37:01 -04:00
Shilei Tian
1180c2ced0
[AMDGPU] Support lowering of cluster related instrinsics (#157978)
Since many code are connected, this also changes how workgroup id is lowered.

Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-09-12 21:11:17 -04:00
Stanislav Mekhanoshin
d267fac3bc
[AMDGPU] Use subtarget call to determine number of VGPRs (#157927)
Since the register file was increased that is no longer valid to
call VGPR_32RegClass.getNumregs() to get a total number of arch
registers available on a subtarget.

Fixes: SWDEV-550425
2025-09-11 00:39:56 -07:00
Matt Arsenault
0a0f077b94
AMDGPU: Add missing static to cl::opt (#152747) 2025-08-09 08:33:09 +09:00
Diana Picus
a910a6a8b5
[AMDGPU] AsmPrinter: Unify arg handling (#151672)
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function`  arguments in the `AMDGPUAsmPrinter`. This pretty much 
repeats some of the logic from instruction selection.

This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.

This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
2025-08-08 12:00:37 +02:00
Matt Arsenault
2b1ce25e21
AMDGPU: Fix -amdgpu-mfma-vgpr-form flag on gfx908 (#150599)
This should be ignored since there are no VGPR forms. This
makes it possible to flip the default for the flag to true.
2025-07-25 19:49:56 +09:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Jeffrey Byrnes
695660cdfd
[AMDGPU] Provide control to force VGPR MFMA form (#148079)
This gives an override to the user to force select VGPR form of MFMA.
Eventually we will drop this in favor of compiler making better
decisions, but this provides a mechanism for users to address the cases
where MayNeedAGPRs favors the AGPR form and performance is degraded due
to poor RA.
2025-07-18 13:53:17 -07:00
Kazu Hirata
7da8f7394f
[AMDGPU] Remove an unnecessary cast (NFC) (#148868)
STI is already of const GCNSubtarget *.
2025-07-15 20:47:31 -07:00
Diana Picus
a201f8872a
[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
2025-06-24 11:09:36 +02:00
Kazu Hirata
1380a8259e
[AMDGPU] Use llvm::find and llvm::find_if (NFC) (#135582) 2025-04-13 23:46:57 -07:00
Diana Picus
72c3c30452
[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.

This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)

Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.

Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
2025-03-19 13:49:19 +01:00
Kazu Hirata
aead088f02
[AMDGPU] Avoid repeated hash lookups (NFC) (#131419) 2025-03-14 23:54:49 -07:00
Matt Arsenault
a216358ce7
AMDGPU: Replace amdgpu-no-agpr with amdgpu-agpr-alloc (#129893)
This performs the minimal replacment of amdgpu-no-agpr to
amdgpu-agpr-alloc=0. Most of the test diffs are due to the new
attribute sorting later alphabetically.

We could do better by trying to perform range merging in the attributor,
and trying to pick non-0 values.
2025-03-06 09:17:51 +07:00
Matt Arsenault
ccad5e7744
AMDGPU: Respect amdgpu-no-agpr in functions and with calls (#128147)
Remove the MIR scan to detect whether AGPRs are used or not,
and the special case for callable functions. This behavior was
confusing, and not overridable. The amdgpu-no-agpr attribute was
intended to avoid this imprecise heuristic for how many AGPRs to
allocate. It was also too confusing to make this interact with
the pending amdgpu-num-agpr replacement for amdgpu-no-agpr.

Also adds an xfail-ish test where the register allocator asserts
after allocation fails which I ran into.

Future work should reintroduce a more refined MIR scan to estimate
AGPR pressure for how to split AGPRs and VGPRs.
2025-02-23 09:00:37 +07:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Stanislav Mekhanoshin
21704a685d
[AMDGPU] Fix printing hasInitWholeWave in mir (#123232) 2025-01-17 03:00:02 -08:00
Austin Kerbow
2e5c298281
[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.

A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
2025-01-10 11:39:02 -08:00
Ruiling, Song
67c55b1ffc
[AMDGPU] Make max dwords of memory cluster configurable (#119342)
We find it helpful to increase the value for graphics workload. Make it
configurable so we can experiment with a different value.
2024-12-18 14:17:27 +08:00
dyung
bc7e099aa8
Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)
Reverts llvm/llvm-project#115291

Reverting due to test failures on many bots including
https://lab.llvm.org/buildbot/#/builders/174/builds/8049
2024-11-07 13:02:51 -05:00
Akshat Oke
21835ee28d
[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291) 2024-11-07 20:08:36 +05:30
Akshat Oke
3495d04560
[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129) 2024-11-05 13:17:25 +05:30
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Christudasan Devadasan
ac0f64f06d
[AMDGPU] Split vgpr regalloc pipeline (#93526)
Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.

This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.
2024-09-30 19:55:42 +05:30
Christudasan Devadasan
23487be490
[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911)
Multiple conditions exist to decide whether callee save spills/restores
are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling
conventions. This patch consolidates them all and moves to a single
place.
2024-09-26 10:50:00 +05:30
Jay Foad
e03f427196
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
2024-09-19 16:16:38 +01:00
Christudasan Devadasan
a566635915
[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720)
This will allow us to reuse the existing flags and the static
functions while building the pipeline for new pass manager.
2024-08-19 20:32:55 +05:30
Jay Foad
63fae3ed65
[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298) 2024-07-17 21:11:00 +01:00
Jay Foad
c7309dadbf
[AMDGPU] Use range-based for loops. NFC. (#99047) 2024-07-17 10:18:03 +01:00
Jay Foad
5e338f1f4a [AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC. 2024-07-17 08:27:35 +01:00
Jay Foad
0b43d573f5 [AMDGPU] clang-tidy: replace macro with enum. NFC. 2024-07-16 16:37:34 +01:00
Nicolai Hähnle
7e9b49f6b8
AMDGPU: Add plumbing for private segment size argument (#96445)
The actual size of scratch/private is determined at dispatch time, so
add more plumbing to request it. Will be used in subsequent change.
2024-06-25 16:20:51 +02:00
Nicolai Hähnle
d6c7410262
AMDGPU: Remove an outdated TODO (#96446)
We have a fixed calling convention for stack pointer and frame pointer,
we shouldn't try to shift anything around.
2024-06-25 16:20:22 +02:00
Jay Foad
5b18775145 [AMDGPU] Fix typo in #89773
Fixes #90281
2024-04-29 11:57:06 +01:00
Jay Foad
46163688e1
[AMDGPU] Allow WorkgroupID intrinsics in amdgpu_gfx functions (#89773)
With GFX12 architected SGPRs the workgroup ids are trivially available
in any function called from a compute entrypoint.
2024-04-24 09:35:40 +01:00
Matt Arsenault
b6b703b2df
AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)
SIMachineFunctionInfo has a scan  of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
into the attributor, so it actually works interprocedurally.
    
Could probably avoid most of the test churn if this bothered to avoid
adding this on subtargets without AGPRs. We should also probably
try to delete the MIR scan in usesAGPRs but it seems to be trickier
to eliminate.
2024-03-21 14:24:06 +05:30
Jun Wang
c4e517f59c
[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)
A new function attribute named amdgpu_num_work_groups is added. This
attribute, which consists of three integers, allows programmers to let
the compiler know the number of workgroups to be launched in each of the
three dimensions and do optimizations based on that information.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-03-12 10:30:39 -07:00
Diana Picus
bc6955f18c
[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.

[1]
34c8b835b1
2024-02-09 09:20:25 +01:00
Christudasan Devadasan
230c13d59d
[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.

This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
2024-01-24 07:08:43 +05:30
Carl Ritson
5139299618
[AMDGPU] Track physical VGPRs used for SGPR spills (#75573)
Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registers
allocated during WWM pre-allocation pass.

This causes SGPR spills allocated after WWM pre-allocation to overlap
with WWM register usage, e.g. if frame pointer is spilt during
prologue/epilog insertion.
2023-12-17 16:44:16 +09:00
Piotr Sobczak
fac093dd08
[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-13 13:52:40 +01:00