239 Commits

Author SHA1 Message Date
macurtis-amd
d28bb1fd78
[AMDGPU] Ensure non-reserved CSR spilled regs are live-in (#146427)
Fixes:
```
*** Bad machine code: Using an undefined physical register ***
- function:    widget
- basic block: %bb.0 bb (0x564092cbe140)
- instruction: $vgpr63 = V_ACCVGPR_READ_B32_e64 killed $agpr13, implicit $exec
- operand 1:   killed $agpr13
LLVM ERROR: Found 1 machine code errors.
```

The detailed sequence of events that led to this assert:

1. MachineVerifier fails because `$agpr13` is not defined on line 19
below:

   ``` 1: bb.0.bb:
    2:   successors: %bb.1(0x80000000); %bb.1(100.00%)
    3:   liveins: $agpr14, $agpr15, $sgpr12, $sgpr13, $sgpr14, \
    4:       $sgpr15, $sgpr30, $sgpr31, $sgpr34, $sgpr35, \
    5:       $sgpr36, $sgpr37, $sgpr38, $sgpr39, $sgpr48, \
    6:       $sgpr49, $sgpr50, $sgpr51, $sgpr52, $sgpr53, \
    7:       $sgpr54, $sgpr55, $sgpr64, $sgpr65, $sgpr66, \
    8:       $sgpr67, $sgpr68, $sgpr69, $sgpr70, $sgpr71, \
    9:       $sgpr80, $sgpr81, $sgpr82, $sgpr83, $sgpr84, \
   10:       $sgpr85, $sgpr86, $sgpr87, $sgpr96, $sgpr97, \
   11:       $sgpr98, $sgpr99, $vgpr0, $vgpr31, $vgpr40, $vgpr41, \
   12:       $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr8_sgpr9, \
   13:       $sgpr10_sgpr11
   14:   $sgpr16 = COPY $sgpr33
   15:   $sgpr33 = frame-setup COPY $sgpr32
   16:   $sgpr18_sgpr19 = S_XOR_SAVEEXEC_B64 -1, \
   17:       implicit-def $exec, implicit-def dead $scc, \
   18:       implicit $exec
   19:   $vgpr63 = V_ACCVGPR_READ_B32_e64 killed $agpr13, \
   20:       implicit $exec
   21:   BUFFER_STORE_DWORD_OFFSET killed $vgpr63, \
   22:       $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr33, 0, 0, 0, \
   23:       implicit $exec :: (store (s32) into %stack.38, \
   24:       addrspace 5)
   25:   ...
   26:   $vgpr43 = IMPLICIT_DEF
   27:   $vgpr43 = SI_SPILL_S32_TO_VGPR $sgpr15, 0, \
   28:       killed $vgpr43(tied-def 0)
   29:   $vgpr43 = SI_SPILL_S32_TO_VGPR $sgpr14, 1, \
   30:       killed $vgpr43(tied-def 0)
   31:   $sgpr100_sgpr101 = S_OR_SAVEEXEC_B64 -1, \
   32:       implicit-def $exec, implicit-def dead $scc, \
   33:       implicit $exec
   34:   renamable $agpr13 = COPY killed $vgpr43, implicit $exec
   ```

2. That instruction is created by
[`emitCSRSpillStores`](d599bdeaa4/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (L977))
(called by
[`SIFrameLowering::emitPrologue`](d599bdeaa4/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (L1122)))
because `$agpr13` is in `WWMSpills`.

   See lines 982, 998, and 993 below.

   ```
977: // Spill Whole-Wave Mode VGPRs. Save only the inactive lanes of the
scratch
978: // registers. However, save all lanes of callee-saved VGPRs. Due to
this, we
   979:   // might end up flipping the EXEC bits twice.
   980:   Register ScratchExecCopy;
981: SmallVector<std::pair<Register, int>, 2> WWMCalleeSavedRegs,
WWMScratchRegs;
982: FuncInfo->splitWWMSpillRegisters(MF, WWMCalleeSavedRegs,
WWMScratchRegs);
   983:   if (!WWMScratchRegs.empty())
   984:     ScratchExecCopy =
   985:         buildScratchExecCopy(LiveUnits, MF, MBB, MBBI, DL,
986: /*IsProlog*/ true, /*EnableInactiveLanes*/ true);
   987:
   988:   auto StoreWWMRegisters =
   989:       [&](SmallVectorImpl<std::pair<Register, int>> &WWMRegs) {
   990:         for (const auto &Reg : WWMRegs) {
   991:           Register VGPR = Reg.first;
   992:           int FI = Reg.second;
993: buildPrologSpill(ST, TRI, *FuncInfo, LiveUnits, MF, MBB, MBBI, DL,
   994:                            VGPR, FI, FrameReg);
   995:         }
   996:       };
   997:
   998:   StoreWWMRegisters(WWMScratchRegs);
   ```

3. `$agpr13` got added to `WWMSpills` by
[`SILowerWWMCopies::run`](59a7185dd9/llvm/lib/Target/AMDGPU/SILowerWWMCopies.cpp (L137))
as it processed the `WWM_COPY` on line 3 below (corresponds to line 34
above in point #_1_):

   ```
1: %45:vgpr_32 = SI_SPILL_S32_TO_VGPR $sgpr15, 0, %45:vgpr_32(tied-def
0)
2: %45:vgpr_32 = SI_SPILL_S32_TO_VGPR $sgpr14, 1, %45:vgpr_32(tied-def
0)
   3: %44:av_32 = WWM_COPY %45:vgpr_32
   ```
2025-08-01 11:54:25 -05:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Diana Picus
a201f8872a
[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
2025-06-24 11:09:36 +02:00
Kazu Hirata
d1cd68881a
[llvm] Use llvm::is_sorted (NFC) (#140399) 2025-05-17 14:29:35 -07:00
Brox Chen
72bc0525d8
[AMDGPU][True16][CodeGen] update wwm reg sorting check condition (#135053)
We currently just need to shift down 32bit wwm registers. 

Previous check condition mistakenly select 16bit registers in true16
mode. Update check condition to skip the 16bit register in wmm reg
sorting
2025-04-27 14:30:34 -04:00
Diana Picus
5bad5d84a1
Reland [AMDGPU] Support block load/store for CSR #130013 (#137169)
Add support for using the existing SCRATCH_STORE_BLOCK and
SCRATCH_LOAD_BLOCK instructions for saving and restoring callee-saved
VGPRs. This is controlled by a new subtarget feature, block-vgpr-csr. It
does not include WWM registers - those will be saved and restored
individually, just like before. This patch does not change the ABI.

Use of this feature may lead to slightly increased stack usage, because
the memory is not compacted if certain registers don't have to be
transferred (this will happen in practice for calling conventions where
the callee and caller saved registers are interleaved in groups of 8).
However, if the registers at the end of the block of 32 don't have to be
transferred, we don't need to use a whole 128-byte stack slot - we can
trim some space off the end of the range.

In order to implement this feature, we need to rely less on the
target-independent code in the PrologEpilogInserter, so we override
several new methods in SIFrameLowering. We also add new pseudos,
SI_BLOCK_SPILL_V1024_SAVE/RESTORE.

One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the
SCRATCH_LOAD_BLOCK instructions will have all the registers that are not
transferred added as implicit uses. This is done in order to inform
LiveRegUnits that those registers are not available before the restore
(since we're not really restoring them - so we can't afford to scavenge
them). Unfortunately, this trick doesn't work with the save, so before
the save all the registers in the block will be unavailable (see the
unit test).

This was reverted due to failures in the builds with expensive checks
on, now fixed by always updating LiveIntervals and SlotIndexes in
SILowerSGPRSpills.
2025-04-25 11:29:27 +02:00
Diana Picus
6bb2f90557
Revert "[AMDGPU] Support block load/store for CSR" (#136846)
Reverts llvm/llvm-project#130013 due to failures with expensive checks
on.
2025-04-23 14:01:00 +02:00
Diana Picus
4a58071d87
[AMDGPU] Support block load/store for CSR (#130013)
Add support for using the existing `SCRATCH_STORE_BLOCK` and
`SCRATCH_LOAD_BLOCK` instructions for saving and restoring callee-saved
VGPRs. This is controlled by a new subtarget feature, `block-vgpr-csr`.
It does not include WWM registers - those will be saved and restored
individually, just like before. This patch does not change the ABI.

Use of this feature may lead to slightly increased stack usage, because
the memory is not compacted if certain registers don't have to be
transferred (this will happen in practice for calling conventions where
the callee and caller saved registers are interleaved in groups of 8).
However, if the registers at the end of the block of 32 don't have to be
transferred, we don't need to use a whole 128-byte stack slot - we can
trim some space off the end of the range.

In order to implement this feature, we need to rely less on the
target-independent code in the PrologEpilogInserter, so we override
several new methods in `SIFrameLowering`. We also add new pseudos,
`SI_BLOCK_SPILL_V1024_SAVE/RESTORE`.

One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the
SCRATCH_LOAD_BLOCK instructions will have all the registers that are not
transferred added as implicit uses. This is done in order to inform
LiveRegUnits that those registers are not available before the restore
(since we're not really restoring them - so we can't afford to scavenge
them). Unfortunately, this trick doesn't work with the save, so before
the save all the registers in the block will be unavailable (see the
unit test).
2025-04-23 10:33:36 +02:00
Diana Picus
72c3c30452
[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.

This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)

Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.

Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
2025-03-19 13:49:19 +01:00
Pierre van Houtryve
5231736329
[AMDGPU] Do not allow M0 as v_readfirstlane_b32 dst (#128851)
M0 can only be written to by the SALU, so `v_readfirstlane_b32 m0` is
effectively useless. Represent this by restricting the dest RC of that
instruction to `SReg_32_XM0` which excludes M0.

There is a lot of test changes due to the register class changing, but
most changes are trivial. In some cases, an extra register and
`s_mov_b32` is needed.

Fixes SWDEV-513269
2025-02-26 13:14:03 +01:00
Craig Topper
af64f0a6c2
[FrameLowering] Use MCRegister instead of Register in CalleeSavedInfo. NFC (#128095)
Callee saved registers should always be phyiscal registers. They are
often passed directly to other functions that take MCRegister like
getMinimalPhysRegClass or TargetRegisterClass::contains.

Unfortunately, sometimes the MCRegister is compared to a Register which
gave an ambiguous comparison error when the MCRegister is on the LHS.
Adding a MCRegister==Register comparison operator created more ambiguous
comparison errors elsewhere. These cases were usually comparing against
a base or frame pointer register that is a physical register in a
Register. For those I added an explicit conversion of Register to
MCRegister to fix the error.
2025-02-20 23:44:05 -08:00
Aaditya
11b0401926
[AMDGPU] Restore SP from saved-FP or saved-BP (#124007)
Currently, the AMDGPU backend bumps the Stack Pointer 
by fixed size offsets in the prolog of device functions, and 
restores it by the same amount in the epilog.
Prolog:
sp += frameSize

Epilog:
sp -= frameSize

If a function has dynamic stack realignment,
Prolog:
sp += frameSize + max_alignment

Epilog:
sp -= frameSize + max_alignment

These calculations are not optimal in case of dynamic 
stack realignment, and completely fail in case of 
dynamic stack readjustment.
This patch uses the saved Frame Pointer to restore SP. 
Prolog:
fp = sp
sp += frameSize

Epilog:
sp = fp

In case of dynamic stack realignment, SP is restored from 
the saved Base Pointer. 
Prolog:
fp = sp + (max_alignment - 1)
fp = fp & (-max_alignment)
bp = sp
sp += frameSize + max_alignment

Epilog:
sp = bp

(Note: The presence of BP has been enforced in case of any 
dynamic stack realignment.)

---------

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-01-24 19:13:40 +05:30
Guy David
1a935d7a17
[llvm] Mark scavenging spill-slots as *spilled* stack objects. (#122673)
This seems like an oversight when copying code from other backends.
2025-01-14 10:18:31 +02:00
Diana Picus
09c41246ed
[AMDGPU] Fix restores in chain functions (#116193)
When spilling a VGPR in `emitPrologue`, chain functions prefer to use
offsets to access the stack instead of the SP.

This patch fixes `emitEpilogue` to do the same. It also brings back some
test coverage that was lost in #93526, when WWM registers started being
shifted to the lowest available range (which meant that tests that were
originally spilling v8 would shift to spill v0, which is a scratch
register for chain functions and didn't get spilled).

Change-Id: Icb07fccd859b563cd45f74c25ae578ecb38bdeeb
2024-11-20 10:43:59 +01:00
Alex Rønne Petersen
ad4a582fd9
[llvm] Consistently respect naked fn attribute in TargetFrameLowering::hasFP() (#106014)
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.

Note: I don't have commit access.
2024-10-18 09:35:42 +04:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Christudasan Devadasan
ac0f64f06d
[AMDGPU] Split vgpr regalloc pipeline (#93526)
Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.

This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.
2024-09-30 19:55:42 +05:30
Christudasan Devadasan
23487be490
[AMDGPU] Merge the conditions used for deciding CS spills for amdgpu_cs_chain[_preserve] (#109911)
Multiple conditions exist to decide whether callee save spills/restores
are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling
conventions. This patch consolidates them all and moves to a single
place.
2024-09-26 10:50:00 +05:30
Pravin Jagtap
3659aa8079
[AMDGPU] Fix handling of DBG_VALUE_LIST while fixing the dead frame indices. (#109685)
Both SGPR->VGPR and VGPR->AGPR spilling code give a fixup to the spill
frame indices referred in debug instructions so that they can be
entirely removed. The stack argument is present at 0th index in
DBG_VALUE and at 2nd index for DBG_VALUE_LIST.

Fixes: SWDEV-484156
2024-09-24 14:41:45 +05:30
Nikita Popov
cee0bf9626
[AMDGPU] Use Lo_32 and Hi_32 helpers (NFC) (#109413) 2024-09-20 14:35:38 +02:00
Diana Picus
3356208531
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512)
This reverts commit
7792b4ae79.

The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).

...if only we had a merge queue.
2024-09-13 11:54:30 +02:00
Diana Picus
7792b4ae79
Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)"" (#108341)
Reverts llvm/llvm-project#108173

si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
2024-09-12 10:12:09 +02:00
Diana Picus
703ebca869
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)" (#108173)
This reverts commit
c7a7767fca.

The buildbots failed because I removed a MI from its parent before
updating LIS. This PR should fix that.
2024-09-12 09:11:41 +02:00
Vitaly Buka
c7a7767fca
Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)
Breaks bots, see #105822.

Reverts llvm/llvm-project#105822
2024-09-10 09:51:43 -07:00
Diana Picus
44556e64f2
[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):

```
entry:
  %func_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block
```

It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).

The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
2024-09-10 13:24:53 +02:00
Matt Arsenault
b1bcb7ca46 Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commit adaff46d087799072438dd744b038e6fd50a2d78.

Drop the -O3 checks from default-attributes.hip. I don't know why they
are different on some bots but reverting this is far too disruptive.
2024-07-15 11:51:44 +04:00
dyung
adaff46d08
Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and
78bc1b64a6dc3fb6191355a5e1b502be8b3668e7.

The test CodeGenHIP/default-attributes.hip is failing on multiple bots
even after the attempted fix including the following:
- https://lab.llvm.org/buildbot/#/builders/3/builds/1473
- https://lab.llvm.org/buildbot/#/builders/65/builds/1380
- https://lab.llvm.org/buildbot/#/builders/161/builds/595
- https://lab.llvm.org/buildbot/#/builders/154/builds/1372
- https://lab.llvm.org/buildbot/#/builders/133/builds/1547
- https://lab.llvm.org/buildbot/#/builders/81/builds/755
- https://lab.llvm.org/buildbot/#/builders/40/builds/570
- https://lab.llvm.org/buildbot/#/builders/13/builds/748
- https://lab.llvm.org/buildbot/#/builders/12/builds/1845
- https://lab.llvm.org/buildbot/#/builders/11/builds/1695
- https://lab.llvm.org/buildbot/#/builders/190/builds/1829
- https://lab.llvm.org/buildbot/#/builders/193/builds/962
- https://lab.llvm.org/buildbot/#/builders/23/builds/991
- https://lab.llvm.org/buildbot/#/builders/144/builds/2256
- https://lab.llvm.org/buildbot/#/builders/46/builds/1614

These bots have been broken for a day, so reverting to get everything
back to green.
2024-07-14 18:48:54 -07:00
Matt Arsenault
78bc1b64a6
AMDGPU: Move attributor into optimization pipeline (#83131)
Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.

Mostly mechanical, but there are some creative test updates. I preferred
to take the changes as-is in tests where the ABI isn't relevant. In
cases where it's more relevant, or the optimize out logic was too
ingrained in the test, I pre-run the optimization. Some cases manually
add attributes to disable inputs.
2024-07-14 08:36:33 +04:00
Pankaj Dwivedi
dc8da7ddea
[AMDGPU] Reserved private memory register during PEI (#93536)
- Reserved newly selected private memory registers in entry Function
Prologue generation.
- Added assertion patch in eliminateFrameIndex to ensure register is
reserved.

Co-authored-by: PankajDwivedi-25 <pankajkumar.divedi@amd.com>
2024-05-29 15:10:44 +05:30
Gang Chen
167427f5db
[AMDGPU] change order of fp and sp in kernel prologue (#90626)
change order of fp and sp in kernel prologue also related codegen tests
to make it easier to merge code into our downstream branches

Signed-off-by: gangc <gangc@amd.com>
2024-05-01 08:16:55 -07:00
Ivan Kosarev
dfa1d9b027
[AMDGPU][NFC] Have helpers to deal with encoding fields. (#82772)
These are hoped to provide more convenient and less error prone
facilities to encode and decode fields than manually defined constants
and functions.
2024-02-23 17:34:55 +00:00
Jan Patrick Lehr
f661057865
Revert "[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init" (#81234)
Reverts llvm/llvm-project#79586

This broke the AMDGPU OpenMP Offload buildbot.
The typical error message was that the GPU attempted to read beyong the
largest legal address.

Error message:
AMDGPU fatal error 1: Received error in queue 0x7f8363f22000:
HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to
access memory beyond the largest legal address.
2024-02-09 09:57:38 +01:00
alex-t
88e52511ca
[AMDGPU] Compiler should synthesize private buffer resource descriptor from flat_scratch_init (#79586)
This change implements synthesizing the private buffer resource
descriptor in the kernel prolog instead of using the preloaded kernel
argument.
2024-02-08 20:27:36 +01:00
Christudasan Devadasan
89ec940b4a
[AMDGPU] Insert spill codes for the SGPRs used for EXEC copy (#79428)
The SGPR registers used for preserving EXEC mask while lowering the
whole-wave register spills and copies should be preserved at the prolog
and epilog if they are in the CSR range. It isn't happening when there
is only wwm-copy lowered and there are no wwm-spills. This patch
addresses that problem.
2024-02-05 18:32:23 +05:30
Christudasan Devadasan
230c13d59d
[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)
CSR SGPR spilling currently uses the early available physical VGPRs. It
currently imposes a high register pressure while trying to allocate
large VGPR tuples within the default register budget.

This patch changes the spilling strategy by picking the VGPRs in the
reverse order, the highest available VGPR first and later after regalloc
shift them back to the lowest available range. With that, the initial
VGPRs would be available for allocation and possibility
of finding large number of contiguous registers will be more.
2024-01-24 07:08:43 +05:30
Jay Foad
9ca36932b5
[AMDGPU] Work around s_getpc_b64 zero extending on GFX12 (#78186) 2024-01-18 10:23:27 +00:00
Mirko Brkušanin
5879162f7f
[AMDGPU] CodeGen for GFX12 VBUFFER instructions (#75492) 2023-12-15 13:45:03 +01:00
Diana
eb3c02fdc2
[AMDGPU] Use immediates for stack accesses in chain funcs (#71913)
Switch to using immediate offsets instead of the SP register to access
objects on the current stack frame in chain functions. This means we no
longer need to reserve a SP register just for accesing stack objects and
it also allows us to set the SP (when one is actually needed) to the
stack size from the very beginning.

This only works if we use a FixedObject for the ScavengeFI, which is
what we do for entry functions anyway (and we generally want to keep
chain functions close to amdgpu_cs behaviour where we don't have a good
reason to diverge).
2023-11-14 13:17:46 +01:00
Jie Fu
fdf99b21f3 [AMDGPU] Fix -Wunused-variable in SIFrameLowering.cpp (NFC)
/llvm-project/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:1829:8: error: unused variable 'IsChainFunction' [-Werror,-Wunused-variable]
  bool IsChainFunction = MF.getInfo<SIMachineFunctionInfo>()->isChainFunction();
       ^
1 error generated.
2023-11-08 20:48:44 +08:00
Jay Foad
d5f3b3b3b1
[RegScavenger] Simplify state tracking for backwards scavenging (#71202)
Track the live register state immediately before, instead of after,
MBBI. This makes it simple to track the state at the start or end of a
basic block without a separate (and poorly named) Tracking flag.

This changes the API of the backward(MachineBasicBlock::iterator I)
method, which now recedes to the state just before, instead of just
after, *I. Some clients are simplified by this change.

There is one small functional change shown in the lit tests where
multiple spilled registers all need to be reloaded before the same
instruction. The reloads will now be inserted in the opposite order.
This should not affect correctness.
2023-11-08 09:49:07 +00:00
Diana Picus
39830fea28 [AMDGPU][PEI] Set up SP for chain functions
Initialize the SP to 0 in the prologue of functions with the
`amdgpu_cs_chain` or `amdgpu_cs_chain_preserve` calling conventions, but
only if they need one (i.e. if they contain calls to `amdgpu_gfx`
functions or if they have stack objects).

Also make sure we don't try to realign the stack (since 0 is aligned
enough).

Differential Revision: https://reviews.llvm.org/D156413
2023-11-08 09:27:34 +01:00
Diana
1fa58c7790
[AMDGPU] Callee saves for amdgpu_cs_chain[_preserve] (#71526)
Teach prolog epilog insertion how to handle functions with the
amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions.

For amdgpu_cs_chain functions, we only need to preserve the inactive
lanes of VGPRs above v8, and only in the presence of calls via
@llvm.amdgcn.cs.chain.

For amdgpu_cs_chain_preserve functions, we will also need to preserve
the active lanes for registers above the last argument VGPR. AFAICT
there's no direct way to find out what the last argument VGPR is, so
instead the patch uses the fact that chain calls from
amdgpu_cs_chain_preserve functions can't use more VGPRs than the
caller's VGPR arguments. In other words, it removes the operands of
SI_CS_CHAIN_TC instructions from the list of callee saved registers.

For both calling conventions, registers v0-v7 never need to be saved and
restored, so we should never add them as WWM spills.

Differential Revision: https://reviews.llvm.org/D156412
2023-11-08 08:28:15 +01:00
Christudasan Devadasan
f9cd789658
[AMDGPU] Add pseudo instructions for SGPR spill to VGPR (#69923)
For a future patch, is it important to keep the lowered SGPR
spills to be recognized as spill instructions during regalloc.
Directly lowering them into V_WRITELANE/V_READLANE won't allow
us to attach the SPILL flag to their instructions.

This patch introduces the pseudo instructions with the SGPRSpill
flag set in their Desc. They will get lowered to equivalent
instructions later during post RA pseudo expansion.
2023-10-27 17:24:10 +05:30
Pranav Taneja
6d9b96313d
[AMDGPU] [SIFrameLowering] Use LiveRegUnits instead of LivePhysRegs. 2023-09-21 11:49:46 +05:30
Austin Kerbow
343be5132e [AMDGPU] Add utilities to track number of user SGPRs. NFC.
Factor out and unify some common code that calculates and tracks the
number of user SGRPs.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D159439
2023-09-12 08:52:30 -07:00
Matt Arsenault
4d42e8b5d1 Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.

The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
2023-07-31 20:15:45 -04:00
Vitaly Buka
a496c8be6e Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
And dependent commits.

Details in D150388.

This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.

No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

Reviewed By: alexfh

Differential Revision: https://reviews.llvm.org/D156381
2023-07-26 22:13:32 -07:00
Christudasan Devadasan
7a98f084c4 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

Differential Revision: https://reviews.llvm.org/D124196
2023-07-07 23:14:32 +05:30
Christudasan Devadasan
b78b36e1a2 [AMDGPU] Implement whole wave register spill
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.

This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.

Reviewed By: arsenm, cdevadas

Differential Revision: https://reviews.llvm.org/D143759
2023-07-07 22:51:45 +05:30
Brendon Cahoon
853b2a84cb [AMDGPU] Reserve SGPR pair when long branches are present
Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the
case when an indirect branch target is too far away. The register
scavanger may not find available registers, which causes a “did not find
scavenging index” assert to occur in assignRegToScavengingIndex.

In this patch, we estimate before register allocation whether an
indirect branch is likely to be needed, and reserve 2 SGPRs if the
branch distance is found to be above a threshold. The distance threshold
is an approximation as the exact code size and branch distance are
unknown prior to register allocation.

Patch by Corbin Robeck. Thanks!

Differential Review: https://reviews.llvm.org/D149775
2023-06-29 16:50:46 -05:00