1082 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
1a9c61f004
[AMDGPU] Non convergent instruction does not depend on EXEC. NFCI. (#179821) 2026-02-09 23:24:45 -08:00
Shilei Tian
65b4099219
[AMDGPU] Fix instruction size for 64-bit literal constant operands (#180387)
`getLit64Encoding` uses a different approach to determine whether 64-bit
literal encoding is used, which caused a size mismatch between the
`MachineInstr` and the `MCInst`.

For `!isValid32BitLiteral`, it is effectively `!(isInt<32>(Val) ||
isUInt<32>(Val))`, which is `!isInt<32>(Val) && !isUInt<32>(Val)`, but
in `getLit64Encoding`, it is `!isInt<32>(Val) || !isUInt<32>(Val)`.
2026-02-09 14:31:52 +00:00
Petr Kurapov
27a8ab09fa
[AMDGPU] Fix V_INDIRECT_REG_READ_GPR_IDX expansion with immediate index (#179699)
The definition for V_INDIRECT_REG_READ_GPR_IDX_B32_V*'s SSrc_b32 operand
allows immediates, but the expansion logic handles only register cases
now. This can result in expansion failures when e.g.
llvm.amdgcn.wave.reduce.umin.i32 is folded into a constant and then used
as an insertelement idx.
2026-02-09 11:33:30 +01:00
Vladimir Vereschaka
19d681177f
Revert "[MC][TableGen] Expand Opcode field of MCInstrDesc" (#180321)
Reverts llvm/llvm-project#179652

This PR causes the out-of-memory build failures on many Windows
builders.
2026-02-06 21:58:50 -08:00
sstipano
13d8870d45
[MC][TableGen] Expand Opcode field of MCInstrDesc (#179652)
Increase width of Opcode to `int` from `short` to allow more capacity.
2026-02-06 20:21:48 +01:00
Pierre van Houtryve
b738491d2f
[AMDGPU][GFX12.5] Add support for emitting memory operations with nv bit set (#179413)
- Add `MONonVolatile` MachineMemOperand flag.
- Set nv=1 on memory operations on GFX12.5 if the operation accesses a
constant address space,
  is an invariant load, or has the `MONonVolatile` flag set.
2026-02-06 11:35:46 +01:00
Diana Picus
9022f47ca4
[AMDGPU] Implement llvm.sponentry (#176357)
In some of our use cases, the GPU runtime stores some data at the top of
the stack. It figures out where it's safe to store it by using the PAL
metadata generated by the backend, which includes the total stack size.
However, the metadata does not include the space reserved at the bottom
of the stack for the trap handler when CWSR is enabled in dynamic VGPR
mode. This space is reserved dynamically based on whether or not the
code is running on the compute queue. Therefore, the runtime needs a way
to take that into account.

Add support for `llvm.sponentry`, which should return the base of the
stack,
skipping over any reserved areas. This allows us to keep this
computation in
one place rather than duplicate it between the backend and the runtime.

The implementation for functions that set up their own stack uses a
pseudo
that is expanded to the same code sequence as that used in the prolog to
set up the stack in the first place.

In callable functions, we generate a fixed stack object and use that
instead,
similar to the Arm/AArch64 approach. This wastes some stack space but
that's
not a problem for now because we're not planning to use this in callable
functions yet.
2026-02-03 15:02:07 +01:00
vporpo
1658456ccf
[AMDGPU] Introduce custom MIR formatting for s_wait_alu (#176316)
This patch implements a custom printer/parser for the immediate operand
of s_wait_alu that prints/parses the decoded counter values.

Format:
```
 .<counter1>_<value1>_<counter2>_<value2>
```

Example:
 `s_wait_alu .VaVdst_1_VmVsrc_1`
 ; Which is equivalent to this:
 `s_wait_alu 8167`

Features:
- If a counter is at its maximum value it won't get printed.
- The parser will error out if a counter is greater or equal to its max
value.
- If all counters are disabled we can use 'AllOff'.
- For now we also accept numeric values for backwards compatibility with
older MIR.

Note: This is similar to https://github.com/llvm/llvm-project/pull/96004
but for `s_wait_alu`.
2026-01-31 10:46:59 -08:00
Janek van Oirschot
d1e2ddf997
[AMDGPU] Emit b32 movs if (a)v_mov_b64_pseudo dest vgprs are misaligned (#160547)
#154115 Exposed a possible destination misaligned v_mov_b64

Relaxes v_mov_b64_pseudo register class constraint (which matches
av_mov_b64_pseudo's register class).
2026-01-30 15:01:14 +00:00
Jay Foad
dbd4240130
[AMDGPU] Fix DEALLOC_VGPRS in the presence of spills to scratch (#178461) 2026-01-29 20:57:16 +01:00
Jay Foad
3f1386b986
[AMDGPU] Add braces around a switch case. NFC. (#178637) 2026-01-29 12:10:03 +00:00
vporpo
21dad8e5cc
[AMDGPU] Improve crash message when S_WAITCNT_DEPCTR is missing its operand (#177065)
The code in the test is causing a crash in `SIInstrInfo.cpp`
`fixImplicitOperands()` in `MI.implicit_operands()`:
```
  for (auto &Op : MI.implicit_operands()) {
```
MachineInstr.h:
```
  mop_range implicit_operands() {
=>  return operands_impl().drop_front(getNumExplicitOperands());
  }
```
We are trying to drop 1 operand from the operands of MI which are 0.

By early returning we are no longer crashing at that point and we are
getting a more meaningful error message:

```
*** Bad machine code: Too few operands ***
- function:    missing_operand_crash
- basic block: %bb.0  (0x5a9d30ced988)
- instruction: S_WAITCNT_DEPCTR
1 operands expected, but 0 given.
```

The code is still crashing at a different location, but at least we are
getting an error message.
2026-01-26 08:35:30 -08:00
Jay Foad
017f2bc181
[AMDGPU] Simplify legalization of PHI operands (#177352)
In practice when legalizeOperands is called on a PHI node, the result is
never an SGPR class and the operands are never subregs. Simplify the
code accordingly by using the result regclass for all the inputs. This
includes using an AV class where previously we picked either an AGPR or
VGPR class.
2026-01-26 15:39:13 +00:00
Mariusz Sikora
3c0f5045e1
[AMDGPU] Add FeatureGFX13 and SMEM encoding for gfx13 (#177567)
For now list of features is based on gfx12 and gfx1250

---------

Co-authored-by: Jay Foad <jay.foad@amd.com>
2026-01-26 14:16:36 +01:00
Ryan Mitchell
13b20e7aea
[AMDGPU][SILoadStoreOptimizer] Fix lds address operand offset (#176816)
The offset operand in GLOBAL_LOAD_ASYNC_TO_LDS_B128, for instance, is
added to both the lds and global address, but SILoadStoreOptimizer is
currently unaware of that. This PR inserts an add to counteract the
offset meant for the global address. This one add is better than not
doing the optimization at all, and having to insert 2 adds for each
global address calculation (with no offset).

```
; ENABLE-LABEL: name: promote_async_load_offset
; ENABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1
; ENABLE-NEXT: {{  $}}
; ENABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec
; ENABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 $vgpr0, 512, 0, implicit $exec
; ENABLE-NEXT: renamable $vgpr3, dead $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec
; ENABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec
; ENABLE-NEXT: renamable $vgpr0 = V_ADD_U32_e32 256, $vgpr1, implicit $exec
; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -256, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)

; DISABLE-LABEL: name: promote_async_load_offset
; DISABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1
; DISABLE-NEXT: {{  $}}
; DISABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec
; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 256, $vgpr0, 0, implicit $exec
; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, $vgpr0, killed $vcc_lo, 0, implicit $exec
; DISABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec
; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 512, $vgpr0, 0, implicit $exec
; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec
; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
```

This PR also promotes the global address to an offset when the offset is
calculated with V_ADD_U64 on applicable gfx versions, (and inversely
adds the LDS offset), whereas previously the optimization opportunity
was missed entirely.
2026-01-26 09:23:17 +01:00
LU-JOHN
8d55fa2853
[AMDGPU] Remove redundant s_cmp_* after add X, 1 (#176962)
Convert:

```
s_add_u32 X, Y, 1
s_cmp_lg_i32 X, 0

```
to:

```
s_add_u32 X, Y, 1
<invert scc uses>
```
Also delete with s_cmp_eq_i32 X, 0, but inverting scc uses is not
necessary.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-01-23 07:51:36 -06:00
Sam Elliott
7184229fea
[NFC][MI] Tidy Up RegState enum use (2/2) (#177090)
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.

If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
2026-01-23 00:19:03 -08:00
Shilei Tian
02d34a76f7
[NFCI][AMDGPU] Remove more redundant code from GCNSubtarget.h (#177297)
We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the
header with this cleanup.
2026-01-22 09:07:15 -05:00
Dark Steve
9429a1e809
[AMDGPU] Fix insertSimulatedTrap to return correct continuation block (#174774)
`insertSimulatedTrap` was returning `HaltLoopBB` when the trap was in a
block with no successors and was the last instruction. Since
`HaltLoopBB` gets appended to the end of the function, `FinalizeISel`
would jump there and skip any intermediate blocks, leaving their pseudos
unexpanded.

Fix by returning `MBB.getNextNode()` unconditionally:
- After `splitAt()`: `getNextNode()` returns the split-off block
(`ContBB`)
- No split, `MBB` in middle: `getNextNode()` returns the next original
block
- No split, `MBB` was last: `getNextNode()` returns `HaltLoopBB` (just
pushed)

Since we always `push_back(HaltLoopBB)` before returning,
`getNextNode()` can never be `nullptr`: if `MBB` was the last block,
`HaltLoopBB` is now after it.

Fixes: SWDEV-572407
2026-01-21 11:52:38 +05:30
Shilei Tian
c253b9f9ca
[AMDGPU] Fix inline constant encoding for v_pk_fmac_f16 (#176659)
This PR handles`v_pk_fmac_f16` inline constant encoding/decoding
differences between pre-GFX11 and GFX11+ hardware.

- Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16
bits, zero in high.
- GFX11+: fp16 inline constants are duplicated to both halves `(f16,
f16)`.

Fixes #94116.
2026-01-20 19:14:59 -05:00
Stanislav Mekhanoshin
0f739e7581
[AMDGPU] Use lambda in fmaak/fmamk f16 folding. NFC (#176258) 2026-01-16 16:01:52 -08:00
Sam Elliott
2042887709
Reland "[NFC][MI] Tidy Up RegState enum use (1/2)" (#176277)
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.

This PR relands llvm/llvm-project#176091 (commit
1d616cdca3aba9d22f120888bb6b09b75ca90b92) which was reverted in
llvm/llvm-project#176190 (commit
6309cd8668fc2ae589f156b23f86821f4ce5b7ea).
2026-01-16 13:05:06 -08:00
Stanislav Mekhanoshin
b501f666c5
[AMDGPU] Fix expensive checks in fmaak/fmamk f16 folding (#176238)
Register classes of sources also has to be restrained to lo128.
There are few regression with register coalescing in true16 mode
though, but otherwise it fails verification.
2026-01-15 14:03:07 -08:00
Stanislav Mekhanoshin
5546ce99d8
[AMDGPU] Allow 16-bit imm folding in real true16 (#173318) 2026-01-15 11:15:12 -08:00
Stanislav Mekhanoshin
fa3ef64011
[AMDGPU] Create V_FMAAK_F16/V_FMAMK_F16 in true16 with imm folding (#173317)
This does not cover real true16 with tests, the next patch will.
2026-01-15 11:06:34 -08:00
Sam Elliott
6309cd8668
Revert "[NFC][MI] Tidy Up RegState enum use (1/2)" (#176190)
Reverts llvm/llvm-project#176091

Reverting because some compilers were erroring on the call to
`Reg.isReg()` (which is not `constexpr`) in a `constexpr` function.
2026-01-15 07:58:05 -08:00
Sam Elliott
1d616cdca3
[NFC][MI] Tidy Up RegState enum use (1/2) (#176091)
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.
2026-01-15 07:47:05 -08:00
sstipano
cc1e10d50b
[AMDGPU] Disable s_add_pc_i64 instruction (#175644)
s_add_pc_i64 instruction is broken on gfx1250. Disable it by default.
2026-01-14 23:01:43 +01:00
LU-JOHN
cf237465b3
[AMDGPU] Invert scc uses to delete s_cmp_eq* (#167382)
Delete s_cmp_eq* instructions by inverting instructions that use scc.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-01-14 10:24:24 -06:00
Christudasan Devadasan
9e1606026c
[CodeGen][InlineSpiller] Add SubReg argument to loadRegFromStackSlot for subreg-reload (#175581)
This preparatory patch introduces an additional argument to the target hook
loadRegFromStackSlot. Ths is essential for targets to handle subregister-specific
reload in the future. See how this is used for AMDGPU target with PR #175002.
2026-01-13 08:21:58 +05:30
Christudasan Devadasan
e486a26b9c
[AMDGPU] Add liverange split instructions into BB Prolog (#117544)
The COPY inserted for liverange split during sgpr-regalloc
pipeline currently breaks the BB prolog during the subsequent
vgpr-regalloc phase while spilling and/or splitting the vector
liveranges. This patch fixes it by correctly including the
LR split instructions during sgpr-regalloc and wwm-regalloc
pipelines into the BB prolog.
2026-01-09 21:25:14 +05:30
LU-JOHN
49381c3000
[NFC][AMDGPU] Declare variables initialized with getDebugLoc as const ref (#174434)
Declare variables initialized with getDebugLoc as a const reference.

Signed-off-by: John Lu <John.Lu@amd.com>
2026-01-05 12:37:47 -06:00
Matt Arsenault
9ad39dd116
AMDGPU: Avoid crashing on statepoint-like pseudoinstructions (#170657)
At the moment the MIR tests are somewhat redundant. The waitcnt
one is needed to ensure we actually have a load, given we are
currently just emitting an error on ExternalSymbol. The asm printer
one is more redundant for the moment, since it's stressed by the IR
test. However I am planning to change the error path for the IR test,
so it will soon not be redundant.
2025-12-29 19:08:08 +01:00
Jay Foad
515c3bdda0
[AMDGPU] Stop handling soft waitcnts in pseudoToMCOpcode. NFC. (#172278)
Since #87539 all soft waitcnts should have been promoted by
SIInsertWaitcnts.
2025-12-15 11:33:55 +00:00
Juan Manuel Martinez Caamaño
55c0e2e20f
[AMDGPU] Add missing cases for V_INDIRECT_REG_{READ/WRITE}_GPR_IDX and V/S_INDIRECT_REG_WRITE_MOVREL (#171835)
A buildbot failure in https://github.com/llvm/llvm-project/pull/170323
when expensive checks were used highlighted that some of these patterns
were missing.

This patch adds `V_INDIRECT_REG_{READ/WRITE}_GPR_IDX` and
`V/S_INDIRECT_REG_WRITE_MOVREL` for `V6` and `V7` vector sizes.
2025-12-12 15:45:34 +00:00
Stanislav Mekhanoshin
bdea6a2dc2
[AMDGPU] Add verifier for flat_scr_base_hi read hazard (#170550) 2025-12-04 15:22:05 -08:00
Pierre van Houtryve
8feb6762ba
[AMDGPU] Take BUF instructions into account in mayAccessScratchThroughFlat (#170274)
BUF instructions can access the scratch address space, so
SIInsertWaitCnt needs to be able
to track the SCRATCH_WRITE_ACCESS event for such BUF instructions.

The release-vgprs.mir test had to be updated because BUF instructions
w/o a MMO are now
tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to
keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.
2025-12-03 10:37:58 +01:00
Stanislav Mekhanoshin
83ab875b83
[AMDGPU] Handle phys regs in flat_scratch_base_hi operand check (#170395) 2025-12-02 17:22:07 -08:00
Stanislav Mekhanoshin
9dd3346589
[AMDGPU] Prevent folding of flat_scr_base_hi into a 64-bit SALU (#170373)
Fixes: SWDEV-563886
2025-12-02 16:08:00 -08:00
Prasoon Mishra
1cea4a0841
[AMDGPU][NPM] Fix CFG invalidation detection in insertSimulatedTrap (#169290)
When SIMULATED_TRAP is at the end of a block with no successors,
insertSimulatedTrap incorrectly returns the original MBB despite adding
HaltLoopBB to the CFG.

EmitInstrWithCustomInserter detects CFG changes by comparing the
returned MBB with the original. When they match, it assumes no
modification occurred and skips MachineLoopInfo invalidation. This
causes stale loop information in subsequent passes, particularly when
using the NPM which relies on accurate invalidation signals.

Fix: Return HaltLoopBB to properly signal the CFG modification.
2025-11-28 13:45:46 +05:30
Jay Foad
d748c81218
[AMDGPU] Change the immediate operand of s_waitcnt_depctr / s_wait_alu (#169378)
The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some
unused bits. Previously codegen would set these bits to 1, but setting
them to 0 matches the SP3 assembler behaviour better, which in turn
means that we can print them using the human readable SP3 syntax:

s_wait_alu 0xfffd ; unused bits set to 1
s_wait_alu 0xff9d ; unused bits set to 0
s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable

Note that the set of unused bits changed between GFX10.1 and GFX10.3.
2025-11-25 11:55:26 +00:00
Nicolai Hähnle
f581d8ad8f
AMDGPU: Fix a comment (#169403)
This verifier check will complain if there aren't enough implicit
operands -- so it doesn't *allow* those operands, it *requires* them.
2025-11-24 20:54:53 +00:00
Nathan Corbyn
4511c355c3
Revert "[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE." (#169068)
PR causes build failures with expensive checks enabled

Reverts llvm/llvm-project#168546
2025-11-21 17:52:08 +00:00
Nicolai Hähnle
ac55d7859f
AMDGPU: Don't duplicate implicit operands in 3-address conversion (#168426)
We previously got a duplicate implicit $exec operand. It didn't really
hurt anything (other than being a slight drag on compile-time
performance). Still, let's keep things clean.
2025-11-20 16:25:47 -08:00
LU-JOHN
b79a665f71
[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE. (#168546)
Remove leftover implicit operands from SI_SPILL/SI_RESTORE.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-19 09:02:03 -06:00
LU-JOHN
9fa15ef916
[AMDGPU] When shrinking and/or to bitset*, remove implicit scc def (#168128)
When shrinking and/or to bitset* remove leftover implicit scc def.
bitset* instructions do not set scc.

Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-15 09:21:43 -06:00
Matt Arsenault
b2f12331ab
AMDGPU: Fix verifier error when waterfall call target is in AV register (#168017) 2025-11-14 09:49:40 -08:00
Jay Foad
72c69aefba
[AMDGPU] Make use of getFunction and getMF. NFC. (#167872) 2025-11-14 11:00:57 +00:00
Mariusz Sikora
4cd836181f
[AMDGPU] Lower S_ABSDIFF_I32 to VALU instructions (#167691)
Added support for lowering the scalar S_ABSDIFF_I32 instruction to
equivalent VALU operations.
2025-11-13 14:35:44 +01:00
Nicolai Hähnle
66366599a9
CodeGen/AMDGPU: Allow 3-address conversion of bundled instructions (#166213)
This is in preparation for future changes in AMDGPU that will make more
substantial use of bundles pre-RA. For now, simply test this with
degenerate (single-instruction) bundles.
2025-11-12 22:04:46 +00:00