llvm-project

Author	SHA1	Message	Date
Stanislav Mekhanoshin	1a9c61f004	[AMDGPU] Non convergent instruction does not depend on EXEC. NFCI. (#179821 )	2026-02-09 23:24:45 -08:00
Shilei Tian	65b4099219	[AMDGPU] Fix instruction size for 64-bit literal constant operands (#180387 ) `getLit64Encoding` uses a different approach to determine whether 64-bit literal encoding is used, which caused a size mismatch between the `MachineInstr` and the `MCInst`. For `!isValid32BitLiteral`, it is effectively `!(isInt<32>(Val) \|\| isUInt<32>(Val))`, which is `!isInt<32>(Val) && !isUInt<32>(Val)`, but in `getLit64Encoding`, it is `!isInt<32>(Val) \|\| !isUInt<32>(Val)`.	2026-02-09 14:31:52 +00:00
Petr Kurapov	27a8ab09fa	[AMDGPU] Fix V_INDIRECT_REG_READ_GPR_IDX expansion with immediate index (#179699 ) The definition for V_INDIRECT_REG_READ_GPR_IDX_B32_V*'s SSrc_b32 operand allows immediates, but the expansion logic handles only register cases now. This can result in expansion failures when e.g. llvm.amdgcn.wave.reduce.umin.i32 is folded into a constant and then used as an insertelement idx.	2026-02-09 11:33:30 +01:00
Vladimir Vereschaka	19d681177f	Revert "[MC][TableGen] Expand Opcode field of MCInstrDesc" (#180321 ) Reverts llvm/llvm-project#179652 This PR causes the out-of-memory build failures on many Windows builders.	2026-02-06 21:58:50 -08:00
sstipano	13d8870d45	[MC][TableGen] Expand Opcode field of MCInstrDesc (#179652 ) Increase width of Opcode to `int` from `short` to allow more capacity.	2026-02-06 20:21:48 +01:00
Pierre van Houtryve	b738491d2f	[AMDGPU][GFX12.5] Add support for emitting memory operations with nv bit set (#179413 ) - Add `MONonVolatile` MachineMemOperand flag. - Set nv=1 on memory operations on GFX12.5 if the operation accesses a constant address space, is an invariant load, or has the `MONonVolatile` flag set.	2026-02-06 11:35:46 +01:00
Diana Picus	9022f47ca4	[AMDGPU] Implement llvm.sponentry (#176357 ) In some of our use cases, the GPU runtime stores some data at the top of the stack. It figures out where it's safe to store it by using the PAL metadata generated by the backend, which includes the total stack size. However, the metadata does not include the space reserved at the bottom of the stack for the trap handler when CWSR is enabled in dynamic VGPR mode. This space is reserved dynamically based on whether or not the code is running on the compute queue. Therefore, the runtime needs a way to take that into account. Add support for `llvm.sponentry`, which should return the base of the stack, skipping over any reserved areas. This allows us to keep this computation in one place rather than duplicate it between the backend and the runtime. The implementation for functions that set up their own stack uses a pseudo that is expanded to the same code sequence as that used in the prolog to set up the stack in the first place. In callable functions, we generate a fixed stack object and use that instead, similar to the Arm/AArch64 approach. This wastes some stack space but that's not a problem for now because we're not planning to use this in callable functions yet.	2026-02-03 15:02:07 +01:00
vporpo	1658456ccf	[AMDGPU] Introduce custom MIR formatting for s_wait_alu (#176316 ) This patch implements a custom printer/parser for the immediate operand of s_wait_alu that prints/parses the decoded counter values. Format: ``` .<counter1>_<value1>_<counter2>_<value2> ``` Example: `s_wait_alu .VaVdst_1_VmVsrc_1` ; Which is equivalent to this: `s_wait_alu 8167` Features: - If a counter is at its maximum value it won't get printed. - The parser will error out if a counter is greater or equal to its max value. - If all counters are disabled we can use 'AllOff'. - For now we also accept numeric values for backwards compatibility with older MIR. Note: This is similar to https://github.com/llvm/llvm-project/pull/96004 but for `s_wait_alu`.	2026-01-31 10:46:59 -08:00
Janek van Oirschot	d1e2ddf997	[AMDGPU] Emit b32 movs if (a)v_mov_b64_pseudo dest vgprs are misaligned (#160547 ) #154115 Exposed a possible destination misaligned v_mov_b64 Relaxes v_mov_b64_pseudo register class constraint (which matches av_mov_b64_pseudo's register class).	2026-01-30 15:01:14 +00:00
Jay Foad	dbd4240130	[AMDGPU] Fix DEALLOC_VGPRS in the presence of spills to scratch (#178461 )	2026-01-29 20:57:16 +01:00
Jay Foad	3f1386b986	[AMDGPU] Add braces around a switch case. NFC. (#178637 )	2026-01-29 12:10:03 +00:00
vporpo	21dad8e5cc	[AMDGPU] Improve crash message when S_WAITCNT_DEPCTR is missing its operand (#177065 ) The code in the test is causing a crash in `SIInstrInfo.cpp` `fixImplicitOperands()` in `MI.implicit_operands()`: ``` for (auto &Op : MI.implicit_operands()) { ``` MachineInstr.h: ``` mop_range implicit_operands() { => return operands_impl().drop_front(getNumExplicitOperands()); } ``` We are trying to drop 1 operand from the operands of MI which are 0. By early returning we are no longer crashing at that point and we are getting a more meaningful error message: ``` * Bad machine code: Too few operands * - function: missing_operand_crash - basic block: %bb.0 (0x5a9d30ced988) - instruction: S_WAITCNT_DEPCTR 1 operands expected, but 0 given. ``` The code is still crashing at a different location, but at least we are getting an error message.	2026-01-26 08:35:30 -08:00
Jay Foad	017f2bc181	[AMDGPU] Simplify legalization of PHI operands (#177352 ) In practice when legalizeOperands is called on a PHI node, the result is never an SGPR class and the operands are never subregs. Simplify the code accordingly by using the result regclass for all the inputs. This includes using an AV class where previously we picked either an AGPR or VGPR class.	2026-01-26 15:39:13 +00:00
Mariusz Sikora	3c0f5045e1	[AMDGPU] Add FeatureGFX13 and SMEM encoding for gfx13 (#177567 ) For now list of features is based on gfx12 and gfx1250 --------- Co-authored-by: Jay Foad <jay.foad@amd.com>	2026-01-26 14:16:36 +01:00
Ryan Mitchell	13b20e7aea	[AMDGPU][SILoadStoreOptimizer] Fix lds address operand offset (#176816 ) The offset operand in GLOBAL_LOAD_ASYNC_TO_LDS_B128, for instance, is added to both the lds and global address, but SILoadStoreOptimizer is currently unaware of that. This PR inserts an add to counteract the offset meant for the global address. This one add is better than not doing the optimization at all, and having to insert 2 adds for each global address calculation (with no offset). ``` ; ENABLE-LABEL: name: promote_async_load_offset ; ENABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1 ; ENABLE-NEXT: {{ $}} ; ENABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec ; ENABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 $vgpr0, 512, 0, implicit $exec ; ENABLE-NEXT: renamable $vgpr3, dead $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec ; ENABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec ; ENABLE-NEXT: renamable $vgpr0 = V_ADD_U32_e32 256, $vgpr1, implicit $exec ; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -256, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ; DISABLE-LABEL: name: promote_async_load_offset ; DISABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1 ; DISABLE-NEXT: {{ $}} ; DISABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec ; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 256, $vgpr0, 0, implicit $exec ; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, $vgpr0, killed $vcc_lo, 0, implicit $exec ; DISABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec ; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 512, $vgpr0, 0, implicit $exec ; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec ; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3) ``` This PR also promotes the global address to an offset when the offset is calculated with V_ADD_U64 on applicable gfx versions, (and inversely adds the LDS offset), whereas previously the optimization opportunity was missed entirely.	2026-01-26 09:23:17 +01:00
LU-JOHN	8d55fa2853	[AMDGPU] Remove redundant s_cmp_* after add X, 1 (#176962 ) Convert: ``` s_add_u32 X, Y, 1 s_cmp_lg_i32 X, 0 ``` to: ``` s_add_u32 X, Y, 1 <invert scc uses> ``` Also delete with s_cmp_eq_i32 X, 0, but inverting scc uses is not necessary. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2026-01-23 07:51:36 -06:00
Sam Elliott	7184229fea	[NFC][MI] Tidy Up RegState enum use (2/2) (#177090 ) This Change makes `RegState` into an enum class, with bitwise operators. It also: - Updates declarations of flag variables/arguments/returns from `unsigned` to `RegState`. - Updates empty RegState initializers from 0 to `{}`. If this is causing problems in downstream code: - Adopt the `RegState getXXXRegState(bool)` functions instead of using a ternary operator such as `bool ? RegState::XXX : 0`. - Adopt the `bool hasRegState(RegState, RegState)` function instead of using a bitwise check of the flags.	2026-01-23 00:19:03 -08:00
Shilei Tian	02d34a76f7	[NFCI][AMDGPU] Remove more redundant code from `GCNSubtarget.h` (#177297 ) We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the header with this cleanup.	2026-01-22 09:07:15 -05:00
Dark Steve	9429a1e809	[AMDGPU] Fix insertSimulatedTrap to return correct continuation block (#174774 ) `insertSimulatedTrap` was returning `HaltLoopBB` when the trap was in a block with no successors and was the last instruction. Since `HaltLoopBB` gets appended to the end of the function, `FinalizeISel` would jump there and skip any intermediate blocks, leaving their pseudos unexpanded. Fix by returning `MBB.getNextNode()` unconditionally: - After `splitAt()`: `getNextNode()` returns the split-off block (`ContBB`) - No split, `MBB` in middle: `getNextNode()` returns the next original block - No split, `MBB` was last: `getNextNode()` returns `HaltLoopBB` (just pushed) Since we always `push_back(HaltLoopBB)` before returning, `getNextNode()` can never be `nullptr`: if `MBB` was the last block, `HaltLoopBB` is now after it. Fixes: SWDEV-572407	2026-01-21 11:52:38 +05:30
Shilei Tian	c253b9f9ca	[AMDGPU] Fix inline constant encoding for `v_pk_fmac_f16` (#176659 ) This PR handles`v_pk_fmac_f16` inline constant encoding/decoding differences between pre-GFX11 and GFX11+ hardware. - Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16 bits, zero in high. - GFX11+: fp16 inline constants are duplicated to both halves `(f16, f16)`. Fixes #94116.	2026-01-20 19:14:59 -05:00
Stanislav Mekhanoshin	0f739e7581	[AMDGPU] Use lambda in fmaak/fmamk f16 folding. NFC (#176258 )	2026-01-16 16:01:52 -08:00
Sam Elliott	2042887709	Reland "[NFC][MI] Tidy Up RegState enum use (1/2)" (#176277 ) This Change is to prepare to make RegState into an enum class. It: - Updates documentation to match the order in the code. - Brings the `get<>RegState` functions together and makes them `constexpr`. - Adopts the `get<>RegState` where RegStates were being chosen with ternary operators in backend code. - Introduces `hasRegState` to make querying RegState easier once it is an enum class. - Adopts `hasRegState` where equivalent was done with bitwise arithmetic. - Introduces `RegState::NoFlags`, which will be used for the lack of flags. - Documents that `0x1` is a reserved flag value used to detect if someone is passing `true` instead of flags (due to implicit bool to unsigned conversions). - Updates two calls to `MachineInstrBuilder::addReg` which were passing `false` to the flags operand, to no longer pass a value. - Documents that `getRegState` seems to have forgotten a call to `getEarlyClobberRegState`. This PR relands llvm/llvm-project#176091 (commit 1d616cdca3aba9d22f120888bb6b09b75ca90b92) which was reverted in llvm/llvm-project#176190 (commit 6309cd8668fc2ae589f156b23f86821f4ce5b7ea).	2026-01-16 13:05:06 -08:00
Stanislav Mekhanoshin	b501f666c5	[AMDGPU] Fix expensive checks in fmaak/fmamk f16 folding (#176238 ) Register classes of sources also has to be restrained to lo128. There are few regression with register coalescing in true16 mode though, but otherwise it fails verification.	2026-01-15 14:03:07 -08:00
Stanislav Mekhanoshin	5546ce99d8	[AMDGPU] Allow 16-bit imm folding in real true16 (#173318 )	2026-01-15 11:15:12 -08:00
Stanislav Mekhanoshin	fa3ef64011	[AMDGPU] Create V_FMAAK_F16/V_FMAMK_F16 in true16 with imm folding (#173317 ) This does not cover real true16 with tests, the next patch will.	2026-01-15 11:06:34 -08:00
Sam Elliott	6309cd8668	Revert "[NFC][MI] Tidy Up RegState enum use (1/2)" (#176190 ) Reverts llvm/llvm-project#176091 Reverting because some compilers were erroring on the call to `Reg.isReg()` (which is not `constexpr`) in a `constexpr` function.	2026-01-15 07:58:05 -08:00
Sam Elliott	1d616cdca3	[NFC][MI] Tidy Up RegState enum use (1/2) (#176091 ) This Change is to prepare to make RegState into an enum class. It: - Updates documentation to match the order in the code. - Brings the `get<>RegState` functions together and makes them `constexpr`. - Adopts the `get<>RegState` where RegStates were being chosen with ternary operators in backend code. - Introduces `hasRegState` to make querying RegState easier once it is an enum class. - Adopts `hasRegState` where equivalent was done with bitwise arithmetic. - Introduces `RegState::NoFlags`, which will be used for the lack of flags. - Documents that `0x1` is a reserved flag value used to detect if someone is passing `true` instead of flags (due to implicit bool to unsigned conversions). - Updates two calls to `MachineInstrBuilder::addReg` which were passing `false` to the flags operand, to no longer pass a value. - Documents that `getRegState` seems to have forgotten a call to `getEarlyClobberRegState`.	2026-01-15 07:47:05 -08:00
sstipano	cc1e10d50b	[AMDGPU] Disable s_add_pc_i64 instruction (#175644 ) s_add_pc_i64 instruction is broken on gfx1250. Disable it by default.	2026-01-14 23:01:43 +01:00
LU-JOHN	cf237465b3	[AMDGPU] Invert scc uses to delete s_cmp_eq* (#167382 ) Delete s_cmp_eq* instructions by inverting instructions that use scc. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2026-01-14 10:24:24 -06:00
Christudasan Devadasan	9e1606026c	[CodeGen][InlineSpiller] Add SubReg argument to loadRegFromStackSlot for subreg-reload (#175581 ) This preparatory patch introduces an additional argument to the target hook loadRegFromStackSlot. Ths is essential for targets to handle subregister-specific reload in the future. See how this is used for AMDGPU target with PR #175002.	2026-01-13 08:21:58 +05:30
Christudasan Devadasan	e486a26b9c	[AMDGPU] Add liverange split instructions into BB Prolog (#117544 ) The COPY inserted for liverange split during sgpr-regalloc pipeline currently breaks the BB prolog during the subsequent vgpr-regalloc phase while spilling and/or splitting the vector liveranges. This patch fixes it by correctly including the LR split instructions during sgpr-regalloc and wwm-regalloc pipelines into the BB prolog.	2026-01-09 21:25:14 +05:30
LU-JOHN	49381c3000	[NFC][AMDGPU] Declare variables initialized with getDebugLoc as const ref (#174434 ) Declare variables initialized with getDebugLoc as a const reference. Signed-off-by: John Lu <John.Lu@amd.com>	2026-01-05 12:37:47 -06:00
Matt Arsenault	9ad39dd116	AMDGPU: Avoid crashing on statepoint-like pseudoinstructions (#170657 ) At the moment the MIR tests are somewhat redundant. The waitcnt one is needed to ensure we actually have a load, given we are currently just emitting an error on ExternalSymbol. The asm printer one is more redundant for the moment, since it's stressed by the IR test. However I am planning to change the error path for the IR test, so it will soon not be redundant.	2025-12-29 19:08:08 +01:00
Jay Foad	515c3bdda0	[AMDGPU] Stop handling soft waitcnts in pseudoToMCOpcode. NFC. (#172278 ) Since #87539 all soft waitcnts should have been promoted by SIInsertWaitcnts.	2025-12-15 11:33:55 +00:00
Juan Manuel Martinez Caamaño	55c0e2e20f	[AMDGPU] Add missing cases for V_INDIRECT_REG_{READ/WRITE}_GPR_IDX and V/S_INDIRECT_REG_WRITE_MOVREL (#171835 ) A buildbot failure in https://github.com/llvm/llvm-project/pull/170323 when expensive checks were used highlighted that some of these patterns were missing. This patch adds `V_INDIRECT_REG_{READ/WRITE}_GPR_IDX` and `V/S_INDIRECT_REG_WRITE_MOVREL` for `V6` and `V7` vector sizes.	2025-12-12 15:45:34 +00:00
Stanislav Mekhanoshin	bdea6a2dc2	[AMDGPU] Add verifier for flat_scr_base_hi read hazard (#170550 )	2025-12-04 15:22:05 -08:00
Pierre van Houtryve	8feb6762ba	[AMDGPU] Take BUF instructions into account in mayAccessScratchThroughFlat (#170274 ) BUF instructions can access the scratch address space, so SIInsertWaitCnt needs to be able to track the SCRATCH_WRITE_ACCESS event for such BUF instructions. The release-vgprs.mir test had to be updated because BUF instructions w/o a MMO are now tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.	2025-12-03 10:37:58 +01:00
Stanislav Mekhanoshin	83ab875b83	[AMDGPU] Handle phys regs in flat_scratch_base_hi operand check (#170395 )	2025-12-02 17:22:07 -08:00
Stanislav Mekhanoshin	9dd3346589	[AMDGPU] Prevent folding of flat_scr_base_hi into a 64-bit SALU (#170373 ) Fixes: SWDEV-563886	2025-12-02 16:08:00 -08:00
Prasoon Mishra	1cea4a0841	[AMDGPU][NPM] Fix CFG invalidation detection in insertSimulatedTrap (#169290 ) When SIMULATED_TRAP is at the end of a block with no successors, insertSimulatedTrap incorrectly returns the original MBB despite adding HaltLoopBB to the CFG. EmitInstrWithCustomInserter detects CFG changes by comparing the returned MBB with the original. When they match, it assumes no modification occurred and skips MachineLoopInfo invalidation. This causes stale loop information in subsequent passes, particularly when using the NPM which relies on accurate invalidation signals. Fix: Return HaltLoopBB to properly signal the CFG modification.	2025-11-28 13:45:46 +05:30
Jay Foad	d748c81218	[AMDGPU] Change the immediate operand of s_waitcnt_depctr / s_wait_alu (#169378 ) The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some unused bits. Previously codegen would set these bits to 1, but setting them to 0 matches the SP3 assembler behaviour better, which in turn means that we can print them using the human readable SP3 syntax: s_wait_alu 0xfffd ; unused bits set to 1 s_wait_alu 0xff9d ; unused bits set to 0 s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable Note that the set of unused bits changed between GFX10.1 and GFX10.3.	2025-11-25 11:55:26 +00:00
Nicolai Hähnle	f581d8ad8f	AMDGPU: Fix a comment (#169403 ) This verifier check will complain if there aren't enough implicit operands -- so it doesn't allow those operands, it requires them.	2025-11-24 20:54:53 +00:00
Nathan Corbyn	4511c355c3	Revert "[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE." (#169068 ) PR causes build failures with expensive checks enabled Reverts llvm/llvm-project#168546	2025-11-21 17:52:08 +00:00
Nicolai Hähnle	ac55d7859f	AMDGPU: Don't duplicate implicit operands in 3-address conversion (#168426 ) We previously got a duplicate implicit $exec operand. It didn't really hurt anything (other than being a slight drag on compile-time performance). Still, let's keep things clean.	2025-11-20 16:25:47 -08:00
LU-JOHN	b79a665f71	[AMDGPU] Remove leftover implicit operands from SI_SPILL/SI_RESTORE. (#168546 ) Remove leftover implicit operands from SI_SPILL/SI_RESTORE. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2025-11-19 09:02:03 -06:00
LU-JOHN	9fa15ef916	[AMDGPU] When shrinking and/or to bitset, remove implicit scc def (#168128 ) When shrinking and/or to bitset remove leftover implicit scc def. bitset* instructions do not set scc. Signed-off-by: John Lu <John.Lu@amd.com>	2025-11-15 09:21:43 -06:00
Matt Arsenault	b2f12331ab	AMDGPU: Fix verifier error when waterfall call target is in AV register (#168017 )	2025-11-14 09:49:40 -08:00
Jay Foad	72c69aefba	[AMDGPU] Make use of getFunction and getMF. NFC. (#167872 )	2025-11-14 11:00:57 +00:00
Mariusz Sikora	4cd836181f	[AMDGPU] Lower S_ABSDIFF_I32 to VALU instructions (#167691 ) Added support for lowering the scalar S_ABSDIFF_I32 instruction to equivalent VALU operations.	2025-11-13 14:35:44 +01:00
Nicolai Hähnle	66366599a9	CodeGen/AMDGPU: Allow 3-address conversion of bundled instructions (#166213 ) This is in preparation for future changes in AMDGPU that will make more substantial use of bundles pre-RA. For now, simply test this with degenerate (single-instruction) bundles.	2025-11-12 22:04:46 +00:00

1 2 3 4 5 ...

1082 Commits