303 Commits

Author SHA1 Message Date
Manuel Carrasco
ab4b689258
[AMDGPU][SIFoldOperands] Fix OR -1 fold (#189655)
In SIFoldOperands, folding `or x, -1` to `v_mov_b32 -1` removed
`Src1Idx`, which is incorrect because `-1` is in `Src0Idx` (after
canonicalization).

Closes https://github.com/llvm/llvm-project/issues/189677.
2026-04-01 13:37:37 +01:00
Stanislav Mekhanoshin
06a903e938
[AMDGPU] Clear no convergence flag on operand folding. NFCI (#179438)
Clear the flag. It fails verification if set, only convergent
operations may have NoConvergent flag. NFCI as it is now because
it just does not happen.
2026-02-03 10:46:26 -08:00
paperchalice
62aa40a4dd
[AMDGPU] Remove NoSignedZerosFPMath uses (#178343)
One of global flags in `resetTargetOptions`, users should use `nsz`
instead.

`fneg_fadd_0_f64` from `AMDGPU/fneg-combines.new.ll` will have
regression when `fadd` is annotated with `nsz`.
2026-01-30 09:18:40 +08:00
Ryan Mitchell
13b20e7aea
[AMDGPU][SILoadStoreOptimizer] Fix lds address operand offset (#176816)
The offset operand in GLOBAL_LOAD_ASYNC_TO_LDS_B128, for instance, is
added to both the lds and global address, but SILoadStoreOptimizer is
currently unaware of that. This PR inserts an add to counteract the
offset meant for the global address. This one add is better than not
doing the optimization at all, and having to insert 2 adds for each
global address calculation (with no offset).

```
; ENABLE-LABEL: name: promote_async_load_offset
; ENABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1
; ENABLE-NEXT: {{  $}}
; ENABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec
; ENABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 $vgpr0, 512, 0, implicit $exec
; ENABLE-NEXT: renamable $vgpr3, dead $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec
; ENABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec
; ENABLE-NEXT: renamable $vgpr0 = V_ADD_U32_e32 256, $vgpr1, implicit $exec
; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr0, $vgpr2_vgpr3, -256, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
; ENABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)

; DISABLE-LABEL: name: promote_async_load_offset
; DISABLE: liveins: $ttmp7, $vgpr0, $sgpr0_sgpr1
; DISABLE-NEXT: {{  $}}
; DISABLE-NEXT: renamable $vgpr1 = V_LSHLREV_B32_e32 8, $vgpr0, implicit $exec
; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 256, $vgpr0, 0, implicit $exec
; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, $vgpr0, killed $vcc_lo, 0, implicit $exec
; DISABLE-NEXT: renamable $vgpr1 = disjoint V_OR_B32_e32 0, killed $vgpr1, implicit $exec
; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
; DISABLE-NEXT: renamable $vgpr2, renamable $vcc_lo = V_ADD_CO_U32_e64 512, $vgpr0, 0, implicit $exec
; DISABLE-NEXT: renamable $vgpr3, $sgpr_null = V_ADDC_U32_e64 0, killed $vgpr0, killed $vcc_lo, 0, implicit $exec
; DISABLE-NEXT: GLOBAL_LOAD_ASYNC_TO_LDS_B128 killed $vgpr1, killed $vgpr2_vgpr3, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load store (s128), align 1, addrspace 3)
```

This PR also promotes the global address to an offset when the offset is
calculated with V_ADD_U64 on applicable gfx versions, (and inversely
adds the LDS offset), whereas previously the optimization opportunity
was missed entirely.
2026-01-26 09:23:17 +01:00
Sam Elliott
7184229fea
[NFC][MI] Tidy Up RegState enum use (2/2) (#177090)
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.

If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
2026-01-23 00:19:03 -08:00
Shilei Tian
02d34a76f7
[NFCI][AMDGPU] Remove more redundant code from GCNSubtarget.h (#177297)
We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the
header with this cleanup.
2026-01-22 09:07:15 -05:00
Shilei Tian
b4aa3d3ae3
[NFC] Check operand type instead of opcode (#168641)
A folow-up of #168458.
2025-11-18 21:37:56 -05:00
Shilei Tian
6665642ce4
[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit (#168458)
On some targets, a packed f32 instruction can only read 32 bits from a
scalar operand (SGPR or literal) and replicates the bits to both
channels. In this case, we should not fold an immediate value if it
can't be replicated from its lower 32-bit.

Fixes SWDEV-567139.
2025-11-18 17:11:10 -05:00
LU-JOHN
9fa15ef916
[AMDGPU] When shrinking and/or to bitset*, remove implicit scc def (#168128)
When shrinking and/or to bitset* remove leftover implicit scc def.
bitset* instructions do not set scc.

Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-15 09:21:43 -06:00
Ivan Kosarev
71eaf14094
[TableGen] Split *GenRegisterInfo.inc. (#167700)
Reduces memory usage compiling backend sources, most notably for
AMDGPU by ~98 MB per source on average.

AMDGPUGenRegisterInfo.inc is tens of megabytes in size now, and
is even larger downstream. At the same time, it is included in
nearly all backend sources, typically just for a small portion of
its content, resulting in compilation being unnecessarily
memory-hungry, which in turn stresses buildbots and wastes their
resources.

Splitting .inc files also helps avoiding extra ccache misses
where changes in .td files don't cause changes in all parts of
what previously was a single .inc file.

It is thought that rather than building on top of the current
single-output-file design of TableGen, e.g., using `split-file`,
it would be more preferable to recognise the need for multi-file
outputs and give it a proper first-class support directly in
TableGen.
2025-11-14 16:30:51 +00:00
Jay Foad
72c69aefba
[AMDGPU] Make use of getFunction and getMF. NFC. (#167872) 2025-11-14 11:00:57 +00:00
Matt Arsenault
e3a9ac5e24
AMDGPU: Remove wrapper around TRI::getRegClass (#159885)
This shadows the member in the base class, but differs slightly
in behavior. The base method doesn't check for the invalid case.
2025-11-11 15:31:52 -08:00
Matt Arsenault
55422e804b
CodeGen: Remove TRI argument from getRegClass (#158225)
TargetInstrInfo now directly holds a reference to TargetRegisterInfo
and does not need TRI passed in anywhere.
2025-11-10 15:43:55 -08:00
Abhay Kanhere
b4b57adb89
[AMDGPU][MachineVerifier] test failures in SIFoldOperands (#166600)
After PR:https://github.com/llvm/llvm-project/pull/151421 merged
following fails in SIFoldOperands showed up.

LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.mfma.ll
LLVM :: CodeGen/AMDGPU/mfma-loop.ll
LLVM :: CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr.ll

In Folding code, if folded operand is register ensure earlyClobber is
set.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-11-07 21:12:19 -08:00
Matt Arsenault
67b6fd04dd
AMDGPU: Delete redundant recursive copy handling code (#157032)
This fixes a regression exposed after
445415219708f9539801018e03282049ca33e0e2.
This introduces a few small regressions for true16. There are more cases
where the value can propagate through subregister extracts which need
new handling. They're also small enough that perhaps there's a way to
avoid needing to deal with this case in the first place.
2025-11-05 18:01:12 -08:00
Matt Arsenault
1a5494ca4a
AMDGPU: Use RegClassByHwMode to manage operand VGPR operand constraints (#158272)
This removes special case processing in TargetInstrInfo::getRegClass to
fixup register operands which depending on the subtarget support AGPRs,
or require even aligned registers.

This regresses assembler diagnostics, which currently work by hackily
accepting invalid cases and then post-rejecting a validly parsed
instruction.
On the plus side this now emits a comment when disassembling unaligned
registers for targets with the alignment requirement.
2025-10-08 11:19:54 +09:00
Brox Chen
b8127cc8d0
[AMDGPU][True16][CodeGen] fix v_mov_b16_t16 index in folding pass (#161764)
With true16 mode v_mov_b16_t16 is added as new foldable copy inst, but
the src operand is in different index.

Use the correct src index for  v_mov_b16_t16.
2025-10-03 17:34:42 -04:00
Matt Arsenault
80fd3eda25
AMDGPU: Fix constrain register logic for physregs (#161794)
We do not need to reconstrain physical registers. Enables an
additional fold for constant physregs.
2025-10-03 21:20:36 +09:00
Matt Arsenault
597f93d36b
AMDGPU: Check if immediate is legal for av_mov_b32_imm_pseudo (#160819)
This is primarily to avoid folding a frame index materialized
into an SGPR into the pseudo; this would end up looking like:
  %sreg = s_mov_b32 %stack.0
  %av_32 = av_mov_b32_imm_pseudo %sreg

Which is not useful.

Match the check used for the b64 case. This is limited to the
pseudo to avoid regression due to gfx908's special case - it
is expecting to pass here with v_accvgpr_write_b32 for illegal
cases, and stay in the intermediate state with an sgpr input.

This avoids regressions in a future patch.
2025-09-27 08:24:20 +09:00
Josh Hutton
de59bc42ed
[AMDGPU] Avoid constraining RC based on folded into operand (NFC) (#160743)
The RC of the folded operand does not need to be constrained based on
the RC of the current operand we are folding into.

The purpose of this PR is to facilitate this PR:
https://github.com/llvm/llvm-project/pull/151033
2025-09-26 05:08:09 +00:00
Stanislav Mekhanoshin
f0090bacc1
[AMDGPU] Fold copies of constant physical registers into their uses (#154410)
Co-authored-by: Jay Foad <Jay.Foad@amd.com>

Co-authored-by: Jay Foad <Jay.Foad@amd.com>
2025-09-17 10:49:34 -07:00
Matt Arsenault
ea9acc97f1
CodeGen: Surface shouldRewriteCopySrc utility function (#158524)
Change shouldRewriteCopySrc to return the common register
class and expose it as a utility function. I've found myself
reproducing essentially the same logic in multiple places. The
purpose of this function is to jsut work through the API constraints
of which combination of register class and subreg indexes you have.

i.e. you need to use a different function if you have 0, 1, or 2
subregister indexes involved in a pair of copy-like operations.
2025-09-16 14:53:49 +09:00
Matt Arsenault
7289f2cd0c
CodeGen: Remove MachineFunction argument from getRegClass (#158188)
This is a low level utility to parse the MCInstrInfo and should
not depend on the state of the function.
2025-09-12 19:22:02 +09:00
Matt Arsenault
dd5eb46690
AMDGPU: Fold 64-bit immediate into copy to AV class (#155615)
This is in preparation for patches which will intoduce more
copies to av registers.
2025-09-03 09:29:59 +09:00
Matt Arsenault
3a7d14acce
AMDGPU: Avoid using exact class check in reg_sequence AGPR fold (#156135)
This does better in cases which mix align2 and non-align2 classes.
2025-09-03 09:05:48 +09:00
Matt Arsenault
96e4caadb4
AMDGPU: Stop special casing aligned VGPR targets in operand folding (#155559)
Perform a register class constraint check when performing the fold
2025-09-02 16:15:25 +00:00
Matt Arsenault
a0c472d50f
AMDGPU: Remove special case of SGPR_LO class in imm folding (#155518)
Previous change accidentally broke this which shows it's not
doing anything.
2025-08-27 06:08:38 +00:00
Matt Arsenault
9091108c66
AMDGPU: Fold mov imm to copy to av_32 class (#155428)
Previously we had special case folding into copies to AGPR_32,
ignoring AV_32. Try folding into the pseudos.

Not sure why the true16 case regressed.
2025-08-27 02:13:14 +00:00
Matt Arsenault
4454152197
AMDGPU: Replace copy-to-mov-imm folding logic with class compat checks (#154501)
This strengthens the check to ensure the new mov's source class
is compatible with the source register. This avoids using the register
sized based checks in getMovOpcode, which don't quite understand
AV superclasses correctly. As a side effect it also enables more folds
into true16 movs.

getMovOpcode should probably be deleted, or at least replaced
with class check based logic. In this particular case other
legality checks need to be mixed in with attempted IR changes,
so I didn't try to push all of that into the opcode selection.
2025-08-26 23:41:35 +09:00
Stanislav Mekhanoshin
3ef3b30c3c
Revert "[AMDGPU] Fold copies of constant physical registers into their uses (#154183)" (#154219)
This reverts commit 3395676a18ab580f21ebcd4324feaf1294a8b6d9.

Fails
libc/test/src/string/libc.test.src.string.memmove_test.__hermetic__
2025-08-18 16:22:47 -07:00
Stanislav Mekhanoshin
3395676a18
[AMDGPU] Fold copies of constant physical registers into their uses (#154183)
With current codegen this only affects src_flat_scratch_base_lo/hi.

Co-authored-by: Jay Foad <Jay.Foad@amd.com>

Co-authored-by: Jay Foad <Jay.Foad@amd.com>
2025-08-18 13:07:36 -07:00
Stanislav Mekhanoshin
d09dbdabb9
[AMDGPU] bf16 clamp folding (#152573) 2025-08-07 12:59:50 -07:00
Ivan Kosarev
2b20cf7291
[AMDGPU] Fold into uses of splat REG_SEQUENCEs through COPYs. (#145691) 2025-08-04 16:18:33 +01:00
Stanislav Mekhanoshin
ce40863209
[AMDGPU] Add v_cvt_sr|pk_bf8|fp8_f16 gfx1250 instructions (#151415) 2025-07-30 17:24:45 -07:00
Matt Arsenault
5f3eea7ef2
AMDGPU: Fix not folding splat immediate into VGPR MFMA src2 (#150628) 2025-07-26 13:54:49 +09:00
Stanislav Mekhanoshin
006858cd4d
[AMDGPU] Prevent folding of FI with scale_offset on gfx1250 (#149894)
SS forms of SCRATCH_LOAD_DWORD do not support SCALE_OFFSET,
so if this bit is used SCRATCH_LOAD_DWORD_SADDR cannot be formed.
This generally shall not happen because FI is not supposed to
be scaled, but add this as a precaution.
2025-07-21 15:05:43 -07:00
macurtis-amd
402b989693
AMDGPU: Fix assert when multi operands to update after folding imm (#148205)
In the original motivating test case,
[FoldList](d8a2141ff9/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (L1764))
had entries:
```
  #0: UseMI: %224:sreg_32 = S_OR_B32 %219.sub0:sreg_64, %219.sub1:sreg_64, implicit-def dead $scc
      UseOpNo: 1

  #1: UseMI: %224:sreg_32 = S_OR_B32 %219.sub0:sreg_64, %219.sub1:sreg_64, implicit-def dead $scc
      UseOpNo: 2
```
After calling
[updateOperand(#0)](d8a2141ff9/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (L1773)),
[tryConstantFoldOp(#0.UseMI)](d8a2141ff9/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (L1786))
removed operand 1, and entry #&#8203;1.UseOpNo was no longer valid,
resulting in an
[assert](4a35214bdd/llvm/include/llvm/ADT/ArrayRef.h (L452)).

This change defers constant folding until all operands have been updated
so that UseOpNo values remain stable.
2025-07-16 06:37:08 -05:00
Matt Arsenault
e1f224b99a
AMDGPU: Handle folding vector splats of inline split f64 inline immediates (#140878)
Recognize a reg_sequence with 32-bit elements that produce a 64-bit
splat value. This enables folding f64 constants into mfma operands
2025-06-26 07:45:49 +09:00
Matt Arsenault
472c9141f9
AMDGPU: Fix tracking subreg defs when folding through reg_sequence (#140608)
We weren't fully respecting the type of a def of an immediate vs.
the type at the use point. Refactor the folding logic to track the
value to fold, as well as a subregister to apply to the underlying
value. This is similar to how PeepholeOpt tracks subregisters (though
only for pure copy-like instructions, no constants).

Fixes #139317
2025-06-26 07:42:55 +09:00
Matt Arsenault
80064b6e32
AMDGPU: Try constant fold after folding immediate (#141862)
This helps avoid some regressions in a future patch. The or 0
pattern appears in the division tests because the reduce 64-bit
bit operation to a 32-bit one with half identity value is only
implemented for constants. We could fix that by using computeKnownBits.
Additionally the pattern disappears if I optimize the IR division
expansion, so that IR should probably be emitted more optimally in
the first place.
2025-06-10 11:44:44 +09:00
Daniil Fukalov
5208f722d8
[AMDGPU] Fix SIFoldOperandsImpl::canUseImmWithOpSel() for VOP3 packed [B]F16 imms. (#142142)
VOP3 instructions ignore opsel source modifiers, so a constant that
contains two different [B]F16 imms cannot be encoded into instruction
with an src opsel.

E.g. without the fix the following instructions

`s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>`
`v_cvt_scalef32_pk_fp8_f16 v0, s0, v2`

lose `2.0` imm and are folded into

`v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0`

Fixes SWDEV-531672
2025-05-30 16:38:07 +02:00
Matt Arsenault
65b90c59ce
AMDGPU: Remove redundant operand folding checks (#140587)
This was pre-filtering out a specific situation from being
added to the fold candidate list. The operand legality will
ultimately be checked with isOperandLegal before the fold is
performed, so I don't see the plus in pre-filtering this one
case.
2025-05-29 19:38:45 +02:00
Matt Arsenault
1b07c589b2
AMDGPU: Delete seemingly dead s_fmaak_f32/s_fmamk_f32 folding code (#140580)
No tests fail with this. I'm not sure I understand the comment,
there can't be any folding into an operand that had to already
be a constant. I tried different combinations of immediates to these
instructions but never hit the condition.
2025-05-29 19:36:05 +02:00
Fabian Ritter
fb27867bd5
[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex (#141558)
foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied
to vector registers into v_adds. We don't want to do that if foldInstOperand
(which so far runs later) can fold the sreg->vreg copy away.
This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after
foldInstOperand.

This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids
regressions in an upcoming patch to enable ISD::PTRADD nodes.
2025-05-27 11:30:51 +02:00
Rahul Joshi
52c2e45c11
[NFC][CodeGen] Adopt MachineFunctionProperties convenience accessors (#141101) 2025-05-23 08:30:29 -07:00
Matt Arsenault
36018494fd
AMDGPU: Check for subreg match when folding through reg_sequence (#140582)
We need to consider the use instruction's intepretation of the bits,
not the defined immediate without use context. This will regress
some cases where we previously coud match f64 inline constants. We
can restore them by either using pseudo instructions to materialize f64
constants, or recognizing reg_sequence decomposed into 32-bit pieces for them
(which essentially means recognizing every other input is a 0).

Fixes #139908
2025-05-19 21:44:44 +02:00
Matt Arsenault
4ddab1252f
AMDGPU: Move reg_sequence splat handling (#140313)
This code clunkily tried to find a splat reg_sequence by
looking at every use of the reg_sequence, and then looking
back at the reg_sequence to see if it's a splat. Extract this
into a separate helper function to help clean this up. This now
parses whether the reg_sequence forms a splat once, and defers the
legal inline immediate check to the use check (which is really use
context dependent)

The one regression is in globalisel, which has an extra
copy that should have been separately folded out. It was getting
dealt with by the handling of foldable copies in tryToFoldACImm.

This is preparation for #139908 and #139317
2025-05-17 08:18:01 +02:00
Ivan Kosarev
c290f48a45
[AMDGPU][NFC] Remove unused operand types. (#139062) 2025-05-08 12:48:25 +01:00
Akhilesh Moorthy
9c9013f703
[AMDGPU] Handle MachineOperandType global address in SIFoldOperands. (#135424)
This patch handles the global operand type properly, fixing the
bug : Assertion `(isFI() || isCPI() || isTargetIndex() ||
isJTI()) && "Wrong MachineOperand accessor"` failed.

Fixes SWDEV-504645

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-05-05 18:12:35 +02:00
Kazu Hirata
d144c13ae5
[Target] Remove unused local variables (NFC) (#138443) 2025-05-04 07:56:38 -07:00