200 Commits

Author SHA1 Message Date
Austin Kerbow
89503bda38
[AMDGPU] Add structural stall heuristic to scheduling strategies (#169617)
Implements a structural stall heuristic that considers both resource
hazards and latency constraints when selecting instructions. In coexec,
this changes the pending queue from a binary “not ready to issue”
distinction into part of a unified candidate comparison. Pending
instructions still identify structural stalls in the current cycle, but
they are now evaluated directly against available instructions by stall
cost, making the heuristics both more intuitive and more expressive.

- Add getStructuralStallCycles() to GCNSchedStrategy that computes the
number of cycles an instruction must wait due to:
  - Resource conflicts on unbuffered resources (from the SchedModel)
  - Sequence-dependent hazards (from GCNHazardRecognizer)

- Add getHazardWaitStates() to GCNHazardRecognizer that returns the
number
of wait states until all hazards for an instruction are resolved,
providing cycle-accurate hazard information for scheduling heuristics.
2026-03-23 11:33:43 -07:00
Austin Kerbow
7f77ca0dbd
[AMDGPU] Include TRANS instructions in WMMA coexecution hazard checking (#186269) 2026-03-16 16:10:58 -07:00
Jay Foad
673a71f018
[CodeGen] Make ShouldPreferAnother const. NFC. (#185606) 2026-03-10 15:03:56 +00:00
Jay Foad
d2c0937c5a
[AMDGPU] Make GCNHazardRecognizer "check" functions const. NFC. (#185416) 2026-03-09 14:15:54 +00:00
Jay Foad
fff2f0ba78
[AMDGPU] Handle GFX1250 hazards between WMMA and VOPD (#183573)
Hazards between WMMA and VALU were handled in #149865 but this only
worked for regular VOP* VALU encodings, not for VOPD.

Fixes: #183546
2026-02-27 19:51:53 +00:00
Dark Steve
254cb2a326
[AMDGPU] Hoist WMMA coexecution hazard V_NOPs from loops to preheaders (#176895)
On GFX1250, V_NOPs inserted for WMMA coexecution hazards are placed at
the use-site. When the hazard-consuming instruction is inside a loop and
the WMMA is outside, these NOPs execute every iteration even though the
hazard only needs to be covered once.

This patch hoists the V_NOPs to the loop preheader, reducing executions
from N iterations to 1.

```
Example (assuming a hazard requiring K V_NOPs):
  Before:
    bb.0 (preheader): WMMA writes vgpr0
    bb.1 (loop):      V_NOP xK, VALU reads vgpr0, branch bb.1
                      -> K NOPs executed per iteration

  After:
    bb.0 (preheader): WMMA writes vgpr0, V_NOP xK
    bb.1 (loop):      VALU reads vgpr0, branch bb.1
                      -> K NOPs executed once
```

For nested loops, V_NOPs are hoisted to the outermost preheader where no
WMMA hazard exists within the loop.
Hoisting is restricted to strict preheaders (not any single predecessor)
to avoid introducing V_NOPs on unrelated control flow paths.

The optimization is controlled by `-amdgpu-wmma-vnop-hoisting` (default:
on).

Fixes: SWDEV-573407
2026-02-26 17:19:00 +05:30
vporpo
5a756c8a3a
[AMDGPU][SIInsertWaitcnts][NFC] Make Waitcnt members private (#180772)
This patch makes Waitcnt member variables private and replaces their
accesses with calls to set() or get(). This will help us change the
implementation to an a array in the followup patch.
2026-02-23 11:44:19 -08:00
Mariusz Sikora
3c0f5045e1
[AMDGPU] Add FeatureGFX13 and SMEM encoding for gfx13 (#177567)
For now list of features is based on gfx12 and gfx1250

---------

Co-authored-by: Jay Foad <jay.foad@amd.com>
2026-01-26 14:16:36 +01:00
Sam Elliott
7184229fea
[NFC][MI] Tidy Up RegState enum use (2/2) (#177090)
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.

If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
2026-01-23 00:19:03 -08:00
Sam Elliott
2042887709
Reland "[NFC][MI] Tidy Up RegState enum use (1/2)" (#176277)
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.

This PR relands llvm/llvm-project#176091 (commit
1d616cdca3aba9d22f120888bb6b09b75ca90b92) which was reverted in
llvm/llvm-project#176190 (commit
6309cd8668fc2ae589f156b23f86821f4ce5b7ea).
2026-01-16 13:05:06 -08:00
Sam Elliott
6309cd8668
Revert "[NFC][MI] Tidy Up RegState enum use (1/2)" (#176190)
Reverts llvm/llvm-project#176091

Reverting because some compilers were erroring on the call to
`Reg.isReg()` (which is not `constexpr`) in a `constexpr` function.
2026-01-15 07:58:05 -08:00
Sam Elliott
1d616cdca3
[NFC][MI] Tidy Up RegState enum use (1/2) (#176091)
This Change is to prepare to make RegState into an enum class. It:
- Updates documentation to match the order in the code.
- Brings the `get<>RegState` functions together and makes them
`constexpr`.
- Adopts the `get<>RegState` where RegStates were being chosen with
ternary operators in backend code.
- Introduces `hasRegState` to make querying RegState easier once it is
an enum class.
- Adopts `hasRegState` where equivalent was done with bitwise
arithmetic.
- Introduces `RegState::NoFlags`, which will be used for the lack of
flags.
- Documents that `0x1` is a reserved flag value used to detect if
someone is passing `true` instead of flags (due to implicit bool to
unsigned conversions).
- Updates two calls to `MachineInstrBuilder::addReg` which were passing
`false` to the flags operand, to no longer pass a value.
- Documents that `getRegState` seems to have forgotten a call to
`getEarlyClobberRegState`.
2026-01-15 07:47:05 -08:00
LU-JOHN
49381c3000
[NFC][AMDGPU] Declare variables initialized with getDebugLoc as const ref (#174434)
Declare variables initialized with getDebugLoc as a const reference.

Signed-off-by: John Lu <John.Lu@amd.com>
2026-01-05 12:37:47 -06:00
LU-JOHN
7e2b79b049
[AMDGPU] Generate more efficient code to avoid shift64 hazard (#171871)
Generate more efficient code to avoid shift64 hazard when dst!=src1.
Transform:

   dst = shiftrev64 amt, src1

to:
  
   dst.sub0 =amt
   dst = shiftrev64 dst.sub0, src1

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-01-05 09:19:15 -06:00
Stephen Thomas
7c328d8a0a
[AMDGPU][GCNHazardRecognizer] Remove instances of hardcoded S_WAITCNT_DEPCTR operand values (#171811)
Two S_WAITCNT_DEPCTR instructions are constructed with hardcoded operand
values. Replace these with appropriate calls to
AMDGPU::DepCtr::encodeFieldVmVsrc().

NFC, except that the original code was setting reserved operand bits
that should-be-zero, and this is now corrected.
2025-12-11 13:26:54 +00:00
Stanislav Mekhanoshin
fffe9bcbc7
[AMDGPU] Allow hazard checks for WMMA co-exec (#168805)
Now we are just inserting V_NOP instrtuctions, try to schedule
something into the shadow.

It is still somewhat imprecise, for example AdvanceCycle() will
use TII.getNumWaitStates() anyway, but in a scheduling mode
we are not required to be precise. We must be finally precise
in the hazard recognizer mode. Then EmittedInstrs buffer is also
limited to MaxLookAhead even though VALU only hazards may actually
never expire and require an endless buffer. But that's OK, we can
at least mitigate what the buffer can hold. The buffer is also
currently much bigger than any of VALU hazards may need.

That said the rest of the 'fix*' functions here can be changed
the same way, these which are using V_NOPs. This one is just the
worst because it may require up to 9 nops.
2025-12-01 11:46:30 -08:00
Stanislav Mekhanoshin
e6ae2462bd
[AMDGPU] Refactor hazard recognizer for VALU-pipeline hazards. NFCI. (#168801)
This is in preparation of handling these in scheduler. I do not expect
any changes to the produced code here, it is just an infrastructure.
Our current problem with the VALU pipeline hazards is that we only
insert V_NOP instructions in the hazard recognizer mode, but ignore
it during scheduling. This patch is meant to create a mechanism to
actually account for that during scheduling.
2025-12-01 10:59:52 -08:00
Jay Foad
d748c81218
[AMDGPU] Change the immediate operand of s_waitcnt_depctr / s_wait_alu (#169378)
The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some
unused bits. Previously codegen would set these bits to 1, but setting
them to 0 matches the SP3 assembler behaviour better, which in turn
means that we can print them using the human readable SP3 syntax:

s_wait_alu 0xfffd ; unused bits set to 1
s_wait_alu 0xff9d ; unused bits set to 0
s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable

Note that the set of unused bits changed between GFX10.1 and GFX10.3.
2025-11-25 11:55:26 +00:00
Robert Imschweiler
0b82415c59
[AMDGPU] Consider FLAT instructions for VMEM hazard detection (#137170)
In general, "Flat instructions look at the per-workitem address and
determine for each work item if the target memory address is in global,
private or scratch memory." (RDNA2 ISA) That means that FLAT
instructions need to be considered for VMEM hazards even without
"specific segment". Also, LDS DMA should be considered for LDS hazard
detection.

See also #137148
2025-11-18 18:41:04 +01:00
Sergei Barannikov
86d712cda4
[AMDGPU] Use MCRegUnit, insert explicit casts to/from unsigned (NFC) (#167889)
The casts are currently no-op because `MCRegUnit` is a typedef'ed to
`unsigned`, but this will change soon enough and explicit cast will be
required.
2025-11-13 21:39:02 +03:00
Kazu Hirata
50faea28fb
[llvm] Use conventional enum declarations (NFC) (#166318)
This patch replaces:

  using Foo = enum { A, B, C };

with the more conventional:

  enum Foo { A, B, C };

These two enum declaration styles are not identical, but their
difference does not matter in these .cpp files.  With the "using Foo"
style, the enum is unnamed and cannot be forward-declared, whereas the
conventional style creates a named enum that can be.  Since these
changes are confined to .cpp files, this distinction has no practical
impact here.
2025-11-04 07:12:53 -08:00
Carl Ritson
385c12134a
[AMDGPU] Rework GFX11 VALU Mask Write Hazard (#138663)
Apply additional counter waits to address VALU writes to SGPRs. Rework
expiry detection and apply wait coalescing to mitigate some of the
additional waits.
2025-10-28 16:09:28 +09:00
Matt Arsenault
1a5494ca4a
AMDGPU: Use RegClassByHwMode to manage operand VGPR operand constraints (#158272)
This removes special case processing in TargetInstrInfo::getRegClass to
fixup register operands which depending on the subtarget support AGPRs,
or require even aligned registers.

This regresses assembler diagnostics, which currently work by hackily
accepting invalid cases and then post-rejecting a validly parsed
instruction.
On the plus side this now emits a comment when disassembling unaligned
registers for targets with the alignment requirement.
2025-10-08 11:19:54 +09:00
Carl Ritson
e60ca86621
[AMDGPU] Refine GCNHazardRecognizer hasHazard() (#138841)
Remove recursion to avoid stack overflow on large CFGs.
Avoid worklist for hazard search within single MachineBasicBlock.
Ensure predecessors are visited for all state combinations.
2025-09-24 18:42:11 +09:00
Stanislav Mekhanoshin
32c2393ca5
[AMDGPU] Handle S_GETREG_B32_const in the hazard recognizer. NFCI (#160364) 2025-09-23 14:30:24 -07:00
Jay Foad
f15c6ff6cb
[AMDGPU] Make use of SIInstrInfo::isWaitcnt. NFC. (#154087) 2025-08-18 16:18:46 +01:00
Stanislav Mekhanoshin
4198649c19
[AMDGPU] Use encodeFieldVaVdst in hazard recognizer. NFCI. (#153881)
Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>

---------

Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>
2025-08-15 17:50:27 -07:00
Stanislav Mekhanoshin
b7ec10ca6c
[AMDGPU] Update GCNHazardRecognizer's understanding of gfx12 waitcount instructions (#153880)
This simply updates the pass's cognizance of these instructions, and for
the
most part the hazards where they might be encountered do not exist for
gfx12.
Nonetheless, encountering them has to be checked for as doing so would
indicate
a compiler error.

Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>

---------

Co-authored-by: Stephen Thomas <Stephen.Thomas@amd.com>
2025-08-15 17:18:41 -07:00
Stanislav Mekhanoshin
4f34c740ab
[AMDGPU] w/a for s_setreg_b32 gfx1250 hazard with MODE register (#153879) 2025-08-15 16:08:13 -07:00
Stanislav Mekhanoshin
f1fc50748a
[AMDGPU] w/a hazard with writing s102/103 and reading FLAT_SCRATCH_BASE (#153878) 2025-08-15 15:23:06 -07:00
Stanislav Mekhanoshin
1f25c4883e
[AMDGPU] Mitigate DS_ATOMIC_ASYNC_BARRIER_ARRIVE_B64 bug (#153872)
DS_ATOMIC_ASYNC_BARRIER_ARRIVE_B64 shall not be claused (we already do
not clause DS instructions) and needs waits before and after.
2025-08-15 14:17:54 -07:00
Stanislav Mekhanoshin
29976f2e58
[AMDGPU] Handle S_GETREG_B32 hazard on gfx1250 (#153848)
GFX1250 SPG says: S_GETREG_B32 does not wait for idle before executing.
The user must S_WAIT_ALU 0 before S_GETREG_B32 on:
STATUS, STATE_PRIV, EXCP_FLAG_PRIV, or EXCP_FLAG_USER.
2025-08-15 11:38:22 -07:00
Stanislav Mekhanoshin
33abf05af4
[AMDGPU] gfx1250 v_permlane_* instructions (#151749) 2025-08-01 16:14:19 -07:00
Changpeng Fang
e47d5eb454
[AMDGPU] Hazard handling for gfx1250 wmma instructions (#149865)
If both instructions are xdl WMMA, hazard exists when the first WMMA
writes a register (D0) and the second WMMA reads it (A1/B1/Index1).

If the first instruction is a xdl WMMA, and the second one is a VALU,
three kinds of hazards exist:
  WMMA writes (D0), VALU reads (Use1);
  WMMA writes (D0), VALU writes (D1);
  WMMA reads (A0/B0.Index0), VALU writes (D1).

The actual number of hazard slots depends on the categories of the first
xdl WMMA as well as whether the second instruction is a xdl WMMA or
VALU. If there is not enough unrelated VALUs in between the two
instructions, appropriate number (to cover the missing) of V_NOPs will
be inserted to satisfy the hazard handling requirements.
2025-07-21 13:24:10 -07:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Changpeng Fang
560e7df689
AMDGPU: Handle the co-execution hazards for TRANS for gfx1250 (#149024)
For the co-execution of the TRANS ops, the requirement is: 1 independent
op or V_NOP (since TRANS takes 2 cycles) after the trans op before its
sources can be overwritten or the output of the trans op can be used.
2025-07-16 10:58:54 -07:00
Shilei Tian
c0e9084b1c
[AMDGPU] Add a debug option -amdgpu-snop-padding for GCNHazardRecognizer (#146587)
This can help to identify if there is potential hazards.

Co-authored-by: Byrnes, Jeffrey <Jeffrey.Byrnes@amd.com>
2025-07-02 08:16:38 -04:00
Harrison Hao
b2379bd5d5
[AMDGPU] Support bottom-up postRA scheduing. (#135295)
Solely relying on top‑down scheduling can underutilize hardware, since
long‑latency instructions often end up scheduled too late and their
latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us
move those instructions earlier, which improves latency hiding and
yields roughly a 2% performance gain on key benchmarks.
2025-06-05 22:07:06 +08:00
Robert Imschweiler
e55172f139
[AMDGPU] Classify FLAT instructions as VMEM (#137148)
Also adapt hazard and wait handling.
2025-05-07 09:20:52 +02:00
Brox Chen
cd54d581b5
[AMDGPU][True16][CodeGen] add v_cndmask_t16 to hazardmask (#128912)
add v_cndmask_t16 to hazardmask
2025-03-14 12:31:57 -04:00
sstipano
531c48546d
[AMDGPU][NFC] Move isXDL and isDGEMM to SIInstrInfo. (#129103) 2025-02-28 03:14:51 +01:00
Fabian Ritter
8615f9aaff
[AMDGPU] Replace gfx940 and gfx941 with gfx942 in llvm (#126763)
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.

This PR removes all non-documentation occurrences of gfx940/gfx941 from
the llvm directory, and the remaining occurrences in clang.

Documentation changes will follow.

For SWDEV-512631
2025-02-19 10:20:48 +01:00
Rahul Joshi
bee9664970
[TableGen] Emit OpName as an enum class instead of a namespace (#125313)
- Change InstrInfoEmitter to emit OpName as an enum class
  instead of an anonymous enum in the OpName namespace.
- This will help clearly distinguish between values that are 
  OpNames vs just operand indices and should help avoid
  bugs due to confusion between the two.
- Rename OpName::OPERAND_LAST to NUM_OPERAND_NAMES.
- Emit declaration of getOperandIdx() along with the OpName
  enum so it doesn't have to be repeated in various headers.
- Also updated AMDGPU, RISCV, and WebAssembly backends
  to conform to the new definition of OpName (mostly
  mechanical changes).
2025-02-12 08:19:30 -08:00
Vigneshwar Jayakumar
1188b1ff7b
AMDGPU: Handle gfx950 XDL Write-VGPR-VALU-WAW wait state change (#126132)
There are additional wait states for XDL write VALU WAW hazard in gfx950
compared to gfx940.
2025-02-12 01:32:23 +07:00
Vigneshwar Jayakumar
a2263eba4d
AMDGPU: Handle gfx950 XDL-write-VGPR-VALU-Mem-Exp wait state change (#126727) 2025-02-12 01:30:53 +07:00
Vigneshwar Jayakumar
c837f57286
AMDGPU: Handle gfx950 XDL-write-VGPR-Overlap-Src-AB wait state (#126732)
gfx950 needs more additional waitstates from gfx940
2025-02-11 22:30:16 +07:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Chinmay Deshpande
9ca1323de1
[AMDGPU] Fix crash due to missing check for FLAT instructions that dont use vector registers when computing VALU hazard (#123627) 2025-01-21 05:50:58 -08:00
Brox Chen
8a0c2e7567
[AMDGPU][True16][MC][CodeGen] true16 for v_cndmask_b16 (#119736)
Support true16 format for v_cndmask_b16 in MC and CodeGen in true16 and
fake16 flow.

Since we are replacing `v_cndmask_b16` to `v_cndmask_b16_t16/fake16`, we
have to at least update the fake16 codeGen to get codeGen test passing.
For this case, we have to update the true16 and with fake16 together,
otherwise some of the true16 tests will fail
2025-01-16 17:18:28 -05:00
Pravin Jagtap
5e007afa9d
[AMDGPU] Handle hazard in v_scalef32_sr_fp4_* conversions (#118589)
Presently, compiler selectivelly adds nop when opsel != 0 i.e. only when
partially writing to high bytes.
Experiments in SWDEV-499733 and SWDEV-501347 suggest that we need nop
for above cases irrespective of opsel values.

Note: We might need to add few others into the same table.
2024-12-11 18:38:10 +05:30