147 Commits

Author SHA1 Message Date
Lucas Ramirez
dfbf4eb88e
[AMDGPU][Scheduler] Fix incorrect region index in EXPENSIVE_CHECKS (#179461)
#177206 exposed a pre-existing typo in EXPENSIVE_CHECKS in the
scheduler's rematerialization stage. When rematerializing a register
that depends on other registers, the dependent registers should be live
at the live-ins of the rematerialized register's using region, but not
necessarily at the live-ins of its defining region, as it is written
right now.

This fixes that and also hoists the check from the loop on regions where
the rematerialized register is live, since it's only supposed to run
once for the using region anyway. This fixes the following unit tests
which were failing when building with EXPENSIVE_CHECKS.
- `CodeGen/AMDGPU/copy-hoist-no-spills.ll`
- `CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir`
2026-02-03 17:51:36 +01:00
Lucas Ramirez
9dc6b4904a
Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)" (#177206)
This re-applies commit f21e3593371c049380f056a539a1601a843df558 along
with the compile fix failure introduced in
8ab79377740789f6a34fc6f04ee321a39ab73724 before the initial patch was
reverted. It also fixes for the previously observed assert failure.

We were hitting the assert in the HIP Blender due to a combination of
two issues that could happen when rematerializations are being rolled
back.

1. Small changes in slots indices (while preserving instruction order)
compared to the pre-re-scheduling state means that we have to re-compute
live ranges for all register operands of rolled back rematerializations.
This was not being done before.
2. Re-scheduling can move registers that were rematerialized at
arbitrary positions in their respective regions while their opcode is
set to DBG_VALUE, even before their read operands are defined. This
makes re-scheduling reverts mandatory before rolling back
rematerializations, as otherwise def-use chains may be broken. The
original patch did not guarantee that, but previous refactoring of the
rollback/revert logic for the rematerialization stage now ensures that
reverts always precede rollbacks.
2026-02-02 12:34:16 +01:00
Lucas Ramirez
cf60af88b4
[AMDGPU][Scheduler] Revert all regions when remat fails to increase occ. (#177205)
When the rematerialization stage fails to increase occupancy in all
regions, the current implementation only reverts the effect of
re-scheduling in regions in which the increased occupancy target could
not be achieved. However, given that re-scheduling with a higher
occupancy target puts more pressure on the scheduler to achieve lower
maximum RP at the cost of potentially lower ILP as well, region
schedules made with higher occupancy targets are generally less
desirable if the whole function is not able to meet that target.
Therefore, if at least one region cannot reach its target, it makes
sense to revert re-scheduling in all affected regions to go back to a
schedule that was made with a lower occupancy target.

This implements such logic for the rematerialization stage, and adds a
test to showcase that re-scheduling is indeed interrupted/reverted as
soon as a re-scheduled region that does not meet the increased target
occupancy is encountered.

As a minor improvement, this also sets higher occupancy targets for
re-scheduling at the end of stage initialization in some cases. In cases
where rematerializations alone are not able to achieve the target, this
can push the scheduler to be more aggressive in reducing RP and achieve
the target.
2026-02-02 00:33:40 +00:00
Lucas Ramirez
7274ae970d
[AMDGPU][Scheduler] Simplify scheduling revert logic (#177203)
When scheduling must be reverted for a region, the current
implementation re-orders non-debug instructions and debug instructions
separately; the former in a first pass and the latter in a second pass
handled by a generic machine scheduler helper whose state is tied to the
current region being scheduled, in turns limiting the revert logic to
only work on the active scheduling region.

This makes the revert logic work in a single pass for all MIs, and
removes the restriction that it works exclusively on the active
scheduling region. The latter enables future use cases such as reverting
scheduling of multiple regions at once.
2026-02-01 17:55:12 +01:00
Lucas Ramirez
8029699b8d
[AMDGPU][Scheduler] Make finalizeGCNRegion an overridable hook (NFC) (#177199)
This allows individual stages to make decisions after re-scheduling
individual regions.
2026-01-30 17:35:23 +01:00
LU-JOHN
90b5ec1226
[AMDGPU] revertScheduling must behave the same with or without debug (#177483)
When revertScheduling reorders instructions only update LiveIntervals if
non-debug instructions have been re-ordered. Otherwise code-generation
can change with debug info.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-01-27 14:33:03 -06:00
Tony Linthicum
fc94ef9576
AMDGPU: Disable scheduler mfma rewrite stage by default for now (#177624)
Currently generating an excess number of copies. Turning it off to avoid
churn for other developers.
2026-01-23 23:39:03 +01:00
Sam Elliott
7184229fea
[NFC][MI] Tidy Up RegState enum use (2/2) (#177090)
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.

If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
2026-01-23 00:19:03 -08:00
Tony Linthicum
9bc0e62617
[AMDGPU] Add scheduling stage to rewrite MFMA from VGPR to AGPR (#170335)
This pull request is an update of [Jeff Byrne's
PR](https://github.com/llvm/llvm-project/pull/149367). Additionally, all
unresolved comments from the original PR have been addressed.

Changes to make the MachineScheduler dependent upon
MachineBlockFrequencyInfo have been pulled out into a new PR,
[176172](https://github.com/llvm/llvm-project/pull/176172), upon which
this PR now depends.

---------

Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com>
2026-01-22 12:22:39 +01:00
Juan Manuel Martinez Caamaño
c30c2f4f3e
[AMDGPU] Rematerialize VGPR candidates when SGPR spills results in VGPR Excess (#168079)
Before, when selecting candidates to rematerialize, we would only
consider SGPR candidates when there was an excess of SGPR registers.

Failing to eliminate the excess would result in spills to VGPRs.
This is normally not an issue, unless spilling to VGPRs results in
excess VGPRs.

This patch does 2 things:
* It relaxes the GCNRPTarget success criteria: now we accept regions
  where we spill SGPRs to VGPRs, as long as this does not end up in
  excess VGPRs.
* It changes isSaveBeneficial to consider the excess VGPRs (which
  includes the SGPRs that would be spilled to VGPR).

With these changes, the compiler rematerializes VGPRs when the excess
SGPRs would result in VGPR excess.

This has some unaddressed flaws: we should attempt to rematerialize
SGPRs
first in order to eliminate the SGPR excess that results in VGPR excess.

Related to SWDEV-549940
2026-01-16 09:08:55 +01:00
Lucas Ramirez
7b699cc5de
Revert "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)" (#175813)
This reverts 8ab79377740789f6a34fc6f04ee321a39ab73724 and
f21e3593371c049380f056a539a1601a843df558 which are causing a HIP failure
in a Blender test.
2026-01-13 19:29:50 +00:00
Thurston Dang
458a2e88cc
[AMDGPU][Scheduler] Fix use-after-poison by printing before deleting (#175807)
A buildbot was failing with a use-after-poison
(https://lab.llvm.org/buildbot/#/builders/24/builds/16530) after
https://github.com/llvm/llvm-project/pull/175050:
```
==llc==1532559==ERROR: AddressSanitizer: use-after-poison on address 0xe26e74e12368 at pc 0xb36d41bd74dc bp 0xffffed72a450 sp 0xffffed72a448
READ of size 8 at 0xe26e74e12368 thread T0
    #0 0xb36d41bd74d8 in llvm::MachineInstr::print(llvm::raw_ostream&, bool, bool, bool, bool, llvm::TargetInstrInfo const*) const /home/b/sanitizer-aarch64-linux-bootstrap-asan/build/llvm-project/llvm/lib/CodeGen/MachineInstr.cpp:1796:35
    #1 0xb36d3e221b08 in operator<< /home/b/sanitizer-aarch64-linux-bootstrap-asan/build/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:2150:6
    #2 0xb36d3e221b08 in llvm::PreRARematStage::rollback(llvm::PreRARematStage::RollbackInfo const&, llvm::BitVector&) const /home/b/sanitizer-aarch64-linux-bootstrap-asan/build/llvm-project/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:2363:57
...
```

This is because it was printing an instruction that had already been
deleted. This patch fixes this by reversing the order.
2026-01-13 10:14:32 -08:00
Lucas Ramirez
125d24af76
[AMDGPU][Scheduler] Fix compile failure due to const/sort interaction (#175755)
On some configurations sorting `ScoredRemat` objects which contains
const members causes a compile failure due to impossibility of
swapping/moving objects. The problem was introduced in #175050.

This removes const from those fields to address the issue. The design
will soon change anyway to not rely on sorting objects of this type, and
consts were only here for semantic clarity.
2026-01-13 14:11:47 +01:00
Lucas Ramirez
6aaa7fd7fb
[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)
This is a significant refactoring of the scheduler's rematerialization
stage meant to improve rematerialization capabilities and lay strong
foundations for future improvements.

As before, the stage identifies scheduling regions in which RP must be
reduced (so-called "target regions"), then rematerializes registers to
try and achieve the desired reduction. All regions affected by
rematerializations are re-scheduled, and, if the MIR is deemed worse
than before, rematerializations are rolled back to leave the MIR in its
pre-stage state.

The core contribution is a scoring system to estimate the benefit of
each rematerialization candidate. This score favors rematerializing
candidates which, in order, would

1. (if the function is spilling) reduce RP in highest-frequency target
regions,
2. be rematerialized to lowest-frequency target regions, and
3. reduce RP in the highest number of target regions.

All rematerialization opportunities are initially scored and
rematerialized in decreasing score order until RP objectives are met or
pre-computed scores diverge from reality; in the latter case remaining
candidates are re-scored and the process repeats. New tests in
`machine-scheduler-rematerialization-scoring.mir` showcase how the
scoring system dictates which rematerialization are the most beneficial
and therefore performed first

A minor contribution included in this PR following previous feedback is
that rollback now happens in-place i.e., without having to re-create the
rematerialized MI. This leaves original slot indices and registers
untouched. We achieve this by temporarily switching the opcode of
rollback-able instructions to a debug opcode during re-scheduling so
that they are ignored.
2026-01-13 12:24:16 +01:00
Dhruva Chakrabarti
94e4ee38aa
[AMDGPU] Fixed crash in getLastMIForRegion when the region is empty. (#168653)
PreRARematStage builds region live-outs if GCN trackers are enabled. If
rematerialization leads to empty regions, this can cause a crash because
of dereference of an invalid iterator in getLastMIForRegion. The fix is
to skip calling getLastMIForRegion for empty regions.

This patch fixes another bug in the same code region. getLastMIForRegion
calls skipDebugInstructionsBackward which may immediately return the
RegionEnd if it is not the begin instruction and it is a non-debug
instruction. That would imply considering an instruction that is outside
the relevant region. The fix is to always pass the previous of RegionEnd
to skipDebugInstructionsBackward.

This bug was found while using GCN trackers on the existing LIT test
machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure.

llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168:
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference
llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const
[with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr,
true, true, void, false, void>; bool IsReverse = false; bool IsConst =
false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference =
llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.
2025-11-19 16:19:20 -08:00
Dhruva Chakrabarti
58ac95db60
[AMDGPU] Avoid changing minOccupancy if unclustered schedule was not run for any region. (#162025)
During init of unclustered schedule stage, minOccupancy may be
temporarily increased. But subsequently, if none of the regions are
scheduled because they don't meet the conditions of initGCNRegion,
minOccupancy remains incorrectly set. This patch avoids this
incorrectness by delaying the change of minOccupancy until a region is
about to be scheduled.
2025-11-12 10:11:22 -08:00
Matt Arsenault
a88fa64e25
CodeGen: Remove TRI argument from reMaterialize (#158229) 2025-11-10 16:23:36 -08:00
Austin Kerbow
d4b1ab77c1
[AMDGPU] Examine instructions in pending queues during scheduling (#147653)
Examine instructions in the pending queue when scheduling. This makes
instructions visible to scheduling heuristics even when they aren't
immediately issuable due to hardware resource constraints.

The scheduler has two hardware resource modeling modes: an in-order mode
where instructions must be ready to issue before scheduling, and
out-of-order models where instructions are always visible to heuristics.
Special handling exists for unbuffered processor resources in
out-of-order models. These resources can cause pipeline stalls when used
back-to-back, so they're typically avoided. However, for AMDGPU targets,
managing register pressure and reducing spilling is critical enough to
justify exceptions to this approach.

This change enables examination of instructions that can't be
immediately issued because they use an already occupied unbuffered
resource. By making these instructions visible to scheduling heuristics
anyway, we gain more flexibility in scheduling decisions, potentially
allowing better register pressure and hardware resource management.
2025-10-16 08:46:29 -07:00
Valery Pykhtin
8823efe77d
[AMDGPU] Add register usage debug printing the point of maximum register pressure. (#161850)
Basically this allows to analyze "why so many VGPRs used?".

It prints all live registers at the point of maximum register pressure
and for each register its defs/uses are dumped.

Currently can be run before and after the scheduler but would be nice if
it can be ran inbetween any passes (not sure this is possible with
legacy pass-manager). Requires debug or built with asserts compiler.

Highly recommended to run with debug info to have debug locations for
instructions.

Example output:

```
*** Register pressure info (VGPRs) for _ZN7ck_tile6ken.... ***
Max pressure is 256 VGPRs at 41780e@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10137:vreg_128_align2, %10141:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec

Live registers with single definition (123 VGPRs):
  %10126:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
    def 41600r@BB.18 (LoopHdr BB.16, Depth 1): undef %10126.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 15232, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1314, !noalias !60, addrspace 3)
    def 41608r@BB.18 (LoopHdr BB.16, Depth 1): %10126.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 16320, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1315, !noalias !60, addrspace 3)
    use 41848r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10126:vreg_128_align2, %10138:vreg_128_align2, %9856:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %10136:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
    def 41264r@BB.18 (LoopHdr BB.16, Depth 1): undef %10136.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 2176, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1294, !noalias !60, addrspace 3)
    def 41272r@BB.18 (LoopHdr BB.16, Depth 1): %10136.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 3264, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1295, !noalias !60, addrspace 3)
    use 41788r@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10136:vreg_128_align2, %10140:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %10129:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
...
Live registers with multiple definitions (133 VGPRs):
  %9856:VReg_512_Align2, LiveMask 00000000FFFFFFFF (16 VGPRs)
    def 16544r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def undef %9856.sub0_sub1:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16592r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub2_sub3:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16608r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub4_sub5:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16656r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub6_sub7:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16672r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub8_sub9:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16720r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub10_sub11:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16736r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub12_sub13:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def 16784r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub14_sub15:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
    def use 41828r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_...
...
********** INTERVALS **********
...
********** MACHINEINSTRS **********
# Machine code for function _ZN7ck_tile6kentr...
```
2025-10-13 16:17:11 +02:00
Philip Reames
58c959b4f7
[AMDGPU] Use common allUsesAvailableAt implementation [nfc] (#161418)
Replace the target specific copy with a call to the generic routine. I
don't spot any differences by eye, and there's nothing in the original
review discussion (#124327) which makes it clear why this was
duplicated.
2025-10-01 07:22:13 -07:00
Philip Reames
ea721e2fa1
[TII] Split isTrivialReMaterializable into two versions [nfc] (#160377)
This change builds on https://github.com/llvm/llvm-project/pull/160319
which tries to clarify which *callers* (not backends) assume that the
result is actually trivial.

This change itself should be NFC. Essentially, I'm just renaming the
existing isTrivialRematerializable to the non-trivial version and then
adding a new trivial version (with the same name as the prior function)
and simplifying a few callers which want that semantic.

This change does *not* enable non-trivial remat any more broadly than
was already done for our targets which were lying through the old APIs;
that will come separately. The goal here is simply to make the code
easier to follow in terms of what assumptions are being made where.

---------

Co-authored-by: Luke Lau <luke_lau@icloud.com>
2025-09-24 18:52:17 -07:00
Lucas Ramirez
83c308f014
[AMDGPU][Scheduler] Consistent occupancy calculation during rematerialization (#149224)
The `RPTarget`'s way of determining whether VGPRs are beneficial to save
and whether the target has been reached w.r.t. VGPR usage currently
assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR
RC can always be used for the other. Implicitly, this makes the
rematerialization stage (only current user of `RPTarget`) follow a
different occupancy calculation than the "regular one" that the
scheduler uses, one that assumes that ArchVGPR/AGPR usage can be
balanced perfectly and at no cost, which is untrue in general. This
ultimately yields suboptimal rematerialization decisions that require
cross-VGPR-RC copies unnecessarily.

This fixes that, making the `RPTarget`'s internal model of occupancy
consistent with the regular one. The `CombinedVGPRSavings` flag is
removed, and a form of cross-VGPR-RC saving implemented only for unified
RFs, which is where it makes the most sense. Only when the amount of
free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the
excess VGPR usage in the other VGPR RC does the `RPTarget` consider that
a pressure reduction in the former will be beneficial to the latter.
2025-08-08 14:26:04 +02:00
Lucas Ramirez
96f3872dc9
[AMDGPU][Scheduler] Delete RegionsWithMinOcc bitvector from scheduler (NFC) (#142361)
The `GCNScheduleDAGMILive`'s `RegionsWithMinOcc` bitvector is only used
by the `UnclusteredHighRPStage`. Its presence in the scheduler's state
forces us to maintain its value throughout scheduling even though it is
of no use to the iterative scheduling process itself. At any point
during scheduling it is possible to cheaply compute the occupancy
induced by a particular register pressure. Furthermore, the field
doesn't appear to be updated correctly throughout scheduling i.e., bits
corresponding to regions at minimum occupancy are not always set in the
vector.

This removes the bitvector from `GCNScheduleDAGMILive`.
`UnclusteredHighRPStage::initGCNRegion` now directly computes the
occupancy of possibly reschedulable regions instead of querying the
vector. Since it is the most expensive check, it is done last in the
list.
2025-08-01 13:20:05 +02:00
Ruiling, Song
451912a24a
[MachineScheduler] Make cluster check more efficient (#150884) 2025-08-01 16:00:42 +08:00
Jeffrey Byrnes
d3a9cde7b8
[AMDGPU] Don't skip regions in getRegionLiveInMap (#151423)
Currently, this skips any region that is not the first region in a
block. This is because the only user of it only cares about the LiveIns
per-block. However, as named, this is supposed to compute the per-region
LiveIns.

This doesn't have any effect on scheduling / CodeGen currently (aside
from computing LiveIns for all regions) since only the per-block LiveIns
are needed. However, I'm working on something that will use this.

Intended User:

https://github.com/llvm/llvm-project/pull/149367/


c62a2f127c/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (L1351)
2025-07-31 15:05:18 -07:00
Lucas Ramirez
c77a2a2403
[AMDGPU][Scheduler] Use AMDGPU::NoSubRegister instead of 0 (NFC) (#150610) 2025-07-25 14:25:13 +02:00
Lucas Ramirez
e38f98f535
[AMDGPU][Scheduler] Fix usage of TII.reMaterialize (NFC) (#150259)
Any non-zero `SubIdx` passed to the method is composed with the
rematerialized instruction's first operand's subregister to determine
the new register's subregister. In our case we want the new register to
have the same subregister as the old one, so we should pass 0.
2025-07-25 12:51:21 +02:00
Jeffrey Byrnes
283a62fa5b
[AMDGPU] NFC: Decouple getRealRegPressure from current region (#149219)
We're already accepting a RegionIdx for the LiveIns, also use this for
the instruction iterators.

Enables querying RP for other regions -- useful for function wide
transformations (e.g. rematerialization, rewriting, etc).
2025-07-16 19:59:38 -07:00
Lucas Ramirez
6307b496f8
[AMDGPU] Add GCNRPTarget to track register pressure against a target (#145765)
This adds the `GCNRPTarget` class which models a register pressure
target (i.e., maximum number of SGPRs/VGPRS) that one can track register
savings against. The only current use of this class is in the
scheduler's rematerialization stage. It replaces the more ad-hoc (and
now deleted) `ExcessRP` class which used to serve the same purpose.

This is only NFC~ish because `GCNRPTarget` tracks VGPR usage more
accurately than `ExcessRP` used to. To estimate required combined VGPR
savings we now additionally take into account the number of available
VGPRs in both banks (ArchVGPR and AGPR) at the time where the RP target
is created, whereas we used to only consider explicit savings made from
the starting RP. This makes VGPR savings estimations more accurate in
cases where we allow for savings in one VGPR bank to help towards
reducing pressure in another VGPR bank (see
`GCNRPTarget::CombineVGPRSavings`). This is the cause for unit test
changes.
2025-06-26 13:11:20 +02:00
Lucas Ramirez
c74ed8a0d3
[AMDGPU][Scheduler] Support for rematerializing SGPRs and AGPRs (#140036)
This adds the ability to rematerialize SGPRs and AGPRs to the
scheduler's `PreRARematStage`, which can currently only rematerialize
ArchVGPRs. This also fixes a small potential issue in the stage where,
in case of spilling, the target occupancy could be set to a lower than
expected value when the function had either one of the "amdgpu-num-sgpr"
or "amdgpu-num-vgpr" attributes set.
2025-06-24 19:30:27 +02:00
Diana Picus
a201f8872a
[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
2025-06-24 11:09:36 +02:00
Mikael Holmen
77062244ed Fix two instances of -Wparentheses warnings [NFC]
Add parentheses around the assert conditions.

Without this gcc warned like
 ../lib/Target/AMDGPU/GCNSchedStrategy.cpp:2250: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
  2250 |          NewMI != RegionBounds.second && "cannot remove at region end");
and
 ../../clang/lib/Sema/SemaOverload.cpp:11326:39: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
 11326 |          DeferredCandidatesCount == 0 &&
       |          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
 11327 |              "Unexpected deferred template candidates");
       |              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-06-12 09:43:30 +02:00
Ruiling, Song
0487db1f13
MachineScheduler: Improve instruction clustering (#137784)
The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the
queue. For example, we have a queue of cluster nodes A B C. But during
scheduling, node B might be picked first, then it will be very likely
that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3
2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be
pred of 3. This makes both 1 and 2 become preds of 3, but there is no
edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This
could help improve clustering in some hard case.

One key reason the change causes so many test check changes is: As the
cluster candidates are not ordered now, the candidates might be picked
in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more
deeper reason that machine scheduler cannot cluster them well both
before and after the change, and the load combine algorithm later is
also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16`
gets more buffer_load being claused.
2025-06-05 15:28:04 +08:00
Lucas Ramirez
3e18216474
[AMDGPU][Scheduler] Delete RescheduleRegions bitvector from scheduler (NFC) (#141595)
The `GCNScheduleDAGMILive`'s `RescheduleRegions` bitvector is only used
by the rematerialization stage (`PreRARematStage`). Its presence in the
scheduler's state forces us to maintain its value throughout scheduling
even though it is of no use to the iterative scheduling process itself,
which instead relies on each stage's `initGCNRegion` hook to determine
whether the current region should be rescheduled.

This moves the bitvector to the `PreRARematStage`, which uses it to
store the set of regions that must be rescheduled between stage
initialization and region initialization.

This NFC also swaps a call to `GCNRegPressure::getArchVGPRNum(false)`
for a call to `GCNRegPressure::getArchVGPRNum()`---which is equivalent
but simpler in the context---and makes
`GCNSchedStage::finalizeGCNRegion` use its own API to advance to the
next region.
2025-05-27 17:10:28 +02:00
Lucas Ramirez
6456ee056f
Reapply "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)" (#139548)
This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to
issues detected by the address sanitizer (MIs have to be removed from
live intervals before being removed from their parent MBB).

Original commit description below.

AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.

- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
2025-05-13 11:11:00 +02:00
Vitaly Buka
b35f6e26a5
Revert "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)" (#139341)
And related "[AMDGPU] Regenerate mfma-loop.ll test"

Introduce memory error detected by Asan #125885.

This reverts commit 382a085a95b0abeac77b150b7b644b372bd08e78.
This reverts commit 067caaafb58a156d0d77229422607782a639f5b5.
2025-05-09 17:51:46 -07:00
Lucas Ramirez
067caaafb5
[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885)
AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.

In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.

This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.

- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.

In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
2025-05-08 12:51:06 +02:00
Jeffrey Byrnes
6e7fe85247
[AMDGPU] Teach iterative schedulers about IGLP (#134953)
This adds IGLP mutation to the iterative schedulers
(`gcn-iterative-max-occupancy-experimental`, `gcn-iterative-minreg`, and
`gcn-iterative-ilp`).

The `gcn-iterative-minreg` and `gcn-iterative-ilp` schedulers never
actually applied the mutations added, so this also has the effect of
teaching them about mutations in general. The
`gcn-iterative-max-occupancy-experimental` scheduler has calls to
`ScheduleDAGMILive::schedule()`, so, before this, mutations were applied
at this point. Now this is done during calls to `BuildDAG`, with IGLP
superseding other mutations (similar to the other schedulers). We may
end up scheduling regions multiple times, with mutations being applied
each time, so we need to track for
`AMDGPU::SchedulingPhase::PreRAReentry`
2025-04-11 15:34:49 -07:00
Kazu Hirata
71935281e0
[Target] Use *Set::insert_range (NFC) (#132140)
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range.  This patch replaces:

  Dest.insert(Src.begin(), Src.end());

with:

  Dest.insert_range(Src);

This patch does not touch custom begin like succ_begin for now.
2025-03-20 09:09:30 -07:00
Emma Pilkington
3eddb992d0
[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167)
Fixes SWDEV-514946
2025-03-19 09:31:54 -04:00
Diana Picus
1f84495255
[AMDGPU] Update target helpers & GCNSchedStrategy for dynamic VGPRs (#130047)
In dynamic VGPR mode, we can allocate up to 8 blocks of either 16 or 32
VGPRs (based on a chip-wide setting which we can model with a Subtarget
feature). Update some of the subtarget helpers to reflect this.

In particular:
- getVGPRAllocGranule is set to the block size
- getAddresableNumVGPR will limit itself to 8 * size of a block

We also try to be more careful about how many VGPR blocks we allocate.
Therefore, when deciding if we should revert scheduling after a given
stage, we check that we haven't increased the number of VGPR blocks that
need to be allocated.

---------

Co-authored-by: Jannik Silvanus <jannik.silvanus@amd.com>
2025-03-19 10:29:38 +01:00
Julian Brown
84909d7977 [AMDGCN] Allow unscheduling of bundled insns
This is a patch arising from AMD's fuzzing project.

In the test case, the scheduling algorithm decides to undo an attempted
schedule, but is unprepared to handle bundled instructions at that
point -- and those can arise via the expansion of intrinsics earlier
in compilation.  The fix is to use the splice method instead of
remove/insert, since that can handle bundles properly.
2025-03-18 11:56:51 -05:00
Lucas Ramirez
03677f63a7
[MachineScheduler] Optional scheduling of single-MI regions (#129704)
Following 15e295d the machine scheduler no longer filters-out single-MI
regions when emitting regions to schedule. While this has no functional
impact at the moment, it generally has a negative compile-time impact
(see #128739).

Since all targets but AMDGPU do not care for this behavior, this
introduces an off-by-default flag to `ScheduleDAGInstrs` to control
whether such regions are going to be scheduled, effectively reverting
15e295d for all targets but AMDGPU (currently the only target enabling
this flag).
2025-03-04 17:46:44 +01:00
Jeffrey Byrnes
16f7e961c6
[AMDGPU] Allow rematerialization of instructions with virtual register uses (#124327)
Remove the restriction that scheduling rematerialization candidates
cannot have virtual reg uses.

Currently, this only allows for virtual reg uses which are already live
at the rematerialization point, so bring in allUsesAvailableAt to check
for this condition. Because of this condition, the uses of the remats
will already be live in to the region, so the remat won't increase
live-in pressure.

Add an expensive check to check this condition.
2025-02-06 10:16:28 -08:00
Jeffrey Byrnes
e77d428e46
[AMDGPU] Do not remat instructions with PhysReg uses (#124366)
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.

Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
2025-01-27 10:50:06 -08:00
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Austin Kerbow
657fb4433e
[AMDGPU] Add target hook to isGlobalMemoryObject (#112781)
We want special handing for IGLP instructions in the scheduler but they
should still be treated like they have side effects by other passes. Add
a target hook to the ScheduleDAGInstrs DAG builder so that we have more
control over this.
2025-01-11 09:57:57 -08:00
Ruiling, Song
b33c807b39
[AMDGPU] Add MaxMemoryClauseSchedStrategy (#114957)
Also expose an option to choose custom scheduler strategy:
amdgpu-sched-strategy={max-ilp|max-memory-clause}
This can be set through either function attribute or command line option.

The major behaviors of the max memory clause schedule strategy includes:
1. Try to cluster memory instructions more aggressively.
2. Try to schedule long latency load earlier than short latency
   instruction.

I tested locally against about 470 real shaders and got the perf
changes (only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.

(This was tested together with another change which increases the
maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.
2024-12-09 10:07:27 +08:00
Jeffrey Byrnes
3a08551a03 [AMDGPU] Fix expensive check
Change-Id: I0b26d5db6d3da8936ab25ee2b1e9002840b9853e
2024-10-09 10:24:06 -07:00
Jeffrey Byrnes
17bc959961
[AMDGPU] Optionally Use GCNRPTrackers during scheduling (#93090)
This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings.

This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.
2024-10-09 09:54:11 -07:00