During init of unclustered schedule stage, minOccupancy may be
temporarily increased. But subsequently, if none of the regions are
scheduled because they don't meet the conditions of initGCNRegion,
minOccupancy remains incorrectly set. This patch avoids this
incorrectness by delaying the change of minOccupancy until a region is
about to be scheduled.
Examine instructions in the pending queue when scheduling. This makes
instructions visible to scheduling heuristics even when they aren't
immediately issuable due to hardware resource constraints.
The scheduler has two hardware resource modeling modes: an in-order mode
where instructions must be ready to issue before scheduling, and
out-of-order models where instructions are always visible to heuristics.
Special handling exists for unbuffered processor resources in
out-of-order models. These resources can cause pipeline stalls when used
back-to-back, so they're typically avoided. However, for AMDGPU targets,
managing register pressure and reducing spilling is critical enough to
justify exceptions to this approach.
This change enables examination of instructions that can't be
immediately issued because they use an already occupied unbuffered
resource. By making these instructions visible to scheduling heuristics
anyway, we gain more flexibility in scheduling decisions, potentially
allowing better register pressure and hardware resource management.
Replace the target specific copy with a call to the generic routine. I
don't spot any differences by eye, and there's nothing in the original
review discussion (#124327) which makes it clear why this was
duplicated.
This change builds on https://github.com/llvm/llvm-project/pull/160319
which tries to clarify which *callers* (not backends) assume that the
result is actually trivial.
This change itself should be NFC. Essentially, I'm just renaming the
existing isTrivialRematerializable to the non-trivial version and then
adding a new trivial version (with the same name as the prior function)
and simplifying a few callers which want that semantic.
This change does *not* enable non-trivial remat any more broadly than
was already done for our targets which were lying through the old APIs;
that will come separately. The goal here is simply to make the code
easier to follow in terms of what assumptions are being made where.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
The `RPTarget`'s way of determining whether VGPRs are beneficial to save
and whether the target has been reached w.r.t. VGPR usage currently
assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR
RC can always be used for the other. Implicitly, this makes the
rematerialization stage (only current user of `RPTarget`) follow a
different occupancy calculation than the "regular one" that the
scheduler uses, one that assumes that ArchVGPR/AGPR usage can be
balanced perfectly and at no cost, which is untrue in general. This
ultimately yields suboptimal rematerialization decisions that require
cross-VGPR-RC copies unnecessarily.
This fixes that, making the `RPTarget`'s internal model of occupancy
consistent with the regular one. The `CombinedVGPRSavings` flag is
removed, and a form of cross-VGPR-RC saving implemented only for unified
RFs, which is where it makes the most sense. Only when the amount of
free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the
excess VGPR usage in the other VGPR RC does the `RPTarget` consider that
a pressure reduction in the former will be beneficial to the latter.
The `GCNScheduleDAGMILive`'s `RegionsWithMinOcc` bitvector is only used
by the `UnclusteredHighRPStage`. Its presence in the scheduler's state
forces us to maintain its value throughout scheduling even though it is
of no use to the iterative scheduling process itself. At any point
during scheduling it is possible to cheaply compute the occupancy
induced by a particular register pressure. Furthermore, the field
doesn't appear to be updated correctly throughout scheduling i.e., bits
corresponding to regions at minimum occupancy are not always set in the
vector.
This removes the bitvector from `GCNScheduleDAGMILive`.
`UnclusteredHighRPStage::initGCNRegion` now directly computes the
occupancy of possibly reschedulable regions instead of querying the
vector. Since it is the most expensive check, it is done last in the
list.
Currently, this skips any region that is not the first region in a
block. This is because the only user of it only cares about the LiveIns
per-block. However, as named, this is supposed to compute the per-region
LiveIns.
This doesn't have any effect on scheduling / CodeGen currently (aside
from computing LiveIns for all regions) since only the per-block LiveIns
are needed. However, I'm working on something that will use this.
Intended User:
https://github.com/llvm/llvm-project/pull/149367/c62a2f127c/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (L1351)
Any non-zero `SubIdx` passed to the method is composed with the
rematerialized instruction's first operand's subregister to determine
the new register's subregister. In our case we want the new register to
have the same subregister as the old one, so we should pass 0.
We're already accepting a RegionIdx for the LiveIns, also use this for
the instruction iterators.
Enables querying RP for other regions -- useful for function wide
transformations (e.g. rematerialization, rewriting, etc).
This adds the `GCNRPTarget` class which models a register pressure
target (i.e., maximum number of SGPRs/VGPRS) that one can track register
savings against. The only current use of this class is in the
scheduler's rematerialization stage. It replaces the more ad-hoc (and
now deleted) `ExcessRP` class which used to serve the same purpose.
This is only NFC~ish because `GCNRPTarget` tracks VGPR usage more
accurately than `ExcessRP` used to. To estimate required combined VGPR
savings we now additionally take into account the number of available
VGPRs in both banks (ArchVGPR and AGPR) at the time where the RP target
is created, whereas we used to only consider explicit savings made from
the starting RP. This makes VGPR savings estimations more accurate in
cases where we allow for savings in one VGPR bank to help towards
reducing pressure in another VGPR bank (see
`GCNRPTarget::CombineVGPRSavings`). This is the cause for unit test
changes.
This adds the ability to rematerialize SGPRs and AGPRs to the
scheduler's `PreRARematStage`, which can currently only rematerialize
ArchVGPRs. This also fixes a small potential issue in the stage where,
in case of spilling, the target occupancy could be set to a lower than
expected value when the function had either one of the "amdgpu-num-sgpr"
or "amdgpu-num-vgpr" attributes set.
Add parentheses around the assert conditions.
Without this gcc warned like
../lib/Target/AMDGPU/GCNSchedStrategy.cpp:2250: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
2250 | NewMI != RegionBounds.second && "cannot remove at region end");
and
../../clang/lib/Sema/SemaOverload.cpp:11326:39: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
11326 | DeferredCandidatesCount == 0 &&
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
11327 | "Unexpected deferred template candidates");
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.
But actually the instruction may be picked not in the exact order of the
queue. For example, we have a queue of cluster nodes A B C. But during
scheduling, node B might be picked first, then it will be very likely
that we only cluster B and C for Top-Down scheduling (leaving A alone).
Another issue is:
```
if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);
if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.
For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3
2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be
pred of 3. This makes both 1 and 2 become preds of 3, but there is no
edge between 1 and 2. Thus we get a broken cluster chain.
To fix both issues, we introduce an unordered set in the change. This
could help improve clustering in some hard case.
One key reason the change causes so many test check changes is: As the
cluster candidates are not ordered now, the candidates might be picked
in different order from before.
The most affected targets are: AMDGPU, AArch64, RISCV.
For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.
For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more
deeper reason that machine scheduler cannot cluster them well both
before and after the change, and the load combine algorithm later is
also not smart enough.
For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16`
gets more buffer_load being claused.
The `GCNScheduleDAGMILive`'s `RescheduleRegions` bitvector is only used
by the rematerialization stage (`PreRARematStage`). Its presence in the
scheduler's state forces us to maintain its value throughout scheduling
even though it is of no use to the iterative scheduling process itself,
which instead relies on each stage's `initGCNRegion` hook to determine
whether the current region should be rescheduled.
This moves the bitvector to the `PreRARematStage`, which uses it to
store the set of regions that must be rescheduled between stage
initialization and region initialization.
This NFC also swaps a call to `GCNRegPressure::getArchVGPRNum(false)`
for a call to `GCNRegPressure::getArchVGPRNum()`---which is equivalent
but simpler in the context---and makes
`GCNSchedStage::finalizeGCNRegion` use its own API to advance to the
next region.
This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to
issues detected by the address sanitizer (MIs have to be removed from
live intervals before being removed from their parent MBB).
Original commit description below.
AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.
In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.
This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.
- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.
In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
And related "[AMDGPU] Regenerate mfma-loop.ll test"
Introduce memory error detected by Asan #125885.
This reverts commit 382a085a95b0abeac77b150b7b644b372bd08e78.
This reverts commit 067caaafb58a156d0d77229422607782a639f5b5.
AMDGPU scheduler's `PreRARematStage` attempts to increase function
occupancy w.r.t. ArchVGPR usage by rematerializing trivial
ArchVGPR-defining instruction next to their single use. It first
collects all eligible trivially rematerializable instructions in the
function, then sinks them one-by-one while recomputing occupancy in all
affected regions each time to determine if and when it has managed to
increase overall occupancy. If it does, changes are committed to the
scheduler's state; otherwise modifications to the IR are reverted and
the scheduling stage gives up.
In both cases, this scheduling stage currently involves repeated queries
for up-to-date occupancy estimates and some state copying to enable
reversal of sinking decisions when occupancy is revealed not to
increase. The current implementation also does not accurately track
register pressure changes in all regions affected by sinking decisions.
This commit refactors this scheduling stage, improving RP tracking and
splitting the stage into two distinct steps to avoid repeated occupancy
queries and IR/state rollbacks.
- Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The
number of ArchVGPRs to save to reduce spilling or increase function
occupancy by 1 (when there is no spilling) is computed. Then,
instructions eligible for rematerialization are collected, stopping as
soon as enough have been identified to be able to achieve our goal
(according to slightly optimistic heuristics). If there aren't enough of
such instructions, the scheduling stage stops here.
- Rematerialization (`rematerialize`). Instructions collected in the
first step are rematerialized one-by-one. Now we are able to directly
update the scheduler's state since we have already done the occupancy
analysis and know we won't have to rollback any state. Register
pressures for impacted regions are recomputed only once, as opposed to
at every sinking decision.
In the case where the stage attempted to increase occupancy, and if both
rematerializations alone and rescheduling after were unable to improve
occupancy, then all rematerializations are rollbacked.
This adds IGLP mutation to the iterative schedulers
(`gcn-iterative-max-occupancy-experimental`, `gcn-iterative-minreg`, and
`gcn-iterative-ilp`).
The `gcn-iterative-minreg` and `gcn-iterative-ilp` schedulers never
actually applied the mutations added, so this also has the effect of
teaching them about mutations in general. The
`gcn-iterative-max-occupancy-experimental` scheduler has calls to
`ScheduleDAGMILive::schedule()`, so, before this, mutations were applied
at this point. Now this is done during calls to `BuildDAG`, with IGLP
superseding other mutations (similar to the other schedulers). We may
end up scheduling regions multiple times, with mutations being applied
each time, so we need to track for
`AMDGPU::SchedulingPhase::PreRAReentry`
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
In dynamic VGPR mode, we can allocate up to 8 blocks of either 16 or 32
VGPRs (based on a chip-wide setting which we can model with a Subtarget
feature). Update some of the subtarget helpers to reflect this.
In particular:
- getVGPRAllocGranule is set to the block size
- getAddresableNumVGPR will limit itself to 8 * size of a block
We also try to be more careful about how many VGPR blocks we allocate.
Therefore, when deciding if we should revert scheduling after a given
stage, we check that we haven't increased the number of VGPR blocks that
need to be allocated.
---------
Co-authored-by: Jannik Silvanus <jannik.silvanus@amd.com>
This is a patch arising from AMD's fuzzing project.
In the test case, the scheduling algorithm decides to undo an attempted
schedule, but is unprepared to handle bundled instructions at that
point -- and those can arise via the expansion of intrinsics earlier
in compilation. The fix is to use the splice method instead of
remove/insert, since that can handle bundles properly.
Following 15e295d the machine scheduler no longer filters-out single-MI
regions when emitting regions to schedule. While this has no functional
impact at the moment, it generally has a negative compile-time impact
(see #128739).
Since all targets but AMDGPU do not care for this behavior, this
introduces an off-by-default flag to `ScheduleDAGInstrs` to control
whether such regions are going to be scheduled, effectively reverting
15e295d for all targets but AMDGPU (currently the only target enabling
this flag).
Remove the restriction that scheduling rematerialization candidates
cannot have virtual reg uses.
Currently, this only allows for virtual reg uses which are already live
at the rematerialization point, so bring in allUsesAvailableAt to check
for this condition. Because of this condition, the uses of the remats
will already be live in to the region, so the remat won't increase
live-in pressure.
Add an expensive check to check this condition.
This blocks rematerialization during scheduling if the instruction has a
non accepted PhysReg use.
Currently, there aren't any checks like this in place, and we may create
invalid code: https://godbolt.org/z/xjPjdcorf
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.
`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.
Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.
As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].
- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.
Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].
Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.
Fixes#118220.
We want special handing for IGLP instructions in the scheduler but they
should still be treated like they have side effects by other passes. Add
a target hook to the ScheduleDAGInstrs DAG builder so that we have more
control over this.
Also expose an option to choose custom scheduler strategy:
amdgpu-sched-strategy={max-ilp|max-memory-clause}
This can be set through either function attribute or command line option.
The major behaviors of the max memory clause schedule strategy includes:
1. Try to cluster memory instructions more aggressively.
2. Try to schedule long latency load earlier than short latency
instruction.
I tested locally against about 470 real shaders and got the perf
changes (only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.
(This was tested together with another change which increases the
maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.
This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings.
This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.
PressureDiff is reliable most of the time, and it's pretty much free
compared to RPTracker. We can use it whenever there is no subregister
definitions, or physregs invovled. No subregs because PDiff doesn't take
into account lane liveness, and no Physreg because it seems to get
PhysReg liveness completely wrong. Sometimes it adds a diff, sometimes
itt doesn't - I didn't look at that one for long so maybe there is
something we can eventually do to make it better.
This allows us to save a ton of calls to RPTracker and LIS too. On a
huge IR module (100+MB), it went from about 20M calls to RPTracker in this
function down to 3.4, with the rest being PressureDiffs.
I also added an expensive check to verify correctness of PressureDiff.
`ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs`
which returns 512 for gfx90a. This is subsequently limited by
`AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for
gfx90a. The ISA states we can have a total of 512 registers, but a
maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4).
Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)`
calculates the maximum number of combined VGPR + AGPR. But, it is
currently used as the limit for accvgpr and as the limit for archvgpr.
This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits.
It is not unreasonable to think other clients of getTotalNumVGPRs are
using it in the wrong way.
IGLP itself will be in SavedMutations via mutations added during
Scheduler creation, thus falling back results in reapplying IGLP.
In PostRA scheduling, if we have multiple regions with IGLP
instructions, then we may have infinite loop.
Disable the feature for now.
This implements the basic pipelining structure of exp/mfma interleaving
for better extensibility. While it does have improved extensibility,
there are controls which only enable it for DAGs with certain
characteristics (matching the DAGs it has been designed against).
We can also re-enter IGLP mutation via later `SchedStage`s in the
`GCNMaxOccupancySchedStrategy` . This is sort of NFC in that there is no
changed behavior for the only current client of `IsReentry`
This will help users analyze whether high register usage is coming from
inability of scheduler to reduce RP, or from sacrificing good RP to
improve ILP.
Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose.
This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline.
Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).
Default scheduling behavior for these types of kernels is to chase high occupancy goals with scheduling heuristics, but allow occupancy drops if we are unable to reach the target.
This (experimental, off-by-default) feature relaxes occupancy target from the beginning, which enables scheduler to produce better ILP schedules.
Differential Revision: https://reviews.llvm.org/D153925
Change-Id: I112833214e2db869704591f4df3c4574d0fcbb1b
Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins.
Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate).
Differential Revision: https://reviews.llvm.org/D149358
We reuse live registers after tracking one MBB as live-ins to the successor MBB
if the successor is only one but we don't check if the successor has other predecessors.
`A B`
` \ /`
` C`
A and B have one successor but C has live-ins defined by A and B and therefore should be
initialized using LIS.
This fixes 83 lit tests out if 420 with EXPENSIVE_CHECK enabled.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D136918