We can also re-enter IGLP mutation via later `SchedStage`s in the
`GCNMaxOccupancySchedStrategy` . This is sort of NFC in that there is no
changed behavior for the only current client of `IsReentry`
This will help users analyze whether high register usage is coming from
inability of scheduler to reduce RP, or from sacrificing good RP to
improve ILP.
Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose.
This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline.
Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).
Default scheduling behavior for these types of kernels is to chase high occupancy goals with scheduling heuristics, but allow occupancy drops if we are unable to reach the target.
This (experimental, off-by-default) feature relaxes occupancy target from the beginning, which enables scheduler to produce better ILP schedules.
Differential Revision: https://reviews.llvm.org/D153925
Change-Id: I112833214e2db869704591f4df3c4574d0fcbb1b
Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins.
Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate).
Differential Revision: https://reviews.llvm.org/D149358
We reuse live registers after tracking one MBB as live-ins to the successor MBB
if the successor is only one but we don't check if the successor has other predecessors.
`A B`
` \ /`
` C`
A and B have one successor but C has live-ins defined by A and B and therefore should be
initialized using LIS.
This fixes 83 lit tests out if 420 with EXPENSIVE_CHECK enabled.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D136918
D139710 has added a metric to increase schedule's ILP while
staying within the same occupancy. Do not bother to apply this
metric to a region which is known to have spilling, it may result
in spilling to reappear after the previous stage and will do no
good if we already spilling anyway. It may also reduce compile
time a bit for such regions.
Fixes: SWDEV-377300
Differential Revision: https://reviews.llvm.org/D143934
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.
getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.
At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).
This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.
Differential Revision: https://reviews.llvm.org/D139468
Unlike older ASICs GFX10+ have a lot of VGPRs. Therefore, it is possible
to achieve high occupancy even with all or almost all addressable VGPRs
used. Our scheduler was never tuned for this scenario. The VGPR Critical
Limit threshold always comes very high, even if maximum occupancy is
targeted. For example on gfx1100 it is set to 192 registers even with
the requested occupancy 16. As a result scheduler starts prioritizing
register pressure reduction very late and we easily end up spilling.
This patch makes VGPR critical limit similar to what we would have on
pre-gfx10 targets with much more limited VGPR budget while still trying
to maintain occupancy as it does now.
Pre-gfx10 ASICs shall not be affected as the limit shall be the same
as before, and on gfx10+ it shall only affect regions where we have
to spill.
Fixes: SWDEV-377300
Differential Revision: https://reviews.llvm.org/D141876
Current implementation abuses ErrorMargin to apply an additional
bias to VGPR and SGPR limits under a high register pressure. The
ErrorMargin exists to account for inaccuracies of the RP tracker
and not to tackle an excess pressure. Introduce separate bias for
this purpose and also make it different for SGPRs and VGPRs as we
may want to use different values in the future.
This is supposed to be NFC, however there is a subtle difference
when subtracting a margin overflows the limit. Doing two subtractions
makes it less probable, although manifests only in mir tests with
an artificially small register budget.
Differential Revision: https://reviews.llvm.org/D142051
Since the divergence-driven ISel was fully enabled we have more VGPRs available.
MachineScheduler trying to take advantage of that bumps up the occupancy sacrificing
the hiding of memory access latency. This really spoils the initially good schedule.
A new metric that reflects the latency hiding quality of the schedule has been created
to make it to balance between occupancy and latency. The metric is based on the latency
model which computes the bubble to working cycles ratio. Then we use this ratio to decide
if the higher occupancy schedule is profitable as follows:
Profit = NewOccupancy/OldOccupancy * OldMetric/NewMetric
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D139710
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.
Differential Revision: https://reviews.llvm.org/D139828
This one-linear fix improves compilation time for about ~40% on ASAN enabled code.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D136069
First boundary of a region wasn't updated when a sinked instruction was added first into the region.
Reviewed By: vangthao
Differential Revision: https://reviews.llvm.org/D138256
The problem with GCNDownwardRPTracker::advanceBeforeNext is that it doesn't allow to get register pressure
after the last instruction in a MBB.
However when we track RP through the boundary of a MBB we need the state that is after the last instruction
of the MBB and before the first instruction of the successor MBB. Currently we stop traking RP in the state
'at' the last instruction of the MBB which is incorrect.
This patch fixes 27 lit tests with EXPENSIVE_CHECKS enabled.
Reviewed By: rampitec, arsenm
Differential Revision: https://reviews.llvm.org/D136927
Adds a builtin that serves as an optimization hint to apply specific optimized
DAG mutations during scheduling. This also disables any other mutations or
clustering that may interfere with the desired pipeline. The first optimization
strategy that is added here is designed to improve the performance of small gemm
kernels on gfx90a.
Reviewed By: jrbyrnes
Differential Revision: https://reviews.llvm.org/D132079
Creates a new scheduling strategy that attempts to maximize ILP for a single
wave.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130869
When register pressure tracking is disabled, the scheduler attempts to load
pressures at SReg_32 and VGPR_32. This causes an index out of bounds error.
This patch fixes this issue by disabling the initialization of RPTracker
when not needed. NFC
Reviewed By: rampitec, kerbowa, arsenm
Differential Revision: https://reviews.llvm.org/D129322
By not clustering loads and adjusting heuristics to more aggressively reduce
register pressure we may be able to increase occupancy for the function if it
was dropped in a first pass scheduling.
Similarly, try to reduce spilling if register usage exceeds lower bound
occupancy.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130329
Tries to make the different scheduling stages a bit more self contained and
modifiable. Intended to be NFC. Preface to other changes.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130147
This was stored in LiveIntervals, but not actually used for anything
related to LiveIntervals. It was only used in one check for if a load
instruction is rematerializable. I also don't think this was entirely
correct, since it was implicitly assuming constant loads are also
dereferenceable.
Remove this and rely only on the invariant+dereferenceable flags in
the memory operand. Set the flag based on the AA query upfront. This
should have the same net benefit, but has the possible disadvantage of
making this AA query nonlazy.
Preserve the behavior of assuming pointsToConstantMemory implying
dereferenceable for now, but maybe this should be changed.
Enable the PreRARematerialize pass when there are multiple high RP scheduling
regions present. Require the occupancy in all high RP regions be improved
before finalizing sinking. If any high RP region did not improve in occupancy
then un-do all sinking and restore the state to before the pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D122501
Ignore all debug uses when collecting trivially rematerializable defs. This fixes an issue with difference in codegen when enabling debug info.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123048
When collecting trivially rematerializable defs, skip any subreg defs. We do not want to sink these.
Differential Revision: https://reviews.llvm.org/D121874
This reverts commit 7f230feeeac8a67b335f52bd2e900a05c6098f20.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy.
This is based off of the discussion in https://reviews.llvm.org/D117562.
The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D119475
Debug position data is cleared after ScheduleDAGMILive::schedule() due to it also calling placeDebugValues(). Make it so the data is not cleared after initial call to placeDebugValues since we will call it again after reverting a schedule.
Secondly, since we skip debug instructions when reverting the schedule on AMDGPU, all debug instructions are now moved to the end of the scheduling region. RegionEnd points to the beginning of this chunk of debug instructions since it was not incremented when a debug instruction was skipped. RegionBegin may also point to the same debug instruction if Unsched.front() is a debug instruction thus shrinking the region to 1. Fix RegionBegin and RegionEnd so that they point to the current beginning and ending before calling placeDebugValues() since both vars will be used as reference points to move debug instructions back.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D119022
GCNDownwardRPTracker RPTracker.reset() skips debug instructions for NextMI so RPTracker.getNext() will never give the beginning of a sched region if it is a debug value. In this case we will never set the live-ins for that block.
Add check to see if getNext also equals the MI after skipping debug instructions.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D118853
The scheduler should set critical/excess register usage thresholds that
are guided by the maximum possible occupancy for the function. This
change is focused on setting proper lower bounds on register usage which
we would typically only see when a specific number of maximum waves is
requested with the "waves-per-eu" attribute, or by setting
"amdgpu-num-vgpr|sgpr" directly. This was broken previously. I have a
follow-on patch that will address issues with the scheduler not
targeting correct upper bounds on register usage which is typical with
launch bounds and min "waves-per-eu".
Changes by this patch:
Set the initial critical register usage thresholds to minimum values
that are determined by the maximum possible occupancy for the function,
or the number of allocatable registers, whichever is lower.
Avoid unisgned overflow if register limits are lower than the register
tracking "ErrorMargin", I.e. when using stress-regalloc=2.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112373
If a region was not constrained by a high register pressure
and was not rescheduled without clustering we can skip
rescheduling it ClusteredLowOccupancyReschedule stage.
This improves scheduling speed by 25% on some kernels.
Differential Revision: https://reviews.llvm.org/D97506
We are attempting rescheduling without load store clustering
if occupancy limits were not met with clustering. Skip this
for regions which do not have any loads or stores at all.
In a set of kernels I am experimenting with this improves
scheduling time by ~30%.
Differential Revision: https://reviews.llvm.org/D97342
Regions are sometimes skipped which should be rescheduled without memory op
clustering. RegionIdx is not incremented when iterating over regions that
are flagged to be skipped, causing the index to be incorrect.
Thanks to Vang Thao for discovering this bug!
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D85498
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
We want to have more load/store clustering but we also want
to maintain low register pressure which are oposit targets.
Allow scheduler to reschedule regions without mutations
applied if we hit a register limit.
Differential Revision: https://reviews.llvm.org/D73386
We can revert region schedule if new schedule decreases occupancy.
However, if we already have only one wave we would accept any new
schedule even if it blows up register pressure. Such schedule may
result in quite heavy spilling which can be avoided if we reject
this new schedule.
Differential Revision: https://reviews.llvm.org/D72181