llvm-project

Author	SHA1	Message	Date
Jeffrey Byrnes	6d8b44a506	[AMDGPU] [IGLP]: Fix assert (#73710 ) We can also re-enter IGLP mutation via later `SchedStage`s in the `GCNMaxOccupancySchedStrategy` . This is sort of NFC in that there is no changed behavior for the only current client of `IsReentry`	2023-12-07 17:10:10 -08:00
Jeffrey Byrnes	5b2fee8418	[AMDGPU] NFC: Add flag to disable clustered low occupancy phase (#73025 ) This will help users analyze whether high register usage is coming from inability of scheduler to reduce RP, or from sacrificing good RP to improve ILP.	2023-11-21 12:43:12 -08:00
Jeffrey Byrnes	6afceba510	[AMDGPU][IGLP] SingleWaveOpt: Cache DSW Counters from PreRA (#67759 ) Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose. This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline. Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).	2023-10-06 17:34:14 -07:00
Jay Foad	c3939eb827	[AMDGPU] Fix typo in scheduler option name (#67661 ) Fix: -amdgpu-disable-unclustred-high-rp-reschedule Now: -amdgpu-disable-unclustered-high-rp-reschedule	2023-09-28 20:54:57 +01:00
Jeffrey Byrnes	be92848ea4	[AMDGPU] NFC: Add schedule-relaxed-occupancy to relax occupancy targets for wave-limited/membound kernels Default scheduling behavior for these types of kernels is to chase high occupancy goals with scheduling heuristics, but allow occupancy drops if we are unable to reach the target. This (experimental, off-by-default) feature relaxes occupancy target from the beginning, which enables scheduler to produce better ILP schedules. Differential Revision: https://reviews.llvm.org/D153925 Change-Id: I112833214e2db869704591f4df3c4574d0fcbb1b	2023-06-28 08:12:31 -07:00
Jay Foad	3030c03988	[AMDGPU] Make use of MachineInstr::all_defs and all_uses. NFCI.	2023-06-05 10:32:33 +01:00
Jeffrey Byrnes	7f0a881e6c	[AMDGPU] Track liveins for max-ilp-sched-strategy Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins. Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate). Differential Revision: https://reviews.llvm.org/D149358	2023-04-27 16:45:45 -07:00
Valery Pykhtin	a999669982	[AMDGPU] Scheduler: fix RP calculation for a MBB with one successor We reuse live registers after tracking one MBB as live-ins to the successor MBB if the successor is only one but we don't check if the successor has other predecessors. `A B` ` \ /` ` C` A and B have one successor but C has live-ins defined by A and B and therefore should be initialized using LIS. This fixes 83 lit tests out if 420 with EXPENSIVE_CHECK enabled. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D136918	2023-03-08 12:20:03 +01:00
Stanislav Mekhanoshin	12b4f9e2af	[AMDGPU] Do not apply schedule metric for regions with spilling D139710 has added a metric to increase schedule's ILP while staying within the same occupancy. Do not bother to apply this metric to a region which is known to have spilling, it may result in spilling to reappear after the previous stage and will do no good if we already spilling anyway. It may also reduce compile time a bit for such regions. Fixes: SWDEV-377300 Differential Revision: https://reviews.llvm.org/D143934	2023-02-14 12:16:46 -08:00
Nicolai Hähnle	10cef708a7	AMDGPU: Clean up LDS-related occupancy calculations Occupancy is expressed as waves per SIMD. This means that we need to take into account the number of SIMDs per "CU" or, to be more precise, the number of SIMDs over which a workgroup may be distributed. getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs into account at all. At the same time, we need to take into account that WGP mode offers access to a larger total amount of LDS, since this can affect how non-power-of-two LDS allocations are rounded. To make this work consistently, we distinguish between (available) local memory size and addressable local memory size (which is always limited by 64kB on gfx10+, even with WGP mode). This change results in a massive amount of test churn. A lot of it is caused by the fact that the default work group size is 1024, which means that (due to rounding effects) the default occupancy on older hardware is 8 instead of 10, which affects scheduling via register pressure estimates. I've adjusted most tests by just running the UTC tools, but in some cases I manually changed the work group size to 32 or 64 to make sure that work group size chunkiness has no effect. Differential Revision: https://reviews.llvm.org/D139468	2023-01-23 21:43:06 +01:00
Stanislav Mekhanoshin	7d0145cc47	[AMDGPU] Use more consistemt way to avoid overflow in the scheduler Use more consistent way to avoid overflow when calculating SGPR and VGPR pressure limits. Differential Revision: https://reviews.llvm.org/D142262	2023-01-23 10:59:46 -08:00
Stanislav Mekhanoshin	d1c0febeab	[AMDGPU] Tune scheduler on GFX10 and GFX11 for regions with spilling Unlike older ASICs GFX10+ have a lot of VGPRs. Therefore, it is possible to achieve high occupancy even with all or almost all addressable VGPRs used. Our scheduler was never tuned for this scenario. The VGPR Critical Limit threshold always comes very high, even if maximum occupancy is targeted. For example on gfx1100 it is set to 192 registers even with the requested occupancy 16. As a result scheduler starts prioritizing register pressure reduction very late and we easily end up spilling. This patch makes VGPR critical limit similar to what we would have on pre-gfx10 targets with much more limited VGPR budget while still trying to maintain occupancy as it does now. Pre-gfx10 ASICs shall not be affected as the limit shall be the same as before, and on gfx10+ it shall only affect regions where we have to spill. Fixes: SWDEV-377300 Differential Revision: https://reviews.llvm.org/D141876	2023-01-23 10:42:26 -08:00
Stanislav Mekhanoshin	e7f080b359	[AMDGPU] Introduce separate register limit bias in scheduler Current implementation abuses ErrorMargin to apply an additional bias to VGPR and SGPR limits under a high register pressure. The ErrorMargin exists to account for inaccuracies of the RP tracker and not to tackle an excess pressure. Introduce separate bias for this purpose and also make it different for SGPRs and VGPRs as we may want to use different values in the future. This is supposed to be NFC, however there is a subtle difference when subtracting a margin overflows the limit. Doing two subtractions makes it less probable, although manifests only in mir tests with an artificially small register budget. Differential Revision: https://reviews.llvm.org/D142051	2023-01-19 10:51:40 -08:00
Alexander Timofeev	6daa983c9d	[AMDGPU] MachineScheduler: schedule execution metric added for the UnclusteredHighRPStage Since the divergence-driven ISel was fully enabled we have more VGPRs available. MachineScheduler trying to take advantage of that bumps up the occupancy sacrificing the hiding of memory access latency. This really spoils the initially good schedule. A new metric that reflects the latency hiding quality of the schedule has been created to make it to balance between occupancy and latency. The metric is based on the latency model which computes the bubble to working cycles ratio. Then we use this ratio to decide if the higher occupancy schedule is profitable as follows: Profit = NewOccupancy/OldOccupancy * OldMetric/NewMetric Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D139710	2023-01-05 21:10:56 +01:00
Jay Foad	6443c0ee02	[AMDGPU] Stop using make_pair and make_tuple. NFC. C++17 allows us to call constructors pair and tuple instead of helper functions make_pair and make_tuple. Differential Revision: https://reviews.llvm.org/D139828	2022-12-14 13:22:26 +00:00
Valery Pykhtin	5ce3273ebf	[AMDGPU] Scheduler: Don't revert the schedule if the register pressure isn't changed for a region This one-linear fix improves compilation time for about ~40% on ASAN enabled code. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D136069	2022-12-02 15:59:38 +01:00
Valery Pykhtin	a35ba2a256	[AMDGPU] Fix PreRARematStage::sinkTriviallyRematInsts region boundary update after sinking. First boundary of a region wasn't updated when a sinked instruction was added first into the region. Reviewed By: vangthao Differential Revision: https://reviews.llvm.org/D138256	2022-11-18 12:13:14 +01:00
Valery Pykhtin	5144133f6f	[AMDGPU] Fix GCNDownwardRPTracker::advanceBeforeNext at the end of MBB The problem with GCNDownwardRPTracker::advanceBeforeNext is that it doesn't allow to get register pressure after the last instruction in a MBB. However when we track RP through the boundary of a MBB we need the state that is after the last instruction of the MBB and before the first instruction of the successor MBB. Currently we stop traking RP in the state 'at' the last instruction of the MBB which is incorrect. This patch fixes 27 lit tests with EXPENSIVE_CHECKS enabled. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D136927	2022-11-03 11:52:56 +01:00
Valery Pykhtin	4ae88a8d42	[AMDGPU] Refactor debug printing routines for GCNRPTracker Use Printable to enhance syntax, remove duplication, unify. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D136704	2022-10-28 04:22:46 +02:00
Austin Kerbow	b0f4678b90	[AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy Adds a builtin that serves as an optimization hint to apply specific optimized DAG mutations during scheduling. This also disables any other mutations or clustering that may interfere with the desired pipeline. The first optimization strategy that is added here is designed to improve the performance of small gemm kernels on gfx90a. Reviewed By: jrbyrnes Differential Revision: https://reviews.llvm.org/D132079	2022-08-19 15:38:36 -07:00
Austin Kerbow	40eec27618	[AMDGPU] Add llvm_unreachable to switch statement added in d7100b398.	2022-08-02 13:45:38 -07:00
Austin Kerbow	d7100b398b	[AMDGPU] Add GCNMaxILPSchedStrategy Creates a new scheduling strategy that attempts to maximize ILP for a single wave. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130869	2022-08-02 13:21:24 -07:00
Anshil Gandhi	5c38056431	[AMDGPU][Scheduler] Avoid initializing Register pressure tracker when tracking is disabled When register pressure tracking is disabled, the scheduler attempts to load pressures at SReg_32 and VGPR_32. This causes an index out of bounds error. This patch fixes this issue by disabling the initialization of RPTracker when not needed. NFC Reviewed By: rampitec, kerbowa, arsenm Differential Revision: https://reviews.llvm.org/D129322	2022-07-28 15:39:28 -06:00
Austin Kerbow	ba0d079c7a	[AMDGPU] Aggressively schedule to reduce RP in occupancy limited regions By not clustering loads and adjusting heuristics to more aggressively reduce register pressure we may be able to increase occupancy for the function if it was dropped in a first pass scheduling. Similarly, try to reduce spilling if register usage exceeds lower bound occupancy. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130329	2022-07-27 22:34:37 -07:00
Austin Kerbow	7ca9e471fe	[AMDGPU] Start refactoring GCNSchedStrategy Tries to make the different scheduling stages a bit more self contained and modifiable. Intended to be NFC. Preface to other changes. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130147	2022-07-26 08:55:19 -07:00
Matt Arsenault	8d0383eb69	CodeGen: Remove AliasAnalysis from regalloc This was stored in LiveIntervals, but not actually used for anything related to LiveIntervals. It was only used in one check for if a load instruction is rematerializable. I also don't think this was entirely correct, since it was implicitly assuming constant loads are also dereferenceable. Remove this and rely only on the invariant+dereferenceable flags in the memory operand. Set the flag based on the AA query upfront. This should have the same net benefit, but has the possible disadvantage of making this AA query nonlazy. Preserve the behavior of assuming pointsToConstantMemory implying dereferenceable for now, but maybe this should be changed.	2022-07-18 17:23:41 -04:00
Vang Thao	311edc6b5b	[AMDGPU] Enable PreRARematerialize scheduling pass with multiple high RP regions Enable the PreRARematerialize pass when there are multiple high RP scheduling regions present. Require the occupancy in all high RP regions be improved before finalizing sinking. If any high RP region did not improve in occupancy then un-do all sinking and restore the state to before the pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D122501	2022-04-08 13:08:32 -07:00
Vang Thao	cd1071171c	[AMDGPU] Fix inline asm causing assert during PreRARematerialize stage in scheduler pass Reviewed By: foad Differential Revision: https://reviews.llvm.org/D123348	2022-04-08 09:22:32 -07:00
Vang Thao	45c2371c0d	[AMDGPU] Ignore debug use during PreRARematerialize stage in scheduling pass Ignore all debug uses when collecting trivially rematerializable defs. This fixes an issue with difference in codegen when enabling debug info. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123048	2022-04-04 11:15:06 -07:00
Vang Thao	27e1931508	[AMDGPU] Fix PreRARematerialize scheduler pass sinking subreg defs When collecting trivially rematerializable defs, skip any subreg defs. We do not want to sink these. Differential Revision: https://reviews.llvm.org/D121874	2022-03-17 11:38:53 -07:00
serge-sans-paille	989f1c72e0	Cleanup codegen includes This is a (fixed) recommit of https://reviews.llvm.org/D121169 after: 1061034926 before: 1063332844 Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D121681	2022-03-16 08:43:00 +01:00
Nico Weber	a278250b0f	Revert "Cleanup codegen includes" This reverts commit 7f230feeeac8a67b335f52bd2e900a05c6098f20. Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang, and many LLVM tests, see comments on https://reviews.llvm.org/D121169	2022-03-10 07:59:22 -05:00
serge-sans-paille	7f230feeea	Cleanup codegen includes after: 1061034926 before: 1063332844 Differential Revision: https://reviews.llvm.org/D121169	2022-03-10 10:00:30 +01:00
Vang Thao	28322c2514	[AMDGPU] Add scheduler pass to rematerialize trivial defs Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy. This is based off of the discussion in https://reviews.llvm.org/D117562. The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D119475	2022-03-09 09:34:33 -08:00
Vang Thao	570471199b	[AMDGPU] Fix debug values in scheduler not placed correctly when reverting Debug position data is cleared after ScheduleDAGMILive::schedule() due to it also calling placeDebugValues(). Make it so the data is not cleared after initial call to placeDebugValues since we will call it again after reverting a schedule. Secondly, since we skip debug instructions when reverting the schedule on AMDGPU, all debug instructions are now moved to the end of the scheduling region. RegionEnd points to the beginning of this chunk of debug instructions since it was not incremented when a debug instruction was skipped. RegionBegin may also point to the same debug instruction if Unsched.front() is a debug instruction thus shrinking the region to 1. Fix RegionBegin and RegionEnd so that they point to the current beginning and ending before calling placeDebugValues() since both vars will be used as reference points to move debug instructions back. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D119022	2022-02-07 11:01:13 -08:00
Vang Thao	2ca194ff55	[AMDGPU] Fix scheduler live-ins with debug inst at start of block GCNDownwardRPTracker RPTracker.reset() skips debug instructions for NextMI so RPTracker.getNext() will never give the beginning of a sched region if it is a debug value. In this case we will never set the live-ins for that block. Add check to see if getNext also equals the MI after skipping debug instructions. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D118853	2022-02-03 12:41:32 -08:00
Neubauer, Sebastian	d1f45ed58f	[AMDGPU][NFC] Fix typos Differential Revision: https://reviews.llvm.org/D113672	2021-11-12 11:37:21 +01:00
Austin Kerbow	02e60f2e77	[AMDGPU] Use max waves for scheduler's initial occupancy target The scheduler should set critical/excess register usage thresholds that are guided by the maximum possible occupancy for the function. This change is focused on setting proper lower bounds on register usage which we would typically only see when a specific number of maximum waves is requested with the "waves-per-eu" attribute, or by setting "amdgpu-num-vgpr\|sgpr" directly. This was broken previously. I have a follow-on patch that will address issues with the scheduler not targeting correct upper bounds on register usage which is typical with launch bounds and min "waves-per-eu". Changes by this patch: Set the initial critical register usage thresholds to minimum values that are determined by the maximum possible occupancy for the function, or the number of allocatable registers, whichever is lower. Avoid unisgned overflow if register limits are lower than the register tracking "ErrorMargin", I.e. when using stress-regalloc=2. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112373	2021-10-26 15:30:26 -07:00
Stanislav Mekhanoshin	799c50fe93	[AMDGPU] Avoid second rescheduling for some regions If a region was not constrained by a high register pressure and was not rescheduled without clustering we can skip rescheduling it ClusteredLowOccupancyReschedule stage. This improves scheduling speed by 25% on some kernels. Differential Revision: https://reviews.llvm.org/D97506	2021-02-26 12:29:37 -08:00
Stanislav Mekhanoshin	635993f07b	[AMDGPU] Skip unclusterd rescheduling w/o ld/st We are attempting rescheduling without load store clustering if occupancy limits were not met with clustering. Skip this for regions which do not have any loads or stores at all. In a set of kernels I am experimenting with this improves scheduling time by ~30%. Differential Revision: https://reviews.llvm.org/D97342	2021-02-26 12:29:03 -08:00
Stanislav Mekhanoshin	bb16efe280	[AMDGPU] Move RPT::getLiveRegs() check under EXPENSIVE_CHECKS This is too expensive even for debug builds. It doubles scheduling time if enabled. Differential Revision: https://reviews.llvm.org/D97232	2021-02-22 15:21:59 -08:00
Stanislav Mekhanoshin	a8d9d50762	[AMDGPU] gfx90a support Differential Revision: https://reviews.llvm.org/D96906	2021-02-17 16:01:32 -08:00
dfukalov	560d7e0411	[NFC][AMDGPU] Split AMDGPUSubtarget.h to R600 and GCN subtargets ... to reduce headers dependency. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D95036	2021-01-20 22:22:45 +03:00
dfukalov	6a87e9b08b	[NFC][AMDGPU] Reduce include files dependency. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D93813	2021-01-07 22:22:05 +03:00
Vang Thao	04bd5b5286	[AMDGPU] Fix not rescheduling without clustering Regions are sometimes skipped which should be rescheduled without memory op clustering. RegionIdx is not incremented when iterating over regions that are flagged to be skipped, causing the index to be incorrect. Thanks to Vang Thao for discovering this bug! Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D85498	2020-08-07 11:15:58 -07:00
Jay Foad	43830790d7	[AMDGPU] Remove dubious logic in bidirectional list scheduler Summary: pickNodeBidirectional tried to compare the best top candidate and the best bottom candidate by examining TopCand.Reason and BotCand.Reason. This is unsound because, after calling pickNodeFromQueue, Cand.Reason does not reflect the most important reason why Cand was chosen. Rather it reflects the most recent reason why it beat some other potential candidate, which could have been for some low priority tie breaker reason. I have seen this cause problems where TopCand is a good candidate, but because TopCand.Reason is ORDER (which is very low priority) it is repeatedly ignored in favour of a mediocre BotCand. This is not how bidirectional scheduling is supposed to work. To fix this I changed the code to always compare TopCand and BotCand directly, like the generic implementation of pickNodeBidirectional does. This removes some uncommented AMDGPU-specific logic; if this logic turns out to be important then perhaps it could be moved into an override of tryCandidate instead. Graphics shader benchmarking on gfx10 shows a lot more positive than negative effects from this change. Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68338	2020-02-28 21:35:34 +00:00
Stanislav Mekhanoshin	dd4766451e	[AMDGPU] Use generated RegisterPressureSets enum Differential Revision: https://reviews.llvm.org/D74671	2020-02-18 10:34:03 -08:00
Stanislav Mekhanoshin	53eb0f8c07	[AMDGPU] Attempt to reschedule withou clustering We want to have more load/store clustering but we also want to maintain low register pressure which are oposit targets. Allow scheduler to reschedule regions without mutations applied if we hit a register limit. Differential Revision: https://reviews.llvm.org/D73386	2020-01-27 10:27:16 -08:00
Stanislav Mekhanoshin	4aa7fb7752	[AMDGPU] Revert scheduling to reduce spilling We can revert region schedule if new schedule decreases occupancy. However, if we already have only one wave we would accept any new schedule even if it blows up register pressure. Such schedule may result in quite heavy spilling which can be avoided if we reject this new schedule. Differential Revision: https://reviews.llvm.org/D72181	2020-01-03 15:20:21 -08:00
Jay Foad	e536800022	[AMDGPU] Add VerifyScheduling support. Summary: This is cut and pasted from the corresponding GenericScheduler functions. Reviewers: arsenm, atrick, tstellar, vpykhtin Subscribers: MatzeB, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68264 llvm-svn: 373346	2019-10-01 15:45:47 +00:00

1 2

90 Commits