llvm-project

Author	SHA1	Message	Date
Dhruva Chakrabarti	58ac95db60	[AMDGPU] Avoid changing minOccupancy if unclustered schedule was not run for any region. (#162025 ) During init of unclustered schedule stage, minOccupancy may be temporarily increased. But subsequently, if none of the regions are scheduled because they don't meet the conditions of initGCNRegion, minOccupancy remains incorrectly set. This patch avoids this incorrectness by delaying the change of minOccupancy until a region is about to be scheduled.	2025-11-12 10:11:22 -08:00
Matt Arsenault	a88fa64e25	CodeGen: Remove TRI argument from reMaterialize (#158229 )	2025-11-10 16:23:36 -08:00
Austin Kerbow	d4b1ab77c1	[AMDGPU] Examine instructions in pending queues during scheduling (#147653 ) Examine instructions in the pending queue when scheduling. This makes instructions visible to scheduling heuristics even when they aren't immediately issuable due to hardware resource constraints. The scheduler has two hardware resource modeling modes: an in-order mode where instructions must be ready to issue before scheduling, and out-of-order models where instructions are always visible to heuristics. Special handling exists for unbuffered processor resources in out-of-order models. These resources can cause pipeline stalls when used back-to-back, so they're typically avoided. However, for AMDGPU targets, managing register pressure and reducing spilling is critical enough to justify exceptions to this approach. This change enables examination of instructions that can't be immediately issued because they use an already occupied unbuffered resource. By making these instructions visible to scheduling heuristics anyway, we gain more flexibility in scheduling decisions, potentially allowing better register pressure and hardware resource management.	2025-10-16 08:46:29 -07:00
Valery Pykhtin	8823efe77d	[AMDGPU] Add register usage debug printing the point of maximum register pressure. (#161850 ) Basically this allows to analyze "why so many VGPRs used?". It prints all live registers at the point of maximum register pressure and for each register its defs/uses are dumped. Currently can be run before and after the scheduler but would be nice if it can be ran inbetween any passes (not sure this is possible with legacy pass-manager). Requires debug or built with asserts compiler. Highly recommended to run with debug info to have debug locations for instructions. Example output: ``` * Register pressure info (VGPRs) for _ZN7ck_tile6ken.... * Max pressure is 256 VGPRs at 41780e@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10137:vreg_128_align2, %10141:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec Live registers with single definition (123 VGPRs): %10126:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs) def 41600r@BB.18 (LoopHdr BB.16, Depth 1): undef %10126.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 15232, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1314, !noalias !60, addrspace 3) def 41608r@BB.18 (LoopHdr BB.16, Depth 1): %10126.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 16320, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1315, !noalias !60, addrspace 3) use 41848r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10126:vreg_128_align2, %10138:vreg_128_align2, %9856:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec %10136:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs) def 41264r@BB.18 (LoopHdr BB.16, Depth 1): undef %10136.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 2176, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1294, !noalias !60, addrspace 3) def 41272r@BB.18 (LoopHdr BB.16, Depth 1): %10136.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 3264, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1295, !noalias !60, addrspace 3) use 41788r@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10136:vreg_128_align2, %10140:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec %10129:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs) ... Live registers with multiple definitions (133 VGPRs): %9856:VReg_512_Align2, LiveMask 00000000FFFFFFFF (16 VGPRs) def 16544r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def undef %9856.sub0_sub1:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16592r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub2_sub3:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16608r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub4_sub5:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16656r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub6_sub7:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16672r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub8_sub9:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16720r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub10_sub11:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16736r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub12_sub13:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16784r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub14_sub15:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def use 41828r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_... ... ******** INTERVALS ****** ... ****** MACHINEINSTRS ******** # Machine code for function _ZN7ck_tile6kentr... ```	2025-10-13 16:17:11 +02:00
Philip Reames	58c959b4f7	[AMDGPU] Use common allUsesAvailableAt implementation [nfc] (#161418 ) Replace the target specific copy with a call to the generic routine. I don't spot any differences by eye, and there's nothing in the original review discussion (#124327) which makes it clear why this was duplicated.	2025-10-01 07:22:13 -07:00
Philip Reames	ea721e2fa1	[TII] Split isTrivialReMaterializable into two versions [nfc] (#160377 ) This change builds on https://github.com/llvm/llvm-project/pull/160319 which tries to clarify which callers (not backends) assume that the result is actually trivial. This change itself should be NFC. Essentially, I'm just renaming the existing isTrivialRematerializable to the non-trivial version and then adding a new trivial version (with the same name as the prior function) and simplifying a few callers which want that semantic. This change does not enable non-trivial remat any more broadly than was already done for our targets which were lying through the old APIs; that will come separately. The goal here is simply to make the code easier to follow in terms of what assumptions are being made where. --------- Co-authored-by: Luke Lau <luke_lau@icloud.com>	2025-09-24 18:52:17 -07:00
Lucas Ramirez	83c308f014	[AMDGPU][Scheduler] Consistent occupancy calculation during rematerialization (#149224 ) The `RPTarget`'s way of determining whether VGPRs are beneficial to save and whether the target has been reached w.r.t. VGPR usage currently assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR RC can always be used for the other. Implicitly, this makes the rematerialization stage (only current user of `RPTarget`) follow a different occupancy calculation than the "regular one" that the scheduler uses, one that assumes that ArchVGPR/AGPR usage can be balanced perfectly and at no cost, which is untrue in general. This ultimately yields suboptimal rematerialization decisions that require cross-VGPR-RC copies unnecessarily. This fixes that, making the `RPTarget`'s internal model of occupancy consistent with the regular one. The `CombinedVGPRSavings` flag is removed, and a form of cross-VGPR-RC saving implemented only for unified RFs, which is where it makes the most sense. Only when the amount of free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the excess VGPR usage in the other VGPR RC does the `RPTarget` consider that a pressure reduction in the former will be beneficial to the latter.	2025-08-08 14:26:04 +02:00
Lucas Ramirez	96f3872dc9	[AMDGPU][Scheduler] Delete RegionsWithMinOcc bitvector from scheduler (NFC) (#142361 ) The `GCNScheduleDAGMILive`'s `RegionsWithMinOcc` bitvector is only used by the `UnclusteredHighRPStage`. Its presence in the scheduler's state forces us to maintain its value throughout scheduling even though it is of no use to the iterative scheduling process itself. At any point during scheduling it is possible to cheaply compute the occupancy induced by a particular register pressure. Furthermore, the field doesn't appear to be updated correctly throughout scheduling i.e., bits corresponding to regions at minimum occupancy are not always set in the vector. This removes the bitvector from `GCNScheduleDAGMILive`. `UnclusteredHighRPStage::initGCNRegion` now directly computes the occupancy of possibly reschedulable regions instead of querying the vector. Since it is the most expensive check, it is done last in the list.	2025-08-01 13:20:05 +02:00
Ruiling, Song	451912a24a	[MachineScheduler] Make cluster check more efficient (#150884 )	2025-08-01 16:00:42 +08:00
Jeffrey Byrnes	d3a9cde7b8	[AMDGPU] Don't skip regions in getRegionLiveInMap (#151423 ) Currently, this skips any region that is not the first region in a block. This is because the only user of it only cares about the LiveIns per-block. However, as named, this is supposed to compute the per-region LiveIns. This doesn't have any effect on scheduling / CodeGen currently (aside from computing LiveIns for all regions) since only the per-block LiveIns are needed. However, I'm working on something that will use this. Intended User: https://github.com/llvm/llvm-project/pull/149367/ `c62a2f127c/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (L1351)`	2025-07-31 15:05:18 -07:00
Lucas Ramirez	c77a2a2403	[AMDGPU][Scheduler] Use `AMDGPU::NoSubRegister` instead of 0 (NFC) (#150610 )	2025-07-25 14:25:13 +02:00
Lucas Ramirez	e38f98f535	[AMDGPU][Scheduler] Fix usage of `TII.reMaterialize` (NFC) (#150259 ) Any non-zero `SubIdx` passed to the method is composed with the rematerialized instruction's first operand's subregister to determine the new register's subregister. In our case we want the new register to have the same subregister as the old one, so we should pass 0.	2025-07-25 12:51:21 +02:00
Jeffrey Byrnes	283a62fa5b	[AMDGPU] NFC: Decouple getRealRegPressure from current region (#149219 ) We're already accepting a RegionIdx for the LiveIns, also use this for the instruction iterators. Enables querying RP for other regions -- useful for function wide transformations (e.g. rematerialization, rewriting, etc).	2025-07-16 19:59:38 -07:00
Lucas Ramirez	6307b496f8	[AMDGPU] Add `GCNRPTarget` to track register pressure against a target (#145765 ) This adds the `GCNRPTarget` class which models a register pressure target (i.e., maximum number of SGPRs/VGPRS) that one can track register savings against. The only current use of this class is in the scheduler's rematerialization stage. It replaces the more ad-hoc (and now deleted) `ExcessRP` class which used to serve the same purpose. This is only NFC~ish because `GCNRPTarget` tracks VGPR usage more accurately than `ExcessRP` used to. To estimate required combined VGPR savings we now additionally take into account the number of available VGPRs in both banks (ArchVGPR and AGPR) at the time where the RP target is created, whereas we used to only consider explicit savings made from the starting RP. This makes VGPR savings estimations more accurate in cases where we allow for savings in one VGPR bank to help towards reducing pressure in another VGPR bank (see `GCNRPTarget::CombineVGPRSavings`). This is the cause for unit test changes.	2025-06-26 13:11:20 +02:00
Lucas Ramirez	c74ed8a0d3	[AMDGPU][Scheduler] Support for rematerializing SGPRs and AGPRs (#140036 ) This adds the ability to rematerialize SGPRs and AGPRs to the scheduler's `PreRARematStage`, which can currently only rematerialize ArchVGPRs. This also fixes a small potential issue in the stage where, in case of spilling, the target occupancy could be set to a lower than expected value when the function had either one of the "amdgpu-num-sgpr" or "amdgpu-num-vgpr" attributes set.	2025-06-24 19:30:27 +02:00
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Mikael Holmen	77062244ed	Fix two instances of -Wparentheses warnings [NFC] Add parentheses around the assert conditions. Without this gcc warned like ../lib/Target/AMDGPU/GCNSchedStrategy.cpp:2250: warning: suggest parentheses around '&&' within '\|\|' [-Wparentheses] 2250 \| NewMI != RegionBounds.second && "cannot remove at region end"); and ../../clang/lib/Sema/SemaOverload.cpp:11326:39: warning: suggest parentheses around '&&' within '\|\|' [-Wparentheses] 11326 \| DeferredCandidatesCount == 0 && \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ 11327 \| "Unexpected deferred template candidates"); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~	2025-06-12 09:43:30 +02:00
Ruiling, Song	0487db1f13	MachineScheduler: Improve instruction clustering (#137784 ) The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`. But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone). Another issue is: ``` if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum) std::swap(SUa, SUb); if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) ``` may break the cluster queue. For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain. To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case. One key reason the change causes so many test check changes is: As the cluster candidates are not ordered now, the candidates might be picked in different order from before. The most affected targets are: AMDGPU, AArch64, RISCV. For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression. For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough. For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.	2025-06-05 15:28:04 +08:00
Lucas Ramirez	3e18216474	[AMDGPU][Scheduler] Delete RescheduleRegions bitvector from scheduler (NFC) (#141595 ) The `GCNScheduleDAGMILive`'s `RescheduleRegions` bitvector is only used by the rematerialization stage (`PreRARematStage`). Its presence in the scheduler's state forces us to maintain its value throughout scheduling even though it is of no use to the iterative scheduling process itself, which instead relies on each stage's `initGCNRegion` hook to determine whether the current region should be rescheduled. This moves the bitvector to the `PreRARematStage`, which uses it to store the set of regions that must be rescheduled between stage initialization and region initialization. This NFC also swaps a call to `GCNRegPressure::getArchVGPRNum(false)` for a call to `GCNRegPressure::getArchVGPRNum()`---which is equivalent but simpler in the context---and makes `GCNSchedStage::finalizeGCNRegion` use its own API to advance to the next region.	2025-05-27 17:10:28 +02:00
Lucas Ramirez	6456ee056f	Reapply "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885 )" (#139548 ) This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to issues detected by the address sanitizer (MIs have to be removed from live intervals before being removed from their parent MBB). Original commit description below. AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.	2025-05-13 11:11:00 +02:00
Vitaly Buka	b35f6e26a5	Revert "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885 )" (#139341 ) And related "[AMDGPU] Regenerate mfma-loop.ll test" Introduce memory error detected by Asan #125885. This reverts commit 382a085a95b0abeac77b150b7b644b372bd08e78. This reverts commit 067caaafb58a156d0d77229422607782a639f5b5.	2025-05-09 17:51:46 -07:00
Lucas Ramirez	067caaafb5	[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885 ) AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.	2025-05-08 12:51:06 +02:00
Jeffrey Byrnes	6e7fe85247	[AMDGPU] Teach iterative schedulers about IGLP (#134953 ) This adds IGLP mutation to the iterative schedulers (`gcn-iterative-max-occupancy-experimental`, `gcn-iterative-minreg`, and `gcn-iterative-ilp`). The `gcn-iterative-minreg` and `gcn-iterative-ilp` schedulers never actually applied the mutations added, so this also has the effect of teaching them about mutations in general. The `gcn-iterative-max-occupancy-experimental` scheduler has calls to `ScheduleDAGMILive::schedule()`, so, before this, mutations were applied at this point. Now this is done during calls to `BuildDAG`, with IGLP superseding other mutations (similar to the other schedulers). We may end up scheduling regions multiple times, with mutations being applied each time, so we need to track for `AMDGPU::SchedulingPhase::PreRAReentry`	2025-04-11 15:34:49 -07:00
Kazu Hirata	71935281e0	[Target] Use *Set::insert_range (NFC) (#132140 ) DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently gained C++23-style insert_range. This patch replaces: Dest.insert(Src.begin(), Src.end()); with: Dest.insert_range(Src); This patch does not touch custom begin like succ_begin for now.	2025-03-20 09:09:30 -07:00
Emma Pilkington	3eddb992d0	[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167 ) Fixes SWDEV-514946	2025-03-19 09:31:54 -04:00
Diana Picus	1f84495255	[AMDGPU] Update target helpers & GCNSchedStrategy for dynamic VGPRs (#130047 ) In dynamic VGPR mode, we can allocate up to 8 blocks of either 16 or 32 VGPRs (based on a chip-wide setting which we can model with a Subtarget feature). Update some of the subtarget helpers to reflect this. In particular: - getVGPRAllocGranule is set to the block size - getAddresableNumVGPR will limit itself to 8 * size of a block We also try to be more careful about how many VGPR blocks we allocate. Therefore, when deciding if we should revert scheduling after a given stage, we check that we haven't increased the number of VGPR blocks that need to be allocated. --------- Co-authored-by: Jannik Silvanus <jannik.silvanus@amd.com>	2025-03-19 10:29:38 +01:00
Julian Brown	84909d7977	[AMDGCN] Allow unscheduling of bundled insns This is a patch arising from AMD's fuzzing project. In the test case, the scheduling algorithm decides to undo an attempted schedule, but is unprepared to handle bundled instructions at that point -- and those can arise via the expansion of intrinsics earlier in compilation. The fix is to use the splice method instead of remove/insert, since that can handle bundles properly.	2025-03-18 11:56:51 -05:00
Lucas Ramirez	03677f63a7	[MachineScheduler] Optional scheduling of single-MI regions (#129704 ) Following 15e295d the machine scheduler no longer filters-out single-MI regions when emitting regions to schedule. While this has no functional impact at the moment, it generally has a negative compile-time impact (see #128739). Since all targets but AMDGPU do not care for this behavior, this introduces an off-by-default flag to `ScheduleDAGInstrs` to control whether such regions are going to be scheduled, effectively reverting 15e295d for all targets but AMDGPU (currently the only target enabling this flag).	2025-03-04 17:46:44 +01:00
Jeffrey Byrnes	16f7e961c6	[AMDGPU] Allow rematerialization of instructions with virtual register uses (#124327 ) Remove the restriction that scheduling rematerialization candidates cannot have virtual reg uses. Currently, this only allows for virtual reg uses which are already live at the rematerialization point, so bring in allUsesAvailableAt to check for this condition. Because of this condition, the uses of the remats will already be live in to the region, so the remat won't increase live-in pressure. Add an expensive check to check this condition.	2025-02-06 10:16:28 -08:00
Jeffrey Byrnes	e77d428e46	[AMDGPU] Do not remat instructions with PhysReg uses (#124366 ) This blocks rematerialization during scheduling if the instruction has a non accepted PhysReg use. Currently, there aren't any checks like this in place, and we may create invalid code: https://godbolt.org/z/xjPjdcorf	2025-01-27 10:50:06 -08:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Austin Kerbow	657fb4433e	[AMDGPU] Add target hook to isGlobalMemoryObject (#112781 ) We want special handing for IGLP instructions in the scheduler but they should still be treated like they have side effects by other passes. Add a target hook to the ScheduleDAGInstrs DAG builder so that we have more control over this.	2025-01-11 09:57:57 -08:00
Ruiling, Song	b33c807b39	[AMDGPU] Add MaxMemoryClauseSchedStrategy (#114957 ) Also expose an option to choose custom scheduler strategy: amdgpu-sched-strategy={max-ilp\|max-memory-clause} This can be set through either function attribute or command line option. The major behaviors of the max memory clause schedule strategy includes: 1. Try to cluster memory instructions more aggressively. 2. Try to schedule long latency load earlier than short latency instruction. I tested locally against about 470 real shaders and got the perf changes (only count perf changes over +/-10%): About 15 shaders improved 10%~40%. Only 3 shaders drops ~10%. (This was tested together with another change which increases the maximum clustered dword from 8 to 32). I will make another change to make that threshold configurable.	2024-12-09 10:07:27 +08:00
Jeffrey Byrnes	3a08551a03	[AMDGPU] Fix expensive check Change-Id: I0b26d5db6d3da8936ab25ee2b1e9002840b9853e	2024-10-09 10:24:06 -07:00
Jeffrey Byrnes	17bc959961	[AMDGPU] Optionally Use GCNRPTrackers during scheduling (#93090 ) This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings. This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.	2024-10-09 09:54:11 -07:00
Kazu Hirata	3b9f183958	[AMDGPU] Use llvm::any_of, llvm::all_of, and llvm::none_of (NFC) (#103007 )	2024-08-13 00:07:54 -07:00
Pierre van Houtryve	d890dda16b	[NFCI][AMDGPU] Try to use PressureDiff to Calculate RegPressure. (#94221 ) PressureDiff is reliable most of the time, and it's pretty much free compared to RPTracker. We can use it whenever there is no subregister definitions, or physregs invovled. No subregs because PDiff doesn't take into account lane liveness, and no Physreg because it seems to get PhysReg liveness completely wrong. Sometimes it adds a diff, sometimes itt doesn't - I didn't look at that one for long so maybe there is something we can eventually do to make it better. This allows us to save a ton of calls to RPTracker and LIS too. On a huge IR module (100+MB), it went from about 20M calls to RPTracker in this function down to 3.4, with the rest being PressureDiffs. I also added an expensive check to verify correctness of PressureDiff.	2024-06-14 08:34:53 +02:00
Jeffrey Byrnes	b761137049	[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case (#85860 ) `ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs` which returns 512 for gfx90a. This is subsequently limited by `AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for gfx90a. The ISA states we can have a total of 512 registers, but a maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4). Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)` calculates the maximum number of combined VGPR + AGPR. But, it is currently used as the limit for accvgpr and as the limit for archvgpr. This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits. It is not unreasonable to think other clients of getTotalNumVGPRs are using it in the wrong way.	2024-03-25 13:11:58 -07:00
Jeffrey Byrnes	cf1c97b2d2	[AMDGPU] Do not attempt to fallback to default mutations (#83208 ) IGLP itself will be in SavedMutations via mutations added during Scheduler creation, thus falling back results in reapplying IGLP. In PostRA scheduling, if we have multiple regions with IGLP instructions, then we may have infinite loop. Disable the feature for now.	2024-02-27 18:04:59 -08:00
Jeffrey Byrnes	113052b2b0	[AMDGPU] Prefer lower total register usage in regions with spilling Change-Id: Ia5c434b0945bdcbc357c5e06c3164118fc91df25	2024-02-26 12:19:52 -08:00
Jeffrey Byrnes	8f2bd8ae68	[AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (#81342 ) This implements the basic pipelining structure of exp/mfma interleaving for better extensibility. While it does have improved extensibility, there are controls which only enable it for DAGs with certain characteristics (matching the DAGs it has been designed against).	2024-02-23 17:13:20 -08:00
Diana Picus	46dd8acf36	[AMDGPU] Fix typos. NFC	2024-01-26 12:04:58 +01:00
Jeffrey Byrnes	6d8b44a506	[AMDGPU] [IGLP]: Fix assert (#73710 ) We can also re-enter IGLP mutation via later `SchedStage`s in the `GCNMaxOccupancySchedStrategy` . This is sort of NFC in that there is no changed behavior for the only current client of `IsReentry`	2023-12-07 17:10:10 -08:00
Jeffrey Byrnes	5b2fee8418	[AMDGPU] NFC: Add flag to disable clustered low occupancy phase (#73025 ) This will help users analyze whether high register usage is coming from inability of scheduler to reduce RP, or from sacrificing good RP to improve ILP.	2023-11-21 12:43:12 -08:00
Jeffrey Byrnes	6afceba510	[AMDGPU][IGLP] SingleWaveOpt: Cache DSW Counters from PreRA (#67759 ) Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose. This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline. Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).	2023-10-06 17:34:14 -07:00
Jay Foad	c3939eb827	[AMDGPU] Fix typo in scheduler option name (#67661 ) Fix: -amdgpu-disable-unclustred-high-rp-reschedule Now: -amdgpu-disable-unclustered-high-rp-reschedule	2023-09-28 20:54:57 +01:00
Jeffrey Byrnes	be92848ea4	[AMDGPU] NFC: Add schedule-relaxed-occupancy to relax occupancy targets for wave-limited/membound kernels Default scheduling behavior for these types of kernels is to chase high occupancy goals with scheduling heuristics, but allow occupancy drops if we are unable to reach the target. This (experimental, off-by-default) feature relaxes occupancy target from the beginning, which enables scheduler to produce better ILP schedules. Differential Revision: https://reviews.llvm.org/D153925 Change-Id: I112833214e2db869704591f4df3c4574d0fcbb1b	2023-06-28 08:12:31 -07:00
Jay Foad	3030c03988	[AMDGPU] Make use of MachineInstr::all_defs and all_uses. NFCI.	2023-06-05 10:32:33 +01:00
Jeffrey Byrnes	7f0a881e6c	[AMDGPU] Track liveins for max-ilp-sched-strategy Even if optimizing for ILP, it is still useful to track RP to avoid spilling. Given that, we need to maintin consistent liveness state with the RP tracker. This patch makes RP tracking consistent by updating for liveins. Otherwise, we should completely eliminate RP tracking for this scheduler (checkScheduling, initCandidate). Differential Revision: https://reviews.llvm.org/D149358	2023-04-27 16:45:45 -07:00
Valery Pykhtin	a999669982	[AMDGPU] Scheduler: fix RP calculation for a MBB with one successor We reuse live registers after tracking one MBB as live-ins to the successor MBB if the successor is only one but we don't check if the successor has other predecessors. `A B` ` \ /` ` C` A and B have one successor but C has live-ins defined by A and B and therefore should be initialized using LIS. This fixes 83 lit tests out if 420 with EXPENSIVE_CHECK enabled. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D136918	2023-03-08 12:20:03 +01:00

1 2 3

132 Commits