llvm-project

Author	SHA1	Message	Date
Lucas Ramirez	dfbf4eb88e	[AMDGPU][Scheduler] Fix incorrect region index in EXPENSIVE_CHECKS (#179461 ) #177206 exposed a pre-existing typo in EXPENSIVE_CHECKS in the scheduler's rematerialization stage. When rematerializing a register that depends on other registers, the dependent registers should be live at the live-ins of the rematerialized register's using region, but not necessarily at the live-ins of its defining region, as it is written right now. This fixes that and also hoists the check from the loop on regions where the rematerialized register is live, since it's only supposed to run once for the using region anyway. This fixes the following unit tests which were failing when building with EXPENSIVE_CHECKS. - `CodeGen/AMDGPU/copy-hoist-no-spills.ll` - `CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir`	2026-02-03 17:51:36 +01:00
Lucas Ramirez	9dc6b4904a	Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050 )" (#177206 ) This re-applies commit f21e3593371c049380f056a539a1601a843df558 along with the compile fix failure introduced in 8ab79377740789f6a34fc6f04ee321a39ab73724 before the initial patch was reverted. It also fixes for the previously observed assert failure. We were hitting the assert in the HIP Blender due to a combination of two issues that could happen when rematerializations are being rolled back. 1. Small changes in slots indices (while preserving instruction order) compared to the pre-re-scheduling state means that we have to re-compute live ranges for all register operands of rolled back rematerializations. This was not being done before. 2. Re-scheduling can move registers that were rematerialized at arbitrary positions in their respective regions while their opcode is set to DBG_VALUE, even before their read operands are defined. This makes re-scheduling reverts mandatory before rolling back rematerializations, as otherwise def-use chains may be broken. The original patch did not guarantee that, but previous refactoring of the rollback/revert logic for the rematerialization stage now ensures that reverts always precede rollbacks.	2026-02-02 12:34:16 +01:00
Lucas Ramirez	cf60af88b4	[AMDGPU][Scheduler] Revert all regions when remat fails to increase occ. (#177205 ) When the rematerialization stage fails to increase occupancy in all regions, the current implementation only reverts the effect of re-scheduling in regions in which the increased occupancy target could not be achieved. However, given that re-scheduling with a higher occupancy target puts more pressure on the scheduler to achieve lower maximum RP at the cost of potentially lower ILP as well, region schedules made with higher occupancy targets are generally less desirable if the whole function is not able to meet that target. Therefore, if at least one region cannot reach its target, it makes sense to revert re-scheduling in all affected regions to go back to a schedule that was made with a lower occupancy target. This implements such logic for the rematerialization stage, and adds a test to showcase that re-scheduling is indeed interrupted/reverted as soon as a re-scheduled region that does not meet the increased target occupancy is encountered. As a minor improvement, this also sets higher occupancy targets for re-scheduling at the end of stage initialization in some cases. In cases where rematerializations alone are not able to achieve the target, this can push the scheduler to be more aggressive in reducing RP and achieve the target.	2026-02-02 00:33:40 +00:00
Lucas Ramirez	7274ae970d	[AMDGPU][Scheduler] Simplify scheduling revert logic (#177203 ) When scheduling must be reverted for a region, the current implementation re-orders non-debug instructions and debug instructions separately; the former in a first pass and the latter in a second pass handled by a generic machine scheduler helper whose state is tied to the current region being scheduled, in turns limiting the revert logic to only work on the active scheduling region. This makes the revert logic work in a single pass for all MIs, and removes the restriction that it works exclusively on the active scheduling region. The latter enables future use cases such as reverting scheduling of multiple regions at once.	2026-02-01 17:55:12 +01:00
Lucas Ramirez	8029699b8d	[AMDGPU][Scheduler] Make `finalizeGCNRegion` an overridable hook (NFC) (#177199 ) This allows individual stages to make decisions after re-scheduling individual regions.	2026-01-30 17:35:23 +01:00
LU-JOHN	90b5ec1226	[AMDGPU] revertScheduling must behave the same with or without debug (#177483 ) When revertScheduling reorders instructions only update LiveIntervals if non-debug instructions have been re-ordered. Otherwise code-generation can change with debug info. --------- Signed-off-by: John Lu <John.Lu@amd.com>	2026-01-27 14:33:03 -06:00
Tony Linthicum	fc94ef9576	AMDGPU: Disable scheduler mfma rewrite stage by default for now (#177624 ) Currently generating an excess number of copies. Turning it off to avoid churn for other developers.	2026-01-23 23:39:03 +01:00
Sam Elliott	7184229fea	[NFC][MI] Tidy Up RegState enum use (2/2) (#177090 ) This Change makes `RegState` into an enum class, with bitwise operators. It also: - Updates declarations of flag variables/arguments/returns from `unsigned` to `RegState`. - Updates empty RegState initializers from 0 to `{}`. If this is causing problems in downstream code: - Adopt the `RegState getXXXRegState(bool)` functions instead of using a ternary operator such as `bool ? RegState::XXX : 0`. - Adopt the `bool hasRegState(RegState, RegState)` function instead of using a bitwise check of the flags.	2026-01-23 00:19:03 -08:00
Tony Linthicum	9bc0e62617	[AMDGPU] Add scheduling stage to rewrite MFMA from VGPR to AGPR (#170335 ) This pull request is an update of [Jeff Byrne's PR](https://github.com/llvm/llvm-project/pull/149367). Additionally, all unresolved comments from the original PR have been addressed. Changes to make the MachineScheduler dependent upon MachineBlockFrequencyInfo have been pulled out into a new PR, [176172](https://github.com/llvm/llvm-project/pull/176172), upon which this PR now depends. --------- Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com>	2026-01-22 12:22:39 +01:00
Juan Manuel Martinez Caamaño	c30c2f4f3e	[AMDGPU] Rematerialize VGPR candidates when SGPR spills results in VGPR Excess (#168079 ) Before, when selecting candidates to rematerialize, we would only consider SGPR candidates when there was an excess of SGPR registers. Failing to eliminate the excess would result in spills to VGPRs. This is normally not an issue, unless spilling to VGPRs results in excess VGPRs. This patch does 2 things: * It relaxes the GCNRPTarget success criteria: now we accept regions where we spill SGPRs to VGPRs, as long as this does not end up in excess VGPRs. * It changes isSaveBeneficial to consider the excess VGPRs (which includes the SGPRs that would be spilled to VGPR). With these changes, the compiler rematerializes VGPRs when the excess SGPRs would result in VGPR excess. This has some unaddressed flaws: we should attempt to rematerialize SGPRs first in order to eliminate the SGPR excess that results in VGPR excess. Related to SWDEV-549940	2026-01-16 09:08:55 +01:00
Lucas Ramirez	7b699cc5de	Revert "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050 )" (#175813 ) This reverts 8ab79377740789f6a34fc6f04ee321a39ab73724 and f21e3593371c049380f056a539a1601a843df558 which are causing a HIP failure in a Blender test.	2026-01-13 19:29:50 +00:00
Thurston Dang	458a2e88cc	[AMDGPU][Scheduler] Fix use-after-poison by printing before deleting (#175807 ) A buildbot was failing with a use-after-poison (https://lab.llvm.org/buildbot/#/builders/24/builds/16530) after https://github.com/llvm/llvm-project/pull/175050: ``` ==llc==1532559==ERROR: AddressSanitizer: use-after-poison on address 0xe26e74e12368 at pc 0xb36d41bd74dc bp 0xffffed72a450 sp 0xffffed72a448 READ of size 8 at 0xe26e74e12368 thread T0 #0 0xb36d41bd74d8 in llvm::MachineInstr::print(llvm::raw_ostream&, bool, bool, bool, bool, llvm::TargetInstrInfo const*) const /home/b/sanitizer-aarch64-linux-bootstrap-asan/build/llvm-project/llvm/lib/CodeGen/MachineInstr.cpp:1796:35 #1 0xb36d3e221b08 in operator<< /home/b/sanitizer-aarch64-linux-bootstrap-asan/build/llvm-project/llvm/include/llvm/CodeGen/MachineInstr.h:2150:6 #2 0xb36d3e221b08 in llvm::PreRARematStage::rollback(llvm::PreRARematStage::RollbackInfo const&, llvm::BitVector&) const /home/b/sanitizer-aarch64-linux-bootstrap-asan/build/llvm-project/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp:2363:57 ... ``` This is because it was printing an instruction that had already been deleted. This patch fixes this by reversing the order.	2026-01-13 10:14:32 -08:00
Lucas Ramirez	125d24af76	[AMDGPU][Scheduler] Fix compile failure due to const/sort interaction (#175755 ) On some configurations sorting `ScoredRemat` objects which contains const members causes a compile failure due to impossibility of swapping/moving objects. The problem was introduced in #175050. This removes const from those fields to address the issue. The design will soon change anyway to not rely on sorting objects of this type, and consts were only here for semantic clarity.	2026-01-13 14:11:47 +01:00
Lucas Ramirez	6aaa7fd7fb	[AMDGPU][Scheduler] Scoring system for rematerializations (#175050 ) This is a significant refactoring of the scheduler's rematerialization stage meant to improve rematerialization capabilities and lay strong foundations for future improvements. As before, the stage identifies scheduling regions in which RP must be reduced (so-called "target regions"), then rematerializes registers to try and achieve the desired reduction. All regions affected by rematerializations are re-scheduled, and, if the MIR is deemed worse than before, rematerializations are rolled back to leave the MIR in its pre-stage state. The core contribution is a scoring system to estimate the benefit of each rematerialization candidate. This score favors rematerializing candidates which, in order, would 1. (if the function is spilling) reduce RP in highest-frequency target regions, 2. be rematerialized to lowest-frequency target regions, and 3. reduce RP in the highest number of target regions. All rematerialization opportunities are initially scored and rematerialized in decreasing score order until RP objectives are met or pre-computed scores diverge from reality; in the latter case remaining candidates are re-scored and the process repeats. New tests in `machine-scheduler-rematerialization-scoring.mir` showcase how the scoring system dictates which rematerialization are the most beneficial and therefore performed first A minor contribution included in this PR following previous feedback is that rollback now happens in-place i.e., without having to re-create the rematerialized MI. This leaves original slot indices and registers untouched. We achieve this by temporarily switching the opcode of rollback-able instructions to a debug opcode during re-scheduling so that they are ignored.	2026-01-13 12:24:16 +01:00
Dhruva Chakrabarti	94e4ee38aa	[AMDGPU] Fixed crash in getLastMIForRegion when the region is empty. (#168653 ) PreRARematStage builds region live-outs if GCN trackers are enabled. If rematerialization leads to empty regions, this can cause a crash because of dereference of an invalid iterator in getLastMIForRegion. The fix is to skip calling getLastMIForRegion for empty regions. This patch fixes another bug in the same code region. getLastMIForRegion calls skipDebugInstructionsBackward which may immediately return the RegionEnd if it is not the begin instruction and it is a non-debug instruction. That would imply considering an instruction that is outside the relevant region. The fix is to always pass the previous of RegionEnd to skipDebugInstructionsBackward. This bug was found while using GCN trackers on the existing LIT test machine-scheduler-sink-trivial-remats.mir. Here's the assertion failure. llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:168: llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::operator*() const [with OptionsT = llvm::ilist_detail::node_options<llvm::MachineInstr, true, true, void, false, void>; bool IsReverse = false; bool IsConst = false; llvm::ilist_iterator<OptionsT, IsReverse, IsConst>::reference = llvm::MachineInstr&]: Assertion `!NodePtr->isKnownSentinel()' failed.	2025-11-19 16:19:20 -08:00
Dhruva Chakrabarti	58ac95db60	[AMDGPU] Avoid changing minOccupancy if unclustered schedule was not run for any region. (#162025 ) During init of unclustered schedule stage, minOccupancy may be temporarily increased. But subsequently, if none of the regions are scheduled because they don't meet the conditions of initGCNRegion, minOccupancy remains incorrectly set. This patch avoids this incorrectness by delaying the change of minOccupancy until a region is about to be scheduled.	2025-11-12 10:11:22 -08:00
Matt Arsenault	a88fa64e25	CodeGen: Remove TRI argument from reMaterialize (#158229 )	2025-11-10 16:23:36 -08:00
Austin Kerbow	d4b1ab77c1	[AMDGPU] Examine instructions in pending queues during scheduling (#147653 ) Examine instructions in the pending queue when scheduling. This makes instructions visible to scheduling heuristics even when they aren't immediately issuable due to hardware resource constraints. The scheduler has two hardware resource modeling modes: an in-order mode where instructions must be ready to issue before scheduling, and out-of-order models where instructions are always visible to heuristics. Special handling exists for unbuffered processor resources in out-of-order models. These resources can cause pipeline stalls when used back-to-back, so they're typically avoided. However, for AMDGPU targets, managing register pressure and reducing spilling is critical enough to justify exceptions to this approach. This change enables examination of instructions that can't be immediately issued because they use an already occupied unbuffered resource. By making these instructions visible to scheduling heuristics anyway, we gain more flexibility in scheduling decisions, potentially allowing better register pressure and hardware resource management.	2025-10-16 08:46:29 -07:00
Valery Pykhtin	8823efe77d	[AMDGPU] Add register usage debug printing the point of maximum register pressure. (#161850 ) Basically this allows to analyze "why so many VGPRs used?". It prints all live registers at the point of maximum register pressure and for each register its defs/uses are dumped. Currently can be run before and after the scheduler but would be nice if it can be ran inbetween any passes (not sure this is possible with legacy pass-manager). Requires debug or built with asserts compiler. Highly recommended to run with debug info to have debug locations for instructions. Example output: ``` * Register pressure info (VGPRs) for _ZN7ck_tile6ken.... * Max pressure is 256 VGPRs at 41780e@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10137:vreg_128_align2, %10141:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec Live registers with single definition (123 VGPRs): %10126:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs) def 41600r@BB.18 (LoopHdr BB.16, Depth 1): undef %10126.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 15232, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1314, !noalias !60, addrspace 3) def 41608r@BB.18 (LoopHdr BB.16, Depth 1): %10126.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 16320, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1315, !noalias !60, addrspace 3) use 41848r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10126:vreg_128_align2, %10138:vreg_128_align2, %9856:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec %10136:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs) def 41264r@BB.18 (LoopHdr BB.16, Depth 1): undef %10136.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 2176, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1294, !noalias !60, addrspace 3) def 41272r@BB.18 (LoopHdr BB.16, Depth 1): %10136.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 3264, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1295, !noalias !60, addrspace 3) use 41788r@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10136:vreg_128_align2, %10140:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec %10129:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs) ... Live registers with multiple definitions (133 VGPRs): %9856:VReg_512_Align2, LiveMask 00000000FFFFFFFF (16 VGPRs) def 16544r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def undef %9856.sub0_sub1:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16592r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub2_sub3:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16608r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub4_sub5:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16656r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub6_sub7:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16672r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub8_sub9:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16720r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub10_sub11:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16736r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub12_sub13:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def 16784r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub14_sub15:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52 def use 41828r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_... ... ******** INTERVALS ****** ... ****** MACHINEINSTRS ******** # Machine code for function _ZN7ck_tile6kentr... ```	2025-10-13 16:17:11 +02:00
Philip Reames	58c959b4f7	[AMDGPU] Use common allUsesAvailableAt implementation [nfc] (#161418 ) Replace the target specific copy with a call to the generic routine. I don't spot any differences by eye, and there's nothing in the original review discussion (#124327) which makes it clear why this was duplicated.	2025-10-01 07:22:13 -07:00
Philip Reames	ea721e2fa1	[TII] Split isTrivialReMaterializable into two versions [nfc] (#160377 ) This change builds on https://github.com/llvm/llvm-project/pull/160319 which tries to clarify which callers (not backends) assume that the result is actually trivial. This change itself should be NFC. Essentially, I'm just renaming the existing isTrivialRematerializable to the non-trivial version and then adding a new trivial version (with the same name as the prior function) and simplifying a few callers which want that semantic. This change does not enable non-trivial remat any more broadly than was already done for our targets which were lying through the old APIs; that will come separately. The goal here is simply to make the code easier to follow in terms of what assumptions are being made where. --------- Co-authored-by: Luke Lau <luke_lau@icloud.com>	2025-09-24 18:52:17 -07:00
Lucas Ramirez	83c308f014	[AMDGPU][Scheduler] Consistent occupancy calculation during rematerialization (#149224 ) The `RPTarget`'s way of determining whether VGPRs are beneficial to save and whether the target has been reached w.r.t. VGPR usage currently assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR RC can always be used for the other. Implicitly, this makes the rematerialization stage (only current user of `RPTarget`) follow a different occupancy calculation than the "regular one" that the scheduler uses, one that assumes that ArchVGPR/AGPR usage can be balanced perfectly and at no cost, which is untrue in general. This ultimately yields suboptimal rematerialization decisions that require cross-VGPR-RC copies unnecessarily. This fixes that, making the `RPTarget`'s internal model of occupancy consistent with the regular one. The `CombinedVGPRSavings` flag is removed, and a form of cross-VGPR-RC saving implemented only for unified RFs, which is where it makes the most sense. Only when the amount of free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the excess VGPR usage in the other VGPR RC does the `RPTarget` consider that a pressure reduction in the former will be beneficial to the latter.	2025-08-08 14:26:04 +02:00
Lucas Ramirez	96f3872dc9	[AMDGPU][Scheduler] Delete RegionsWithMinOcc bitvector from scheduler (NFC) (#142361 ) The `GCNScheduleDAGMILive`'s `RegionsWithMinOcc` bitvector is only used by the `UnclusteredHighRPStage`. Its presence in the scheduler's state forces us to maintain its value throughout scheduling even though it is of no use to the iterative scheduling process itself. At any point during scheduling it is possible to cheaply compute the occupancy induced by a particular register pressure. Furthermore, the field doesn't appear to be updated correctly throughout scheduling i.e., bits corresponding to regions at minimum occupancy are not always set in the vector. This removes the bitvector from `GCNScheduleDAGMILive`. `UnclusteredHighRPStage::initGCNRegion` now directly computes the occupancy of possibly reschedulable regions instead of querying the vector. Since it is the most expensive check, it is done last in the list.	2025-08-01 13:20:05 +02:00
Ruiling, Song	451912a24a	[MachineScheduler] Make cluster check more efficient (#150884 )	2025-08-01 16:00:42 +08:00
Jeffrey Byrnes	d3a9cde7b8	[AMDGPU] Don't skip regions in getRegionLiveInMap (#151423 ) Currently, this skips any region that is not the first region in a block. This is because the only user of it only cares about the LiveIns per-block. However, as named, this is supposed to compute the per-region LiveIns. This doesn't have any effect on scheduling / CodeGen currently (aside from computing LiveIns for all regions) since only the per-block LiveIns are needed. However, I'm working on something that will use this. Intended User: https://github.com/llvm/llvm-project/pull/149367/ `c62a2f127c/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (L1351)`	2025-07-31 15:05:18 -07:00
Lucas Ramirez	c77a2a2403	[AMDGPU][Scheduler] Use `AMDGPU::NoSubRegister` instead of 0 (NFC) (#150610 )	2025-07-25 14:25:13 +02:00
Lucas Ramirez	e38f98f535	[AMDGPU][Scheduler] Fix usage of `TII.reMaterialize` (NFC) (#150259 ) Any non-zero `SubIdx` passed to the method is composed with the rematerialized instruction's first operand's subregister to determine the new register's subregister. In our case we want the new register to have the same subregister as the old one, so we should pass 0.	2025-07-25 12:51:21 +02:00
Jeffrey Byrnes	283a62fa5b	[AMDGPU] NFC: Decouple getRealRegPressure from current region (#149219 ) We're already accepting a RegionIdx for the LiveIns, also use this for the instruction iterators. Enables querying RP for other regions -- useful for function wide transformations (e.g. rematerialization, rewriting, etc).	2025-07-16 19:59:38 -07:00
Lucas Ramirez	6307b496f8	[AMDGPU] Add `GCNRPTarget` to track register pressure against a target (#145765 ) This adds the `GCNRPTarget` class which models a register pressure target (i.e., maximum number of SGPRs/VGPRS) that one can track register savings against. The only current use of this class is in the scheduler's rematerialization stage. It replaces the more ad-hoc (and now deleted) `ExcessRP` class which used to serve the same purpose. This is only NFC~ish because `GCNRPTarget` tracks VGPR usage more accurately than `ExcessRP` used to. To estimate required combined VGPR savings we now additionally take into account the number of available VGPRs in both banks (ArchVGPR and AGPR) at the time where the RP target is created, whereas we used to only consider explicit savings made from the starting RP. This makes VGPR savings estimations more accurate in cases where we allow for savings in one VGPR bank to help towards reducing pressure in another VGPR bank (see `GCNRPTarget::CombineVGPRSavings`). This is the cause for unit test changes.	2025-06-26 13:11:20 +02:00
Lucas Ramirez	c74ed8a0d3	[AMDGPU][Scheduler] Support for rematerializing SGPRs and AGPRs (#140036 ) This adds the ability to rematerialize SGPRs and AGPRs to the scheduler's `PreRARematStage`, which can currently only rematerialize ArchVGPRs. This also fixes a small potential issue in the stage where, in case of spilling, the target occupancy could be set to a lower than expected value when the function had either one of the "amdgpu-num-sgpr" or "amdgpu-num-vgpr" attributes set.	2025-06-24 19:30:27 +02:00
Diana Picus	a201f8872a	[AMDGPU] Replace dynamic VGPR feature with attribute (#133444 ) Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.	2025-06-24 11:09:36 +02:00
Mikael Holmen	77062244ed	Fix two instances of -Wparentheses warnings [NFC] Add parentheses around the assert conditions. Without this gcc warned like ../lib/Target/AMDGPU/GCNSchedStrategy.cpp:2250: warning: suggest parentheses around '&&' within '\|\|' [-Wparentheses] 2250 \| NewMI != RegionBounds.second && "cannot remove at region end"); and ../../clang/lib/Sema/SemaOverload.cpp:11326:39: warning: suggest parentheses around '&&' within '\|\|' [-Wparentheses] 11326 \| DeferredCandidatesCount == 0 && \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ 11327 \| "Unexpected deferred template candidates"); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~	2025-06-12 09:43:30 +02:00
Ruiling, Song	0487db1f13	MachineScheduler: Improve instruction clustering (#137784 ) The existing way of managing clustered nodes was done through adding weak edges between the neighbouring cluster nodes, which is a sort of ordered queue. And this will be later recorded as `NextClusterPred` or `NextClusterSucc` in `ScheduleDAGMI`. But actually the instruction may be picked not in the exact order of the queue. For example, we have a queue of cluster nodes A B C. But during scheduling, node B might be picked first, then it will be very likely that we only cluster B and C for Top-Down scheduling (leaving A alone). Another issue is: ``` if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum) std::swap(SUa, SUb); if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) ``` may break the cluster queue. For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3 2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2), As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be pred of 3. This makes both 1 and 2 become preds of 3, but there is no edge between 1 and 2. Thus we get a broken cluster chain. To fix both issues, we introduce an unordered set in the change. This could help improve clustering in some hard case. One key reason the change causes so many test check changes is: As the cluster candidates are not ordered now, the candidates might be picked in different order from before. The most affected targets are: AMDGPU, AArch64, RISCV. For RISCV, it seems to me most are just minor instruction reorder, don't see obvious regression. For AArch64, there were some combining of ldr into ldp being affected. With two cases being regressed and two being improved. This has more deeper reason that machine scheduler cannot cluster them well both before and after the change, and the load combine algorithm later is also not smart enough. For AMDGPU, some cases have more v_dual instructions used while some are regressed. It seems less critical. Seems like test `v_vselect_v32bf16` gets more buffer_load being claused.	2025-06-05 15:28:04 +08:00
Lucas Ramirez	3e18216474	[AMDGPU][Scheduler] Delete RescheduleRegions bitvector from scheduler (NFC) (#141595 ) The `GCNScheduleDAGMILive`'s `RescheduleRegions` bitvector is only used by the rematerialization stage (`PreRARematStage`). Its presence in the scheduler's state forces us to maintain its value throughout scheduling even though it is of no use to the iterative scheduling process itself, which instead relies on each stage's `initGCNRegion` hook to determine whether the current region should be rescheduled. This moves the bitvector to the `PreRARematStage`, which uses it to store the set of regions that must be rescheduled between stage initialization and region initialization. This NFC also swaps a call to `GCNRegPressure::getArchVGPRNum(false)` for a call to `GCNRegPressure::getArchVGPRNum()`---which is equivalent but simpler in the context---and makes `GCNSchedStage::finalizeGCNRegion` use its own API to advance to the next region.	2025-05-27 17:10:28 +02:00
Lucas Ramirez	6456ee056f	Reapply "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885 )" (#139548 ) This reapplies 067caaa and 382a085 (reverting b35f6e2) with fixes to issues detected by the address sanitizer (MIs have to be removed from live intervals before being removed from their parent MBB). Original commit description below. AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.	2025-05-13 11:11:00 +02:00
Vitaly Buka	b35f6e26a5	Revert "[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885 )" (#139341 ) And related "[AMDGPU] Regenerate mfma-loop.ll test" Introduce memory error detected by Asan #125885. This reverts commit 382a085a95b0abeac77b150b7b644b372bd08e78. This reverts commit 067caaafb58a156d0d77229422607782a639f5b5.	2025-05-09 17:51:46 -07:00
Lucas Ramirez	067caaafb5	[AMDGPU][Scheduler] Refactor ArchVGPR rematerialization during scheduling (#125885 ) AMDGPU scheduler's `PreRARematStage` attempts to increase function occupancy w.r.t. ArchVGPR usage by rematerializing trivial ArchVGPR-defining instruction next to their single use. It first collects all eligible trivially rematerializable instructions in the function, then sinks them one-by-one while recomputing occupancy in all affected regions each time to determine if and when it has managed to increase overall occupancy. If it does, changes are committed to the scheduler's state; otherwise modifications to the IR are reverted and the scheduling stage gives up. In both cases, this scheduling stage currently involves repeated queries for up-to-date occupancy estimates and some state copying to enable reversal of sinking decisions when occupancy is revealed not to increase. The current implementation also does not accurately track register pressure changes in all regions affected by sinking decisions. This commit refactors this scheduling stage, improving RP tracking and splitting the stage into two distinct steps to avoid repeated occupancy queries and IR/state rollbacks. - Analysis and collection (`canIncreaseOccupancyOrReduceSpill`). The number of ArchVGPRs to save to reduce spilling or increase function occupancy by 1 (when there is no spilling) is computed. Then, instructions eligible for rematerialization are collected, stopping as soon as enough have been identified to be able to achieve our goal (according to slightly optimistic heuristics). If there aren't enough of such instructions, the scheduling stage stops here. - Rematerialization (`rematerialize`). Instructions collected in the first step are rematerialized one-by-one. Now we are able to directly update the scheduler's state since we have already done the occupancy analysis and know we won't have to rollback any state. Register pressures for impacted regions are recomputed only once, as opposed to at every sinking decision. In the case where the stage attempted to increase occupancy, and if both rematerializations alone and rescheduling after were unable to improve occupancy, then all rematerializations are rollbacked.	2025-05-08 12:51:06 +02:00
Jeffrey Byrnes	6e7fe85247	[AMDGPU] Teach iterative schedulers about IGLP (#134953 ) This adds IGLP mutation to the iterative schedulers (`gcn-iterative-max-occupancy-experimental`, `gcn-iterative-minreg`, and `gcn-iterative-ilp`). The `gcn-iterative-minreg` and `gcn-iterative-ilp` schedulers never actually applied the mutations added, so this also has the effect of teaching them about mutations in general. The `gcn-iterative-max-occupancy-experimental` scheduler has calls to `ScheduleDAGMILive::schedule()`, so, before this, mutations were applied at this point. Now this is done during calls to `BuildDAG`, with IGLP superseding other mutations (similar to the other schedulers). We may end up scheduling regions multiple times, with mutations being applied each time, so we need to track for `AMDGPU::SchedulingPhase::PreRAReentry`	2025-04-11 15:34:49 -07:00
Kazu Hirata	71935281e0	[Target] Use *Set::insert_range (NFC) (#132140 ) DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently gained C++23-style insert_range. This patch replaces: Dest.insert(Src.begin(), Src.end()); with: Dest.insert_range(Src); This patch does not touch custom begin like succ_begin for now.	2025-03-20 09:09:30 -07:00
Emma Pilkington	3eddb992d0	[AMDGPU] Fix a crash by skipping DBG instrs at start of sched region (#131167 ) Fixes SWDEV-514946	2025-03-19 09:31:54 -04:00
Diana Picus	1f84495255	[AMDGPU] Update target helpers & GCNSchedStrategy for dynamic VGPRs (#130047 ) In dynamic VGPR mode, we can allocate up to 8 blocks of either 16 or 32 VGPRs (based on a chip-wide setting which we can model with a Subtarget feature). Update some of the subtarget helpers to reflect this. In particular: - getVGPRAllocGranule is set to the block size - getAddresableNumVGPR will limit itself to 8 * size of a block We also try to be more careful about how many VGPR blocks we allocate. Therefore, when deciding if we should revert scheduling after a given stage, we check that we haven't increased the number of VGPR blocks that need to be allocated. --------- Co-authored-by: Jannik Silvanus <jannik.silvanus@amd.com>	2025-03-19 10:29:38 +01:00
Julian Brown	84909d7977	[AMDGCN] Allow unscheduling of bundled insns This is a patch arising from AMD's fuzzing project. In the test case, the scheduling algorithm decides to undo an attempted schedule, but is unprepared to handle bundled instructions at that point -- and those can arise via the expansion of intrinsics earlier in compilation. The fix is to use the splice method instead of remove/insert, since that can handle bundles properly.	2025-03-18 11:56:51 -05:00
Lucas Ramirez	03677f63a7	[MachineScheduler] Optional scheduling of single-MI regions (#129704 ) Following 15e295d the machine scheduler no longer filters-out single-MI regions when emitting regions to schedule. While this has no functional impact at the moment, it generally has a negative compile-time impact (see #128739). Since all targets but AMDGPU do not care for this behavior, this introduces an off-by-default flag to `ScheduleDAGInstrs` to control whether such regions are going to be scheduled, effectively reverting 15e295d for all targets but AMDGPU (currently the only target enabling this flag).	2025-03-04 17:46:44 +01:00
Jeffrey Byrnes	16f7e961c6	[AMDGPU] Allow rematerialization of instructions with virtual register uses (#124327 ) Remove the restriction that scheduling rematerialization candidates cannot have virtual reg uses. Currently, this only allows for virtual reg uses which are already live at the rematerialization point, so bring in allUsesAvailableAt to check for this condition. Because of this condition, the uses of the remats will already be live in to the region, so the remat won't increase live-in pressure. Add an expensive check to check this condition.	2025-02-06 10:16:28 -08:00
Jeffrey Byrnes	e77d428e46	[AMDGPU] Do not remat instructions with PhysReg uses (#124366 ) This blocks rematerialization during scheduling if the instruction has a non accepted PhysReg use. Currently, there aren't any checks like this in place, and we may create invalid code: https://godbolt.org/z/xjPjdcorf	2025-01-27 10:50:06 -08:00
Lucas Ramirez	6206f5444f	[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748 ) Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.	2025-01-23 16:07:57 +01:00
Austin Kerbow	657fb4433e	[AMDGPU] Add target hook to isGlobalMemoryObject (#112781 ) We want special handing for IGLP instructions in the scheduler but they should still be treated like they have side effects by other passes. Add a target hook to the ScheduleDAGInstrs DAG builder so that we have more control over this.	2025-01-11 09:57:57 -08:00
Ruiling, Song	b33c807b39	[AMDGPU] Add MaxMemoryClauseSchedStrategy (#114957 ) Also expose an option to choose custom scheduler strategy: amdgpu-sched-strategy={max-ilp\|max-memory-clause} This can be set through either function attribute or command line option. The major behaviors of the max memory clause schedule strategy includes: 1. Try to cluster memory instructions more aggressively. 2. Try to schedule long latency load earlier than short latency instruction. I tested locally against about 470 real shaders and got the perf changes (only count perf changes over +/-10%): About 15 shaders improved 10%~40%. Only 3 shaders drops ~10%. (This was tested together with another change which increases the maximum clustered dword from 8 to 32). I will make another change to make that threshold configurable.	2024-12-09 10:07:27 +08:00
Jeffrey Byrnes	3a08551a03	[AMDGPU] Fix expensive check Change-Id: I0b26d5db6d3da8936ab25ee2b1e9002840b9853e	2024-10-09 10:24:06 -07:00
Jeffrey Byrnes	17bc959961	[AMDGPU] Optionally Use GCNRPTrackers during scheduling (#93090 ) This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings. This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.	2024-10-09 09:54:11 -07:00

1 2 3

147 Commits