llvm-project

Author	SHA1	Message	Date
vporpo	9898082bd3	[AMDGPU][SIInsertWaitcnt][NFC] Access Waitcnt elements using InstCounterType (#178345 ) This patch introduces `get(T)` and `set(T, Val)` functions for Waitcnt and removes getCounterRef() and getWait(). For this to work we also need to move InstrCounterType to AMDGPUBaseInfo.h. Please note that the member variables are still public to keep this patch small. They will be replaced in the follow-up patch.	2026-02-09 17:54:08 -08:00
vporpo	88dff28c9f	[AMDGPU][SIInsertWaitcnts][NFC] Make a few WaitcntBracket member functions private (#180018 ) The user of the WaitcntBrackets class shouldn't need to know about how the scoreboard has been implemented internally. So I think it is best to provide a higher level API that hides things like scoreUB, scoreLB and score ranges. This patch makes getScoreUB(), getScoreLB() and getScoreRange() private and introduces new functions that don't expose the internal implementation: - getOutstanding(T) - hasPendingVMEM(VMEMID, T) - empty(T) I also noticed that getSGPRScore() and getVMemScore() are not used externally so these are now private.	2026-02-07 13:55:01 -08:00
vporpo	7022e5a514	[AMDGPU][SIINsertWaitcnts][NFC] Make TTI and ST references (#180017 ) This patch converts WaitcntGenerator::TTI and ST pointers to references. This helps remove some null checking assertions.	2026-02-05 13:18:06 -08:00
vporpo	5326166866	[AMDGPU][SIInsertWaitcnt][NFC] Don't expose internal data structure to user (#179736 ) With this patch we are no longer exposing the internal data structure that holds the WaitEvents to the user through the `getWaitEventMask()` API. Instead we only allow the user to query a specific type and get the corresponding `WaitEventSet` with `getWaitEvents(T)`. Note: This patch also renames `getWaitEventMask()` to `getWaitEvents()` because we are no longer returning a mask but instead a `WaitEventSet` object.	2026-02-05 11:07:38 -08:00
Vigneshwar Jayakumar	2dcd75eb44	[AMDGPU] Fix missing waitcnt after buffer_wbl2 (#178316 ) On GFX9, BUFFER_WBL2 is used to write back dirty cache lines and requires an s_waitcnt vmcnt(0) afterwards to ensure completion. This patch fixes by incrementing vmcnt for buffer_wbl2 instruction --------- Co-authored-by: Jay Foad <jay.foad@gmail.com>	2026-02-05 10:13:51 -06:00
vporpo	d9da5d7626	[AMDGPU][SIInsertWaitcnt][NFC] Move eventCounter() function (#178949 ) The eventCounter() function searches through the array of events. This array is owned by the WaitcntGenerator class. This patch moves the function into the WaitcntGenerator class which helps hide the event array from the user. It also renames it to getCounterFromEvent(). This should be NFC.	2026-02-04 11:11:18 -08:00
vporpo	3c67dcb2f2	[AMDGPU][SIInsertWaitcnts][NFC] Replace Wait.combined() with simple assignment (#179142 ) Wait is initialized with all ~0s and by the time it reaches the updated line it still holds the same value. So Wait.combined(AllZeroWait) is effectively combining all ~0s with AllZeroWait and given that combined() returns the min() of the two it should always return AllZeroWait. So this patch replaces the assignment with `= AllZeroWait` to make it easier to read.	2026-02-02 13:50:26 -08:00
vporpo	8b3b0b869e	[AMDGPU][SIInsertWaitcnt][NFC] Replace if/else with switch (#178956 ) This is an NFC patch that replaces the consecutive ifs and else ifs in generateWaitcntInstBefore() with a switch. This makes it a bit easier to read.	2026-02-01 13:13:01 -08:00
vporpo	ef720acad0	[AMDGPU][SIInsertWaitcnts][NFC] Use loop to set Wait entries (#178764 ) Please note that the original code was skipping STORE_CNT but this one is not.	2026-01-30 16:18:03 -08:00
vporpo	c5862719c5	[AMDGPU][SIInsertWaitcnts][NFC] Introduce WaitEventSet container for events (#178511 ) Before this patch WaitEventType events used to be collected in unsigned integers that were used as small bit vectors. This patch introduces a WaitEventSet container class to replace the integer bit vectors with a class that hides the implementation of common operations like insertion, removal, union, intersection etc. from the user. The WaitEventSet API matches that of a set and not a vector because we don't care about the order of its contents. Internally though it is still a bit vector that uses an unsigned integer as its storage, just like the original implementation. This patch should not change the functionality.	2026-01-30 08:55:44 -08:00
Jay Foad	7deea9db70	[AMDGPU] Move WaitcntBrackets::simplifyXcnt near other simplify functions. NFC. (#178673 )	2026-01-29 15:49:11 +00:00
vporpo	2615005982	[AMDGPU][SIInsertWaitcnts] Cleanup: Remove WaitEventMaskForInst member variable (#178030 ) The event mask is constant and target dependent it should be accessed through the WCG object.	2026-01-28 12:22:48 -08:00
Jay Foad	775f02521d	[AMDGPU] Fix buggy insertion of DEALLOC_VGPRS message (#178401 ) We inserted the DEALLOC_VGPRS message if there were no pending scratch stores the first time an S_ENDPGM instruction was visited. But because this pass uses a worklist to revisit blocks until it reaches a fixed point, it is possible that pending scratch stores are only discovered on the second or later visit to a block. Fix this by storing a flag for each S_ENDPGM instruction which can be updated by later visits.	2026-01-28 13:29:26 +00:00
vporpo	89e38d65d8	[AMDGPU][SIInsertWaitcnts][NFC] Move static array definition (#178014 ) Move the array out of the member function.	2026-01-26 11:21:20 -08:00
vporpo	8805e27365	[AMDGPU] Cleanup: Use unique_ptr for WCG and remove unnecessary class members (#177689 )	2026-01-26 10:10:35 -08:00
Christudasan Devadasan	cf25346dcb	[AMDGPU][GFX1250] Optimize s_wait_xcnt for back-to-back atomic RMWs (#177620 ) This patch optimizes the insertion of s_wait_xcnt instruction for sequences of atomic read-modify-write (RMW) operations in the SIInsertWaitcnts pass. The Memory Legalizer conservatively inserts a soft xcnt instruction before each atomic RMW operation as part of PR 168852, which is correct given the nature of atomic operations. However, for back-to-back atomic RMWs, only the first s_wait_xcnt is necessary for better runtime performance. This patch tracks atomic RMW blocks within each basic block and removes redundant soft xcnt instructions, keeping only the first wait in each sequence. An atomic RMW block continues through subsequent atomic RMWs and non-memory instructions (e.g., ALU operations) but is broken by CU-scoped memory operations, atomic stores, or basic block boundaries.	2026-01-24 08:21:41 +05:30
Shilei Tian	02d34a76f7	[NFCI][AMDGPU] Remove more redundant code from `GCNSubtarget.h` (#177297 ) We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the header with this cleanup.	2026-01-22 09:07:15 -05:00
Jay Foad	2b522e0b09	Reapply "[AMDGPU] Fix excessive stack usage in SIInsertWaitcnts::run (#134835 )" (#174215 ) (#177338 ) This reverts commit 0dd03598dca91c93c74b94a714c38a4ffad0ed1c. Apparently the stack usage on Windows was still large enough to cause problems for some DirectX games.	2026-01-22 11:25:08 +00:00
Shilei Tian	1843a7fe9f	[NFCI][AMDGPU] Use X-macro to reduce boilerplate in `GCNSubtarget.h` (#176844 ) `GCNSubtarget.h` contained a large amount of repetitive code following the pattern `bool HasXXX = false;` for member declarations and `bool hasXXX() const { return HasXXX; }` for getters. This boilerplate made the file unnecessarily long and harder to maintain. This patch introduces an X-macro pattern `GCN_SUBTARGET_HAS_FEATURE` that consolidates 135 simple subtarget features into a single list. The macro is expanded twice: once in the protected section to generate member variable declarations, and once in the public section to generate the corresponding getter methods. This reduces the file by approximately 600 lines while preserving the exact same API and functionality. Features with complex getter logic or inconsistent naming conventions are left as manual implementations for future improvement. Ideally, these could be generated by TableGen using `GET_SUBTARGETINFO_MACRO`, similar to the X86 backend. However, `AMDGPU.td` has several issues that prevent direct adoption: duplicate field names (e.g., `DumpCode` is set by both `FeatureDumpCode` and `FeatureDumpCodeLower`), and inconsistent naming conventions where many features don't have the `Has` prefix (e.g., `FlatAddressSpace`, `GFX10Insts`, `FP64`). Fixing these issues would require renaming fields in `AMDGPU.td` and updating all references, which is left for future work.	2026-01-21 15:29:09 -05:00
Pankaj Dwivedi	6b86e24ec1	[AMDGPU][SIInsertWaitcnt] Address review feedback for waitcnt profiling expansion (#175922 )	2026-01-17 14:57:45 +05:30
Ramkumar Ramachandra	d69335bac9	[LLVM] Clean up code using [not_]equal_to (NFC) (#175824 ) Use llvm::[not_]equal_to landed in d2a521750 ([ADT] Introduce bind_{front,back}, [not_]equal_to, #175056) across LLVM for cleaner code.	2026-01-13 21:19:39 +00:00
hidekisaito	a062249932	[AMDGPU] Add DS loop waitcnt optimization for GFX12+ (#172728 ) Add support for flushing DS_CNT in loop preheaders when the loop uses values that were DS-loaded outside the loop. This is similar to the existing VMEM loop optimization. Assisted-by: Cursor / claude-4.5-opus-high	2026-01-12 13:34:02 -08:00
Jay Foad	27074aa31a	[AMDGPU] Fix crash in SIInsertWaitcnts debug output (#175518 ) In some cases we were accessing `OldWaitcntInstr.getParent()->end()` after `OldWaitcntInstr` had already been erased from its parent.	2026-01-12 14:48:53 +00:00
Jay Foad	3b2d14ba1c	[AMDGPU] Inline two helpers in SIInsertWaitcnts. NFC. (#174557 )	2026-01-12 14:26:31 +00:00
Pankaj Dwivedi	3dfb782333	[AMDGPU][SIInsertWaitcnt] Implement Waitcnt Expansion for Profiling (#169345 ) Reference issue: https://github.com/ROCm/llvm-project/issues/67 This patch adds support for expanding s_waitcnt instructions into sequences with decreasing counter values, enabling PC-sampling profilers to identify which specific memory operation is causing a stall. This is controlled via: Clang flag: -mamdgpu-expand-waitcnt-profiling / -mno-amdgpu-expand-waitcnt-profiling Function attribute: "amdgpu-expand-waitcnt-profiling" When enabled, instead of emitting a single waitcnt, the pass generates a sequence that waits for each outstanding operation individually. For example, if there are 5 outstanding memory operations and the target is to wait until 2 remain: Original: s_waitcnt vmcnt(2) Expanded: s_waitcnt vmcnt(4) s_waitcnt vmcnt(3) s_waitcnt vmcnt(2) The expansion starts from (Outstanding - 1) down to the target value, since waitcnt(Outstanding) would be a no-op (the counter is already at that value). - Uses ScoreBrackets to determine the actual number of outstanding operations - Only expands when operations complete in-order - Skips expansion for mixed event types (e.g., LDS+SMEM on same counter) - Skips expansion for scalar memory (always out-of-order) Releated previous work for Reference - PR: llvm/llvm-project#79236 (related `-amdgpu-waitcnt-forcezero`) --------- Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>	2026-01-12 17:35:06 +05:30
Jay Foad	475f022cb7	[AMDGPU] Add support for GFX12 expert scheduling mode 2 (#170319 )	2026-01-09 15:49:10 +00:00
Jay Foad	0494abbea0	[AMDGPU] Make WaitcntBrackets::simplifyWaitcnt const again (#173390 ) The original design was: - WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the current state of WaitcntBrackets, removing unnecesary waits. - WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on Wait, updating the state by applying the specified waits. This was changed by #164357 which started calling applyWaitcnt from simplifyWaitcnt. This patch restores the original design without any significant functional changes. There is some code duplication because both simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts with other counters like LOADCNT and KMCNT.	2026-01-05 13:34:32 +00:00
Jay Foad	b3c3e5fd99	[AMDGPU] Simplify and document waitcnt handling on call and return (#172453 ) Start documenting the ABI conventions for dependency counters on function call and return. Stop pretending that SIInsertWaitcnts can handle anything other than the default documented behavior.	2026-01-05 13:29:54 +00:00
Jay Foad	0dd03598dc	Revert "[AMDGPU] Fix excessive stack usage in SIInsertWaitcnts::run (#134835 )" (#174215 ) This reverts commit 008c875be85732f72c4df4671167f5be79f449eb. PR #162077 / #171779 shrunk the WaitcntBrackets class by using DenseMaps instead of large arrays, so the size of a temporary WaitcntBrackets allocated on the stack is no longer a concern. With this patch on Linux I measured the stack size of SIInsertWaitcnts::run increasing from 456 bytes to 632 bytes.	2026-01-05 12:10:11 +00:00
Matt Arsenault	9ad39dd116	AMDGPU: Avoid crashing on statepoint-like pseudoinstructions (#170657 ) At the moment the MIR tests are somewhat redundant. The waitcnt one is needed to ensure we actually have a load, given we are currently just emitting an error on ExternalSymbol. The asm printer one is more redundant for the moment, since it's stressed by the IR test. However I am planning to change the error path for the IR test, so it will soon not be redundant.	2025-12-29 19:08:08 +01:00
Pankaj Dwivedi	28d4e33b65	[AMDGPU][SIInsertWaitCnt] Optimize loadcnt insertion at function boundaries (#169647 ) On GFX12+, GLOBAL_INV increments the loadcnt counter but does not write results to any VGPRs. Previously, we unconditionally inserted s_wait_loadcnt 0 at function returns even when the only pending loadcnt was from GLOBAL_INV instructions. This patch optimizes waitcnt insertion by skipping the loadcnt wait at function boundaries when no VGPRs have pending loads. This is determined by checking if any VGPR has a score greater than the lower bound for LOAD_CNT - if not, the pending loadcnt must be from non-VGPR-writing instructions like GLOBAL_INV. The optimization is limited to GFX12+ targets where GLOBAL_INV exists and uses the extended wait count instructions. This is a follow-up optimization to PR #135340 which added tracking for GLOBAL_INV in the waitcnt pass.	2025-12-17 17:53:00 +05:30
Jay Foad	c1e829fc3d	[AMDGPU] Simplify waitcnt insertion on function entry. NFC. (#172461 ) This pass runs way too late for PHI instructions.	2025-12-16 12:00:13 +00:00
Pankaj Dwivedi	e151434b0f	[AMDGPU][InsertWaitCnts][NFC] Merge VMEM_ACCESS and VMEM_READ_ACCESS into a single event type (#171973 )	2025-12-12 18:26:40 +05:30
Pierre van Houtryve	025d0c0d1d	(reland) [AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077 ) (#171779 ) Fixed a crash in Blender due to some weird control flow. The issue was with the "merge" function which was only looking at the keys of the "Other" VMem/SGPR maps. It needs to look at the keys of both maps and merge them. Original commit message below ---- The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units. There are no test changes because functionality didn't change, except: - We can now track more LDS DMA IDs if we need it (up to `1 << 16`) - The debug prints also changed a bit because we now talk in terms of register units. This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the `memcpy` built-in on a big test file). I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses.	2025-12-12 09:41:04 +01:00
Sameer Sahasrabuddhe	130fa98a29	[AMDGPU][NFC] dump Waitcnt using an ostream operator (#171251 )	2025-12-10 20:45:49 +05:30
pvanhout	4572f4f5b1	Revert "[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077 )" Fails on https://lab.llvm.org/buildbot/#/builders/123/builds/31922 This reverts commit bf9344099c63549b2f19f8ede29f883669b0baca.	2025-12-09 14:48:19 +01:00
Pierre van Houtryve	bf9344099c	[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077 ) The pass was already "reinventing" the concept just to deal with 16 bit registers. Clean up the entire tracking logic to only use register units. There are no test changes because functionality didn't change, except: - We can now track more LDS DMA IDs if we need it (up to `1 << 16`) - The debug prints also changed a bit because we now talk in terms of register units. This also changes the tracking to use a DenseMap instead of a massive fixed size table. This trades a bit of access speed for a smaller memory footprint. Allocating and memsetting a huge table to zero caused a non-negligible performance impact (I've observed up to 50% of the time in the pass spent in the `memcpy` built-in on a big test file). I also think we don't access these often enough to really justify using a vector. We do a few accesses per instruction, but not much more. In a huge 120MB LL file, I can barely see the trace of the DenseMap accesses.	2025-12-09 13:51:19 +01:00
Sameer Sahasrabuddhe	8c8196c802	[AMDGPU][NFC] cleanup whitespace in debug log of SIInsertWaitcnts	2025-12-09 07:42:37 +05:30
Sameer Sahasrabuddhe	1e33e12f25	[AMDGPU][NFC] fix function names in debug log for SIInsertWaitcnts	2025-12-09 07:36:31 +05:30
Sameer Sahasrabuddhe	1ae957515c	[AMDGPU][NFC] Update a comment about FLAT v/s LDSDMA The change in #170263 does not do justice to common knowledge in the backend. Fix the comment to reflect the relation between FLAT encoding, flat pointer access, and LDSDMA operations.	2025-12-08 20:49:19 +05:30
Pierre van Houtryve	8aa82eff56	[AMDGPU][SIInsertWaitcnts] Wait on all LDS DMA operations when no aliasing store is found (#170660 ) Previously, we would miss inserting a wait if the ds_read had AA info, but it didn't match any LDS DMA op, for example if we didn't track the LDS DMA op it aliases with because it exceeded the tracking limit.	2025-12-08 11:02:24 +01:00
Jay Foad	0ecac6d5b9	[AMDGPU] Inherit constructors from WaitcntGenerator. NFC. (#170845 )	2025-12-05 13:38:00 +00:00
Jay Foad	64e3bcdd1f	[AMDGPU] Add an assertion. NFCI.	2025-12-05 12:55:09 +00:00
Sameer Sahasrabuddhe	cb8ce283e1	[AMDGPU][Waitcnts] Don't create a pending flat event for LDS DMA (#170263 ) Flat instructions need a waitcnt(0) on both VMEM and LDS accesses, but only when the instruction really is using flat addressing. The LDS DMA instructions (on GFX9) have the FLAT flag set, but they have very clear semantics. These instructions update only VM_CNT (on GFX9), and hence do not need to be treated like actual flat instructions.	2025-12-04 17:22:59 +05:30
Pierre van Houtryve	8feb6762ba	[AMDGPU] Take BUF instructions into account in mayAccessScratchThroughFlat (#170274 ) BUF instructions can access the scratch address space, so SIInsertWaitCnt needs to be able to track the SCRATCH_WRITE_ACCESS event for such BUF instructions. The release-vgprs.mir test had to be updated because BUF instructions w/o a MMO are now tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.	2025-12-03 10:37:58 +01:00
Ryan Mitchell	5e4505d562	[AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization (#164357 ) Refactor the XCnt optimization checks so that they can be checked when applying a pre-existing waitcnt. This removes unnecessary xcnt waits when taking a loop backedge.	2025-11-13 18:43:12 +00:00
Jay Foad	5e4f177142	[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs (#166779 )	2025-11-12 09:44:08 +00:00
Aaditya	c8187f6539	[AMDGPU] Fix Xcnt handling between blocks (#165201 ) For blocks with multiple predescessors, there maybe `SMEM` and `VMEM` events active at the same time. This patch handles these cases.	2025-11-01 16:48:48 +05:30
Aaditya	982c9e6ac5	[AMDGPU][NFC] Use `getScoreUB` for XCNT insertion. (#162448 )	2025-10-13 11:07:05 +05:30
Aaditya	19cd5bd350	[AMDGPU] Account for implicit XCNT insertion (#160812 ) Hardware inserts an implicit `S_WAIT_XCNT 0` between alternate SMEM and VMEM instructions, so there are never outstanding address translations for both SMEM and VMEM at the same time.	2025-10-03 13:38:37 +05:30

1 2 3 4 5 ...

315 Commits