315 Commits

Author SHA1 Message Date
vporpo
9898082bd3
[AMDGPU][SIInsertWaitcnt][NFC] Access Waitcnt elements using InstCounterType (#178345)
This patch introduces `get(T)` and `set(T, Val)` functions for Waitcnt
and removes getCounterRef() and getWait(). For this to work we also need
to move InstrCounterType to AMDGPUBaseInfo.h.

Please note that the member variables are still public to keep this
patch small.
They will be replaced in the follow-up patch.
2026-02-09 17:54:08 -08:00
vporpo
88dff28c9f
[AMDGPU][SIInsertWaitcnts][NFC] Make a few WaitcntBracket member functions private (#180018)
The user of the WaitcntBrackets class shouldn't need to know about how
the scoreboard has been implemented internally. So I think it is best to
provide a higher level API that hides things like scoreUB, scoreLB and
score ranges.

This patch makes getScoreUB(), getScoreLB() and getScoreRange() private
and introduces new functions that don't expose the internal
implementation:
- getOutstanding(T)
- hasPendingVMEM(VMEMID, T)
- empty(T)

I also noticed that getSGPRScore() and getVMemScore() are not used
externally so these are now private.
2026-02-07 13:55:01 -08:00
vporpo
7022e5a514
[AMDGPU][SIINsertWaitcnts][NFC] Make TTI and ST references (#180017)
This patch converts WaitcntGenerator::TTI and ST pointers to references.
This helps remove some null checking assertions.
2026-02-05 13:18:06 -08:00
vporpo
5326166866
[AMDGPU][SIInsertWaitcnt][NFC] Don't expose internal data structure to user (#179736)
With this patch we are no longer exposing the internal data structure
that holds the WaitEvents to the user through the `getWaitEventMask()`
API. Instead we only allow the user to query a specific type and get the
corresponding `WaitEventSet` with `getWaitEvents(T)`.
Note: This patch also renames `getWaitEventMask()` to `getWaitEvents()`
because we are no longer returning a mask but instead a `WaitEventSet`
object.
2026-02-05 11:07:38 -08:00
Vigneshwar Jayakumar
2dcd75eb44
[AMDGPU] Fix missing waitcnt after buffer_wbl2 (#178316)
On GFX9, BUFFER_WBL2 is used to write back dirty cache lines and
requires an s_waitcnt vmcnt(0) afterwards to ensure completion.

This patch fixes by incrementing vmcnt for buffer_wbl2 instruction

---------

Co-authored-by: Jay Foad <jay.foad@gmail.com>
2026-02-05 10:13:51 -06:00
vporpo
d9da5d7626
[AMDGPU][SIInsertWaitcnt][NFC] Move eventCounter() function (#178949)
The eventCounter() function searches through the array of events. This
array is owned by the WaitcntGenerator class.

This patch moves the function into the WaitcntGenerator class which
helps hide the event array from the user.
It also renames it to getCounterFromEvent().

This should be NFC.
2026-02-04 11:11:18 -08:00
vporpo
3c67dcb2f2
[AMDGPU][SIInsertWaitcnts][NFC] Replace Wait.combined() with simple assignment (#179142)
Wait is initialized with all ~0s and by the time it reaches the updated
line it still holds the same value. So Wait.combined(AllZeroWait) is
effectively combining all ~0s with AllZeroWait and given that combined()
returns the min() of the two it should always return AllZeroWait.

So this patch replaces the assignment with `= AllZeroWait` to make it
easier to read.
2026-02-02 13:50:26 -08:00
vporpo
8b3b0b869e
[AMDGPU][SIInsertWaitcnt][NFC] Replace if/else with switch (#178956)
This is an NFC patch that replaces the consecutive ifs and else ifs in
generateWaitcntInstBefore() with a switch. This makes it a bit easier to
read.
2026-02-01 13:13:01 -08:00
vporpo
ef720acad0
[AMDGPU][SIInsertWaitcnts][NFC] Use loop to set Wait entries (#178764)
Please note that the original code was skipping STORE_CNT but this one
is not.
2026-01-30 16:18:03 -08:00
vporpo
c5862719c5
[AMDGPU][SIInsertWaitcnts][NFC] Introduce WaitEventSet container for events (#178511)
Before this patch WaitEventType events used to be collected in unsigned
integers that were used as small bit vectors.

This patch introduces a WaitEventSet container class to replace the
integer bit vectors with a class that hides the implementation of common
operations like insertion, removal, union, intersection etc. from the
user.

The WaitEventSet API matches that of a set and not a vector because we
don't care about the order of its contents. Internally though it is
still a bit vector that uses an unsigned integer as its storage, just
like the original implementation.

This patch should not change the functionality.
2026-01-30 08:55:44 -08:00
Jay Foad
7deea9db70
[AMDGPU] Move WaitcntBrackets::simplifyXcnt near other simplify functions. NFC. (#178673) 2026-01-29 15:49:11 +00:00
vporpo
2615005982
[AMDGPU][SIInsertWaitcnts] Cleanup: Remove WaitEventMaskForInst member variable (#178030)
The event mask is constant and target dependent it should be accessed
through the WCG object.
2026-01-28 12:22:48 -08:00
Jay Foad
775f02521d
[AMDGPU] Fix buggy insertion of DEALLOC_VGPRS message (#178401)
We inserted the DEALLOC_VGPRS message if there were no pending scratch
stores the first time an S_ENDPGM instruction was visited. But because
this pass uses a worklist to revisit blocks until it reaches a fixed
point, it is possible that pending scratch stores are only discovered on
the second or later visit to a block. Fix this by storing a flag for
each S_ENDPGM instruction which can be updated by later visits.
2026-01-28 13:29:26 +00:00
vporpo
89e38d65d8
[AMDGPU][SIInsertWaitcnts][NFC] Move static array definition (#178014)
Move the array out of the member function.
2026-01-26 11:21:20 -08:00
vporpo
8805e27365
[AMDGPU] Cleanup: Use unique_ptr for WCG and remove unnecessary class members (#177689) 2026-01-26 10:10:35 -08:00
Christudasan Devadasan
cf25346dcb
[AMDGPU][GFX1250] Optimize s_wait_xcnt for back-to-back atomic RMWs (#177620)
This patch optimizes the insertion of s_wait_xcnt instruction for
sequences of atomic read-modify-write (RMW) operations in the
SIInsertWaitcnts pass. The Memory Legalizer conservatively inserts a
soft xcnt instruction before each atomic RMW operation as part of PR
168852, which is correct given the nature of atomic operations.
However, for back-to-back atomic RMWs, only the first s_wait_xcnt is
necessary for better runtime performance. This patch tracks atomic
RMW blocks within each basic block and removes redundant soft xcnt
instructions, keeping only the first wait in each sequence. An atomic
RMW block continues through subsequent atomic RMWs and non-memory
instructions (e.g., ALU operations) but is broken by CU-scoped memory
operations, atomic stores, or basic block boundaries.
2026-01-24 08:21:41 +05:30
Shilei Tian
02d34a76f7
[NFCI][AMDGPU] Remove more redundant code from GCNSubtarget.h (#177297)
We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the
header with this cleanup.
2026-01-22 09:07:15 -05:00
Jay Foad
2b522e0b09
Reapply "[AMDGPU] Fix excessive stack usage in SIInsertWaitcnts::run (#134835)" (#174215) (#177338)
This reverts commit 0dd03598dca91c93c74b94a714c38a4ffad0ed1c.

Apparently the stack usage on Windows was still large enough to cause
problems for some DirectX games.
2026-01-22 11:25:08 +00:00
Shilei Tian
1843a7fe9f
[NFCI][AMDGPU] Use X-macro to reduce boilerplate in GCNSubtarget.h (#176844)
`GCNSubtarget.h` contained a large amount of repetitive code following
the pattern `bool HasXXX = false;` for member declarations and `bool
hasXXX() const { return HasXXX; }` for getters. This boilerplate made
the file unnecessarily long and harder to maintain.

This patch introduces an X-macro pattern `GCN_SUBTARGET_HAS_FEATURE`
that consolidates 135 simple subtarget features into a single list. The
macro is expanded twice: once in the protected section to generate
member variable declarations, and once in the public section to generate
the corresponding getter methods. This reduces the file by approximately
600 lines while preserving the exact same API and functionality.
Features with complex getter logic or inconsistent naming conventions
are left as manual implementations for future improvement.

Ideally, these could be generated by TableGen using
`GET_SUBTARGETINFO_MACRO`, similar to the X86 backend. However,
`AMDGPU.td` has several issues that prevent direct adoption: duplicate
field names (e.g., `DumpCode` is set by both `FeatureDumpCode` and
`FeatureDumpCodeLower`), and inconsistent naming conventions where many
features don't have the `Has` prefix (e.g., `FlatAddressSpace`,
`GFX10Insts`, `FP64`). Fixing these issues would require renaming fields
in `AMDGPU.td` and updating all references, which is left for future
work.
2026-01-21 15:29:09 -05:00
Pankaj Dwivedi
6b86e24ec1
[AMDGPU][SIInsertWaitcnt] Address review feedback for waitcnt profiling expansion (#175922) 2026-01-17 14:57:45 +05:30
Ramkumar Ramachandra
d69335bac9
[LLVM] Clean up code using [not_]equal_to (NFC) (#175824)
Use llvm::[not_]equal_to landed in d2a521750 ([ADT] Introduce
bind_{front,back}, [not_]equal_to, #175056) across LLVM for cleaner
code.
2026-01-13 21:19:39 +00:00
hidekisaito
a062249932
[AMDGPU] Add DS loop waitcnt optimization for GFX12+ (#172728)
Add support for flushing DS_CNT in loop preheaders when the loop uses
values that were DS-loaded outside the loop. This is similar to the
existing VMEM loop optimization.

Assisted-by: Cursor / claude-4.5-opus-high
2026-01-12 13:34:02 -08:00
Jay Foad
27074aa31a
[AMDGPU] Fix crash in SIInsertWaitcnts debug output (#175518)
In some cases we were accessing `OldWaitcntInstr.getParent()->end()`
after `OldWaitcntInstr` had already been erased from its parent.
2026-01-12 14:48:53 +00:00
Jay Foad
3b2d14ba1c
[AMDGPU] Inline two helpers in SIInsertWaitcnts. NFC. (#174557) 2026-01-12 14:26:31 +00:00
Pankaj Dwivedi
3dfb782333
[AMDGPU][SIInsertWaitcnt] Implement Waitcnt Expansion for Profiling (#169345)
Reference issue: https://github.com/ROCm/llvm-project/issues/67

This patch adds support for expanding s_waitcnt instructions into
sequences with decreasing counter values, enabling PC-sampling profilers
to identify which specific memory operation is causing a stall.

This is controlled via:
Clang flag: -mamdgpu-expand-waitcnt-profiling /
-mno-amdgpu-expand-waitcnt-profiling
Function attribute: "amdgpu-expand-waitcnt-profiling"

When enabled, instead of emitting a single waitcnt, the pass generates a
sequence that waits for each outstanding operation individually. For
example, if there are 5 outstanding memory operations and the target is
to wait until 2 remain:


**Original**: 
s_waitcnt vmcnt(2)

**Expanded**:  
s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)

The expansion starts from (Outstanding - 1) down to the target value,
since waitcnt(Outstanding) would be a no-op (the counter is already at
that value).

- Uses ScoreBrackets to determine the actual number of outstanding
operations
- Only expands when operations complete in-order
- Skips expansion for mixed event types (e.g., LDS+SMEM on same counter)
- Skips expansion for scalar memory (always out-of-order)

Releated previous work for Reference
- **PR**: llvm/llvm-project#79236 (related `-amdgpu-waitcnt-forcezero`)

---------

Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
2026-01-12 17:35:06 +05:30
Jay Foad
475f022cb7
[AMDGPU] Add support for GFX12 expert scheduling mode 2 (#170319) 2026-01-09 15:49:10 +00:00
Jay Foad
0494abbea0
[AMDGPU] Make WaitcntBrackets::simplifyWaitcnt const again (#173390)
The original design was:
- WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the
  current state of WaitcntBrackets, removing unnecesary waits.
- WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on
  Wait, updating the state by applying the specified waits.

This was changed by #164357 which started calling applyWaitcnt from
simplifyWaitcnt.

This patch restores the original design without any significant
functional changes. There is some code duplication because both
simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts
with other counters like LOADCNT and KMCNT.
2026-01-05 13:34:32 +00:00
Jay Foad
b3c3e5fd99
[AMDGPU] Simplify and document waitcnt handling on call and return (#172453)
Start documenting the ABI conventions for dependency counters on
function call and return.

Stop pretending that SIInsertWaitcnts can handle anything other than the
default documented behavior.
2026-01-05 13:29:54 +00:00
Jay Foad
0dd03598dc
Revert "[AMDGPU] Fix excessive stack usage in SIInsertWaitcnts::run (#134835)" (#174215)
This reverts commit 008c875be85732f72c4df4671167f5be79f449eb.

PR #162077 / #171779 shrunk the WaitcntBrackets class by using DenseMaps
instead of large arrays, so the size of a temporary WaitcntBrackets
allocated on the stack is no longer a concern.

With this patch on Linux I measured the stack size of
SIInsertWaitcnts::run increasing from 456 bytes to 632 bytes.
2026-01-05 12:10:11 +00:00
Matt Arsenault
9ad39dd116
AMDGPU: Avoid crashing on statepoint-like pseudoinstructions (#170657)
At the moment the MIR tests are somewhat redundant. The waitcnt
one is needed to ensure we actually have a load, given we are
currently just emitting an error on ExternalSymbol. The asm printer
one is more redundant for the moment, since it's stressed by the IR
test. However I am planning to change the error path for the IR test,
so it will soon not be redundant.
2025-12-29 19:08:08 +01:00
Pankaj Dwivedi
28d4e33b65
[AMDGPU][SIInsertWaitCnt] Optimize loadcnt insertion at function boundaries (#169647)
On GFX12+, GLOBAL_INV increments the loadcnt counter but does not write
results to any VGPRs. Previously, we unconditionally inserted
s_wait_loadcnt 0 at function returns even when the only pending loadcnt
was from GLOBAL_INV instructions.

This patch optimizes waitcnt insertion by skipping the loadcnt wait at
function boundaries when no VGPRs have pending loads. This is determined
by checking if any VGPR has a score greater than the lower bound for
LOAD_CNT - if not, the pending loadcnt must be from non-VGPR-writing
instructions like GLOBAL_INV.

The optimization is limited to GFX12+ targets where GLOBAL_INV exists
and uses the extended wait count instructions.

This is a follow-up optimization to PR #135340 which added tracking for
GLOBAL_INV in the waitcnt pass.
2025-12-17 17:53:00 +05:30
Jay Foad
c1e829fc3d
[AMDGPU] Simplify waitcnt insertion on function entry. NFC. (#172461)
This pass runs way too late for PHI instructions.
2025-12-16 12:00:13 +00:00
Pankaj Dwivedi
e151434b0f
[AMDGPU][InsertWaitCnts][NFC] Merge VMEM_ACCESS and VMEM_READ_ACCESS into a single event type (#171973) 2025-12-12 18:26:40 +05:30
Pierre van Houtryve
025d0c0d1d
(reland) [AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077) (#171779)
Fixed a crash in Blender due to some weird control flow.
The issue was with the "merge" function which was only looking at the
keys of the "Other" VMem/SGPR maps. It needs to look at the keys of both
maps and merge them.

Original commit message below
----

The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.

There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).

I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
2025-12-12 09:41:04 +01:00
Sameer Sahasrabuddhe
130fa98a29
[AMDGPU][NFC] dump Waitcnt using an ostream operator (#171251) 2025-12-10 20:45:49 +05:30
pvanhout
4572f4f5b1 Revert "[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077)"
Fails on https://lab.llvm.org/buildbot/#/builders/123/builds/31922

This reverts commit bf9344099c63549b2f19f8ede29f883669b0baca.
2025-12-09 14:48:19 +01:00
Pierre van Houtryve
bf9344099c
[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077)
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.

There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).

I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
2025-12-09 13:51:19 +01:00
Sameer Sahasrabuddhe
8c8196c802 [AMDGPU][NFC] cleanup whitespace in debug log of SIInsertWaitcnts 2025-12-09 07:42:37 +05:30
Sameer Sahasrabuddhe
1e33e12f25 [AMDGPU][NFC] fix function names in debug log for SIInsertWaitcnts 2025-12-09 07:36:31 +05:30
Sameer Sahasrabuddhe
1ae957515c [AMDGPU][NFC] Update a comment about FLAT v/s LDSDMA
The change in #170263 does not do justice to common knowledge in the backend.
Fix the comment to reflect the relation between FLAT encoding, flat pointer
access, and LDSDMA operations.
2025-12-08 20:49:19 +05:30
Pierre van Houtryve
8aa82eff56
[AMDGPU][SIInsertWaitcnts] Wait on all LDS DMA operations when no aliasing store is found (#170660)
Previously, we would miss inserting a wait if the ds_read had AA info,
but it didn't match
any LDS DMA op, for example if we didn't track the LDS DMA op it aliases
with because it exceeded the tracking limit.
2025-12-08 11:02:24 +01:00
Jay Foad
0ecac6d5b9
[AMDGPU] Inherit constructors from WaitcntGenerator. NFC. (#170845) 2025-12-05 13:38:00 +00:00
Jay Foad
64e3bcdd1f [AMDGPU] Add an assertion. NFCI. 2025-12-05 12:55:09 +00:00
Sameer Sahasrabuddhe
cb8ce283e1
[AMDGPU][Waitcnts] Don't create a pending flat event for LDS DMA (#170263)
Flat instructions need a waitcnt(0) on both VMEM and LDS accesses, but
only when the instruction really is using flat addressing. The LDS DMA
instructions (on GFX9) have the FLAT flag set, but they have very clear
semantics. These instructions update only VM_CNT (on GFX9), and hence do
not need to be treated like actual flat instructions.
2025-12-04 17:22:59 +05:30
Pierre van Houtryve
8feb6762ba
[AMDGPU] Take BUF instructions into account in mayAccessScratchThroughFlat (#170274)
BUF instructions can access the scratch address space, so
SIInsertWaitCnt needs to be able
to track the SCRATCH_WRITE_ACCESS event for such BUF instructions.

The release-vgprs.mir test had to be updated because BUF instructions
w/o a MMO are now
tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to
keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.
2025-12-03 10:37:58 +01:00
Ryan Mitchell
5e4505d562
[AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization (#164357)
Refactor the XCnt optimization checks so that they can be checked when
applying a pre-existing waitcnt. This removes unnecessary xcnt waits
when taking a loop backedge.
2025-11-13 18:43:12 +00:00
Jay Foad
5e4f177142
[AMDGPU] Fix missing S_WAIT_XCNT with multiple pending VMEMs (#166779) 2025-11-12 09:44:08 +00:00
Aaditya
c8187f6539
[AMDGPU] Fix Xcnt handling between blocks (#165201)
For blocks with multiple predescessors, there
maybe `SMEM` and `VMEM` events active at the same time.
This patch handles these cases.
2025-11-01 16:48:48 +05:30
Aaditya
982c9e6ac5
[AMDGPU][NFC] Use getScoreUB for XCNT insertion. (#162448) 2025-10-13 11:07:05 +05:30
Aaditya
19cd5bd350
[AMDGPU] Account for implicit XCNT insertion (#160812)
Hardware inserts an implicit `S_WAIT_XCNT 0` between 
alternate SMEM and VMEM instructions, so there are 
never outstanding address translations for both SMEM 
and VMEM at the same time.
2025-10-03 13:38:37 +05:30