This patch introduces `get(T)` and `set(T, Val)` functions for Waitcnt
and removes getCounterRef() and getWait(). For this to work we also need
to move InstrCounterType to AMDGPUBaseInfo.h.
Please note that the member variables are still public to keep this
patch small.
They will be replaced in the follow-up patch.
The user of the WaitcntBrackets class shouldn't need to know about how
the scoreboard has been implemented internally. So I think it is best to
provide a higher level API that hides things like scoreUB, scoreLB and
score ranges.
This patch makes getScoreUB(), getScoreLB() and getScoreRange() private
and introduces new functions that don't expose the internal
implementation:
- getOutstanding(T)
- hasPendingVMEM(VMEMID, T)
- empty(T)
I also noticed that getSGPRScore() and getVMemScore() are not used
externally so these are now private.
With this patch we are no longer exposing the internal data structure
that holds the WaitEvents to the user through the `getWaitEventMask()`
API. Instead we only allow the user to query a specific type and get the
corresponding `WaitEventSet` with `getWaitEvents(T)`.
Note: This patch also renames `getWaitEventMask()` to `getWaitEvents()`
because we are no longer returning a mask but instead a `WaitEventSet`
object.
On GFX9, BUFFER_WBL2 is used to write back dirty cache lines and
requires an s_waitcnt vmcnt(0) afterwards to ensure completion.
This patch fixes by incrementing vmcnt for buffer_wbl2 instruction
---------
Co-authored-by: Jay Foad <jay.foad@gmail.com>
The eventCounter() function searches through the array of events. This
array is owned by the WaitcntGenerator class.
This patch moves the function into the WaitcntGenerator class which
helps hide the event array from the user.
It also renames it to getCounterFromEvent().
This should be NFC.
Wait is initialized with all ~0s and by the time it reaches the updated
line it still holds the same value. So Wait.combined(AllZeroWait) is
effectively combining all ~0s with AllZeroWait and given that combined()
returns the min() of the two it should always return AllZeroWait.
So this patch replaces the assignment with `= AllZeroWait` to make it
easier to read.
Before this patch WaitEventType events used to be collected in unsigned
integers that were used as small bit vectors.
This patch introduces a WaitEventSet container class to replace the
integer bit vectors with a class that hides the implementation of common
operations like insertion, removal, union, intersection etc. from the
user.
The WaitEventSet API matches that of a set and not a vector because we
don't care about the order of its contents. Internally though it is
still a bit vector that uses an unsigned integer as its storage, just
like the original implementation.
This patch should not change the functionality.
We inserted the DEALLOC_VGPRS message if there were no pending scratch
stores the first time an S_ENDPGM instruction was visited. But because
this pass uses a worklist to revisit blocks until it reaches a fixed
point, it is possible that pending scratch stores are only discovered on
the second or later visit to a block. Fix this by storing a flag for
each S_ENDPGM instruction which can be updated by later visits.
This patch optimizes the insertion of s_wait_xcnt instruction for
sequences of atomic read-modify-write (RMW) operations in the
SIInsertWaitcnts pass. The Memory Legalizer conservatively inserts a
soft xcnt instruction before each atomic RMW operation as part of PR
168852, which is correct given the nature of atomic operations.
However, for back-to-back atomic RMWs, only the first s_wait_xcnt is
necessary for better runtime performance. This patch tracks atomic
RMW blocks within each basic block and removes redundant soft xcnt
instructions, keeping only the first wait in each sequence. An atomic
RMW block continues through subsequent atomic RMWs and non-memory
instructions (e.g., ALU operations) but is broken by CU-scoped memory
operations, atomic stores, or basic block boundaries.
This reverts commit 0dd03598dca91c93c74b94a714c38a4ffad0ed1c.
Apparently the stack usage on Windows was still large enough to cause
problems for some DirectX games.
`GCNSubtarget.h` contained a large amount of repetitive code following
the pattern `bool HasXXX = false;` for member declarations and `bool
hasXXX() const { return HasXXX; }` for getters. This boilerplate made
the file unnecessarily long and harder to maintain.
This patch introduces an X-macro pattern `GCN_SUBTARGET_HAS_FEATURE`
that consolidates 135 simple subtarget features into a single list. The
macro is expanded twice: once in the protected section to generate
member variable declarations, and once in the public section to generate
the corresponding getter methods. This reduces the file by approximately
600 lines while preserving the exact same API and functionality.
Features with complex getter logic or inconsistent naming conventions
are left as manual implementations for future improvement.
Ideally, these could be generated by TableGen using
`GET_SUBTARGETINFO_MACRO`, similar to the X86 backend. However,
`AMDGPU.td` has several issues that prevent direct adoption: duplicate
field names (e.g., `DumpCode` is set by both `FeatureDumpCode` and
`FeatureDumpCodeLower`), and inconsistent naming conventions where many
features don't have the `Has` prefix (e.g., `FlatAddressSpace`,
`GFX10Insts`, `FP64`). Fixing these issues would require renaming fields
in `AMDGPU.td` and updating all references, which is left for future
work.
Add support for flushing DS_CNT in loop preheaders when the loop uses
values that were DS-loaded outside the loop. This is similar to the
existing VMEM loop optimization.
Assisted-by: Cursor / claude-4.5-opus-high
Reference issue: https://github.com/ROCm/llvm-project/issues/67
This patch adds support for expanding s_waitcnt instructions into
sequences with decreasing counter values, enabling PC-sampling profilers
to identify which specific memory operation is causing a stall.
This is controlled via:
Clang flag: -mamdgpu-expand-waitcnt-profiling /
-mno-amdgpu-expand-waitcnt-profiling
Function attribute: "amdgpu-expand-waitcnt-profiling"
When enabled, instead of emitting a single waitcnt, the pass generates a
sequence that waits for each outstanding operation individually. For
example, if there are 5 outstanding memory operations and the target is
to wait until 2 remain:
**Original**:
s_waitcnt vmcnt(2)
**Expanded**:
s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)
The expansion starts from (Outstanding - 1) down to the target value,
since waitcnt(Outstanding) would be a no-op (the counter is already at
that value).
- Uses ScoreBrackets to determine the actual number of outstanding
operations
- Only expands when operations complete in-order
- Skips expansion for mixed event types (e.g., LDS+SMEM on same counter)
- Skips expansion for scalar memory (always out-of-order)
Releated previous work for Reference
- **PR**: llvm/llvm-project#79236 (related `-amdgpu-waitcnt-forcezero`)
---------
Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
The original design was:
- WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the
current state of WaitcntBrackets, removing unnecesary waits.
- WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on
Wait, updating the state by applying the specified waits.
This was changed by #164357 which started calling applyWaitcnt from
simplifyWaitcnt.
This patch restores the original design without any significant
functional changes. There is some code duplication because both
simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts
with other counters like LOADCNT and KMCNT.
Start documenting the ABI conventions for dependency counters on
function call and return.
Stop pretending that SIInsertWaitcnts can handle anything other than the
default documented behavior.
This reverts commit 008c875be85732f72c4df4671167f5be79f449eb.
PR #162077 / #171779 shrunk the WaitcntBrackets class by using DenseMaps
instead of large arrays, so the size of a temporary WaitcntBrackets
allocated on the stack is no longer a concern.
With this patch on Linux I measured the stack size of
SIInsertWaitcnts::run increasing from 456 bytes to 632 bytes.
At the moment the MIR tests are somewhat redundant. The waitcnt
one is needed to ensure we actually have a load, given we are
currently just emitting an error on ExternalSymbol. The asm printer
one is more redundant for the moment, since it's stressed by the IR
test. However I am planning to change the error path for the IR test,
so it will soon not be redundant.
On GFX12+, GLOBAL_INV increments the loadcnt counter but does not write
results to any VGPRs. Previously, we unconditionally inserted
s_wait_loadcnt 0 at function returns even when the only pending loadcnt
was from GLOBAL_INV instructions.
This patch optimizes waitcnt insertion by skipping the loadcnt wait at
function boundaries when no VGPRs have pending loads. This is determined
by checking if any VGPR has a score greater than the lower bound for
LOAD_CNT - if not, the pending loadcnt must be from non-VGPR-writing
instructions like GLOBAL_INV.
The optimization is limited to GFX12+ targets where GLOBAL_INV exists
and uses the extended wait count instructions.
This is a follow-up optimization to PR #135340 which added tracking for
GLOBAL_INV in the waitcnt pass.
Fixed a crash in Blender due to some weird control flow.
The issue was with the "merge" function which was only looking at the
keys of the "Other" VMem/SGPR maps. It needs to look at the keys of both
maps and merge them.
Original commit message below
----
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.
There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.
This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).
I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.
There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.
This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).
I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
The change in #170263 does not do justice to common knowledge in the backend.
Fix the comment to reflect the relation between FLAT encoding, flat pointer
access, and LDSDMA operations.
Previously, we would miss inserting a wait if the ds_read had AA info,
but it didn't match
any LDS DMA op, for example if we didn't track the LDS DMA op it aliases
with because it exceeded the tracking limit.
Flat instructions need a waitcnt(0) on both VMEM and LDS accesses, but
only when the instruction really is using flat addressing. The LDS DMA
instructions (on GFX9) have the FLAT flag set, but they have very clear
semantics. These instructions update only VM_CNT (on GFX9), and hence do
not need to be treated like actual flat instructions.
BUF instructions can access the scratch address space, so
SIInsertWaitCnt needs to be able
to track the SCRATCH_WRITE_ACCESS event for such BUF instructions.
The release-vgprs.mir test had to be updated because BUF instructions
w/o a MMO are now
tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to
keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.
Refactor the XCnt optimization checks so that they can be checked when
applying a pre-existing waitcnt. This removes unnecessary xcnt waits
when taking a loop backedge.
Hardware inserts an implicit `S_WAIT_XCNT 0` between
alternate SMEM and VMEM instructions, so there are
never outstanding address translations for both SMEM
and VMEM at the same time.