MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
For setScore, the root function is setScoreByInterval with RegInterval
input
For determineWait, the root function is determineWait with RegInterval
input
When generating waitcounts before a use or def skip VGPRs. We never have
a real implicit VGPR operands on memory instructions, it is only for
super-reg liveness accounting.
Some other instructions (MOVRELS as an example) may have real implicit
VGPR uses though.
This is less then ideal but most of the problems observed with spills.
Implicit defs and uses on spill stores were accounted as real defs and
uses, while only exist for liveness accounting. As a result unneded
waits were generated.
Fixes: SWDEV-484177
When a loop contains a VMEM load whose result is only used outside the
loop, do not bother to flush vmcnt in the loop head on GFX12. A wait for
vmcnt will be required inside the loop anyway, because VMEM instructions
can write their VGPR results out of order.
SLoadAddresses previously held data across different functions and used
these for dominance queries of blocks in different functions. This is
not intended; clear the state at the end of the pass.
An appropriately configured image resource descriptor can trigger
image_sample instructions to store outputs directly to a linked memory
location instead of returning to VGPRs.
This is opaque to the backend as instruction encoding is unchanged;
however, a mechanism is require to allow frontends to communicate that
these instructions do not require destination VGPRs and store to memory.
Flagging these as stores means they will not be optimized away.
I do not understand what optimization this was supposed to implement.
It has never been enabled. I suspect it no longer applies to GCN/RDNA
architectures.
They *are* still accepted by the HW but have a conservative effect.
Leave them untouched since handling them would complicate the logic a
bit, and developers who code to such a low level really need to revisit
what they're doing anyway.
https://github.com/llvm/llvm-project/pull/90201 made some fixes for
gfx12
image_msaa_load waitcnt insertion.
That fix might break in some situations for pre-gfx12 - this fixes that
by
explitly checking for VSAMPLE which always requires a s_wait_samplecnt
and
leaves the previous logic intact for non-gfx12.
Code to determine if a waitcnt is required before a barrier instruction
only
considered S_BARRIER.
gfx12 adds barrier_signal/wait so need to enhance the existing code to
look for
a barrier start (which is just an S_BARRIER for earlier architectures).
The autogenerated memory legalizer tests use -O0 so this allows us to
see the exact waitcnts that were inserted by the memory legalizer
without them being optimized away.
Fixes#82659
There are some functions, such as `findRegisterDefOperandIdx` and `findRegisterDefOperand`, that have too many default parameters. As a result, we have encountered some issues due to the lack of TRI parameters, as shown in issue #82411.
Following @RKSimon 's suggestion, this patch refactors 9 functions, including `{reads, kills, defines, modifies}Register`, `registerDefIsDead`, and `findRegister{UseOperandIdx, UseOperand, DefOperandIdx, DefOperand}`, adjusting the order of the TRI parameter and making it required. In addition, all the places that call these functions have also been updated correctly to ensure no additional impact.
After this, the caller of these functions should explicitly know whether to pass the `TargetRegisterInfo` or just a `nullptr`.
Deallocating VGPRs interferes with doing a context save, which is needed for GDB
to report a breakpoint. So, in this sequence:
s_sendmsg MSG_DEALLOC_VGPRS
s_endpgm
We now use the debug location of the s_endpgm for the s_sendmsg, so a breakpoint
set in the debugger at the end of a shader will be hit before deallocating VGPRs.
This patch introduces a new command-line option for clang, namely,
amdgpu-precise-mem-op (or precise-memory in the backend). When this option is specified, a waitcnt
instruction is generated after each memory load/store instruction. The
counter values are always 0, but which counters are involved depends on
the memory instruction.
---------
Co-authored-by: Jun Wang <jun.wang7@amd.com>
Call generateWaitcnt unconditionally at the end of
SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to
generate a new waitcnt instruction it has the effect of combining or
removing redundant waitcnts that were already present. Tests show
various small improvements in waitcnt placement.
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
LDA DMA loads increase VMCNT and a load from the LDS stored must wait on
this counter to only read memory after it is written. Wait count
insertion pass does not track memory dependencies, it tracks register
dependencies. To model the LDS dependency a pseudo register is used in
the scoreboard, acting like if LDS DMA writes it and LDS load reads it.
This patch adds 8 more pseudo registers to use for independent LDS
locations if we can prove they are disjoint using alias analysis.
Fixes: SWDEV-433427
Fix a potential hang introduced by #77439 and #77935. This line:
setScoreUB(VS_CNT, getScoreLB(VS_CNT) + getWaitCountMax(VS_CNT));
could potentialy set UB lower than it was before, which confused
SIInsertWaitcnts's fixed point algorithm.
This was only triggered a STORE instruction with an implicit-def, which
seems odd but apparently happens for some spills.
This avoids listing all soft waitcnt opcodes in two places
(getNonSoftWaitcntOpcode and isSoftWaitcnt) and avoids the need for
helpers isWaitcnt and isWaitcntVsCnt.
Always set the upper bound for VS_CNT higher than the lower bound.
Before #77439 this code was only executed on function entry where the
lower bound was 0 so it was not a problem.
Fixes#77931
Calls do not have to wait for VsCnt, so after they return there might
still be scratch stores in progress. It's important that we don't send
the DEALLOC_VGPR message in that case, since that might release the
VGPRs and scratch allocation before those stores are complete.
This means that getRegInterval no longer depends on the MCInstrDesc, so
it could be simplified further to take just a MachineOperand or just a
physical register. NFCI.