For V_DOT2_F32_F16 and V_DOT2_F32_BF16 add their VOPDName and mark
them with usesCustomInserter which will be used to add pre-RA register
allocation hints to preferably assign dst and src2 to the same physical
register. When the hint is satisfied, canMapVOP3PToVOPD recognises the
instruction as eligible for VOPD pairing by checking if it is VOP2 like:
dst==src2, no source modifiers, no clamp, and src1 is a register.
Mark both instructions as commutable to allow a literal in src1 to be
moved to src0, since VOPD only permits a literal in src0.
Async operations transfer data between global memory and LDS. Their
progress is tracked by the ASYNC_CNT counter on GFX1250 and later
architectures. This change introduces the representation of that counter
in SIInsertWaitCnts. For now, the programmer must manually insert
s_wait_asyncnt instructions. Later changes will add compiler assistance
for generating the waits by including this counter in the asyncmark
instructions.
Assisted-by: Claude Sonnet 4.5
This is part of a stack:
- #185813
- #185810
This patch replaces the member variables of Waitcnt with an array. This
helps in several ways:
(i) It helps replace switch cases with array accesses, and
(ii) It makes operating on all elements with a loop which is much
easier, and should require less maintenance if we add more counters
The maximum VGPR usage of a shader is limited based on the target
occupancy,
ensuring that the targeted number of waves actually fit onto a CU/WGP.
However, in dynamic VGPR mode, we should not do that, because VGPRs are
allocated
dynamically at runtime, and there are no static constraints based on
occupancy.
Fix that in this patch.
Also fixup the getMinNumVGPRs helper to behave consistently by always
returning
zero in dVGPR mode.
This also fixes a problem where AMDGPUAsmPrinter bumps the VGPR usage to
at least
the result of getMinNumVGPRs, per my understanding in order to avoid an
occupancy
that is higher than the occupancy target. That was causing incorrect
(too high)
VGPR usages in dVGPR mode with medium-sized workgroups (say 768).
Fold the second operand (m0) of llvm.amdgcn.s.sendmsg and
llvm.amdgcn.s.sendmsghalt to poison when the message type does not use
m0.
Only MSG_GS_ALLOC_REQ (message ID 9) actually reads the m0 value. All
other message types ignore it, so we can fold the operand to poison,
which eliminates unnecessary s_mov_b32 m0, 0 instructions in the
generated code.
Fixes https://github.com/llvm/llvm-project/issues/183605
- Added InstCombine case for amdgcn_s_sendmsg and amdgcn_s_sendmsghalt
intrinsics
- Extract message ID using 8-bit mask to handle both pre-GFX11 (4-bit)
and GFX11+ (8-bit) encoding
- Only preserve m0 operand for ID_GS_ALLOC_REQ
Rename TENSOR_LOAD_TO_LDS to TENSOR_LOAD_TO_LDS_d4;
Rename TENSOR_STORE_FROM_LDS to TENSOR_STORE_FROM_LDS_d4;
Also rename function names in a couple of tests to reflect this change.
This patch makes Waitcnt member variables private and replaces their
accesses with calls to set() or get(). This will help us change the
implementation to an a array in the followup patch.
Since #182059 there is only one case in which these functions return -1,
so callers no longer need to distinguish between (int64_t)-1 and
(uint32_t)-1, so we can go back to a 32-bit return value like it was
before #180954.
Introduce two new subtarget features:
- WMMA256bInsts for GFX11 WMMA instructions and
- WMMA128bInsts for GFX1170 and GFX12 WMMA and SWMMAC instructions
Some WMMA instructions have changed from GFX 11.0 to GFX 11.7 so new
Real versions were added with "_gfx1170" suffix. For consistency all
WMMA and SWMMAC GFX11.7 instructions use this suffix.
To resolve decoding issues between different formats for some WMMA
instructions between GFX 11 and GFX 11.7, new decoding tables were
added.
This patch introduces `get(T)` and `set(T, Val)` functions for Waitcnt
and removes getCounterRef() and getWait(). For this to work we also need
to move InstrCounterType to AMDGPUBaseInfo.h.
Please note that the member variables are still public to keep this
patch small.
They will be replaced in the follow-up patch.
This patch implements a custom printer/parser for the immediate operand
of s_wait_alu that prints/parses the decoded counter values.
Format:
```
.<counter1>_<value1>_<counter2>_<value2>
```
Example:
`s_wait_alu .VaVdst_1_VmVsrc_1`
; Which is equivalent to this:
`s_wait_alu 8167`
Features:
- If a counter is at its maximum value it won't get printed.
- The parser will error out if a counter is greater or equal to its max
value.
- If all counters are disabled we can use 'AllOff'.
- For now we also accept numeric values for backwards compatibility with
older MIR.
Note: This is similar to https://github.com/llvm/llvm-project/pull/96004
but for `s_wait_alu`.
This PR handles`v_pk_fmac_f16` inline constant encoding/decoding
differences between pre-GFX11 and GFX11+ hardware.
- Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16
bits, zero in high.
- GFX11+: fp16 inline constants are duplicated to both halves `(f16,
f16)`.
Fixes#94116.
Many `SubtargetFeature` definitions in `AMDGPU.td` follow a repetitive
pattern where a `FeatureXYZ` is paired with a `HasXYZ` predicate. This
creates significant code duplication.
This PR introduces `AMDGPUSubtargetFeature` multiclass that generates
both the `SubtargetFeature` and its corresponding `Predicate` from a
single definition. The multiclass accepts an optional `GenPredicate`
parameter (default 1) to skip predicate generation when not needed.
Not converted:
- Features with dependencies - multiclass doesn't support this yet. Will
do it in a follow-up.
- Features with irregular predicates (e.g., Predicate without
`AssemblerPredicate`, negated `Predicate`, complex multi-feature
conditions). For those without `AssemblerPredicate`, this can be done by
adding an extra optional argument to indicate whether
`AssemblerPredicate` is needed. Will be done in a follow-up.
- Features where field name doesn't match the `HasXYZ` pattern.
148 features converted, saving ~529 lines of code.
Reference issue: https://github.com/ROCm/llvm-project/issues/67
This patch adds support for expanding s_waitcnt instructions into
sequences with decreasing counter values, enabling PC-sampling profilers
to identify which specific memory operation is causing a stall.
This is controlled via:
Clang flag: -mamdgpu-expand-waitcnt-profiling /
-mno-amdgpu-expand-waitcnt-profiling
Function attribute: "amdgpu-expand-waitcnt-profiling"
When enabled, instead of emitting a single waitcnt, the pass generates a
sequence that waits for each outstanding operation individually. For
example, if there are 5 outstanding memory operations and the target is
to wait until 2 remain:
**Original**:
s_waitcnt vmcnt(2)
**Expanded**:
s_waitcnt vmcnt(4)
s_waitcnt vmcnt(3)
s_waitcnt vmcnt(2)
The expansion starts from (Outstanding - 1) down to the target value,
since waitcnt(Outstanding) would be a no-op (the counter is already at
that value).
- Uses ScoreBrackets to determine the actual number of outstanding
operations
- Only expands when operations complete in-order
- Skips expansion for mixed event types (e.g., LDS+SMEM on same counter)
- Skips expansion for scalar memory (always out-of-order)
Releated previous work for Reference
- **PR**: llvm/llvm-project#79236 (related `-amdgpu-waitcnt-forcezero`)
---------
Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve@amd.com>
The 16-bit immediate operand of s_waitcnt_depctr / s_wait_alu has some
unused bits. Previously codegen would set these bits to 1, but setting
them to 0 matches the SP3 assembler behaviour better, which in turn
means that we can print them using the human readable SP3 syntax:
s_wait_alu 0xfffd ; unused bits set to 1
s_wait_alu 0xff9d ; unused bits set to 0
s_wait_alu depctr_va_vcc(0) ; unused bits set to 0, human readable
Note that the set of unused bits changed between GFX10.1 and GFX10.3.
There should normally be no need to generate implicit lit64()
modifiers on the assembler side. It's the encoder's responsibility
to recognise literals that are implicitly 64 bits wide.
The exceptions are where we rewrite floating-point operand values
as integer ones, which would not be assembled back to the original
values unless wrapped into lit64().
Respect explicit lit() modifiers for non-inline values as
necessary to avoid regressions in MC tests. This change still
doesn't prevent use of inline constants where lit()/lit64 is
specified; subject to a separate patch.
On disassembling, only create lit64() operands where necessary for
correct round-tripping.
Add round-tripping tests where useful and feasible.
This removes special case processing in TargetInstrInfo::getRegClass to
fixup register operands which depending on the subtarget support AGPRs,
or require even aligned registers.
This regresses assembler diagnostics, which currently work by hackily
accepting invalid cases and then post-rejecting a validly parsed
instruction.
On the plus side this now emits a comment when disassembling unaligned
registers for targets with the alignment requirement.
Users of the backend are expected to enable dynamic VGPRs via the
`amdgpu-dynamic-vgpr-block-size` attribute instead of the subtarget
features (see https://github.com/llvm/llvm-project/pull/133444).
We have proper encoding facilities to encode operands and instructions;
there's no need to pollute the MC representation with encoding details.
Supposed to be an NFCI, but happens to fix some re-encoded instruction
codes in disassembler tests.
The 64-bit operands are to be addressed in following patches introducing
MC-level representation for lit() and lit64() modifiers, to then be
respected by both the assembler and disassembler.