The ASYNC_CNT is used to track the progress of asynchronous copies
between global and LDS memories. By including it in asyncmark, the
compiler can now assist the programmer in generating waits for
ASYNC_CNT.
Assisted-By: Claude Sonnet 4.5
This is part of a stack:
- #185813
- #185810
Fixes: LCOMPILER-332
Non-entry functions were unconditionally aligned to 4 bytes with no
architecture-specific preferred alignment, and setAlignment() was used
instead of ensureAlignment(), overwriting any explicit IR attributes.
Add instruction cache line size and fetch alignment data to GCNSubtarget
for each generation (GFX9: 64B/32B, GFX10: 64B/4B, GFX11+: 128B/4B). Use
this to call setPrefFunctionAlignment() in SITargetLowering, aligning
non-entry functions to the cache line size by default. Change
setAlignment to ensureAlignment in AMDGPUAsmPrinter so explicit IR align
attributes are respected.
Empirical thread trace analysis on gfx942, gfx1030, gfx1100, and gfx1200
showed that only GFX9 exhibits measurable fetch stalls when functions
cross the 32-byte fetch window boundary. GFX10+ showed no alignment
sensitivity. A hidden option -amdgpu-align-functions-for-fetch-only is
provided to use the fetch granularity instead of cache line size.
Assisted-by: Claude Opus
This patch adds target features:
- `+dpp-wavefront-shifts`, for DPP `wave_shl/rol/shr/ror`
- `+dpp-row-bcast`, for DPP `row_bcast15/31`
These DPP controls are not available in gfx10+, so these target features
enable `AMDGPURemoveIncompatibleFunctions` to remove functions that rely
on these controls when compiling for newer GPUs.
On GFX9, the instruction sequencer fetches 32 bytes at a time. When an
8-byte instruction at a loop header straddles a 32-byte fetch window
boundary, the sequencer must perform two fetches after a backward
branch, incurring a delay. On GFX950, this causes additional performance
issues.
This patch adds 32-byte alignment (.p2align 5, , 4) for loop headers on
GFX950 when the first real instruction is 8 bytes. At most one s_nop (4
bytes, 1 quad-cycle before the loop) is used for padding. If more than 4
bytes of padding were needed, the 8-byte instruction would not straddle
a 32-byte boundary anyway, so alignment is skipped.
Note: the alignment decision is made during block-placement, before
si-insert-waitcnts. In loops where a 4-byte S_WAITCNT is later inserted
as the first instruction, the alignment becomes redundant but mostly
harmless (at most one extra s_nop per affected loop).
Assisted-by: Claude (Anthropic)
Introduce two new subtarget features:
- WMMA256bInsts for GFX11 WMMA instructions and
- WMMA128bInsts for GFX1170 and GFX12 WMMA and SWMMAC instructions
Some WMMA instructions have changed from GFX 11.0 to GFX 11.7 so new
Real versions were added with "_gfx1170" suffix. For consistency all
WMMA and SWMMAC GFX11.7 instructions use this suffix.
To resolve decoding issues between different formats for some WMMA
instructions between GFX 11 and GFX 11.7, new decoding tables were
added.
Adds logic to detect cases where the llvm.amdgcn.wave.shuffle intrinsic
is being applied to an index operand that would make the result
equivalent to the various Row Share flavors of DPP16 operations, and
replaces the intrinsic and the instructions computing the index with an
equivalent llvm.amdgcn.update.dpp call.
We can finally get rid of the manually defined boolean variables, like
other targets. Even though most of them are now defined by macros, we
still need to add the entries.
Extend the X-macro pattern to eliminate boilerplate for additional
subtarget features.
This reduces ~50 lines of repetitive member declarations and getter
definitions.
This PR extends the multiclass to support two additional parameters: one
for specifying whether an `AssemblerPredicate` should be generated, and
another for dependent `SubtargetFeatures`. This allows 15 more
definitions to be converted to use the multiclass.
`GCNSubtarget.h` contained a large amount of repetitive code following
the pattern `bool HasXXX = false;` for member declarations and `bool
hasXXX() const { return HasXXX; }` for getters. This boilerplate made
the file unnecessarily long and harder to maintain.
This patch introduces an X-macro pattern `GCN_SUBTARGET_HAS_FEATURE`
that consolidates 135 simple subtarget features into a single list. The
macro is expanded twice: once in the protected section to generate
member variable declarations, and once in the public section to generate
the corresponding getter methods. This reduces the file by approximately
600 lines while preserving the exact same API and functionality.
Features with complex getter logic or inconsistent naming conventions
are left as manual implementations for future improvement.
Ideally, these could be generated by TableGen using
`GET_SUBTARGETINFO_MACRO`, similar to the X86 backend. However,
`AMDGPU.td` has several issues that prevent direct adoption: duplicate
field names (e.g., `DumpCode` is set by both `FeatureDumpCode` and
`FeatureDumpCodeLower`), and inconsistent naming conventions where many
features don't have the `Has` prefix (e.g., `FlatAddressSpace`,
`GFX10Insts`, `FP64`). Fixing these issues would require renaming fields
in `AMDGPU.td` and updating all references, which is left for future
work.
This PR handles`v_pk_fmac_f16` inline constant encoding/decoding
differences between pre-GFX11 and GFX11+ hardware.
- Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16
bits, zero in high.
- GFX11+: fp16 inline constants are duplicated to both halves `(f16,
f16)`.
Fixes#94116.
Before, GlobalISel would still return true for lowering the intrinsic
for GFX7 and earlier even though the required ds_bpermute_b32
instruction is not supported. After this change, GlobalISel will
properly report failure to select in this case. Testing is updated
appropriately.
Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
This change adds a new intrinsic for AMDGPU that implements a wave
shuffle, allowing arbitrary swizzling between lanes using an index. In
the initial version of this commit, there was an issue in one of the
tests added that returned a signal, causing testing to fail when
combined with another recent change to 'not'.
For context on the initial commit see #167372
---------
Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
Co-authored-by: Jay Foad <jay.foad@gmail.com>
This intrinsic will be useful for implementing the
OpGroupNonUniformShuffle operation in the SPIR-V reference
---------
Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
Co-authored-by: Jay Foad <jay.foad@gmail.com>
The xcnt wait is actually required before any memory access that can
only be done once, so atomic stores and volatile accesses are affected.
This patch also ensures buffer instructions are handled.
On some targets, a packed f32 instruction can only read 32 bits from a
scalar operand (SGPR or literal) and replicates the bits to both
channels. In this case, we should not fold an immediate value if it
can't be replicated from its lower 32-bit.
Fixes SWDEV-567139.
Introduce a target hook to incrementally flip the behavior of
targets with test changes, and start by implementing it for AMDGPU.
This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.
Make sure we cannot be in a mode with both wavesizes. This
prevents assertions in a future change. This should probably
just be an error, but we do not have a good way to report
errors from the MCSubtargetInfo constructor.