293 Commits

Author SHA1 Message Date
Sameer Sahasrabuddhe
f9adee2f6b
[AMDGPU] asyncmark support for ASYNC_CNT (#185813)
Some checks failed
Bazel Checks / Buildifier (push) Has been cancelled
Bazel Checks / Bazel Build/Test (push) Has been cancelled
Build CI Tooling Containers / Build Container abi-tests (push) Has been cancelled
Build CI Tooling Containers / Build Container format (push) Has been cancelled
Build CI Tooling Containers / Build Container lint (push) Has been cancelled
Build Windows CI Container / build-ci-container-windows (push) Has been cancelled
Build CI Container / Build Container X64 (push) Has been cancelled
Build CI Container / Build Container ARM64 (push) Has been cancelled
Build CI Container / Build Container agent X64 (push) Has been cancelled
Build CI Container / Build Container agent ARM64 (push) Has been cancelled
Build libc Container / Build libc container (ubuntu-24.04) (push) Has been cancelled
Build libc Container / Build libc container (ubuntu-24.04-arm) (push) Has been cancelled
Build Metrics Container / build-metrics-container (push) Has been cancelled
Check CI Scripts / Check Python Tests (push) Has been cancelled
Test documentation build / Test documentation build (push) Has been cancelled
Libclang Python Binding Tests / Build and run Python unit tests (3.13) (push) Has been cancelled
Libclang Python Binding Tests / Build and run Python unit tests (3.8) (push) Has been cancelled
Build Docker images for libc++ CI / build-and-push (push) Has been cancelled
Test Unprivileged Download Artifact Action / Upload Test Artifact (push) Has been cancelled
Zizmor GitHub Actions Analysis / Run zizmor (push) Has been cancelled
Build CI Tooling Containers / push-ci-container (push) Has been cancelled
Build Windows CI Container / push-ci-container (push) Has been cancelled
Build CI Container / push-ci-container (push) Has been cancelled
Build libc Container / push-libc-container (push) Has been cancelled
Build Metrics Container / push-metrics-container (push) Has been cancelled
Test Unprivileged Download Artifact Action / Test Unprivileged Download Artifact (push) Has been cancelled
Commit Access Review / commit-access-review (push) Has been cancelled
The ASYNC_CNT is used to track the progress of asynchronous copies
between global and LDS memories. By including it in asyncmark, the
compiler can now assist the programmer in generating waits for
ASYNC_CNT.

Assisted-By: Claude Sonnet 4.5

This is part of a stack:

- #185813
- #185810 

Fixes: LCOMPILER-332
2026-04-07 07:23:09 +05:30
Mirko Brkušanin
93d7583f4f
[AMDGPU] Update features for gfx1170 (#186107)
- Enable `NoF16PseudoScalarTransInlineConstants` for 11.7.
- Add test for `RequiredExportPriority`, one of the differences between
11.5 and 11.7.
2026-03-20 17:04:17 +01:00
Mirko Brkušanin
a5aa136eb3
[AMDGPU] Add GFX11_7Insts feature, eliminate isGFX1170 helpers. NFC (#185878) 2026-03-11 17:05:18 +01:00
michaelselehov
cb3fbe921b
[AMDGPU] Set preferred function alignment based on icache geometry (#183064)
Non-entry functions were unconditionally aligned to 4 bytes with no
architecture-specific preferred alignment, and setAlignment() was used
instead of ensureAlignment(), overwriting any explicit IR attributes.

Add instruction cache line size and fetch alignment data to GCNSubtarget
for each generation (GFX9: 64B/32B, GFX10: 64B/4B, GFX11+: 128B/4B). Use
this to call setPrefFunctionAlignment() in SITargetLowering, aligning
non-entry functions to the cache line size by default. Change
setAlignment to ensureAlignment in AMDGPUAsmPrinter so explicit IR align
attributes are respected.

Empirical thread trace analysis on gfx942, gfx1030, gfx1100, and gfx1200
showed that only GFX9 exhibits measurable fetch stalls when functions
cross the 32-byte fetch window boundary. GFX10+ showed no alignment
sensitivity. A hidden option -amdgpu-align-functions-for-fetch-only is
provided to use the fetch granularity instead of cache line size.

Assisted-by: Claude Opus
2026-03-11 07:57:37 -04:00
Matt Arsenault
8ec961e1a9
Reapply "AMDGPU: Annotate group size ABI loads with range metadata (#185420)" (#185588)
This reverts commit d5685ac6db0ae4cbca1745f18d8f2f7dc7d673a5.

Fix off by one error. The end of the range is open.
2026-03-10 07:41:26 +00:00
Matt Arsenault
3545e51093
Revert "AMDGPU: Annotate group size ABI loads with range metadata (#185420)" (#185521)
This reverts commit 76daf31b4000623d5c9548348a859ea3ed8712e1.

Bot failure.
2026-03-10 01:04:02 +00:00
Matt Arsenault
76daf31b40
AMDGPU: Annotate group size ABI loads with range metadata (#185420)
We previously did the same for the grid size when annotated.
The group size is easier, so it's weird that this wasn't implemented
first.
2026-03-09 19:11:59 +01:00
Mirko Brkušanin
d0f50d5574
[AMDGPU] Remove DX10_CLAMP and IEEE bits from gfx1170 (#182107)
Add `DX10ClampAndIEEEMode` feature and set it for every subtarget prior
to gfx1170
2026-03-04 12:16:41 +01:00
LU-JOHN
7585ab05d6
[AMDGPU] Enable shift64 hazard recognition for gfx9 (#183839)
Enable shift64 hazard recognition for gfx9 cores.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-02-28 08:59:55 -06:00
zGoldthorpe
20dba979f7
[AMDGPU] Add target features to guard DPP controls (#182391)
This patch adds target features:
- `+dpp-wavefront-shifts`, for DPP `wave_shl/rol/shr/ror`
- `+dpp-row-bcast`, for DPP `row_bcast15/31`

These DPP controls are not available in gfx10+, so these target features
enable `AMDGPURemoveIncompatibleFunctions` to remove functions that rely
on these controls when compiling for newer GPUs.
2026-02-20 07:59:10 -07:00
michaelselehov
ed0ba3cb45
[AMDGPU] Align loop headers to prevent instruction fetch split on GFX950 (#181999)
On GFX9, the instruction sequencer fetches 32 bytes at a time. When an
8-byte instruction at a loop header straddles a 32-byte fetch window
boundary, the sequencer must perform two fetches after a backward
branch, incurring a delay. On GFX950, this causes additional performance
issues.

This patch adds 32-byte alignment (.p2align 5, , 4) for loop headers on
GFX950 when the first real instruction is 8 bytes. At most one s_nop (4
bytes, 1 quad-cycle before the loop) is used for padding. If more than 4
bytes of padding were needed, the 8-byte instruction would not straddle
a 32-byte boundary anyway, so alignment is skipped.

Note: the alignment decision is made during block-placement, before
si-insert-waitcnts. In loops where a 4-byte S_WAITCNT is later inserted
as the first instruction, the alignment becomes redundant but mostly
harmless (at most one extra s_nop per affected loop).

Assisted-by: Claude (Anthropic)
2026-02-19 14:18:44 -05:00
Mirko Brkušanin
829afc4c91
[AMDGPU] Add WMMA and SWMMAC instructions for gfx1170 (#180731)
Introduce two new subtarget features:

- WMMA256bInsts for GFX11 WMMA instructions and
- WMMA128bInsts for GFX1170 and GFX12 WMMA and SWMMAC instructions

Some WMMA instructions have changed from GFX 11.0 to GFX 11.7 so new
Real versions were added with "_gfx1170" suffix. For consistency all
WMMA and SWMMAC GFX11.7 instructions use this suffix.

To resolve decoding issues between different formats for some WMMA
instructions between GFX 11 and GFX 11.7, new decoding tables were
added.
2026-02-18 19:17:48 +01:00
Domenic Nutile
5c72240617
[AMDGPU] Add DPP16 Row Share optimization for llvm.amdgcn.wave.shuffle (#177470)
Adds logic to detect cases where the llvm.amdgcn.wave.shuffle intrinsic
is being applied to an index operand that would make the result
equivalent to the various Row Share flavors of DPP16 operations, and
replaces the intrinsic and the instructions computing the index with an
equivalent llvm.amdgcn.update.dpp call.
2026-02-06 15:31:34 -05:00
Carl Ritson
61f272d5cc
[AMDGPU] Pre-GFX10 does not need added latency for workgroup fences (#177157)
Wait counts will not typically be introduced for workgroup scope fences
in pre-GFX10 ASICs.
Hence avoid adding scheduling latency for these.
2026-01-27 10:24:05 +09:00
Shilei Tian
786a20710d
[NFCI][AMDGPU] Use GET_SUBTARGETINFO_MACRO in GCNSubtarget.h and R600Subtarget.h (#177402)
We can finally get rid of the manually defined boolean variables, like
other targets. Even though most of them are now defined by macros, we
still need to add the entries.
2026-01-25 09:38:42 -05:00
Shilei Tian
a7732479c1
[NFCI][AMDGPU] Move more attributes from AMDGPUSubtarget to GCNSubtarget (#177670)
They are simply not used by `AMDGPUSubtarget &` but directly via
`GCNSubtarget &`.
2026-01-24 07:31:42 -05:00
Shilei Tian
bc4b2765eb
[NFCI][AMDGPU] Refine AMDGPUSubtarget.h (#177473)
This PR is to move code around to pave the path for using
`GET_SUBTARGETINFO_MACRO` in `GCNSubtarget.h`.
2026-01-22 18:40:07 -05:00
Shilei Tian
4b1cfc5d7c
[NFCI][AMDGPU] Final touch before moving to GET_SUBTARGETINFO_MACRO (#177401) 2026-01-22 17:33:17 +00:00
Shilei Tian
9f536c771d
[NFC][AMDGPU] Remove unused FeatureDisable (#177288) 2026-01-22 09:07:28 -05:00
Shilei Tian
02d34a76f7
[NFCI][AMDGPU] Remove more redundant code from GCNSubtarget.h (#177297)
We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the
header with this cleanup.
2026-01-22 09:07:15 -05:00
Shilei Tian
b857faeda6 [NFC][AMDGPU] Remove stale/dangling comments 2026-01-21 20:16:17 -05:00
Shilei Tian
2692f5ed53
[NFCI][AMDGPU] Convert more SubtargetFeatures to use AMDGPUSubtargetFeature and X-macros (#177256)
Extend the X-macro pattern to eliminate boilerplate for additional
subtarget features.

This reduces ~50 lines of repetitive member declarations and getter
definitions.
2026-01-21 18:03:32 -05:00
Shilei Tian
fa4f7657a2
[AMDGPU] Further improve AMDGPUSubtargetFeature multiclass (#177077)
This PR extends the multiclass to support two additional parameters: one
for specifying whether an `AssemblerPredicate` should be generated, and
another for dependent `SubtargetFeatures`. This allows 15 more
definitions to be converted to use the multiclass.
2026-01-21 21:05:13 +00:00
Shilei Tian
1843a7fe9f
[NFCI][AMDGPU] Use X-macro to reduce boilerplate in GCNSubtarget.h (#176844)
`GCNSubtarget.h` contained a large amount of repetitive code following
the pattern `bool HasXXX = false;` for member declarations and `bool
hasXXX() const { return HasXXX; }` for getters. This boilerplate made
the file unnecessarily long and harder to maintain.

This patch introduces an X-macro pattern `GCN_SUBTARGET_HAS_FEATURE`
that consolidates 135 simple subtarget features into a single list. The
macro is expanded twice: once in the protected section to generate
member variable declarations, and once in the public section to generate
the corresponding getter methods. This reduces the file by approximately
600 lines while preserving the exact same API and functionality.
Features with complex getter logic or inconsistent naming conventions
are left as manual implementations for future improvement.

Ideally, these could be generated by TableGen using
`GET_SUBTARGETINFO_MACRO`, similar to the X86 backend. However,
`AMDGPU.td` has several issues that prevent direct adoption: duplicate
field names (e.g., `DumpCode` is set by both `FeatureDumpCode` and
`FeatureDumpCodeLower`), and inconsistent naming conventions where many
features don't have the `Has` prefix (e.g., `FlatAddressSpace`,
`GFX10Insts`, `FP64`). Fixing these issues would require renaming fields
in `AMDGPU.td` and updating all references, which is left for future
work.
2026-01-21 15:29:09 -05:00
Shilei Tian
c253b9f9ca
[AMDGPU] Fix inline constant encoding for v_pk_fmac_f16 (#176659)
This PR handles`v_pk_fmac_f16` inline constant encoding/decoding
differences between pre-GFX11 and GFX11+ hardware.

- Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16
bits, zero in high.
- GFX11+: fp16 inline constants are duplicated to both halves `(f16,
f16)`.

Fixes #94116.
2026-01-20 19:14:59 -05:00
Stanislav Mekhanoshin
dd947ebcf3
[AMDGPU] Update gfx1250 memory model for global acquire/release (#175865)
Inserts required waits around GLOBAL_INV/GLOBAL_WBINV for
agent scope and above.
2026-01-15 03:25:03 -08:00
sstipano
cc1e10d50b
[AMDGPU] Disable s_add_pc_i64 instruction (#175644)
s_add_pc_i64 instruction is broken on gfx1250. Disable it by default.
2026-01-14 23:01:43 +01:00
Shoreshen
26624d51d1
[AMDGPU]Add specific instruction feature for multicast load (#175503) 2026-01-13 09:10:09 +08:00
Shilei Tian
df3629dc0c
[AMDGPU] Handle s_setreg_imm32_b32 targeting MODE register (#174681)
On certain hardware, this instruction clobbers VGPR MSB `bits[12:19]`,
so we need to restore the current mode.

Fixes SWDEV-571581.
2026-01-09 14:43:41 -05:00
Jay Foad
475f022cb7
[AMDGPU] Add support for GFX12 expert scheduling mode 2 (#170319) 2026-01-09 15:49:10 +00:00
saxlungs
7bbaf2e16b
[AMDGPU] Improve llvm.amdgcn.wave.shuffle handling for pre-GFX8 (#174845)
Before, GlobalISel would still return true for lowering the intrinsic
for GFX7 and earlier even though the required ds_bpermute_b32
instruction is not supported. After this change, GlobalISel will
properly report failure to select in this case. Testing is updated
appropriately.

Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
2026-01-07 21:48:11 +01:00
saxlungs
c262893f4b
Reland "[AMDGPU] Add new llvm.amdgcn.wave.shuffle intrinsic (#167372)" (#174614)
This change adds a new intrinsic for AMDGPU that implements a wave
shuffle, allowing arbitrary swizzling between lanes using an index. In
the initial version of this commit, there was an issue in one of the
tests added that returned a signal, causing testing to fail when
combined with another recent change to 'not'.

For context on the initial commit see #167372

---------

Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
Co-authored-by: Jay Foad <jay.foad@gmail.com>
2026-01-06 15:02:08 -05:00
Joe Nash
4bca00d56b
Revert "[AMDGPU] Add new llvm.amdgcn.wave.shuffle intrinsic" (#174501)
Reverts llvm/llvm-project#167372
2026-01-05 17:52:28 -05:00
saxlungs
b9fbc19017
[AMDGPU] Add new llvm.amdgcn.wave.shuffle intrinsic (#167372)
This intrinsic will be useful for implementing the
OpGroupNonUniformShuffle operation in the SPIR-V reference

---------

Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
Co-authored-by: Jay Foad <jay.foad@gmail.com>
2026-01-05 17:15:58 -05:00
Jay Foad
35c2dbd481
[AMDGPU] Remove trivially true predicates from GCNSubtarget. NFC. (#172830) 2025-12-18 11:05:34 +00:00
Mirko Brkušanin
5759a3a779
[AMDGPU] Add s_wakeup_barrier instruction for gfx1250 (#170501) 2025-12-10 09:45:13 +01:00
anjenner
740a3ad1f7
AMDGPU: Add codegen for atomicrmw operations usub_cond and usub_sat (#141068)
Split off from https://github.com/llvm/llvm-project/pull/105553 as per
discussion there.
2025-12-05 12:37:33 +00:00
Stanislav Mekhanoshin
9dd3346589
[AMDGPU] Prevent folding of flat_scr_base_hi into a 64-bit SALU (#170373)
Fixes: SWDEV-563886
2025-12-02 16:08:00 -08:00
Pierre van Houtryve
a086fb2fbb
[AMDGPU][gfx1250] Add wait_xcnt before any access that cannot be repeated (#168852)
The xcnt wait is actually required before any memory access that can
only be done once, so atomic stores and volatile accesses are affected.
This patch also ensures buffer instructions are handled.
2025-11-25 10:11:04 +01:00
Shoreshen
52a58a4193
[AMDGPU] Adding instruction specific features (#167809) 2025-11-19 11:06:00 +08:00
Shilei Tian
b4aa3d3ae3
[NFC] Check operand type instead of opcode (#168641)
A folow-up of #168458.
2025-11-18 21:37:56 -05:00
Shilei Tian
6665642ce4
[AMDGPU] Don't fold an i64 immediate value if it can't be replicated from its lower 32-bit (#168458)
On some targets, a packed f32 instruction can only read 32 bits from a
scalar operand (SGPR or literal) and replicates the bits to both
channels. In this case, we should not fold an immediate value if it
can't be replicated from its lower 32-bit.

Fixes SWDEV-567139.
2025-11-18 17:11:10 -05:00
Matt Arsenault
dfdada1b78
CodeGen: Remove target hook for terminal rule (#165962)
Enables the terminal rule for remaining targets
2025-11-12 21:12:19 +00:00
Matt Arsenault
e95f6fa123
RegisterCoalescer: Enable terminal rule by default for AMDGPU (#161621)
Introduce a target hook to incrementally flip the behavior of
targets with test changes, and start by implementing it for AMDGPU.

This appears to be forgotten switch flip from 2015. This
seems to do a nicer job with subregister copies. Most of the
test changes are improvements or neutral, not that many are
light regressions. The worst AMDGPU regressions are for true16
in the atomic tests, but I think that's due to existing true16
issues.
2025-11-10 09:37:14 -08:00
Jay Foad
60f20ea465
[AMDGPU] Add target feature for waits before system scope stores. NFC. (#164993) 2025-10-27 10:31:37 +00:00
Stanislav Mekhanoshin
9b5bc98743
[AMDGPU] Add intrinsics for v_[pk]_add_{min|max}_* instructions (#164731) 2025-10-22 17:46:33 -07:00
Matt Arsenault
d4b504ff20
AMDGPU: Remove triple field from subtarget (#164208)
This is redundant and already exists in the base class, and
is also unused.
2025-10-20 06:58:16 +00:00
Shilei Tian
9e8dda1034
[NFC] Change spelling of cluster feature to "clusters" (#162103) 2025-10-06 15:55:39 +00:00
Shilei Tian
bea0225c30
[AMDGPU] Make cluster a target feature (#162040)
This replaces the original arch check.
2025-10-06 05:05:53 +00:00
Matt Arsenault
0a80631142
AMDGPU: Ensure both wavesize features are not set (#159234)
Make sure we cannot be in a mode with both wavesizes. This
prevents assertions in a future change. This should probably
just be an error, but we do not have a good way to report
errors from the MCSubtargetInfo constructor.
2025-09-25 09:46:34 +00:00