461 Commits

Author SHA1 Message Date
Domenic Nutile
5b33f85a08
[AMDGPU] Change isSingleLaneExecution to account for WWM enabling lanes even if there's only one workitem (#188316)
This issue was discovered during some downstream work around Vulkan CTS
tests, specifically
`dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float`

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2026-04-06 12:51:46 -04:00
Mirko Brkušanin
5d9eb0c76a
[AMDGPU] Define new targets gfx1171 and gfx1172 (#187735) 2026-04-01 18:16:11 +02:00
Nicolai Hähnle
2fb59b4e6c
AMDGPU: Document two more in-flight address spaces (#185690)
There are two more address spaces being worked on internally for
future features. Note this in the documentation now to reduce the risk
of clashes.
2026-03-10 17:13:59 +00:00
Diana Picus
f2e8e2faff
[AMDGPU] Make chain functions receive a stack pointer (#184616)
Currently, chain functions are free to set up a stack pointer if they
need one, and they assume they can start at scratch offset 0. This is
not correct if CWSR and dynamic VGPRs are both enabled, since in that
case we need to reserve an area at offset 0 for the trap handler, but
only when running on a compute queue (which we determine at runtime).
Rather than duplicate in every chain function the code sequence for
determining if/how much scratch space needs to be reserved, this patch
changes the ABI of chain functions so that they receive a stack pointer
from their caller.

Since chain functions can no longer use plain offsets to access their
own stack, we'll also need to allocate a frame pointer more often (and
sometimes also a base pointer). For simplicity, we use the same
registers that `amdgpu_gfx` functions do (s32, s33, s34). This may
change in the future. Chain functions never return to their caller and
thus don't need to preserve the frame or base pointer.

Another consequence is that now we might need to realign the stack in
some cases (since it no longer starts at the infinitely aligned 0).
2026-03-06 11:01:42 +01:00
Mirko Brkušanin
d0f50d5574
[AMDGPU] Remove DX10_CLAMP and IEEE bits from gfx1170 (#182107)
Add `DX10ClampAndIEEEMode` feature and set it for every subtarget prior
to gfx1170
2026-03-04 12:16:41 +01:00
David Stuttard
cd68939326
[AMDGPU] Add attribute for FWD_PROGRESS (#181675)
Added an attribute for FWD_PROGRESS that allows it to be
turned off for some shaders.
2026-02-26 11:35:36 +00:00
Stanislav Mekhanoshin
33fd75f55d
[AMDGPU] Add gfx12-5-generic subtarget (#183381)
This is functionally equivalent to gfx1250.
2026-02-25 13:34:48 -08:00
Konstantin Zhuravlyov
ea4bbfbf5b
AMDGPU/Docs: Reserve 0x060 and 0x070 ELF MACH (e_flags) (#182341) 2026-02-19 14:06:02 -05:00
Pierre van Houtryve
c30879ea91
[AMDGPU][Doc] Small fix for GFX12 release atomic memory model doc (#182241)
That row goes for both generic/global but it only said global.
2026-02-19 11:38:05 +01:00
Sameer Sahasrabuddhe
b02b395a1e
[AMDGPU] Asynchronous loads from global/buffer to LDS on pre-GFX12 (#180466)
The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.

This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.

This is part of a stack split out from #173259:
- #180467
- #180466

Fixes: SWDEV-521121
2026-02-11 05:26:58 +00:00
Pierre van Houtryve
b79ba02479
[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343)
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.

Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.

This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.

Solves SWDEV-516398.
2026-02-09 09:57:27 +01:00
Mirko Brkušanin
20b5849e17
[AMDGPU] Define new target gfx1170 (#180185) 2026-02-06 14:38:50 +01:00
Aaditya
e9e8b38c80
[AMDGPU] Update documentation for wave reduction intrinsics (#175132) 2026-01-30 18:19:40 +05:30
Mariusz Sikora
6de6f7b46b
[AMDGPU] Define gfx1310 target with ELF number 0x50 (#177355)
For now this is identical to gfx1250.

---------

Co-authored-by: Jay Foad <jay.foad@amd.com>
2026-01-22 17:08:38 +01:00
Pierre van Houtryve
3dda4b5463
[Doc][AMDGPU] Add barrier execution & memory model (#170447)
Add a formal execution model, and a memory model for the execution barrier
primitives available in GFX12.0 and below.
The model also works for GFX12.5 workgroup/workgroup trap barriers, but does
not include the new barrier types and instructions added in GFX12.5.
These will be added at a later date.
2026-01-21 16:12:33 +01:00
Stanislav Mekhanoshin
dd947ebcf3
[AMDGPU] Update gfx1250 memory model for global acquire/release (#175865)
Inserts required waits around GLOBAL_INV/GLOBAL_WBINV for
agent scope and above.
2026-01-15 03:25:03 -08:00
Pankaj Dwivedi
017a27cb1a
[AMDGPU][Docs] Document amdgpu-expand-waitcnt-profiling attribute (#175750) 2026-01-13 18:21:13 +05:30
Jay Foad
475f022cb7
[AMDGPU] Add support for GFX12 expert scheduling mode 2 (#170319) 2026-01-09 15:49:10 +00:00
Jay Foad
b3c3e5fd99
[AMDGPU] Simplify and document waitcnt handling on call and return (#172453)
Start documenting the ABI conventions for dependency counters on
function call and return.

Stop pretending that SIInsertWaitcnts can handle anything other than the
default documented behavior.
2026-01-05 13:29:54 +00:00
Shilei Tian
c97de4387b
Revert "[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069)" (#174303)
This reverts commit 2c376ffeca490a5732e4fd6e98e5351fcf6d692a because it
breaks assembler.

```
$ llvm-mc -triple=amdgcn -mcpu=gfx1250 -show-encoding <<< "v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] matrix_b_reuse"
  v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] clamp ; encoding: [0x10,0x80,0x72,0xcc,0x00,0x11,0x42,0x1c]
```

We have a fundamental issue in the clamp support in VOP3P instructions,
which will need more changes.
2026-01-04 02:13:21 +00:00
Muhammad Abdul
2c376ffeca
[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069)
Fixes #166989 

- Adds a clamp immediate operand to the AMDGPU WMMA iu8 intrinsic and
threads it through LLVM IR, MIR lowering, Clang builtins/tests, and MLIR
ROCDL dialect so all layers agree on the new operand
- Updates AMDGPUWmmaIntrinsicModsAB so the clamp attribute is emitted,
teaches VOP3P encoding to accept the immediate, and adjusts Clang
codegen/builtin headers plus MLIR op definitions and tests to match
- Documents what the WMMA clamp operand do
- Implement bitcode AutoUpgrade for source compatibility on WMMA IU8
Intrinsic op

Possible future enhancements:
- infer clamping as an optimization fold based on the use context

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-12-27 12:51:29 -05:00
anjenner
27651133e2
AMDGPU: Drop and upgrade llvm.amdgcn.atomic.csub/cond.sub to atomicrmw (#105553)
These both perform conditional subtraction, returning the minuend and
zero respectively, if the difference is negative.
2025-12-09 23:13:33 +00:00
Jan Patrick Lehr
41a6a0a1d2
[AMDGPU] Add some more product names for GPUs (#170469) 2025-12-04 07:08:20 +01:00
Changpeng Fang
5f38ae4a77
[AMDGPU] update LDS block size for gfx1250 (#167614)
LDS block size should be 2048 bytes (512 dwords) based on current spec.
2025-11-17 16:03:47 -08:00
Kazu Hirata
ea0ecd63d4
[llvm] Proofread *.rst (#168254)
This patch is limited to hyphenation to ease the review process.
2025-11-16 08:09:07 -08:00
Krzysztof Drewniak
0190951a3e
[AMDGPU] Update buffer fat pointer docs for gfx1250, fix formatting (#167818) 2025-11-14 11:13:57 -08:00
Krzysztof Drewniak
d4e9982787
[AMDGPU] Document meaning of alignment of buffer fat pointers, intrinsics (#167553)
This commit adds documentation clarifying the meaning of `align` on ptr
addrpsace(7) (buffer fat pointer) and ptr addrspace(9) (bufferef
structured pointer) operations (specifying that both the base and the
offset need to be aligned) and documents the meaning of the `align`
attribute when used as an argument on *.buffer.ptr.* intrinsics.
2025-11-12 19:39:27 -08:00
Kazu Hirata
ce7f9f9ccd
[llvm] Proofread *.rst (#167108)
This patch is limited to single-word replacements to fix spelling
and/or grammar to ease the review process.  Punctuation and markdown
fixes are specifically excluded.
2025-11-08 07:41:23 -08:00
Nicolai Hähnle
917d815d4e
AMDGPU: Preliminary documentation for named barriers (#165502) 2025-11-07 18:10:59 +00:00
Nicolai Hähnle
cfca229782
AMDGPU: Add and clarify reserved address spaces (#166486)
Address spaces 10 and 11 are reserved for future use in the sense that
we plain to upstream their use.

Address space 12 is used by LLPC. It is used in a workaround for an
issue with SMEM accesses to PRT buffers that is specific to the LLPC
ecosystem and makes no sense to upstream.
2025-11-05 01:57:54 +00:00
Pierre van Houtryve
07d47c792b
[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (#161638)
They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following
sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.

So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.

Supersedes #160501
Solves SC1-6454
2025-10-21 09:23:46 +02:00
Krzysztof Drewniak
d37141776f
[AMDGPU] Enable volatile and non-temporal for loads to LDS (#153244)
The primary purpose of this commit is to enable marking loads to LDS
(global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux
as with normal buffer loads) and to ensure that their !nontemporal
annotations translate to appropriate settings of te cache control bits.

However, in the process of implementing this feature, we also fixed
- Incorrect handling of buffer loads to LDS in GlobalISel
- Updating the handling of volatile on buffers in SIMemoryLegalizer:
previously, the mapping of address spaces would cause volatile on buffer
loads to be silently dropped on at least gfx10.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-10-20 12:42:22 -05:00
Nicolai Hähnle
56ee43a863
AMDGPU: Document address spaces as reserved (#163996)
They are going to be used for internal work downstream that we do expect
to upstream eventually.
2025-10-17 17:34:28 +00:00
Jan Patrick Lehr
4773751799
[AMDGPU] Add product names to processor table (#163717) 2025-10-16 12:17:20 +02:00
Kazu Hirata
bc0c232a2a
[llvm] Proofread AMDGPUUsage.rst (#163331) 2025-10-14 07:15:53 -07:00
Jun Wang
bdd98a0147
[AMDGPU] Add documentation files for GFX12. (#157151)
This patch adds documentation files for GFX12.
2025-10-01 15:31:50 -07:00
Stanislav Mekhanoshin
f81cc8bddc
[AMDGPU] Update gfx1250 documentation. NFC (#160457) 2025-09-24 10:04:47 -07:00
Stanislav Mekhanoshin
5f105fe806
[AMDGPU] Update documentation about DWARF registers mapping. NFC (#159447) 2025-09-17 14:08:19 -07:00
Stanislav Mekhanoshin
e556dc0b23
[AMDGPU] Add gfx1251 subtarget (#159430) 2025-09-17 13:02:02 -07:00
Shilei Tian
04cd39ae28
[AMDGPU] Add the support for .cluster_dims code object metadata (#158721)
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-09-15 16:13:07 -04:00
Shilei Tian
27b242fbff
[AMDGPU][Attributor] Add AAAMDGPUClusterDims (#158076) 2025-09-15 15:04:33 -04:00
Shilei Tian
1180c2ced0
[AMDGPU] Support lowering of cluster related instrinsics (#157978)
Since many code are connected, this also changes how workgroup id is lowered.

Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-09-12 21:11:17 -04:00
Stanislav Mekhanoshin
10cb685939
[AMDGPU] llvm.prefetch documentation for gfx1250. NFC (#157949) 2025-09-10 14:06:32 -07:00
Pierre van Houtryve
49a898f9b5
[AMDGPU][gfx1250] Support "cluster" syncscope (#157641)
Defaults to "agent" for targets that do not support it.

- Add documentation
- Register it in MachineModuleInfo
- Add MemoryLegalizer support
2025-09-10 11:41:43 +02:00
Pierre van Houtryve
dcaa29c8ed
Revert "[AMDGPU][gfx1250] Add cu-store subtarget feature (#150588)" (#157639)
This reverts commit be17791f2624f22b3ed24a2539406164a379125d.

This is not necessary for gfx1250 anymore.
2025-09-10 10:20:59 +02:00
Jan Patrick Lehr
c209cca677
[Docs][AMDGPU] Add gfx1200/gfx1201 product names (#155577)
I took the liberty to add the product names according to Wikipedia.
2025-08-27 14:03:26 +02:00
Stanislav Mekhanoshin
438c099c23
[AMDGPU] gfx1250 kernel descriptor update (#155008) 2025-08-22 12:58:41 -07:00
Gang Chen
60dbde69cd
[AMDGPU] report named barrier cnt part2 (#154588) 2025-08-20 12:00:45 -07:00
Tim Renouf
f279c47cb3
AMDGPU gfx12: Add _dvgpr$ symbols for dynamic VGPRs (#148251)
For each function with the AMDGPU_CS_Chain calling convention, with
dynamic VGPRs enabled, add a _dvgpr$ symbol, with the value of the
function symbol, plus an offset encoding one less than the number of
VGPR blocks used by the function (16 VGPRs per block, no more than 128)
in bits 5..3 of the symbol value. This is used by a front-end to have
functions that are chained rather than called, and a dispatcher that
dynamically resizes the VGPR count before dispatching to a function.
2025-08-15 16:33:06 +01:00
Stanislav Mekhanoshin
49f2093477
[AMDGPU] Increase LDS to 320K on gfx1250 (#153645) 2025-08-14 12:52:00 -07:00