411 Commits

Author SHA1 Message Date
Kazu Hirata
183b38078a
[llvm] Proofread *.rst (#151087)
This patch hyphenates words that are used as adjecives, such as:

- architecture specific
- human readable
- implementation defined
- language independent
- language specific
- machine readable
- machine specific
- target independent
- target specific
2025-08-01 07:01:50 -07:00
Pierre van Houtryve
be17791f26
[AMDGPU][gfx1250] Add cu-store subtarget feature (#150588)
Determines whether we can use `SCOPE_CU` stores (on by default), or
whether all stores must be done at `SCOPE_SE` minimum.
2025-07-29 11:38:43 +02:00
Pierre van Houtryve
cd1b84caa8
[NFC][AMDGPU] Rename "amdgpu-as" to "amdgpu-synchronize-as" (#148627)
"amdgpu-as" is way too vague and doesn't give enough context.
We may want to support it on normal atomics too, to control the synchronized (ordered) AS.
If we do that, the name has to be less vague.
2025-07-24 12:41:57 +02:00
Kazu Hirata
9d9d971f71
[llvm] Proofread AMDGPUUsage.rst (#150273) 2025-07-23 15:19:01 -07:00
Diana Picus
20d8398825
[AMDGPU] ISel & PEI for whole wave functions (#145858)
Whole wave functions are functions that will run with a full EXEC mask.
They will not be invoked directly, but instead will be launched by way
of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in
a future patch). These functions are meant as an alternative to the
`llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics.

Whole wave functions will set EXEC to -1 in the prologue and restore the
original value of EXEC in the epilogue. They must have a special first
argument, `i1 %active`, that is going to be mapped to EXEC. They may
have either the default calling convention or amdgpu_gfx. The inactive
lanes need to be preserved for all registers used, active lanes only for
the CSRs.

At the IR level, arguments to a whole wave function (other than
`%active`) contain poison in their inactive lanes. Likewise, the return
value for the inactive lanes is poison.

This patch contains the following work:
* 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN
  used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return
  a SReg_1 representing `%active`, which needs to be passed into
  SI_WHOLE_WAVE_FUNC_RETURN.
* SelectionDAG support for generating these 2 new pseudos and the
  special handling of %active. Since the return may be in a different
  basic block, it's difficult to add the virtual reg for %active to
  SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF
  which is later replaced via a custom inserter.
* Expansion of the 2 pseudos during prolog/epilog insertion. PEI also
  marks any used VGPRs as WWM registers, which are then spilled and
  restored with the usual logic.

Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic
and a lot of optimization work (especially in order to reduce spills
around function calls).

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-21 10:39:09 +02:00
Nicolai Hähnle
61739d76f0
AMDGPU: Trivial doc fixes (#146021) 2025-06-26 20:55:16 -07:00
Diana Picus
a201f8872a
[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget
feature, as requested in #130030.
2025-06-24 11:09:36 +02:00
Stanislav Mekhanoshin
69974658f0
[AMDGPU] Initial support for gfx1250 target. (#144965)
This is just a stub for now.
2025-06-19 22:52:51 -07:00
Nicolai Hähnle
3bee9ba015
AMDGPU/GFX12: Fix s_barrier_signal_isfirst for single-wave workgroups (#143634)
Barrier instructions are no-ops in single-wave workgroups. This includes
s_barrier_signal_isfirst, which will leave SCC unmodified.

Model this correctly (via an implicit use of SCC) and ensure SCC==1
before the barrier instruction (if the wave is the only one of the
workgroup, then it is the first).

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-06-19 11:22:49 -07:00
Scott Linder
e8362234f6
[Object][AMDGPU] Support REL relocations (#143966)
Shaders compiled with DXC/LLPC generate these relocations, and even if
that changes in the future we want to handle existing binaries. The
friction to support this and the maintenance cost long term both seem
incredibly low, considering other targets like ARM support both REL/RELA
static relocations behind the same interface.
2025-06-16 15:03:02 -04:00
Daniel Hernandez-Juarez
68b6f392ed
[MLIR][AMDGPU] Fix bug in GatherToLDSOpLowering, get the correct MemRefType for destination (#142915)
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.

CC: @krzysz00 @umangyadav @lialan
2025-06-13 11:33:51 -05:00
Diana Picus
a5cbd2ab0b
Revert "[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#… (#144039)
…133242)"

This reverts commit 130080fab11cde5efcb338b77f5c3b31097df6e6 because it
causes issues in testcases similar to coalescer_remat.ll [1], i.e. when
we use a VGPR tuple but only write to its lower parts. The high VGPRs
would then not be included in the vgpr_count, and accessing them would
be an out of bounds violation.

[1]
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/coalescer_remat.ll
2025-06-13 12:48:24 +02:00
Saiyedul Islam
432d06ab91
[NFC][AMDGPU] Fix stale links to ROCm repositories (#143949)
Following GitHub organizations were merged into the ROCm org:
  * ROCm-Developer-Tools
  * RadeonOpenCompute
  * ROCmSoftwarePlatform

Ensure that all hyperlinks to the old organizations now point to the new
organization at https://github.com/ROCm.
2025-06-13 11:33:52 +05:30
Diana Picus
130080fab1
[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#133242)
Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.

In particular, the register counts are used:
1. When launching an entry function - in which case we're safe because
   the register counts of the entry function will include the register
   counts of all callees.
2. At function boundaries in dynamic VGPR mode. In this case it's safe
   because whenever we set the new VGPR allocation we take into account
   the outgoing_vgpr_count set by the middle-end.

The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.

---------

Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
2025-06-03 11:20:48 +02:00
Kazu Hirata
9f4cea209e
[llvm] Fix typos in documentation (#140844) 2025-05-21 01:11:08 -07:00
Krzysztof Drewniak
4bdd116b80
[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425)
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.

This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).

It also plumbs the intrinsic through to clang.
2025-05-19 07:15:04 -07:00
Kazu Hirata
ad6bb70773
[llvm] Fix typos in documentation (#140275)
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-05-16 12:37:43 -07:00
Chinmay Deshpande
5c551cbe46
[NFC] Fix warning formatting for AMDGPUUsage.rst 2025-05-15 11:50:54 -07:00
Frederik Harwath
124e547e83
[AMDGPU][Docs] Correct Radeon Pro 5600M ISA (#140041)
As observed by LLVM Discord user "buck", the Radeon Pro 5600M is gfx1011
and the entry in the gfx1010 category was probably meant to be Radeon RX
5600M which indeed is gfx1010.
2025-05-15 14:00:45 +02:00
Mirko Brkušanin
ba7feaab92
[AMDGPU][Docs] Fix and update AMDGPUUsage.rst (#133894)
- Fix notes about SALU float and src1 SGPRs for dpp instructions
- Add split between gfx11 and gfx12 sections, update references.
2025-04-01 13:59:42 +02:00
Diana Picus
72c3c30452
[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)
The CWSR trap handler needs to save and restore the VGPRs. When dynamic
VGPRs are in use, the fixed function hardware will only allocate enough
space for one VGPR block. The rest will have to be stored in scratch, at
offset 0.

This patch allocates the necessary space by:
- generating a prologue that checks at runtime if we're on a compute
queue (since CWSR only works on compute queues); for this we will have
to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero,
we can assume we're on a compute queue and initialize the SP and FP with
enough room for the dynamic VGPRs
- forcing all compute entry functions to use a FP so they can access
their locals/spills correctly (this isn't ideal but it's the quickest to
implement)

Note that at the moment we allocate enough space for the theoretical
maximum number of VGPRs that can be allocated dynamically (for blocks of
16 registers, this will be 128, of which we subtract the first 16, which
are already allocated by the fixed function hardware). Future patches
may decide to allocate less if they can prove the shader never allocates
that many blocks.

Also note that this should not affect any reported stack sizes (e.g. PAL
backend_stack_size etc).
2025-03-19 13:49:19 +01:00
Diana Picus
0a21ef9536
[AMDGPU] Add SubtargetFeature for dynamic VGPR mode (#130030)
This represents a hardware mode supported only for wave32 compute
shaders. When enabled, we set the `.dynamic_vgpr_en` field of
`.compute_registers` to true in the PAL metadata.

This will be changed to use an attribute after downstream consumers
have been migrated.
2025-03-18 11:48:01 +01:00
Alex Voicu
c1fabd681f
[llvm][AMDGPU] Enable FWD_PROGRESS bit for GFX10+ (#128367)
From GFX10 onwards it is possible to employ benevolent scheduling of
waves. This patch unconditionally enables, for the `amdhsa` OS, the bit
which controls that capability, as it is beneficial for algorithms that
rely on more complex concurrent coordination and it is generally
performance neutral otherwise.
2025-03-17 23:17:46 +00:00
Carl Ritson
d921bf233c
[AMDGPU] Extend promotion of alloca to vectors (#127973)
* Add multi dimensional array support
* Make maximum vector size tunable
* Make ratio of VGPRs used for vector promotion tunable
* Maximum array size now based on VGPR count (32b) instead of element count
2025-03-12 15:11:30 +09:00
Matt Arsenault
0d2c55cb96
AMDGPU: Move enqueued block handling into clang (#128519)
The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.

This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.

I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.

We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.

https://reviews.llvm.org/D141700
2025-03-10 19:54:04 +07:00
Matt Arsenault
a216358ce7
AMDGPU: Replace amdgpu-no-agpr with amdgpu-agpr-alloc (#129893)
This performs the minimal replacment of amdgpu-no-agpr to
amdgpu-agpr-alloc=0. Most of the test diffs are due to the new
attribute sorting later alphabetically.

We could do better by trying to perform range merging in the attributor,
and trying to pick non-0 values.
2025-03-06 09:17:51 +07:00
Matt Arsenault
f4ba2bf236
AMDGPU: Add amdgpu-agpr-alloc attribute to control AGPR allocation (#128034)
This provides a range to decide how to subdivide the vector register
budget on gfx90a+. A single value declares the minimum AGPRs that
should be allocatable. Eventually this should replace amdgpu-no-agpr.

I want this primarily for testing agpr allocation behavior. We should
have a heuristic try to detect a reasonable number of AGPRs to keep
allocatable.
2025-03-06 09:13:59 +07:00
Carl Ritson
581599096e [AMDGPU] Expand IR Attribute table to handle longer names (NFC) 2025-02-21 16:34:54 +09:00
Fabian Ritter
db597084c5
[AMDGPU][docs] Replace gfx940 and gfx941 with gfx942 in llvm/docs (#126887)
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.

This PR removes all documentation occurrences of gfx940/gfx941 except
for the gfx940 ISA description, which will be the subject of a separate
PR.

For SWDEV-512631
2025-02-19 10:31:47 +01:00
Fabian Ritter
8615f9aaff
[AMDGPU] Replace gfx940 and gfx941 with gfx942 in llvm (#126763)
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.

This PR removes all non-documentation occurrences of gfx940/gfx941 from
the llvm directory, and the remaining occurrences in clang.

Documentation changes will follow.

For SWDEV-512631
2025-02-19 10:20:48 +01:00
Krzysztof Drewniak
f7d03707d1
[AMDGPU] Generalize amdgcn.make.buffer.rsrc to fat pointers (#126828)
Attempting to pass a `ptr addrspace(7)` to functions that take `ptr`
arguments produces undesirable `addrspacecast(addrspacecast(p8 x to p7)
to p0) => addrspacecast(p8 x to p0)` folds. This results in illegal GEP
operations on buffer resources, which can't be GEP'd. (However, note
that, while unimplemneted, addressspacecast from ptr addrspace(7) to ptr
is legal - it's just an effective address computation)

To resolve this problem, and thus prevent illegal
`getelementptr T, ptr addrspace(8) %x, ...` s from being produces, this
commit extends amdgcn.make.buffer.rsrc to also be variadic in its result
type, auto-upgrading old manglings.

The logic for handling a make.buffer.rsrc in instruction selection
remains untouched and expects the output type to be a ptr addrspace(8),
as does the Clang lowering for its builtin (the pointer-to-pointer
version might want a different name in clang). LowerBufferFatPointers
has been updated to lower
amdgcn.make.buffer.rsrc.p7.p* to amdgcn.make.buffer.rsrc.p8.p* .

This'll also make exposing buffer fat pointers in Clang easier, since
you don't have to cast between a `__amdgcn_rsrc_t` and a pointer.
2025-02-18 14:15:28 -06:00
Stanislav Mekhanoshin
7639242155
[AMDGPU] Create new directive .amdhsa_inst_pref_size (#126622)
The field INST_PREF_SIZE is available since gfx11.
2025-02-11 08:35:45 -08:00
Konstantin Zhuravlyov
fc4210fb6c
AMDGPU/Docs: Fix target properties for gfx9-4-generic (#125593)
gfx9-4-generic has architected flat scratch, not absolute
2025-02-04 21:47:43 -05:00
Carl Ritson
1f38d38d54 [AMDGPU] Fix documentation table formatting from #118750 (NFC) 2025-01-30 14:27:25 +09:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Jeffrey Byrnes
db1ee18eda NFC: Typo fix
Change-Id: I08470bc617490558250136ea35a4964003fa9981
2025-01-24 15:59:13 -08:00
Jun Wang
77c23fd0aa
[AMDGPU] Update AMDGPUUsage.rst to document two intrinsics (#123816)
The AMDGPUUsage.rst file is updated to document two intrinsics:
llvm.amdgcn.mov.dpp and llvm.amdgcn.update.dpp.
2025-01-24 14:12:18 -08:00
Austin Kerbow
2e5c298281
[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)
Add a prologue to the kernel entry to handle cases where code designed
for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes
at the start of the kernel entry will be skipped. This skipping is done
automatically by hardware that supports the feature.

A pass is added which is intended to be run at the very end of the
pipeline to avoid any optimizations that would assume the prologue is a
real predecessor block to the actual code start. In reality we have two
possible entry points for the function. 1. The optimized path that
supports kernarg preloading which begins at an offset of 256 bytes. 2.
The backwards compatible entry point which starts at offset 0.
2025-01-10 11:39:02 -08:00
Austin Kerbow
aebd3389a9
[AMDGPU] Fix user SGPR alloc order in docs (#119092)
NFC. Preload kernarg SGPRs are allocated after the private segment size
SGPR. This patch updates AMDGPUUsage.rst to reflect this.
2024-12-07 13:08:35 -08:00
Jeffrey Byrnes
9ac52ce8d6
[AMDGPU] Add iglp_opt(3) for simple mfma / exp interleaving (#117269)
Adds a minimal iglp_opt to do simple exp / mfma interleaving.
2024-12-06 15:19:07 -08:00
Shilei Tian
17cfd016b4 [AMDGPU][Doc] Add gfx950 to gfx9-4-generic in the document 2024-12-03 11:17:22 -05:00
Matt Arsenault
d1cca3133a
AMDGPU: Add v_permlane16_swap_b32 and v_permlane32_swap_b32 for gfx950 (#117260)
This was a bit annoying because these introduce a new special case
encoding usage. op_sel is repurposed as a subset of dpp controls,
and is eligible for VOP3->VOP1 shrinking. For some reason fi also
uses an enum value, so we need to convert the raw boolean to 1 instead
of -1.

The 2 registers are swapped, so this has 2 defs. Ideally the builtin
would return a pair, but that's difficult so return a vector instead.
This would make a hypothetical builtin that supports v2f16 directly
uglier.
2024-11-22 20:12:50 -08:00
Matt Arsenault
01c9a14ccf
AMDGPU: Define v_mfma_f32_{16x16x128|32x32x64}_f8f6f4 instructions (#116723)
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).

I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.

The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
2024-11-21 08:51:58 -08:00
Matt Arsenault
5a556d55fb
AMDGPU: Increase the LDS size to support to 160 KB for gfx950 (#116309) 2024-11-18 10:48:56 -08:00
Matt Arsenault
a6fc489bb7
AMDGPU: Add gfx950 subtarget definitions (#116307)
Mostly a stub, but adds some baseline tests and
tests for removed instructions.
2024-11-18 10:41:14 -08:00
Diana Picus
2aa6cedfa8
[AMDGPU] Clarify amdgpu.cs.chain + init whole wave. NFC (#115452)
Add some docs clarifying how inactive lanes are handled in the
amdgpu_cs_chain calling convention when the llvm.amdgcn.init.whole.wave
intrinsic is used.
2024-11-14 10:10:33 +01:00
Shilei Tian
de0fd64bed
[AMDGPU] Introduce a new generic target gfx9-4-generic (#115190)
This patch introduces a new generic target, `gfx9-4-generic`. Since it doesn’t support FP8 and XF32-related instructions, the patch includes several code reorganizations to accommodate these changes.
2024-11-12 23:11:05 -05:00
Matt Arsenault
0b40f97929
AMDGPU: Treat uint32_max as the default value for amdgpu-max-num-workgroups (#113751)
0 does not make sense as a value for this to be, much less the default.
Also stop emitting each individual field if it is the default, rather than
if any element was the default. Also fix the name of the test since it didn't
exactly match the real attribute name.
2024-11-05 12:50:44 -08:00
Carl Ritson
076aac59ac
[AMDGPU] Add a new target for gfx1153 (#113138) 2024-10-23 12:56:58 +09:00
Jay Foad
e7f1dae412
[AMDGPU] gfx1152 does not have Feature1_5xVGPRs (#113163) 2024-10-22 11:12:00 +01:00