563 Commits

Author SHA1 Message Date
Sameer Sahasrabuddhe
f9adee2f6b
[AMDGPU] asyncmark support for ASYNC_CNT (#185813)
Some checks failed
Bazel Checks / Buildifier (push) Has been cancelled
Bazel Checks / Bazel Build/Test (push) Has been cancelled
Build CI Tooling Containers / Build Container abi-tests (push) Has been cancelled
Build CI Tooling Containers / Build Container format (push) Has been cancelled
Build CI Tooling Containers / Build Container lint (push) Has been cancelled
Build Windows CI Container / build-ci-container-windows (push) Has been cancelled
Build CI Container / Build Container X64 (push) Has been cancelled
Build CI Container / Build Container ARM64 (push) Has been cancelled
Build CI Container / Build Container agent X64 (push) Has been cancelled
Build CI Container / Build Container agent ARM64 (push) Has been cancelled
Build libc Container / Build libc container (ubuntu-24.04) (push) Has been cancelled
Build libc Container / Build libc container (ubuntu-24.04-arm) (push) Has been cancelled
Build Metrics Container / build-metrics-container (push) Has been cancelled
Check CI Scripts / Check Python Tests (push) Has been cancelled
Test documentation build / Test documentation build (push) Has been cancelled
Libclang Python Binding Tests / Build and run Python unit tests (3.13) (push) Has been cancelled
Libclang Python Binding Tests / Build and run Python unit tests (3.8) (push) Has been cancelled
Build Docker images for libc++ CI / build-and-push (push) Has been cancelled
Test Unprivileged Download Artifact Action / Upload Test Artifact (push) Has been cancelled
Zizmor GitHub Actions Analysis / Run zizmor (push) Has been cancelled
Build CI Tooling Containers / push-ci-container (push) Has been cancelled
Build Windows CI Container / push-ci-container (push) Has been cancelled
Build CI Container / push-ci-container (push) Has been cancelled
Build libc Container / push-libc-container (push) Has been cancelled
Build Metrics Container / push-metrics-container (push) Has been cancelled
Test Unprivileged Download Artifact Action / Test Unprivileged Download Artifact (push) Has been cancelled
Commit Access Review / commit-access-review (push) Has been cancelled
The ASYNC_CNT is used to track the progress of asynchronous copies
between global and LDS memories. By including it in asyncmark, the
compiler can now assist the programmer in generating waits for
ASYNC_CNT.

Assisted-By: Claude Sonnet 4.5

This is part of a stack:

- #185813
- #185810 

Fixes: LCOMPILER-332
2026-04-07 07:23:09 +05:30
Petar Avramovic
ce5a1dffa2
AMDGPU: Improve codegen for VOP2 v_dot2c_f32_f16/bf16 (#179225)
Select VOP2 version when there are no src_modifers, otherwise VOP3.
2026-03-23 16:55:53 +01:00
Petar Avramovic
034054431d
AMDGPU: Fix src2_modifiers for v_dot2_f32_f16/bf16 (#179224) 2026-03-23 15:44:09 +01:00
Jay Foad
79d1a2c418
[AMDGPU] Standardize on using AMDGPU::getNullPointerValue. NFC. (#187037)
AMDGPUTargetMachine also had a static method which did the same thing.
Remove it so that we have a single source of truth.
2026-03-17 17:08:16 +00:00
Vigneshwar Jayakumar
6e1aee4276
[AMDGPU] Select v_bfe_u32 for i8/i16 (and (srl x, c), mask) (#182446)
Combine i8 and i16 (and (srl x, c), mask) instructions to v_bfe_32. This optimization is skipped true_i16 targets.

resolves issue #179494
2026-03-04 13:31:15 -06:00
Changpeng Fang
5b144c0aec
[AMDGPU] Add suffix _d4 to tensor load/store with 4 groups D#, NFC (#184176)
Rename TENSOR_LOAD_TO_LDS to TENSOR_LOAD_TO_LDS_d4;
  Rename TENSOR_STORE_FROM_LDS to TENSOR_STORE_FROM_LDS_d4;
Also rename function names in a couple of tests to reflect this change.
2026-03-03 14:10:38 -08:00
Changpeng Fang
99dc561c7d
[AMDGPU] Use a general form of intrinsic for tensor load/store (#182334)
The intrinsic has five arguments for the tensor descriptor (D#), while the fifth one is reserved for future targets, and it will be silently ignored in codegen for gfx1250.
  For tensor up to 2D, only the first two D# groups are meaningful and the rest should be zero-initialized.
2026-02-20 17:28:32 -08:00
Piotr Balcer
6012aa1d44
[AMDGPU] Fix opcode comparison logic for G_INTRINSIC (#156008)
The check `(Opc < TargetOpcode::GENERIC_OP_END)` incorrectly
includes `G_INTRINSIC` (129), which is less than
`GENERIC_OP_END` (313), leading to logically dead code.

This patch reorders the conditionals to first check for `G_INTRINSIC`,
ensuring
correct handling of the `amdgcn_fdot2` intrinsic.
2026-02-18 11:05:36 +00:00
Sameer Sahasrabuddhe
128437fb6a
[AMDGPU] Introduce asyncmark/wait intrinsics (#180467)
Asynchronous operations are memory transfers (usually between the global
memory and LDS) that are completed independently at an unspecified
scope. A thread that requests one or more asynchronous transfers can use
async marks to track their completion. The thread waits for each mark to
be completed, which indicates that requests initiated in program order
before this mark have also completed.

For now, we implement asyncmark/wait operations on pre-GFX12
architectures that support "LDS DMA" operations. Future work will extend
support to GFX12Plus architectures that support "true" async operations.

This is part of a stack split out from #173259
- #180467
- #180466

Co-authored-by: Ryan Mitchell ryan.mitchell@amd.com

Fixes: SWDEV-521121
2026-02-11 07:15:51 +00:00
Sameer Sahasrabuddhe
b02b395a1e
[AMDGPU] Asynchronous loads from global/buffer to LDS on pre-GFX12 (#180466)
The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.

This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.

This is part of a stack split out from #173259:
- #180467
- #180466

Fixes: SWDEV-521121
2026-02-11 05:26:58 +00:00
Diana Picus
24405f070f
[AMDGPU] Add intrinsic exposing s_alloc_vgpr (#163951)
Make it possible to use `s_alloc_vgpr` at the IR level. This is a huge
footgun and use for anything other than compiler internal purposes is
heavily discouraged. The calling code must make sure that it does not
allocate fewer VGPRs than necessary - the intrinsic is NOT a request to
the backend to limit the number of VGPRs it uses (in essence it's not so
different from what we do with the dynamic VGPR flags of the
`amdgcn.cs.chain` intrinsic, it just makes it possible to use this
functionality in other scenarios).
2026-02-10 09:28:31 +01:00
Jay Foad
4a6697f393
[AMDGPU] Fix and simplify patterns selecting fsub to v_fma_mix_f32 (#180169)
Select (fsub x, y) -> (fma y, -1.0, x). Using -1.0 as the constant
avoids the need for ComplexPatterns to negate x or y.

This also fixes the bad pattern (fsub x, y) -> (fma -x, 1.0, y).
2026-02-06 14:39:13 +00:00
Acim Maravic
b0827f3b36
[LLVM] Select fma_mix for v_cvt_f32_f16 and v_add_f32/v_mul_f32 (#160151) 2026-02-05 11:51:25 +01:00
Juan Manuel Martinez Caamaño
04c56505f8
[NFC][LLVM] Make constrainSelectedInstRegOperands return void (#179501)
`constrainSelectedInstRegOperands` always returns `true`; so it can be
safely transformed to return `void` instead.

A follow-up patch should update `MachineInstrBuilder::constrainAllUses`.
2026-02-04 08:59:16 +01:00
serge-sans-paille
85919fbfa4
[perf] Replace copy-assign by move-assign in llvm/lib/Target/AMDGPU/ (#179460) 2026-02-03 14:24:31 +00:00
Sam Elliott
7184229fea
[NFC][MI] Tidy Up RegState enum use (2/2) (#177090)
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.

If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
2026-01-23 00:19:03 -08:00
Shilei Tian
02d34a76f7
[NFCI][AMDGPU] Remove more redundant code from GCNSubtarget.h (#177297)
We are getting pretty close to use `GET_SUBTARGETINFO_MACRO` in the
header with this cleanup.
2026-01-22 09:07:15 -05:00
saxlungs
7bbaf2e16b
[AMDGPU] Improve llvm.amdgcn.wave.shuffle handling for pre-GFX8 (#174845)
Before, GlobalISel would still return true for lowering the intrinsic
for GFX7 and earlier even though the required ds_bpermute_b32
instruction is not supported. After this change, GlobalISel will
properly report failure to select in this case. Testing is updated
appropriately.

Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
2026-01-07 21:48:11 +01:00
saxlungs
c262893f4b
Reland "[AMDGPU] Add new llvm.amdgcn.wave.shuffle intrinsic (#167372)" (#174614)
This change adds a new intrinsic for AMDGPU that implements a wave
shuffle, allowing arbitrary swizzling between lanes using an index. In
the initial version of this commit, there was an issue in one of the
tests added that returned a signal, causing testing to fail when
combined with another recent change to 'not'.

For context on the initial commit see #167372

---------

Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
Co-authored-by: Jay Foad <jay.foad@gmail.com>
2026-01-06 15:02:08 -05:00
Joe Nash
4bca00d56b
Revert "[AMDGPU] Add new llvm.amdgcn.wave.shuffle intrinsic" (#174501)
Reverts llvm/llvm-project#167372
2026-01-05 17:52:28 -05:00
saxlungs
b9fbc19017
[AMDGPU] Add new llvm.amdgcn.wave.shuffle intrinsic (#167372)
This intrinsic will be useful for implementing the
OpGroupNonUniformShuffle operation in the SPIR-V reference

---------

Signed-off-by: Domenic Nutile <domenic.nutile@gmail.com>
Co-authored-by: Jay Foad <jay.foad@gmail.com>
2026-01-05 17:15:58 -05:00
Mirko Brkušanin
5759a3a779
[AMDGPU] Add s_wakeup_barrier instruction for gfx1250 (#170501) 2025-12-10 09:45:13 +01:00
anjenner
740a3ad1f7
AMDGPU: Add codegen for atomicrmw operations usub_cond and usub_sat (#141068)
Split off from https://github.com/llvm/llvm-project/pull/105553 as per
discussion there.
2025-12-05 12:37:33 +00:00
Matt Arsenault
2ee12f191a
AMDGPU: Use RegClassByHwMode to manage GWS operand special case (#169373)
On targets that require even aligned 64-bit VGPRs, GWS operands
require even alignment of a 32-bit operand. Previously we had a hacky
post-processing which added an implicit operand to try to manage
the constraint. This would require special casing in other passes
to avoid breaking the operand constraint. This moves the handling
into the instruction definition, so other passes no longer need
to consider this edge case. MC still does need to special case this,
to print/parse as a 32-bit register. This also still ends up net
less work than introducing even aligned 32-bit register classes.

This also should be applied to the image special case.
2025-11-25 18:55:34 +00:00
Jay Foad
72c69aefba
[AMDGPU] Make use of getFunction and getMF. NFC. (#167872) 2025-11-14 11:00:57 +00:00
LU-JOHN
87b1d3537a
[AMDGPU][NFC] Avoid copying MachineOperands (#166293)
Avoid copying machine operands.

Signed-off-by: John Lu <John.Lu@amd.com>
2025-11-04 23:18:40 -06:00
Abhay Kanhere
d998f92a00
[CodeGen] MachineVerifier to check early-clobber constraint (#151421)
Currently MachineVerifier is missing verifying early-clobber operand
constraint.
The only other machine operand constraint -  TiedTo is already verified.
2025-11-04 18:39:31 -08:00
vangthao95
d1d635083d
[AMDGPU][GlobalISel] Clean up selectCOPY_SCC_VCC function (#165797)
Follow-up patch to address the comments in
https://github.com/llvm/llvm-project/pull/165355.
2025-10-31 13:17:44 -07:00
vangthao95
ba5cde79aa
[AMDGPU][GlobalISel] Fix issue with copy_scc_vcc on gfx7 (#165355)
When selecting for G_AMDGPU_COPY_SCC_VCC, we use S_CMP_LG_U64 or
S_CMP_LG_U32 for wave64 and wave32 respectively. However, on gfx7 we do
not have the S_CMP_LG_U64 instruction. Work around this issue by using
S_OR_B64 instead.
2025-10-30 08:19:12 -07:00
Harrison Hao
d604ab6288
[AMDGPU] Support image atomic no return instructions (#150742)
Add support for no-return variants of image atomic operations
(e.g. IMAGE_ATOMIC_ADD_NORTN, IMAGE_ATOMIC_CMPSWAP_NORTN). 
These variants are generated when the return value of the intrinsic is
unused, allowing the backend to select no return type instructions.
2025-10-29 10:42:15 +08:00
Krzysztof Drewniak
d37141776f
[AMDGPU] Enable volatile and non-temporal for loads to LDS (#153244)
The primary purpose of this commit is to enable marking loads to LDS
(global.load.lds, buffer.*.load.lds) volatile (using bit 31 of the aux
as with normal buffer loads) and to ensure that their !nontemporal
annotations translate to appropriate settings of te cache control bits.

However, in the process of implementing this feature, we also fixed
- Incorrect handling of buffer loads to LDS in GlobalISel
- Updating the handling of volatile on buffers in SIMemoryLegalizer:
previously, the mapping of address spaces would cause volatile on buffer
loads to be silently dropped on at least gfx10.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-10-20 12:42:22 -05:00
Petar Avramovic
98d43ef2d8
AMDGPU: Use srcvalue and delete Ignore complex pattern (#161359) 2025-09-30 16:18:51 +02:00
Petar Avramovic
1553b3de71
AMDGPU: Fix gcc build break (#161354) 2025-09-30 14:01:08 +02:00
Petar Avramovic
709a74dfb3
AMDGPU: Fix s_barrier_leave to write to scc (#161221)
s_barrier_leave implicitly defines $scc
and does not use imm that represents type of barrier,
isel pattern ignores imm operand from llvm intrinsic.
Test if SIInsertWaitcnts tracks this scc write.
2025-09-30 12:55:35 +02:00
Changpeng Fang
7753f61f61
[AMDGPU] Support cluster_load_async_to_lds instructions on gfx1250 (#156595) 2025-09-03 11:22:10 -07:00
Changpeng Fang
d3d1d8ff21
[AMDGPU] Support cluster load instructions for gfx1250 (#156548) 2025-09-02 16:34:20 -07:00
Nicolai Hähnle
353b5e43c6
AMDGPU: Refactor lowering of s_barrier to split barriers (#154648)
Let's do the lowering of non-split into split barriers in a new IR pass,
AMDGPULowerIntrinsics. That way, there is no code duplication between
SelectionDAG and GlobalISel. This simplifies some upcoming extensions to
the code.
2025-08-28 07:01:20 -07:00
Stanislav Mekhanoshin
d0ee82040c
[AMDGPU] Add s_barrier_init|join|leave instructions (#153296) 2025-08-12 15:07:07 -07:00
Fabian Ritter
e9ece175f9
[AMDGPU][GISel] Only fold flat offsets if they are inbounds (#153001)
For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, ISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.

This patch only selects flat instructions with immediate offsets from address
computations with the inbounds flag: If the address computation does not leave
the bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.

Relevant tests are in fold-gep-offset.ll.

Analogous to #132353 for SDAG (which is not yet in a mergeable state, its
progress is currently blocked by #146076).

Fixes SWDEV-516125 for GISel.
2025-08-12 10:14:20 +02:00
Changpeng Fang
1e815ced81
[AMDGPU] Use SDNodeXForm to select a few VOP3P modifiers, NFC (#151907)
It is not necessary to use ComplexPattern to select VOP3PModsNeg, VOP3PModsNegs
and VOP3PModsNegAbs. We can use SDNodeXForm instead.
2025-08-04 12:51:48 -07:00
Changpeng Fang
7d2332391f
[AMDGPU] Fix destination op_sel for v_cvt_scale32_* and v_cvt_sr_* (#151411)
GFX950 uses OP_SEL[MSB:LSB] for both src reads and dest writes. So this
patch essentially revert the work from
https://github.com/llvm/llvm-project/pull/151286 regarding dest writes.
2025-07-30 16:15:50 -07:00
Changpeng Fang
180281b8ec
[AMDGPU] Fix op_sel settings for v_cvt_scale32_* and v_cvt_sr_* (#151286)
For OPF_OPSEL_SRCBYTE: Vector instruction uses OPSEL[1:0] to specify a
byte
select for the first source operand. So op_sel [0, 0], [1, 0], [0, 1]
and [1, 1] should map
to byte 0, 1, 2 and 3, respectively.

For OPF_OPSEL_DSTBYTE: OPSEL is used as a destination byte select.
OPSEL[2:3]
specify which byte of the destination to write to. Note that the order
of the bits is different
from that of OPF_OPSEL_SRCBYT. So the mapping should be: op_sel [0, 0],
[0, 1], [1, 0]
and [1, 1] map to byte 0, 1, 2 and 3, respectively.

Fixes: SWDEV-544901
2025-07-30 12:24:51 -07:00
Stanislav Mekhanoshin
7eaf1f2b2d
[AMDGPU] Bitop3 opcodes for gfx1250 (#151235) 2025-07-29 15:36:56 -07:00
Stanislav Mekhanoshin
d99238263c
[AMDGPU] Implement v_mad_u32/v_mad_nc_u|i64_u32 on gfx1250 (#151226) 2025-07-29 15:06:35 -07:00
Changpeng Fang
3b66d4a987
[AMDGPU] Support builtin/intrinsics for async loads/stores on gfx1250 (#151058) 2025-07-29 08:20:05 -07:00
Changpeng Fang
d7a38a94cd
[AMDGPU] Support builtin/intrinsics for load monitors on gfx1250 (#150540) 2025-07-24 16:23:33 -07:00
Stanislav Mekhanoshin
96e5eed92a
[AMDGPU] Select VMEM prefetch for llvm.prefetch on gfx1250 (#150493)
We have a choice to use a scalar or vector prefetch for an uniform
pointer. Since we do not have scalar stores our scalar cache is
practically readonly. The rw argument of the prefetch intrinsic is
used to force vector operation even for an uniform case. On GFX12
scalar prefetch will be used anyway, it is still useful but it will
only bring data to L2.
2025-07-24 13:22:50 -07:00
Stanislav Mekhanoshin
c6e560a25b
[AMDGPU] Select scale_offset for scratch instructions on gfx1250 (#150111) 2025-07-22 15:24:55 -07:00
Stanislav Mekhanoshin
a0973de745
[AMDGPU] Select scale_offset for global instructions on gfx1250 (#150107)
Also switches immediate offset to signed for the subtarget.
2025-07-22 15:04:52 -07:00
Stanislav Mekhanoshin
a0aebb1935
[AMDGPU] Select scale_offset with SMEM instructions (#150078) 2025-07-22 13:26:28 -07:00