58 Commits

Author SHA1 Message Date
Alex Voicu
18e6958903
[SPIRV][AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates (#134016)
This change adds two builtins for AMDGPU:

- `__builtin_amdgcn_processor_is`, which is similar in observable
behaviour with `__builtin_cpu_is`, except that it is never "evaluated"
at run time;
- `__builtin_amdgcn_is_invocable`, which is behaviourally similar with
`__has_builtin`, except that it is not a macro (i.e. not evaluated at
preprocessing time).

Neither of these are `constexpr`, even though when compiling for
concrete (i.e. `gfxXXX` / `gfxXXX-generic`) targets they get evaluated
in Clang, so they shouldn't tear the AST too badly / at all for
multi-pass compilation cases like HIP. They can only be used in specific
contexts (as args to control structures).

The motivation for adding these is two-fold:

- as a nice to have, it provides an AST-visible way to incorporate
architecture specific code, rather than having to rely on macros and the
preprocessor, which burn in the choice quite early;
- as a must have, it allows featureful AMDGCN flavoured SPIR-V to be
produced, where target specific capability is guarded and chosen or
discarded when finalising compilation for a concrete target; this is
built atop the Speciali\ation Constant concept which is described in the
SPIR-V specification under section [2.12
Specialization](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_specialization_2)

I've tried to keep the overall footprint of the change small. The
changes to Sema are a bit unpleasant, but there was a strong desire to
have Clang validate these, and to constrain their uses, and this was the
most compact solution I could come up with (suggestions welcome).

---------

Co-authored-by: Juan Manuel Martinez Caamaño <jmartinezcaamao@gmail.com>
Co-authored-by: Voicu <avoicu@amd.com>
2026-03-30 23:02:26 +01:00
Stanislav Mekhanoshin
5f99854d01
[AMDGPU] Drop A and B neg modifier from amdgcn_wmma_bf16_16x16x32_bf16 (#189468)
Fixes: LCOMPILER-1673
2026-03-30 14:14:22 -07:00
Stanislav Mekhanoshin
a2d84b5d8d
[AMDGPU] Remove neg support from 4 more gfx1250 WMMA (#189115)
These are previously covered by AMDGPUWmmaIntrinsicModsAllReuse.
2026-03-27 15:20:14 -07:00
Stanislav Mekhanoshin
e69c7312f3
[AMDGPU] Disable neg_lo[0:1] and neg_hi[0:1] on wmma_f32_16x16x32_bf16 (#188649)
This is the pilot change, the rest will follow the same idea.
2026-03-26 00:37:05 -07:00
Rana Pratap Reddy
584c83cb15
[Clang][AMDGPU] Add clang builtins for buffer format load/store intrinsics (#187064)
Adding new clang builtins for AMDGPU raw/struct buffer format load/store
intrinsics. Clang currently has `__builtin_amdgcn_raw_buffer_load_b*`
and `__builtin_amdgcn_raw_buffer_store_b*` builtins, but is missing
builtins for the format variants. These format intrinsics are currently
used by device-libs via manually written IR wrappers in
[buffer-intrinsics.ll](https://github.com/ROCm/llvm-project/blob/amd-staging/amd/device-libs/ockl/src/buffer-intrinsics.ll).
2026-03-19 19:09:44 +01:00
Matt Arsenault
7cb3005ba2
AMDGPU: Add dereferenceable attribute to dispatch ptr intrinsic (#185955)
Stop manually setting it on the callsite in clang.
2026-03-12 07:28:39 +01:00
Matt Arsenault
47e31088f3
clang/AMDGPU: Fix workgroup size builtins for nonuniform work group sizes (#185098)
These were assuming uniform work group sizes. Emit the v4 and v5
sequences to take the remainder group for the nonuniform case.

Currently the device libs uses this builtin on the legacy ABI path with
the same sequence to calculate the remainder, and fully implements the
v5 path. If you perform a franken-build of the library with the updated
builtin, the result is worse. The duplicate sequence does not fully fold out.
However, it does not appear to be wrong. The relevant conformance tests still
pass.
2026-03-09 11:47:49 +01:00
Matt Arsenault
ede065a682
clang/AMDGPU: Do not emit __oclc_ABI_version references with environment (#184868)
Assume a sufficently new code object version if the environment is set
to something indicating we should have a real library.
2026-03-05 20:29:34 +00:00
Sameer Sahasrabuddhe
b02b395a1e
[AMDGPU] Asynchronous loads from global/buffer to LDS on pre-GFX12 (#180466)
The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.

This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.

This is part of a stack split out from #173259:
- #180467
- #180466

Fixes: SWDEV-521121
2026-02-11 05:26:58 +00:00
Simon Pilgrim
1d61ac2d41
[clang][amdgpu] mapCABIAtomicOrdering - fix MSVC not all control paths return a value warning. NFC. (#180500) 2026-02-09 10:59:37 +00:00
Pierre van Houtryve
b79ba02479
[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343)
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.

Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.

This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.

Solves SWDEV-516398.
2026-02-09 09:57:27 +01:00
Aaditya
f190477718
[AMDGPU] Add builtins for wave reduction intrinsics (#170813) 2026-01-30 18:15:06 +05:30
Shilei Tian
5a63367b15
Reapply "[AMDGPU] Rework the clamp support for WMMA instructions" (#174674) (#174697)
This reverts commit 0b2f3cfb72a76fa90f3ec2a234caabe0d0712590.
2026-01-07 06:12:19 +00:00
dyung
0b2f3cfb72
Revert "[AMDGPU] Rework the clamp support for WMMA instructions" (#174674)
Reverts llvm/llvm-project#174310

This change is causing 2 cross-project-test failures on
https://lab.llvm.org/buildbot/#/builders/174/builds/29695
2026-01-07 01:18:23 +00:00
Shilei Tian
ccca3b8c67
[AMDGPU] Rework the clamp support for WMMA instructions (#174310)
Fixes #166989.
2026-01-06 15:46:40 -05:00
Shilei Tian
c97de4387b
Revert "[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069)" (#174303)
This reverts commit 2c376ffeca490a5732e4fd6e98e5351fcf6d692a because it
breaks assembler.

```
$ llvm-mc -triple=amdgcn -mcpu=gfx1250 -show-encoding <<< "v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] matrix_b_reuse"
  v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] clamp ; encoding: [0x10,0x80,0x72,0xcc,0x00,0x11,0x42,0x1c]
```

We have a fundamental issue in the clamp support in VOP3P instructions,
which will need more changes.
2026-01-04 02:13:21 +00:00
Muhammad Abdul
2c376ffeca
[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069)
Fixes #166989 

- Adds a clamp immediate operand to the AMDGPU WMMA iu8 intrinsic and
threads it through LLVM IR, MIR lowering, Clang builtins/tests, and MLIR
ROCDL dialect so all layers agree on the new operand
- Updates AMDGPUWmmaIntrinsicModsAB so the clamp attribute is emitted,
teaches VOP3P encoding to accept the immediate, and adjusts Clang
codegen/builtin headers plus MLIR op definitions and tests to match
- Documents what the WMMA clamp operand do
- Implement bitcode AutoUpgrade for source compatibility on WMMA IU8
Intrinsic op

Possible future enhancements:
- infer clamping as an optimization fold based on the use context

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-12-27 12:51:29 -05:00
Sven van Haastregt
98182f4d20
Move CodeGenFunction::EmitScalarOrConstFoldImmArg; NFC (#170286)
This function is called from various .cpp files under `TargetBuiltins/`,
and was moved unintentionally into `AMDGPU.cpp` in PR #132252. Move it
to a common place.
2025-12-03 08:58:31 +01:00
Aaditya
4604762cc3
[AMDGPU] Add builtins for wave reduction intrinsics (#161816) 2025-11-24 15:13:11 +05:30
Jordan Rupprecht
3d3307ecd8
[clang][NFC] Inline Frontend/FrontendDiagnostic.h -> Basic/DiagnosticFrontend.h (#162883)
d076608d58d1ec55016eb747a995511e3a3f72aa moved some deps around to avoid
cycles and left clang/Frontend/FrontendDiagnostic.h as a shim that
simply includes clang/Basic/DiagnosticFrontend.h. This PR inlines it so
that nothing in tree still includes clang/Frontend/FrontendDiagnostic.h.

Doing this will help prevent future layering issues. See #162865.

Frontend already depends on Basic, so no new deps need to be added
anywhere except for places that do strict dep checking.
2025-11-21 03:39:49 +00:00
Rana Pratap Reddy
24c75a21b8
[AMDGPU][Clang] Support for type inferring extended image builtins for AMDGPU (#164358)
Introduces the builtins for extended image insts for amdgcn.
2025-10-30 22:20:28 +05:30
macurtis-amd
5440cfc450
[clang] Add support for cluster sync scope (#162575)
From Sam Liu:
>CUDA supports thread block clusters
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-block-clusters
>
>In their atomic intrinsics, cluster scope is supported
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#nv-atomic-fetch-add-and-nv-atomic-add
>
>For compatibility, clang and hip needs to support cluster scope.
2025-10-21 05:47:26 -05:00
Rana Pratap Reddy
f2ffb4d20b
[AMDGPU] Support for type inferring image load/store builtins for AMDGPU (#140210)
Introduces the builtins for amdgcn_image_load/store/sample.
2025-10-10 15:26:08 +05:30
Alex Voicu
d481e5f9b7
[AMDGPU][SPIRV] Use SPIR-V syncscopes for some AMDGCN BIs (#154867)
AMDGCN flavoured SPIR-V allows AMDGCN specific builtins, including those
for scoped fences and some specific RMWs. However, at present we don't
map syncscopes to their SPIR-V equivalents, but rather use the AMDGCN
ones. This ends up pessimising the resulting code as system scope is
used instead of device (agent) or subgroup (wavefront), so we correct
the behaviour, to ensure that we do the right thing during reverse
translation.
2025-09-29 22:50:15 +01:00
Aaditya
924bf242c8
[AMDGPU] Add builtins for wave reduction intrinsics (#150170) 2025-09-10 19:06:07 +05:30
Pierre van Houtryve
e2bd10cf16
[AMDGPU][gfx1250] Add 128B cooperative atomics (#156418)
- Add clang built-ins + sema/codegen
- Add IR Intrinsic + verifier
- Add DAG/GlobalISel codegen for the intrinsics
- Add lowering in SIMemoryLegalizer using a MMO flag.
2025-09-04 09:19:25 +00:00
Changpeng Fang
d3d1d8ff21
[AMDGPU] Support cluster load instructions for gfx1250 (#156548) 2025-09-02 16:34:20 -07:00
Nicolai Hähnle
deb851c6d0
clang/AMDGPU: Add __builtin_amdgcn_inverse_ballot_w{32,64} (#155724)
Add builtins that expose the underlying llvm.amdgcn.inverse.ballot
intrinsic that we've had for a while.

This allows more explicitly writing code that selects or branches in
terms of lane masks, which can lead to better code quality.
2025-08-27 19:40:03 -07:00
Stanislav Mekhanoshin
34aed0ed56
[AMDGPU] Add gfx1250 wmma_scale[16]_f32_32x16x128_f4 instructions (#152194) 2025-08-05 15:15:21 -07:00
zGoldthorpe
d7074b63ed
[Clang][AMDGPU] Add builtins for some buffer resource atomics (#149216)
This patch exposes builtins for atomic `add`, `max`, and `min` operations that
operate over buffer resource pointers.
2025-08-05 11:04:15 -06:00
Stanislav Mekhanoshin
a153e83e41
[AMDGPU] gfx1250 v_wmma_scale[16]_f32_16x16x128_f8f6f4 codegen (#152036) 2025-08-04 19:16:34 -07:00
Changpeng Fang
d7a38a94cd
[AMDGPU] Support builtin/intrinsics for load monitors on gfx1250 (#150540) 2025-07-24 16:23:33 -07:00
Pierre van Houtryve
cd1b84caa8
[NFC][AMDGPU] Rename "amdgpu-as" to "amdgpu-synchronize-as" (#148627)
"amdgpu-as" is way too vague and doesn't give enough context.
We may want to support it on normal atomics too, to control the synchronized (ordered) AS.
If we do that, the name has to be less vague.
2025-07-24 12:41:57 +02:00
Changpeng Fang
d6094370cb
AMDGPU: Support v_wmma_f32_16x16x128_f8f6f4 on gfx1250 (#149684)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2025-07-21 10:09:42 -07:00
Shilei Tian
602d43cfd1
[Clang][AMDGPU] Add the missing builtin __builtin_amdgcn_sqrt_bf16 (#149447)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-18 08:43:08 -04:00
Shilei Tian
aecd44818a
[AMDGPU] Add support for v_tanh_f16 on gfx1250 (#149439)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-18 00:21:04 -04:00
Shilei Tian
7e105fbdbe
[AMDGPU] Add support for v_tanh_f32 on gfx1250 (#149360)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-17 15:42:35 -04:00
Shilei Tian
fd5fc76c91
[AMDGPU] Add support for v_cos_bf16 on gfx1250 (#149355)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-17 14:43:34 -04:00
Shilei Tian
a102342990
[AMDGPU] Add support for v_sin_bf16 on gfx1250 (#149241)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-17 08:49:45 -04:00
Shilei Tian
a6b5ece75e
[AMDGPU] Add support for v_exp_bf16 on gfx1250 (#149229)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-17 08:46:01 -04:00
Shilei Tian
ad6d5d2821
[AMDGPU] Add support for v_log_bf16 on gfx1250 (#149201)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-16 19:09:34 -04:00
Shilei Tian
7d2a58e87d
[AMDGPU] Add support for v_rsq_bf16 on gfx1250 (#149194)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-16 19:06:03 -04:00
Changpeng Fang
c962f2b29d
AMDGPU: Implement builtins for gfx1250 wmma instructions (#148991)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
Co-authored-by: Shilei Tian <Shilei.Tian@amd.com>
2025-07-15 18:17:12 -07:00
Shilei Tian
dabc8e2ec1
[AMDGPU] Add support for v_rcp_bf16 on gfx1250 (#148916)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-15 16:12:51 -04:00
Shilei Tian
d7ec80c897
[AMDGPU] Add support for v_tanh_bf16 on gfx1250 (#147425)
Co-authored-by: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin@amd.com>
2025-07-14 16:30:18 -04:00
Changpeng Fang
1f5f381920
AMDGPU: Implement intrinsic/builtins for gfx1250 load transpose instructions (#146289) 2025-06-29 14:33:31 -07:00
choikwa
77de8a0c0a
[AMDGPU][clang] provide device implementation for __builtin_logb and … (#129347)
…__builtin_scalbn

Clang generates library calls for __builtin_* functions which can be a
problem for GPUs that cannot handle them. This patch generates call to
device implementation for __builtin_logb and ldexp intrinsic for
__builtin_scalbn.
2025-05-19 14:11:31 -04:00
Krzysztof Drewniak
4bdd116b80
[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425)
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.

This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).

It also plumbs the intrinsic through to clang.
2025-05-19 07:15:04 -07:00
Matt Arsenault
9bdd9dc895
AMDGPU: Mark workitem ID intrinsics with range attribute (#136196)
This avoids the need to have special handling at every use site.
Unfortunately this means we unnecessarily emit AssertZext in the DAG
(where we already directly understand the range of the intrinsic), andt
we regress in undefined cases as we don't fold out asserts on undef.
2025-04-18 12:27:38 +02:00
Shilei Tian
9e90e10e76
[AMDGPU][Clang] Add builtins for gfx12 ray tracing intrinsics (#135224) 2025-04-11 09:33:32 -04:00