1039 Commits

Author SHA1 Message Date
Mirko Brkušanin
5d9eb0c76a
[AMDGPU] Define new targets gfx1171 and gfx1172 (#187735) 2026-04-01 18:16:11 +02:00
Alex Voicu
18e6958903
[SPIRV][AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates (#134016)
This change adds two builtins for AMDGPU:

- `__builtin_amdgcn_processor_is`, which is similar in observable
behaviour with `__builtin_cpu_is`, except that it is never "evaluated"
at run time;
- `__builtin_amdgcn_is_invocable`, which is behaviourally similar with
`__has_builtin`, except that it is not a macro (i.e. not evaluated at
preprocessing time).

Neither of these are `constexpr`, even though when compiling for
concrete (i.e. `gfxXXX` / `gfxXXX-generic`) targets they get evaluated
in Clang, so they shouldn't tear the AST too badly / at all for
multi-pass compilation cases like HIP. They can only be used in specific
contexts (as args to control structures).

The motivation for adding these is two-fold:

- as a nice to have, it provides an AST-visible way to incorporate
architecture specific code, rather than having to rely on macros and the
preprocessor, which burn in the choice quite early;
- as a must have, it allows featureful AMDGCN flavoured SPIR-V to be
produced, where target specific capability is guarded and chosen or
discarded when finalising compilation for a concrete target; this is
built atop the Speciali\ation Constant concept which is described in the
SPIR-V specification under section [2.12
Specialization](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_specialization_2)

I've tried to keep the overall footprint of the change small. The
changes to Sema are a bit unpleasant, but there was a strong desire to
have Clang validate these, and to constrain their uses, and this was the
most compact solution I could come up with (suggestions welcome).

---------

Co-authored-by: Juan Manuel Martinez Caamaño <jmartinezcaamao@gmail.com>
Co-authored-by: Voicu <avoicu@amd.com>
2026-03-30 23:02:26 +01:00
Stanislav Mekhanoshin
5f99854d01
[AMDGPU] Drop A and B neg modifier from amdgcn_wmma_bf16_16x16x32_bf16 (#189468)
Fixes: LCOMPILER-1673
2026-03-30 14:14:22 -07:00
Stanislav Mekhanoshin
a2d84b5d8d
[AMDGPU] Remove neg support from 4 more gfx1250 WMMA (#189115)
These are previously covered by AMDGPUWmmaIntrinsicModsAllReuse.
2026-03-27 15:20:14 -07:00
Stanislav Mekhanoshin
e69c7312f3
[AMDGPU] Disable neg_lo[0:1] and neg_hi[0:1] on wmma_f32_16x16x32_bf16 (#188649)
This is the pilot change, the rest will follow the same idea.
2026-03-26 00:37:05 -07:00
Rana Pratap Reddy
584c83cb15
[Clang][AMDGPU] Add clang builtins for buffer format load/store intrinsics (#187064)
Adding new clang builtins for AMDGPU raw/struct buffer format load/store
intrinsics. Clang currently has `__builtin_amdgcn_raw_buffer_load_b*`
and `__builtin_amdgcn_raw_buffer_store_b*` builtins, but is missing
builtins for the format variants. These format intrinsics are currently
used by device-libs via manually written IR wrappers in
[buffer-intrinsics.ll](https://github.com/ROCm/llvm-project/blob/amd-staging/amd/device-libs/ockl/src/buffer-intrinsics.ll).
2026-03-19 19:09:44 +01:00
CarolineConcatto
d96722b660
[LLVM] Improve IR parsing and printing for target memory locations (#176968)
This patch adds support for specifying all target memory locations using
a
single IR spellings such as:
```
memory(target_mem: read)
```

This form is not supported in TableGen, but it is now accepted by the IR
parser.
When the parser encounters target_mem, it expands it to all
target-memory
locations (e.g., target_mem0, target_mem1, …).

Printing behavior

When all target-memory locations share the same ModRef value, the
printer
now collapses them into a single entry:
```
memory(target_mem: read)
```
Otherwise, each target memory location is printed separately.

Rejected IR:
```
memory(target_mem0: write, target_mem: read)
```
This is invalid because the default access kind for the target memory
group
must appear first.
2026-03-19 17:29:54 +00:00
Chinmay Deshpande
e044c4ad81
[AMDGPU] Add target features for SWMMAC instructions (#185785)
Introduce `swmmac-gfx1200-insts` and `swmmac-gfx1250-insts`
2026-03-18 13:52:34 -07:00
paperchalice
26ac669101
[LLVM] Remove "no-nans-fp-math" attribute support (#186285)
Now all `NoNaNsFPMath` uses have been removed, remove this attribute.
2026-03-13 09:29:28 +00:00
Matt Arsenault
7cb3005ba2
AMDGPU: Add dereferenceable attribute to dispatch ptr intrinsic (#185955)
Stop manually setting it on the callsite in clang.
2026-03-12 07:28:39 +01:00
Jan Patrick Lehr
883aa697db
Revert "[Clang][AMDGPU] Change __fp16 to _Float16 in builtin definitions" (#185861)
Reverts llvm/llvm-project#185446

This breaks CK build downstream.
2026-03-11 09:27:48 -04:00
Rana Pratap Reddy
a17289b76a
[Clang][AMDGPU] Change __fp16 to _Float16 in builtin definitions (#185446)
Change the type signature of `SWMMAC, load, cvt` builtins from `__fp16
to _Float16` in the tablegen builtin definitions.
2026-03-10 10:20:40 +05:30
Matt Arsenault
47e31088f3
clang/AMDGPU: Fix workgroup size builtins for nonuniform work group sizes (#185098)
These were assuming uniform work group sizes. Emit the v4 and v5
sequences to take the remainder group for the nonuniform case.

Currently the device libs uses this builtin on the legacy ABI path with
the same sequence to calculate the remainder, and fully implements the
v5 path. If you perform a franken-build of the library with the updated
builtin, the result is worse. The duplicate sequence does not fully fold out.
However, it does not appear to be wrong. The relevant conformance tests still
pass.
2026-03-09 11:47:49 +01:00
Matt Arsenault
d840396e20
clang: Simplify emission of uniform-work-group-size attribute (#185066) 2026-03-09 11:17:30 +01:00
Joseph Huber
5a88dffc40
[Clang] Only define wchar_size module flag if non-standard (#184668)
Summary:
This PR simply changes the behavior of the `wchar_size` flag. Currently,
we emit this in all cases for all targets. This causes problems during
LLVM-IR linking, specifically because this would vary between Linux and
Windows in unintuitive ways. Now we have an llvm::Triple helper to
determine the size from the known values. The module flag will only be
emitted if these do not match (indicating a non-standard environment).

In addition to fixing AMDGCN bitcode linking, this also means we don't
need to bloat *every* IR module compiled by clang with this flag. The
changed tests reflects this, one less unnecessary piece of metadata.
2026-03-04 16:13:48 -06:00
Mirko Brkušanin
d0f50d5574
[AMDGPU] Remove DX10_CLAMP and IEEE bits from gfx1170 (#182107)
Add `DX10ClampAndIEEEMode` feature and set it for every subtarget prior
to gfx1170
2026-03-04 12:16:41 +01:00
Changpeng Fang
5b144c0aec
[AMDGPU] Add suffix _d4 to tensor load/store with 4 groups D#, NFC (#184176)
Rename TENSOR_LOAD_TO_LDS to TENSOR_LOAD_TO_LDS_d4;
  Rename TENSOR_STORE_FROM_LDS to TENSOR_STORE_FROM_LDS_d4;
Also rename function names in a couple of tests to reflect this change.
2026-03-03 14:10:38 -08:00
Joseph Huber
d61b45cd40
[Clang] Generate ptr and float atomics without integer casts (#183853)
Summary:
LLVM IR should support these for all cases except for compare-exchange.
Currently the code goes through an integer indirection for these cases.
This PR changes the behavior to use atomics directly to the target
memory type.
2026-03-03 09:57:26 -06:00
Shilei Tian
81872e7049
[NFC] Fix check lines for clang/test/CodeGenOpenCL/cl-uniform-wg-size.cl on Darwin (#184042) 2026-03-01 22:07:50 +00:00
Shilei Tian
f05d2e8a39
[AMDGPU] Make uniform-work-group-size a valueless attribute (#183925)
The "uniform-work-group-size" function attribute previously took a
string value of "true" or "false". Since presence alone can convey the
"true" semantics and absence can convey "false", the value is
unnecessary.

This patch converts it to a valueless string attribute: presence
indicates true, absence indicates false. For backward compatibility,
auto-upgrade logic is added in both UpgradeAttributes (bitcode) and
UpgradeFunctionAttributes: if the old value is "true", the attribute is
kept without a value; if "false", the attribute is removed.
2026-03-01 21:29:55 +00:00
Shilei Tian
e2ef93fc57
[NFC] Remove clang/test/CodeGenOpenCL/.gdb_history (#184038) 2026-03-01 16:21:17 -05:00
Shilei Tian
d9ca61b6e7
Revert "[NFC][Clang] Auto generate check lines for clang/test/CodeGenOpenCL/cl-uniform-wg-size.cl" (#184035)
Reverts llvm/llvm-project#183926 because of some BB failures.
2026-03-01 15:29:44 -05:00
Shilei Tian
dddd06be8c
[NFC][Clang] Auto generate check lines for clang/test/CodeGenOpenCL/cl-uniform-wg-size.cl (#183926) 2026-03-01 14:41:57 -05:00
Changpeng Fang
99dc561c7d
[AMDGPU] Use a general form of intrinsic for tensor load/store (#182334)
The intrinsic has five arguments for the tensor descriptor (D#), while the fifth one is reserved for future targets, and it will be silently ignored in codegen for gfx1250.
  For tensor up to 2D, only the first two D# groups are meaningful and the rest should be zero-initialized.
2026-02-20 17:28:32 -08:00
Mirko Brkušanin
829afc4c91
[AMDGPU] Add WMMA and SWMMAC instructions for gfx1170 (#180731)
Introduce two new subtarget features:

- WMMA256bInsts for GFX11 WMMA instructions and
- WMMA128bInsts for GFX1170 and GFX12 WMMA and SWMMAC instructions

Some WMMA instructions have changed from GFX 11.0 to GFX 11.7 so new
Real versions were added with "_gfx1170" suffix. For consistency all
WMMA and SWMMAC GFX11.7 instructions use this suffix.

To resolve decoding issues between different formats for some WMMA
instructions between GFX 11 and GFX 11.7, new decoding tables were
added.
2026-02-18 19:17:48 +01:00
Stanislav Mekhanoshin
7487c7581e
[AMDGPU] Change 9 SWMMAC builtins to use 64-bit index (#181246)
There 9 gfx1250 instructions have 64-bit packed index:

- v_swmmac_f16_16x16x128_bf8_bf8
- v_swmmac_f16_16x16x128_bf8_fp8
- v_swmmac_f16_16x16x128_fp8_bf8
- v_swmmac_f16_16x16x128_fp8_fp8
- v_swmmac_f32_16x16x128_bf8_bf8
- v_swmmac_f32_16x16x128_bf8_fp8
- v_swmmac_f32_16x16x128_fp8_bf8
- v_swmmac_f32_16x16x128_fp8_fp8
- v_swmmac_i32_16x16x128_iu8

Intrinsics accept anyint, but builtins are defined with i32 argument.

Fixes: SWDEV-579843
2026-02-12 15:31:50 -08:00
Sameer Sahasrabuddhe
128437fb6a
[AMDGPU] Introduce asyncmark/wait intrinsics (#180467)
Asynchronous operations are memory transfers (usually between the global
memory and LDS) that are completed independently at an unspecified
scope. A thread that requests one or more asynchronous transfers can use
async marks to track their completion. The thread waits for each mark to
be completed, which indicates that requests initiated in program order
before this mark have also completed.

For now, we implement asyncmark/wait operations on pre-GFX12
architectures that support "LDS DMA" operations. Future work will extend
support to GFX12Plus architectures that support "true" async operations.

This is part of a stack split out from #173259
- #180467
- #180466

Co-authored-by: Ryan Mitchell ryan.mitchell@amd.com

Fixes: SWDEV-521121
2026-02-11 07:15:51 +00:00
Sameer Sahasrabuddhe
b02b395a1e
[AMDGPU] Asynchronous loads from global/buffer to LDS on pre-GFX12 (#180466)
The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.

This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.

This is part of a stack split out from #173259:
- #180467
- #180466

Fixes: SWDEV-521121
2026-02-11 05:26:58 +00:00
YunQiang Su
7e734da346
Clang: Add nsz to llvm.minnum and llvm.maxnum emitted from fmin and fmax (#113133)
See: https://github.com/llvm/llvm-project/pull/112852

We will define llvm.minnum and llvm.maxnum with +0.0>-0.0, by default,
while libc doesn't require it.
2026-02-11 08:33:29 +08:00
Marcos Maronas
ce94d63f0f
Make OpenCL an OSType rather than an EnvironmentType. (#170297)
OpenCL was added as an `EnvironmentType` in
https://github.com/llvm/llvm-project/pull/78655, but there is no
explanation as to why it was added as such, even after explicitly asking
in the PR
(https://github.com/llvm/llvm-project/pull/78655#issuecomment-2743162853).
This PR makes it an `OSType` instead, which feels more natural, and
updates tests accordingly.

---------

Co-authored-by: Marcos Maronas <marcos.maronas@intel.com>
2026-02-10 18:45:50 +00:00
Mirko Brkušanin
4280f0d241
[AMDGPU] Add dot4 fp8/bf8 instructions for gfx1170 (#180516) 2026-02-10 12:14:49 +01:00
Mirko Brkušanin
45b037cf7a
[AMDGPU] Add fp8/bf8 conversion instructions for gfx1170 (#180191) 2026-02-09 13:56:43 +01:00
Pierre van Houtryve
b79ba02479
[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343)
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.

Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.

This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.

Solves SWDEV-516398.
2026-02-09 09:57:27 +01:00
paperchalice
5c5677d7b8
[llvm] Remove "no-infs-fp-math" attribute support (#180083)
One of global options in `TargetMachine::resetTargetOptions`, now all
backends no longer support it, remove it.
2026-02-09 08:43:33 +08:00
Mirko Brkušanin
20b5849e17
[AMDGPU] Define new target gfx1170 (#180185) 2026-02-06 14:38:50 +01:00
Matt Arsenault
2502e3b7ba
IR: Promote "denormal-fp-math" to a first class attribute (#174293)
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.

The syntax in the common cases looks like this:
  `denormal_fpenv(preservesign,preservesign)`
  `denormal_fpenv(float: preservesign,preservesign)`
  `denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`

I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).

This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.

This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.

I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
2026-02-05 13:31:26 +00:00
Wenju He
8ab29461c3
[OpenCL] Set half-precision Div and Sqrt accuracy (#179621)
OpenCL spec relaxed half-precision divide to 1 ULP and sqrt to 1.5 ULP
in https://github.com/KhronosGroup/OpenCL-Docs/pull/1293
https://github.com/KhronosGroup/OpenCL-Docs/pull/1386
This can enable target to use hardware rcp instruction for half.
2026-02-05 09:32:56 +08:00
Jameson Nash
0dd21ad1c6
[clang] remove addrspace cast from CreateIRTemp (#179327)
This just added unnecessary work to the IR, since they are only used for
load and store, which just causes some IR noise. Tests updated by UTC
script to remove the extra lines.
2026-02-04 13:09:32 -05:00
Aaditya
f190477718
[AMDGPU] Add builtins for wave reduction intrinsics (#170813) 2026-01-30 18:15:06 +05:30
Wenju He
c03d0fe672
[OpenCL] Add clang internal extension __cl_clang_function_scope_local_variables (#176726)
OpenCL spec restricts that variable in local address space can only be
declared at kernel function scope.
Add a Clang internal extension __cl_clang_function_scope_local_variables
to lift the restriction.

To expose static local allocations at kernel scope, targets can either
force-inline non-kernel functions that declare local memory or pass a
kernel-allocated local buffer to those functions via an implicit argument.

Motivation: support local memory allocation in libclc's implementation
of work-group collective built-ins, see example at:
https://github.com/intel/llvm/blob/41455e305117/libclc/libspirv/lib/amdgcn-amdhsa/group/collectives_helpers.ll
https://github.com/intel/llvm/blob/41455e305117/libclc/libspirv/lib/amdgcn-amdhsa/group/collectives.cl#L182

Right now this is a Clang-only OpenCL extension intended for compiling
OpenCL libraries with Clang. It could be proposed as a standard OpenCL
extension in the future.
2026-01-26 08:13:22 +08:00
Shilei Tian
f3a674a2ef
[RFC][Clang][AMDGPU] Emit only delta target-features to reduce IR bloat (#176533)
Currently, AMDGPU functions have `target-features` attribute populated
with all default features for the target GPU. This is redundant because
the backend can derive these defaults from the `target-cpu` attribute
via `AMDGPUTargetMachine::getFeatureString()`.

In this PR, for AMDGPU targets only:

- Functions without explicit target attributes no longer emit
`target-features`
- Functions with `__attribute__((target(...)))` or `-target-feature`
emit only features that differ from the target's defaults (delta)

The backend already handles missing `target-features` correctly by
falling back to the TargetMachine's defaults.

A new cc1 flag `-famdgpu-emit-full-target-features` is added to emit
full features when needed.

Example:

Before:

```llvm
attributes #0 = { "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,..." }
```

After (default):

```llvm
attributes #0 = { "target-cpu"="gfx90a" }
```

After (with explicit `+wavefrontsize32` override):

```llvm
attributes #0 = { "target-cpu"="gfx90a" "target-features"="+wavefrontsize32" }
```
2026-01-20 14:49:35 -05:00
Shilei Tian
4efbe98659
[Clang][AMDGPU] Add a Sema check for the imm argument of __builtin_amdgcn_s_setreg (#176838)
Our backend cannot select the corresponding intrinsic if the imm
argument is not a `int16_t` or `uint16_t`, which is not really helpful.
2026-01-20 11:48:52 -05:00
Shilei Tian
39bd4562ba
[Clang][AMDGPU] Handle wavefrontsize32 and wavefrontsize64 features more robustly (#176599)
We should not allow `-wavefrontsize32` and `-wavefrontsize64` to be
specified at the same time. We should also not allow `-wavefrontsize32`
on a target that only supports `wavefrontsize32`, and the vice versa.
2026-01-19 18:16:29 -05:00
Shoreshen
26624d51d1
[AMDGPU]Add specific instruction feature for multicast load (#175503) 2026-01-13 09:10:09 +08:00
Shilei Tian
5a63367b15
Reapply "[AMDGPU] Rework the clamp support for WMMA instructions" (#174674) (#174697)
This reverts commit 0b2f3cfb72a76fa90f3ec2a234caabe0d0712590.
2026-01-07 06:12:19 +00:00
dyung
0b2f3cfb72
Revert "[AMDGPU] Rework the clamp support for WMMA instructions" (#174674)
Reverts llvm/llvm-project#174310

This change is causing 2 cross-project-test failures on
https://lab.llvm.org/buildbot/#/builders/174/builds/29695
2026-01-07 01:18:23 +00:00
Shilei Tian
ccca3b8c67
[AMDGPU] Rework the clamp support for WMMA instructions (#174310)
Fixes #166989.
2026-01-06 15:46:40 -05:00
Shilei Tian
ef55a0be4e [NFC] Update clang/test/CodeGenOpenCL/builtins-amdgcn-gfx1250-wmma-w32.cl 2026-01-06 13:06:57 -05:00
Wenju He
1f14ed948d
[Clang] Honor '#pragma STDC FENV_ROUND' in __builtin_store_half/halff (#173821)
Before this change, constrained fptrunc for __builtin_store_half/halff
always used round.tonearest, ignoring the active pragma STDC FENV_ROUND.
This PR guards builtin emission with CGFPOptionsRAII so the current
rounding mode is propagated to the generated constrained intrinsic.
2026-01-04 17:25:22 +08:00
Shilei Tian
c97de4387b
Revert "[AMDGPU] add clamp immediate operand to WMMA iu8 intrinsic (#171069)" (#174303)
This reverts commit 2c376ffeca490a5732e4fd6e98e5351fcf6d692a because it
breaks assembler.

```
$ llvm-mc -triple=amdgcn -mcpu=gfx1250 -show-encoding <<< "v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] matrix_b_reuse"
  v_wmma_i32_16x16x64_iu8 v[16:23], v[0:7], v[8:15], v[16:23] clamp ; encoding: [0x10,0x80,0x72,0xcc,0x00,0x11,0x42,0x1c]
```

We have a fundamental issue in the clamp support in VOP3P instructions,
which will need more changes.
2026-01-04 02:13:21 +00:00