93 Commits

Author SHA1 Message Date
Erick Ochoa Lopez
1fcfd5c67b
[mlir][amdgpu] Sink op creation in scaled conversion intrinsics (NFC) (#168542)
Where possible:

* notifyMatchFailure happen first
* then op.emitOpError
* finally assertions / op creation.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-11-18 10:35:05 -05:00
Erick Ochoa Lopez
909c9aacea
[mlir][amdgpu] Add lowerings for ScaledExtPacked816 (#168123)
* Adds lowerings for amdgpy.scaled_ext_packed816
* updates verifiers
2025-11-17 16:51:52 -05:00
Muzammiluddin Syed
b1262d13e0
[mlir][ROCDL] Refactor wmma intrinsics to use attributes not operands where possible (#167041)
The current implementation of the WMMA intrinsic ops as they are defined
in the ROCDL tablegen is incorrect. They represent as operands what
should be attributes such as `clamp`, `opsel`, `signA/signB`. This
change performs a refactoring to bring it in line with what we expect.

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-11-13 19:50:02 -05:00
Jakub Kuderski
ba0be89cd2
[mlir] Simplify Default cases in type switches. NFC. (#165767)
Use default values instead of lambdas when possible. `std::nullopt` and
`nullptr` can be used now because of
https://github.com/llvm/llvm-project/pull/165724.
2025-10-30 15:10:59 -04:00
Jakub Kuderski
3167752f29
[mlir][amdgpu][rocdl] Allow for graceful wmma conversion failures (#165616) 2025-10-29 16:09:59 -04:00
Jakub Kuderski
466c526714
[mlir][amdgpu][rocdl] Add gfx1250 wmma ops (#165064)
Update `amdgpu.wmma` op definition and implement amdgpu to rocdl
conversion for new variants.
2025-10-28 12:42:39 -04:00
Jakub Kuderski
dc5f274560
[mlir][amdgpu] Add explicit intrinsic shape to wmma (#164920)
This is in preparation for adding support for gfx1250 wmma intrinsics
that include much more possible shapes.

Instead of guessing the wave32/wave64 mode based on element types and
vector sizes, require the intrinsic shapes to be set explicitly as
attributes.
2025-10-24 12:21:33 -04:00
Jakub Kuderski
ae11c5c2c4
[mlir] Switch uses of deprecated .create methods to free function. NFC. (#164635)
See https://discourse.llvm.org/t/psa-opty-create-now-with-100-more-tab-complete/87339.
2025-10-22 14:51:03 +00:00
Shilei Tian
2195fe7e01
[AMDGPU] Add the support for 45-bit buffer resource (#159702)
On new targets like `gfx1250`, the buffer resource (V#) now uses this
format:

```
base (57-bit): resource[56:0]
num_records (45-bit): resource[101:57]
reserved (6-bit): resource[107:102]
stride (14-bit): resource[121:108]
```

This PR changes the type of `num_records` from `i32` to `i64` in both
builtin and intrinsic, and also adds the support for lowering the new
format.

Fixes SWDEV-554034.

---------

Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
2025-09-24 11:12:02 -04:00
Krzysztof Drewniak
5ecc6d1951
[mlir][AMDGPU] Use LDS-only MMRA fences for lds_barrier (#157919)
The previous lowering strategy for amdgpu.lds_barrier (which is an
operation whose semantics are) "s.barrier, and all LDS operations before
this happen-before LDS operations after this, and there must not be an
inherent fence/forcing-to-completion of global memory (for performance)"
was previosuly implemented through using manual calls to waitcnt()
intrinsics and the s_barrire intrinsic(s).

The lack of explicit fencing enabled miscompiles (where LDS accesses
were reordered with the barrier) on gfx12. Since LLVM now allows MMRA
annotations to ensure that only LDS accesses are fenced by a pair of
fences, we can now use these fences in order to explicitly represent the
semantics we want instead of trying to prescribe the method of their
implemntation.

Note that the gfx908 workaround of hiding the s_barrier in inline
assembly in order to prevent spurious vmem barriers remains in place,
but is is removed for gfx11 because the fences have been changed to give
us the effect we want recently.
2025-09-23 14:00:09 -05:00
Gaurav Verma
a2a9601ea4
[mlir][AMDGPU] Updated PermlaneSwapOp to select correct val (#157586)
* as per the instruction description, updated `PermlaneSwapOp` to select
correct val
* updated corresponding lit tests

Issue it resolves: the block reduction was failing otherwise as we were
selecting the `{0}` always.

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
2025-09-12 13:45:56 +02:00
Tim Gymnich
003cbbd4ca
[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933)
- promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap
%src {16,32}`
2025-08-24 12:41:09 +02:00
Tim Gymnich
e20fa4f412
[mlir][AMDGPU] Add PermlaneSwapOp (#154345)
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-08-21 18:21:43 +02:00
Maksim Levental
c610b24493
[mlir][NFC] update mlir/Dialect create APIs (27/n) (#150638)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-25 11:48:32 -05:00
Maksim Levental
8e8f195322
[mlir][amd] fix LLVM::InsertValueOp::create failure to disambiguate (#150605)
fixes
https://github.com/llvm/llvm-project/pull/149879#issuecomment-3117145615

Note this happens because ADL can't disambiguate between
`mlir::DenseI64ArrayAttr` and `llvm::ArrayRef<int64_t>` **for the value
0** which I guess is equal to nullptr on some (most?) systems.

Note, this only occurs with the value 0.
2025-07-25 07:56:27 -04:00
Maksim Levental
b0434925c9
[mlir][NFC] update Conversion create APIs (4/n) (#149879)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-23 10:49:35 -05:00
Ivan Butygin
4977100624
[mlir][amdgpu] Add rocdl.s.waitcnt wrapper (#149670)
The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly
(similar to the asm format) and delegate architecture-dependent
bitpacking to the amdgpu->rocdl lowering.

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
2025-07-22 23:37:56 +03:00
Daniel Hernandez-Juarez
668c964282
[AMDGPU] [MLIR] Add 96 and 128 bit GatherToLDS for gfx950 (#147496)
This PR adds 96 and 128 gather_to_lds support for gfx950. Updating
lowering, verifier and tests.
2025-07-09 11:53:26 -04:00
Alan Li
3f3282cee8
[AMDGPU] Adding AMDGPU dialect wrapper for ROCDL transpose loads. (#145395)
* 1-to-1 mapping wrapper op.
* Direct lowering from AMDGPU wrapper to ROCDL intrinsics.
2025-06-25 22:58:14 -04:00
Umang Yadav
836201f117
Allow bf16 operands on new MFMAs (#144925)
New gfx950 MFMA allows bf16 operands. 


c0cc81cdc0/llvm/include/llvm/IR/IntrinsicsAMDGPU.td (L3434)

When running `amdgpu-to-rocdl`, Current logic converts bf16 to i16
always which fails to compile for newer bf16 MFMA e.g.
`v_mfma_f32_16x16x32bf16`.
Backend expects bf16 type for the operands for those newer MFMAs. This
patch fixes it.

CC: @krzysz00  @dhernandez0  @giuseros  @antiagainst  @kuhar
2025-06-19 12:52:31 -05:00
Daniel Hernandez-Juarez
68b6f392ed
[MLIR][AMDGPU] Fix bug in GatherToLDSOpLowering, get the correct MemRefType for destination (#142915)
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.

CC: @krzysz00 @umangyadav @lialan
2025-06-13 11:33:51 -05:00
Tim Gymnich
67c590004d
[mlir][AMDGPU] Add scaled floating point conversion ops (#141554)
implement `ScaledExtPackedOp` and `PackedScaledTruncOp`
2025-06-13 11:09:11 +02:00
Bruno Cardoso Lopes
05494f3bad
[MLIR][LLVM] Tail call support for inline asm op (#140826) 2025-05-22 15:30:31 -07:00
Krzysztof Drewniak
6c813e8a3c
[mlir][ROCDL] Add fp4 and fp6 conversion intrinsics, fix fp8 immargs (#140801)
This PR adds support for the scaled conversion intrinsics for fp4 and
fp6 types so that they can be targetted by a future amdgpu dialect op or
used directly.

Additionally, this patch refactors the copy-paste-heavy fp8 versions of
these scaled conversion intrinsics with tablegen `foreach` loops, and
fixes the fact that certain immargs weren't being stored as attributes.

Note that some of the MLIR-level tests for those scaled fp8 intrinsics
had incorrect return types, which have been fixed.

(Note that while the operations have a known return type, the IR format
still prints that type for clarity).
2025-05-21 13:50:02 -07:00
Peiyong Lin
04ad8d4900
Emit inbounds and nuw attributes in memref. (#138984)
Now that MLIR accepts nuw and nusw in getelementptr, this patch emits
the inbounds and nuw attributes when lower memref to LLVM in load and
store operators.

This patch also strengthens the memref.load and memref.store spec about
undefined behaviour during lowering.

This patch also lifts the |rewriter| parameter in getStridedElementPtr
ahead so that LLVM::GEPNoWrapFlags can be added at the end with a
default value and grouped together with other operators' parameters.

Signed-off-by: Lin, Peiyong <linpyong@gmail.com>
2025-05-20 14:16:22 -07:00
Krzysztof Drewniak
4bdd116b80
[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425)
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.

This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).

It also plumbs the intrinsic through to clang.
2025-05-19 07:15:04 -07:00
Muzammil
105ce585d3
[mlir][amdgpu] Define an amdgpu.scaling_mfma wrapper (#137498)
Create a wrapper around the new scaled MFMAs that operate on specific
element types and tile sizes.

See [Issue](https://github.com/iree-org/iree/issues/20616).

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-05-02 18:07:17 -05:00
Ivan Butygin
dda4b968e7
[mlir] AMDGPUToROCDL: lower amdgpu.swizzle_bitmode (#136223)
Repack `amdgpu.swizzle_bitmode` arguments and lower it to
`rocdl.ds_swizzle`.

Repacking logic is follows:
* `sizeof(arg) < sizeof(i32)`: bitcast to integer and zext to i32 and
then trunc and bitcast back.
* `sizeof(arg) == sizeof(i32)`: just bitcast to i32 and back if not i32
* `sizeof(arg) > sizeof(i32)`: bitcast to `vector<Nxi32>`, extract
individual elements and do a series of `rocdl.ds_swizzle` and then
compose vector and bitcast back.

Added repacking logic to LLVM utils so it can be used elsewhere. I'm
planning to use it for `gpu.shuffle` later.
2025-04-18 17:19:04 +03:00
Alan Li
dae0ef53a0
[MLIR][AMDGPU] Add a wrapper for global LDS load intrinsics in AMDGPU (#133498)
Defining a new `amdgpu.global_load` op, which is a thin wrap around
ROCDL `global_load_lds` intrinsic, along with its lowering logics to
`rocdl.global.load.lds`.
2025-04-08 09:18:30 -04:00
Krzysztof Drewniak
25622aa745
[mlir][AMDGPU] Add gfx950 MFMAs to the amdgpu.mfma op (#133553)
This commit extends the lowering of amdgpu.mfma to handle the new
double-rate MFMAs in gfx950 and adds tests for these operations.

It also adds support for MFMAs on small floats (f6 and f4), which are
implented using the "scaled" MFMA intrinsic with a scale value of 0 in
order to have an unscaled MFMA.

This commit does not add a `amdgpu.scaled_mfma` operation, as that is
future work.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-04-01 11:59:09 -05:00
Yi Qian
0ea4fb9264
[AMD][ROCDL] Add packed conversions fp8/bf8->bf16 and fp8/bf8->fp32 in ROCDL dialect (#131850)
- Add packed conversions fp8/bf8->bf16 for gfx950 and fp8/bf8->fp32 for
gfx942 in ROCDL dialect
- Update amdgpu.ext_packed_fp8 lowering to use ROCDL packed fp8/bf8->f32
conversions for vector target types and ROCDL scalar fp8/bf8->fp32 for
scalar target type.

---------

Co-authored-by: Jungwook Park <jungwook.park@amd.com>
2025-03-21 14:49:50 +00:00
Mirza Halilčević
1fc49ff593
[MLIR][AMDGPU] Add OCP FP8 support for new hardware (#127728)
(Continuing from #106160)

This PR addresses remaining review comments from the original PR.

Original PR Description
---
Upcoming hardware (gfx12 and some future gfx9) will support the OCP
8-bit float formats for their matrix multiplication intrinsics and
conversion operations, retaining existing opcodes and compiler builtins.

This commit adds support for these types to the MLIR wrappers around
such operations, ensuring that the OCP types aren't used to generate
those builtins on hardware that doesn't expect that format and,
conversely, to ensure that the pre-OCP formats aren't used on new
hardware.

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Paul Fuqua <pf@acm.org>
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
2025-03-03 14:10:31 -06:00
Krzysztof Drewniak
b31175a33a
[mlir][AMDGPU] Add int4 intrinsics, mixed-type fp8 to handle gfx12 (#128963)
1. Extend the gfx12 FP8 support to allow mixed-type intrinsics (since
they've been added), creating limited mixed-type support that mirrors
MFMA
2. Extend the `amdgpu.wmma` intrinsic lowering to correctly handle
shorter vectors because gfx12 now has instructions that logically take a
4xi8, or, as far as LLVM's concerned, an i32. Similarly, there are 4xi4
inputs, which are an i16 (that must be zero-extended to i32).
3. Correctly handle the ambiguities in the int4 intrinsics on gfx12,
which can either be 16x16x16 or 16x16x32
4. Add tests showing all WMMAs being lowered the way gfx12 expects
(mirroring LLVM's tests)
5. Add a verifier to prevent emiting ilegal instructions on gfx12.
2025-02-27 14:48:58 -06:00
Krzysztof Drewniak
42526d240c
[mlir][AMDGPU] Plumb address space 7 through MLIR, add address_space attr. (#125594)
This commit adds support for casting memrefs into fat raw buffer
pointers to the AMDGPU dialect.

Fat raw buffer pointers - or, in LLVM terms, ptr addrspcae(7), allow
encapsulating a buffer descriptor (as produced by the make.buffer.rsrc
intrinsic or provided from some API) into a pointer that supports
ordinary pointer operations like load or store. This allows people to
take advantage of the additional semantics that buffer_load and similar
instructions provide without forcing the use of entirely separate
amdgpu.raw_buffer_* operations.

Operations on fat raw buffer pointers are translated to the
corresponding LLVM intrinsics by the backend.

This commit also goes and and defines a #amdgpu.address_space<>
attribute so that AMDGPU-specific memory spaces can be represented. Only
#amdgpu.address_space<fat_raw_buffer> will work correctly with the
memref dialect, but the other possible address spaces are included for
completeness.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Prashant Kumar <pk5561@gmail.com>
2025-02-26 16:02:39 -06:00
Ivan Butygin
6e611264c6
[mlir] AMDGPUToROCDL: handle 1-element vectors (#128266)
Buffer intrinsics doesn't support 1-element vectors, cast them to
scalars.
2025-02-23 03:50:59 +03:00
lorenzo chelini
0b63dfb066
[MLIR][NFC] Use base alias for constructor inheritance (#127756)
During my previous cleanup (#127403), I did not notice that we defined a
type alias for the base class. This type alias allows us to use the
shorter form Base::Base, and this PR switches to that.
2025-02-19 16:05:25 +01:00
Fabian Ritter
8900e412ae
[AMDGPU][MLIR] Replace gfx940 and gfx941 with gfx942 in MLIR (#125836)
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.

For SWDEV-512631
2025-02-19 10:05:45 +01:00
lorenzo chelini
c1a2292526
[MLIR][NFC] Retire let constructor for passes in Conversion directory (part1) (#127403)
`let constructor` is deprecated since the table gen backend emits most
of the glue logic to build a pass. This PR retires the td method for
most (I need another pass) passes in the Conversion directory.
2025-02-17 10:55:27 +01:00
Krzysztof Drewniak
efd0a7f446
[mlir][ROCDL][~NFC] Migrate to LLVM dialect default builders (#125609)
There were a bunch of spots in ROCDL.td where we were defining our own
llvmBuilder call which could have been generated using the default
built-in one on LLVM_IntrOpBase.

This commit cleans up such usages in the interests of potentinally
enabling ROCDL import in the future and of making best practices more
obvious.

The one breaking change is renaming WaitcntOp to SWaitcntOp, which
should have minimal impact.
2025-02-06 11:38:43 -06:00
Matthias Springer
6aaa8f25b6
[mlir][IR][NFC] Move free-standing functions to MemRefType (#123465)
Turn free-standing `MemRefType`-related helper functions in
`BuiltinTypes.h` into member functions.
2025-01-21 08:48:09 +01:00
Matthias Springer
7a77f14c0a
[mlir][IR] Remove isF...() type API for low-precision FP types (#123326)
Remove `type.isFloat4E2M1FN()` etc. Use `isa<Float4E2M1FNType>(type)`
instead.

For details, see:
https://discourse.llvm.org/t/rethink-on-approach-to-low-precision-fp-types/82361/28
2025-01-20 09:22:53 +01:00
Fabian Mora
0c1c49f0ff
[mlir][AMDGPU] Fix raw buffer ptr ops lowering (#122293)
This patch fixes several bugs in the lowering of AMDGPU raw buffer
operations. These bugs include:
- Incorrectly handling the offset of the memref, causing errors when
using subviews.
- Using the MaximumOp (float specific op) to calculate the number of
records.
 - The number of records in the static shape case.
 - The lowering when index bitwidth=i64.

Furthermore this patch also switches to use MLIR's data layout to get
the type size.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-01-13 16:11:33 -05:00
Ivan Butygin
953b07febc
[mlir] AMDGPUToROCDL: RawBufferOpLowering fixes (#120642)
1. We can use `getNumElements()` only for memrefs with trivial layout.
2. Buffer ops expecting sizes in i32 but descriptor values can be either
i32 or i64, add appropriate casts. This implementation is not ideal as
it can overflow, but it's still better than generating broken IR.
2024-12-20 18:09:01 +03:00
Krzysztof Drewniak
3452149c05
[mlir][AMDGPU] Support vector<2xbf16> packed atomic fadd (#113929)
Now that we use LLVM's native bfloat types in the AMDGPU lowering,
enable vector<2xbf16> for AMDGPU.
2024-10-31 10:52:53 -05:00
Benoit Jacob
d8a656ffaf
[MLIR] AMDGPUToROCDL: Use a bitcast op to reintepret a vector of i8 as single integer. (#111400)
Found by inspecting AMDGPU assembly - so the arithmetic ops created
there were definitely making their way into the target ISA. A
`LLVM::BitcastOp` seems equivalent, and evaporates as expected in the
target asm.

Along the way, I thought that this helper function `mfmaConcatIfNeeded`
could be renamed to `convertMFMAVectorOperand` to better convey its
contract; so I don't need to think about whether a bitcast is a
legitimate "concat" :-)

---------

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
2024-10-07 14:14:18 -04:00
Matthias Springer
206fad0e21
[mlir][NFC] Mark type converter in populate... functions as const (#111250)
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.

Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.

Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
2024-10-05 21:32:40 +02:00
Daniel Hernandez-Juarez
b014265d99
[mlir][AMDGPU] New gfx12 barrier instructions and update lowering LDSBarrierOp (#109273)
New gfx12 barrier instructions: s.barrier.signal, s.barrier.wait and
s.wait.dscnt. And update lowering LDSBarrierOp accordingly.

CC: @krzysz00 @manupak @giuseros
2024-09-20 17:41:36 -05:00
Krzysztof Drewniak
6292ea6879
[mlir][AMDGPU] Remove an old bf16 workaround (#108409)
The AMDGPU backend now implements LLVM's `bfloat` type. Therefore, we no
longer need to type convert MLIR's `bf16` to `i16` during lowerings to
ROCDL.

As a result of this change, we discovered that, whel the code for MFMA
and WMMA intrinsics was mainly prepared for this change, we were failing
to bitcast the bf16 results of WMMA operations out from the i16 they're
natively represented as. This commit also fixes that issue.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2024-09-12 17:45:39 -05:00
Krzysztof Drewniak
9596e83b2a
[mlir][AMDGPU] Enable emulating vector buffer_atomic_fadd on gfx11 (#108312)
* Fix a bug introduced by the Chipset refactoring in #107720 where
atomics emulation for adds was mistakenly applied to gfx11+
* Add the case needed for gfx11+ atomic emulation, namely that gfx11
doesn't support atomically adding a v2f16 or v2bf16, thus requiring
MLIR-level legalization for buffer intrinsics that attempt to do such an
addition
* Add tests, including tests for gfx11 atomic emulation

Co-authored-by: Manupa Karunaratne <manupa.karunaratne@amd.com>
2024-09-12 09:47:52 -05:00
Krzysztof Drewniak
aa60a3e4d0
[mlir][AMDGPU] Support vector<2xf16> inputs to buffer atomic fadd (#108286)
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.

Re-lands #108238 (and addresses review comments from there)

Co-authored-by: Giuseppe Rossini <giuseppe.rossini@amd.com>
2024-09-11 17:51:07 -05:00