70 Commits

Author SHA1 Message Date
Krzysztof Drewniak
52dfcab327
[NFC][mlir][AMDGPU] Partition dialect .td into multiple files (#178562)
Follow the style of other dialects by having a distiinct .td file for
each category of thing (type, attribdut, operation, enum) generated for
the AMDGPU dialect.

Nothing has changed, but a lot of things have been copy-pasted.
2026-01-29 15:10:20 -08:00
Ivan Butygin
ac62f12192
[mlir][amdgpu] Remove redundant barriers (#175436) 2026-01-12 14:47:58 +03:00
Jorn Tuyls
5ff486d08e
[NFC][AMDGPU] Use getMixedSize in FatRawBufferCastOp dim reification (#174548)
After https://github.com/llvm/llvm-project/pull/174477, I found similar
logic that can be replaced by `memref::getMixedSize` in the
FatRawBufferCastOp dimension reification function.
2026-01-06 08:32:25 -05:00
Ivan Butygin
9afbcde1d2
[mlir][amdgpu] Fix gather_to_lds for 0d memrefs (#173421)
`dstType.areTrailingDimsContiguous(1)` asserts for memref of rank 0.
2025-12-24 02:00:55 +03:00
Eric Feng
24c7b4ea48
[mlir][amdgpu] implement amdgpu.sparse_mfma wrapper for smfmac instructions (#171968)
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
2025-12-18 20:16:14 -06:00
Justin Rosner
3a88bb90bb
[mlir][AMDGPU] Add scaled wmma ops for gfx1250 (#169854)
This PR adds scaled WMMA ops (available on gfx1250) and the lowering to the AMDGPU dialect, wrapping the underlying intrinsics.
2025-12-15 15:44:36 -08:00
Erick Ochoa Lopez
5123d36c02
[mlir][amdgpu] Lower make_gather_dma_descriptor. (#172083)
* Makes `MakeDescriptorOp` a template for `make_dma_descriptor` and
`make_gather_dma_descriptor`.
* Makes verification and folder for `make_dma_descriptor` a template.
* Adds custom verification and folder for `make_dma_gather_descriptor`
based on tempalte.
* Adds `make_gather_dma_descriptor` op.
* Lowers `make_gather_dma_descriptor` to ROCDL.
2025-12-15 13:32:57 -05:00
Zhewen Yu
d107b3c82a
[MLIR][AMDGPU] Implement reifyDimOfResult for FatRawBufferCastOp (#171839)
Since `FatRawBufferCastOp` preserves the shape of its source operand,
the result dimensions can be reified by querying the source's
dimensions.

---------

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
2025-12-12 11:39:00 -08:00
Erick Ochoa Lopez
5ebb928532
[mlir][amdgpu] Adds make_dma_gather_base (#171857)
* Adds `tdm_gather_base` type.
* Adds `make_dma_gather_base` op.
* Adds `make_dma_gather_base` lowering to ROCDL.
2025-12-12 09:20:38 -05:00
Ivan Butygin
f88d060c41
[mlir][amdgpu] memory_counter_wait tensor counter support (#171153) 2025-12-08 20:02:40 +03:00
Ivan Butygin
ca8419d6cc
[mlir][amdgpu] Fuse adjacent MemoryCounterWaitOp (#171148)
Taking the minimum value.
2025-12-08 18:52:26 +03:00
Tim Gymnich
0487154588
[mlir][amdgpu] Add workgroup_mask to MakeDmaDescriptorOp (#171103)
- add `workgroup_mask` and `early_timeout`
2025-12-08 16:02:18 +01:00
Erick Ochoa Lopez
5dfd9c4f84
[mlir][amdgpu] Add lowering for make_dma_descriptor (#169955)
* Adds initial lowering for make_dma_descriptor supporting tensors of
rank 2.
* Adds folders for make_dma_descriptor allowing statically known
operands to be folded into attributes.
* Add AllElementTypesMatch<["lds", "global"]> to make_dma_base.
* Rename pad to pad_amount
* Rename pad_every to pad_interval
2025-12-05 14:24:23 -05:00
Krzysztof Drewniak
e209b8bc2f
[mlir][AMDGPU] Rename gfx1250 packed extension ops, change firstScaleLane (#170718)
The current name of scaled_ext_packed816 was, in retrospect, bothering
me, since it just has a bunch of numbers on the end and doesn't really
reflect the wave-wide nature of the operation.

On top of that, the fact that firstScaleLane was 0 or 1, which might be
read as the first lane being 1 (and not what it actually was, 16), also
seemed weird.

Therefore, before this op sees any use,

1. Renaem it to scaled_ext_packed_matrix
2. Change the semantics of firstScaleLane to actually point at the lane
where the scales start (valid options currently are 0 or 16, the two
halves of a wave32 wave).

(Disclaimer: the mechanical updates were done via AI.)

---------

Co-authored-by: Erick Ochoa Lopez <eochoalo@amd.com>
2025-12-04 14:35:16 -08:00
Erick Ochoa Lopez
73979c1df9
[mlir][amdgpu] Lower amdgpu.make_dma_base (#169817)
* Adds lowering for `amdgpu.make_dma_base`
2025-12-02 13:48:31 -05:00
Erick Ochoa Lopez
df3e1b59d8
[mlir][amdgpu] Add amdgpu.make_dma_descriptor (#169407)
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-12-01 15:05:02 -05:00
Erick Ochoa Lopez
9af00e62ec
[mlir][amdgpu] Add make_dma_base operation (#169086) 2025-11-26 19:43:38 +00:00
Erick Ochoa Lopez
909c9aacea
[mlir][amdgpu] Add lowerings for ScaledExtPacked816 (#168123)
* Adds lowerings for amdgpy.scaled_ext_packed816
* updates verifiers
2025-11-17 16:51:52 -05:00
Erick Ochoa Lopez
e468ea3f40
[mlir][amdgpu] Fix documentation and verifiers (#167369) 2025-11-17 08:34:21 -05:00
Jakub Kuderski
466c526714
[mlir][amdgpu][rocdl] Add gfx1250 wmma ops (#165064)
Update `amdgpu.wmma` op definition and implement amdgpu to rocdl
conversion for new variants.
2025-10-28 12:42:39 -04:00
Jakub Kuderski
f248010a52
[mlir][amdgpu] Update mfma assembly format with intrinsic shape (#165037)
Use the same format as introduced for wmma by
https://github.com/llvm/llvm-project/pull/164920.

Also make `blocks` default to 1.
2025-10-25 05:58:43 -04:00
Jakub Kuderski
dc5f274560
[mlir][amdgpu] Add explicit intrinsic shape to wmma (#164920)
This is in preparation for adding support for gfx1250 wmma intrinsics
that include much more possible shapes.

Instead of guessing the wave32/wave64 mode based on element types and
vector sizes, require the intrinsic shapes to be set explicitly as
attributes.
2025-10-24 12:21:33 -04:00
Erick Ochoa Lopez
a76c71b205
[mlir][amdgpu] Add scaled_ext_packed{8,16} operations (#159830) 2025-10-17 12:58:03 -04:00
Muzammil
5a6756d2a0
[mlir][AMGPU] Replace use of SmallVector with ArrayRef, NFC (#163770)
Improving choice of class used, from SmallVector to ArrayRef
(https://llvm.org/docs/ProgrammersManual.html#llvm-adt-arrayref-h). Also infer template types when possible.
Leftover from https://github.com/llvm/llvm-project/pull/155951.

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-10-16 10:41:22 -04:00
Ivan Butygin
6ad662d322
[mlir][amdgpu] Add Inliner interface (#162873)
All the `amdgpu` dialect ops can be inlined.

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
2025-10-10 21:34:00 +03:00
Muzammil
9628061e05
[mlir][AMDGPU] Add canonicalization pattern to pack scales for ScaledMFMAOp (#155951)
The ScaledMFMAOp accepts scales as a vector of 4 bytes
(`vector<4xf8E8M0FNU>`) that can be stored in a single register with a
particular scale accessed using the `OpSel` attribute. Currently, we
only use one byte in this 4-byte vector, resulting in 3 wasted
registers.

This is fixed by identifying when single byte extractions are performed
and rewriting them into extractions of 4-byte vectors.

Example:
```
  %unit = vector.extract %ScaleSrc[offsets] : f8E8M0FNU from vector<?x?x?xf8E8M0FNU>
  %scale = vector.insert %unit, ... : f8E8M0FNU into vector<4xf8E8M0FNU>
  amdgpu.scaled_mfma(%scale[0] * ...
```
to
```
  %reshaped = vector.shape_cast %ScaleSrc : vector<?x?x?xf8E8M0FNU> to vector<?x4xf8E8M0FNU> 
  %scale = vector.extract %reshaped[?] : vector<4xf8E8M0FNU> from vector<?x4xf8E8M0FNU>
  amdgpu.scaled_mfma(%scale[0-3] * ...
```

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-09-18 19:25:14 +00:00
Mehdi Amini
6ec08132ee [MLIR] Apply clang-tidy fixes for readability-identifier-naming in AMDGPUDialect.cpp (NFC) 2025-09-18 10:28:46 -07:00
Tim Gymnich
e20fa4f412
[mlir][AMDGPU] Add PermlaneSwapOp (#154345)
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-08-21 18:21:43 +02:00
Quinn Dawkins
72bc1bea7a
[mlir][AMDGPU] Allow non-contiguous destination memrefs for gather_to_lds (#152559)
The requirement that the LDS operand is contiguous is overly restrictive
because it's perfectly valid to have a subview depend on subgroup IDs
that is still subgroup contiguous. We could continue trying to do this
verification based on the number of copied elements, but instead this
change just opts to clarify the semantics on the op definition.
2025-08-07 17:24:15 -04:00
Quinn Dawkins
b7f889a29c
[mlir][AMDGPU] Add canonicalizer for folding casts into gather_to_lds (#150503) 2025-07-24 19:58:30 -04:00
Krzysztof Drewniak
9052a85da8
[mlir][AMDGPU] Infer canonical layouts for fat_raw_buffer_cast resetOffset (#149867)
When inferring the return type of amdgpu.fat_raw_buffer_cast with the
offset reset, we would sometimes use a strided layout, like
strided<[1]>, in cases where, after stripping the offset, the memref had
the identity layout. This would cause issues with EmulateNarrowTypes,
which does perform this layout canonicalization.

Now, the return type inference will put in an identity layout after
offset stripping for
1. Statically-shaped memrefs of any rank where the strides match the
suffix product of the shape, and
2. Memrefs of rank <= 1 whose strides are [1] (or []) that just had
their offset removed by resetOffset.
2025-07-21 15:18:19 -05:00
Ivan Butygin
6b29ee9d9a
[mlir][amdgpu] Properly handle mismatching memref ranks in amdgpu.gather_to_lds (#149407)
This op doesn't have any rank or indices restrictions on src/dst
memrefs, but was using `SameVariadicOperandSize` which was causing
issues. Also fix some other issues while we at it.
2025-07-18 00:42:25 +03:00
Daniel Hernandez-Juarez
668c964282
[AMDGPU] [MLIR] Add 96 and 128 bit GatherToLDS for gfx950 (#147496)
This PR adds 96 and 128 gather_to_lds support for gfx950. Updating
lowering, verifier and tests.
2025-07-09 11:53:26 -04:00
Alan Li
3f3282cee8
[AMDGPU] Adding AMDGPU dialect wrapper for ROCDL transpose loads. (#145395)
* 1-to-1 mapping wrapper op.
* Direct lowering from AMDGPU wrapper to ROCDL intrinsics.
2025-06-25 22:58:14 -04:00
Tim Gymnich
67c590004d
[mlir][AMDGPU] Add scaled floating point conversion ops (#141554)
implement `ScaledExtPackedOp` and `PackedScaledTruncOp`
2025-06-13 11:09:11 +02:00
Krzysztof Drewniak
4bdd116b80
[AMDGPU] Add a new amdgcn.load.to.lds intrinsic (#137425)
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.

This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).

It also plumbs the intrinsic through to clang.
2025-05-19 07:15:04 -07:00
Christian Sigg
3a6b9b3a87 [mlir][bazel] Fix after dae0ef53a0b99c6c2b74143baee5896e8bc5c8e7
Remove unnecessary include.
2025-04-08 15:47:14 +02:00
Alan Li
dae0ef53a0
[MLIR][AMDGPU] Add a wrapper for global LDS load intrinsics in AMDGPU (#133498)
Defining a new `amdgpu.global_load` op, which is a thin wrap around
ROCDL `global_load_lds` intrinsic, along with its lowering logics to
`rocdl.global.load.lds`.
2025-04-08 09:18:30 -04:00
Krzysztof Drewniak
25622aa745
[mlir][AMDGPU] Add gfx950 MFMAs to the amdgpu.mfma op (#133553)
This commit extends the lowering of amdgpu.mfma to handle the new
double-rate MFMAs in gfx950 and adds tests for these operations.

It also adds support for MFMAs on small floats (f6 and f4), which are
implented using the "scaled" MFMA intrinsic with a scale value of 0 in
order to have an unscaled MFMA.

This commit does not add a `amdgpu.scaled_mfma` operation, as that is
future work.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-04-01 11:59:09 -05:00
Mirza Halilčević
1fc49ff593
[MLIR][AMDGPU] Add OCP FP8 support for new hardware (#127728)
(Continuing from #106160)

This PR addresses remaining review comments from the original PR.

Original PR Description
---
Upcoming hardware (gfx12 and some future gfx9) will support the OCP
8-bit float formats for their matrix multiplication intrinsics and
conversion operations, retaining existing opcodes and compiler builtins.

This commit adds support for these types to the MLIR wrappers around
such operations, ensuring that the OCP types aren't used to generate
those builtins on hardware that doesn't expect that format and,
conversely, to ensure that the pre-OCP formats aren't used on new
hardware.

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Paul Fuqua <pf@acm.org>
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
2025-03-03 14:10:31 -06:00
Krzysztof Drewniak
b31175a33a
[mlir][AMDGPU] Add int4 intrinsics, mixed-type fp8 to handle gfx12 (#128963)
1. Extend the gfx12 FP8 support to allow mixed-type intrinsics (since
they've been added), creating limited mixed-type support that mirrors
MFMA
2. Extend the `amdgpu.wmma` intrinsic lowering to correctly handle
shorter vectors because gfx12 now has instructions that logically take a
4xi8, or, as far as LLVM's concerned, an i32. Similarly, there are 4xi4
inputs, which are an i16 (that must be zero-extended to i32).
3. Correctly handle the ambiguities in the int4 intrinsics on gfx12,
which can either be 16x16x16 or 16x16x32
4. Add tests showing all WMMAs being lowered the way gfx12 expects
(mirroring LLVM's tests)
5. Add a verifier to prevent emiting ilegal instructions on gfx12.
2025-02-27 14:48:58 -06:00
Krzysztof Drewniak
42526d240c
[mlir][AMDGPU] Plumb address space 7 through MLIR, add address_space attr. (#125594)
This commit adds support for casting memrefs into fat raw buffer
pointers to the AMDGPU dialect.

Fat raw buffer pointers - or, in LLVM terms, ptr addrspcae(7), allow
encapsulating a buffer descriptor (as produced by the make.buffer.rsrc
intrinsic or provided from some API) into a pointer that supports
ordinary pointer operations like load or store. This allows people to
take advantage of the additional semantics that buffer_load and similar
instructions provide without forcing the use of entirely separate
amdgpu.raw_buffer_* operations.

Operations on fat raw buffer pointers are translated to the
corresponding LLVM intrinsics by the backend.

This commit also goes and and defines a #amdgpu.address_space<>
attribute so that AMDGPU-specific memory spaces can be represented. Only
#amdgpu.address_space<fat_raw_buffer> will work correctly with the
memref dialect, but the other possible address spaces are included for
completeness.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Prashant Kumar <pk5561@gmail.com>
2025-02-26 16:02:39 -06:00
Matthias Springer
6aaa8f25b6
[mlir][IR][NFC] Move free-standing functions to MemRefType (#123465)
Turn free-standing `MemRefType`-related helper functions in
`BuiltinTypes.h` into member functions.
2025-01-21 08:48:09 +01:00
Matthias Springer
7a77f14c0a
[mlir][IR] Remove isF...() type API for low-precision FP types (#123326)
Remove `type.isFloat4E2M1FN()` etc. Use `isa<Float4E2M1FNType>(type)`
instead.

For details, see:
https://discourse.llvm.org/t/rethink-on-approach-to-low-precision-fp-types/82361/28
2025-01-20 09:22:53 +01:00
Frank Schlimbach
d5746d73ce
eliminating g++ warnings (#105520)
Eliminating g++ warnings. Mostly declaring "[[maybe_unused]]", adding
return statements where missing and fixing casts.

@rengolin

---------

Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
Co-authored-by: Renato Golin <rengolin@systemcall.eu>
2024-10-18 21:20:47 +01:00
Giuseppe Rossini
a8e1c6f99a
[MLIR][AMDGPU] Add support for fp8 ops on gfx12 (#106388)
This PR is adding support for `fp8` and `bfp8` on gfx12
2024-09-03 17:47:08 +01:00
Giuseppe Rossini
1387ba48a3
[MLIR][AMDGPU] Introduce fp16 packed arithmetic (#105688)
This PR is introducing rocdl.cvt.pkrtz in the ROCDL dialect and it is
using that instruction when lowering `arith::TruncFOp`.
2024-08-26 12:48:57 -05:00
stefankoncarevic
1164e4aef2
[mlir][AMDGPU] Implement AMDGPU DPP operation in MLIR. (#89233)
Defined AMDGPU DPP operation in mlir to represent semantics. Introduced
a new enumeration attribute for different permutations and allowed for
different types of arguments. Implemented constant attribute handling
for ROCDL::DPPMovOp operation. The operation now correctly accepts
constant attributes for dppCtrl, rowMask, bankMask, boundCtrl, and
passes them to the corresponding LLVM intrinsic.
2024-08-16 11:19:39 -05:00
Christian Sigg
a5757c5b65
Switch member calls to isa/dyn_cast/cast/... to free function calls. (#89356)
This change cleans up call sites. Next step is to mark the member
functions deprecated.

See https://mlir.llvm.org/deprecation and
https://discourse.llvm.org/t/preferred-casting-style-going-forward.
2024-04-19 15:58:27 +02:00
Jakub Kuderski
44718311de
[mlir][amdgpu] Remove shared memory optimization pass (#88225)
This implementation has a number of issues and ultimately does not work
on gfx9.
* It does not reduce bank conflicts with wide memory accesses.
* It does not correctly account for when LDS bank conflicts occur on
amdgpu.
* The implementation is too fragile to be used on real-world code. For
example, the code bails out on any `memref.subview` in the root op, even
when the subview is not a user of any of the `memref.alloc` ops.

I do not see how these can be easily fixed, therefore I think it's
better to delete this code.
2024-04-11 11:07:17 -04:00