122 Commits

Author SHA1 Message Date
Krzysztof Drewniak
149fa17adf
[mlir][AMDGPU] Update gather_to_lds with explicit-async support (#181082)
This commit takes advantage of the new `load.async.to.lds` intrinsic in
order to add an `async` mode to `gather_to_lds`. In this mode,
completion of the load needs to be managed with `asyncmark` and
`wait.asyncmark` intrinsics instead of being implicitly derived by alias
analysis.

This commit adds the flag, a lowering for it, and updates tests.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-16 20:52:35 +00:00
Krzysztof Drewniak
cba0e6ad8e
[mlir][AMDGPU] Change width of LDS barrier count (#180554)
Whoops, turns out I was off by 1 on how many bits are in the counts and
phases ind these new LDS barriers. This commit fixes this.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 13:37:14 -08:00
Krzysztof Drewniak
762c32aa08
[mlir][AMDGPU] Add wrappers for in-memory barriers on gfx1250 (#180112)
This commit introduces the `!amdgpu.ds_barrier_state` type and
operations on that type, including extracting its components and (more
importantly) provides wrappers around the upcoming barrier-management
instructions that will be added in gfx1250.

This commit is loosely based on work done for Triton, but this commit
provides slightly more lower level-primitives (namely a known-atomic
load for getting the barrier state instead of providing a `wait`
operation that includes an entire spin-loop, though if people want one
we could consider adding it.) These operations will allow LDS barriers
to be interacted with in a more type-safe manner.

The types and operations use the Ds naming scheme to match the
underlying instructions and to avoid confusion with the "LDS barrier"
already present in the AMDGPU dialect that was a workaround for LLVM's
memory fencing support.

(To summarize a potential usage pattern, one can use a pair of these
barriers to communicate between wave(s) in a workgroup that load data
into memory and a separate wave(s) that compute with that data.)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 17:43:12 +00:00
Ravil Dorozhinskii
b1907c109c
[ROCDL] Refactored MFMA ops in ODS; added constraints (#175775)
This PR improves the ROCDL MFMA intrinsics by making their operand and
result types explicit in the IR and by modeling immediate arguments
(immargs) as attributes rather than opaque operands.

This brings MFMA intrinsics in line with recent changes made to ROCDL
WMMA operations, where intrinsic signatures were clarified to avoid
treating them as an unstructured “blob of arguments”.
2026-01-21 21:50:25 +01:00
Erick Ochoa Lopez
aba7d72c8d
[mlir][amdgpu] gfx1250+ lower fat_raw_pointer_cast (#175047)
* numRecords are set to all 1s if out of bounds is not requested.
* set flags correctly to zero.
2026-01-08 16:47:24 -05:00
Erick Ochoa Lopez
19089fa13b
[mlir][amdgpu] Fix DMA lowerings. (#174008)
* Fixes off by one error where tensor_dim_0_stride was always set to 1.
* Instead of always setting this value to 1, tensor_dim_0_stride is the
stride across the last dimension.
2025-12-30 13:33:35 -05:00
Eric Feng
24c7b4ea48
[mlir][amdgpu] implement amdgpu.sparse_mfma wrapper for smfmac instructions (#171968)
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
2025-12-18 20:16:14 -06:00
Erick Ochoa Lopez
5f15fee8ac
[mlir][amdgpu] Add tensor load store operations (#172686)
Reland https://github.com/llvm/llvm-project/pull/170918

This PR differs from the original one by making the target
materialization more restrictive.
2025-12-17 12:37:27 -05:00
Erick Ochoa Lopez
b9d6ad9ce9
Revert "[mlir][amdgpu] Add tensor load store operations (#170918)" (#172671)
This reverts commit ecbb44464a3a5fad090be8c19632b9046f8eb109. Broke ROCM
integration tests. Will reland in future commit.
2025-12-17 15:06:22 +00:00
Mehdi Amini
8cc9c690eb [MLIR] Fix clang-tidy fixes for llvm-prefer-isa-or-dyn-cast-in-conditionals in AMDGPUToROCDL.cpp (NFC)
The cast can't fail, the `if` checks are spurious.
2025-12-17 05:30:01 -08:00
Ivan Butygin
ce553ab69f
Revert "[mlir][amdgpu] Expose waitcnt bitpacking infra (#172313)" (#172636)
This reverts commit 93013817afabe23a07073528481856b3507b6faf.

Revert https://github.com/llvm/llvm-project/pull/172313

Missing libraries, again
2025-12-17 12:13:44 +00:00
Ivan Butygin
93013817af
[mlir][amdgpu] Expose waitcnt bitpacking infra (#172313)
So we can get rid of our copy in `AMDGPUToROCDL`.
2025-12-17 14:32:30 +03:00
Erick Ochoa Lopez
ecbb44464a
[mlir][amdgpu] Add tensor load store operations (#170918)
* removes unused code.
* lowers tensor load and store operations.
2025-12-16 09:09:28 -05:00
Justin Rosner
3a88bb90bb
[mlir][AMDGPU] Add scaled wmma ops for gfx1250 (#169854)
This PR adds scaled WMMA ops (available on gfx1250) and the lowering to the AMDGPU dialect, wrapping the underlying intrinsics.
2025-12-15 15:44:36 -08:00
Erick Ochoa Lopez
5123d36c02
[mlir][amdgpu] Lower make_gather_dma_descriptor. (#172083)
* Makes `MakeDescriptorOp` a template for `make_dma_descriptor` and
`make_gather_dma_descriptor`.
* Makes verification and folder for `make_dma_descriptor` a template.
* Adds custom verification and folder for `make_dma_gather_descriptor`
based on tempalte.
* Adds `make_gather_dma_descriptor` op.
* Lowers `make_gather_dma_descriptor` to ROCDL.
2025-12-15 13:32:57 -05:00
Ivan Butygin
f785ca0d72
[mlir][nvgpu] Move memref memspace attributes conversion to single place (#172156)
Also, some fixes for AMDGPU part for better naming.
2025-12-14 12:44:47 +03:00
Erick Ochoa Lopez
5ebb928532
[mlir][amdgpu] Adds make_dma_gather_base (#171857)
* Adds `tdm_gather_base` type.
* Adds `make_dma_gather_base` op.
* Adds `make_dma_gather_base` lowering to ROCDL.
2025-12-12 09:20:38 -05:00
Erick Ochoa Lopez
2f9b8b7428
[mlir][amdgpu] Continue lowering make_tdm_descriptor. (#171498)
* changes workgroup mask's type from i16 to vector<16xi1>
* changes pad_amount and pad_interval from Index to I32
* adds lit tests for padEnable, iteration and dynamic cases
* adds TODO for a future instrumentation pass to validate inputs
* adds descriptor groups 2 and 3
2025-12-11 15:49:50 -05:00
Ivan Butygin
c22d82a1d4
[mlir][amdgpu] Move GPU memory spaces conversion to single place (#171876) 2025-12-11 21:39:57 +03:00
Erick Ochoa Lopez
87345d2ad4
[mlir][amdgpu] Add type conversion to populate method (NFC) (#171708)
* Renames populateAMDGPUMemorySpaceAttributeConversions to
populateAMDGPUTypeAndAttributeConversions.
* Adds TDMBaseType conversion to
populateAMDGPUTypeAndAttributeConversions.
2025-12-11 08:44:19 -05:00
Ivan Butygin
c9c4e6eb58
Reland [mlir][amdgpu] Add common gpu mem space conversions to convert-amdgpu-to-rocdl (#171599)
Reland https://github.com/llvm/llvm-project/pull/171543

Added missing GPU lib `MLIRGPUToGPURuntimeTransforms`.
2025-12-10 17:33:51 +03:00
Ivan Butygin
467af2715a
Revert "[mlir][amdgpu] Add common gpu mem space conversions to `conve… (#171594)
…rt-amdgpu-to-rocdl` (#171543)"

This reverts commit fd0fb05ae196cb664ebdd8940aad20f9606c62f7.

Forgot to link GPU lib and shared lib build failed.
2025-12-10 10:47:07 +00:00
Ivan Butygin
fd0fb05ae1
[mlir][amdgpu] Add common gpu mem space conversions to convert-amdgpu-to-rocdl (#171543)
Without it `convert-amdgpu-to-rocdl` will fail to convert
`amdgpu.gather_to_lds` with `#gpu.address_space<workgroup>` mem space.
2025-12-10 13:15:10 +03:00
Ivan Butygin
f88d060c41
[mlir][amdgpu] memory_counter_wait tensor counter support (#171153) 2025-12-08 20:02:40 +03:00
Tim Gymnich
0487154588
[mlir][amdgpu] Add workgroup_mask to MakeDmaDescriptorOp (#171103)
- add `workgroup_mask` and `early_timeout`
2025-12-08 16:02:18 +01:00
Kazu Hirata
29fa151a07 [mlir] Fix a warning
This patch fixes:

  mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp:2666:10: error:
  unused variable 'v4i32' [-Werror,-Wunused-variable]
2025-12-05 11:28:36 -08:00
Erick Ochoa Lopez
5dfd9c4f84
[mlir][amdgpu] Add lowering for make_dma_descriptor (#169955)
* Adds initial lowering for make_dma_descriptor supporting tensors of
rank 2.
* Adds folders for make_dma_descriptor allowing statically known
operands to be folded into attributes.
* Add AllElementTypesMatch<["lds", "global"]> to make_dma_base.
* Rename pad to pad_amount
* Rename pad_every to pad_interval
2025-12-05 14:24:23 -05:00
Krzysztof Drewniak
e209b8bc2f
[mlir][AMDGPU] Rename gfx1250 packed extension ops, change firstScaleLane (#170718)
The current name of scaled_ext_packed816 was, in retrospect, bothering
me, since it just has a bunch of numbers on the end and doesn't really
reflect the wave-wide nature of the operation.

On top of that, the fact that firstScaleLane was 0 or 1, which might be
read as the first lane being 1 (and not what it actually was, 16), also
seemed weird.

Therefore, before this op sees any use,

1. Renaem it to scaled_ext_packed_matrix
2. Change the semantics of firstScaleLane to actually point at the lane
where the scales start (valid options currently are 0 or 16, the two
halves of a wave32 wave).

(Disclaimer: the mechanical updates were done via AI.)

---------

Co-authored-by: Erick Ochoa Lopez <eochoalo@amd.com>
2025-12-04 14:35:16 -08:00
Erick Ochoa Lopez
73979c1df9
[mlir][amdgpu] Lower amdgpu.make_dma_base (#169817)
* Adds lowering for `amdgpu.make_dma_base`
2025-12-02 13:48:31 -05:00
Erick Ochoa Lopez
1fcfd5c67b
[mlir][amdgpu] Sink op creation in scaled conversion intrinsics (NFC) (#168542)
Where possible:

* notifyMatchFailure happen first
* then op.emitOpError
* finally assertions / op creation.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-11-18 10:35:05 -05:00
Erick Ochoa Lopez
909c9aacea
[mlir][amdgpu] Add lowerings for ScaledExtPacked816 (#168123)
* Adds lowerings for amdgpy.scaled_ext_packed816
* updates verifiers
2025-11-17 16:51:52 -05:00
Muzammiluddin Syed
b1262d13e0
[mlir][ROCDL] Refactor wmma intrinsics to use attributes not operands where possible (#167041)
The current implementation of the WMMA intrinsic ops as they are defined
in the ROCDL tablegen is incorrect. They represent as operands what
should be attributes such as `clamp`, `opsel`, `signA/signB`. This
change performs a refactoring to bring it in line with what we expect.

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-11-13 19:50:02 -05:00
Jakub Kuderski
ba0be89cd2
[mlir] Simplify Default cases in type switches. NFC. (#165767)
Use default values instead of lambdas when possible. `std::nullopt` and
`nullptr` can be used now because of
https://github.com/llvm/llvm-project/pull/165724.
2025-10-30 15:10:59 -04:00
Jakub Kuderski
3167752f29
[mlir][amdgpu][rocdl] Allow for graceful wmma conversion failures (#165616) 2025-10-29 16:09:59 -04:00
Jakub Kuderski
466c526714
[mlir][amdgpu][rocdl] Add gfx1250 wmma ops (#165064)
Update `amdgpu.wmma` op definition and implement amdgpu to rocdl
conversion for new variants.
2025-10-28 12:42:39 -04:00
Jakub Kuderski
dc5f274560
[mlir][amdgpu] Add explicit intrinsic shape to wmma (#164920)
This is in preparation for adding support for gfx1250 wmma intrinsics
that include much more possible shapes.

Instead of guessing the wave32/wave64 mode based on element types and
vector sizes, require the intrinsic shapes to be set explicitly as
attributes.
2025-10-24 12:21:33 -04:00
Jakub Kuderski
ae11c5c2c4
[mlir] Switch uses of deprecated .create methods to free function. NFC. (#164635)
See https://discourse.llvm.org/t/psa-opty-create-now-with-100-more-tab-complete/87339.
2025-10-22 14:51:03 +00:00
Shilei Tian
2195fe7e01
[AMDGPU] Add the support for 45-bit buffer resource (#159702)
On new targets like `gfx1250`, the buffer resource (V#) now uses this
format:

```
base (57-bit): resource[56:0]
num_records (45-bit): resource[101:57]
reserved (6-bit): resource[107:102]
stride (14-bit): resource[121:108]
```

This PR changes the type of `num_records` from `i32` to `i64` in both
builtin and intrinsic, and also adds the support for lowering the new
format.

Fixes SWDEV-554034.

---------

Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
2025-09-24 11:12:02 -04:00
Krzysztof Drewniak
5ecc6d1951
[mlir][AMDGPU] Use LDS-only MMRA fences for lds_barrier (#157919)
The previous lowering strategy for amdgpu.lds_barrier (which is an
operation whose semantics are) "s.barrier, and all LDS operations before
this happen-before LDS operations after this, and there must not be an
inherent fence/forcing-to-completion of global memory (for performance)"
was previosuly implemented through using manual calls to waitcnt()
intrinsics and the s_barrire intrinsic(s).

The lack of explicit fencing enabled miscompiles (where LDS accesses
were reordered with the barrier) on gfx12. Since LLVM now allows MMRA
annotations to ensure that only LDS accesses are fenced by a pair of
fences, we can now use these fences in order to explicitly represent the
semantics we want instead of trying to prescribe the method of their
implemntation.

Note that the gfx908 workaround of hiding the s_barrier in inline
assembly in order to prevent spurious vmem barriers remains in place,
but is is removed for gfx11 because the fences have been changed to give
us the effect we want recently.
2025-09-23 14:00:09 -05:00
Gaurav Verma
a2a9601ea4
[mlir][AMDGPU] Updated PermlaneSwapOp to select correct val (#157586)
* as per the instruction description, updated `PermlaneSwapOp` to select
correct val
* updated corresponding lit tests

Issue it resolves: the block reduction was failing otherwise as we were
selecting the `{0}` always.

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
2025-09-12 13:45:56 +02:00
Tim Gymnich
003cbbd4ca
[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933)
- promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap
%src {16,32}`
2025-08-24 12:41:09 +02:00
Tim Gymnich
e20fa4f412
[mlir][AMDGPU] Add PermlaneSwapOp (#154345)
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2025-08-21 18:21:43 +02:00
Maksim Levental
c610b24493
[mlir][NFC] update mlir/Dialect create APIs (27/n) (#150638)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-25 11:48:32 -05:00
Maksim Levental
8e8f195322
[mlir][amd] fix LLVM::InsertValueOp::create failure to disambiguate (#150605)
fixes
https://github.com/llvm/llvm-project/pull/149879#issuecomment-3117145615

Note this happens because ADL can't disambiguate between
`mlir::DenseI64ArrayAttr` and `llvm::ArrayRef<int64_t>` **for the value
0** which I guess is equal to nullptr on some (most?) systems.

Note, this only occurs with the value 0.
2025-07-25 07:56:27 -04:00
Maksim Levental
b0434925c9
[mlir][NFC] update Conversion create APIs (4/n) (#149879)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-23 10:49:35 -05:00
Ivan Butygin
4977100624
[mlir][amdgpu] Add rocdl.s.waitcnt wrapper (#149670)
The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly
(similar to the asm format) and delegate architecture-dependent
bitpacking to the amdgpu->rocdl lowering.

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
2025-07-22 23:37:56 +03:00
Daniel Hernandez-Juarez
668c964282
[AMDGPU] [MLIR] Add 96 and 128 bit GatherToLDS for gfx950 (#147496)
This PR adds 96 and 128 gather_to_lds support for gfx950. Updating
lowering, verifier and tests.
2025-07-09 11:53:26 -04:00
Alan Li
3f3282cee8
[AMDGPU] Adding AMDGPU dialect wrapper for ROCDL transpose loads. (#145395)
* 1-to-1 mapping wrapper op.
* Direct lowering from AMDGPU wrapper to ROCDL intrinsics.
2025-06-25 22:58:14 -04:00
Umang Yadav
836201f117
Allow bf16 operands on new MFMAs (#144925)
New gfx950 MFMA allows bf16 operands. 


c0cc81cdc0/llvm/include/llvm/IR/IntrinsicsAMDGPU.td (L3434)

When running `amdgpu-to-rocdl`, Current logic converts bf16 to i16
always which fails to compile for newer bf16 MFMA e.g.
`v_mfma_f32_16x16x32bf16`.
Backend expects bf16 type for the operands for those newer MFMAs. This
patch fixes it.

CC: @krzysz00  @dhernandez0  @giuseros  @antiagainst  @kuhar
2025-06-19 12:52:31 -05:00
Daniel Hernandez-Juarez
68b6f392ed
[MLIR][AMDGPU] Fix bug in GatherToLDSOpLowering, get the correct MemRefType for destination (#142915)
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.

CC: @krzysz00 @umangyadav @lialan
2025-06-13 11:33:51 -05:00