This commit takes advantage of the new `load.async.to.lds` intrinsic in
order to add an `async` mode to `gather_to_lds`. In this mode,
completion of the load needs to be managed with `asyncmark` and
`wait.asyncmark` intrinsics instead of being implicitly derived by alias
analysis.
This commit adds the flag, a lowering for it, and updates tests.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Whoops, turns out I was off by 1 on how many bits are in the counts and
phases ind these new LDS barriers. This commit fixes this.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit introduces the `!amdgpu.ds_barrier_state` type and
operations on that type, including extracting its components and (more
importantly) provides wrappers around the upcoming barrier-management
instructions that will be added in gfx1250.
This commit is loosely based on work done for Triton, but this commit
provides slightly more lower level-primitives (namely a known-atomic
load for getting the barrier state instead of providing a `wait`
operation that includes an entire spin-loop, though if people want one
we could consider adding it.) These operations will allow LDS barriers
to be interacted with in a more type-safe manner.
The types and operations use the Ds naming scheme to match the
underlying instructions and to avoid confusion with the "LDS barrier"
already present in the AMDGPU dialect that was a workaround for LLVM's
memory fencing support.
(To summarize a potential usage pattern, one can use a pair of these
barriers to communicate between wave(s) in a workgroup that load data
into memory and a separate wave(s) that compute with that data.)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This PR improves the ROCDL MFMA intrinsics by making their operand and
result types explicit in the IR and by modeling immediate arguments
(immargs) as attributes rather than opaque operands.
This brings MFMA intrinsics in line with recent changes made to ROCDL
WMMA operations, where intrinsic signatures were clarified to avoid
treating them as an unstructured “blob of arguments”.
* Fixes off by one error where tensor_dim_0_stride was always set to 1.
* Instead of always setting this value to 1, tensor_dim_0_stride is the
stride across the last dimension.
* Makes `MakeDescriptorOp` a template for `make_dma_descriptor` and
`make_gather_dma_descriptor`.
* Makes verification and folder for `make_dma_descriptor` a template.
* Adds custom verification and folder for `make_dma_gather_descriptor`
based on tempalte.
* Adds `make_gather_dma_descriptor` op.
* Lowers `make_gather_dma_descriptor` to ROCDL.
* changes workgroup mask's type from i16 to vector<16xi1>
* changes pad_amount and pad_interval from Index to I32
* adds lit tests for padEnable, iteration and dynamic cases
* adds TODO for a future instrumentation pass to validate inputs
* adds descriptor groups 2 and 3
* Adds initial lowering for make_dma_descriptor supporting tensors of
rank 2.
* Adds folders for make_dma_descriptor allowing statically known
operands to be folded into attributes.
* Add AllElementTypesMatch<["lds", "global"]> to make_dma_base.
* Rename pad to pad_amount
* Rename pad_every to pad_interval
The current name of scaled_ext_packed816 was, in retrospect, bothering
me, since it just has a bunch of numbers on the end and doesn't really
reflect the wave-wide nature of the operation.
On top of that, the fact that firstScaleLane was 0 or 1, which might be
read as the first lane being 1 (and not what it actually was, 16), also
seemed weird.
Therefore, before this op sees any use,
1. Renaem it to scaled_ext_packed_matrix
2. Change the semantics of firstScaleLane to actually point at the lane
where the scales start (valid options currently are 0 or 16, the two
halves of a wave32 wave).
(Disclaimer: the mechanical updates were done via AI.)
---------
Co-authored-by: Erick Ochoa Lopez <eochoalo@amd.com>
Where possible:
* notifyMatchFailure happen first
* then op.emitOpError
* finally assertions / op creation.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
The current implementation of the WMMA intrinsic ops as they are defined
in the ROCDL tablegen is incorrect. They represent as operands what
should be attributes such as `clamp`, `opsel`, `signA/signB`. This
change performs a refactoring to bring it in line with what we expect.
---------
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
This is in preparation for adding support for gfx1250 wmma intrinsics
that include much more possible shapes.
Instead of guessing the wave32/wave64 mode based on element types and
vector sizes, require the intrinsic shapes to be set explicitly as
attributes.
On new targets like `gfx1250`, the buffer resource (V#) now uses this
format:
```
base (57-bit): resource[56:0]
num_records (45-bit): resource[101:57]
reserved (6-bit): resource[107:102]
stride (14-bit): resource[121:108]
```
This PR changes the type of `num_records` from `i32` to `i64` in both
builtin and intrinsic, and also adds the support for lowering the new
format.
Fixes SWDEV-554034.
---------
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
The previous lowering strategy for amdgpu.lds_barrier (which is an
operation whose semantics are) "s.barrier, and all LDS operations before
this happen-before LDS operations after this, and there must not be an
inherent fence/forcing-to-completion of global memory (for performance)"
was previosuly implemented through using manual calls to waitcnt()
intrinsics and the s_barrire intrinsic(s).
The lack of explicit fencing enabled miscompiles (where LDS accesses
were reordered with the barrier) on gfx12. Since LLVM now allows MMRA
annotations to ensure that only LDS accesses are fenced by a pair of
fences, we can now use these fences in order to explicitly represent the
semantics we want instead of trying to prescribe the method of their
implemntation.
Note that the gfx908 workaround of hiding the s_barrier in inline
assembly in order to prevent spurious vmem barriers remains in place,
but is is removed for gfx11 because the fences have been changed to give
us the effect we want recently.
* as per the instruction description, updated `PermlaneSwapOp` to select
correct val
* updated corresponding lit tests
Issue it resolves: the block reduction was failing otherwise as we were
selecting the `{0}` always.
---------
Signed-off-by: xintin <gaurav.verma@amd.com>
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
fixes
https://github.com/llvm/llvm-project/pull/149879#issuecomment-3117145615
Note this happens because ADL can't disambiguate between
`mlir::DenseI64ArrayAttr` and `llvm::ArrayRef<int64_t>` **for the value
0** which I guess is equal to nullptr on some (most?) systems.
Note, this only occurs with the value 0.
The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly
(similar to the asm format) and delegate architecture-dependent
bitpacking to the amdgpu->rocdl lowering.
---------
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
New gfx950 MFMA allows bf16 operands.
c0cc81cdc0/llvm/include/llvm/IR/IntrinsicsAMDGPU.td (L3434)
When running `amdgpu-to-rocdl`, Current logic converts bf16 to i16
always which fails to compile for newer bf16 MFMA e.g.
`v_mfma_f32_16x16x32bf16`.
Backend expects bf16 type for the operands for those newer MFMAs. This
patch fixes it.
CC: @krzysz00 @dhernandez0 @giuseros @antiagainst @kuhar
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.
CC: @krzysz00 @umangyadav @lialan