Follow the style of other dialects by having a distiinct .td file for
each category of thing (type, attribdut, operation, enum) generated for
the AMDGPU dialect.
Nothing has changed, but a lot of things have been copy-pasted.
* Makes `MakeDescriptorOp` a template for `make_dma_descriptor` and
`make_gather_dma_descriptor`.
* Makes verification and folder for `make_dma_descriptor` a template.
* Adds custom verification and folder for `make_dma_gather_descriptor`
based on tempalte.
* Adds `make_gather_dma_descriptor` op.
* Lowers `make_gather_dma_descriptor` to ROCDL.
Since `FatRawBufferCastOp` preserves the shape of its source operand,
the result dimensions can be reified by querying the source's
dimensions.
---------
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
* Adds initial lowering for make_dma_descriptor supporting tensors of
rank 2.
* Adds folders for make_dma_descriptor allowing statically known
operands to be folded into attributes.
* Add AllElementTypesMatch<["lds", "global"]> to make_dma_base.
* Rename pad to pad_amount
* Rename pad_every to pad_interval
The current name of scaled_ext_packed816 was, in retrospect, bothering
me, since it just has a bunch of numbers on the end and doesn't really
reflect the wave-wide nature of the operation.
On top of that, the fact that firstScaleLane was 0 or 1, which might be
read as the first lane being 1 (and not what it actually was, 16), also
seemed weird.
Therefore, before this op sees any use,
1. Renaem it to scaled_ext_packed_matrix
2. Change the semantics of firstScaleLane to actually point at the lane
where the scales start (valid options currently are 0 or 16, the two
halves of a wave32 wave).
(Disclaimer: the mechanical updates were done via AI.)
---------
Co-authored-by: Erick Ochoa Lopez <eochoalo@amd.com>
This is in preparation for adding support for gfx1250 wmma intrinsics
that include much more possible shapes.
Instead of guessing the wave32/wave64 mode based on element types and
vector sizes, require the intrinsic shapes to be set explicitly as
attributes.
The ScaledMFMAOp accepts scales as a vector of 4 bytes
(`vector<4xf8E8M0FNU>`) that can be stored in a single register with a
particular scale accessed using the `OpSel` attribute. Currently, we
only use one byte in this 4-byte vector, resulting in 3 wasted
registers.
This is fixed by identifying when single byte extractions are performed
and rewriting them into extractions of 4-byte vectors.
Example:
```
%unit = vector.extract %ScaleSrc[offsets] : f8E8M0FNU from vector<?x?x?xf8E8M0FNU>
%scale = vector.insert %unit, ... : f8E8M0FNU into vector<4xf8E8M0FNU>
amdgpu.scaled_mfma(%scale[0] * ...
```
to
```
%reshaped = vector.shape_cast %ScaleSrc : vector<?x?x?xf8E8M0FNU> to vector<?x4xf8E8M0FNU>
%scale = vector.extract %reshaped[?] : vector<4xf8E8M0FNU> from vector<?x4xf8E8M0FNU>
amdgpu.scaled_mfma(%scale[0-3] * ...
```
---------
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
- Add PermlaneSwapOp that lowers to `rocdl.permlane16.swap` and
`rocdl.permlane32.swap`
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
The requirement that the LDS operand is contiguous is overly restrictive
because it's perfectly valid to have a subview depend on subgroup IDs
that is still subgroup contiguous. We could continue trying to do this
verification based on the number of copied elements, but instead this
change just opts to clarify the semantics on the op definition.
When inferring the return type of amdgpu.fat_raw_buffer_cast with the
offset reset, we would sometimes use a strided layout, like
strided<[1]>, in cases where, after stripping the offset, the memref had
the identity layout. This would cause issues with EmulateNarrowTypes,
which does perform this layout canonicalization.
Now, the return type inference will put in an identity layout after
offset stripping for
1. Statically-shaped memrefs of any rank where the strides match the
suffix product of the shape, and
2. Memrefs of rank <= 1 whose strides are [1] (or []) that just had
their offset removed by resetOffset.
This op doesn't have any rank or indices restrictions on src/dst
memrefs, but was using `SameVariadicOperandSize` which was causing
issues. Also fix some other issues while we at it.
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.
This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).
It also plumbs the intrinsic through to clang.
Defining a new `amdgpu.global_load` op, which is a thin wrap around
ROCDL `global_load_lds` intrinsic, along with its lowering logics to
`rocdl.global.load.lds`.
This commit extends the lowering of amdgpu.mfma to handle the new
double-rate MFMAs in gfx950 and adds tests for these operations.
It also adds support for MFMAs on small floats (f6 and f4), which are
implented using the "scaled" MFMA intrinsic with a scale value of 0 in
order to have an unscaled MFMA.
This commit does not add a `amdgpu.scaled_mfma` operation, as that is
future work.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
(Continuing from #106160)
This PR addresses remaining review comments from the original PR.
Original PR Description
---
Upcoming hardware (gfx12 and some future gfx9) will support the OCP
8-bit float formats for their matrix multiplication intrinsics and
conversion operations, retaining existing opcodes and compiler builtins.
This commit adds support for these types to the MLIR wrappers around
such operations, ensuring that the OCP types aren't used to generate
those builtins on hardware that doesn't expect that format and,
conversely, to ensure that the pre-OCP formats aren't used on new
hardware.
---------
Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Paul Fuqua <pf@acm.org>
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
1. Extend the gfx12 FP8 support to allow mixed-type intrinsics (since
they've been added), creating limited mixed-type support that mirrors
MFMA
2. Extend the `amdgpu.wmma` intrinsic lowering to correctly handle
shorter vectors because gfx12 now has instructions that logically take a
4xi8, or, as far as LLVM's concerned, an i32. Similarly, there are 4xi4
inputs, which are an i16 (that must be zero-extended to i32).
3. Correctly handle the ambiguities in the int4 intrinsics on gfx12,
which can either be 16x16x16 or 16x16x32
4. Add tests showing all WMMAs being lowered the way gfx12 expects
(mirroring LLVM's tests)
5. Add a verifier to prevent emiting ilegal instructions on gfx12.
This commit adds support for casting memrefs into fat raw buffer
pointers to the AMDGPU dialect.
Fat raw buffer pointers - or, in LLVM terms, ptr addrspcae(7), allow
encapsulating a buffer descriptor (as produced by the make.buffer.rsrc
intrinsic or provided from some API) into a pointer that supports
ordinary pointer operations like load or store. This allows people to
take advantage of the additional semantics that buffer_load and similar
instructions provide without forcing the use of entirely separate
amdgpu.raw_buffer_* operations.
Operations on fat raw buffer pointers are translated to the
corresponding LLVM intrinsics by the backend.
This commit also goes and and defines a #amdgpu.address_space<>
attribute so that AMDGPU-specific memory spaces can be represented. Only
#amdgpu.address_space<fat_raw_buffer> will work correctly with the
memref dialect, but the other possible address spaces are included for
completeness.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Prashant Kumar <pk5561@gmail.com>
Defined AMDGPU DPP operation in mlir to represent semantics. Introduced
a new enumeration attribute for different permutations and allowed for
different types of arguments. Implemented constant attribute handling
for ROCDL::DPPMovOp operation. The operation now correctly accepts
constant attributes for dppCtrl, rowMask, bankMask, boundCtrl, and
passes them to the corresponding LLVM intrinsic.
This implementation has a number of issues and ultimately does not work
on gfx9.
* It does not reduce bank conflicts with wide memory accesses.
* It does not correctly account for when LDS bank conflicts occur on
amdgpu.
* The implementation is too fragile to be used on real-world code. For
example, the code bails out on any `memref.subview` in the root op, even
when the subview is not a user of any of the `memref.alloc` ops.
I do not see how these can be easily fixed, therefore I think it's
better to delete this code.