fixes
https://github.com/llvm/llvm-project/pull/149879#issuecomment-3117145615
Note this happens because ADL can't disambiguate between
`mlir::DenseI64ArrayAttr` and `llvm::ArrayRef<int64_t>` **for the value
0** which I guess is equal to nullptr on some (most?) systems.
Note, this only occurs with the value 0.
The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly
(similar to the asm format) and delegate architecture-dependent
bitpacking to the amdgpu->rocdl lowering.
---------
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
New gfx950 MFMA allows bf16 operands.
c0cc81cdc0/llvm/include/llvm/IR/IntrinsicsAMDGPU.td (L3434)
When running `amdgpu-to-rocdl`, Current logic converts bf16 to i16
always which fails to compile for newer bf16 MFMA e.g.
`v_mfma_f32_16x16x32bf16`.
Backend expects bf16 type for the operands for those newer MFMAs. This
patch fixes it.
CC: @krzysz00 @dhernandez0 @giuseros @antiagainst @kuhar
This PR fixes a bug in GatherToLDSOpLowering, we were getting the
MemRefType of source for the destination. Additionally, some related
typos are corrected.
CC: @krzysz00 @umangyadav @lialan
This PR adds support for the scaled conversion intrinsics for fp4 and
fp6 types so that they can be targetted by a future amdgpu dialect op or
used directly.
Additionally, this patch refactors the copy-paste-heavy fp8 versions of
these scaled conversion intrinsics with tablegen `foreach` loops, and
fixes the fact that certain immargs weren't being stored as attributes.
Note that some of the MLIR-level tests for those scaled fp8 intrinsics
had incorrect return types, which have been fixed.
(Note that while the operations have a known return type, the IR format
still prints that type for clarity).
Now that MLIR accepts nuw and nusw in getelementptr, this patch emits
the inbounds and nuw attributes when lower memref to LLVM in load and
store operators.
This patch also strengthens the memref.load and memref.store spec about
undefined behaviour during lowering.
This patch also lifts the |rewriter| parameter in getStridedElementPtr
ahead so that LLVM::GEPNoWrapFlags can be added at the end with a
default value and grouped together with other operators' parameters.
Signed-off-by: Lin, Peiyong <linpyong@gmail.com>
This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to
LDS from global (address space 1) pointers and buffer fat pointers
(address space 7), since they use the same API and "gather from a
pointer to LDS" is something of an abstract operation.
This commit adds the intrinsic and its lowerings for addrspaces 1 and 7,
and updates the MLIR wrappers to use it (loosening up the restrictions
on loads to LDS along the way to match the ground truth from target
features).
It also plumbs the intrinsic through to clang.
Create a wrapper around the new scaled MFMAs that operate on specific
element types and tile sizes.
See [Issue](https://github.com/iree-org/iree/issues/20616).
---------
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Repack `amdgpu.swizzle_bitmode` arguments and lower it to
`rocdl.ds_swizzle`.
Repacking logic is follows:
* `sizeof(arg) < sizeof(i32)`: bitcast to integer and zext to i32 and
then trunc and bitcast back.
* `sizeof(arg) == sizeof(i32)`: just bitcast to i32 and back if not i32
* `sizeof(arg) > sizeof(i32)`: bitcast to `vector<Nxi32>`, extract
individual elements and do a series of `rocdl.ds_swizzle` and then
compose vector and bitcast back.
Added repacking logic to LLVM utils so it can be used elsewhere. I'm
planning to use it for `gpu.shuffle` later.
Defining a new `amdgpu.global_load` op, which is a thin wrap around
ROCDL `global_load_lds` intrinsic, along with its lowering logics to
`rocdl.global.load.lds`.
This commit extends the lowering of amdgpu.mfma to handle the new
double-rate MFMAs in gfx950 and adds tests for these operations.
It also adds support for MFMAs on small floats (f6 and f4), which are
implented using the "scaled" MFMA intrinsic with a scale value of 0 in
order to have an unscaled MFMA.
This commit does not add a `amdgpu.scaled_mfma` operation, as that is
future work.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
- Add packed conversions fp8/bf8->bf16 for gfx950 and fp8/bf8->fp32 for
gfx942 in ROCDL dialect
- Update amdgpu.ext_packed_fp8 lowering to use ROCDL packed fp8/bf8->f32
conversions for vector target types and ROCDL scalar fp8/bf8->fp32 for
scalar target type.
---------
Co-authored-by: Jungwook Park <jungwook.park@amd.com>
(Continuing from #106160)
This PR addresses remaining review comments from the original PR.
Original PR Description
---
Upcoming hardware (gfx12 and some future gfx9) will support the OCP
8-bit float formats for their matrix multiplication intrinsics and
conversion operations, retaining existing opcodes and compiler builtins.
This commit adds support for these types to the MLIR wrappers around
such operations, ensuring that the OCP types aren't used to generate
those builtins on hardware that doesn't expect that format and,
conversely, to ensure that the pre-OCP formats aren't used on new
hardware.
---------
Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Paul Fuqua <pf@acm.org>
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
1. Extend the gfx12 FP8 support to allow mixed-type intrinsics (since
they've been added), creating limited mixed-type support that mirrors
MFMA
2. Extend the `amdgpu.wmma` intrinsic lowering to correctly handle
shorter vectors because gfx12 now has instructions that logically take a
4xi8, or, as far as LLVM's concerned, an i32. Similarly, there are 4xi4
inputs, which are an i16 (that must be zero-extended to i32).
3. Correctly handle the ambiguities in the int4 intrinsics on gfx12,
which can either be 16x16x16 or 16x16x32
4. Add tests showing all WMMAs being lowered the way gfx12 expects
(mirroring LLVM's tests)
5. Add a verifier to prevent emiting ilegal instructions on gfx12.
This commit adds support for casting memrefs into fat raw buffer
pointers to the AMDGPU dialect.
Fat raw buffer pointers - or, in LLVM terms, ptr addrspcae(7), allow
encapsulating a buffer descriptor (as produced by the make.buffer.rsrc
intrinsic or provided from some API) into a pointer that supports
ordinary pointer operations like load or store. This allows people to
take advantage of the additional semantics that buffer_load and similar
instructions provide without forcing the use of entirely separate
amdgpu.raw_buffer_* operations.
Operations on fat raw buffer pointers are translated to the
corresponding LLVM intrinsics by the backend.
This commit also goes and and defines a #amdgpu.address_space<>
attribute so that AMDGPU-specific memory spaces can be represented. Only
#amdgpu.address_space<fat_raw_buffer> will work correctly with the
memref dialect, but the other possible address spaces are included for
completeness.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Prashant Kumar <pk5561@gmail.com>
During my previous cleanup (#127403), I did not notice that we defined a
type alias for the base class. This type alias allows us to use the
shorter form Base::Base, and this PR switches to that.
`let constructor` is deprecated since the table gen backend emits most
of the glue logic to build a pass. This PR retires the td method for
most (I need another pass) passes in the Conversion directory.
There were a bunch of spots in ROCDL.td where we were defining our own
llvmBuilder call which could have been generated using the default
built-in one on LLVM_IntrOpBase.
This commit cleans up such usages in the interests of potentinally
enabling ROCDL import in the future and of making best practices more
obvious.
The one breaking change is renaming WaitcntOp to SWaitcntOp, which
should have minimal impact.
This patch fixes several bugs in the lowering of AMDGPU raw buffer
operations. These bugs include:
- Incorrectly handling the offset of the memref, causing errors when
using subviews.
- Using the MaximumOp (float specific op) to calculate the number of
records.
- The number of records in the static shape case.
- The lowering when index bitwidth=i64.
Furthermore this patch also switches to use MLIR's data layout to get
the type size.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
1. We can use `getNumElements()` only for memrefs with trivial layout.
2. Buffer ops expecting sizes in i32 but descriptor values can be either
i32 or i64, add appropriate casts. This implementation is not ideal as
it can overflow, but it's still better than generating broken IR.
Found by inspecting AMDGPU assembly - so the arithmetic ops created
there were definitely making their way into the target ISA. A
`LLVM::BitcastOp` seems equivalent, and evaporates as expected in the
target asm.
Along the way, I thought that this helper function `mfmaConcatIfNeeded`
could be renamed to `convertMFMAVectorOperand` to better convey its
contract; so I don't need to think about whether a bitcast is a
legitimate "concat" :-)
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.
Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.
Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
The AMDGPU backend now implements LLVM's `bfloat` type. Therefore, we no
longer need to type convert MLIR's `bf16` to `i16` during lowerings to
ROCDL.
As a result of this change, we discovered that, whel the code for MFMA
and WMMA intrinsics was mainly prepared for this change, we were failing
to bitcast the bf16 results of WMMA operations out from the i16 they're
natively represented as. This commit also fixes that issue.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
* Fix a bug introduced by the Chipset refactoring in #107720 where
atomics emulation for adds was mistakenly applied to gfx11+
* Add the case needed for gfx11+ atomic emulation, namely that gfx11
doesn't support atomically adding a v2f16 or v2bf16, thus requiring
MLIR-level legalization for buffer intrinsics that attempt to do such an
addition
* Add tests, including tests for gfx11 atomic emulation
Co-authored-by: Manupa Karunaratne <manupa.karunaratne@amd.com>
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.
Re-lands #108238 (and addresses review comments from there)
Co-authored-by: Giuseppe Rossini <giuseppe.rossini@amd.com>
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.
Co-authored-by: Giuseppe Rossini <giuseppe.rossini@amd.com>
Update the Chipset struct to follow the `IsaVersion` definition from
llvm's `TargetParser`. This is a follow up to
https://github.com/llvm/llvm-project/pull/106169#discussion_r1733955012.
* Add the stepping version. Note: This may break downstream code that
compares against the minor version directly.
* Use comparisons with full Chipset version where possible.
Note that we can't use the code in `TargetParser` directly because the
chipset utility is outside of `mlir/Target` that re-exports llvm's
target library.
Defined AMDGPU DPP operation in mlir to represent semantics. Introduced
a new enumeration attribute for different permutations and allowed for
different types of arguments. Implemented constant attribute handling
for ROCDL::DPPMovOp operation. The operation now correctly accepts
constant attributes for dppCtrl, rowMask, bankMask, boundCtrl, and
passes them to the corresponding LLVM intrinsic.
On some architectures (currently gfx90a, gfx94*, and gfx10**), we can
implement an LDS barrier using compiler intrinsics instead of inline
assembly, improving optimization possibilities and decreasing the
fragility of the underlying code.
Other AMDGPU chipsets continue to require inline assembly to implement
this barrier, as, by the default, the LLVM backend will insert waits on
global memory (s_waintcnt vmcnt(0)) before barriers in order to ensure
memory watchpoints set by debuggers work correctly.
Use of amdgpu.lds_barrier, on these architectures, imposes a tradeoff
between debugability and performance. The documentation, as well as the
generated inline assembly, have been updated to explicitly call
attention to this fact.
For chipsets that did not require the inline assembly hack, we move to
the s.waitcnt and s.barrier intrinsics, which have been added to the
ROCDL dialect. The magic constants used as an argument to the waitcnt
intrinsic can be derived from
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
Define operations that wrap the gfx940's new operations for converting
between f32 and registers containing packed sets of four 8-bit floats.
Define rocdl operations for the intrinsics and an AMDGPU dialect
wrapper around them (to account for the fact that MLIR distinguishes
the two float formats at the type level but that the LLVM IR does
not).
Define an ArithToAMDGPU pass, meant to run before conversion to LLVM,
that replaces relevant calls to arith.extf and arith.truncf with the
packed operations in the AMDGPU dialect. Note that the conversion
currently only handles scalars and vectors of rank <= 1, as we do not
have a usecase for multi-dimensional vector support right now.
Reviewed By: jsjodin
Differential Revision: https://reviews.llvm.org/D152457
The AMDGPU backend now has buffer resource intrinsics that take a ptr
addrspase (8) instead of a vector<4xi32>, improving LLVM's ability to
reason about their memory behavior. This commit moves MLIR to these
new functions.
Reviewed By: jsjodin
Differential Revision: https://reviews.llvm.org/D157053
Many previous sets of AMDGPU dialect code have been incorrect in the
presence of the bf16 type (when lowered to LLVM's bfloat) as they were
developed in a setting that run a custom bf16-to-i16 pass before LLVM
lowering.
An overall effect of this patch is that you should run
--arith-emulate-unsupported-floats="source-types=bf16 target-type=f32"
on your GPU module before calling --convert-gpu-to-rocdl if your code
performs bf16 arithmetic.
While LLVM now supports software bfloat, initial experiments showed
that using this support on AMDGPU inserted a large number of
conversions around loads and stores which had substantial performance
imparts. Furthermore, all of the native AMDGPU operations on bf16
types (like the WMMA operations) operate on 16-bit integers instead of
the bfloat type.
First, we make the following changes to preserve compatibility once
the LLVM bfloat type is reenabled.
1. The matrix multiplication operations (MFMA and WMMA) will bitcast
bfloat vectors to i16 vectors.
2. Buffer loads and stores will operate on the relevant integer
datatype and then cast to bfloat if needed.
Second, we add type conversions to convert bf16 and vectors of it to
equivalent i16 types.
Third, we add the bfloat <-> f32 expansion patterns to the set of
operations run before the main LLVM conversion so that MLIR's
implementation of these conversion routines is used.
Finally, we extend the "floats treated as integers" support in the
LLVM exporter to handle types other than fp8.
We also fix a bug in the unsupported floats emulation where it tried
to operate on `arith.bitcast` due to an oversight.
Reviewed By: rsuderman
Differential Revision: https://reviews.llvm.org/D156361