16 Commits

Author SHA1 Message Date
Krzysztof Drewniak
b05c15259b
[mlir][AMDGPU] Improve amdgpu.lds_barrier, add warnings (#77942)
On some architectures (currently gfx90a, gfx94*, and gfx10**), we can
implement an LDS barrier using compiler intrinsics instead of inline
assembly, improving optimization possibilities and decreasing the
fragility of the underlying code.

Other AMDGPU chipsets continue to require inline assembly to implement
this barrier, as, by the default, the LLVM backend will insert waits on
global memory (s_waintcnt vmcnt(0)) before barriers in order to ensure
memory watchpoints set by debuggers work correctly.

Use of amdgpu.lds_barrier, on these architectures, imposes a tradeoff
between debugability and performance. The documentation, as well as the
generated inline assembly, have been updated to explicitly call
attention to this fact.

For chipsets that did not require the inline assembly hack, we move to
the s.waitcnt and s.barrier intrinsics, which have been added to the
ROCDL dialect. The magic constants used as an argument to the waitcnt
intrinsic can be derived from
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
2024-03-11 10:06:49 -05:00
Krzysztof Drewniak
2ebd633f14 [mlir][AMDGPU] Add packed 8-bit float conversion ops and lowering
Define operations that wrap the gfx940's new operations for converting
between f32 and registers containing packed sets of four 8-bit floats.

Define rocdl operations for the intrinsics and an AMDGPU dialect
wrapper around them (to account for the fact that MLIR distinguishes
the two float formats at the type level but that the LLVM IR does
not).

Define an ArithToAMDGPU pass, meant to run before conversion to LLVM,
that replaces relevant calls to arith.extf and arith.truncf with the
packed operations in the AMDGPU dialect. Note that the conversion
currently only handles scalars and vectors of rank <= 1, as we do not
have a usecase for multi-dimensional vector support right now.

Reviewed By: jsjodin

Differential Revision: https://reviews.llvm.org/D152457
2023-09-28 14:44:16 +00:00
Krzysztof Drewniak
bfa501b892 [mlir][AMDGPU] Move to new buffer resource intrinsics
The AMDGPU backend now has buffer resource intrinsics that take a ptr
addrspase (8) instead of a vector<4xi32>, improving LLVM's ability to
reason about their memory behavior. This commit moves MLIR to these
new functions.

Reviewed By: jsjodin

Differential Revision: https://reviews.llvm.org/D157053
2023-09-22 19:48:06 +00:00
Krzysztof Drewniak
51b65d0895 [mlir][AMDGPU] Improve BF16 handling through AMDGPU compilation
Many previous sets of AMDGPU dialect code have been incorrect in the
presence of the bf16 type (when lowered to LLVM's bfloat) as they were
developed in a setting that run a custom bf16-to-i16 pass before LLVM
lowering.

An overall effect of this patch is that you should run
--arith-emulate-unsupported-floats="source-types=bf16 target-type=f32"
on your GPU module before calling --convert-gpu-to-rocdl if your code
performs bf16 arithmetic.

While LLVM now supports software bfloat, initial experiments showed
that using this support on AMDGPU inserted a large number of
conversions around loads and stores which had substantial performance
imparts. Furthermore, all of the native AMDGPU operations on bf16
types (like the WMMA operations) operate on 16-bit integers instead of
the bfloat type.

First, we make the following changes to preserve compatibility once
the LLVM bfloat type is reenabled.
1. The matrix multiplication operations (MFMA and WMMA) will bitcast
bfloat vectors to i16 vectors.
2. Buffer loads and stores will operate on the relevant integer
datatype and then cast to bfloat if needed.

Second, we add type conversions to convert bf16 and vectors of it to
equivalent i16 types.

Third, we add the bfloat <-> f32 expansion patterns to the set of
operations run before the main LLVM conversion so that MLIR's
implementation of these conversion routines is used.

Finally, we extend the "floats treated as integers" support in the
LLVM exporter to handle types other than fp8.

We also fix a bug in the unsupported floats emulation where it tried
to operate on `arith.bitcast` due to an oversight.

Reviewed By: rsuderman

Differential Revision: https://reviews.llvm.org/D156361
2023-08-17 18:31:28 +00:00
Giuseppe Rossini
4b3eaee270 [mlir][AMDGPU] Define wrappers for WMMA matrix ops
Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate
matrix multiplication on RDNA3 architectures.  LLVM already provides a
set of intrinsics to generate wmma instructions. This change uses those
intrinsics to enable the feature in MLIR.

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D152451
2023-07-20 18:38:35 +00:00
Giuseppe Rossini
20c66a0c66 [AMDGPU] Add basic support for gfx11xx
This patch fixes a minor issue in AMDGPUToROCDL to add gfx11 support in MLIR

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D152450
2023-06-12 17:06:36 +00:00
Krzysztof Drewniak
98c1104d41 [mlir][AMDGPU] Define atomic compare-and-swap for raw buffers
This commit adds the buffer cmpswap intrinsic to the ROCDL dialect and
its corresponding AMDGPU dialect wrappers.

Reviewed By: nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D148722
2023-05-03 21:11:20 +00:00
giuseros
82ac02e4a8 Add scalar support for amdgpu.raw_buffer_{load,store}
Introduce the possibility to load/store scalars via amdgpu.raw_buffer_{load,store}

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D146413
2023-03-20 20:19:20 +00:00
Manupa Karunaratne
584f64365a [MLIR][AMDGPU][ROCDL] Adding raw.buffer.atomic.fmax/smax/umin support
This commit adds support for atomic fmax/smax/umin support
for AMDGPU dialect and the dependent dialects to allow such
a lowering.

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D144097
2023-02-28 16:58:35 +00:00
Krzysztof Drewniak
22f0c7a451 [mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect
Upcoming AMD hardware will include functions that accept 8-bit floats.
Specifically, there are MFMA instructions that accept 8-bit floats,
either using the same or mixed formats. This patch adds MLIR wrappers
for these intrinsics and explicitly adds support for 8-bit floats in
the gpu-to-rocdl conversion by way of amdgpu-to-rocdl.

Since LLVM does not have f8 types, when targeting LLVM for compilation
on an AMD GPU, both f8 types used on AMD hardware (f8E5M2FNUZ and
f8E4M3FNUZ) are rewritten to i8.

This patch also relaxes the restriction that the types of both source
operands to a amdgpu.mfma instructions match exactly, as this is not
necessarily required for the bf8 (f8E5M2FNUZ) and fp8 (f8E4M3FNUZ)
instructions. In addition, since the buffer_{load,store} operations
maintain a whitelist of permitted types, we add the relevant f8 types
to that list.

This patch does not add any implementations of arithmetic operations
for f8 types.

Reviewed By: jakeh-gc

Differential Revision: https://reviews.llvm.org/D143956
2023-02-15 16:46:08 +00:00
Krzysztof Drewniak
c55b41d519 [mlir][AMDGPU] Define amdgpu.mfma operator
The amdgpu.mfma operator is a wrapper around the Matrix Fused Multiply
Add (MFMA) instructions on some AMD GPUs (the CDNA-based MI-* cards).

This interface allows for selecting the operation to be performed by
specifying the dimensions of the multiplication to be performed and
any additional attributes (such as whether to use reduced-precision
floating-point math) that are needed to select the relevant mfma
instruction and set its parameters.

Reviewed By: ThomasRaoux, nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D132956
2022-08-31 21:06:12 +00:00
Krzysztof Drewniak
6329562249 [mlir][AMDGPU] Explicitly truncate memory addresses in buffer ops
As a percaution, truncate memory addresses passed to kernels to 48 bits,
since bits 48-63 of the buffer descriptor are used for the stride field
and, on gfx10, to control swizzling.

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D131016
2022-08-04 19:42:33 +00:00
Krzysztof Drewniak
bc61cc9a2d [mlir][AMDGPU] Add lds_barrier op
The lds_barrier op allows workgroups to wait at a barrier for
operations to/from their local data store (LDS) to complete without
incurring the performance penalties of a full memory fence.

Reviewed By: nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D129522
2022-07-14 20:45:26 +00:00
Krzysztof Drewniak
db590549a9 [mlir][AMDGPU] Use the correct values for OOB_SELECT on gfx10
Differential Revision: https://reviews.llvm.org/D129320
2022-07-07 21:23:38 +00:00
Krzysztof Drewniak
cab44c515c [mlir][AMDGPU] Add --chipset option to AMDGPUToROCDL
Because the buffer descriptor structure (the V#) has no backwards-compatibility
guarentees, and since said guarantees have been violated in practice
(see https://github.com/llvm/llvm-project/issues/56323 ), and since
the `targetIsRDNA` attribute isn't something that higher-level clients can set
in general, make the lowering of the amdgpu dialect to rocdl take a --chipset
option.

Note that this option is a string because adding a parser for the Chipset
struct to llvm::cl wasn't working out.

Reviewed By: herhut

Differential Revision: https://reviews.llvm.org/D129228
2022-07-07 14:58:13 +00:00
Krzysztof Drewniak
f1f05a91ca [MLIR][AMDGPU] Add AMDGPU dialect, wrappers around raw buffer intrinsics
By analogy with the NVGPU dialect, introduce an AMDGPU dialect for
AMD-specific intrinsic wrappers.

The dialect initially includes wrappers around the raw buffer intrinsics.

On AMD GPUs, a memref can be converted to a "buffer descriptor" that
allows more precise control of memory access, such as by allowing for
out of bounds loads/stores to be replaced by 0/ignored without adding
additional conditional logic, which is important for performance.

The repository currently contains a limited conversion from
transfer_read/transfer_write to Mubuf intrinsics, which are an older,
deprecated intrinsic for the same functionality.

The new amdgpu.raw_buffer_* ops allow these operations to be used
explicitly and for including metadata such as whether the target
chipset is an RDNA chip or not (which impacts the interpretation of
some bits in the buffer descriptor), while still maintaining an
MLIR-like interface.

(This change also exposes the floating-point atomic add intrinsic.)

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D122765
2022-05-10 14:59:58 +00:00