llvm-project

Author	SHA1	Message	Date
Krzysztof Drewniak	b05c15259b	[mlir][AMDGPU] Improve amdgpu.lds_barrier, add warnings (#77942 ) On some architectures (currently gfx90a, gfx94, and gfx10*), we can implement an LDS barrier using compiler intrinsics instead of inline assembly, improving optimization possibilities and decreasing the fragility of the underlying code. Other AMDGPU chipsets continue to require inline assembly to implement this barrier, as, by the default, the LLVM backend will insert waits on global memory (s_waintcnt vmcnt(0)) before barriers in order to ensure memory watchpoints set by debuggers work correctly. Use of amdgpu.lds_barrier, on these architectures, imposes a tradeoff between debugability and performance. The documentation, as well as the generated inline assembly, have been updated to explicitly call attention to this fact. For chipsets that did not require the inline assembly hack, we move to the s.waitcnt and s.barrier intrinsics, which have been added to the ROCDL dialect. The magic constants used as an argument to the waitcnt intrinsic can be derived from llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp	2024-03-11 10:06:49 -05:00
Krzysztof Drewniak	2ebd633f14	[mlir][AMDGPU] Add packed 8-bit float conversion ops and lowering Define operations that wrap the gfx940's new operations for converting between f32 and registers containing packed sets of four 8-bit floats. Define rocdl operations for the intrinsics and an AMDGPU dialect wrapper around them (to account for the fact that MLIR distinguishes the two float formats at the type level but that the LLVM IR does not). Define an ArithToAMDGPU pass, meant to run before conversion to LLVM, that replaces relevant calls to arith.extf and arith.truncf with the packed operations in the AMDGPU dialect. Note that the conversion currently only handles scalars and vectors of rank <= 1, as we do not have a usecase for multi-dimensional vector support right now. Reviewed By: jsjodin Differential Revision: https://reviews.llvm.org/D152457	2023-09-28 14:44:16 +00:00
Krzysztof Drewniak	bfa501b892	[mlir][AMDGPU] Move to new buffer resource intrinsics The AMDGPU backend now has buffer resource intrinsics that take a ptr addrspase (8) instead of a vector<4xi32>, improving LLVM's ability to reason about their memory behavior. This commit moves MLIR to these new functions. Reviewed By: jsjodin Differential Revision: https://reviews.llvm.org/D157053	2023-09-22 19:48:06 +00:00
Krzysztof Drewniak	51b65d0895	[mlir][AMDGPU] Improve BF16 handling through AMDGPU compilation Many previous sets of AMDGPU dialect code have been incorrect in the presence of the bf16 type (when lowered to LLVM's bfloat) as they were developed in a setting that run a custom bf16-to-i16 pass before LLVM lowering. An overall effect of this patch is that you should run --arith-emulate-unsupported-floats="source-types=bf16 target-type=f32" on your GPU module before calling --convert-gpu-to-rocdl if your code performs bf16 arithmetic. While LLVM now supports software bfloat, initial experiments showed that using this support on AMDGPU inserted a large number of conversions around loads and stores which had substantial performance imparts. Furthermore, all of the native AMDGPU operations on bf16 types (like the WMMA operations) operate on 16-bit integers instead of the bfloat type. First, we make the following changes to preserve compatibility once the LLVM bfloat type is reenabled. 1. The matrix multiplication operations (MFMA and WMMA) will bitcast bfloat vectors to i16 vectors. 2. Buffer loads and stores will operate on the relevant integer datatype and then cast to bfloat if needed. Second, we add type conversions to convert bf16 and vectors of it to equivalent i16 types. Third, we add the bfloat <-> f32 expansion patterns to the set of operations run before the main LLVM conversion so that MLIR's implementation of these conversion routines is used. Finally, we extend the "floats treated as integers" support in the LLVM exporter to handle types other than fp8. We also fix a bug in the unsupported floats emulation where it tried to operate on `arith.bitcast` due to an oversight. Reviewed By: rsuderman Differential Revision: https://reviews.llvm.org/D156361	2023-08-17 18:31:28 +00:00
Giuseppe Rossini	4b3eaee270	[mlir][AMDGPU] Define wrappers for WMMA matrix ops Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate matrix multiplication on RDNA3 architectures. LLVM already provides a set of intrinsics to generate wmma instructions. This change uses those intrinsics to enable the feature in MLIR. Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D152451	2023-07-20 18:38:35 +00:00
Giuseppe Rossini	20c66a0c66	[AMDGPU] Add basic support for gfx11xx This patch fixes a minor issue in AMDGPUToROCDL to add gfx11 support in MLIR Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D152450	2023-06-12 17:06:36 +00:00
Krzysztof Drewniak	98c1104d41	[mlir][AMDGPU] Define atomic compare-and-swap for raw buffers This commit adds the buffer cmpswap intrinsic to the ROCDL dialect and its corresponding AMDGPU dialect wrappers. Reviewed By: nirvedhmeshram Differential Revision: https://reviews.llvm.org/D148722	2023-05-03 21:11:20 +00:00
giuseros	82ac02e4a8	Add scalar support for amdgpu.raw_buffer_{load,store} Introduce the possibility to load/store scalars via amdgpu.raw_buffer_{load,store} Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D146413	2023-03-20 20:19:20 +00:00
Manupa Karunaratne	584f64365a	[MLIR][AMDGPU][ROCDL] Adding raw.buffer.atomic.fmax/smax/umin support This commit adds support for atomic fmax/smax/umin support for AMDGPU dialect and the dependent dialects to allow such a lowering. Reviewed By: krzysz00 Differential Revision: https://reviews.llvm.org/D144097	2023-02-28 16:58:35 +00:00
Krzysztof Drewniak	22f0c7a451	[mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect Upcoming AMD hardware will include functions that accept 8-bit floats. Specifically, there are MFMA instructions that accept 8-bit floats, either using the same or mixed formats. This patch adds MLIR wrappers for these intrinsics and explicitly adds support for 8-bit floats in the gpu-to-rocdl conversion by way of amdgpu-to-rocdl. Since LLVM does not have f8 types, when targeting LLVM for compilation on an AMD GPU, both f8 types used on AMD hardware (f8E5M2FNUZ and f8E4M3FNUZ) are rewritten to i8. This patch also relaxes the restriction that the types of both source operands to a amdgpu.mfma instructions match exactly, as this is not necessarily required for the bf8 (f8E5M2FNUZ) and fp8 (f8E4M3FNUZ) instructions. In addition, since the buffer_{load,store} operations maintain a whitelist of permitted types, we add the relevant f8 types to that list. This patch does not add any implementations of arithmetic operations for f8 types. Reviewed By: jakeh-gc Differential Revision: https://reviews.llvm.org/D143956	2023-02-15 16:46:08 +00:00
Krzysztof Drewniak	c55b41d519	[mlir][AMDGPU] Define amdgpu.mfma operator The amdgpu.mfma operator is a wrapper around the Matrix Fused Multiply Add (MFMA) instructions on some AMD GPUs (the CDNA-based MI-* cards). This interface allows for selecting the operation to be performed by specifying the dimensions of the multiplication to be performed and any additional attributes (such as whether to use reduced-precision floating-point math) that are needed to select the relevant mfma instruction and set its parameters. Reviewed By: ThomasRaoux, nirvedhmeshram Differential Revision: https://reviews.llvm.org/D132956	2022-08-31 21:06:12 +00:00
Krzysztof Drewniak	6329562249	[mlir][AMDGPU] Explicitly truncate memory addresses in buffer ops As a percaution, truncate memory addresses passed to kernels to 48 bits, since bits 48-63 of the buffer descriptor are used for the stride field and, on gfx10, to control swizzling. Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D131016	2022-08-04 19:42:33 +00:00
Krzysztof Drewniak	bc61cc9a2d	[mlir][AMDGPU] Add lds_barrier op The lds_barrier op allows workgroups to wait at a barrier for operations to/from their local data store (LDS) to complete without incurring the performance penalties of a full memory fence. Reviewed By: nirvedhmeshram Differential Revision: https://reviews.llvm.org/D129522	2022-07-14 20:45:26 +00:00
Krzysztof Drewniak	db590549a9	[mlir][AMDGPU] Use the correct values for OOB_SELECT on gfx10 Differential Revision: https://reviews.llvm.org/D129320	2022-07-07 21:23:38 +00:00
Krzysztof Drewniak	cab44c515c	[mlir][AMDGPU] Add --chipset option to AMDGPUToROCDL Because the buffer descriptor structure (the V#) has no backwards-compatibility guarentees, and since said guarantees have been violated in practice (see https://github.com/llvm/llvm-project/issues/56323 ), and since the `targetIsRDNA` attribute isn't something that higher-level clients can set in general, make the lowering of the amdgpu dialect to rocdl take a --chipset option. Note that this option is a string because adding a parser for the Chipset struct to llvm::cl wasn't working out. Reviewed By: herhut Differential Revision: https://reviews.llvm.org/D129228	2022-07-07 14:58:13 +00:00
Krzysztof Drewniak	f1f05a91ca	[MLIR][AMDGPU] Add AMDGPU dialect, wrappers around raw buffer intrinsics By analogy with the NVGPU dialect, introduce an AMDGPU dialect for AMD-specific intrinsic wrappers. The dialect initially includes wrappers around the raw buffer intrinsics. On AMD GPUs, a memref can be converted to a "buffer descriptor" that allows more precise control of memory access, such as by allowing for out of bounds loads/stores to be replaced by 0/ignored without adding additional conditional logic, which is important for performance. The repository currently contains a limited conversion from transfer_read/transfer_write to Mubuf intrinsics, which are an older, deprecated intrinsic for the same functionality. The new amdgpu.raw_buffer_* ops allow these operations to be used explicitly and for including metadata such as whether the target chipset is an RDNA chip or not (which impacts the interpretation of some bits in the buffer descriptor), while still maintaining an MLIR-like interface. (This change also exposes the floating-point atomic add intrinsic.) Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D122765	2022-05-10 14:59:58 +00:00

16 Commits