20 Commits

Author SHA1 Message Date
erman-gurses
87c0260f45
[AMDGPU] Add parameterization for optimized shared memory variables (#82508)
- This PR adds parameterization for shared memory variables that are
used for optimization: `sharedMemoryLineSizeBytes` and
`defaultVectorSizeBits.`
- The default values are set to 128 for both variables since it gives
zero bank conflicts.
2024-02-27 23:28:12 -05:00
erman-gurses
04381c106f
[MLIR][AMDGPU]Add refactoring for shared-mem optimization (#81791)
Addressing the issues in this PR:
https://github.com/llvm/llvm-project/pull/81550
2024-02-15 13:53:15 -05:00
erman-gurses
29d1aca05c
[AMDGPU][MLIR]Add shmem-optimization as an op using transform dialect (#81550)
This PR adds functionality to use shared memory optimization as an op
using transform dialect.
2024-02-13 17:42:04 -08:00
erman-gurses
3f37df5b71
[reland][mlir][amdgpu] Shared memory access optimization pass (#79164)
- Reland: https://github.com/llvm/llvm-project/pull/75627

- Reproduced then fixed the build issue
2024-01-25 07:44:45 -08:00
Mehdi Amini
e611a4cf80
Revert "[mlir][amdgpu] Shared memory access optimization pass" (#78822)
Reverts llvm/llvm-project#75627 ; it broke the bot:
https://lab.llvm.org/buildbot/#/builders/61/builds/53218
2024-01-19 16:41:43 -08:00
erman-gurses
b7360fbe8c
[mlir][amdgpu] Shared memory access optimization pass (#75627)
It implements transformation to optimize accesses to shared memory.

Reference: https://reviews.llvm.org/D127457

_This change adds a transformation and pass to the NvGPU dialect that
attempts to optimize reads/writes from a memref representing GPU shared
memory in order to avoid bank conflicts. Given a value representing a
shared memory memref, it traverses all reads/writes within the parent op
and, subject to suitable conditions, rewrites all last dimension index
values such that element locations in the final (col) dimension are
given by newColIdx = col % vecSize + perm[row](col / vecSize, row)
where perm is a permutation function indexed by row and vecSize
is the vector access size in elements (currently assumes 128bit
vectorized accesses, but this can be made a parameter). This specific
transformation can help optimize typical distributed & vectorized
accesses
common to loading matrix multiplication operands to/from shared memory._
2024-01-19 15:44:45 -08:00
Krzysztof Drewniak
2ebd633f14 [mlir][AMDGPU] Add packed 8-bit float conversion ops and lowering
Define operations that wrap the gfx940's new operations for converting
between f32 and registers containing packed sets of four 8-bit floats.

Define rocdl operations for the intrinsics and an AMDGPU dialect
wrapper around them (to account for the fact that MLIR distinguishes
the two float formats at the type level but that the LLVM IR does
not).

Define an ArithToAMDGPU pass, meant to run before conversion to LLVM,
that replaces relevant calls to arith.extf and arith.truncf with the
packed operations in the AMDGPU dialect. Note that the conversion
currently only handles scalars and vectors of rank <= 1, as we do not
have a usecase for multi-dimensional vector support right now.

Reviewed By: jsjodin

Differential Revision: https://reviews.llvm.org/D152457
2023-09-28 14:44:16 +00:00
Daniil Dudkin
8a6e54c9b3
[mlir][arith] Rename operations: maxfmaximumf, minfminimumf (#65800)
This patch is part of a larger initiative aimed at fixing floating-point `max` and `min` operations in MLIR: https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.

This commit addresses Task 1.2 of the mentioned RFC. By renaming these operations, we align their names with LLVM intrinsics that have corresponding semantics.
2023-09-11 22:02:19 -07:00
Giuseppe Rossini
4b3eaee270 [mlir][AMDGPU] Define wrappers for WMMA matrix ops
Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate
matrix multiplication on RDNA3 architectures.  LLVM already provides a
set of intrinsics to generate wmma instructions. This change uses those
intrinsics to enable the feature in MLIR.

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D152451
2023-07-20 18:38:35 +00:00
Krzysztof Drewniak
cc4703745f [mlir][AMDGPU] Add emulation pass for atomics on AMDGPU targets
Not all AMDGPU targets support all atomic operations. For example,
there are not atomic floating-point adds on the gfx10 series. Add a
pass to emulate these operations using a compare-and-swap loop, by
analogy to the generic atomicrmw rewrite in MemrefToLLVM.

This pass is named generally, as in the future we may have a
memref-to-amdgpu that translates constructs like atomicrmw fmax (which
doesn't generally exist in LLVM) to the relevant intrinsics, which may
themselves require emulation.

Since the AMDGPU dialect now has a pass that operates on it, the
dialect's directory structure is reorganized to match other similarly
complex dialects.

The pass should be run before amdgpu-to-rocdl if desired.

This commit also adds f64 support to atomic_fmax.

Depends on D148722

Reviewed By: nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D148724
2023-05-03 21:18:48 +00:00
Krzysztof Drewniak
98c1104d41 [mlir][AMDGPU] Define atomic compare-and-swap for raw buffers
This commit adds the buffer cmpswap intrinsic to the ROCDL dialect and
its corresponding AMDGPU dialect wrappers.

Reviewed By: nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D148722
2023-05-03 21:11:20 +00:00
giuseros
82ac02e4a8 Add scalar support for amdgpu.raw_buffer_{load,store}
Introduce the possibility to load/store scalars via amdgpu.raw_buffer_{load,store}

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D146413
2023-03-20 20:19:20 +00:00
Krzysztof Drewniak
22f0c7a451 [mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect
Upcoming AMD hardware will include functions that accept 8-bit floats.
Specifically, there are MFMA instructions that accept 8-bit floats,
either using the same or mixed formats. This patch adds MLIR wrappers
for these intrinsics and explicitly adds support for 8-bit floats in
the gpu-to-rocdl conversion by way of amdgpu-to-rocdl.

Since LLVM does not have f8 types, when targeting LLVM for compilation
on an AMD GPU, both f8 types used on AMD hardware (f8E5M2FNUZ and
f8E4M3FNUZ) are rewritten to i8.

This patch also relaxes the restriction that the types of both source
operands to a amdgpu.mfma instructions match exactly, as this is not
necessarily required for the bf8 (f8E5M2FNUZ) and fp8 (f8E4M3FNUZ)
instructions. In addition, since the buffer_{load,store} operations
maintain a whitelist of permitted types, we add the relevant f8 types
to that list.

This patch does not add any implementations of arithmetic operations
for f8 types.

Reviewed By: jakeh-gc

Differential Revision: https://reviews.llvm.org/D143956
2023-02-15 16:46:08 +00:00
Matthias Springer
e7790fbed3 [mlir] Add test-convergence option to Canonicalizer tests
This new option is set to `false` by default. It should  be set only in Canonicalizer tests to detect faulty canonicalization patterns. I.e., patterns that prevent the canonicalizer from converging. The canonicalizer should always convergence on such small unit tests that we have in `canonicalize.mlir`.

Two faulty canonicalization patterns were detected and fixed with this change.

Differential Revision: https://reviews.llvm.org/D140873
2023-01-04 12:02:21 +01:00
Krzysztof Drewniak
d6abdf46bc [mlir][AMDGPU] Remove buffer ops that are statically out of bounds
When the bounds check attribute is true, the raw buffer load, store,
and atomic operations have well-defined behavior (returning 0 for
loads and ignoring stores) when the buffer access exceeds the bounds
of the memory being accessed.

Because of how LLVM currently implements these buffer operations (as
opaque intrinsics), the backend cannot optimize out this known
behavior and eliminate the memory operations. Therefore, use MLIR's
canonicalization system to eliminate these operations.

Reviewed By: nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D138146
2022-11-21 16:47:21 +00:00
Jeremy Furtek
f6ee194b68 [mlir][ods] Do not print default-valued attributes when the value is equal to the default
This diff causes the `tblgen`-erated print() function to skip printing a
`DefaultValuedAttr` attribute when the value is equal to the default.

This feature will reduce the amount of custom printing code that needs to be
written by users a relatively common scenario. As a motivating example, for the
fastmath flags in the LLVMIR dialect, we would prefer to print this:

```
%0 = llvm.fadd %arg0, %arg1 : f32
```

instead of this:

```
%0 = llvm.fadd %arg0, %arg1 {fastmathFlags = #llvm.fastmath<none>} : f32
```

This diff makes the handling of print functionality for default-valued attributes
standard.

This is an updated version of https://reviews.llvm.org/D135398, without the per-attribute bit to control printing.

Reviewed By: Mogball

Differential Revision: https://reviews.llvm.org/D135993
2022-10-17 13:57:36 -07:00
Krzysztof Drewniak
c55b41d519 [mlir][AMDGPU] Define amdgpu.mfma operator
The amdgpu.mfma operator is a wrapper around the Matrix Fused Multiply
Add (MFMA) instructions on some AMD GPUs (the CDNA-based MI-* cards).

This interface allows for selecting the operation to be performed by
specifying the dimensions of the multiplication to be performed and
any additional attributes (such as whether to use reduced-precision
floating-point math) that are needed to select the relevant mfma
instruction and set its parameters.

Reviewed By: ThomasRaoux, nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D132956
2022-08-31 21:06:12 +00:00
Krzysztof Drewniak
bc61cc9a2d [mlir][AMDGPU] Add lds_barrier op
The lds_barrier op allows workgroups to wait at a barrier for
operations to/from their local data store (LDS) to complete without
incurring the performance penalties of a full memory fence.

Reviewed By: nirvedhmeshram

Differential Revision: https://reviews.llvm.org/D129522
2022-07-14 20:45:26 +00:00
Krzysztof Drewniak
cab44c515c [mlir][AMDGPU] Add --chipset option to AMDGPUToROCDL
Because the buffer descriptor structure (the V#) has no backwards-compatibility
guarentees, and since said guarantees have been violated in practice
(see https://github.com/llvm/llvm-project/issues/56323 ), and since
the `targetIsRDNA` attribute isn't something that higher-level clients can set
in general, make the lowering of the amdgpu dialect to rocdl take a --chipset
option.

Note that this option is a string because adding a parser for the Chipset
struct to llvm::cl wasn't working out.

Reviewed By: herhut

Differential Revision: https://reviews.llvm.org/D129228
2022-07-07 14:58:13 +00:00
Krzysztof Drewniak
f1f05a91ca [MLIR][AMDGPU] Add AMDGPU dialect, wrappers around raw buffer intrinsics
By analogy with the NVGPU dialect, introduce an AMDGPU dialect for
AMD-specific intrinsic wrappers.

The dialect initially includes wrappers around the raw buffer intrinsics.

On AMD GPUs, a memref can be converted to a "buffer descriptor" that
allows more precise control of memory access, such as by allowing for
out of bounds loads/stores to be replaced by 0/ignored without adding
additional conditional logic, which is important for performance.

The repository currently contains a limited conversion from
transfer_read/transfer_write to Mubuf intrinsics, which are an older,
deprecated intrinsic for the same functionality.

The new amdgpu.raw_buffer_* ops allow these operations to be used
explicitly and for including metadata such as whether the target
chipset is an RDNA chip or not (which impacts the interpretation of
some bits in the buffer descriptor), while still maintaining an
MLIR-like interface.

(This change also exposes the floating-point atomic add intrinsic.)

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D122765
2022-05-10 14:59:58 +00:00