22 Commits

Author SHA1 Message Date
Christian Sigg
a5757c5b65
Switch member calls to isa/dyn_cast/cast/... to free function calls. (#89356)
This change cleans up call sites. Next step is to mark the member
functions deprecated.

See https://mlir.llvm.org/deprecation and
https://discourse.llvm.org/t/preferred-casting-style-going-forward.
2024-04-19 15:58:27 +02:00
Guray Ozen
0a600c34c8
[mlir][nvgpu] Make phaseParity of mbarrier.try_wait i1 (#81460)
Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is
index. This can cause a problem if it's passed any value different than
0 or 1. Because the PTX instruction only accepts even or odd phase. This
PR makes phaseParity argument i1 to avoid misuse.

Here is the information from PTX doc:

```
The .parity variant of the instructions test for the completion of the phase indicated 
by the operand phaseParity, which is the integer parity of either the current phase or 
the immediately preceding phase of the mbarrier object. An even phase has integer 
parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity 
operand are 0 and 1.
```
See for more information:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait
2024-02-13 09:50:34 +01:00
Mehdi Amini
a7759fb0f8 Apply clang-tidy fixes for performance-unnecessary-value-param in NVGPUTransformOps.cpp (NFC) 2024-01-25 11:33:05 -08:00
Mehdi Amini
70fdaefe1f Apply clang-tidy fixes for bugprone-macro-parentheses in NVGPUTransformOps.cpp (NFC) 2024-01-25 11:33:05 -08:00
Guray Ozen
4319e1916d
[mlir][nvgpu] Introduce Multicast Capability to nvgpu.tma.async.load (#76935)
This PR improves the functionality of the `nvgpu.tma.async.load` Op by
adding support for multicast. While we already had this capability in
the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM
Op, this PR lowers mask information to the NVVM operation.
2024-01-05 10:48:55 +01:00
Guray Ozen
3a03da37a3
[mlir][nvgpu] Add address space attribute converter in nvgpu-to-nvvm pass (#74075)
GPU dialect has `#gpu.address_space<workgroup>` for shared memory of
NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU
dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion.

This PR adds `populateGpuMemorySpaceAttributeConversions` to
nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>`
`nvgpu-to-nvvm` pass
2023-12-04 16:48:39 +01:00
Christian Ulmann
97a238e863
[MLIR][LLVM] Remove typed pointer conversion utils (#71169)
This commit removes the no longer required type pointer helpers from the
LLVM dialect conversion utils. Typed pointers have been deprecated for a
while now and it's planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-11-03 13:02:35 +01:00
Guray Ozen
192d3320f0
[mlir][nvgpu] Add predicate argument to NVGPU Ops (#69322) 2023-10-18 19:41:51 +02:00
Guray Ozen
52db7e2745
[mlir][nvgpu] Improve WarpgroupAccumulator type to simplify IR (#68728)
`WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type
that keeps the accumulator matrix that is used by warp-group level
matrix multiplication. It is handy to have a special type for that as
the matrix is distributed among the threads of the warp-group. However,
current transformations requires to create and use multiple
`WarpgroupAccumulator` if the shape of GEMM is larger than the supported
shape of `wgmma.mma_async` instruction. This makes IR looks dense.

This PR improves the transformation of `WarpgroupAccumulator` type in
every nvgpu Op that uses it.

**Example: Current GEMM in NVGPU-IR**
```
// Init
%m1, %m2 = nvgpu.warpgroup.mma.init.accumulator ->  
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>

// GEMM
%r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>  


// Epilogue 
nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```

**Example: This PR simplifies the IR as below:**
```
// Init
%m = nvgpu.warpgroup.mma.init.accumulator ->  
           !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>

// GEMM
%r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>  

// Epilogue 
nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```
2023-10-17 11:46:47 +02:00
Guray Ozen
17649a7726
[MLIR][NVGPU] Introduce nvgpu.mbarrier.group for multiple mbarrier use (#65951)
A common practice involves the creation of multiple `mbarrier` objects,
see an example below. This is particularly valuable in scenarios like
software pipelining for GEMM, where we need to generate multiple
barriers dynamically use and wait them in a loop.

PR improves `nvgpu.mbarrier.barrier` type into the
`nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type.
Consequently, these Ops are now capable of managing multiple barriers
seamlessly.

Having `num_barriers = 4` helps us to locate mbarrier object(s) into
static shared memory. We could make the value dynamic that requires
dynamic shared memory it would complicate the codegen.

```
%barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
...
scf.for %i = %c0 to %n step %c1 {
  nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... 

  // ... Do work once mbarrier is ready 

  nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... 
}
```
We will have mbarrier usages like below: 
```
expect_tx[0]
expect_tx[1]
expect_tx[2]
Loop:
 try_wait mbarrier[0], expect_tx[3]
 try_wait mbarrier[1], expect_tx[0]
 try_wait mbarrier[2], expect_tx[1]
 try_wait mbarrier[3], expect_tx[2]
...
```
2023-09-22 17:09:43 +02:00
Guray Ozen
2388222695
[MLIR][NVGPU] Adding nvgpu.warpgroup.mma Op for Hopper GPUs (#65440)
This work introduces a new operation called `warpgroup.mma` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate
warpgroup-level matrix multiply and accumulate (WGMMA) operations on
Hopper GPUs with sm_90a architecture.

Previously, the `nvvm.wgmma.mma_async` operation was introduced to
support warpgroup-level matrix operations in NVVM dialect. This op is
used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired
shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity
and provides a higher-level interface for performing warpgroup-level
matrix operations.

The `nvgpu.warpgroup.mma` does followings:
1) Corresponds multiple `wgmma` instructions.
2) Iterates input matrix descriptors to achieve the desired computation
shape. 3) Groups and runs `wgmma` instructions asynchronously, and
eventually waits them. This are done by `wgmma.fence.aligned`,
`wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned`
4) Results fragmented matrices

Here's an example usage of the `nvgpu.warpgroup.mma` operation:
```
%wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: 
      !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>  
```

The op will result following PTX:
```
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA,     %descB,     p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+2,   %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+4,   %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+8,   %descB+348, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB,     p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 1;
```

The Op keeps 
  - first 64 registers (`{%f1, %f2,    62 more registers}`) -> `%acc1` 
- second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.
2023-09-22 11:46:29 +02:00
Nicolas Vasilache
99475f5b4a [mlir][transform] Add NVGPU to NVVM conversion via transform.apply_conversion_patterns
Differential Revision: https://reviews.llvm.org/D157501
2023-08-09 14:09:57 +00:00
Jie Fu
28d8c0d168 [mlir][nvgpu] Fix -Wunused-variable in NVGPUTransformOps.cpp (NFC)
/data/llvm-project/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp:969:16: error: unused variable 'inMemRefType' [-Werror,-Wunused-variable]
    MemRefType inMemRefType = inMemRef.getType();
               ^
1 error generated.
2023-08-08 20:21:41 +08:00
Nicolas Vasilache
a3cd2eeb2d [mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.
This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.

Differential Revision: https://reviews.llvm.org/D157087
2023-08-08 12:07:59 +00:00
Matthias Springer
db393288ff [mlir][NVGPU][transform] Add create_async_groups transform op
This transform looks for suitable vector transfers from global memory to shared memory and converts them to async device copies.

Differential Revision: https://reviews.llvm.org/D155569
2023-07-18 14:36:41 +02:00
Alex Zinenko
4a6b31b8d8 [mlir] NFC: untangle SCF Patterns.h and Transforms.h
These two headers both contained a strange mix of definitions related to
both patterns and non-pattern transforms. Put patterns and "populate"
functions into Patterns.h and standalone transforms into Transforms.h.

Depends On: D155223

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155454
2023-07-18 11:27:36 +00:00
Alex Zinenko
371366ce27 [mlir][nvgpu] add simple pipelining for shared memory copies
Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an `scf.for` loop with either
`vector.transfer` or `nvgpu.device_async_copy` operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.

This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.

This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155223
2023-07-17 14:29:12 +00:00
Lorenzo Chelini
2049b2adfe [MLIR] Fix compiler warnings (NFC)
In `TestTensorTransforms.cpp` `replaced` is nullptr I assumed the intent
was to emit the error for the `rootOp`.

In `TransformInterfaces.cpp` there were some uninitialized variables.

In `NVGPUTransformOps.cpp` `matmulOp` was never used.

Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D154439
2023-07-05 09:49:57 +02:00
Nicolas Vasilache
13f4e889c5 Revert "Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions""
This reverts commit 6506692fe619ef8a1f7c6ea829d9a9eceb31622d.

Differential Revision: https://reviews.llvm.org/D153845
2023-06-28 06:50:05 +00:00
Mehdi Amini
6506692fe6 Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions"
This reverts commit 40deed40ae77ba22f7c72693903752ab6bfeb4e7.
and commit 1660f2174d59bc2fd04131dab9ab0b43178bf665.

The buildbot is broken, the two tests aren't passing.
2023-06-27 08:46:18 +02:00
Nicolas Vasilache
1660f2174d [mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite.
This PR adds support for the m16n8k16 f16 case.
At this point, the support is mostly mechanical and could be Tablegen'd to all cases.
Until then, this can be populated as needed on a case-by-case basis.

Depends on: D153420

Differential Revision: https://reviews.llvm.org/D153428
2023-06-26 16:46:42 +00:00
Nicolas Vasilache
40deed40ae [mlir][Transform] Introduce nvgpu transform extensions
Mapping to NVGPU operations such as mma.sync with mixed precision and ldmatrix with transposes and
various data types involves complex matchings from low-level IR.
This is akin to raising complex patterns after unnecessarily having lost structural information.
To avoid such unnecessary complexity, introduce a direct mapping step from a matmul on memrefs
to distributed NVGPU vector abstractions.
In this context, mapping to specific mma.sync operations is trivial and consists in simply
translating the documentation into indexing expressions.

Correctness is demonstrated with an end-to-end integration test.

Differential Revision: https://reviews.llvm.org/D153420
2023-06-26 16:21:28 +00:00