Allow TMA's last dimension to be non-128B when swizzling mode is not
set.
Test `tma_load_64x8_8x128_noswizzle.mlir` is failing due to the
verifier. This PR will fix that
PR adds `nvgpu.tma.async.store` Op for asynchronous stores using the
Tensor Memory Access (TMA) unit.
It also implements Op lowering to NVVM dialect. The Op currently
performs asynchronous stores of a tile memory region from shared to
global memory for a single CTA.
This PR improves and cleans-up verifiers of TmaCreateDescriptor and
TmaAsyncLoad Ops and unifies them.
The PR verifiers followings that didn't before:
- address space
- rank match between descriptor and memref
- element type match between descriptor and memref
- shape type match between descriptor and memref
`WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type
that keeps the accumulator matrix that is used by warp-group level
matrix multiplication. It is handy to have a special type for that as
the matrix is distributed among the threads of the warp-group. However,
current transformations requires to create and use multiple
`WarpgroupAccumulator` if the shape of GEMM is larger than the supported
shape of `wgmma.mma_async` instruction. This makes IR looks dense.
This PR improves the transformation of `WarpgroupAccumulator` type in
every nvgpu Op that uses it.
**Example: Current GEMM in NVGPU-IR**
```
// Init
%m1, %m2 = nvgpu.warpgroup.mma.init.accumulator ->
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
// GEMM
%r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}:
!nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
->
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
// Epilogue
nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer
: !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
into memref<128x128xf32,3>
```
**Example: This PR simplifies the IR as below:**
```
// Init
%m = nvgpu.warpgroup.mma.init.accumulator ->
!nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>
// GEMM
%r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}:
!nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>
->
!nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>
// Epilogue
nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer
: !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
into memref<128x128xf32,3>
```
This Op generates and initilizes the accumulator matrix for
`nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate
(mma).
Its associated transformation generates `!llvm.struct<>` and fill it
with the initial values. The size of struct is number of required inout
registers for `nvgpu.warpgroup.mma` op.
This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate storing
fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by
`warpgroup.mma` to the given memref.
An example of fragmentated matrix is given here :
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d
The `warpgroup.mma.store` does followings:
1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented
results matrix)
2) Calculates indexes per thread in warp-group and stores the data into
give memref.
Here's an example usage:
```
// A warpgroup performs GEMM, results in fragmented matrix
%result1, %result2 = nvgpu.warpgroup.mma ...
// Stores the fragmented result to memref
nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD :
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>
to memref<128x128xf32,3>
```
NVGPU dialect is gaining large support for warpgroup level operations,
and their names always starts with `warpgroup....`.
This PR changes name of Op and type from `wgmma.descriptor` to
`warpgroup.descriptor` for sake of consistency.
This work introduces a new operation called `warpgroup.mma` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate
warpgroup-level matrix multiply and accumulate (WGMMA) operations on
Hopper GPUs with sm_90a architecture.
Previously, the `nvvm.wgmma.mma_async` operation was introduced to
support warpgroup-level matrix operations in NVVM dialect. This op is
used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired
shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity
and provides a higher-level interface for performing warpgroup-level
matrix operations.
The `nvgpu.warpgroup.mma` does followings:
1) Corresponds multiple `wgmma` instructions.
2) Iterates input matrix descriptors to achieve the desired computation
shape. 3) Groups and runs `wgmma` instructions asynchronously, and
eventually waits them. This are done by `wgmma.fence.aligned`,
`wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned`
4) Results fragmented matrices
Here's an example usage of the `nvgpu.warpgroup.mma` operation:
```
%wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}:
!nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>
->
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>
```
The op will result following PTX:
```
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA, %descB, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+2, %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+4, %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+8, %descB+348, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 1;
```
The Op keeps
- first 64 registers (`{%f1, %f2, 62 more registers}`) -> `%acc1`
- second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.
This work introduces a new Op, `wgmma.generate.descriptor`, designed to create a wgmma descriptor for inputs of matrix multiply and accumulate operations using `wgmma.mma_async` PTX instruction.
The descriptor format specifications can be found in the following link:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor
It's important to note that this op is in its initial phase, and it does come with certain limitations. It only supports 128b swizzling and does not incorporate interleaving. In the future, different calculations will be addressed in separate works, expanding the capabilities of the op.
Reviewed By: qcolombet
Differential Revision: https://reviews.llvm.org/D157382
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.
The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
Depends on D155453
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155680
This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object.
It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155453
This work improves verifier for invalid cases. It is NFC.
Reviewed By: nicolasvasilache, springerm
Differential Revision: https://reviews.llvm.org/D155448
`mbarrier` is a barrier created in shared memory that supports different flavors of synchronizing threads other than `__syncthreads`, for more information see below.
https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier
This work adds initial Ops wrt `mbarrier` to nvgpu dialect.
First, it introduces to two types:
`mbarrier.barrier` that is barrier object in shared memory
`mbarrier.barrier.token` that is token
It introduces following Ops:
`mbarrier.create` creates `mbarrier.barrier`
`mbarrier.init` initializes `mbarrier.barrier`
`mbarrier.arrive` performs arrive-on `mbarrier.barrier` returns `mbarrier.barrier.token`
`mbarrier.arrive.nocomplete` performs arrive-on (non-blocking) `mbarrier.barrier` returns `mbarrier.barrier.token`
`mbarrier.test_wait` waits on `mbarrier.barrier` and `mbarrier.barrier.token`
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D154090
`nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that.
This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass.
Depends on D154060
Reviewed By: nicolasvasilache, manishucsd
Differential Revision: https://reviews.llvm.org/D154345
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.
Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.
Context:
* https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…"
* Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443
Implementation:
This follows a previous patch that updated calls
`op.cast<T>()-> cast<T>(op)`. However some cases could not handle an
unprefixed `cast` call due to occurrences of variables named cast, or
occurring inside of class definitions which would resolve to the method.
All C++ files that did not work automatically with `cast<T>()` are
updated here to `llvm::cast` and similar with the intention that they
can be easily updated after the methods are removed through a
find-replace.
See https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
for the clang-tidy check that is used and then update printed
occurrences of the function to include `llvm::` before.
One can then run the following:
```
ninja -C $BUILD_DIR clang-tidy
run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
-export-fixes /tmp/cast/casts.yaml mlir/*\
-header-filter=mlir/ -fix
rm -rf $BUILD_DIR/tools/mlir/**/*.inc
```
Differential Revision: https://reviews.llvm.org/D150348
This is a purely mechanical change that introduces an enum attribute in the GPU
dialect to represent the various memref memory spaces as opposed to the
hard-coded integer attributes that are currently used.
The following steps were taken to make the transition across the codebase:
1. Introduce a pass "gpu-lower-memory-space-attributes":
The pass updates all memref types that have a memory space attribute that is a
`gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a
mapping that is given by the caller. This pass is based on the
"map-memref-spirv-storage-class" pass and the common functions can probably
be refactored into a set of utilities under the MemRef dialect.
2. Update the verifiers of GPU/NVGPU dialect operations.
If a verifier currently checks the address space of an operand using
e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the
checks are changed to only fail if the memory space is either missing or a wrong
value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address
space is correct because it was specifically lowered to something other than a
`gpu::AddressSpaceAttr`.
3. Update existing gpu-to-llvm conversion infrastructure.
In the existing gpu-to-X passes, we add a full conversion equivalent to
`gpu-lower-memory-space-attributes` just before doing the conversion to the
LLVMDialect. This is done because currently both the gpu-to-llvm passes
(rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce
`AddressSpaceAttr` memory space annotations. Therefore, I inserted the
memory space conversion between the gpu-to-gpu rewrites and the LLVM
conversion.
For more context see the below discourse discussion:
https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D140644
This change adds a new NVGPU operation that targets the PTX `mma.sp.sync`
instruction variants. A lowering to NVVM is provided using inline
assembly.
Reviewed By: ThomasRaoux, manishucsd
Differential Revision: https://reviews.llvm.org/D137202
Moves definition of DeviceAsyncToken to use the declarative Tablegen
TypeDef since the type is trivial. This also allows for removing the
current code for parsing/printing types by using the auto-generated
functions.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D134564
Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3.
For mma.sync on f32 input using tensor cores there are two possibilites:
(a) tf32 (1 `mma.sync` per warp-level matrix-multiply-accumulate)
(b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate)
Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D130294
- Adds verification for `nvgpu.mma.sync` op
- Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir`
- `nvgpu.mma.sync` verifier caught a bug and triggered a failure in m16n8k4_tf32_f32 variant in `mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir`
- The output shape of vector holding thread-level accumulators was inconsistent and fixed in this change
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D129400
Follow up from flipping dialects to both, flip accessor used to prefixed
variant ahead to flipping from _Both to _Prefixed. This just flips to
the accessors introduced in the preceding change which are just prefixed
forms of the existing accessor changed from.
Mechanical change using helper script
https://github.com/jpienaar/llvm-project/blob/main/clang-tools-extra/clang-tidy/misc/AddGetterCheck.cpp and clang-format.
This change adds a transformation and pass to the NvGPU dialect that
attempts to optimize reads/writes from a memref representing GPU shared
memory in order to avoid bank conflicts. Given a value representing a
shared memory memref, it traverses all reads/writes within the parent op
and, subject to suitable conditions, rewrites all last dimension index
values such that element locations in the final (col) dimension are
given by
`newColIdx = col % vecSize + perm[row](col/vecSize,row)`
where `perm` is a permutation function indexed by `row` and `vecSize`
is the vector access size in elements (currently assumes 128bit
vectorized accesses, but this can be made a parameter). This specific
transformation can help optimize typical distributed & vectorized accesses
common to loading matrix multiplication operands to/from shared memory.
Differential Revision: https://reviews.llvm.org/D127457
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.
Differential Revision: https://reviews.llvm.org/D125244