37 Commits

Author SHA1 Message Date
Guray Ozen
e892f323ea
[mlir][nvgpu] Allow TMA's last dim to be non-128B without swizzling (#81499)
Allow TMA's last dimension to be non-128B when swizzling mode is not
set.

Test `tma_load_64x8_8x128_noswizzle.mlir` is failing due to the
verifier. This PR will fix that
2024-02-13 09:17:09 +01:00
Guray Ozen
f33a0a4835
[mlir][nvgpu] Improve tensormap.descriptor Type Verifier (#77904)
This PR improves the verifier for the `nvgpu.tensormap.descriptor` type.
The descriptor contains information for TMA, and the compile-time check
ensures its restrictions, such as the last memory dimension being
128-byte. This prevents runtime crashes.

See cuda driver for more explanation:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7
2024-02-05 10:32:03 +01:00
Guray Ozen
8dd0d95c7c
[mlir][nvgpu] Add nvgpu.tma.async.store (#77811)
PR adds `nvgpu.tma.async.store` Op for asynchronous stores using the
Tensor Memory Access (TMA) unit.

It also implements Op lowering to NVVM dialect. The Op currently
performs asynchronous stores of a tile memory region from shared to
global memory for a single CTA.
2024-01-15 11:44:51 +01:00
Guray Ozen
249186701d
[mlir][nvgpu] Improve verifier of ldmatrix (#77807)
PR improves the verifier of `nvgpu.ldmatrix` Op, so `nvgpu-to-nvvm`
lowering does not crash.
2024-01-12 08:57:12 +01:00
Jie Fu
b44399296a [mlir] Fix -Wsign-compare in NVGPUDialect.cpp (NFC)
/llvm-project/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp:396:31: error: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Werror,-Wsign-compare]
  if (getCoordinates().size() !=
      ~~~~~~~~~~~~~~~~~~~~~~~ ^
1 error generated.
2023-11-08 20:26:33 +08:00
Guray Ozen
6eb97f0380
[MLIR][NVGPU] Improve and Cleanup verifier of TMA OPs (#70923)
This PR improves and cleans-up verifiers of TmaCreateDescriptor and
TmaAsyncLoad Ops and unifies them.

The PR verifiers followings that didn't before:
- address space
- rank match between descriptor and memref
- element type match between descriptor and memref
- shape type match between descriptor and memref
2023-11-08 12:03:16 +01:00
Guray Ozen
52db7e2745
[mlir][nvgpu] Improve WarpgroupAccumulator type to simplify IR (#68728)
`WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type
that keeps the accumulator matrix that is used by warp-group level
matrix multiplication. It is handy to have a special type for that as
the matrix is distributed among the threads of the warp-group. However,
current transformations requires to create and use multiple
`WarpgroupAccumulator` if the shape of GEMM is larger than the supported
shape of `wgmma.mma_async` instruction. This makes IR looks dense.

This PR improves the transformation of `WarpgroupAccumulator` type in
every nvgpu Op that uses it.

**Example: Current GEMM in NVGPU-IR**
```
// Init
%m1, %m2 = nvgpu.warpgroup.mma.init.accumulator ->  
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>

// GEMM
%r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>  


// Epilogue 
nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```

**Example: This PR simplifies the IR as below:**
```
// Init
%m = nvgpu.warpgroup.mma.init.accumulator ->  
           !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>

// GEMM
%r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>  

// Epilogue 
nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```
2023-10-17 11:46:47 +02:00
Guray Ozen
315ab3c44b
[MLIR][NVGPU] Introduce warpgroup.init.accumulator Op (#67530)
This Op generates and initilizes the accumulator matrix for
`nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate
(mma).

Its associated transformation generates `!llvm.struct<>` and fill it
with the initial values. The size of struct is number of required inout
registers for `nvgpu.warpgroup.mma` op.
2023-10-11 08:28:26 -07:00
Guray Ozen
d20fbc9007
[MLIR][NVGPU] Introduce nvgpu.wargroup.mma.store Op for Hopper GPUs (#65441)
This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate storing
fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by
`warpgroup.mma` to the given memref.

An example of fragmentated matrix is given here :

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d

The `warpgroup.mma.store` does followings:
1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented
results matrix)
2) Calculates indexes per thread in warp-group and stores the data into
give memref.

Here's an example usage:
```
// A warpgroup performs GEMM, results in fragmented matrix
%result1, %result2 = nvgpu.warpgroup.mma ...

// Stores the fragmented result to memref
nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD : 
    !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
    !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> 
    to memref<128x128xf32,3>
```
2023-10-05 10:54:13 +02:00
Guray Ozen
7eb2b99f16
[mlir] Change the class name of the GenerateWarpgroupDescriptor (#68286) 2023-10-05 10:15:40 +02:00
Guray Ozen
6dc7717bca
[MLIR][NVGPU] Change name wgmma.descriptor to warpgroup.descriptor (NFC) (#67526)
NVGPU dialect is gaining large support for warpgroup level operations,
and their names always starts with `warpgroup....`.

This PR changes name of Op and type from `wgmma.descriptor` to
`warpgroup.descriptor` for sake of consistency.
2023-10-05 09:01:48 +02:00
Kazu Hirata
3bca659556 Use llvm::is_contained (NFC) 2023-09-22 17:20:50 -07:00
Guray Ozen
2388222695
[MLIR][NVGPU] Adding nvgpu.warpgroup.mma Op for Hopper GPUs (#65440)
This work introduces a new operation called `warpgroup.mma` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate
warpgroup-level matrix multiply and accumulate (WGMMA) operations on
Hopper GPUs with sm_90a architecture.

Previously, the `nvvm.wgmma.mma_async` operation was introduced to
support warpgroup-level matrix operations in NVVM dialect. This op is
used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired
shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity
and provides a higher-level interface for performing warpgroup-level
matrix operations.

The `nvgpu.warpgroup.mma` does followings:
1) Corresponds multiple `wgmma` instructions.
2) Iterates input matrix descriptors to achieve the desired computation
shape. 3) Groups and runs `wgmma` instructions asynchronously, and
eventually waits them. This are done by `wgmma.fence.aligned`,
`wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned`
4) Results fragmented matrices

Here's an example usage of the `nvgpu.warpgroup.mma` operation:
```
%wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: 
      !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>  
```

The op will result following PTX:
```
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA,     %descB,     p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+2,   %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+4,   %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+8,   %descB+348, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB,     p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 1;
```

The Op keeps 
  - first 64 registers (`{%f1, %f2,    62 more registers}`) -> `%acc1` 
- second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.
2023-09-22 11:46:29 +02:00
Guray Ozen
cce3e8ed89 [MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op
This work introduces a new Op, `wgmma.generate.descriptor`, designed to create a wgmma descriptor for inputs of matrix multiply and accumulate operations using `wgmma.mma_async` PTX instruction.

The descriptor format specifications can be found in the following link:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor

It's important to note that this op is in its initial phase, and it does come with certain limitations. It only supports 128b swizzling and does not incorporate interleaving. In the future, different calculations will be addressed in separate works, expanding the capabilities of the op.

Reviewed By: qcolombet

Differential Revision: https://reviews.llvm.org/D157382
2023-08-22 16:12:25 +02:00
Guray Ozen
4622113820 [mlir][nvgpu] Set useDefaultAttributePrinterParser
Differential Revision: https://reviews.llvm.org/D155959
2023-07-21 17:00:39 +02:00
Jie Fu
3fd1790638 [mlir][nvgpu] Ignore -Wunused-function in NVGPUDialect.cpp (NFC)
In file included from /Users/jiefu/llvm-project/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp:363:
/Users/jiefu/llvm-project/build-Release/tools/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc:22:36: error: unused function 'generatedAttributeParser' [-Werror,-Wunused-function]
static ::mlir::OptionalParseResult generatedAttributeParser(::mlir::AsmParser &parser, ::llvm::StringRef *mnemonic, ::mlir::Type type, ::mlir::Attribute &value) {
                                   ^
/Users/jiefu/llvm-project/build-Release/tools/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc:46:30: error: unused function 'generatedAttributePrinter' [-Werror,-Wunused-function]
static ::mlir::LogicalResult generatedAttributePrinter(::mlir::Attribute def, ::mlir::AsmPrinter &printer) {
                             ^
2 errors generated.
2023-07-21 20:50:48 +08:00
Guray Ozen
e56d6745f7 [mlir][nvgpu] Add tma.create.descriptor to create tensor map descriptor
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.

The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

Depends on D155453

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155680
2023-07-21 11:33:04 +02:00
Guray Ozen
70c2e0618a [mlir][nvgpu] Add nvgpu.tma.async.load and nvgpu.tma.descriptor
This work adds `nvgpu.tma.async.load` Op that requests tma load asyncronusly using mbarrier object.

It also creates nvgpu.tma.descriptor type. The type is supposed be created by `cuTensorMapEncodeTiled` cuda drivers api.

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155453
2023-07-21 10:23:25 +02:00
Guray Ozen
960ab5225b [mlir][nvgpu] Verify invalid copy size (nfc)
This work improves verifier for invalid cases. It is NFC.

Reviewed By: nicolasvasilache, springerm

Differential Revision: https://reviews.llvm.org/D155448
2023-07-17 17:09:33 +02:00
Matthias Springer
b1d2687501 [mlir][IR] Remove duplicate isLastMemrefDimUnitStride functions
This function is duplicated in various dialects.

Differential Revision: https://reviews.llvm.org/D155462
2023-07-17 16:31:04 +02:00
Guray Ozen
affcfccd3c [mlir][nvgpu] Add initial support for mbarrier
`mbarrier` is a barrier created in shared memory that supports different flavors of synchronizing threads other than `__syncthreads`, for more information see below.
https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier

This work adds initial Ops wrt `mbarrier` to nvgpu dialect.

First, it introduces to two types:
`mbarrier.barrier` that is barrier object in shared memory
`mbarrier.barrier.token` that is token

It introduces following Ops:
`mbarrier.create` creates `mbarrier.barrier`
`mbarrier.init` initializes `mbarrier.barrier`
`mbarrier.arrive` performs arrive-on `mbarrier.barrier` returns `mbarrier.barrier.token`
`mbarrier.arrive.nocomplete` performs arrive-on (non-blocking) `mbarrier.barrier` returns `mbarrier.barrier.token`
`mbarrier.test_wait` waits on `mbarrier.barrier` and `mbarrier.barrier.token`

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D154090
2023-07-11 17:35:27 +02:00
Guray Ozen
2c5739675c [mlir][nvgpu] Implement nvgpu.device_async_copy by NVVMToLLVM Pass
`nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that.

This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass.

Depends on D154060

Reviewed By: nicolasvasilache, manishucsd

Differential Revision: https://reviews.llvm.org/D154345
2023-07-11 12:18:28 +02:00
Nicolas Vasilache
8c80d01a95 [mlir][NVGPU] NFC - Add a more convenient C++ builder for nvgpu::MmaSyncOp 2023-06-19 13:54:31 +00:00
Tres Popp
c1fa60b4cd [mlir] Update method cast calls to function calls
The MLIR classes Type/Attribute/Operation/Op/Value support
cast/dyn_cast/isa/dyn_cast_or_null functionality through llvm's doCast
functionality in addition to defining methods with the same name.
This change begins the migration of uses of the method to the
corresponding function call as has been decided as more consistent.

Note that there still exist classes that only define methods directly,
such as AffineExpr, and this does not include work currently to support
a functional cast/isa call.

Context:

* https://mlir.llvm.org/deprecation/ at "Use the free function variants for dyn_cast/cast/isa/…"
* Original discussion at https://discourse.llvm.org/t/preferred-casting-style-going-forward/68443

Implementation:
This follows a previous patch that updated calls
`op.cast<T>()-> cast<T>(op)`. However some cases could not handle an
unprefixed `cast` call due to occurrences of variables named cast, or
occurring inside of class definitions which would resolve to the method.
All C++ files that did not work automatically with `cast<T>()` are
updated here to `llvm::cast` and similar with the intention that they
can be easily updated after the methods are removed through a
find-replace.

See https://github.com/llvm/llvm-project/compare/main...tpopp:llvm-project:tidy-cast-check
for the clang-tidy check that is used and then update printed
occurrences of the function to include `llvm::` before.

One can then run the following:
```
ninja -C $BUILD_DIR clang-tidy

run-clang-tidy -clang-tidy-binary=$BUILD_DIR/bin/clang-tidy -checks='-*,misc-cast-functions'\
                 -export-fixes /tmp/cast/casts.yaml mlir/*\
                 -header-filter=mlir/ -fix

rm -rf $BUILD_DIR/tools/mlir/**/*.inc
```

Differential Revision: https://reviews.llvm.org/D150348
2023-05-12 11:21:30 +02:00
Aart Bik
a3bb693420 [mlir][gpu][nvvm] refined sparsity selector test and verification of mma.sp
Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D146319
2023-03-17 15:50:36 -07:00
Christopher Bate
6ca1a09f03 [mlir][gpu] Migrate hard-coded address space integers to an enum attribute (gpu::AddressSpaceAttr)
This is a purely mechanical change that introduces an enum attribute in the GPU
dialect to represent the various memref memory spaces as opposed to the
hard-coded integer attributes that are currently used.

The following steps were taken to make the transition across the codebase:

1. Introduce a pass "gpu-lower-memory-space-attributes":

The pass updates all memref types that have a memory space attribute that is a
`gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a
mapping that is given by the caller. This pass is based on the
"map-memref-spirv-storage-class" pass and the common functions can probably
be refactored into a set of utilities under the MemRef dialect.

2. Update the verifiers of GPU/NVGPU dialect operations.

If a verifier currently checks the address space of an operand using
e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the
checks are changed to only fail if the memory space is either missing or a wrong
value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address
space is correct because it was specifically lowered to something other than a
`gpu::AddressSpaceAttr`.

3. Update existing gpu-to-llvm conversion infrastructure.

In the existing gpu-to-X passes, we add a full conversion equivalent to
`gpu-lower-memory-space-attributes` just before doing the conversion to the
LLVMDialect. This is done because currently both the gpu-to-llvm passes
(rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce
`AddressSpaceAttr` memory space annotations. Therefore, I inserted the
memory space conversion between the gpu-to-gpu rewrites and the LLVM
conversion.

For more context see the below discourse discussion:
https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/

Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D140644
2023-01-13 11:00:10 -07:00
Christopher Bate
708185f03f [mlir][NVGPU] Add support for structured sparsity MMA variants
This change adds a new NVGPU operation that targets the PTX `mma.sp.sync`
instruction variants. A lowering to NVVM is provided using inline
assembly.

Reviewed By: ThomasRaoux, manishucsd

Differential Revision: https://reviews.llvm.org/D137202
2022-11-07 09:43:03 -07:00
Christopher Bate
afb0b21f24 [mlir][nvgpu] Use TableGen TypeDef for NVGPU dialect types
Moves definition of DeviceAsyncToken to use the declarative Tablegen
TypeDef since the type is trivial. This also allows for removing the
current code for parsing/printing types by using the auto-generated
functions.

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D134564
2022-09-23 19:46:03 -06:00
Mehdi Amini
d2c7c725c1 Apply clang-tidy fixes for readability-simplify-boolean-expr in NVGPUDialect.cpp (NFC) 2022-09-05 12:34:47 +00:00
Manish Gupta
14d79afeae [mlir][NVGPU] nvgpu.mmasync on F32 through TF32
Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3.

For mma.sync on f32 input using tensor cores there are two possibilites:
(a) tf32   (1 `mma.sync` per warp-level matrix-multiply-accumulate)
(b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate)

Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations.

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D130294
2022-08-01 23:23:27 +00:00
Manish Gupta
713d3de5fb [mlir][NVGPU] Verifier for nvgpu.ldmatrix
* Adds verifiers for `nvgpu.ldmatrix` op
* Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir`

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D129669
2022-07-14 22:46:38 +00:00
Manish Gupta
f7d42d5149 [mlir][NVGPU] Verifiers for nvgpu.mma.sync Op
- Adds verification for `nvgpu.mma.sync` op
- Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir`
- `nvgpu.mma.sync` verifier caught a bug and triggered a failure in m16n8k4_tf32_f32 variant in `mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir`
     - The output shape of vector holding thread-level accumulators was inconsistent  and fixed in this change

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D129400
2022-07-13 18:57:07 +00:00
Jacques Pienaar
8df54a6a03 [mlir] Update accessors to prefixed form (NFC)
Follow up from flipping dialects to both, flip accessor used to prefixed
variant ahead to flipping from _Both to _Prefixed. This just flips to
the accessors introduced in the preceding change which are just prefixed
forms of the existing accessor changed from.

Mechanical change using helper script
https://github.com/jpienaar/llvm-project/blob/main/clang-tools-extra/clang-tidy/misc/AddGetterCheck.cpp and clang-format.
2022-06-18 17:53:22 -07:00
Christopher Bate
51b925df94 [mlir][nvgpu] shared memory access optimization pass
This change adds a transformation and pass to the NvGPU dialect that
attempts to optimize reads/writes from a  memref representing GPU shared
memory in order to avoid bank conflicts. Given a value representing a
shared memory memref, it traverses all reads/writes within the parent op
and, subject to suitable conditions, rewrites all last dimension index
values such that element locations in the final (col) dimension are
given by
`newColIdx = col % vecSize + perm[row](col/vecSize,row)`
where `perm` is a permutation function indexed by `row` and `vecSize`
is the vector access size in elements (currently assumes 128bit
vectorized accesses, but this can be made a parameter). This specific
transformation can help optimize typical distributed & vectorized accesses
common to loading matrix multiplication operands to/from shared memory.

Differential Revision: https://reviews.llvm.org/D127457
2022-06-17 09:31:05 -06:00
Mogball
d7ef488bb6 [mlir][gpu] Move GPU headers into IR/ and Transforms/
Depends on D127350

Reviewed By: rriddle

Differential Revision: https://reviews.llvm.org/D127352
2022-06-09 22:49:03 +00:00
Thomas Raoux
15bcc36eed [mlir][gpu] Move async copy ops to NVGPU and add caching hints
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.

Differential Revision: https://reviews.llvm.org/D125244
2022-05-10 22:30:24 +00:00
Thomas Raoux
4c564940a1 [mlir][nvgpu] Add NVGPU dialect (architectural specific gpu dialect)
This introduce a new dialect for vendro specific ptx operations. This
also adds the first operation ldmatrix as an example. More operations
will be added in follow up patches.
This new dialect is meant to be a bridge between GPU and Vector
dialectis and NVVM dialect.

This is based on the RFC proposed here:
https://discourse.llvm.org/t/rfc-add-nv-gpu-dialect-hw-specific-extension-of-gpu-dialect-for-nvidia-gpus/61466/8

Differential Revision: https://reviews.llvm.org/D123266
2022-04-14 16:33:46 +00:00