39 Commits

Author SHA1 Message Date
Guray Ozen
0a600c34c8
[mlir][nvgpu] Make phaseParity of mbarrier.try_wait i1 (#81460)
Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is
index. This can cause a problem if it's passed any value different than
0 or 1. Because the PTX instruction only accepts even or odd phase. This
PR makes phaseParity argument i1 to avoid misuse.

Here is the information from PTX doc:

```
The .parity variant of the instructions test for the completion of the phase indicated 
by the operand phaseParity, which is the integer parity of either the current phase or 
the immediately preceding phase of the mbarrier object. An even phase has integer 
parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity 
operand are 0 and 1.
```
See for more information:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait
2024-02-13 09:50:34 +01:00
Guray Ozen
f33a0a4835
[mlir][nvgpu] Improve tensormap.descriptor Type Verifier (#77904)
This PR improves the verifier for the `nvgpu.tensormap.descriptor` type.
The descriptor contains information for TMA, and the compile-time check
ensures its restrictions, such as the last memory dimension being
128-byte. This prevents runtime crashes.

See cuda driver for more explanation:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7
2024-02-05 10:32:03 +01:00
Guray Ozen
3477bcf4b9
[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in tma.async.store (#79427) 2024-01-30 19:36:59 +01:00
Guray Ozen
249186701d
[mlir][nvgpu] Improve verifier of ldmatrix (#77807)
PR improves the verifier of `nvgpu.ldmatrix` Op, so `nvgpu-to-nvvm`
lowering does not crash.
2024-01-12 08:57:12 +01:00
Matthias Springer
bb6d5c2200
[mlir][Transforms] GreedyPatternRewriteDriver: Do not CSE constants during iterations (#75897)
The `GreedyPatternRewriteDriver` tries to iteratively fold ops and apply
rewrite patterns to ops. It has special handling for constants: they are
CSE'd and sometimes moved to parent regions to allow for additional
CSE'ing. This happens in `OperationFolder`.

To allow for efficient CSE'ing, `OperationFolder` maintains an internal
lookup data structure to find the existing constant ops with the same
value for each `IsolatedFromAbove` region:
```c++
/// A mapping between an insertion region and the constants that have been
/// created within it.
DenseMap<Region *, ConstantMap> foldScopes;
```

Rewrite patterns are allowed to modify operations. In particular, they
may move operations (including constants) from one region to another
one. Such an IR rewrite can make the above lookup data structure
inconsistent.

We encountered such a bug in a downstream project. This bug materialized
in the form of an op that uses the result of a constant op from a
different `IsolatedFromAbove` region (that is not accessible).

This commit changes the behavior of the `GreedyPatternRewriteDriver`
such that `OperationFolder` is used to CSE constants at the beginning of
each iteration (as the worklist is populated), but no longer during an
iteration. `OperationFolder` is no longer used after populating the
worklist, so we do not have to care about inconsistent state in the
`OperationFolder` due to IR rewrites. The `GreedyPatternRewriteDriver`
now performs the op folding by itself instead of calling
`OperationFolder::tryToFold`.

This change changes the order of constant ops in test cases, but not the
region in which they appear. All broken test cases were fixed by turning
`CHECK` into `CHECK-DAG`.

Alternatives considered: The state of `OperationFolder` could be
partially invalidated with every `notifyOperationModified` notification.
That is more fragile than the solution in this commit because incorrect
rewriter API usage can lead to missing notifications and hard-to-debug
`IsolatedFromAbove` violations. (It did not fix the above mention bug in
a downstream project, which could be due to incorrect rewriter API usage
or due to another conceptual problem that I missed.) Moreover, ops are
frequently getting modified during a greedy pattern rewrite, so we would
likely keep invalidating large parts of the state of `OperationFolder`
over and over.

Migration guide: Turn `CHECK` into `CHECK-DAG` in test cases. Constant
ops are no longer folded during a greedy pattern rewrite. If you rely on
folding (and rematerialization) of constant ops during a greedy pattern
rewrite, turn the folder into a pattern.
2024-01-05 09:22:18 +01:00
Thomas Raoux
ef112833e1
[MLIR][SCF] Add support for pipelining dynamic loops (#74350)
Support loops without static boundaries. Since the number of iteration
is not known we need to predicate prologue and epilogue in case the
number of iterations is smaller than the number of stages.

This patch includes work from @chengjunlu
2023-12-10 22:32:11 -08:00
Rik Huijzer
ae6eedd275
[mlir] Fix two CHECK: typos (#73803)
Out of curiosity, I ran [typos](https://github.com/crate-ci/typos)
against MLIR. It found two `CHECK:` typos (and many minor typos; which
I'm not gonna work on today).
2023-11-30 10:19:27 +01:00
Guray Ozen
6eb97f0380
[MLIR][NVGPU] Improve and Cleanup verifier of TMA OPs (#70923)
This PR improves and cleans-up verifiers of TmaCreateDescriptor and
TmaAsyncLoad Ops and unifies them.

The PR verifiers followings that didn't before:
- address space
- rank match between descriptor and memref
- element type match between descriptor and memref
- shape type match between descriptor and memref
2023-11-08 12:03:16 +01:00
Oleksandr "Alex" Zinenko
e4384149b5
[mlir] use transform-interpreter in test passes (#70040)
Update most test passes to use the transform-interpreter pass instead of
the test-transform-dialect-interpreter-pass. The new "main" interpreter
pass has a named entry point instead of looking up the top-level op with
`PossibleTopLevelOpTrait`, which is arguably a more understandable
interface. The change is mechanical, rewriting an unnamed sequence into
a named one and wrapping the transform IR in to a module when necessary.

Add an option to the transform-interpreter pass to target a tagged
payload op instead of the root anchor op, which is also useful for repro
generation.

Only the test in the transform dialect proper and the examples have not
been updated yet. These will be updated separately after a more careful
consideration of testing coverage of the transform interpreter logic.
2023-10-24 16:12:34 +02:00
Guray Ozen
52db7e2745
[mlir][nvgpu] Improve WarpgroupAccumulator type to simplify IR (#68728)
`WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type
that keeps the accumulator matrix that is used by warp-group level
matrix multiplication. It is handy to have a special type for that as
the matrix is distributed among the threads of the warp-group. However,
current transformations requires to create and use multiple
`WarpgroupAccumulator` if the shape of GEMM is larger than the supported
shape of `wgmma.mma_async` instruction. This makes IR looks dense.

This PR improves the transformation of `WarpgroupAccumulator` type in
every nvgpu Op that uses it.

**Example: Current GEMM in NVGPU-IR**
```
// Init
%m1, %m2 = nvgpu.warpgroup.mma.init.accumulator ->  
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>

// GEMM
%r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>  


// Epilogue 
nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```

**Example: This PR simplifies the IR as below:**
```
// Init
%m = nvgpu.warpgroup.mma.init.accumulator ->  
           !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>

// GEMM
%r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>  

// Epilogue 
nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```
2023-10-17 11:46:47 +02:00
Guray Ozen
6dc7717bca
[MLIR][NVGPU] Change name wgmma.descriptor to warpgroup.descriptor (NFC) (#67526)
NVGPU dialect is gaining large support for warpgroup level operations,
and their names always starts with `warpgroup....`.

This PR changes name of Op and type from `wgmma.descriptor` to
`warpgroup.descriptor` for sake of consistency.
2023-10-05 09:01:48 +02:00
Cullen Rhodes
9816edc9f3
[mlir][vector] add result type to vector.extract assembly format (#66499)
The vector.extract assembly format currently only contains the source
type, for example:

  %1 = vector.extract %0[1] : vector<3x7x8xf32>

it's not immediately obvious if this is the source or result type. This
patch improves the assembly format to make this clearer, so the above
becomes:

  %1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>
2023-09-28 11:11:16 +01:00
Guray Ozen
17649a7726
[MLIR][NVGPU] Introduce nvgpu.mbarrier.group for multiple mbarrier use (#65951)
A common practice involves the creation of multiple `mbarrier` objects,
see an example below. This is particularly valuable in scenarios like
software pipelining for GEMM, where we need to generate multiple
barriers dynamically use and wait them in a loop.

PR improves `nvgpu.mbarrier.barrier` type into the
`nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type.
Consequently, these Ops are now capable of managing multiple barriers
seamlessly.

Having `num_barriers = 4` helps us to locate mbarrier object(s) into
static shared memory. We could make the value dynamic that requires
dynamic shared memory it would complicate the codegen.

```
%barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
...
scf.for %i = %c0 to %n step %c1 {
  nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... 

  // ... Do work once mbarrier is ready 

  nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... 
}
```
We will have mbarrier usages like below: 
```
expect_tx[0]
expect_tx[1]
expect_tx[2]
Loop:
 try_wait mbarrier[0], expect_tx[3]
 try_wait mbarrier[1], expect_tx[0]
 try_wait mbarrier[2], expect_tx[1]
 try_wait mbarrier[3], expect_tx[2]
...
```
2023-09-22 17:09:43 +02:00
Guray Ozen
2388222695
[MLIR][NVGPU] Adding nvgpu.warpgroup.mma Op for Hopper GPUs (#65440)
This work introduces a new operation called `warpgroup.mma` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate
warpgroup-level matrix multiply and accumulate (WGMMA) operations on
Hopper GPUs with sm_90a architecture.

Previously, the `nvvm.wgmma.mma_async` operation was introduced to
support warpgroup-level matrix operations in NVVM dialect. This op is
used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired
shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity
and provides a higher-level interface for performing warpgroup-level
matrix operations.

The `nvgpu.warpgroup.mma` does followings:
1) Corresponds multiple `wgmma` instructions.
2) Iterates input matrix descriptors to achieve the desired computation
shape. 3) Groups and runs `wgmma` instructions asynchronously, and
eventually waits them. This are done by `wgmma.fence.aligned`,
`wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned`
4) Results fragmented matrices

Here's an example usage of the `nvgpu.warpgroup.mma` operation:
```
%wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: 
      !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, 
      !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>  
```

The op will result following PTX:
```
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA,     %descB,     p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+2,   %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+4,   %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,    62 more registers}, %descA+8,   %descB+348, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB,     p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 1;
```

The Op keeps 
  - first 64 registers (`{%f1, %f2,    62 more registers}`) -> `%acc1` 
- second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.
2023-09-22 11:46:29 +02:00
Matthias Springer
15ea2306a4 [mlir][NVGPU] Support N-D masks in transform.nvgpu.create_async_groups
Support IR that is generated by the vector-to-scf lowering of N-D vector transfers with a mask. (Until now only 1-D and 2-D transfers were supported.) Only transfers that were fully unrolled are supported.

Differential Revision: https://reviews.llvm.org/D157286
2023-08-08 14:30:03 +02:00
Nicolas Vasilache
a3cd2eeb2d [mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.
This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.

Differential Revision: https://reviews.llvm.org/D157087
2023-08-08 12:07:59 +00:00
Matthias Springer
39d8876da3 [mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups
Support IR that is generated by the vector-to-scf lowering of 2D vector transfers with a mask. Only 2D transfers that were fully unrolled are supported at the moment.

Differential Revision: https://reviews.llvm.org/D156695
2023-08-07 15:46:54 +02:00
Matthias Springer
db393288ff [mlir][NVGPU][transform] Add create_async_groups transform op
This transform looks for suitable vector transfers from global memory to shared memory and converts them to async device copies.

Differential Revision: https://reviews.llvm.org/D155569
2023-07-18 14:36:41 +02:00
Guray Ozen
960ab5225b [mlir][nvgpu] Verify invalid copy size (nfc)
This work improves verifier for invalid cases. It is NFC.

Reviewed By: nicolasvasilache, springerm

Differential Revision: https://reviews.llvm.org/D155448
2023-07-17 17:09:33 +02:00
Alex Zinenko
371366ce27 [mlir][nvgpu] add simple pipelining for shared memory copies
Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an `scf.for` loop with either
`vector.transfer` or `nvgpu.device_async_copy` operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.

This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.

This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155223
2023-07-17 14:29:12 +00:00
Guray Ozen
2c5739675c [mlir][nvgpu] Implement nvgpu.device_async_copy by NVVMToLLVM Pass
`nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that.

This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass.

Depends on D154060

Reviewed By: nicolasvasilache, manishucsd

Differential Revision: https://reviews.llvm.org/D154345
2023-07-11 12:18:28 +02:00
Nicolas Vasilache
13f4e889c5 Revert "Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions""
This reverts commit 6506692fe619ef8a1f7c6ea829d9a9eceb31622d.

Differential Revision: https://reviews.llvm.org/D153845
2023-06-28 06:50:05 +00:00
Mehdi Amini
6506692fe6 Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions"
This reverts commit 40deed40ae77ba22f7c72693903752ab6bfeb4e7.
and commit 1660f2174d59bc2fd04131dab9ab0b43178bf665.

The buildbot is broken, the two tests aren't passing.
2023-06-27 08:46:18 +02:00
Nicolas Vasilache
1660f2174d [mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite.
This PR adds support for the m16n8k16 f16 case.
At this point, the support is mostly mechanical and could be Tablegen'd to all cases.
Until then, this can be populated as needed on a case-by-case basis.

Depends on: D153420

Differential Revision: https://reviews.llvm.org/D153428
2023-06-26 16:46:42 +00:00
Nicolas Vasilache
40deed40ae [mlir][Transform] Introduce nvgpu transform extensions
Mapping to NVGPU operations such as mma.sync with mixed precision and ldmatrix with transposes and
various data types involves complex matchings from low-level IR.
This is akin to raising complex patterns after unnecessarily having lost structural information.
To avoid such unnecessary complexity, introduce a direct mapping step from a matmul on memrefs
to distributed NVGPU vector abstractions.
In this context, mapping to specific mma.sync operations is trivial and consists in simply
translating the documentation into indexing expressions.

Correctness is demonstrated with an end-to-end integration test.

Differential Revision: https://reviews.llvm.org/D153420
2023-06-26 16:21:28 +00:00
Aart Bik
a3bb693420 [mlir][gpu][nvvm] refined sparsity selector test and verification of mma.sp
Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D146319
2023-03-17 15:50:36 -07:00
Christopher Bate
6ca1a09f03 [mlir][gpu] Migrate hard-coded address space integers to an enum attribute (gpu::AddressSpaceAttr)
This is a purely mechanical change that introduces an enum attribute in the GPU
dialect to represent the various memref memory spaces as opposed to the
hard-coded integer attributes that are currently used.

The following steps were taken to make the transition across the codebase:

1. Introduce a pass "gpu-lower-memory-space-attributes":

The pass updates all memref types that have a memory space attribute that is a
`gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a
mapping that is given by the caller. This pass is based on the
"map-memref-spirv-storage-class" pass and the common functions can probably
be refactored into a set of utilities under the MemRef dialect.

2. Update the verifiers of GPU/NVGPU dialect operations.

If a verifier currently checks the address space of an operand using
e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the
checks are changed to only fail if the memory space is either missing or a wrong
value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address
space is correct because it was specifically lowered to something other than a
`gpu::AddressSpaceAttr`.

3. Update existing gpu-to-llvm conversion infrastructure.

In the existing gpu-to-X passes, we add a full conversion equivalent to
`gpu-lower-memory-space-attributes` just before doing the conversion to the
LLVMDialect. This is done because currently both the gpu-to-llvm passes
(rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce
`AddressSpaceAttr` memory space annotations. Therefore, I inserted the
memory space conversion between the gpu-to-gpu rewrites and the LLVM
conversion.

For more context see the below discourse discussion:
https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/

Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D140644
2023-01-13 11:00:10 -07:00
Hanhan Wang
0a1569a400 [mlir][NFC] Remove trailing whitespaces from *.td and *.mlir files.
This is generated by running

```
sed --in-place 's/[[:space:]]\+$//' mlir/**/*.td
sed --in-place 's/[[:space:]]\+$//' mlir/**/*.mlir
```

Reviewed By: rriddle, dcaballe

Differential Revision: https://reviews.llvm.org/D138866
2022-11-28 15:26:30 -08:00
Christopher Bate
708185f03f [mlir][NVGPU] Add support for structured sparsity MMA variants
This change adds a new NVGPU operation that targets the PTX `mma.sp.sync`
instruction variants. A lowering to NVVM is provided using inline
assembly.

Reviewed By: ThomasRaoux, manishucsd

Differential Revision: https://reviews.llvm.org/D137202
2022-11-07 09:43:03 -07:00
rkayaith
13bd410962 [mlir][Pass] Include anchor op in -pass-pipeline
In D134622 the printed form of a pass manager is changed to include the
name of the op that the pass manager is anchored on. This updates the
`-pass-pipeline` argument format to include the anchor op as well, so
that the printed form of a pipeline can be directly passed to
`-pass-pipeline`. In most cases this requires updating
`-pass-pipeline='pipeline'` to
`-pass-pipeline='builtin.module(pipeline)'`.

This also fixes an outdated assert that prevented running a
`PassManager` anchored on `'any'`.

Reviewed By: rriddle

Differential Revision: https://reviews.llvm.org/D134900
2022-11-03 11:36:12 -04:00
Thomas Raoux
91594b5b98 [mlir][nvpu] Prevent F32ToTF32 pattern to generate illegal IR
We shouldn't apply this pattern to non F32->F32 mma.sync operations.

Differential Revision: https://reviews.llvm.org/D131902
2022-08-15 16:46:18 +00:00
Manish Gupta
14d79afeae [mlir][NVGPU] nvgpu.mmasync on F32 through TF32
Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3.

For mma.sync on f32 input using tensor cores there are two possibilites:
(a) tf32   (1 `mma.sync` per warp-level matrix-multiply-accumulate)
(b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate)

Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations.

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D130294
2022-08-01 23:23:27 +00:00
Manish Gupta
713d3de5fb [mlir][NVGPU] Verifier for nvgpu.ldmatrix
* Adds verifiers for `nvgpu.ldmatrix` op
* Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir`

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D129669
2022-07-14 22:46:38 +00:00
Manish Gupta
f7d42d5149 [mlir][NVGPU] Verifiers for nvgpu.mma.sync Op
- Adds verification for `nvgpu.mma.sync` op
- Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir`
- `nvgpu.mma.sync` verifier caught a bug and triggered a failure in m16n8k4_tf32_f32 variant in `mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir`
     - The output shape of vector holding thread-level accumulators was inconsistent  and fixed in this change

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D129400
2022-07-13 18:57:07 +00:00
Christopher Bate
51b925df94 [mlir][nvgpu] shared memory access optimization pass
This change adds a transformation and pass to the NvGPU dialect that
attempts to optimize reads/writes from a  memref representing GPU shared
memory in order to avoid bank conflicts. Given a value representing a
shared memory memref, it traverses all reads/writes within the parent op
and, subject to suitable conditions, rewrites all last dimension index
values such that element locations in the final (col) dimension are
given by
`newColIdx = col % vecSize + perm[row](col/vecSize,row)`
where `perm` is a permutation function indexed by `row` and `vecSize`
is the vector access size in elements (currently assumes 128bit
vectorized accesses, but this can be made a parameter). This specific
transformation can help optimize typical distributed & vectorized accesses
common to loading matrix multiplication operands to/from shared memory.

Differential Revision: https://reviews.llvm.org/D127457
2022-06-17 09:31:05 -06:00
Thomas Raoux
15bcc36eed [mlir][gpu] Move async copy ops to NVGPU and add caching hints
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.

Differential Revision: https://reviews.llvm.org/D125244
2022-05-10 22:30:24 +00:00
River Riddle
0254b0bcf0 [mlir][NFC] Update textual references of func to func.func in LLVM/Math/MemRef/NVGPU/OpenACC/OpenMP/Quant/SCF/Shape tests
The special case parsing of `func` operations is being removed.
2022-04-20 22:17:28 -07:00
Thomas Raoux
894a591cf6 [mlir][nvgpu] Move mma.sync and ldmatrix in nvgpu dialect
Move gpu operation mma.sync and ldmatrix in nvgpu as they are specific
to nvidia target.

Differential Revision: https://reviews.llvm.org/D123824
2022-04-14 23:44:52 +00:00
Thomas Raoux
4c564940a1 [mlir][nvgpu] Add NVGPU dialect (architectural specific gpu dialect)
This introduce a new dialect for vendro specific ptx operations. This
also adds the first operation ldmatrix as an example. More operations
will be added in follow up patches.
This new dialect is meant to be a bridge between GPU and Vector
dialectis and NVVM dialect.

This is based on the RFC proposed here:
https://discourse.llvm.org/t/rfc-add-nv-gpu-dialect-hw-specific-extension-of-gpu-dialect-for-nvidia-gpus/61466/8

Differential Revision: https://reviews.llvm.org/D123266
2022-04-14 16:33:46 +00:00