llvm-project

Author	SHA1	Message	Date
Guray Ozen	0a600c34c8	[mlir][nvgpu] Make `phaseParity` of `mbarrier.try_wait` `i1` (#81460 ) Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is index. This can cause a problem if it's passed any value different than 0 or 1. Because the PTX instruction only accepts even or odd phase. This PR makes phaseParity argument i1 to avoid misuse. Here is the information from PTX doc: ``` The .parity variant of the instructions test for the completion of the phase indicated by the operand phaseParity, which is the integer parity of either the current phase or the immediately preceding phase of the mbarrier object. An even phase has integer parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity operand are 0 and 1. ``` See for more information: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait	2024-02-13 09:50:34 +01:00
Guray Ozen	f33a0a4835	[mlir][nvgpu] Improve `tensormap.descriptor` Type Verifier (#77904 ) This PR improves the verifier for the `nvgpu.tensormap.descriptor` type. The descriptor contains information for TMA, and the compile-time check ensures its restrictions, such as the last memory dimension being 128-byte. This prevents runtime crashes. See cuda driver for more explanation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html#group__CUDA__TENSOR__MEMORY_1ga7c7d2aaac9e49294304e755e6f341d7	2024-02-05 10:32:03 +01:00
Guray Ozen	3477bcf4b9	[mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` (#79427 )	2024-01-30 19:36:59 +01:00
Guray Ozen	249186701d	[mlir][nvgpu] Improve verifier of `ldmatrix` (#77807 ) PR improves the verifier of `nvgpu.ldmatrix` Op, so `nvgpu-to-nvvm` lowering does not crash.	2024-01-12 08:57:12 +01:00
Matthias Springer	bb6d5c2200	[mlir][Transforms] `GreedyPatternRewriteDriver`: Do not CSE constants during iterations (#75897 ) The `GreedyPatternRewriteDriver` tries to iteratively fold ops and apply rewrite patterns to ops. It has special handling for constants: they are CSE'd and sometimes moved to parent regions to allow for additional CSE'ing. This happens in `OperationFolder`. To allow for efficient CSE'ing, `OperationFolder` maintains an internal lookup data structure to find the existing constant ops with the same value for each `IsolatedFromAbove` region: ```c++ /// A mapping between an insertion region and the constants that have been /// created within it. DenseMap<Region *, ConstantMap> foldScopes; ``` Rewrite patterns are allowed to modify operations. In particular, they may move operations (including constants) from one region to another one. Such an IR rewrite can make the above lookup data structure inconsistent. We encountered such a bug in a downstream project. This bug materialized in the form of an op that uses the result of a constant op from a different `IsolatedFromAbove` region (that is not accessible). This commit changes the behavior of the `GreedyPatternRewriteDriver` such that `OperationFolder` is used to CSE constants at the beginning of each iteration (as the worklist is populated), but no longer during an iteration. `OperationFolder` is no longer used after populating the worklist, so we do not have to care about inconsistent state in the `OperationFolder` due to IR rewrites. The `GreedyPatternRewriteDriver` now performs the op folding by itself instead of calling `OperationFolder::tryToFold`. This change changes the order of constant ops in test cases, but not the region in which they appear. All broken test cases were fixed by turning `CHECK` into `CHECK-DAG`. Alternatives considered: The state of `OperationFolder` could be partially invalidated with every `notifyOperationModified` notification. That is more fragile than the solution in this commit because incorrect rewriter API usage can lead to missing notifications and hard-to-debug `IsolatedFromAbove` violations. (It did not fix the above mention bug in a downstream project, which could be due to incorrect rewriter API usage or due to another conceptual problem that I missed.) Moreover, ops are frequently getting modified during a greedy pattern rewrite, so we would likely keep invalidating large parts of the state of `OperationFolder` over and over. Migration guide: Turn `CHECK` into `CHECK-DAG` in test cases. Constant ops are no longer folded during a greedy pattern rewrite. If you rely on folding (and rematerialization) of constant ops during a greedy pattern rewrite, turn the folder into a pattern.	2024-01-05 09:22:18 +01:00
Thomas Raoux	ef112833e1	[MLIR][SCF] Add support for pipelining dynamic loops (#74350 ) Support loops without static boundaries. Since the number of iteration is not known we need to predicate prologue and epilogue in case the number of iterations is smaller than the number of stages. This patch includes work from @chengjunlu	2023-12-10 22:32:11 -08:00
Rik Huijzer	ae6eedd275	[mlir] Fix two `CHECK:` typos (#73803 ) Out of curiosity, I ran [typos](https://github.com/crate-ci/typos) against MLIR. It found two `CHECK:` typos (and many minor typos; which I'm not gonna work on today).	2023-11-30 10:19:27 +01:00
Guray Ozen	6eb97f0380	[MLIR][NVGPU] Improve and Cleanup verifier of TMA OPs (#70923 ) This PR improves and cleans-up verifiers of TmaCreateDescriptor and TmaAsyncLoad Ops and unifies them. The PR verifiers followings that didn't before: - address space - rank match between descriptor and memref - element type match between descriptor and memref - shape type match between descriptor and memref	2023-11-08 12:03:16 +01:00
Oleksandr "Alex" Zinenko	e4384149b5	[mlir] use transform-interpreter in test passes (#70040 ) Update most test passes to use the transform-interpreter pass instead of the test-transform-dialect-interpreter-pass. The new "main" interpreter pass has a named entry point instead of looking up the top-level op with `PossibleTopLevelOpTrait`, which is arguably a more understandable interface. The change is mechanical, rewriting an unnamed sequence into a named one and wrapping the transform IR in to a module when necessary. Add an option to the transform-interpreter pass to target a tagged payload op instead of the root anchor op, which is also useful for repro generation. Only the test in the transform dialect proper and the examples have not been updated yet. These will be updated separately after a more careful consideration of testing coverage of the transform interpreter logic.	2023-10-24 16:12:34 +02:00
Guray Ozen	52db7e2745	[mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR (#68728 ) `WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type that keeps the accumulator matrix that is used by warp-group level matrix multiplication. It is handy to have a special type for that as the matrix is distributed among the threads of the warp-group. However, current transformations requires to create and use multiple `WarpgroupAccumulator` if the shape of GEMM is larger than the supported shape of `wgmma.mma_async` instruction. This makes IR looks dense. This PR improves the transformation of `WarpgroupAccumulator` type in every nvgpu Op that uses it. Example: Current GEMM in NVGPU-IR ``` // Init %m1, %m2 = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // GEMM %r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ``` Example: This PR simplifies the IR as below: ``` // Init %m = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // GEMM %r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ```	2023-10-17 11:46:47 +02:00
Guray Ozen	6dc7717bca	[MLIR][NVGPU] Change name `wgmma.descriptor` to `warpgroup.descriptor` (NFC) (#67526 ) NVGPU dialect is gaining large support for warpgroup level operations, and their names always starts with `warpgroup....`. This PR changes name of Op and type from `wgmma.descriptor` to `warpgroup.descriptor` for sake of consistency.	2023-10-05 09:01:48 +02:00
Cullen Rhodes	9816edc9f3	[mlir][vector] add result type to vector.extract assembly format (#66499 ) The vector.extract assembly format currently only contains the source type, for example: %1 = vector.extract %0[1] : vector<3x7x8xf32> it's not immediately obvious if this is the source or result type. This patch improves the assembly format to make this clearer, so the above becomes: %1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>	2023-09-28 11:11:16 +01:00
Guray Ozen	17649a7726	[MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for multiple mbarrier use (#65951 ) A common practice involves the creation of multiple `mbarrier` objects, see an example below. This is particularly valuable in scenarios like software pipelining for GEMM, where we need to generate multiple barriers dynamically use and wait them in a loop. PR improves `nvgpu.mbarrier.barrier` type into the `nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type. Consequently, these Ops are now capable of managing multiple barriers seamlessly. Having `num_barriers = 4` helps us to locate mbarrier object(s) into static shared memory. We could make the value dynamic that requires dynamic shared memory it would complicate the codegen. ``` %barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> ... scf.for %i = %c0 to %n step %c1 { nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... // ... Do work once mbarrier is ready nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... } ``` We will have mbarrier usages like below: ``` expect_tx[0] expect_tx[1] expect_tx[2] Loop: try_wait mbarrier[0], expect_tx[3] try_wait mbarrier[1], expect_tx[0] try_wait mbarrier[2], expect_tx[1] try_wait mbarrier[3], expect_tx[2] ... ```	2023-09-22 17:09:43 +02:00
Guray Ozen	2388222695	[MLIR][NVGPU] Adding `nvgpu.warpgroup.mma` Op for Hopper GPUs (#65440 ) This work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture. Previously, the `nvvm.wgmma.mma_async` operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations. The `nvgpu.warpgroup.mma` does followings: 1) Corresponds multiple `wgmma` instructions. 2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices Here's an example usage of the `nvgpu.warpgroup.mma` operation: ``` %wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> ``` The op will result following PTX: ``` wgmma.fence.sync.aligned; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+2, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+4, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+8, %descB+348, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1; wgmma.commit_group.sync.aligned; wgmma.wait_group.sync.aligned 1; ``` The Op keeps - first 64 registers (`{%f1, %f2, 62 more registers}`) -> `%acc1` - second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.	2023-09-22 11:46:29 +02:00
Matthias Springer	15ea2306a4	[mlir][NVGPU] Support N-D masks in transform.nvgpu.create_async_groups Support IR that is generated by the vector-to-scf lowering of N-D vector transfers with a mask. (Until now only 1-D and 2-D transfers were supported.) Only transfers that were fully unrolled are supported. Differential Revision: https://reviews.llvm.org/D157286	2023-08-08 14:30:03 +02:00
Nicolas Vasilache	a3cd2eeb2d	[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation. This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations. This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end. Differential Revision: https://reviews.llvm.org/D157087	2023-08-08 12:07:59 +00:00
Matthias Springer	39d8876da3	[mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups Support IR that is generated by the vector-to-scf lowering of 2D vector transfers with a mask. Only 2D transfers that were fully unrolled are supported at the moment. Differential Revision: https://reviews.llvm.org/D156695	2023-08-07 15:46:54 +02:00
Matthias Springer	db393288ff	[mlir][NVGPU][transform] Add `create_async_groups` transform op This transform looks for suitable vector transfers from global memory to shared memory and converts them to async device copies. Differential Revision: https://reviews.llvm.org/D155569	2023-07-18 14:36:41 +02:00
Guray Ozen	960ab5225b	[mlir][nvgpu] Verify invalid copy size (nfc) This work improves verifier for invalid cases. It is NFC. Reviewed By: nicolasvasilache, springerm Differential Revision: https://reviews.llvm.org/D155448	2023-07-17 17:09:33 +02:00
Alex Zinenko	371366ce27	[mlir][nvgpu] add simple pipelining for shared memory copies Add a simple transform operation to the NVGPU extension that performs software pipelining of copies to shared memory. The functionality is extremely minimalistic in this version and only supports copies from global to shared memory inside an `scf.for` loop with either `vector.transfer` or `nvgpu.device_async_copy` operations when pipelining preconditions are already satisfied in the IR. This is the minimally useful version that uses the more general loop pipeliner in an NVGPU-specific way. Further extensions and orthogonalizations will be necessary. This required a change to the loop pipeliner itself to properly propagate errors should the predicate generator fail. This is loosely inspired from the vesion in IREE, but has less unsafe assumptions and more principled way of communicating decisions. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155223	2023-07-17 14:29:12 +00:00
Guray Ozen	2c5739675c	[mlir][nvgpu] Implement `nvgpu.device_async_copy` by NVVMToLLVM Pass `nvgpu.device_async_copy` is lowered into `cp.async` PTX instruction. However, NVPTX backend does not support its all mode especially when zero padding is needed. Therefore, current MLIR implementation genereates inline assembly for that. This work simplifies PTX generation for `nvgpu.device_async_copy`, and implements it by `NVVMToLLVM` Pass. Depends on D154060 Reviewed By: nicolasvasilache, manishucsd Differential Revision: https://reviews.llvm.org/D154345	2023-07-11 12:18:28 +02:00
Nicolas Vasilache	13f4e889c5	Revert "Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions"" This reverts commit 6506692fe619ef8a1f7c6ea829d9a9eceb31622d. Differential Revision: https://reviews.llvm.org/D153845	2023-06-28 06:50:05 +00:00
Mehdi Amini	6506692fe6	Revert "[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite." and "[mlir][Transform] Introduce nvgpu transform extensions" This reverts commit 40deed40ae77ba22f7c72693903752ab6bfeb4e7. and commit 1660f2174d59bc2fd04131dab9ab0b43178bf665. The buildbot is broken, the two tests aren't passing.	2023-06-27 08:46:18 +02:00
Nicolas Vasilache	1660f2174d	[mlir][Transform] Add support for mma.sync m16n8k16 f16 rewrite. This PR adds support for the m16n8k16 f16 case. At this point, the support is mostly mechanical and could be Tablegen'd to all cases. Until then, this can be populated as needed on a case-by-case basis. Depends on: D153420 Differential Revision: https://reviews.llvm.org/D153428	2023-06-26 16:46:42 +00:00
Nicolas Vasilache	40deed40ae	[mlir][Transform] Introduce nvgpu transform extensions Mapping to NVGPU operations such as mma.sync with mixed precision and ldmatrix with transposes and various data types involves complex matchings from low-level IR. This is akin to raising complex patterns after unnecessarily having lost structural information. To avoid such unnecessary complexity, introduce a direct mapping step from a matmul on memrefs to distributed NVGPU vector abstractions. In this context, mapping to specific mma.sync operations is trivial and consists in simply translating the documentation into indexing expressions. Correctness is demonstrated with an end-to-end integration test. Differential Revision: https://reviews.llvm.org/D153420	2023-06-26 16:21:28 +00:00
Aart Bik	a3bb693420	[mlir][gpu][nvvm] refined sparsity selector test and verification of mma.sp Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D146319	2023-03-17 15:50:36 -07:00
Christopher Bate	6ca1a09f03	[mlir][gpu] Migrate hard-coded address space integers to an enum attribute (gpu::AddressSpaceAttr) This is a purely mechanical change that introduces an enum attribute in the GPU dialect to represent the various memref memory spaces as opposed to the hard-coded integer attributes that are currently used. The following steps were taken to make the transition across the codebase: 1. Introduce a pass "gpu-lower-memory-space-attributes": The pass updates all memref types that have a memory space attribute that is a `gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a mapping that is given by the caller. This pass is based on the "map-memref-spirv-storage-class" pass and the common functions can probably be refactored into a set of utilities under the MemRef dialect. 2. Update the verifiers of GPU/NVGPU dialect operations. If a verifier currently checks the address space of an operand using e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the checks are changed to only fail if the memory space is either missing or a wrong value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address space is correct because it was specifically lowered to something other than a `gpu::AddressSpaceAttr`. 3. Update existing gpu-to-llvm conversion infrastructure. In the existing gpu-to-X passes, we add a full conversion equivalent to `gpu-lower-memory-space-attributes` just before doing the conversion to the LLVMDialect. This is done because currently both the gpu-to-llvm passes (rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce `AddressSpaceAttr` memory space annotations. Therefore, I inserted the memory space conversion between the gpu-to-gpu rewrites and the LLVM conversion. For more context see the below discourse discussion: https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/ Reviewed By: ftynse Differential Revision: https://reviews.llvm.org/D140644	2023-01-13 11:00:10 -07:00
Hanhan Wang	0a1569a400	[mlir][NFC] Remove trailing whitespaces from `.td` and `.mlir` files. This is generated by running ``` sed --in-place 's/[[:space:]]\+$//' mlir/*/.td sed --in-place 's/[[:space:]]\+$//' mlir/*/.mlir ``` Reviewed By: rriddle, dcaballe Differential Revision: https://reviews.llvm.org/D138866	2022-11-28 15:26:30 -08:00
Christopher Bate	708185f03f	[mlir][NVGPU] Add support for structured sparsity MMA variants This change adds a new NVGPU operation that targets the PTX `mma.sp.sync` instruction variants. A lowering to NVVM is provided using inline assembly. Reviewed By: ThomasRaoux, manishucsd Differential Revision: https://reviews.llvm.org/D137202	2022-11-07 09:43:03 -07:00
rkayaith	13bd410962	[mlir][Pass] Include anchor op in -pass-pipeline In D134622 the printed form of a pass manager is changed to include the name of the op that the pass manager is anchored on. This updates the `-pass-pipeline` argument format to include the anchor op as well, so that the printed form of a pipeline can be directly passed to `-pass-pipeline`. In most cases this requires updating `-pass-pipeline='pipeline'` to `-pass-pipeline='builtin.module(pipeline)'`. This also fixes an outdated assert that prevented running a `PassManager` anchored on `'any'`. Reviewed By: rriddle Differential Revision: https://reviews.llvm.org/D134900	2022-11-03 11:36:12 -04:00
Thomas Raoux	91594b5b98	[mlir][nvpu] Prevent F32ToTF32 pattern to generate illegal IR We shouldn't apply this pattern to non F32->F32 mma.sync operations. Differential Revision: https://reviews.llvm.org/D131902	2022-08-15 16:46:18 +00:00
Manish Gupta	14d79afeae	[mlir][NVGPU] nvgpu.mmasync on F32 through TF32 Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3. For mma.sync on f32 input using tensor cores there are two possibilites: (a) tf32 (1 `mma.sync` per warp-level matrix-multiply-accumulate) (b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate) Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations. Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D130294	2022-08-01 23:23:27 +00:00
Manish Gupta	713d3de5fb	[mlir][NVGPU] Verifier for nvgpu.ldmatrix * Adds verifiers for `nvgpu.ldmatrix` op * Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir` Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D129669	2022-07-14 22:46:38 +00:00
Manish Gupta	f7d42d5149	[mlir][NVGPU] Verifiers for nvgpu.mma.sync Op - Adds verification for `nvgpu.mma.sync` op - Adds tests to `mlir/test/Dialect/NVGPU/invalid.mlir` - `nvgpu.mma.sync` verifier caught a bug and triggered a failure in m16n8k4_tf32_f32 variant in `mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir` - The output shape of vector holding thread-level accumulators was inconsistent and fixed in this change Reviewed By: ThomasRaoux Differential Revision: https://reviews.llvm.org/D129400	2022-07-13 18:57:07 +00:00
Christopher Bate	51b925df94	[mlir][nvgpu] shared memory access optimization pass This change adds a transformation and pass to the NvGPU dialect that attempts to optimize reads/writes from a memref representing GPU shared memory in order to avoid bank conflicts. Given a value representing a shared memory memref, it traverses all reads/writes within the parent op and, subject to suitable conditions, rewrites all last dimension index values such that element locations in the final (col) dimension are given by `newColIdx = col % vecSize + perm[row](col/vecSize,row)` where `perm` is a permutation function indexed by `row` and `vecSize` is the vector access size in elements (currently assumes 128bit vectorized accesses, but this can be made a parameter). This specific transformation can help optimize typical distributed & vectorized accesses common to loading matrix multiplication operands to/from shared memory. Differential Revision: https://reviews.llvm.org/D127457	2022-06-17 09:31:05 -06:00
Thomas Raoux	15bcc36eed	[mlir][gpu] Move async copy ops to NVGPU and add caching hints Move async copy operations to NVGPU as they only exist on NV target and are designed to match ptx semantic. This allows us to also add more fine grain caching hint attribute to the op. Add hint to bypass L1 and hook it up to NVVM op. Differential Revision: https://reviews.llvm.org/D125244	2022-05-10 22:30:24 +00:00
River Riddle	0254b0bcf0	[mlir][NFC] Update textual references of `func` to `func.func` in LLVM/Math/MemRef/NVGPU/OpenACC/OpenMP/Quant/SCF/Shape tests The special case parsing of `func` operations is being removed.	2022-04-20 22:17:28 -07:00
Thomas Raoux	894a591cf6	[mlir][nvgpu] Move mma.sync and ldmatrix in nvgpu dialect Move gpu operation mma.sync and ldmatrix in nvgpu as they are specific to nvidia target. Differential Revision: https://reviews.llvm.org/D123824	2022-04-14 23:44:52 +00:00
Thomas Raoux	4c564940a1	[mlir][nvgpu] Add NVGPU dialect (architectural specific gpu dialect) This introduce a new dialect for vendro specific ptx operations. This also adds the first operation ldmatrix as an example. More operations will be added in follow up patches. This new dialect is meant to be a bridge between GPU and Vector dialectis and NVVM dialect. This is based on the RFC proposed here: https://discourse.llvm.org/t/rfc-add-nv-gpu-dialect-hw-specific-extension-of-gpu-dialect-for-nvidia-gpus/61466/8 Differential Revision: https://reviews.llvm.org/D123266	2022-04-14 16:33:46 +00:00

39 Commits