llvm-project

Author	SHA1	Message	Date
Ivan Butygin	f785ca0d72	[mlir][nvgpu] Move memref memspace attributes conversion to single place (#172156 ) Also, some fixes for AMDGPU part for better naming.	2025-12-14 12:44:47 +03:00
Durgadoss R	fddf7b0510	[MLIR][NVVM] Update mbarrier.arrive.expect_tx Op (#169922 ) This patch updates the mbarrier.arrive.expect_tx Op. It also adds an Op for its arrive_drop version. * No change in the existing inline-asm lowering. This functionality continues to work as is. * An optional return value is added for shared_cta space. * The scope and semantics are added as attributes. * Inline-PTX lowering is available when `predicate` is provided. Otherwise, the Op lowers to intrinsics. * lit tests are added to verify the lowering to intrinsics. * Specific negative tests are added to check the invalid cases for inline-ptx lowering. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-12-01 22:19:34 +05:30
Durgadoss R	7eeae8e41d	[MLIR][NVVM] Update mbarrier Ops to use AnyTypeOf[] (3/3) (#167567 ) This is a follow-up of PR #165558 and #165993. This patch updates the remaining two Ops to use the AnyTypeOf[] construct, completing the migration for the mbarrier family of Ops. ``` mbarrier.arrive.expect_tx mbarrier.try_wait.parity ``` Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-11-12 16:07:57 +05:30
Durgadoss R	35ee3c6f72	[MLIR][NVVM] Update mbarrier Ops to use AnyTypeOf[] (2/n) (#165993 ) This is a follow up of PR #165558. (1/n) This patch updates the below mbarrier Ops to use AnyTypeOf[] construct: ``` * mbarrier.arrive * mbarrier.arrive.noComplete * mbarrier.test.wait * cp.async.mbarrier.arrive ``` * Updated existing tests accordingly. * Verified locally that there are no new regressions in the `integration` tests. * TODO: Two more Ops remain and will be migrated in a subsequent PR. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-11-05 15:38:24 +05:30
Durgadoss R	523706f2cd	[MLIR][NVVM] Update mbarrier.init/inval Ops to use AnyTypeOf[] (#165558 ) This patch updates the mbarrier.init/inval Ops to use the AnyTypeOf[] construct for their `addr` argument. This enables us to have a single Op that can take a pointer in either generic or shared memory space and generate the right intrinsics during the lowering. * Updated existing tests accordingly. * Verified locally that there are no new regressions in `integration` tests. * TODO: Additional updates for the remaining mbarrier Ops are in progress. These will be refactored in subsequent patches. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-10-31 17:20:49 +05:30
Durgadoss R	fa366b4e9f	[MLIR][NVVM] Update TMA Load Op (#156347 ) This patch includes im2col and gather mode support for the TMA Load Op. The lowering is also updated to intrinsics except when a Predicate is given. This completes the Blackwell additions on this Op. * NVVM Dialect has support for Shared::Cluster address-space now. So, this patch also updates the Op to use AS(7) instead of AS(3). The corresponding inline-ptx based unit tests are also updated. * lit tests are added for all combinations. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-09-23 13:03:35 +05:30
Fabian Mora	48babe1931	[mlir][LLVM] Add LLVMAddrSpaceAttrInterface and NVVMMemorySpaceAttr (#157339 ) This patch introduces the `LLVMAddrSpaceAttrInterface` for defining compatible LLVM address space attributes To test this interface, this patch also adds: - Adds NVVMMemorySpaceAttr implementing both LLVMAddrSpaceAttrInterface and MemorySpaceAttrInterface - Converts NVVM memory space constants from enum to MLIR enums - Updates all NVVM memory space references to use new attribute system - Adds support for NVVM memory spaces in ptr dialect translation Example: ```mlir llvm.func @nvvm_ptr_address_space( !ptr.ptr<#nvvm.memory_space<global>>, !ptr.ptr<#nvvm.memory_space<shared>>, !ptr.ptr<#nvvm.memory_space<constant>>, !ptr.ptr<#nvvm.memory_space<local>>, !ptr.ptr<#nvvm.memory_space<tensor>>, !ptr.ptr<#nvvm.memory_space<shared_cluster>> ) -> !ptr.ptr<#nvvm.memory_space<generic>> ``` Translating the above code to LLVM produces: ```llvm declare ptr @nvvm_ptr_address_space(ptr addrspace(1), ptr addrspace(3), ptr addrspace(4), ptr addrspace(5), ptr addrspace(6), ptr addrspace(7)) ``` To convert the memory space enum to the new enum class use: ```bash grep -r . -e "NVVMMemorySpace::kGenericMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kGenericMemorySpace/NVVMMemorySpace::Generic/g" grep -r . -e "NVVMMemorySpace::kGlobalMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kGlobalMemorySpace/NVVMMemorySpace::Global/g" grep -r . -e "NVVMMemorySpace::kSharedMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kSharedMemorySpace/NVVMMemorySpace::Shared/g" grep -r . -e "NVVMMemorySpace::kConstantMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kConstantMemorySpace/NVVMMemorySpace::Constant/g" grep -r . -e "NVVMMemorySpace::kLocalMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kLocalMemorySpace/NVVMMemorySpace::Local/g" grep -r . -e "NVVMMemorySpace::kTensorMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kTensorMemorySpace/NVVMMemorySpace::Tensor/g" grep -r . -e "NVVMMemorySpace::kSharedClusterMemorySpace" -l \| xargs sed -i -e "s/NVVMMemorySpace::kSharedClusterMemorySpace/NVVMMemorySpace::SharedCluster/g" ``` NOTE: A future patch will add support for ROCDL, it wasn't added here to keep the patch small.	2025-09-14 09:05:28 -04:00
Srinivasa Ravi	f1032f06e8	[MLIR][NVVM][NVGPU] Combine prefetch and prefetch.tensormap (#153134 ) This PR combines the `prefetch` and `prefetch.tensormap` NVVM Ops to one `prefetch` Op. The `tensormap` variant is lowered through the newly added intrinsics. The lowering of the NVGPU `tma.prefetch.descriptor` Op is changed from lowering to the `prefetch.tensormap` Op to `prefetch`. PTX Spec Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu	2025-09-01 15:56:31 +05:30
Durgadoss R	4a5b051d53	[MLIR][NVVM] Update TMA Store Op (#155435 ) This patch includes im2col and scatter mode support to the TMA Store Op. The lowering is also updated to intrinsics except when Predicate is given. This completes the Blackwell additions on this Op. * lit tests are added for all combinations. * Move the TMA reduce invalid tests to their own file. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>	2025-08-29 11:27:17 +05:30
lonely eagle	200a9a87fe	[mlir][nvgpu] Move dependent dialect from C++ to TableGen for nvgpu-to-nvvm pass (NFC) (#155801 ) Removed the getDependentDialects function from the convert-nvgpu-to-nvvm pass and instead use TableGen to define dependent dialects.	2025-08-29 08:49:33 +08:00
Matthias Springer	f0967fca04	[mlir][LLVM] `FuncToLLVM`: Add 1:N type conversion support (#153823 ) Add support for 1:N type conversions to the `FuncToLLVM` lowering patterns. This commit does not change the lowering of any types (such as `MemRefType`). It just sets up the infrastructure, such that 1:N type conversions can be used during `FuncToLLVM`. Note: When the converted result types of a `func.func` have more than 1 type, then the results are wrapped in an `llvm.struct`. That's because `llvm.func` does not support multiple result values. This "wrapping" was already implemented for cases where the original `func.func` has multiple results. With 1:N conversions, even a single result can now expand to multiple converted results, triggering the same wrapping mechanism. The test cases are exercised with both the old and the new no-rollback conversion driver.	2025-08-16 09:45:08 +02:00
Gao Yanfeng	24f5385a85	[MLIR][NVVM] Support generating all the ldmatrix intrinsics from NVVM ops (#148783 ) Previously, the NVVM dialect's ldmatrix operation could only generate a limited subset of the available NVVM ldmatrix intrinsics. The intrinsics generating new ops introduced in BlackWell are not accessible through the NVVM ops. This commit extends the ldmatrix operation to support all available ldmatrix intrinsics.	2025-08-12 15:13:15 +01:00
Mehdi Amini	0d8abc2188	[MLIR] Migrate NVVM to the new LDBG debug macro (NFC) (#151162 )	2025-07-30 13:28:51 +02:00
Kazu Hirata	1a0f482de8	[mlir] Remove unused includes (NFC) (#150476 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-24 11:23:53 -07:00
Maksim Levental	4ae9fdca8a	[mlir][NFC] update `Conversion` create APIs (6/n) (#149888 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-22 08:16:53 -04:00
Bruno Cardoso Lopes	05494f3bad	[MLIR][LLVM] Tail call support for inline asm op (#140826 )	2025-05-22 15:30:31 -07:00
Peiyong Lin	04ad8d4900	Emit inbounds and nuw attributes in memref. (#138984 ) Now that MLIR accepts nuw and nusw in getelementptr, this patch emits the inbounds and nuw attributes when lower memref to LLVM in load and store operators. This patch also strengthens the memref.load and memref.store spec about undefined behaviour during lowering. This patch also lifts the \|rewriter\| parameter in getStridedElementPtr ahead so that LLVM::GEPNoWrapFlags can be added at the end with a default value and grouped together with other operators' parameters. Signed-off-by: Lin, Peiyong <linpyong@gmail.com>	2025-05-20 14:16:22 -07:00
Matthias Springer	85742f7642	[mlir][LLVM] Delete `getFixedVectorType` and `getScalableVectorType` (#135051 ) The LLVM dialect no longer has its own vector types. It uses `mlir::VectorType` everywhere. Remove `LLVM::getFixedVectorType/getScalableVectorType` and use `VectorType::get` instead. This commit addresses a [comment](https://github.com/llvm/llvm-project/pull/133286#discussion_r2022192500) on the PR that deleted the LLVM vector types.	2025-04-10 10:36:21 +02:00
Guray Ozen	38d9a44510	[MLIR][NVGPU] Add `tma.fence.descriptor` OP (#133218 ) When the TMA descriptor is transferred from host memory to global memory using cudaMemcpy, each thread block must insert a fence before any thread accesses the updated tensor map in global memory. Once the tensor map has been accessed, no additional fences are needed by that block unless the map is modified again. [Example from cuda programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays). The `tma.fence.descriptor` basically implements `ptx::fence_proxy_tensormap_generic`. ``` #include <cuda.h> #include <cuda/ptx> namespace ptx = cuda::ptx; __device__ CUtensorMap global_tensor_map; __global__ void kernel(CUtensorMap *tensor_map) { // Fence acquire tensor map: ptx::n32_t<128> size_bytes; // Since the tensor map was modified from the host using cudaMemcpy, // the scope should be .sys. ptx::fence_proxy_tensormap_generic( ptx::sem_acquire, ptx::scope_sys, tensor_map, size_bytes ); // Safe to use tensor_map after fence inside this thread.. } int main() { CUtensorMap local_tensor_map; // [ ..Initialize map.. ] cudaMemcpy(&global_tensor_map, &local_tensor_map, sizeof(CUtensorMap), cudaMemcpyHostToDevice); kernel<<<1, 1>>>(global_tensor_map); } ```	2025-03-27 15:20:19 +01:00
Guray Ozen	bc7e3915e1	[MLIR][NVGPU] Add `mbarrier.get` Op (#133221 ) The `mbarrier.create` op can create multiple mbarrier objects, and other mbarrier-related ops can access an mbarrier using a dynamic SSA value. This is especially useful when using mbarriers in dynamic loops. This PR adds the `mbarrier.get` op, which returns a pointer to a specific mbarrier object from a group of barriers created by the nvgpu.mbarrier.create operation. It is useful when composing the NVGPU and NVVM dialects. Example: ``` %mbars = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 10> %mbar_pointer = nvgpu.mbarrier.get %mbars[%c2] : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>> -> i32 ```	2025-03-27 15:20:07 +01:00
Krzysztof Drewniak	f4e3b8783c	[mlir][LLVM] Switch `undef` for `poison` for uninitialized values (#125629 ) LLVM itself is generally moving away from using `undef` and towards using `poison`, to the point of having a lint that caches new uses of `undef` in tests. In order to not trip the lint on new patterns and to conform to the evolution of LLVM - Rename valious ::undef() methods on StructBuilder subclasses to ::poison() - Audit the uses of UndefOp in the MLIR libraries and replace almost all of them with PoisonOp The remaining uses of `undef` are initializing `uninitialized` memrefs, explicit conversions to undef from SPIR-V, and a few cases in AMDGPUToROCDL where usage like %v = insertelement <M x iN> undef, iN %v, i32 0 %arg = bitcast <M x iN> %v to i(M * N) is used to handle "i32" arguments that are are really packed vectors of smaller types that won't always be fully initialized.	2025-02-06 12:49:30 -06:00
Matthias Springer	7a77f14c0a	[mlir][IR] Remove `isF...()` type API for low-precision FP types (#123326 ) Remove `type.isFloat4E2M1FN()` etc. Use `isa<Float4E2M1FNType>(type)` instead. For details, see: https://discourse.llvm.org/t/rethink-on-approach-to-low-precision-fp-types/82361/28	2025-01-20 09:22:53 +01:00
Matthias Springer	206fad0e21	[mlir][NFC] Mark type converter in `populate...` functions as `const` (#111250 ) This commit marks the type converter in `populate...` functions as `const`. This is useful for debugging. Patterns already take a `const` type converter. However, some `populate...` functions do not only add new patterns, but also add additional type conversion rules. That makes it difficult to find the place where a type conversion was added in the code base. With this change, all `populate...` functions that only populate pattern now have a `const` type converter. Programmers can then conclude from the function signature that these functions do not register any new type conversion rules. Also some minor cleanups around the 1:N dialect conversion infrastructure, which did not always pass the type converter as a `const` object internally.	2024-10-05 21:32:40 +02:00
Youngsuk Kim	123e8c735d	[mlir] Don't call llvm::raw_string_ostream::flush() (NFC) Don't call raw_string_ostream::flush(), which is essentially a no-op. As specified in the docs, raw_string_ostream is always unbuffered. ( 65b13610a5226b84889b923bae884ba395ad084d for further reference )	2024-09-22 15:37:34 -05:00
Observer007	2b23e6c8d6	[mlir][nvgpu] Add `nvgpu.rcp` OP (#100965 ) This PR introduces a new OP for reciprocal calculation for `vector` types using `nvvm.rcp` OPs. Currently, it supports only f32 types --------- Co-authored-by: jingzec <jingzec@nvidia.com>	2024-07-30 09:20:49 +02:00
Christian Sigg	a5757c5b65	Switch member calls to `isa/dyn_cast/cast/...` to free function calls. (#89356 ) This change cleans up call sites. Next step is to mark the member functions deprecated. See https://mlir.llvm.org/deprecation and https://discourse.llvm.org/t/preferred-casting-style-going-forward.	2024-04-19 15:58:27 +02:00
Guray Ozen	0a600c34c8	[mlir][nvgpu] Make `phaseParity` of `mbarrier.try_wait` `i1` (#81460 ) Currently, `phaseParity` argument of `nvgpu.mbarrier.try_wait.parity` is index. This can cause a problem if it's passed any value different than 0 or 1. Because the PTX instruction only accepts even or odd phase. This PR makes phaseParity argument i1 to avoid misuse. Here is the information from PTX doc: ``` The .parity variant of the instructions test for the completion of the phase indicated by the operand phaseParity, which is the integer parity of either the current phase or the immediately preceding phase of the mbarrier object. An even phase has integer parity 0 and an odd phase has integer parity of 1. So the valid values of phaseParity operand are 0 and 1. ``` See for more information: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-mbarrier-try-wait	2024-02-13 09:50:34 +01:00
Guray Ozen	fa13c3eea7	[mlir][nvgpu] Fix `transposeB` in `nvgpu.warpgroup.mma` (#79271 ) The #76150 fixed meaning of `transposeB` in NVVM dialect which was initially implemented with opposite meaning. This PR fixes the lowering of `nvgpu.warpgroup.mma` to NVVM dialect. This will fix two integration tests: gemm_f32_f16_f16_128x128x128.mlir gemm_pred_f32_f16_f16_128x128x128.mlir	2024-01-25 09:25:43 +01:00
Guray Ozen	12c241b365	[MLIR][NVVM] Explicit Data Type for Output in `wgmma.mma_async` (#78713 ) The current implementation of `nvvm.wgmma.mma_async` Op deduces the data type of the output matrix from the data type of struct member, which can be non-intuitive, especially in cases where types like `2xf16` are packed into `i32`. This PR addresses this issue by improving the Op to include an explicit data type for the output matrix. The modified Op now includes an explicit data type for Matrix-D (<f16>), and looks as follows: ``` %result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ... nvvm.wgmma.mma_async %descA, %descB, %result, #nvvm.shape<m = 64, n = 32, k = 16>, D [<f16>, #nvvm.wgmma_scale_out<zero>], A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>], B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>] ```	2024-01-22 08:37:20 +01:00
Guray Ozen	21830c9135	[mlir][nvgpu] Fix 'warpgroup.mma.store' index calculation (#78413 ) This PR fixes the 'nvgpu.warpgroup.mma.store' index calculation. When the destionation memref and current accumulator matrix were small, the previous code was reaching out of range.	2024-01-22 08:32:56 +01:00
Guray Ozen	8dd0d95c7c	[mlir][nvgpu] Add `nvgpu.tma.async.store` (#77811 ) PR adds `nvgpu.tma.async.store` Op for asynchronous stores using the Tensor Memory Access (TMA) unit. It also implements Op lowering to NVVM dialect. The Op currently performs asynchronous stores of a tile memory region from shared to global memory for a single CTA.	2024-01-15 11:44:51 +01:00
Guray Ozen	4319e1916d	[mlir][nvgpu] Introduce Multicast Capability to `nvgpu.tma.async.load` (#76935 ) This PR improves the functionality of the `nvgpu.tma.async.load` Op by adding support for multicast. While we already had this capability in the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM Op, this PR lowers mask information to the NVVM operation.	2024-01-05 10:48:55 +01:00
Guray Ozen	3a03da37a3	[mlir][nvgpu] Add address space attribute converter in nvgpu-to-nvvm pass (#74075 ) GPU dialect has `#gpu.address_space<workgroup>` for shared memory of NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion. This PR adds `populateGpuMemorySpaceAttributeConversions` to nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>` `nvgpu-to-nvvm` pass	2023-12-04 16:48:39 +01:00
Guray Ozen	9ceea08859	[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967 ) PR adds support of `im2col` and `l2cache` to `cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports all the traits of the corresponding PTX instruction. The current structure of this operation looks somewhat like below. The PR also simplifies types so we don't need to write obvious types after `:` anymore. ``` nvvm.cp.async.bulk.tensor.shared.cluster.global %dest, %tmaDescriptor, %barrier, box[%crd0,%crd1,%crd2,%crd3,%crd4] im2col[%off0,%off1,%off2] <-- PR introduces multicast_mask = %ctamask l2_cache_hint = %cacheHint <-- PR introduces : !llvm.ptr<3>, !llvm.ptr ```	2023-11-22 16:08:09 +01:00
Guray Ozen	108380da35	[mlir][nvvm] Add `cp.async.bulk.tensor.shared.cluster.global.multicast` (#72429 ) This PR introduce `cp.async.bulk.tensor.shared.cluster.global.multicast` Op in NVVM dialect. It loads data using TMA data from global memory to shared memory of multiple CTAs in the cluster. It resolves #72368	2023-11-16 14:34:56 +01:00
Christian Ulmann	2f17c9f65e	[MLIR][NVGPUToNVVM] Remove typed pointer support (#70867 ) This commit removes the support for lowering NVGPU to NVVM dialect with typed pointers. Typed pointers have been deprecated for a while now and it's planned to soon remove them from the LLVM dialect. Related PSA: https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502	2023-11-02 07:35:21 +01:00
Guray Ozen	192d3320f0	[mlir][nvgpu] Add predicate argument to NVGPU Ops (#69322 )	2023-10-18 19:41:51 +02:00
Guray Ozen	39cdefb5b5	[mlir][nvvm] Add prefetch.tensormap (#67564 ) This PR adds `prefetch.tensormap` Op. It brings the cache line containing the given tma descriptor for subsequent use by the cp.async.bulk.tensor instruction. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-prefetch-prefetchu	2023-10-17 13:03:37 +02:00
Guray Ozen	c4ba84d655	[mlir][nvgpu] Fix packing accumlator matrix (#69316 ) The #68728 significantly simplified the accumulator matrix type, making it easier to work with the nvgpu dialect without worrying about the number of required structs, as this information is abstracted away in the nvgpu-to-nvvm transformation. However, we forgot packing the structs after initialization, causing the accumulator matrix to hold undefined values, which is wrong. This PR addresses that.	2023-10-17 12:46:10 +02:00
Guray Ozen	63389326f5	[mlir][nvvm] Support predicates in `BasicPtxBuilder` (#67102 ) This PR enhances `BasicPtxBuilder` to support predicates in PTX code generation. The `BasicPtxBuilder` interface was initially introduced for generating PTX code automatically for Ops that aren't supported by LLVM core. Predicates, which are typically not supported in LLVM core, are now supported using the same mechanism. In PTX programming, instructions can be guarded by predicates as shown below:. Here `@p` is a predicate register and guard the execution of the instruction. ``` @p ptx.code op1, op2, op3 ``` This PR introduces the `getPredicate` function in the `BasicPtxBuilder` interface to set an optional predicate. When a predicate is provided, the instruction is generated with predicate and guarded, otherwise, predicate is not genearted. Note that the predicate value must always appear as the last argument on the Op definition. Additionally, this PR implements predicate usage for the following ops: - mbarrier.init - mbarrier.init.shared - mbarrier.arrive.expect_tx - mbarrier.arrive.expect_tx.shared - cp.async.bulk.tensor.shared.cluster.global - cp.async.bulk.tensor.global.shared.cta See for more detail in PTX programing model https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions	2023-10-17 12:42:36 +02:00
Guray Ozen	52db7e2745	[mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR (#68728 ) `WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type that keeps the accumulator matrix that is used by warp-group level matrix multiplication. It is handy to have a special type for that as the matrix is distributed among the threads of the warp-group. However, current transformations requires to create and use multiple `WarpgroupAccumulator` if the shape of GEMM is larger than the supported shape of `wgmma.mma_async` instruction. This makes IR looks dense. This PR improves the transformation of `WarpgroupAccumulator` type in every nvgpu Op that uses it. Example: Current GEMM in NVGPU-IR ``` // Init %m1, %m2 = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // GEMM %r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ``` Example: This PR simplifies the IR as below: ``` // Init %m = nvgpu.warpgroup.mma.init.accumulator -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // GEMM %r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> -> !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> // Epilogue nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> into memref<128x128xf32,3> ```	2023-10-17 11:46:47 +02:00
Guray Ozen	315ab3c44b	[MLIR][NVGPU] Introduce `warpgroup.init.accumulator` Op (#67530 ) This Op generates and initilizes the accumulator matrix for `nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate (mma). Its associated transformation generates `!llvm.struct<>` and fill it with the initial values. The size of struct is number of required inout registers for `nvgpu.warpgroup.mma` op.	2023-10-11 08:28:26 -07:00
Guray Ozen	d20fbc9007	[MLIR][NVGPU] Introduce `nvgpu.wargroup.mma.store` Op for Hopper GPUs (#65441 ) This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate storing fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by `warpgroup.mma` to the given memref. An example of fragmentated matrix is given here : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d The `warpgroup.mma.store` does followings: 1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented results matrix) 2) Calculates indexes per thread in warp-group and stores the data into give memref. Here's an example usage: ``` // A warpgroup performs GEMM, results in fragmented matrix %result1, %result2 = nvgpu.warpgroup.mma ... // Stores the fragmented result to memref nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD : !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> to memref<128x128xf32,3> ```	2023-10-05 10:54:13 +02:00
Guray Ozen	b74cfc139a	[mlir][nvgpu] Improve nvgpu->nvvm transformation of `warpgroup.mma` Op (NFC) (#67325 ) This PR introduces substantial improvements to the readability and maintainability of the `nvgpu.warpgroup.mma` Op transformation from nvgpu->nvvm. This transformation plays a crucial role in GEMM and manages complex operations such as generating multiple wgmma ops and iterating their descriptors. The prior code lacked clarity, but this PR addresses that issue effectively. PR does followings: Introduces a helper class: `WarpgroupGemm` class encapsulates the necessary functionality, making the code cleaner and more understandable. Detailed Documentation: Each function within the helper class is thoroughly documented to provide clear insights into its purpose and functionality.	2023-10-05 10:16:59 +02:00
Guray Ozen	7eb2b99f16	[mlir] Change the class name of the `GenerateWarpgroupDescriptor` (#68286 )	2023-10-05 10:15:40 +02:00
Guray Ozen	6dc7717bca	[MLIR][NVGPU] Change name `wgmma.descriptor` to `warpgroup.descriptor` (NFC) (#67526 ) NVGPU dialect is gaining large support for warpgroup level operations, and their names always starts with `warpgroup....`. This PR changes name of Op and type from `wgmma.descriptor` to `warpgroup.descriptor` for sake of consistency.	2023-10-05 09:01:48 +02:00
Guray Ozen	ee49cda7d4	[mlir][nvgpu] Use ImplicitLocOpBuilder in nvgpu-to-nvvm pass (NFC) (#67993 ) For the sake of better readability, this PR uses `ImplicitLocOpBuilder` instead of rewriter+loc	2023-10-03 10:52:36 +02:00
Guray Ozen	17649a7726	[MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for multiple mbarrier use (#65951 ) A common practice involves the creation of multiple `mbarrier` objects, see an example below. This is particularly valuable in scenarios like software pipelining for GEMM, where we need to generate multiple barriers dynamically use and wait them in a loop. PR improves `nvgpu.mbarrier.barrier` type into the `nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type. Consequently, these Ops are now capable of managing multiple barriers seamlessly. Having `num_barriers = 4` helps us to locate mbarrier object(s) into static shared memory. We could make the value dynamic that requires dynamic shared memory it would complicate the codegen. ``` %barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4> ... scf.for %i = %c0 to %n step %c1 { nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... // ... Do work once mbarrier is ready nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... } ``` We will have mbarrier usages like below: ``` expect_tx[0] expect_tx[1] expect_tx[2] Loop: try_wait mbarrier[0], expect_tx[3] try_wait mbarrier[1], expect_tx[0] try_wait mbarrier[2], expect_tx[1] try_wait mbarrier[3], expect_tx[2] ... ```	2023-09-22 17:09:43 +02:00
Guray Ozen	2388222695	[MLIR][NVGPU] Adding `nvgpu.warpgroup.mma` Op for Hopper GPUs (#65440 ) This work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture. Previously, the `nvvm.wgmma.mma_async` operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations. The `nvgpu.warpgroup.mma` does followings: 1) Corresponds multiple `wgmma` instructions. 2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices Here's an example usage of the `nvgpu.warpgroup.mma` operation: ``` %wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> -> !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>, !nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>> ``` The op will result following PTX: ``` wgmma.fence.sync.aligned; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+2, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+4, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2, 62 more registers}, %descA+8, %descB+348, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+512, %descB, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+514, %descB+128, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+516, %descB+256, p, 1, 1, 0, 1; wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501, 62 more registers}, %descA+518, %descB+348, p, 1, 1, 0, 1; wgmma.commit_group.sync.aligned; wgmma.wait_group.sync.aligned 1; ``` The Op keeps - first 64 registers (`{%f1, %f2, 62 more registers}`) -> `%acc1` - second 64 registers (`{%f500,%f501, 62 more registers}`) -> `%acc2`.	2023-09-22 11:46:29 +02:00
Guray Ozen	b96d069324	[NVGPU] Add debug in nvgpu (nfc) Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D159343	2023-09-01 16:40:40 +02:00

1 2

86 Commits