67 Commits

Author SHA1 Message Date
Christian Sigg
7851b1bcf1
[mlir][gpu] Change GPU modules to globals (#135478)
Load/unload GPU modules in global ctors/dtors instead of each time when
launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU
context. Now that the modules are loaded ahead of time, asynchronously
launched kernels can run concurrently, see
https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of `embedBinary()` and `launchKernel()` use slightly
different mechanics at the moment but I prefer to not change the latter
more than necessary as part of this PR. I will prepare a follow-up NFC
for `launchKernel()` to align them again.
2025-04-22 13:49:58 +02:00
Guray Ozen
20861f1f2f
[mlir][gpu] Use alloc OP's host_shared in cuda runtime (#99035) 2024-07-17 07:25:11 +02:00
Justin Holewinski
5e78417db5
[MLIR][CUDA] Use _alloca instead of alloca on Windows (#85853)
MSVC/Windows does not support `alloca()`; instead it defines `_alloca()`
in `malloc.h`.
2024-03-20 00:32:19 -07:00
Guray Ozen
7d55b916a5
[mlir][nvgpu] Support strided memref when creating TMA descriptor (#85652) 2024-03-18 19:47:39 +01:00
Aart Bik
41a07e668c
[mlir][sparse] recognize NVidia 2:4 type for matmul (#76758)
This removes the temporary DENSE24 attribute and replaces it with proper
recognition of dense to 24 conversion. The compressionh will be
performed on the device prior to performing the matrix mult. Note that
we no longer need to start with the linalg version, we can lift this to
the proper named linalg op. Also renames some files into more consistent
names.
2024-01-02 14:44:24 -08:00
Adam Paszke
12e4332501
[mlir][nvgpu] Fix the TMA stride setup (#75838)
There were two issues with the previous computation:
* it never looked at dimensions past the second one
* the definition was recursive, making each dimension have an extra
`elementSize` power
2023-12-19 08:40:26 +01:00
Adam Paszke
65aab9e722
[mlir][gpu] Generate multiple rank-specializations for tensor map cre… (#74082)
…ation

The previous code was technically incorrect in that the type indicated
that the memref only has 1 dimension, while the code below was happily
dereferencing the size array out of bounds. Now, if the compiler doesn't
get too smart about optimizations, this code *might even work*. But, if
the compiler realizes that the array has 1 element it might starrt doing
silly things. This generates a specialization per each supported rank,
making sure we don't do any UB.
2023-12-01 15:51:48 +01:00
Adam Paszke
1c2a0768de
[MLIR][CUDA] Update export macros in CudaRuntimeWrappers (#73932)
This fixes a few issues present in the current version:
1) The macro doesn't enforce the default visibility on exported
   functions, causing compilation to fail when using
   `-fvisibility=hidden`
2) Not all functions are exported
3) Sometimes the macro ended up weirdly interleaved with `extern "C"`
   declarations
2023-11-30 14:57:39 +01:00
Guray Ozen
f21a70f9fe
[mlir][cuda] Guard mgpuLaunchClusterKernel for Cuda 12.0+ (NFC) (#73495) 2023-11-27 11:50:46 +01:00
Guray Ozen
edf5cae739
[mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Guray Ozen
5ef45c02dc
[mlir][cuda] Avoid driver call to check max shared memory (#70021)
This PR guards the driver call with if-statement as the driver calls are
more expensive.

As a future todo, the if statement could be generated by the compiler
and thus optimized in some cases.
2023-10-26 11:02:32 +03:00
Aart Bik
7ac330a461
[mlir][sparse][gpu] protect BSR method with cuda 12.1 (#67728)
MLIR official build is not quite at 12.1 yet, so until then we protext
the Bsr method with a macro guard
2023-09-28 12:58:01 -07:00
Aart Bik
39038177ee
[mlir][sparse][gpu] add CSC and BSR format to cuSparse GPU ops (#67509)
This adds two cuSparse formats to the GPU dialect support. Together with
proper lowering and runtime cuda support. Also fixes a few minor
omissions.
2023-09-27 09:32:25 -07:00
Nishant Patel
1002a1d058
[MLIR] Pass hostShared flag in gpu.alloc op to runtime wrappers (#66401)
This PR is a breakdown of the big PR
https://github.com/llvm/llvm-project/pull/65539 which enables intel gpu
integration. In this PR we pass hostShared flag to runtime wrappers
(required by SyclRuntimeWrappers which will come in subsequent PR) to
indicate if the allocation is done on host shared gpu memory or device
only memory.
2023-09-26 15:32:11 -07:00
Nishant Patel
ebfea261e6
[MLIR] Pass count of parameters & gpu binary size to runtime wrappers (#66154)
This PR is a breakdown of the big PR #65539 which enables intel gpu
integration. In this PR we pass count of parameters and size of gpu
binary to runtime wrappers since the SyclRuntimeWrappers (which will
come in subsequent PR) requires the spirv size for compilation and also
the number of parameters to iterate over the params.
2023-09-26 11:27:07 -07:00
Guray Ozen
2379432c88
[MLIR] Correct Initial TMA Descriptor Values (#67397) 2023-09-26 09:20:46 +02:00
Aart Bik
8998bcfbce
[mlir][sparse][gpu] refine type of workspace size variables (#66438)
Rationale:
Some compiler settings don't like the size_t vs uint64_t setup.
2023-09-14 15:49:52 -07:00
Fabian Mora
5093413a50
[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path (#66220)
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.

2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.

3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.

4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
2023-09-14 18:00:27 -04:00
Aart Bik
3635c74375
[mlir][gpu][sparse] gracefully accept zero size allocation (#66127)
This cleans up a unnecessary code that changes zero size allocation to
avoid the following error message

'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'
2023-09-12 13:07:24 -07:00
Guray Ozen
1dc0071216 [MLIR] Guard Cuda 12.0+ newer driver APIs with CUDA_VERSION macro checks
Fixes #64529
https://github.com/llvm/llvm-project/issues/64529

Reviewed By: mehdi_amini

Differential Revision: https://reviews.llvm.org/D159440
2023-09-06 08:17:06 +02:00
Aart Bik
289f7231f9 [mlir][sparse][gpu] minor code cleanup for sparse gpu ops
Consistent order of ops and related methods.
Also, renamed SpGEMMGetSizeOp to SpMatGetSizeOp
since this is a general utility for sparse matrices,
not specific to GEMM ops only.

Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D157922
2023-08-14 15:08:57 -07:00
Aart Bik
95a6c509c9 [mlir][sparse][gpu] add set csr pointers, remove estimate op, fix bugs
Rationale:
Since we only support default algorithm for SpGEMM, we can remove the
estimate op (for now at least). This also introduces the set csr pointers
op, and fixes a few bugs in the existing lowering for the SpGEMM breakdown.
This revision paves the way for actual recognition of SpGEMM in the sparsifier.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157645
2023-08-10 13:52:47 -07:00
Aart Bik
e7e4ed0d7a [mlir][sparse][gpu] only support default algorithm for SpGEMM
Rationale:
This is the approach taken for all the others too (SpMV, SpMM, SDDMM),
so it is more consistent to follow the same path (until we have a need
for more algorithms). Also, in a follow up revision, this will allow
us to remove some unused GEMM ops.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D157542
2023-08-09 12:49:47 -07:00
Kun Wu
0664db5425 [mlir][sparse][gpu] fix spgemm runtime compile error
Differential Revision: https://reviews.llvm.org/D157349
2023-08-08 01:37:31 +00:00
Kun Wu
dfe2942909 [mlir][sparse][gpu] add spgemm operator
Differential Revision: https://reviews.llvm.org/D152981
2023-08-08 00:29:23 +00:00
Guray Ozen
53881490c2 [mlir][cuda runtime] Set Max Dynamic Shared Memory Attribute
This works aims to address the issue related to larger shared memory usage in the MLIR CUDA runtime. Currently, when the shared memory usage exceeds 48KB, we need to set the CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES attribute of the CUDA kernel appropriately. This work takes care of that by setting the attribute as required. Additionally, it includes some debug prints for better visibility and troubleshooting.

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D156874
2023-08-02 14:18:59 +02:00
Guray Ozen
19b1107963 [mlir][gpu] Add debug print with environment value
This work introduces `MLIR_CUDA_DEBUG` environment value and `debug_print` function to be able to debug runtimes.

Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D156232
2023-08-02 11:55:32 +02:00
Kun Wu
1e491c425b [mlir][sparse][gpu] add 2:4 spmm prune_and_check flag
Differential Revision: https://reviews.llvm.org/D155909
2023-08-01 18:24:18 +00:00
Guray Ozen
e56d6745f7 [mlir][nvgpu] Add tma.create.descriptor to create tensor map descriptor
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.

The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

Depends on D155453

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155680
2023-07-21 11:33:04 +02:00
Aart Bik
4df01dc270 [mlir][sparse][gpu][nvidia] add pruning step and check to 2:4 matrix multiplication
(1) without the check, the results may silently be wrong, so check is needed
(2) add pruning step to guarantee 2:4 property

Note, in the longer run, we may want to split out the pruning step somehow,
or make it optional.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D155320
2023-07-14 12:08:13 -07:00
Aart Bik
97678cec1b [mlir][sparse][gpu] remove zero init memset
avoids quite a big memory fill for each setup

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D155251
2023-07-13 18:22:21 -07:00
Aart Bik
86eff489e7 [mlir][sparse][gpu] force 16-byte alignment on data structs for cuSparseLt
Also makes some minor consistency edits in the cuSparseLt wrapper lib.

Reviewed By: Peiming, K-Wu

Differential Revision: https://reviews.llvm.org/D155139
2023-07-13 10:45:15 -07:00
Adrian Kuegel
f250fbcbbb [mlir] Apply ClangTidy fix (NFC)
The return statement is redundant.
2023-07-10 11:46:32 +02:00
Aart Bik
03125e6894 [mlir][sparse][gpu] fix missing dealloc
This dealloc was incorrectly removed in
https://reviews.llvm.org/D153173

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D154564
2023-07-06 09:48:19 -07:00
Kun Wu
be2dd22b8f [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime
Differential Revision: https://reviews.llvm.org/D153173
2023-06-30 21:52:34 +00:00
Kun Wu
7a3ebba9cb [mlir][sparse][gpu] Add explaining string to three static_assert stmts
Differential Revision: https://reviews.llvm.org/D154243
2023-06-30 14:10:45 -05:00
Kun Wu
632ccc538c [mlir][sparse][gpu] remove tuple as one of the spmm_buffer_size output type
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D153188
2023-06-19 15:57:50 +00:00
Kun Wu
9167dd46ba [mlir][sparse][gpu] recognizing sddmm pattern in GPU libgen path
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D151582
2023-06-15 23:48:11 +00:00
Kun Wu
ac30f48e37 [mlir][sparse][gpu]fix various cusparseLt bugs
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D152489
2023-06-12 23:48:49 +00:00
Navdeep Katel
18cc07aa07 [MLIR][GPU] Add 16-bit version of cudaMemset in cudaRuntimeWrappers
Add 16-bit version of cudaMemset in cudaRuntimeWrappers and update the GPU to LLVM lowering.

Reviewed By: bondhugula

Differential Revision: https://reviews.llvm.org/D151642
2023-06-08 17:33:26 +05:30
Aart Bik
50db4789a8 [mlir][sparse][gpu] refined build setup for cusparse
Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D152387
2023-06-07 11:09:22 -07:00
Kun Wu
8ed59c53de [mlir][sparse][gpu] add sm8.0+ tensor core 2:4 sparsity support
Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D151775
2023-06-06 23:13:21 +00:00
Aart Bik
9fc02a7a08 [mlir][sparse][gpu] add AoS COO support to cuSPARSE
Even though this feature was deprecated in release 11.2,
any library before this version still supports the feature,
which is why we are making it available under a macro.

Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D152290
2023-06-06 12:32:46 -07:00
Kun Wu
7e44f0736a [mlir][gpu][sparse] fix broken type in cusparseCreateCsr
Differential Revision: https://reviews.llvm.org/D151912
2023-06-01 18:06:09 +00:00
Kun Wu
be6c532005 [mlir][sparse][gpu] fixing broken literal names in cuda runner macros
Differential Revision: https://reviews.llvm.org/D151910
2023-06-01 17:52:58 +00:00
Kun Wu
cc402de0b1 [mlir][sparse][gpu] add result type to spmv and spmm gpu libgen path
Differential Revision: https://reviews.llvm.org/D151592
2023-06-01 17:17:40 +00:00
Aart Bik
752c04777f [mlir][sparse][gpu] fix merge conflict
Reviewed By: K-Wu

Differential Revision: https://reviews.llvm.org/D151619
2023-05-27 13:42:20 -07:00
Kun Wu
cf44847b4d [mlir][gpu][sparse] adding cusparse sddmm support
Differential Revision: https://reviews.llvm.org/D151279
2023-05-27 20:01:41 +00:00
Aart Bik
74e29d3715 [mlir][sparse][gpu] fix merge conflict
Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D151574
2023-05-26 11:00:20 -07:00
Kun Wu
235fbe792b [mlir] [sparse] [gpu] adding transpose support to spmm spmv
Reviewed By: aartbik, wrengr

Differential Revision: https://reviews.llvm.org/D151259
2023-05-26 17:07:09 +00:00