llvm-project

Author	SHA1	Message	Date
Christian Sigg	7851b1bcf1	[mlir][gpu] Change GPU modules to globals (#135478 ) Load/unload GPU modules in global ctors/dtors instead of each time when launching a kernel. Loading GPU modules is a heavy-weight operation and synchronizes the GPU context. Now that the modules are loaded ahead of time, asynchronously launched kernels can run concurrently, see https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect. The implementations of `embedBinary()` and `launchKernel()` use slightly different mechanics at the moment but I prefer to not change the latter more than necessary as part of this PR. I will prepare a follow-up NFC for `launchKernel()` to align them again.	2025-04-22 13:49:58 +02:00
Guray Ozen	20861f1f2f	[mlir][gpu] Use alloc OP's `host_shared` in cuda runtime (#99035 )	2024-07-17 07:25:11 +02:00
Justin Holewinski	5e78417db5	[MLIR][CUDA] Use _alloca instead of alloca on Windows (#85853 ) MSVC/Windows does not support `alloca()`; instead it defines `_alloca()` in `malloc.h`.	2024-03-20 00:32:19 -07:00
Guray Ozen	7d55b916a5	[mlir][nvgpu] Support strided memref when creating TMA descriptor (#85652 )	2024-03-18 19:47:39 +01:00
Aart Bik	41a07e668c	[mlir][sparse] recognize NVidia 2:4 type for matmul (#76758 ) This removes the temporary DENSE24 attribute and replaces it with proper recognition of dense to 24 conversion. The compressionh will be performed on the device prior to performing the matrix mult. Note that we no longer need to start with the linalg version, we can lift this to the proper named linalg op. Also renames some files into more consistent names.	2024-01-02 14:44:24 -08:00
Adam Paszke	12e4332501	[mlir][nvgpu] Fix the TMA stride setup (#75838 ) There were two issues with the previous computation: * it never looked at dimensions past the second one * the definition was recursive, making each dimension have an extra `elementSize` power	2023-12-19 08:40:26 +01:00
Adam Paszke	65aab9e722	[mlir][gpu] Generate multiple rank-specializations for tensor map cre… (#74082 ) …ation The previous code was technically incorrect in that the type indicated that the memref only has 1 dimension, while the code below was happily dereferencing the size array out of bounds. Now, if the compiler doesn't get too smart about optimizations, this code might even work. But, if the compiler realizes that the array has 1 element it might starrt doing silly things. This generates a specialization per each supported rank, making sure we don't do any UB.	2023-12-01 15:51:48 +01:00
Adam Paszke	1c2a0768de	[MLIR][CUDA] Update export macros in CudaRuntimeWrappers (#73932 ) This fixes a few issues present in the current version: 1) The macro doesn't enforce the default visibility on exported functions, causing compilation to fail when using `-fvisibility=hidden` 2) Not all functions are exported 3) Sometimes the macro ended up weirdly interleaved with `extern "C"` declarations	2023-11-30 14:57:39 +01:00
Guray Ozen	f21a70f9fe	[mlir][cuda] Guard mgpuLaunchClusterKernel for Cuda 12.0+ (NFC) (#73495 )	2023-11-27 11:50:46 +01:00
Guray Ozen	edf5cae739	[mlir][gpu] Support Cluster of Thread Blocks in `gpu.launch_func` (#72871 ) NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA). It is a new level of parallelism, allowing clustering of Cooperative Thread Arrays (CTA) to synchronize and communicate through shared memory while running concurrently. This PR enables support for CGA within the `gpu.launch_func` in the GPU dialect. It extends `gpu.launch_func` to accommodate this functionality. The GPU dialect remains architecture-agnostic, so we've added CGA functionality as optional parameters. We want to leverage mechanisms that we have in the GPU dialects such as outlining and kernel launching, making it a practical and convenient choice. An example of this implementation can be seen below: ``` gpu.launch_func @kernel_module::@kernel clusters in (%1, %0, %0) // <-- Optional blocks in (%0, %0, %0) threads in (%0, %0, %0) ``` The PR also introduces index and dimensions Ops specific to clusters, binding them to NVVM Ops: ``` %cidX = gpu.cluster_id x %cidY = gpu.cluster_id y %cidZ = gpu.cluster_id z %cdimX = gpu.cluster_dim x %cdimY = gpu.cluster_dim y %cdimZ = gpu.cluster_dim z ``` We will introduce cluster support in `gpu.launch` Op in an upcoming PR. See [the documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays) provided by NVIDIA for details.	2023-11-27 11:05:07 +01:00
Guray Ozen	5ef45c02dc	[mlir][cuda] Avoid driver call to check max shared memory (#70021 ) This PR guards the driver call with if-statement as the driver calls are more expensive. As a future todo, the if statement could be generated by the compiler and thus optimized in some cases.	2023-10-26 11:02:32 +03:00
Aart Bik	7ac330a461	[mlir][sparse][gpu] protect BSR method with cuda 12.1 (#67728 ) MLIR official build is not quite at 12.1 yet, so until then we protext the Bsr method with a macro guard	2023-09-28 12:58:01 -07:00
Aart Bik	39038177ee	[mlir][sparse][gpu] add CSC and BSR format to cuSparse GPU ops (#67509 ) This adds two cuSparse formats to the GPU dialect support. Together with proper lowering and runtime cuda support. Also fixes a few minor omissions.	2023-09-27 09:32:25 -07:00
Nishant Patel	1002a1d058	[MLIR] Pass hostShared flag in gpu.alloc op to runtime wrappers (#66401 ) This PR is a breakdown of the big PR https://github.com/llvm/llvm-project/pull/65539 which enables intel gpu integration. In this PR we pass hostShared flag to runtime wrappers (required by SyclRuntimeWrappers which will come in subsequent PR) to indicate if the allocation is done on host shared gpu memory or device only memory.	2023-09-26 15:32:11 -07:00
Nishant Patel	ebfea261e6	[MLIR] Pass count of parameters & gpu binary size to runtime wrappers (#66154 ) This PR is a breakdown of the big PR #65539 which enables intel gpu integration. In this PR we pass count of parameters and size of gpu binary to runtime wrappers since the SyclRuntimeWrappers (which will come in subsequent PR) requires the spirv size for compilation and also the number of parameters to iterate over the params.	2023-09-26 11:27:07 -07:00
Guray Ozen	2379432c88	[MLIR] Correct Initial TMA Descriptor Values (#67397 )	2023-09-26 09:20:46 +02:00
Aart Bik	8998bcfbce	[mlir][sparse][gpu] refine type of workspace size variables (#66438 ) Rationale: Some compiler settings don't like the size_t vs uint64_t setup.	2023-09-14 15:49:52 -07:00
Fabian Mora	5093413a50	[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path (#66220 ) This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options. 2. Adding the `createObject` method to `GPUTargetAttrInterface`; this method returns a GPU object from a binary string. 3. Adding the function `mgpuModuleLoadJIT`, which is only available for NVIDIA GPUs, as there is no equivalent for AMD. 4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify the format to use during testing.	2023-09-14 18:00:27 -04:00
Aart Bik	3635c74375	[mlir][gpu][sparse] gracefully accept zero size allocation (#66127 ) This cleans up a unnecessary code that changes zero size allocation to avoid the following error message 'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'	2023-09-12 13:07:24 -07:00
Guray Ozen	1dc0071216	[MLIR] Guard Cuda 12.0+ newer driver APIs with CUDA_VERSION macro checks Fixes #64529 https://github.com/llvm/llvm-project/issues/64529 Reviewed By: mehdi_amini Differential Revision: https://reviews.llvm.org/D159440	2023-09-06 08:17:06 +02:00
Aart Bik	289f7231f9	[mlir][sparse][gpu] minor code cleanup for sparse gpu ops Consistent order of ops and related methods. Also, renamed SpGEMMGetSizeOp to SpMatGetSizeOp since this is a general utility for sparse matrices, not specific to GEMM ops only. Reviewed By: Peiming Differential Revision: https://reviews.llvm.org/D157922	2023-08-14 15:08:57 -07:00
Aart Bik	95a6c509c9	[mlir][sparse][gpu] add set csr pointers, remove estimate op, fix bugs Rationale: Since we only support default algorithm for SpGEMM, we can remove the estimate op (for now at least). This also introduces the set csr pointers op, and fixes a few bugs in the existing lowering for the SpGEMM breakdown. This revision paves the way for actual recognition of SpGEMM in the sparsifier. Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D157645	2023-08-10 13:52:47 -07:00
Aart Bik	e7e4ed0d7a	[mlir][sparse][gpu] only support default algorithm for SpGEMM Rationale: This is the approach taken for all the others too (SpMV, SpMM, SDDMM), so it is more consistent to follow the same path (until we have a need for more algorithms). Also, in a follow up revision, this will allow us to remove some unused GEMM ops. Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D157542	2023-08-09 12:49:47 -07:00
Kun Wu	0664db5425	[mlir][sparse][gpu] fix spgemm runtime compile error Differential Revision: https://reviews.llvm.org/D157349	2023-08-08 01:37:31 +00:00
Kun Wu	dfe2942909	[mlir][sparse][gpu] add spgemm operator Differential Revision: https://reviews.llvm.org/D152981	2023-08-08 00:29:23 +00:00
Guray Ozen	53881490c2	[mlir][cuda runtime] Set Max Dynamic Shared Memory Attribute This works aims to address the issue related to larger shared memory usage in the MLIR CUDA runtime. Currently, when the shared memory usage exceeds 48KB, we need to set the CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES attribute of the CUDA kernel appropriately. This work takes care of that by setting the attribute as required. Additionally, it includes some debug prints for better visibility and troubleshooting. Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D156874	2023-08-02 14:18:59 +02:00
Guray Ozen	19b1107963	[mlir][gpu] Add debug print with environment value This work introduces `MLIR_CUDA_DEBUG` environment value and `debug_print` function to be able to debug runtimes. Reviewed By: aartbik Differential Revision: https://reviews.llvm.org/D156232	2023-08-02 11:55:32 +02:00
Kun Wu	1e491c425b	[mlir][sparse][gpu] add 2:4 spmm prune_and_check flag Differential Revision: https://reviews.llvm.org/D155909	2023-08-01 18:24:18 +00:00
Guray Ozen	e56d6745f7	[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension. The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html Depends on D155453 Reviewed By: nicolasvasilache Differential Revision: https://reviews.llvm.org/D155680	2023-07-21 11:33:04 +02:00
Aart Bik	4df01dc270	[mlir][sparse][gpu][nvidia] add pruning step and check to 2:4 matrix multiplication (1) without the check, the results may silently be wrong, so check is needed (2) add pruning step to guarantee 2:4 property Note, in the longer run, we may want to split out the pruning step somehow, or make it optional. Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D155320	2023-07-14 12:08:13 -07:00
Aart Bik	97678cec1b	[mlir][sparse][gpu] remove zero init memset avoids quite a big memory fill for each setup Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D155251	2023-07-13 18:22:21 -07:00
Aart Bik	86eff489e7	[mlir][sparse][gpu] force 16-byte alignment on data structs for cuSparseLt Also makes some minor consistency edits in the cuSparseLt wrapper lib. Reviewed By: Peiming, K-Wu Differential Revision: https://reviews.llvm.org/D155139	2023-07-13 10:45:15 -07:00
Adrian Kuegel	f250fbcbbb	[mlir] Apply ClangTidy fix (NFC) The return statement is redundant.	2023-07-10 11:46:32 +02:00
Aart Bik	03125e6894	[mlir][sparse][gpu] fix missing dealloc This dealloc was incorrectly removed in https://reviews.llvm.org/D153173 Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D154564	2023-07-06 09:48:19 -07:00
Kun Wu	be2dd22b8f	[mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime Differential Revision: https://reviews.llvm.org/D153173	2023-06-30 21:52:34 +00:00
Kun Wu	7a3ebba9cb	[mlir][sparse][gpu] Add explaining string to three static_assert stmts Differential Revision: https://reviews.llvm.org/D154243	2023-06-30 14:10:45 -05:00
Kun Wu	632ccc538c	[mlir][sparse][gpu] remove tuple as one of the spmm_buffer_size output type Reviewed By: aartbik Differential Revision: https://reviews.llvm.org/D153188	2023-06-19 15:57:50 +00:00
Kun Wu	9167dd46ba	[mlir][sparse][gpu] recognizing sddmm pattern in GPU libgen path Reviewed By: aartbik Differential Revision: https://reviews.llvm.org/D151582	2023-06-15 23:48:11 +00:00
Kun Wu	ac30f48e37	[mlir][sparse][gpu]fix various cusparseLt bugs Reviewed By: aartbik Differential Revision: https://reviews.llvm.org/D152489	2023-06-12 23:48:49 +00:00
Navdeep Katel	18cc07aa07	[MLIR][GPU] Add 16-bit version of cudaMemset in cudaRuntimeWrappers Add 16-bit version of cudaMemset in cudaRuntimeWrappers and update the GPU to LLVM lowering. Reviewed By: bondhugula Differential Revision: https://reviews.llvm.org/D151642	2023-06-08 17:33:26 +05:30
Aart Bik	50db4789a8	[mlir][sparse][gpu] refined build setup for cusparse Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D152387	2023-06-07 11:09:22 -07:00
Kun Wu	8ed59c53de	[mlir][sparse][gpu] add sm8.0+ tensor core 2:4 sparsity support Reviewed By: aartbik Differential Revision: https://reviews.llvm.org/D151775	2023-06-06 23:13:21 +00:00
Aart Bik	9fc02a7a08	[mlir][sparse][gpu] add AoS COO support to cuSPARSE Even though this feature was deprecated in release 11.2, any library before this version still supports the feature, which is why we are making it available under a macro. Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D152290	2023-06-06 12:32:46 -07:00
Kun Wu	7e44f0736a	[mlir][gpu][sparse] fix broken type in cusparseCreateCsr Differential Revision: https://reviews.llvm.org/D151912	2023-06-01 18:06:09 +00:00
Kun Wu	be6c532005	[mlir][sparse][gpu] fixing broken literal names in cuda runner macros Differential Revision: https://reviews.llvm.org/D151910	2023-06-01 17:52:58 +00:00
Kun Wu	cc402de0b1	[mlir][sparse][gpu] add result type to spmv and spmm gpu libgen path Differential Revision: https://reviews.llvm.org/D151592	2023-06-01 17:17:40 +00:00
Aart Bik	752c04777f	[mlir][sparse][gpu] fix merge conflict Reviewed By: K-Wu Differential Revision: https://reviews.llvm.org/D151619	2023-05-27 13:42:20 -07:00
Kun Wu	cf44847b4d	[mlir][gpu][sparse] adding cusparse sddmm support Differential Revision: https://reviews.llvm.org/D151279	2023-05-27 20:01:41 +00:00
Aart Bik	74e29d3715	[mlir][sparse][gpu] fix merge conflict Reviewed By: Peiming Differential Revision: https://reviews.llvm.org/D151574	2023-05-26 11:00:20 -07:00
Kun Wu	235fbe792b	[mlir] [sparse] [gpu] adding transpose support to spmm spmv Reviewed By: aartbik, wrengr Differential Revision: https://reviews.llvm.org/D151259	2023-05-26 17:07:09 +00:00

1 2

67 Commits