LegalizeLaunchFuncOpPattern previously rejected gpu.launch_func ops with
more than one async dependency. This change removes that limitation by
synchronizing additional dependencies onto the primary stream using
CUDA/HIP events, following the same approach already used in
ConvertWaitAsyncOpToGpuRuntimeCallPattern for gpu.wait async.
For each additional async dependency beyond the first:
- If it is a stream (produced by mgpuStreamCreate), create an event,
record it on that stream, wait for it on the primary stream, then
destroy the event.
- If it is already an event, wait for it directly on the primary stream
and destroy it.
Fixes#156984
Assisted-by: Claude Code
When using `--gpu-to-llvm` with `use-bare-pointers-for-kernels=true` and
a `gpu.launch_func` whose kernel has an `UnrankedMemRefType` argument,
`LLVMTypeConverter::promoteOperands` would hit an `llvm_unreachable`
because unranked memrefs are not supported with the bare-pointer calling
convention.
Fix this by checking for unranked memref kernel arguments before calling
`promoteOperands` and returning a proper conversion failure with a
diagnostic instead of crashing.
Fixes#184939
Assisted-by: Claude Code
This PR moves the patterns that unroll vector.to_elements and
vector.from_elements into the file with other vector unrolling
operations. This PR also adds these unrolling patterns into the
`populateVectorUnrollPatterns`. And renames
`populateVectorToElementsLoweringPatterns`
`populateVectorFromElementsLoweringPatterns` to
`populateVectorToElementsUnrollPatterns`
`populateVectorFromElementsUnrollPatterns`.
### Problem
PR #142944 introduced a new canonicalization pattern which caused
failures in the following GPU-related integration tests:
-
mlir/test/Integration/GPU/CUDA/TensorCore/sm80/transform-mma-sync-matmul-f16-f16-accum.mlir
-
mlir/test/Integration/GPU/CUDA/TensorCore/sm80/transform-mma-sync-matmul-f32.mlir
The issue occurs because the new canonicalization pattern can generate
multi-dimensional `vector.from_elements` operations (rank > 1), but the
GPU lowering pipelines were not equipped to handle these during the
conversion to LLVM.
### Fix
This PR adds `vector::populateVectorFromElementsLoweringPatterns` to the
GPU lowering passes that are integrated in `gpu-lower-to-nvvm-pipeline`:
- `GpuToLLVMConversionPass`: the general GPU-to-LLVM conversion pass.
- `LowerGpuOpsToNVVMOpsPass`: the NVVM-specific lowering pass.
Co-authored-by: Yang Bai <yangb@nvidia.com>
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This commit is a follow-up to 99a562b3cb17e89273ba0fe77129f2fb17a19381,
which migrated some of the mlir-vulkan-runner tests to mlir-cpu-runner
using a new pipeline and set of wrappers. That commit could not migrate
all the tests, because the existing calling conventions/ABIs for kernel
arguments generated by GPUToLLVMConversionPass were not a good fit for
the Vulkan runtime. This commit fixes this and migrates the remaining
tests. With this commit, mlir-vulkan-runner and many related components
are now unused, and they will be removed in a later commit (see #73457).
The old calling conventions require both the caller (host LLVM code) and
callee (device code) to have compile-time knowledge of the precise
argument types. This works for CUDA, ROCm and SYCL, where there is a
C-like calling convention agreed between the host and device code, and
the runtime passes through arguments as raw data without comprehension.
For Vulkan, however, the interface declared by the shader/kernel is in a
more abstract form, so the device code has indirect access to the
argument data, and the runtime must process the arguments to set up and
bind appropriately-sized buffer descriptors.
This commit introduces a new calling convention option to meet the
Vulkan runtime's needs. It lowers memref arguments to {void*, size_t}
pairs, which can be trivially interpreted by the runtime without it
needing to know the original argument types. Unlike the stopgap measure
in the previous commit, this system can support memrefs of various ranks
and element types, which unblocked migrating the remaining tests.
This commit adds new wrappers around the MLIR Vulkan runtime which
implement the mgpu* APIs (as generated by GPUToLLVMConversionPass), adds
an optional LLVM lowering to the Vulkan runner mlir-opt pipeline based
on GPUToLLVMConversionPass, and migrates several of the
mlir-vulkan-runner tests to use mlir-cpu-runner instead, together with
the new pipeline and wrappers.
This is a further incremental step towards eliminating
mlir-vulkan-runner and its associated pipeline, passes and wrappers
(#73457). This commit does not migrate all of the tests to the new
system, because changes to the mgpuLaunchKernel ABI will be necessary to
support the tests that use multi-dimensional memref arguments.
This patch fixes:
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp:535:13: error:
'applyPatternsAndFoldGreedily' is deprecated: Use
applyPatternsGreedily() instead [-Werror,-Wdeprecated-declarations]
Clean up `populateVectorToLLVMConversionPatterns` so that it populates
only conversion patterns. All rewrite patterns that do not lower to LLVM
should be populated into a separate greedy pattern rewrite.
The current combination of rewrite patterns and conversion patterns
triggered an edge case when merging the 1:1 and 1:N dialect conversions.
Depends on #119973.
This patch adds the `ConvertToLLVMAttrInterface` and
`ConvertToLLVMOpInterface` interfaces. It also modifies the
`convert-to-llvm` pass to use these interfaces when available.
The `ConvertToLLVMAttrInterface` interfaces allows attributes to
configure conversion to LLVM, including the conversion target, LLVM type
converter, and populating conversion patterns. See the `NVVMTargetAttr`
implementation of this interface for an example of how this interface
can be used to configure conversion to LLVM.
The `ConvertToLLVMOpInterface` interface collects all convert to LLVM
attributes stored in an operation.
Finally, the `convert-to-llvm` pass was modified to use these interfaces
when available. This allows applying `convert-to-llvm` to GPU modules
and letting the `NVVMTargetAttr` decide which patterns to populate.
This patch updates the lowering of `LaunchFuncOp` in GPU to LLVM to only
legalize the operation with the converted operands, effectively removing
the lowering used by the old serialization pipeline.
It also removes all remaining uses of the old gpu serialization
infrastructure in `gpu-to-llvm`.
See [Compilation overview | 'gpu' Dialect - MLIR
docs](https://mlir.llvm.org/docs/Dialects/GPU/#compilation-overview) for
additional information on the target attributes compilation pipeline
that replaced the old serialization pipeline.
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
10 under mlir/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
GEPArg can only be constructed from int32_t and mlir::Value. Explicitly
cast other types (e.g. unsigned, size_t) to int32_t to avoid narrowing
conversion warnings on MSVC. Some recent examples of such are:
```
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'size_t' to 'T' requires a narrowing
conversion
with
[
T=mlir::LLVM::GEPArg
]
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'unsigned int' to 'T' requires a narrowing
conversion
with
[
T=mlir::LLVM::GEPArg
]
```
Co-authored-by: Nikita Kudriavtsev <nikita.kudriavtsev@intel.com>
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
This is a follow-up to the introduction of `convert-to-llvm`: it is
supposed to be a unifying pass through the
`ConvertToLLVMPatternInterface`, but some specific conversion (like the
GPU target) aren't vanilla LLVM target. Instead they need extra
customizations that are specific to LLVM-on-GPUs and our custom runtime
wrappers.
This change make the GpuToLLVMConversionPass just as pluggable as the
`convert-to-llvm` by using the same mechanism.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
If gpu.alloc has no asyn deependency ( in case if gpu.alloc has
hostShared allocation), create a new stream & synchronize. This PR is
follow up to #66401
This PR is a breakdown of the big PR
https://github.com/llvm/llvm-project/pull/65539 which enables intel gpu
integration. In this PR we pass hostShared flag to runtime wrappers
(required by SyclRuntimeWrappers which will come in subsequent PR) to
indicate if the allocation is done on host shared gpu memory or device
only memory.
This PR is a breakdown of the big PR #65539 which enables intel gpu
integration. In this PR we pass count of parameters and size of gpu
binary to runtime wrappers since the SyclRuntimeWrappers (which will
come in subsequent PR) requires the spirv size for compilation and also
the number of parameters to iterate over the params.
This revision replaces the LLVM dialect NullOp by the recently
introduced ZeroOp. The ZeroOp is more generic in the sense that it
represents zero values of any LLVM type rather than null pointers only.
This is a follow to https://github.com/llvm/llvm-project/pull/65508
Consistent order of ops and related methods.
Also, renamed SpGEMMGetSizeOp to SpMatGetSizeOp
since this is a general utility for sparse matrices,
not specific to GEMM ops only.
Reviewed By: Peiming
Differential Revision: https://reviews.llvm.org/D157922
ConversionPatterns do not (and should not) modify the type converter that they are using.
* Make `ConversionPattern::typeConverter` const.
* Make member functions of the `LLVMTypeConverter` const.
* Conversion patterns take a const type converter.
* Various helper functions (that are called from patterns) now also take a const type converter.
Differential Revision: https://reviews.llvm.org/D157601
**For an explanation of these patches see D154153.**
This patch modifies the lowering of `gpu.module` & `gpu.launch_func` in the `gpu-to-llvm` pass,
allowing the usage of the new GPU compilation mechanism in the patch series ending in D154153.
Instead of removing Modules, this patch preserves the module if it has target attributes so that the
`gpu-module-to-binary` pass can later serialize them.
Instead of lowering the kernel calls to the LLVM dialect, this patch primarily updates the operation's
arguments, leaving the job of converting the operation into LLVM instructions to the translation stage.
The reason for not lowering the operation to LLVM at this stage is that kernel launches do not have a
single one-to-one representation in LLVM. For example, a kernel launch can be represented by a call
to a kernel stub, like in CUDA or HIP.
Kernel launches are also intrinsically linked to the binary associated with the call, and the binaries are
converted during translation.
Depends on D154149
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D154152
Rationale:
Since we only support default algorithm for SpGEMM, we can remove the
estimate op (for now at least). This also introduces the set csr pointers
op, and fixes a few bugs in the existing lowering for the SpGEMM breakdown.
This revision paves the way for actual recognition of SpGEMM in the sparsifier.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D157645
Rationale:
This is the approach taken for all the others too (SpMV, SpMM, SDDMM),
so it is more consistent to follow the same path (until we have a need
for more algorithms). Also, in a follow up revision, this will allow
us to remove some unused GEMM ops.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D157542
Macro is used to avoid repeating same pattern many times.
Also fixed the ordering of ops to be consistent.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D157419
This revision removes the createIndexConstant method, which implicitly creates constants of the
getIndexType type and updates all uses to the more explicit createIndexAttrConstant which requires
an explicit Type parameter.
This is an NFC step towards entangling index type conversion in LLVM lowering.
The selection of which index type to use requires finer granularity than the existing
implementations which all rely on pass level flags and end up in mismatches, especially on GPUs
with multiple address spaces of different capacities.
This revision also includes an NFC fix to MemRefToLLVM.cpp that prevents a crash in cases where
an integer memory space cannot be derived for a MemRef.
Differential Revision: https://reviews.llvm.org/D156854
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.
The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
Depends on D155453
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155680
Also makes some minor consistency edits in the cuSparseLt wrapper lib.
Reviewed By: Peiming, K-Wu
Differential Revision: https://reviews.llvm.org/D155139
Add support for the bare pointer calling convention in the gpu-to-llvm
pass. This wasn't being exposed and is needed when GPU-compiled MLIR is
to be called with this convention.
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D152477