The new patterns break down subgroup reduce ops with vector values into
a sequence of subgroup reductions that fit the native shuffle size. The
maximum/native shuffle size is parametrized.
The overall goal is to be able to perform multi-element reductions with
a sequence of `gpu.shuffle` ops.
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.
This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
#69934 broke integration tests that rely on the
kernel-bare-ptr-calling-convention and host-bare-ptr-calling-convention
flags. This PR brings these flags.
Also the kernel-index-bitwidth flag is removed, as kernel pointer size
depends on the host. Separating host (64-bit) and kernel (32-bit) is not
viable.
The test-`lower-to-nvvm pipeline`, designed for NVGPU dialect within GPU
kernels, plays important role for compiling integration tests. This PR
restructured the passes, and cleaned up the code. It also fixes the
order of pipelines.
This fix is needed for #69913
This PR enables `test-lower-to-nvvm` pass pipeline for the integration
tests for NVIDIA sm_90 architecture.
This PR adjusts `test-lower-to-nvvm` pass in two ways:
1) Calls `createConvertNVGPUToNVVMPass` before the outlining process.
This particular pass is responsible for generating both device and host
code. On the host, it calls the CUDA driver to build the TMA descriptor
(`cuTensorMap`).
2) Integrates the `createConvertNVVMToLLVMPass` to generate PTXs for
NVVM Ops.
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.
2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.
3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.
4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
Currently, the VectorToLLVM patterns are built into a library along
with the corresponding pass, which also pulls in all the
platform-specific vector dialects (like AMXDialect) to apply all the
vector to LLVM conversions.
This causes dependency bloat when writing libraries - for example the
GPU to LLVM passes, which use the vector to LLVM patterns, don't need
the X86Vector dialect to be present at all.
This commit partitions the library into VectorToLLVM and
VectorToLLVMPass, where the latter pulls in all the other vector
transformations.
Reviewed By: nicolasvasilache, mehdi_amini
Differential Revision: https://reviews.llvm.org/D158287
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the
`TargetAttr` workflow. This patch removes remaining upstream uses of the
aforementioned passes, including the option to use them in `mlir-opt`. A
future patch will remove these passes entirely.
The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
The revert happened due to a build bot failure that threw 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'.
The failure's root cause was a pass using "+ptx76" for compilation and an old CUDA driver
on the bot. This commit relands the patch with "+ptx60".
Original Gh PR: #65768
Original commit message:
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
This reverts commit 2e0e00ed841951e358a85a871647be9b3a622f51
and reverts commit a6eb40692c795a9cc29266779ceca2e304141114
and reverts commit 585cbe3f639783bf0307b47504acbd205f135310.
15 tests are broken on the mlir-nvidia buildbot:
'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_SOURCE'
'cuModuleGetFunction(&function, module, name)' failed with 'CUDA_ERROR_INVALID_HANDLE'
'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, smem, stream, params, extra)' failed with 'CUDA_ERROR_INVALID_HANDLE'
'cuModuleUnload(module)' failed with 'CUDA_ERROR_INVALID_HANDLE'
Current SM version is 35 but it is deprecated long time ago. D155563 introduced ptxas compilations, using sm_35 causes failures in builtbot. This change increase default SM version to 50.
Differential Revision: https://reviews.llvm.org/D156098
This work improves how we compile the generated PTX code using the `ptxas` compiler. Currently, we rely on the driver's jit API to compile the PTX code. However, this approach has some limitations. It doesn't always produce the same binary output as the ptxas compiler, leading to potential inconsistencies in the generated Cubin files.
This work introduces a significant improvement by directly utilizing the ptxas compiler for PTX compilation. By doing so, we can achieve more consistent and reliable results in generating cubin files. Key Benefits:
- Using the Ptxas compiler directly ensures that the cubin files generated during the build process remain consistent with CUDA compilation using `nvcc` or `clang`.
- Another advantage of this work is that it allows developers to experiment with different ptxas compilers without the need to change the compiler. Performance among ptxas compiler versions are vary, therefore, one can easily try different ptxas compilers.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155563
This fixes builds for 7e78ecfe10ea9071234de8d385b87d338d280266 (both cmake and bazel) as well as trim unnecessary dependencies.
This is achieved by moving the functionality to test/lib/GPU which is a more natural landing pad.
This patch implements a rewrite pattern for transforming gpu.global_id x
to gpu.thread_id + gpu.block_id * gpu.block_dim.
Reviewed By: makslevental
Differential Revision: https://reviews.llvm.org/D148978
This aligns the SCF dialect file layout with the majority of the dialects.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D128049
This commit restructures how TypeID is implemented to ideally avoid
the current problems related to shared libraries. This is done by changing
the "implicit" fallback path to use the name of the type, instead of using
a static template variable (which breaks shared libraries). The major downside to this
is that it adds some additional initialization costs for the implicit path. Given the
use of type names for uniqueness in the fallback, we also no longer allow types
defined in anonymous namespaces to have an implicit TypeID. To simplify defining
an ID for these classes, a new `MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID` macro
was added to allow for explicitly defining a TypeID directly on an internal class.
To help identify when types are using the fallback, `-debug-only=typeid` can be
used to log which types are using implicit ids.
This change generally only requires changes to the test passes, which are all defined
in anonymous namespaces, and thus can't use the fallback any longer.
Differential Revision: https://reviews.llvm.org/D122775
A lot of test passes are currently anchored on FuncOp, but this
dependency
is generally just historical. A majority of these test passes can run on
any operation, or can operate on a specific interface
(FunctionOpInterface/SymbolOpInterface).
This allows for greatly reducing the API dependency on FuncOp, which
is slated to be moved out of the Builtin dialect.
Differential Revision: https://reviews.llvm.org/D121191
The Func has a large number of legacy dependencies carried over from the old
Standard dialect, which was pervasive and contained a large number of varied
operations. With the split of the standard dialect and its demise, a lot of lingering
dead dependencies have survived to the Func dialect. This commit removes a
large majority of then, greatly reducing the dependence surface area of the
Func dialect.
The last remaining operations in the standard dialect all revolve around
FuncOp/function related constructs. This patch simply handles the initial
renaming (which by itself is already huge), but there are a large number
of cleanups unlocked/necessary afterwards:
* Removing a bunch of unnecessary dependencies on Func
* Cleaning up the From/ToStandard conversion passes
* Preparing for the move of FuncOp to the Func dialect
See the discussion at https://discourse.llvm.org/t/standard-dialect-the-final-chapter/6061
Differential Revision: https://reviews.llvm.org/D120624
Precursor: https://reviews.llvm.org/D110200
Removed redundant ops from the standard dialect that were moved to the
`arith` or `math` dialects.
Renamed all instances of operations in the codebase and in tests.
Reviewed By: rriddle, jpienaar
Differential Revision: https://reviews.llvm.org/D110797
Split out GPU ops library from GPU transforms. This allows libraries to
depend on GPU Ops without needing/building its transforms.
Differential Revision: https://reviews.llvm.org/D105472
test/lib/Transforms/ has bitrot and become somewhat of a dumping grounds for testing pretty much any part of the project. This revision cleans this up, and moves the files within to a directory that reflects what is actually being tested.
Differential Revision: https://reviews.llvm.org/D102456