Shenghang Tsai 7610b13729
[MLIR] Split ExecutionEngine Initialization out of ctor into an explicit method call (#153524)
Retry landing https://github.com/llvm/llvm-project/pull/153373
## Major changes from previous attempt
- remove the test in CAPI because no existing tests in CAPI deal with
sanitizer exemptions
- update `mlir/docs/Dialects/GPU.md` to reflect the new behavior: load
GPU binary in global ctors, instead of loading them at call site.
- skip the test on Aarch64 since we have an issue with initialization there

---------

Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
2025-08-17 23:07:24 +02:00

10 KiB

'gpu' Dialect

Note: this dialect is more likely to change than others in the near future; use with caution.

This dialect provides middle-level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL. It provides abstractions for kernel invocations (and may eventually provide those for device management) that are not present at the lower level (e.g., as LLVM IR intrinsics for GPUs). Its goal is to abstract away device- and driver-specific manipulations to launch a GPU kernel and provide a simple path towards GPU execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The dialect uses gpu as its canonical prefix.

This dialect also abstracts away primitives commonly available in GPU code, such as with gpu.thread_id (an operation that returns the ID of threads within a thread block/workgroup along a given dimension). While the compilation pipelines documented below expect such code to live inside a gpu.module and gpu.func, these intrinsic wrappers may be used outside of this context.

Intrinsic-wrapping operations should not expect that they have a parent of type gpu.func. However, operations that deal in compiling and launching GPU functions, like gpu.launch_func or gpu.binary may assume that the dialect's full layering is being used.

[TOC]

GPU address spaces

The GPU dialect exposes the gpu.address_space attribute, which currently has three values: global, workgroup, and private.

These address spaces represent the types of buffer commonly seen in GPU compilation. global memory is memory that resides in the GPU's global memory. workgroup memory is a limited, per-workgroup resource: all threads in a workgroup/thread block access the same values in workgroup memory. Finally, private memory is used to represent alloca-like buffers that are private to a single thread/workitem.

These address spaces may be used as the memorySpace attribute on memref values. The gpu.module/gpu.func compilation pipeline will lower such memory space usages to the correct address spaces on target platforms. Memory attributions should be created with the correct memory space on the memref.

Memory attribution

Memory buffers are defined at the function level, either in "gpu.launch" or in "gpu.func" ops. This encoding makes it clear where the memory belongs and makes the lifetime of the memory visible. The memory is only accessible while the kernel is launched/the function is currently invoked. The latter is more strict than actual GPU implementations but using static memory at the function level is just for convenience. It is also always possible to pass pointers to the workgroup memory into other functions, provided they expect the correct memory space.

The buffers are considered live throughout the execution of the GPU function body. The absence of memory attribution syntax means that the function does not require special buffers. Rationale: although the underlying models declare memory buffers at the module level, we chose to do it at the function level to provide some structuring for the lifetime of those buffers; this avoids the incentive to use the buffers for communicating between different kernels or launches of the same kernel, which should be done through function arguments instead; we chose not to use alloca-style approach that would require more complex lifetime analysis following the principles of MLIR that promote structure and representing analysis results in the IR.

GPU Compilation

Compilation overview

The compilation process in the GPU dialect has two main stages: GPU module serialization and offloading operations translation. Together these stages can produce GPU binaries and the necessary code to execute them.

An example of how the compilation workflow look is:

mlir-opt example.mlir                   \
  --pass-pipeline="builtin.module(      \
    gpu-kernel-outlining,               \ # Outline gpu.launch body to a kernel.
    nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
    gpu.module(convert-gpu-to-nvvm),    \ # Convert GPU to NVVM.
    gpu-to-llvm,                        \ # Convert GPU to LLVM.
    gpu-module-to-binary                \ # Serialize GPU modules to binaries.
  )" -o example-nvvm.mlir
mlir-translate example-nvvm.mlir        \
  --mlir-to-llvmir                      \ # Obtain the translated LLVM IR.
  -o example.ll

This compilation process expects all GPU code to live in a gpu.module and expects all kernels to be gpu.func operations. Non-kernel functions, like device library calls, may be defined using func.func or other non-GPU dialect operations. This permits downstream systems to use these wrappers without requiring them to use the GPU dialect's function operations, which might not include information those systems want to have as intrinsic values on their functions. Additionally, this allows for using func.func for device-side library functions in gpu.modules.

Default NVVM Compilation Pipeline: gpu-lower-to-nvvm-pipeline

The gpu-lower-to-nvvm-pipeline compilation pipeline serves as the default way for NVVM target compilation within MLIR. This pipeline operates by lowering primary dialects (arith, memref, scf, vector, gpu, and nvgpu) to NVVM target. It begins by lowering GPU code region(s) to the specified NVVM compilation target and subsequently handles the host code.

This pipeline specifically requires explicitly parallel IR and doesn't do GPU parallelization. To enable parallelism, necessary transformations must be applied before utilizing this pipeline.

It's designed to provide a generic solution for NVVM targets, generating NVVM and LLVM dialect code compatible with mlir-runner or execution engine.

Example:

Here's a snippet illustrating the use of primary dialects, including arith, within GPU code execution:

func.func @main() {
    %c2 = arith.constant 2 : index
    %c1 = arith.constant 1 : index
    gpu.launch
        blocks(%0, %1, %2) in (%3 = %c1, %4 = %c1, %5 = %c1)
        threads(%6, %7, %8) in (%9 = %c2, %10 = %c1, %11 = %c1) {
        gpu.printf "Hello from %d\n" %6 : index
        gpu.terminator
    }
    return
}

The gpu-lower-to-nvvm pipeline compiles this input code to NVVM format as below. It provides customization options like specifying SM capability, PTX version, and optimization level. Once compiled, the resulting IR is ready for execution using mlir-runner. Alternatively, it can be translated into LLVM, expanding its utility within the system.

mlir-opt example.mlir -gpu-lower-to-nvvm-pipeline = "cubin-chip=sm_90a cubin-features=+ptx80 opt-level=3"

Module serialization

Attributes implementing the GPU Target Attribute Interface handle the serialization process and are called Target attributes. These attributes can be attached to GPU Modules indicating the serialization scheme to compile the module into a binary string.

The gpu-module-to-binary pass searches for all nested GPU modules and serializes the module using the target attributes attached to the module, producing a binary with an object for every target.

Example:

// Input:
gpu.module @kernels [#nvvm.target<chip = "sm_90">, #nvvm.target<chip = "sm_60">] {
  ...
}
// mlir-opt --gpu-module-to-binary:
gpu.binary @kernels [
  #gpu.object<#nvvm.target<chip = "sm_90">, "sm_90 cubin">,
  #gpu.object<#nvvm.target<chip = "sm_60">, "sm_60 cubin">
]

Offloading LLVM translation

Attributes implementing the GPU Offloading LLVM Translation Attribute Interface handle the translation of GPU binaries and kernel launches into LLVM instructions and are called Offloading attributes. These attributes are attached to GPU binary operations.

During the LLVM translation process, GPU binaries get translated using the scheme provided by the Offloading attribute, translating the GPU binary into LLVM instructions. Meanwhile, Kernel launches are translated by searching the appropriate binary and invoking the procedure provided by the Offloading attribute in the binary for translating kernel launches into LLVM instructions.

Example:

// Input:
// Binary with multiple objects but selecting the second one for embedding.
gpu.binary @binary <#gpu.select_object<#rocdl.target<chip = "gfx90a">>> [
    #gpu.object<#nvvm.target, "NVPTX">,
    #gpu.object<#rocdl.target<chip = "gfx90a">, "AMDGPU">
  ]
llvm.func @foo() {
  ...
  // Launching a kernel inside the binary.
  gpu.launch_func @binary::@func blocks in (%0, %0, %0)
                                 threads in (%0, %0, %0) : i64
                                 dynamic_shared_memory_size %2
                                 args(%1 : i32, %1 : i32)
  ...
}
// mlir-translate --mlir-to-llvmir:
@binary_bin_cst = internal constant [6 x i8] c"AMDGPU", align 8
@binary_func_kernel_name = private unnamed_addr constant [7 x i8] c"func\00", align 1
@binary_module = internal global ptr null
@llvm.global_ctors = appending global [1 x {i32, ptr, ptr}] [{i32 123, ptr @binary_load, ptr null}]
@llvm.global_dtors = appending global [1 x {i32, ptr, ptr}] [{i32 123, ptr @binary_unload, ptr null}]
define internal void @binary_load() section ".text.startup" {
entry:
  %0 = call ptr @mgpuModuleLoad(ptr @binary_bin_cst)
  store ptr %0, ptr @binary_module
  ...
}
define internal void @binary_unload() section ".text.startup" {
entry:
  %0 = load ptr, ptr @binary_module, align 8
  call void @mgpuModuleUnload(ptr %0)
  ...
}
...
define void @foo() {
  ...
  %module = load ptr, ptr @binary_module, align 8
  %kernel = call ptr @mgpuModuleGetFunction(ptr %module, ptr @binary_func_kernel_name)
  call void @mgpuLaunchKernel(ptr %kernel, ...) ; Launch the kernel
  ...
  call void @mgpuModuleUnload(ptr %module)
  ...
}
...

The binary operation

From a semantic point of view, GPU binaries allow the implementation of many concepts, from simple object files to fat binaries. By default, the binary operation uses the #gpu.select_object offloading attribute; this attribute embeds a single object in the binary as a global string, see the attribute docs for more information.

Operations

[include "Dialects/GPUOps.md"]