#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.
This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.
This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.
## Reproduction
Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.
This script reproduces the error message from `mgpuModuleUnload` on my
system:
```
#!/bin/bash
set -e
LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}
cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
%c1 = arith.constant 1 : index
gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
gpu.terminator
}
return
}
MLIR
$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
-gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
| $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll
$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o
cc /tmp/repro.o \
-L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
-lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro
echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context
This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload
Fixes#170833
Extend the vector.gather e2e test to cover both available lowering
paths:
* Direct lowering to LLVM (via -test-lower-to-llvm)
* Lowering via vector.load (via -test-vector-gather-lowering)
This is a follow-up to https://github.com/llvm/llvm-project/pull/184706,
which updated a pattern used by -test-vector-gather-lowering.
The test is extended to operate on 2D memrefs so that the changes
in https://github.com/llvm/llvm-project/pull/184706 are meaningfully
exercised.
This MR removes a hard-coded compute number in an MLIR test. This will
allow the test to not need to be updated in the future. The default
value will come from `NVVMOps.td`.
Follow-up to #184253. Update tests that checked for the old double-space
output of gpu.block_id using GPU_DimensionAttr.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Follow-up to #184253. The ODS attr/type printer fix removed the leading
space from generated print() methods. Update tests that checked for the
old double-space output of GPU ops using GPU_DimensionAttr and
GPU_MmaElementwiseOpAttr.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The `OneDimMultiReductionToTwoDim` pattern had some issues. For the
input program:
```mlir
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %acc: f32) -> f32 {
%0 = vector.multi_reduction <add>, %arg0, %acc [0] : vector<8xf32> to f32
return %0 : f32
}
```
* when lowering using the inner-parallel strategy, the compiler would
essentially produce scalar code:
```mlir
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %arg1: f32) -> f32 {
%0 = vector.shape_cast %arg0 : vector<8xf32> to vector<1x8xf32>
%1 = vector.broadcast %arg1 : f32 to vector<1xf32>
%2 = vector.transpose %0, [1, 0] : vector<1x8xf32> to vector<8x1xf32>
%3 = vector.extract %2[0] : vector<1xf32> from vector<8x1xf32>
%4 = arith.addf %3, %1 : vector<1xf32>
%5 = vector.extract %2[1] : vector<1xf32> from vector<8x1xf32>
%6 = arith.addf %5, %4 : vector<1xf32>
... (repeats for all 8 elements) ...
%17 = vector.extract %2[7] : vector<1xf32> from vector<8x1xf32>
%18 = arith.addf %17, %16 : vector<1xf32>
%19 = vector.extract %18[0] : f32 from vector<1xf32>
return %19 : f32
}
```
* when lowering using the inner-reduction strategy, the compiler would
first unnecessarily transform it into a 2-D multi_reduction operation
<1x8xf32> and then extract an <8xf32> vector and apply reduction. The
canonicalization and folding would lead to the following final result:
```mlir
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %arg1: f32) -> f32 {
%0 = vector.reduction <add>, %arg0, %arg1 : vector<8xf32> into f32
return %0 : f32
}
```
Now, after this change:
* when lowering the compiler now produces for both strategies in one
step.
```
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %arg1: f32) -> f32 {
%0 = vector.reduction <add>, %arg0, %arg1 : vector<8xf32> into f32
return %0 : f32
}
```
This pattern is also useful for an ongoing refactoring that is happening
in the multi_reduction patterns. It is the only pattern that increases
multi_reduction in rank and would lead to an infinite loop when
attempting to reach a fixed point once we generalize other unrolling
patterns.
Assisted-by: Claude
Unifies the two dialects that define x86 operations into a single one.
The AMX dialect is moved into X86 in line with other x86 extensions.
Following the dialect renaming, X86 dialect is now a suitable home for
wider range of operations targeting specific hardware features. Moving
AMX definitions to X86 dialect creates a single, centralized hub for
defining all x86 intrinsic-like operations. The new grouping aims to
eliminate the need for new dialects as new hardware extensions become
available.
The two dialects are simply merged together. X86 dialect refactoring
will be addressed separately.
List of changes:
- operations: 'amx.tile_*' => 'x86.amx.tile_*'
- types: '!amx.tile' => '!x86.amx.tile'
- namespace: 'mlir::amx' => 'mlir::x86::amx'
- test define: 'MLIR_RUN_AMX_TESTS' => 'MLIR_RUN_X86_AMX_TESTS'
- vector lowering: AMX is enabled by default together with X86
The MLIR AMX tests are now nested under X86 directory. To enable AMX
integration tests, 'MLIR_RUN_X86_TESTS' must also be defined.
Renames 'x86vector' dialect to 'x86'.
This is the first PR in series of cleanups around dialects targeting x86
platforms.
The new naming scheme is shorter, cleaner, and opens possibility of
integrating other x86-specific operations not strictly fitting pure
vector representation. For example, the generalization will allow for
future merger of AMX dialect into the x86 dialect to create one-stop x86
operations collection and boost discoverability.
This PR makes the compilation log from ISA compiler available to users
by returning it as part of the `gpu::ObjectAttr` properties, following
the existing pattern like `LLVMIRToISATimeInMs`.
Currently, the compiler log (which contains useful information such as
spill statistics when --verbose is passed) is only accessible in debug
builds via `LLVM_DEBUG`. However, there are good reasons to make this
information available in release builds as well:
1. Both `ptxas` and `libnvptxcompiler` are publicly available
tools/libraries distributed with the CUDA Toolkit. The `--verbose` flag
and its output are documented public features, not internal debug
information.
2. The verbose output provides valuable insights for users.
A new `SerializedObject` class is used to carry the metadata alongside
the binary when returning from `serializeObject`.
This patch re-enables the matmul.py tests:
* Fix gpu.wait usages
* Fix gpu.launchOp usage
* Fix format-string for gpu.printf
* Fix verification failure by removing the block[0] append.
This is now done by the python script's init.
* Fix the runtime error by adding the missing initialize() call during
JIT.
* Add the missing waitGroup(0) for _ws implementation.
This was mistakenly removed in PR #113713. Without this fix,
I see timing issues and the _ws tests with stage>1 randomly show output
mismatch.
With all these fixes, the test compiles and
executes successfully on an sm90a machine.
(locally verified for 1K iterations)
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This patch updates a few FileCheck primitives for the TMA test
to use CHECK-PTX-DAG instead of CHECK-PTX to accommodate
a slightly different ordering of BB's.
The dump-ptx integration test fails when the PTX is generated
through nvcc (intermediates) from public toolkit. This patch fixes
it by allowing regex strings from both the backends.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:
1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.
2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.
3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().
4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run.
5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().
6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
gpu printf test was not using the runtime required by lit.local.cfg
All other tests in the directory are correctly using level zero runtime.
But gpu printf test is using sycl runtime.
Add support for vectorized operations such as `arith.addf ... :
vector<4xf4E2M1FN>`. The computation is scalarized: scalar operands are
extracted with `vector.to_elements`, multiple scalar computations are
performed and the result is inserted back into a vector with
`vector.from_elements`.
This PR re-lands #165873.
This PR extends the gpu.subgroup_mma_* ops to support fp64 type.
The extension requires special handling during the lowering to nvvm due
to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
The original PR did not guard the new test based on the required
architecture (sm80) which lead to a failure on the cuda runners with T4
GPUs.
Disallow implicit casting, which is surprising, and, IME, usually
indicative of copy-paste errors.
Because the initial value must be a scalar, I don't expect this to
affect any data movement.
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
Add pass options to run lowerings to NVVM without pattern rollback. This
makes the dialect conversions easier to debug and improves
performance/memory usage.
This integration test has been broken for a while. This commit partially
fixes it.
- Use `CHECK` + `CHECK-NEXT` to ensure that the correct error lines are
matched together.
- Move all `CHECK-NOT` to the end. Having a `CHECK` with the same string
does not make sense after a `CHECK-NOT`.
- Add a missing `CHECK: ERROR` for one of the test cases.
- Deactivate `reverse_from_3`, which is broken, and put a TODO.
I hit another runtime verification issue (similar to
https://github.com/llvm/llvm-project/pull/164878) while working with
TFLite models. The verifier is incorrectly rejecting
`tensor.extract_slice` operations when extracting an empty slice
(size=0) that starts exactly at the tensor boundary.
The current runtime verification unconditionally enforces `offset <
dim_size`. This makes sense for non-empty slices, but it's too strict
for empty slices, causing false positives that lead to spurious runtime
assertions.
**Simple example that demonstrates the issue:**
```mlir
func.func @extract_empty_slice(%tensor: tensor<?xf32>, %offset: index, %size: index) {
// When called with: tensor size=10, offset=10, size=0
// Runtime verification fails: "offset 0 is out-of-bounds"
%slice = tensor.extract_slice %tensor[%offset] [%size] [1]
: tensor<?xf32> to tensor<?xf32>
return
}
```
For the above example, the check evaluates `10 < 10` which is false, so
verification fails. However, I believe this operation should be valid -
we're extracting zero elements, so there's no actual out-of-bounds
access.
**Real-world repro from the TensorFlow Lite models:**
This issue manifests while lowering TFLite models and a lot of our
system tests are failing due to this. Here's a simplified version
showing the problematic pattern:
In this code, `%extracted_slice_0` becomes an empty tensor when SSA
value `%15` reaches 10 (on the final loop iteration), making `%16 = 0`.
The operation extracts zero elements along dimension 0, which is
semantically valid but fails runtime verification.
```mlir
func.func @simplified_repro_from_tensorflowlite_model(%arg0: tensor<10x4x1xf32>) -> tensor<10x4x1xf32> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c10 = arith.constant 10 : index
%c-1 = arith.constant -1 : index
%0 = "tosa.const"() <{values = dense<0> : tensor<i32>}> : () -> tensor<i32>
%1 = "tosa.const"() <{values = dense<1> : tensor<i32>}> : () -> tensor<i32>
%2 = "tosa.const"() <{values = dense<10> : tensor<i32>}> : () -> tensor<i32>
%3 = "tosa.const"() <{values = dense<-1> : tensor<2xi32>}> : () -> tensor<2xi32>
%4 = "tosa.const"() <{values = dense<0> : tensor<2xi32>}> : () -> tensor<2xi32>
%5 = "tosa.const"() <{values = dense<0.000000e+00> : tensor<1x4x1xf32>}> : () -> tensor<1x4x1xf32>
%c4_1 = tosa.const_shape {values = dense<1> : tensor<1xindex>} : () -> !tosa.shape<1>
%6:2 = scf.while (%arg1 = %0, %arg2 = %arg0)
: (tensor<i32>, tensor<10x4x1xf32>) -> (tensor<i32>, tensor<10x4x1xf32>) {
%7 = tosa.greater %2, %arg1 : (tensor<i32>, tensor<i32>) -> tensor<i1>
%extracted = tensor.extract %7[] : tensor<i1>
scf.condition(%extracted) %arg1, %arg2 : tensor<i32>, tensor<10x4x1xf32>
} do {
^bb0(%arg1: tensor<i32>, %arg2: tensor<10x4x1xf32>):
%7 = tosa.add %arg1, %1 : (tensor<i32>, tensor<i32>) -> tensor<i32>
// First slice
%8 = tosa.reshape %arg1, %c4_1 : (tensor<i32>, !tosa.shape<1>) -> tensor<1xi32>
%9 = tosa.concat %8, %3 {axis = 0 : i32} : (tensor<1xi32>, tensor<2xi32>) -> tensor<3xi32>
%extracted_0 = tensor.extract %9[%c0] : tensor<3xi32>
%10 = index.casts %extracted_0 : i32 to index
%11 = arith.cmpi eq, %10, %c-1 : index
%12 = arith.select %11, %c10, %10 : index
%extracted_slice = tensor.extract_slice %arg2[0, 0, 0] [%12, 4, 1] [1, 1, 1]
: tensor<10x4x1xf32> to tensor<?x4x1xf32>
// Second slice - this is where the failure occurs
%13 = tosa.reshape %7, %c4_1 : (tensor<i32>, !tosa.shape<1>) -> tensor<1xi32>
%14 = tosa.concat %13, %4 {axis = 0 : i32} : (tensor<1xi32>, tensor<2xi32>) -> tensor<3xi32>
%extracted_1 = tensor.extract %14[%c0] : tensor<3xi32>
%15 = index.castu %extracted_1 : i32 to index
%16 = arith.subi %c10, %15 : index // size = 10 - offset
%extracted_2 = tensor.extract %14[%c1] : tensor<3xi32>
%17 = index.castu %extracted_2 : i32 to index
%extracted_3 = tensor.extract %14[%c2] : tensor<3xi32>
%18 = index.castu %extracted_3 : i32 to index
// On the last loop iteration: %15=10, %16=0
// %extracted_slice_0 becomes an empty tensor
// Runtime verification fails: "offset 0 is out-of-bounds"
%extracted_slice_0 = tensor.extract_slice %arg2[%15, %17, %18] [%16, 4, 1] [1, 1, 1]
: tensor<10x4x1xf32> to tensor<?x4x1xf32>
%19 = tosa.concat %extracted_slice, %5, %extracted_slice_0 {axis = 0 : i32}
: (tensor<?x4x1xf32>, tensor<1x4x1xf32>, tensor<?x4x1xf32>) -> tensor<10x4x1xf32>
scf.yield %7, %19 : tensor<i32>, tensor<10x4x1xf32>
}
return %6#1 : tensor<10x4x1xf32>
}
```
**The fix:**
Make the offset check conditional on slice size:
- Empty slice (size == 0): allow `0 <= offset <= dim_size`
- Non-empty slice (size > 0): require `0 <= offset < dim_size`
**Question for reviewers:**
Should we also relax the static verifier to allow this edge case?
Currently, the static verifier rejects the following IR:
```mlir
%tensor = arith.constant dense<1.0> : tensor<10xf32>
%slice = tensor.extract_slice %tensor[10] [0] [1] : tensor<10xf32> to tensor<0xf32>
```
Since we're allowing it at runtime for dynamic shapes, it seems
inconsistent to reject it statically. However, I wanted to get feedback
before making that change - this PR focuses only on the runtime
verification fix for dynamic shapes.
P.S. We have a similar issue with `memref.subview`. I will send a
separate patch for the issue.
Co-authored-by: Hanumanth Hanumantharayappa <hhanuman@ah-hhanuman-l.dhcp.mathworks.com>