133 Commits

Author SHA1 Message Date
Jared Hoberock
90ec5f2f62
[MLIR][test] Re-disable FileCheck on async.mlir integration test (#190702)
#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.

This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.

This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
2026-04-07 01:14:56 +02:00
Jared Hoberock
7087ece044
[MLIR][ExecutionEngine] Tolerate CUDA_ERROR_DEINITIALIZED in mgpuModuleUnload (#190563)
`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.

## Reproduction

Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.

This script reproduces the error message from `mgpuModuleUnload` on my
system:

```
#!/bin/bash
set -e

LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}

cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
  %c1 = arith.constant 1 : index
  gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
             threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
    gpu.terminator
  }
  return
}
MLIR

$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
  -gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
  | $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll

$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o

cc /tmp/repro.o \
  -L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
  -lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro

echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context

This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload

Fixes #170833
2026-04-06 21:11:58 +00:00
Stefan Mada
0769dde7a2
Removed Hardcoded SM Number from Mlir Test (#186917)
This MR removes a hard-coded compute number in an MLIR test. This will
allow the test to not need to be updated in the future. The default
value will come from `NVVMOps.td`.
2026-03-17 11:12:52 -07:00
Jakub Kuderski
15e7177f08
[mlir][GPU] Fix double spaces in tests after ODS printer fix. NFC. (#185325)
Follow-up to #184253. The ODS attr/type printer fix removed the leading
space from generated print() methods. Update tests that checked for the
old double-space output of GPU ops using GPU_DimensionAttr and
GPU_MmaElementwiseOpAttr.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:46:54 -04:00
Zichen Lu
fbffdaa174
[MLIR][GPU] Update serializeToObject to use SerializedObject wrapper and include ISA compiler logs (#176697)
This PR makes the compilation log from ISA compiler available to users
by returning it as part of the `gpu::ObjectAttr` properties, following
the existing pattern like `LLVMIRToISATimeInMs`.

Currently, the compiler log (which contains useful information such as
spill statistics when --verbose is passed) is only accessible in debug
builds via `LLVM_DEBUG`. However, there are good reasons to make this
information available in release builds as well:

1. Both `ptxas` and `libnvptxcompiler` are publicly available
tools/libraries distributed with the CUDA Toolkit. The `--verbose` flag
and its output are documented public features, not internal debug
information.
2. The verbose output provides valuable insights for users.

A new `SerializedObject` class is used to carry the metadata alongside
the binary when returning from `serializeObject`.
2026-01-30 12:56:20 +01:00
Durgadoss R
22271c9e76
[MLIR][NVVM][Tests] Re-enable matmul.py tests (#175728)
This patch re-enables the matmul.py tests:
* Fix gpu.wait usages
* Fix gpu.launchOp usage
* Fix format-string for gpu.printf
* Fix verification failure by removing the block[0] append.
   This is now done by the python script's init.
* Fix the runtime error by adding the missing initialize() call during
JIT.
* Add the missing waitGroup(0) for _ws implementation.
  This was mistakenly removed in PR #113713. Without this fix,
I see timing issues and the _ws tests with stage>1 randomly show output
mismatch.

With all these fixes, the test compiles and
executes successfully on an sm90a machine.
(locally verified for 1K iterations)

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2026-01-16 10:57:54 +05:30
Maksim Levental
ad5be31c30
[mlir][Python] fix NV examples after #172892 (#174481) 2026-01-05 21:47:35 +00:00
Durgadoss R
6778f0d483
[MLIR][NVVM][Tests]: Update FileCheck primitives (#173252)
This patch updates a few FileCheck primitives for the TMA test
to use CHECK-PTX-DAG instead of CHECK-PTX to accommodate
a slightly different ordering of BB's.

The dump-ptx integration test fails when the PTX is generated
through nvcc (intermediates) from public toolkit. This patch fixes
it by allowing regex strings from both the backends.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2025-12-23 00:01:25 +05:30
Sang Ik Lee
528caf99a7
[MLIR] Fix GPU integration tests for SYCL and LevelZero runtime. (#171718)
i1 type load / store lowering does not work anymore for SPIR-V kernel
Rewrite test cases such that it does not use i1 load / store.
2025-12-19 09:10:44 -08:00
Mehdi Amini
d02471ae5e [MLIR] Partially disable test/Integration/GPU/CUDA/async.mlir
This test is flaky, needs investigation.

See #170833
2025-12-05 03:54:04 -08:00
Sohaib Iftikhar
b31a398bcf
[MLIR][NVVM] Fix wmma test after d3edc94d (#170659)
See discussion on #169061
2025-12-04 13:37:29 +00:00
Giacomo Castiglioni
d3edc94d11
[MLIR][GPU] subgroup_mma fp64 extension - take 2 (#169061)
This PR re-lands #165873.

This PR extends the gpu.subgroup_mma_* ops to support fp64 type.
The extension requires special handling during the lowering to nvvm due
to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).

The original PR did not guard the new test based on the required
architecture (sm80) which lead to a failure on the cuda runners with T4
GPUs.
2025-12-01 07:39:59 -05:00
Fabian Mora
8c3f59f1b2
Revert "[MLIR][GPU] subgroup_mma fp64 extension" (#169049)
Reverts llvm/llvm-project#165873

The revert is triggered by a failing integration test on a couple of
buildbots.
2025-11-21 10:02:59 -05:00
Giacomo Castiglioni
49995b2af0
[MLIR][GPU] subgroup_mma fp64 extension (#165873)
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
2025-11-21 09:07:43 -05:00
Matthias Springer
951ab04d6c
[mlir][NVVM] Add no-rollback option to NVVM lowering passes (#168477)
Add pass options to run lowerings to NVVM without pattern rollback. This
makes the dialect conversions easier to debug and improves
performance/memory usage.
2025-11-18 13:47:28 +08:00
Erick Ochoa Lopez
55d4d2ee0d
[mlir][spirv] Fix test (NFC) (#163413)
This test had a CHECK-RAW command. The intention behind this command
appears to be to avoid using the regular expression matching
capabilities. However, this was interpretted as a comment by FileCheck.
In order to check for literal strings the {LITERAL} modifier should be
used.
https://llvm.org/docs/CommandGuide/FileCheck.html#directive-modifiers
2025-10-14 12:09:27 -04:00
Durgadoss R
fa366b4e9f
[MLIR][NVVM] Update TMA Load Op (#156347)
This patch includes im2col and gather mode
support for the TMA Load Op. The lowering is
also updated to intrinsics except when a Predicate
is given. This completes the Blackwell additions
on this Op.

* NVVM Dialect has support for Shared::Cluster
   address-space now. So, this patch also updates the
   Op to use AS(7) instead of AS(3). The corresponding
   inline-ptx based unit tests are also updated.
*  lit tests are added for all combinations.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2025-09-23 13:03:35 +05:30
lonely eagle
a579278312
[mlir][nvgpu] Fix nvgpu integration test (#154748)
Fix nvgpu mlir file integration test. This PR fixes the bug by removing
memref.get_global and then using memref.view.
2025-08-25 17:04:01 +08:00
Md Abdullah Shahneous Bari
281e6d2cc4
[mlir][ExecutionEngine] Add LevelZeroRuntimeWrapper. (#151038)
Adds LevelZeroRuntime wrapper and tests.

Co-authored-by: Artem Kroviakov <artem.kroviakov@intel.com>
Co-authored-by: Nishant Patel <nishant.b.patel@intel.com>

---------

Co-authored-by: Artem Kroviakov <artem.kroviakov@intel.com>
Co-authored-by: Nishant Patel <nishant.b.patel@intel.com>
2025-08-06 16:48:59 -05:00
Diego Caballero
33465bb2bb
[mlir][Vector] Remove vector.extractelement and vector.insertelement ops (#149603)
This PR removes `vector.extractelement` and `vector.insertelement` ops
from the code base in favor of the `vector.extract` and `vector.insert`
counterparts.

See RFC:
https://discourse.llvm.org/t/rfc-psa-remove-vector-extractelement-and-vector-insertelement-ops-in-favor-of-vector-extract-and-vector-insert-ops
2025-07-28 11:01:14 -07:00
lorenzo chelini
c5613dc863
[MLIR] Mark LLVM::FMAOp as legal (#144671)
Mark LLVM::FMAOp as legal in configureGpuToNVVMConversionLegality, since
we can handle intrinsic lowering in the NVPTX backend and emit
fma.rn.f32.
2025-06-18 15:49:00 +02:00
Han-Chung Wang
77e2e3f641
[mlir][memref] Update tests to use memref.assume_alignment properly. (#142358)
With
ffb9bbfd07,
memref.assume_alignment op returns a result value. The revision updates
the tests to reflect the change:

- Update all the lit tests to use the result of memref.assume_alignment,
if it is present.
- Capture the result of the op in lit tests.

---------

Signed-off-by: hanhanW <hanhan0912@gmail.com>
2025-06-02 07:57:36 -07:00
Sang Ik Lee
3fa65dee14
[mlir] SYCL runtime wrapper: add memcpy support. (#141647) 2025-05-28 11:33:15 -07:00
Mehdi Amini
c3858e55f4 [MLIR] Fix test run line: use env to set environment variable 2025-04-27 14:48:13 -07:00
Christian Sigg
7851b1bcf1
[mlir][gpu] Change GPU modules to globals (#135478)
Load/unload GPU modules in global ctors/dtors instead of each time when
launching a kernel.

Loading GPU modules is a heavy-weight operation and synchronizes the GPU
context. Now that the modules are loaded ahead of time, asynchronously
launched kernels can run concurrently, see
https://discourse.llvm.org/t/how-to-lower-the-combination-of-async-gpu-ops-in-gpu-dialect.

The implementations of `embedBinary()` and `launchKernel()` use slightly
different mechanics at the moment but I prefer to not change the latter
more than necessary as part of this PR. I will prepare a follow-up NFC
for `launchKernel()` to align them again.
2025-04-22 13:49:58 +02:00
Qinkun Bao
91d2ecf0d5
[NFC] Fix some typos in libc and mlir comments (#133374) 2025-03-28 15:52:37 -04:00
Guray Ozen
e8dfd70fe2
[MLIR][NVGPU] Use gpu.dynamic_shared_memory in tests (#133122)
Reland #133051
2025-03-26 18:00:22 +01:00
Karlo Basioli
3f82c3d5a8
Revert "[MLIR][NVGPU] Use gpu.dynamic_shared_memory in tests" (#133103)
Reverts llvm/llvm-project#133051 due to failing integration tests
2025-03-26 15:39:14 +00:00
Guray Ozen
15f5a7a3ec
[MLIR][NVGPU] Use gpu.dynamic_shared_memory in tests (#133051)
The `memref.subview` ops in the test case were incorrect: they extracted
out-of-bounds.
2025-03-26 14:32:04 +01:00
Guray Ozen
837b89fc0f
[MLIR][NVVM] Add ptxas-cmd-options to pass flags to the downstream compiler (#127457)
This PR adds `cmd-options` to the `gpu-lower-to-nvvm-pipeline` pipeline
and the `nvvm-attach-target` pass, allowing users to pass flags to the
downstream compiler, *ptxas*.

Example:
```
mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 ptxas-cmd-options='-v --register-usage-level=8'"
```
2025-02-17 12:09:27 +01:00
Andrea Faulds
eb206e9ea8
[mlir] Rename mlir-cpu-runner to mlir-runner (#123776)
With the removal of mlir-vulkan-runner (as part of #73457) in
e7e3c45bc70904e24e2b3221ac8521e67eb84668, mlir-cpu-runner is now the
only runner for all CPU and GPU targets, and the "cpu" name has been
misleading for some time already. This commit renames it to mlir-runner.
2025-01-24 14:08:38 +01:00
Andrea Faulds
e7e3c45bc7
[mlir] Remove mlir-vulkan-runner and GPUToVulkan conversion passes (#123750)
This follows up on 733be4ed7dcf976719f424c0cb81b77a14f91f5a, which made
mlir-vulkan-runner and its associated passes redundant, and completes
the main goal of #73457. The mlir-vulkan-runner tests become part of the
integration test suite, and the Vulkan runner runtime components become
part of ExecutionEngine, just as was done when removing other
target-specific runners.
2025-01-21 16:51:27 +01:00
Benjamin Kramer
0d7022ed75 [MLIR][GPU] Fix gpu.printf test syntax after f50f9698ad012882df8dd605f5482e280c138266 2025-01-08 15:17:39 +01:00
Guray Ozen
f50f9698ad
[MLIR][GPU] Fix gpu.printf (#121940) 2025-01-08 08:25:57 +01:00
Matthias Springer
599c739905
[mlir][GPU] Add NVVM-specific cf.assert lowering (#120431)
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.

Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
2025-01-06 12:00:11 +01:00
Matthias Springer
0dc086a787
[mlir] Fix integration tests after #120580 (#120729)
This commit should have been part of #120580.
2024-12-20 14:01:23 +01:00
Matthias Springer
b03a09e74f
[mlir] Fix integration tests after #120548 (#120706)
This should have been part of #120548.
2024-12-20 11:03:33 +01:00
Andrea Faulds
0e39b1348e
[mlir] Remove the mlir-spirv-cpu-runner (move to mlir-cpu-runner) (#114563)
This commit builds on and completes the work done in
9f6c632ecda08bfff76b798c46d5d7cfde57b5e9 to eliminate the need for a
separate mlir-spirv-cpu-runner binary. Since the MLIR processing is
already done outside this runner, the only real difference between it
and the mlir-cpu-runner is the final linking step between the nested
LLVM IR modules. By moving this step into mlir-cpu-runner behind a new
command-line flag (`--link-nested-modules`), this commit is able to
completely remove the runner component of the mlir-spirv-cpu-runner.

The runtime libraries and the tests are moved and renamed to fit into
the Execution Engine and Integration tests, following the model of the
similar migration done for the CUDA Runner in D97463.
2024-11-08 08:01:52 -05:00
Durgadoss R
13d6233e77
[MLIR][NVGPU] Fix nvgpu_arrive syntax in matmulBuilder.py (#113713)
This patch updates the syntax for nvgpu_arrive Op
in matmulBuilder.py. This fixes the compilation
error for this test.

For the warp-specialized matmul_kernel implementation,
removing the WaitGroupSyncOp (after the mma-main-loop)
fixes the hang observed.

With these two fixes, the test compiles and
executes successfully on an sm90a machine.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-10-26 11:15:50 +05:30
Durgadoss R
a8b5115441
[MLIR][NVGPU] Fix the cga_cluster.mlir test (#112191)
This patch fixes the sm90 cluster test by:
* Fixing a typo in LowerGpuOpsToNVVMOps where one of the ClusterDim Op
   conversion pattern should actually be for the
   ClusterDimBlocks Op. This addresses the compilation error for this test.
* The grid-size should be (4,4,1) instead of (2,2,1). This passes the
   scf-if check against the threshold of 3 below and actually
   generates the required prints from the GPU.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-10-14 19:44:13 +05:30
Durgadoss R
9432f7074c
[MLIR][NVGPU-Tests] Fix a failing sm90 test (#111731)
The memref.expand_shape explicitly takes an output_shape now.
This patch adds it to the Op and fixes the failing test.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-10-10 10:51:59 +05:30
Matthias Springer
8e33ff7d56
[mlir][GPU][NFC] Move dump-ptx.mlir test case (#111142) 2024-10-04 15:13:20 +02:00
Guray Ozen
816134b333
[MLIR] Dump sass (#110227)
This PR dump sass by using nvdiasm
2024-09-27 13:52:15 +02:00
Christian Sigg
b4ac5c4b7c
[mlir][cuda] NFC: Remove accidentally committed 'asd' file. (#105491)
Co-authored-by: Christian Sigg <chsigg@users.noreply.github.com>
2024-08-22 10:52:50 +02:00
Guray Ozen
f2251f93ab [mlir][gpu] Add mlir_c_runner_utils to fix #99035
This fixes the unit test that is broken in #99035.
2024-07-17 09:23:32 +02:00
Guray Ozen
20861f1f2f
[mlir][gpu] Use alloc OP's host_shared in cuda runtime (#99035) 2024-07-17 07:25:11 +02:00
Matthias Springer
7775be4d48
[mlir] Fix GPU integration test (part 2) (#98918)
Fix tests that were broken by #97903.
2024-07-15 17:39:16 +02:00
Matthias Springer
6469faf9fd
[mlir] Fix GPU integration test (#98917)
Fix tests that were broken by #97903.
2024-07-15 17:30:04 +02:00
Guray Ozen
f8ff909471
[mlir][gpu] Add py binding for AsyncTokenType (#96466)
The PR adds py binding for `AsyncTokenType`
2024-06-24 11:39:22 +02:00
Pradeep Kumar
bd6568c98a
[MLIR][GPU] Add gpu.cluster_dim_blocks and gpu.cluster_block_id Ops (#95245)
This commit adds support for `gpu.cluster_dim_blocks` and
`gpu.cluster_block_id` Ops to represent number of blocks per cluster and
block id inside a cluster respectively. Also, fixed the description of
`gpu.cluster_dim` Op and updated the `cga_cluster.mlir` test file to use
`gpu.cluster_dim_blocks`

Co-authored-by: pradeepku <pradeepku@nvidia.com>
Co-authored-by: Guray Ozen <guray.ozen@gmail.com>
2024-06-14 10:35:35 +05:30