1068 Commits

Author SHA1 Message Date
Jared Hoberock
90ec5f2f62
[MLIR][test] Re-disable FileCheck on async.mlir integration test (#190702)
#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.

This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.

This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
2026-04-07 01:14:56 +02:00
Jared Hoberock
7087ece044
[MLIR][ExecutionEngine] Tolerate CUDA_ERROR_DEINITIALIZED in mgpuModuleUnload (#190563)
`mgpuModuleUnload` may be called from a global destructor (registered by
`SelectObjectAttr`'s `appendToGlobalDtors`) after the CUDA primary
context has already been destroyed during program shutdown. In this
case, `cuModuleUnload` returns `CUDA_ERROR_DEINITIALIZED`, which is
benign since the module's resources are already freed with the context.

## Reproduction

Any program that uses `gpu.launch_func` and is AOT-compiled (via
`mlir-translate --mlir-to-llvmir | llc | cc -lmlir_cuda_runtime`) will
print `'cuModuleUnload(module)' failed with '<unknown>'` on exit. This
is because `SelectObjectAttr` registers the module unload as a global
destructor, which runs after the CUDA primary context is released.

This script reproduces the error message from `mgpuModuleUnload` on my
system:

```
#!/bin/bash
set -e

LLVM_BUILD=${LLVM_BUILD:-$HOME/dev/git/llvm-project-22/build}

cat > /tmp/repro.mlir << 'MLIR'
func.func @main() {
  %c1 = arith.constant 1 : index
  gpu.launch blocks(%bx, %by, %bz) in (%gx = %c1, %gy = %c1, %gz = %c1)
             threads(%tx, %ty, %tz) in (%bsx = %c1, %bsy = %c1, %bsz = %c1) {
    gpu.terminator
  }
  return
}
MLIR

$LLVM_BUILD/bin/mlir-opt /tmp/repro.mlir \
  -gpu-lower-to-nvvm-pipeline="cubin-format=fatbin" \
  | $LLVM_BUILD/bin/mlir-translate --mlir-to-llvmir -o /tmp/repro.ll

$LLVM_BUILD/bin/llc -relocation-model=pic -filetype=obj /tmp/repro.ll -o /tmp/repro.o

cc /tmp/repro.o \
  -L$LLVM_BUILD/lib -Wl,-rpath,$LLVM_BUILD/lib \
  -lmlir_cuda_runtime -lmlir_runner_utils -o /tmp/repro

echo "Running:"
/tmp/repro 2>&1
echo "Exit code: $?"
```
## Context

This matches how other projects handle the same shutdown ordering issue:
- Clang CUDA (D48613) switched module cleanup from
`__attribute__((destructor))` to `atexit()`
- GCC libgomp checks context validity before `cuModuleUnload`
- Apache TVM silently ignores `CUDA_ERROR_DEINITIALIZED` on module
unload

Fixes #170833
2026-04-06 21:11:58 +00:00
Will Froom
f52b2616f4
[mlir][vector] Use non-native runner in gather.mlir test (#187243)
Fix after https://github.com/llvm/llvm-project/pull/187071
2026-03-18 11:28:14 +00:00
Andrzej Warzyński
9cb9081049
[mlir][vector] Extend vector.gather e2e test (#187071)
Extend the vector.gather e2e test to cover both available lowering
paths:

* Direct lowering to LLVM (via -test-lower-to-llvm)
* Lowering via vector.load (via -test-vector-gather-lowering)

This is a follow-up to https://github.com/llvm/llvm-project/pull/184706,
which updated a pattern used by -test-vector-gather-lowering.

The test is extended to operate on 2D memrefs so that the changes
in https://github.com/llvm/llvm-project/pull/184706 are meaningfully
exercised.
2026-03-18 09:23:17 +00:00
Stefan Mada
0769dde7a2
Removed Hardcoded SM Number from Mlir Test (#186917)
This MR removes a hard-coded compute number in an MLIR test. This will
allow the test to not need to be updated in the future. The default
value will come from `NVVMOps.td`.
2026-03-17 11:12:52 -07:00
Jakub Kuderski
ade6309229
[mlir][XeGPU] Fix double spaces in tests after ODS printer fix. NFC. (#185324)
Follow-up to #184253. Update tests that checked for the old double-space
output of gpu.block_id using GPU_DimensionAttr.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 20:59:47 -04:00
Jakub Kuderski
15e7177f08
[mlir][GPU] Fix double spaces in tests after ODS printer fix. NFC. (#185325)
Follow-up to #184253. The ODS attr/type printer fix removed the leading
space from generated print() methods. Update tests that checked for the
old double-space output of GPU ops using GPU_DimensionAttr and
GPU_MmaElementwiseOpAttr.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 18:46:54 -04:00
Sang Ik Lee
dcd1bfb0df
[MLIR][XeVM] Mark gpu.printf test with XFAIL. (#184215)
gpu.printf test is expect to fail until vararg handling issue with
SPIR-V backend is resolved.
2026-03-05 23:30:18 +00:00
Erick Ochoa Lopez
613a5c555e
[mlir][vector] Replace OneDimMultiReductionToTwoDim with OneDimMultiReductionToReduction (#184241)
The `OneDimMultiReductionToTwoDim` pattern had some issues. For the
input program:

```mlir
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %acc: f32) -> f32 {
    %0 = vector.multi_reduction <add>, %arg0, %acc [0] : vector<8xf32> to f32
    return %0 : f32
}
```

* when lowering using the inner-parallel strategy, the compiler would
essentially produce scalar code:
```mlir
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %arg1: f32) -> f32 {
    %0 = vector.shape_cast %arg0 : vector<8xf32> to vector<1x8xf32>
    %1 = vector.broadcast %arg1 : f32 to vector<1xf32>
    %2 = vector.transpose %0, [1, 0] : vector<1x8xf32> to vector<8x1xf32>
    %3 = vector.extract %2[0] : vector<1xf32> from vector<8x1xf32>
    %4 = arith.addf %3, %1 : vector<1xf32>
    %5 = vector.extract %2[1] : vector<1xf32> from vector<8x1xf32>
    %6 = arith.addf %5, %4 : vector<1xf32>
    ... (repeats for all 8 elements) ...
    %17 = vector.extract %2[7] : vector<1xf32> from vector<8x1xf32>
    %18 = arith.addf %17, %16 : vector<1xf32>
    %19 = vector.extract %18[0] : f32 from vector<1xf32>
    return %19 : f32
}
```
* when lowering using the inner-reduction strategy, the compiler would
first unnecessarily transform it into a 2-D multi_reduction operation
<1x8xf32> and then extract an <8xf32> vector and apply reduction. The
canonicalization and folding would lead to the following final result:
```mlir
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %arg1: f32) -> f32 {
    %0 = vector.reduction <add>, %arg0, %arg1 : vector<8xf32> into f32
    return %0 : f32
}
```

Now, after this change:
* when lowering the compiler now produces for both strategies in one
step.
```
func.func @rank1_multi_reduction(%arg0: vector<8xf32>, %arg1: f32) -> f32 {
    %0 = vector.reduction <add>, %arg0, %arg1 : vector<8xf32> into f32
    return %0 : f32
}
```

This pattern is also useful for an ongoing refactoring that is happening
in the multi_reduction patterns. It is the only pattern that increases
multi_reduction in rank and would lead to an infinite loop when
attempting to reach a fixed point once we generalize other unrolling
patterns.

Assisted-by: Claude
2026-03-04 16:13:11 +00:00
Adam Siemieniuk
0fff939c1a
[mlir][linalg] Lower unpack - capture handle to created copy op (#183744)
Adds missing copy op created to unpack lowering results. Corresponding
transform op is also updated with the new result value.
2026-03-03 08:26:04 +01:00
Adam Siemieniuk
e44fd05035
[mlir][x86] Move AMX dialect into X86 dialect (#183717)
Unifies the two dialects that define x86 operations into a single one.
The AMX dialect is moved into X86 in line with other x86 extensions.

Following the dialect renaming, X86 dialect is now a suitable home for
wider range of operations targeting specific hardware features. Moving
AMX definitions to X86 dialect creates a single, centralized hub for
defining all x86 intrinsic-like operations. The new grouping aims to
eliminate the need for new dialects as new hardware extensions become
available.

The two dialects are simply merged together. X86 dialect refactoring
will be addressed separately.

List of changes:
  - operations: 'amx.tile_*' => 'x86.amx.tile_*'
  - types: '!amx.tile' => '!x86.amx.tile'
  - namespace: 'mlir::amx' => 'mlir::x86::amx'
  - test define: 'MLIR_RUN_AMX_TESTS' => 'MLIR_RUN_X86_AMX_TESTS'
  - vector lowering: AMX is enabled by default together with X86

The MLIR AMX tests are now nested under X86 directory. To enable AMX
integration tests, 'MLIR_RUN_X86_TESTS' must also be defined.
2026-03-02 11:47:30 +01:00
Adam Siemieniuk
67ac275fee
[mlir][x86] Rename x86vector to x86 (#183311)
Renames 'x86vector' dialect to 'x86'.

This is the first PR in series of cleanups around dialects targeting x86
platforms.
The new naming scheme is shorter, cleaner, and opens possibility of
integrating other x86-specific operations not strictly fitting pure
vector representation. For example, the generalization will allow for
future merger of AMX dialect into the x86 dialect to create one-stop x86
operations collection and boost discoverability.
2026-02-26 11:21:58 +01:00
Erick Ochoa Lopez
eeb6b394c5
[mlir][vector] remove lower_multi_reduction (#182332)
* Removes `ApplyLowerMultiReductionPatternsOp`
(`apply_patterns.vector.lower_multi_reduction`)
* Updates uses of `apply_patterns.vector.lower_multi_reduction` in tests
to use:
  *  reorder_and_expand_multi_reduction_dims
  * multi_reduction_flattening
   * multi_reduction_unrolling
* Removes `populateVectorMultiReductionLoweringPatterns` (unused)
2026-02-20 08:33:24 -05:00
Sang Ik Lee
f481bf1031
[MLIR][XeVM] Update cache control values and metadata format. (#175274)
Fix incorrect cache control metadata values and format.
2026-02-18 13:14:16 -08:00
Sang Ik Lee
a51bc254bc
[MLIR][XeGPU] Add LANE level integration test without XeGPU ops. (#181891)
XeGPU LANE level integration test lacks a test without usage of any
XeGPU dialect ops.
Add an integration test without XeGPU dialect ops.
2026-02-18 10:22:47 -08:00
Zichen Lu
fbffdaa174
[MLIR][GPU] Update serializeToObject to use SerializedObject wrapper and include ISA compiler logs (#176697)
This PR makes the compilation log from ISA compiler available to users
by returning it as part of the `gpu::ObjectAttr` properties, following
the existing pattern like `LLVMIRToISATimeInMs`.

Currently, the compiler log (which contains useful information such as
spill statistics when --verbose is passed) is only accessible in debug
builds via `LLVM_DEBUG`. However, there are good reasons to make this
information available in release builds as well:

1. Both `ptxas` and `libnvptxcompiler` are publicly available
tools/libraries distributed with the CUDA Toolkit. The `--verbose` flag
and its output are documented public features, not internal debug
information.
2. The verbose output provides valuable insights for users.

A new `SerializedObject` class is used to carry the metadata alongside
the binary when returning from `serializeObject`.
2026-01-30 12:56:20 +01:00
Maksim Levental
47f9e0ab2a
[mlir][math] Add vector support for math-to-apfloat (#172715)
This PR adds vector type support to `math-to-apfloat`.
2026-01-16 10:14:00 -08:00
Durgadoss R
22271c9e76
[MLIR][NVVM][Tests] Re-enable matmul.py tests (#175728)
This patch re-enables the matmul.py tests:
* Fix gpu.wait usages
* Fix gpu.launchOp usage
* Fix format-string for gpu.printf
* Fix verification failure by removing the block[0] append.
   This is now done by the python script's init.
* Fix the runtime error by adding the missing initialize() call during
JIT.
* Add the missing waitGroup(0) for _ws implementation.
  This was mistakenly removed in PR #113713. Without this fix,
I see timing issues and the _ws tests with stage>1 randomly show output
mismatch.

With all these fixes, the test compiles and
executes successfully on an sm90a machine.
(locally verified for 1K iterations)

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2026-01-16 10:57:54 +05:30
Maksim Levental
ad5be31c30
[mlir][Python] fix NV examples after #172892 (#174481) 2026-01-05 21:47:35 +00:00
Durgadoss R
6778f0d483
[MLIR][NVVM][Tests]: Update FileCheck primitives (#173252)
This patch updates a few FileCheck primitives for the TMA test
to use CHECK-PTX-DAG instead of CHECK-PTX to accommodate
a slightly different ordering of BB's.

The dump-ptx integration test fails when the PTX is generated
through nvcc (intermediates) from public toolkit. This patch fixes
it by allowing regex strings from both the backends.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2025-12-23 00:01:25 +05:30
Sang Ik Lee
528caf99a7
[MLIR] Fix GPU integration tests for SYCL and LevelZero runtime. (#171718)
i1 type load / store lowering does not work anymore for SPIR-V kernel
Rewrite test cases such that it does not use i1 load / store.
2025-12-19 09:10:44 -08:00
Maksim Levental
54eee1e947
Reapply "[mlir][math] Add FP software implementation lowering pass: math-to-apfloat" (#172714) (#172716)
Reapply https://github.com/llvm/llvm-project/pull/171221 - Fix builder
by linking `MLIRTransformUtils`. Also move headers to
`mlir/Conversion/ArithAndMathToAPFloat`.
2025-12-17 17:26:37 -08:00
Jianhui Li
2b9e47749c
[MLIR][XeGPU] Refactor Layout access interface (#172125)
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:

1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.

2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.

3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().

4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run. 

5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().

6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
2025-12-17 12:04:58 -08:00
Maksim Levental
621fe03eaa
Revert "[mlir][math] Add FP software implementation lowering pass: math-to-apfloat" (#172714)
Reverts llvm/llvm-project#171221

Broken builder https://lab.llvm.org/buildbot/#/builders/138/builds/23270
2025-12-17 10:52:43 -08:00
Maksim Levental
7f1a30ebd2
[mlir][math] Add FP software implementation lowering pass: math-to-apfloat (#171221)
Add APFloat software implementation for `math.fma`, `math.abs`,
`math.isnan`, `math.isfinite`, `math.isinf`, `math.isnormal` for reduced
precision (`fp4*`, `fp6*`, `fp8*`).
2025-12-17 18:37:13 +00:00
Maksim Levental
8d5ade8feb
[mlir] enable APFloatWrappers on MacOS (#172070) 2025-12-12 11:34:23 -08:00
Sang Ik Lee
b8ddbc4f03
[MLIR][XeVM] gpu.printf test: use correct runtime. (#170754)
gpu printf test was not using the runtime required by lit.local.cfg
All other tests in the directory are correctly using level zero runtime.
But gpu printf test is using sycl runtime.
2025-12-08 08:14:56 -08:00
Matthias Springer
8378a6fa4f
[mlir][arith] Fix build after #171024 (#171057)
Fix build after #171024.
2025-12-07 21:48:00 +01:00
Matthias Springer
5dbd049662
[mlir][arith] arith-to-apfloat: Add vector support (#171024)
Add support for vectorized operations such as `arith.addf ... :
vector<4xf4E2M1FN>`. The computation is scalarized: scalar operands are
extracted with `vector.to_elements`, multiple scalar computations are
performed and the result is inserted back into a vector with
`vector.from_elements`.
2025-12-07 20:55:48 +01:00
Mehdi Amini
d02471ae5e [MLIR] Partially disable test/Integration/GPU/CUDA/async.mlir
This test is flaky, needs investigation.

See #170833
2025-12-05 03:54:04 -08:00
Sohaib Iftikhar
b31a398bcf
[MLIR][NVVM] Fix wmma test after d3edc94d (#170659)
See discussion on #169061
2025-12-04 13:37:29 +00:00
Sang Ik Lee
c379f7cc01
[MLIR][XeGPU] Add integration with XeGPU load / store ops to / from memref subview. (#170385)
Add XeGPU integration test for missing usage case: base memory from
memref subview.
2025-12-03 09:22:18 -08:00
Jakub Kuderski
ad656d3a19
[mlir][linalg][arm] Fix use of fill in arm integration tests (#170143)
Follow up to
https://github.com/llvm/llvm-project/pull/169567#issuecomment-3596220014
2025-12-01 14:19:07 +00:00
Ryan Holt
b27301ff5d
[mlir][linalg] Re-enable linalg runtime verification test (#170129)
Test seems to pass after re-enabling without any additional changes.
2025-12-01 08:52:20 -05:00
Giacomo Castiglioni
d3edc94d11
[MLIR][GPU] subgroup_mma fp64 extension - take 2 (#169061)
This PR re-lands #165873.

This PR extends the gpu.subgroup_mma_* ops to support fp64 type.
The extension requires special handling during the lowering to nvvm due
to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).

The original PR did not guard the new test based on the required
architecture (sm80) which lead to a failure on the cuda runners with T4
GPUs.
2025-12-01 07:39:59 -05:00
Matthias Springer
147c466bcd
[mlir][arith] Add support for min/max to ArithToAPFloat (#169760)
Add support for `arith.minnumf`, `arith.maxnumf`, `arith.minimumf`,
`arith.maximumf`.
2025-12-01 08:50:02 +00:00
Matthias Springer
05b1989551
[mlir][arith] Add support for negf to ArithToAPFloat (#169759)
Add support for `arith.negf`.
2025-12-01 08:28:23 +00:00
Matthias Springer
4d7abe5355
[mlir][arith] Add support for cmpf to ArithToAPFloat (#169753)
Add support for `arith.cmpf`.
2025-12-01 09:12:11 +01:00
Jakub Kuderski
0bd2f12753
[mlir][linalg] Restrict fill initial value type to output element type (#169567)
Disallow implicit casting, which is surprising, and, IME, usually
indicative of copy-paste errors.

Because the initial value must be a scalar, I don't expect this to
affect any data movement.
2025-11-30 09:51:37 -05:00
Matthias Springer
6ec686735c
[mlir][arith] Add support for sitofp, uitofp to ArithToAPFloat (#169284)
Add support for `arith.sitofp` and `arith.uitofp`.
2025-11-25 11:31:23 +09:00
Matthias Springer
3db8ed0500
[mlir][arith] Add support for fptosi, fptoui to ArithToAPFloat (#169277)
Add support for `arith.fptosi` and `arith.fptoui`.
2025-11-25 10:50:20 +09:00
Matthias Springer
78994706d8
[mlir][arith] Add support for extf, truncf to ArithToAPFloat (#169275)
Add support for `arith.extf` and `arith.truncf`. No support for custom
rounding modes yet.
2025-11-25 10:09:26 +09:00
Fabian Mora
8c3f59f1b2
Revert "[MLIR][GPU] subgroup_mma fp64 extension" (#169049)
Reverts llvm/llvm-project#165873

The revert is triggered by a failing integration test on a couple of
buildbots.
2025-11-21 10:02:59 -05:00
Giacomo Castiglioni
49995b2af0
[MLIR][GPU] subgroup_mma fp64 extension (#165873)
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
2025-11-21 09:07:43 -05:00
Matthias Springer
951ab04d6c
[mlir][NVVM] Add no-rollback option to NVVM lowering passes (#168477)
Add pass options to run lowerings to NVVM without pattern rollback. This
makes the dialect conversions easier to debug and improves
performance/memory usage.
2025-11-18 13:47:28 +08:00
Matthias Springer
7a53d33e7c
[mlir] Add FP software implementation lowering pass: arith-to-apfloat (#167848)
Reland pass and fix linker errors.

---------

Co-authored-by: Maksim Levental <maksim.levental@gmail.com>
2025-11-13 18:35:30 +09:00
Maksim Levental
140e07c862
Revert "Reland yet again: [mlir] Add FP software implementation lowering pass: arith-to-apfloat" (#167834)
Reverts llvm/llvm-project#167608

Broken builder https://lab.llvm.org/buildbot/#/builders/52/builds/12781
2025-11-12 23:02:21 -08:00
Matthias Springer
73e70e0c88
[mlir][linalg] Fix Linalg runtime verification test (#167814)
This integration test has been broken for a while. This commit partially
fixes it.

- Use `CHECK` + `CHECK-NEXT` to ensure that the correct error lines are
matched together.
- Move all `CHECK-NOT` to the end. Having a `CHECK` with the same string
does not make sense after a `CHECK-NOT`.
- Add a missing `CHECK: ERROR` for one of the test cases.
- Deactivate `reverse_from_3`, which is broken, and put a TODO.
2025-11-13 12:40:43 +09:00
Maksim Levental
0bba1e7658
Reland yet again: [mlir] Add FP software implementation lowering pass: arith-to-apfloat (#167608)
Fix both symbol visibility issue in the mlir_apfloat_wrappers lib and the linkage issue in ArithToAPFloat.
2025-11-12 17:57:53 -08:00
Hanumanth
81964597f9
[mlir][tensor] Fix runtime verification for tensor.extract_slice for empty tensor slices (#166569)
I hit another runtime verification issue (similar to
https://github.com/llvm/llvm-project/pull/164878) while working with
TFLite models. The verifier is incorrectly rejecting
`tensor.extract_slice` operations when extracting an empty slice
(size=0) that starts exactly at the tensor boundary.

The current runtime verification unconditionally enforces `offset <
dim_size`. This makes sense for non-empty slices, but it's too strict
for empty slices, causing false positives that lead to spurious runtime
assertions.

**Simple example that demonstrates the issue:**

```mlir
func.func @extract_empty_slice(%tensor: tensor<?xf32>, %offset: index, %size: index) {
  // When called with: tensor size=10, offset=10, size=0
  // Runtime verification fails: "offset 0 is out-of-bounds"
  %slice = tensor.extract_slice %tensor[%offset] [%size] [1] 
    : tensor<?xf32> to tensor<?xf32>
  return
}
```

For the above example, the check evaluates `10 < 10` which is false, so
verification fails. However, I believe this operation should be valid -
we're extracting zero elements, so there's no actual out-of-bounds
access.

**Real-world repro from the TensorFlow Lite models:**

This issue manifests while lowering TFLite models and a lot of our
system tests are failing due to this. Here's a simplified version
showing the problematic pattern:

In this code, `%extracted_slice_0` becomes an empty tensor when SSA
value `%15` reaches 10 (on the final loop iteration), making `%16 = 0`.
The operation extracts zero elements along dimension 0, which is
semantically valid but fails runtime verification.

```mlir
func.func @simplified_repro_from_tensorflowlite_model(%arg0: tensor<10x4x1xf32>) -> tensor<10x4x1xf32> {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c2 = arith.constant 2 : index
  %c10 = arith.constant 10 : index
  %c-1 = arith.constant -1 : index
  
  %0 = "tosa.const"() <{values = dense<0> : tensor<i32>}> : () -> tensor<i32>
  %1 = "tosa.const"() <{values = dense<1> : tensor<i32>}> : () -> tensor<i32>
  %2 = "tosa.const"() <{values = dense<10> : tensor<i32>}> : () -> tensor<i32>
  %3 = "tosa.const"() <{values = dense<-1> : tensor<2xi32>}> : () -> tensor<2xi32>
  %4 = "tosa.const"() <{values = dense<0> : tensor<2xi32>}> : () -> tensor<2xi32>
  %5 = "tosa.const"() <{values = dense<0.000000e+00> : tensor<1x4x1xf32>}> : () -> tensor<1x4x1xf32>
  %c4_1 = tosa.const_shape  {values = dense<1> : tensor<1xindex>} : () -> !tosa.shape<1>
  
  %6:2 = scf.while (%arg1 = %0, %arg2 = %arg0) 
    : (tensor<i32>, tensor<10x4x1xf32>) -> (tensor<i32>, tensor<10x4x1xf32>) {
    %7 = tosa.greater %2, %arg1 : (tensor<i32>, tensor<i32>) -> tensor<i1>
    %extracted = tensor.extract %7[] : tensor<i1>
    scf.condition(%extracted) %arg1, %arg2 : tensor<i32>, tensor<10x4x1xf32>
  } do {
  ^bb0(%arg1: tensor<i32>, %arg2: tensor<10x4x1xf32>):
    %7 = tosa.add %arg1, %1 : (tensor<i32>, tensor<i32>) -> tensor<i32>
    
    // First slice
    %8 = tosa.reshape %arg1, %c4_1 : (tensor<i32>, !tosa.shape<1>) -> tensor<1xi32>
    %9 = tosa.concat %8, %3 {axis = 0 : i32} : (tensor<1xi32>, tensor<2xi32>) -> tensor<3xi32>
    
    %extracted_0 = tensor.extract %9[%c0] : tensor<3xi32>
    %10 = index.casts %extracted_0 : i32 to index
    %11 = arith.cmpi eq, %10, %c-1 : index
    %12 = arith.select %11, %c10, %10 : index
    
    %extracted_slice = tensor.extract_slice %arg2[0, 0, 0] [%12, 4, 1] [1, 1, 1] 
      : tensor<10x4x1xf32> to tensor<?x4x1xf32>
    
    // Second slice - this is where the failure occurs
    %13 = tosa.reshape %7, %c4_1 : (tensor<i32>, !tosa.shape<1>) -> tensor<1xi32>
    %14 = tosa.concat %13, %4 {axis = 0 : i32} : (tensor<1xi32>, tensor<2xi32>) -> tensor<3xi32>
    
    %extracted_1 = tensor.extract %14[%c0] : tensor<3xi32>
    %15 = index.castu %extracted_1 : i32 to index
    %16 = arith.subi %c10, %15 : index  // size = 10 - offset
    
    %extracted_2 = tensor.extract %14[%c1] : tensor<3xi32>
    %17 = index.castu %extracted_2 : i32 to index
    
    %extracted_3 = tensor.extract %14[%c2] : tensor<3xi32>
    %18 = index.castu %extracted_3 : i32 to index
    
    // On the last loop iteration: %15=10, %16=0
    // %extracted_slice_0 becomes an empty tensor
    // Runtime verification fails: "offset 0 is out-of-bounds"
    %extracted_slice_0 = tensor.extract_slice %arg2[%15, %17, %18] [%16, 4, 1] [1, 1, 1] 
      : tensor<10x4x1xf32> to tensor<?x4x1xf32>
    
    %19 = tosa.concat %extracted_slice, %5, %extracted_slice_0 {axis = 0 : i32} 
      : (tensor<?x4x1xf32>, tensor<1x4x1xf32>, tensor<?x4x1xf32>) -> tensor<10x4x1xf32>
    
    scf.yield %7, %19 : tensor<i32>, tensor<10x4x1xf32>
  }
  
  return %6#1 : tensor<10x4x1xf32>
}
```
**The fix:**

Make the offset check conditional on slice size:
- Empty slice (size == 0): allow `0 <= offset <= dim_size`
- Non-empty slice (size > 0): require `0 <= offset < dim_size`


**Question for reviewers:**
Should we also relax the static verifier to allow this edge case?
Currently, the static verifier rejects the following IR:

```mlir
%tensor = arith.constant dense<1.0> : tensor<10xf32>
%slice = tensor.extract_slice %tensor[10] [0] [1] : tensor<10xf32> to tensor<0xf32>
```
Since we're allowing it at runtime for dynamic shapes, it seems
inconsistent to reject it statically. However, I wanted to get feedback
before making that change - this PR focuses only on the runtime
verification fix for dynamic shapes.

P.S. We have a similar issue with `memref.subview`. I will send a
separate patch for the issue.

Co-authored-by: Hanumanth Hanumantharayappa <hhanuman@ah-hhanuman-l.dhcp.mathworks.com>
2025-11-12 08:37:15 +09:00