I hit another runtime verification issue (similar to
https://github.com/llvm/llvm-project/pull/164878) while working with
TFLite models. The verifier is incorrectly rejecting
`tensor.extract_slice` operations when extracting an empty slice
(size=0) that starts exactly at the tensor boundary.
The current runtime verification unconditionally enforces `offset <
dim_size`. This makes sense for non-empty slices, but it's too strict
for empty slices, causing false positives that lead to spurious runtime
assertions.
**Simple example that demonstrates the issue:**
```mlir
func.func @extract_empty_slice(%tensor: tensor<?xf32>, %offset: index, %size: index) {
// When called with: tensor size=10, offset=10, size=0
// Runtime verification fails: "offset 0 is out-of-bounds"
%slice = tensor.extract_slice %tensor[%offset] [%size] [1]
: tensor<?xf32> to tensor<?xf32>
return
}
```
For the above example, the check evaluates `10 < 10` which is false, so
verification fails. However, I believe this operation should be valid -
we're extracting zero elements, so there's no actual out-of-bounds
access.
**Real-world repro from the TensorFlow Lite models:**
This issue manifests while lowering TFLite models and a lot of our
system tests are failing due to this. Here's a simplified version
showing the problematic pattern:
In this code, `%extracted_slice_0` becomes an empty tensor when SSA
value `%15` reaches 10 (on the final loop iteration), making `%16 = 0`.
The operation extracts zero elements along dimension 0, which is
semantically valid but fails runtime verification.
```mlir
func.func @simplified_repro_from_tensorflowlite_model(%arg0: tensor<10x4x1xf32>) -> tensor<10x4x1xf32> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c10 = arith.constant 10 : index
%c-1 = arith.constant -1 : index
%0 = "tosa.const"() <{values = dense<0> : tensor<i32>}> : () -> tensor<i32>
%1 = "tosa.const"() <{values = dense<1> : tensor<i32>}> : () -> tensor<i32>
%2 = "tosa.const"() <{values = dense<10> : tensor<i32>}> : () -> tensor<i32>
%3 = "tosa.const"() <{values = dense<-1> : tensor<2xi32>}> : () -> tensor<2xi32>
%4 = "tosa.const"() <{values = dense<0> : tensor<2xi32>}> : () -> tensor<2xi32>
%5 = "tosa.const"() <{values = dense<0.000000e+00> : tensor<1x4x1xf32>}> : () -> tensor<1x4x1xf32>
%c4_1 = tosa.const_shape {values = dense<1> : tensor<1xindex>} : () -> !tosa.shape<1>
%6:2 = scf.while (%arg1 = %0, %arg2 = %arg0)
: (tensor<i32>, tensor<10x4x1xf32>) -> (tensor<i32>, tensor<10x4x1xf32>) {
%7 = tosa.greater %2, %arg1 : (tensor<i32>, tensor<i32>) -> tensor<i1>
%extracted = tensor.extract %7[] : tensor<i1>
scf.condition(%extracted) %arg1, %arg2 : tensor<i32>, tensor<10x4x1xf32>
} do {
^bb0(%arg1: tensor<i32>, %arg2: tensor<10x4x1xf32>):
%7 = tosa.add %arg1, %1 : (tensor<i32>, tensor<i32>) -> tensor<i32>
// First slice
%8 = tosa.reshape %arg1, %c4_1 : (tensor<i32>, !tosa.shape<1>) -> tensor<1xi32>
%9 = tosa.concat %8, %3 {axis = 0 : i32} : (tensor<1xi32>, tensor<2xi32>) -> tensor<3xi32>
%extracted_0 = tensor.extract %9[%c0] : tensor<3xi32>
%10 = index.casts %extracted_0 : i32 to index
%11 = arith.cmpi eq, %10, %c-1 : index
%12 = arith.select %11, %c10, %10 : index
%extracted_slice = tensor.extract_slice %arg2[0, 0, 0] [%12, 4, 1] [1, 1, 1]
: tensor<10x4x1xf32> to tensor<?x4x1xf32>
// Second slice - this is where the failure occurs
%13 = tosa.reshape %7, %c4_1 : (tensor<i32>, !tosa.shape<1>) -> tensor<1xi32>
%14 = tosa.concat %13, %4 {axis = 0 : i32} : (tensor<1xi32>, tensor<2xi32>) -> tensor<3xi32>
%extracted_1 = tensor.extract %14[%c0] : tensor<3xi32>
%15 = index.castu %extracted_1 : i32 to index
%16 = arith.subi %c10, %15 : index // size = 10 - offset
%extracted_2 = tensor.extract %14[%c1] : tensor<3xi32>
%17 = index.castu %extracted_2 : i32 to index
%extracted_3 = tensor.extract %14[%c2] : tensor<3xi32>
%18 = index.castu %extracted_3 : i32 to index
// On the last loop iteration: %15=10, %16=0
// %extracted_slice_0 becomes an empty tensor
// Runtime verification fails: "offset 0 is out-of-bounds"
%extracted_slice_0 = tensor.extract_slice %arg2[%15, %17, %18] [%16, 4, 1] [1, 1, 1]
: tensor<10x4x1xf32> to tensor<?x4x1xf32>
%19 = tosa.concat %extracted_slice, %5, %extracted_slice_0 {axis = 0 : i32}
: (tensor<?x4x1xf32>, tensor<1x4x1xf32>, tensor<?x4x1xf32>) -> tensor<10x4x1xf32>
scf.yield %7, %19 : tensor<i32>, tensor<10x4x1xf32>
}
return %6#1 : tensor<10x4x1xf32>
}
```
**The fix:**
Make the offset check conditional on slice size:
- Empty slice (size == 0): allow `0 <= offset <= dim_size`
- Non-empty slice (size > 0): require `0 <= offset < dim_size`
**Question for reviewers:**
Should we also relax the static verifier to allow this edge case?
Currently, the static verifier rejects the following IR:
```mlir
%tensor = arith.constant dense<1.0> : tensor<10xf32>
%slice = tensor.extract_slice %tensor[10] [0] [1] : tensor<10xf32> to tensor<0xf32>
```
Since we're allowing it at runtime for dynamic shapes, it seems
inconsistent to reject it statically. However, I wanted to get feedback
before making that change - this PR focuses only on the runtime
verification fix for dynamic shapes.
P.S. We have a similar issue with `memref.subview`. I will send a
separate patch for the issue.
Co-authored-by: Hanumanth Hanumantharayappa <hhanuman@ah-hhanuman-l.dhcp.mathworks.com>
Previously, the runtime verification pass would insert assertion
statements with conditions that always evaluate to false for
semantically valid `tensor.extract_slice` operations where one of the
dimensions had a size of 0.
The `tensor.extract_slice` runtime verification logic was
unconditionally generating checks for the position of the last element
(`offset + (size - 1) * stride`). When `size` is 0, this causes the
assertion condition to always be false, leading to runtime failures even
though the operation is semantically valid.
This patch fixes the issue by making the `lastPos` check conditional.
The offset is always verified, but the endpoint check is only performed
when `size > 0` to avoid generating spurious assert statements.
This issue was discovered through LiteRT model, where a dynamic shape
calculation resulted in a zero-sized dimension being passed to
`tensor.extract_slice`.
The following is a simplified IR snippet from the model. After running
the runtime verification pass, an assertion that always fails is
generated because the SSA value `%3` becomes 0.
```mlir
func.func @simple_repro_from_liteRT_model(%arg0: tensor<10x4x1xf32>) -> tensor<?x?x?xf32> {
%cst = arith.constant dense<0> : tensor<1xi32>
%cst_0 = arith.constant dense<-1> : tensor<2xi32>
%c-1 = arith.constant -1 : index
%c0 = arith.constant 0 : index
%c10 = arith.constant 10 : index
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%c2 = arith.constant 2 : index
%0 = tensor.empty() : tensor<3xi32>
%inserted_slice = tensor.insert_slice %cst into %0[0] [1] [1] : tensor<1xi32> into tensor<3xi32>
%inserted_slice_1 = tensor.insert_slice %cst_0 into %inserted_slice[1] [2] [1] : tensor<2xi32> into tensor<3xi32>
%extracted = tensor.extract %inserted_slice_1[%c0] : tensor<3xi32>
%1 = index.casts %extracted : i32 to index
%2 = arith.cmpi eq, %1, %c-1 : index
%3 = arith.select %2, %c10, %1 : index
%extracted_2 = tensor.extract %inserted_slice_1[%c1] : tensor<3xi32>
%4 = index.casts %extracted_2 : i32 to index
%5 = arith.cmpi eq, %4, %c-1 : index
%6 = arith.select %5, %c4, %4 : index
%extracted_3 = tensor.extract %inserted_slice_1[%c2] : tensor<3xi32>
%7 = index.casts %extracted_3 : i32 to index
%8 = arith.cmpi eq, %7, %c-1 : index
%9 = arith.select %8, %c1, %7 : index
%extracted_slice = tensor.extract_slice %arg0[0, 0, 0] [%3, %6, %9] [1, 1, 1] : tensor<10x4x1xf32> to tensor<?x?x?xf32>
return %extracted_slice : tensor<?x?x?xf32>
}
```
The issue can be reproduced more simply with the following test case,
where `dim_0` is `0`. When the runtime verification pass is applied to
this code with `dim_0 = 0`, it generates an assertion that will always
fail at runtime.
```mlir
func.func @extract_slice_zero_size_dim(%arg0: tensor<10x4x1xf32>,
%dim_0: index,
%dim_1: index,
%dim_2: index) {
%slice = tensor.extract_slice %arg0[0, 0, 0] [%dim_0, %dim_1, %dim_2] [1, 1, 1]
: tensor<10x4x1xf32> to tensor<?x?x?xf32>
return
}
func.func @test_zero_size_extraction() {
%input = arith.constant dense<1.0> : tensor<10x4x1xf32>
// Define slice dimensions: 0x4x1 (zero-size in first dimension)
%dim_0 = arith.constant 0 : index
%dim_1 = arith.constant 4 : index
%dim_2 = arith.constant 1 : index
func.call @extract_slice_zero_size_dim(%input, %dim_0, %dim_1, %dim_2)
: (tensor<10x4x1xf32>, index, index, index) -> ()
return
}
```
P.S. We probably have a similar issue with `memref.subview`. I will
check this and send a separate PR for the issue.
---------
Co-authored-by: Hanumanth Hanumantharayappa <hhanuman@ah-hhanuman-l.dhcp.mathworks.com>
The pass generate-runtime-verification generates additional runtime op
verification checks.
Currently, the pass is extremely expensive. For example, with a
mobilenet v2 ssd network(converted to mlir), running this pass alone in
debug mode will take 30 minutes. The same observation has been made to
other networks as small as 5 Mb.
The culprit is this line "op->print(stream, flags);" in function
"RuntimeVerifiableOpInterface::generateErrorMessage" in File
mlir/lib/Interfaces/RuntimeVerifiableOpInterface.cpp.
As we are printing the op with all the names of the operands in the
middle end, we are constructing a new SSANameState for each
op->print(...) call. Thus, we are doing a new SSA analysis for each
error message printed.
Perf profiling shows that 98% percent of the time is spent in the
constructor of SSANameState.
This change refactored the message generator. We use a toplevel
AsmState, and reuse it with all the op-print(stream, asmState). With a
release build, this change reduces the pass exeuction time from ~160
seconds to 0.3 seconds on my machine.
This change also adds verbose options to generate-runtime-verification
pass.
verbose 0: print only source location with error message.
verbose 1: print the full op, including the name of the operands.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Add `RuntimeVerifiableOpInterface` implementations for the following
ops. These were mostly copied from the respective memref
implementations. Only the part that deals with offsets and strides was
removed.
* `tensor.cast`: `memref.cast`
* `tensor.dim`: `memref.dim`
* `tensor.extract`: `memref.load`
* `tensor.insert`: `memref.store`
* `tensor.extract_slice`: `memref.subview`