Add support for folding `memref.expand_shape` ops into
`vector.transfer_read` ops when the permutation map is a
non-minor-identity.
In the case that the permutation map indexes into expanded dimensions
that would be contiguous within the original source shape then it is
safe to make this transformation.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
The revision follows
f0e1857c84
to add an option for supporting non-atomic RMW emulation. The 0D case
uses non-atomic option unconditionally because it writes the entire
value.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
The memref::multiBuffer transformation replaces an allocation with a
multi-buffered allocation and creates a strided memref.subview at each
loop iteration. When the original allocation is used through view-like
ops, the existing code only handles SubViewOp, leaving other view-like
ops with incorrect types.
This patch extends replaceUsesAndPropagateType to handle ExpandShapeOp,
CollapseShapeOp, and CastOp using TypeSwitch. For each view-like op, we
compute the correct result type (or assert on failure) and create a new
operation, then recursively propagate the updated type through chains.
New FileCheck tests cover expand_shape, collapse_shape, cast, and a
chained expand_shape->cast case.
A single ViewLikeOpInterface hook is not practical here: view-like ops
have distinct type inference and validity rules (e.g., subview uses
offset/size/stride inference, expand/collapse use reassociation, cast
requires compatibility checks). Ops like memref.view or
memref.reinterpret_cast need additional layout/size validation beyond
what multi-buffering currently tracks, so this patch handles the common
safe cases directly.
I'm planning to introduce an interface that'll allow FoldMemRefAliasOps
to not know about dialects like NVVM or GPU. To do this, however, I need
to get the `affine` ops (which need special handling in order to handle
their implicit affine maps) into a separate pass, analogously to how
`amdgpu` ops have these patterns under their dialect and ton under
`memref`.
This commit also changes the expand/collapse_shape index resolvers to
return `void`, since they never actually failed and to make it clearer
that they modify IR.
(Note: An LLM did the initial refactoring and test movement, I've
reviewed the results and edited them some.)
Extend the load of a expand shape rewrite pattern to support folding a
`memref.expand_shape` and `vector.transfer_read` when the permutation
map on `vector.transfer_read` is a minor identity.
---------
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
`RewriteExtractAlignedPointerAsIndexOfViewLikeOp` tries to propagate
`extract_aligned_pointer_as_index` through the view ops.
`ViewLikeOpInterface` by itself doesn't guarantee to preserve the base
pointer and `memref.view` is one such example, so limit pattern to a few
specific ops.
This PR applies the same fix from #166569 to `memref.subview`. That PR
fixed the issue for `tensor.extract_slice`, and this one addresses the
identical problem for `memref.subview`.
The runtime verification for `memref.subview` incorrectly rejects valid
empty subviews (size=0) starting at the memref boundary.
**Example that demonstrates the issue:**
```mlir
func.func @subview_with_empty_slice(%memref: memref<10x4x1xf32, strided<[?, ?, ?], offset: ?>>,
%dim_0: index,
%dim_1: index,
%dim_2: index,
%offset: index) {
// When called with: offset=10, dim_0=0, dim_1=4, dim_2=1
// Runtime verification fails: "offset 0 is out-of-bounds"
%subview = memref.subview %memref[%offset, 0, 0] [%dim_0, %dim_1, %dim_2] [1, 1, 1] :
memref<10x4x1xf32, strided<[?, ?, ?], offset: ?>> to
memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
return
}
```
When `%offset=10` and `%dim_0=0`, we're creating an empty subview (zero
elements along dimension 0) starting at the boundary. The current
verification enforces `offset < dim_size`, which evaluates to `10 < 10`
and fails. I feel this should be valid since no memory is accessed.
**The fix:**
Same as #166569 - make the offset check conditional on subview size:
- Empty subview (size == 0): allow `0 <= offset <= dim_size`
- Non-empty subview (size > 0): require `0 <= offset < dim_size`
Please see #166569 for motivation and rationale.
---
Co-authored-by: Hanumanth Hanumantharayappa <hhanuman@ah-hhanuman-l.dhcp.mathworks.com>
Current implementation of `reifyResultShapes` forces all
implementations to return all dimensions of all results. This can be
wasteful when you only require dimensions of one result, or a single
dimension of a result. Further this also creates issues with using
patterns to resolve the `tensor.dim` and `memref.dim` operations since
the extra operations created result in the pattern rewriter entering
an infinite loop (eventually breaking out of the loop due to the
iteration limit on the pattern rewriter). This is demonstrated by some
of the test cases added here that hit this limit when using
`--resolve-shaped-type-result-dims` and
`--resolve-ranked-shaped-type-result-dims`. To resolve this issue the
interface should allow for creating just the operations needed. This
change is the first step in resolving this.
The original implementation was done with the restriction in mind that
it might not always be possible to compute dimension of a single
result or one dimension of a single result in all cases. To account
for such cases, two additional interface methods are added
- `reifyShapeOfResult` (which allows reifying dimensions of
just one result), has a default implementation that calls
`reifyResultShapes` and returns the dimensions of a single result.
- `reifyDimOfResult` (which allows reifying a single dimension of a
single result) has a default implementation that calls
`reifyDimOfResult` and returns the value for the dimension of the
result (which in turn for the default case would call
`reifyDimOfResult`).
While this change sets up the interface, ideally most operations will
implement the `refiyDimOfResult` when possible. For almost all
operations in tree this is true. Subsequent commits will change those
incrementally.
Some of the tests added here that check that the default
implementations for the above method work as expected, also end up
hitting the pattern rewriter limit when using
`--resolve-ranked-shaped-type-result-dims`/
`--resolve-ranked-shaped-type-result-dims`. For testing purposes, a
flag is added to these passes that ignore the error returned by the
pattern application (this flag is left on by default to maintain
current state).
Changes required downstream to integrate this change
1. In operation definitions in .td files, for those operations that
implement the `ReifyRankedShapedTypeOpInterface`.
```
def <op-name> : Op<..., [...,
DeclareOpInterfaceMethods[ReifyRankedShapedTypeOpInterface]]>
```
should be changed to
```
def <op-name> : Op<..., [...,
DeclareOpInterfaceMethods[ReifyRankedShapedTypeOpInterface, [
"reifyResultShapes"]]]>
```
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Previously, the runtime verification pass would insert assertion
statements with conditions that always evaluate to false for
semantically valid `memref.subview` operations where one of the
dimensions had a size of 0.
The `memref.subview` runtime verification logic was unconditionally
generating checks for the position of the last element (`offset + (size
- 1) * stride`). When `size` is 0, this causes the assertion condition
to always be false, leading to runtime failures even though the
operation is semantically valid.
This patch fixes the issue by making the `lastPos` check conditional.
The offset is always verified, but the endpoint check is only performed
when `size > 0` to avoid generating spurious assert statements.
This issue was discovered through a LiteRT model, where a dynamic shape
calculation resulted in a zero-sized dimension being passed to
`memref.subview`. The following is a simplified IR snippet from the
model. After running the runtime verification pass, an assertion that
always fails is generated because the SSA value `%5` becomes 0.
```mlir
module {
memref.global "private" constant @__constant_2xi32 : memref<2xi32> = dense<-1> {alignment = 64 : i64}
memref.global "private" constant @__constant_1xi32 : memref<1xi32> = dense<0> {alignment = 64 : i64}
func.func @simpleRepro(%arg0: memref<10x4x1xf32, strided<[?, ?, ?], offset: ?>>) -> memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>> {
%c2 = arith.constant 2 : index
%c4 = arith.constant 4 : index
%c1 = arith.constant 1 : index
%c10 = arith.constant 10 : index
%c0 = arith.constant 0 : index
%c-1 = arith.constant -1 : index
%0 = memref.get_global @__constant_1xi32 : memref<1xi32>
%1 = memref.get_global @__constant_2xi32 : memref<2xi32>
%alloca = memref.alloca() {alignment = 64 : i64} : memref<3xi32>
%subview = memref.subview %alloca[0] [1] [1] : memref<3xi32> to memref<1xi32, strided<[1]>>
memref.copy %0, %subview : memref<1xi32> to memref<1xi32, strided<[1]>>
%subview_0 = memref.subview %alloca[1] [2] [1] : memref<3xi32> to memref<2xi32, strided<[1], offset: 1>>
memref.copy %1, %subview_0 : memref<2xi32> to memref<2xi32, strided<[1], offset: 1>>
%2 = memref.load %alloca[%c0] : memref<3xi32>
%3 = index.casts %2 : i32 to index
%4 = arith.cmpi eq, %3, %c-1 : index
%5 = arith.select %4, %c10, %3 : index
%6 = memref.load %alloca[%c1] : memref<3xi32>
%7 = index.casts %6 : i32 to index
%8 = arith.cmpi eq, %7, %c-1 : index
%9 = arith.select %8, %c4, %7 : index
%10 = memref.load %alloca[%c2] : memref<3xi32>
%11 = index.casts %10 : i32 to index
%12 = arith.cmpi eq, %11, %c-1 : index
%13 = arith.select %12, %c1, %11 : index
%subview_1 = memref.subview %arg0[0, 0, 0] [%5, %9, %13] [1, 1, 1] : memref<10x4x1xf32, strided<[?, ?, ?], offset: ?>> to memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
return %subview_1 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
}
}
```
P.S. This is a similar issue to the one fixed for `tensor.extract_slice`
in https://github.com/llvm/llvm-project/pull/164878
---------
Co-authored-by: Hanumanth Hanumantharayappa <hhanuman@ah-hhanuman-l.dhcp.mathworks.com>
The pass generate-runtime-verification generates additional runtime op
verification checks.
Currently, the pass is extremely expensive. For example, with a
mobilenet v2 ssd network(converted to mlir), running this pass alone in
debug mode will take 30 minutes. The same observation has been made to
other networks as small as 5 Mb.
The culprit is this line "op->print(stream, flags);" in function
"RuntimeVerifiableOpInterface::generateErrorMessage" in File
mlir/lib/Interfaces/RuntimeVerifiableOpInterface.cpp.
As we are printing the op with all the names of the operands in the
middle end, we are constructing a new SSANameState for each
op->print(...) call. Thus, we are doing a new SSA analysis for each
error message printed.
Perf profiling shows that 98% percent of the time is spent in the
constructor of SSANameState.
This change refactored the message generator. We use a toplevel
AsmState, and reuse it with all the op-print(stream, asmState). With a
release build, this change reduces the pass exeuction time from ~160
seconds to 0.3 seconds on my machine.
This change also adds verbose options to generate-runtime-verification
pass.
verbose 0: print only source location with error message.
verbose 1: print the full op, including the name of the operands.
Addresses: https://github.com/llvm/llvm-project/issues/115653
We already have utilities to flatten memrefs into 1-D. This change makes
memref flattening a prerequisite for vector narrow type emulation,
ensuring that emulation patterns only need to handle 1-D scenarios.
The viewLikeOpInterface abstracts the behavior of an operation view one
buffer as another. However, the current interface only includes a
"getViewSource" method and lacks a "getViewDest" method.
Previously, it was generally assumed that viewLikeOpInterface operations
would have only one return value, which was the view dest. This
assumption was broken by memref.extract_strided_metadata, and more
operations may break these silent conventions in the future. Calling
"viewLikeInterface->getResult(0)" may lead to a core dump at runtime.
Therefore, we need 'getViewDest' method to standardize our behavior.
This patch adds the getViewDest function to viewLikeOpInterface and
modifies the usage points of viewLikeOpInterface to standardize its use.
Supports the case where the sizes of the subview op is dynamic.When
there are more for loops in the tile algorithm, multiple subviews are
performed and test-compose-subview does not work when the size operand
of the subview ops is dynamic value.
This is a reapply of patch #149851. The reapply also fixes a CMake/Bazel
build issue, which was the reason of the revert. (Thanks @rupprecht )
Original patch (#149851) message:
-----
This PR adds a new optimization pass to fold
`memref.subview/expand_shape/collapse_shape` ops into consumer
`amdgpu.gather_to_lds` operations.
* Implements a new pass `AmdgpuFoldMemRefOpsPass` with pattern
`FoldMemRefOpsIntoGatherToLDSOp`
* Adds corresponding folding tests
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This PR adds a new optimization pass to fold
`memref.subview/expand_shape/collapse_shape` ops into consumer
`amdgpu.gather_to_lds` operations.
* Implements a new pass `AmdgpuFoldMemRefOpsPass` with pattern
`FoldMemRefOpsIntoGatherToLDSOp`
* Adds corresponding folding tests
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This enables memref.load/store + vector.load/store support for sub-byte
float types. Since the memref types don't matter for loads/stores, we
still use the same types as integers with equivalent widths, with a few
extra bitcasts needed around certain operations.
There is no direct change needed for vector.load/store support. The
tests added for them are to verify that float types are
supported as well.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
The motivation is to avoid having to negate `isDynamic*` checks, avoid
double negations, and allow for `ShapedType::isStaticDim` to be used in
ADT functions without having to wrap it in a lambda performing the
negation.
Also add the new functions to C and Python bindings.
This pass reifies the shapes of a subset of
`ReifyRankedShapedTypeOpInterface` ops with `tensor` results.
The pass currently only supports result shape type reification for:
- tensor::PadOp
- tensor::ConcatOp
It addresses a representation gap where implicit op semantics are needed
to infer static result types from dynamic
operands. But it does so by using `ReifyRankedShapedTypeOpInterface` as
the source of truth rather than the op itself.
As a consequence, this cannot generalize today.
TODO: in the future, we should consider coupling this information with
op "transfer functions" (e.g.
`IndexingMapOpInterface`) to provide a source of truth that can work
across result shape inference, canonicalization and
op verifiers.
The pass replaces the operations with their reified versions, when more
static information can be derived, and inserts
casts when results shapes are updated.
Example:
```mlir
#map = affine_map<(d0) -> (-d0 + 256)>
func.func @func(%arg0: f32, %arg1: index, %arg2: tensor<64x?x64xf32>) -> tensor<1x?x64xf32> {
%0 = affine.apply #map(%arg1)
%extracted_slice = tensor.extract_slice %arg2[0, 0, 0] [1, %arg1, 64] [1, 1, 1] : tensor<64x?x64xf32> to tensor<1x?x64xf32>
%padded = tensor.pad %extracted_slice low[0, 0, 0] high[0, %0, 0] {
^bb0(%arg3: index, %arg4: index, %arg5: index):
tensor.yield %arg0 : f32
} : tensor<1x?x64xf32> to tensor<1x?x64xf32>
return %padded : tensor<1x?x64xf32>
}
// mlir-opt --reify-result-shapes
#map = affine_map<()[s0] -> (-s0 + 256)>
func.func @func(%arg0: f32, %arg1: index, %arg2: tensor<64x?x64xf32>) -> tensor<1x?x64xf32> {
%0 = affine.apply #map()[%arg1]
%extracted_slice = tensor.extract_slice %arg2[0, 0, 0] [1, %arg1, 64] [1, 1, 1] : tensor<64x?x64xf32> to tensor<1x?x64xf32>
%padded = tensor.pad %extracted_slice low[0, 0, 0] high[0, %0, 0] {
^bb0(%arg3: index, %arg4: index, %arg5: index):
tensor.yield %arg0 : f32
} : tensor<1x?x64xf32> to tensor<1x256x64xf32>
%cast = tensor.cast %padded : tensor<1x256x64xf32> to tensor<1x?x64xf32>
return %cast : tensor<1x?x64xf32>
}
```
---------
Co-authored-by: Fabian Mora <fabian.mora-cordero@amd.com>
Fixes: https://github.com/llvm/llvm-project/issues/130257
Fix affine-data-copy-generate in certain cases that involved users in
multiple blocks. Perform the memref replacement correctly during copy
generation.
Improve/clean up memref affine use replacement API. Instead of
supporting dominance and post dominance filters (which aren't adequate
in most cases) and computing dominance info expensively each time in
RAMUW, provide a user filter callback, i.e., force users to compute
dominance if needed.
The expansion of `memref.atomic_rmw` into a `memref.generic_atomic_rmw`
for floating-point min/max operations is no longer necessary as those
are now supported by the LLVM dialect and LLVM IR.
Furthermore, combining this expansion with direct lowering of
`generic_atomic_rmw` could leads to invalid LLVM dialect IR with
`cmpxchg` operating on floating-point values that it does not support.
To add a transformation that simplifies memory access patterns, this PR
adds a memref linearizer which is based on the GPU/DecomposeMemRefs
pass, with the following changes:
* support vector dialect ops
* instead of decompose memrefs to rank-0 memrefs, flatten higher-ranked
memrefs to rank-1.
Notes:
* After the linearization, a MemRef's offset is kept, so a
`memref<4x8xf32, strided<[8, 1], offset: 100>>` becomes `memref<32xf32,
strided<[1], offset: 100>>`.
* It also works with dynamic shapes and strides and offsets (see test
cases for details).
* The shape of the casted memref is computed as 1d, flattened.
The revision adds isOneInteger helper, and simplifies the existing code
with the two methods. It removes some lambda, which makes code cleaner.
For downstream users, you can update the code with the below script.
```bash
sed -i "s/isZeroIndex/isZeroInteger/g" **/*.h
sed -i "s/isZeroIndex/isZeroInteger/g" **/*.cpp
```
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Made AssumeAlignment a ViewLikeOp that returns a new SSA memref equal
to its memref argument and made it have Pure trait. This
gives it a defined memory effect that matches what it does in practice
and makes it behave nicely with optimizations which won't get rid of it
unless its result isn't being used.
This PR updates the FoldMemRefAliasOps to use `affine.linearize_index`
and `affine.delinearize_index` to perform the index computations needed
to fold a `memref.expand_shape` or `memref.collapse_shape` into its
consumers, respectively.
This also loosens some limitations of the pass:
1. The existing `output_shape` argument to `memref.expand_shape` is now
used, eliminating the need to re-infer this shape or call `memref.dim`.
2. Because we're using `affine.delinearize_index`, the restriction that
each group in a `memref.collapse_shape` can only have one dynamic
dimension is removed.
[mlir][vector] Standardize base Naming Across Vector Ops (NFC)
This change standardizes the naming convention for the argument
representing the value to read from or write to in Vector ops that
interface with Tensors or MemRefs. Specifically, it ensures that all
such ops use the name `base` (i.e., the base address or location to
which offsets are applied).
Updated operations:
* `vector.transfer_read`,
* `vector.transfer_write`.
For reference, these ops already use `base`:
* `vector.load`, `vector.store`, `vector.scatter`, `vector.gather`,
`vector.expandload`, `vector.compressstore`, `vector.maskedstore`,
`vector.maskedload`.
This is a non-functional change (NFC) and does not alter the semantics of these
operations. However, it does require users of the XFer ops to switch from
`op.getSource()` to `op.getBase()`.
To ease the transition, this PR temporarily adds a `getSource()` interface
method for compatibility. This is intended for downstream use only and should
not be relied on upstream. The method will be removed prior to the LLVM 21
release.
Implements #131602
The runtime verification code used to verify that the result of a
`memref.reinterpret_cast` is in-bounds with respect to the source
memref. This is incorrect: `memref.reinterpret_cast` allows users to
construct almost arbitrary memref descriptors and there is no
correctness expectation.
This op is supposed to be used when the user "knows what they are
doing." Similarly, the static verifier of `memref.reinterpret_cast` does
not verify in-bounds semantics either.
Rewrites memrefs defined by reinterpet_cast ops to have an identity
layout map and updates all their indexing uses. Also, extend
`replaceAllMemRefUsesWith` utility to work when there are multiple
occurrences of `oldMemRef` in `op`'s operand list when op is
non-dereferencing.
Fixes#122090Fixes#121091
let constructor is legacy (do not use in tree!) since the tableGen
backend emits most of the glue logic to build a pass.
Note: The following constructor has been retired:
```cpp
std::unique_ptr<Pass> createExpandReallocPass(bool emitDeallocs = true);
```
To update your codebase, replace it with the new options-based API:
```cpp
memref::ExpandReallocPassOptions expandAllocPassOptions{
/*emitDeallocs=*/false};
pm.addPass(memref::createExpandReallocPass(expandAllocPassOptions));
```
This commit addresses a TODO in the runtime verification of
`memref.subview`. Each dimension is now verified: the offset must be
in-bounds and the slice must not run out-of-bounds.
This commit aligns runtime verification with static op verification
(which was improved in #133086).