## Description
This change introduces a new canonicalization pattern for the MLIR
Vector dialect that optimizes chains of insertions. The optimization
identifies when a vector is **completely** initialized through a series
of vector.insert operations and replaces the entire chain with a
single `vector.from_elements` operation.
Please be aware that the new pattern **doesn't** work for poison vectors
where only **some** elements are set, as MLIR doesn't support partial
poison vectors for now.
**New Pattern: InsertChainFullyInitialized**
* Detects chains of vector.insert operations.
* Validates that all insertions are at static positions, and all
intermediate insertions have only one use.
* Ensures the entire vector is **completely** initialized.
* Replaces the entire chain with a
single vector.from_elementts operation.
**Refactored Helper Function**
* Extracted `calculateInsertPosition` from
`foldDenseElementsAttrDestInsertOp` to avoid code duplication.
## Example
```
// Before:
%v1 = vector.insert %c10, %v0[0] : i64 into vector<2xi64>
%v2 = vector.insert %c20, %v1[1] : i64 into vector<2xi64>
// After:
%v2 = vector.from_elements %c10, %c20 : vector<2xi64>
```
It also works for multidimensional vectors.
```
// Before:
%v1 = vector.insert %cv0, %v0[0] : vector<3xi64> into vector<2x3xi64>
%v2 = vector.insert %cv1, %v1[1] : vector<3xi64> into vector<2x3xi64>
// After:
%0:3 = vector.to_elements %arg1 : vector<3xi64>
%1:3 = vector.to_elements %arg2 : vector<3xi64>
%v2 = vector.from_elements %0#0, %0#1, %0#2, %1#0, %1#1, %1#2 : vector<2x3xi64>
```
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
Fold `broadcast(shape_cast(x))` into `broadcast(x)` if the type of x is
compatible with broadcast's result type and the shape_cast only adds or removes ones in the leading dimensions.
---------
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
Co-authored-by: James Newling <james.newling@gmail.com>
This PR ensures parity in folding/canonicalizing of vector.broadcast
(from a scalar) and vector.splat. This means that by using
vector.broadcast instead of vector.splat (which is currently
deprecated), there is no loss in optimizations performed. All tests
which were previously checking folding/canonicalizing of vector.splat
are now done for vector.broadcast. The vector.splat canonicalization
tests are now in a separate file, ready for removal when, in the future,
we remove vector.splat completely.
This PR also adds a canonicalizer to vector.splat to always convert it
to vector.broadcast. This is to reduce the 'traffic' through
vector.splat.
There is a chance that this PR will break downstream users who create/expect
for vector.splat. Changing all such logic to work just vector.broadcast instead
should fix.
The folder `shape_cast(splat constant) -> splat constant` was first
introduced
[here](36480657d8 (diff-484cea976e0c96459027c951733bf2d22d34c5a0c0de6f577069870ef4588983R2600))
(Nov 2020). In that commit there is a comment to _Only handle splat for
now_. Based on that I assume the intention was to, at a later time,
support a general `shape_cast(constant) -> constant` folder. That is
what this PR does
One minor downside: It is possible with this folder end up with, instead
of 1 large constant and 1 shape_cast, 2 large constants:
```mlir
func.func @foo() -> (vector<4xi32>, vector<2x2xi32>) {
%cst = arith.constant dense<[1, 2, 3, 4]> : vector<4xi32> # 'large' constant 1
%0 = vector.shape_cast %cst : vector<4xi32> to vector<2x2xi32>
return %cst, %0 : vector<4xi32>, vector<2x2xi32>
}
```
gets folded with this new folder to
```mlir
func.func @foo() -> (vector<4xi32>, vector<2x2xi32>) {
%cst = arith.constant dense<[1, 2, 3, 4]> : vector<4xi32> # 'large' constant 1
%cst_0 = arith.constant dense<[[1, 2], [3, 4]]> : vector<2x2xi32> # 'large' constant 2
return %cst, %cst_0 : vector<4xi32>, vector<2x2xi32>
}
```
Notes on the above case:
1) This only effects the textual IR, the actual values share the same
context storage (I've verified this by checking pointer values in the
`DenseIntOrFPElementsAttrStorage`
[constructor](da5c442550/mlir/lib/IR/AttributeDetail.h (L59)))
so no compile-time memory overhead to this folding. At the LLVM
IR level the constant is shared, too.
2) This only happens when the pre-folded constant cannot be dead code
eliminated (i.e. when it has 2+ uses) which I don't think is common.
Implements the `inferResultRanges` method from the
`InferIntRangeInterface` interface for `vector.step`. The implementation
is similar to that of arith.constant, since the exact result values are
statically known.
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Implements the `inferResultRanges` method from the
`InferIntRangeInterface` interface for `vector.transpose`. The result
ranges simply match the source ranges.
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
This PR uses `val.getDefiningOp<OpTy>()` to replace `dyn_cast<OpTy>(val.getDefiningOp())` , `dyn_cast_or_null<OpTy>(val.getDefiningOp())` and `dyn_cast_if_present<OpTy>(val.getDefiningOp())`.
The Canonicalizer pass has a dependency to UB dialect which shouldn't have.
It also no longer needs to directly depend on the UB dialect since the Vector dialect
(which uses UB dialect for poison index operations introduced by 35df525) already
declares this dependency(878d3594).
In a later PR more shape_cast ops will appear. Specifically, broadcasts that
just prepend ones become shape_cast ops (i.e. volume preserving broadcasts
are canonicalized to shape_casts). This PR ensures that broadcast-like
shape_cast ops fold at least as well as broadcast ops.
This is done by modifying patterns that target broadcast ops, to target
'broadcast-like' ops. No new patterns are added, the patterns that exist
are just made to match on shape_casts where appropriate.
This PR also includes minor code simplifications: use
`isBroadcastableTo` to simplify `ExtractOpFromBroadcast` and simplify
how broadcast dims are detected in `foldExtractFromBroadcast`. These are
NFC.
---------
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
When the result of an insert op is used by an insert op, and the
subsequent insert op is inserted at the same location as the previous
insert op, replaces the dest of the subsequent insert op with the dest
of the previous insert op.This is because the previous insert op does
not affect subsequent insert ops.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
The motivation is to avoid having to negate `isDynamic*` checks, avoid
double negations, and allow for `ShapedType::isStaticDim` to be used in
ADT functions without having to wrap it in a lambda performing the
negation.
Also add the new functions to C and Python bindings.
Context:
`vector.transfer_read` always requires a padding value. Most of its
builders take no `padding` value and assume the safe value of `0`.
However, this should be a conscious choice by the API user, as it makes
it easy to introduce bugs.
For example, I found several occasions while making this patch that the
padding value was not getting propagated (`vector.transfer_read` was
transformed into another `vector.transfer_read`). These bugs, were
always caused because of constructors that don't require specifying
padding.
Additionally, using `ub.poison` as a possible default value is better,
as it indicates the user "doesn't care" about the actual padding value,
forcing users to specify the actual padding semantics they want.
With that in mind, this patch changes the builders in
`vector.transfer_read` to always having a `std::optional<Value> padding`
argument. This argument is never optional, but for convenience users can
pass `std::nullopt`, padding the transfer read with `ub.poison`.
---------
Signed-off-by: Fabian Mora <fabian.mora-cordero@amd.com>
This patch enforces a restriction in the Vector dialect: the non-indexed
operands of `vector.insert` and `vector.extract` must no longer be 0-D
vectors. In other words, rank-0 vector types like `vector<f32>` are
disallowed as the source or result.
EXAMPLES
--------
The following are now **illegal** (note the use of `vector<f32>`):
```mlir
%0 = vector.insert %v, %dst[0, 0] : vector<f32> into vector<2x2xf32>
%1 = vector.extract %src[0, 0] : vector<f32> from vector<2x2xf32>
```
Instead, use scalars as the source and result types:
```mlir
%0 = vector.insert %v, %dst[0, 0] : f32 into vector<2x2xf32>
%1 = vector.extract %src[0, 0] : f32 from vector<2x2xf32>
```
Note, this change serves three goals. These are summarised below.
## 1. REDUCED AMBIGUITY
By enforcing scalar-only semantics when the result (`vector.extract`)
or source (`vector.insert`) are rank-0, we eliminate ambiguity
in interpretation. Prior to this patch, both `f32` and `vector<f32>`
were accepted.
## 2. MATCH IMPLEMENTATION TO DOCUMENTATION
The current behaviour contradicts the documented intent. For example,
`vector.extract` states:
> Degenerates to an element type if n-k is zero.
This patch enforces that intent in code.
## 3. ENSURE SYMMETRY BETWEEN INSERT AND EXTRACT
With the stricter semantics in place, it’s natural and consistent to
make `vector.insert` behave symmetrically to `vector.extract`, i.e.,
degenerate the source type to a scalar when n = 0.
NOTES FOR REVIEWERS
-------------------
1. Main change is in "VectorOps.cpp", where stricter type checks are
implemented.
2. Test updates in "invalid.mlir" and "ops.mlir" are minor cleanups to
remove now-illegal examples.
2. Lowering changes in "VectorToSCF.cpp" are the main trade-off: we now
require an additional `vector.extract` when a preceding
`vector.transfer_read` generates a rank-0 vector.
RELATED RFC
-----------
*
https://discourse.llvm.org/t/rfc-should-we-restrict-the-usage-of-0-d-vectors-in-the-vector-dialect
As a follow-up to PR #135841 (see discussion for background), this patch
removes the `ConvertIllegalShapeCastOpsToTransposes` pattern from the SME
legalization pass. This change unblocks folding for ShapeCastOp involving
scalable vectors.
Originally, the `ConvertIllegalShapeCastOpsToTransposes` pattern was introduced
to rewrite certain `vector.shape_cast` ops that could not be lowered otherwise.
Based on local end-to-end testing, this workaround is no longer required, and
the pattern can now be safely removed.
This patch also removes a special case from `ShapeCastOp::fold`, simplifying
the fold logic.
As a side effect of removing `ConvertIllegalShapeCastOpsToTransposes`, we lose
the mechanism that enabled lowering of certain ops like:
```mlir
%res = vector.transfer_read %mem[%a, %b] (...) :
memref<?x?xf32>, vector<[4]x1xf32>
```
Previously, such cases were handled by:
* Rewriting a nearby `vector.shape_cast` to a `vector.transpose` (via
`ConvertIllegalShapeCastOpsToTransposes`)
* Then lowering the result with `LiftIllegalVectorTransposeToMemory`.
This patch introduces a new dedicated pattern,
`LowerColumnTransferReadToLoops`, that directly handles illegal
`vector.transfer_read` ops involving leading scalable dimensions.
Refactor the verifiers to make use of the common bits and make
`vector.contract` also use this interface.
In the process, the confusingly named getStaticShape has disappeared.
Note: the verifier for IndexingMapOpInterface is currently called
manually from other verifiers as it was unclear how to avoid it taking
precedence over more meaningful error messages
### Description
This patch improves the folding efficiency of `vector.insert` and
`vector.extract` operations by not returning early after successfully
converting dynamic indices to static indices.
This PR also renames the test pass `TestConstantFold` to
`TestSingleFold` and adds comprehensive documentation explaining the
single-pass folding behavior.
### Motivation
Since the `OpBuilder::createOrFold` function only calls `fold` **once**,
the current `fold` methods of `vector.insert` and `vector.extract` may
leave the op in a state that can be folded further. For example,
consider the following un-folded IR:
```
%v1 = vector.insert %e1, %v0 [0] : f32 into vector<128xf32>
%c0 = arith.constant 0 : index
%e2 = vector.extract %v1[%c0] : f32 from vector<128xf32>
```
If we use `createOrFold` to create the `vector.extract` op, then the
result will be:
```
%v1 = vector.insert %e1, %v0 [127] : f32 into vector<128xf32>
%e2 = vector.extract %v1[0] : f32 from vector<128xf32>
```
But this is not the optimal result. `createOrFold` should have returned
`%e1`.
The reason is that the execution of fold returns immediately after
`extractInsertFoldConstantOp`, causing subsequent folding logics to be
skipped.
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
…WriteToBroadcast
The pattern would previously apply in spurious cases and generate
incorrect IR.
In the process, we disable the application of this pattern in the case
where there is no broadcast; this should be handled separately and may
more easily support masking.
The case {no-broadcast, yes-transpose} was previously caught by this
pattern and arguably could also generate incorrect IR (and was also
untested): this case does not apply anymore.
The last cast {yes-broadcast, yes-transpose} continues to apply but
should arguably be removed from the future because creating transposes
as part of canonicalization feels dangerous.
There are other patterns that move permutation logic:
- either into the transfer, or
- outside of the transfer
Ideally, this would be target-dependent and not a canonicalization (i.e.
does your DMA HW allow transpose on the fly or not) but this is beyond
the scope of this PR.
Co-authored-by: Nicolas Vasilache <nicolasvasilache@users.noreply.github.com>
This fixes an issue with `TransferWriteOp`'s implementation of the
`MemoryEffectOpInterface` where the write effect was attached to the
stored value rather than the base.
This had the effect that when asking for the memory effects for the
input memref buffer using `getEffectsOnValue(...)`, the function would
return no-effects (as the effect would have been attached to the stored
value rather than the input buffer).
This PR adds `UnrollBroadcastPattern` to `VectorUnroll` transform.
To support this, it also extends `BroadcastOp` definition with
`VectorUnrollOpInterface`
Example:
```mlir
%0 = vector.extract %source[0, 0] : i8 from vector<1x2xi8>
%1 = vector.extract %source[0, 1] : i8 from vector<1x2xi8>
%2 = vector.from_elements %0, %1 : vector<2xi8>
```
becomes
```mlir
%2 = vector.shape_cast %source : vector<1x2xi8> to vector<2xi8>
```
It was decided that we should spill canonicalization tests into new
files (see
[discussion](https://github.com/llvm/llvm-project/pull/135096#pullrequestreview-2760245596))
In view of this I added the new tests to a new file specifically for
canonicalization of from_elements. To be consistent in the location of
the tests, I moved existing tests `extract_scalar_from_from_element`,
`extract_1d_from_from_elements`, `extract_2d_from_from_elements` and
`from_elements_to_splat` from `canonicalize.mlir` to
`canonicalze/vector-from-elements.mlir`. In addition to moving I changed
the LIT variables to all be upper-case for consistency.
This PR moves `extractInsertFoldConstantOp` earlier in the folder lists
of `vector.extract` and `vector.insert`. Many folders require having
non-dynamic indices so `extractInsertFoldConstantOp` is a requirement
for them to trigger.
This PR improves the `vector.mask` verifier to make sure it's not
applying masking semantics to operations defined outside of the
`vector.mask` region. Documentation is updated to emphasize that and
make it clearer, even though it already stated that.
As part of this change, the logic that ensures that a terminator is
present in the region mask has been simplified to make it less
surprising to the user when a `vector.yield` is explicitly provided in
the IR.
This patch contains two minor changes, which I believe were the original
author's intent.
* when folding `transpose(broadcast(x))` emit `broadcast(x)` instead of
`broadcast(broadcast(x))`. The latter causes transient verifier
failures with `mlir-opt --debug` , e.g.
```
mlir-asm-printer: 'func.func' failed to verify and will be printed in generic form
"func.func"() <{function_type = (vector<4x1x1x7xi8>) -> vector<3x2x4x5x6x7xi8>, sym_name = "broadcast_transpose_mixed_example"}> ({
^bb0(%arg0: vector<4x1x1x7xi8>):
%0 = "vector.broadcast"(%arg0) : (vector<4x1x1x7xi8>) -> vector<2x3x4x5x6x7xi8>
%1 = "vector.broadcast"(%0) : (vector<2x3x4x5x6x7xi8>) -> vector<3x2x4x5x6x7xi8>
"func.return"(%1) : (vector<3x2x4x5x6x7xi8>) -> ()
}) : () -> ()
```
* when checking permutation groups the variable `low` was set just once
to zero, thus checking was quadratic. It looks the intent was for `low`
to track the beginning of each dimension groups. (Nevertheless the check
was correct).
Fold transpose with unit-dimensions. Seen in the wild:
```
%0 = vector.transpose %arg, [0, 2, 1, 3] : vector<6x1x1x4xi8> to vector<6x1x1x4xi8>
```
This transpose can be folded because (1) it preserves the shape and (2)
the shuffled dims are unit extent.
Also addresses comment about static vs anonymous namespace:
https://github.com/llvm/llvm-project/pull/135841#discussion_r2071869067
---------
Signed-off-by: James Newling <james.newling@gmail.com>
[mlir][vector] Standardize base Naming Across Vector Ops (NFC)
This change standardizes the naming convention for the argument
representing the value to read from or write to in Vector ops that
interface with Tensors or MemRefs. Specifically, it ensures that all
such ops use the name `base` (i.e., the base address or location to
which offsets are applied).
Updated operations:
* `vector.transfer_read`,
* `vector.transfer_write`.
For reference, these ops already use `base`:
* `vector.load`, `vector.store`, `vector.scatter`, `vector.gather`,
`vector.expandload`, `vector.compressstore`, `vector.maskedstore`,
`vector.maskedload`.
This is a non-functional change (NFC) and does not alter the semantics of these
operations. However, it does require users of the XFer ops to switch from
`op.getSource()` to `op.getBase()`.
To ease the transition, this PR temporarily adds a `getSource()` interface
method for compatibility. This is intended for downstream use only and should
not be relied on upstream. The method will be removed prior to the LLVM 21
release.
Implements #131602
Handles special case where transpose doesn't permute any non-1
dimensions (and so is effectively a shape_cast) and is adjacent to a
shape_cast that it can fold into. For example
```
%1 = vector.transpose %0, [1, 0, 3, 2] : vector<4x1x1x6xf32> to vector<1x4x6x1xf32>
```
can be folded into an adjacent shape_cast. An alternative to this PR
would be to canonicalize such transposes to shape_casts directly, but I
think it'll be difficult getting consensus that shape_cast is 'more
canonical' than transpose, so this PR compromises with the less
opinionated claim that
1) shape_cast is more canonical than shape_cast(transpose)
2) shape_cast is more canonical than transpose(shape_cast)
The pattern `ConvertIllegalShapeCastOpsToTransposes` that is specific to
transposes with scalable dimensions reverses the canonicalization added
here, so I've I've disabled this canonicalization for scalable vectors
`vector.shape_cast` was initially designed to be the union of
collapse_shape and expand_shape. There was an inconsistency in the
verifier that allowed any shape casts when the rank did not change, which
led to a strange middle ground where you could cast from shape (4,3) to
(3,4) but not from (4,3) to (2,3,2). That issue was fixed (verifier made stricter)
in https://github.com/llvm/llvm-project/pull/135855, but further feedback
there (and polling) suggests that vector.shape_cast should rather allow all
shape casts (so more like tensor.reshape than
tensor.collapse_shape/tensor.expand_shape). This PR makes this simplification
by relaxing the verifier.
This is a small follow-on for #133721:
* Renamed `getRealVectorRank` as `getEffectiveVectorRankForXferOp` (to
emphasise that this method was written specifically for transfer Ops).
* Marginally tweaked the description for
`getEffectiveVectorRankForXferOp` (mostly to highlight the two edge
cases being covered).
* Added tests for cases when the element type (of the shaped type) is a
vector.
* Unified the naming (and the order) of arguments in tests with the
surrounding tests (e.g. `%vec_to_write` -> `%arg1`). Mostly for
consistency (it would be good to use self-documenting names like
`%vec_to_write` throughout).
This is a minor follow-up to #135498. It ensures that operations like
the following are not treated as out-of-bounds accesses and can be
folded correctly (*):
```mlir
%c_neg_1 = arith.constant -1 : index
%0 = vector.insert %value_to_store, %dest[%c_neg_1] : vector<5xf32> into vector<4x5xf32>
%1 = vector.extract %src[%c_neg_1, 0] : f32 from vector<4x5xf32>
```
In addition to adding tests for the case above, this PR also relocates
the tests from #135498 to be alongside existing tests for the
`vector.{insert|extract}` folder, and reformats them to follow:
* https://mlir.llvm.org/getting_started/TestingGuide/
For example:
* The "no_fold" prefix is now used to label negative tests.
* Redundant check lines have been removed (e.g., CHECK: vector.insert
is sufficient to verify that folding did not occur).
(*) As per https://mlir.llvm.org/docs/Dialects/Vector/#vectorinsert-vectorinsertop,
these are poison values.
Out of bound position values should not be folded in vector.extract and
vector.insert operations, as only in bounds constants and -1 are valid.
Fixes#134516
In addition to the new folder, I've also a test for broadcast(splat) ->
splat which I think was missing
Signed-off-by: James Newling <james.newling@gmail.com>