## Description
This change introduces a new canonicalization pattern for the MLIR
Vector dialect that optimizes chains of insertions. The optimization
identifies when a vector is **completely** initialized through a series
of vector.insert operations and replaces the entire chain with a
single `vector.from_elements` operation.
Please be aware that the new pattern **doesn't** work for poison vectors
where only **some** elements are set, as MLIR doesn't support partial
poison vectors for now.
**New Pattern: InsertChainFullyInitialized**
* Detects chains of vector.insert operations.
* Validates that all insertions are at static positions, and all
intermediate insertions have only one use.
* Ensures the entire vector is **completely** initialized.
* Replaces the entire chain with a
single vector.from_elementts operation.
**Refactored Helper Function**
* Extracted `calculateInsertPosition` from
`foldDenseElementsAttrDestInsertOp` to avoid code duplication.
## Example
```
// Before:
%v1 = vector.insert %c10, %v0[0] : i64 into vector<2xi64>
%v2 = vector.insert %c20, %v1[1] : i64 into vector<2xi64>
// After:
%v2 = vector.from_elements %c10, %c20 : vector<2xi64>
```
It also works for multidimensional vectors.
```
// Before:
%v1 = vector.insert %cv0, %v0[0] : vector<3xi64> into vector<2x3xi64>
%v2 = vector.insert %cv1, %v1[1] : vector<3xi64> into vector<2x3xi64>
// After:
%0:3 = vector.to_elements %arg1 : vector<3xi64>
%1:3 = vector.to_elements %arg2 : vector<3xi64>
%v2 = vector.from_elements %0#0, %0#1, %0#2, %1#0, %1#1, %1#2 : vector<2x3xi64>
```
---------
Co-authored-by: Yang Bai <yangb@nvidia.com>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
In FoldArithToVectorOuterProduct pattern, static cast to vector type
causes assertion when a scalar type was encountered. It seems the author
meant to have a dyn_cast instead.
This NFC patch handles it by using dyn_cast.
Fold `broadcast(shape_cast(x))` into `broadcast(x)` if the type of x is
compatible with broadcast's result type and the shape_cast only adds or removes ones in the leading dimensions.
---------
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
Co-authored-by: James Newling <james.newling@gmail.com>
This patch updates `vectorizeAsTensorUnpackOp` to support scalable
vectorization by requiring user-specified vector sizes for the _read_ operation
(rather than the _write_ operation) in `linalg.unpack`.
Conceptually, `linalg.unpack` consists of these high-level steps:
* **Read** from the source tensor using `vector.transfer_read`.
* **Transpose** the read value according to the permutation in the
`linalg.unpack` op (via `vector.transpose`).
* **Re-associate** dimensions of the transposed value, as specified by the op
(via `vector.shape_cast`)
* **Write** the result into the destination tensor via
`vector.transfer_write`.
Previously, the vector sizes provided by the user were interpreted as
write-vector sizes. These were used to:
* Infer read-vector sizes using the `inner_tiles` attribute of the unpack op.
* Deduce vector sizes for the transpose and shape cast operations.
* Ultimately determine the vector shape for the write.
However, this logic breaks when one or more tile sizes are dynamic. In such
cases, `vectorizeUnPackOpPrecondition` fails, and vectorization is rejected.
This patch switches the contract: users now directly specify the
"read-vector-sizes", which inherently encode all inner tile sizes - including
dynamic ones. It becomes the user's responsibility to provide valid sizes.
In practice, since `linalg.unpack` is typically constructed, tiled, and
vectorized by the same transformation pipeline, the necessary
"read-vector-sizes" should be recoverable.
This PR ensures parity in folding/canonicalizing of vector.broadcast
(from a scalar) and vector.splat. This means that by using
vector.broadcast instead of vector.splat (which is currently
deprecated), there is no loss in optimizations performed. All tests
which were previously checking folding/canonicalizing of vector.splat
are now done for vector.broadcast. The vector.splat canonicalization
tests are now in a separate file, ready for removal when, in the future,
we remove vector.splat completely.
This PR also adds a canonicalizer to vector.splat to always convert it
to vector.broadcast. This is to reduce the 'traffic' through
vector.splat.
There is a chance that this PR will break downstream users who create/expect
for vector.splat. Changing all such logic to work just vector.broadcast instead
should fix.
The folder `shape_cast(splat constant) -> splat constant` was first
introduced
[here](36480657d8 (diff-484cea976e0c96459027c951733bf2d22d34c5a0c0de6f577069870ef4588983R2600))
(Nov 2020). In that commit there is a comment to _Only handle splat for
now_. Based on that I assume the intention was to, at a later time,
support a general `shape_cast(constant) -> constant` folder. That is
what this PR does
One minor downside: It is possible with this folder end up with, instead
of 1 large constant and 1 shape_cast, 2 large constants:
```mlir
func.func @foo() -> (vector<4xi32>, vector<2x2xi32>) {
%cst = arith.constant dense<[1, 2, 3, 4]> : vector<4xi32> # 'large' constant 1
%0 = vector.shape_cast %cst : vector<4xi32> to vector<2x2xi32>
return %cst, %0 : vector<4xi32>, vector<2x2xi32>
}
```
gets folded with this new folder to
```mlir
func.func @foo() -> (vector<4xi32>, vector<2x2xi32>) {
%cst = arith.constant dense<[1, 2, 3, 4]> : vector<4xi32> # 'large' constant 1
%cst_0 = arith.constant dense<[[1, 2], [3, 4]]> : vector<2x2xi32> # 'large' constant 2
return %cst, %cst_0 : vector<4xi32>, vector<2x2xi32>
}
```
Notes on the above case:
1) This only effects the textual IR, the actual values share the same
context storage (I've verified this by checking pointer values in the
`DenseIntOrFPElementsAttrStorage`
[constructor](da5c442550/mlir/lib/IR/AttributeDetail.h (L59)))
so no compile-time memory overhead to this folding. At the LLVM
IR level the constant is shared, too.
2) This only happens when the pre-folded constant cannot be dead code
eliminated (i.e. when it has 2+ uses) which I don't think is common.
Implements the `inferResultRanges` method from the
`InferIntRangeInterface` interface for `vector.step`. The implementation
is similar to that of arith.constant, since the exact result values are
statically known.
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Implements the `inferResultRanges` method from the
`InferIntRangeInterface` interface for `vector.transpose`. The result
ranges simply match the source ranges.
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
This patch extends the operation that rewrites elementwise operations
whose inputs are all broadcast from the same shape to handle
mixed-types, such as when the result and input types don't match, or
when the inputs have multiple types.
PR #150867 failed to check for the possibility of type mismatches when
rewriting splat constants. In order to fix that issue, we add support
for mixed-type operations more generally.
There is a pattern that rewrites
elementwise_op(broadcast(x1 : T to U), broadcast(x2 : T to U), ...) to
broadcast(elementwise_op(x1, x2, ...) : T to U).
This pattern did not, however, account for the case where a broadcast
constant is represented as a SplatElementsAttr, which can safely be
reshaped or scalarized but is not a `vector.broadcast` or `vector.splat`
operation.
This patch fixes this oversight, prenting premature broadcasting.
This did result in the need to update some linalg dialect tests, which
now feature a less-broadcast computation and/or more constant folding.
The crash is caused because, during IR transformation, the
vector-unrolling pass (using ExtractStridedSliceOp) attempts to slice an
input vector of higher rank using a target vector of lower rank, which
is not supported. Fixes#148368.
This PR uses `val.getDefiningOp<OpTy>()` to replace `dyn_cast<OpTy>(val.getDefiningOp())` , `dyn_cast_or_null<OpTy>(val.getDefiningOp())` and `dyn_cast_if_present<OpTy>(val.getDefiningOp())`.
The Canonicalizer pass has a dependency to UB dialect which shouldn't have.
It also no longer needs to directly depend on the UB dialect since the Vector dialect
(which uses UB dialect for poison index operations introduced by 35df525) already
declares this dependency(878d3594).
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Extends linalg vectorizer with a path to lower contraction ops directly
into `vector.contract`.
The direct rewriting preserves high-level op semantics and provides more
progressive lowering compared to reconstructing contraction back from
multi dimensional reduction.
The added lowering focuses on named linalg ops and leverages their well
defined semantics to avoid complex precondition verification.
The new path is optional and disabled by default to avoid changing the
default vectorizer behavior.
In a later PR more shape_cast ops will appear. Specifically, broadcasts that
just prepend ones become shape_cast ops (i.e. volume preserving broadcasts
are canonicalized to shape_casts). This PR ensures that broadcast-like
shape_cast ops fold at least as well as broadcast ops.
This is done by modifying patterns that target broadcast ops, to target
'broadcast-like' ops. No new patterns are added, the patterns that exist
are just made to match on shape_casts where appropriate.
This PR also includes minor code simplifications: use
`isBroadcastableTo` to simplify `ExtractOpFromBroadcast` and simplify
how broadcast dims are detected in `foldExtractFromBroadcast`. These are
NFC.
---------
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
This patch deletes `vector.matrix_multiply` and `vector.flat_transpose`,
which are thin wrappers around the corresponding LLVM intrinsics:
- `llvm.intr.matrix.multiply`
- `llvm.intr.matrix.transpose`
These Vector dialect ops did not provide additional semantics or
abstraction beyond the LLVM intrinsics. Their removal simplifies the
lowering pipeline without losing any functionality.
The lowering chains:
- `vector.contract` → `vector.matrix_multiply` →
`llvm.intr.matrix.multiply`
- `vector.transpose` → `vector.flat_transpose` →
`llvm.intr.matrix.transpose`
are now replaced with:
- `vector.contract` → `llvm.intr.matrix.multiply`
- `vector.transpose` → `llvm.intr.matrix.transpose`
This was accomplished by directly replacing:
- `vector::MatrixMultiplyOp` with `LLVM::MatrixMultiplyOp`
- `vector::FlatTransposeOp` with `LLVM::MatrixTransposeOp`
Note: To avoid a build-time dependency from `Vector` to `LLVM`,
relevant transformations are moved from "Vector/Transforms" to
`Conversion/VectorToLLVM`.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Reapply attempt for : https://github.com/llvm/llvm-project/pull/148291
Fix for the build failure reported in :
https://lab.llvm.org/buildbot/#/builders/116/builds/15477
-----
This crash is caused by mismatch of distributed type returned by
`getDistributedType` and intended distributed type for forOp results.
Solution diff:
20c2cf6766
Example:
```
func.func @warp_scf_for_broadcasted_result(%arg0: index) -> vector<1xf32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%2 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<1xf32>) {
%ini = "some_def"() : () -> (vector<1xf32>)
%0 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini) -> (vector<1xf32>) {
%1 = "some_op"(%arg4) : (vector<1xf32>) -> (vector<1xf32>)
scf.yield %1 : vector<1xf32>
}
gpu.yield %0 : vector<1xf32>
}
return %2 : vector<1xf32>
}
```
In this case the distributed type for forOp result is `vector<1xf32>`
(result is not distributed and broadcasted to all lanes instead).
However, in this case `getDistributedType` will return NULL type.
Therefore, if the distributed type can be recovered from warpOp, we
should always do that first before using `getDistributedType`
This enables memref.load/store + vector.load/store support for sub-byte
float types. Since the memref types don't matter for loads/stores, we
still use the same types as integers with equivalent widths, with a few
extra bitcasts needed around certain operations.
There is no direct change needed for vector.load/store support. The
tests added for them are to verify that float types are
supported as well.
Propagating vector.extract when a dynamic position is present can cause
dominance issues and needs better handling. For now, disable propagation
if there is a dynamic position present.
This PR adds a new transformation that turns sequences of `vector.to_elements` and `vector.from_elements` into a binary tree of `vector.shuffle` operations.
(Related RFC:
https://discourse.llvm.org/t/rfc-adding-vector-to-elements-op-to-the-vector-dialect/86779).
Example:
```
%0:4 = vector.to_elements %a : vector<4xf32>
%1:4 = vector.to_elements %b : vector<4xf32>
%2:4 = vector.to_elements %c : vector<4xf32>
%3 = vector.from_elements %0#0, %0#1, %0#2, %0#3,
%1#0, %1#1, %1#2, %1#3,
%2#0, %2#1, %2#2, %2#3 : vector<12xf32>
==>
%0 = vector.shuffle %a, %b [0, 1, 2, 3, 4, 5, 6, 7] : vector<4xf32>, vector<4xf32>
%1 = vector.shuffle %c, %c [0, 1, 2, 3, -1, -1, -1, -1] : vector<4xf32>, vector<4xf32>
%2 = vector.shuffle %0, %1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] : vector<8xf32>, vector<8xf32>
```
The algorithm leverages the structured extraction/insertion information
of `vector.to_elements` and `vector.from_elements` operations and builds
a set of intervals to determine the vector length that should be used at
each level of the tree to combine the level inputs in pairs.
There are a few improvements that can be implemented in the future, such
as shuffle mask compression to avoid unnecessarily large vector lengths
with poison values, but I decided to keep things "simpler" and spend
more time documenting the different steps of the algorithm so that
people can follow along.
When the result of an insert op is used by an insert op, and the
subsequent insert op is inserted at the same location as the previous
insert op, replaces the dest of the subsequent insert op with the dest
of the previous insert op.This is because the previous insert op does
not affect subsequent insert ops.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
The motivation is to avoid having to negate `isDynamic*` checks, avoid
double negations, and allow for `ShapedType::isStaticDim` to be used in
ADT functions without having to wrap it in a lambda performing the
negation.
Also add the new functions to C and Python bindings.
Renames `populateVectorTransferCollapseInnerMostContiguousDimsPatterns`
as `populateDropInnerMostUnitDimsXferOpPatterns` + updates the
corresponding comments.
This addresses a TODO and makes the difference between these two
`populate*` methods clearer:
* `populateDropUnitDimWithShapeCastPatterns`,
* `populateDropInnerMostUnitDimsXferOpPatterns`.
Context:
`vector.transfer_read` always requires a padding value. Most of its
builders take no `padding` value and assume the safe value of `0`.
However, this should be a conscious choice by the API user, as it makes
it easy to introduce bugs.
For example, I found several occasions while making this patch that the
padding value was not getting propagated (`vector.transfer_read` was
transformed into another `vector.transfer_read`). These bugs, were
always caused because of constructors that don't require specifying
padding.
Additionally, using `ub.poison` as a possible default value is better,
as it indicates the user "doesn't care" about the actual padding value,
forcing users to specify the actual padding semantics they want.
With that in mind, this patch changes the builders in
`vector.transfer_read` to always having a `std::optional<Value> padding`
argument. This argument is never optional, but for convenience users can
pass `std::nullopt`, padding the transfer read with `ub.poison`.
---------
Signed-off-by: Fabian Mora <fabian.mora-cordero@amd.com>
This patch enforces a restriction in the Vector dialect: the non-indexed
operands of `vector.insert` and `vector.extract` must no longer be 0-D
vectors. In other words, rank-0 vector types like `vector<f32>` are
disallowed as the source or result.
EXAMPLES
--------
The following are now **illegal** (note the use of `vector<f32>`):
```mlir
%0 = vector.insert %v, %dst[0, 0] : vector<f32> into vector<2x2xf32>
%1 = vector.extract %src[0, 0] : vector<f32> from vector<2x2xf32>
```
Instead, use scalars as the source and result types:
```mlir
%0 = vector.insert %v, %dst[0, 0] : f32 into vector<2x2xf32>
%1 = vector.extract %src[0, 0] : f32 from vector<2x2xf32>
```
Note, this change serves three goals. These are summarised below.
## 1. REDUCED AMBIGUITY
By enforcing scalar-only semantics when the result (`vector.extract`)
or source (`vector.insert`) are rank-0, we eliminate ambiguity
in interpretation. Prior to this patch, both `f32` and `vector<f32>`
were accepted.
## 2. MATCH IMPLEMENTATION TO DOCUMENTATION
The current behaviour contradicts the documented intent. For example,
`vector.extract` states:
> Degenerates to an element type if n-k is zero.
This patch enforces that intent in code.
## 3. ENSURE SYMMETRY BETWEEN INSERT AND EXTRACT
With the stricter semantics in place, it’s natural and consistent to
make `vector.insert` behave symmetrically to `vector.extract`, i.e.,
degenerate the source type to a scalar when n = 0.
NOTES FOR REVIEWERS
-------------------
1. Main change is in "VectorOps.cpp", where stricter type checks are
implemented.
2. Test updates in "invalid.mlir" and "ops.mlir" are minor cleanups to
remove now-illegal examples.
2. Lowering changes in "VectorToSCF.cpp" are the main trade-off: we now
require an additional `vector.extract` when a preceding
`vector.transfer_read` generates a rank-0 vector.
RELATED RFC
-----------
*
https://discourse.llvm.org/t/rfc-should-we-restrict-the-usage-of-0-d-vectors-in-the-vector-dialect