This patch adds handling of an empty `MaskOp` to `MaskOpRewritePattern`
and thereby fixes a crash.
It also pulls the `MaskOp` canonicalization patterns into
`LowerVectorMask` so that empty `MaskOp`s are folded away in the Pass.
Fix https://github.com/llvm/llvm-project/issues/71036
A number of the warp distribution patterns work by rewriting a warp op
in place by moving a contained op outside. This notifies the rewriter
that the warp op is changing in this case.
The `stride == 1` does not imply that we can drop it. Because it could
load more than 1 elements. We should also take source sizes and vector
sizes into account. Otherwise it generates invalid IRs. E.g.,
```mlir
func.func @foo(%arg0: memref<1x1xf32>) -> vector<4x8xf32> {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%0 = vector.transfer_read %arg0[%c0, %c0], %cst : memref<1x1xf32>, vector<4x8xf32>
return %0 : vector<4x8xf32>
}
```
Fixes https://github.com/openxla/iree/issues/15493
This is the last step needed for basic support for distributing masked
vector code. The lane id gets delinearized based on the distributed mask
shape and then compared against the original mask sizes to compute the
bounds for the distributed mask. Note that the distribution of masks is
implicit on the shape specified by the warp op. As a result, it is the
responsibility of the consumer of the mask to ensure the distributed
mask will match its own distribution semantics.
Currently when there is a mix of transfer read ops and transfer write
ops that need to be distributed, because the pattern for write
distribution is rooted on the transfer write, it is hard to guarantee
that the write gets distributed after the read when the two aren't
directly connected by SSA. This is likely still relatively unsafe when
there are undistributable ops, but structurally these patterns are a bit
difficult to work with. For now pattern benefits give fairly good
guarantees for happy paths.
This fixes two bugs:
1) When deciding whether a transfer read could be propagated out of
a warp op, it looked for the first yield operand that was produced by
a transfer read. If this transfer read wasn't ready to be
distributed, the pattern would not re-check for any other transfer
reads that could have been propagated.
2) When dropping dead warp results, we do so by updating the warp op
signature and splicing in the old region. This does not add the ops
in the body of the warp op back to the pattern applicator's worklist,
and thus those operations won't be DCE'd. This is a problem for
patterns like the one for transfer reads that will still see the dead
operation as a user.
Because the distribution is based on types, supporting general masked
reads requires first materializing the permutation map in IR to align
the elements of the mask with the elements read by the transfer op. For
now just support cases with the trivial permutation map.
General distribution of masked writes requires materializing the permutation on the vector of the write in IR to ensure the vector lines up with the mask. For now just support cases with trivial permutation maps.
This handles `vector.transfer_read`, `vector.transfer_write`, and
`vector.constant_mask`. The unit dims are only relevant for masks
created by `create_mask` and `constant_mask` if the mask size for the
unit dim is non-one, in which case all subsequent sizes must also be
zero. From the perspective of the vector transfers, however, these unit
dims can just be dropped directly.
After propagation of `vector.warp_execute_on_lane_0` through `scf.for`,
uniform operations like those on the loop iterators can now be hoisted
out of the inner warp op.
- Implement `SubsetOpInterface`, `SubsetExtractionOpInterface`,
`SubsetInsertionOpInterface` for `vector.transfer_read` and
`vector.transfer_write`.
- Move all tensor subset hoisting test cases from `Linalg` to
`loop-invariant-subset-hoisting.mlir`. (Removing 1 duplicate test case.)
Update the remaining tests for matrix multiplication (_matmul_) in:
* vector-contract-to-outerproduct-transforms.mlir
with cases for scalable vectors.
Note that in order for the "vector.contract -> vector.outerproduct"
patterns to work, only the non-reduction dimension can be scalable (*).
For Matmul operations that is set to be the N dimension (i.e. rows of
the output matrix), which matches how matrix multiplication are normally
implemented for e.g. Arm's SVE. However, making the M dimension scalable
(i.e. columns of the output matrix) should work as well.
Making both parellel dimensions scalable is left as a TODO for when
support for 2-D scalable vectors is more established (this is
work-in-progress as part of the effort to support Arm's SME in MLIR).
The change in:
* `UnrolledOuterProductGenerator`
is a "bug fix" to make sure that the conversion pattern correctly
propagates scalability when creating `arith.extf` operations.
(*) The conversion tested in this file unrolls along the reduction
dimension, which is not supported for scalable vectors.
Recent changes (https://github.com/llvm/llvm-project/pull/66930)
disabled vector transfer ops hoisting with view-like intermediate ops.
The recommended way is to fold subview ops into transfer op indices
before invoking hoisting. That would mean now we see transfer op indices
involving dynamic values, instead of static constant values before with
subview ops. Therefore hoisting won't kick in anymore. This breaks
downstream users.
To fix it, this commit enables hoisting transfer ops with dynamic
indices by using `ValueBoundsConstraintSet` to prove ranges are disjoint
in `isDisjointTransferIndices`. Given that utility is used in many
places including op folders, right now we introduce a flag to it and
only set as true for "heavy" transforms in hoisting and load-store
forwarding.
This patch constrains the patterns for converting `vector.contract` to
`vector.outerproduct` so that
* the reduction dimension is _not unrolled_ if the corresponding
dimension is scalable.
This is necessary as the current lowering is incorrect for scalable
dims. Indeed, the following unrolling for `vector.contract` would be
invalid if the corresponding dimension was scalable (K is the size of
the reduction dimension):
```
// K times. This is valid if K _is not_ scalable.
%lhs = vector.extract %LHS[0]
%rhs = vector.extract %RHS[0]
vector.outerproduct %lhs, %rhs
%lhs = vector.extract %LHS[1]
%rhs = vector.extract %RHS[1]
vector.outerproduct %lhs, %rhs
// ...
```
Instead, a `for` loop should be generated:
```
// This would be valid regardless of whether K is scalable or not
scf.for %k = 0 to K step 1
%lhs = vector.extract LHS[%k]
%rhs = vector.extract RHS[%k]
vector.outerproduct %lhs, %rhs
```
However, the lowering of:
* `vector.extract` of vector slices with dynamic indices
is incomplete and hence the implementation proposed above (with
`scf.for`) wouldn't work just yet, i.e. it wouldn't be possible to lower
it further. Instead, this patch disables unrolling in cases when the
reduction dimension is scalable, i.e. where the generated code would be
functionally incorrect.
In order to document unsupported cases, a dedicated test file is added:
* "vector-contract-to-outerproduct-transforms-unsupported.mlir"
This is the first patch in a series of patches that strives to update
these patterns (and to test them) for scalable vectors.
Resolves#68400
Also fix an issue with sink broadcast across elementwise where
`arith.cmpf` is elementwise, but result type is different. The result
type is not same as the operand type, creating illegal IR.
Similar issue with `vector.fma` which only accepts vector operand types,
while broadcasts can have scalar sources. Sinking broadcast across would
result in an illegal `vector.fma` (with scalar operands).
The following patterns
- TransferReadToVectorLoadLowering
- TransferWriteToVectorStoreLowering
attempt to generate invalid vector.maskedload and vector.maskedstore ops
for non rank-1 vector types. These ops operate on 1-D vectors. This
patch adds a check to prevent this.
The vector.extract assembly format currently only contains the source
type, for example:
%1 = vector.extract %0[1] : vector<3x7x8xf32>
it's not immediately obvious if this is the source or result type. This
patch improves the assembly format to make this clearer, so the above
becomes:
%1 = vector.extract %0[1] : vector<7x8xf32> from vector<3x7x8xf32>
This extends `vector.constant_mask` so that mask dim sizes that
correspond to a scalable dimension are treated as if they're implicitly
multiplied by vscale. Currently this is limited to mask dim sizes of 0
or the size of the dim/vscale. This allows constant masks to represent
all true and all false scalable masks (and some variations):
```
// All true scalable mask
%mask = vector.constant_mask [8] : vector<[8]xi1>
// All false scalable mask
%mask = vector.constant_mask [0] : vector<[8]xi1>
// First two scalable rows
%mask = vector.constant_mask [2,4] : vector<4x[4]xi1>
```
Extend `ReorderElementwiseOpsOnBroadcast` so that the broadcasting op
could be either `vector.broadcast` (already supported) as well as
`vector.splat` (support added in this patch).
…cast) expansion
This revision adds a rewrite for sequences of vector `ext(bitcast)` to
use a more efficient sequence of vector operations comprising `shuffle`
and `bitwise` ops.
Such patterns appear naturally when writing quantization /
dequantization functionality with the vector dialect.
The rewrite performs a simple enumeration of each of the bits in the
result vector and determines its provenance in the source vector. The
enumeration is used to generate the proper sequence of `shuffle`,
`andi`, `ori` with shifts`.
The rewrite currently only applies to 1-D non-scalable vectors and bails
out if the final vector element type is not a multiple of 8. This is a
failsafe heuristic determined empirically: if the resulting type is not
an even number of bytes, further complexities arise that are not
improved by this pattern: the heavy lifting still needs to be done by
LLVM.
…(trunci) expansion
This revision adds a rewrite for sequences of vector `bitcast(trunci)`
to use a more efficient sequence of vector operations comprising
`shuffle` and `bitwise` ops.
Such patterns appear naturally when writing quantization /
dequantization functionality with the vector dialect.
The rewrite performs a simple enumeration of each of the bits in the
result vector and determines its provenance in the pre-trunci vector.
The enumeration is used to generate the proper sequence of `shuffle`,
`andi`, `ori` followed by an optional final `trunci`/`extui`.
The rewrite currently only applies to 1-D non-scalable vectors and bails
out if the final vector element type is not a multiple of 8. This is a
failsafe heuristic determined empirically: if the resulting type is not
an even number of bytes, further complexities arise that are not
improved by this pattern: the heavy lifting still needs to be done by
LLVM.
* Always use the auto-generated `getInitArgs` function. Remove the
hand-written `getInitOperands` duplicate.
* Remove `hasIterOperands` and `getNumIterOperands`. The names were
inconsistent because the "arg" is called `initArgs` in TableGen. Use
`getInitArgs().size()` instead.
* Fix verification around ops with no results.
This patch makes sure that the following case is lowered correctly
("duplication"):
```
func.func @broadcast_scalable_duplication(%arg0: vector<[32]xf32>) -> vector<1x[32]xf32> {
%res = vector.broadcast %arg0 : vector<[32]xf32> to vector<1x[32]xf32>
return %res : vector<1x[32]xf32>
}
```
This change refactors some of the utilities used to unroll larger vector
computations into smaller vector computations. In fact, the indexing
computations used here are rather generic and are useful in other dialects or
downstream projects. Therefore, a utility for iterating over all possible tile
offsets for a particular pair of static (shape, tiled shape) is introduced in
IndexingUtils and replaces the existing computations in the vector unrolling
transformations. This builds off of the refactoring of IndexingUtils introduced
in 203fad476b7e.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D150000
This patch adds support for lowering vector.outerproduct to the ArmSME
MOPA intrinsic for the following types:
vector<[8]xf16>, vector<[8]xf16> -> vector<[8]x[8]xf16>
vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16>
vector<[4]xf32>, vector<[4]xf32> -> vector<[4]x[4]xf32>
vector<[2]xf64>, vector<[2]xf64> -> vector<[2]x[2]xf64>
The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to
BFMOPA
(non-widening) [2].
Note at the ISA level these variants are implemented by different
architecture features, these are listed below:
FMOPA (non-widening)
* half-precision - +sme2p1,+sme-f16f16
* single-precision - +sme
* double-precision - +sme-f64f64
BFMOPA (non-widening)
* half-precision - +sme2p1,+b16b16
There's currently no way to target different features when lowering to
ArmSME. Integration tests are added for F32 and F64. We use QEMU to run
the integration tests but SME2 support isn't available yet, it's
targeted for 9.0, so integration tests for these variants excluded.
Masking is currently unsupported.
Depends on #65450.
[1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-
[2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-
Make sure that when analysing a `vector.transfer_read` that's a
candidate for either hoisting or store-to-load forwarding,
`memref.collapse_shape` Ops are correctly included in the alias
analysis. This is done by either
* making sure that relevant users are taken into account, or
* source Ops are correctly identified.
This patch is part of a larger initiative aimed at fixing floating-point `max` and `min` operations in MLIR: https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.
This commit addresses Task 1.2 of the mentioned RFC. By renaming these operations, we align their names with LLVM intrinsics that have corresponding semantics.
This is a follow-on to D158753, and allows the lowering of a
transfer read/write of n-D vectors with a single trailing scalable dimension
to primitive vector ops.
The final conversion to LLVM depends on D158517 and D158752, without
these patches type conversion will fail (or an assert is hit in the LLVM
backend) if the final IR contains an array of scalable vectors.
This patch adds `transform.apply_patterns.vector.lower_create_mask`
which allows the lowering of vector.create_mask/constant_mask to be
tested independently of --convert-vector-to-llvm.
Reviewed By: c-rhodes, awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D159482
This adds a lowering similar to the general shape_cast lowering, but
instead moves elements a (scalable) subvector at a time via
vector.scalable.extract/insert. It is restricted to the case where both
the source and result vector types have a single trailing scalable
dimension (due to limitations of the insert/extract ops).
The current lowerings are now disabled for scalable vectors, as they
produce incorrect results at runtime (due to assuming a fixed number
of elements).
Examples of casts that now work:
// Flattening:
%v = vector.shape_cast %arg0 : vector<4x[8]xi8> to vector<[32]xi8>
// Un-flattening:
%v = vector.shape_cast %arg0 : vector<[8]xi32> to vector<2x1x[4]xi32>
Reviewed By: awarzynski, nicolasvasilache
Differential Revision: https://reviews.llvm.org/D159217
This was introduced before the Optional directive and uses Variadic, but
it's really optional.
Reviewed By: nicolasvasilache, benmxwl-arm, dcaballe
Differential Revision: https://reviews.llvm.org/D159259
0-D vectors are now supported, so the special case of returning the just
the element type can now be removed.
A few callers that relied on the old behaviour have been updated.
Reviewed By: awarzynski, nicolasvasilache
Differential Revision: https://reviews.llvm.org/D159122
This patch effectively enables the CastAwayElementwiseLeadingOneDim
rewrite pattern for scalable vectors. To this end,
`ExtractOp::inferReturnTypes` is updated so that scalable dimensions are
correctly recognised.
The change to ExtractOp will likely make also other conversion patterns
valid for scalable vectors, but this patch focuses on just one case.
Other conversion patterns will be enabled in the forthcoming patches.
Depends on D157993
Differential Revision: https://reviews.llvm.org/D158335
This patch updates one specific hook in "VectorDropLeadUnitDim.cpp" to
make sure that "scalable dims" are handled correctly. While this change
affects multiple patterns, I am only adding one regression tests that
captures one specific case that affects me right now.
I am also adding Vector dialect to the list of dependencies of
`-test-vector-to-vector-lowering`. Otherwise my test case won't work as
a standalone test.
Differential Revision: https://reviews.llvm.org/D157993
If the original shape and the distributed shape is the same,
we don't distribute at all--every thread is handling the whole.
Reviewed By: hanchung
Differential Revision: https://reviews.llvm.org/D158235
When handling sub-byte emulation, the sizes of the converted `memref`s
also need to be updated (this was not done in the current
implementation). This adds the additional complexity of having to
linearize the `memref`s as well. Consider a `memref<3x3xi4>` where the
`i4` elements are packed. This has a overall size of 5 bytes (rounded
up to number of bytes). This can only be represented by a
`memref<5xi8>`. A `memref<3x2xi8>` would imply an implicit padding of
4 bits at the end of each row. So incorporate linearization into the
sub-byte load-store emulation.
This patch also updates some of the utility functions to make better
use of statically available information using `OpFoldResult` and
`makeComposedFoldedAffineApplyOps`.
Reviewed By: hanchung, yzhang93
Differential Revision: https://reviews.llvm.org/D158125
This commit starts enabling vector distruction over multiple
dimensions. It requires delinearize the lane ID to match the
expected rank. shape_cast and transfer_read now can properly
handle multiple dimensions.
Reviewed By: hanchung
Differential Revision: https://reviews.llvm.org/D157931
This revision is needed to support bufferization of `cf.br`/`cf.cond_br`. It will also be useful for better analysis of loop ops.
This revision generalizes `getAliasingOpResults` to `getAliasingValues`. An OpOperand can now not only alias with OpResults but also with BlockArguments. In the case of `cf.br` (will be added in a later revision): a `cf.br` operand will alias with the corresponding argument of the destination block.
If an op does not implement the `BufferizableOpInterface`, the analysis in conservative. It previously assumed that an OpOperand may alias with each OpResult. It now assumes that an OpOperand may alias with each OpResult and each BlockArgument of the entry block.
Differential Revision: https://reviews.llvm.org/D157957
This patch adds the missing logic so that the
`TransferReadPermutationLowering` can be used for scalable vectors. To
this end:
* TransferOp custom C++ builder is updated to support scalable
vectors,
* `TransferOpReduceRank` is also updated to support scalable vectors.
This pattern is relevant when lowering `linalg.matmul` via
`vector_multi_reduction` for scalable vectors.
I've also updated relevant code in `TransferOpReduceRank` not to use
`llvm::to_vector` for constructing `SmallVector` from `ArrayRef`. That
hook doesn't work for `ArraryRef<bool>` (*), so for consistency I
switched to an explicit constructor (so that both `newShape` and
`newScalableDim` are constructed in a similar fashion).
(*) IIUC, that's due how implicit narrowing conversions between `bool`
and `*bool` work. Note that these narrowing conversions change when
using initializer lists, see
* https://en.cppreference.com/w/cpp/language/list_initialization.
Depends on D157092
Differential Revision: https://reviews.llvm.org/D157268