This PR builds upon the previous #146531 and enables scalable
vectorization for `batch_mmt4d` as well.
---------
Signed-off-by: Ege Beysel <beyselege@gmail.com>
Adds `linalg-morph-ops` pass to convert an op from one representation to another:
named-op <--> category_op (elementwise, contraction, ..) <--> generic
e.g.
```mlir
%exp = linalg.exp ins(%A : tensor<16x8xf32>) outs(%B : tensor<16x8xf32>) -> tensor<16x8xf32>
```
After `mlir-opt -linalg-morph-ops=named-to-category ..`
```mlir
%0 = linalg.elementwise kind=#linalg.elementwise_kind<exp> ins(%arg0 : tensor<16x8xf32> ..
Note: this is generalization of
`--linalg-generalize-named-ops` is the path `named-op --> generic-op`
`--linalg-specialize-generic-ops` is the path `named-op <-- generic-op`
email: quic_mabsar@quicinc.com
This patch updates `vectorizeAsTensorUnpackOp` to support scalable
vectorization by requiring user-specified vector sizes for the _read_ operation
(rather than the _write_ operation) in `linalg.unpack`.
Conceptually, `linalg.unpack` consists of these high-level steps:
* **Read** from the source tensor using `vector.transfer_read`.
* **Transpose** the read value according to the permutation in the
`linalg.unpack` op (via `vector.transpose`).
* **Re-associate** dimensions of the transposed value, as specified by the op
(via `vector.shape_cast`)
* **Write** the result into the destination tensor via
`vector.transfer_write`.
Previously, the vector sizes provided by the user were interpreted as
write-vector sizes. These were used to:
* Infer read-vector sizes using the `inner_tiles` attribute of the unpack op.
* Deduce vector sizes for the transpose and shape cast operations.
* Ultimately determine the vector shape for the write.
However, this logic breaks when one or more tile sizes are dynamic. In such
cases, `vectorizeUnPackOpPrecondition` fails, and vectorization is rejected.
This patch switches the contract: users now directly specify the
"read-vector-sizes", which inherently encode all inner tile sizes - including
dynamic ones. It becomes the user's responsibility to provide valid sizes.
In practice, since `linalg.unpack` is typically constructed, tiled, and
vectorized by the same transformation pipeline, the necessary
"read-vector-sizes" should be recoverable.
This patch introduces a new helper, `getCollapsedVecType`, and updates
`vectorizeAsTensorUnpackOp` to use it. The motivation stems from improving how
`vector.shape_cast` operations are generated when vectorizing `linalg.unpack`.
Previously, the vectorizer relied on
* `tensor::CollapseShapeOp::inferCollapsedType`
to compute the collapsed vector type. This approach is suboptimal
because:
* `inferCollapsedType` lacks awareness of scalable vector flags.
* Linalg vectorization should not depend on Tensor dialect utilities.
Instead of relocating `inferCollapsedType`, we introduce
`getCollapsedVecType` — a lightweight, specialized hook that:
* Assumes no dynamic sizes.
* Handles scalable flags alongside shape dimensions.
This change also reduces temporary variables in
`vectorizeAsTensorUnpackOp` and paves the way for a cleaner update in
#149293.
Previously when dropping unit dim from a pad with mixed dynamic/static
input/output shapes, the resulting shape would take on the Type of the
input, resulting in invalid IR.
Also did some minor cleanup to the formatting of the
`drop_unit_dim_corresponding_to_dynamic_dim` test to make it match the
rest of the file.
---------
Signed-off-by: dan <danimal197@gmail.com>
In https://github.com/llvm/llvm-project/pull/149156, I ensured that we
no longer generate spurious `tensor.empty` ops when vectorizing
`linalg.unpack`.
This follow-up removes leftover code that is now redundant but was
missed in the original PR and in #150602 that was also meant to clean-up
left-over code.
Note, this is removing code to compute "write-vector-sizes". Instead,
these are fully inferred from previous Ops.
This PR fixes the computation of padded shapes for convolution-style
affine maps (e.g., d0 + d1) in `PadTilingInterface`. Previously, the
codes used the direct sum of loop upper bounds, leading to over-padding.
For example, the following `conv_2d_nhwc_fhwc` op, if only padding the c
dimensions to multiples of 16, it also incorrectly pads the convolved
dimensions and generates the wrong input shape as:
```
%padded = tensor.pad %arg0 low[0, 0, 0, 0] high[0, 1, 1, 12] {
^bb0(%arg3: index, %arg4: index, %arg5: index, %arg6: index):
tensor.yield %cst : f32
} : tensor<1x16x16x4xf32> to tensor<1x17x17x16xf32>
%padded_0 = tensor.pad %arg1 low[0, 0, 0, 0] high[0, 0, 0, 12] {
^bb0(%arg3: index, %arg4: index, %arg5: index, %arg6: index):
tensor.yield %cst : f32
} : tensor<16x3x3x4xf32> to tensor<16x3x3x16xf32>
%0 = linalg.conv_2d_nhwc_fhwc {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%padded, %padded_0 : tensor<1x17x17x16xf32>, tensor<16x3x3x16xf32>) outs(%arg2 : tensor<1x14x14x16xf32>) -> tensor<1x14x14x16xf32>
return %0 : tensor<1x14x14x16xf32>
```
The new implementation uses the maximum accessed index as the input for
affine map and then adds 1 after aggregating all the terms to get the
final padded size. This fixed
https://github.com/llvm/llvm-project/issues/148679.
It was disabled because there may be artificial padding. After [refining the pack op semantics](773e158c64),
we can assume that there is no artificial padding. Thus, the check can
be removed, and we can unconditionally enable the consumer fusion if it
is a perfect tiling case.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
This reverts commit
0844812b2e
with a shape fix in
1db4c6b275
The revision restrict the `linalg.pack` op to not have artificial
padding semantics. E.g., the below is valid without the change, and it
becomes invalid with the change.
```mlir
func.func @foo(%src: tensor<9xf32>) -> tensor<100x8xf32> {
%cst = arith.constant 0.000000e+00 : f32
%dest = tensor.empty() : tensor<100x8xf32>
%pack = linalg.pack %src
padding_value(%cst : f32)
inner_dims_pos = [0]
inner_tiles = [8] into %dest
: tensor<9xf32> -> tensor<100x8xf32>
return %pack : tensor<100x8xf32>
}
```
IMO, it is a misuse if we use pack ops with artificial padding sizes
because the intention of the pack op is to relayout the source based on
target intrinsics, etc. The output shape is expected to be
`tensor<2x8xf32>`. If people need extra padding sizes, they can create a
new pad op followed by the pack op.
This also makes consumer tiling much easier because the consumer fusion
does not support artificial padding sizes. It is very hard to make it
work without using ad-hoc patterns because the tiling sizes are about
source, which implies that you don't have a core_id/thread_id to write
padding values to the whole tile.
People may have a question how why pad tiling implementation works. The
answer is that it creates an `if-else` branch to handle the case. In my
experience, it is very struggle in transformation because most of the
time people only need one side of the branch given that the tile sizes
are usually greater than padding sizes. However, the implementation is
conservatively correct in terms of semantics. Given that the
introduction of `pack` op is to serve the relayout needs better, having
the restriction makes sense to me.
Removed tests:
-
`no_bubble_up_pack_extending_dimension_through_expand_cannot_reassociate`
from `data-layout-propagation.mlir`: it is a dup test to
`bubble_up_pack_non_expanded_dims_through_expand` after we fix the
shape.
- `fuse_pack_consumer_with_untiled_extra_padding` from
`tile-and-fuse-consumer.mlir`: it was created for artificial padding in
the consumer fusion implementation.
The other changes in lit tests are just fixing the shape.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
In https://github.com/llvm/llvm-project/pull/149156, I ensured that we
no longer generate spurious `tensor.empty` ops when vectorizing
`linalg.unpack`.
This follow-up removes leftover code that is now redundant but was
missed in the original PR.
The revision restrict the `linalg.pack` op to not have artificial
padding semantics. E.g., the below is valid without the change, and it
becomes invalid with the change.
```mlir
func.func @foo(%src: tensor<9xf32>) -> tensor<100x8xf32> {
%cst = arith.constant 0.000000e+00 : f32
%dest = tensor.empty() : tensor<100x8xf32>
%pack = linalg.pack %src
padding_value(%cst : f32)
inner_dims_pos = [0]
inner_tiles = [8] into %dest
: tensor<9xf32> -> tensor<100x8xf32>
return %pack : tensor<100x8xf32>
}
```
IMO, it is a misuse if we use pack ops with artificial padding sizes
because the intention of the pack op is to relayout the source based on
target intrinsics, etc. The output shape is expected to be
`tensor<2x8xf32>`. If people need extra padding sizes, they can create a
new pad op followed by the pack op.
This also makes consumer tiling much easier because the consumer fusion
does not support artificial padding sizes. It is very hard to make it
work without using ad-hoc patterns because the tiling sizes are about
source, which implies that you don't have a core_id/thread_id to write
padding values to the whole tile.
People may have a question how why pad tiling implementation works. The
answer is that it creates an `if-else` branch to handle the case. In my
experience, it is very struggle in transformation because most of the
time people only need one side of the branch given that the tile sizes
are usually greater than padding sizes. However, the implementation is
conservatively correct in terms of semantics. Given that the
introduction of `pack` op is to serve the relayout needs better, having
the restriction makes sense to me.
Removed tests:
-
`no_bubble_up_pack_extending_dimension_through_expand_cannot_reassociate`
from `data-layout-propagation.mlir`: it is a dup test to
`bubble_up_pack_non_expanded_dims_through_expand` after we fix the
shape.
- `fuse_pack_consumer_with_untiled_extra_padding` from
`tile-and-fuse-consumer.mlir`: it was created for artificial padding in
the consumer fusion implementation.
The other changes in lit tests are just fixing the shape.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
This PR uses `val.getDefiningOp<OpTy>()` to replace `dyn_cast<OpTy>(val.getDefiningOp())` , `dyn_cast_or_null<OpTy>(val.getDefiningOp())` and `dyn_cast_if_present<OpTy>(val.getDefiningOp())`.
The revision only folds the tensor.pad/extract_slice op into
linalg.pack/unpack ops only when it is safe to fold. It is not valid to
have artificial padding.
The documentation improvement and verifier update will be done in a
separate PR (i.e., https://github.com/llvm/llvm-project/pull/149624).
The revision is a step towards it.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
Generalizes `dropUnitDims` to operate on any op implementing the
`IndexingMapOpInterface`. Operation specific creation is handled by
passing a builder that will construct the new operation based on the
dropped dimensions.
---------
Signed-off-by: Ian Wood <ianwood@u.northwestern.edu>
Co-authored-by: Kunwar Grover <groverkss@gmail.com>
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
In the linalg ElementwiseOpFusion transform, a pre-requisite for the
fusion between a producer and consumer op is that the producer's output
indexing map associated to the result to be fused must be invertible
(e.g. a simple permutation).
Before this patch, only the first output indexing map was being checked;
this bug produced issues when the operand to fuse was not the 1st result
of the producer op. For example, this situation arises when the producer
op has multiple results because it's the result of previous fusions
where the original result had been preserved: in these cases, the pass
ought to check the indexing map of the result being fused, which is not
necessarily the 1st one.
Signed-off-by: Fabrizio Indirli <Fabrizio.Indirli@arm.com>
Extends linalg vectorizer with a path to lower contraction ops directly
into `vector.contract`.
The direct rewriting preserves high-level op semantics and provides more
progressive lowering compared to reconstructing contraction back from
multi dimensional reduction.
The added lowering focuses on named linalg ops and leverages their well
defined semantics to avoid complex precondition verification.
The new path is optional and disabled by default to avoid changing the
default vectorizer behavior.
This happens only when you use larger tile size, which is greater than
or equal to the dimension size. In this case, it is a full slice, so it
is fusible.
The IR can be generated during the TileAndFuse process. It is hard to
fix in such driver, so we enable the naive fusion for the case.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
If a dimension is not tiled, it is always valid to fuse the pack op,
even if it has padding semantics. Because it always generates a full
slice along the dimension.
If a dimension is tiled and it does not need extra padding, the fusion
is valid.
The revision also formats corresponding tests for consistency.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
This patch adds support for scalable vectorization of linalg.mmt4d. The
key design change is the introduction of a new vectorizer state variable:
* `assumeDynamicDimsMatchVecSizes`
...along with the corresponding Transform dialect attribute:
* `assume_dynamic_dims_match_vec_sizes`.
This flag instructs the vectorizer to assume that dynamic memref/tensor
dimensions match the corresponding vector sizes (fixed or scalable). With this
assumption, masking becomes unnecessary, which simplifies the lowering pipeline
significantly.
While this assumption is not universally valid, it typically holds for
`linalg.mmt4d`. Inputs and outputs are explicitly packed using `linalg.pack`,
and this packing includes padding, ensuring that dimension sizes align with
vector sizes (*).
* Related discussion: https://github.com/llvm/llvm-project/issues/143920
An upcoming patch will include an end-to-end test that leverages scalable
vectorization of linalg.mmt4d to demonstrate the newly enabled functionality.
This would not be feasible without the changes introduced here, as it would
otherwise require additional logic to handle complex - but ultimately redundant
- masks.
(*) This holds provided that the tile sizes used for packing match the vector
sizes used during vectorization. It is the user’s responsibility to enforce
this.
When collapsing linalg dimensions we check if its memref operands are
guaranteed to be collapsible. However, we currently assume that the
matching indexing map is the identity map.
This commit modifies this behavior and checks if the memref is
collapsible on the transformed dimensions.
The motivation is to avoid having to negate `isDynamic*` checks, avoid
double negations, and allow for `ShapedType::isStaticDim` to be used in
ADT functions without having to wrap it in a lambda performing the
negation.
Also add the new functions to C and Python bindings.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This patch is a follow up to https://github.com/llvm/llvm-project/pull/146088 and changes the padding value in the linalg vectorizer from `0` to `ub.poison` in `vector.transfer_read`s created for extracting slices or when vectorizing a generic.
Signed-off-by: Fabian Mora <fabian.mora-cordero@amd.com>
linalg promotion attempts to compute a constant upper bound for the
allocated buffer size. Only when failed to compute an upperbound it
fallbacks to the original subview size, which may be dynamic.
Adding a promotion option to use the original subview size by default,
thus minimizing the allocation size.
Fixes#144268.
The issue is triggered by
ee070d0816
that checks `TensorLikeType` when downstream projects use the pattern
without registering bufferization::BufferizationDialect. The
registration is needed because the interface implementation for builtin
types locate at `BufferizationDialect::initialize()`. However, we do not
need to fix it by the registration. The proper fix is using the linalg
method, i.e., hasPureTensorSemantics.
No additional tests are added because the functionality is well tested
in
[transpose-matmul.mlir](https://github.com/llvm/llvm-project/blob/main/mlir/test/Dialect/Linalg/transpose-matmul.mlir).
To reproduce the issue, it requires a different setup, e.g., writing a
new C++ pass, which seems not worth it.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
This PR adds a mechanism, so that downstream consumers can pass in
control functions for the application of these patterns. This change
shouldn't affect any consumers of this method that do not specify a
controlFn. The controlFn always gets the source operand of the consumer
in each of the patterns as a parameter.
In IREE, we (will) use it to control preventing folding patterns that
would inhibit fusion. See IREE issue
[#20896](https://github.com/iree-org/iree/issues/20896) for more
details.
In both `bubbleUpPackOpThroughGenericOp()` or
`pushDownUnPackOpThroughGenericOp()`, we can simplify the lowered IR by
removing the pack of an empty when the init tensor isn't used in generic
op. Instead of packing an empty tensor, the empty tensor can be
forwarded to the generic output. This allows cleaner result after data
layout propagation.
Context:
`vector.transfer_read` always requires a padding value. Most of its
builders take no `padding` value and assume the safe value of `0`.
However, this should be a conscious choice by the API user, as it makes
it easy to introduce bugs.
For example, I found several occasions while making this patch that the
padding value was not getting propagated (`vector.transfer_read` was
transformed into another `vector.transfer_read`). These bugs, were
always caused because of constructors that don't require specifying
padding.
Additionally, using `ub.poison` as a possible default value is better,
as it indicates the user "doesn't care" about the actual padding value,
forcing users to specify the actual padding semantics they want.
With that in mind, this patch changes the builders in
`vector.transfer_read` to always having a `std::optional<Value> padding`
argument. This argument is never optional, but for convenience users can
pass `std::nullopt`, padding the transfer read with `ub.poison`.
---------
Signed-off-by: Fabian Mora <fabian.mora-cordero@amd.com>
This patch adds additional checks to the hoisting logic to prevent hoisting of
`vector.transfer_read` / `vector.transfer_write` pairs when the underlying
memref has users that introduce aliases via operations implementing
`ViewLikeOpInterface`.
Note: This may conservatively block some valid hoisting opportunities and could
affect performance. However, as demonstrated by the included tests, the current
logic is too permissive and can lead to incorrect transformations.
If this change prevents hoisting in cases that are provably safe, please share
a minimal repro - I'm happy to explore ways to relax the check.
Special treatment is given to `memref.assume_alignment`, mainly to accommodate
recent updates in:
* https://github.com/llvm/llvm-project/pull/139521
Note that such special casing does not scale and should generally be avoided.
The current hoisting logic lacks robust alias analysis. While better support
would require more work, the broader semantics of `memref.assume_alignment`
remain somewhat unclear. It's possible this op may eventually be replaced with
the "alignment" attribute added in:
* https://github.com/llvm/llvm-project/pull/144344
Given the following example:
```
module {
func.func @main(%arg0: tensor<1x1x1x4x1xf32>, %arg1: tensor<1x1x4xf32>) -> tensor<1x1x1x4x1xf32> {
%pack = linalg.pack %arg1 outer_dims_perm = [1, 2, 0] inner_dims_pos = [2, 0] inner_tiles = [4, 1] into %arg0 : tensor<1x1x4xf32> -> tensor<1x1x1x4x1xf32>
return %pack : tensor<1x1x1x4x1xf32>
}
}
```
We would generate an invalid transpose operation because the calculated
permutation would be `[0, 2, 0]` which is semantically incorrect. As the
permutation must contain unique integers corresponding to the source
tensor dimensions.
The following change modifies how we calculate the permutation array and
ensures that the dimension indices given in the permutation array is
unique.
The above example would then translate to a transpose having a
permutation of `[1, 2, 0]`. Following the rule, that the `inner_dim_pos`
is appended to the permutation array and the preceding indices are
filled with the remaining dimensions.
For consumer fusion cases of this form
```
%0:2 = scf.forall .. shared_outs(%arg0 = ..., %arg0 = ...) {
tensor.parallel_insert_slice ... into %arg0
tensor.parallel_insert_slice ... into %arg1
}
%1 = linalg.generic ... ins(%0#0, %0#1)
```
the current consumer fusion that handles one slice at a time cannot fuse
the consumer into the loop, since fusing along one slice will create and
SSA violation on the other use from the `scf.forall`. The solution is to
allow consumer fusion to allow considering multiple slices at once. This
PR changes the `TilingInterface` methods related to consumer fusion,
i.e.
- `getTiledImplementationFromOperandTile`
- `getIterationDomainFromOperandTile`
to allow fusion while considering multiple operands. It is upto the
`TilingInterface` implementation to return an error if a list of tiles
of the operands cannot result in a consistent implementation of the
tiled operation.
The Linalg operation implementation of `TilingInterface` has been
modified to account for these changes and allow cases where operand
tiles that can result in a consistent tiling implementation are handled.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
This include introduces a dependency for LinalgTransforms on
LinalgTransformOps, which is unspecified in the module dependencies, and
would produce a cyclic dependency if it were specified.
The include is unused in WinogradConv2D.cpp, so this change removes it.
We only support fixed set of minimum filtering algorithm for Winograd
Conv2D decomposition. Instead of letting users specify any integer,
define a fixed set of enumeration values for the parameters of minimum
filtering algorithm.
Updates the linalg::vectorize function to return a
`FailureOr<VectorizationResult>` containing the values to replace the
original operation, instead of directly replacing the original
operation. This aligns better with the style of transforms used with the
TilingInterface, and gives more control to users over the lowering,
since it allows for additional transformation of the IR before
replacement.
There was already a `VectorizationResult` defined, which was used for
the internal vectorize implementation using `CustomVectorizationHook`s,
so the old struct is renamed to `VectorizationHookResult`.
Note for integration: The replacement of the original operation is now
the responsibility of the caller, so wherever `linalg::vectorize` is
used, the caller must also do
`rewriter.replaceOp(vectorizeResults->replacements)`.
---------
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>