The hoistRedundantVectorTransfers function does not verification of loop
bounds when hoisting vector transfers. This is not safe in general,
since it is possible that the loop will have zero trip count. This PR
uses ValueBounds to verify that the lower bound is less than the upper
bound of the loop before hoisting. Trip count verification is currently
behind an option `verifyNonZeroTrip`, which is false by default.
Zero trip count loops can arise in GPU code generation, where a loop
bound can be dependent on a thread id. If not all threads execute the
loop body, then hoisting out of the loop can cause these threads to
execute the transfers when they are not supposed to.
---------
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
The pack_paddings attribute in the structure.pad TD Op is used to set
the `nofold` attribute in the generated tensor.pad Op. The current name
is confusing and suggests that there's a relation with the tensor.pack
Op. This patch renames it as `nofold_flags` to better match the actual
usage.
The SCF helper for tiling an operation implementing the TilingInterface
and greedily fusing consumers requires an uninterrupted chain of
operations implementing the tiling interface to succeed. There can be
cases with intermediate ops that don't implement the interface but have
producers that could be fused if various canonicalization/simplification
patterns could run in between fusion steps.
This adds an option to SCFTileAndFuseOptions for a pattern set to run
between fusion steps to the ops that result from fusion/tiling. Removed
and newly inserted slices are tracked for continued fusion applications.
See this RFC for more discussion:
https://discourse.llvm.org/t/rfc-split-fusion-portions-of-the-tilinginterface-into-a-new-interface/81155
The newly added hook simply returns `false` for Ops for which there's no
"vectorization logic" in the Linalg Vectorizer (i.e. the `vectorize()`
method). It's added so that the following two TD ops expose identical
level of functionality (that's not the case ATM):
* `transform.structured.vectorize_children_and_apply_patterns`
* `transform.structured.vectorize`
Specifically, ATM, the former works only for Linalg Ops, while the
latter works for all Ops that the vectorizer supports (*). With this
change,
I am making sure that both TD will behave consistently.
Note, this shouldn't affect any of the current uses of the vectorizer.
(*) This is implemented via the `vectorize()` method in
Vectorization.cpp.
Replaces a linalg.add with one operand the single user of a contraction,
which has a zero-filled, "identity-mapped" destination and is dominated
by the `other` operand, by the contraction with `other` as its dest.
Benefits include elision of an elementwise op, namely the linalg.add,
and removing a tensor.empty as a destination which is likely to require
an allocation upon bufferization.
Group all patterns that re-order vector.transpose and vector.broadcast
Ops (*) under `populateSinkVectorOpsPatterns`. These patterns are
normally used to "sink" redundant Vector Ops, hence grouping together.
Example:
```mlir
%at = vector.transpose %a, [1, 0]: vector<4x2xf32> to vector<2x4xf32>
%bt = vector.transpose %b, [1, 0]: vector<4x2xf32> to vector<2x4xf32>
%r = arith.addf %at, %bt : vector<2x4xf32>
```
would get converted to:
```mlir
%0 = arith.addf %a, %b : vector<4x2xf32>
%r = vector.transpose %0, [1, 0] : vector<2x4xf32>
```
This patch also moves all tests for these patterns so that all of them
are:
* run under one test-flag: `test-vector-sink-patterns`,
* located in one file: "vector-sink.mlir".
To facilitate this change:
* `-test-sink-vector-broadcast` is renamed as
`test-vector-sink-patterns`,
* "sink-vector-broadcast.mlir" is renamed as "vector-sink.mlir",
* tests for `ReorderCastOpsOnBroadcast` and
`ReorderElementwiseOpsOnTranspose` patterns are moved from
"vector-reduce-to-contract.mlir" to "vector-sink.mlir",
* `ReorderElementwiseOpsOnTranspose` patterns are removed from
`populateVectorReductionToContractPatterns` and added to (newly
created) `populateSinkVectorOpsPatterns`,
* `ReorderCastOpsOnBroadcast` patterns are removed from
`populateVectorReductionToContractPatterns` - these are already
present in `populateSinkVectorOpsPatterns`.
This should allow us better layering and more straightforward testing.
For the latter, the goal is to be able to easily identify which pattern
a particular test is exercising (especially when it's a specific
pattern).
NOTES FOR DOWNSTREAM USERS
In order to preserve the current functionality, please make sure to add
* `populateSinkVectorOpsPatterns`,
wherever you are using `populateVectorReductionToContractPatterns`.
Also, rename `populateSinkVectorBroadcastPatterns` as
`populateSinkVectorOpsPatterns`.
(*) I didn't notice any other re-order patterns.
In order to support arbitrary size input data of conv2d, implement
TilingInterface for winograd operations. Before converting winograd
operations into nested loops with matrix multiply, tile the input of
conv2d into the supported size first.
Add a transform operation structured.decompose_winograd_op to decompose
winograd operations. Before applying the transform op, use
tile_using_for to tile the input data into supported size. The test case
shows how to tile and decompose winograd operations.
The implementation of these methods are legacy and they are removed in
favor of using the `scf::tileUsingSCF` methods as replacements. To get
the latter on par with requirements of the deprecated methods, the
tiling allows one to specify the maximum number of tiles to use instead
of specifying the tile sizes. When tiling to `scf.forall` this
specification is used to generate the `num_threads` version of the
operation.
A slight deviation from previous implementation is that the deprecated
method always generated the `num_threads` variant of the `scf.forall`
operation. Instead now this is driven by the tiling options specified.
This reduces the indexing math generated when the tile sizes are
specified.
**Moving from `linalg::tileToForallOp` to `scf::tileUsingSCF`**
```
OpBuilder b;
TilingInterface op;
ArrayRef<OpFoldResult> numThreads;
ArrayAttr mapping;
FailureOr<ForallTilingResult> result =linalg::tileToForallOp(b, op, numThreads, mapping);
```
can be replaced by
```
scf::SCFTilingOptions options;
options.setNumThreads(numThreads);
options.setLoopType(scf::SCFTilingOptions::LoopType::ForallOp);
options.setMapping(mapping.getValue()); /*note the difference that setMapping takes an ArrayRef<Attribute> */
FailureOr<scf::SCFTilingResult> result = scf::tileUsingSCF(b, op, options);
```
This generates the `numThreads` version of the `scf.forall` for the
inter-tile loops, i.e.
```
... = scf.forall (%arg0, %arg1) in (%nt0, %nt1) shared_outs(...)
```
**Moving from `linalg::tileToForallOpUsingTileSizes` to
`scf::tileUsingSCF`**
```
OpBuilder b;
TilingInterface op;
ArrayRef<OpFoldResult> tileSizes;
ArrayAttr mapping;
FailureOr<ForallTilingResult> result =linalg::tileToForallOpUsingTileSizes(b, op, tileSizes, mapping);
```
can be replaced by
```
scf::SCFTilingOptions options;
options.setTileSizes(tileSizes);
options.setLoopType(scf::SCFTilingOptions::LoopType::ForallOp);
options.setMapping(mapping.getValue()); /*note the difference that setMapping takes an ArrayRef<Attribute> */
FailureOr<scf::SCFTilingResult> result = scf::tileUsingSCF(b, op, options);
```
Also note that `linalg::tileToForallOpUsingTileSizes` would effectively
call the `linalg::tileToForallOp` by computing the `numThreads` from the
`op` and `tileSizes` and generate the `numThreads` version of the
`scf.forall`. That is not the case anymore. Instead this will directly
generate the `tileSizes` version of the `scf.forall` op
```
... = scf.forall(%arg0, %arg1) = (%lb0, %lb1) to (%ub0, %ub1) step(%step0, %step1) shared_outs(...)
```
If you actually want to use the `numThreads` version, it is upto the
caller to compute the `numThreads` and set `options.setNumThreads`
instead of `options.setTileSizes`. Note that there is a slight
difference in the num threads version and tile size version. The former
requires an additional `affine.max` on the tile size to ensure
non-negative tile sizes. When lowering to `numThreads` version this
`affine.max` is not needed since by construction the tile sizes are
non-negative. In previous implementations, the `numThreads` version
generated when using the `linalg::tileToForallOpUsingTileSizes` method
would avoid generating the `affine.max` operation. To get the same
state, downstream users will have to additionally normalize the
`scf.forall` operation.
**Changes to `transform.structured.tile_using_forall`**
The transform dialect op that called into `linalg::tileToForallOp` and
`linalg::tileToForallOpUsingTileSizes` have been modified to call
`scf::tileUsingSCF`. The transform dialect op always generates the
`numThreads` version of the `scf.forall` op. So when `tile_sizes` are
specified for the transform dialect op, first the `tile_sizes` version
of the `scf.forall` is generated by the `scf::tileUsingSCF` method which
is then further normalized to get back to the same state. So there is no
functional change to `transform.structured.tile_using_forall`. It always
generates the `numThreads` version of the `scf.forall` op (as it did
before this change).
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
This patch adds a check for the correct number of `loops` results of the
`transform.structured.tile_using_for` Op to the verifier, fixing a
crash.
Fix https://github.com/llvm/llvm-project/issues/98008
Detected with ASAN. `Operation::getLoc()` was called after erasing the
operation.
Reverts 48cf6b6bbe7a22bfcd98f82dc7afd21c9decd22f, which attempted to fix
the use-after-free. (But the use-after-free is still there when the
`hasFailed` branch is taken.)
This patch enables continuous tiling of a target structured op using
diminishing tile sizes. In cases where the tensor dimensions are not
exactly divisible by the tile size, we are left with leftover tensor
chunks that are irregularly tiled. This approach enables tiling of the
leftover chunk with a smaller tile size and repeats this process
recursively using exponentially diminishing tile sizes. This eventually
generates a chain of loops that apply tiling using diminishing tile
sizes.
Adds `continuous_tile_sizes` op to the transform dialect. This op, when
given a tile size and a dimension, computes a series of diminishing tile
sizes that can be used to tile the target along the given dimension.
Additionally, this op also generates a series of chunk sizes that the
corresponding tile sizes should be applied to along the given dimension.
Adds `multiway` attribute to `transform.structured.split` that enables
multiway splitting of a single target op along the given dimension, as
specified in a list enumerating the chunk sizes.
This patch adds more precise side effects to the current ops with memory
effects, allowing us to determine which OpOperand/OpResult/BlockArgument
the
operation reads or writes, rather than just recording the reading and
writing
of values. This allows for convenient use of precise side effects to
achieve
analysis and optimization.
Related discussions:
https://discourse.llvm.org/t/rfc-add-operandindex-to-sideeffect-instance/79243
The `TilingInterface` methods have return values that allow the
interface implementation to return multiple operations, and also return
tiled values explicitly. This is to avoid the assumption that the
interface needs to return a single operation and this operations result
are the expected tiled values. Make the
`PartialReductionOpInterface::tileToPartialReduction` return
`TilingResult` as well for the same reason.
Similarly make the `PartialReductionOpInterface::mergeReductions` also
return a list of generated operations and values to use as replacements.
This is just a refactoring to allow for deprecation of
`linalg::tileReductionUsingForall` with `scf::tileReductionUsingSCF`
method.
This patch adds support for reducing operations with multiple results
using PartialReductionOpInterface. Also adds an implementation of
PartialReductionOpInterface for multiple results for linalg.generic.
This patch is a first pass at making consistent syntax across the
`LinalgTransformOp`s that use dynamic index lists for size parameters.
Previously, there were two different forms: inline types in the list, or
place them in the functional style tuple. This patch goes for the
latter.
In order to do this, the `printPackedOrDynamicIndexList`,
`printDynamicIndexList` and their `parse` counterparts were modified so
that the types can be optionally provided to the corresponding custom
directives.
All affected ops now use tablegen `assemblyFormat`, so custom
`parse`/`print` functions have been removed. There are a couple ops that
will likely add dynamic size support, and once that happens it should be
made sure that the assembly remains consistent with the changes in this
patch.
The affected ops are as follows: `pack`, `pack_greedily`,
`tile_using_forall`. The `tile_using_for` and `vectorize` ops already
used this syntax, but their custom assembly was removed.
---------
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
This patch modifies the definition of `PadOp` to take transform params
and handles for the `pad_to_multiple_of` operand.
---------
Co-authored-by: Oleksandr "Alex" Zinenko <ftynse@gmail.com>
This adds patterns to convert from the Linalg matmul and batch_matmul
ops to the transposed variants. By default the LHS matrix is transposed.
Our work enabling a lowering path from linalg.matmul to ArmSME has
revealed the current lowering results in non-contiguous memory accesses
for the A matrix and very poor performance.
These patterns provide a simple option to fix this.
The previous implementation used a `notifyMatchFailure` to emit failure
message inappropriately and then used the
`emitDefaultSilenceableFailure`. This patch changes this to use the more
appropriate `emitSilenceableFailure` with error message. Additionally a
failure test has been added.
The previous implementation was doing an early successful return on
`rank <= 1` without adding the original op to transform results. This
resulted in errors about number of returns. This patch fixes this by
adding the original op to results. Additionally, we first check if op is
elementwise and return a slienceable failure early if not.
Transform interfaces are implemented, direction or via extensions, in
libraries belonging to multiple other dialects. Those dialects don't
need to depend on the non-interface part of the transform dialect, which
includes the growing number of ops and transitive dependency footprint.
Split out the interfaces into a separate library. This in turn requires
flipping the dependency from the interface on the dialect that has crept
in because both co-existed in one library. The interface shouldn't
depend on the transform dialect either.
As a consequence of splitting, the capability of the interpreter to
automatically walk the payload IR to identify payload ops of a certain
kind based on the type used for the entry point symbol argument is
disabled. This is a good move by itself as it simplifies the interpreter
logic. This functionality can be trivially replaced by a
`transform.structured.match` operation.
This lets `transform.structured.convert_to_loops` return handles to the
generated loops, making this transformation more useful to use for
(transformation-)nesting purposes. This is modelled after SCFs
`transform.loop.forall_to_for` which returns handles to loops.
Introduced in commit aa2a96a24ae3a8cc04635ab6ede474c5f2665053, with a
note that they might move out of the `Linalg`-Dialect, but no reason
given for the non-return of handles. As far as I can see, this transform
always returns loops.
A `transform.structured.flatten_elementwise` op is implemented for
flattening the iteration space and (applicable) operands/results to a
single dimension.
Added support to vectorized tensor.unpack. The unpack Op is split into a
`vector.transfer_read`, `vector.transpose`, `vector.shape_cast` and a
`vector.transfer_write`.
Rename listener callback names:
* `notifyOperationRemoved` -> `notifyOperationErased`
* `notifyBlockRemoved` -> `notifyBlockErased`
The current callback names are misnomers. The callbacks are triggered
when an operation/block is erased, not when it is removed (unlinked).
E.g.:
```c++
/// Notify the listener that the specified operation is about to be erased.
/// At this point, the operation has zero uses.
///
/// Note: This notification is not triggered when unlinking an operation.
virtual void notifyOperationErased(Operation *op) {}
```
This change is in preparation of adding listener support to the dialect
conversion. The dialect conversion internally unlinks IR before erasing
it at a later point of time. There is an important difference between
"remove" and "erase". Lister callback names should be accurate to avoid
confusion.
This PR adds a direct vectorization lowering of `tensor.pack` into
`mask(vector.transfer_read)`->`vector.shape_cast`->`vector.transpose`->`vector.transfer_write`.
Using `LoopLikeOpInterface` as the basis for the implementation unifies
all the tiling logic for both `scf.for` and `scf.forall`. The only
difference is the actual loop generation. This is a follow up to
https://github.com/llvm/llvm-project/pull/72178
Instead of many entry points for each loop type, the loop type is now
passed as part of the options passed to the tiling method.
This is a breaking change with the following changes
1) The `scf::tileUsingSCFForOp` is renamed to `scf::tileUsingSCF`
2) The `scf::tileUsingSCFForallOp` is deprecated. The same
functionality is obtained by using `scf::tileUsingSCF` and setting
the loop type in `scf::SCFTilingOptions` passed into this method to
`scf::SCFTilingOptions::LoopType::ForallOp` (using the
`setLoopType` method).
3) The `scf::tileConsumerAndFusedProducerGreedilyUsingSCFForOp` is
renamed to `scf::tileConsumerAndFuseProducerUsingSCF`. The use of
the `controlFn` in `scf::SCFTileAndFuseOptions` allows implementing
any strategy with the default callback implemeting the greedy fusion.
4) The `scf::SCFTilingResult` and `scf::SCFTileAndFuseResult` now use
`SmallVector<LoopLikeOpInterface>`.
5) To make `scf::ForallOp` implement the parts of
`LoopLikeOpInterface` needed, the `getOutputBlockArguments()`
method is replaced with `getRegionIterArgs()`
These changes now bring the tiling and fusion capabilities using
`scf.forall` on par with what was already supported by `scf.for`
The pattern rewriter documentation states that "*all* IR mutations [...]
are required to be performed via the `PatternRewriter`." This commit
adds two functions that were missing from the rewriter API:
`moveOpBefore` and `moveOpAfter`.
After an operation was moved, the `notifyOperationInserted` callback is
triggered. This allows listeners such as the greedy pattern rewrite
driver to react to IR changes.
This commit narrows the discrepancy between the kind of IR modification
that can be performed and the kind of IR modifications that can be
listened to.
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
In the process a couple of test transform dialect ops are added just
for testing. These operations are not intended to use as full flushed
out of transformation ops, but are rather operations added for testing.
A separate operation is added to `LinalgTransformOps.td` to convert a
`TilingInterface` operation to loops using the
`generateScalarImplementation` method implemented by the
operation. Eventually this and other operations related to tiling
using the `TilingInterface` need to move to a better place (i.e. out
of `Linalg` dialect)
Adjust the silenceable failure message as we lower `tensor.unpack` as a
combination of `linalg.transpose` + `tensor.collapse_shape` and
`tensor.extract_slice`.
Test was failing due to a different transform sequence declaration (transform sequence were used, while now it should be named transform sequence). Test is now fixed.
The current vectorization of 1D depthwise convolutions in Linalg is
_sub-optimal_ for tensor with a low number of channel dimensions, e.g.:
```mlir
linalg.depthwise_conv_1d_nwc_wc
{dilations = dense<1> : vector<1xi64>,
strides = dense<1> : vector<1xi64>}
ins(%input, %filter : tensor<1x8x3xi8>, tensor<1x3xi8>)
outs(%output : tensor<1x8x3xi8>) -> tensor<1x8x3xi8>
```
That's due to the fact that ultimately (i.e. at LLVM level),
vectorization happens along the trailing dimension (i.e. the channel
dimension). In this case it leads to vectors with 3 elements (or worse,
if there's e.g. only 1 channel dimension). For comparison, a 128 bit
wide vector registers can hold 16 x i8.
Instead, this patch adds an option to flatten/collapse the channel
dimension into the width dimension of the input/filter/output using
`vector.shape_cast` operation:
```mlir
%sc_input = vector.shape_cast %input : vector<1x8x3xi8> to vector<1x24xi8>
%sc_output = vector.shape_cast %output : vector<1x8x3xi8> to vector<1x24xi8>
%b_filter = vector.broadcast %filter : vector<3xi8> to vector<1x8x3xi8>
%sc_filter = vector.shape_cast %b_filter : vector<1x8x3xi8> to vector<1x24xi8>
```
This new vectorization mode is implemented in `depthwiseConv` by
inserting `vector.shape_cast` Ops before and after
`depthwiseConv1dSliceAsMulAcc` is invoked. It can be selected through
e.g. a transform dialect attribute:
```mlir
transform.structured.vectorize_children_and_apply_patterns %conv {flatten_1d_depthwise_conv}
```
A forthcoming patch will implement a strategy to automatically switch
between the two implementations, depending on the shape of the input
tensors.
Co-authored by: Bradley Smith <bradley.smith@arm.com>
`TileUsingForOp` has an optional Attribute `interchange` which was given
in curly braces like this: `{interchange = [...]}`. The way this was
parsed meant that no `attr-dict` could be attached to the Op.
This patch adds printing / parsing of an `attr-dict` to the Op and
prints/parses the `interchange` Attribute separate from the
discardable Attributes.