Previously, for masked tile loads/stores we directly used the dimension
size from the `vector.create_mask` operation as the upper bound of the
`scf.for` over the tile slices. This was not correct, as `create_mask`
allows operands to be greater than the size of the vector dimension, in
which case the for loop bounds should be clamped to the number of tile
slices.
This will make fixing a bug (next patch) a change to one place, rather
than fixing three separate rewrites.
Note: `TileLoadOpWithMaskAndPadZeroConversion` has been merged into
`TileLoadOpConversion`, since after this change those two rewrites were
pretty much identical.
Since #73253, loops over tiles in SSA form (i.e. loops that take
`iter_args` and yield a new tile) are supported, so this patch updates
ArmSME lowerings to this form. This is a NFC, as it still lowers to the
same intrinsics, but this makes IR less 'surprising' at a higher-level,
and may be recognised by more transforms.
Example:
IR before:
```mlir
scf.for %tile_slice_index = %c0 to %num_tile_slices step %c1
{
arm_sme.move_vector_to_tile_slice
%broadcast_to_1d, %tile, %tile_slice_index :
vector<[4]xi32> into vector<[4]x[4]xi32>
}
// ... later use %tile
```
IR now:
```mlir
%broadcast_to_tile = scf.for %tile_slice_index = %c0 to %num_tile_slices
step %c1 iter_args(%iter_tile = %init_tile) -> (vector<[4]x[4]xi32>)
{
%tile_update = arm_sme.move_vector_to_tile_slice
%broadcast_to_1d, %iter_tile, %tile_slice_index :
vector<[4]xi32> into vector<[4]x[4]xi32>
scf.yield %tile_update : vector<[4]x[4]xi32>
}
// ... later use %broadcast_to_tile
```
This moves the SME tile vector.print lowering from
`-convert-arm-sme-to-scf` to `-convert-vector-to-arm-sme`. This seems
like a more logical place, as this is lowering a vector op to ArmSME,
and it also prevents vector.print from blocking tile allocation.
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:
* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
* Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)
As part of this patch we bid farewell to the following operations:
```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```
These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```
The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.
Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).
Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).
Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```
Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.
Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
This patch extends ArmSMEToSCF to support lowering of masked tile_load
ops. Only masks created by 'vector.create_mask' are currently supported.
There are two lowerings depending on the pad.
For pad of constant zero, the tile is first zeroed, then only active
rows are loaded.
For non-zero pad, the scalar pad is broadcast to a 1-D vector and a
regular 'vector.masked_load' (will be lowered to SVE, not SME) loads
each slice, with padding specified as a passthru and the 2-D mask
combined into a 1-D mask. The resulting slice is then inserted into the
tile with 'arm_sme.move_vector_to_tile_slice'.
This patch prefixes tile slice layout with `layout` in the
assemblyFormat:
- `<vertical>` -> `layout<vertical>`
- `<horizontal>` -> `layout<horizontal>`
The reason for this change is the current format doesn't play nicely
with additional optional operands, required to support padding and
masking (#69148), as it becomes ambiguous.
This affects the the following ops:
- arm_sme.tile_load
- arm_sme.tile_store
- arm_sme.load_tile_slice
- arm_sme.store_tile_slice
This adds a simple higher-level op for the tile slice to vector
intrinsics (and updates the existing vector.print lowering to use it).
This op will be used a few more times to implement vector.insert/extract
lowerings in later patches.
This adds a custom lowering for SME that loops over each row of the
tile, extracting it via an SME MOVA, then printing with a normal 1D
vector.print.
This makes writing SME integration tests easier and less verbose.
Depends on: #66910, #66911
In SME a ZA tile slice is a one-dimensional set of horizontally or
vertically contiguous elements within a ZA tile. Currently the load and
store ops only support horizontal tile slices. This patch adds a tile
slice layout attribute to the load and store ops to support both
horizontal and vertical tile slices.
When lowering from Vector dialect horizontal layout is the default.
This patch extends the ArmSME load and store op lowering to use the
memref indices. An integration test that loads two 32-bit element ZA
tiles from memory and stores them back to memory in reverse order to
verify this is added.
Depends on D156467 D156558
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D156689
Currently a loop is materialized when lowering ArmSME loads and stores
to intrinsics. This patch introduces two new ops to the ArmSME dialect
that map 1-1 with intrinsics:
1. arm_sme.load_tile_slice - Loads a 1D tile slice from
memory into a 2D SME "virtual tile".
2. arm_sme.store_tile_slice - Stores a 1D tile slice from a 2D SME
"virtual tile" into memory.
As well as a new conversion pass '-convert-arm-sme-to-scf' that
materializes loops with these ops. The existing load/store lowering to
intrinsics is updated to use these ops.
Depends on D156517
Discourse thread:
https://discourse.llvm.org/t/loop-materialization-in-armsme/72354
Reviewed By: awarzynski, dcaballe, WanderAway
Differential Revision: https://reviews.llvm.org/D156467