This adds a rewrite that converts illegal 2D unit-dim `shape_casts` into
`vector.transpose` ops.
E.g.
```mlir
// Case 1:
%a = vector.shape_cast %0 : vector<[4]x1xf32> to vector<1x[4]xf32>
// Case 2:
%b = vector.shape_cast %1 : vector<[4]x1xf32> to vector<[4]xf32>
```
Becomes:
```mlir
// Case 1:
%a = vector.transpose %0 : [1, 0] vector<[4]x1xf32> to vector<1x[4]xf32>
// Case 2:
%t = vector.transpose %1 : [1, 0] vector<[4]x1xf32> to vector<1x[4]xf32>
%b = vector.shape_cast %t : vector<1x[4]xf32> to vector<[4]xf32>
```
Various lowerings and drop unit-dims patterns add such shape_casts,
however, if they do not cancel out (which they likely won't if we've
reached the vector-legalization pass) they will prevent lowering the IR.
Rewriting them as a transpose gives `LiftIllegalVectorTransposeToMemory`
a chance to eliminate the illegal types.
This patch introduces support for 4-way widening outer products. This
enables the fusion of 4 'arm_sme.outerproduct' operations that are
chained via the accumulator into single widened operations.
Changes:
- Adds the following operations:
- smopa_4way, smops_4way
- umopa_4way, umops_4way
- sumopa_4way, sumops_4way
- sumopa_4way, sumops_4way
- Implements conversions for the above ops to intrinsics in ArmSMEToLLVM.
- Extends 'arm-sme-outer-product' pass.
For a detailed description of these operations see the
'arm_sme.smopa_4way' description.
When unrolling the reduction dimension of something like a matmul for
SME, you can end up with transposed reads of illegal types, like so:
```mlir
%illegalRead = vector.transfer_read %memref[%a, %b]
: memref<?x?xf32>, vector<[8]x4xf32>
%legalType = vector.transpose %illegalRead, [1, 0]
: vector<[8]x4xf32> to vector<4x[8]xf32>
```
Here the `vector<[8]x4xf32>` is an illegal type, there's no way to lower
a scalable vector of fixed vectors. However, as the final type
`vector<4x[8]xf32>` is legal, we can instead lift the transpose to
memory (producing a strided memref), and eliminate all the illegal
types. This is shown below.
```mlir
%readSubview = memref.subview %memref[%a, %b] [%c8_vscale, %c4] [%c1, %c1]
: memref<?x?xf32> to memref<?x?xf32>
%transpose = memref.transpose %readSubview (d0, d1) -> (d1, d0)
: memref<?x?xf32> to memref<?x?xf32>
%legalType = vector.transfer_read %transpose[%c0, %c0]
: memref<?x?xf32>, vector<4x[8]xf32>
```
In mixed matmul lowering (e.g., i8 to i32) we're seeing the following
sequence:
%0 = arith.extsi %src : vector<4x[8]xi8> to vector<4x[8]xi32>
%1 = vector.extract %0[0] : vector<[8]xi32> from vector<4x[8]xi32>
%lhs = vector.scalable.extract %1[0] : vector<[4]xi32> from
vector<[8]xi32>
... (same for rhs)
%2 = vector.outerproduct %lhs, %rhs, %acc vector<[4]xi32>,
vector<[4]xi32>
// x4 chained by accumulator
This chain of 4 outer products can be fused into a single 4-way widening
variant but the pass doesn't match on the IR, as it expects the source
of the inputs to be an extend and it can't look through the extracts.
This patch fixes this with two rewrites that swaps extract(extend) into
extend(extract).
Related to #78975, #79288.
When unrolling the reduction dimension of something like a matmul for
SME, it is possible to get 3D masks, which are vectors of SME-like
masks. The 2D masks for individual operations are then extracted from
the 3D masks.
i.e.:
```mlir
%mask = vector.create_mask %nonConstantDim, %a, %b : vector<4x[4]x[4]xi1>
%subMask = vector.extract %mask[2]
: vector<[4]x[4]xi1> from vector<4x[4]x[4]xi1>
```
ArmSME only supports lowering 2D create_masks, so we must fold the
extract into the create_mask. This can be done by checking if the
extraction index is within the true region, then using that select the
first dimension of the 2D mask. This is shown below.
```mlir
%extractionInTrueRegion = arith.cmpi slt, %c2, %nonConstantDim : index
%newMaskFrontDim = arith.select %extractionInTrueRegion, %a, %c0 : index
%subMask = vector.create_mask %newMaskFrontDim, %b : vector<[4]x[4]xi1>
```
This adds a new pass (`-arm-sme-vector-legalization`) which legalizes
vector operations so that they can be lowered to ArmSME. This initial
patch adds decomposition for `vector.outerproduct`,
`vector.transfer_read`, and `vector.transfer_write` when they operate on
vector types larger than a single SME tile. For example, a [8]x[8]xf32
outer product would be decomposed into four [4]x[4]xf32 outer products,
which could then be lowered to ArmSME. These three ops have been picked
as supporting them alone allows lowering matmuls that use all ZA
accumulators to ArmSME.
For it to be possible to legalize a vector type it has to be a multiple
of an SME tile size, but other than that any shape can be used. E.g.
`vector<[8]x[8]xf32>`, `vector<[4]x[16]xf32>`, `vector<[16]x[4]xf32>`
can all be lowered to four `vector<[4]x[4]xf32>` operations.
In future, this pass will be extended with more SME-specific rewrites to
legalize unrolling the reduction dimension of matmuls (which is not
type-decomposition), which is why the pass has quite a general name.
This patch introduces support for 2-way widening outer products. This
enables the fusion of 2 'arm_sme.outerproduct' operations that are
chained via the accumulator into a 2-way widening outer product
operation.
Changes:
- Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants.
These map to instruction variants added in SME2 and use different
intrinsics. Intrinsics are already implemented for widening variants
from SME1.
- Adds the following operations:
- fmopa_2way, fmops_2way
- smopa_2way, smops_2way
- umopa_2way, umops_2way
- Implements conversions for the above ops to intrinsics in
ArmSMEToLLVM.
- Adds a pass 'arm-sme-outer-product-fusion' that fuses
'arm_sme.outerproduct' operations.
For a detailed description of these operations see the
'arm_sme.fmopa_2way' description.
The reason for introducing many operations rather than one is the
signed/unsigned variants can't be distinguished with types (e.g., ui16,
si16) since 'arith.extui' and 'arith.extsi' only support signless
integers. A single operation would require this information and an
attribute (for example) for the sign doesn't feel right if
floating-point types are also supported where this wouldn't apply.
Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32)
introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but
no subtract variant. Whilst these are not supported in this patch, it
felt simpler to have separate ops for add/subtract given this.
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
This commit fixes rewrite pattern API violations:
* Rewrite pattern must return "failure" if the IR was not modified.
* In-place op modifications must be communicated to the rewriter
(`updateRootInPlace`).
This commit fixes `test/Dialect/ArmSVE/legalize-vector-storage.mlir`,
`test/Dialect/ArmSME/vector-ops-to-llvm.mlir`,
`test/Dialect/ArmSME/tile-allocation-invalid.mlir`,
`test/Conversion/ArmSMEToLLVM/arm-sme-to-llvm.mlir`,
`test/Conversion/ArmSMEToLLVM/tile-spills-and-fills.mlir`,
`test/Conversion/ArmSMEToLLVM/unsupported.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
---------
Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
This adds very basic (and inelegant) support for something like spilling
and reloading tiles, if you use more SME tiles than physically exist.
This is purely implemented to prevent the compiler from aborting if a
function uses too many tiles (i.e. due to bad unrolling), but is
expected to perform very poorly.
Currently, this works in two stages:
During tile allocation, if we run out of tiles instead of giving up, we
switch to allocating 'in-memory' tile IDs. These are tile IDs that start
at 16 (which is higher than any real tile ID). A warning will also be
emitted for each (root) tile op assigned an in-memory tile ID:
```
warning: failed to allocate SME virtual tile to operation, all tile operations will go through memory, expect degraded performance
```
Everything after this works like normal until `-convert-arm-sme-to-llvm`
Here the in-memory tile op:
```mlir
arm_sme.tile_op { tile_id = <IN MEMORY TILE> }
```
Is lowered to:
```mlir
// At function entry:
%alloca = memref.alloca ... : memref<?x?xty>
// Around the op:
// Swap the contents of %alloca and tile 0.
scf.for %slice_idx {
%current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
"arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}>
vector.store %current_slice, %alloca[%slice_idx, %c0]
}
// Execute op using tile 0.
arm_sme.tile_op { tile_id = 0 }
// Swap the contents of %alloca and tile 0.
// This restores tile 0 to its original state.
scf.for %slice_idx {
%current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
"arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}>
vector.store %current_slice, %alloca[%slice_idx, %c0]
}
```
This is inserted during the lowering to LLVM as spilling/reloading
registers is a very low-level concept, that can't really be modeled
correctly at a high level in MLIR.
Note: This is always doing the worst case full-tile swap. This could be
optimized to only spill/load data the tile op will use, which could be
just a slice. It's also not making any use of liveness, which could
allow reusing tiles. But these is not seen as important as correct code
should only use the available number of tiles.
When operations are modified in-place, the rewriter must be notified.
This commit fixes `mlir/test/Conversion/ArmSMEToLLVM/unsupported.mlir`,
`mlir/test/Dialect/ArmSME/tile-zero-masks.mlir` and
`mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS` enabled.
This patch removes the ArmSMETypeConverter, and instead updates
`populateArmSMEToLLVMConversionPatterns()` to add an ArmSME vector type
conversion to the existing LLVMTypeConverter. This makes it easier to
add these patterns to an existing `-to-llvm` lowering pass.
This adds a `only-if-required-by-ops` flag to the `enable-arm-streaming`
pass. This flag defaults to `false` (which preserves the original
behaviour), however, if set to `true` the pass will only add the
selected ZA/streaming mode to functions that contain ops that implement
`ArmSMETileOpInterface`.
This simplifies enabling these modes, as we can now first try lowering
ops to ArmSME, then only if we succeed, add the relevant function
attributes.
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:
* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
* Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)
As part of this patch we bid farewell to the following operations:
```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```
These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```
The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.
Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).
Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).
Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```
Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.
Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
This gives more flexibility with when these lowerings are performed,
without also lowering unrelated vector ops.
This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).
Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.
Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.
A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
This patch adds support for lowering masked outer products to SME. This
is done in two stages. First, vector.outerproducts (both masked and
non-masked) are rewritten to arm_sme.outerproducts. The
arm_sme.outerproduct op is close to vector.outerproduct, but supports
masking on the operands rather than the result. It also limits the cases
it handles to things that could be (directly) lowered to SME.
This currently requires that the source of the mask is a
vector.create_mask op. E.g.:
```mlir
%mask = vector.create_mask %dimA, %dimB : vector<[4]x[4]xi1>
%result = vector.mask %mask {
vector.outerproduct %vecA, %vecB
: vector<[4]xf32>, vector<[4]xf32>
} : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32>
```
Is rewritten to:
```
%maskA = vector.create_mask %dimA : vector<[4]xi1>
%maskB = vector.create_mask %dimB : vector<[4]xi1>
%result = arm_sme.outerproduct %vecA, %vecB masks(%maskA, %maskB)
: vector<[4]xf32>, vector<[4]xf32>
```
(The same rewrite works for non-masked vector.outerproducts too)
The arm_sme.outerproduct can then be directly lowered to SME intrinsics.
This is used in #69148 when lowering masked tile_store with non-zero
pad, see #69148
This updates:
* `arm_sme.move_vector_to_tile_slice`
* `arm_sme.move_tile_slice_to_vector`
This patch adds support for lowering vector.insert/extract of tile
slices or elements to ArmSME MOVA intrinsics.
This enables the following operations for ArmSME:
```
// Extract slice from tile:
%slice = vector.extract %tile[%row]
: vector<[4]xi32> from vector<[4]x[4]xi32>
```
```
// Extract element from tile:
%el = vector.extract %tile[%row, %col]
: i32 from vector<[4]x[4]xi32>
```
```
// Insert slice into tile:
%new_tile = vector.insert %slice, %tile[%row]
: vector<[4]xi32> into vector<[4]x[4]xi32>
```
```
// Insert element into tile;
%new_tile = vector.insert %el, %tile[%row, %col]
: i32 into vector<[4]x[4]xi32>
```
This adds a simple higher-level op for the tile slice to vector
intrinsics (and updates the existing vector.print lowering to use it).
This op will be used a few more times to implement vector.insert/extract
lowerings in later patches.
This adds a custom lowering for SME that loops over each row of the
tile, extracting it via an SME MOVA, then printing with a normal 1D
vector.print.
This makes writing SME integration tests easier and less verbose.
Depends on: #66910, #66911
In SME a ZA tile slice is a one-dimensional set of horizontally or
vertically contiguous elements within a ZA tile. Currently the load and
store ops only support horizontal tile slices. This patch adds a tile
slice layout attribute to the load and store ops to support both
horizontal and vertical tile slices.
When lowering from Vector dialect horizontal layout is the default.
This attribute makes the `enable_arm_streaming` pass ignore a function
(i.e. not add the enable streaming/za attributes). The main use case for
this is to prevent helper functions within tests being made streaming
functions.
This patch adds support for lowering vector.outerproduct to the ArmSME
MOPA intrinsic for the following types:
vector<[8]xf16>, vector<[8]xf16> -> vector<[8]x[8]xf16>
vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16>
vector<[4]xf32>, vector<[4]xf32> -> vector<[4]x[4]xf32>
vector<[2]xf64>, vector<[2]xf64> -> vector<[2]x[2]xf64>
The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to
BFMOPA
(non-widening) [2].
Note at the ISA level these variants are implemented by different
architecture features, these are listed below:
FMOPA (non-widening)
* half-precision - +sme2p1,+sme-f16f16
* single-precision - +sme
* double-precision - +sme-f64f64
BFMOPA (non-widening)
* half-precision - +sme2p1,+b16b16
There's currently no way to target different features when lowering to
ArmSME. Integration tests are added for F32 and F64. We use QEMU to run
the integration tests but SME2 support isn't available yet, it's
targeted for 9.0, so integration tests for these variants excluded.
Masking is currently unsupported.
Depends on #65450.
[1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-
[2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-
The arm_sme.get_tile_id op returns a scalar integer but the arm_sme.zero
op lowering incorrectly uses the element type, which could be
floating-point.
Reviewed By: awarzynski, benmxwl-arm
Differential Revision: https://reviews.llvm.org/D159080
This adds a 'move_vector_to_tile_slice' op to the ArmSME dialect that
moves a 1-D scalable vector to a slice of a 2-D tile at a given index.
This is lowered to the 'llvm.aarch64.sme.write.horiz' intrinsic that
maps to the MOVA (vector to tile, single) SME instruction [1] when
lowering to LLVM. Like the SME load and store instructions this operates
on ZA tile slices, which are 1D vectors of horizontally or vertically
contiguous elements within a ZA tile.
This patch extends the lowering of 'arith.constant' to SME to support
non-zero constants using this new op. This requires materializing a
loop that broadcasts the constant to each tile slice with the
'vector_to_tile_slice' op. Unlike load and store, this is done during
conversion from Vector to ArmSME, rather than ArmSME to SCF. The latter
would require a higher-level custom op in the ArmSME dialect like
'tile_load' and 'tile_store' and this isn't necessary. We may also
remove the load and store ops in the future in favour of lowering
straight from Vector, at which point this would converge.
Currently only horizontal tile slices are supported. A future patch will
extend this mechanism to support 'vector.broadcast'.
Depends on D156980 D157004
[1] https://developer.arm.com/documentation/ddi0602
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D157005
This follows from D155306.
Loads and stores of 128-bit tiles have been confirmed to work in the
`load-store-128-bit-tile.mlir` integration test. However, there is
currently a bug in QEMU (see: https://gitlab.com/qemu-project/qemu/-/issues/1833)
which means this test produces incorrect results (a patch for this issue
is available but not yet in any released version of QEMU). Until a
fixed version of QEMU is available the integration test is expected to fail.
Reviewed By: c-rhodes, awarzynski
Differential Revision: https://reviews.llvm.org/D158418
This patch updates the lowering of the arm_sme.zero to intrinsics so
that it calculates the correct mask for the tile to zero.
The zero instruction takes an 8-bit mask which specifies which 64-bit
tiles to zero, ZA0.D to ZA7.D correspond to bits 0 to 7. To zero tiles
with element sizes of 8-bit to 32-bit just requires zeroing the right
64-bit tiles.
This is quite easy to calculate, each size has a "base mask" which can
be shifted left by the tile ID to get the mask for that tile.
base_mask << tile_id
After tile allocation, this will be folded to a constant mask.
Reviewed By: awarzynski
Differential Revision: https://reviews.llvm.org/D157902
This patch extends the ArmSME load and store op lowering to use the
memref indices. An integration test that loads two 32-bit element ZA
tiles from memory and stores them back to memory in reverse order to
verify this is added.
Depends on D156467 D156558
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D156689
Currently a loop is materialized when lowering ArmSME loads and stores
to intrinsics. This patch introduces two new ops to the ArmSME dialect
that map 1-1 with intrinsics:
1. arm_sme.load_tile_slice - Loads a 1D tile slice from
memory into a 2D SME "virtual tile".
2. arm_sme.store_tile_slice - Stores a 1D tile slice from a 2D SME
"virtual tile" into memory.
As well as a new conversion pass '-convert-arm-sme-to-scf' that
materializes loops with these ops. The existing load/store lowering to
intrinsics is updated to use these ops.
Depends on D156517
Discourse thread:
https://discourse.llvm.org/t/loop-materialization-in-armsme/72354
Reviewed By: awarzynski, dcaballe, WanderAway
Differential Revision: https://reviews.llvm.org/D156467
This extends the existing 'arm_sme.tile_store' op to support all tile
sizes and adds a new op 'arm_sme.tile_load', as well as lowerings from
vector -> custom ops and custom ops -> intrinsics. Currently there's no
lowering for i128.
Depends on D154867
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D155306
At the moment, SME-to-LLVM lowerings rely entirely on
`LLVMTypeConverter`. This patch introduces a dedicated `TypeConverter`
that inherits from `LLVMTypeConverter` (it will also be used when
lowering ArmSME Ops to LLVM).
The new type converter merely disables lowerings for `VectorType` to
prevent 2-d scalable vectors (common in the context of ArmSME), e.g.
`vector<[16]x[16]xi8>`,
entering the LLVM Type converter. LLVM does not support arrays of
scalable vectors and hence the need for specialisation. In the case of
SME such types are effectively eliminated when emitting LLVM IR
intrinsics for SME.
Differential Revision: https://reviews.llvm.org/D155365
This patch adds a pass '-allocate-sme-tiles' to the ArmSME dialect that
implements allocation of SME ZA tiles.
It does this at the 'func.func' op level by replacing
'arm_sme.get_tile_id' ops with 'arith.constant' ops that represent the
tile number. The tiles in use in a given function are tracked by an
integer function attribute 'arm_sme.tiles_in_use' that is a 16-bit tile
mask with a bit for each 128-bit element tile (ZA0.Q-ZA15.Q), the
smallest ZA tile granule. This is initialized on the first
'arm_sme.get_tile_id' rewrite and updated on each subsequent rewrite.
Mixing of different element tile types is supported.
Section B2.3.2 of the SME spec [1] describes how the 128-bit element
tiles overlap with other element tiles.
Depends on D154941
[1] https://developer.arm.com/documentation/ddi0616/aa
Reviewed By: awarzynski
Differential Revision: https://reviews.llvm.org/D154955
At the moment, the lowering from the Vector dialect to SME looks like
this:
* Vector --> SME LLVM IR intrinsics
This patch introduces a new lowering layer between the Vector dialect
and the Arm SME extension:
* Vector --> ArmSME dialect (custom Ops) --> SME LLVM IR intrinsics.
This is motivated by 2 considerations:
1. Storing `ZA` to memory (e.g. `vector.transfer_write`) requires an
`scf.for` loop over all rows of `ZA`. Similar logic will apply to
"load to ZA from memory". This is a rather complex transformation and
a custom Op seems justified.
2. As discussed in [1], we need to prevent the LLVM type converter from
having to convert types unsupported in LLVM, e.g.
`vector<[16]x[16]xi8>`. A dedicated abstraction layer with custom Ops
opens a path to some fine tuning (e.g. custom type converters) that
will allow us to avoid this.
To facilitate this change, two new custom SME Op are introduced:
* `TileStoreOp`, and
* `ZeroOp`.
Note that no new functionality is added - these Ops merely model what's
already supported. In particular, the following tile size is assumed
(dimension and element size are fixed):
* `vector<[16]x[16]xi8>`
The new lowering layer is introduced via a conversion pass between the
Vector and the SME dialects. You can use the `-convert-vector-to-sme`
flag to run it. The following function:
```
func.func @example(%arg0 : memref<?x?xi8>) {
// (...)
%cst = arith.constant dense<0> : vector<[16]x[16]xi8>
vector.transfer_write %cst, %arg0 : vector<[16]x[16]xi8>, memref<?x?xi8>
return
}
```
would be lowered to:
```
func.func @example(%arg0: memref<?x?xi8>) {
// (...)
%0 = arm_sme.zero : vector<[16]x[16]xi8>
arm_sme.tile_store %arg0[%c0, %c0], %0 : memref<?x?xi8>, vector<[16]x[16]xi8>
return
}
```
Later, a mechanism will be introduced to guarantee that `arm_sme.zero`
and `arm_sme.tile_store` operate on the same virtual tile. For `i8`
elements this is not required as there is only one tile.
In order to lower the above output to LLVM, use
* `-convert-vector-to-llvm="enable-arm-sme"`.
[1] https://github.com/openxla/iree/issues/14294
Reviewed By: WanderAway
Differential Revision: https://reviews.llvm.org/D154867
This patch adds three new custom ops to the ArmSME dialect:
* arm_sme.get_tile_id - returns a scalar integer representing an SME
"virtual tile" that is not in use.
* arm_sme.cast_tile_to_vector - casts from a tile id to a 2-d scalable
vector type, which represents an SME "virtual tile".
* arm_sme.cast_vector_to_tile - casts from a 2-d scalable vector type,
which represents an SME "virtual tile", to a tile id.
The 'arm_sme.get_tile_id' op currently only supports tile 0, a follow-up
patch will implement proper tile allocation. A further follow-up patch
will demonstrate load/store to/from ZA using these ops.
See the op descriptions for further details and examples.
Thanks to @paulwalker-arm and @awarzynski for helping drive this.
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D154941
This patch adds two LLVM intrinsics to the ArmSME dialect:
* llvm.aarch64.sme.za.enable
* llvm.aarch64.sme.za.disable
for enabling the ZA storage array [1], as well as patterns for inserting
them during legalization to LLVM at the start and end of functions if
the function has the 'arm_za' attribute (D152695).
In the future ZA should probably be automatically enabled/disabled when
lowering from vector to SME, but this should be sufficient for now at
least until we have patterns lowering to SME instructions that use ZA.
N.B. The backend function attribute 'aarch64_pstate_za_new' can be used
manage ZA state (as was originally tried in D152694), but it emits calls
to the following SME support routines [2] for the lazy-save mechanism
[3]:
* __arm_tpidr2_restore
* __arm_tpidr2_save
These will soon be added to compiler-rt but there's currently no public
implementation, and using this attribute would introduce an MLIR
dependency on compiler-rt. Furthermore, this mechanism is for routines
with ZA enabled calling other routines with it also enabled. We can
choose not to enable ZA in the compiler when this is case.
Depends on D152695
[1] https://developer.arm.com/documentation/ddi0616/aa
[2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#sme-support-routines
[3] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-za-lazy-saving-scheme
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D153050
This patch extends the 'enable-arm-streaming' pass with a new option to
enable the ZA storage array by adding the 'arm_za' attribute to
'func.func' ops.
A later patch will insert `llvm.aarch64.sme.za.enable` at the beginning
of 'func.func' ops and `llvm.aarch64.sme.za.disable` before
`func.return` statements when lowering to LLVM dialect.
Currently the pass only supports enabling ZA with streaming-mode on but
the SME LDR, STR and ZERO instructions can access ZA when not in
streaming-mode (section B1.1.1, IDGNQM [1]), so it may be worth making
these options independent in the future.
N.B. This patch is generally useful in the context of SME enablement in
MLIR, but it will help enable writing an integration test for rewrite
pattern that lowers `vector.transfer_write` -> `zero {za}` (D152508).
[1] https://developer.arm.com/documentation/ddi0616/aa
Reviewed By: awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D152695