48 Commits

Author SHA1 Message Date
Benjamin Maxwell
d1fc59c3b5
[mlir][ArmSME] Rewrite illegal shape_casts to vector.transpose ops (#82985)
This adds a rewrite that converts illegal 2D unit-dim `shape_casts` into
`vector.transpose` ops.

E.g.

```mlir
// Case 1:
%a = vector.shape_cast %0 : vector<[4]x1xf32> to vector<1x[4]xf32>
// Case 2:
%b = vector.shape_cast %1 : vector<[4]x1xf32> to vector<[4]xf32>
```

Becomes:

```mlir
// Case 1:
%a = vector.transpose %0 : [1, 0] vector<[4]x1xf32> to vector<1x[4]xf32>
// Case 2:
%t = vector.transpose %1 : [1, 0] vector<[4]x1xf32> to vector<1x[4]xf32>
%b = vector.shape_cast %t : vector<1x[4]xf32> to vector<[4]xf32>
```

Various lowerings and drop unit-dims patterns add such shape_casts,
however, if they do not cancel out (which they likely won't if we've
reached the vector-legalization pass) they will prevent lowering the IR.

Rewriting them as a transpose gives `LiftIllegalVectorTransposeToMemory`
a chance to eliminate the illegal types.
2024-03-07 17:04:12 +00:00
Benjamin Maxwell
8cfb71613c
[mlir][ArmSME] Replace use of isa with isa_and_present (#82798)
`op` can be null here, in which case this should just return a null
value back.
2024-02-26 09:44:26 +00:00
Benjamin Maxwell
1408667fdd [mlir][ArmSME] Follow MLIR constant style in VectorLegalization.cpp (NFC) 2024-02-23 16:55:32 +00:00
Cullen Rhodes
fff86c6111
[mlir][ArmSME] Support 4-way widening outer products (#79288)
This patch introduces support for 4-way widening outer products. This
enables the fusion of 4 'arm_sme.outerproduct' operations that are
chained via the accumulator into single widened operations.

Changes:

- Adds the following operations:
  - smopa_4way, smops_4way
  - umopa_4way, umops_4way
  - sumopa_4way, sumops_4way
  - sumopa_4way, sumops_4way
- Implements conversions for the above ops to intrinsics in ArmSMEToLLVM.
- Extends 'arm-sme-outer-product' pass.

For a detailed description of these operations see the
'arm_sme.smopa_4way' description.
2024-02-07 08:17:47 +00:00
Benjamin Maxwell
0473e322f6
[mlir][ArmSME] Add rewrite to lift illegal vector.transposes to memory (#80170)
When unrolling the reduction dimension of something like a matmul for
SME, you can end up with transposed reads of illegal types, like so:

```mlir
%illegalRead = vector.transfer_read %memref[%a, %b]
                : memref<?x?xf32>, vector<[8]x4xf32>
%legalType = vector.transpose %illegalRead, [1, 0]
                : vector<[8]x4xf32> to vector<4x[8]xf32>
```

Here the `vector<[8]x4xf32>` is an illegal type, there's no way to lower
a scalable vector of fixed vectors. However, as the final type
`vector<4x[8]xf32>` is legal, we can instead lift the transpose to
memory (producing a strided memref), and eliminate all the illegal
types. This is shown below.

```mlir
%readSubview = memref.subview %memref[%a, %b] [%c8_vscale, %c4] [%c1, %c1]
                : memref<?x?xf32> to memref<?x?xf32>
%transpose = memref.transpose %readSubview (d0, d1) -> (d1, d0)
                : memref<?x?xf32> to memref<?x?xf32>
%legalType = vector.transfer_read %transpose[%c0, %c0]
                : memref<?x?xf32>, vector<4x[8]xf32>
```
2024-02-06 09:30:55 +00:00
Cullen Rhodes
5f5b3bb22b
[mlir][ArmSME] Add rewrites to swap extract of extend (#80407)
In mixed matmul lowering (e.g., i8 to i32) we're seeing the following
sequence:

  %0 = arith.extsi %src : vector<4x[8]xi8> to vector<4x[8]xi32>
  %1 = vector.extract %0[0] : vector<[8]xi32> from vector<4x[8]xi32>
%lhs = vector.scalable.extract %1[0] : vector<[4]xi32> from
vector<[8]xi32>

  ... (same for rhs)

%2 = vector.outerproduct %lhs, %rhs, %acc vector<[4]xi32>,
vector<[4]xi32>

  // x4 chained by accumulator

This chain of 4 outer products can be fused into a single 4-way widening
variant but the pass doesn't match on the IR, as it expects the source
of the inputs to be an extend and it can't look through the extracts.

This patch fixes this with two rewrites that swaps extract(extend) into
extend(extract).

Related to #78975, #79288.
2024-02-05 14:13:53 +00:00
Benjamin Maxwell
c2dea7122c
[mlir][ArmSME] Fold extracts from 3D create_masks of SME-like masks (#80148)
When unrolling the reduction dimension of something like a matmul for
SME, it is possible to get 3D masks, which are vectors of SME-like
masks. The 2D masks for individual operations are then extracted from
the 3D masks.

i.e.:

```mlir
%mask = vector.create_mask %nonConstantDim, %a, %b : vector<4x[4]x[4]xi1>
%subMask = vector.extract %mask[2]
        : vector<[4]x[4]xi1> from vector<4x[4]x[4]xi1>
```

ArmSME only supports lowering 2D create_masks, so we must fold the
extract into the create_mask. This can be done by checking if the
extraction index is within the true region, then using that select the
first dimension of the 2D mask. This is shown below.

```mlir
%extractionInTrueRegion = arith.cmpi slt, %c2, %nonConstantDim : index
%newMaskFrontDim = arith.select %extractionInTrueRegion, %a, %c0 : index
%subMask = vector.create_mask %newMaskFrontDim, %b : vector<[4]x[4]xi1>
```
2024-02-02 10:06:11 +00:00
Benjamin Maxwell
042800a4dd
[mlir][ArmSME] Add initial SME vector legalization pass (#79152)
This adds a new pass (`-arm-sme-vector-legalization`) which legalizes
vector operations so that they can be lowered to ArmSME. This initial
patch adds decomposition for `vector.outerproduct`,
`vector.transfer_read`, and `vector.transfer_write` when they operate on
vector types larger than a single SME tile. For example, a [8]x[8]xf32
outer product would be decomposed into four [4]x[4]xf32 outer products,
which could then be lowered to ArmSME. These three ops have been picked
as supporting them alone allows lowering matmuls that use all ZA
accumulators to ArmSME.

For it to be possible to legalize a vector type it has to be a multiple
of an SME tile size, but other than that any shape can be used. E.g.
`vector<[8]x[8]xf32>`, `vector<[4]x[16]xf32>`, `vector<[16]x[4]xf32>`
can all be lowered to four `vector<[4]x[4]xf32>` operations.

In future, this pass will be extended with more SME-specific rewrites to
legalize unrolling the reduction dimension of matmuls (which is not
type-decomposition), which is why the pass has quite a general name.
2024-01-31 11:55:22 +00:00
Cullen Rhodes
95ef8e3868
[mlir][ArmSME] Support 2-way widening outer products (#78975)
This patch introduces support for 2-way widening outer products. This
enables the fusion of 2 'arm_sme.outerproduct' operations that are
chained via the accumulator into a 2-way widening outer product
operation.

Changes:

- Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants.
  These map to instruction variants added in SME2 and use different
  intrinsics. Intrinsics are already implemented for widening variants
  from SME1.
- Adds the following operations:
  - fmopa_2way, fmops_2way
  - smopa_2way, smops_2way
  - umopa_2way, umops_2way
- Implements conversions for the above ops to intrinsics in
ArmSMEToLLVM.
- Adds a pass 'arm-sme-outer-product-fusion'  that fuses
  'arm_sme.outerproduct' operations.

For a detailed description of these operations see the
'arm_sme.fmopa_2way' description.

The reason for introducing many operations rather than one is the
signed/unsigned variants can't be distinguished with types (e.g., ui16,
si16) since 'arith.extui' and 'arith.extsi' only support signless
integers. A single operation would require this information and an
attribute (for example) for the sign doesn't feel right if
floating-point types are also supported where this wouldn't apply.
Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32)
introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but
no subtract variant. Whilst these are not supported in this patch, it
felt simpler to have separate ops for add/subtract given this.
2024-01-31 09:13:18 +00:00
Matthias Springer
5fcf907b34
[mlir][IR] Rename "update root" to "modify op" in rewriter API (#78260)
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`

The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.

Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
2024-01-17 11:08:59 +01:00
Matthias Springer
e2bb47caa6
[mlir][Arm] Fix invalid rewrite pattern API violations (#78246)
This commit fixes rewrite pattern API violations:
* Rewrite pattern must return "failure" if the IR was not modified.
* In-place op modifications must be communicated to the rewriter
(`updateRootInPlace`).

This commit fixes `test/Dialect/ArmSVE/legalize-vector-storage.mlir`,
`test/Dialect/ArmSME/vector-ops-to-llvm.mlir`,
`test/Dialect/ArmSME/tile-allocation-invalid.mlir`,
`test/Conversion/ArmSMEToLLVM/arm-sme-to-llvm.mlir`,
`test/Conversion/ArmSMEToLLVM/tile-spills-and-fills.mlir`,
`test/Conversion/ArmSMEToLLVM/unsupported.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.

---------

Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
2024-01-16 13:26:39 +01:00
Benjamin Maxwell
5417a5fed6
[mlir][ArmSME] Add rudimentary support for tile spills to the stack (#76086)
This adds very basic (and inelegant) support for something like spilling
and reloading tiles, if you use more SME tiles than physically exist.

This is purely implemented to prevent the compiler from aborting if a
function uses too many tiles (i.e. due to bad unrolling), but is
expected to perform very poorly.

Currently, this works in two stages:

During tile allocation, if we run out of tiles instead of giving up, we
switch to allocating 'in-memory' tile IDs. These are tile IDs that start
at 16 (which is higher than any real tile ID). A warning will also be
emitted for each (root) tile op assigned an in-memory tile ID:

```
warning: failed to allocate SME virtual tile to operation, all tile operations will go through memory, expect degraded performance
```

Everything after this works like normal until `-convert-arm-sme-to-llvm`

Here the in-memory tile op:

```mlir
arm_sme.tile_op { tile_id = <IN MEMORY TILE> }
```

Is lowered to:

```mlir
// At function entry:
%alloca = memref.alloca ... : memref<?x?xty>

// Around the op:
// Swap the contents of %alloca and tile 0.
scf.for %slice_idx {
  %current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
  "arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx)  <{tile_id = 0 : i32}>
  vector.store %current_slice, %alloca[%slice_idx, %c0]
}
// Execute op using tile 0.
arm_sme.tile_op { tile_id = 0 }
// Swap the contents of %alloca and tile 0.
// This restores tile 0 to its original state.
scf.for %slice_idx {
  %current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
  "arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx)  <{tile_id = 0 : i32}>
  vector.store %current_slice, %alloca[%slice_idx, %c0]
}
```

This is inserted during the lowering to LLVM as spilling/reloading
registers is a very low-level concept, that can't really be modeled
correctly at a high level in MLIR.

Note: This is always doing the worst case full-tile swap. This could be
optimized to only spill/load data the tile op will use, which could be
just a slice. It's also not making any use of liveness, which could
allow reusing tiles. But these is not seen as important as correct code
should only use the available number of tiles.
2024-01-12 14:51:47 +00:00
Matthias Springer
db8a119e8f
[mlir][ArmSME] Fix invalid rewriter API usage (#76123)
When operations are modified in-place, the rewriter must be notified.
This commit fixes `mlir/test/Conversion/ArmSMEToLLVM/unsupported.mlir`,
`mlir/test/Dialect/ArmSME/tile-zero-masks.mlir` and
`mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS` enabled.
2023-12-21 17:39:36 +09:00
Benjamin Maxwell
01e40a8a3d
[mlir][ArmSME] Remove ArmSMETypeConverter (and configure LLVM one instead) (#73639)
This patch removes the ArmSMETypeConverter, and instead updates
`populateArmSMEToLLVMConversionPatterns()` to add an ArmSME vector type
conversion to the existing LLVMTypeConverter. This makes it easier to
add these patterns to an existing `-to-llvm` lowering pass.
2023-12-04 17:02:48 +00:00
Benjamin Maxwell
f7d91faa79
[mlir][ArmSME] Add option to only enable streaming mode/ZA if required (#73931)
This adds a `only-if-required-by-ops` flag to the `enable-arm-streaming`
pass. This flag defaults to `false` (which preserves the original
behaviour), however, if set to `true` the pass will only add the
selected ZA/streaming mode to functions that contain ops that implement
`ArmSMETileOpInterface`.

This simplifies enabling these modes, as we can now first try lowering
ops to ArmSME, then only if we succeed, add the relevant function
attributes.
2023-12-01 10:39:01 +00:00
Benjamin Maxwell
eaff02f28e
[mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:

* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
 * Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)

As part of this patch we bid farewell to the following operations:

```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```

These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```

The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.

Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).

Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).

Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```

Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.

Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
2023-11-30 10:22:22 +00:00
Benjamin Maxwell
dff97c1e4c
[mlir][ArmSME] Move ArmSME -> intrinsics lowerings to convert-arm-sme-to-llvm pass (#72890)
This gives more flexibility with when these lowerings are performed,
without also lowering unrelated vector ops.

This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)
2023-11-22 13:36:36 +00:00
Benjamin Maxwell
c4c52d4199
[mlir][ArmSME] Move vector.extract/insert lowerings to vector-to-arm-sme (NFC) (#72852)
These were placed in LegalizeForLLVMExport.cpp, which is the wrong stage
for these, as these lower to high-level ArmSME ops, not intrinsics.
2023-11-20 14:04:59 +00:00
Benjamin Maxwell
783ac3b6fb
[mlir][ArmSME] Make use of backend function attributes for enabling ZA storage (#71044)
Previously, we were inserting za.enable/disable intrinsics for functions
with the "arm_za" attribute (at the MLIR level), rather than using the
backend attributes. This was done to avoid a dependency on the SME ABI
functions from compiler-rt (which have only recently been implemented).

Doing things this way did have correctness issues, for example, calling
a streaming-mode function from another streaming-mode function (both
with ZA enabled) would lead to ZA being disabled after returning to the
caller (where it should still be enabled). Fixing issues like this would
require re-doing the ABI work already done in the backend within MLIR.

Instead, this patch switches to use the "arm_new_za" (backend) attribute
for enabling ZA for an MLIR function. For the integration tests, this
requires some way of linking the SME ABI functions. This is done via the
`%arm_sme_abi_shlib` lit substitution. By default, this expands to a
stub implementation of the SME ABI functions, but this can be overridden
by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable
(pointing it at an alternative implementation). For now, the ArmSME
integration tests pass with just stubs, as we don't make use of nested
ZA-enabled calls.

A future patch may add an option to compiler-rt to build the SME
builtins into a standalone shared library to allow easily
building/testing with the actual implementation.
2023-11-14 12:50:38 +00:00
Cullen Rhodes
8f564e014e
[mlir][ArmSME] Add mask operand to store_tile_slice (#70838) 2023-11-02 08:43:37 +00:00
Cullen Rhodes
8ea260a093
[mlir][ArmSME] Add mask operand to load_tile_slice (#70655) 2023-10-31 13:08:55 +00:00
Benjamin Maxwell
e666295011
[mlir][ArmSME] Support lowering masked vector.outerproduct ops to SME (#69604)
This patch adds support for lowering masked outer products to SME. This
is done in two stages. First, vector.outerproducts (both masked and
non-masked) are rewritten to arm_sme.outerproducts. The
arm_sme.outerproduct op is close to vector.outerproduct, but supports
masking on the operands rather than the result. It also limits the cases
it handles to things that could be (directly) lowered to SME.

This currently requires that the source of the mask is a
vector.create_mask op. E.g.:

```mlir
%mask = vector.create_mask %dimA, %dimB : vector<[4]x[4]xi1>
%result = vector.mask %mask {
             vector.outerproduct %vecA, %vecB
              : vector<[4]xf32>, vector<[4]xf32>
          } : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32>
```
Is rewritten to:
```
%maskA = vector.create_mask %dimA : vector<[4]xi1>
%maskB = vector.create_mask %dimB : vector<[4]xi1>
%result = arm_sme.outerproduct %vecA, %vecB masks(%maskA, %maskB)
              : vector<[4]xf32>, vector<[4]xf32>
```
(The same rewrite works for non-masked vector.outerproducts too)

The arm_sme.outerproduct can then be directly lowered to SME intrinsics.
2023-10-31 09:06:21 +00:00
Cullen Rhodes
2f055ddca3
[mlir][ArmSME] Add tile slice layout attr to vector <-> tile ops (#69186)
This is used in #69148 when lowering masked tile_store with non-zero
pad, see #69148

This updates:
 * `arm_sme.move_vector_to_tile_slice`
 * `arm_sme.move_tile_slice_to_vector`
2023-10-25 14:44:31 +01:00
Benjamin Maxwell
496318ad8d
[mlir][ArmSME] Lower vector.extract/insert on SME tiles to MOVA intrinsics (#67786)
This patch adds support for lowering vector.insert/extract of tile
slices or elements to ArmSME MOVA intrinsics.

This enables the following operations for ArmSME:
```
// Extract slice from tile:
%slice = vector.extract %tile[%row] 
                 : vector<[4]xi32> from vector<[4]x[4]xi32>
```
```
// Extract element from tile:
%el = vector.extract %tile[%row, %col]
                 : i32 from vector<[4]x[4]xi32>
```
```
// Insert slice into tile:
%new_tile = vector.insert %slice, %tile[%row]
                    : vector<[4]xi32> into vector<[4]x[4]xi32>
```
```
// Insert element into tile;
%new_tile = vector.insert %el, %tile[%row, %col]
                    : i32 into vector<[4]x[4]xi32>
```
2023-10-04 09:28:39 +01:00
Andrzej Warzynski
23b5f92c97 [mlir][SME] Re-order patterns alphabetically (nfc) 2023-09-29 16:54:47 +00:00
Benjamin Maxwell
b34f15df55
[mlir][ArmSME] Add arm_sme.move_tile_slice_to_vector op (#67652)
This adds a simple higher-level op for the tile slice to vector
intrinsics (and updates the existing vector.print lowering to use it).
This op will be used a few more times to implement vector.insert/extract
lowerings in later patches.
2023-09-29 10:33:09 +01:00
Benjamin Maxwell
174cd6145b
[mlir][ArmSME] Add custom vector.print lowering for SME tiles (#66691)
This adds a custom lowering for SME that loops over each row of the
tile, extracting it via an SME MOVA, then printing with a normal 1D
vector.print.

This makes writing SME integration tests easier and less verbose.

Depends on: #66910, #66911
2023-09-26 17:09:57 +01:00
Cullen Rhodes
75a71c27c1
[mlir][ArmSME] Support vertical layout in load and store ops (#66758)
In SME a ZA tile slice is a one-dimensional set of horizontally or
vertically contiguous elements within a ZA tile. Currently the load and
store ops only support horizontal tile slices. This patch adds a tile
slice layout attribute to the load and store ops to support both
horizontal and vertical tile slices.

When lowering from Vector dialect horizontal layout is the default.
2023-09-25 09:34:23 +01:00
Benjamin Maxwell
06f22c9ae5
[mlir][ArmSME] Add enable_arm_streaming_ignore attribute (#66911)
This attribute makes the `enable_arm_streaming` pass ignore a function
(i.e. not add the enable streaming/za attributes). The main use case for
this is to prevent helper functions within tests being made streaming
functions.
2023-09-22 11:27:55 +01:00
Cullen Rhodes
f75d46a7ec
[mlir][ArmSME] Lower vector.outerproduct to FMOPA/BFMOPA (#65621)
This patch adds support for lowering vector.outerproduct to the ArmSME
MOPA intrinsic for the following types:

  vector<[8]xf16>,  vector<[8]xf16>  -> vector<[8]x[8]xf16>
  vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16>
  vector<[4]xf32>,  vector<[4]xf32>  -> vector<[4]x[4]xf32>
  vector<[2]xf64>,  vector<[2]xf64>  -> vector<[2]x[2]xf64>

The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to
BFMOPA
(non-widening) [2].

Note at the ISA level these variants are implemented by different
architecture features, these are listed below:

  FMOPA (non-widening)
    * half-precision   - +sme2p1,+sme-f16f16
    * single-precision - +sme
    * double-precision - +sme-f64f64
  BFMOPA (non-widening)
    * half-precision   - +sme2p1,+b16b16

There's currently no way to target different features when lowering to
ArmSME. Integration tests are added for F32 and F64. We use QEMU to run
the integration tests but SME2 support isn't available yet, it's
targeted for 9.0, so integration tests for these variants excluded.

Masking is currently unsupported.

Depends on #65450.

[1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-
[2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-
2023-09-14 08:31:52 +01:00
Cullen Rhodes
834cdc8b64 [mlir][ArmSME] Fix get_tile_id type in zero lowering
The arm_sme.get_tile_id op returns a scalar integer but the arm_sme.zero
op lowering incorrectly uses the element type, which could be
floating-point.

Reviewed By: awarzynski, benmxwl-arm

Differential Revision: https://reviews.llvm.org/D159080
2023-08-30 07:16:35 +00:00
Cullen Rhodes
3b4b6cbba5 [mlir][ArmSME] Add move vector to tile slice op and lowerings
This adds a 'move_vector_to_tile_slice' op to the ArmSME dialect that
moves a 1-D scalable vector to a slice of a 2-D tile at a given index.

This is lowered to the 'llvm.aarch64.sme.write.horiz' intrinsic that
maps to the MOVA (vector to tile, single) SME instruction [1] when
lowering to LLVM. Like the SME load and store instructions this operates
on ZA tile slices, which are 1D vectors of horizontally or vertically
contiguous elements within a ZA tile.

This patch extends the lowering of 'arith.constant' to SME to support
non-zero constants using this new op.  This requires materializing a
loop that broadcasts the constant to each tile slice with the
'vector_to_tile_slice' op. Unlike load and store, this is done during
conversion from Vector to ArmSME, rather than ArmSME to SCF. The latter
would require a higher-level custom op in the ArmSME dialect like
'tile_load' and 'tile_store' and this isn't necessary. We may also
remove the load and store ops in the future in favour of lowering
straight from Vector, at which point this would converge.

Currently only horizontal tile slices are supported. A future patch will
extend this mechanism to support 'vector.broadcast'.

Depends on D156980 D157004

[1] https://developer.arm.com/documentation/ddi0602

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D157005
2023-08-29 09:29:22 +00:00
Benjamin Maxwell
97da414182 [mlir][ArmSME] Lower loads/stores of (.Q) 128-bit tiles to intrinsics
This follows from D155306.

Loads and stores of 128-bit tiles have been confirmed to work in the
`load-store-128-bit-tile.mlir` integration test. However, there is
currently a bug in QEMU (see: https://gitlab.com/qemu-project/qemu/-/issues/1833)
which means this test produces incorrect results (a patch for this issue
is available but not yet in any released version of QEMU). Until a
fixed version of QEMU is available the integration test is expected to fail.

Reviewed By: c-rhodes, awarzynski

Differential Revision: https://reviews.llvm.org/D158418
2023-08-23 09:16:20 +00:00
Benjamin Maxwell
a4d87e3d06 [mlir][ArmSME] Calculate correct tile mask when lowering arm_sme.zero
This patch updates the lowering of the arm_sme.zero to intrinsics so
that it calculates the correct mask for the tile to zero.

The zero instruction takes an 8-bit mask which specifies which 64-bit
tiles to zero, ZA0.D to ZA7.D correspond to bits 0 to 7. To zero tiles
with element sizes of 8-bit to 32-bit just requires zeroing the right
64-bit tiles.

This is quite easy to calculate, each size has a "base mask" which can
be shifted left by the tile ID to get the mask for that tile.

  base_mask << tile_id

After tile allocation, this will be folded to a constant mask.

Reviewed By: awarzynski

Differential Revision: https://reviews.llvm.org/D157902
2023-08-18 09:34:29 +00:00
Cullen Rhodes
65a6be5de9 [mlir][ArmSME] Use memref indices for load and store
This patch extends the ArmSME load and store op lowering to use the
memref indices. An integration test that loads two 32-bit element ZA
tiles from memory and stores them back to memory in reverse order to
verify this is added.

Depends on D156467 D156558

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D156689
2023-08-03 08:50:12 +00:00
Cullen Rhodes
9e1b825321 [mlir][ArmSME] Add conversion from ArmSME to SCF to materialize loops
Currently a loop is materialized when lowering ArmSME loads and stores
to intrinsics. This patch introduces two new ops to the ArmSME dialect
that map 1-1 with intrinsics:

  1. arm_sme.load_tile_slice  - Loads a 1D tile slice from
     memory into a 2D SME "virtual tile".
  2. arm_sme.store_tile_slice - Stores a 1D tile slice from a 2D SME
     "virtual tile" into memory.

As well as a new conversion pass '-convert-arm-sme-to-scf' that
materializes loops with these ops. The existing load/store lowering to
intrinsics is updated to use these ops.

Depends on D156517

Discourse thread:
https://discourse.llvm.org/t/loop-materialization-in-armsme/72354

Reviewed By: awarzynski, dcaballe, WanderAway

Differential Revision: https://reviews.llvm.org/D156467
2023-08-01 08:20:02 +00:00
Cullen Rhodes
ca9a3354d0 [mlir][ArmSME] Add tile load op and extend tile store tile size support
This extends the existing 'arm_sme.tile_store' op to support all tile
sizes and adds a new op 'arm_sme.tile_load', as well as lowerings from
vector -> custom ops and custom ops -> intrinsics. Currently there's no
lowering for i128.

Depends on D154867

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D155306
2023-07-25 08:28:36 +00:00
Andrzej Warzynski
3fa5ee67ba [mlir][ArmSME] Introduce custom TypeConverter for ArmSME
At the moment, SME-to-LLVM lowerings rely entirely on
`LLVMTypeConverter`. This patch introduces a dedicated `TypeConverter`
that inherits from `LLVMTypeConverter` (it will also be used when
lowering ArmSME Ops to LLVM).

The new type converter merely disables lowerings for `VectorType` to
prevent 2-d scalable vectors (common in the context of ArmSME), e.g.

   `vector<[16]x[16]xi8>`,

entering the LLVM Type converter. LLVM does not support arrays of
scalable vectors and hence the need for specialisation. In the case of
SME such types are effectively eliminated when emitting LLVM IR
intrinsics for SME.

Differential Revision: https://reviews.llvm.org/D155365
2023-07-18 09:35:32 +00:00
Cullen Rhodes
fb54fec726 [mlir][ArmSME] Implement tile allocation
This patch adds a pass '-allocate-sme-tiles' to the ArmSME dialect that
implements allocation of SME ZA tiles.

It does this at the 'func.func' op level by replacing
'arm_sme.get_tile_id' ops with 'arith.constant' ops that represent the
tile number. The tiles in use in a given function are tracked by an
integer function attribute 'arm_sme.tiles_in_use' that is a 16-bit tile
mask with a bit for each 128-bit element tile (ZA0.Q-ZA15.Q), the
smallest ZA tile granule. This is initialized on the first
'arm_sme.get_tile_id' rewrite and updated on each subsequent rewrite.
Mixing of different element tile types is supported.

Section B2.3.2 of the SME spec [1] describes how the 128-bit element
tiles overlap with other element tiles.

Depends on D154941

[1] https://developer.arm.com/documentation/ddi0616/aa

Reviewed By: awarzynski

Differential Revision: https://reviews.llvm.org/D154955
2023-07-18 08:46:40 +00:00
Andrzej Warzynski
447bb5bee4 [mlir][ArmSME] Introduce new lowering layer (Vector -> ArmSME)
At the moment, the lowering from the Vector dialect to SME looks like
this:

  * Vector --> SME LLVM IR intrinsics

This patch introduces a new lowering layer between the Vector dialect
and the Arm SME extension:

  * Vector --> ArmSME dialect (custom Ops) --> SME LLVM IR intrinsics.

This is motivated by 2 considerations:
1. Storing `ZA` to memory (e.g. `vector.transfer_write`) requires an
   `scf.for` loop over all rows of `ZA`. Similar logic will apply to
   "load to ZA from memory". This is a rather complex transformation and
   a custom Op seems justified.
2. As discussed in [1], we need to prevent the LLVM type converter from
   having to convert types unsupported in LLVM, e.g.
   `vector<[16]x[16]xi8>`. A dedicated abstraction layer with custom Ops
   opens a path to some fine tuning (e.g. custom type converters) that
   will allow us to avoid this.

To facilitate this change, two new custom SME Op are introduced:

  * `TileStoreOp`, and
  * `ZeroOp`.

Note that no new functionality is added - these Ops merely model what's
already supported. In particular, the following tile size is assumed
(dimension and element size are fixed):

  * `vector<[16]x[16]xi8>`

The new lowering layer is introduced via a conversion pass between the
Vector and the SME dialects. You can use the `-convert-vector-to-sme`
flag to run it. The following function:
```
func.func @example(%arg0 : memref<?x?xi8>) {
  // (...)
  %cst = arith.constant dense<0> : vector<[16]x[16]xi8>
  vector.transfer_write %cst, %arg0 : vector<[16]x[16]xi8>, memref<?x?xi8>
  return
}
```
would be lowered to:
```
  func.func @example(%arg0: memref<?x?xi8>) {
    // (...)
    %0 = arm_sme.zero : vector<[16]x[16]xi8>
    arm_sme.tile_store %arg0[%c0, %c0], %0 : memref<?x?xi8>, vector<[16]x[16]xi8>
    return
  }
```

Later, a mechanism will be introduced to guarantee that `arm_sme.zero`
and `arm_sme.tile_store` operate on the same virtual tile. For `i8`
elements this is not required as there is only one tile.

In order to lower the above output to LLVM, use
  * `-convert-vector-to-llvm="enable-arm-sme"`.

[1] https://github.com/openxla/iree/issues/14294

Reviewed By: WanderAway

Differential Revision: https://reviews.llvm.org/D154867
2023-07-18 08:04:59 +00:00
Cullen Rhodes
6ff9761a69 [mlir][ArmSME] Add custom get_tile_id and cast ops
This patch adds three new custom ops to the ArmSME dialect:

  * arm_sme.get_tile_id - returns a scalar integer representing an SME
    "virtual tile" that is not in use.
  * arm_sme.cast_tile_to_vector - casts from a tile id to a 2-d scalable
    vector type, which represents an SME "virtual tile".
  * arm_sme.cast_vector_to_tile - casts from a 2-d scalable vector type,
    which represents an SME "virtual tile", to a tile id.

The 'arm_sme.get_tile_id' op currently only supports tile 0, a follow-up
patch will implement proper tile allocation. A further follow-up patch
will demonstrate load/store to/from ZA using these ops.

See the op descriptions for further details and examples.

Thanks to @paulwalker-arm and @awarzynski for helping drive this.

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D154941
2023-07-18 07:41:45 +00:00
Cullen Rhodes
564713c471 [mlir][ArmSME] Add basic lowering of vector.transfer_write to zero
This patch adds support for lowering a 'vector.transfer_write' of zeroes
and type 'vector<[16x16]xi8>' to the SME 'zero {za}' instruction [1],
which zeroes the entire accumulator, and then writing it out to memory
with the 'str' instruction [2].

This contributes to supporting a path from 'linalg.fill' to SME.

[1] https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/ZERO--Zero-a-list-of-64-bit-element-ZA-tiles-
[2] https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/STR--Store-vector-from-ZA-array-

Reviewed By: awarzynski, dcaballe, WanderAway

Differential Revision: https://reviews.llvm.org/D152508
2023-07-03 10:18:43 +00:00
Cullen Rhodes
51b0398b76 [mlir][ArmSME] Fix crash on func decls in 'arm_za' legality checks
Reviewed By: dcaballe, Dinistro

Differential Revision: https://reviews.llvm.org/D153750
2023-06-27 07:39:35 +00:00
Cullen Rhodes
65305aeab9 [mlir][ArmSME] Insert intrinsics to enable/disable ZA
This patch adds two LLVM intrinsics to the ArmSME dialect:

  * llvm.aarch64.sme.za.enable
  * llvm.aarch64.sme.za.disable

for enabling the ZA storage array [1], as well as patterns for inserting
them during legalization to LLVM at the start and end of functions if
the function has the 'arm_za' attribute (D152695).

In the future ZA should probably be automatically enabled/disabled when
lowering from vector to SME, but this should be sufficient for now at
least until we have patterns lowering to SME instructions that use ZA.

N.B. The backend function attribute 'aarch64_pstate_za_new' can be used
manage ZA state (as was originally tried in D152694), but it emits calls
to the following SME support routines [2] for the lazy-save mechanism
[3]:

  * __arm_tpidr2_restore
  * __arm_tpidr2_save

These will soon be added to compiler-rt but there's currently no public
implementation, and using this attribute would introduce an MLIR
dependency on compiler-rt. Furthermore, this mechanism is for routines
with ZA enabled calling other routines with it also enabled. We can
choose not to enable ZA in the compiler when this is case.

Depends on D152695

[1] https://developer.arm.com/documentation/ddi0616/aa
[2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#sme-support-routines
[3] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-za-lazy-saving-scheme

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D153050
2023-06-16 09:40:48 +00:00
Cullen Rhodes
e947e76058 [mlir][ArmSME] Extend streaming-mode pass to support enabling ZA
This patch extends the 'enable-arm-streaming' pass with a new option to
enable the ZA storage array by adding the 'arm_za' attribute to
'func.func' ops.

A later patch will insert `llvm.aarch64.sme.za.enable` at the beginning
of 'func.func' ops and `llvm.aarch64.sme.za.disable` before
`func.return` statements when lowering to LLVM dialect.

Currently the pass only supports enabling ZA with streaming-mode on but
the SME LDR, STR and ZERO instructions can access ZA when not in
streaming-mode (section B1.1.1, IDGNQM [1]), so it may be worth making
these options independent in the future.

N.B. This patch is generally useful in the context of SME enablement in
MLIR, but it will help enable writing an integration test for rewrite
pattern that lowers `vector.transfer_write` -> `zero {za}` (D152508).

[1] https://developer.arm.com/documentation/ddi0616/aa

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D152695
2023-06-16 09:26:42 +00:00
Cullen Rhodes
1e41a29d73 Revert "[mlir][ArmSME] Add initial dialect with basic lowering of vector.transfer write to zero"
Apologies I shouldn't have comitted this, need to wait until the planned
MLIR ODM:

  https://discourse.llvm.org/t/rfc-creating-a-armsme-dialect/67208/76

This reverts commit a48fe898857c95a063fa6c201343dca969bc098a.
2023-06-14 09:03:10 +00:00
Cullen Rhodes
a48fe89885 [mlir][ArmSME] Add initial dialect with basic lowering of vector.transfer write to zero
This patch adds support for lowering a `vector.transfer_write` of zeroes
and type `vector<[16x16]xi8>` to the SME `zero {za}` instruction [1],
which zeroes the entire accumulator.

This contributes to supporting a path from `linalg.fill` to SME.

[1] https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/ZERO--Zero-a-list-of-64-bit-element-ZA-tiles-

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D152508
2023-06-14 08:46:53 +00:00
Cullen Rhodes
1264849299 [mlir] Add pass to enable Armv9 Streaming SVE mode
This patch adds a pass 'enable-arm-streaming' that enables the Armv9
Scalable Matrix Extension (SME) Streaming SVE (SSVE) mode [1] by adding
either of the following attributes to 'func.func' ops:

  * arm_streaming (default)
  * arm_locally_streaming

PATCH [2 / 2] in series for RFC: https://discourse.llvm.org/t/rfc-supporting-armv9-scalable-matrix-extension-sme-streaming-sve-ssve-mode-in-mlir/70678

[1] https://developer.arm.com/documentation/ddi0616/aa

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D150934
2023-05-25 09:20:36 +00:00