30 Commits

Author SHA1 Message Date
Benjamin Maxwell
c42512436b
[mlir][ArmSME] Rename slice move operations to insert/extract_tile_slice (#106755)
This renames:

- `arm_sme.move_tile_slice_to_vector` to `arm_sme.extract_tile_slice`
- `arm_sme.move_vector_to_tile_slice` to `arm_sme.insert_tile_slice`

The new names are more consistent with the rest of MLIR and should be
easier to understand. The current names (to me personally) are hard to
parse and easy to mix up when skimming through code.

Additionally, the syntax for `insert_tile_slice` has changed from:

```mlir
%4 = arm_sme.insert_tile_slice %0, %1, %2
  : vector<[16]xi8> into vector<[16]x[16]xi8>
```

To:

```mlir
%4 = arm_sme.insert_tile_slice %0, %1[%2]
  : vector<[16]xi8> into vector<[16]x[16]xi8>
```

This is for consistency with `extract_tile_slice`, but also helps with
readability as it makes it clear which operand is the index.
2024-09-02 11:12:40 +01:00
Benjamin Maxwell
e2296d8295
[mlir][ArmSME] Lower extract from 2D scalable create_mask to psel (#96066)
Example:
```mlir
%mask = vector.create_mask %a, %b : vector<[4]x[8]xi1>
%slice = vector.extract %mask[%index]
           : vector<[8]xi1> from vector<[4]x[8]xi1>
```
Becomes:
```mlir
%mask_rows = vector.create_mask %a : vector<[4]xi1>
%mask_cols = vector.create_mask %b : vector<[8]xi1>
%slice = arm_sve.psel %mask_cols, %mask_rows[%index]
           : vector<[8]xi1>, vector<[4]xi1>
```

Note: While psel is under ArmSVE it requires SME (or SVE 2.1), so this
is currently the most logical place for this lowering.
2024-06-20 10:27:07 +01:00
Benjamin Maxwell
4d6b9921b3
[mlir][ArmSME] Fold MoveTileSliceToVector + TransferWrite to StoreTileSlice (#95907) 2024-06-19 12:52:53 +01:00
Cullen Rhodes
b49c0b8abc
[mlir][ArmSME] Simplify permutation map handling (#93515)
In -convert-vector-to-arm-sme the permutation_map is explicitly checked
for transpose when converting xfer ops, but for 2-D vector types the
only non-identity permutation map is transpose so this can be
simplified.
2024-05-30 13:22:43 +01:00
Cullen Rhodes
bfb5fe218e
[mlir][ArmSME] Fold transpose into xfer read to enable in-flight transpose (#92562)
vector.transpose ops whose inputs come from vector.transfer_read can be
eliminated by folding the transpose into the xfer op to enable in-flight
transposition when converting xfer read to arm_sme.tile_load.
2024-05-21 08:08:05 +01:00
Cullen Rhodes
9f7fff7f13
[mlir][ArmSME] Add arith-to-arm-sme conversion pass (#78197)
Existing 'arith::ConstantOp' conversion and tests are moved from
VectorToArmSME. There's currently only a single op that's converted at
the moment, but this will grow in the future as things like in-tile add
are implemented. Also, 'createLoopOverTileSlices' is moved to ArmSME
utils since it's relevant for both conversions.
2024-01-22 09:23:11 +00:00
Cullen Rhodes
e7432babaf
[mlir][ArmSME] Fail instead of error in vector.outerproduct lowering (#75447)
The 'vector.outerproduct' -> 'arm_sme.outerproduct' conversion currently
errors on unsupported cases when it should return failure.
2023-12-15 07:30:32 +00:00
Benjamin Maxwell
b0b69fd879
[mlir][ArmSME] More precisely model dataflow in ArmSME to SCF lowerings (#73922)
Since #73253, loops over tiles in SSA form (i.e. loops that take
`iter_args` and yield a new tile) are supported, so this patch updates
ArmSME lowerings to this form. This is a NFC, as it still lowers to the
same intrinsics, but this makes IR less 'surprising' at a higher-level,
and may be recognised by more transforms.


Example:

IR before:
```mlir
scf.for %tile_slice_index = %c0 to %num_tile_slices step %c1 
{
   arm_sme.move_vector_to_tile_slice 
     %broadcast_to_1d, %tile, %tile_slice_index : 
     vector<[4]xi32> into vector<[4]x[4]xi32>
}
// ... later use %tile
```
IR now:
```mlir
%broadcast_to_tile = scf.for %tile_slice_index = %c0 to %num_tile_slices
    step %c1 iter_args(%iter_tile = %init_tile) -> (vector<[4]x[4]xi32>)
{
   %tile_update = arm_sme.move_vector_to_tile_slice
      %broadcast_to_1d, %iter_tile, %tile_slice_index :
      vector<[4]xi32> into vector<[4]x[4]xi32>
  scf.yield %tile_update : vector<[4]x[4]xi32>
}
// ... later use %broadcast_to_tile
```
2023-12-06 14:31:05 +00:00
Benjamin Maxwell
10063c5a29
[mlir][ArmSME] Move vector.print -> ArmSME lowering to VectorToArmSME (#74063)
This moves the SME tile vector.print lowering from
`-convert-arm-sme-to-scf` to `-convert-vector-to-arm-sme`. This seems
like a more logical place, as this is lowering a vector op to ArmSME,
and it also prevents vector.print from blocking tile allocation.
2023-12-04 09:42:11 +00:00
Benjamin Maxwell
eaff02f28e
[mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:

* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
 * Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)

As part of this patch we bid farewell to the following operations:

```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```

These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```

The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.

Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).

Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).

Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```

Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.

Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
2023-11-30 10:22:22 +00:00
Benjamin Maxwell
c4c52d4199
[mlir][ArmSME] Move vector.extract/insert lowerings to vector-to-arm-sme (NFC) (#72852)
These were placed in LegalizeForLLVMExport.cpp, which is the wrong stage
for these, as these lower to high-level ArmSME ops, not intrinsics.
2023-11-20 14:04:59 +00:00
Matthias Springer
32c3decb77
[mlir][vector] Modernize vector.transpose op (#72594)
* Declare arguments/results with `let` statements.
* Rename `transp` to `permutation`.
* Change type of `transp` from `I64ArrayAttr` to `DenseI64ArrayAttr`
(provides direct access to `ArrayRef<int64_t>` instead of `ArrayAttr`).
2023-11-20 11:25:35 +01:00
Cullen Rhodes
4240b1790f
[mlir][ArmSME] Lower transfer_write + transpose to vertical store (#71181)
This patch extends the lowering of vector.transfer_write in
VectorToArmSME to support in-flight transpose via SME vertical store.
2023-11-10 07:51:06 +00:00
Cullen Rhodes
22f1159223
[mlir][ArmSME] Propagate pad and mask in vector.transfer_read lowering (#70814)
This extends the lowering of vector.transfer_read -> arm_sme.tile_load
lowering to propagate pad and mask.

The restriction on the transfer_read being a transposition is also
removed, identity maps are lowered to normal horizontal loads.
2023-11-02 10:23:38 +00:00
Cullen Rhodes
1908f47a9b
[mlir][ArmSME] Add optional mask operand to tile_store (#70657) 2023-11-01 07:15:51 +00:00
Benjamin Maxwell
e666295011
[mlir][ArmSME] Support lowering masked vector.outerproduct ops to SME (#69604)
This patch adds support for lowering masked outer products to SME. This
is done in two stages. First, vector.outerproducts (both masked and
non-masked) are rewritten to arm_sme.outerproducts. The
arm_sme.outerproduct op is close to vector.outerproduct, but supports
masking on the operands rather than the result. It also limits the cases
it handles to things that could be (directly) lowered to SME.

This currently requires that the source of the mask is a
vector.create_mask op. E.g.:

```mlir
%mask = vector.create_mask %dimA, %dimB : vector<[4]x[4]xi1>
%result = vector.mask %mask {
             vector.outerproduct %vecA, %vecB
              : vector<[4]xf32>, vector<[4]xf32>
          } : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32>
```
Is rewritten to:
```
%maskA = vector.create_mask %dimA : vector<[4]xi1>
%maskB = vector.create_mask %dimB : vector<[4]xi1>
%result = arm_sme.outerproduct %vecA, %vecB masks(%maskA, %maskB)
              : vector<[4]xf32>, vector<[4]xf32>
```
(The same rewrite works for non-masked vector.outerproducts too)

The arm_sme.outerproduct can then be directly lowered to SME intrinsics.
2023-10-31 09:06:21 +00:00
Cullen Rhodes
d86047cb66
[mlir][ArmSME] Update tile slice layout syntax (#69151)
This patch prefixes tile slice layout with `layout` in the
assemblyFormat:

  - `<vertical>`   -> `layout<vertical>`
  - `<horizontal>` -> `layout<horizontal>`

The reason for this change is the current format doesn't play nicely
with additional optional operands, required to support padding and
masking (#69148), as it becomes ambiguous.

This affects the the following ops:

  - arm_sme.tile_load
  - arm_sme.tile_store
  - arm_sme.load_tile_slice
  - arm_sme.store_tile_slice
2023-10-16 10:55:30 +01:00
Andrzej Warzynski
23b5f92c97 [mlir][SME] Re-order patterns alphabetically (nfc) 2023-09-29 16:54:47 +00:00
Andrzej Warzyński
35dd3a6475
[mlir][SME][nfc] Clarify the usage of insertion guard (#67668)
Added extra comment that should clarify the need for an insertion guard
when using `getLoopOverTileSlices`. Also removed some redundant calls to
`setInsertionPointAfter` - the insertion guard would overwrite that on
destruction anyway.
2023-09-29 17:33:57 +01:00
Goran Flegar
042468bff5 [mlir][SME] Fix unused variable warning 2023-09-28 15:24:27 +02:00
Andrzej Warzyński
0cb0df41d4
[mlir][SME] Add vector.splat -> SME conversion (#67659)
This conversion is identical to vector.broadcast when broadcasting a
scalar.
2023-09-28 13:46:24 +01:00
Cullen Rhodes
8e64e9c365
[mlir][ArmSME] Add support for vector.transfer_read with transpose (#67527)
This patch adds support for lowering a vector.transfer_read with a
transpose permutation map to a vertical tile load, for example:

  vector.transfer_read ...  permutation_map: (d0, d1) -> (d1, d0)

is converted to:

  arm_sme.tile_load ... <vertical>

On SME the transpose can be done in-flight, rather than as a separate
operation as in the TransferReadPermutationLowering, which would do the
following:

  %0 = vector.transfer_read ...
  vector.transpose %0, [1, 0] ...

The lowering doesn't support masking yet and the transfer_read must be
in-bounds. It also intentionally doesn't handle simple loads as
transfer_write currently does, as the generic
TransferReadToVectorLoadLowering can lower these to simple vector.load
ops, which can already be lowered to ArmSME.

A subsequent patch will update the existing transfer_write lowering,
this is a separate patch as there is currently no lowering for
vector.transfer_read.
2023-09-28 10:54:20 +01:00
Cullen Rhodes
eaf15900ff
[mlir][ArmSME] Add support for vector.transpose (#66760)
This patch adds support for lowering vector.transpose to ArmSME. It's
implemented by storing the input tile of the tranpose to memory and
reloading vertically, building on top of the tile slice layout support.

Tranposing via memory is obviously expensive, the current intention is
to avoid the transpose if possible, this is therefore intended as a
fallback and to provide base support for Vector ops. If it turns out
transposes can't be avoided then this should be replaced with a more
optimal implementation, perhaps with tile <-> vector (MOVA) ops.

Depends on https://github.com/llvm/llvm-project/pull/66758.
2023-09-25 12:15:12 +01:00
Cullen Rhodes
2dd3f42083 [mlir][ArmSME] Lower vector.broadcast to ArmSME
This adds support for lowering vector.broadcast ops to SME, if the
source is either a scalar, 0-d vector, or 1-d vector, and the result a
2-d scalable vector that aligns with SME tiles.

This follows on from D157005 which introduced a vector to tile slice op
that moves a 1-d scalable vector to a slice of a 2-d scalable vector
(tile). The lowering from vector.broadcast is similar, a couple of
helper functions are added to prevent duplication.

Lowering of vector.broadcast contributes towards a path from linalg.fill
to SME.

Depends on D157005

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D158586
2023-08-29 09:43:16 +00:00
Cullen Rhodes
3b4b6cbba5 [mlir][ArmSME] Add move vector to tile slice op and lowerings
This adds a 'move_vector_to_tile_slice' op to the ArmSME dialect that
moves a 1-D scalable vector to a slice of a 2-D tile at a given index.

This is lowered to the 'llvm.aarch64.sme.write.horiz' intrinsic that
maps to the MOVA (vector to tile, single) SME instruction [1] when
lowering to LLVM. Like the SME load and store instructions this operates
on ZA tile slices, which are 1D vectors of horizontally or vertically
contiguous elements within a ZA tile.

This patch extends the lowering of 'arith.constant' to SME to support
non-zero constants using this new op.  This requires materializing a
loop that broadcasts the constant to each tile slice with the
'vector_to_tile_slice' op. Unlike load and store, this is done during
conversion from Vector to ArmSME, rather than ArmSME to SCF. The latter
would require a higher-level custom op in the ArmSME dialect like
'tile_load' and 'tile_store' and this isn't necessary. We may also
remove the load and store ops in the future in favour of lowering
straight from Vector, at which point this would converge.

Currently only horizontal tile slices are supported. A future patch will
extend this mechanism to support 'vector.broadcast'.

Depends on D156980 D157004

[1] https://developer.arm.com/documentation/ddi0602

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D157005
2023-08-29 09:29:22 +00:00
Cullen Rhodes
dfa10ec2e6 [mlir][ArmSME] Extend arm_sme.zero for all types
The arm_sme.zero op currently only supports 8-bit element tiles. This
extends the op and lowering from 'arith.constant dense<0>' ->
'arm_sme.zero' to support all tile types.

The lowering from arm_sme.zero to intrinsics is not updated as part of
this patch and will be done separately.

Reviewed By: dcaballe

Differential Revision: https://reviews.llvm.org/D156980
2023-08-11 12:44:56 +00:00
Cullen Rhodes
12e1a9b876 [mlir][ArmSME] Extend vector.transfer_write lowering
Enables the lowering of other tile types and values to match the
vector.store -> arm_sme.tile_store lowering.

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D156976
2023-08-11 12:33:09 +00:00
Cullen Rhodes
781883ea62 [mlir][ArmSME] Split lowering of arith.constant from vector.transfer_write
An 'arith.constant dense<0>' is currently lowered to 'arm_sme.zero' as
part of the 'vector.transfer_write' lowering during '-vector-to-arm-sme'
conversion. This patch makes this lowering independent of the
'vector.transfer_write'. This can then be extended for further tile
types and non-zero constants.

Reviewed By: awarzynski

Differential Revision: https://reviews.llvm.org/D156802
2023-08-03 08:57:33 +00:00
Cullen Rhodes
ca9a3354d0 [mlir][ArmSME] Add tile load op and extend tile store tile size support
This extends the existing 'arm_sme.tile_store' op to support all tile
sizes and adds a new op 'arm_sme.tile_load', as well as lowerings from
vector -> custom ops and custom ops -> intrinsics. Currently there's no
lowering for i128.

Depends on D154867

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D155306
2023-07-25 08:28:36 +00:00
Andrzej Warzynski
447bb5bee4 [mlir][ArmSME] Introduce new lowering layer (Vector -> ArmSME)
At the moment, the lowering from the Vector dialect to SME looks like
this:

  * Vector --> SME LLVM IR intrinsics

This patch introduces a new lowering layer between the Vector dialect
and the Arm SME extension:

  * Vector --> ArmSME dialect (custom Ops) --> SME LLVM IR intrinsics.

This is motivated by 2 considerations:
1. Storing `ZA` to memory (e.g. `vector.transfer_write`) requires an
   `scf.for` loop over all rows of `ZA`. Similar logic will apply to
   "load to ZA from memory". This is a rather complex transformation and
   a custom Op seems justified.
2. As discussed in [1], we need to prevent the LLVM type converter from
   having to convert types unsupported in LLVM, e.g.
   `vector<[16]x[16]xi8>`. A dedicated abstraction layer with custom Ops
   opens a path to some fine tuning (e.g. custom type converters) that
   will allow us to avoid this.

To facilitate this change, two new custom SME Op are introduced:

  * `TileStoreOp`, and
  * `ZeroOp`.

Note that no new functionality is added - these Ops merely model what's
already supported. In particular, the following tile size is assumed
(dimension and element size are fixed):

  * `vector<[16]x[16]xi8>`

The new lowering layer is introduced via a conversion pass between the
Vector and the SME dialects. You can use the `-convert-vector-to-sme`
flag to run it. The following function:
```
func.func @example(%arg0 : memref<?x?xi8>) {
  // (...)
  %cst = arith.constant dense<0> : vector<[16]x[16]xi8>
  vector.transfer_write %cst, %arg0 : vector<[16]x[16]xi8>, memref<?x?xi8>
  return
}
```
would be lowered to:
```
  func.func @example(%arg0: memref<?x?xi8>) {
    // (...)
    %0 = arm_sme.zero : vector<[16]x[16]xi8>
    arm_sme.tile_store %arg0[%c0, %c0], %0 : memref<?x?xi8>, vector<[16]x[16]xi8>
    return
  }
```

Later, a mechanism will be introduced to guarantee that `arm_sme.zero`
and `arm_sme.tile_store` operate on the same virtual tile. For `i8`
elements this is not required as there is only one tile.

In order to lower the above output to LLVM, use
  * `-convert-vector-to-llvm="enable-arm-sme"`.

[1] https://github.com/openxla/iree/issues/14294

Reviewed By: WanderAway

Differential Revision: https://reviews.llvm.org/D154867
2023-07-18 08:04:59 +00:00