16 Commits

Author SHA1 Message Date
Benjamin Maxwell
fb8eb4251a
[mlir][ArmSME] Fix loop bounds of masked loads/stores (#78983)
Previously, for masked tile loads/stores we directly used the dimension
size from the `vector.create_mask` operation as the upper bound of the
`scf.for` over the tile slices. This was not correct, as `create_mask`
allows operands to be greater than the size of the vector dimension, in
which case the for loop bounds should be clamped to the number of tile
slices.
2024-01-26 09:39:43 +00:00
Benjamin Maxwell
8ff16f646f
[mlir][ArmSME] Refactor ArmSMEToSCF to used shared loop-building helper (NFC) (#79172)
This will make fixing a bug (next patch) a change to one place, rather
than fixing three separate rewrites.

Note: `TileLoadOpWithMaskAndPadZeroConversion` has been merged into
`TileLoadOpConversion`, since after this change those two rewrites were
pretty much identical.
2024-01-25 15:11:46 +00:00
Benjamin Maxwell
01ac530a2e
[mlir][ArmSME] Remove vector.print legality from ArmSMEToSCF (NFC) (#74875)
This was moved to VectorToArmSME in #74063, so this is no longer needed.

VectorToArmSME uses a greedy rewriter, so a similar legality rule is not
needed there.

See:
bbb8a0df73/mlir/lib/Conversion/VectorToArmSME/VectorToArmSMEPass.cpp (L35)
2023-12-11 11:25:43 +00:00
Benjamin Maxwell
b0b69fd879
[mlir][ArmSME] More precisely model dataflow in ArmSME to SCF lowerings (#73922)
Since #73253, loops over tiles in SSA form (i.e. loops that take
`iter_args` and yield a new tile) are supported, so this patch updates
ArmSME lowerings to this form. This is a NFC, as it still lowers to the
same intrinsics, but this makes IR less 'surprising' at a higher-level,
and may be recognised by more transforms.


Example:

IR before:
```mlir
scf.for %tile_slice_index = %c0 to %num_tile_slices step %c1 
{
   arm_sme.move_vector_to_tile_slice 
     %broadcast_to_1d, %tile, %tile_slice_index : 
     vector<[4]xi32> into vector<[4]x[4]xi32>
}
// ... later use %tile
```
IR now:
```mlir
%broadcast_to_tile = scf.for %tile_slice_index = %c0 to %num_tile_slices
    step %c1 iter_args(%iter_tile = %init_tile) -> (vector<[4]x[4]xi32>)
{
   %tile_update = arm_sme.move_vector_to_tile_slice
      %broadcast_to_1d, %iter_tile, %tile_slice_index :
      vector<[4]xi32> into vector<[4]x[4]xi32>
  scf.yield %tile_update : vector<[4]x[4]xi32>
}
// ... later use %broadcast_to_tile
```
2023-12-06 14:31:05 +00:00
Benjamin Maxwell
10063c5a29
[mlir][ArmSME] Move vector.print -> ArmSME lowering to VectorToArmSME (#74063)
This moves the SME tile vector.print lowering from
`-convert-arm-sme-to-scf` to `-convert-vector-to-arm-sme`. This seems
like a more logical place, as this is lowering a vector op to ArmSME,
and it also prevents vector.print from blocking tile allocation.
2023-12-04 09:42:11 +00:00
Benjamin Maxwell
eaff02f28e
[mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253)
This reworks the ArmSME dialect to use attributes for tile allocation.
This has a number of advantages and corrects some issues with the
previous approach:

* Tile allocation can now be done ASAP (i.e. immediately after
`-convert-vector-to-arm-sme`)
* SSA form for control flow is now supported (e.g.`scf.for` loops that
yield tiles)
* ArmSME ops can be converted to intrinsics very late (i.e. after
lowering to control flow)
 * Tests are simplified by removing constants and casts
* Avoids correctness issues with representing LLVM `immargs` as MLIR
values
- The tile ID on the SME intrinsics is an `immarg` (so is required to be
a compile-time constant), `immargs` should be mapped to MLIR attributes
(this is already the case for intrinsics in the LLVM dialect)
- Using MLIR values for `immargs` can lead to invalid LLVM IR being
generated (and passes such as -cse making incorrect optimizations)

As part of this patch we bid farewell to the following operations:

```mlir
arm_sme.get_tile_id : i32
arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32>
arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32
```

These are now replaced with:
```mlir
// Allocates a new tile with (indeterminate) state:
arm_sme.get_tile : vector<[4]x[4]xi32>
// A placeholder operation for lowering ArmSME ops to intrinsics:
arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32>
```

The new tile allocation works by operations implementing the
`ArmSMETileOpInterface`. This interface says that an operation needs to
be assigned a tile ID, and may conditionally allocate a new SME tile.

Operations allocate a new tile by implementing...
```c++
std::optional<arm_sme::ArmSMETileType> getAllocatedTileType()
```
...and returning what type of tile the op allocates (ZAB, ZAH, etc).

Operations that don't allocate a tile return `std::nullopt` (which is
the default behaviour).

Currently the following ops are defined as allocating:
```mlir
arm_sme.get_tile
arm_sme.zero
arm_sme.tile_load
arm_sme.outerproduct // (if no accumulator is specified)
```

Allocating operations become the roots for the tile allocation pass,
which currently just (naively) assigns all transitive uses of a root
operation the same tile ID. However, this is enough to handle current
use cases.

Once tile IDs have been allocated subsequent rewrites can forward the
tile IDs to any newly created operations.
2023-11-30 10:22:22 +00:00
Cullen Rhodes
9783cf448a
[mlir][ArmSME] Add support for lowering masked tile_load ops (#70915)
This patch extends ArmSMEToSCF to support lowering of masked tile_load
ops. Only masks created by 'vector.create_mask' are currently supported.
There are two lowerings depending on the pad.

For pad of constant zero, the tile is first zeroed, then only active
rows are loaded.

For non-zero pad, the scalar pad is broadcast to a 1-D vector and a
regular 'vector.masked_load' (will be lowered to SVE, not SME) loads
each slice, with padding specified as a passthru and the 2-D mask
combined into a 1-D mask. The resulting slice is then inserted into the
tile with 'arm_sme.move_vector_to_tile_slice'.
2023-11-08 09:02:09 +00:00
Cullen Rhodes
ed350bb3d8
[mlir][ArmSME] Add support for lowering masked tile_store ops (#71180)
This patch extends ArmSMEToSCF to support lowering of masked tile_store
ops. Only masks created by 'vector.create_mask' are currently supported.

Example:

  %mask = vector.create_mask %c3, %c2 : vector<[4]x[4]xi1>
  arm_sme.tile_store %tile, %dest[%c0, %c0], %mask : memref<?x?xi32>,
vector<[4]x[4]xi32>

Produces:

  %num_rows = arith.constant 3 : index
  %num_cols = vector.create_mask %c2 : vector<[4]xi1>
  scf.for %slice_idx = %c0 to %num_rows step %c1
    arm_sme.store_tile_slice %tile, %slice_idx, %num_cols, %dest[%slice_idx, %c0]
      : memref<?x?xi32>, vector<[4]xi1>, vector<[4]x[4]xi32>
2023-11-06 11:18:57 +00:00
Cullen Rhodes
8f564e014e
[mlir][ArmSME] Add mask operand to store_tile_slice (#70838) 2023-11-02 08:43:37 +00:00
Cullen Rhodes
8ea260a093
[mlir][ArmSME] Add mask operand to load_tile_slice (#70655) 2023-10-31 13:08:55 +00:00
Cullen Rhodes
d86047cb66
[mlir][ArmSME] Update tile slice layout syntax (#69151)
This patch prefixes tile slice layout with `layout` in the
assemblyFormat:

  - `<vertical>`   -> `layout<vertical>`
  - `<horizontal>` -> `layout<horizontal>`

The reason for this change is the current format doesn't play nicely
with additional optional operands, required to support padding and
masking (#69148), as it becomes ambiguous.

This affects the the following ops:

  - arm_sme.tile_load
  - arm_sme.tile_store
  - arm_sme.load_tile_slice
  - arm_sme.store_tile_slice
2023-10-16 10:55:30 +01:00
Benjamin Maxwell
b34f15df55
[mlir][ArmSME] Add arm_sme.move_tile_slice_to_vector op (#67652)
This adds a simple higher-level op for the tile slice to vector
intrinsics (and updates the existing vector.print lowering to use it).
This op will be used a few more times to implement vector.insert/extract
lowerings in later patches.
2023-09-29 10:33:09 +01:00
Benjamin Maxwell
174cd6145b
[mlir][ArmSME] Add custom vector.print lowering for SME tiles (#66691)
This adds a custom lowering for SME that loops over each row of the
tile, extracting it via an SME MOVA, then printing with a normal 1D
vector.print.

This makes writing SME integration tests easier and less verbose.

Depends on: #66910, #66911
2023-09-26 17:09:57 +01:00
Cullen Rhodes
75a71c27c1
[mlir][ArmSME] Support vertical layout in load and store ops (#66758)
In SME a ZA tile slice is a one-dimensional set of horizontally or
vertically contiguous elements within a ZA tile. Currently the load and
store ops only support horizontal tile slices. This patch adds a tile
slice layout attribute to the load and store ops to support both
horizontal and vertical tile slices.

When lowering from Vector dialect horizontal layout is the default.
2023-09-25 09:34:23 +01:00
Cullen Rhodes
65a6be5de9 [mlir][ArmSME] Use memref indices for load and store
This patch extends the ArmSME load and store op lowering to use the
memref indices. An integration test that loads two 32-bit element ZA
tiles from memory and stores them back to memory in reverse order to
verify this is added.

Depends on D156467 D156558

Reviewed By: awarzynski, dcaballe

Differential Revision: https://reviews.llvm.org/D156689
2023-08-03 08:50:12 +00:00
Cullen Rhodes
9e1b825321 [mlir][ArmSME] Add conversion from ArmSME to SCF to materialize loops
Currently a loop is materialized when lowering ArmSME loads and stores
to intrinsics. This patch introduces two new ops to the ArmSME dialect
that map 1-1 with intrinsics:

  1. arm_sme.load_tile_slice  - Loads a 1D tile slice from
     memory into a 2D SME "virtual tile".
  2. arm_sme.store_tile_slice - Stores a 1D tile slice from a 2D SME
     "virtual tile" into memory.

As well as a new conversion pass '-convert-arm-sme-to-scf' that
materializes loops with these ops. The existing load/store lowering to
intrinsics is updated to use these ops.

Depends on D156517

Discourse thread:
https://discourse.llvm.org/t/loop-materialization-in-armsme/72354

Reviewed By: awarzynski, dcaballe, WanderAway

Differential Revision: https://reviews.llvm.org/D156467
2023-08-01 08:20:02 +00:00