llvm-project

Author	SHA1	Message	Date
Benjamin Maxwell	d1fc59c3b5	[mlir][ArmSME] Rewrite illegal `shape_casts` to `vector.transpose` ops (#82985 ) This adds a rewrite that converts illegal 2D unit-dim `shape_casts` into `vector.transpose` ops. E.g. ```mlir // Case 1: %a = vector.shape_cast %0 : vector<[4]x1xf32> to vector<1x[4]xf32> // Case 2: %b = vector.shape_cast %1 : vector<[4]x1xf32> to vector<[4]xf32> ``` Becomes: ```mlir // Case 1: %a = vector.transpose %0 : [1, 0] vector<[4]x1xf32> to vector<1x[4]xf32> // Case 2: %t = vector.transpose %1 : [1, 0] vector<[4]x1xf32> to vector<1x[4]xf32> %b = vector.shape_cast %t : vector<1x[4]xf32> to vector<[4]xf32> ``` Various lowerings and drop unit-dims patterns add such shape_casts, however, if they do not cancel out (which they likely won't if we've reached the vector-legalization pass) they will prevent lowering the IR. Rewriting them as a transpose gives `LiftIllegalVectorTransposeToMemory` a chance to eliminate the illegal types.	2024-03-07 17:04:12 +00:00
Benjamin Maxwell	8cfb71613c	[mlir][ArmSME] Replace use of `isa` with `isa_and_present` (#82798 ) `op` can be null here, in which case this should just return a null value back.	2024-02-26 09:44:26 +00:00
Benjamin Maxwell	1408667fdd	[mlir][ArmSME] Follow MLIR constant style in VectorLegalization.cpp (NFC)	2024-02-23 16:55:32 +00:00
Cullen Rhodes	fff86c6111	[mlir][ArmSME] Support 4-way widening outer products (#79288 ) This patch introduces support for 4-way widening outer products. This enables the fusion of 4 'arm_sme.outerproduct' operations that are chained via the accumulator into single widened operations. Changes: - Adds the following operations: - smopa_4way, smops_4way - umopa_4way, umops_4way - sumopa_4way, sumops_4way - sumopa_4way, sumops_4way - Implements conversions for the above ops to intrinsics in ArmSMEToLLVM. - Extends 'arm-sme-outer-product' pass. For a detailed description of these operations see the 'arm_sme.smopa_4way' description.	2024-02-07 08:17:47 +00:00
Benjamin Maxwell	0473e322f6	[mlir][ArmSME] Add rewrite to lift illegal vector.transposes to memory (#80170 ) When unrolling the reduction dimension of something like a matmul for SME, you can end up with transposed reads of illegal types, like so: ```mlir %illegalRead = vector.transfer_read %memref[%a, %b] : memref<?x?xf32>, vector<[8]x4xf32> %legalType = vector.transpose %illegalRead, [1, 0] : vector<[8]x4xf32> to vector<4x[8]xf32> ``` Here the `vector<[8]x4xf32>` is an illegal type, there's no way to lower a scalable vector of fixed vectors. However, as the final type `vector<4x[8]xf32>` is legal, we can instead lift the transpose to memory (producing a strided memref), and eliminate all the illegal types. This is shown below. ```mlir %readSubview = memref.subview %memref[%a, %b] [%c8_vscale, %c4] [%c1, %c1] : memref<?x?xf32> to memref<?x?xf32> %transpose = memref.transpose %readSubview (d0, d1) -> (d1, d0) : memref<?x?xf32> to memref<?x?xf32> %legalType = vector.transfer_read %transpose[%c0, %c0] : memref<?x?xf32>, vector<4x[8]xf32> ```	2024-02-06 09:30:55 +00:00
Cullen Rhodes	5f5b3bb22b	[mlir][ArmSME] Add rewrites to swap extract of extend (#80407 ) In mixed matmul lowering (e.g., i8 to i32) we're seeing the following sequence: %0 = arith.extsi %src : vector<4x[8]xi8> to vector<4x[8]xi32> %1 = vector.extract %0[0] : vector<[8]xi32> from vector<4x[8]xi32> %lhs = vector.scalable.extract %1[0] : vector<[4]xi32> from vector<[8]xi32> ... (same for rhs) %2 = vector.outerproduct %lhs, %rhs, %acc vector<[4]xi32>, vector<[4]xi32> // x4 chained by accumulator This chain of 4 outer products can be fused into a single 4-way widening variant but the pass doesn't match on the IR, as it expects the source of the inputs to be an extend and it can't look through the extracts. This patch fixes this with two rewrites that swaps extract(extend) into extend(extract). Related to #78975, #79288.	2024-02-05 14:13:53 +00:00
Benjamin Maxwell	c2dea7122c	[mlir][ArmSME] Fold extracts from 3D create_masks of SME-like masks (#80148 ) When unrolling the reduction dimension of something like a matmul for SME, it is possible to get 3D masks, which are vectors of SME-like masks. The 2D masks for individual operations are then extracted from the 3D masks. i.e.: ```mlir %mask = vector.create_mask %nonConstantDim, %a, %b : vector<4x[4]x[4]xi1> %subMask = vector.extract %mask[2] : vector<[4]x[4]xi1> from vector<4x[4]x[4]xi1> ``` ArmSME only supports lowering 2D create_masks, so we must fold the extract into the create_mask. This can be done by checking if the extraction index is within the true region, then using that select the first dimension of the 2D mask. This is shown below. ```mlir %extractionInTrueRegion = arith.cmpi slt, %c2, %nonConstantDim : index %newMaskFrontDim = arith.select %extractionInTrueRegion, %a, %c0 : index %subMask = vector.create_mask %newMaskFrontDim, %b : vector<[4]x[4]xi1> ```	2024-02-02 10:06:11 +00:00
Benjamin Maxwell	042800a4dd	[mlir][ArmSME] Add initial SME vector legalization pass (#79152 ) This adds a new pass (`-arm-sme-vector-legalization`) which legalizes vector operations so that they can be lowered to ArmSME. This initial patch adds decomposition for `vector.outerproduct`, `vector.transfer_read`, and `vector.transfer_write` when they operate on vector types larger than a single SME tile. For example, a [8]x[8]xf32 outer product would be decomposed into four [4]x[4]xf32 outer products, which could then be lowered to ArmSME. These three ops have been picked as supporting them alone allows lowering matmuls that use all ZA accumulators to ArmSME. For it to be possible to legalize a vector type it has to be a multiple of an SME tile size, but other than that any shape can be used. E.g. `vector<[8]x[8]xf32>`, `vector<[4]x[16]xf32>`, `vector<[16]x[4]xf32>` can all be lowered to four `vector<[4]x[4]xf32>` operations. In future, this pass will be extended with more SME-specific rewrites to legalize unrolling the reduction dimension of matmuls (which is not type-decomposition), which is why the pass has quite a general name.	2024-01-31 11:55:22 +00:00
Cullen Rhodes	95ef8e3868	[mlir][ArmSME] Support 2-way widening outer products (#78975 ) This patch introduces support for 2-way widening outer products. This enables the fusion of 2 'arm_sme.outerproduct' operations that are chained via the accumulator into a 2-way widening outer product operation. Changes: - Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants. These map to instruction variants added in SME2 and use different intrinsics. Intrinsics are already implemented for widening variants from SME1. - Adds the following operations: - fmopa_2way, fmops_2way - smopa_2way, smops_2way - umopa_2way, umops_2way - Implements conversions for the above ops to intrinsics in ArmSMEToLLVM. - Adds a pass 'arm-sme-outer-product-fusion' that fuses 'arm_sme.outerproduct' operations. For a detailed description of these operations see the 'arm_sme.fmopa_2way' description. The reason for introducing many operations rather than one is the signed/unsigned variants can't be distinguished with types (e.g., ui16, si16) since 'arith.extui' and 'arith.extsi' only support signless integers. A single operation would require this information and an attribute (for example) for the sign doesn't feel right if floating-point types are also supported where this wouldn't apply. Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32) introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but no subtract variant. Whilst these are not supported in this patch, it felt simpler to have separate ops for add/subtract given this.	2024-01-31 09:13:18 +00:00
Matthias Springer	5fcf907b34	[mlir][IR] Rename "update root" to "modify op" in rewriter API (#78260 ) This commit renames 4 pattern rewriter API functions: * `updateRootInPlace` -> `modifyOpInPlace` * `startRootUpdate` -> `startOpModification` * `finalizeRootUpdate` -> `finalizeOpModification` * `cancelRootUpdate` -> `cancelOpModification` The term "root" is a misnomer. The root is the op that a rewrite pattern matches against (https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional). A rewriter must be notified of all in-place op modifications, not just in-place modifications of the root (https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old function names were confusing and have contributed to various broken rewrite patterns. Note: The new function names use the term "modify" instead of "update" for consistency with the `RewriterBase::Listener` terminology (`notifyOperationModified`).	2024-01-17 11:08:59 +01:00
Matthias Springer	e2bb47caa6	[mlir][Arm] Fix invalid rewrite pattern API violations (#78246 ) This commit fixes rewrite pattern API violations: * Rewrite pattern must return "failure" if the IR was not modified. * In-place op modifications must be communicated to the rewriter (`updateRootInPlace`). This commit fixes `test/Dialect/ArmSVE/legalize-vector-storage.mlir`, `test/Dialect/ArmSME/vector-ops-to-llvm.mlir`, `test/Dialect/ArmSME/tile-allocation-invalid.mlir`, `test/Conversion/ArmSMEToLLVM/arm-sme-to-llvm.mlir`, `test/Conversion/ArmSMEToLLVM/tile-spills-and-fills.mlir`, `test/Conversion/ArmSMEToLLVM/unsupported.mlir` when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`. --------- Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>	2024-01-16 13:26:39 +01:00
Benjamin Maxwell	5417a5fed6	[mlir][ArmSME] Add rudimentary support for tile spills to the stack (#76086 ) This adds very basic (and inelegant) support for something like spilling and reloading tiles, if you use more SME tiles than physically exist. This is purely implemented to prevent the compiler from aborting if a function uses too many tiles (i.e. due to bad unrolling), but is expected to perform very poorly. Currently, this works in two stages: During tile allocation, if we run out of tiles instead of giving up, we switch to allocating 'in-memory' tile IDs. These are tile IDs that start at 16 (which is higher than any real tile ID). A warning will also be emitted for each (root) tile op assigned an in-memory tile ID: ``` warning: failed to allocate SME virtual tile to operation, all tile operations will go through memory, expect degraded performance ``` Everything after this works like normal until `-convert-arm-sme-to-llvm` Here the in-memory tile op: ```mlir arm_sme.tile_op { tile_id = <IN MEMORY TILE> } ``` Is lowered to: ```mlir // At function entry: %alloca = memref.alloca ... : memref<?x?xty> // Around the op: // Swap the contents of %alloca and tile 0. scf.for %slice_idx { %current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}> "arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}> vector.store %current_slice, %alloca[%slice_idx, %c0] } // Execute op using tile 0. arm_sme.tile_op { tile_id = 0 } // Swap the contents of %alloca and tile 0. // This restores tile 0 to its original state. scf.for %slice_idx { %current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}> "arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}> vector.store %current_slice, %alloca[%slice_idx, %c0] } ``` This is inserted during the lowering to LLVM as spilling/reloading registers is a very low-level concept, that can't really be modeled correctly at a high level in MLIR. Note: This is always doing the worst case full-tile swap. This could be optimized to only spill/load data the tile op will use, which could be just a slice. It's also not making any use of liveness, which could allow reusing tiles. But these is not seen as important as correct code should only use the available number of tiles.	2024-01-12 14:51:47 +00:00
Matthias Springer	db8a119e8f	[mlir][ArmSME] Fix invalid rewriter API usage (#76123 ) When operations are modified in-place, the rewriter must be notified. This commit fixes `mlir/test/Conversion/ArmSMEToLLVM/unsupported.mlir`, `mlir/test/Dialect/ArmSME/tile-zero-masks.mlir` and `mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir` when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS` enabled.	2023-12-21 17:39:36 +09:00
Benjamin Maxwell	01e40a8a3d	[mlir][ArmSME] Remove ArmSMETypeConverter (and configure LLVM one instead) (#73639 ) This patch removes the ArmSMETypeConverter, and instead updates `populateArmSMEToLLVMConversionPatterns()` to add an ArmSME vector type conversion to the existing LLVMTypeConverter. This makes it easier to add these patterns to an existing `-to-llvm` lowering pass.	2023-12-04 17:02:48 +00:00
Benjamin Maxwell	f7d91faa79	[mlir][ArmSME] Add option to only enable streaming mode/ZA if required (#73931 ) This adds a `only-if-required-by-ops` flag to the `enable-arm-streaming` pass. This flag defaults to `false` (which preserves the original behaviour), however, if set to `true` the pass will only add the selected ZA/streaming mode to functions that contain ops that implement `ArmSMETileOpInterface`. This simplifies enabling these modes, as we can now first try lowering ops to ArmSME, then only if we succeed, add the relevant function attributes.	2023-12-01 10:39:01 +00:00
Benjamin Maxwell	eaff02f28e	[mlir][ArmSME] Switch to an attribute-based tile allocation scheme (#73253 ) This reworks the ArmSME dialect to use attributes for tile allocation. This has a number of advantages and corrects some issues with the previous approach: * Tile allocation can now be done ASAP (i.e. immediately after `-convert-vector-to-arm-sme`) * SSA form for control flow is now supported (e.g.`scf.for` loops that yield tiles) * ArmSME ops can be converted to intrinsics very late (i.e. after lowering to control flow) * Tests are simplified by removing constants and casts * Avoids correctness issues with representing LLVM `immargs` as MLIR values - The tile ID on the SME intrinsics is an `immarg` (so is required to be a compile-time constant), `immargs` should be mapped to MLIR attributes (this is already the case for intrinsics in the LLVM dialect) - Using MLIR values for `immargs` can lead to invalid LLVM IR being generated (and passes such as -cse making incorrect optimizations) As part of this patch we bid farewell to the following operations: ```mlir arm_sme.get_tile_id : i32 arm_sme.cast_tile_to_vector : i32 to vector<[4]x[4]xi32> arm_sme.cast_vector_to_tile : vector<[4]x[4]xi32> to i32 ``` These are now replaced with: ```mlir // Allocates a new tile with (indeterminate) state: arm_sme.get_tile : vector<[4]x[4]xi32> // A placeholder operation for lowering ArmSME ops to intrinsics: arm_sme.materialize_ssa_tile : vector<[4]x[4]xi32> ``` The new tile allocation works by operations implementing the `ArmSMETileOpInterface`. This interface says that an operation needs to be assigned a tile ID, and may conditionally allocate a new SME tile. Operations allocate a new tile by implementing... ```c++ std::optional<arm_sme::ArmSMETileType> getAllocatedTileType() ``` ...and returning what type of tile the op allocates (ZAB, ZAH, etc). Operations that don't allocate a tile return `std::nullopt` (which is the default behaviour). Currently the following ops are defined as allocating: ```mlir arm_sme.get_tile arm_sme.zero arm_sme.tile_load arm_sme.outerproduct // (if no accumulator is specified) ``` Allocating operations become the roots for the tile allocation pass, which currently just (naively) assigns all transitive uses of a root operation the same tile ID. However, this is enough to handle current use cases. Once tile IDs have been allocated subsequent rewrites can forward the tile IDs to any newly created operations.	2023-11-30 10:22:22 +00:00
Benjamin Maxwell	dff97c1e4c	[mlir][ArmSME] Move ArmSME -> intrinsics lowerings to `convert-arm-sme-to-llvm` pass (#72890 ) This gives more flexibility with when these lowerings are performed, without also lowering unrelated vector ops. This is a NFC (other than adding a new `-convert-arm-sme-to-llvm` pass)	2023-11-22 13:36:36 +00:00
Benjamin Maxwell	c4c52d4199	[mlir][ArmSME] Move vector.extract/insert lowerings to vector-to-arm-sme (NFC) (#72852 ) These were placed in LegalizeForLLVMExport.cpp, which is the wrong stage for these, as these lower to high-level ArmSME ops, not intrinsics.	2023-11-20 14:04:59 +00:00
Benjamin Maxwell	783ac3b6fb	[mlir][ArmSME] Make use of backend function attributes for enabling ZA storage (#71044 ) Previously, we were inserting za.enable/disable intrinsics for functions with the "arm_za" attribute (at the MLIR level), rather than using the backend attributes. This was done to avoid a dependency on the SME ABI functions from compiler-rt (which have only recently been implemented). Doing things this way did have correctness issues, for example, calling a streaming-mode function from another streaming-mode function (both with ZA enabled) would lead to ZA being disabled after returning to the caller (where it should still be enabled). Fixing issues like this would require re-doing the ABI work already done in the backend within MLIR. Instead, this patch switches to use the "arm_new_za" (backend) attribute for enabling ZA for an MLIR function. For the integration tests, this requires some way of linking the SME ABI functions. This is done via the `%arm_sme_abi_shlib` lit substitution. By default, this expands to a stub implementation of the SME ABI functions, but this can be overridden by providing the `ARM_SME_ABI_ROUTINES_SHLIB` CMake cache variable (pointing it at an alternative implementation). For now, the ArmSME integration tests pass with just stubs, as we don't make use of nested ZA-enabled calls. A future patch may add an option to compiler-rt to build the SME builtins into a standalone shared library to allow easily building/testing with the actual implementation.	2023-11-14 12:50:38 +00:00
Cullen Rhodes	8f564e014e	[mlir][ArmSME] Add mask operand to store_tile_slice (#70838 )	2023-11-02 08:43:37 +00:00
Cullen Rhodes	8ea260a093	[mlir][ArmSME] Add mask operand to load_tile_slice (#70655 )	2023-10-31 13:08:55 +00:00
Benjamin Maxwell	e666295011	[mlir][ArmSME] Support lowering masked vector.outerproduct ops to SME (#69604 ) This patch adds support for lowering masked outer products to SME. This is done in two stages. First, vector.outerproducts (both masked and non-masked) are rewritten to arm_sme.outerproducts. The arm_sme.outerproduct op is close to vector.outerproduct, but supports masking on the operands rather than the result. It also limits the cases it handles to things that could be (directly) lowered to SME. This currently requires that the source of the mask is a vector.create_mask op. E.g.: ```mlir %mask = vector.create_mask %dimA, %dimB : vector<[4]x[4]xi1> %result = vector.mask %mask { vector.outerproduct %vecA, %vecB : vector<[4]xf32>, vector<[4]xf32> } : vector<[4]x[4]xi1> -> vector<[4]x[4]xf32> ``` Is rewritten to: ``` %maskA = vector.create_mask %dimA : vector<[4]xi1> %maskB = vector.create_mask %dimB : vector<[4]xi1> %result = arm_sme.outerproduct %vecA, %vecB masks(%maskA, %maskB) : vector<[4]xf32>, vector<[4]xf32> ``` (The same rewrite works for non-masked vector.outerproducts too) The arm_sme.outerproduct can then be directly lowered to SME intrinsics.	2023-10-31 09:06:21 +00:00
Cullen Rhodes	2f055ddca3	[mlir][ArmSME] Add tile slice layout attr to vector <-> tile ops (#69186 ) This is used in #69148 when lowering masked tile_store with non-zero pad, see #69148 This updates: * `arm_sme.move_vector_to_tile_slice` * `arm_sme.move_tile_slice_to_vector`	2023-10-25 14:44:31 +01:00
Benjamin Maxwell	496318ad8d	[mlir][ArmSME] Lower vector.extract/insert on SME tiles to MOVA intrinsics (#67786 ) This patch adds support for lowering vector.insert/extract of tile slices or elements to ArmSME MOVA intrinsics. This enables the following operations for ArmSME: ``` // Extract slice from tile: %slice = vector.extract %tile[%row] : vector<[4]xi32> from vector<[4]x[4]xi32> ``` ``` // Extract element from tile: %el = vector.extract %tile[%row, %col] : i32 from vector<[4]x[4]xi32> ``` ``` // Insert slice into tile: %new_tile = vector.insert %slice, %tile[%row] : vector<[4]xi32> into vector<[4]x[4]xi32> ``` ``` // Insert element into tile; %new_tile = vector.insert %el, %tile[%row, %col] : i32 into vector<[4]x[4]xi32> ```	2023-10-04 09:28:39 +01:00
Andrzej Warzynski	23b5f92c97	[mlir][SME] Re-order patterns alphabetically (nfc)	2023-09-29 16:54:47 +00:00
Benjamin Maxwell	b34f15df55	[mlir][ArmSME] Add arm_sme.move_tile_slice_to_vector op (#67652 ) This adds a simple higher-level op for the tile slice to vector intrinsics (and updates the existing vector.print lowering to use it). This op will be used a few more times to implement vector.insert/extract lowerings in later patches.	2023-09-29 10:33:09 +01:00
Benjamin Maxwell	174cd6145b	[mlir][ArmSME] Add custom vector.print lowering for SME tiles (#66691 ) This adds a custom lowering for SME that loops over each row of the tile, extracting it via an SME MOVA, then printing with a normal 1D vector.print. This makes writing SME integration tests easier and less verbose. Depends on: #66910, #66911	2023-09-26 17:09:57 +01:00
Cullen Rhodes	75a71c27c1	[mlir][ArmSME] Support vertical layout in load and store ops (#66758 ) In SME a ZA tile slice is a one-dimensional set of horizontally or vertically contiguous elements within a ZA tile. Currently the load and store ops only support horizontal tile slices. This patch adds a tile slice layout attribute to the load and store ops to support both horizontal and vertical tile slices. When lowering from Vector dialect horizontal layout is the default.	2023-09-25 09:34:23 +01:00
Benjamin Maxwell	06f22c9ae5	[mlir][ArmSME] Add `enable_arm_streaming_ignore` attribute (#66911 ) This attribute makes the `enable_arm_streaming` pass ignore a function (i.e. not add the enable streaming/za attributes). The main use case for this is to prevent helper functions within tests being made streaming functions.	2023-09-22 11:27:55 +01:00
Cullen Rhodes	f75d46a7ec	[mlir][ArmSME] Lower vector.outerproduct to FMOPA/BFMOPA (#65621 ) This patch adds support for lowering vector.outerproduct to the ArmSME MOPA intrinsic for the following types: vector<[8]xf16>, vector<[8]xf16> -> vector<[8]x[8]xf16> vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16> vector<[4]xf32>, vector<[4]xf32> -> vector<[4]x[4]xf32> vector<[2]xf64>, vector<[2]xf64> -> vector<[2]x[2]xf64> The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to BFMOPA (non-widening) [2]. Note at the ISA level these variants are implemented by different architecture features, these are listed below: FMOPA (non-widening) * half-precision - +sme2p1,+sme-f16f16 * single-precision - +sme * double-precision - +sme-f64f64 BFMOPA (non-widening) * half-precision - +sme2p1,+b16b16 There's currently no way to target different features when lowering to ArmSME. Integration tests are added for F32 and F64. We use QEMU to run the integration tests but SME2 support isn't available yet, it's targeted for 9.0, so integration tests for these variants excluded. Masking is currently unsupported. Depends on #65450. [1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate- [2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-	2023-09-14 08:31:52 +01:00
Cullen Rhodes	834cdc8b64	[mlir][ArmSME] Fix get_tile_id type in zero lowering The arm_sme.get_tile_id op returns a scalar integer but the arm_sme.zero op lowering incorrectly uses the element type, which could be floating-point. Reviewed By: awarzynski, benmxwl-arm Differential Revision: https://reviews.llvm.org/D159080	2023-08-30 07:16:35 +00:00
Cullen Rhodes	3b4b6cbba5	[mlir][ArmSME] Add move vector to tile slice op and lowerings This adds a 'move_vector_to_tile_slice' op to the ArmSME dialect that moves a 1-D scalable vector to a slice of a 2-D tile at a given index. This is lowered to the 'llvm.aarch64.sme.write.horiz' intrinsic that maps to the MOVA (vector to tile, single) SME instruction [1] when lowering to LLVM. Like the SME load and store instructions this operates on ZA tile slices, which are 1D vectors of horizontally or vertically contiguous elements within a ZA tile. This patch extends the lowering of 'arith.constant' to SME to support non-zero constants using this new op. This requires materializing a loop that broadcasts the constant to each tile slice with the 'vector_to_tile_slice' op. Unlike load and store, this is done during conversion from Vector to ArmSME, rather than ArmSME to SCF. The latter would require a higher-level custom op in the ArmSME dialect like 'tile_load' and 'tile_store' and this isn't necessary. We may also remove the load and store ops in the future in favour of lowering straight from Vector, at which point this would converge. Currently only horizontal tile slices are supported. A future patch will extend this mechanism to support 'vector.broadcast'. Depends on D156980 D157004 [1] https://developer.arm.com/documentation/ddi0602 Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D157005	2023-08-29 09:29:22 +00:00
Benjamin Maxwell	97da414182	[mlir][ArmSME] Lower loads/stores of (.Q) 128-bit tiles to intrinsics This follows from D155306. Loads and stores of 128-bit tiles have been confirmed to work in the `load-store-128-bit-tile.mlir` integration test. However, there is currently a bug in QEMU (see: https://gitlab.com/qemu-project/qemu/-/issues/1833) which means this test produces incorrect results (a patch for this issue is available but not yet in any released version of QEMU). Until a fixed version of QEMU is available the integration test is expected to fail. Reviewed By: c-rhodes, awarzynski Differential Revision: https://reviews.llvm.org/D158418	2023-08-23 09:16:20 +00:00
Benjamin Maxwell	a4d87e3d06	[mlir][ArmSME] Calculate correct tile mask when lowering arm_sme.zero This patch updates the lowering of the arm_sme.zero to intrinsics so that it calculates the correct mask for the tile to zero. The zero instruction takes an 8-bit mask which specifies which 64-bit tiles to zero, ZA0.D to ZA7.D correspond to bits 0 to 7. To zero tiles with element sizes of 8-bit to 32-bit just requires zeroing the right 64-bit tiles. This is quite easy to calculate, each size has a "base mask" which can be shifted left by the tile ID to get the mask for that tile. base_mask << tile_id After tile allocation, this will be folded to a constant mask. Reviewed By: awarzynski Differential Revision: https://reviews.llvm.org/D157902	2023-08-18 09:34:29 +00:00
Cullen Rhodes	65a6be5de9	[mlir][ArmSME] Use memref indices for load and store This patch extends the ArmSME load and store op lowering to use the memref indices. An integration test that loads two 32-bit element ZA tiles from memory and stores them back to memory in reverse order to verify this is added. Depends on D156467 D156558 Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D156689	2023-08-03 08:50:12 +00:00
Cullen Rhodes	9e1b825321	[mlir][ArmSME] Add conversion from ArmSME to SCF to materialize loops Currently a loop is materialized when lowering ArmSME loads and stores to intrinsics. This patch introduces two new ops to the ArmSME dialect that map 1-1 with intrinsics: 1. arm_sme.load_tile_slice - Loads a 1D tile slice from memory into a 2D SME "virtual tile". 2. arm_sme.store_tile_slice - Stores a 1D tile slice from a 2D SME "virtual tile" into memory. As well as a new conversion pass '-convert-arm-sme-to-scf' that materializes loops with these ops. The existing load/store lowering to intrinsics is updated to use these ops. Depends on D156517 Discourse thread: https://discourse.llvm.org/t/loop-materialization-in-armsme/72354 Reviewed By: awarzynski, dcaballe, WanderAway Differential Revision: https://reviews.llvm.org/D156467	2023-08-01 08:20:02 +00:00
Cullen Rhodes	ca9a3354d0	[mlir][ArmSME] Add tile load op and extend tile store tile size support This extends the existing 'arm_sme.tile_store' op to support all tile sizes and adds a new op 'arm_sme.tile_load', as well as lowerings from vector -> custom ops and custom ops -> intrinsics. Currently there's no lowering for i128. Depends on D154867 Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D155306	2023-07-25 08:28:36 +00:00
Andrzej Warzynski	3fa5ee67ba	[mlir][ArmSME] Introduce custom TypeConverter for ArmSME At the moment, SME-to-LLVM lowerings rely entirely on `LLVMTypeConverter`. This patch introduces a dedicated `TypeConverter` that inherits from `LLVMTypeConverter` (it will also be used when lowering ArmSME Ops to LLVM). The new type converter merely disables lowerings for `VectorType` to prevent 2-d scalable vectors (common in the context of ArmSME), e.g. `vector<[16]x[16]xi8>`, entering the LLVM Type converter. LLVM does not support arrays of scalable vectors and hence the need for specialisation. In the case of SME such types are effectively eliminated when emitting LLVM IR intrinsics for SME. Differential Revision: https://reviews.llvm.org/D155365	2023-07-18 09:35:32 +00:00
Cullen Rhodes	fb54fec726	[mlir][ArmSME] Implement tile allocation This patch adds a pass '-allocate-sme-tiles' to the ArmSME dialect that implements allocation of SME ZA tiles. It does this at the 'func.func' op level by replacing 'arm_sme.get_tile_id' ops with 'arith.constant' ops that represent the tile number. The tiles in use in a given function are tracked by an integer function attribute 'arm_sme.tiles_in_use' that is a 16-bit tile mask with a bit for each 128-bit element tile (ZA0.Q-ZA15.Q), the smallest ZA tile granule. This is initialized on the first 'arm_sme.get_tile_id' rewrite and updated on each subsequent rewrite. Mixing of different element tile types is supported. Section B2.3.2 of the SME spec [1] describes how the 128-bit element tiles overlap with other element tiles. Depends on D154941 [1] https://developer.arm.com/documentation/ddi0616/aa Reviewed By: awarzynski Differential Revision: https://reviews.llvm.org/D154955	2023-07-18 08:46:40 +00:00
Andrzej Warzynski	447bb5bee4	[mlir][ArmSME] Introduce new lowering layer (Vector -> ArmSME) At the moment, the lowering from the Vector dialect to SME looks like this: * Vector --> SME LLVM IR intrinsics This patch introduces a new lowering layer between the Vector dialect and the Arm SME extension: * Vector --> ArmSME dialect (custom Ops) --> SME LLVM IR intrinsics. This is motivated by 2 considerations: 1. Storing `ZA` to memory (e.g. `vector.transfer_write`) requires an `scf.for` loop over all rows of `ZA`. Similar logic will apply to "load to ZA from memory". This is a rather complex transformation and a custom Op seems justified. 2. As discussed in [1], we need to prevent the LLVM type converter from having to convert types unsupported in LLVM, e.g. `vector<[16]x[16]xi8>`. A dedicated abstraction layer with custom Ops opens a path to some fine tuning (e.g. custom type converters) that will allow us to avoid this. To facilitate this change, two new custom SME Op are introduced: * `TileStoreOp`, and * `ZeroOp`. Note that no new functionality is added - these Ops merely model what's already supported. In particular, the following tile size is assumed (dimension and element size are fixed): * `vector<[16]x[16]xi8>` The new lowering layer is introduced via a conversion pass between the Vector and the SME dialects. You can use the `-convert-vector-to-sme` flag to run it. The following function: ``` func.func @example(%arg0 : memref<?x?xi8>) { // (...) %cst = arith.constant dense<0> : vector<[16]x[16]xi8> vector.transfer_write %cst, %arg0 : vector<[16]x[16]xi8>, memref<?x?xi8> return } ``` would be lowered to: ``` func.func @example(%arg0: memref<?x?xi8>) { // (...) %0 = arm_sme.zero : vector<[16]x[16]xi8> arm_sme.tile_store %arg0[%c0, %c0], %0 : memref<?x?xi8>, vector<[16]x[16]xi8> return } ``` Later, a mechanism will be introduced to guarantee that `arm_sme.zero` and `arm_sme.tile_store` operate on the same virtual tile. For `i8` elements this is not required as there is only one tile. In order to lower the above output to LLVM, use * `-convert-vector-to-llvm="enable-arm-sme"`. [1] https://github.com/openxla/iree/issues/14294 Reviewed By: WanderAway Differential Revision: https://reviews.llvm.org/D154867	2023-07-18 08:04:59 +00:00
Cullen Rhodes	6ff9761a69	[mlir][ArmSME] Add custom get_tile_id and cast ops This patch adds three new custom ops to the ArmSME dialect: * arm_sme.get_tile_id - returns a scalar integer representing an SME "virtual tile" that is not in use. * arm_sme.cast_tile_to_vector - casts from a tile id to a 2-d scalable vector type, which represents an SME "virtual tile". * arm_sme.cast_vector_to_tile - casts from a 2-d scalable vector type, which represents an SME "virtual tile", to a tile id. The 'arm_sme.get_tile_id' op currently only supports tile 0, a follow-up patch will implement proper tile allocation. A further follow-up patch will demonstrate load/store to/from ZA using these ops. See the op descriptions for further details and examples. Thanks to @paulwalker-arm and @awarzynski for helping drive this. Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D154941	2023-07-18 07:41:45 +00:00
Cullen Rhodes	564713c471	[mlir][ArmSME] Add basic lowering of vector.transfer_write to zero This patch adds support for lowering a 'vector.transfer_write' of zeroes and type 'vector<[16x16]xi8>' to the SME 'zero {za}' instruction [1], which zeroes the entire accumulator, and then writing it out to memory with the 'str' instruction [2]. This contributes to supporting a path from 'linalg.fill' to SME. [1] https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/ZERO--Zero-a-list-of-64-bit-element-ZA-tiles- [2] https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/STR--Store-vector-from-ZA-array- Reviewed By: awarzynski, dcaballe, WanderAway Differential Revision: https://reviews.llvm.org/D152508	2023-07-03 10:18:43 +00:00
Cullen Rhodes	51b0398b76	[mlir][ArmSME] Fix crash on func decls in 'arm_za' legality checks Reviewed By: dcaballe, Dinistro Differential Revision: https://reviews.llvm.org/D153750	2023-06-27 07:39:35 +00:00
Cullen Rhodes	65305aeab9	[mlir][ArmSME] Insert intrinsics to enable/disable ZA This patch adds two LLVM intrinsics to the ArmSME dialect: * llvm.aarch64.sme.za.enable * llvm.aarch64.sme.za.disable for enabling the ZA storage array [1], as well as patterns for inserting them during legalization to LLVM at the start and end of functions if the function has the 'arm_za' attribute (D152695). In the future ZA should probably be automatically enabled/disabled when lowering from vector to SME, but this should be sufficient for now at least until we have patterns lowering to SME instructions that use ZA. N.B. The backend function attribute 'aarch64_pstate_za_new' can be used manage ZA state (as was originally tried in D152694), but it emits calls to the following SME support routines [2] for the lazy-save mechanism [3]: * __arm_tpidr2_restore * __arm_tpidr2_save These will soon be added to compiler-rt but there's currently no public implementation, and using this attribute would introduce an MLIR dependency on compiler-rt. Furthermore, this mechanism is for routines with ZA enabled calling other routines with it also enabled. We can choose not to enable ZA in the compiler when this is case. Depends on D152695 [1] https://developer.arm.com/documentation/ddi0616/aa [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#sme-support-routines [3] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-za-lazy-saving-scheme Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D153050	2023-06-16 09:40:48 +00:00
Cullen Rhodes	e947e76058	[mlir][ArmSME] Extend streaming-mode pass to support enabling ZA This patch extends the 'enable-arm-streaming' pass with a new option to enable the ZA storage array by adding the 'arm_za' attribute to 'func.func' ops. A later patch will insert `llvm.aarch64.sme.za.enable` at the beginning of 'func.func' ops and `llvm.aarch64.sme.za.disable` before `func.return` statements when lowering to LLVM dialect. Currently the pass only supports enabling ZA with streaming-mode on but the SME LDR, STR and ZERO instructions can access ZA when not in streaming-mode (section B1.1.1, IDGNQM [1]), so it may be worth making these options independent in the future. N.B. This patch is generally useful in the context of SME enablement in MLIR, but it will help enable writing an integration test for rewrite pattern that lowers `vector.transfer_write` -> `zero {za}` (D152508). [1] https://developer.arm.com/documentation/ddi0616/aa Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D152695	2023-06-16 09:26:42 +00:00
Cullen Rhodes	1e41a29d73	Revert "[mlir][ArmSME] Add initial dialect with basic lowering of vector.transfer write to zero" Apologies I shouldn't have comitted this, need to wait until the planned MLIR ODM: https://discourse.llvm.org/t/rfc-creating-a-armsme-dialect/67208/76 This reverts commit a48fe898857c95a063fa6c201343dca969bc098a.	2023-06-14 09:03:10 +00:00
Cullen Rhodes	a48fe89885	[mlir][ArmSME] Add initial dialect with basic lowering of vector.transfer write to zero This patch adds support for lowering a `vector.transfer_write` of zeroes and type `vector<[16x16]xi8>` to the SME `zero {za}` instruction [1], which zeroes the entire accumulator. This contributes to supporting a path from `linalg.fill` to SME. [1] https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/ZERO--Zero-a-list-of-64-bit-element-ZA-tiles- Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D152508	2023-06-14 08:46:53 +00:00
Cullen Rhodes	1264849299	[mlir] Add pass to enable Armv9 Streaming SVE mode This patch adds a pass 'enable-arm-streaming' that enables the Armv9 Scalable Matrix Extension (SME) Streaming SVE (SSVE) mode [1] by adding either of the following attributes to 'func.func' ops: * arm_streaming (default) * arm_locally_streaming PATCH [2 / 2] in series for RFC: https://discourse.llvm.org/t/rfc-supporting-armv9-scalable-matrix-extension-sme-streaming-sve-ssve-mode-in-mlir/70678 [1] https://developer.arm.com/documentation/ddi0616/aa Reviewed By: awarzynski, dcaballe Differential Revision: https://reviews.llvm.org/D150934	2023-05-25 09:20:36 +00:00

48 Commits