Fixes correctness issue with current smmla unrolling patterns whereby
unrolling K dimension would only include the result from the last tile
along K. Updates patterns to feed previous smmla output of the previous
tile into the next one along K.
Patterns in `LowerContractionToSMMLAPattern` are designed to handle
vector-to-matrix multiplication but not matrix-to-vector. This leads to
the following error when processing `rhs` with rank < 2:
```
iree-compile: /usr/local/google/home/kooljblack/code/iree-build/llvm-project/tools/mlir/include/mlir/IR/BuiltinTypeInterfaces.h.inc:268: int64_t mlir::detail::ShapedTypeTrait<mlir::VectorType>::getDimSize(unsigned int) const [ConcreteType = mlir::VectorType]: Assertion `idx < getRank() && "invalid index for shaped type"' failed.
```
Updates to explicitly check the rhs rank and fail cases that cannot
process.
Updates smmla unrolling patterns to handle vecmat contracts where `dimM=1`. This includes explicit vecmats in the form: `<1x8xi8> x <8x8xi8> --> <1x8xi32>` or implied with the leading dim folded: `<8xi8> x <8x8xi8> --> <8xi32>`
Since the smmla operates on two `<2x8xi8>` input vectors to produce `<2x2xi8>` accumulators, half of each 2x2 accumulator tile is dummy data not pertinent to the computation, resulting in half throughput.
This patch updates `LowerContractionToSMMLAPattern` to unroll larger vector contracts into multiple smmla instructions.
Now accepts up to [8,8,8] tiles (previously only [2,2,8]). The N/M dimensions must be powers of 2. `vector.extract_strided_slice`/`vector.insert_strided_slice` divides the contract into tiles to be processed in a row.
This patch adds a the `LowerVectorToArmNeonPattern` patterns to the
ArmNeon.
This pattern inspects `vector.contract` ops that can be 1-1 mapped to an
`arm.neon.smmla` intrinsic. The contract ops must be separated into
tiles who's inputs must fit that of a single smmla op (`2x8xi32` inputs
and `2x2xi32` output). The `vector.contract` inputs must be sign
extended from narrow types (<=i8) to be converted. If all conditions are
met, an smmla op is inserted with additional `vector.shape_casts` to
handle linearizing the input and output dimension.