The PR adds [`vector.contract(transpose_a/transpose_b)` decomposition
patterns](3215645b8d/mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp (L1263))
from `vector-to-gpu` to `vector-to-xegpu` pass.
The `populatePrepareVectorToMMAPatterns` adds two patterns:
1. `PrepareContractToGPUMMA` that splits `vector.contract(transpose)`
into `vector.transpose + vector.contract`
2. `CombineTransferReadOpTranspose` that fuses `vector.transpose` into
the permutation map of `vector.transfer_read`
The second pattern doesn't always bring us to the desired result
(`xegpu.load_nd + vector.transpose + xegpu.dpas`) since [not all data
types are supported
](1237bd6df0/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp (L570-L575))
for the transposed-read case. There's a second PR (#182875) on this
matter that adds a decomposition-pattern for unsupported types (it might
seem strange that we first fuse and then decompose
transfer_read+transpose but this way we don't have code duplication
between vector-to-gpu&to-xegpu passes and cover all functional cases)
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
The vector.store and vector.load lowering in --convert-vector-to-xegpu
would crash when the source memref had a non-integer/float element type
(e.g. memref<?xvector<4xf32>>).
The crash occurred inside createNdDescriptor() when computing the byte
offset for dynamic memrefs: srcTy.getElementTypeBitWidth() internally
calls getIntOrFloatBitWidth() which asserts on non-scalar types such as
vector<4xf32>.
Fix by adding a check for the memref's element type in
storeLoadPreconditions(). If the element type is not an integer or
float, the pattern returns notifyMatchFailure() instead of proceeding
and crashing.
The same guard is applied to TransferReadLowering and
TransferWriteLowering which share the same helper and can hit the same
path.
Fixes#181463
Fixes https://github.com/llvm/llvm-project/issues/173851
1. Only allow XeGPU_ScalarType element types in `xegpu::TensorDescType`
(via verifier, keeping mlir::Type params in api)
2. Fix `VectorToXeGPU` to prevent vectors with invalid TensorDescType
element types from lowering
If source strided memref is not fully static - at least one of shape,
strides, offset is kDynamic - use i64 source variant.
With this change, xegpu.create_nd_tdesc created by lowering from vector
dialect, can rely on getMixedOffsets, getMixedSize and getMixedStrides
to get relevant values.
This PR adds support to retain the anchor op layouts (after dropping
what's not required) for xegpu nD ops during workgroup to subgroup &
unroll transformation
The PR changes the `TransferReadLowering` to always use `xegpu.load`
(and not `xegpu.load_nd`) for 1D cases as it has more developed
interface (e.g. layouts capabilites).
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
As [suggested
here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637)
the PR adds an optional layout attribute for `LoadGather` and
`StoreScatter` ops.
For the load-op the attribute describes the layout of the result (ex
`layout_result_0`), and for store-op it describes the layout for the
vector-to-store operand (ex `layout_operand_0`).
The PR also reworks `propagate-layout` pass to consider perm layout
attributes and back-propagate them accordingly.
The helper utility function `getDistributeLayoutAttr` is reworked to
return either `layout_operand/result_0` or `layout` for load/store ops
(denepding on which one is set). After an offline discussion decided
that the overall utilities layouts API is confusing since it tries to
mix permament and temporary layouts. Would need to change it in the
future.
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Lowering for `vector.gather`/`vector.scatter` into `xegpu.load`/`xegpu.store`.
High level steps to lower vector.gather/scatter:
```
%0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
%pass_thru : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
```
1. Compute strides and a memref offset for the `%source` memref using
`computeMemrefMeta` func from the transfer_read/write lowering
2. Compute a linear offset like `%lin_off = %base_offset + %off1 *
strides#0 + %off2 * strides#1 + %off3 * strides#2`
3. Combine the linear offset with `%indices`: `%off = (broadcast
%lin_off : index to vector<8xindex>) + %indices * strides#2`
4. Convert memref to an i64: `%flat_memref =
memref.extract_aligned_pointer_as_index %source + arith.index_cast`
5. Perform load/store: `%vec = xegpu.load %flat_memref[%off], %mask`
6. Apply selection to propagate values from the pass_thru vector: `%res
= arith.select %mask, %vec, %pass_thru`
This PR fixes a case where a source memref in
`vector.transfer_read/write` is not contiguous, which violates the
`memref.collapse_shape` semantic that is used in the lowering.
<details><summary>An example of a failing test</summary>
```mlir
gpu.module @xevm_module {
gpu.func @load_from_subview(%source: memref<4096x4096xf16>, %off1: index, %off2: index) -> vector<8xf16> {
%c0 = arith.constant 0.0 : f16
%subview = memref.subview %source[%off1, %off2] [256, 256] [1, 1] : memref<4096x4096xf16> to memref<256x256xf16, strided<[4096, 1], offset: ?>>
%0 = vector.transfer_read %subview[%off2, %off2], %c0
{in_bounds = [true]} : memref<256x256xf16, strided<[4096, 1], offset: ?>>, vector<8xf16>
gpu.return %0 : vector<8xf16>
}
}
```
Fails with:
```
/home/user/llvm/mlir/test/Conversion/VectorToXeGPU/transfer-read-to-xegpu.mlir:404:8: error: 'memref.collapse_shape' op invalid source layout map or collapsing non-contiguous dims
%0 = vector.transfer_read %subview[%off2, %off2], %c0
^
/home/user/llvm/mlir/test/Conversion/VectorToXeGPU/transfer-read-to-xegpu.mlir:404:8: note: see current operation: %8 = "memref.collapse_shape"(%2) <{reassociation = [[0, 1]]}> : (memref<256x256xf16, strided<[4096, 1], offset: ?>>) -> memref<65536xf16>
```
</details>
A suggestion was to replace `memref.collapse_shape` with
`memref.extract_aligned_pointer_as_index` which is done in this PR.
Since `extract_aligned_pointer` applied to a subview returns an original
pointer without subview offsets, this PR also adds a logic to use an
offset obtained from `memref.extract_strided_metadata` in `baseOffset`
calculation in `computeOffsets`.
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
Lowering transfer_read/transfer_write to load_gather/store_scatter in
case the target uArch doesn't support load_nd/store_nd. The high level
steps:
1. compute Strides;
2. compute Offsets;
3. collapseMemrefTo1D;
4. create Load gather or store_scatter op
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
[mlir][vector] Standardize base Naming Across Vector Ops (NFC)
This change standardizes the naming convention for the argument
representing the value to read from or write to in Vector ops that
interface with Tensors or MemRefs. Specifically, it ensures that all
such ops use the name `base` (i.e., the base address or location to
which offsets are applied).
Updated operations:
* `vector.transfer_read`,
* `vector.transfer_write`.
For reference, these ops already use `base`:
* `vector.load`, `vector.store`, `vector.scatter`, `vector.gather`,
`vector.expandload`, `vector.compressstore`, `vector.maskedstore`,
`vector.maskedload`.
This is a non-functional change (NFC) and does not alter the semantics of these
operations. However, it does require users of the XFer ops to switch from
`op.getSource()` to `op.getBase()`.
To ease the transition, this PR temporarily adds a `getSource()` interface
method for compatibility. This is intended for downstream use only and should
not be relied on upstream. The method will be removed prior to the LLVM 21
release.
Implements #131602
`let constructor` is deprecated since the table gen backend emits most
of the glue logic to build a pass. This PR retires the td method for
most (I need another pass) passes in the Conversion directory.
In the LLVM style guide, we prefer not using braced initializer lists to
call a constructor. Also, we prefer using an equal before the open curly
brace if we use a braced initializer list when initializing a variable.
See
https://llvm.org/docs/CodingStandards.html#do-not-use-braced-initializer-lists-to-call-a-constructor
for more details.
The style guide does not explain the reason well. There is an article
from abseil, which mentions few benefits. E.g., we can avoid the most
vexing parse, etc. See https://abseil.io/tips/88 for more details.
Signed-off-by: hanhanW <hanhan0912@gmail.com>
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.
These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.
Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
Constrains Vector lowering to apply boundary checks only to data
transfers operating on block shapes.
This further aligns lowering with the current Xe instructions'
restrictions.
Relaxes vector.transfer_write lowering to allow out-of-bound writes.
This aligns lowering with the current hardware specification which does
not update bytes in out-of-bound locations during block stores.