This PR calls recoverTemporaryLayout before the XeGPUWgtoSgDistribute &
XeGPUBlocking Pass to recover all the temporary operand layout which
might be required by the transformation patterns for checks and
verification
This PR refactors layout propagation into two distinct components:
result/anchor layout setup and source layout inference from the result.
For operations that require a specific result layout due to semantic or
hardware constraints, the propagation logic explicitly sets up the
result or anchor layout. Otherwise, it infers the source layout from the
backward-propagated consumer layout.
The result or anchor layout may differ from the backward-propagated
consumer layout; any such discrepancies are resolved via the existing
layout-conflict mechanism.
**This PR introduces the following utility functions:**
Source layout inference:
> inferBroadcastSourceLayout()
> inferMultiReductionSourceLayout()
> inferBitCastSourceLayout()
> inferShapeCastSourceLayout()
> inferInsertStridedSliceSourceLayout()
Result / anchor layout setup:
> setupMultiReductionResultLayout()
> setupBitCastResultLayout()
> setupInsertStridedSliceResultLayout()
> setupLoadMatrixAnchorLayout()
> setupStoreMatrixAnchorLayout()
> setupLoadGatherAnchorLayout()
> setupStoreScatterAnchorLayout()
Part of subgroup distribution related code changes are separated and
created as PR https://github.com/llvm/llvm-project/pull/179018/changes.
This PR fix a bug in setUnitDimData and setUnitDimLayout, and adds
documentation and test.
It also cleans up the shapecast op pattern in the wg distribution to use
local temporary layout instead of getting from definition op's result
(one TODO item from PR
[#172125](https://github.com/llvm/llvm-project/pull/172125)).
This PR adds support for cross-sg reduction whilst distributing from
workgroup to subgroup. It has following limitation
1. Cannot reduce to a scalar
2. For cross-sg, only 1:1 decomposition (each sg should be assigned only
one tile in the original WG tile) is supported for now. For example for
a WG tile of size 256x128, sg_layout = [8, 4], sg_data = [16, 16] wont
be supported.
Removing temporary slice layout (besides the regular layout) at the end
of wg distribution and blocking pass.
The PR also drop sg_data/inst_data from anchor layouts in every
wg-to-sg/blocking/unrolling pattern.
---------
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:
1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.
2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.
3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().
4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run.
5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().
6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
Add special-case handling for ShapeCast when it expands unit dimensions
for a succeeding broadcast op. In this scenario, distribution requires
the source layout to be a slice layout, and the result layout is first
normalized by setting the expanded unit dimensions to 1 before computing
the distributed result shape. In all other cases, ShapeCast is
distributed as usual.
This PR also updates the propagation rule for vectors with expanded unit
dimensions, allowing them to share the same layout as the result of a
broadcast op. This enables correct layout propagation back to the source
of the ShapeCast op, as that layout must ultimately be restored as the
parent layout of the slice layout.
This PR adds support to retain the anchor op layouts (after dropping
what's not required) for xegpu nD ops during workgroup to subgroup &
unroll transformation
Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored
by XeGPU distribution and lowerinngs once specified.
1. Add anchor_layout for XeGPU anchor OPs: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw.
2. rename layout attributes to anchor_layout for these ops: load, store,
load_matrix, store_matrix
3. update layout propagation pass: Only when user doesn't specify anchor
layout, the pass computes a default layout and set to anchor op's
permant layout and use that for propagation. if user specified anchor
layout, the pass takes user-specified anchor layout. permant layout and
use that for propagation. if user specified anchor layout, the pass
takes user-specified anchor layout.
This PR does the following:
1. Handle order attribute during the delinearization from linear
subgroup Id to multi-dim id.
2. Adds a transformation pattern for vector.transpose in wg to sg pass.
3. Updates CHECKS in the wg to sg tests
As [suggested
here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637)
the PR adds an optional layout attribute for `LoadGather` and
`StoreScatter` ops.
For the load-op the attribute describes the layout of the result (ex
`layout_result_0`), and for store-op it describes the layout for the
vector-to-store operand (ex `layout_operand_0`).
The PR also reworks `propagate-layout` pass to consider perm layout
attributes and back-propagate them accordingly.
The helper utility function `getDistributeLayoutAttr` is reworked to
return either `layout_operand/result_0` or `layout` for load/store ops
(denepding on which one is set). After an offline discussion decided
that the overall utilities layouts API is confusing since it tries to
mix permament and temporary layouts. Would need to change it in the
future.
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
This PR adds lowering of xegpu.load_matrix/store_matrix to
xevm.blockload/blockstore or and llvm.load/store, depending on wi level
attributes.
It includes a few components:
1. adds wi-level attributes: subgroup_block_io.
2. expand load_matrix/store_matrix op definition to support scalar data
(besides vector data).
2. adds a member function to mem_desc to compute the linearized address
for a nd offsets.
3. add lowering depending on wi-level attributes:
a) if subgroup_block_io attribute presents, lower to
xevm.blockload/blockstore
c) else lower to llvm.load/store. If result is a vector, lower to
llvm.load/store with vector operand.
This PR adds patterns to distribute vector.step and vector.shape_cast op
from wg to sg and it also enables constant, broadcast and elementwise
ops to handle the slice attribute
Add support for distributing the `vector.multi_reduction` operation
across lanes in a warp. Currently only 2D to 1D reductions are
supported. Given layouts for the source and accumulator vectors,
* If the reduction dimension is distributed across lanes, the reduction
is non-lane-local and the reduction is done using warp shuffles. Here we
simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s
inside the warp op body. Actual distribution will be done by
`WarpOpReduction` pattern.
* If the reduction dimension is not distributed across lanes, the
reduction is lane-local. In this case, we yield the source and
accumulator vectors from the warp op and perform the lane-local
reduction outside the warp op using a sequence of `ReductionOp`s.
PR also adds support for distributing `vector.shape_cast` based on
layouts.
This PR adds pattern to distribute the load/store/prefetch nd ops with
offsets from workgroup to subgroup IR. This PR is part of the transition
to move offsets from create_nd to load/store/prefetch nd ops.
Create_nd PR : #152351
This PR adds pattern to distribute the create_nd_desc op without offsets
from workgroup (Wg) IR to subgroup (Sg) IR.
The round robin distribution logic (involves offset calculation) now
will happen in load/store/prefetch nd ops instead of create_nd.
This PR adds a new attribute to the xegpu dialect called xegpu.range.
One use case of this attribute can be to attach subgroup_id_range to
scf.if of to drive the execution.
This PR adds initial skeleton implementation for lowering
ConvertLayoutOp. It currently only supports cases where SLM is not
needed.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
This PR allows load_nd/store_nd/prefetch_nd to take an additional offset
operand.
It is based on this PR https://github.com/llvm/llvm-project/pull/148335.
Now user can create a nd_tdesc with no offset, and instead set the
offset with the load_nd operation.
This PR introduces support for `scf::ForOp`, `scf::WhileOp`, `scf::If`,
and `scf::Condition` within the workgroup-subgroup-distribution pass,
leveraging the `SCFStructuralTypeConversionsAndLegality`.
This PR adds the XeGPU workgroup (wg) to subgroup (sg) pass. The wg to
sg pass transforms the xegpu wg level operations to subgroup operations
based on the sg_layout and sg_data attribute. The PR adds transformation
patterns for following Ops
1. CreateNdDesc
2. LoadNd
3. StoreNd
4. PrefetchNd
5. UpdateNdOffset
6. Dpas
This PR adds the XeGPU workgroup (wg) to subgroup (sg) pass. The wg to
sg pass transforms the xegpu wg level operations to subgroup operations
based on the sg_layout and sg_data attribute. The PR adds transformation
patterns for following Ops
1. CreateNdDesc
2. LoadNd
3. StoreNd
4. PrefetchNd
4. UpdateNdOffset
5. Dpas