This PR calls recoverTemporaryLayout before the XeGPUWgtoSgDistribute &
XeGPUBlocking Pass to recover all the temporary operand layout which
might be required by the transformation patterns for checks and
verification
This PR refactors layout propagation into two distinct components:
result/anchor layout setup and source layout inference from the result.
For operations that require a specific result layout due to semantic or
hardware constraints, the propagation logic explicitly sets up the
result or anchor layout. Otherwise, it infers the source layout from the
backward-propagated consumer layout.
The result or anchor layout may differ from the backward-propagated
consumer layout; any such discrepancies are resolved via the existing
layout-conflict mechanism.
**This PR introduces the following utility functions:**
Source layout inference:
> inferBroadcastSourceLayout()
> inferMultiReductionSourceLayout()
> inferBitCastSourceLayout()
> inferShapeCastSourceLayout()
> inferInsertStridedSliceSourceLayout()
Result / anchor layout setup:
> setupMultiReductionResultLayout()
> setupBitCastResultLayout()
> setupInsertStridedSliceResultLayout()
> setupLoadMatrixAnchorLayout()
> setupStoreMatrixAnchorLayout()
> setupLoadGatherAnchorLayout()
> setupStoreScatterAnchorLayout()
Part of subgroup distribution related code changes are separated and
created as PR https://github.com/llvm/llvm-project/pull/179018/changes.
This PR preserves leading dimensions for
xegpu.load_matrix/store_matrix/atomic_rmw/convert_layout, and vector
operations which have impact on shapes:
broadcast/multi-reduction/shape_cast/transpose.
Rank-sensitive operations are operations whose semantics depend on the
tensor rank (and consequently its shape), and therefore must not alter
the input tile rank or shape, such as by dropping leading dimensions.
Removing temporary slice layout (besides the regular layout) at the end
of wg distribution and blocking pass.
The PR also drop sg_data/inst_data from anchor layouts in every
wg-to-sg/blocking/unrolling pattern.
---------
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:
1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.
2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.
3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().
4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run.
5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().
6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
This PR changes the pack/unpack method used for unrolling to allow for
lower rank slice to be extracted and inserted from and to src vector by
adding reshapes. It also removes leading unit dims from inst_data if
there are any.
The unroll patterns for these ops were added in the previous PR but the
getTileShape method was not changed to handle these ops and hence
blocking pass was not kicking in.
Add support for distributing the `vector.multi_reduction` operation
across lanes in a warp. Currently only 2D to 1D reductions are
supported. Given layouts for the source and accumulator vectors,
* If the reduction dimension is distributed across lanes, the reduction
is non-lane-local and the reduction is done using warp shuffles. Here we
simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s
inside the warp op body. Actual distribution will be done by
`WarpOpReduction` pattern.
* If the reduction dimension is not distributed across lanes, the
reduction is lane-local. In this case, we yield the source and
accumulator vectors from the warp op and perform the lane-local
reduction outside the warp op using a sequence of `ReductionOp`s.
PR also adds support for distributing `vector.shape_cast` based on
layouts.
This PR adds initial skeleton implementation for lowering
ConvertLayoutOp. It currently only supports cases where SLM is not
needed.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
This PR adds blocking support for vector dialect operations (`reduce`,
`broadcast`, and `transpose`) in the XeGPU based IR. It simply assigned
the shape specified by "inst_data" as its target shape of the unrolling
to implement the blocking. It is based on
https://github.com/llvm/llvm-project/pull/140163.
This PR introduces the initial implementation of a blocking pass for
XeGPU programs. The pass leverages unroll patterns from both the XeGPU
and Vector dialects.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>