This PR add Layout Propagation support for multi-reduction/reduction op
with scalar result:
1) Enhance setupMultiReductionResultLayout() and
LayoutInfoPropagation::visitVectorMultiReductionOp() to support scalar
result
2) Add propagation support for vector.reduction op at the lane level,
since the op is only introduced at the lane level.
The layout propagation fails if dpas has an f16 accumulator. This fix
resolves the issue by removing the packingSize argument which seems not
valid here.
This PR enhance the multi-reduction layout propagation:
1. improve inst_data and lane_data to support fractional subgroup size
2. improve subgroup_layout/data setup to utilize the (nested) slice
layout from consumer op
It also removes the restriction in load_matrix/store_matrix layout
propagation to allow nd (n>2) layout
This PR enhances the layout propagation rules for broadcast operations.
The source layout is derived from the result layout based on the
broadcast pattern:
1. Broadcast on leading dimensions
The source layout is the slice layout of the result layout.
2. Broadcast on inner unit dimensions
The source layout matches the result layout, with sg_data and lane_data
set to 1.
3. Broadcast on both leading dimensions and inner unit dimensions
The source layout is derived by combining the above two rules.
This PR enhances insert_strided_slice layout rules to handle slice
layout and adjust the layout to fit the src shape. It adds dropDims as
layout utility function.
This PR refactors Transpose Op Layout Propagation:
1. Add inferTransposeSourceLayout() to layout utility, enhance layout
propagation and conflict handling to use this function
2. Add Layout utility: TransposeDims()
3. Refactor IsTransposeOf() and fix minor bugs
4. Fix minor issue in dropSgLayoutAndData()
This PR enhances the layout assignment for XeGPU load/store operations
to handle vector size smaller than subgroup size.
Say for vector[4], in case of lane_data=[1], lane_layout=[4] and
inst_data=[4].
The fractional-subgroup-size vector support is required to support the
cross-subgroup reduction case. The number of participant subgroups in
reduction can be small, so it causes each subgroup needs to reduce a
small vector size, often a fraction of subgroup size.
Most layout-based subgroup distribution patterns support
fraction-subgroup-size without no change except a few: reduction,
insert/extract, constant. We don't expect ND operations (like
load_nd/store_nd/dpas) accept fractional-subgroup-size vector.
This PR adds support for layout conflict handling for vector operands. A
conflict for a vector operand occurs when a value consumed at a given
operand is not in the expected layout in the context of the consumer
(for example `vector.multi_reduction` op's source require a specific
layout inferred from its current result layout). To resolve this
conflict, we insert an `xegpu.convert_layout` right after the producer
(essentially duplicating the producer with expected layout) and use the
new value in the consumer.
This PR calls recoverTemporaryLayout before the XeGPUWgtoSgDistribute &
XeGPUBlocking Pass to recover all the temporary operand layout which
might be required by the transformation patterns for checks and
verification
Fix two issues brough by PR179016:
1. unused variable if build the option with
"DLLVM_ENABLE_ASSERTIONS=OFF"
2. Recover modification to recoverTemporaryLayouts() brought by
PR176737. Unintentionally lost during the merging process.
This PR refactors layout propagation into two distinct components:
result/anchor layout setup and source layout inference from the result.
For operations that require a specific result layout due to semantic or
hardware constraints, the propagation logic explicitly sets up the
result or anchor layout. Otherwise, it infers the source layout from the
backward-propagated consumer layout.
The result or anchor layout may differ from the backward-propagated
consumer layout; any such discrepancies are resolved via the existing
layout-conflict mechanism.
**This PR introduces the following utility functions:**
Source layout inference:
> inferBroadcastSourceLayout()
> inferMultiReductionSourceLayout()
> inferBitCastSourceLayout()
> inferShapeCastSourceLayout()
> inferInsertStridedSliceSourceLayout()
Result / anchor layout setup:
> setupMultiReductionResultLayout()
> setupBitCastResultLayout()
> setupInsertStridedSliceResultLayout()
> setupLoadMatrixAnchorLayout()
> setupStoreMatrixAnchorLayout()
> setupLoadGatherAnchorLayout()
> setupStoreScatterAnchorLayout()
Part of subgroup distribution related code changes are separated and
created as PR https://github.com/llvm/llvm-project/pull/179018/changes.