The PR modifies the subgroup distribution pass to only sink
insert_strided_slice operation if it becomes the last op before yield.
It avoids sinking insert_strided_slice multiple times and cause
potential issue in worst case.
This PR calls recoverTemporaryLayout before the XeGPUWgtoSgDistribute &
XeGPUBlocking Pass to recover all the temporary operand layout which
might be required by the transformation patterns for checks and
verification
Fix two issues brough by PR179016:
1. unused variable if build the option with
"DLLVM_ENABLE_ASSERTIONS=OFF"
2. Recover modification to recoverTemporaryLayouts() brought by
PR176737. Unintentionally lost during the merging process.
This PR refactors layout propagation into two distinct components:
result/anchor layout setup and source layout inference from the result.
For operations that require a specific result layout due to semantic or
hardware constraints, the propagation logic explicitly sets up the
result or anchor layout. Otherwise, it infers the source layout from the
backward-propagated consumer layout.
The result or anchor layout may differ from the backward-propagated
consumer layout; any such discrepancies are resolved via the existing
layout-conflict mechanism.
**This PR introduces the following utility functions:**
Source layout inference:
> inferBroadcastSourceLayout()
> inferMultiReductionSourceLayout()
> inferBitCastSourceLayout()
> inferShapeCastSourceLayout()
> inferInsertStridedSliceSourceLayout()
Result / anchor layout setup:
> setupMultiReductionResultLayout()
> setupBitCastResultLayout()
> setupInsertStridedSliceResultLayout()
> setupLoadMatrixAnchorLayout()
> setupStoreMatrixAnchorLayout()
> setupLoadGatherAnchorLayout()
> setupStoreScatterAnchorLayout()
Part of subgroup distribution related code changes are separated and
created as PR https://github.com/llvm/llvm-project/pull/179018/changes.
Currently XeGPU lowering pipeline uses `XeGPUSubgroupDistribute` pass to
subgroup to work item distribution of ops. This pass is well established
and relies on vector distribution's `WarpOp` based distribution
mechanism. However, recent experiments with larger kernels have shown
that this pass is very expensive in terms of compile time (see below).
This prompted us to create a new pass that does not rely on `WarpOp`
based distribution. This PR adds the initial infra to move away from the
old way and align Wg To WI distribution with Wg to Sg distribution. New
pass also uses context-aware type conversion based on XeGPU layouts to
distributed vector types from SG to WI.
This PR adds the following changes:
* SG to WI distribution pass based on context-aware type conversions
using `OpConversionPatterns`
* Test pass for testing individual patterns
(`TestXeGPUSgToWiDistributeExperimental`)
* `XeGPUSgToWiDistributeExperimentalPass` which will eventually replace
`XeGPUSubgroupDistribute`
Flash attention e2e compilations stats:
```
----Wall Time---- ----Name----
0.0032 ( 0.2%) Parser
0.0008 ( 0.0%) CSE
0.0000 ( 0.0%) (A) DominanceInfo
0.0002 ( 0.0%) GpuXeVMAttachTarget
1.1427 ( 58.7%) 'gpu.module' Pipeline
0.0019 ( 0.1%) XeGPUWgToSgDistribute
0.0003 ( 0.0%) CSE
0.0000 ( 0.0%) (A) DominanceInfo
0.0002 ( 0.0%) LowerAffinePass
0.0001 ( 0.0%) CSE
0.0000 ( 0.0%) (A) DominanceInfo
0.0008 ( 0.0%) XeGPUPropagateLayout
0.0056 ( 0.3%) XeGPUBlocking
0.0010 ( 0.1%) Canonicalizer
0.0004 ( 0.0%) CSE
0.0000 ( 0.0%) (A) DominanceInfo
0.0015 ( 0.1%) XeGPUPropagateLayout
0.0007 ( 0.0%) XeGPUOptimizeBlockLoads
0.0010 ( 0.0%) Canonicalizer
0.0004 ( 0.0%) CSE
0.0000 ( 0.0%) (A) DominanceInfo
0.0015 ( 0.1%) XeGPUPropagateLayout
1.1274 ( 57.9%) XeGPUSubgroupDistribute
0.7959 ( 40.9%) Output
0.0022 ( 0.1%) Rest
1.9461 (100.0%) Total
```
This PR adds initial support for layout conflict resolution in XeGPU.
Layout conflict occurs when some op's use point expects a different
layout than what the op can currently provide. This conflict needs to be
resolved by adding certain other xegpu ops.
Initially, We only focus conflict handling at tensor desc use points.
This PR preserves leading dimensions for
xegpu.load_matrix/store_matrix/atomic_rmw/convert_layout, and vector
operations which have impact on shapes:
broadcast/multi-reduction/shape_cast/transpose.
Rank-sensitive operations are operations whose semantics depend on the
tensor rank (and consequently its shape), and therefore must not alter
the input tile rank or shape, such as by dropping leading dimensions.
This PR fix a bug in setUnitDimData and setUnitDimLayout, and adds
documentation and test.
It also cleans up the shapecast op pattern in the wg distribution to use
local temporary layout instead of getting from definition op's result
(one TODO item from PR
[#172125](https://github.com/llvm/llvm-project/pull/172125)).
This PR adds support for cross-sg reduction whilst distributing from
workgroup to subgroup. It has following limitation
1. Cannot reduce to a scalar
2. For cross-sg, only 1:1 decomposition (each sg should be assigned only
one tile in the original WG tile) is supported for now. For example for
a WG tile of size 256x128, sg_layout = [8, 4], sg_data = [16, 16] wont
be supported.
In XeGPUPropagateLayout.cpp, the helper getDefaultSIMTLayoutInfo is
implemented via multiple overloads that differ significantly in
semantics, not just parameter types.
Reusing the same function name for these semantically different
behaviors makes call sites harder to read and reason about and increases
the maintenance burden. This PR improves readability and maintainability
of layout propagation logic.
Removing temporary slice layout (besides the regular layout) at the end
of wg distribution and blocking pass.
The PR also drop sg_data/inst_data from anchor layouts in every
wg-to-sg/blocking/unrolling pattern.
---------
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Add visitNonControlFlowArgumentst API to SparseBackwardDataFlowAnalysis,
current SparseBackwardDataflowAnalysis cannot access all SSA values,
such as, the loop's IV. Now we can use visitNonControlFlowArgumentst to
visit it. Apply it in LivenessAnalysis/RemoveDeadValues, solved the
issue of IV liveness in the loop.
https://discourse.llvm.org/t/rfc-add-visitbranchregionargument-interface-to-sparsedataflowanalysis/89061
Add a helper function to compute a mapping of successor operands to
successor inputs. This mapping is computed in various places. Also add a
helper function to gather all region branch points.
This commit is in preparation of a bug fix / partial redesign of
`-remove-dead-values`. This commit also removes some duplicate code in
various places.
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:
1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.
2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.
3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().
4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run.
5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().
6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
Add special-case handling for ShapeCast when it expands unit dimensions
for a succeeding broadcast op. In this scenario, distribution requires
the source layout to be a slice layout, and the result layout is first
normalized by setting the expanded unit dimensions to 1 before computing
the distributed result shape. In all other cases, ShapeCast is
distributed as usual.
This PR also updates the propagation rule for vectors with expanded unit
dimensions, allowing them to share the same layout as the result of a
broadcast op. This enables correct layout propagation back to the source
of the ShapeCast op, as that layout must ultimately be restored as the
parent layout of the slice layout.
`vector.extract_strided_slice` can have two forms when specifying
offsets.
Case 1:
```
%1 = vector.extract_strided_slice %0 { offsets = [8, 0], sizes = [8, 16], strides = [1, 1]}
: vector<24x16xf32> to vector<8x16xf32>
```
Case 2:
```
%1 = vector.extract_strided_slice %0 { offsets = [8], sizes = [8], strides = [1]}
: vector<24x16xf32> to vector<8x16xf32>
```
These two ops means the same thing, but case 2 is syntactic sugar to
avoid specifying offsets for fully extracted dims. Currently case 2
fails in XeGPU SIMT distribution. This PR fixes this issue.
This PR extends XeGPU layout propagation and distribution for
vector.broadcast operation.
It relaxes the restriction of layout propagation to allow low-rank and
scalar source input, and adds a pattern in sg-to-wi distribution to
support the lowering.
This PR adds support to retain the anchor op layouts (after dropping
what's not required) for xegpu nD ops during workgroup to subgroup &
unroll transformation
Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored
by XeGPU distribution and lowerinngs once specified.
1. Add anchor_layout for XeGPU anchor OPs: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw.
2. rename layout attributes to anchor_layout for these ops: load, store,
load_matrix, store_matrix
3. update layout propagation pass: Only when user doesn't specify anchor
layout, the pass computes a default layout and set to anchor op's
permant layout and use that for propagation. if user specified anchor
layout, the pass takes user-specified anchor layout. permant layout and
use that for propagation. if user specified anchor layout, the pass
takes user-specified anchor layout.
This PR adds general SIMT distribution support for
`vector.extract/insert_strided_slice`. Currently vector distribution
already have support for these operations but have restrictions to avoid
requiring layouts during distribution logic. For example,
`extract_stride_slice` require that distributed dimension is fully
extracted. However, more complex cases may require extracting partially
from distributed dimension (eg. 8x16xf16 extraction from 8x32xf16).
These types of cases need the layouts to reason about how the data is
spread across SIMT lanes.
Currently, we don't have layout access in vector distribution so these
new patterns are place in XeGPU side. They have higher pattern benefit
so that they will be tried first before trying regular vector
distribution based patterns.
This PR does the following:
1. Handle order attribute during the delinearization from linear
subgroup Id to multi-dim id.
2. Adds a transformation pattern for vector.transpose in wg to sg pass.
3. Updates CHECKS in the wg to sg tests
This pass rewrites certain xegpu `CreateNd` and `LoadNd` operations that
feeds into `vector.transpose` to more optimal form to improve
performance. Specifically, low precision (bitwidth < 32) `LoadNd` ops
that feeds into transpose ops are rewritten to i32 loads with a valid
transpose layout such that later passes can use the load with transpose
HW feature to accelerate such load ops.
**Update:**
Pass is renamed to `OptimizeBlockLoads ` because later we plan to add
the array length optimization into this pass as well. This will break
down a larger load (like `32x32xf16`) into more DPAS-favorable array
length loads (`32x16xf16` with array length = 2). Both these
optmizations require rewriting `CreateNd` and `LoadNd` and it makes
sense to have a common pass for both.
As [suggested
here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637)
the PR adds an optional layout attribute for `LoadGather` and
`StoreScatter` ops.
For the load-op the attribute describes the layout of the result (ex
`layout_result_0`), and for store-op it describes the layout for the
vector-to-store operand (ex `layout_operand_0`).
The PR also reworks `propagate-layout` pass to consider perm layout
attributes and back-propagate them accordingly.
The helper utility function `getDistributeLayoutAttr` is reworked to
return either `layout_operand/result_0` or `layout` for load/store ops
(denepding on which one is set). After an offline discussion decided
that the overall utilities layouts API is confusing since it tries to
mix permament and temporary layouts. Would need to change it in the
future.
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
This PR changes the pack/unpack method used for unrolling to allow for
lower rank slice to be extracted and inserted from and to src vector by
adding reshapes. It also removes leading unit dims from inst_data if
there are any.
Uniform values should not be distributed during vector distribution.
Example would be a reduction result where reduction happens across
lanes.
However, current `getDistributedType` does not accept a zero result
affine map (i.e. no distributed dims) when describing the distributed
dimensions. This result in null type being returned and crashing the
vector distribution in some cases. An example case would be a `scf.for`
op (about to be distributed) in which one of the for result is a uniform
value and it does not have a user outside the warp op. This necessitates
querying the `getDistributedType` to figure our the distributed type of
this value.