Uniform values should not be distributed during vector distribution.
Example would be a reduction result where reduction happens across
lanes.
However, current `getDistributedType` does not accept a zero result
affine map (i.e. no distributed dims) when describing the distributed
dimensions. This result in null type being returned and crashing the
vector distribution in some cases. An example case would be a `scf.for`
op (about to be distributed) in which one of the for result is a uniform
value and it does not have a user outside the warp op. This necessitates
querying the `getDistributedType` to figure our the distributed type of
this value.
This PR adds some tests for covering some useful corner cases.
1. more tests for `vector.shape_cast` distribution.
2. testing for `MoveFuncBodyToWarpOp` pattern that was not possible
before.
Current subgroup distribution test employ the entire
`xegpu-subgroup-distribute` pass which include multiple steps like
layout propagation, move func body into warp op, and distribute to work
items.
This makes it harder to isolate the testing for xegpu subgroup
distribution logic, because certain corner cases may be not supported
yet by other steps mentioned above.
This PR introduces a test pass for subgroup distribution logic and
isolate the testing for distribution logic. We plan to add more corner
case (that were not possible before) covering non-xegpu ops (like
vector) in next PRs.
This PR also include,
1. minor bug fixes in gather/scatter distribution.
2. bug fix in vector multi reduction lowering where it fails to retain
some layouts.
Currently offsets are given as operands of `CreateNd` op. Sg
distribution does not support offsets arguments at the consumer.
This PR adds support for offsets given at the consumer (like LoadNd).
With this change, it is required to specify the offsets at consumer op
(LoadNd, StoreNd, PrefetchNd) of the tile or otherwise distribution will
fail.
This also removes the need for UpdateNdOffset op. PR removes the support
for UpdateNdOffset .
This PR adds the features needed for supporting the GEMM with transpose
B case.
Summary of changes.
1). Add distribution logic for `vector.bitcast`, `vector.transpose` and
`memref.extract_aligned_pointer_as_index` cases.
2). Add layout propagation support for `vector.shape_cast`,
`vector.broadcast` and `vector.bitcast`
3). Incorporate slice attribute and `DistributeLayoutAttr` interface
with the core logic in layout prop.
Add support for distributing the `vector.multi_reduction` operation
across lanes in a warp. Currently only 2D to 1D reductions are
supported. Given layouts for the source and accumulator vectors,
* If the reduction dimension is distributed across lanes, the reduction
is non-lane-local and the reduction is done using warp shuffles. Here we
simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s
inside the warp op body. Actual distribution will be done by
`WarpOpReduction` pattern.
* If the reduction dimension is not distributed across lanes, the
reduction is lane-local. In this case, we yield the source and
accumulator vectors from the warp op and perform the lane-local
reduction outside the warp op using a sequence of `ReductionOp`s.
PR also adds support for distributing `vector.shape_cast` based on
layouts.
This PR is a reapply of
https://github.com/llvm/llvm-project/pull/154949, which failed one of
sanitizer checks.
The issue was querying the `warpOp` results in `LoadDistribution` after
calling `moveRegionToNewWarpOpAndAppendReturns()`, which resulted in use
after free. This PR solves the issue by moving the op query before the
call and is otherwise identical to the one linked above.
---------
Co-authored-by: Charitha Saumya <136391709+charithaintc@users.noreply.github.com>
This PR adds distribution patterns for scattered load and store ops,
chunk size included.
XeGPU moves toward offsets being part of the load/store ops, so the pass
only supports this case. Manipulating a vector of offsets indirectly
through create_tdesc is complex and soon to become obsolete anyway.
This PR assumes the SIMT-adapted scatter ops verification introduced in
https://github.com/llvm/llvm-project/pull/154653. The distribution
itself can be reviewed in the meantime.
Adds a utility getter to `warp_execute_on_lane_0` which simplifies
access to the op's terminator.
Uses are refactored to utilize the new terminator getter.
Reason is UpdateNdOffset source operand not retaining the layouts when
it is yielded by the warp op. `warp_execute_on_lane0` op expects that
TensorDesc type is unchanged during distribution out of its region. we
use UnrealizedCasts to reconcile this mismatch outside the warpOp (via
`resolveDistributedTy`)
This PR allows load_nd/store_nd/prefetch_nd to take an additional offset
operand.
It is based on this PR https://github.com/llvm/llvm-project/pull/148335.
Now user can create a nd_tdesc with no offset, and instead set the
offset with the load_nd operation.
This PR addresses the following issues.
1. Add the missing attributes when creating a new GPU funcOp in
`MoveFuncBodyToWarpExecuteOnLane0` pattern.
2. Bug fix in LoadNd distribution to make sure LoadOp is the last op in
warpOp region before it is distributed (needed for preserving the memory
op ordering during distribution).
3. Add utility for removing OpOperand or OpResult layout attributes.
Reapply attempt for : https://github.com/llvm/llvm-project/pull/148291
Fix for the build failure reported in :
https://lab.llvm.org/buildbot/#/builders/116/builds/15477
-----
This crash is caused by mismatch of distributed type returned by
`getDistributedType` and intended distributed type for forOp results.
Solution diff:
20c2cf6766
Example:
```
func.func @warp_scf_for_broadcasted_result(%arg0: index) -> vector<1xf32> {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%2 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<1xf32>) {
%ini = "some_def"() : () -> (vector<1xf32>)
%0 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini) -> (vector<1xf32>) {
%1 = "some_op"(%arg4) : (vector<1xf32>) -> (vector<1xf32>)
scf.yield %1 : vector<1xf32>
}
gpu.yield %0 : vector<1xf32>
}
return %2 : vector<1xf32>
}
```
In this case the distributed type for forOp result is `vector<1xf32>`
(result is not distributed and broadcasted to all lanes instead).
However, in this case `getDistributedType` will return NULL type.
Therefore, if the distributed type can be recovered from warpOp, we
should always do that first before using `getDistributedType`
Changes:
* Decouple layout propagation from subgroup distribution and move it to
an independent pass.
* Refine layout assignment to handle control-flow ops correctly (scf.for, scf.while).
* Refine test cases.
Cf. https://discourse.llvm.org/t/mlir-dead-code-analysis/67568/10
Custom analysis passes will not work properly unless both
DeadCodeAnalysis and SparseConstantPropagation are loaded to the
DataFlowSolver. This is intended behavior, but surprising to many users
as shown in the thread. In lieu of a longer-term fix (which I am not
knowledgeable enough to implement myself, yet), this commit adds a
helper function that loads these two analyses, as well as providing
breadcrumbs for an explanation of the problem. The existing places in
the codebase where these two analyses are loaded for the purpose of
running other unrelated analyses are replaced by the use of the helper.
---------
Co-authored-by: Jeremy Kun <j2kun@users.noreply.github.com>
Co-authored-by: Oleksandr "Alex" Zinenko <azinenko@amd.com>
This PR introduces the initial implementation of a blocking pass for
XeGPU programs. The pass leverages unroll patterns from both the XeGPU
and Vector dialects.
---------
Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
This patch fixes several typographical errors in comments and test
files:
1. Corrected "achive" to "archive" in archive-update.test.
2. Fixed "achive" to "achieve" in a comment in
XeGPUSubgroupDistribute.cpp.
3. Corrected "achived" to "achieved" in a test note in
SimpleSIVNoValidityCheckFixedSize.ll.
These changes are non-functional and intended to improve readability and
documentation accuracy.
Signed-off-by: Kane Wang <wangqiang1@kylinos.cn>
Co-authored-by: Kane Wang <wangqiang1@kylinos.cn>
This PR adds support for moving scalar uniform (gpu index ops, constants
etc) outside the `gpu.warp_execute_on_lane0` op. These kinds of ops do
not require distribution and are safe to move out of the warp op. This
also avoid adding separate distribution patterns for these ops.
Example:
```
%1 = gpu.warp_execute_on_lane_0(%laneid) -> (index) {
...
%block_id_x = gpu.block_id x
gpu.yield %block_id_x
}
// use %1
```
To:
```
%block_id_x = gpu.block_id x
%1 = gpu.warp_execute_on_lane_0(%laneid) -> (index) {
...
gpu.yield %block_id_x
}
// use %1
```
This patch fixes:
mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp:901:12:
error: variable 'origVecType' set but not used
[-Werror,-Wunused-but-set-variable]
mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp:908:12:
error: variable 'origTensorDescTy' set but not used
[-Werror,-Wunused-but-set-variable]
This patch fixes:
mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp:54:3:
error: definition of implicit copy assignment operator for 'Layout'
is deprecated because it has a user-declared copy constructor
[-Werror,-Wdeprecated-copy]
mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp:103:3:
error: definition of implicit copy assignment operator for 'SGMap'
is deprecated because it has a user-declared copy constructor
[-Werror,-Wdeprecated-copy]
Originally introduced in #130240 and reverted in #131364
Reproduced the issue locally in Linux by doing a shared lib build. Fixes
including adding the missing LINK_LIBS.
**Original commit message:**
This PR adds the SG map propagation step of the XeGPU SIMT distribution.
SG map propagation is a sparse backward dataflow analysis that propagate
the sg_map backward starting from the operands of certain operations
(DPAS, store etc.).
This is the first step of XeGPU subgroup distribution. This analysis
result is used to attach layout information to each XeGPU SIMD subgroup
op. The lowering patterns in XeGPUSubgroupDistribute will consume these
layout info to distribute SIMD ops into SIMT ops that work on work-item
level data fragments.
Summary of Lowering XeGPU SIMD -> SIMT
Subgroup map propagation (This PR)
Attach sg_map to each op in move all ops inside
gpu.warp_execute_on_lane0 region.
Distribute each op using sg_map
Additional legalization steps to align more with Xe HW.
This PR adds the SG map propagation step of the XeGPU SIMT distribution.
SG map propagation is a sparse backward dataflow analysis that propagate
the sg_map backward starting from the operands of certain operations
(DPAS, store etc.).
This is the first step of XeGPU subgroup distribution. This analysis
result is used to attach layout information to each XeGPU SIMD subgroup
op. The lowering patterns in XeGPUSubgroupDistribute will consume these
layout info to distribute SIMD ops into SIMT ops that work on work-item
level data fragments.
### Summary of Lowering XeGPU SIMD -> SIMT
1. Subgroup map propagation (This PR)
2. Attach `sg_map` to each op in move all ops inside
`gpu.warp_execute_on_lane0` region.
3. Distribute each op using `sg_map`
4. Additional legalization steps to align more with Xe HW.