28 Commits

Author SHA1 Message Date
Jianhui Li
61b8a57839
[MLIR][XeGPU] Refactor layout propagation utilities (#179016)
This PR refactors layout propagation into two distinct components:
result/anchor layout setup and source layout inference from the result.

For operations that require a specific result layout due to semantic or
hardware constraints, the propagation logic explicitly sets up the
result or anchor layout. Otherwise, it infers the source layout from the
backward-propagated consumer layout.

The result or anchor layout may differ from the backward-propagated
consumer layout; any such discrepancies are resolved via the existing
layout-conflict mechanism.

**This PR introduces the following utility functions:**

Source layout inference:

> inferBroadcastSourceLayout()
> inferMultiReductionSourceLayout()
> inferBitCastSourceLayout()
> inferShapeCastSourceLayout()
> inferInsertStridedSliceSourceLayout()

Result / anchor layout setup:

> setupMultiReductionResultLayout()
> setupBitCastResultLayout()
> setupInsertStridedSliceResultLayout()
> setupLoadMatrixAnchorLayout()
> setupStoreMatrixAnchorLayout()
> setupLoadGatherAnchorLayout()
> setupStoreScatterAnchorLayout()

Part of subgroup distribution related code changes are separated and
created as PR https://github.com/llvm/llvm-project/pull/179018/changes.
2026-02-05 19:26:25 -08:00
Charitha Saumya
d9c65f94b1
[mlir][xegpu] Add XeGPUSgToWiDistributeExperimental pass. (#177492)
Currently XeGPU lowering pipeline uses `XeGPUSubgroupDistribute` pass to
subgroup to work item distribution of ops. This pass is well established
and relies on vector distribution's `WarpOp` based distribution
mechanism. However, recent experiments with larger kernels have shown
that this pass is very expensive in terms of compile time (see below).

This prompted us to create a new pass that does not rely on `WarpOp`
based distribution. This PR adds the initial infra to move away from the
old way and align Wg To WI distribution with Wg to Sg distribution. New
pass also uses context-aware type conversion based on XeGPU layouts to
distributed vector types from SG to WI.

This PR adds the following changes:
* SG to WI distribution pass based on context-aware type conversions
using `OpConversionPatterns`
* Test pass for testing individual patterns
(`TestXeGPUSgToWiDistributeExperimental`)
* `XeGPUSgToWiDistributeExperimentalPass` which will eventually replace
`XeGPUSubgroupDistribute`

Flash attention e2e compilations stats:
```
----Wall Time----  ----Name----
    0.0032 (  0.2%)  Parser
    0.0008 (  0.0%)  CSE
    0.0000 (  0.0%)    (A) DominanceInfo
    0.0002 (  0.0%)  GpuXeVMAttachTarget
    1.1427 ( 58.7%)  'gpu.module' Pipeline
    0.0019 (  0.1%)    XeGPUWgToSgDistribute
    0.0003 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0002 (  0.0%)    LowerAffinePass
    0.0001 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0008 (  0.0%)    XeGPUPropagateLayout
    0.0056 (  0.3%)    XeGPUBlocking
    0.0010 (  0.1%)    Canonicalizer
    0.0004 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0015 (  0.1%)    XeGPUPropagateLayout
    0.0007 (  0.0%)    XeGPUOptimizeBlockLoads
    0.0010 (  0.0%)    Canonicalizer
    0.0004 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0015 (  0.1%)    XeGPUPropagateLayout
    1.1274 ( 57.9%)    XeGPUSubgroupDistribute
    0.7959 ( 40.9%)  Output
    0.0022 (  0.1%)  Rest
    1.9461 (100.0%)  Total
```
2026-01-29 09:57:01 -08:00
Artem Kroviakov
0926743e2e
[MLIR][XeGPU] Add uniform values distribution pattern (#176737) 2026-01-26 21:23:31 +01:00
Jianhui Li
1b8903aa8e
[MLIR][XeGPU] setUnitDim bug fix and add documentation (#173521)
This PR fix a bug in setUnitDimData and setUnitDimLayout, and adds
documentation and test.
It also cleans up the shapecast op pattern in the wg distribution to use
local temporary layout instead of getting from definition op's result
(one TODO item from PR
[#172125](https://github.com/llvm/llvm-project/pull/172125)).
2026-01-20 21:17:00 -08:00
Jianhui Li
074740df8a
[MLIR][XeGPU] bug fix: removing temporary slice layout at the pass end (#172589)
Removing temporary slice layout (besides the regular layout) at the end
of wg distribution and blocking pass.
The PR also drop sg_data/inst_data from anchor layouts in every
wg-to-sg/blocking/unrolling pattern.

---------

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
2026-01-14 09:57:14 -08:00
Jianhui Li
2b9e47749c
[MLIR][XeGPU] Refactor Layout access interface (#172125)
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:

1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.

2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.

3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().

4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run. 

5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().

6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
2025-12-17 12:04:58 -08:00
Charitha Saumya
eb7db0b9ec
[mlir][xegpu] Change index arithmetic ops to arith ops. (#170390)
Index ops cause some issues during SIMT distribution because they don't
have the `Elementwise` mappable trait. This PR replaces all index
arithmetic ops with matching `arith` dialect ops.
2025-12-03 07:48:00 -08:00
Jianhui Li
326a1a4bad
[MLIR][XeGPU] Add anchor_layout and update propagation to honor user-specified layouts (#169267)
Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored
by XeGPU distribution and lowerinngs once specified.
1. Add anchor_layout for XeGPU anchor OPs: load_nd, store_nd,
prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and
atomic_rmw.
2. rename layout attributes to anchor_layout for these ops: load, store,
load_matrix, store_matrix
3. update layout propagation pass: Only when user doesn't specify anchor
layout, the pass computes a default layout and set to anchor op's
permant layout and use that for propagation. if user specified anchor
layout, the pass takes user-specified anchor layout. permant layout and
use that for propagation. if user specified anchor layout, the pass
takes user-specified anchor layout.
2025-11-26 23:02:01 -08:00
Kazu Hirata
ff8ed4d80a
[mlir] Use llvm::copy (NFC) (#168213)
Identified with llvm-use-ranges.
2025-11-15 10:54:01 -08:00
Charitha Saumya
9703bda95b
[mlir][xegpu] Add OptimizeBlockLoads pass. (#165483)
This pass rewrites certain xegpu `CreateNd` and `LoadNd` operations that
feeds into `vector.transpose` to more optimal form to improve
performance. Specifically, low precision (bitwidth < 32) `LoadNd` ops
that feeds into transpose ops are rewritten to i32 loads with a valid
transpose layout such that later passes can use the load with transpose
HW feature to accelerate such load ops.

**Update:**
Pass is renamed to `OptimizeBlockLoads ` because later we plan to add
the array length optimization into this pass as well. This will break
down a larger load (like `32x32xf16`) into more DPAS-favorable array
length loads (`32x16xf16` with array length = 2). Both these
optmizations require rewriting `CreateNd` and `LoadNd` and it makes
sense to have a common pass for both.
2025-11-04 13:15:32 -08:00
Dmitry Chigarev
6c563dc6a2
[mlir][XeGPU] Add optional layout attribute to LoadGather StoreScatter ops (#163414)
As [suggested
here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637)
the PR adds an optional layout attribute for `LoadGather` and
`StoreScatter` ops.

For the load-op the attribute describes the layout of the result (ex
`layout_result_0`), and for store-op it describes the layout for the
vector-to-store operand (ex `layout_operand_0`).

The PR also reworks `propagate-layout` pass to consider perm layout
attributes and back-propagate them accordingly.

The helper utility function `getDistributeLayoutAttr` is reworked to
return either `layout_operand/result_0` or `layout` for load/store ops
(denepding on which one is set). After an offline discussion decided
that the overall utilities layouts API is confusing since it tries to
mix permament and temporary layouts. Would need to change it in the
future.

---------

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
2025-11-04 08:19:47 -08:00
Mehdi Amini
bc2f746e78 [MLIR] Apply clang-tidy fixes for llvm-qualified-auto in XeGPUUtils.cpp (NFC) 2025-10-27 17:14:50 -07:00
Nishant Patel
621ed04e28
[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459)
This PR changes the pack/unpack method used for unrolling to allow for
lower rank slice to be extracted and inserted from and to src vector by
adding reshapes. It also removes leading unit dims from inst_data if
there are any.
2025-10-24 11:04:05 -07:00
Jakub Kuderski
0820266651
[mlir] Use llvm accumulate wrappers. NFCI. (#162957)
Use wrappers around `std::accumulate` to make the code more concise and
less bug-prone: https://github.com/llvm/llvm-project/pull/162129.

With `std::accumulate`, it's the initial value that determines the
accumulator type. `llvm::sum_of` and `llvm::product_of` pick the right
accumulator type based on the range element type.

Found some funny bugs like a local accumulate helper that calculated a
sum with initial value of 1 -- we didn't hit the bug because the code
was actually dead...
2025-10-11 11:33:18 -04:00
Chao Chen
6026ca301d
[mlir][XeGPU] add unroll patterns for load_matrix and store_matrix (#154637) 2025-09-03 13:56:41 -05:00
Chao Chen
c96e2cdd13
[mlir][XeGPU] Update utils for LayoutAttr and SliceAttr support (#154819) 2025-08-27 12:37:15 -05:00
Chao Chen
68d6866428
[mlir][XeGPU] add WgToSg distribution pattern for load_matrix and store_matrix. (#154403) 2025-08-21 10:02:45 -05:00
Jianhui Li
98728d9dc8
[MLIR][XeGPU] Add lowering from transfer_read/transfer_write to load_gather/store_scatter (#152429)
Lowering transfer_read/transfer_write to load_gather/store_scatter in
case the target uArch doesn't support load_nd/store_nd. The high level
steps:
  1. compute Strides;
  2. compute Offsets;
  3. collapseMemrefTo1D;
  4. create Load gather or store_scatter op
2025-08-14 11:27:07 -07:00
Kazu Hirata
0925d7572a
[mlir] Remove unused includes (NFC) (#150266)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-07-23 15:18:53 -07:00
Chao Chen
317dae1a7e
[mlir][xegpu] Add initial skeleton implementation for lowering ConvertLayoutOp (#146176)
This PR adds initial skeleton implementation for lowering
ConvertLayoutOp. It currently only supports cases where SLM is not
needed.

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
2025-07-23 11:35:40 -05:00
Chao Chen
0e00bc4f83
[mlir][xegpu] cleanup the print format for TensorDesc (#149182) 2025-07-22 14:16:58 -05:00
Maksim Levental
7b78796543
[mlir][NFC] update mlir/Dialect create APIs (25/n) (#149932)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-21 19:57:59 -04:00
Charitha Saumya
fc3781853b
[mlir][xegpu] Minor fixes in XeGPU subgroup distribution. (#147846)
This PR addresses the following issues.

1. Add the missing attributes when creating a new GPU funcOp in
`MoveFuncBodyToWarpExecuteOnLane0` pattern.
2. Bug fix in LoadNd distribution to make sure LoadOp is the last op in
warpOp region before it is distributed (needed for preserving the memory
op ordering during distribution).
3. Add utility for removing OpOperand or OpResult layout attributes.
2025-07-17 15:13:20 -07:00
Chao Chen
5578bcbcfd
[mlir][xegpu] add support for structure control flow ops in workgroup to subgroup distribution (#142618)
This PR introduces support for `scf::ForOp`, `scf::WhileOp`, `scf::If`,
and `scf::Condition` within the workgroup-subgroup-distribution pass,
leveraging the `SCFStructuralTypeConversionsAndLegality`.
2025-06-13 12:32:46 -05:00
Chao Chen
9e2684e4cf
[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#142477)
Bring back https://github.com/llvm/llvm-project/pull/140163 with fixes
2025-06-02 21:39:30 -05:00
Chao Chen
b88dfb0b23
Revert "[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N]" (#142459)
Reverts llvm/llvm-project#140163
2025-06-02 15:47:21 -04:00
Chao Chen
0210750d5a
[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#140163)
This PR introduces the initial implementation of a blocking pass for
XeGPU programs. The pass leverages unroll patterns from both the XeGPU
and Vector dialects. 

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
2025-06-02 14:02:45 -05:00
Charitha Saumya
d30554b19e
[mlir][xegpu] SIMT distribution patterns for XeGPU CreateNdTdesc, LoadNd, StoreNd and Dpas Ops. (#135271)
This PR adds the SIMT distribution patterns for create_nd_tdesc, load_nd, store_nd and dpas XeGPU ops.
2025-04-30 12:16:47 -07:00