22 Commits

Author SHA1 Message Date
Jianhui Li
61b8a57839
[MLIR][XeGPU] Refactor layout propagation utilities (#179016)
This PR refactors layout propagation into two distinct components:
result/anchor layout setup and source layout inference from the result.

For operations that require a specific result layout due to semantic or
hardware constraints, the propagation logic explicitly sets up the
result or anchor layout. Otherwise, it infers the source layout from the
backward-propagated consumer layout.

The result or anchor layout may differ from the backward-propagated
consumer layout; any such discrepancies are resolved via the existing
layout-conflict mechanism.

**This PR introduces the following utility functions:**

Source layout inference:

> inferBroadcastSourceLayout()
> inferMultiReductionSourceLayout()
> inferBitCastSourceLayout()
> inferShapeCastSourceLayout()
> inferInsertStridedSliceSourceLayout()

Result / anchor layout setup:

> setupMultiReductionResultLayout()
> setupBitCastResultLayout()
> setupInsertStridedSliceResultLayout()
> setupLoadMatrixAnchorLayout()
> setupStoreMatrixAnchorLayout()
> setupLoadGatherAnchorLayout()
> setupStoreScatterAnchorLayout()

Part of subgroup distribution related code changes are separated and
created as PR https://github.com/llvm/llvm-project/pull/179018/changes.
2026-02-05 19:26:25 -08:00
Charitha Saumya
d9c65f94b1
[mlir][xegpu] Add XeGPUSgToWiDistributeExperimental pass. (#177492)
Currently XeGPU lowering pipeline uses `XeGPUSubgroupDistribute` pass to
subgroup to work item distribution of ops. This pass is well established
and relies on vector distribution's `WarpOp` based distribution
mechanism. However, recent experiments with larger kernels have shown
that this pass is very expensive in terms of compile time (see below).

This prompted us to create a new pass that does not rely on `WarpOp`
based distribution. This PR adds the initial infra to move away from the
old way and align Wg To WI distribution with Wg to Sg distribution. New
pass also uses context-aware type conversion based on XeGPU layouts to
distributed vector types from SG to WI.

This PR adds the following changes:
* SG to WI distribution pass based on context-aware type conversions
using `OpConversionPatterns`
* Test pass for testing individual patterns
(`TestXeGPUSgToWiDistributeExperimental`)
* `XeGPUSgToWiDistributeExperimentalPass` which will eventually replace
`XeGPUSubgroupDistribute`

Flash attention e2e compilations stats:
```
----Wall Time----  ----Name----
    0.0032 (  0.2%)  Parser
    0.0008 (  0.0%)  CSE
    0.0000 (  0.0%)    (A) DominanceInfo
    0.0002 (  0.0%)  GpuXeVMAttachTarget
    1.1427 ( 58.7%)  'gpu.module' Pipeline
    0.0019 (  0.1%)    XeGPUWgToSgDistribute
    0.0003 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0002 (  0.0%)    LowerAffinePass
    0.0001 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0008 (  0.0%)    XeGPUPropagateLayout
    0.0056 (  0.3%)    XeGPUBlocking
    0.0010 (  0.1%)    Canonicalizer
    0.0004 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0015 (  0.1%)    XeGPUPropagateLayout
    0.0007 (  0.0%)    XeGPUOptimizeBlockLoads
    0.0010 (  0.0%)    Canonicalizer
    0.0004 (  0.0%)    CSE
    0.0000 (  0.0%)      (A) DominanceInfo
    0.0015 (  0.1%)    XeGPUPropagateLayout
    1.1274 ( 57.9%)    XeGPUSubgroupDistribute
    0.7959 ( 40.9%)  Output
    0.0022 (  0.1%)  Rest
    1.9461 (100.0%)  Total
```
2026-01-29 09:57:01 -08:00
Charitha Saumya
cd4c9d200b
[mlir][xegpu] Add initial support for layout conflict handling. (#173090)
This PR adds initial support for layout conflict resolution in XeGPU.
Layout conflict occurs when some op's use point expects a different
layout than what the op can currently provide. This conflict needs to be
resolved by adding certain other xegpu ops.

Initially, We only focus conflict handling at tensor desc use points.
2026-01-28 11:50:42 -08:00
Jianhui Li
2b9e47749c
[MLIR][XeGPU] Refactor Layout access interface (#172125)
This PR builds on the anchor layout mechanism introduced in
https://github.com/llvm/llvm-project/pull/169267 and performs the
following refactoring:

1. Introduce getAnchorLayout() and setAnchorLayout() interface for
anchor ops to get and set layout attributes.

2. Add getLocalLayout() and setLocalLayout() utility functions, and
refactor workgroup/subgroup distribution patterns to use these APIs.
These utilities access the layout information directly and locally,
without relying on global propagation.

3. Introduce localPropagateLayoutsFromAnchor(), a utility used by
subgroup distribution to unify non-anchor layout setup.
This function is intended to be invoked upfront by all layout-based
passes (including workgroup/subgroup distribution and unrolling) to
propagate layouts from anchor ops to non-anchor ops.
After this step, patterns within the pass should exclusively use
getLocalLayout() / setLocalLayout().

4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to
remove special-case handling. These APIs now operate in a uniform order:
anchor ops first, then non-anchor ops, and finally block arguments.
These APIs will be deprecated on long run. 

5. Refactor patterns in wg/sg distribution, load optimization passes to
use get/setAnchorLayout() and get/setLocalLayout().

6. Update test cases to enforce that anchor ops must use—and only
use—anchor layouts.
2025-12-17 12:04:58 -08:00
Mehdi Amini
7ead626bf1 [MLIR] Apply clang-tidy fixes for modernize-use-equals-default in TestXeGPUTransforms.cpp (NFC) 2025-12-17 11:16:52 -08:00
Artem Kroviakov
68c4c83bcb
[MLIR][XeGPU] Matrix load/store subgroup distribution (#165008) 2025-11-03 21:48:27 +01:00
Charitha Saumya
bd6da1feaa
[mlir][xegpu] Add more tests in XeGPU subgroup distribution. (#162543)
This PR adds some tests for covering some useful corner cases.

1. more tests for `vector.shape_cast` distribution.
2. testing for `MoveFuncBodyToWarpOp` pattern that was not possible
before.
2025-10-10 09:27:36 -07:00
Charitha Saumya
b86fef88c5
[mlir][xegpu] Create a test pass for subgroup distribution. (#161592)
Current subgroup distribution test employ the entire
`xegpu-subgroup-distribute` pass which include multiple steps like
layout propagation, move func body into warp op, and distribute to work
items.

This makes it harder to isolate the testing for xegpu subgroup
distribution logic, because certain corner cases may be not supported
yet by other steps mentioned above.

This PR introduces a test pass for subgroup distribution logic and
isolate the testing for distribution logic. We plan to add more corner
case (that were not possible before) covering non-xegpu ops (like
vector) in next PRs.

This PR also include,
1. minor bug fixes in gather/scatter distribution.  
2. bug fix in vector multi reduction lowering where it fails to retain
some layouts.
2025-10-03 12:35:13 -07:00
Mehdi Amini
da160574e0 [MLIR] Remove unused debug macros (NFC) 2025-10-01 05:09:58 -07:00
Dmitry Chigarev
04258fe3b1
[mlir][XeGPU][XeGPUUnroll] Support new syntax with offsets moved to load_nd/store_nd/prefetch_nd (#160323)
Adds support for new syntax in XeGPUUnroll for:
1. `create_nd_desc` without offsets
2. `load_nd` with offsets
3. `store_nd` with offsets
4. `prefetch_nd` with offsets

`create_nd_desc with offsets` + `load_nd with offsets` won't be lowered
correctly. In this case the IR would still have two unrealized
conversions that will fail later in the pipeline.

The offsets computation for the unrolled tile is now moved from
descriptors to load/store/prefetch operations. The resulted IR now has
one single descriptor that is being iterated in load/store/prefetch ops.

<details><summary>old/new behavior examples</summary>

```mlir
// before unroll pass:
gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> {
  %tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>>
  %ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32>
  gpu.return %ld : vector<24x32xf32>
}

// after unroll pass (offsets in create_nd_desc):
gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32>
  %c24 = arith.constant 24 : index
  %c32 = arith.constant 32 : index
  %c8 = arith.constant 8 : index
  %c16 = arith.constant 16 : index
  // create 6 descriptors for each tile
  %0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %6 = xegpu.load_nd %0  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %7 = xegpu.load_nd %1  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %8 = xegpu.load_nd %2  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %9 = xegpu.load_nd %3  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %10 = xegpu.load_nd %4  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %11 = xegpu.load_nd %5  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  ...
}

// after unroll pass (offsets in load_nd):
gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32>
  %c24 = arith.constant 24 : index
  %c32 = arith.constant 32 : index
  %c16 = arith.constant 16 : index
  %c8 = arith.constant 8 : index
  // create only one descriptor with proper tile shape
  %0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  // compute tile offsets at the operation (using only one descriptor)
  %1 = xegpu.load_nd %0[%c8, %c16]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %2 = xegpu.load_nd %0[%c8, %c32]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %3 = xegpu.load_nd %0[%c16, %c16]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %4 = xegpu.load_nd %0[%c16, %c32]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %5 = xegpu.load_nd %0[%c24, %c16]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %6 = xegpu.load_nd %0[%c24, %c32]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  ...
}
```

</details>

---------

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
2025-09-25 11:31:17 -07:00
Nishant Patel
d235d62d65
[MLIR][XeGPU] Add unroll pattern for load_gather and store_scatter with offsets (#159453)
This PR adds unrolling/blocking patterns for load_gather and
store_scatter ops with offsets.
2025-09-24 13:28:43 -07:00
Charitha Saumya
9b0d7ddb04
[mlir][xegpu] Add support for vector.multi_reduction and vector.shape_cast SIMT distribution. (#157560)
Add support for distributing the `vector.multi_reduction` operation
across lanes in a warp. Currently only 2D to 1D reductions are
supported. Given layouts for the source and accumulator vectors,
* If the reduction dimension is distributed across lanes, the reduction
is non-lane-local and the reduction is done using warp shuffles. Here we
simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s
inside the warp op body. Actual distribution will be done by
`WarpOpReduction` pattern.
* If the reduction dimension is not distributed across lanes, the
reduction is lane-local. In this case, we yield the source and
accumulator vectors from the warp op and perform the lane-local
reduction outside the warp op using a sequence of `ReductionOp`s.

PR also adds support for distributing `vector.shape_cast` based on
layouts.
2025-09-12 09:37:04 -07:00
Chao Chen
68d6866428
[mlir][XeGPU] add WgToSg distribution pattern for load_matrix and store_matrix. (#154403) 2025-08-21 10:02:45 -05:00
Jacques Pienaar
4bf33958da
[mlir] Update builders to use new form. (#154132)
Mechanically applied using clang-tidy.
2025-08-18 15:19:34 +00:00
Chao Chen
c96223434c
[mlir][xegpu] Add definition of SliceAttr (#150146)
---------

Co-authored-by: Charitha Saumya <136391709+charithaintc@users.noreply.github.com>
2025-08-08 11:27:17 -05:00
Jacques Pienaar
07967d4af8
[mlir] Switch to new LDBG macro (#150616)
Change local variants to use new central one.
2025-07-25 18:22:46 +02:00
Chao Chen
75524dee18
[mlir][xegpu] Relax rank restriction of TensorDescType (#145916) 2025-07-09 19:40:24 -05:00
Jianhui Li
118bfcda46
[MLIR][XEGPU] Add blocking support for scatter ops (#144766)
Add blocking support for scatter ops: Create_tdesc, update, prefetch,
load and store. It also enables the load/store with chunk size.
2025-06-18 14:52:03 -07:00
Jianhui Li
f25f2f7de4
[MLIR][XeGPU] Extend unrolling support for scatter ops with chunk_size (#144447)
Add support for load/store with chunk_size, which requires special
consideration for the operand blocking since offests and masks are
 n-D and tensor are n+1-D. Support operations including create_tdesc,
update_tdesc, load, store, and prefetch.

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
2025-06-17 17:46:35 -05:00
Jianhui Li
58d23476f0
[MLIR][XeGPU] Add unroll patterns for scatter ops (#143602)
Add unrolling support for create_tdesc, load, store, prefetch, and update_offset.

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
Co-authored-by: Chao Chen <chao.chen@intel.com>
2025-06-16 10:48:41 -05:00
Kazu Hirata
1eb843b1a0
[mlir] Ensure newline at the end of files (NFC) (#143155) 2025-06-06 09:16:52 -07:00
Chao Chen
db42345dc6
[MLIR][XeGPU] Add unroll patterns for XeGPU (1/N) (#137010)
Similar to vector ops, XeGPU ops need to be unrolled into smaller shapes
such that they can be dispatched into a hardware instruction. This PR
marks the initial phase of a series dedicated to incorporating unroll
patterns for XeGPU operations. In this installment, we introduce
patterns for the following operations:
1. createNd
2. updateNd
3. prefetchNd
4. loadNd
5. storeNd
6. dpas
2025-05-12 09:16:21 -05:00