llvm-project

Author	SHA1	Message	Date
Nishant Patel	570055bf97	[MLIR][XeGPU] Propagate layout from anchor ops before Wg To Sg & Blocking Pass (#179490 ) This PR calls recoverTemporaryLayout before the XeGPUWgtoSgDistribute & XeGPUBlocking Pass to recover all the temporary operand layout which might be required by the transformation patterns for checks and verification	2026-02-06 15:56:09 -08:00
Jianhui Li	61b8a57839	[MLIR][XeGPU] Refactor layout propagation utilities (#179016 ) This PR refactors layout propagation into two distinct components: result/anchor layout setup and source layout inference from the result. For operations that require a specific result layout due to semantic or hardware constraints, the propagation logic explicitly sets up the result or anchor layout. Otherwise, it infers the source layout from the backward-propagated consumer layout. The result or anchor layout may differ from the backward-propagated consumer layout; any such discrepancies are resolved via the existing layout-conflict mechanism. This PR introduces the following utility functions: Source layout inference: > inferBroadcastSourceLayout() > inferMultiReductionSourceLayout() > inferBitCastSourceLayout() > inferShapeCastSourceLayout() > inferInsertStridedSliceSourceLayout() Result / anchor layout setup: > setupMultiReductionResultLayout() > setupBitCastResultLayout() > setupInsertStridedSliceResultLayout() > setupLoadMatrixAnchorLayout() > setupStoreMatrixAnchorLayout() > setupLoadGatherAnchorLayout() > setupStoreScatterAnchorLayout() Part of subgroup distribution related code changes are separated and created as PR https://github.com/llvm/llvm-project/pull/179018/changes.	2026-02-05 19:26:25 -08:00
Jianhui Li	fe71ea4437	[MLIR][XeGPU] Preserve Leading dimension when blocking rank-sensitive operations (#177489 ) This PR preserves leading dimensions for xegpu.load_matrix/store_matrix/atomic_rmw/convert_layout, and vector operations which have impact on shapes: broadcast/multi-reduction/shape_cast/transpose. Rank-sensitive operations are operations whose semantics depend on the tensor rank (and consequently its shape), and therefore must not alter the input tile rank or shape, such as by dropping leading dimensions.	2026-01-24 12:34:38 -08:00
Jianhui Li	074740df8a	[MLIR][XeGPU] bug fix: removing temporary slice layout at the pass end (#172589 ) Removing temporary slice layout (besides the regular layout) at the end of wg distribution and blocking pass. The PR also drop sg_data/inst_data from anchor layouts in every wg-to-sg/blocking/unrolling pattern. --------- Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>	2026-01-14 09:57:14 -08:00
Nishant Patel	4a8ecccd70	[MLIR][XeGPU] Pass inst_data for blocking create/constant Mask and Step op (#175456 )	2026-01-13 15:21:39 -08:00
Jianhui Li	2b9e47749c	[MLIR][XeGPU] Refactor Layout access interface (#172125 ) This PR builds on the anchor layout mechanism introduced in https://github.com/llvm/llvm-project/pull/169267 and performs the following refactoring: 1. Introduce getAnchorLayout() and setAnchorLayout() interface for anchor ops to get and set layout attributes. 2. Add getLocalLayout() and setLocalLayout() utility functions, and refactor workgroup/subgroup distribution patterns to use these APIs. These utilities access the layout information directly and locally, without relying on global propagation. 3. Introduce localPropagateLayoutsFromAnchor(), a utility used by subgroup distribution to unify non-anchor layout setup. This function is intended to be invoked upfront by all layout-based passes (including workgroup/subgroup distribution and unrolling) to propagate layouts from anchor ops to non-anchor ops. After this step, patterns within the pass should exclusively use getLocalLayout() / setLocalLayout(). 4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to remove special-case handling. These APIs now operate in a uniform order: anchor ops first, then non-anchor ops, and finally block arguments. These APIs will be deprecated on long run. 5. Refactor patterns in wg/sg distribution, load optimization passes to use get/setAnchorLayout() and get/setLocalLayout(). 6. Update test cases to enforce that anchor ops must use—and only use—anchor layouts.	2025-12-17 12:04:58 -08:00
Nishant Patel	07372fcf6c	[MLIR][XeGPU] Remove leading unit dims from vector ops before unrolling (#165030 ) This PR uses the upstream populateCastAwayVectorLeadingOneDimPatterns to remove leading unit dims from vector ops and then do the unrolling/blocking	2025-10-27 09:09:33 -07:00
Nishant Patel	621ed04e28	[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459 ) This PR changes the pack/unpack method used for unrolling to allow for lower rank slice to be extracted and inserted from and to src vector by adding reshapes. It also removes leading unit dims from inst_data if there are any.	2025-10-24 11:04:05 -07:00
Nishant Patel	48c9a8a9c8	[MLIR][XeGPU] Enable blocking for scatter ops with offsets (#162896 ) The unroll patterns for these ops were added in the previous PR but the getTileShape method was not changed to handle these ops and hence blocking pass was not kicking in.	2025-10-13 11:50:11 -07:00
Dmitry Chigarev	04258fe3b1	[mlir][XeGPU][XeGPUUnroll] Support new syntax with offsets moved to load_nd/store_nd/prefetch_nd (#160323 ) Adds support for new syntax in XeGPUUnroll for: 1. `create_nd_desc` without offsets 2. `load_nd` with offsets 3. `store_nd` with offsets 4. `prefetch_nd` with offsets `create_nd_desc with offsets` + `load_nd with offsets` won't be lowered correctly. In this case the IR would still have two unrealized conversions that will fail later in the pipeline. The offsets computation for the unrolled tile is now moved from descriptors to load/store/prefetch operations. The resulted IR now has one single descriptor that is being iterated in load/store/prefetch ops. <details><summary>old/new behavior examples</summary> ```mlir // before unroll pass: gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> { %tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> %ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32> gpu.return %ld : vector<24x32xf32> } // after unroll pass (offsets in create_nd_desc): gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c8 = arith.constant 8 : index %c16 = arith.constant 16 : index // create 6 descriptors for each tile %0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %6 = xegpu.load_nd %0 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %7 = xegpu.load_nd %1 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %8 = xegpu.load_nd %2 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %9 = xegpu.load_nd %3 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %10 = xegpu.load_nd %4 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %11 = xegpu.load_nd %5 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } // after unroll pass (offsets in load_nd): gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c16 = arith.constant 16 : index %c8 = arith.constant 8 : index // create only one descriptor with proper tile shape %0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> // compute tile offsets at the operation (using only one descriptor) %1 = xegpu.load_nd %0[%c8, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %2 = xegpu.load_nd %0[%c8, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %3 = xegpu.load_nd %0[%c16, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %4 = xegpu.load_nd %0[%c16, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %5 = xegpu.load_nd %0[%c24, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %6 = xegpu.load_nd %0[%c24, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } ``` </details> --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-09-25 11:31:17 -07:00
Charitha Saumya	9b0d7ddb04	[mlir][xegpu] Add support for `vector.multi_reduction` and `vector.shape_cast` SIMT distribution. (#157560 ) Add support for distributing the `vector.multi_reduction` operation across lanes in a warp. Currently only 2D to 1D reductions are supported. Given layouts for the source and accumulator vectors, * If the reduction dimension is distributed across lanes, the reduction is non-lane-local and the reduction is done using warp shuffles. Here we simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s inside the warp op body. Actual distribution will be done by `WarpOpReduction` pattern. * If the reduction dimension is not distributed across lanes, the reduction is lane-local. In this case, we yield the source and accumulator vectors from the warp op and perform the lane-local reduction outside the warp op using a sequence of `ReductionOp`s. PR also adds support for distributing `vector.shape_cast` based on layouts.	2025-09-12 09:37:04 -07:00
Chao Chen	6026ca301d	[mlir][XeGPU] add unroll patterns for load_matrix and store_matrix (#154637 )	2025-09-03 13:56:41 -05:00
Chao Chen	c96e2cdd13	[mlir][XeGPU] Update utils for LayoutAttr and SliceAttr support (#154819 )	2025-08-27 12:37:15 -05:00
Chao Chen	68d6866428	[mlir][XeGPU] add WgToSg distribution pattern for load_matrix and store_matrix. (#154403 )	2025-08-21 10:02:45 -05:00
Jacques Pienaar	07967d4af8	[mlir] Switch to new LDBG macro (#150616 ) Change local variants to use new central one.	2025-07-25 18:22:46 +02:00
Chao Chen	317dae1a7e	[mlir][xegpu] Add initial skeleton implementation for lowering ConvertLayoutOp (#146176 ) This PR adds initial skeleton implementation for lowering ConvertLayoutOp. It currently only supports cases where SLM is not needed. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-07-23 11:35:40 -05:00
Kazu Hirata	c06d3a7b72	[mlir] Remove unused includes (NFC) (#148769 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-14 22:19:23 -07:00
Chao Chen	75524dee18	[mlir][xegpu] Relax rank restriction of TensorDescType (#145916 )	2025-07-09 19:40:24 -05:00
Jianhui Li	118bfcda46	[MLIR][XEGPU] Add blocking support for scatter ops (#144766 ) Add blocking support for scatter ops: Create_tdesc, update, prefetch, load and store. It also enables the load/store with chunk size.	2025-06-18 14:52:03 -07:00
Jianhui Li	9630d7cb92	[MLIR][XeGPU] add blocking support for reduce, broadcast, and transpose (#143389 ) This PR adds blocking support for vector dialect operations (`reduce`, `broadcast`, and `transpose`) in the XeGPU based IR. It simply assigned the shape specified by "inst_data" as its target shape of the unrolling to implement the blocking. It is based on https://github.com/llvm/llvm-project/pull/140163.	2025-06-10 10:50:26 -05:00
Chao Chen	9e2684e4cf	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#142477 ) Bring back https://github.com/llvm/llvm-project/pull/140163 with fixes	2025-06-02 21:39:30 -05:00
Chao Chen	b88dfb0b23	Revert "[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N]" (#142459 ) Reverts llvm/llvm-project#140163	2025-06-02 15:47:21 -04:00
Chao Chen	0210750d5a	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#140163 ) This PR introduces the initial implementation of a blocking pass for XeGPU programs. The pass leverages unroll patterns from both the XeGPU and Vector dialects. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-06-02 14:02:45 -05:00

23 Commits