llvm-project

Author	SHA1	Message	Date
Nishant Patel	07372fcf6c	[MLIR][XeGPU] Remove leading unit dims from vector ops before unrolling (#165030 ) This PR uses the upstream populateCastAwayVectorLeadingOneDimPatterns to remove leading unit dims from vector ops and then do the unrolling/blocking	2025-10-27 09:09:33 -07:00
Nishant Patel	621ed04e28	[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459 ) This PR changes the pack/unpack method used for unrolling to allow for lower rank slice to be extracted and inserted from and to src vector by adding reshapes. It also removes leading unit dims from inst_data if there are any.	2025-10-24 11:04:05 -07:00
Nishant Patel	48c9a8a9c8	[MLIR][XeGPU] Enable blocking for scatter ops with offsets (#162896 ) The unroll patterns for these ops were added in the previous PR but the getTileShape method was not changed to handle these ops and hence blocking pass was not kicking in.	2025-10-13 11:50:11 -07:00
Dmitry Chigarev	04258fe3b1	[mlir][XeGPU][XeGPUUnroll] Support new syntax with offsets moved to load_nd/store_nd/prefetch_nd (#160323 ) Adds support for new syntax in XeGPUUnroll for: 1. `create_nd_desc` without offsets 2. `load_nd` with offsets 3. `store_nd` with offsets 4. `prefetch_nd` with offsets `create_nd_desc with offsets` + `load_nd with offsets` won't be lowered correctly. In this case the IR would still have two unrealized conversions that will fail later in the pipeline. The offsets computation for the unrolled tile is now moved from descriptors to load/store/prefetch operations. The resulted IR now has one single descriptor that is being iterated in load/store/prefetch ops. <details><summary>old/new behavior examples</summary> ```mlir // before unroll pass: gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> { %tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> %ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32> gpu.return %ld : vector<24x32xf32> } // after unroll pass (offsets in create_nd_desc): gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c8 = arith.constant 8 : index %c16 = arith.constant 16 : index // create 6 descriptors for each tile %0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %6 = xegpu.load_nd %0 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %7 = xegpu.load_nd %1 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %8 = xegpu.load_nd %2 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %9 = xegpu.load_nd %3 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %10 = xegpu.load_nd %4 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %11 = xegpu.load_nd %5 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } // after unroll pass (offsets in load_nd): gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c16 = arith.constant 16 : index %c8 = arith.constant 8 : index // create only one descriptor with proper tile shape %0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> // compute tile offsets at the operation (using only one descriptor) %1 = xegpu.load_nd %0[%c8, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %2 = xegpu.load_nd %0[%c8, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %3 = xegpu.load_nd %0[%c16, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %4 = xegpu.load_nd %0[%c16, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %5 = xegpu.load_nd %0[%c24, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %6 = xegpu.load_nd %0[%c24, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } ``` </details> --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-09-25 11:31:17 -07:00
Charitha Saumya	9b0d7ddb04	[mlir][xegpu] Add support for `vector.multi_reduction` and `vector.shape_cast` SIMT distribution. (#157560 ) Add support for distributing the `vector.multi_reduction` operation across lanes in a warp. Currently only 2D to 1D reductions are supported. Given layouts for the source and accumulator vectors, * If the reduction dimension is distributed across lanes, the reduction is non-lane-local and the reduction is done using warp shuffles. Here we simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s inside the warp op body. Actual distribution will be done by `WarpOpReduction` pattern. * If the reduction dimension is not distributed across lanes, the reduction is lane-local. In this case, we yield the source and accumulator vectors from the warp op and perform the lane-local reduction outside the warp op using a sequence of `ReductionOp`s. PR also adds support for distributing `vector.shape_cast` based on layouts.	2025-09-12 09:37:04 -07:00
Chao Chen	6026ca301d	[mlir][XeGPU] add unroll patterns for load_matrix and store_matrix (#154637 )	2025-09-03 13:56:41 -05:00
Chao Chen	c96e2cdd13	[mlir][XeGPU] Update utils for LayoutAttr and SliceAttr support (#154819 )	2025-08-27 12:37:15 -05:00
Chao Chen	68d6866428	[mlir][XeGPU] add WgToSg distribution pattern for load_matrix and store_matrix. (#154403 )	2025-08-21 10:02:45 -05:00
Jacques Pienaar	07967d4af8	[mlir] Switch to new LDBG macro (#150616 ) Change local variants to use new central one.	2025-07-25 18:22:46 +02:00
Chao Chen	317dae1a7e	[mlir][xegpu] Add initial skeleton implementation for lowering ConvertLayoutOp (#146176 ) This PR adds initial skeleton implementation for lowering ConvertLayoutOp. It currently only supports cases where SLM is not needed. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-07-23 11:35:40 -05:00
Kazu Hirata	c06d3a7b72	[mlir] Remove unused includes (NFC) (#148769 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-14 22:19:23 -07:00
Chao Chen	75524dee18	[mlir][xegpu] Relax rank restriction of TensorDescType (#145916 )	2025-07-09 19:40:24 -05:00
Jianhui Li	118bfcda46	[MLIR][XEGPU] Add blocking support for scatter ops (#144766 ) Add blocking support for scatter ops: Create_tdesc, update, prefetch, load and store. It also enables the load/store with chunk size.	2025-06-18 14:52:03 -07:00
Jianhui Li	9630d7cb92	[MLIR][XeGPU] add blocking support for reduce, broadcast, and transpose (#143389 ) This PR adds blocking support for vector dialect operations (`reduce`, `broadcast`, and `transpose`) in the XeGPU based IR. It simply assigned the shape specified by "inst_data" as its target shape of the unrolling to implement the blocking. It is based on https://github.com/llvm/llvm-project/pull/140163.	2025-06-10 10:50:26 -05:00
Chao Chen	9e2684e4cf	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#142477 ) Bring back https://github.com/llvm/llvm-project/pull/140163 with fixes	2025-06-02 21:39:30 -05:00
Chao Chen	b88dfb0b23	Revert "[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N]" (#142459 ) Reverts llvm/llvm-project#140163	2025-06-02 15:47:21 -04:00
Chao Chen	0210750d5a	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#140163 ) This PR introduces the initial implementation of a blocking pass for XeGPU programs. The pass leverages unroll patterns from both the XeGPU and Vector dialects. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-06-02 14:02:45 -05:00

17 Commits