llvm-project

Author	SHA1	Message	Date
Charitha Saumya	bd6da1feaa	[mlir][xegpu] Add more tests in XeGPU subgroup distribution. (#162543 ) This PR adds some tests for covering some useful corner cases. 1. more tests for `vector.shape_cast` distribution. 2. testing for `MoveFuncBodyToWarpOp` pattern that was not possible before.	2025-10-10 09:27:36 -07:00
Charitha Saumya	b86fef88c5	[mlir][xegpu] Create a test pass for subgroup distribution. (#161592 ) Current subgroup distribution test employ the entire `xegpu-subgroup-distribute` pass which include multiple steps like layout propagation, move func body into warp op, and distribute to work items. This makes it harder to isolate the testing for xegpu subgroup distribution logic, because certain corner cases may be not supported yet by other steps mentioned above. This PR introduces a test pass for subgroup distribution logic and isolate the testing for distribution logic. We plan to add more corner case (that were not possible before) covering non-xegpu ops (like vector) in next PRs. This PR also include, 1. minor bug fixes in gather/scatter distribution. 2. bug fix in vector multi reduction lowering where it fails to retain some layouts.	2025-10-03 12:35:13 -07:00
Mehdi Amini	da160574e0	[MLIR] Remove unused debug macros (NFC)	2025-10-01 05:09:58 -07:00
Dmitry Chigarev	04258fe3b1	[mlir][XeGPU][XeGPUUnroll] Support new syntax with offsets moved to load_nd/store_nd/prefetch_nd (#160323 ) Adds support for new syntax in XeGPUUnroll for: 1. `create_nd_desc` without offsets 2. `load_nd` with offsets 3. `store_nd` with offsets 4. `prefetch_nd` with offsets `create_nd_desc with offsets` + `load_nd with offsets` won't be lowered correctly. In this case the IR would still have two unrealized conversions that will fail later in the pipeline. The offsets computation for the unrolled tile is now moved from descriptors to load/store/prefetch operations. The resulted IR now has one single descriptor that is being iterated in load/store/prefetch ops. <details><summary>old/new behavior examples</summary> ```mlir // before unroll pass: gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> { %tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> %ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32> gpu.return %ld : vector<24x32xf32> } // after unroll pass (offsets in create_nd_desc): gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c8 = arith.constant 8 : index %c16 = arith.constant 16 : index // create 6 descriptors for each tile %0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %6 = xegpu.load_nd %0 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %7 = xegpu.load_nd %1 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %8 = xegpu.load_nd %2 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %9 = xegpu.load_nd %3 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %10 = xegpu.load_nd %4 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %11 = xegpu.load_nd %5 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } // after unroll pass (offsets in load_nd): gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c16 = arith.constant 16 : index %c8 = arith.constant 8 : index // create only one descriptor with proper tile shape %0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> // compute tile offsets at the operation (using only one descriptor) %1 = xegpu.load_nd %0[%c8, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %2 = xegpu.load_nd %0[%c8, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %3 = xegpu.load_nd %0[%c16, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %4 = xegpu.load_nd %0[%c16, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %5 = xegpu.load_nd %0[%c24, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %6 = xegpu.load_nd %0[%c24, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } ``` </details> --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-09-25 11:31:17 -07:00
Nishant Patel	d235d62d65	[MLIR][XeGPU] Add unroll pattern for load_gather and store_scatter with offsets (#159453 ) This PR adds unrolling/blocking patterns for load_gather and store_scatter ops with offsets.	2025-09-24 13:28:43 -07:00
Charitha Saumya	9b0d7ddb04	[mlir][xegpu] Add support for `vector.multi_reduction` and `vector.shape_cast` SIMT distribution. (#157560 ) Add support for distributing the `vector.multi_reduction` operation across lanes in a warp. Currently only 2D to 1D reductions are supported. Given layouts for the source and accumulator vectors, * If the reduction dimension is distributed across lanes, the reduction is non-lane-local and the reduction is done using warp shuffles. Here we simply rewrite the `MultiDimReductionOp` to a sequence of `ReductionOp`s inside the warp op body. Actual distribution will be done by `WarpOpReduction` pattern. * If the reduction dimension is not distributed across lanes, the reduction is lane-local. In this case, we yield the source and accumulator vectors from the warp op and perform the lane-local reduction outside the warp op using a sequence of `ReductionOp`s. PR also adds support for distributing `vector.shape_cast` based on layouts.	2025-09-12 09:37:04 -07:00
Chao Chen	68d6866428	[mlir][XeGPU] add WgToSg distribution pattern for load_matrix and store_matrix. (#154403 )	2025-08-21 10:02:45 -05:00
Jacques Pienaar	4bf33958da	[mlir] Update builders to use new form. (#154132 ) Mechanically applied using clang-tidy.	2025-08-18 15:19:34 +00:00
Chao Chen	c96223434c	[mlir][xegpu] Add definition of SliceAttr (#150146 ) --------- Co-authored-by: Charitha Saumya <136391709+charithaintc@users.noreply.github.com>	2025-08-08 11:27:17 -05:00
Jacques Pienaar	07967d4af8	[mlir] Switch to new LDBG macro (#150616 ) Change local variants to use new central one.	2025-07-25 18:22:46 +02:00
Chao Chen	75524dee18	[mlir][xegpu] Relax rank restriction of TensorDescType (#145916 )	2025-07-09 19:40:24 -05:00
Jianhui Li	118bfcda46	[MLIR][XEGPU] Add blocking support for scatter ops (#144766 ) Add blocking support for scatter ops: Create_tdesc, update, prefetch, load and store. It also enables the load/store with chunk size.	2025-06-18 14:52:03 -07:00
Jianhui Li	f25f2f7de4	[MLIR][XeGPU] Extend unrolling support for scatter ops with chunk_size (#144447 ) Add support for load/store with chunk_size, which requires special consideration for the operand blocking since offests and masks are n-D and tensor are n+1-D. Support operations including create_tdesc, update_tdesc, load, store, and prefetch. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-06-17 17:46:35 -05:00
Jianhui Li	58d23476f0	[MLIR][XeGPU] Add unroll patterns for scatter ops (#143602 ) Add unrolling support for create_tdesc, load, store, prefetch, and update_offset. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com> Co-authored-by: Chao Chen <chao.chen@intel.com>	2025-06-16 10:48:41 -05:00
Kazu Hirata	1eb843b1a0	[mlir] Ensure newline at the end of files (NFC) (#143155 )	2025-06-06 09:16:52 -07:00
Chao Chen	db42345dc6	[MLIR][XeGPU] Add unroll patterns for XeGPU (1/N) (#137010 ) Similar to vector ops, XeGPU ops need to be unrolled into smaller shapes such that they can be dispatched into a hardware instruction. This PR marks the initial phase of a series dedicated to incorporating unroll patterns for XeGPU operations. In this installment, we introduce patterns for the following operations: 1. createNd 2. updateNd 3. prefetchNd 4. loadNd 5. storeNd 6. dpas	2025-05-12 09:16:21 -05:00

16 Commits