llvm-project

Author	SHA1	Message	Date
Dmitry Chigarev	cd5d5b31bf	[mlir][XeGPU] Use DistributeLayoutAttr instead of LayoutAttr for load gather/scatter ops (#167850 ) The PR changes the layout attribute type for `xegpu::LoadGatherOp/StoreScatterOp` from `LayoutAttr` to `DistributeLayoutAttr` to also support `xegpu.slice` layouts. Initially we [wanted to restrict slice layouts](https://github.com/llvm/llvm-project/pull/163414#discussion_r2478978798) from the attribute, but now it turns out there are actually valid use cases for that: ```mlir gpu.func @distribute_load_slice_attr() { %2 = memref.alloca() {alignment = 1024} : memref<4096xf32> %offset = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<0> : vector<256xindex> %mask = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<1> : vector<256xi1> %3 = xegpu.load %2[%offset], %mask <{chunk_size = 1, layout = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]>>} { layout_result_0 = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]> } : memref<4096xf32>, vector<256xindex>, vector<256xi1> -> vector<256xf32> %4 = vector.broadcast %3 {layout_result_0 = #xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>} : vector<256xf32> to vector<256x256xf32> gpu.return } ``` Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-11-17 11:00:03 -08:00
Dmitry Chigarev	6c563dc6a2	[mlir][XeGPU] Add optional layout attribute to LoadGather StoreScatter ops (#163414 ) As [suggested here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637) the PR adds an optional layout attribute for `LoadGather` and `StoreScatter` ops. For the load-op the attribute describes the layout of the result (ex `layout_result_0`), and for store-op it describes the layout for the vector-to-store operand (ex `layout_operand_0`). The PR also reworks `propagate-layout` pass to consider perm layout attributes and back-propagate them accordingly. The helper utility function `getDistributeLayoutAttr` is reworked to return either `layout_operand/result_0` or `layout` for load/store ops (denepding on which one is set). After an offline discussion decided that the overall utilities layouts API is confusing since it tries to mix permament and temporary layouts. Would need to change it in the future. --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-11-04 08:19:47 -08:00
Nishant Patel	621ed04e28	[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459 ) This PR changes the pack/unpack method used for unrolling to allow for lower rank slice to be extracted and inserted from and to src vector by adding reshapes. It also removes leading unit dims from inst_data if there are any.	2025-10-24 11:04:05 -07:00
Jianhui Li	77cb19d7aa	[MLIR][XeGPU] XeVM lowering support for load_matrix/store_matrix + fix sanitizer issue (#163858 ) This PR fix the sanitizer issue reported post-merge for https://github.com/llvm/llvm-project/pull/162780	2025-10-16 14:09:48 -07:00
Vitaly Buka	d43581aaee	Revert "[MLIR][XeGPU] XeVM lowering support for load_matrix/store_matrix" (#163684 ) Reverts llvm/llvm-project#162780 Breaks build bots, see #162780.	2025-10-16 03:11:42 +00:00
Jianhui Li	6cae29fb3a	[MLIR][XeGPU] XeVM lowering support for load_matrix/store_matrix (#162780 ) This PR adds lowering of xegpu.load_matrix/store_matrix to xevm.blockload/blockstore or and llvm.load/store, depending on wi level attributes. It includes a few components: 1. adds wi-level attributes: subgroup_block_io. 2. expand load_matrix/store_matrix op definition to support scalar data (besides vector data). 2. adds a member function to mem_desc to compute the linearized address for a nd offsets. 3. add lowering depending on wi-level attributes: a) if subgroup_block_io attribute presents, lower to xevm.blockload/blockstore c) else lower to llvm.load/store. If result is a vector, lower to llvm.load/store with vector operand.	2025-10-15 16:50:41 -07:00
Dmitry Chigarev	04258fe3b1	[mlir][XeGPU][XeGPUUnroll] Support new syntax with offsets moved to load_nd/store_nd/prefetch_nd (#160323 ) Adds support for new syntax in XeGPUUnroll for: 1. `create_nd_desc` without offsets 2. `load_nd` with offsets 3. `store_nd` with offsets 4. `prefetch_nd` with offsets `create_nd_desc with offsets` + `load_nd with offsets` won't be lowered correctly. In this case the IR would still have two unrealized conversions that will fail later in the pipeline. The offsets computation for the unrolled tile is now moved from descriptors to load/store/prefetch operations. The resulted IR now has one single descriptor that is being iterated in load/store/prefetch ops. <details><summary>old/new behavior examples</summary> ```mlir // before unroll pass: gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> { %tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> %ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32> gpu.return %ld : vector<24x32xf32> } // after unroll pass (offsets in create_nd_desc): gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c8 = arith.constant 8 : index %c16 = arith.constant 16 : index // create 6 descriptors for each tile %0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> %6 = xegpu.load_nd %0 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %7 = xegpu.load_nd %1 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %8 = xegpu.load_nd %2 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %9 = xegpu.load_nd %3 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %10 = xegpu.load_nd %4 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %11 = xegpu.load_nd %5 : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } // after unroll pass (offsets in load_nd): gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> { %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32> %c24 = arith.constant 24 : index %c32 = arith.constant 32 : index %c16 = arith.constant 16 : index %c8 = arith.constant 8 : index // create only one descriptor with proper tile shape %0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32> // compute tile offsets at the operation (using only one descriptor) %1 = xegpu.load_nd %0[%c8, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %2 = xegpu.load_nd %0[%c8, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %3 = xegpu.load_nd %0[%c16, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %4 = xegpu.load_nd %0[%c16, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %5 = xegpu.load_nd %0[%c24, %c16] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> %6 = xegpu.load_nd %0[%c24, %c32] : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32> ... } ``` </details> --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-09-25 11:31:17 -07:00
Nishant Patel	d235d62d65	[MLIR][XeGPU] Add unroll pattern for load_gather and store_scatter with offsets (#159453 ) This PR adds unrolling/blocking patterns for load_gather and store_scatter ops with offsets.	2025-09-24 13:28:43 -07:00
Jakub Kuderski	2ed3f49c49	[mlir] Use free op create functions. NFC. (#157374 ) The builder create methods are deprecated: https://mlir.llvm.org/deprecation/. See https://discourse.llvm.org/t/psa-opty-create-now-with-100-more-tab-complete/87339.	2025-09-07 22:13:20 -04:00
Chao Chen	6026ca301d	[mlir][XeGPU] add unroll patterns for load_matrix and store_matrix (#154637 )	2025-09-03 13:56:41 -05:00
Jianhui Li	e6f360b0ab	[MLIR][XeGPU] Allow load/store/prefetch uses [memref+offset] instead of tdesc (#150576 ) Add variant of load/store/prefetch to allow offset. The new xegpu.load variant accepts memref+offset, and the existing tdesc operand will be removed in the future PR. The semantics are combination of "creating scattered_tdesc + xegpu.load with scattered_tdesc". The current xegpu.load accepts tdesc operand, which encapsulates "memref+offset". This PR "fold" "memref+offset" directly to xegpu.load replacing "tdesc". Create_tdesc will be removed as scatter_tdesc only contains base address after offsets being taken away, so there is no point to keep it. ```mlir // wi level code example %2 = xegpu.load %src[%offsets], %mask <{chunk_size = 2}> : ui64, vector<1xindex>, vector<1xi1> -> vector<2xf32> xegpu.store %val, %src[%offsets], %mask: vector<1xf16>, memref<?xf16>, vector<1xindex>, vector<1xi1> xegpu.prefetch %src[%0] : ui64, vector<1xindex> ```	2025-07-30 16:00:40 -07:00
Jacques Pienaar	07967d4af8	[mlir] Switch to new LDBG macro (#150616 ) Change local variants to use new central one.	2025-07-25 18:22:46 +02:00
Jianhui Li	90944b85c5	[MLIR][XeGPU] Add offset operands to load_nd/store_nd/prefetch_nd (#149424 ) This PR allows load_nd/store_nd/prefetch_nd to take an additional offset operand. It is based on this PR https://github.com/llvm/llvm-project/pull/148335. Now user can create a nd_tdesc with no offset, and instead set the offset with the load_nd operation.	2025-07-23 09:00:51 -07:00
James Newling	6ed921f967	Reland "[mlir][vector] Use vector.broadcast in place of vector.splat" (#150138 ) This reverts commit 228c45f13dc92546661b6825b7b32c3808b0d2eb (PR #148937) . Now that #148027 is landed, I think it is safe to "reland" the original PR: #148028	2025-07-23 06:00:59 -07:00
Maksim Levental	7b78796543	[mlir][NFC] update `mlir/Dialect` create APIs (25/n) (#149932 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-21 19:57:59 -04:00
James Newling	228c45f13d	Revert [mlir][vector] Use vector.broadcast in place of vector.splat (#148937 ) This reverts PR/commit `99875733fc` This PR/commit should only be landed after https://github.com/llvm/llvm-project/pull/148027, at which point we don't need to assume that vector.broadcast has been lowered to another form.	2025-07-15 20:45:01 -07:00
Kazu Hirata	c06d3a7b72	[mlir] Remove unused includes (NFC) (#148769 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-14 22:19:23 -07:00
James Newling	99875733fc	[mlir][vector] Use vector.broadcast in place of vector.splat (#148028 ) Part of deprecation of vector.splat RFC: https://discourse.llvm.org/t/rfc-mlir-vector-deprecate-then-remove-vector-splat/87143/4 More complete deprecation: https://github.com/llvm/llvm-project/pull/147818	2025-07-14 15:12:21 -07:00
Chao Chen	75524dee18	[mlir][xegpu] Relax rank restriction of TensorDescType (#145916 )	2025-07-09 19:40:24 -05:00
Chao Chen	36fbc6a8d2	[MLIR][XeGPU] Remove the transpose attribute from Gather/Scatter ops and Cleanup the documents (#145389 )	2025-06-25 19:43:53 -05:00
Jianhui Li	f25f2f7de4	[MLIR][XeGPU] Extend unrolling support for scatter ops with chunk_size (#144447 ) Add support for load/store with chunk_size, which requires special consideration for the operand blocking since offests and masks are n-D and tensor are n+1-D. Support operations including create_tdesc, update_tdesc, load, store, and prefetch. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-06-17 17:46:35 -05:00
Jianhui Li	58d23476f0	[MLIR][XeGPU] Add unroll patterns for scatter ops (#143602 ) Add unrolling support for create_tdesc, load, store, prefetch, and update_offset. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com> Co-authored-by: Chao Chen <chao.chen@intel.com>	2025-06-16 10:48:41 -05:00
Chao Chen	9e2684e4cf	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#142477 ) Bring back https://github.com/llvm/llvm-project/pull/140163 with fixes	2025-06-02 21:39:30 -05:00
Chao Chen	b88dfb0b23	Revert "[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N]" (#142459 ) Reverts llvm/llvm-project#140163	2025-06-02 15:47:21 -04:00
Chao Chen	0210750d5a	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#140163 ) This PR introduces the initial implementation of a blocking pass for XeGPU programs. The pass leverages unroll patterns from both the XeGPU and Vector dialects. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-06-02 14:02:45 -05:00
Chao Chen	db42345dc6	[MLIR][XeGPU] Add unroll patterns for XeGPU (1/N) (#137010 ) Similar to vector ops, XeGPU ops need to be unrolled into smaller shapes such that they can be dispatched into a hardware instruction. This PR marks the initial phase of a series dedicated to incorporating unroll patterns for XeGPU operations. In this installment, we introduce patterns for the following operations: 1. createNd 2. updateNd 3. prefetchNd 4. loadNd 5. storeNd 6. dpas	2025-05-12 09:16:21 -05:00

26 Commits