26 Commits

Author SHA1 Message Date
Dmitry Chigarev
cd5d5b31bf
[mlir][XeGPU] Use DistributeLayoutAttr instead of LayoutAttr for load gather/scatter ops (#167850)
The PR changes the layout attribute type for
`xegpu::LoadGatherOp/StoreScatterOp` from `LayoutAttr` to
`DistributeLayoutAttr` to also support `xegpu.slice` layouts.

Initially we [wanted to restrict slice
layouts](https://github.com/llvm/llvm-project/pull/163414#discussion_r2478978798)
from the attribute, but now it turns out there are actually valid use
cases for that:
```mlir
gpu.func @distribute_load_slice_attr() {
  %2 = memref.alloca() {alignment = 1024} : memref<4096xf32>
  %offset =  arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<0> : vector<256xindex>
  %mask = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<1> : vector<256xi1>

  %3 = xegpu.load %2[%offset], %mask <{chunk_size = 1, layout = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]>>} {
      layout_result_0 = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]> 
  } : memref<4096xf32>, vector<256xindex>, vector<256xi1> -> vector<256xf32>

  %4 = vector.broadcast %3 {layout_result_0 =
      #xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>} : vector<256xf32> to vector<256x256xf32>
  gpu.return
}
```

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
2025-11-17 11:00:03 -08:00
Dmitry Chigarev
6c563dc6a2
[mlir][XeGPU] Add optional layout attribute to LoadGather StoreScatter ops (#163414)
As [suggested
here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637)
the PR adds an optional layout attribute for `LoadGather` and
`StoreScatter` ops.

For the load-op the attribute describes the layout of the result (ex
`layout_result_0`), and for store-op it describes the layout for the
vector-to-store operand (ex `layout_operand_0`).

The PR also reworks `propagate-layout` pass to consider perm layout
attributes and back-propagate them accordingly.

The helper utility function `getDistributeLayoutAttr` is reworked to
return either `layout_operand/result_0` or `layout` for load/store ops
(denepding on which one is set). After an offline discussion decided
that the overall utilities layouts API is confusing since it tries to
mix permament and temporary layouts. Would need to change it in the
future.

---------

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
2025-11-04 08:19:47 -08:00
Nishant Patel
621ed04e28
[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459)
This PR changes the pack/unpack method used for unrolling to allow for
lower rank slice to be extracted and inserted from and to src vector by
adding reshapes. It also removes leading unit dims from inst_data if
there are any.
2025-10-24 11:04:05 -07:00
Jianhui Li
77cb19d7aa
[MLIR][XeGPU] XeVM lowering support for load_matrix/store_matrix + fix sanitizer issue (#163858)
This PR fix the sanitizer issue reported post-merge for
https://github.com/llvm/llvm-project/pull/162780
2025-10-16 14:09:48 -07:00
Vitaly Buka
d43581aaee
Revert "[MLIR][XeGPU] XeVM lowering support for load_matrix/store_matrix" (#163684)
Reverts llvm/llvm-project#162780

Breaks build bots, see #162780.
2025-10-16 03:11:42 +00:00
Jianhui Li
6cae29fb3a
[MLIR][XeGPU] XeVM lowering support for load_matrix/store_matrix (#162780)
This PR adds lowering of xegpu.load_matrix/store_matrix to
xevm.blockload/blockstore or and llvm.load/store, depending on wi level
attributes.
It includes a few components: 
   1. adds wi-level attributes: subgroup_block_io.   
2. expand load_matrix/store_matrix op definition to support scalar data
(besides vector data).
2. adds a member function to mem_desc to compute the linearized address
for a nd offsets.
   3. add lowering depending on wi-level attributes: 
a) if subgroup_block_io attribute presents, lower to
xevm.blockload/blockstore
c) else lower to llvm.load/store. If result is a vector, lower to
llvm.load/store with vector operand.
2025-10-15 16:50:41 -07:00
Dmitry Chigarev
04258fe3b1
[mlir][XeGPU][XeGPUUnroll] Support new syntax with offsets moved to load_nd/store_nd/prefetch_nd (#160323)
Adds support for new syntax in XeGPUUnroll for:
1. `create_nd_desc` without offsets
2. `load_nd` with offsets
3. `store_nd` with offsets
4. `prefetch_nd` with offsets

`create_nd_desc with offsets` + `load_nd with offsets` won't be lowered
correctly. In this case the IR would still have two unrealized
conversions that will fail later in the pipeline.

The offsets computation for the unrolled tile is now moved from
descriptors to load/store/prefetch operations. The resulted IR now has
one single descriptor that is being iterated in load/store/prefetch ops.

<details><summary>old/new behavior examples</summary>

```mlir
// before unroll pass:
gpu.func @load_nd(%src: memref<256x318xf32>) -> vector<24x32xf32> {
  %tdesc = xegpu.create_nd_tdesc %src : memref<256x318xf32> -> !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>>
  %ld = xegpu.load_nd %tdesc[8, 16]: !xegpu.tensor_desc<24x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<24x32xf32>
  gpu.return %ld : vector<24x32xf32>
}

// after unroll pass (offsets in create_nd_desc):
gpu.func @create_nd_tdesc2(%arg0: memref<256x318xf32>) -> vector<24x32xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32>
  %c24 = arith.constant 24 : index
  %c32 = arith.constant 32 : index
  %c8 = arith.constant 8 : index
  %c16 = arith.constant 16 : index
  // create 6 descriptors for each tile
  %0 = xegpu.create_nd_tdesc %arg0[%c8, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %1 = xegpu.create_nd_tdesc %arg0[%c8, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %2 = xegpu.create_nd_tdesc %arg0[%c16, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %3 = xegpu.create_nd_tdesc %arg0[%c16, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %4 = xegpu.create_nd_tdesc %arg0[%c24, %c16] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %5 = xegpu.create_nd_tdesc %arg0[%c24, %c32] : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  %6 = xegpu.load_nd %0  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %7 = xegpu.load_nd %1  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %8 = xegpu.load_nd %2  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %9 = xegpu.load_nd %3  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %10 = xegpu.load_nd %4  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %11 = xegpu.load_nd %5  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  ...
}

// after unroll pass (offsets in load_nd):
gpu.func @load_nd(%arg0: memref<256x318xf32>) -> vector<24x32xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<24x32xf32>
  %c24 = arith.constant 24 : index
  %c32 = arith.constant 32 : index
  %c16 = arith.constant 16 : index
  %c8 = arith.constant 8 : index
  // create only one descriptor with proper tile shape
  %0 = xegpu.create_nd_tdesc %arg0 : memref<256x318xf32> -> !xegpu.tensor_desc<8x16xf32>
  // compute tile offsets at the operation (using only one descriptor)
  %1 = xegpu.load_nd %0[%c8, %c16]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %2 = xegpu.load_nd %0[%c8, %c32]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %3 = xegpu.load_nd %0[%c16, %c16]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %4 = xegpu.load_nd %0[%c16, %c32]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %5 = xegpu.load_nd %0[%c24, %c16]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  %6 = xegpu.load_nd %0[%c24, %c32]  : !xegpu.tensor_desc<8x16xf32> -> vector<8x16xf32>
  ...
}
```

</details>

---------

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>
2025-09-25 11:31:17 -07:00
Nishant Patel
d235d62d65
[MLIR][XeGPU] Add unroll pattern for load_gather and store_scatter with offsets (#159453)
This PR adds unrolling/blocking patterns for load_gather and
store_scatter ops with offsets.
2025-09-24 13:28:43 -07:00
Jakub Kuderski
2ed3f49c49
[mlir] Use free op create functions. NFC. (#157374)
The builder create methods are deprecated:
https://mlir.llvm.org/deprecation/. See
https://discourse.llvm.org/t/psa-opty-create-now-with-100-more-tab-complete/87339.
2025-09-07 22:13:20 -04:00
Chao Chen
6026ca301d
[mlir][XeGPU] add unroll patterns for load_matrix and store_matrix (#154637) 2025-09-03 13:56:41 -05:00
Jianhui Li
e6f360b0ab
[MLIR][XeGPU] Allow load/store/prefetch uses [memref+offset] instead of tdesc (#150576)
Add variant of load/store/prefetch to allow offset. The new xegpu.load
variant accepts memref+offset, and the existing tdesc operand will be
removed in the future PR.

The semantics are combination of "creating scattered_tdesc + xegpu.load
with scattered_tdesc". The current xegpu.load accepts tdesc operand,
which encapsulates "memref+offset". This PR "fold" "memref+offset"
directly to xegpu.load replacing "tdesc". Create_tdesc will be removed
as scatter_tdesc only contains base address after offsets being taken
away, so there is no point to keep it.

```mlir 
    // wi level code example
    %2 = xegpu.load %src[%offsets], %mask <{chunk_size = 2}> : ui64,  vector<1xindex>, vector<1xi1> -> vector<2xf32>
    xegpu.store %val, %src[%offsets], %mask: vector<1xf16>, memref<?xf16>, vector<1xindex>, vector<1xi1>
    xegpu.prefetch %src[%0] : ui64, vector<1xindex>
```
2025-07-30 16:00:40 -07:00
Jacques Pienaar
07967d4af8
[mlir] Switch to new LDBG macro (#150616)
Change local variants to use new central one.
2025-07-25 18:22:46 +02:00
Jianhui Li
90944b85c5
[MLIR][XeGPU] Add offset operands to load_nd/store_nd/prefetch_nd (#149424)
This PR allows load_nd/store_nd/prefetch_nd to take an additional offset
operand.
It is based on this PR https://github.com/llvm/llvm-project/pull/148335.
Now user can create a nd_tdesc with no offset, and instead set the
offset with the load_nd operation.
2025-07-23 09:00:51 -07:00
James Newling
6ed921f967
Reland "[mlir][vector] Use vector.broadcast in place of vector.splat" (#150138)
This reverts commit 228c45f13dc92546661b6825b7b32c3808b0d2eb (PR
#148937) . Now that #148027 is landed, I think it is safe to "reland"
the original PR: #148028
2025-07-23 06:00:59 -07:00
Maksim Levental
7b78796543
[mlir][NFC] update mlir/Dialect create APIs (25/n) (#149932)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-21 19:57:59 -04:00
James Newling
228c45f13d
Revert [mlir][vector] Use vector.broadcast in place of vector.splat (#148937)
This reverts PR/commit 99875733fc

This PR/commit should only be landed after
https://github.com/llvm/llvm-project/pull/148027, at which point we
don't need to assume that vector.broadcast has been lowered to another
form.
2025-07-15 20:45:01 -07:00
Kazu Hirata
c06d3a7b72
[mlir] Remove unused includes (NFC) (#148769)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-07-14 22:19:23 -07:00
James Newling
99875733fc
[mlir][vector] Use vector.broadcast in place of vector.splat (#148028)
Part of deprecation of vector.splat

RFC:
https://discourse.llvm.org/t/rfc-mlir-vector-deprecate-then-remove-vector-splat/87143/4
More complete deprecation:
https://github.com/llvm/llvm-project/pull/147818
2025-07-14 15:12:21 -07:00
Chao Chen
75524dee18
[mlir][xegpu] Relax rank restriction of TensorDescType (#145916) 2025-07-09 19:40:24 -05:00
Chao Chen
36fbc6a8d2
[MLIR][XeGPU] Remove the transpose attribute from Gather/Scatter ops and Cleanup the documents (#145389) 2025-06-25 19:43:53 -05:00
Jianhui Li
f25f2f7de4
[MLIR][XeGPU] Extend unrolling support for scatter ops with chunk_size (#144447)
Add support for load/store with chunk_size, which requires special
consideration for the operand blocking since offests and masks are
 n-D and tensor are n+1-D. Support operations including create_tdesc,
update_tdesc, load, store, and prefetch.

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
2025-06-17 17:46:35 -05:00
Jianhui Li
58d23476f0
[MLIR][XeGPU] Add unroll patterns for scatter ops (#143602)
Add unrolling support for create_tdesc, load, store, prefetch, and update_offset.

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
Co-authored-by: Chao Chen <chao.chen@intel.com>
2025-06-16 10:48:41 -05:00
Chao Chen
9e2684e4cf
[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#142477)
Bring back https://github.com/llvm/llvm-project/pull/140163 with fixes
2025-06-02 21:39:30 -05:00
Chao Chen
b88dfb0b23
Revert "[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N]" (#142459)
Reverts llvm/llvm-project#140163
2025-06-02 15:47:21 -04:00
Chao Chen
0210750d5a
[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#140163)
This PR introduces the initial implementation of a blocking pass for
XeGPU programs. The pass leverages unroll patterns from both the XeGPU
and Vector dialects. 

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
2025-06-02 14:02:45 -05:00
Chao Chen
db42345dc6
[MLIR][XeGPU] Add unroll patterns for XeGPU (1/N) (#137010)
Similar to vector ops, XeGPU ops need to be unrolled into smaller shapes
such that they can be dispatched into a hardware instruction. This PR
marks the initial phase of a series dedicated to incorporating unroll
patterns for XeGPU operations. In this installment, we introduce
patterns for the following operations:
1. createNd
2. updateNd
3. prefetchNd
4. loadNd
5. storeNd
6. dpas
2025-05-12 09:16:21 -05:00