llvm-project

Author	SHA1	Message	Date
Jianhui Li	470c5ca81c	[MLIR][XeGPU] Fix insert_strided_slice op in subgroup distribution (#180604 ) The PR modifies the subgroup distribution pass to only sink insert_strided_slice operation if it becomes the last op before yield. It avoids sinking insert_strided_slice multiple times and cause potential issue in worst case.	2026-02-09 13:29:46 -08:00
Nishant Patel	570055bf97	[MLIR][XeGPU] Propagate layout from anchor ops before Wg To Sg & Blocking Pass (#179490 ) This PR calls recoverTemporaryLayout before the XeGPUWgtoSgDistribute & XeGPUBlocking Pass to recover all the temporary operand layout which might be required by the transformation patterns for checks and verification	2026-02-06 15:56:09 -08:00
Jianhui Li	8102ebf6a3	[MLIR][XeGPU] Fixing PR179016 minor issues (#180295 ) Fix two issues brough by PR179016: 1. unused variable if build the option with "DLLVM_ENABLE_ASSERTIONS=OFF" 2. Recover modification to recoverTemporaryLayouts() brought by PR176737. Unintentionally lost during the merging process.	2026-02-06 14:51:40 -08:00
Jianhui Li	61b8a57839	[MLIR][XeGPU] Refactor layout propagation utilities (#179016 ) This PR refactors layout propagation into two distinct components: result/anchor layout setup and source layout inference from the result. For operations that require a specific result layout due to semantic or hardware constraints, the propagation logic explicitly sets up the result or anchor layout. Otherwise, it infers the source layout from the backward-propagated consumer layout. The result or anchor layout may differ from the backward-propagated consumer layout; any such discrepancies are resolved via the existing layout-conflict mechanism. This PR introduces the following utility functions: Source layout inference: > inferBroadcastSourceLayout() > inferMultiReductionSourceLayout() > inferBitCastSourceLayout() > inferShapeCastSourceLayout() > inferInsertStridedSliceSourceLayout() Result / anchor layout setup: > setupMultiReductionResultLayout() > setupBitCastResultLayout() > setupInsertStridedSliceResultLayout() > setupLoadMatrixAnchorLayout() > setupStoreMatrixAnchorLayout() > setupLoadGatherAnchorLayout() > setupStoreScatterAnchorLayout() Part of subgroup distribution related code changes are separated and created as PR https://github.com/llvm/llvm-project/pull/179018/changes.	2026-02-05 19:26:25 -08:00
Jianhui Li	983d8663b0	[MLIR] [XeGPU] SG distribution: adding tests for alloca/create_memdesc and remove unncessary check from shape_cast op lowering (#179018 ) This PR add subgroup distribution tests for memref.alloca and xegpu.create_memdesc ops. It also removes the slice layout requirement for shape_cast.	2026-02-03 09:17:25 -08:00
Artem Kroviakov	56d31697c2	[MLIR][XeGPU] Reorganize uArch for easier extension (#178907 )	2026-02-03 09:47:11 +01:00
Charitha Saumya	eb33585b7d	[mlir][xegpu] Remove unused headers added in #177492 (#178719 ) Remove unused headers added in #177492	2026-01-29 10:47:26 -08:00
Charitha Saumya	d9c65f94b1	[mlir][xegpu] Add `XeGPUSgToWiDistributeExperimental` pass. (#177492 ) Currently XeGPU lowering pipeline uses `XeGPUSubgroupDistribute` pass to subgroup to work item distribution of ops. This pass is well established and relies on vector distribution's `WarpOp` based distribution mechanism. However, recent experiments with larger kernels have shown that this pass is very expensive in terms of compile time (see below). This prompted us to create a new pass that does not rely on `WarpOp` based distribution. This PR adds the initial infra to move away from the old way and align Wg To WI distribution with Wg to Sg distribution. New pass also uses context-aware type conversion based on XeGPU layouts to distributed vector types from SG to WI. This PR adds the following changes: * SG to WI distribution pass based on context-aware type conversions using `OpConversionPatterns` * Test pass for testing individual patterns (`TestXeGPUSgToWiDistributeExperimental`) * `XeGPUSgToWiDistributeExperimentalPass` which will eventually replace `XeGPUSubgroupDistribute` Flash attention e2e compilations stats: ``` ----Wall Time---- ----Name---- 0.0032 ( 0.2%) Parser 0.0008 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0002 ( 0.0%) GpuXeVMAttachTarget 1.1427 ( 58.7%) 'gpu.module' Pipeline 0.0019 ( 0.1%) XeGPUWgToSgDistribute 0.0003 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0002 ( 0.0%) LowerAffinePass 0.0001 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0008 ( 0.0%) XeGPUPropagateLayout 0.0056 ( 0.3%) XeGPUBlocking 0.0010 ( 0.1%) Canonicalizer 0.0004 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0015 ( 0.1%) XeGPUPropagateLayout 0.0007 ( 0.0%) XeGPUOptimizeBlockLoads 0.0010 ( 0.0%) Canonicalizer 0.0004 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0015 ( 0.1%) XeGPUPropagateLayout 1.1274 ( 57.9%) XeGPUSubgroupDistribute 0.7959 ( 40.9%) Output 0.0022 ( 0.1%) Rest 1.9461 (100.0%) Total ```	2026-01-29 09:57:01 -08:00
Jakub Kuderski	9aaf0b89f5	[mlir] Apply clang-tidy check llvm-use-vector-utils. NFC. (#178526 )	2026-01-29 02:19:00 +00:00
Charitha Saumya	cd4c9d200b	[mlir][xegpu] Add initial support for layout conflict handling. (#173090 ) This PR adds initial support for layout conflict resolution in XeGPU. Layout conflict occurs when some op's use point expects a different layout than what the op can currently provide. This conflict needs to be resolved by adding certain other xegpu ops. Initially, We only focus conflict handling at tensor desc use points.	2026-01-28 11:50:42 -08:00
Jakub Kuderski	59e44799bd	[mlir] Fix new clang-tidy warning llvm-type-switch-case-types. NFC. (#178487 ) Pre-commiting this before landing the new check in https://github.com/llvm/llvm-project/pull/177892	2026-01-28 19:13:47 +00:00
Artem Kroviakov	0926743e2e	[MLIR][XeGPU] Add uniform values distribution pattern (#176737 )	2026-01-26 21:23:31 +01:00
Jianhui Li	fe71ea4437	[MLIR][XeGPU] Preserve Leading dimension when blocking rank-sensitive operations (#177489 ) This PR preserves leading dimensions for xegpu.load_matrix/store_matrix/atomic_rmw/convert_layout, and vector operations which have impact on shapes: broadcast/multi-reduction/shape_cast/transpose. Rank-sensitive operations are operations whose semantics depend on the tensor rank (and consequently its shape), and therefore must not alter the input tile rank or shape, such as by dropping leading dimensions.	2026-01-24 12:34:38 -08:00
Artem Kroviakov	9109c603a8	[MLIR][XeGPU] Add uArch limitation to scatter load store (#172845 )	2026-01-23 19:10:51 +01:00
Artem Kroviakov	2357408dd2	[MLIR][XeGPU] Add layout propagation for `xegpu.store_matrix` (#174952 )	2026-01-23 17:08:30 +01:00
Artem Kroviakov	e6195237a2	[MLIR][XeGPU] Add simple rank-based sg layout creation (#172867 )	2026-01-23 11:43:44 +01:00
Jianhui Li	1b8903aa8e	[MLIR][XeGPU] setUnitDim bug fix and add documentation (#173521 ) This PR fix a bug in setUnitDimData and setUnitDimLayout, and adds documentation and test. It also cleans up the shapecast op pattern in the wg distribution to use local temporary layout instead of getting from definition op's result (one TODO item from PR [#172125](https://github.com/llvm/llvm-project/pull/172125)).	2026-01-20 21:17:00 -08:00
Nishant Patel	9a2d3ab6ad	[MLIR][XeGPU] Add support for cross-subgroup reduction from wg to sg (#170936 ) This PR adds support for cross-sg reduction whilst distributing from workgroup to subgroup. It has following limitation 1. Cannot reduce to a scalar 2. For cross-sg, only 1:1 decomposition (each sg should be assigned only one tile in the original WG tile) is supported for now. For example for a WG tile of size 256x128, sg_layout = [8, 4], sg_data = [16, 16] wont be supported.	2026-01-16 07:19:23 -08:00
Nishant Patel	3150b73dec	[MLIR][XeGPU] Clean up helpers in XeGPUPropagateLayout (#175857 ) In XeGPUPropagateLayout.cpp, the helper getDefaultSIMTLayoutInfo is implemented via multiple overloads that differ significantly in semantics, not just parameter types. Reusing the same function name for these semantically different behaviors makes call sites harder to read and reason about and increases the maintenance burden. This PR improves readability and maintainability of layout propagation logic.	2026-01-15 08:13:38 -08:00
Jianhui Li	074740df8a	[MLIR][XeGPU] bug fix: removing temporary slice layout at the pass end (#172589 ) Removing temporary slice layout (besides the regular layout) at the end of wg distribution and blocking pass. The PR also drop sg_data/inst_data from anchor layouts in every wg-to-sg/blocking/unrolling pattern. --------- Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>	2026-01-14 09:57:14 -08:00
Artem Kroviakov	0b7d14e9a8	[MLIR][XeGPU] Add 2D `vector.multi_reduction` optimization (#171154 )	2026-01-14 12:58:30 +01:00
Nishant Patel	4a8ecccd70	[MLIR][XeGPU] Pass inst_data for blocking create/constant Mask and Step op (#175456 )	2026-01-13 15:21:39 -08:00
lonely eagle	0394ad1bfa	[mlir][dataflow] Add new visitNonControlFlowArgumentst API to SparseBackwardDataFlowAnalysis and apply it in LivenessAnalysis/RemoveDeadValues (#169816 ) Add visitNonControlFlowArgumentst API to SparseBackwardDataFlowAnalysis, current SparseBackwardDataflowAnalysis cannot access all SSA values, such as, the loop's IV. Now we can use visitNonControlFlowArgumentst to visit it. Apply it in LivenessAnalysis/RemoveDeadValues, solved the issue of IV liveness in the loop. https://discourse.llvm.org/t/rfc-add-visitbranchregionargument-interface-to-sparsedataflowanalysis/89061	2026-01-02 17:10:05 +08:00
Matthias Springer	0b24580a26	[mlir][Interfaces][NFC] Add `RegionBranchOpInterface` helper for forwarded values (#173981 ) Add a helper function to compute a mapping of successor operands to successor inputs. This mapping is computed in various places. Also add a helper function to gather all region branch points. This commit is in preparation of a bug fix / partial redesign of `-remove-dead-values`. This commit also removes some duplicate code in various places.	2026-01-01 11:56:05 +01:00
Jianhui Li	2b9e47749c	[MLIR][XeGPU] Refactor Layout access interface (#172125 ) This PR builds on the anchor layout mechanism introduced in https://github.com/llvm/llvm-project/pull/169267 and performs the following refactoring: 1. Introduce getAnchorLayout() and setAnchorLayout() interface for anchor ops to get and set layout attributes. 2. Add getLocalLayout() and setLocalLayout() utility functions, and refactor workgroup/subgroup distribution patterns to use these APIs. These utilities access the layout information directly and locally, without relying on global propagation. 3. Introduce localPropagateLayoutsFromAnchor(), a utility used by subgroup distribution to unify non-anchor layout setup. This function is intended to be invoked upfront by all layout-based passes (including workgroup/subgroup distribution and unrolling) to propagate layouts from anchor ops to non-anchor ops. After this step, patterns within the pass should exclusively use getLocalLayout() / setLocalLayout(). 4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to remove special-case handling. These APIs now operate in a uniform order: anchor ops first, then non-anchor ops, and finally block arguments. These APIs will be deprecated on long run. 5. Refactor patterns in wg/sg distribution, load optimization passes to use get/setAnchorLayout() and get/setLocalLayout(). 6. Update test cases to enforce that anchor ops must use—and only use—anchor layouts.	2025-12-17 12:04:58 -08:00
Artem Kroviakov	a6f837e9f8	[MLIR][XeGPU] Add sg layout propagation (#170879 )	2025-12-17 11:03:21 +01:00
Jianhui Li	492340aeb1	[MLIR][XeGPU] Add handling for unit-dim expansion in ShapeCast workgroup-to-subgroup distribution (#171758 ) Add special-case handling for ShapeCast when it expands unit dimensions for a succeeding broadcast op. In this scenario, distribution requires the source layout to be a slice layout, and the result layout is first normalized by setting the expanded unit dimensions to 1 before computing the distributed result shape. In all other cases, ShapeCast is distributed as usual. This PR also updates the propagation rule for vectors with expanded unit dimensions, allowing them to share the same layout as the result of a broadcast op. This enables correct layout propagation back to the source of the ShapeCast op, as that layout must ultimately be restored as the parent layout of the slice layout.	2025-12-16 13:13:11 -08:00
Charitha Saumya	3ece6626cb	[mlir][xegpu] Add support for `vector.extract_strided_slice` XeGPU SIMT distribution with partial offsets. (#171512 ) `vector.extract_strided_slice` can have two forms when specifying offsets. Case 1: ``` %1 = vector.extract_strided_slice %0 { offsets = [8, 0], sizes = [8, 16], strides = [1, 1]} : vector<24x16xf32> to vector<8x16xf32> ``` Case 2: ``` %1 = vector.extract_strided_slice %0 { offsets = [8], sizes = [8], strides = [1]} : vector<24x16xf32> to vector<8x16xf32> ``` These two ops means the same thing, but case 2 is syntactic sugar to avoid specifying offsets for fully extracted dims. Currently case 2 fails in XeGPU SIMT distribution. This PR fixes this issue.	2025-12-10 09:53:56 -08:00
Jianhui Li	5236af88e5	[MLIR][XeGPU] Extend propagation and sg_to_lane distribution pass support broadcast with low rank and scalar source input (#170409 ) This PR extends XeGPU layout propagation and distribution for vector.broadcast operation. It relaxes the restriction of layout propagation to allow low-rank and scalar source input, and adds a pattern in sg-to-wi distribution to support the lowering.	2025-12-09 08:48:27 -08:00
Nishant Patel	5fc8e87fe2	[MLIR][XeGPU] Retain anchor op layouts for XeGPU nD ops (#170934 ) This PR adds support to retain the anchor op layouts (after dropping what's not required) for xegpu nD ops during workgroup to subgroup & unroll transformation	2025-12-05 21:49:13 -08:00
Nishant Patel	c8d3b0c8e3	[MLIR][XeGPU] Add distribution for vector.create_mask from Wg to Sg (#169571 )	2025-12-03 16:01:46 -08:00
Artem Kroviakov	ea00593dd1	[MLIR][XeGPU][Quickfix] Disable block count in propagation (#170304 ) One of the previous PRs https://github.com/llvm/llvm-project/pull/169267/ has reintroduced block count to layout propagation that was removed in https://github.com/llvm/llvm-project/pull/168504/. This PR patches the issue.	2025-12-02 09:49:06 -08:00
Jianhui Li	326a1a4bad	[MLIR][XeGPU] Add anchor_layout and update propagation to honor user-specified layouts (#169267 ) Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd, prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored by XeGPU distribution and lowerinngs once specified. 1. Add anchor_layout for XeGPU anchor OPs: load_nd, store_nd, prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and atomic_rmw. 2. rename layout attributes to anchor_layout for these ops: load, store, load_matrix, store_matrix 3. update layout propagation pass: Only when user doesn't specify anchor layout, the pass computes a default layout and set to anchor op's permant layout and use that for propagation. if user specified anchor layout, the pass takes user-specified anchor layout. permant layout and use that for propagation. if user specified anchor layout, the pass takes user-specified anchor layout.	2025-11-26 23:02:01 -08:00
Charitha Saumya	c333f7dab9	[mlir][xegpu] Add layout based SIMT distribution support for `vector.extract/insert_strided_slice` (#168626 ) This PR adds general SIMT distribution support for `vector.extract/insert_strided_slice`. Currently vector distribution already have support for these operations but have restrictions to avoid requiring layouts during distribution logic. For example, `extract_stride_slice` require that distributed dimension is fully extracted. However, more complex cases may require extracting partially from distributed dimension (eg. 8x16xf16 extraction from 8x32xf16). These types of cases need the layouts to reason about how the data is spread across SIMT lanes. Currently, we don't have layout access in vector distribution so these new patterns are place in XeGPU side. They have higher pattern benefit so that they will be tried first before trying regular vector distribution based patterns.	2025-11-26 10:10:36 -06:00
Kazu Hirata	67391fc039	[mlir] Construct SmallVector with initial values (NFC) (#169239 ) Identified with llvm-use-ranges.	2025-11-23 22:32:50 -08:00
Artem Kroviakov	8ea5e20ce4	[MLIR][XeGPU] Disable block count usage in layout propagation (#168504 )	2025-11-23 10:34:57 +01:00
Nishant Patel	778e104dee	[MLIR] [XeGPU] Fix dropSgLayoutAndData & dropInstData in SliceAttr (#168618 )	2025-11-21 12:40:16 -08:00
Nishant Patel	310abe0e4b	[MLIR] [XeGPU] Add distribution pattern for vector.constant_mask from Wg To Sg (#168118 )	2025-11-20 15:00:57 -08:00
Dmitry Chigarev	cd5d5b31bf	[mlir][XeGPU] Use DistributeLayoutAttr instead of LayoutAttr for load gather/scatter ops (#167850 ) The PR changes the layout attribute type for `xegpu::LoadGatherOp/StoreScatterOp` from `LayoutAttr` to `DistributeLayoutAttr` to also support `xegpu.slice` layouts. Initially we [wanted to restrict slice layouts](https://github.com/llvm/llvm-project/pull/163414#discussion_r2478978798) from the attribute, but now it turns out there are actually valid use cases for that: ```mlir gpu.func @distribute_load_slice_attr() { %2 = memref.alloca() {alignment = 1024} : memref<4096xf32> %offset = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<0> : vector<256xindex> %mask = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8], sg_data = [32], inst_data = [16]> } dense<1> : vector<256xi1> %3 = xegpu.load %2[%offset], %mask <{chunk_size = 1, layout = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]>>} { layout_result_0 = #xegpu.slice<#xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>, dims = [0]> } : memref<4096xf32>, vector<256xindex>, vector<256xi1> -> vector<256xf32> %4 = vector.broadcast %3 {layout_result_0 = #xegpu.layout<sg_layout = [8, 8], sg_data = [32, 32], inst_data = [8, 16]>} : vector<256xf32> to vector<256x256xf32> gpu.return } ``` Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-11-17 11:00:03 -08:00
Jakub Kuderski	1fd9c02513	[mlir] Adopt cast function objects. NFC. (#168228 ) These were added in https://github.com/llvm/llvm-project/pull/165803.	2025-11-15 14:51:14 -05:00
Artem Kroviakov	bba40ab4bd	[MLIR][XeGPU] Decouple `inst_data` and `lane_layout` in propagation (#166941 )	2025-11-10 14:14:11 +01:00
Nishant Patel	f291f335c9	[MLIR][XeGPU] Support order attribute and add pattern for vector.transpose in WgToSg Pass (#165307 ) This PR does the following: 1. Handle order attribute during the delinearization from linear subgroup Id to multi-dim id. 2. Adds a transformation pattern for vector.transpose in wg to sg pass. 3. Updates CHECKS in the wg to sg tests	2025-11-04 19:37:08 -08:00
Charitha Saumya	9703bda95b	[mlir][xegpu] Add OptimizeBlockLoads pass. (#165483 ) This pass rewrites certain xegpu `CreateNd` and `LoadNd` operations that feeds into `vector.transpose` to more optimal form to improve performance. Specifically, low precision (bitwidth < 32) `LoadNd` ops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops. Update: Pass is renamed to `OptimizeBlockLoads ` because later we plan to add the array length optimization into this pass as well. This will break down a larger load (like `32x32xf16`) into more DPAS-favorable array length loads (`32x16xf16` with array length = 2). Both these optmizations require rewriting `CreateNd` and `LoadNd` and it makes sense to have a common pass for both.	2025-11-04 13:15:32 -08:00
Dmitry Chigarev	6c563dc6a2	[mlir][XeGPU] Add optional layout attribute to LoadGather StoreScatter ops (#163414 ) As [suggested here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637) the PR adds an optional layout attribute for `LoadGather` and `StoreScatter` ops. For the load-op the attribute describes the layout of the result (ex `layout_result_0`), and for store-op it describes the layout for the vector-to-store operand (ex `layout_operand_0`). The PR also reworks `propagate-layout` pass to consider perm layout attributes and back-propagate them accordingly. The helper utility function `getDistributeLayoutAttr` is reworked to return either `layout_operand/result_0` or `layout` for load/store ops (denepding on which one is set). After an offline discussion decided that the overall utilities layouts API is confusing since it tries to mix permament and temporary layouts. Would need to change it in the future. --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-11-04 08:19:47 -08:00
Artem Kroviakov	68c4c83bcb	[MLIR][XeGPU] Matrix load/store subgroup distribution (#165008 )	2025-11-03 21:48:27 +01:00
Artem Kroviakov	ec657d859c	[MLIR][XeGPU] Introduce `xegpu::uArch` usage in target-sensitive passes (#163801 )	2025-10-31 17:33:11 +01:00
Nishant Patel	07372fcf6c	[MLIR][XeGPU] Remove leading unit dims from vector ops before unrolling (#165030 ) This PR uses the upstream populateCastAwayVectorLeadingOneDimPatterns to remove leading unit dims from vector ops and then do the unrolling/blocking	2025-10-27 09:09:33 -07:00
Nishant Patel	621ed04e28	[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459 ) This PR changes the pack/unpack method used for unrolling to allow for lower rank slice to be extracted and inserted from and to src vector by adding reshapes. It also removes leading unit dims from inst_data if there are any.	2025-10-24 11:04:05 -07:00
Charitha Saumya	1e8834ea3a	[mlir][vector][xegpu] Accept uniform values in `getDistributedType` (#163887 ) Uniform values should not be distributed during vector distribution. Example would be a reduction result where reduction happens across lanes. However, current `getDistributedType` does not accept a zero result affine map (i.e. no distributed dims) when describing the distributed dimensions. This result in null type being returned and crashing the vector distribution in some cases. An example case would be a `scf.for` op (about to be distributed) in which one of the for result is a uniform value and it does not have a user outside the warp op. This necessitates querying the `getDistributedType` to figure our the distributed type of this value.	2025-10-22 08:41:41 -07:00
Jakub Kuderski	ae11c5c2c4	[mlir] Switch uses of deprecated .create methods to free function. NFC. (#164635 ) See https://discourse.llvm.org/t/psa-opty-create-now-with-100-more-tab-complete/87339.	2025-10-22 14:51:03 +00:00

1 2 3

132 Commits