llvm-project

Author	SHA1	Message	Date
Jianhui Li	61b8a57839	[MLIR][XeGPU] Refactor layout propagation utilities (#179016 ) This PR refactors layout propagation into two distinct components: result/anchor layout setup and source layout inference from the result. For operations that require a specific result layout due to semantic or hardware constraints, the propagation logic explicitly sets up the result or anchor layout. Otherwise, it infers the source layout from the backward-propagated consumer layout. The result or anchor layout may differ from the backward-propagated consumer layout; any such discrepancies are resolved via the existing layout-conflict mechanism. This PR introduces the following utility functions: Source layout inference: > inferBroadcastSourceLayout() > inferMultiReductionSourceLayout() > inferBitCastSourceLayout() > inferShapeCastSourceLayout() > inferInsertStridedSliceSourceLayout() Result / anchor layout setup: > setupMultiReductionResultLayout() > setupBitCastResultLayout() > setupInsertStridedSliceResultLayout() > setupLoadMatrixAnchorLayout() > setupStoreMatrixAnchorLayout() > setupLoadGatherAnchorLayout() > setupStoreScatterAnchorLayout() Part of subgroup distribution related code changes are separated and created as PR https://github.com/llvm/llvm-project/pull/179018/changes.	2026-02-05 19:26:25 -08:00
Charitha Saumya	d9c65f94b1	[mlir][xegpu] Add `XeGPUSgToWiDistributeExperimental` pass. (#177492 ) Currently XeGPU lowering pipeline uses `XeGPUSubgroupDistribute` pass to subgroup to work item distribution of ops. This pass is well established and relies on vector distribution's `WarpOp` based distribution mechanism. However, recent experiments with larger kernels have shown that this pass is very expensive in terms of compile time (see below). This prompted us to create a new pass that does not rely on `WarpOp` based distribution. This PR adds the initial infra to move away from the old way and align Wg To WI distribution with Wg to Sg distribution. New pass also uses context-aware type conversion based on XeGPU layouts to distributed vector types from SG to WI. This PR adds the following changes: * SG to WI distribution pass based on context-aware type conversions using `OpConversionPatterns` * Test pass for testing individual patterns (`TestXeGPUSgToWiDistributeExperimental`) * `XeGPUSgToWiDistributeExperimentalPass` which will eventually replace `XeGPUSubgroupDistribute` Flash attention e2e compilations stats: ``` ----Wall Time---- ----Name---- 0.0032 ( 0.2%) Parser 0.0008 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0002 ( 0.0%) GpuXeVMAttachTarget 1.1427 ( 58.7%) 'gpu.module' Pipeline 0.0019 ( 0.1%) XeGPUWgToSgDistribute 0.0003 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0002 ( 0.0%) LowerAffinePass 0.0001 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0008 ( 0.0%) XeGPUPropagateLayout 0.0056 ( 0.3%) XeGPUBlocking 0.0010 ( 0.1%) Canonicalizer 0.0004 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0015 ( 0.1%) XeGPUPropagateLayout 0.0007 ( 0.0%) XeGPUOptimizeBlockLoads 0.0010 ( 0.0%) Canonicalizer 0.0004 ( 0.0%) CSE 0.0000 ( 0.0%) (A) DominanceInfo 0.0015 ( 0.1%) XeGPUPropagateLayout 1.1274 ( 57.9%) XeGPUSubgroupDistribute 0.7959 ( 40.9%) Output 0.0022 ( 0.1%) Rest 1.9461 (100.0%) Total ```	2026-01-29 09:57:01 -08:00
Artem Kroviakov	0926743e2e	[MLIR][XeGPU] Add uniform values distribution pattern (#176737 )	2026-01-26 21:23:31 +01:00
Jianhui Li	1b8903aa8e	[MLIR][XeGPU] setUnitDim bug fix and add documentation (#173521 ) This PR fix a bug in setUnitDimData and setUnitDimLayout, and adds documentation and test. It also cleans up the shapecast op pattern in the wg distribution to use local temporary layout instead of getting from definition op's result (one TODO item from PR [#172125](https://github.com/llvm/llvm-project/pull/172125)).	2026-01-20 21:17:00 -08:00
Jianhui Li	074740df8a	[MLIR][XeGPU] bug fix: removing temporary slice layout at the pass end (#172589 ) Removing temporary slice layout (besides the regular layout) at the end of wg distribution and blocking pass. The PR also drop sg_data/inst_data from anchor layouts in every wg-to-sg/blocking/unrolling pattern. --------- Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com> Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>	2026-01-14 09:57:14 -08:00
Jianhui Li	2b9e47749c	[MLIR][XeGPU] Refactor Layout access interface (#172125 ) This PR builds on the anchor layout mechanism introduced in https://github.com/llvm/llvm-project/pull/169267 and performs the following refactoring: 1. Introduce getAnchorLayout() and setAnchorLayout() interface for anchor ops to get and set layout attributes. 2. Add getLocalLayout() and setLocalLayout() utility functions, and refactor workgroup/subgroup distribution patterns to use these APIs. These utilities access the layout information directly and locally, without relying on global propagation. 3. Introduce localPropagateLayoutsFromAnchor(), a utility used by subgroup distribution to unify non-anchor layout setup. This function is intended to be invoked upfront by all layout-based passes (including workgroup/subgroup distribution and unrolling) to propagate layouts from anchor ops to non-anchor ops. After this step, patterns within the pass should exclusively use getLocalLayout() / setLocalLayout(). 4. Refactor getDistributeLayoutAttr() and setDistributeLayoutAttr() to remove special-case handling. These APIs now operate in a uniform order: anchor ops first, then non-anchor ops, and finally block arguments. These APIs will be deprecated on long run. 5. Refactor patterns in wg/sg distribution, load optimization passes to use get/setAnchorLayout() and get/setLocalLayout(). 6. Update test cases to enforce that anchor ops must use—and only use—anchor layouts.	2025-12-17 12:04:58 -08:00
Charitha Saumya	eb7db0b9ec	[mlir][xegpu] Change `index` arithmetic ops to `arith` ops. (#170390 ) Index ops cause some issues during SIMT distribution because they don't have the `Elementwise` mappable trait. This PR replaces all index arithmetic ops with matching `arith` dialect ops.	2025-12-03 07:48:00 -08:00
Jianhui Li	326a1a4bad	[MLIR][XeGPU] Add anchor_layout and update propagation to honor user-specified layouts (#169267 ) Introduce anchor layout for XeGPU anchor ops: load_nd, store_nd, prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and atomic_rmw. Anchor layout is permanent, and is guaranteed to be honored by XeGPU distribution and lowerinngs once specified. 1. Add anchor_layout for XeGPU anchor OPs: load_nd, store_nd, prefetch_nd, dpas, load, store, prefetch, load_matrix, store_matrix, and atomic_rmw. 2. rename layout attributes to anchor_layout for these ops: load, store, load_matrix, store_matrix 3. update layout propagation pass: Only when user doesn't specify anchor layout, the pass computes a default layout and set to anchor op's permant layout and use that for propagation. if user specified anchor layout, the pass takes user-specified anchor layout. permant layout and use that for propagation. if user specified anchor layout, the pass takes user-specified anchor layout.	2025-11-26 23:02:01 -08:00
Kazu Hirata	ff8ed4d80a	[mlir] Use llvm::copy (NFC) (#168213 ) Identified with llvm-use-ranges.	2025-11-15 10:54:01 -08:00
Charitha Saumya	9703bda95b	[mlir][xegpu] Add OptimizeBlockLoads pass. (#165483 ) This pass rewrites certain xegpu `CreateNd` and `LoadNd` operations that feeds into `vector.transpose` to more optimal form to improve performance. Specifically, low precision (bitwidth < 32) `LoadNd` ops that feeds into transpose ops are rewritten to i32 loads with a valid transpose layout such that later passes can use the load with transpose HW feature to accelerate such load ops. Update: Pass is renamed to `OptimizeBlockLoads ` because later we plan to add the array length optimization into this pass as well. This will break down a larger load (like `32x32xf16`) into more DPAS-favorable array length loads (`32x16xf16` with array length = 2). Both these optmizations require rewriting `CreateNd` and `LoadNd` and it makes sense to have a common pass for both.	2025-11-04 13:15:32 -08:00
Dmitry Chigarev	6c563dc6a2	[mlir][XeGPU] Add optional layout attribute to LoadGather StoreScatter ops (#163414 ) As [suggested here](https://github.com/llvm/llvm-project/pull/163071#discussion_r2427229637) the PR adds an optional layout attribute for `LoadGather` and `StoreScatter` ops. For the load-op the attribute describes the layout of the result (ex `layout_result_0`), and for store-op it describes the layout for the vector-to-store operand (ex `layout_operand_0`). The PR also reworks `propagate-layout` pass to consider perm layout attributes and back-propagate them accordingly. The helper utility function `getDistributeLayoutAttr` is reworked to return either `layout_operand/result_0` or `layout` for load/store ops (denepding on which one is set). After an offline discussion decided that the overall utilities layouts API is confusing since it tries to mix permament and temporary layouts. Would need to change it in the future. --------- Signed-off-by: dchigarev <dmitry.chigarev@intel.com>	2025-11-04 08:19:47 -08:00
Mehdi Amini	bc2f746e78	[MLIR] Apply clang-tidy fixes for llvm-qualified-auto in XeGPUUtils.cpp (NFC)	2025-10-27 17:14:50 -07:00
Nishant Patel	621ed04e28	[MLIR][XeGPU]Enhance Pack/Unpack for XeGPUUnroll (#163459 ) This PR changes the pack/unpack method used for unrolling to allow for lower rank slice to be extracted and inserted from and to src vector by adding reshapes. It also removes leading unit dims from inst_data if there are any.	2025-10-24 11:04:05 -07:00
Jakub Kuderski	0820266651	[mlir] Use llvm accumulate wrappers. NFCI. (#162957 ) Use wrappers around `std::accumulate` to make the code more concise and less bug-prone: https://github.com/llvm/llvm-project/pull/162129. With `std::accumulate`, it's the initial value that determines the accumulator type. `llvm::sum_of` and `llvm::product_of` pick the right accumulator type based on the range element type. Found some funny bugs like a local accumulate helper that calculated a sum with initial value of 1 -- we didn't hit the bug because the code was actually dead...	2025-10-11 11:33:18 -04:00
Chao Chen	6026ca301d	[mlir][XeGPU] add unroll patterns for load_matrix and store_matrix (#154637 )	2025-09-03 13:56:41 -05:00
Chao Chen	c96e2cdd13	[mlir][XeGPU] Update utils for LayoutAttr and SliceAttr support (#154819 )	2025-08-27 12:37:15 -05:00
Chao Chen	68d6866428	[mlir][XeGPU] add WgToSg distribution pattern for load_matrix and store_matrix. (#154403 )	2025-08-21 10:02:45 -05:00
Jianhui Li	98728d9dc8	[MLIR][XeGPU] Add lowering from transfer_read/transfer_write to load_gather/store_scatter (#152429 ) Lowering transfer_read/transfer_write to load_gather/store_scatter in case the target uArch doesn't support load_nd/store_nd. The high level steps: 1. compute Strides; 2. compute Offsets; 3. collapseMemrefTo1D; 4. create Load gather or store_scatter op	2025-08-14 11:27:07 -07:00
Kazu Hirata	0925d7572a	[mlir] Remove unused includes (NFC) (#150266 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-23 15:18:53 -07:00
Chao Chen	317dae1a7e	[mlir][xegpu] Add initial skeleton implementation for lowering ConvertLayoutOp (#146176 ) This PR adds initial skeleton implementation for lowering ConvertLayoutOp. It currently only supports cases where SLM is not needed. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-07-23 11:35:40 -05:00
Chao Chen	0e00bc4f83	[mlir][xegpu] cleanup the print format for TensorDesc (#149182 )	2025-07-22 14:16:58 -05:00
Maksim Levental	7b78796543	[mlir][NFC] update `mlir/Dialect` create APIs (25/n) (#149932 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-21 19:57:59 -04:00
Charitha Saumya	fc3781853b	[mlir][xegpu] Minor fixes in XeGPU subgroup distribution. (#147846 ) This PR addresses the following issues. 1. Add the missing attributes when creating a new GPU funcOp in `MoveFuncBodyToWarpExecuteOnLane0` pattern. 2. Bug fix in LoadNd distribution to make sure LoadOp is the last op in warpOp region before it is distributed (needed for preserving the memory op ordering during distribution). 3. Add utility for removing OpOperand or OpResult layout attributes.	2025-07-17 15:13:20 -07:00
Chao Chen	5578bcbcfd	[mlir][xegpu] add support for structure control flow ops in workgroup to subgroup distribution (#142618 ) This PR introduces support for `scf::ForOp`, `scf::WhileOp`, `scf::If`, and `scf::Condition` within the workgroup-subgroup-distribution pass, leveraging the `SCFStructuralTypeConversionsAndLegality`.	2025-06-13 12:32:46 -05:00
Chao Chen	9e2684e4cf	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#142477 ) Bring back https://github.com/llvm/llvm-project/pull/140163 with fixes	2025-06-02 21:39:30 -05:00
Chao Chen	b88dfb0b23	Revert "[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N]" (#142459 ) Reverts llvm/llvm-project#140163	2025-06-02 15:47:21 -04:00
Chao Chen	0210750d5a	[MLIR][XeGPU] Add unroll patterns and blocking pass for XeGPU [2/N] (#140163 ) This PR introduces the initial implementation of a blocking pass for XeGPU programs. The pass leverages unroll patterns from both the XeGPU and Vector dialects. --------- Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>	2025-06-02 14:02:45 -05:00
Charitha Saumya	d30554b19e	[mlir][xegpu] SIMT distribution patterns for XeGPU CreateNdTdesc, LoadNd, StoreNd and Dpas Ops. (#135271 ) This PR adds the SIMT distribution patterns for create_nd_tdesc, load_nd, store_nd and dpas XeGPU ops.	2025-04-30 12:16:47 -07:00

28 Commits