llvm-project

Author	SHA1	Message	Date
Maksim Levental	c090ed53fb	[mlir][NFC] update `mlir/Dialect` create APIs (33/n) (#150659 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-25 16:13:55 -04:00
Maksim Levental	dce6679cf5	[mlir][NFC] update `mlir/Dialect` create APIs (16/n) (#149922 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-21 19:57:30 -04:00
Kazu Hirata	54bd936ec9	[mlir] Remove unused includes (NFC) (#147455 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-07 23:40:44 -07:00
Muzammil	893ef7ffbd	[mlir][GPU] Fixes subgroup reduce lowering (#141825 ) Fixes the final reduction steps which were taken from an implementation of scan, not reduction, causing lanes earlier in the wave to have incorrect results due to masking. Now aligning more closely with triton implementation : https://github.com/triton-lang/triton/pull/5019 # Hypothetical example To provide an explanation of the issue with the current implementation, let's take the simple example of attempting to perform a sum over 64 lanes where the initial values are as follows (first lane has value 1, and all other lanes have value 0): ``` [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` When performing a sum reduction over these 64 lanes, in the current implementation we perform 6 dpp instructions which in sequential order do the following: 1) sum over clusters of 2 contiguous lanes 2) sum over clusters of 4 contiguous lanes 3) sum over clusters of 8 contiguous lanes 4) sum over an entire row 5) broadcast the result of last lane in each row to the next row and each lane sums current value with incoming value. 5) broadcast the result of the 32nd lane to last two rows and each lane sums current value with incoming value. After step 4) the result for the example above looks like this: ``` [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` After step 5) the result looks like this: ``` [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` After step 6) the result looks like this: ``` [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ``` Note that the correct value here is always 1, yet after the `dpp.broadcast` ops some lanes have incorrect values. The reason is that for these incorrect lanes, like lanes 0-15 in step 5, the `dpp.broadcast` op doesn't provide them incoming values from other lanes. Instead these lanes are provided either their own values, or 0 (depending on whether `bound_ctrl` is true or false) as values to sum over, either way these values are stale and these lanes shouldn't be used in general. So what this means: - For a subgroup reduce over 32 lanes (like Step 5), the correct result is stored in lanes 16 to 31 - For a subgroup reduce over 64 lanes (like Step 6), the correct result is stored in lanes 32 to 63. However in the current implementation we do not specifically read the value from one of the correct lanes when returning a final value. In some workloads it seems without this specification, the stale value from the first lane is returned instead. # Actual failing test For a specific example of how the current implementation causes issues, take a look at the IR below which represents an additive reduction over a dynamic dimension. ``` !matA = tensor<1x?xf16> !matB = tensor<1xf16> #map = affine_map<(d0, d1) -> (d0, d1)> #map1 = affine_map<(d0, d1) -> (d0)> func.func @only_producer_fusion_multiple_result(%arg0: !matA) -> !matB { %cst_1 = arith.constant 0.000000e+00 : f16 %c2_i64 = arith.constant 2 : i64 %0 = tensor.empty() : !matB %2 = linalg.fill ins(%cst_1 : f16) outs(%0 : !matB) -> !matB %4 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : !matA) outs(%2 : !matB) { ^bb0(%in: f16, %out: f16): %7 = arith.addf %in, %out : f16 linalg.yield %7 : f16 } -> !matB return %4 : !matB } ``` When provided an input of type `tensor<1x2xf16>` and values `{0, 1}` to perform the reduction over, the value returned is consistently 4. By the same analysis done above, this shows that the returned value is coming from one of these stale lanes and needs to be read instead from one of the lanes storing the correct result. Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>	2025-05-28 17:47:22 -05:00
Muzammil	905f1d8068	[mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs (#133204 ) When performing cross-lane reductions using subgroup_reduce ops across contiguous lanes on AMD GPUs, lower to Data Parallel Primitives (DPP) ops when possible. This reduces latency on applicable devices. See related [Issue](https://github.com/iree-org/iree/issues/20007) To do: - Improve lowering to subgroup_reduce in compatible matvecs (these get directly lowered to gpu.shuffles in an earlier pass) --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>	2025-04-23 17:37:32 -07:00
Petr Kurapov	bc29fc937c	[MLIR] Create GPU utils library & move distribution utils (#119264 ) Continue the move of `warp_execute_on_lane_0` op to the gpu dialect (#116994). This patch creates a utils library in GPU and moves generic helper functions there.	2024-12-13 10:26:57 +01:00
Andrea Faulds	a800ffac41	[mlir][gpu] Disjoint patterns for lowering clustered subgroup reduce (#109158 ) Making the existing populateGpuLowerSubgroupReduceToShufflePatterns() function also cover the new "clustered" subgroup reductions is proving to be inconvenient, because certain backends may have more specific lowerings that only cover the non-clustered type, and this creates pass ordering constraints. This commit removes coverage of clustered reductions from this function in favour of a new separate function, which makes controlling the lowering much more straightforward.	2024-09-18 15:55:53 -04:00
Andrea Faulds	fd26f8444a	[mlir][gpu] Rename two misspelled pattern population functions (#109015 )	2024-09-17 15:26:14 -04:00
Andrea Faulds	3d01f0a33b	[mlir][gpu] Add 'cluster_stride' attribute to gpu.subgroup_reduce (#107142 ) Follow-up to 7aa22f013e24d20291aad745368ff907baa9dfa4, adding an additional attribute needed in some applications.	2024-09-05 09:03:22 -04:00
Andrea Faulds	7aa22f013e	[mlir][gpu] Add 'cluster_size' attribute to gpu.subgroup_reduce (#104851 ) This enables performing several reductions in parallel, each smaller than the size of the subgroup. One potential application is flash attention with subgroup-wide matrix multiplication and reduction combined in one kernel. The multiplication operation requires a 2D matrix to be distributed over the lanes of the subgroup, which then constrains the shape the following reduction can have if we want to keep data in registers.	2024-08-20 13:37:03 -04:00
Ramkumar Ramachandra	db791b278a	mlir/LogicalResult: move into llvm (#97309 ) This patch is part of a project to move the Presburger library into LLVM.	2024-07-02 10:42:33 +01:00
Jakub Kuderski	c0345b4648	[mlir][gpu] Add subgroup_reduce to shuffle lowering (#76530 ) This supports both the scalar and the vector multi-reduction cases.	2024-01-02 16:14:22 -05:00
Jakub Kuderski	2af186f9bd	[mlir][gpu] Add patterns to break down subgroup reduce (#76271 ) The new patterns break down subgroup reduce ops with vector values into a sequence of subgroup reductions that fit the native shuffle size. The maximum/native shuffle size is parametrized. The overall goal is to be able to perform multi-element reductions with a sequence of `gpu.shuffle` ops.	2023-12-28 14:39:46 -05:00

13 Commits