llvm-project

Author	SHA1	Message	Date
Krzysztof Drewniak	c08b2c75db	[milr][gpu] Make barrier elimination address-space aware (#178101 ) Upgrade the barrier eliminiation pass to account for the address spaces of accessed memory when deciding which barriers to eliminiate. In particular, a loop that only reads and writes global memory that has a workgoup-memory-fencing barrier inside of it will now have that barrier marked for elimiination, as the global memory traffic is not being synchronized by the barrier. The pass is also adjusted to ignore barriers whose memory fencing list is [], as those do not synchronize memory and therefore the logic in this pass would potentially incorrectly remove them after proving that fact. --------- Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>	2026-02-04 08:39:44 -08:00
Mehdi Amini	fba111de5e	[MLIR] Apply clang-tidy fixes for bugprone-argument-comment in PromoteShuffleToAMDGPU.cpp (NFC)	2026-02-02 10:27:20 -08:00
Zichen Lu	fbffdaa174	[MLIR][GPU] Update serializeToObject to use SerializedObject wrapper and include ISA compiler logs (#176697 ) This PR makes the compilation log from ISA compiler available to users by returning it as part of the `gpu::ObjectAttr` properties, following the existing pattern like `LLVMIRToISATimeInMs`. Currently, the compiler log (which contains useful information such as spill statistics when --verbose is passed) is only accessible in debug builds via `LLVM_DEBUG`. However, there are good reasons to make this information available in release builds as well: 1. Both `ptxas` and `libnvptxcompiler` are publicly available tools/libraries distributed with the CUDA Toolkit. The `--verbose` flag and its output are documented public features, not internal debug information. 2. The verbose output provides valuable insights for users. A new `SerializedObject` class is used to carry the metadata alongside the binary when returning from `serializeObject`.	2026-01-30 12:56:20 +01:00
Jakub Kuderski	59e44799bd	[mlir] Fix new clang-tidy warning llvm-type-switch-case-types. NFC. (#178487 ) Pre-commiting this before landing the new check in https://github.com/llvm/llvm-project/pull/177892	2026-01-28 19:13:47 +00:00
Krzysztof Drewniak	3446ff1e67	[mlir] Update all-reduce (& vector tests) to use workgroup barriers (#178285 ) This commit updates the lowering of all-reduce operations to annotate the generated barriers with `memfence [#gpu.address_space<workgroup>]` so that these barriers do not force unrelated global memory operations to complete. It similarly sets up the warp synchronization function in the vectory distribuhte tests, since they also only read/write shared memory. In additon, this commit adds convenience builders for gpu.barrier, which will allow it to either fence on a given address space or on the address space of a provided memref.	2026-01-27 13:09:18 -08:00
Mehdi Amini	24b5cf154b	[MLIR] Apply clang-tidy fixes for llvm-else-after-return in EliminateBarriers.cpp (NFC)	2026-01-27 07:46:37 -08:00
Srinivasa Ravi	1e468b2813	[MLIR][GPU][NVVM] Add verify-target-arch option to nvvm-attach-target pass (#176774 ) This change adds the `verify-target-arch` option to the `nvvm-attach-target` to control the `verifyTarget` parameter in the attached `NVVMTargetAttr` which is used to enable/disable the verification of the target architecture with respect to the NVVM Ops.	2026-01-22 17:18:22 +05:30
Adam Paszke	064cbec2c1	[MLIR][GPU] Make sure to propagate known cluster sizes in kernel outlining (#176894 ) Otherwise, the changes from #174404 don't kick in.	2026-01-21 11:10:40 +01:00
Longsheng Mou	ad8d9e1428	[mlir][gpu] Use `arith` dialect to lower gpu.global_id (#171614 ) This PR lowers the`gpu.global_id` op using the arith dialect instead of the index dialect. Fixes #171303.	2025-12-13 18:43:12 +08:00
Mehdi Amini	971e124f0d	[MLIR] Apply clang-tidy fixes for misc-use-internal-linkage in AsyncRegionRewriter.cpp (NFC)	2025-11-13 04:39:15 -08:00
Jakub Kuderski	4c21d0cb14	[ADT] Prepare to deprecate variadic `StringSwitch::Cases`. NFC. (#166020 ) Update all uses of variadic `.Cases` to use the initializer list overload instead. I plan to mark variadic `.Cases` as deprecated in a followup PR. For more context, see https://github.com/llvm/llvm-project/pull/163117.	2025-11-02 00:12:33 +00:00
James Newling	0928f46c69	[MLIR][GPU] Ensure all lanes in cluster have final reduction value (#165764 ) This is a fix for a cluster size of 32 when the subgroup size is 64. Previously, only lanes [16, 32) u [48, 64) contained the correct clusterwise reduction value. This PR adds a swizzle instruction to broadcast the correct value down to lanes [0, 16) u [32, 48).	2025-10-31 09:12:43 -07:00
Jakub Kuderski	ba0be89cd2	[mlir] Simplify Default cases in type switches. NFC. (#165767 ) Use default values instead of lambdas when possible. `std::nullopt` and `nullptr` can be used now because of https://github.com/llvm/llvm-project/pull/165724.	2025-10-30 15:10:59 -04:00
Mehdi Amini	8907adc28c	[MLIR] Apply clang-tidy fixes for bugprone-argument-comment in SubgroupReduceLowering.cpp (NFC)	2025-09-29 01:10:16 -07:00
Georgios Pinitas	c1b211034b	[mlir][gpu] Add innermost-first policy when mapping loops to GPU IDs (#160634 )	2025-09-25 17:23:53 +01:00
Mehdi Amini	b88e5ca5f9	[MLIR] Adopt LDBG() in EliminateBarriers.cpp (NFC) (#155092 ) Also add an extra optional TYPE argument to the LDBG() macro to make it easier to punctually overide DEBUG_TYPE.	2025-08-27 10:45:58 +02:00
Tim Gymnich	003cbbd4ca	[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933 ) - promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap %src {16,32}`	2025-08-24 12:41:09 +02:00
Sang Ik Lee	baae949f19	[MLIR][GPU][XeVM] Add XeVM target and XeVM dialect integration tests. (#148286 ) As part of XeVM dialect upsteaming, covers remaining parts required for XeVM dialect integration and testing. It has two high level components - XeVM target and serialization support - XeVM dialect integration tests using level zero runtime Co-Authored-by: Artem Kroviakov <artem.kroviakov@intel.com>	2025-08-13 13:17:10 -07:00
Longsheng Mou	2edee0bc79	[mlir][gpu] Support outlining nested `gpu.launch` (#152696 ) This PR fixes a crash in `GpuKernelOutliningPass` that occurred when encountering a symbol that was not a `FlatSymbolRefAttr`, enabling outlining of nested `gpu.launch` operations. Fixes #149318.	2025-08-13 11:42:52 +08:00
Longsheng Mou	7d886fab74	[mlir][gpu] Update attribute definitions in `gpu::LaunchOp` (#152106 ) `gpu::LaunchOp` is updated the following way: - Change the attribute type of kernel function and module from `SymbolRefAttr` to `FlatSymbolRefAttr` to avoid nested symbol references. - Rename variables from camel case (kernelFunc, kernelModule) to lower case (function, module) and update the syntax. - `LaunchOp::build` support passing `module` and `function` attributes.	2025-08-08 11:43:21 +08:00
Maksim Levental	c090ed53fb	[mlir][NFC] update `mlir/Dialect` create APIs (33/n) (#150659 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-25 16:13:55 -04:00
Longsheng Mou	3eb49c482c	[mlir][NFC] Use `hasOneBlock` instead of `llvm::hasSingleElement(region)` (#149809 )	2025-07-24 10:11:21 +08:00
Kazu Hirata	0925d7572a	[mlir] Remove unused includes (NFC) (#150266 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-23 15:18:53 -07:00
Maksim Levental	dce6679cf5	[mlir][NFC] update `mlir/Dialect` create APIs (16/n) (#149922 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-21 19:57:30 -04:00
Sang Ik Lee	61004b7eb5	[MLIR][GPU] Add xevm-attach-target transform pass. (#147372 ) Add xevm-attach-target transform pass and unit-tests. Co-authored-by: by Sang Ik Lee sang.ik.lee@intel.com. Co-authored-by: Artem Kroviakov artem.kroviakov@intel.com	2025-07-10 15:44:26 -05:00
Kazu Hirata	54bd936ec9	[mlir] Remove unused includes (NFC) (#147455 ) These are identified by misc-include-cleaner. I've filtered out those that break builds. Also, I'm staying away from llvm-config.h, config.h, and Compiler.h, which likely cause platform- or compiler-specific build failures.	2025-07-07 23:40:44 -07:00
Kazu Hirata	28f6f87061	[mlir] Migrate away from std::nullopt (NFC) (#145523 ) ArrayRef has a constructor that accepts std::nullopt. This constructor dates back to the days when we still had llvm::Optional. Since the use of std::nullopt outside the context of std::optional is kind of abuse and not intuitive to new comers, I would like to move away from the constructor and eventually remove it. This patch migrates away from std::nullopt in favor of ArrayRef<T>() where we use perfect forwarding. Note that {} would be ambiguous for perfect forwarding to work.	2025-06-25 11:49:22 -07:00
Skrai Pardus	a45fda6aeb	switch type and value ordering for arith `Constant[XX]Op` (#144636 ) This change standardizes the order of the parameters for `Constant[XXX] Ops` to match with all other `Op` `build()` constructors. In all instances of generated code for the MLIR dialects's Ops (that is the TableGen using the .td files to create the .h.inc/.cpp.inc files), the desired result type is always specified before the value. Examples: ``` // ArithOps.h.inc class ConstantOp : public ::mlir::Op<ConstantOp, ::mlir::OpTrait::ZeroRegions, ::mlir::OpTrait::OneResult, ::mlir::OpTrait::OneTypedResult<::mlir::Type>::Impl, ::mlir::OpTrait::ZeroSuccessors, ::mlir::OpTrait::ZeroOperands, ::mlir::OpTrait::OpInvariants, ::mlir::BytecodeOpInterface::Trait, ::mlir::OpTrait::ConstantLike, ::mlir::ConditionallySpeculatable::Trait, ::mlir::OpTrait::AlwaysSpeculatableImplTrait, ::mlir::MemoryEffectOpInterface::Trait, ::mlir::OpAsmOpInterface::Trait, ::mlir::InferIntRangeInterface::Trait, ::mlir::InferTypeOpInterface::Trait> { public: .... static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type result, ::mlir::TypedAttr value); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypedAttr value); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::TypedAttr value); static void build(::mlir::OpBuilder &, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {}); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {}); ... ``` ``` // ArithOps.h.inc class SubIOp : public ::mlir::Op<SubIOp, ::mlir::OpTrait::ZeroRegions, ::mlir::OpTrait::OneResult, ::mlir::OpTrait::OneTypedResult<::mlir::Type>::Impl, ::mlir::OpTrait::ZeroSuccessors, ::mlir::OpTrait::NOperands<2>::Impl, ::mlir::OpTrait::OpInvariants, ::mlir::BytecodeOpInterface::Trait, ::mlir::ConditionallySpeculatable::Trait, ::mlir::OpTrait::AlwaysSpeculatableImplTrait, ::mlir::MemoryEffectOpInterface::Trait, ::mlir::InferIntRangeInterface::Trait, ::mlir::arith::ArithIntegerOverflowFlagsInterface::Trait, ::mlir::OpTrait::SameOperandsAndResultType, ::mlir::VectorUnrollOpInterface::Trait, ::mlir::OpTrait::Elementwise, ::mlir::OpTrait::Scalarizable, ::mlir::OpTrait::Vectorizable, ::mlir::OpTrait::Tensorizable, ::mlir::InferTypeOpInterface::Trait> { public: ... static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type result, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlagsAttr overflowFlags); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlagsAttr overflowFlags); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlagsAttr overflowFlags); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type result, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlags overflowFlags = ::mlir::arith::IntegerOverflowFlags::none); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlags overflowFlags = ::mlir::arith::IntegerOverflowFlags::none); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlags overflowFlags = ::mlir::arith::IntegerOverflowFlags::none); static void build(::mlir::OpBuilder &, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {}); static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {}); ... ``` In comparison, in the distinct case of `ConstantIntOp` and `ConstantFloatOp`, the ordering of the result type and the value is switched. Thus, this PR corrects the ordering of the aforementioned `Constant[XXX]Ops` to match with other constructors.	2025-06-23 23:35:50 +02:00
Kazu Hirata	887222e352	[mlir] Migrate away from ArrayRef(std::nullopt) (NFC) (#144989 ) ArrayRef has a constructor that accepts std::nullopt. This constructor dates back to the days when we still had llvm::Optional. Since the use of std::nullopt outside the context of std::optional is kind of abuse and not intuitive to new comers, I would like to move away from the constructor and eventually remove it. This patch takes care of the mlir side of the migration, starting with straightforward places where I see ArrayRef or ValueRange nearby. Note that ValueRange has a constructor that forwards arguments to an ArrayRef constructor.	2025-06-20 08:33:59 -07:00
Muzammil	893ef7ffbd	[mlir][GPU] Fixes subgroup reduce lowering (#141825 ) Fixes the final reduction steps which were taken from an implementation of scan, not reduction, causing lanes earlier in the wave to have incorrect results due to masking. Now aligning more closely with triton implementation : https://github.com/triton-lang/triton/pull/5019 # Hypothetical example To provide an explanation of the issue with the current implementation, let's take the simple example of attempting to perform a sum over 64 lanes where the initial values are as follows (first lane has value 1, and all other lanes have value 0): ``` [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` When performing a sum reduction over these 64 lanes, in the current implementation we perform 6 dpp instructions which in sequential order do the following: 1) sum over clusters of 2 contiguous lanes 2) sum over clusters of 4 contiguous lanes 3) sum over clusters of 8 contiguous lanes 4) sum over an entire row 5) broadcast the result of last lane in each row to the next row and each lane sums current value with incoming value. 5) broadcast the result of the 32nd lane to last two rows and each lane sums current value with incoming value. After step 4) the result for the example above looks like this: ``` [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` After step 5) the result looks like this: ``` [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` After step 6) the result looks like this: ``` [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ``` Note that the correct value here is always 1, yet after the `dpp.broadcast` ops some lanes have incorrect values. The reason is that for these incorrect lanes, like lanes 0-15 in step 5, the `dpp.broadcast` op doesn't provide them incoming values from other lanes. Instead these lanes are provided either their own values, or 0 (depending on whether `bound_ctrl` is true or false) as values to sum over, either way these values are stale and these lanes shouldn't be used in general. So what this means: - For a subgroup reduce over 32 lanes (like Step 5), the correct result is stored in lanes 16 to 31 - For a subgroup reduce over 64 lanes (like Step 6), the correct result is stored in lanes 32 to 63. However in the current implementation we do not specifically read the value from one of the correct lanes when returning a final value. In some workloads it seems without this specification, the stale value from the first lane is returned instead. # Actual failing test For a specific example of how the current implementation causes issues, take a look at the IR below which represents an additive reduction over a dynamic dimension. ``` !matA = tensor<1x?xf16> !matB = tensor<1xf16> #map = affine_map<(d0, d1) -> (d0, d1)> #map1 = affine_map<(d0, d1) -> (d0)> func.func @only_producer_fusion_multiple_result(%arg0: !matA) -> !matB { %cst_1 = arith.constant 0.000000e+00 : f16 %c2_i64 = arith.constant 2 : i64 %0 = tensor.empty() : !matB %2 = linalg.fill ins(%cst_1 : f16) outs(%0 : !matB) -> !matB %4 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : !matA) outs(%2 : !matB) { ^bb0(%in: f16, %out: f16): %7 = arith.addf %in, %out : f16 linalg.yield %7 : f16 } -> !matB return %4 : !matB } ``` When provided an input of type `tensor<1x2xf16>` and values `{0, 1}` to perform the reduction over, the value returned is consistently 4. By the same analysis done above, this shows that the returned value is coming from one of these stale lanes and needs to be read instead from one of the lanes storing the correct result. Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>	2025-05-28 17:47:22 -05:00
Shay Kleiman	ffb9bbfd07	[mlir][MemRef] Changed AssumeAlignment into a Pure ViewLikeOp (#139521 ) Made AssumeAlignment a ViewLikeOp that returns a new SSA memref equal to its memref argument and made it have Pure trait. This gives it a defined memory effect that matches what it does in practice and makes it behave nicely with optimizations which won't get rid of it unless its result isn't being used.	2025-05-18 13:50:29 +03:00
Ivan Butygin	91f3cdbd4f	[mlir][gpu] Pattern to promote `gpu.shuffle` to specialized AMDGPU ops (#137109 ) Only swizzle promotion for now, may add DPP ops support later.	2025-05-13 13:26:46 +03:00
Alan Li	0ba1361478	[MLIR][GPU] Use arith instead of index for subgroup_id (#137843 ) Trying to simplify situation by using `arith` dialect instead of `index` in the rewriting of `gpu.subgroup_id`.	2025-04-30 09:03:24 -04:00
Alan Li	ac65b2c327	[MLIR][GPU] Add a pattern to rewrite gpu.subgroup_id (#137671 ) This patch impelemnts a rewrite pattern for transforming `gpu.subgroup_id` to: ``` subgroup_id = linearized_thread_id / gpu.subgroup_size ``` where: ``` linearized_thread_id = thread_id.x + block_dim.x * (thread_id.y + block_dim.y * thread_id.z) ```	2025-04-29 10:54:48 -04:00
Kazu Hirata	d1e85a0ea0	[mlir] Use range constructors of *Set (NFC) (#137563 )	2025-04-27 17:52:41 -07:00
Muzammil	905f1d8068	[mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs (#133204 ) When performing cross-lane reductions using subgroup_reduce ops across contiguous lanes on AMD GPUs, lower to Data Parallel Primitives (DPP) ops when possible. This reduces latency on applicable devices. See related [Issue](https://github.com/iree-org/iree/issues/20007) To do: - Improve lowering to subgroup_reduce in compatible matvecs (these get directly lowered to gpu.shuffles in an earlier pass) --------- Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>	2025-04-23 17:37:32 -07:00
Krzysztof Drewniak	bf3b3d012c	[mlir][GPU] Don't look into neighboring functions for barrier elimination (#135293 ) If a `func.func` is nested in some other operation, the barrier eliminator's recursion into parents will examine the neighbors of each function. Therefore, don't recurse into the parent of an operation if that operation is IsolatedFromAbove, like a func.func is. Furthermore, define functions as a region that executes only once, since, within the context of this pass (which runs on functions) it is true.	2025-04-15 07:04:24 -07:00
Ivan Butygin	d893d129e6	[mlir] GPUToROCDL: Fix crashes with unsupported shuffle datatypes (#135504 ) Calling `getIntOrFloatBitWidth` on non-int/float types (`gpu.shuffle` also accepts vectors) will crash.	2025-04-13 20:26:19 +02:00
Kazu Hirata	3041fa6c7a	[mlir] Use *Set::insert_range (NFC) (#132326 ) DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently gained C++23-style insert_range. This patch replaces: Dest.insert(Src.begin(), Src.end()); with: Dest.insert_range(Src); This patch does not touch custom begin like succ_begin for now.	2025-03-20 22:24:17 -07:00
lorenzo chelini	556a64507b	[MLIR][NFC] Retire let constructor for GPU (#129849 ) `let constructor` is legacy (do not use in tree!) since the table gen backend emits most of the glue logic to build a pass.	2025-03-06 11:48:24 +01:00
Guray Ozen	837b89fc0f	[MLIR][NVVM] Add `ptxas-cmd-options` to pass flags to the downstream compiler (#127457 ) This PR adds `cmd-options` to the `gpu-lower-to-nvvm-pipeline` pipeline and the `nvvm-attach-target` pass, allowing users to pass flags to the downstream compiler, ptxas. Example: ``` mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 ptxas-cmd-options='-v --register-usage-level=8'" ```	2025-02-17 12:09:27 +01:00
Matthias Springer	6aaa8f25b6	[mlir][IR][NFC] Move free-standing functions to `MemRefType` (#123465 ) Turn free-standing `MemRefType`-related helper functions in `BuiltinTypes.h` into member functions.	2025-01-21 08:48:09 +01:00
Jacques Pienaar	09dfc5713d	[mlir] Enable decoupling two kinds of greedy behavior. (#104649 ) The greedy rewriter is used in many different flows and it has a lot of convenience (work list management, debugging actions, tracing, etc). But it combines two kinds of greedy behavior 1) how ops are matched, 2) folding wherever it can. These are independent forms of greedy and leads to inefficiency. E.g., cases where one need to create different phases in lowering and is required to applying patterns in specific order split across different passes. Using the driver one ends up needlessly retrying folding/having multiple rounds of folding attempts, where one final run would have sufficed. Of course folks can locally avoid this behavior by just building their own, but this is also a common requested feature that folks keep on working around locally in suboptimal ways. For downstream users, there should be no behavioral change. Updating from the deprecated should just be a find and replace (e.g., `find ./ -type f -exec sed -i 's\|applyPatternsAndFoldGreedily\|applyPatternsGreedily\|g' {} \;` variety) as the API arguments hasn't changed between the two.	2024-12-20 08:15:48 -08:00
Mehdi Amini	72e8b9aeaa	[MLIR] Add a BlobAttr interface for attribute to wrap arbitrary content and use it as linkLibs for ModuleToObject (#120116 ) This change allows to expose through an interface attributes wrapping content as external resources, and the usage inside the ModuleToObject show how we will be able to provide runtime libraries without relying on the filesystem.	2024-12-17 01:30:56 +01:00
Renaud Kauffmann	9919295cfd	[mlir][gpu] Adding ELF section option to the gpu-module-to-binary pass (#119440 ) This is a follow-up of #117246. I thought then it would be easy to edit a DictionaryAttr but it turns out that these attributes are immutable and need to be passed during the construction of the gpu.binary Op. The first commit was using the NVVMTargetAttr to pass the information. After feedback from @fabianmcg, this PR now passes the information through a new option of the gpu-module-to-binary pass. Please add reviewers, as you see fit.	2024-12-16 09:09:41 -08:00
Petr Kurapov	bc29fc937c	[MLIR] Create GPU utils library & move distribution utils (#119264 ) Continue the move of `warp_execute_on_lane_0` op to the gpu dialect (#116994). This patch creates a utils library in GPU and moves generic helper functions there.	2024-12-13 10:26:57 +01:00
Zhen Wang	516d6ede12	[mlir][gpu] Add optional attributes of kernelModule and kernelFunc for outlining kernels. (#118861 ) Adding optional attributes so we can specify the kernel function names and the kernel module names generated.	2024-12-06 12:33:34 -08:00
Oleksandr "Alex" Zinenko	5ce4d4c775	[mlir] fix memory effects in GPU barrier elimination (#117432 ) Existing implementation may trigger infinite cycles when collecting effects above or below the current block after wrapping around a loop-like construct. Limit this case to only looking at the immediate block (loop body). This is correct because wrap around is intended to consider effects of different iterations of the same loop and shouldn't be existing the loop block. Reported-by: Fabian Mora <fmora.dev@gmail.com> Co-authored-by: Fabian Mora <fmora.dev@gmail.com>	2024-11-24 23:25:11 +01:00
donald chen	889b67c9d3	[mlir] [memref] add more checks to the memref.reinterpret_cast (#112669 ) Operation memref.reinterpret_cast was accept input like: %out = memref.reinterpret_cast %in to offset: [%offset], sizes: [10], strides: [1] : memref<?xf32> to memref<10xf32> A problem arises: while lowering, the true offset of %out is %offset, but its data type indicates an offset of 0. Permitting this inconsistency can result in incorrect outcomes, as certain pass might erroneously extract the offset from the data type of %out. This patch fixes this by enforcing that the return value's data type aligns with the input parameter.	2024-10-26 08:07:51 +08:00
Andrea Faulds	a800ffac41	[mlir][gpu] Disjoint patterns for lowering clustered subgroup reduce (#109158 ) Making the existing populateGpuLowerSubgroupReduceToShufflePatterns() function also cover the new "clustered" subgroup reductions is proving to be inconvenient, because certain backends may have more specific lowerings that only cover the non-clustered type, and this creates pass ordering constraints. This commit removes coverage of clustered reductions from this function in favour of a new separate function, which makes controlling the lowering much more straightforward.	2024-09-18 15:55:53 -04:00

1 2 3 4 5 ...

329 Commits