llvm-project

Author	SHA1	Message	Date
Longsheng Mou	7d886fab74	[mlir][gpu] Update attribute definitions in `gpu::LaunchOp` (#152106 ) `gpu::LaunchOp` is updated the following way: - Change the attribute type of kernel function and module from `SymbolRefAttr` to `FlatSymbolRefAttr` to avoid nested symbol references. - Rename variables from camel case (kernelFunc, kernelModule) to lower case (function, module) and update the syntax. - `LaunchOp::build` support passing `module` and `function` attributes.	2025-08-08 11:43:21 +08:00
Hsiangkai Wang	0d21522c00	[mlir][gpu] Make offset and width in gpu.rotate as attributes (#150901 ) `offset` and `width` must be constants and there are constraints on their values. Update the operation definition to use attributes instead of operands.	2025-07-29 09:02:42 +01:00
Maksim Levental	dce6679cf5	[mlir][NFC] update `mlir/Dialect` create APIs (16/n) (#149922 ) See https://github.com/llvm/llvm-project/pull/147168 for more info.	2025-07-21 19:57:30 -04:00
Nicolas Vasilache	2b28d10022	[mlir][SCF][GPU] Add DeviceMaskingAttrInterface (#146943 ) This revision adds DeviceMaskingAttrInterface and extends DeviceMappingArrayAttr to accept a union of DeviceMappingAttrInterface and DeviceMaskingAttrInterface. Support is added to GPUTransformOps to take advantage of this information and lower to block/warpgroup/warp/thread specialization when mapped to linear ids. The revision also connects to scf::ForallOp and uses the new attribute to implement warp specialization. The implementation is in the form of a GPUMappingMaskAttr, which can be additionally passed to the scf.forall.mapping attribute to specify a mask on compute resources that should be active. In the first implementation the masking is a bitfield that specifies for each processing unit whether it is active or not. In the future, we may want to implement this as a symbol to refer to dynamically defined values. Extending op semantics with an operand is deemed too intrusive at this time. --------- Co-authored-by: Oleksandr "Alex" Zinenko <git@ozinenko.com>	2025-07-07 18:06:41 +02:00
Nicolas Vasilache	0a62836969	[mlir][gpu][transforms] Add support for mapping to lanes (#146912 ) This revision adds a new attribute for mapping `scf.forall` to linear lane ids. Example: ``` // %arg2 and %arg3 map to lanes [0, 6) and are turned into epxressions // involving threadIdx.x/y by the map_nested_forall_to_threads // transformation. This results in a if (linear_thread_id < 6) conditional. scf.forall (%arg2, %arg3) in (2, 3) { ... } {mapping = [#gpu.lane<linear_dim_0>, #gpu.lane<linear_dim_1>]} ``` --------- Co-authored-by: Oleksandr "Alex" Zinenko <git@ozinenko.com>	2025-07-07 15:14:52 +02:00
Kazu Hirata	ed0ee3a419	[mlir] Use llvm::fill (NFC) (#147100 ) We can pass a range to llvm::fill.	2025-07-04 13:30:14 -07:00
Hsiangkai Wang	f581ef5b66	[mlir][gpu] Add gpu.rotate operation (#142796 ) Add gpu.rotate operation and a pattern to convert gpu.rotate to SPIR-V OpGroupNonUniformRotateKHR.	2025-07-01 11:32:25 +01:00
Kazu Hirata	63f30d7d82	[mlir] Migrate away from {TypeRange,ValueRange}(std::nullopt) (NFC) (#145445 ) ArrayRef has a constructor that accepts std::nullopt. This constructor dates back to the days when we still had llvm::Optional. Since the use of std::nullopt outside the context of std::optional is kind of abuse and not intuitive to new comers, I would like to move away from the constructor and eventually remove it. This patch migrates away from TypeRagne(std::nullopt) and ValueRange(std::nullopt).	2025-06-24 07:03:59 -07:00
Srinivasa Ravi	9a553d3766	[MLIR][NVVM] Add NVVMRequiresSM op traits (#126886 ) Motivation: Currently, the NVVMOps are not verified against the supported SM architectures. This can manifest as an ISel failure in the NVPTX LLVM backend during CodeGen to PTX ISA. This PR addresses this issue by adding verifier checks for Target-SM architectures in the NVVM Dialect itself, thereby catching the errors early on. Summary: * Parametric traits named `NVVMRequiresSM` and `NVVMRequiresSMa` are added to facilitate the version checks for typical and arch-accelerated versions respectively. * These traits can be attached to any NVVM Op to enable the checks for the particular Op. (example shown below) * An attribute interface called named `TargetAttrVerifyInterface` is added to the GPU dialect which any target attribute seeking to perform target-verification on the module can implement. * The checks are performed by the `NVVMTargetAttr` (implementing the `TargetAttrVerifyInterface` interface) when called from the GPU module verifier where it walks through the module and performs the checks for Ops with the `NVVMRequiresSM` traits. * A few Ops in `NVVMOps.td` have been updated to serve as examples. Example Usage: ``` def NVVM_ReduxOp : NVVM_Op<"redux.sync"> {...} ----> def NVVM_ReduxOp : NVVM_Op<"redux.sync", [NVVMRequiresSM<80>]> {...} def NVVM_WgmmaFenceAlignedOp : NVVM_Op<"wgmma.fence.aligned"> {...} ----> def NVVM_WgmmaFenceAlignedOp : NVVM_Op<"wgmma.fence.aligned", [NVVMRequiresSMa<[90]>]> {...} ``` --------- Co-authored-by: Guray Ozen <guray.ozen@gmail.com>	2025-05-21 08:53:00 +05:30
Kazu Hirata	15f7c6ed70	[mlir] Remove unused local variables (NFC) (#138481 )	2025-05-05 10:08:00 -07:00
Jakub Kuderski	c62afbfeda	[mlir][linalg][gpu] Clean up printing. NFC. (#136330 ) * Use `llvm::interleaved` from #135517 to simplify printing * Avoid needless vector allocations	2025-04-18 15:05:27 -04:00
Jakub Kuderski	4be84a142e	[mlir][gpu] Clean up prints in GPU dialect. NFC. (#136250 ) Clean up printing code by switching to `llvm::interleaved` from https://github.com/llvm/llvm-project/pull/135517. Also make some minor readability & performance fixes.	2025-04-18 11:10:17 -04:00
Zichen Lu	360630b567	[mlir][GPUDialect] Add cmdOption suffix consumer in GpuModuleToBinary Pass (#127646 ) Add cmdOption suffix consumer function in GpuModuleToBinary Pass, which can tokenize and remove a specific suffix of cmdOption.	2025-02-18 19:02:23 +01:00
Guray Ozen	837b89fc0f	[MLIR][NVVM] Add `ptxas-cmd-options` to pass flags to the downstream compiler (#127457 ) This PR adds `cmd-options` to the `gpu-lower-to-nvvm-pipeline` pipeline and the `nvvm-attach-target` pass, allowing users to pass flags to the downstream compiler, ptxas. Example: ``` mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 ptxas-cmd-options='-v --register-usage-level=8'" ```	2025-02-17 12:09:27 +01:00
jeanPerier	327d627066	[mlir] share argument attributes interface between calls and callables (#123176 ) This patch shares core interface methods dealing with argument and result attributes from CallableOpInterface with the CallOpInterface and makes them mandatory to gives more consistent guarantees about concrete operations using these interfaces. This allows adding argument attributes on call like operations, which is sometimes required to get proper ABI, like with llvm.call (and llvm.invoke). The patch adds optional `arg_attrs` and `res_attrs` attributes to operations using these interfaces that did not have that already. They can then re-use the common "rich function signature" printing/parsing helpers if they want (for the LLVM dialect, this is done in the next patch). Part of RFC: https://discourse.llvm.org/t/mlir-rfc-adding-argument-and-result-attributes-to-llvm-call/84107	2025-02-03 11:27:14 +01:00
Matthias Springer	6aaa8f25b6	[mlir][IR][NFC] Move free-standing functions to `MemRefType` (#123465 ) Turn free-standing `MemRefType`-related helper functions in `BuiltinTypes.h` into member functions.	2025-01-21 08:48:09 +01:00
Krzysztof Drewniak	0aa831e0ed	[mlir][GPU] Implement ValueBoundsOpInterface for GPU ID operations (#122190 ) The GPU ID operations already implement InferIntRangeInterface, which gives constant lower and upper bounds on those IDs when appropriate metadata is prentent on the operations or in the surrounding context. This commit uses that existing code to implement the ValueBoundsOpInterface, which is used when analyzing affine operations (unlike the integer range interface, which is used for arithmetic optimization). It also implements the interface for gpu.launch, where we can use it to express the constraint that block/grid sizes are equal to their value from outside the launch op and that the corresponding IDs are bounded above by that size. As a consequence, the test pass for this inference is updated to work on a FunctionOpInterface and not a func.func, creating minor churn in other tests.	2025-01-09 11:42:22 -08:00
Timothy Hoffman	fbbbd65b25	[MLIR] correct return type of parse() functions (#120180 ) The `parseX()` functions that are defined to support `custom<X>` in `assemblyFormat` should return `ParseResult` rather than `LogicalResult`. The `ParseResult` type is necessary due to tablegen generating code that expects this type within an Op `parseX()` function.	2024-12-17 09:06:55 -08:00
Mehdi Amini	72e8b9aeaa	[MLIR] Add a BlobAttr interface for attribute to wrap arbitrary content and use it as linkLibs for ModuleToObject (#120116 ) This change allows to expose through an interface attributes wrapping content as external resources, and the usage inside the ModuleToObject show how we will be able to provide runtime libraries without relying on the filesystem.	2024-12-17 01:30:56 +01:00
Renaud Kauffmann	9919295cfd	[mlir][gpu] Adding ELF section option to the gpu-module-to-binary pass (#119440 ) This is a follow-up of #117246. I thought then it would be easy to edit a DictionaryAttr but it turns out that these attributes are immutable and need to be passed during the construction of the gpu.binary Op. The first commit was using the NVVMTargetAttr to pass the information. After feedback from @fabianmcg, this PR now passes the information through a new option of the gpu-module-to-binary pass. Please add reviewers, as you see fit.	2024-12-16 09:09:41 -08:00
Mehdi Amini	a9b399aeef	[MLIR][GPU] Fix memref.dim folding with out-of-bound index (#118890 ) Fixes #118760	2024-12-05 16:36:33 -08:00
Petr Kurapov	ecaf2c335c	[MLIR] Move warp_execute_on_lane_0 from vector to gpu (#116994 ) Please see the related RFC here: https://discourse.llvm.org/t/rfc-move-execute-on-lane-0-from-vector-to-gpu-dialect/82989. This patch does exactly one thing - moves the op to gpu.	2024-11-22 15:30:47 +01:00
Zichen Lu	08e7609692	[mlir][fix] Add callback functions for ModuleToObject (#116916 ) Here is the [merged MR](https://github.com/llvm/llvm-project/pull/116007) which caused a failure and [was reverted](https://github.com/llvm/llvm-project/pull/116811). Thanks to @joker-eph for the help, I fix it (miss constructing `ModuleObject` with callback functions in `mlir/lib/Target/LLVM/NVVM/Target.cpp`) and split unit tests from origin test which don't need `ptxas` to make the test runs more widely.	2024-11-20 13:22:08 +01:00
Mehdi Amini	af41c55673	Revert "[MLIR] Add callback functions for ModuleToObject" (#116811 ) Reverts llvm/llvm-project#116007 Bot is broken.	2024-11-19 15:28:17 +01:00
Zichen Lu	2153672ba3	[MLIR] Add callback functions for ModuleToObject (#116007 ) In ModuleToObject flow, users may want to add some callback functions invoked with LLVM IR/ISA for debugging or other purposes.	2024-11-19 13:51:08 +01:00
Andrzej Warzyński	bfde17834d	[mlir] Update the return type of `getNum{Dynamic\|Scalable}Dims` (#110472 ) Updates the return type of `getNumDynamicDims` and `getNumScalableDims` from `int64_t` to `size_t`. This is for consistency with other helpers/methods that return "size" and to reduce the number of `static_cast`s in various places.	2024-09-30 14:53:50 +01:00
Andrea Faulds	3d01f0a33b	[mlir][gpu] Add 'cluster_stride' attribute to gpu.subgroup_reduce (#107142 ) Follow-up to 7aa22f013e24d20291aad745368ff907baa9dfa4, adding an additional attribute needed in some applications.	2024-09-05 09:03:22 -04:00
Fabian Mora	016e1eb9c8	[mlir][gpu] Add metadata attributes for storing kernel metadata in GPU objects (#95292 ) This patch adds the `#gpu.kernel_metadata` and `#gpu.kernel_table` attributes. The `#gpu.kernel_metadata` attribute allows storing metadata related to a compiled kernel, for example, the number of scalar registers used by the kernel. The attribute only has 2 required parameters, the name and function type. It also has 2 optional parameters, the arguments attributes and generic dictionary for storing all other metadata. The `#gpu.kernel_table` stores a table of `#gpu.kernel_metadata`, mapping the name of the kernel to the metadata. Finally, the function `ROCDL::getAMDHSAKernelsELFMetadata` was added to collect ELF metadata from a binary, and to test the class methods in both attributes. Example: ```mlir gpu.binary @binary [#gpu.object<#rocdl.target<chip = "gfx900">, kernels = #gpu.kernel_table<[ #gpu.kernel_metadata<"kernel0", (i32) -> (), metadata = {sgpr_count = 255}>, #gpu.kernel_metadata<"kernel1", (i32, f32) -> (), arg_attrs = [{llvm.read_only}, {}]> ]> , bin = "BLOB">] ``` The motivation behind these attributes is to provide useful information for things like tunning. --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>	2024-08-27 18:44:50 -04:00
Andrea Faulds	7aa22f013e	[mlir][gpu] Add 'cluster_size' attribute to gpu.subgroup_reduce (#104851 ) This enables performing several reductions in parallel, each smaller than the size of the subgroup. One potential application is flash attention with subgroup-wide matrix multiplication and reduction combined in one kernel. The multiplication operation requires a 2D matrix to be distributed over the lanes of the subgroup, which then constrains the shape the following reduction can have if we want to keep data in registers.	2024-08-20 13:37:03 -04:00
Matthias Springer	7030280329	[mlir][GPU] Improve `gpu.module` op implementation (#102866 ) - Replace hand-written parser/printer with auto-generated assembly format. - Remove implicit `gpu.module_end` terminator and use the `NoTerminator` trait instead. (Same as `builtin.module`.) - Turn the region into a graph region. (Same as `builtin.module`.)	2024-08-13 09:37:36 +02:00
Matthias Springer	7359a6b799	[mlir][ODS] Verify type constraints in Types and Attributes (#102326 ) When a type/attribute is defined in TableGen, a type constraint can be used for parameters, but the type constraint verification was missing. Example: ``` def TestTypeVerification : Test_Type<"TestTypeVerification"> { let parameters = (ins AnyTypeOf<[I16, I32]>:$param); // ... } ``` No verification code was generated to ensure that `$param` is I16 or I32. When type constraints a present, a new method will generated for types and attributes: `verifyInvariantsImpl`. (The naming is similar to op verifiers.) The user-provided verifier is called `verify` (no change). There is now a new entry point to type/attribute verification: `verifyInvariants`. This function calls both `verifyInvariantsImpl` and `verify`. If neither of those two verifications are present, the `verifyInvariants` function is not generated. When a type/attribute is not defined in TableGen, but a verifier is needed, users can implement the `verifyInvariants` function. (This function was previously called `verify`.) Note for LLVM integration: If you have an attribute/type that is not defined in TableGen (i.e., just C++), you have to rename the verification function from `verify` to `verifyInvariants`. (Most attributes/types have no verification, in which case there is nothing to do.) Depends on #102657.	2024-08-09 22:04:40 +02:00
Ramkumar Ramachandra	db791b278a	mlir/LogicalResult: move into llvm (#97309 ) This patch is part of a project to move the Presburger library into LLVM.	2024-07-02 10:42:33 +01:00
Krzysztof Drewniak	43fd4c49bd	[mlir][GPU] Improve handling of GPU bounds (#95166 ) This change reworks how range information for GPU dispatch IDs (block IDs, thread IDs, and so on) is handled. 1. `known_block_size` and `known_grid_size` become inherent attributes of GPU functions. This makes them less clunky to work with. As a consequence, the `gpu.func` lowering patterns now only look at the inherent attributes when setting target-specific attributes on the `llvm.func` that they lower to. 2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size` are made official dialect-level discardable attributes which can be placed on arbitrary functions. This allows for progressive lowerings (without this, a lowering for `gpu.thread_id` couldn't know about the bounds if it had already been moved from a `gpu.func` to an `llvm.func`) and allows for range information to be provided even when `gpu._{id,dim}` are being used outside of a `gpu.func` context. 3. All of these index operations have gained an optional `upper_bound` attribute, allowing for an alternate mode of operation where the bounds are specified locally and not inherited from the operation's context. These also allow handling of cases where the precise launch sizes aren't known, but can be bounded more precisely than the maximum of what any platform's API allows. (I'd like to thank @benvanik for pointing out that this could be useful.) When inferring bounds (either for range inference or for setting `range` during lowering) these sources of information are consulted in order of specificity (`upper_bound` > inherent attribute > discardable attribute, except that dimension sizes check for `known__bounds` to see if they can be constant-folded before checking their `upper_bound`). This patch also updates the documentation about the bounds and inference behavior to clarify what these attributes do when set and the consequences of setting them up incorrectly. --------- Co-authored-by: Mehdi Amini <joker.eph@gmail.com>	2024-06-17 23:47:38 -05:00
Fabian Mora	f3b4c00304	[mlir][gpu] Add builder to `gpu.launch_func` (#95541 ) This patch adds a builder to `gpu.launch_func` allowing it to be created using `SymbolRefAttr` instead of `GPUFuncOp`. This allows creating `launch_func` when only a `gpu.binary` is present, instead of the full `gpu.module {...}`.	2024-06-15 12:40:14 -05:00
Fabian Mora	54373e0f40	[NFC][mlir][gpu] Make sym_name an inherent attr in GPUModuleOp (#94918 ) Make `sym_name` an inherent attr in GPUModuleOp so that it doesn't show in the discardable attributes. The change is safe as the attribute is always expected to be present.	2024-06-09 17:36:19 -05:00
Kazu Hirata	dec8055a1e	[mlir] Use StringRef::operator== instead of StringRef::equals (NFC) (#91560 ) I'm planning to remove StringRef::equals in favor of StringRef::operator==. - StringRef::operator==/!= outnumber StringRef::equals by a factor of 10 under mlir/ in terms of their usage. - The elimination of StringRef::equals brings StringRef closer to std::string_view, which has operator== but not equals. - S == "foo" is more readable than S.equals("foo"), especially for !Long.Expression.equals("str") vs Long.Expression != "str".	2024-05-08 23:52:22 -07:00
Justin Fargnoli	35d55f2894	[NFC][mlir] Reorder `declarePromisedInterface()` operands (#86628 ) Reorder the template operands of `declarePromisedInterface()` to match `declarePromisedInterfaces()`.	2024-03-27 10:30:17 -07:00
Justin Fargnoli	513cdb8222	[mlir] Declare promised interfaces for all dialects (#78368 ) This PR adds promised interface declarations for all interfaces declared in `InitAllDialects.h`. Promised interfaces allow a dialect to declare that it will have an implementation of a particular interface, crashing the program if one isn't provided when the interface is used.	2024-03-15 20:23:20 -07:00
Matthias Springer	492e8ba038	[mlir] Fix memory leaks after #81759 (#82762 ) This commit fixes memory leaks that were introduced by #81759. The way ops and blocks are erased changed slightly. The leaks were caused by an incorrect implementation of op builders: blocks must be created with the supplied builder object. Otherwise, they are not properly tracked by the dialect conversion and can leak during rollback.	2024-02-23 14:28:57 +01:00
Matthias Springer	5fcf907b34	[mlir][IR] Rename "update root" to "modify op" in rewriter API (#78260 ) This commit renames 4 pattern rewriter API functions: * `updateRootInPlace` -> `modifyOpInPlace` * `startRootUpdate` -> `startOpModification` * `finalizeRootUpdate` -> `finalizeOpModification` * `cancelRootUpdate` -> `cancelOpModification` The term "root" is a misnomer. The root is the op that a rewrite pattern matches against (https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional). A rewriter must be notified of all in-place op modifications, not just in-place modifications of the root (https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old function names were confusing and have contributed to various broken rewrite patterns. Note: The new function names use the term "modify" instead of "update" for consistency with the `RewriterBase::Listener` terminology (`notifyOperationModified`).	2024-01-17 11:08:59 +01:00
Fabian Mora	5b4f2b906b	[mlir][gpu] Add an offloading handler attribute to `gpu.module` (#78047 ) This patch adds an optional offloading handler attribute to the`gpu.module` op. This attribute will be used during `gpu-module-to-binary` pass to override the offloading handler used in the `gpu.binary` op.	2024-01-15 16:58:10 -05:00
Guray Ozen	5b33cff397	[mlir][gpu] Add Support for Cluster of Thread Blocks in `gpu.launch` (#76924 )	2024-01-06 11:17:01 +01:00
Jakub Kuderski	72003adf6b	[mlir][gpu] Allow subgroup reductions over 1-d vector types (#76015 ) Each vector element is reduced independently, which is a form of multi-reduction. The plan is to allow for gradual lowering of multi-reduction that results in fewer `gpu.shuffle` ops at the end: 1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d `gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32 For example we can perform 2 independent f16 reductions with a series of `gpu.shuffles` over i32, reducing the final number of `gpu.shuffles` by 2x.	2023-12-21 11:55:43 -05:00
Jakub Kuderski	560564f51c	[mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901 ) This is to avoid confusion when dealing with reduction/combining kinds. For example, see a recent PR comment: https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175. Previously, they were picked to mostly mirror the names of the llvm vector reduction intrinsics: https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In isolation, it was not clear if `<maxf>` has `arith.maxnumf` or `arith.maximumf` semantics. The new reduction kind names map 1:1 to arith ops, which makes it easier to tell/look up their semantics. Because both the vector and the gpu dialect depend on the arith dialect, it's more natural to align names with those in arith than with the lowering to llvm intrinsics. Issue: https://github.com/llvm/llvm-project/issues/72354	2023-12-20 00:14:43 -05:00
Jakub Kuderski	7eccd52842	Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423 )" This reverts commit dd09221a29506031415cad8a1308998358633d48 and relands https://github.com/llvm/llvm-project/pull/73423. * Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.	2023-11-27 11:38:18 -05:00
Jakub Kuderski	dd09221a29	Revert "[mlir][gpu] Align reduction operations with vector combining kinds (#73423 )" This reverts commit e0aac8c88d0d30e8da0f8a240ad1e6b4d88782e0. I'm seeing some nvidia integration test failures: https://lab.llvm.org/buildbot/#/builders/61/builds/52334.	2023-11-27 11:29:23 -05:00
Jakub Kuderski	e0aac8c88d	[mlir][gpu] Align reduction operations with vector combining kinds (#73423 ) The motivation for this change is explained in https://github.com/llvm/llvm-project/issues/72354. Before this change, we could not tell between signed/unsigned minimum/maximum and NaN treatment for floating point values. The mapping of old reduction operations to the new ones is as follows: * `min` --> `minsi` for ints, `minf` for floats * `max` --> `maxsi` for ints, `maxf` for floats New reduction kinds not represented in the old enum: `minui`, `maxui`, `minimumf`, `maximumf`. As a next step, I would like to have a common definition of combining kinds used by the `vector` and `gpu` dialects. Separately, the GPU to SPIR-V lowering does not yet properly handle zero and NaN values -- the behavior of floating point min/max group reductions is not specified by the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459. Issue: https://github.com/llvm/llvm-project/issues/72354	2023-11-27 11:19:20 -05:00
Guray Ozen	edf5cae739	[mlir][gpu] Support Cluster of Thread Blocks in `gpu.launch_func` (#72871 ) NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA). It is a new level of parallelism, allowing clustering of Cooperative Thread Arrays (CTA) to synchronize and communicate through shared memory while running concurrently. This PR enables support for CGA within the `gpu.launch_func` in the GPU dialect. It extends `gpu.launch_func` to accommodate this functionality. The GPU dialect remains architecture-agnostic, so we've added CGA functionality as optional parameters. We want to leverage mechanisms that we have in the GPU dialects such as outlining and kernel launching, making it a practical and convenient choice. An example of this implementation can be seen below: ``` gpu.launch_func @kernel_module::@kernel clusters in (%1, %0, %0) // <-- Optional blocks in (%0, %0, %0) threads in (%0, %0, %0) ``` The PR also introduces index and dimensions Ops specific to clusters, binding them to NVVM Ops: ``` %cidX = gpu.cluster_id x %cidY = gpu.cluster_id y %cidZ = gpu.cluster_id z %cdimX = gpu.cluster_dim x %cdimY = gpu.cluster_dim y %cdimZ = gpu.cluster_dim z ``` We will introduce cluster support in `gpu.launch` Op in an upcoming PR. See [the documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays) provided by NVIDIA for details.	2023-11-27 11:05:07 +01:00
Guray Ozen	ea84897ba3	[mlir][gpu] Introduce `gpu.dynamic_shared_memory` Op (#71546 ) While the `gpu.launch` Op allows setting the size via the `dynamic_shared_memory_size` argument, accessing the dynamic shared memory is very convoluted. This PR implements the proposed Op, `gpu.dynamic_shared_memory` that aims to simplify the utilization of dynamic shared memory. RFC: https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/ Proposal from RFC This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory feature efficiently. It is is a powerful feature that enables the allocation of shared memory at runtime with the kernel launch on the host. Afterwards, the memory can be accessed directly from the device. I believe similar story exists for AMDGPU. Current way Using Dynamic Shared Memory with MLIR Let me illustrate the challenges of using dynamic shared memory in MLIR with an example below. The process involves several steps: - memref.global 0-sized array LLVM's NVPTX backend expects - dynamic_shared_memory_size Set the size of dynamic shared memory - memref.get_global Access the global symbol - reinterpret_cast and subview Many OPs for pointer arithmetic ``` // Step 1. Create 0-sized global symbol. Manually set the alignment memref.global "private" @dynamicShmem : memref<0xf16, 3> { alignment = 16 } func.func @main() { // Step 2. Allocate shared memory gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 { // Step 3. Access the global object %shmem = memref.get_global @dynamicShmem : memref<0xf16, 3> // Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations. %4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128], strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3> %5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> %6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> %7 = memref.subview %6[0, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3> %8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3> // Step.5 Use "test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index) "test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index) gpu.terminator } ``` Let’s write the program above with that: ``` func.func @main() { gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 { %i = arith.constant 18 : index // Step 1: Obtain shared memory directly %shmem = gpu.dynamic_shared_memory : memref<?xi8, 3> %c147456 = arith.constant 147456 : index %c155648 = arith.constant 155648 : index %7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3> %8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3> // Step 2: Utilize the shared memory "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index) "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index) } } ``` This PR resolves #72513	2023-11-16 14:42:17 +01:00
spaceotter	51af040b22	[mlir][gpu] Eliminate redundant gpu.barrier ops (#71575 ) Adds a canonicalizer for gpu.barrier that gets rid of duplicates. Co-authored-by: Eric Eaton <eric@nod-labs.com>	2023-11-09 18:06:20 -05:00

1 2 3 4 5 ...

253 Commits