329 Commits

Author SHA1 Message Date
Krzysztof Drewniak
c08b2c75db
[milr][gpu] Make barrier elimination address-space aware (#178101)
Upgrade the barrier eliminiation pass to account for the address spaces
of accessed memory when deciding which barriers to eliminiate. In
particular, a loop that only reads and writes global memory that has a
workgoup-memory-fencing barrier inside of it will now have that barrier
marked for elimiination, as the global memory traffic is not being
synchronized by the barrier.

The pass is also adjusted to ignore barriers whose memory fencing list
is [], as those do not synchronize memory and therefore the logic in
this pass would potentially incorrectly remove them after proving that
fact.

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
2026-02-04 08:39:44 -08:00
Mehdi Amini
fba111de5e [MLIR] Apply clang-tidy fixes for bugprone-argument-comment in PromoteShuffleToAMDGPU.cpp (NFC) 2026-02-02 10:27:20 -08:00
Zichen Lu
fbffdaa174
[MLIR][GPU] Update serializeToObject to use SerializedObject wrapper and include ISA compiler logs (#176697)
This PR makes the compilation log from ISA compiler available to users
by returning it as part of the `gpu::ObjectAttr` properties, following
the existing pattern like `LLVMIRToISATimeInMs`.

Currently, the compiler log (which contains useful information such as
spill statistics when --verbose is passed) is only accessible in debug
builds via `LLVM_DEBUG`. However, there are good reasons to make this
information available in release builds as well:

1. Both `ptxas` and `libnvptxcompiler` are publicly available
tools/libraries distributed with the CUDA Toolkit. The `--verbose` flag
and its output are documented public features, not internal debug
information.
2. The verbose output provides valuable insights for users.

A new `SerializedObject` class is used to carry the metadata alongside
the binary when returning from `serializeObject`.
2026-01-30 12:56:20 +01:00
Jakub Kuderski
59e44799bd
[mlir] Fix new clang-tidy warning llvm-type-switch-case-types. NFC. (#178487)
Pre-commiting this before landing the new check in
https://github.com/llvm/llvm-project/pull/177892
2026-01-28 19:13:47 +00:00
Krzysztof Drewniak
3446ff1e67
[mlir] Update all-reduce (& vector tests) to use workgroup barriers (#178285)
This commit updates the lowering of all-reduce operations to annotate
the generated barriers with `memfence [#gpu.address_space<workgroup>]`
so that these barriers do not force unrelated global memory operations
to complete. It similarly sets up the warp synchronization function in
the vectory distribuhte tests, since they also only read/write shared
memory.

In additon, this commit adds convenience builders for gpu.barrier, which
will allow it to either fence on a given address space or on the address
space of a provided memref.
2026-01-27 13:09:18 -08:00
Mehdi Amini
24b5cf154b [MLIR] Apply clang-tidy fixes for llvm-else-after-return in EliminateBarriers.cpp (NFC) 2026-01-27 07:46:37 -08:00
Srinivasa Ravi
1e468b2813
[MLIR][GPU][NVVM] Add verify-target-arch option to nvvm-attach-target pass (#176774)
This change adds the `verify-target-arch` option to the
`nvvm-attach-target` to control the `verifyTarget` parameter in the
attached `NVVMTargetAttr` which is used to enable/disable the
verification of the target architecture with respect to the NVVM Ops.
2026-01-22 17:18:22 +05:30
Adam Paszke
064cbec2c1
[MLIR][GPU] Make sure to propagate known cluster sizes in kernel outlining (#176894)
Otherwise, the changes from #174404 don't kick in.
2026-01-21 11:10:40 +01:00
Longsheng Mou
ad8d9e1428
[mlir][gpu] Use arith dialect to lower gpu.global_id (#171614)
This PR lowers the`gpu.global_id` op using the arith dialect instead of
the index dialect. Fixes #171303.
2025-12-13 18:43:12 +08:00
Mehdi Amini
971e124f0d [MLIR] Apply clang-tidy fixes for misc-use-internal-linkage in AsyncRegionRewriter.cpp (NFC) 2025-11-13 04:39:15 -08:00
Jakub Kuderski
4c21d0cb14
[ADT] Prepare to deprecate variadic StringSwitch::Cases. NFC. (#166020)
Update all uses of variadic `.Cases` to use the initializer list
overload instead. I plan to mark variadic `.Cases` as deprecated in a
followup PR.

For more context, see https://github.com/llvm/llvm-project/pull/163117.
2025-11-02 00:12:33 +00:00
James Newling
0928f46c69
[MLIR][GPU] Ensure all lanes in cluster have final reduction value (#165764)
This is a fix for a cluster size of 32 when the subgroup size is 64.
Previously, only lanes [16, 32) u [48, 64) contained the correct
clusterwise reduction value. This PR adds a swizzle instruction to
broadcast the correct value down to lanes [0, 16) u [32, 48).
2025-10-31 09:12:43 -07:00
Jakub Kuderski
ba0be89cd2
[mlir] Simplify Default cases in type switches. NFC. (#165767)
Use default values instead of lambdas when possible. `std::nullopt` and
`nullptr` can be used now because of
https://github.com/llvm/llvm-project/pull/165724.
2025-10-30 15:10:59 -04:00
Mehdi Amini
8907adc28c [MLIR] Apply clang-tidy fixes for bugprone-argument-comment in SubgroupReduceLowering.cpp (NFC) 2025-09-29 01:10:16 -07:00
Georgios Pinitas
c1b211034b
[mlir][gpu] Add innermost-first policy when mapping loops to GPU IDs (#160634) 2025-09-25 17:23:53 +01:00
Mehdi Amini
b88e5ca5f9
[MLIR] Adopt LDBG() in EliminateBarriers.cpp (NFC) (#155092)
Also add an extra optional TYPE argument to the LDBG() macro to make it
easier to punctually overide DEBUG_TYPE.
2025-08-27 10:45:58 +02:00
Tim Gymnich
003cbbd4ca
[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933)
- promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap
%src {16,32}`
2025-08-24 12:41:09 +02:00
Sang Ik Lee
baae949f19
[MLIR][GPU][XeVM] Add XeVM target and XeVM dialect integration tests. (#148286)
As part of XeVM dialect upsteaming, covers remaining parts required for XeVM dialect integration and testing.
It has two high level components
- XeVM target and serialization support
- XeVM dialect integration tests using level zero runtime

Co-Authored-by: Artem Kroviakov <artem.kroviakov@intel.com>
2025-08-13 13:17:10 -07:00
Longsheng Mou
2edee0bc79
[mlir][gpu] Support outlining nested gpu.launch (#152696)
This PR fixes a crash in `GpuKernelOutliningPass` that occurred when
encountering a symbol that was not a `FlatSymbolRefAttr`, enabling
outlining of nested `gpu.launch` operations. Fixes #149318.
2025-08-13 11:42:52 +08:00
Longsheng Mou
7d886fab74
[mlir][gpu] Update attribute definitions in gpu::LaunchOp (#152106)
`gpu::LaunchOp` is updated the following way:
- Change the attribute type of kernel function and module from
`SymbolRefAttr` to `FlatSymbolRefAttr` to avoid nested symbol
references.
- Rename variables from camel case (kernelFunc, kernelModule) to lower
case (function, module) and update the syntax.
- `LaunchOp::build` support passing `module` and `function` attributes.
2025-08-08 11:43:21 +08:00
Maksim Levental
c090ed53fb
[mlir][NFC] update mlir/Dialect create APIs (33/n) (#150659)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-25 16:13:55 -04:00
Longsheng Mou
3eb49c482c
[mlir][NFC] Use hasOneBlock instead of llvm::hasSingleElement(region) (#149809) 2025-07-24 10:11:21 +08:00
Kazu Hirata
0925d7572a
[mlir] Remove unused includes (NFC) (#150266)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-07-23 15:18:53 -07:00
Maksim Levental
dce6679cf5
[mlir][NFC] update mlir/Dialect create APIs (16/n) (#149922)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-21 19:57:30 -04:00
Sang Ik Lee
61004b7eb5
[MLIR][GPU] Add xevm-attach-target transform pass. (#147372)
Add xevm-attach-target transform pass and unit-tests.

Co-authored-by: by Sang Ik Lee sang.ik.lee@intel.com.
Co-authored-by: Artem Kroviakov artem.kroviakov@intel.com
2025-07-10 15:44:26 -05:00
Kazu Hirata
54bd936ec9
[mlir] Remove unused includes (NFC) (#147455)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-07-07 23:40:44 -07:00
Kazu Hirata
28f6f87061
[mlir] Migrate away from std::nullopt (NFC) (#145523)
ArrayRef has a constructor that accepts std::nullopt.  This
constructor dates back to the days when we still had llvm::Optional.

Since the use of std::nullopt outside the context of std::optional is
kind of abuse and not intuitive to new comers, I would like to move
away from the constructor and eventually remove it.

This patch migrates away from std::nullopt in favor of ArrayRef<T>()
where we use perfect forwarding.  Note that {} would be ambiguous for
perfect forwarding to work.
2025-06-25 11:49:22 -07:00
Skrai Pardus
a45fda6aeb
switch type and value ordering for arith Constant[XX]Op (#144636)
This change standardizes the order of the parameters for `Constant[XXX]
Ops` to match with all other `Op` `build()` constructors.

In all instances of generated code for the MLIR dialects's Ops (that is
the TableGen using the .td files to create the .h.inc/.cpp.inc files),
the desired result type is always specified before the value.

Examples: 
```
// ArithOps.h.inc
class ConstantOp : public ::mlir::Op<ConstantOp, ::mlir::OpTrait::ZeroRegions, ::mlir::OpTrait::OneResult, ::mlir::OpTrait::OneTypedResult<::mlir::Type>::Impl, ::mlir::OpTrait::ZeroSuccessors, ::mlir::OpTrait::ZeroOperands, ::mlir::OpTrait::OpInvariants, ::mlir::BytecodeOpInterface::Trait, ::mlir::OpTrait::ConstantLike, ::mlir::ConditionallySpeculatable::Trait, ::mlir::OpTrait::AlwaysSpeculatableImplTrait, ::mlir::MemoryEffectOpInterface::Trait, ::mlir::OpAsmOpInterface::Trait, ::mlir::InferIntRangeInterface::Trait, ::mlir::InferTypeOpInterface::Trait> {
public:
....
static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type result, ::mlir::TypedAttr value);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypedAttr value);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::TypedAttr value);
  static void build(::mlir::OpBuilder &, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {});
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {});
...
```
```
// ArithOps.h.inc
class SubIOp : public ::mlir::Op<SubIOp, ::mlir::OpTrait::ZeroRegions, ::mlir::OpTrait::OneResult, ::mlir::OpTrait::OneTypedResult<::mlir::Type>::Impl, ::mlir::OpTrait::ZeroSuccessors, ::mlir::OpTrait::NOperands<2>::Impl, ::mlir::OpTrait::OpInvariants, ::mlir::BytecodeOpInterface::Trait, ::mlir::ConditionallySpeculatable::Trait, ::mlir::OpTrait::AlwaysSpeculatableImplTrait, ::mlir::MemoryEffectOpInterface::Trait, ::mlir::InferIntRangeInterface::Trait, ::mlir::arith::ArithIntegerOverflowFlagsInterface::Trait, ::mlir::OpTrait::SameOperandsAndResultType, ::mlir::VectorUnrollOpInterface::Trait, ::mlir::OpTrait::Elementwise, ::mlir::OpTrait::Scalarizable, ::mlir::OpTrait::Vectorizable, ::mlir::OpTrait::Tensorizable, ::mlir::InferTypeOpInterface::Trait> {
public:
...
static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type result, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlagsAttr overflowFlags);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlagsAttr overflowFlags);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlagsAttr overflowFlags);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Type result, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlags overflowFlags = ::mlir::arith::IntegerOverflowFlags::none);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlags overflowFlags = ::mlir::arith::IntegerOverflowFlags::none);
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::Value lhs, ::mlir::Value rhs, ::mlir::arith::IntegerOverflowFlags overflowFlags = ::mlir::arith::IntegerOverflowFlags::none);
  static void build(::mlir::OpBuilder &, ::mlir::OperationState &odsState, ::mlir::TypeRange resultTypes, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {});
  static void build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::ValueRange operands, ::llvm::ArrayRef<::mlir::NamedAttribute> attributes = {});
...
```
In comparison, in the distinct case of `ConstantIntOp` and
`ConstantFloatOp`, the ordering of the result type and the value is
switched.

Thus, this PR corrects the ordering of the aforementioned
`Constant[XXX]Ops` to match with other constructors.
2025-06-23 23:35:50 +02:00
Kazu Hirata
887222e352
[mlir] Migrate away from ArrayRef(std::nullopt) (NFC) (#144989)
ArrayRef has a constructor that accepts std::nullopt.  This
constructor dates back to the days when we still had llvm::Optional.

Since the use of std::nullopt outside the context of std::optional is
kind of abuse and not intuitive to new comers, I would like to move
away from the constructor and eventually remove it.

This patch takes care of the mlir side of the migration, starting with
straightforward places where I see ArrayRef or ValueRange nearby.
Note that ValueRange has a constructor that forwards arguments to an
ArrayRef constructor.
2025-06-20 08:33:59 -07:00
Muzammil
893ef7ffbd
[mlir][GPU] Fixes subgroup reduce lowering (#141825)
Fixes the final reduction steps which were taken from an implementation
of scan, not reduction, causing lanes earlier in the wave to have
incorrect results due to masking.

Now aligning more closely with triton implementation :
https://github.com/triton-lang/triton/pull/5019

# Hypothetical example
To provide an explanation of the issue with the current implementation,
let's take the simple example of attempting to perform a sum over 64
lanes where the initial values are as follows (first lane has value 1,
and all other lanes have value 0):
```
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```
When performing a sum reduction over these 64 lanes, in the current
implementation we perform 6 dpp instructions which in sequential order
do the following:
1) sum over clusters of 2 contiguous lanes
2) sum over clusters of 4 contiguous lanes
3) sum over clusters of 8 contiguous lanes
4) sum over an entire row
5) broadcast the result of last lane in each row to the next row and
each lane sums current value with incoming value.
5) broadcast the result of the 32nd lane to last two rows and each lane
sums current value with incoming value.

After step 4) the result for the example above looks like this:

```
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```

After step 5) the result looks like this:
```
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```

After step 6) the result looks like this:
```
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
```
Note that the correct value here is always 1, yet after the
`dpp.broadcast` ops some lanes have incorrect values. The reason is that
for these incorrect lanes, like lanes 0-15 in step 5, the
`dpp.broadcast` op doesn't provide them incoming values from other
lanes. Instead these lanes are provided either their own values, or 0
(depending on whether `bound_ctrl` is true or false) as values to sum
over, either way these values are stale and these lanes shouldn't be
used in general.

So what this means:
- For a subgroup reduce over 32 lanes (like Step 5), the correct result
is stored in lanes 16 to 31
- For a subgroup reduce over 64 lanes (like Step 6), the correct result
is stored in lanes 32 to 63.

However in the current implementation we do not specifically read the
value from one of the correct lanes when returning a final value. In
some workloads it seems without this specification, the stale value from
the first lane is returned instead.

# Actual failing test
For a specific example of how the current implementation causes issues,
take a look at the IR below which represents an additive reduction over
a dynamic dimension.
```
!matA = tensor<1x?xf16>
!matB = tensor<1xf16>
#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
func.func @only_producer_fusion_multiple_result(%arg0: !matA) -> !matB {
  %cst_1 = arith.constant 0.000000e+00 : f16
  %c2_i64 = arith.constant 2 : i64
  %0 = tensor.empty() : !matB
  %2 = linalg.fill ins(%cst_1 : f16) outs(%0 : !matB) -> !matB
  %4 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : !matA) outs(%2 : !matB)  {
  ^bb0(%in: f16, %out: f16):
    %7 = arith.addf %in, %out : f16
    linalg.yield %7 : f16
  } -> !matB
  return %4 : !matB
}
```
When provided an input of type `tensor<1x2xf16>` and values `{0, 1}` to
perform the reduction over, the value returned is consistently 4. By the
same analysis done above, this shows that the returned value is coming
from one of these stale lanes and needs to be read instead from one of
the lanes storing the correct result.

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-05-28 17:47:22 -05:00
Shay Kleiman
ffb9bbfd07
[mlir][MemRef] Changed AssumeAlignment into a Pure ViewLikeOp (#139521)
Made AssumeAlignment a ViewLikeOp that returns a new SSA memref equal
to its memref argument and made it have Pure trait. This
gives it a defined memory effect that matches what it does in practice
and makes it behave nicely with optimizations which won't get rid of it
unless its result isn't being used.
2025-05-18 13:50:29 +03:00
Ivan Butygin
91f3cdbd4f
[mlir][gpu] Pattern to promote gpu.shuffle to specialized AMDGPU ops (#137109)
Only swizzle promotion for now, may add DPP ops support later.
2025-05-13 13:26:46 +03:00
Alan Li
0ba1361478
[MLIR][GPU] Use arith instead of index for subgroup_id (#137843)
Trying to simplify situation by using `arith` dialect instead of `index`
in the rewriting of `gpu.subgroup_id`.
2025-04-30 09:03:24 -04:00
Alan Li
ac65b2c327
[MLIR][GPU] Add a pattern to rewrite gpu.subgroup_id (#137671)
This patch impelemnts a rewrite pattern for transforming
`gpu.subgroup_id` to:
```
subgroup_id = linearized_thread_id / gpu.subgroup_size
```

where:
```
linearized_thread_id = thread_id.x + block_dim.x * (thread_id.y + block_dim.y * thread_id.z)
```
2025-04-29 10:54:48 -04:00
Kazu Hirata
d1e85a0ea0
[mlir] Use range constructors of *Set (NFC) (#137563) 2025-04-27 17:52:41 -07:00
Muzammil
905f1d8068
[mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs (#133204)
When performing cross-lane reductions using subgroup_reduce ops across
contiguous lanes on AMD GPUs, lower to Data Parallel Primitives (DPP)
ops when possible. This reduces latency on applicable devices.
See related [Issue](https://github.com/iree-org/iree/issues/20007)
To do:
- Improve lowering to subgroup_reduce in compatible matvecs (these get
directly lowered to gpu.shuffles in an earlier pass)

---------

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
2025-04-23 17:37:32 -07:00
Krzysztof Drewniak
bf3b3d012c
[mlir][GPU] Don't look into neighboring functions for barrier elimination (#135293)
If a `func.func` is nested in some other operation, the barrier
eliminator's recursion into parents will examine the neighbors of each
function. Therefore, don't recurse into the parent of an operation if
that operation is IsolatedFromAbove, like a func.func is.

Furthermore, define functions as a region that executes only once,
since, within the context of this pass (which runs on functions) it is
true.
2025-04-15 07:04:24 -07:00
Ivan Butygin
d893d129e6
[mlir] GPUToROCDL: Fix crashes with unsupported shuffle datatypes (#135504)
Calling `getIntOrFloatBitWidth` on non-int/float types (`gpu.shuffle`
also accepts vectors) will crash.
2025-04-13 20:26:19 +02:00
Kazu Hirata
3041fa6c7a
[mlir] Use *Set::insert_range (NFC) (#132326)
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range.  This patch replaces:

  Dest.insert(Src.begin(), Src.end());

with:

  Dest.insert_range(Src);

This patch does not touch custom begin like succ_begin for now.
2025-03-20 22:24:17 -07:00
lorenzo chelini
556a64507b
[MLIR][NFC] Retire let constructor for GPU (#129849)
`let constructor` is legacy (do not use in tree!) since the table gen
backend emits most of the glue logic to build a pass.
2025-03-06 11:48:24 +01:00
Guray Ozen
837b89fc0f
[MLIR][NVVM] Add ptxas-cmd-options to pass flags to the downstream compiler (#127457)
This PR adds `cmd-options` to the `gpu-lower-to-nvvm-pipeline` pipeline
and the `nvvm-attach-target` pass, allowing users to pass flags to the
downstream compiler, *ptxas*.

Example:
```
mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 ptxas-cmd-options='-v --register-usage-level=8'"
```
2025-02-17 12:09:27 +01:00
Matthias Springer
6aaa8f25b6
[mlir][IR][NFC] Move free-standing functions to MemRefType (#123465)
Turn free-standing `MemRefType`-related helper functions in
`BuiltinTypes.h` into member functions.
2025-01-21 08:48:09 +01:00
Jacques Pienaar
09dfc5713d
[mlir] Enable decoupling two kinds of greedy behavior. (#104649)
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
2024-12-20 08:15:48 -08:00
Mehdi Amini
72e8b9aeaa
[MLIR] Add a BlobAttr interface for attribute to wrap arbitrary content and use it as linkLibs for ModuleToObject (#120116)
This change allows to expose through an interface attributes wrapping
content as external resources, and the usage inside the ModuleToObject
show how we will be able to provide runtime libraries without relying on
the filesystem.
2024-12-17 01:30:56 +01:00
Renaud Kauffmann
9919295cfd
[mlir][gpu] Adding ELF section option to the gpu-module-to-binary pass (#119440)
This is a follow-up of #117246.

I thought then it would be easy to edit a DictionaryAttr but it turns
out that these attributes are immutable and need to be passed during the
construction of the gpu.binary Op.

The first commit was using the NVVMTargetAttr to pass the information.
After feedback from @fabianmcg, this PR now passes the information
through a new option of the gpu-module-to-binary pass.

Please add reviewers, as you see fit.
2024-12-16 09:09:41 -08:00
Petr Kurapov
bc29fc937c
[MLIR] Create GPU utils library & move distribution utils (#119264)
Continue the move of `warp_execute_on_lane_0` op to the gpu dialect
(#116994). This patch creates a utils library in GPU and moves generic
helper functions there.
2024-12-13 10:26:57 +01:00
Zhen Wang
516d6ede12
[mlir][gpu] Add optional attributes of kernelModule and kernelFunc for outlining kernels. (#118861)
Adding optional attributes so we can specify the kernel function names
and the kernel module names generated.
2024-12-06 12:33:34 -08:00
Oleksandr "Alex" Zinenko
5ce4d4c775
[mlir] fix memory effects in GPU barrier elimination (#117432)
Existing implementation may trigger infinite cycles when collecting
effects above or below the current block after wrapping around a
loop-like construct. Limit this case to only looking at the immediate
block (loop body). This is correct because wrap around is intended to
consider effects of different iterations of the same loop and shouldn't
be existing the loop block.

Reported-by: Fabian Mora <fmora.dev@gmail.com>
Co-authored-by: Fabian Mora <fmora.dev@gmail.com>
2024-11-24 23:25:11 +01:00
donald chen
889b67c9d3
[mlir] [memref] add more checks to the memref.reinterpret_cast (#112669)
Operation memref.reinterpret_cast was accept input like:

%out = memref.reinterpret_cast %in to offset: [%offset], sizes: [10],
strides: [1]
         : memref<?xf32> to memref<10xf32>

A problem arises: while lowering, the true offset of %out is %offset,
but its data type indicates an offset of 0. Permitting this
inconsistency can result in incorrect outcomes, as certain pass might
erroneously extract the offset from the data type of %out.

This patch fixes this by enforcing that the return value's data type
aligns
with the input parameter.
2024-10-26 08:07:51 +08:00
Andrea Faulds
a800ffac41
[mlir][gpu] Disjoint patterns for lowering clustered subgroup reduce (#109158)
Making the existing populateGpuLowerSubgroupReduceToShufflePatterns()
function also cover the new "clustered" subgroup reductions is proving
to be inconvenient, because certain backends may have more specific
lowerings that only cover the non-clustered type, and this creates pass
ordering constraints. This commit removes coverage of clustered
reductions from this function in favour of a new separate function,
which makes controlling the lowering much more straightforward.
2024-09-18 15:55:53 -04:00