137 Commits

Author SHA1 Message Date
Ivan Butygin
d8350712b3
[mlir] GPUToROCDL: lower gpu.subgroup_id to the intrinsic where possible (#179422)
Lower `gpu.subgroup_id` to `wave.id` intrinsic on gfx12+, lower to
`linearized_thread_id / subgroup_size` on older.
2026-02-04 00:53:07 +03:00
Krzysztof Drewniak
df739ba008
[mlir][gpu] Add address space modifier to gpu.barrier (#177425)
This is a takeover of PR ##110527

This commit adds an optional list of memory fences to gpu.barrier,
allowing users to specify which memory scopes they wish to fence
explicitly, while leaving the default semantics (which are equivalent to
calling for a global and local fence by analogy to CUDA's __syncthreads)
unchanged. The new expanded semantics are implemented for SPIR-V and for
the AMDGPU backend.

See also

https://discourse.llvm.org/t/rfc-add-memory-scope-to-gpu-barrier/81021/2?u=fmarno,
where the default behavior of a gpu.barrier was hashed out (though note
that the examples based on VMCNT are outdated for AMDGPU in that memory
fences can now be annotated with the correct set of address spaces).

This commit also deprecates amdgpu.lds_barrier for usecases that don't
involve targeting a gfx908.

Assisted-by: Cursor/Claude code (tests and extending amdgpu.lds_barrier
pattern while copying it over)

---------

Co-authored-by: Finlay Marno <finlay.marno@codeplay.com>
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Alan Li <alan.li@me.com>
2026-01-26 12:08:47 -08:00
Ivan Butygin
8b907a3a20
[mlir] GPUToROCDL: repack usupported types when lowering subgroup_broadcast (#174206)
Use the same repacking logic as for shuffle/swizzle.
2026-01-06 23:26:47 +03:00
Adam Paszke
9a93769853
[MLIR] Propagate known cluster sizes from gpu.launch to gpu.func (#174404)
This lets us properly annotate ranges for gpu.cluster_block_id and
gpu.cluster_dim_blocks. It also allows us to fill in the
nvvm.cluster_dim attribute for use in the NVVM backend.
2026-01-06 03:49:02 -08:00
Ivan Butygin
f785ca0d72
[mlir][nvgpu] Move memref memspace attributes conversion to single place (#172156)
Also, some fixes for AMDGPU part for better naming.
2025-12-14 12:44:47 +03:00
Ivan Butygin
c22d82a1d4
[mlir][amdgpu] Move GPU memory spaces conversion to single place (#171876) 2025-12-11 21:39:57 +03:00
Keshav Vinayak Jha
fbbffc1169
[MLIR][ROCDL] Add math.clampf -> rocdl.fmed3 conversion (#163520)
Added Pattern for lowering `Math::ClampFOp` to `ROCDL::FMED3`.
Also added `chipet` option to `MathToRocdl` pass to check for arch
support ISA instructions

Solves [#15072](https://github.com/llvm/llvm-project/issues/157052)

Reapplies https://github.com/llvm/llvm-project/pull/160100

Un-reverts the merged https://github.com/llvm/llvm-project/pull/163259,
and fixes the error.

---------

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>
2025-10-17 08:14:58 -04:00
Fabian Mora
e34b71e351
Revert "[MLIR][ROCDL] Add math.clampf -> rocdl.fmed3 conversion" (#163447)
Reverts llvm/llvm-project#163259. Reverting due to missing link libraries
causing failures in shared build bots.
2025-10-14 16:33:09 -04:00
Keshav Vinayak Jha
1e6df640e2
[MLIR][ROCDL] Add math.clampf -> rocdl.fmed3 conversion (#163259)
Added Pattern for lowering `Math::ClampFOp` to `ROCDL::FMED3`.
Also added `chipset` option to `MathToRocdl` pass to check for arch
support ISA instructions

Solves [#15072](https://github.com/llvm/llvm-project/issues/157052)

Reapplies https://github.com/llvm/llvm-project/pull/160100

---------

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>
2025-10-14 15:21:26 -05:00
Mehdi Amini
beb6bab87e [MLIR] Apply clang-tidy fixes for llvm-qualified-auto in LowerGpuOpsToROCDLOps.cpp (NFC) 2025-09-16 08:00:52 -07:00
Pablo Antonio Martinez
dd04668138
[mlir][gpu] Refactor GpuOpsToROCDLOps pass interface (NFC) (#157402)
This PR deletes the `createLowerGpuOpsToROCDLOpsPass` constructor from
the .td file, making the `createConvertGpuOpsToROCDLOps` pass available
to users. This has the following effects:

1. `createLowerGpuOpsToROCDLOpsPass` is not available anymore. Instead,
`createConvertGpuOpsToROCDLOps` should be used. This makes the interface
consistent with ConvertGpuOpsToNVVMOps.

2. To call `createConvertGpuOpsToROCDLOps`, the options must be passed
via ConvertGpuOpsToROCDLOpsOptions. This has the side effect of
making the `allowed-dialects` option available, which was not accessible
via C++ before.
2025-09-10 09:04:34 +02:00
Jakub Kuderski
2b3d3fce73
[mlir][gpu] Revert gpu.subgroup_broadcast with any_lane (#157373)
This partially reverts https://github.com/llvm/llvm-project/pull/152808.

Post-commit comments revealed that the `any_lane` variant hasn't been
fully agreed upon at the time of landing.
2025-09-08 00:43:57 +00:00
Ivan Butygin
4880940c84
[mlir][gpu] Add subgroup_broadcast op (#152808)
`subgroup_broadcast` allow to broadcast the value from one lane to all
lanes in subgroup.

Supported modes:
* `first_active_lane` - broadcast value from the first active lane in
subgroup.
* `specific_lane` - broadcast value from the specified lane, lane index
must be within subgroup.
* `any_lane` - if `src` value is uniform across all the subgroup lanes
return it unchanged, otherwise result is poison. This variant
essentially an uniformity hint for the compiler, conveying that specific
value is uniform across all subgroup lanes. Dropping `any_lane`
broadcast should not change the code semantics.
2025-08-30 09:25:49 +03:00
Tim Gymnich
003cbbd4ca
[mlir][amdgpu] Promote gpu.shuffle to amdgpu.permlane_swap (#154933)
- promote `gpu.shuffle %src xor {16,32} 64` to `amdgpu.permlane_swap
%src {16,32}`
2025-08-24 12:41:09 +02:00
Krzysztof Drewniak
bbe3d64b39
[mlir][ROCDL] Annotate lane ID functions with noundef, ranges (#151396)
Now that we have general support for setting argument and result
attributes on LLVM intrinsics, extend the definitions of mbcnt.lo and
mbcnt.hi to carry such attributes. With that, update the construction of
the mbcnt.lo/mbcnt.hi calls used to get the lane ID to be `noundef`
(since the lane ID is always defined) and to be annotated with the
correct ranges (so that generic LLVM passes can correctly optimized
based on the fact that there are never more than 32/64 lanes).

(Also, handle a pattern that wasn't using getLaneId() and get rid of a
dead argument)
2025-08-13 17:44:03 -05:00
Maksim Levental
eaa67a3cf0
[mlir][NFC] update Conversion create APIs (5/n) (#149887)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-22 10:40:45 -04:00
Kazu Hirata
fa9adbfda9
[mlir] Remove unused includes (NFC) (#147101)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-07-04 13:30:21 -07:00
Alexander Richardson
07e2ba445d
[AMDGPU] Set AS8 address width to 48 bits
Of the 128-bits of buffer descriptor only 48 bits are address bits, so
following the discussion on https://discourse.llvm.org/t/clarifiying-the-semantics-of-ptrtoint/83987/54,
the logic conclusion is to set the index width to 48 bits instead of
the current value of 128.

Most of the test changes are mechanical datalayout updates, but there
is one actual change: the ptrmask test now uses .i48 instead of .i128
and I had to update SelectionDAGBuilder to correctly extend the mask.

Reviewed By: krzysz00

Pull Request: https://github.com/llvm/llvm-project/pull/139419
2025-05-19 17:26:05 -07:00
Ivan Butygin
91f3cdbd4f
[mlir][gpu] Pattern to promote gpu.shuffle to specialized AMDGPU ops (#137109)
Only swizzle promotion for now, may add DPP ops support later.
2025-05-13 13:26:46 +03:00
Krzysztof Drewniak
2880859604
[mlir][ROCDL] Remove unneeded bf16 expansion in LowerGPUToROCDL (#139603)
The umbrella pass fol lowering GPU ops to ROCDL (aka  lowering to LLVM
+ the AMDGPU-specific setup) would call the arith patterns that manually
implemented extf and truncf on bfloat because the LLVM AMDGPU backend
used to not suppport those operaitons.

Since the backend does now support these operations and has for quite
some time, remove these patterns from the default lowering flow.
2025-05-12 16:42:15 -05:00
Stanley Winata
1c8e5e223f
[mlir][gpu] Fix breaking constructor from GPUSubgroupSizeToROCDL (#137439)
This PR addressed a bug from llvm/llvm-project#137360. which was using
GPUSubgroupSizeToROCDL to patterns function that do not have a valid
constructor for it. This is causing compilation error below:

error: constructor inherited by 'GPUSubgroupSizeOpToROCDL' from base
class 'ConvertOpToLLVMPattern<mlir::gpu::SubgroupSizeOp>' is implicitly
deleted

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
2025-04-25 20:25:06 -07:00
Kazu Hirata
9799746fea [mlir] Fix a warning
This patch fixes:

  mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp:170:5:
  error: default label in switch which covers all enumeration values
  [-Werror,-Wcovered-switch-default]
2025-04-25 16:21:29 -07:00
Alan Li
6e47937eed
[MLIR][ROCDL] Lower gpu.subgroup_size to wavefrontsize (#137360) 2025-04-25 19:21:15 -04:00
Gaurav Verma
60a1f5a8a0
[mlir] added gpu.shuffle mode UP support (#137300)
Added support for `gpu.shuffle` mode `UP`

Signed-off-by: xintin <gaurav.verma@amd.com>
2025-04-25 11:15:37 -07:00
Kazu Hirata
4c17a5c663 [mlir] Fix a warning
This patch fixes:

  mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp:140:10:
  error: unused variable 'shflType' [-Werror,-Wunused-variable]
2025-04-18 09:59:19 -07:00
Ivan Butygin
46d1cb8335
[mlir] GPUToROCDL: Add support for non-i32/f32 shuffle types (#136320)
Use recently added repacking utilities to support other datatypes.

Also, tighten `gpu.shuffle` verification to reject scalable vectors
2025-04-18 19:53:24 +03:00
Ivan Butygin
d893d129e6
[mlir] GPUToROCDL: Fix crashes with unsupported shuffle datatypes (#135504)
Calling `getIntOrFloatBitWidth` on non-int/float types (`gpu.shuffle`
also accepts vectors) will crash.
2025-04-13 20:26:19 +02:00
Ivan Butygin
aecb764cc2
[mlir][gpu] GPUToROCDL/NVVM: use generic llvm conversion interface instead of hardcoded conversions. (#124439)
Using `ConvertToLLVMPatternInterface` allows to unhardcode specific
dialect conversions from passes and, more importantly, allows downstream
projects to inject their ops/types translation here by registering
corresponding interface.

Add `allowed-dialects` option so user can control which dialects can be
used to populate conversions.
2025-02-13 17:53:12 +03:00
Matthias Springer
599c739905
[mlir][GPU] Add NVVM-specific cf.assert lowering (#120431)
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.

Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
2025-01-06 12:00:11 +01:00
Ivan Butygin
0e23cb0cc5
[mlir][nfc] GpuToROCDL: Remove some dead code (#121403) 2024-12-31 20:39:31 +03:00
Ivan Butygin
018b32ca1f
Revert "[mlir][nfc] GpuToROCDL: Remove some dead code" (#121402)
Reverts llvm/llvm-project#121395
2024-12-31 18:55:00 +03:00
Ivan Butygin
0b08e095cc
[mlir][nfc] GpuToROCDL: Remove some dead code (#121395) 2024-12-31 18:54:41 +03:00
Jacques Pienaar
09dfc5713d
[mlir] Enable decoupling two kinds of greedy behavior. (#104649)
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
2024-12-20 08:15:48 -08:00
Dragan Mladjenovic
596bfb804b
[MLIR][AMDGPU] Support gpu::ShuffleMode::DOWN lowering in ROCDL (#106237) 2024-11-20 03:00:05 -06:00
Matthias Springer
206fad0e21
[mlir][NFC] Mark type converter in populate... functions as const (#111250)
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.

Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.

Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
2024-10-05 21:32:40 +02:00
Daniel Hernandez-Juarez
1c47fa9b62
[mlir][AMDGPU] Add support for AMD f16 math library calls (#108809)
In this PR we add support for AMD f16 math library calls
(`__ocml_*_f16`)

CC: @krzysz00 @manupak
2024-09-23 12:52:00 -05:00
Nirvedh Meshram
a16164d0c2
[MLIR][ROCDL] Add dynamically legal ops to LowerGpuOpsToROCDLOpsPass (#108302)
Similar to https://github.com/llvm/llvm-project/pull/108266
After https://github.com/llvm/llvm-project/pull/102971
It is legal to generate `LLVM::ExpOp` and `LLVM::LogOp` if the type is
is a float16 or float32
2024-09-12 11:20:27 -05:00
Nirvedh Meshram
c31d343857
Update legalizations for LowerGpuOpsToROCDLOps (#108266)
LLVM::FAbsOp and LLVM::SqrtOp are legal after
https://github.com/llvm/llvm-project/pull/102971
2024-09-11 15:02:38 -05:00
Matthias Springer
7030280329
[mlir][GPU] Improve gpu.module op implementation (#102866)
- Replace hand-written parser/printer with auto-generated assembly
format.
- Remove implicit `gpu.module_end` terminator and use the `NoTerminator`
trait instead. (Same as `builtin.module`.)
- Turn the region into a graph region. (Same as `builtin.module`.)
2024-08-13 09:37:36 +02:00
Victor Perez
d45de8003a
[MLIR][GPU-LLVM] Convert gpu.func to llvm.func (#101664)
Add support in `-convert-gpu-to-llvm-spv` to convert `gpu.func` to
`llvm.func` operations.

- `spir_kernel`/`spir_func` calling conventions used for
kernels/functions.
- `workgroup` attributions encoded as additional `llvm.ptr<3>`
arguments.
- No attribute used to annotate kernels
- `reqd_work_group_size` attribute using to encode
`gpu.known_block_size`.
- `llvm.mlir.workgroup_attrib_size` used to encode workgroup attribution
sizes. This will be attached to the pointer argument workgroup
attributions lower to.

**Note**: A notable missing feature that will be addressed in a
follow-up PR is a `-use-bare-ptr-memref-call-conv` option to replace
MemRef arguments with bare pointers to the MemRef element types instead
of the current MemRef descriptor approach.

---------

Signed-off-by: Victor Perez <victor.perez@codeplay.com>
2024-08-09 16:09:11 +02:00
Jan Leyonberg
3fae5551de
[MLIR][ROCDL] Refactor conversion of math operations to ROCDL calls to a separate pass (#98653)
This patch refactors the conversion of math operations to ROCDL library
calls. This pass will also be used in flang to lower Fortran
intrinsics/math functions for OpenMP target offloading codgen.
2024-07-17 09:33:04 -04:00
Krzysztof Drewniak
43fd4c49bd
[mlir][GPU] Improve handling of GPU bounds (#95166)
This change reworks how range information for GPU dispatch IDs (block
IDs, thread IDs, and so on) is handled.

1. `known_block_size` and `known_grid_size` become inherent attributes
of GPU functions. This makes them less clunky to work with. As a
consequence, the `gpu.func` lowering patterns now only look at the
inherent attributes when setting target-specific attributes on the
`llvm.func` that they lower to.
2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size`
are made official dialect-level discardable attributes which can be
placed on arbitrary functions. This allows for progressive lowerings
(without this, a lowering for `gpu.thread_id` couldn't know about the
bounds if it had already been moved from a `gpu.func` to an `llvm.func`)
and allows for range information to be provided even when
`gpu.*_{id,dim}` are being used outside of a `gpu.func` context.
3. All of these index operations have gained an optional `upper_bound`
attribute, allowing for an alternate mode of operation where the bounds
are specified locally and not inherited from the operation's context.
These also allow handling of cases where the precise launch sizes aren't
known, but can be bounded more precisely than the maximum of what any
platform's API allows. (I'd like to thank @benvanik for pointing out
that this could be useful.)

When inferring bounds (either for range inference or for setting `range`
during lowering) these sources of information are consulted in order of
specificity (`upper_bound` > inherent attribute > discardable attribute,
except that dimension sizes check for `known_*_bounds` to see if they
can be constant-folded before checking their `upper_bound`).

This patch also updates the documentation about the bounds and inference
behavior to clarify what these attributes do when set and the
consequences of setting them up incorrectly.

---------

Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
2024-06-17 23:47:38 -05:00
stefankoncarevic
94be801879
[mlir][ROCDL] Update the LLVM data layout for ROCDL lowering. (#92127)
This change updates the dataLayout string to ensure alignment with the
latest LLVM TargetMachine configuration. The aim is to
maintain consistency and prevent potential compilation issues related to
memory address space handling.
2024-05-28 10:17:02 -05:00
Krzysztof Drewniak
4cba5957e6
[mlir][ROCDL] Set the LLVM data layout when lowering to ROCDL LLVM (#74501)
In order to ensure operations lower correctly (especially
memref.addrspacecast, which relies on the data layout benig set
correctly then dealing with dynamic memrefs) and to prevent compilation
issues later down the line, set the `llvm.data_layout` attribute on GPU
modules when lowering their contents to a ROCDL / AMDGPU target.

If there's a good way to test the embedded string to prevent it from
going out of sync with the LLVM TargetMachine, I'd appreciate hearing
about it. (Or, alternatively, if there's a place I could farctor the
string out to).
2024-02-27 09:59:50 -06:00
Mehdi Amini
45c226d452
[MLIR] Add ODS support for generating helpers for dialect (discardable) attributes (#77024)
This is a new ODS feature that allows dialects to define a list of
key/value pair representing an attribute type and a name.
This will generate helper classes on the dialect to be able to
manage discardable attributes on operations in a type safe way.

For example the `test` dialect can define:

```
  let discardableAttrs = (ins
     "mlir::IntegerAttr":$discardable_attr_key,
  );
```

And the following will be generated in the TestDialect class:

```
   /// Helper to manage the discardable attribute `discardable_attr_key`.
    class DiscardableAttrKeyAttrHelper {
      ::mlir::StringAttr name;
    public:
      static constexpr ::llvm::StringLiteral getNameStr() {
        return "test.discardable_attr_key";
      }
      constexpr ::mlir::StringAttr getName() {
        return name;
      }

      DiscardableAttrKeyAttrHelper(::mlir::MLIRContext *ctx)
        : name(::mlir::StringAttr::get(ctx, getNameStr())) {}

     mlir::IntegerAttr getAttr(::mlir::Operation *op) {
       return op->getAttrOfType<mlir::IntegerAttr>(name);
     }
     void setAttr(::mlir::Operation *op, mlir::IntegerAttr val) {
       op->setAttr(name, val);
     }
     bool isAttrPresent(::mlir::Operation *op) {
       return op->hasAttrOfType<mlir::IntegerAttr>(name);
     }
     void removeAttr(::mlir::Operation *op) {
       assert(op->hasAttrOfType<mlir::IntegerAttr>(name));
       op->removeAttr(name);
     }
   };
   DiscardableAttrKeyAttrHelper getDiscardableAttrKeyAttrHelper() {
     return discardableAttrKeyAttrName;
   }
```

User code having an instance of the TestDialect can then manipulate this
attribute on operation using:

```
  auto helper = testDialect.getDiscardableAttrKeyAttrHelper();

  helper.setAttr(op, value);
  helper.isAttrPresent(op);
  ...
```
2024-02-19 23:30:03 -08:00
Hugo Trachino
65066c0277
[mlir] Use create instead of createOrFold for ConstantOp as folding has no effect (NFC) (#80129)
This aims to clean-up confusing uses of
builder.createOrFold<ConstantOp> since folding of constants fails.
2024-01-31 23:40:37 -08:00
Guray Ozen
391a7577e7
[mlir][gpu] Add lowering dynamic_shared_memory op for rocdl (#74473)
This PR adds lowering of `gpu.dynamic_shared_memory` to rocdl target.
2023-12-05 19:56:43 +01:00
Christian Ulmann
4279a642fb
[MLIR][GPUToROCDL] Remove typed pointer support (#70908)
This commit removes the support for lowering GPU to ROCDL dialect with
typed pointers. Typed pointers have been deprecated for a while now and
it's planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-11-01 10:13:06 +01:00
Adrian Kuegel
baf2d13519 [mlir][GPUToROCDL] Lower arith.remf to GPU intrinsic.
Differential Revision: https://reviews.llvm.org/D159423
2023-09-04 14:05:04 +02:00
Stanley Winata
1896096002 [mlir][ROCM] Add Wave/Warp shuffle lowering and op for ROCM.
Reduction is heavily used for many DL workload especially with
softmax/Attention layers. Wave/Warp shuffle and reduction is known to be
a speedy/efficient way to do these reductions.

In this patch we introduce AMD shuffle intrinsic Ops to ROCDL, along with it's corresponding lowering from gpu.shuffle. This should speed up a lot of DL workloads on ROCM backend. Currently, we have support for xor and idx, which are the more common ones. In the future, we plan on adding support for Down and Up, as well as using the ds_swizzle to further enhance it's performance when width and offsets are constant.

Reviewed By: antiagainst

Differential Revision: https://reviews.llvm.org/D158684
2023-08-24 17:35:34 -07:00