198 Commits

Author SHA1 Message Date
Maksim Levental
eaa67a3cf0
[mlir][NFC] update Conversion create APIs (5/n) (#149887)
See https://github.com/llvm/llvm-project/pull/147168 for more info.
2025-07-22 10:40:45 -04:00
Kazu Hirata
fa9adbfda9
[mlir] Remove unused includes (NFC) (#147101)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-07-04 13:30:21 -07:00
lorenzo chelini
c5613dc863
[MLIR] Mark LLVM::FMAOp as legal (#144671)
Mark LLVM::FMAOp as legal in configureGpuToNVVMConversionLegality, since
we can handle intrinsic lowering in the NVPTX backend and emit
fma.rn.f32.
2025-06-18 15:49:00 +02:00
William Moses
d9c65af626
[MLIR][GPUToNVVM] Support 32-bit isfinite (#131699)
Co-authored-by: Ivan Radanov Ivanov <ivanov.i.aa@m.titech.ac.jp>
2025-03-18 02:11:38 +01:00
Valentin Clement (バレンタイン クレメン)
b3e05d58b9
[mlir][nvvm] Add conversion for math.erfc (#129329)
Add missing pattern to convert `math.erfc` operation to `__nv_erfcf` or
`__nv_erfc` function call.
2025-02-28 14:51:56 -08:00
Matthias Springer
4defac91db
[mlir][GPUToNVVM] Add benefit to populate functions (#128484)
Certain GPU->NVVM patterns compete with Arith->LLVM patterns. (The ones
that lower to libdevice.) Add an optional `benefit` parameter to all
`populate` functions so that users can give preference to GPU->NVVM
patterns.
2025-02-24 17:27:55 +01:00
Oleksandr "Alex" Zinenko
79d8a34bc5
[mlir] add some FP classification ops and their lowering to libdevice (#127322)
Introduce a subset of floating point classification ops to the Math
dialect. These ops mirror functions provided by the C math library and,
similarly to the existing `math.copysign`, belong to the math dialect.
Add a lowering of those ops to Nvidia libdevice calls when possible as
the first mechanism to exercise them.
2025-02-16 14:51:21 +01:00
Oleksandr "Alex" Zinenko
963ff1c305
[mlir] lower min/maxnum to libdevice calls (#127323)
Introduce lowering from arith.minnum/maxxnum operations to the
corresponding Nvidia libdevice calls. This requires to reorder pattern
population methods so that the libdevice-targeting patterns are
prioritized over default patterns targeting LLVM IR intrinsics from the
Arith dialect. The tests are placed into a separate file as the existing
gpu-to-nvvm.mlir files has a mode that forces Arith dialect operations
to be preserved as is without using a separate FileCheck tag to
differentiate.

Co-authored-by: William Moses <gh@wsmoses.com>
2025-02-15 17:53:36 -05:00
Ivan Butygin
aecb764cc2
[mlir][gpu] GPUToROCDL/NVVM: use generic llvm conversion interface instead of hardcoded conversions. (#124439)
Using `ConvertToLLVMPatternInterface` allows to unhardcode specific
dialect conversions from passes and, more importantly, allows downstream
projects to inject their ops/types translation here by registering
corresponding interface.

Add `allowed-dialects` option so user can control which dialects can be
used to populate conversions.
2025-02-13 17:53:12 +03:00
Oleksandr "Alex" Zinenko
4df28af713
[mlir] Add lowering of absi and fpowi to libdevice (#123644)
More concise version of #123422.

---------

Co-authored-by: William S. Moses <gh@wsmoses.com>
2025-01-20 12:53:44 -06:00
Matthias Springer
599c739905
[mlir][GPU] Add NVVM-specific cf.assert lowering (#120431)
This commit add an NVIDIA-specific lowering of `cf.assert` to to
`__assertfail`.

Note: `getUniqueFormatGlobalName`, `getOrCreateFormatStringConstant` and
`getOrDefineFunction` are moved to `GPUOpsLowering.h`, so that they can
be reused.
2025-01-06 12:00:11 +01:00
Jacques Pienaar
09dfc5713d
[mlir] Enable decoupling two kinds of greedy behavior. (#104649)
The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering and is
required to applying patterns in specific order split across different
passes. Using the driver one ends up needlessly retrying folding/having
multiple rounds of folding attempts, where one final run would have
sufficed.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating
from the deprecated should just be a find and replace (e.g., `find ./
-type f -exec sed -i
's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \;` variety)
as the API arguments hasn't changed between the two.
2024-12-20 08:15:48 -08:00
Fabian Mora
7498eaa9ab
[mlir][LLVM] Add the ConvertToLLVMAttrInterface and ConvertToLLVMOpInterface interfaces (#99566)
This patch adds the `ConvertToLLVMAttrInterface` and
`ConvertToLLVMOpInterface` interfaces. It also modifies the
`convert-to-llvm` pass to use these interfaces when available.

The `ConvertToLLVMAttrInterface` interfaces allows attributes to
configure conversion to LLVM, including the conversion target, LLVM type
converter, and populating conversion patterns. See the `NVVMTargetAttr`
implementation of this interface for an example of how this interface
can be used to configure conversion to LLVM.

The `ConvertToLLVMOpInterface` interface collects all convert to LLVM
attributes stored in an operation.

Finally, the `convert-to-llvm` pass was modified to use these interfaces
when available. This allows applying `convert-to-llvm` to GPU modules
and letting the `NVVMTargetAttr` decide which patterns to populate.
2024-11-24 10:09:43 -05:00
Durgadoss R
a8b5115441
[MLIR][NVGPU] Fix the cga_cluster.mlir test (#112191)
This patch fixes the sm90 cluster test by:
* Fixing a typo in LowerGpuOpsToNVVMOps where one of the ClusterDim Op
   conversion pattern should actually be for the
   ClusterDimBlocks Op. This addresses the compilation error for this test.
* The grid-size should be (4,4,1) instead of (2,2,1). This passes the
   scf-if check against the threshold of 3 below and actually
   generates the required prints from the GPU.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-10-14 19:44:13 +05:30
Matthias Springer
206fad0e21
[mlir][NFC] Mark type converter in populate... functions as const (#111250)
This commit marks the type converter in `populate...` functions as
`const`. This is useful for debugging.

Patterns already take a `const` type converter. However, some
`populate...` functions do not only add new patterns, but also add
additional type conversion rules. That makes it difficult to find the
place where a type conversion was added in the code base. With this
change, all `populate...` functions that only populate pattern now have
a `const` type converter. Programmers can then conclude from the
function signature that these functions do not register any new type
conversion rules.

Also some minor cleanups around the 1:N dialect conversion
infrastructure, which did not always pass the type converter as a
`const` object internally.
2024-10-05 21:32:40 +02:00
Daniel Hernandez-Juarez
1c47fa9b62
[mlir][AMDGPU] Add support for AMD f16 math library calls (#108809)
In this PR we add support for AMD f16 math library calls
(`__ocml_*_f16`)

CC: @krzysz00 @manupak
2024-09-23 12:52:00 -05:00
Benjamin Kramer
ac11945386 [mlir][GPU] block_id has the grid size as its range 2024-09-17 18:38:04 +02:00
Krzysztof Drewniak
a953982cb7
[mlir][GPU] Plumb range information through the NVVM lowerings (#107659)
Update the GPU to NVVM lowerings to correctly propagate range
information on IDs and dimension queries, etiher from
known_{block,grid}_size attributes or from `upperBound` annotations on
the operations themselves.
2024-09-13 12:07:51 -05:00
Andrea Faulds
7aa22f013e
[mlir][gpu] Add 'cluster_size' attribute to gpu.subgroup_reduce (#104851)
This enables performing several reductions in parallel, each smaller
than the size of the subgroup.

One potential application is flash attention with subgroup-wide matrix
multiplication and reduction combined in one kernel. The multiplication
operation requires a 2D matrix to be distributed over the lanes of the
subgroup, which then constrains the shape the following reduction can
have if we want to keep data in registers.
2024-08-20 13:37:03 -04:00
Matthias Springer
7030280329
[mlir][GPU] Improve gpu.module op implementation (#102866)
- Replace hand-written parser/printer with auto-generated assembly
format.
- Remove implicit `gpu.module_end` terminator and use the `NoTerminator`
trait instead. (Same as `builtin.module`.)
- Turn the region into a graph region. (Same as `builtin.module`.)
2024-08-13 09:37:36 +02:00
Victor Perez
d45de8003a
[MLIR][GPU-LLVM] Convert gpu.func to llvm.func (#101664)
Add support in `-convert-gpu-to-llvm-spv` to convert `gpu.func` to
`llvm.func` operations.

- `spir_kernel`/`spir_func` calling conventions used for
kernels/functions.
- `workgroup` attributions encoded as additional `llvm.ptr<3>`
arguments.
- No attribute used to annotate kernels
- `reqd_work_group_size` attribute using to encode
`gpu.known_block_size`.
- `llvm.mlir.workgroup_attrib_size` used to encode workgroup attribution
sizes. This will be attached to the pointer argument workgroup
attributions lower to.

**Note**: A notable missing feature that will be addressed in a
follow-up PR is a `-use-bare-ptr-memref-call-conv` option to replace
MemRef arguments with bare pointers to the MemRef element types instead
of the current MemRef descriptor approach.

---------

Signed-off-by: Victor Perez <victor.perez@codeplay.com>
2024-08-09 16:09:11 +02:00
runseny
f6431f0c52
[MLIR][GPUToNVVM] support fastMath and other non-supported mathOp (#99890)
Support fastMath and other non-supported mathOp which only require float
operands and call libdevice function directly to nvvm.

1. lowering mathOp with fastMath attribute to correct libdevice
intrinsic.
2. some mathOp in math dialect has been lowered to libdevice now, but it
doesn't cover all mathOp. so this mr lowers all the remaining mathOp
which only require float operands.
2024-07-25 13:54:58 +02:00
Krzysztof Drewniak
43fd4c49bd
[mlir][GPU] Improve handling of GPU bounds (#95166)
This change reworks how range information for GPU dispatch IDs (block
IDs, thread IDs, and so on) is handled.

1. `known_block_size` and `known_grid_size` become inherent attributes
of GPU functions. This makes them less clunky to work with. As a
consequence, the `gpu.func` lowering patterns now only look at the
inherent attributes when setting target-specific attributes on the
`llvm.func` that they lower to.
2. At the same time, `gpu.known_block_size` and `gpu.known_grid_size`
are made official dialect-level discardable attributes which can be
placed on arbitrary functions. This allows for progressive lowerings
(without this, a lowering for `gpu.thread_id` couldn't know about the
bounds if it had already been moved from a `gpu.func` to an `llvm.func`)
and allows for range information to be provided even when
`gpu.*_{id,dim}` are being used outside of a `gpu.func` context.
3. All of these index operations have gained an optional `upper_bound`
attribute, allowing for an alternate mode of operation where the bounds
are specified locally and not inherited from the operation's context.
These also allow handling of cases where the precise launch sizes aren't
known, but can be bounded more precisely than the maximum of what any
platform's API allows. (I'd like to thank @benvanik for pointing out
that this could be useful.)

When inferring bounds (either for range inference or for setting `range`
during lowering) these sources of information are consulted in order of
specificity (`upper_bound` > inherent attribute > discardable attribute,
except that dimension sizes check for `known_*_bounds` to see if they
can be constant-folded before checking their `upper_bound`).

This patch also updates the documentation about the bounds and inference
behavior to clarify what these attributes do when set and the
consequences of setting them up incorrectly.

---------

Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
2024-06-17 23:47:38 -05:00
Pradeep Kumar
bd6568c98a
[MLIR][GPU] Add gpu.cluster_dim_blocks and gpu.cluster_block_id Ops (#95245)
This commit adds support for `gpu.cluster_dim_blocks` and
`gpu.cluster_block_id` Ops to represent number of blocks per cluster and
block id inside a cluster respectively. Also, fixed the description of
`gpu.cluster_dim` Op and updated the `cga_cluster.mlir` test file to use
`gpu.cluster_dim_blocks`

Co-authored-by: pradeepku <pradeepku@nvidia.com>
Co-authored-by: Guray Ozen <guray.ozen@gmail.com>
2024-06-14 10:35:35 +05:30
Johannes Reifferscheid
a6a9215b93
Lower shuffle to single-result form if possible. (#84321)
We currently always lower shuffle to the struct-returning variant. I saw
some cases where this survived all the way through ptx, resulting in
increased register usage. The easiest fix is to simply lower to the
single-result version when the predicate is unused.
2024-03-21 10:33:49 +01:00
David Majnemer
0039a2ff4f
[mlir][gpu] Add support for lowering math.erf to __nv_erf (#79848) 2024-01-29 19:35:23 +00:00
Guray Ozen
763109e346
[mlir][gpu] Use known_block_size to set maxntid for NVVM target (#77301)
Setting thread block size with `maxntid` on the kernel has great
performance benefits. In this way, downstream PTX compiler can do better
register allocation.

MLIR's `gpu.launch` and `gpu.launch_func` already has an attribute
(`known_block_size`) that keeps the thread block size when it is known.
This PR simply uses this attribute to set `maxntid`.
2024-01-08 14:49:19 +01:00
Jakub Kuderski
560564f51c
[mlir][vector][gpu] Align minf/maxf reduction kind names with arith (#75901)
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.

Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.

Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-12-20 00:14:43 -05:00
Guray Ozen
641e05decc
[mlir][gpu] Support dynamic_shared_memory Op with vector dialect (#74475)
`gpu.dynamic_shared_memory` currently does not get lowered when it is
used with vector dialect. The reason is that vector-to-llvm conversion
is not included in gpu-to-nvvm. This PR includes that and adds a test.
2023-12-06 10:41:57 +01:00
Jakub Kuderski
7eccd52842 Reland "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit dd09221a29506031415cad8a1308998358633d48 and relands
https://github.com/llvm/llvm-project/pull/73423.

* Updated `gpu.all_reduce` `min`/`max` in CUDA integration tests.
2023-11-27 11:38:18 -05:00
Jakub Kuderski
dd09221a29 Revert "[mlir][gpu] Align reduction operations with vector combining kinds (#73423)"
This reverts commit e0aac8c88d0d30e8da0f8a240ad1e6b4d88782e0.

I'm seeing some nvidia integration test failures:
https://lab.llvm.org/buildbot/#/builders/61/builds/52334.
2023-11-27 11:29:23 -05:00
Jakub Kuderski
e0aac8c88d
[mlir][gpu] Align reduction operations with vector combining kinds (#73423)
The motivation for this change is explained in
https://github.com/llvm/llvm-project/issues/72354.

Before this change, we could not tell between signed/unsigned
minimum/maximum and NaN treatment for floating point values.

The mapping of old reduction operations to the new ones is as follows:
*  `min` --> `minsi` for ints, `minf` for floats
*  `max` --> `maxsi` for ints, `maxf` for floats

New reduction kinds not represented in the old enum: `minui`, `maxui`,
`minimumf`, `maximumf`.

As a next step, I would like to have a common definition of combining
kinds used by the `vector` and `gpu` dialects. Separately, the GPU to
SPIR-V lowering does not yet properly handle zero and NaN values -- the
behavior of floating point min/max group reductions is not specified by
the SPIR-V spec, see https://github.com/llvm/llvm-project/issues/73459. 

Issue: https://github.com/llvm/llvm-project/issues/72354
2023-11-27 11:19:20 -05:00
Guray Ozen
edf5cae739
[mlir][gpu] Support Cluster of Thread Blocks in gpu.launch_func (#72871)
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
2023-11-27 11:05:07 +01:00
Guray Ozen
ea84897ba3
[mlir][gpu] Introduce gpu.dynamic_shared_memory Op (#71546)
While the `gpu.launch` Op allows setting the size via the
`dynamic_shared_memory_size` argument, accessing the dynamic shared
memory is very convoluted. This PR implements the proposed Op,
`gpu.dynamic_shared_memory` that aims to simplify the utilization of
dynamic shared memory.

RFC:
https://discourse.llvm.org/t/rfc-simplifying-dynamic-shared-memory-access-in-gpu/

**Proposal from RFC**
This PR `gpu.dynamic.shared.memory` Op to use dynamic shared memory
feature efficiently. It is is a powerful feature that enables the
allocation of shared memory at runtime with the kernel launch on the
host. Afterwards, the memory can be accessed directly from the device. I
believe similar story exists for AMDGPU.

**Current way Using Dynamic Shared Memory with MLIR**

Let me illustrate the challenges of using dynamic shared memory in MLIR
with an example below. The process involves several steps:
- memref.global 0-sized array LLVM's NVPTX backend expects
- dynamic_shared_memory_size Set the size of dynamic shared memory
- memref.get_global Access the global symbol
- reinterpret_cast and subview Many OPs for pointer arithmetic

```
// Step 1. Create 0-sized global symbol. Manually set the alignment
memref.global "private" @dynamicShmem  : memref<0xf16, 3> { alignment = 16 }
func.func @main() {
  // Step 2. Allocate shared memory
  gpu.launch blocks(...) threads(...)
    dynamic_shared_memory_size %c10000 {
    // Step 3. Access the global object
    %shmem = memref.get_global @dynamicShmem : memref<0xf16, 3>
    // Step 4. A sequence of `memref.reinterpret_cast` and `memref.subview` operations.
    %4 = memref.reinterpret_cast %shmem to offset: [0], sizes: [14, 64, 128],  strides: [8192,128,1] : memref<0xf16, 3> to memref<14x64x128xf16,3>
    %5 = memref.subview %4[7, 0, 0][7, 64, 128][1,1,1] : memref<14x64x128xf16,3> to memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3>
    %6 = memref.subview %5[2, 0, 0][1, 64, 128][1,1,1] : memref<7x64x128xf16, strided<[8192, 128, 1], offset: 57344>, 3> to memref<64x128xf16, strided<[128, 1], offset: 73728>, 3>
    %7 = memref.subview %6[0, 0][64, 64][1,1]  : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>
    %8 = memref.subview %6[32, 0][64, 64][1,1] : memref<64x128xf16, strided<[128, 1], offset: 73728>, 3> to memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>
    // Step.5 Use
    "test.use.shared.memory"(%7) : (memref<64x64xf16, strided<[128, 1], offset: 73728>, 3>) -> (index)
    "test.use.shared.memory"(%8) : (memref<64x64xf16, strided<[128, 1], offset: 77824>, 3>) -> (index)
    gpu.terminator
  }
```

Let’s write the program above with that:

```
func.func @main() {
    gpu.launch blocks(...) threads(...) dynamic_shared_memory_size %c10000 {
    	%i = arith.constant 18 : index
        // Step 1: Obtain shared memory directly
        %shmem = gpu.dynamic_shared_memory : memref<?xi8, 3>
        %c147456 = arith.constant 147456 : index
        %c155648 = arith.constant 155648 : index
        %7 = memref.view %shmem[%c147456][] : memref<?xi8, 3> to memref<64x64xf16, 3>
        %8 = memref.view %shmem[%c155648][] : memref<?xi8, 3> to memref<64x64xf16, 3>

        // Step 2: Utilize the shared memory
        "test.use.shared.memory"(%7) : (memref<64x64xf16, 3>) -> (index)
        "test.use.shared.memory"(%8) : (memref<64x64xf16, 3>) -> (index)
    }
}
```

This PR resolves #72513
2023-11-16 14:42:17 +01:00
Christian Ulmann
02307a1444
[MLIR][GPUToNVVM] Remove typed pointer support (#70861)
This commit removes the support for lowering GPU to NVVM dialect with
typed pointers. Typed pointers have been deprecated for a while now and
it's planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
2023-11-01 08:40:32 +01:00
Fabian Mora
119c489cc1 Reland [mlir][test][gpu] Migrate CUDA tests to the TargetAttr compilation workflow (llvm#65768)
The revert happened due to a build bot failure that threw 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'.
The failure's root cause was a pass using "+ptx76" for compilation and an old CUDA driver
on the bot. This commit relands the patch with "+ptx60".

Original Gh PR: #65768
Original commit message:
    Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
    using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
    to use the new compilation workflow to simplify the introduction of
    future tests.

    The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
    allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
    pass.
2023-09-09 12:45:21 +00:00
Fabian Mora
2c596ea951
Revert "[mlir][test][gpu] Migrate CUDA tests to the TargetAttr compilation workflow (#65768) (#65848)
This reverts commit d21b67293be15f8a89378e4785d70cc037866406.
2023-09-09 07:14:19 -04:00
Fabian Mora
d21b67293b
[mlir][test][gpu] Migrate CUDA tests to the TargetAttr compilation workflow (#65768)
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.

The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
2023-09-09 07:03:38 -04:00
Adrian Kuegel
640b95da80 [mlir][GPU] Lower arith.remf to GPU intrinsic.
Differential Revision: https://reviews.llvm.org/D159422
2023-09-04 13:38:45 +02:00
Nicolas Vasilache
888717e853 [mlir][transform] Enable gpu-to-nvvm via conversion patterns driven by TD
This revision untangles a few more conversion pieces and allows rewriting
the relatively intricate (and somewhat inconsistent) LowerGpuOpsToNVVMOpsPass
in a declarative fashion that provides a much better understanding and control.

Differential Revision: https://reviews.llvm.org/D157617
2023-08-10 15:30:48 +00:00
Uday Bondhugula
27ccf0f407 [MLIR] Provide bare pointer memref lowering option on gpu-to-nvvm pass
Provide the bare pointer memref lowering option on gpu-to-nvvm pass.
This is needed whenever we lower memrefs on the host function side and
the kernel calls on the host-side (gpu-to-llvm) with the bare ptr
convention. The GPU module side of the lowering should also "align" and
use the bare pointer convention.

Reviewed By: krzysz00

Differential Revision: https://reviews.llvm.org/D152480
2023-06-18 22:53:50 +05:30
Markus Böck
0e5aeae6f5 [mlir][GPUToLLVM] Add support for emitting opaque pointers
Part of https://discourse.llvm.org/t/rfc-switching-the-llvm-dialect-and-dialect-lowerings-to-opaque-pointers/68179

This patch adds the new pass option `use-opaque-pointers` to the GPU to LLVM lowerings (including ROCD and NVVM) and adapts the code to support using opaque pointers in addition to typed pointers.
The required changes mostly boil down to avoiding `getElementType` and specifying base types in GEP and Alloca.

In the future opaque pointers will be the only supported model, hence tests have been ported to using opaque pointers by default. Additional regression tests for typed-pointers have been added to avoid breaking existing clients.

Note: This does not yet port the `GpuToVulkan` passes.

Differential Revision: https://reviews.llvm.org/D144448
2023-02-21 20:46:33 +01:00
Krzysztof Drewniak
499abb243c Add generic type attribute mapping infrastructure, use it in GpuToX
Remapping memory spaces is a function often needed in type
conversions, most often when going to LLVM or to/from SPIR-V (a future
commit), and it is possible that such remappings may become more
common in the future as dialects take advantage of the more generic
memory space infrastructure.

Currently, memory space remappings are handled by running a
special-purpose conversion pass before the main conversion that
changes the address space attributes. In this commit, this approach is
replaced by adding a notion of type attribute conversions
TypeConverter, which is then used to convert memory space attributes.

Then, we use this infrastructure throughout the *ToLLVM conversions.
This has the advantage of loosing the requirements on the inputs to
those passes from "all address spaces must be integers" to "all
memory spaces must be convertible to integer spaces", a looser
requirement that reduces the coupling between portions of MLIR.

ON top of that, this change leads to the removal of most of the calls
to getMemorySpaceAsInt(), bringing us closer to removing it.

(A rework of the SPIR-V conversions to use this new system will be in
a folowup commit.)

As a note, one long-term motivation for this change is that I would
eventually like to add an allocaMemorySpace key to MLIR data layouts
and then call getMemRefAddressSpace(allocaMemorySpace) in the
relevant *ToLLVM in order to ensure all alloca()s, whether incoming or
produces during the LLVM lowering, have the correct address space for
a given target.

I expect that the type attribute conversion system may be useful in
other contexts.

Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D142159
2023-02-09 18:00:46 +00:00
Kazu Hirata
5c9013e266 Use std::optional instead of llvm::Optional (NFC) 2023-01-28 00:45:19 -08:00
Quentin Colombet
cb4ccd38fa [mlir][Conversion] Rename the MemRefToLLVM pass
Since the recent MemRef refactoring that centralizes the lowering of
complex MemRef operations outside of the conversion framework, the
MemRefToLLVM pass doesn't directly convert these complex operations.

Instead, to fully convert the whole MemRef dialect space, MemRefToLLVM
needs to run after `expand-strided-metadata`.

Make this more obvious by changing the name of the pass and the option
associated with it from `convert-memref-to-llvm` to
`finalize-memref-to-llvm`.
The word "finalize" conveys that this pass needs to run after something
else and that something else is documented in its tablegen description.

This is a follow-up patch related to the conversation at:
https://discourse.llvm.org/t/psa-you-need-to-run-expand-strided-metadata-before-memref-to-llvm-now/66956/14

Differential Revision: https://reviews.llvm.org/D142463
2023-01-27 09:10:10 +00:00
Guray Ozen
a3388f3e2a [mlir] Introduce a pattern to lower gpu.subgroup_reduce to nvvm.redux_op
This revision introduces a pattern to lower `gpu.subgroup_reduce` op into to the `nvvm.redux_sync` op. The op must be run by the entire subgroup, otherwise it is undefined behaviour.

It also adds a flag and populate function, because the op is not avaiable for every gpu (sm80+), so it can be used when it is desired.

Depends on D142088

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D142103
2023-01-20 13:56:23 +01:00
Christopher Bate
60dd937d56 [mlir][gpu] Fix build failure / silence windows build warnings
Fixes Windows build failure (C4715) caused by
6ca1a09f03e8e940f306bea73efa935e4ee38173.
2023-01-13 13:40:57 -07:00
Christopher Bate
6ca1a09f03 [mlir][gpu] Migrate hard-coded address space integers to an enum attribute (gpu::AddressSpaceAttr)
This is a purely mechanical change that introduces an enum attribute in the GPU
dialect to represent the various memref memory spaces as opposed to the
hard-coded integer attributes that are currently used.

The following steps were taken to make the transition across the codebase:

1. Introduce a pass "gpu-lower-memory-space-attributes":

The pass updates all memref types that have a memory space attribute that is a
`gpu::AddressSpaceAttr`. These attributes are changed to `IntegerAttr`'s using a
mapping that is given by the caller. This pass is based on the
"map-memref-spirv-storage-class" pass and the common functions can probably
be refactored into a set of utilities under the MemRef dialect.

2. Update the verifiers of GPU/NVGPU dialect operations.

If a verifier currently checks the address space of an operand using
e.g.`getWorkspaceAddressSpace`, then it can continue to do so. However, the
checks are changed to only fail if the memory space is either missing or a wrong
value of type `gpu::AddressSpaceAttr`. Otherwise, it just assumes the address
space is correct because it was specifically lowered to something other than a
`gpu::AddressSpaceAttr`.

3. Update existing gpu-to-llvm conversion infrastructure.

In the existing gpu-to-X passes, we add a full conversion equivalent to
`gpu-lower-memory-space-attributes` just before doing the conversion to the
LLVMDialect. This is done because currently both the gpu-to-llvm passes
(rocdl,nvvm) run gpu-to-gpu rewrites within the pass, which introduce
`AddressSpaceAttr` memory space annotations. Therefore, I inserted the
memory space conversion between the gpu-to-gpu rewrites and the LLVM
conversion.

For more context see the below discourse discussion:
https://discourse.llvm.org/t/gpu-workgroup-shared-memory-address-space-is-hard-coded/

Reviewed By: ftynse

Differential Revision: https://reviews.llvm.org/D140644
2023-01-13 11:00:10 -07:00
Goran Flegar
b7fe0d346b Lower math.tan to __nv_tan[f] / __ocml_tan_f{32|64}
At present math.tan fails to lower for NVVM and ROCDL.

Differential Revision: https://reviews.llvm.org/D141505
2023-01-12 15:26:11 +01:00
Johannes Reifferscheid
059cf735a9 Lower math.cbrt to NVVM/ROCDL.
Reviewed By: pifon2a

Differential Revision: https://reviews.llvm.org/D141270
2023-01-09 13:17:35 +01:00