10 Commits

Author SHA1 Message Date
Guray Ozen
63389326f5
[mlir][nvvm] Support predicates in BasicPtxBuilder (#67102)
This PR enhances `BasicPtxBuilder` to support predicates in PTX code
generation. The `BasicPtxBuilder` interface was initially introduced for
generating PTX code automatically for Ops that aren't supported by LLVM
core. Predicates, which are typically not supported in LLVM core, are
now supported using the same mechanism.

In PTX programming, instructions can be guarded by predicates as shown
below:. Here `@p` is a predicate register and guard the execution of the
instruction.

```
@p ptx.code op1, op2, op3
```

This PR introduces the `getPredicate` function in the `BasicPtxBuilder`
interface to set an optional predicate. When a predicate is provided,
the instruction is generated with predicate and guarded, otherwise,
predicate is not genearted. Note that the predicate value must always
appear as the last argument on the Op definition.

Additionally, this PR implements predicate usage for the following ops:

- mbarrier.init
- mbarrier.init.shared
- mbarrier.arrive.expect_tx
- mbarrier.arrive.expect_tx.shared
- cp.async.bulk.tensor.shared.cluster.global
- cp.async.bulk.tensor.global.shared.cta

See for more detail in PTX programing model

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions
2023-10-17 12:42:36 +02:00
Guray Ozen
25da11541c
[mlir][nvvm] Move BasicPtxBuilder Interface to Its Own File (NFC) (#68095) 2023-10-12 07:13:58 -07:00
Guray Ozen
6e8ffab9b6
[mlir][nvvm] Introduce elect.sync Op (#68323)
The Op selects a leader thread from a set of threads.

See for more information:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-elect-sync
2023-10-09 10:21:45 -07:00
Guray Ozen
18e161f9e1 [MLIR][NVVM] Introduction of the wgmma.mma_async Op
This work introduces the `wgmma.mma_async` Op along PTX generation using `BasicPtxBuilderOpInterface`. The Op is designed to execute the matrix multiply-and-accumulate operation across a warpgroup (128 threads). It's important to note that this operation works for devices with the sm_90a capability.

The matrix multiply-and-accumulate operation can take one of the following forms. In both cases, matrix D is referred to as the accumulator:
	D = A * B + D 	: Result is added to the accumulator matrix D.
	D = A * B 		: The input from the accumulator matrix D is not utilized.

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D157370
2023-08-09 23:08:00 +02:00
Matthias Springer
876a480cac [mlir][Conversion] Add type converter parameter to ConvertToLLVMPatternInterface
Most `*-to-llvm` conversion patterns require a type converter. This
revision adds a type converter to the
`populateConvertToLLVMConversionPatterns` function and implements the
interface for the MemRef dialect.

Differential Revision: https://reviews.llvm.org/D157387
2023-08-09 09:00:46 +02:00
Mehdi Amini
4529797a9d Add a generic "convert-to-llvm" pass delegating to an interface
The multiple -convert-XXX-to-llvm passes are really nice testing tools for
individual dialects, but the expectation is that a proper conversion should
assemble the conversion patterns using `populateXXXToLLVMConversionPatterns()
APIs. However most customers just chain the conversion passes by convenience.

This pass makes it composable more transparently to assemble the required
patterns for conversion to LLVM dialect by using an interface.
The Pass will scan the input and collect all the dialect present, and for
those who implement the `ConvertToLLVMPatternInterface` it will use it to
populate the conversion pattern, and possible the conversion target.

Since these conversions can involve intermediate dialects, or target other
dialects than LLVM (for example AVX or NVVM), this pass can't statically
declare the required `getDependentDialects()` before the pass pipeline
begins. This is worked around by using an extension in the dialectRegistry
that will be invoked for every new loaded dialects in the context. This
allows to lookup the interface ahead of time and use it to query the
dependent dialects.

Differential Revision: https://reviews.llvm.org/D157183
2023-08-07 18:46:08 -07:00
Guray Ozen
eda52f3cd3 [mlir][nvvm] Add populate function (nfc)
This work adds populate function for the nvvm to llvm conversion pattern.

Reviewed By: kuhar

Differential Revision: https://reviews.llvm.org/D155189
2023-07-13 14:53:51 +02:00
Guray Ozen
ffbca7e9f3 [mlir][nvvm] Change return type of std::string of getPtx of PtxBuilder
getPtx used to return `const char*`. It is not flexible when one needs to build string in the function. This work changes return type.

Reviewed By: springerm

Differential Revision: https://reviews.llvm.org/D155056
2023-07-12 14:59:54 +02:00
Guray Ozen
b6bf775f58 [mlir][nvvm] fix potential bug (NFC)
Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D155048
2023-07-12 10:09:21 +02:00
Guray Ozen
dd080c7579 [mlir][nvvm] Add NVVMToLLVM Pass
It introduces an NVVMToLLVM Pass and a `BasicPtxBuilderOpInterface` interface. The Pass performs pattern matching on all the NVVM Ops that implement the BasicPtxBuilderOpInterface interface to generate LLVM Inline Assembly Ops.

The BasicPtxBuilderOpInterface interface is utilized in the convert-nvvm-to-llvm pass, which lowers Ops that support this interface to inline assembly Ops. The interface provides several methods that are used for this lowering.

The `getPtx` method returns PTX code. The `hasSideEffect` method is used to determine whether the op has any side effects on the memory. The `hasIntrinsic` method indicates whether the operation has intrinsic support in LLVM. This is particularly useful for Ops that don't have intrinsic support for each case. The `getAsmValues` method returns the arguments to be passed to the PTX code. The order of arguments starts with the results and they are used for write operations, followed by the operands and attributes.

Example:

If we have the following Op definition that returns PTX code through getPtx:
```tablegen
def NVVM_MBarrierArriveExpectTxOp : NVVM_Op<\"mbarrier.arrive.expect_tx\",
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>, Arguments<(ins LLVM_i64ptr_any:$addr, I32:$txcount)> {
  ...
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() { return \"mbarrier.arrive.expect_tx.b64 %0, [%1], %2;\"; }
  }\];
}
```

The NVVM Op will look like below:
```mlir
  %0 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i32
```

The `convert-nvvm-to-llvm` Pass generates the following PTX code, while keeping the order of arguments the same. The read/write modifiers are set based on the input and result types.
```mlir
  %0 = llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 %0, [%1], %2;", "=r,l,r" %arg0, %arg1 : (!llvm.ptr, i32) -> i32
```

Reviewed By: nicolasvasilache

Differential Revision: https://reviews.llvm.org/D154060
2023-07-11 12:14:24 +02:00