This new option is set to `false` by default. It should be set only in Canonicalizer tests to detect faulty canonicalization patterns. I.e., patterns that prevent the canonicalizer from converging. The canonicalizer should always convergence on such small unit tests that we have in `canonicalize.mlir`.
Two faulty canonicalization patterns were detected and fixed with this change.
Differential Revision: https://reviews.llvm.org/D140873
In many cases, the the number of workgroups (the grid size) and the
number of workitems within each group (the block size) that a GPU
kernel will be launched with are known. For example, if gpu.launch is
called with constant block and grid sizes, we know that those are the
only possible sizes that will be used to launch that kernel. In other
cases, a custom code-generation pipeline that eventually produces GPU
kernels may know the launch dimensions of those kernels, or at least
may be able to provide an upper bound on them.
Other GPU programming systems, such as OpenCL, allow capturing such
information to enable compiler optimizations - see
reqd_work_group_size, but MLIR currently has no mechanism for doing so.
This set of attributes is the first step in enabling optimizations
based on the known launch dimensions of kernels. It extends the kernel
outline pass to set these bounds on kernels with constant launch
dimensions and extends integer range inference for GPU index
operations to account for the bounds when they are known.
Subsequent revisions will use this data when lowering GPU operations
to the ROCDL dialect.
Reviewed By: antiagainst
Differential Revision: https://reviews.llvm.org/D139865
This is generated by running
```
sed --in-place 's/[[:space:]]\+$//' mlir/**/*.td
sed --in-place 's/[[:space:]]\+$//' mlir/**/*.mlir
```
Reviewed By: rriddle, dcaballe
Differential Revision: https://reviews.llvm.org/D138866
In a nested loop nest, it is not feasible to map different loops to the same processing unit; for an example, check the code below. This modification includes a check in this circumstance.
```
scf.foreach_thread (%i, %j) in (%c32, %c32) {...}
{ mapping = [#gpu.thread<x>, #gpu.thread<x>] }
```
Note: It also deletes a test because it is not possible to reproduce this error.
Depends on D138020
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D138032
The given test fails due to error below.
The following error is why the test is failing. One `memref.store` and two `memref.load` are consumers of the loop index for which I do RAUW. `memref.store` is first in the list. If I RAUW on this the loop of `llvm::make early inc range(threadIdx.getUsers())` does not return two `memref.load` as users. They remain unchanged. I'm not really certain why.
This change applies RAUW after collecting the users. If a better solution exists, I would be happy to implement it.
```
mlir-opt: ...llvm-project/mlir/include/mlir/IR/UseDefLists.h:175: mlir::IRObjectWithUseList<mlir::OpOperand>::~IRObjectWithUseList() [OperandType = mlir::OpOperand]: Assertion `use_empty() && "Cannot destroy a value that still has uses!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
```
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D138029
`DeviceMappingAttrInterface` is implemented as unifiying mechanism for thread mapping. A code generator could use any attribute that implements this interface to lower `scf.foreach_thread` to device specific code. It is allowed to choose its own mapping and interpretation.
Currently, GPU transform dialect supports only `GPUThreadMapping` and `GPUBlockMapping`; however, other mappings should to be supported as well. This change addresses this issue. It decouples gpu transform dialect from the `GPUThreadMapping` and `GPUBlockMapping`. Now, they can work any other mapping.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D138020
Previously, the need for a dense permutation leaked into the thread_dim_mapping specification.
This revision allows to use a sparse specification of the thread_dim_mapping and the proper completion / sorting is applied automatically.
In the process, the sematics of scf.foreach_thread is tightened to require a matching number of thread dimensions and mappings.
The relevant negative test is added.
Differential Revision: https://reviews.llvm.org/D137906
`scf.foreach_thread` defines mapping its loops to processors via an integer array, see an example below. A lowering can use this mapping. However, expressing mapping as an integer array is very confusing, especially when there are multiple levels of parallelism. In addition, the op does not verify the integer array. This change introduces device mapping attribute to make mapping descriptive and verifiable. Then it makes GPU transform dialect use it.
```
scf.foreach_thread (%i, %j) in (%c1, %c2) {
scf.foreach_thread (%i2, %j2) in (%c1, %c2)
{...} { thread_dim_mapping = [0, 1]}
} { thread_dim_mapping = [0, 1]}
```
It first introduces a `DeviceMappingInterface` which is an attribute interface. `scf.foreach_thread` defines its mapping via this interface. A lowering must define its attributes and implement this interface as well. This way gives us a clear validation.
The change also introduces two new attributes (`#gpu.thread<x/y/z>` and `#gpu.block<x,y,z>` ). After this change, the above code prints as below, as seen here, this way clarifies the loop mappings. The change also implements consuming of these two new attribute by the transform dialect. Transform dialect binds the outermost loops to the thread blocks and innermost loops to threads.
```
scf.foreach_thread (%i, %j) in (%c1, %c2) {
scf.foreach_thread (%i2, %j2) in (%c1, %c2)
{...} { thread_dim_mapping = [#gpu.thread<x>, #gpu.thread<y>]}
} { thread_dim_mapping = [#gpu.block<x>, #gpu.block<y>]}
```
Reviewed By: ftynse, nicolasvasilache
Differential Revision: https://reviews.llvm.org/D137413
In D134622 the printed form of a pass manager is changed to include the
name of the op that the pass manager is anchored on. This updates the
`-pass-pipeline` argument format to include the anchor op as well, so
that the printed form of a pipeline can be directly passed to
`-pass-pipeline`. In most cases this requires updating
`-pass-pipeline='pipeline'` to
`-pass-pipeline='builtin.module(pipeline)'`.
This also fixes an outdated assert that prevented running a
`PassManager` anchored on `'any'`.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D134900
These are test updates required for D135745, which disallows mixing
`-pass-pipeline` and the individual `-pass-name` options.
Reviewed By: rriddle, mehdi_amini
Differential Revision: https://reviews.llvm.org/D135746
This is useful for converting to SPIR-V, where we'd like to have
memref of vector element types.
Reviewed By: ThomasRaoux, bondhugula
Differential Revision: https://reviews.llvm.org/D137143
If, in the GPU async transformation, the operation being given an
async dependency already depended on the token in question, we
would add duplicate tokens, creating issues in GPU to LLVM lowering.
To resolve this issue, add a check to addAsyncDependency() to ensure
that duplicate tokens are not present in the token list.
(I'm open to a different approach here, this is just what I went with
initially)
Reviewed By: Mogball
Differential Revision: https://reviews.llvm.org/D136105
Many tests wrap the piece of the IR related to the transform dialect
into `transform.with_pdl_patterns` without actually using PDL patterns
inside. Some of these are leftovers from migration to `structured.match`
and some others are cargo cult, both are useless and pollute the tests.
Reviewed By: guraypp
Differential Revision: https://reviews.llvm.org/D135661
Use the recently introduced TransformTypeInterface instead of hardcoding
the PDLOperationType. This will allow the operations to use more
specific transform types to express pre/post-conditions in the future.
It requires the syntax and Python op construction API to be updated.
Dialect extensions will be switched separately.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D135584
Introduce `subgroup_reduce` operation, similar to `all_reduce`, but operating on subgroup scope instead of workgroup.
It is intended as low-level building block for more high level abstractions (e.g for workgroup-wide `all_reduce` ops).
Only introduce version taking reduce operation enum for simplicity sake.
Differential Revision: https://reviews.llvm.org/D135323
Motivation: we have lowering pipeline based on upstream gpu and spirv dialects and and we are using host shared gpu memory to transfer data between host and device.
Add `host_shared` flag to `gpu.alloc` to distinguish between shared and device-only gpu memory allocations.
Differential Revision: https://reviews.llvm.org/D133533
The GPU transform dialect currently has restrictions and several situations where we can't use transform dialect.
This update includes a method to test a failing cases in GPU transform dialect.
Differential Revision: https://reviews.llvm.org/D135063
This revision adds GPU transform dialect. It also introduce a prefix such as "transform.gpu" for all ops related to this dialect.
MLIR already had two GPU transform op in linalg. This revision moves these ops into GPUTransformOps. The Ops are as follows:
`transform.structured.map_nested_foreach_thread_to_gpu_blocks` -> `transform.gpu.map_foreach_to_blocks`
This op selects the outermost (toplevel) foreach_thread and parallelize across GPU blocks. It can also generate `gpu_launch`.
`transform.structured.map_nested_foreach_thread_to_gpu_threads` -> `transform.gpu.map_nested_foreach_to_threads`
This op parallelizes nested foreach_thread that are inside `gpu_launch` across GPU threads.
It doesn't add new functionality, but there are some minor refactoring of the code.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D134800
The helper function in GPUFuncOp incorrectly assumed the workgroup
attribution attribute is always present. Instead, treat its absence as
if its value was zero, i.e., no workgroup attributions are specified.
Closes#58045.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D134865
This allows for incrementally updating the old API usages without
needing to update everything at once. These will be left on Both
for a little bit and then flipped to prefixed when all APIs have been
updated.
Differential Revision: https://reviews.llvm.org/D134386
GPUFuncOpLowering moves the body out of gpu.func op and erases it. An empty gpu.func may fail verification but should not crash it. Verification of an erased op is triggered e.g. with debug printing on.
Reviewed By: akuegel
Differential Revision: https://reviews.llvm.org/D132446
Default attributes were only handled by ODS accessors generated with the
intention that these behave as if set attributes. This addresses the
long standing TODO to address this inconsistency. Moving the
initialization to construction vs every access. Removing need for
duplicated default attribute population in python bindings.
Switch some of the OpenMP ones to optional attribute with default as the
currently set default values are not legal. May need to dig more there.
Switched LinAlg generated ones to optional attribute with default as its
quite widely used and unclear where it falls on two different
interpretations.
Differential Revision: https://reviews.llvm.org/D130916
This reland includes changes to the Python bindings.
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131801
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D131803
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131738
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D131702
This commit refactors the syntax of "ugly" attribute/type formats to not use
strings for wrapping. This means that moving forward attirbutes and type formats
will always need to be in some recognizable form, i.e. if they use incompatible
characters they will need to manually wrap those in a string, the framework will
no longer do it automatically.
This has the benefit of greatly simplifying how parsing attributes/types work, given
that we currently rely on some extremely complicated nested parser logic which is
quite problematic for a myriad of reasons; unecessary complexity(we create a nested
source manager/lexer/etc.), diagnostic locations can be off/wrong given string escaping,
etc.
Differential Revision: https://reviews.llvm.org/D118505
Infers block/grid dimensions/indices or ranges of such dimensions/indices.
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D129036
It was a StructAttr. Also adds a FieldParser for AffineMap.
Depends on D127348
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D127350
Erase gpu.memcpy op when only uses of dest are
the memcpy op in question, its allocation and deallocation
ops.
Reviewed By: bondhugula, csigg
Differential Revision: https://reviews.llvm.org/D124257
Move async copy operations to NVGPU as they only exist on NV target and are
designed to match ptx semantic. This allows us to also add more fine grain
caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.
Differential Revision: https://reviews.llvm.org/D125244
This was leftover from when the standard dialect was destroyed, and
when FuncOp moved to the func dialect. Now that these transitions
have settled a bit we can drop these.
Most updates were handled using a simple regex: replace `^( *)func` with `$1func.func`
Differential Revision: https://reviews.llvm.org/D124146
MLIR has a common pattern for "arguments" that uses syntax
like `%x : i32 {attrs} loc("sourceloc")` which is implemented
in adhoc ways throughout the codebase. The approach this uses
is verbose (because it is implemented with parallel arrays) and
inconsistent (e.g. lots of things drop source location info).
Solve this by introducing OpAsmParser::Argument and make addRegion
(which sets up BlockArguments for the region) take it. Convert the
world to propagating this down. This means that we correctly
capture and propagate source location information in a lot more
cases (e.g. see the affine.for testcase example), and it also
simplifies much code.
Differential Revision: https://reviews.llvm.org/D124649
Add async dependencies support for gpu.launch op: this allows specifying
a list of async tokens ("streams") as dependencies for the launch.
Update the GPU kernel outlining pass lowering to propagate async
dependencies from gpu.launch to gpu.launch_func op. Previously, a new
stream was being created and destroyed for a kernel launch. The async
deps support allows the kernel launch to be serialized on an existing
stream.
Differential Revision: https://reviews.llvm.org/D123499
Fold away gpu.memcpy op when only uses of dest are
the memcpy op in question, its allocation and deallocation
ops.
Reviewed By: bondhugula
Differential Revision: https://reviews.llvm.org/D121279
This removes any potential confusion with the `getType` accessors
which correspond to SSA results of an operation, and makes it
clear what the intent is (i.e. to represent the type of the function).
Differential Revision: https://reviews.llvm.org/D121762
In this CL, update the function name of verifier according to the
behavior. If a verifier needs to access the region then it'll be updated
to `verifyRegions`.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D120373
This change gives explicit order of verifier execution and adds
`hasRegionVerifier` and `verifyWithRegions` to increase the granularity
of verifier classification. The orders are as below,
1. InternalOpTrait will be verified first, they can be run independently.
2. `verifyInvariants` which is constructed by ODS, it verifies the type,
attributes, .etc.
3. Other Traits/Interfaces that have marked their verifier as
`verifyTrait` or `verifyWithRegions=0`.
4. Custom verifier which is defined in the op and has marked
`hasVerifier=1`
If an operation has regions, then it may have the second phase,
5. Traits/Interfaces that have marked their verifier as
`verifyRegionTrait` or
`verifyWithRegions=1`. This implies the verifier needs to access the
operations in its regions.
6. Custom verifier which is defined in the op and has marked
`hasRegionVerifier=1`
Note that the second phase will be run after the operations in the
region are verified. Based on the verification order, you will be able to
avoid verifying duplicate things.
Reviewed By: Mogball
Differential Revision: https://reviews.llvm.org/D116789
This op is added to allow MLIR code running on multi-GPU systems to
select the GPU they want to execute operations on when no GPU is
otherwise specified.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D119883
Previously `gpu-kernel-outlining` pass was also doing index computation sinking into gpu.launch before actual outlining.
Split ops sinking from `gpu-kernel-outlining` pass into separate pass, so users can use theirs own sinking pass before outlining.
To achieve old behavior users will need to call both passes: `-gpu-launch-sink-index-computations -gpu-kernel-outlining`.
Differential Revision: https://reviews.llvm.org/D119932