The `scf.index_switch` is a control-flow operation that branches to one of the
given regions based on the values of the argument and the cases. The
argument is always of type `index`.
Example:
```mlir
%0 = scf.index_switch %arg0 -> i32
case 2 {
%1 = arith.constant 10 : i32
scf.yield %1 : i32
}
case 5 {
%2 = arith.constant 20 : i32
scf.yield %2 : i32
}
default {
%3 = arith.constant 30 : i32
scf.yield %3 : i32
}
```
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D136003
By matching `arith.constant` specifically, SCF canonicalizers/folders
are incompatible with other kinds of constants. Use the generic
matchers instead.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D135517
This revision adds a new op `map_nested_foreach_thread_to_gpu_threads` to transform dialect. The op searches `scf.foreach_threads` inside the `gpu_launch` and distributes them with `gpu.thread_id` attribute.
Loop mapping is explicit and given by the `map_nested_foreach_thread_to_gpu_threads` op. Mapping is done one-to-one, therefore the loops dissappear.
The dynamic trip count or trip count that are larger than thread size are not supported for the time being. However, we can indeed support them by generating a loop inside with cyclic scheduling.
For the time being, trip counts that are dynamic or bigger than thread sizes are not supported. However, in the future the compiler can indeed generate a loop with static cyclic scheduling to support these cases.
Current mechanism allows `scf.foreach_threads` to be siblings or nested. There cannot be interleaving code between the loops when they are nested.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D133950
This change refines the semantics of scf.foreach_thread. Tensors that are inserted into in the terminator must now be passed to the region explicitly via `shared_outs`. Inside of the body of the op, those tensors are then accessed via block arguments.
The body of a scf.foreach_thread is now treated as a repetitive region. I.e., op dominance can no longer be used in conflict detection when using a value that is defined outside of the body. Such uses may now be considered as conflicts (if there is at least one read and one write in the body), effectively privatizing the tensor. Shared outputs are not privatized when they are used via their corresponding block arguments.
As part of this change, it was also necessary to update the "tiling to scf.foreach_thread", such that the generated tensor.extract_slice ops use the scf.foreach_thread's block arguments. This is implemented by cloning the TilingInterface op inside the scf.foreach_thread, rewriting all of its outputs with block arguments and then calling the tiling implementation. Afterwards, the cloned op is deleted again.
Differential Revision: https://reviews.llvm.org/D133114
This overload just wraps a bitvector, and in most cases a bitvector
could be used directly instead of a list.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D132896
This reland includes changes to the Python bindings.
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131801
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D131803
Switch variadic operand and result segment size attributes to use the
dense i32 array. Dense integer arrays were introduced primarily to
represent index lists. They are a better fit for segment sizes than
dense elements attrs.
Depends on D131738
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D131702
This is moslty NFC and will allow tensor.parallel_insert_slice to gain
rank-reducing semantics by reusing the vast majority of the tensor.insert_slice impl.
Depends on D128857
Differential Revision: https://reviews.llvm.org/D128920
This allows purging references of scf.ForeachThreadOp and scf.PerformConcurrentlyOp from
ParallelInsertSliceOp.
This will allowmoving the op closer to tensor::InsertSliceOp with which it should share much more
code.
In the future, the decoupling will also allow extending the type of ops that can be used in the
parallel combinator as well as semantics related to multiple concurrent inserts to the same
result.
Differential Revision: https://reviews.llvm.org/D128857
This was previous implemented as part of the BufferizableOpInterface of ForEachThreadOp. Moving the implementation to ParallelInsertSliceOp to be consistent with the remaining ops and to have a nice example op that can serve as a blueprint for other ops.
Differential Revision: https://reviews.llvm.org/D128666
An optional thread_dim_mapping index array attribute specifies for each
virtual thread dimension, how it remaps 1-1 to a set of concrete processing
element resources (e.g. a CUDA grid dimension or a level of concrete nested
async parallelism). At this time, the specification is backend-dependent and
is not verified by the op, beyond being an index array attribute.
It is the reponsibility of the lowering to interpret the index array in the
context of the concrete target the op is lowered to, or to ignore it when
the specification is ill-formed or unsupported for a particular target.
Differential Revision: https://reviews.llvm.org/D128633
This revision adds the necessary plumbing for canonicalizing scf::ForeachThread with the
`AffineOpSCFCanonicalizationPattern`.
In the process the `loopMatcher` helper is updated to take OpFoldResult instead of just values.
This allows composing various scenarios without the need for an artificial builder.
Differential Revision: https://reviews.llvm.org/D128244
This aligns the SCF dialect file layout with the majority of the dialects.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D128049