Adds initial support for GPU by-ref reductions. The main problem for
reduction by reference is that, prior to this PR, we were shuffling
(from remote lanes within the same warp or across different warps within
the block) pointers/references to the private reduction values rather
than the private reduction values themselves.
In particular, this diff adds support for reductions on scalar
allocatables where reductions happen on loops nested in `target`
regions. For example:
```fortran
integer :: i
real, allocatable :: scalar_alloc
allocate(scalar_alloc)
scalar_alloc = 0
!$omp target map(tofrom: scalar_alloc)
!$omp parallel do reduction(+: scalar_alloc)
do i = 1, 1000000
scalar_alloc = scalar_alloc + 1
end do
!$omp end target
```
This PR supports by-ref reductions on the intra- and inter-warp levels.
So far, there are still steps to be takens for full support of by-ref
reductions, for example:
* Support inter-block value combination is still not supported.
Therefore, `target teams distribute parallel do` is still not supported.
* Support for dynamically-sized arrays still needs to be added.
* Support for more than one allocatable/array on the same `reduction`
clause.
This PR shifts from using the LLVM OpenMP enumerator bit flags to an
OpenMP dialect specific enumerator. This allows us to better represent
map types that wouldn't be of interest to the LLVM backend and runtime
in the dialect.
Primarily things like
ref_ptr/ref_ptee/ref_ptr_ptee/atach_none/attach_always/attach_auto which
are of interest to the compiler for certrain transformations (primarily
in the FIR transformation passes dealing with mapping), but the runtime
has no need to know about them. It also means if another OpenMP
implementation comes along they won't need to stick to the same bit flag
system LLVM chose/do leg work to address it.
This commit generalizes `replaceUsesOfBlockArgument` to
`replaceAllUsesWith`. In rollback mode, the same restrictions keep
applying: a value cannot be replaced multiple times and a call to
`replaceAllUsesWith` will replace all current and future uses of the
`from` value.
`replaceAllUsesWith` is now fully supported and its behavior is
consistent with the remaining dialect conversion API. Before this
commit, `replaceAllUsesWith` was immediately reflected in the IR when
running in rollback mode. After this commit, `replaceAllUsesWith`
changes are materialized in a delayed fashion, at the end of the dialect
conversion. This is consistent with the `replaceUsesOfBlockArgument` and
`replaceOp` APIs.
`replaceAllUsesExcept` etc. are still not supported and will be
deactivated on the `ConversionPatternRewriter` (when running in rollback
mode) in a follow-up commit.
Note for LLVM integration: Replace `replaceUsesOfBlockArgument` with
`replaceAllUsesWith`. If you are seeing failures, you may have patterns
that use `replaceAllUsesWith` incorrectly (e.g., being called multiple
times on the same value) or bypass the rewriter API entirely. E.g., such
failures were mitigated in Flang by switching to the walk-patterns
driver (#156171).
You can temporarily reactivate the old behavior by calling
`RewriterBase::replaceAllUsesWith`. However, note that that behavior is
faulty in a dialect conversion. E.g., the base
`RewriterBase::replaceAllUsesWith` implementation does not see uses of
the `from` value that have not materialized yet and will, therefore, not
replace them.
Fixes#155273
This PR introduces 2 changes:
1. The `do concurrent` to OpenMP pass is now a module pass rather than a
function pass.
2. Reduction ops are looked up in the parent module before being
created.
The benefit of using a module pass is that the same reduction operation
can be used across multiple functions if the reduction type matches.
Starts the effort to map `do concurrent` locality specifiers to OpenMP
clauses. This PR adds support for basic specifiers (no `init` or `copy`
regions yet).
This PR updates the `do concurrent` to OpenMP mapping pass to use the
newly added `fir.do_concurrent` ops that were recently added upstream
instead of handling nests of `fir.do_loop ... unordered` ops.
Parent PR: https://github.com/llvm/llvm-project/pull/137928.
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.
An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.
In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
This looks like a huge PR but a lot of the added stuff is documentation.
It is also worth noting that the downstream pass has been validated on
https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived
performance speed-ups that match pure OpenMP, for GPU mapping we are
still working on extending our support for implicit memory mapping and
locality specifiers.
PR stack:
- https://github.com/llvm/llvm-project/pull/126026 (this PR)
- https://github.com/llvm/llvm-project/pull/127595
- https://github.com/llvm/llvm-project/pull/127633
- https://github.com/llvm/llvm-project/pull/127634
- https://github.com/llvm/llvm-project/pull/127635