This only strips trailing whitespace from a subset of the .md files in flang/docs. Trailing whitespace will be removed from the remaining .md files in a future commit
368 lines
13 KiB
Markdown
368 lines
13 KiB
Markdown
<!--===- docs/DoConcurrentMappingToOpenMP.md
|
|
|
|
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
See https://llvm.org/LICENSE.txt for license information.
|
|
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
-->
|
|
|
|
# `DO CONCURRENT` mapping to OpenMP
|
|
|
|
```{contents}
|
|
---
|
|
local:
|
|
---
|
|
```
|
|
|
|
This document seeks to describe the effort to parallelize `do concurrent` loops
|
|
by mapping them to OpenMP worksharing constructs. The goals of this document
|
|
are:
|
|
* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
|
|
constructs.
|
|
* Tracking the current status of such mapping.
|
|
* Describing the limitations of the current implementation.
|
|
* Describing next steps.
|
|
* Tracking the current upstreaming status (from the AMD ROCm fork).
|
|
|
|
## Usage
|
|
|
|
In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
|
|
compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
|
|
1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
|
|
This maps such loops to the equivalent of `omp parallel do`.
|
|
2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
|
|
This maps such loops to the equivalent of
|
|
`omp target teams distribute parallel do`.
|
|
3. `none`: this disables `do concurrent` mapping altogether. In that case, such
|
|
loops are emitted as sequential loops.
|
|
|
|
The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
|
|
OpenMP is also enabled. So you need to provide the following options to flang in
|
|
order to enable it:
|
|
```
|
|
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
|
|
```
|
|
For mapping to device, the target device architecture must be specified as well.
|
|
See `-fopenmp-targets` and `--offload-arch` for more info.
|
|
|
|
## Current status
|
|
|
|
Under the hood, `do concurrent` mapping is implemented in the
|
|
`DoConcurrentConversionPass`. This is still an experimental pass which means
|
|
that:
|
|
* It has been tested in a very limited way so far.
|
|
* It has been tested mostly on simple synthetic inputs.
|
|
|
|
### Loop nest detection
|
|
|
|
On the `FIR` dialect level, the following loop:
|
|
```fortran
|
|
do concurrent(i=1:n, j=1:m, k=1:o)
|
|
a(i,j,k) = i + j + k
|
|
end do
|
|
```
|
|
is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
|
|
contains **only** the following:
|
|
1. The operations needed to assign/update the outer loop's induction variable.
|
|
1. The inner loop itself.
|
|
|
|
So the MLIR structure for the above example looks similar to the following:
|
|
```
|
|
fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
|
|
%i_idx_2 = fir.convert %i_idx : (index) -> i32
|
|
fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
|
|
|
|
fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
|
|
%j_idx_2 = fir.convert %j_idx : (index) -> i32
|
|
fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
|
|
|
|
fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
|
|
%k_idx_2 = fir.convert %k_idx : (index) -> i32
|
|
fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
|
|
|
|
... loop nest body goes here ...
|
|
}
|
|
}
|
|
}
|
|
```
|
|
This applies to multi-range loops in general; they are represented in the IR as
|
|
a nest of `fir.do_loop` ops with the above nesting structure.
|
|
|
|
Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
|
|
loops and map them as "collapsed" loops in OpenMP.
|
|
|
|
#### Further info regarding loop nest detection
|
|
|
|
Loop nest detection is currently limited to the scenario described in the previous
|
|
section. However, this is quite limited and can be extended in the future to cover
|
|
more cases. At the moment, for the following loop nest, even though both loops are
|
|
perfectly nested, only the outer loop is parallelized:
|
|
```fortran
|
|
do concurrent(i=1:n)
|
|
do concurrent(j=1:m)
|
|
a(i,j) = i * j
|
|
end do
|
|
end do
|
|
```
|
|
|
|
Similarly, for the following loop nest, even though the intervening statement `x = 41`
|
|
does not have any memory effects that would affect parallelization, this nest is
|
|
not parallelized either (only the outer loop is).
|
|
|
|
```fortran
|
|
do concurrent(i=1:n)
|
|
x = 41
|
|
do concurrent(j=1:m)
|
|
a(i,j) = i * j
|
|
end do
|
|
end do
|
|
```
|
|
|
|
The above also has the consequence that the `j` variable will **not** be
|
|
privatized in the OpenMP parallel/target region. In other words, it will be
|
|
treated as if it was a `shared` variable. For more details about privatization,
|
|
see the "Data environment" section below.
|
|
|
|
See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
|
|
of what is and is not detected as a perfect loop nest.
|
|
|
|
### Single-range loops
|
|
|
|
Given the following loop:
|
|
```fortran
|
|
do concurrent(i=1:n)
|
|
a(i) = i * i
|
|
end do
|
|
```
|
|
|
|
#### Mapping to `host`
|
|
|
|
Mapping this loop to the `host`, generates MLIR operations of the following
|
|
structure:
|
|
|
|
```
|
|
%4 = fir.address_of(@_QFEa) ...
|
|
%6:2 = hlfir.declare %4 ...
|
|
|
|
omp.parallel {
|
|
// Allocate private copy for `i`.
|
|
// TODO Use delayed privatization.
|
|
%19 = fir.alloca i32 {bindc_name = "i"}
|
|
%20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
|
|
|
|
omp.wsloop {
|
|
omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
|
|
%23 = fir.convert %arg0 : (index) -> i32
|
|
// Use the privatized version of `i`.
|
|
fir.store %23 to %20#1 : !fir.ref<i32>
|
|
...
|
|
|
|
// Use "shared" SSA value of `a`.
|
|
%42 = hlfir.designate %6#0
|
|
hlfir.assign %35 to %42
|
|
...
|
|
omp.yield
|
|
}
|
|
omp.terminator
|
|
}
|
|
omp.terminator
|
|
}
|
|
```
|
|
|
|
#### Mapping to `device`
|
|
|
|
<!-- TODO -->
|
|
|
|
### Multi-range loops
|
|
|
|
The pass currently supports multi-range loops as well. Given the following
|
|
example:
|
|
|
|
```fortran
|
|
do concurrent(i=1:n, j=1:m)
|
|
a(i,j) = i * j
|
|
end do
|
|
```
|
|
|
|
The generated `omp.loop_nest` operation look like:
|
|
|
|
```
|
|
omp.loop_nest (%arg0, %arg1)
|
|
: index = (%17, %19) to (%18, %20)
|
|
inclusive step (%c1_2, %c1_4) {
|
|
fir.store %arg0 to %private_i#1 : !fir.ref<i32>
|
|
fir.store %arg1 to %private_j#1 : !fir.ref<i32>
|
|
...
|
|
omp.yield
|
|
}
|
|
```
|
|
|
|
It is worth noting that we have privatized versions for both iteration
|
|
variables: `i` and `j`. These are locally allocated inside the parallel/target
|
|
OpenMP region similar to what the single-range example in previous section
|
|
shows.
|
|
|
|
### Data environment
|
|
|
|
By default, variables that are used inside a `do concurrent` loop nest are
|
|
either treated as `shared` in case of mapping to `host`, or mapped into the
|
|
`target` region using a `map` clause in case of mapping to `device`. The only
|
|
exceptions to this are:
|
|
1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
|
|
case, for each IV, we allocate a local copy as shown by the mapping
|
|
examples above.
|
|
1. any values that are from allocations outside the loop nest and used
|
|
exclusively inside of it. In such cases, a local privatized
|
|
copy is created in the OpenMP region to prevent multiple teams of threads
|
|
from accessing and destroying the same memory block, which causes runtime
|
|
issues. For an example of such cases, see
|
|
`flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
|
|
|
|
Implicit mapping detection (for mapping to the target device) is still quite
|
|
limited and work to make it smarter is underway for both OpenMP in general
|
|
and `do concurrent` mapping.
|
|
|
|
#### Non-perfectly-nested loops' IVs
|
|
|
|
For non-perfectly-nested loops, the IVs are still treated as `shared` or
|
|
`map` entries as pointed out above. This **might not** be consistent with what
|
|
the Fortran specification tells us. In particular, taking the following
|
|
snippets from the spec (version 2023) into account:
|
|
|
|
> § 3.35
|
|
> ------
|
|
> construct entity
|
|
> entity whose identifier has the scope of a construct
|
|
|
|
> § 19.4
|
|
> ------
|
|
> A variable that appears as an index-name in a FORALL or DO CONCURRENT
|
|
> construct [...] is a construct entity. A variable that has LOCAL or
|
|
> LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
|
|
> [...]
|
|
> The name of a variable that appears as an index-name in a DO CONCURRENT
|
|
> construct, FORALL statement, or FORALL construct has a scope of the statement
|
|
> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
|
|
> CONCURRENT construct has the scope of that construct.
|
|
|
|
From the above quotes, it seems there is an equivalence between the IV of a `do
|
|
concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
|
|
to OpenMP's `private` clause). Which means that we should probably
|
|
localize/privatize a `do concurrent` loop's IV even if it is not perfectly
|
|
nested in the nest we are parallelizing. For now, however, we **do not** do
|
|
that as pointed out previously. In the near future, we propose a middle-ground
|
|
solution (see the Next steps section for more details).
|
|
|
|
<!--
|
|
More details about current status will be added along with relevant parts of the
|
|
implementation in later upstreaming patches.
|
|
-->
|
|
|
|
## Next steps
|
|
|
|
This section describes some of the open questions/issues that are not tackled yet
|
|
even in the downstream implementation.
|
|
|
|
### Separate MLIR op for `do concurrent`
|
|
|
|
At the moment, both increment and concurrent loops are represented by one MLIR
|
|
op: `fir.do_loop`; where we differentiate concurrent loops with the `unordered`
|
|
attribute. This is not ideal since the `fir.do_loop` op support only single
|
|
iteration ranges. Consequently, to model multi-range `do concurrent` loops, flang
|
|
emits a nest of `fir.do_loop` ops which we have to detect in the OpenMP conversion
|
|
pass to handle multi-range loops. Instead, it would better to model multi-range
|
|
concurrent loops using a separate op which the IR more representative of the input
|
|
Fortran code and also easier to detect and transform.
|
|
|
|
### Delayed privatization
|
|
|
|
So far, we emit the privatization logic for IVs inline in the parallel/target
|
|
region. This is enough for our purposes right now since we don't
|
|
localize/privatize any sophisticated types of variables yet. Once we have need
|
|
for more advanced localization through `do concurrent`'s locality specifiers
|
|
(see below), delayed privatization will enable us to have a much cleaner IR.
|
|
Once delayed privatization's implementation upstream is supported for the
|
|
required constructs by the pass, we will move to it rather than inlined/early
|
|
privatization.
|
|
|
|
### Locality specifiers for `do concurrent`
|
|
|
|
Locality specifiers will enable the user to control the data environment of the
|
|
loop nest in a more fine-grained way. Implementing these specifiers on the
|
|
`FIR` dialect level is needed in order to support this in the
|
|
`DoConcurrentConversionPass`.
|
|
|
|
Such specifiers will also unlock a potential solution to the
|
|
non-perfectly-nested loops' IVs issue described above. In particular, for a
|
|
non-perfectly nested loop, one middle-ground proposal/solution would be to:
|
|
* Emit the loop's IV as shared/mapped just like we do currently.
|
|
* Emit a warning that the IV of the loop is emitted as shared/mapped.
|
|
* Given support for `LOCAL`, we can recommend the user to explicitly
|
|
localize/privatize the loop's IV if they choose to.
|
|
|
|
#### Sharing TableGen clause records from the OpenMP dialect
|
|
|
|
At the moment, the FIR dialect does not have a way to model locality specifiers
|
|
on the IR level. Instead, something similar to early/eager privatization in OpenMP
|
|
is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
|
|
modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
|
|
reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
|
|
to OpenMP (and other parallel programming models) much easier.
|
|
|
|
Therefore, one way to approach this problem is to extract the TableGen records
|
|
for relevant OpenMP clauses in a shared dialect for "data environment management"
|
|
and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
|
|
as well.
|
|
|
|
#### Supporting reductions
|
|
|
|
Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
|
|
is also still an open TODO. We can potentially extend the MLIR infrastructure
|
|
proposed in the previous section to share reduction records among the different
|
|
relevant dialects as well.
|
|
|
|
### More advanced detection of loop nests
|
|
|
|
As pointed out earlier, any intervening code between the headers of 2 nested
|
|
`do concurrent` loops prevents us from detecting this as a loop nest. In some
|
|
cases this is overly conservative. Therefore, a more flexible detection logic
|
|
of loop nests needs to be implemented.
|
|
|
|
### Data-dependence analysis
|
|
|
|
Right now, we map loop nests without analysing whether such mapping is safe to
|
|
do or not. We probably need to at least warn the user of unsafe loop nests due
|
|
to loop-carried dependencies.
|
|
|
|
### Non-rectangular loop nests
|
|
|
|
So far, we did not need to use the pass for non-rectangular loop nests. For
|
|
example:
|
|
```fortran
|
|
do concurrent(i=1:n)
|
|
do concurrent(j=i:n)
|
|
...
|
|
end do
|
|
end do
|
|
```
|
|
We defer this to the (hopefully) near future when we get the conversion in a
|
|
good share for the samples/projects at hand.
|
|
|
|
### Generalizing the pass to other parallel programming models
|
|
|
|
Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
|
|
this in a more generalized direction and allow the pass to target other models;
|
|
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
|
|
targeting OpenMP.
|
|
|
|
|
|
## Upstreaming status
|
|
|
|
- [x] Command line options for `flang` and `bbc`.
|
|
- [x] Conversion pass skeleton (no transormations happen yet).
|
|
- [x] Status description and tracking document (this document).
|
|
- [x] Loop nest detection to identify multi-range loops.
|
|
- [ ] Basic host/CPU mapping support.
|
|
- [ ] Basic device/GPU mapping support.
|
|
- [ ] More advanced host and device support (expaned to multiple items as needed).
|