llvm-project/flang/docs/DoConcurrentConversionToOpenMP.md

<!--===- docs/DoConcurrentMappingToOpenMP.md

   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
   See https://llvm.org/LICENSE.txt for license information.
   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

-->

# `DO CONCURRENT` mapping to OpenMP

```{contents}
---
local:
---
```

This document seeks to describe the effort to parallelize `do concurrent` loops
by mapping them to OpenMP worksharing constructs. The goals of this document
are:
* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
  constructs.
* Tracking the current status of such mapping.
* Describing the limitations of the current implementation.
* Describing next steps.
* Tracking the current upstreaming status (from the AMD ROCm fork).

## Usage

In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
   This maps such loops to the equivalent of `omp parallel do`.
2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
   This maps such loops to the equivalent of
   `omp target teams distribute parallel do`.
3. `none`: this disables `do concurrent` mapping altogether. In that case, such
   loops are emitted as sequential loops.

The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
OpenMP is also enabled. So you need to provide the following options to flang in
order to enable it:
```
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
```
For mapping to device, the target device architecture must be specified as well.
See `-fopenmp-targets` and `--offload-arch` for more info.

## Current status

Under the hood, `do concurrent` mapping is implemented in the
`DoConcurrentConversionPass`. This is still an experimental pass which means
that:
* It has been tested in a very limited way so far.
* It has been tested mostly on simple synthetic inputs.

### Loop nest detection

On the `FIR` dialect level, the following loop:
```fortran
  do concurrent(i=1:n, j=1:m, k=1:o)
    a(i,j,k) = i + j + k
  end do
```
is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
contains **only** the following:
  1. The operations needed to assign/update the outer loop's induction variable.
  1. The inner loop itself.

So the MLIR structure for the above example looks similar to the following:
```
  fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
    %i_idx_2 = fir.convert %i_idx : (index) -> i32
    fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>

    fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
      %j_idx_2 = fir.convert %j_idx : (index) -> i32
      fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>

      fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
        %k_idx_2 = fir.convert %k_idx : (index) -> i32
        fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>

        ... loop nest body goes here ...
      }
    }
  }
```
This applies to multi-range loops in general; they are represented in the IR as
a nest of `fir.do_loop` ops with the above nesting structure.

Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
loops and map them as "collapsed" loops in OpenMP.

#### Further info regarding loop nest detection

Loop nest detection is currently limited to the scenario described in the previous
section. However, this is quite limited and can be extended in the future to cover
more cases. At the moment, for the following loop nest, even though both loops are
perfectly nested, only the outer loop is parallelized:
```fortran
do concurrent(i=1:n)
  do concurrent(j=1:m)
    a(i,j) = i * j
  end do
end do
```

Similarly, for the following loop nest, even though the intervening statement `x = 41`
does not have any memory effects that would affect parallelization, this nest is
not parallelized either (only the outer loop is).

```fortran
do concurrent(i=1:n)
  x = 41
  do concurrent(j=1:m)
    a(i,j) = i * j
  end do
end do
```

The above also has the consequence that the `j` variable will **not** be
privatized in the OpenMP parallel/target region. In other words, it will be
treated as if it was a `shared` variable. For more details about privatization,
see the "Data environment" section below.

See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
of what is and is not detected as a perfect loop nest.

### Single-range loops

Given the following loop:
```fortran
  do concurrent(i=1:n)
    a(i) = i * i
  end do
```

#### Mapping to `host`

Mapping this loop to the `host`, generates MLIR operations of the following
structure:

```
%4 = fir.address_of(@_QFEa) ...
%6:2 = hlfir.declare %4 ...

omp.parallel {
  // Allocate private copy for `i`.
  // TODO Use delayed privatization.
  %19 = fir.alloca i32 {bindc_name = "i"}
  %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...

  omp.wsloop {
    omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
      %23 = fir.convert %arg0 : (index) -> i32
      // Use the privatized version of `i`.
      fir.store %23 to %20#1 : !fir.ref<i32>
      ...

      // Use "shared" SSA value of `a`.
      %42 = hlfir.designate %6#0
      hlfir.assign %35 to %42
      ...
      omp.yield
    }
    omp.terminator
  }
  omp.terminator
}
```

#### Mapping to `device`

<!-- TODO -->

### Multi-range loops

The pass currently supports multi-range loops as well. Given the following
example:

```fortran
   do concurrent(i=1:n, j=1:m)
       a(i,j) = i * j
   end do
```

The generated `omp.loop_nest` operation look like:

```
omp.loop_nest (%arg0, %arg1)
    : index = (%17, %19) to (%18, %20)
    inclusive step (%c1_2, %c1_4) {
  fir.store %arg0 to %private_i#1 : !fir.ref<i32>
  fir.store %arg1 to %private_j#1 : !fir.ref<i32>
  ...
  omp.yield
}
```

It is worth noting that we have privatized versions for both iteration
variables: `i` and `j`. These are locally allocated inside the parallel/target
OpenMP region similar to what the single-range example in previous section
shows.

### Data environment

By default, variables that are used inside a `do concurrent` loop nest are
either treated as `shared` in case of mapping to `host`, or mapped into the
`target` region using a `map` clause in case of mapping to `device`. The only
exceptions to this are:
  1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
     case, for each IV, we allocate a local copy as shown by the mapping
     examples above.
  1. any values that are from allocations outside the loop nest and used
     exclusively inside of it. In such cases, a local privatized
     copy is created in the OpenMP region to prevent multiple teams of threads
     from accessing and destroying the same memory block, which causes runtime
     issues. For an example of such cases, see
     `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.

Implicit mapping detection (for mapping to the target device) is still quite
limited and work to make it smarter is underway for both OpenMP in general
and `do concurrent` mapping.

#### Non-perfectly-nested loops' IVs

For non-perfectly-nested loops, the IVs are still treated as `shared` or
`map` entries as pointed out above. This **might not** be consistent with what
the Fortran specification tells us. In particular, taking the following
snippets from the spec (version 2023) into account:

> § 3.35
> ------
> construct entity
> entity whose identifier has the scope of a construct

> § 19.4
> ------
>  A variable that appears as an index-name in a FORALL or DO CONCURRENT
>  construct [...] is a construct entity. A variable that has LOCAL or
>  LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
> [...]
> The name of a variable that appears as an index-name in a DO CONCURRENT
> construct, FORALL statement, or FORALL construct has a scope of the statement
> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
> CONCURRENT construct has the scope of that construct.

From the above quotes, it seems there is an equivalence between the IV of a `do
concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
to OpenMP's `private` clause). Which means that we should probably
localize/privatize a `do concurrent` loop's IV even if it is not perfectly
nested in the nest we are parallelizing. For now, however, we **do not** do
that as pointed out previously. In the near future, we propose a middle-ground
solution (see the Next steps section for more details).

<!--
More details about current status will be added along with relevant parts of the
implementation in later upstreaming patches.
-->

## Next steps

This section describes some of the open questions/issues that are not tackled yet
even in the downstream implementation.

### Separate MLIR op for `do concurrent`

At the moment, both increment and concurrent loops are represented by one MLIR
op: `fir.do_loop`; where we differentiate concurrent loops with the `unordered`
attribute. This is not ideal since the `fir.do_loop` op support only single
iteration ranges. Consequently, to model multi-range `do concurrent` loops, flang
emits a nest of `fir.do_loop` ops which we have to detect in the OpenMP conversion
pass to handle multi-range loops. Instead, it would better to model multi-range
concurrent loops using a separate op which the IR more representative of the input
Fortran code and also easier to detect and transform.

### Delayed privatization

So far, we emit the privatization logic for IVs inline in the parallel/target
region. This is enough for our purposes right now since we don't
localize/privatize any sophisticated types of variables yet. Once we have need
for more advanced localization through `do concurrent`'s locality specifiers
(see below), delayed privatization will enable us to have a much cleaner IR.
Once delayed privatization's implementation upstream is supported for the
required constructs by the pass, we will move to it rather than inlined/early
privatization.

### Locality specifiers for `do concurrent`

Locality specifiers will enable the user to control the data environment of the
loop nest in a more fine-grained way. Implementing these specifiers on the
`FIR` dialect level is needed in order to support this in the
`DoConcurrentConversionPass`.

Such specifiers will also unlock a potential solution to the
non-perfectly-nested loops' IVs issue described above. In particular, for a
non-perfectly nested loop, one middle-ground proposal/solution would be to:
* Emit the loop's IV as shared/mapped just like we do currently.
* Emit a warning that the IV of the loop is emitted as shared/mapped.
* Given support for `LOCAL`, we can recommend the user to explicitly
  localize/privatize the loop's IV if they choose to.

#### Sharing TableGen clause records from the OpenMP dialect

At the moment, the FIR dialect does not have a way to model locality specifiers
on the IR level. Instead, something similar to early/eager privatization in OpenMP
is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
to OpenMP (and other parallel programming models) much easier.

Therefore, one way to approach this problem is to extract the TableGen records
for relevant OpenMP clauses in a shared dialect for "data environment management"
and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
as well.

#### Supporting reductions

Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
is also still an open TODO. We can potentially extend the MLIR infrastructure
proposed in the previous section to share reduction records among the different
relevant dialects as well.

### More advanced detection of loop nests

As pointed out earlier, any intervening code between the headers of 2 nested
`do concurrent` loops prevents us from detecting this as a loop nest. In some
cases this is overly conservative. Therefore, a more flexible detection logic
of loop nests needs to be implemented.

### Data-dependence analysis

Right now, we map loop nests without analysing whether such mapping is safe to
do or not. We probably need to at least warn the user of unsafe loop nests due
to loop-carried dependencies.

### Non-rectangular loop nests

So far, we did not need to use the pass for non-rectangular loop nests. For
example:
```fortran
do concurrent(i=1:n)
  do concurrent(j=i:n)
    ...
  end do
end do
```
We defer this to the (hopefully) near future when we get the conversion in a
good share for the samples/projects at hand.

### Generalizing the pass to other parallel programming models

Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
this in a more generalized direction and allow the pass to target other models;
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
targeting OpenMP.


## Upstreaming status

- [x] Command line options for `flang` and `bbc`.
- [x] Conversion pass skeleton (no transormations happen yet).
- [x] Status description and tracking document (this document).
- [x] Loop nest detection to identify multi-range loops.
- [ ] Basic host/CPU mapping support.
- [ ] Basic device/GPU mapping support.
- [ ] More advanced host and device support (expaned to multiple items as needed).