This PR aims to fix a mapping error when trying to map nullary elements
of a record type (primary example is allocatables/pointer types in
Fortran at the moment). This should be legal to map, just not write to
without pointing to anything within the target region. A common Fortran
OpenMP idiom/example where this is useful can be found in the added
Fortran offload example.
The runtime error arises when we try to map the pointer member utilising
a prescribed constant size that we receive from the lowered type,
resulting in mapping of data that will be non-existent when there is no
allocated data. The fix in this case is to emit a runtime check to see
if the data has been allocated, if it hasn't been we select a size of 0,
if it has we emit the usual type size.
Scan directive allows to specify scan reductions within an worksharing
loop, worksharing loop simd or simd directive which should have an
`InScan` modifier associated with it. This change adds the mlir support
for the same.
Related PR: [Parsing and Semantic Support for
scan](https://github.com/llvm/llvm-project/pull/102792)
Implementation details:
The UNTIED clause is recognized by setting the flag=0 for the default
case or performing logical OR to flag if other clauses are specified,
and this flag is passed as an argument to the `__kmpc_omp_task_alloc`
runtime call.
Resubmitting the PR with fix for the failure, as it was reverted here:
927a70daf31b1610627f346b0dc140eda72144b9
and previously merged here: https://github.com/llvm/llvm-project/pull/115283
While https://github.com/llvm/llvm-project/pull/122866 fixed some
issues, it introduced a regression in worksharing loops. The new bug
comes from the fact that we now conditionally created the `after_alloca`
block based on the number of sucessors of the alloca insertion point.
This is unneccessary, we can just alway create the block. If we do this,
we respect the post condtions expected after calling
`allocatePrivateVars` (i.e. that the `afterAlloca` block has a single
predecessor.
This enable delayed privatization by default for `omp.wsloop` ops, with
one caveat! I had to workaround the "impure" alloc region issue that
being resolved at the moment. The workaround detects whether the alloc
region's argument is used in the region and at the same time defined in
block that does not dominate the chosen alloca insertion point. If so,
we move the alloca insertion point below the defining instruction of the
alloc region argument. This basically reverts to the
non-delayed-privatizaiton behavior.
Implementation details:
The PRIORITY clause is recognized by setting the flags = 32 to the
`__kmpc_omp_task_alloc` runtime call. Also, store the priority-value
to the `kmp_task_t` struct member
This PR generalizes a fix that we implemented previously for
`omp.wsloop`s. The fix makes sure the pre-condtion that the `alloca`
block has a single successor whenever we inline delayed privatizers is
respected. I simply moved the fix to `allocatePrivateVars` so that it
kicks in for any op not just `omp.wsloop`.
This handles a bug uncovered by [a
test](https://github.com/OpenMP-Validation-and-Verification/OpenMP_VV/blob/master/tests/4.5/target_simd/test_target_simd_safelen.F90)
in the OpenMP_VV test suite.
This patch implements support for handling the 'if' clause of OpenMP
'target' constructs in the OMPIRBuilder and updates MLIR to LLVM IR
translation of the `omp.target` MLIR operation to make use of this new
feature.
This patch adds support for processing the `host_eval` clause of
`omp.target` to populate default and runtime kernel launch attributes.
Specifically, these related to the `num_teams`, `thread_limit` and
`num_threads` clauses attached to operations nested inside of
`omp.target`. As a result, the `thread_limit` clause of `omp.target` is
also supported.
The implementation of `initTargetDefaultAttrs()` is intended to reflect
clang's own processing of multiple constructs and clauses in order to
define a default number of teams and threads to be used as kernel
attributes and to populate global variables in the target device module.
One side effect of this change is that it is no longer possible to
translate to LLVM IR target device MLIR modules unless they have a
supported target triple. This is because the local `getGridValue()`
function in the `OpenMPIRBuilder` only works for certain architectures,
and it is called whenever the maximum number of threads has not been
explicitly defined. This limitation also matches clang.
Evaluating the collapsed loop trip count of SPMD and Generic-SPMD
kernels remains unsupported.
This patch introduces a `TargetKernelRuntimeAttrs` structure to hold
host-evaluated `num_teams`, `thread_limit`, `num_threads` and trip count
values passed to the runtime kernel offloading call.
Additionally, kernel type information is used to influence target device
code generation and the `IsSPMD` flag is replaced by `ExecFlags`, which
provides more granularity.
This patch introduces the `OpenMPIRBuilder::TargetKernelDefaultAttrs`
structure used to simplify passing default and constant values for
number of teams and threads, and possibly other target kernel-related
information in the future.
This is used to forward values passed to `createTarget` to
`createTargetInit`, which previously used a default unrelated set of
values.
This patch adds the `host_eval` clause to the `omp.target` operation.
Additionally, it updates its op verifier to make sure all uses of block
arguments defined by this clause fall within one of the few cases where
they are allowed.
MLIR to LLVM IR translation fails on translation of this clause with a
not-yet-implemented error.
Replaces https://github.com/llvm/llvm-project/pull/121886
Fixes https://github.com/llvm/llvm-project/issues/120254 (hopefully 🤞)
## Problem
Consider the following example:
```fortran
program test
real :: x(1)
integer :: i
!$omp parallel do reduction(+:x)
do i = 1,1
x = 1
end do
!$omp end parallel do
end program
```
The HLFIR+OMP IR for this example looks like this:
```mlir
func.func @_QQmain() {
...
omp.parallel {
%5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>>
%6 = fir.alloca !fir.box<!fir.array<1xf32>>
...
omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) {
omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) {
...
omp.yield
}
}
omp.terminator
}
return
}
```
The problem addressed by this PR is related to: the `alloca` in the
`omp.parallel` region + the related `reduction` clause on the
`omp.wsloop` op. When we try translate the reduction from MLIR to LLVM,
we have to choose an `alloca` insertion point. This happens in
`convertOmpWsloop` where at entry to that function, this is what the
LLVM module looks like:
```llvm
define void @_QQmain() {
%tid.addr = alloca i32, align 4
...
entry:
%omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1)
br label %omp.par.entry
omp.par.entry:
%tid.addr.local = alloca i32, align 4
...
br label %omp.par.region
omp.par.region:
br label %omp.par.region1
omp.par.region1:
...
%5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
```
Now, when we choose an `alloca` insertion point for the reduction, this
is the chosen block `omp.par.entry` (without the changes in this PR).
The problem is that the allocation needed for the reduction needs to
reference the `%5` SSA value. This results in inserting allocations in
`omp.par.entry` that reference allocations in a later block
`omp.par.region1` which causes the `Instruction does not dominate all
uses!` error.
## Possible solution - take 2:
This PR contains a more localized solution than
https://github.com/llvm/llvm-project/pull/121886. It makes sure that on
entry to `initReductionVars`, the IR builder is at a point where we can
starting inserting initialization region; to make things cleaner, we
still split the builder insertion point to a dedicated
`omp.reduction.init`. This way we avoid splitting after the latest
allocation block; which is what causing the issue.
Currently the compiler will ICE in programs like the following on the
device lowering pass:
```
program main
implicit none
type i1_t
integer :: val(1000)
end type i1_t
integer :: i
type(i1_t), pointer :: newi1
type(i1_t), pointer :: tab=>null()
integer, dimension(:), pointer :: tabval
!$omp THREADPRIVATE(tab)
allocate(newi1)
tab=>newi1
tab%val(:)=1
tabval=>tab%val
!$omp target teams distribute parallel do
do i = 1, 1000
tabval(i) = i
end do
!$omp end target teams distribute parallel do
end program main
```
This is due to the fact that THREADPRIVATE returns a result operation,
and this operation can actually be used by other LLVM dialect (or other
dialect) operations. However, we currently skip the lowering of
threadprivate, so we effectively never generate and bind an LLVM-IR
result to the threadprivate operation result. So when we later go on to
lower dependent LLVM dialect operations, we are missing the required
LLVM-IR result, try to access and use it and then ICE. The fix in this
particular PR is to allow compilation of threadprivate for device as
well as host, and simply treat the device compilation as a no-op,
binding the LLVM-IR result of threadprivate with no alterations and
binding it, which will allow the rest of the compilation to proceed,
where we'll eventually discard the host segment in any case.
The other possible solution to this I can think of, is doing something
similar to Flang's passes that occur prior to CodeGen to the LLVM
dialect, where they erase/no-op certain unrequired operations or
transform them to lower level series of operations. And we would
erase/no-op threadprivate on device as we'd never have these in target
regions.
The main issues I can see with this are that we currently do not
specialise this stage based on wether we're compiling for device or
host, so it's setting a precedent and adding another point of having to
understand the separation between target and host compilation. I am also
not sure we'd necessarily want to enforce this at a dialect level incase
someone else wishes to add a different lowering flow or translation
flow. Another possible issue is that a target operation we have/utilise
would depend on the result of threadprivate, meaning we'd not be allowed
to entirely erase/no-op it, I am not sure of any situations where this
may be an issue currently though.
This patch,
- Added a translation support for aligned clause in SIMD directive by passing the alignment details to "llvm.assume" intrinsic.
- Updated the insertion point for llvm.assume intrinsic call in "OMPIRBuilder.cpp".
- Added a check in aligned clause MLIR lowering, to ensure that the alignment value must be a power of 2.
Implementatoin details:
Both Mutexinoutset and Inoutset is recognized as flag=0x4
and 0x8 respectively, the flags is set to `kmp_depend_info` and
passed as argument to `__kmpc_omp_task_with_deps` runtime call
Implementation details:
The UNTIED clause is recognized by setting the flag=0 for the default
case or performing logical OR to flag if other clauses are specified,
and this flag is passed as an argument to the `__kmpc_omp_task_alloc`
runtime call.
This reapplies PR #118463 after introducing a fix for a bug uncovere by
the test suite. The problem is that when the alloca block is terminated
with a conditional branch, this violates a pre-condition of
`allocatePrivateVars` (which assumes the alloca block has a single
successor). This new PR includes a test that reproduces the issue.
Extend MLIR to LLVM lowering by adding support for `omp.wsloop` for
delayed privatization. This also refactors a few bit of code to isolate
the logic needed for `firstprivate` initialization in a shared util that
can be used across constructs that need it. The same is done for
`dealloc` regions.
This PR adds translation support for task detach. Essentially, if the
`detach` clause is present on a task, emit a
`__kmpc_task_allow_completion_event` on it, and store its return (of
type `kmp_event_t*`) into the `event_handle`.
Extend MLIR to LLVM lowering by adding support for `omp.wsloop` for
delayed privatization. This also refactors a few bit of code to isolate
the logic needed for `firstprivate` initialization in a shared util that
can be used across constructs that need it. The same is done for
`dealloc`
regions.
Parent PR: https://github.com/llvm/llvm-project/pull/118447. Only latest
commit is relevant for this PR.
This refactors the logic needed to emit init logic for reductions by
moving some duplicated code into a shared util. The logic for doing is
quite involved and is needed for any construct that has reductions.
Moreover, when a construct has both private and reduction clauses, both
sets of clauses need to cooperate with each other when emitting the
logic needed for allocation and initialization. Therefore, this PR
clearly sets the boundaries for the logic needed to initialize
reductions.
Add FIR generation and LLVMIR translation support for mergeable clause
on task construct. If mergeable clause is present on a task, the
relevant flag in `ompt_task_flag_t` is set and passed to
`__kmpc_omp_task_alloc`.
This is one of 3 PRs in a PR stack that aims to add support for explicit
mapping of allocatable members in derived types.
The primary changes in this PR are the OpenMPToLLVMIRTranslation.cpp
changes, which are small and seek to alter the current member mapping to
add an additional map insertion for pointers. Effectively, if the member
is a pointer (currently indicated by having a varPtrPtr field) we add an
additional map for the pointer and then alter the subsequent mapping of
the member (the data) to utilise the member rather than the parents base
pointer. This appears to be necessary in certain cases when mapping
pointer data within record types to avoid segfaulting on device (due to
incorrect data mapping). In general this record type mapping may be
simplifiable in the future.
There are also additions of tests which should help to showcase the
affect of the changes above.
This patch primarily updates the MapInfoFinalization pass to utilise the
BlockArgument interface. It also shuffles newly added arguments the
MapInfoFinalization passes to the end of the BlockArg/Relevant MapInfo
lists, instead of one prior to the owning descriptor type.
During this it was noted that the use_device_ptr/addr handling of target
data was a little bit too order dependent so I've attempted to make it
less so, as we cannot depend on argument ordering to be the same as
Fortran for any future frontends.
Upstreaming some code cleanups ahead of supporting delayed task
execution.
- Make allocatePrivateVars not need to be a template (it will need to
operate separately on firstprivate and private variables for delayed
task execution so it can't index into lists of all variables in the
operation).
- Use llvm::SmallVectorImpl for function arguments
- collectPrivatizationDecls already reserves size for privateDecls so we
don't need to do that in callers
- Use llvm::zip_equal instead of C-style array indexing
This uses essentially an identical implementation to that used for
ParallelOp. The private variable allocation and deallocation use shared
functions to avoid code duplication. FIRSTPRIVATE variable copying uses
duplicated code for now because I anticipate the implementation
diverging in the near future once I store data for firstprivate
variables in the task description structure.
After enabling delayed privatisation for TASK in flang, one more test in
the fujitsu test suite passes (I haven't looked into why).
This patch improves not-yet-implemented error diagnostics to more
closely follow the format used by Flang lowering for the same kind of
errors. This helps keep some level of uniformity from a user
perspective.
When composite constructs are lowered, clauses for each leaf construct
are lowered before creating the set of loop wrapper operations, using
these outside values to populate their operand lists. Then, when the
loop nest associated to that composite construct is lowered, the binding
of Fortran symbols to the entry block arguments defined by these loop
wrappers is performed, resulting in the creation of `hlfir.declare`
operations in the entry block of the `omp.loop_nest`.
This approach prevents `hlfir.declare` operations related to the binding
and other operations resulting from the evaluation of the clauses from
being inserted between loop wrapper operations, which would be an
illegal MLIR representation. However, this introduces the problem of
entry block arguments defined by a wrapper that then should be used by
one of its nested wrappers, because the corresponding Fortran symbol
would still be mapped to an outside value at the time of gathering the
list of operands for the nested wrapper.
This patch adds operand re-mapping logic to update wrappers without
changing when clauses are evaluated or where the `hlfir.declare`
creation is performed.
This patch improves error reporting in the MLIR to LLVM IR translation
pass for the 'omp' dialect by emitting descriptive errors when
encountering clauses not yet supported by that pass.
Additionally, not-yet-implemented errors previously missing for some
clauses are added, to avoid silently ignoring them.
Error messages related to inlining of `omp.private` and
`omp.declare_reduction` regions have been updated to use the same
format.
This patch unifies the handling of errors passed through the
OpenMPIRBuilder and removes some redundant error messages through the
introduction of a custom `ErrorInfo` subclass.
Additionally, the current list of operations and clauses unsupported by
the MLIR to LLVM IR translation pass is added to a new Lit test to check
they are being reported to the user.
This patch updates the translation of `omp.wsloop` with a nested
`omp.simd` to prevent uses of block arguments defined by the latter from
triggering null pointer dereferences.
This happens because the inner `omp.simd` operation representing
composite `do simd` constructs is currently skipped and not translated,
but this results in block arguments defined by it not being mapped to an
LLVM value. The proposed solution is to map these block arguments to the
LLVM value associated to the corresponding operand, which is defined
above.
This patch implements an approach to communicate errors between the
OMPIRBuilder and its users. It introduces `llvm::Error` and
`llvm::Expected` objects to replace the values returned by callbacks
passed to `OMPIRBuilder` codegen functions. These functions then check
the result for errors when callbacks are called and forward them back to
the caller, which has the flexibility to recover, exit cleanly or dump a
stack trace.
This prevents a failed callback to leave the IR in an invalid state and
still continue the codegen process, triggering unrelated assertions or
segmentation faults. In the case of MLIR to LLVM IR translation of the
'omp' dialect, this change results in the compiler emitting errors and
exiting early instead of triggering a crash for not-yet-implemented
errors. The behavior in Clang and openmp-opt stays unchanged, since
callbacks will continue always returning 'success'.
Extends `nowait` support for other device directives. This PR refactors
the task generation utils used for the `target` directive so that they
are general enough to be reused for other device directives as well.
The existing conversion inlined private alloc regions and firstprivate
copy regions in mlir, then undoing the modification of the mlir module
before completing the conversion. To make this work, LLVM IR had to be
generated using the wrong mapping for privatised values and then later
fixed inside of OpenMPIRBuilder.
This approach violated an assumption in OpenMPIRBuilder that private
variables would be values not constants. Flang sometimes generates code
where private variables are promoted to globals, the address of which is
treated as a constant in LLVM IR. This caused the incorrect values for
the private variable from being replaced by OpenMPIRBuilder: ultimately
resulting in programs producing incorrect results.
This patch rewrites delayed privatisation for omp.parallel to work more
similarly to reductions: translating directly into LLVMIR with correct
mappings for private variables.
RFC:
https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225
Tested against the gfortran testsuite and our internal test suite.
Linaro's post-commit bots will check against the fujitsu test suite.
I decided to add the new tests as flang integration tests rather than in
mlir/test/Target/LLVMIR:
- The regression test is for an issue filed against flang. i wanted to
keep the reproducer similar to the code in the ticket.
- I found the "worst case" CFG test difficult to reason about in
abstract it helped me to think about what was going on in terms of a
Fortran program.
Fixes#106297
This patch enables lowering to MLIR of the reduction clause of `simd`
constructs. Lowering from MLIR to LLVM IR remains unimplemented, so at
that stage it will result in errors being emitted rather than silently
ignoring it as it is currently done.
On composite `do simd` constructs, this lowering error will remain
untriggered, as the `omp.simd` operation in that case is currently
ignored. The MLIR representation, however, will now contain `reduction`
information.
Adds MLIR to LLVM lowering support for `target ... nowait`. This
leverages the already existings code-gen patterns for `task` by treating
`target ... nowait` as `task ... if(1)` and `target` (without `nowait`)
as `task ... if(0)`; similar to what clang does.
This patch adds extra class declarations to the `omp.declare_reduction`
and `omp.private` operations to access the entry block arguments defined
by their regions. Some existing accesses to these arguments are updated
to use the new named methods to improve code readability.
This patch adds functionality to emit relevant libcalls in case
atomicrmw instruction can not be emitted (for instance, in case of
complex types). The IRBuilder is modified to directly emit __atomic_load
and __atomic_compare_exchange libcalls. The added functions follow a
similar codegen path as Clang, so that LLVM Flang generates almost
similar IR as Clang.
Fixes https://github.com/llvm/llvm-project/issues/83760 and
https://github.com/llvm/llvm-project/issues/75138
Co-authored-by: Michael Kruse <llvm-project@meinersbur.de>
This patch updates the `omp.target_data` operation to use the same
formatting as `map` clauses on `omp.target` for `use_device_addr` and
`use_device_ptr`. This is done so the mapping that is being enforced
between op arguments and associated entry block arguments is explicit.
The way it is achieved is by marking these clauses as entry block
argument-defining and adjusting printer/parsers accordingly.
As a result of this change, block arguments for `use_device_addr` come
before those for `use_device_ptr`, which is the opposite of the previous
undocumented situation. Some unit tests are updated based on this
change, in addition to those updated because of the format change.
This patch introduces a new MLIR interface for the OpenMP dialect aimed
at providing a uniform way of verifying and handling entry block
arguments defined by OpenMP clauses.
The approach consists in defining a set of overrideable methods that
return the number of block arguments the operation holds regarding each
of the clauses that may define them. These by default return 0, but they
are overriden by the corresponding clause through the
`extraClassDeclaration` mechanism.
Another set of interface methods to get the actual lists of block
arguments is defined, which is implemented based on the previously
described methods. These implicitly define a standardized ordering
between the list of block arguments associated to each clause, based on
the alphabetical ordering of their names. They should be the preferred
way of matching operation arguments and entry block arguments to that
operation's first region.
Some updates are made to the printing/parsing of `omp.parallel` to
follow the expected order between `private` and `reduction` clauses, as
well as the MLIR to LLVM IR translation pass to access block arguments
using the new interface. Unit tests of operations impacted by additional
verification checks and sorting of entry block arguments.
This patch adds support to translate the `private` clause on
`omp.target` ops from MLIR to LLVMIR. This first cut only handles
non-allocatables. Also, this is for delayed privatization.