This patch is a follow-up from #161213 and adds the omp.fuse loop
transformation for the OpenMP dialect. Used for lowering a `!$omp fuse`
in Flang.
Added Lowering and end2end tests.
When a target region is placed inside a constant false condition (e.g.,
`if (.false.)`), the dead code gets eliminated on the host side,
removing the `omp.target` operation entirely. However, the device-side
compilation pipeline is unaware of this elimination and attempts to
generate kernel code. Since the host never created offload metadata for
the eliminated target, the device-side kernel function lacks the
"kernel" attribute, causing `OpenMPOpt` to fail with an assertion when
it expects all outlined kernels to have this attribute. The problem can
be seen with the following code:
```fortran
program cele
implicit none
real :: V
integer :: i
if (.false.) then
!$omp target teams distribute parallel do
do i = 1, 5
V = V * 2
end do
!$omp end target teams distribute parallel do
end if
end program
```
It currently fails with the following assertion:
```
Assertion `omp::isOpenMPKernel(*Kernel) && "Expected kernel function!"' failed.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp:4291
```
This PR adds `DeleteUnreachableTargetsPass` that identifies `omp.target`
operations in unreachable code blocks and removes them.
7c07cb6542a0c5e4340e09a9a247e3e5123c6567 introduced a variable created
in an if statement that is only used in an assertion. Per the coding
guidelines, mark it [[maybe_unused]].
Following work completed in #174386 and #174623, this patch adds support
for collapse to Taskloop. Collapse allows for the user to compress
multiple loop nests into a single loop, and for this to work with
Taskloop, there needs to be some changes to how we process the loops,
and the tasks that run them.
This patch brings Taskloop equivalent to OpenMP 4.5 support for MLIR and
Flang.
Recursive types can cause re-entrant mapper emission. The mapper
function is created by OpenMPIRBuilder before the callbacks run, so it
may already exist in the LLVM module even though it is not yet
registered in the ModuleTranslation mapping table. Reuse and register it
to break the recursion. Added offloading test.
After removing host operations from the device MLIR module, it is no
longer necessary to provide special codegen logic to prevent these
operations from causing compiler crashes or miscompilations.
This patch removes these now unnecessary code paths to simplify codegen
logic. Some MLIR tests are now replaced with Flang tests, since the
responsibility of dealing with host operations has been moved earlier in
the compilation flow.
MLIR tests holding target device modules are updated to no longer
include now unsupported host operations.
Fix OpenMP mapper lowering by attaching user-defined/default mappers
only to the base parent entry, not combined/segment entries. This
prevents mapper calls with partial sizes. Added relevant tests.
Fixes a segfault when trip count values are null by skipping trip count
calculation when we cannot determine if it is safe to hoist out the
values.
Of note I originally tried to modify `extractOnlyOmpNestedDir` to return
the first OpenMPConstruct directive, skipping over any earlier
directives (ie stores), which did work for the below generic test case:
```fortran
program minimal_repro
implicit none
integer :: i, m
integer :: res(10) = 0
!$omp target teams map(from:m,res) private(m)
m = 5
!$omp distribute parallel do
do i = 1, 10
res(i) = 5 + i
end do
!$omp end distribute parallel do
!$omp end target teams
end program minimal_repro
```
But that led to incorrect output in this test case as the trip count was
hoisted out and calculated by m(1000000) instead of m(1)
```fortran
program minimal_repro
implicit none
integer :: i, x
integer :: m(1) = 0
integer :: res(10) = 0
m(1) = 10
x = 1000000
!$omp target teams map(res)
x = 1
!$omp distribute parallel do
do i = 1, m(x)
res(i) = 5 + i
end do
!$omp end distribute parallel do
!$omp end target teams
print *, "Test completed successfully m =", m, " res=", res
end program minimal_repro
```
Leading to a segfault, due to the loop bounds being calculated with
m(1000000)
```mlir
%c1000000_i32 = arith.constant 1000000 : i32
hlfir.assign %c1000000_i32 to %10#0 : i32, !fir.ref<i32>
%c1_i32 = arith.constant 1 : i32
%12 = fir.load %10#0 : !fir.ref<i32>
%13 = fir.convert %12 : (i32) -> i64
%14 = hlfir.designate %5#0 (%13) : (!fir.ref<!fir.array<1xi32>>, i64) -> !fir.ref<i32>
%15 = fir.load %14 : !fir.ref<i32>
...
omp.target host_eval(%c1_i32 -> %arg0, %15 -> %arg1, %c1_i32_1 -> %arg2 : i32, i32, i32) map_entries(%18 -> %arg3, %19 -> %arg4, %20 -> %arg5, %23 -> %arg6 : !fir.ref<!fir.array<10xi32>>, !fir.ref<i32>, !fir.ref<i32>, !fir.ref<!fir.array<1xi32>>) {
...
omp.teams {
...
omp.loop_nest (%arg8) : i32 = (%arg0) to (%arg1) inclusive step (%arg2) {
```
The wip commit for this change is here:
beafeae396
We would need to have some sort of intelligent hoisting for these cases,
to allow hoisting, but for now I just created this PR to fix the bug.
Fixes: #176030
This PR fixes a bug when the lower and upper bound for the number of
teams was not an `int32`, but a different type. In this case, an
internal compiler would trigger due to a mismatching call to
`__kmpc_push_num_teams`.
Corrected various spelling mistakes such as 'occurred', 'receiver',
'initialized', 'length', and others in comments, variable names,
function names, and documentation throughout the project. These
changes improve code readability and maintain consistency in naming
and documentation.
Co-authored-by: Louis Dionne <ldionne.2@gmail.com>
Implementation follows exactly what is done for omp.wsloop and omp.task.
See #137841.
The change to the operation verifier is to allow a taskgroup
cancellation point inside of a taskloop. This was already allowed for
omp.cancel.
Don't allocate a task context structure if none of the private variables
needed it. This was already skipped when there were no private variables
at all.
Following on from the work to implement MLIR -> LLVM IR Translation for
Taskloop, this adds support for the following clauses to be used
alongside taskloop:
- if
- grainsize
- num_tasks
- untied
- Nogroup
- Final
- Mergeable
- Priority
These clauses are ones which work directly through the relevant OpenMP
Runtime functions, so their information just needed collecting from the
relevant location and passing through to the appropriate runtime
function.
Remaining clauses retain their TODO message as they have not yet been
implemented.
This PR replaces #166903
This implements translation for taskloop, along with DSA clauses. Other
clauses will follow immediately after this is merged.
This patch was collaborative work by myself, @kaviya2510, and
@Stylie777. I’ve left the commits unsquashed to make authorship clear.
My only changes to other author’s commits are to rebase and run
clang-format.
The taskloop implementation in the runtime works roughly like this: if
the number of loop iterations to perform are more than some threshold,
the current task is duplicated and both resulting tasks gets half of the
loop range. This continues recursively until each task has a small
enough loop range to run itself in a single thread.
This leads to two implementation complexities:
- The runtime needs to be able to update the loop bounds used when
executing the loop inside of the task. This has been implemented by
forcing them to always have a fixed location inside of the structure
produced when outlining the task.
- When a task is duplicated, all data stored for the task’s
(first)private variables needs to also be duplicated and appropriate
constructors run. This is handled by a task duplication function invoked
by the runtime.
With regards to testing, most existing tests in the gfortran and fujitsu
test suites require the reduction clause (not part of OpenMP 4.5). I
wrote some tests of my own and was satisfied that it seems to be
working.
Co-authored-by: Kaviya Rajendiran <kaviyara2000@gmail.com>
Co-authored-by: Jack Styles <jack.styles@arm.com>
---------
Co-authored-by: Kaviya Rajendiran <kaviyara2000@gmail.com>
Co-authored-by: Jack Styles <jack.styles@arm.com>
Extend OpenMP device clause lowering for target data, target enter data,
target exit data, and target update to accept non-constant values.
Previously, only constant device IDs could be lowered to LLVM IR.
Add Flang tests to validate device clause handling and mark the feature
as supported in the OpenMPSupport documentation. New tests cover:
- target teams
- target teams distribute
- target teams distribute parallel do
- target teams distribute parallel do simd
- target data
Tests for target update and target enter/exit were
already present in Flang.
Add lowering support for the OpenMP `device` clause on the `target`
directive in Flang.
The device expression is propagated through MLIR OpenMP and passed to
the host-side `__tgt_target_kernel` call.
We add barriers to the firstprivate copy region when they are required
to avoid a race condition with the lastprivate clause.
The problem is that these barriers are added by the compiler not implied
by user code so it is the compiler's problem to avoid deadlock.
I came across a testcase whilst working on taskloop support that looks a
bit like this
```
!$omp parallel
!$omp single
!$omp taskloop firstprivate(a) lastprivate(a)
...
!$omp end single
!$omp end parallel
```
This is so that there are multiple threads for the generated tasks to be
distributed over, but we don't generate the tasks afresh in every
thread.
The problem comes when the taskloop requires a barrier to prevent the
datarace between firstprivate and lastprivate. This barrier will then be
generated inside of SINGLE and so only one thread will encounter the
barrier: leading to a deadlock.
This patch works around the problem by detecting this situation
statically and then not generating the barrier. There are cases where we
cannot detect this statically (e.g. if the TASKLOOP is inside a function
call inside of SINGLE). The program will still deadlock in this case
after my patch. I'm unsure what the solution would be for that case. I
want to fix this simple case in LLVM 22 before engaging in a longer
discussion as to whether there is a better way to handle the more
general case.
Testing using wsloop because I want to land this (or not) independently
of taskloop. Note that for wsloop it would be up to the programmer to
remember to use the nowait clause, but nowait cannot be used to control
generation of this barrier because it refers to the barrier after the
construct not after firstprivate copyin (before the construct
execution).
Up till OpenMP version 4.5, the loop iteration variable in the
associated do-construct of simd is linear with a linear step equal to
the increment of the loop. This PR implements this functionality. For
versions > 4.5, such an implicit linear clause is not assumed for the
loop iteration variable.
Fixes https://github.com/llvm/llvm-project/issues/171006
Fixes#173332
The compiler was crashing when compiling OpenMP `parallel do simd` with
a `linear` clause on `INTEGER(8)` variables. The assertion failure
occurred during MLIR-to-LLVM translation:
Cannot create binary operator with two operands of differing type!
**Root Cause:**
The bug was in `LinearClauseProcessor::updateLinearVar()` where the step
value (i32) and induction variable were multiplied without normalizing
to the linear variable's type (i64), causing type mismatches in LLVM IR
generation.
**Solution:**
Updated the translation logic to cast both the induction variable and
step value to `linearVarTypes[index]` before performing arithmetic
operations. This ensures type consistency for both integer and
floating-point linear variables.
**Testing:**
- Added integration test verifying successful compilation to LLVM IR
- Added lowering test for MLIR generation with various linear clause
forms
- Verified the exact reproducer from the issue now compiles without
errors
Add support for OpenMP is_device_ptr clause for target directives.
[MLIR][OpenMP] Add OpenMPToLLVMIRTranslation support for is_device_ptr
#169367 This PR adds support for the OpenMP is_device_ptr clause in the
MLIR to LLVM IR translation for target regions. The is_device_ptr clause
allows device pointers (allocated via OpenMP runtime APIs) to be used
directly in target regions without implicit mapping.
Add support for OpenMP is_device_ptr clause for target directives.
[MLIR][OpenMP] Add OpenMPToLLVMIRTranslation support for is_device_ptr #169367
This PR adds support for the OpenMP is_device_ptr clause in the MLIR to LLVM IR translation for target regions. The is_device_ptr clause allows device pointers (allocated via OpenMP runtime APIs) to be used directly in target regions without implicit mapping.
A barrier will pause execution until all threads reach it. If some go to
a different barrier then we deadlock. This manifests in that the
finalization callback must only be run once. Fix by ensuring we always
go through the same finalization block whether the thread in cancelled
or not and no matter which cancellation point causes the cancellation.
The old callback only affected PARALLEL, so it has been moved into the
code generating PARALLEL. For this reason, we don't need similar changes
for other cancellable constructs. We need to create the barrier on the
shared exit from the outlined function instead of only on the cancelled
branch to make sure that threads exiting normally (without cancellation)
meet the same barriers as those which were cancelled. For example,
previously we might have generated code like
```
...
%ret = call i32 @__kmpc_cancel(...)
%cond = icmp eq i32 %ret, 0
br i1 %cond, label %continue, label %cancel
continue:
// do the rest of the callback, eventually branching to %fini
br label %fini
cancel:
// Populated by the callback:
// unsafe: if any thread makes it to the end without being cancelled
// it won't reach this barrier and then the program will deadlock
%unused = call i32 @__kmpc_cancel_barrier(...)
br label %fini
fini:
// run destructors etc
ret
```
In the new version the barrier is moved into fini. I generate it *after*
the destructors because the standard describes the barrier as occurring
after the end of the parallel region.
```
...
%ret = call i32 @__kmpc_cancel(...)
%cond = icmp eq i32 %ret, 0
br i1 %cond, label %continue, label %cancel
continue:
// do the rest of the callback, eventually branching to %fini
br label %fini
cancel:
br label %fini
fini:
// run destructors etc
// safe so long as every exit from the function happens via this block:
%unused = call i32 @__kmpc_cancel_barrier(...)
ret
```
To achieve this, the barrier is now generated alongside the finalization
code instead of in the callback. This is the reason for the changes to
the unit test.
I'm unsure if I should keep the incorrect barrier generation callback
only on the cancellation branch in clang with the OMPIRBuilder backend
because that would match clang's ordinary codegen. Right now I have
opted to remove it entirely because it is a deadlock waiting to happen.
---
This re-lands #164586 with a small fix for a failing buildbot running
address sanitizer on clang lit tests.
In the previous version of the patch I added an insertion point guard
"just to be safe" and never removed it. There isn't insertion point
guarding on the other route out of this function and we do not
preserve the insertion point around getFiniBB either so it is not
needed here.
The problem flagged by the sanitizers was because the saved insertion
point pointed to an instruction which was then removed inside the FiniCB
for some clang codegen functions. The instruction was freed when it was
removed. Then accessing it to restore the insertion point was a use
after free bug.
`dist_schedule` was previously supported in Flang/Clang but was not
implemented in MLIR, instead a user would get a "not yet implemented"
error. This patch adds support for the `dist_schedule` clause to be
lowered to LLVM IR when used in an `omp.distribute` or `omp.wsloop`
section.
There has needed to be some rework required to ensure that MLIR/LLVM
emits the correct Schedule Type for the clause, as it uses a different
schedule type to other OpenMP directives/clauses in the runtime library.
This patch also ensures that when using dist_schedule or a chunked
schedule clause, the correct llvm loop parallel accesses details are
added.
A barrier will pause execution until all threads reach it. If some go to
a different barrier then we deadlock. This manifests in that the
finalization callback must only be run once. Fix by ensuring we always
go through the same finalization block whether the thread in cancelled
or not and no matter which cancellation point causes the cancellation.
The old callback only affected PARALLEL, so it has been moved into the
code generating PARALLEL. For this reason, we don't need similar changes
for other cancellable constructs. We need to create the barrier on the
shared exit from the outlined function instead of only on the cancelled
branch to make sure that threads exiting normally (without cancellation)
meet the same barriers as those which were cancelled. For example,
previously we might have generated code like
```
...
%ret = call i32 @__kmpc_cancel(...)
%cond = icmp eq i32 %ret, 0
br i1 %cond, label %continue, label %cancel
continue:
// do the rest of the callback, eventually branching to %fini
br label %fini
cancel:
// Populated by the callback:
// unsafe: if any thread makes it to the end without being cancelled
// it won't reach this barrier and then the program will deadlock
%unused = call i32 @__kmpc_cancel_barrier(...)
br label %fini
fini:
// run destructors etc
ret
```
In the new version the barrier is moved into fini. I generate it *after*
the destructors because the standard describes the barrier as occurring
after the end of the parallel region.
```
...
%ret = call i32 @__kmpc_cancel(...)
%cond = icmp eq i32 %ret, 0
br i1 %cond, label %continue, label %cancel
continue:
// do the rest of the callback, eventually branching to %fini
br label %fini
cancel:
br label %fini
fini:
// run destructors etc
// safe so long as every exit from the function happens via this block:
%unused = call i32 @__kmpc_cancel_barrier(...)
ret
```
To achieve this, the barrier is now generated alongside the finalization
code instead of in the callback. This is the reason for the changes to
the unit test.
I'm unsure if I should keep the incorrect barrier generation callback
only on the cancellation branch in clang with the OMPIRBuilder backend
because that would match clang's ordinary codegen. Right now I have
opted to remove it entirely because it is a deadlock waiting to happen.
Adds initial support for GPU by-ref reductions. The main problem for
reduction by reference is that, prior to this PR, we were shuffling
(from remote lanes within the same warp or across different warps within
the block) pointers/references to the private reduction values rather
than the private reduction values themselves.
In particular, this diff adds support for reductions on scalar
allocatables where reductions happen on loops nested in `target`
regions. For example:
```fortran
integer :: i
real, allocatable :: scalar_alloc
allocate(scalar_alloc)
scalar_alloc = 0
!$omp target map(tofrom: scalar_alloc)
!$omp parallel do reduction(+: scalar_alloc)
do i = 1, 1000000
scalar_alloc = scalar_alloc + 1
end do
!$omp end target
```
This PR supports by-ref reductions on the intra- and inter-warp levels.
So far, there are still steps to be takens for full support of by-ref
reductions, for example:
* Support inter-block value combination is still not supported.
Therefore, `target teams distribute parallel do` is still not supported.
* Support for dynamically-sized arrays still needs to be added.
* Support for more than one allocatable/array on the same `reduction`
clause.
If we are given the same index in the comparator callback, simply return
false. Otherwise we will end up adding invalid items to
occludedChildren, causing extra items to get removed that should not be,
resulting in failures that manifest in different forms (assertions, asan
failures, ubsan failures, etc.).
This patch add support for lowering of custom reductions to MLIR. It
also enhances the capability of the pass to automatically mark functions
as "declare target" by traversing custom reduction initializers and
combiners.
While the infrastructure for declare target to/enter and link for
variables exists in the MLIR dialect and at the Flang level, the current
lowering from MLIR -> LLVM IR isn't in place, it's only in place for
variables that have the link clause applied.
This PR aims to extend that lowering to an initial implementation that
incorporates declare target to as well, which primarily requires changes
in the OpenMPToLLVMIRTranslation phase. However, a minor addition to the
OpenMP dialect was required to extend the declare target enumerator to
include a default None field as well.
This also requires a minor change to the Flang lowering's
MapInfoFinlization.cpp pass to alter the map type for descriptors to
deal with cases where a variable is marked declare to. Currently, when a
descriptor variable is mapped declare target to the descriptor component
can become attatched, and cannot be updated, this results in issues when
an unusual allocation range is specified (effectively an off-by X
error). The current solution is to map the descriptor always, as we
always require an up-to-date version of this data. However, this also
requires an interlinked PR that adds a more intricate type of mapping of
structures/record types that clang currently implements, to circumvent
the overwriting of the pointer in the descriptor.
3/3 required PRs to enable declare target to mapping, this PR should
pass all tests and provide an all green CI.
Co-authored-by: Raghu Maddhipatla raghu.maddhipatla@amd.com
This PR introduces a new additional type of map lowering for record
types that Clang currently supports, in which a user can map a top-level
record type and then individual members with different mapping,
effectively creating a sort of "overlapping" mapping that we attempt to
cut around.
This is currently most predominantly used in Fortran, when mapping
descriptors and there data, we map the descriptor and its data with
separate map modifiers and "cut around" the pointer data, so that wedo
not overwrite it unless the runtime deems it a neccesary action based on
its reference counting mechanism. However, it is a mechanism that will
come in handy/trigger when a user explitily maps a record type (derived
type or structure) and then explicitly maps a member with a different
map type.
These additions were predominantly in the OpenMPToLLVMIRTranslation.cpp
file and phase, however, one Flang test that checks end-to-end IR
compilation (as far as we care for now at least) was altered.
2/3 required PRs to enable declare target to mapping, should look at PR
3/3 to check for full green passes (this one will fail a number due to
some dependencies).
Co-authored-by: Raghu Maddhipatla raghu.maddhipatla@amd.com
Currently this is being calculated incorrectly and will result in
incorrect index offsets in more complicated array slices. This PR tries
to address it by refactoring and changing the calculation to be more
correct.
This PR adds support for translation of the private clause on deferred
target tasks - that is `omp.target` operations with the `nowait` clause.
An offloading call for a deferred target-task is not blocking - the
offloading (target-generating) host task continues its execution after issuing the offloading
call. Therefore, the key problem we need to solve is to ensure that the
data needed for private variables to be initialized in the target task
persists even after the host task has completed.
We do this in a new pass called `PrepareForOMPOffloadPrivatizationPass`.
For a privatized variable that needs its host counterpart for
initialization (such as the shape of the data from the descriptor when
an allocatable is privatized or the value of the data when an
allocatable is firstprivatized),
- the pass allocates memory on the heap.
- it then initializes this memory by using the `init` and `copy` (for
firstprivate) regions of the corresponding `omp::PrivateClauseOp`.
- Finally the memory allocated on the heap is freed using the `dealloc`
region of the same `omp::PrivateClauseOp` instance. This step is not
straightforward though, because we cannot simply free the memory that's
going to be used by another thread without any synchronization. So, for
deallocation, we create a `omp.task` after the `omp.target` and
synchronize the two with a dummy dependency (using the `depend` clause).
In this newly created `omp.task` we do the deallocation.
This PR shifts from using the LLVM OpenMP enumerator bit flags to an
OpenMP dialect specific enumerator. This allows us to better represent
map types that wouldn't be of interest to the LLVM backend and runtime
in the dialect.
Primarily things like
ref_ptr/ref_ptee/ref_ptr_ptee/atach_none/attach_always/attach_auto which
are of interest to the compiler for certrain transformations (primarily
in the FIR transformation passes dealing with mapping), but the runtime
has no need to know about them. It also means if another OpenMP
implementation comes along they won't need to stick to the same bit flag
system LLVM chose/do leg work to address it.
This PR uses the VFS to create the OpenMP target entry instead of going
straight to the real file system. This matches the behavior of other
input files of the compiler.
Enable the generation of no-loop kernels for Fortran OpenMP code. target
teams distribute parallel do pragmas can be promoted to no-loop kernels
if the user adds the -fopenmp-assume-teams-oversubscription and
-fopenmp-assume-threads-oversubscription flags.
If the OpenMP kernel contains reduction or num_teams clauses, it is not
promoted to no-loop mode.
The global OpenMP device RTL oversubscription flags no longer force
no-loop code generation for Fortran.
With declare mapper, the parent base entry was emitted as `TARGET_PARAM`
only. The mapper received a map-type without `to/from`, causing
components to degrade to `alloc`-only (no copies), breaking allocatable
payload mapping. This PR preserves the map-type bits from the parent.
This fixes#156466.