Currently this is being calculated incorrectly and will result in
incorrect index offsets in more complicated array slices. This PR tries
to address it by refactoring and changing the calculation to be more
correct.
This PR adds support for translation of the private clause on deferred
target tasks - that is `omp.target` operations with the `nowait` clause.
An offloading call for a deferred target-task is not blocking - the
offloading (target-generating) host task continues its execution after issuing the offloading
call. Therefore, the key problem we need to solve is to ensure that the
data needed for private variables to be initialized in the target task
persists even after the host task has completed.
We do this in a new pass called `PrepareForOMPOffloadPrivatizationPass`.
For a privatized variable that needs its host counterpart for
initialization (such as the shape of the data from the descriptor when
an allocatable is privatized or the value of the data when an
allocatable is firstprivatized),
- the pass allocates memory on the heap.
- it then initializes this memory by using the `init` and `copy` (for
firstprivate) regions of the corresponding `omp::PrivateClauseOp`.
- Finally the memory allocated on the heap is freed using the `dealloc`
region of the same `omp::PrivateClauseOp` instance. This step is not
straightforward though, because we cannot simply free the memory that's
going to be used by another thread without any synchronization. So, for
deallocation, we create a `omp.task` after the `omp.target` and
synchronize the two with a dummy dependency (using the `depend` clause).
In this newly created `omp.task` we do the deallocation.
This PR shifts from using the LLVM OpenMP enumerator bit flags to an
OpenMP dialect specific enumerator. This allows us to better represent
map types that wouldn't be of interest to the LLVM backend and runtime
in the dialect.
Primarily things like
ref_ptr/ref_ptee/ref_ptr_ptee/atach_none/attach_always/attach_auto which
are of interest to the compiler for certrain transformations (primarily
in the FIR transformation passes dealing with mapping), but the runtime
has no need to know about them. It also means if another OpenMP
implementation comes along they won't need to stick to the same bit flag
system LLVM chose/do leg work to address it.
This PR uses the VFS to create the OpenMP target entry instead of going
straight to the real file system. This matches the behavior of other
input files of the compiler.
Enable the generation of no-loop kernels for Fortran OpenMP code. target
teams distribute parallel do pragmas can be promoted to no-loop kernels
if the user adds the -fopenmp-assume-teams-oversubscription and
-fopenmp-assume-threads-oversubscription flags.
If the OpenMP kernel contains reduction or num_teams clauses, it is not
promoted to no-loop mode.
The global OpenMP device RTL oversubscription flags no longer force
no-loop code generation for Fortran.
With declare mapper, the parent base entry was emitted as `TARGET_PARAM`
only. The mapper received a map-type without `to/from`, causing
components to degrade to `alloc`-only (no copies), breaking allocatable
payload mapping. This PR preserves the map-type bits from the parent.
This fixes#156466.
This patch enables tiling in flang. In MLIR tiling is handled by
changing the the omp.loop_nest op to be able to represent both collapse
and tiling, so the flang front-end will combine the nested constructs into
a single MLIR op. The MLIR->LLVM-IR lowering of the LoopNestOp is
enhanced to first do the tiling if present, then collapse.
**Issue:**
When SIMD aligned clause has a alignment value which is not a power of
2, compiler crashes with error Assertion (alignment & (alignment - 1))
== 0 && "alignment is not power of 2"
**Fix:**
According to LLVM Language Reference manual
[[link]](https://llvm.org/docs/LangRef.html#assume-opbundles), the
alignment value may be non-power-of-two. In that case, the pointer value
must be a null pointer otherwise the behavior is undefined.
So instead of emitting `llvm.assume` intrinsic function with a null
pointer having the specified alignment, modified the implementation
which ignores the aligned clause which has an alignment value which is
not a power of 2. This patch also emits a warning indicating that the aligned clause is ignored if the alignment value is not a power of two.
It fixes the issue https://github.com/llvm/llvm-project/issues/149458
This PR introduces two new ops in omp dialect, omp.target_allocmem and
omp.target_freemem.
omp.target_allocmem: Allocates heap memory on device. Will be lowered to
omp_target_alloc call in llvm.
omp.target_freemem: Deallocates heap memory on device. Will be lowered
to omp+target_free call in llvm.
Example:
%1 = omp.target_allocmem %device : i32, i64
omp.target_freemem %device, %1 : i32, i64
The work in this PR is C-P/inspired from @ivanradanov commit from
coexecute implementation:
[Add fir omp target alloc and free
ops](be860ac8ba)
[Lower omp_target_{alloc,free} to
llvm](6e2d584dc9)
This fixes#147063.
I tried to fix this issue in more general way in
https://github.com/llvm/llvm-project/pull/147091 but the reviewer
suggested to fix the locations which are causing this issue. So this is
a more targeted approach.
The `restoreIP` is frequently used in the `OMPIRBuilder` to change the
insert position. This function eventually calls
`SetInsertPoint(BasicBlock *TheBB, BasicBlock::iterator IP)`. This
function updates the insert point and the debug location. But if the
`IP` is pointing to the end of the `TheBB`, then the debug location is
not updated and we could have a mismatch between insert point and the
debug location.
The problem can occur in 2 different code patterns.
This code below shows the first scenario.
```
1. auto curPos = builder.saveIP();
2. builder.restoreIP(/* some new pos */);
3. // generate some code
4. builder.restoreIP(curPos);
```
If `curPos` points to the end of basic block, we could have a problem.
But it is easy one to handle as we have the location before hand and can
save the correct debug location before 2 and then restore it after 3.
This can be done either manually or using the `llvm::InsertPointGuard`
as shown below.
```
// manual approach
auto curPos = builder.saveIP();
llvm::DebugLoc DbgLoc = builder.getCurrentDebugLocation();
builder.restoreIP(/* some new pos */);
// generate some code
builder.SetCurrentDebugLocation(DbgLoc);
builder.restoreIP(curPos);
{
// using InsertPointGuard
llvm::InsertPointGuard IPG(builder);
builder.restoreIP(/* some new pos */);
// generate some code
}
```
This PR fixes one problematic case using the manual approach.
For the 2nd scenario, look at the code below.
```
1. void fn(InsertPointTy allocIP, InsertPointTy codegenIP) {
2. builder.setInsertPoint(allocIP);
3. // generate some alloca
4. builder.setInsertPoint(codegenIP);
5. }
```
The `fn` can be called from anywhere and we can't assume the debug
location of the builder is valid at the start of the function. So if 4
does not update the debug location because the `codegenIP` points at the
end of the block, the rest of the code can end up using the debug
location of the `allocaIP`. Unlike the first case, we don't have a debug
location that we can save before hand and restore afterwards.
The solution here is to use the location of the last instruction in that
block. I have added a wrapper function over `restoreIP` that could be
called for such cases. This PR uses it to fix one problematic case.
Adding mlir to llvm support for atomic control options.
Atomic Control Options are used to specify architectural characteristics
to help lowering of atomic operations. The options used are:
`-f[no-]atomic-remote-memory`, `-f[no-]atomic-fine-grained-memory`,
`-f[no-]atomic-ignore-denormal-mode`.
Legacy option `-m[no-]unsafe-fp-atomics` is aliased to
`-f[no-]ignore-denormal-mode`.
More details can be found in
https://github.com/llvm/llvm-project/pull/102569. This PR implements the
MLIR to LLVM lowering support of atomic control attributes specified
with OpenMP `atomicUpdateOp`.
Initial support can be found in PR:
https://github.com/llvm/llvm-project/pull/150860
This PR uses `val.getDefiningOp<OpTy>()` to replace `dyn_cast<OpTy>(val.getDefiningOp())` , `dyn_cast_or_null<OpTy>(val.getDefiningOp())` and `dyn_cast_if_present<OpTy>(val.getDefiningOp())`.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Reduction support: https://github.com/llvm/llvm-project/pull/146671
If Support is fixed in this PR
The problem for the IF clause in composite constructs was that wsloop
and simd both operate on the same CanonicalLoopInfo structure: with the
SIMD processed first, followed by the wsloop. Previously the IF clause
generated code like
```
if (cond) {
while (...) {
simd_loop_body;
}
} else {
while (...) {
nonsimd_loop_body;
}
}
```
The problem with this is that this invalidates the CanonicalLoopInfo
structure to be processed by the wsloop later. To avoid this, in this
patch I preserve the original loop, moving the IF clause inside of the
loop:
```
while (...) {
if (cond) {
simd_loop_body;
} else {
non_simd_loop_body;
}
}
```
On simple examples I tried LLVM was able to hoist the if condition
outside of the loop at -O3.
The disadvantage of this is that we cannot add the
llvm.loop.vectorize.enable attribute on either the SIMD or non-SIMD
loops because they both share a loop back edge. There's no way of
solving this without keeping the old design of having two different
loops: which cannot be represented using only one CanonicalLoopInfo
structure. I don't think the presence or absence of this attribute makes
much difference. In my testing it is the llvm.loop.parallel_access
metadata which makes the difference to vectorization. LLVM will
vectorize if legal whether or not this attribute is there in the TRUE
branch. In the FALSE branch this means the loop might be vectorized even
when the condition is false: but I think this is still standards
compliant: OpenMP 6.0 says that when the if clause is false that should
be treated like the SIMDLEN clause is one. The SIMDLEN clause is defined
as a "hint". For the same reason, SIMDLEN and SAFELEN clauses are
silently ignored when SIMD IF is used.
I think it is better to implement SIMD IF and ignore SIMDLEN and SAFELEN
and some vectorization encouragement metadata when combined with IF than
to ignore IF because IF could have correctness consequences whereas the
rest are optimiztion hints. For example, the user might use the IF
clause to disable SIMD programatically when it is known not safe to
vectorize the loop. In this case it is not at all safe to add the
parallel access or SAFELEN metadata.
Support for translating the operations introduced in #144785 to LLVM-IR.
In order to keep the lowering simple,
`OpenMPIRBuider::unrollLoopHeuristic` is applied when encountering the
`omp.unroll_heuristic` op. As a result, the operation that unrolling is
applied to (`omp.canonical_loop`) must have been emitted before even
though logically there is no such requirement.
Eventually, all transformations on a loop must be applied directly after
emitting `omp.canonical_loop`, i.e. future transformations must be
looked-up when encountering `omp.canonical_loop` itself. This is because
many OpenMPIRBuilder methods (e.g. `createParallel`) expect all the
region code to be emitted withing a callback. In the case of
`createParallel`, the region code is getting outlined into a new
function. Therefore, making the operation order a formal requirement
would not make the implementation any easier.
The patch introduced changes to add address spaces to a wider array of MLIR/LLVM values, however,
it was missing an address space cast that exists in our downstream implementation that's required
for declare target to work correctly.
Despite currently being ignored with a warning, simd as a leaf in
composite constructs behaves as expected when the construct does not
contain a reduction. Enable it for those non-reduction constructs.
---------
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Temps needed for the reduction init regions are now allocate on the heap
all the time. However, this is performance killer for GPUs since malloc
calls are prohibitively expensive. Therefore, we should do these
allocations on the stack for GPU reductions.
This is combination of https://github.com/llvm/llvm-project/pull/138149
and https://github.com/llvm/llvm-project/pull/138039 which were opened
separately for ease of reviewing. Only other change is adjustments in 2
tests which have gone in since.
There are `DeclareOp` present for the variables mapped into target
region. That allow us to generate debug information for them. But the
`TargetOp` is still part of parent function and those variables get the
parent function's `DISubprogram` as a scope.
In `OMPIRBuilder`, a new function is created for the `TargetOp`. We also
create a new `DISubprogram` for it. All the variables that were in the
target region now have to be updated to have the correct scope. This
after the fact updating of
debug information becomes very difficult in certain cases. Take the
example of variable arrays. The type of those arrays depend on the
artificial `DILocalVariable`(s) which hold the size(s) of the array.
This new function will now require that we generate the new variable and
and new types. Similar issue exist for character type variables too.
To avoid this after the fact updating, this PR generates a
`DISubprogramAttr` for the `TargetOp` while generating the debug info in
`flang`. Then we don't need to generate a `DISubprogram` in
`OMPIRBuilder`. This change is made a bit more complicated by the the
fact that in new scheme, the debug location already points to the new
`DISubprogram` by the time it reaches `convertOmpTarget`. But we need
some code generation in the parent function so we have to carefully
manage the debug locations.
This fixes issue `#134991`.
This replicates clang's implementation. Basically:
- A private copy of the reduction variable is created, initialized to
the reduction neutral value (using regions from the reduction
declaration op).
- The body of the loop is lowered as usual, with accesses to the
reduction variable mapped to the private copy.
- After the loop, we inline the reduction region from the declaration op
to combine the privatized variable into the original variable.
- As usual with the SIMD construct, attributes are added to encourage
vectorization of the loop and to assert that memory accesses in the loop
don't alias across iterations.
I have verified that simple scalar examples do vectorize at -O3 and the
tests I could find in the Fujitsu test suite produce correct results. I
tested on top of #146097 and this seemed to work for composite
constructs as well.
Fixes#144290
Please see the following program.
```
module test_0
INTEGER :: sp = 1
!$omp declare target link(sp)
end module test_0
program main
use test_0
integer :: new_len
!$omp target map(tofrom:new_len) map(tofrom:sp)
new_len = sp
!$omp end target
print *, new_len
print *, sp
end program
```
When compiled with
`flang -g -O0 -fopenmp --offload-arch=gfx1100`
will fail the compilation with the following error:
`dbg attachment points at wrong subprogram for function`
The reason is that with the `link` clause on `!$omp declare target`, an
extra load instruction is inserted. But the debug location was not
updated before insertion which caused an invalid location to be attached
to the instruction.
CodeExtractor can currently erroneously insert an alloca into a
different function than it inserts its users into, in cases where code
is being extracted out of a function that has already been outlined. Add
an assertion that the two blocks being inserted into are actually in the
same function.
Add a check to findAllocaInsertPoint in OpenMP to LLVMIR translation to
prevent the aforementioned scenario from happening.
OpenMPIRBuilder relies on a callback mechanism to fix-up a module later
on during the finaliser step. In some cases this results in the module
being invalid prior to the finalise step running. Remove calls to
verifyModule wrapped in LLVM_DEBUG from CodeExtractor, as the presence
of those results in the compiler crashing with -mllvm -debug due to
premature module verification where it would not crash without -debug.
Call ompBuilder->finalize() the end of mlir::translateModuleToLLVMIR, in
order to make sure the module has actually been finalized prior to
trying to verify it.
Resolves https://github.com/llvm/llvm-project/issues/138102.
---------
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
When no offload targets are specified flang will avoid offloading for
"target" constructs, but not "target data" constructs. This patch makes
the behavior consistent across all offload-related operations.
While ignoring "target" may produce semantically incorrect code, it may
still be a useful debugging tool.
--
This reinstates commits 6ba1955 and 349f8d6, reverted due to compilation
failures in the gfortran test suite. These build problems were caused by
an unrelated issue (https://github.com/llvm/llvm-project/issues/145558)
which is now fixed.
Ref: https://github.com/llvm/llvm-project/pull/144534
Given a TARGET DATA construct with USE_DEVICE_PTR(x) and IF(FALSE), the
compiler will crash if `x` was used in the body. The cause of the crash
is that the MLIR->LLVM codegen tries to look up the translated value of
x, but one had not been mapped.
Given an IF clause, the translation will generate an if-then-else
construct, with the "else" block corresponding to the false condition,
i.e. the host device playing the role of the target device. In that
block, still process the USE_DEVICE_ADDR/USE_DEVICE_PTR clauses, which
will cause the translation mappings to be created.
Fixes https://github.com/llvm/llvm-project/issues/145558
When no offload targets are specified flang will ignore "target"
constructs, but not "target data" constructs. This patch makes the
behavior consistent across all offload-related operations.
While ignoring "target" may produce semantically incorrect code, it may
still be a useful debugging tool.
Reintroduce a TODO for linear clause translation unless corner issues
(like linear variables being entities other than `alloca`, and support
for linear variables of types other than integer) are solved.
This patch adds assertions to map-related MLIR to LLVM IR translation
functions and utils to explicitly document whether they are intended for
host or device compilation only.
Over time, map-related handling has increased in complexity. This is
compounded by the fact that some handling is device-specific and some is
host-specific. By explicitly asserting on these functions on the
expected compilation pass, the flow should become slighlty easier to
follow.
A cancel or cancellation point for taskgroup is always nested inside of
a task inside of the taskgroup. For the task which is cancelled, it is
that task which needs to be cleaned up: not the owning taskgroup.
Therefore the cancellation branch handler is done in the conversion of
the task not in conversion of taskgroup.
I added a firstprivate clause to the test for cancel taskgroup to
demonstrate that the block being branched to is the same block where
mandatory cleanup code is added. Cancellation point follows exactly the
same code path.
This patch,
- Added a new attribute `nontemporal` to fir.load and fir.store operation in the FIR dialect.
- Added a pass `lower-nontemporal` which is called before FIRToLLVM conversion pass and adds the nontemporal attribute to loads and stores on the list items specified in the nontemporal clause of the SIMD directive.
- Set the `UnitAttr:$nontemporal` to llvm.load and llvm.store operations during FIR to LLVM dialect conversion, if the corresponding fir.load or fir.store operations have the nontemporal attribute.
- Attached the `nontemporal metadata` to load and store instructions that have the nontemporal attribute, during LLVM dialect to LLVM IR translation.