Add Automap modifier to the MLIR op definition for the DeclareTarget
directive's Enter clause. Also add lowering support in Flang.
Automap Ref: OpenMP 6.0 section 7.9.7.
Fortran::parser::omp::GetOmpDirectiveName(t) will get the
OmpDirectiveName object that corresponds to construct t. That object (an
AST node) contains the enum id and the source information of the
directive.
Replace uses of extractOmpDirective and getOpenMPDirectiveEnum with the
new function.
This PR tries to fix `lastprivate` update issues in composite
constructs. In particular, pre-determined `lastprivate` symbols are
attached to the wrong leaf of the composite construct (the outermost
one). When using delayed privatization (should be the default mode in
the future), this results in trying to update the `lastprivate` symbol
in the wrong construct (outside the `omp.loop_nest` op).
For example, given the following input:
```fortran
!$omp target teams distribute parallel do simd collapse(2) private(y_max)
do i=x_min,x_max
do j=y_min,y_max
enddo
enddo
```
Without the fixes introduced in this PR, the `DataSharingProcessor`
tries to generate the `lastprivate` update ops in the `parallel` op
since this is the op for which the DSP instance is created.
The fix consists of 2 main parts:
1. Instead of creating a single DSP instance, one instance is created
for the leaf constructs that might need privatization (whether for
explicit, implicit, or pre-determined symbols).
2. When generating the `lastprivate` comparison ops, we don't directly
use the SSA values of the UBs and steps. Instead, we regenerated these
SSA values from the original loop bounds' expressions. We have to do
this to avoid using `host_eval` values in the `lastprivate` comparison
logic which is illegal.
This PR is one of 3 in a PR stack, this is the primary change set which
seeks to extend the current derived type explicit member mapping support
to handle descriptor member mapping at arbitrary levels of nesting. The
PR stack seems to do this reasonably (from testing so far) but as you
can create quite complex mappings with derived types (in particular when
adding allocatable derived types or arrays of allocatable derived types)
I imagine there will be hiccups, which I am more than happy to address.
There will also be further extensions to this work to handle the
implicit auto-magical mapping of descriptor members in derived types and
a few other changes planned for the future (with some ideas on
optimizing things).
The changes in this PR primarily occur in the OpenMP lowering and the
OMPMapInfoFinalization pass.
In the OpenMP lowering several utility functions were added or extended
to support the generation of appropriate intermediate member mappings
which are currently required when the parent (or multiple parents) of a
mapped member are descriptor types. We need to map the entirety of these
types or do a "deep copy" for lack of a better term, where we map both
the base address and the descriptor as without the copying of both of
these we lack the information in the case of the descriptor to access
the member or attach the pointers data to the pointer and in the latter
case we require the base address to map the chunk of data. Currently we
do not segment descriptor based derived types as we do with regular
non-descriptor derived types, we effectively map their entirety in all
cases at the moment, I hope to address this at some point in the future
as it adds a fair bit of a performance penalty to having nestings of
allocatable derived types as an example. The process of mapping all
intermediate descriptor members in a members path only occurs if a
member has an allocatable or object parent in its symbol path or the
member itself is a member or allocatable. This occurs in the
createParentSymAndGenIntermediateMaps function, which will also generate
the appropriate address for the allocatable member within the derived
type to use as a the varPtr field of the map (for intermediate
allocatable maps and final allocatable mappings). In this case it's
necessary as we can't utilise the usual Fortran::lower functionality
such as gatherDataOperandAddrAndBounds without causing issues later in
the lowering due to extra allocas being spawned which seem to affect the
pointer attachment (at least this is my current assumption, it results
in memory access errors on the device due to incorrect map information
generation). This is similar to why we do not use the MLIR value
generated for this and utilise the original symbol provided when mapping
descriptor types external to derived types. Hopefully this can be
rectified in the future so this function can be simplified and more
closely aligned to the other type mappings. We also make use of
fir::CoordinateOp as opposed to the HLFIR version as the HLFIR version
doesn't support the appropriate lowering to FIR necessary at the moment,
we also cannot use a single CoordinateOp (similarly to a single GEP) as
when we index through a descriptor operation (BoxType) we encounter
issues later in the lowering, however in either case we need access to
intermediate descriptors so individual CoordinateOp's aid this
(although, being able to compress them into a smaller amount of
CoordinateOp's may simplify the IR and perhaps result in a better end
product, something to consider for the future).
The other large change area was in the OMPMapInfoFinalization pass,
where the pass had to be extended to support the expansion of box types
(or multiple nestings of box types) within derived types, or box type
derived types. This requires expanding each BoxType mapping from one
into two maps and then modifying all of the existing member indices of
the overarching parent mapping to account for the addition of these new
members alongside adjusting the existing member indices to support the
addition of these new maps which extend the original member indices (as
a base address of a box type is currently considered a member of the box
type at a position of 0 as when lowered to LLVM-IR it's a pointer
contained at this position in the descriptor type, however, this means
extending mapped children of this expanded descriptor type to
additionally incorporate the new member index in the correct location in
its own index list). I believe there is a reasonable amount of comments
that should aid in understanding this better, alongside the test
alterations for the pass.
A subset of the changes were also aimed at making some of the utilities
for packing and unpacking the DenseIntElementsAttr containing the member
indices shareable across the lowering and OMPMapInfoFinalization, this
required moving some functions to the Lower/Support/Utils.h header, and
transforming the lowering structure containing the member index data
into something more similar to the version used in
OMPMapInfoFinalization. There we also some other attempts at tidying
things up in relation to the member index data generation in the
lowering, some of which required creating a logical operator for the
OpenMP ID class so it can be utilised as a map key (it simply utilises
the symbol address for the moment as ordering isn't particularly
important).
Otherwise I have added a set of new tests encompassing some of the
mappings currently supported by this PR (unfortunately as you can have
arbitrary nestings of all shapes and types it's not very feasible to
cover them all).
The main purpose of this patch is to centralize the logic for creating
MLIR operation entry blocks and for binding them to the corresponding
symbols. This minimizes the chances of mixing arguments up for
operations having multiple entry block argument-generating clauses and
prevents divergence while binding arguments.
Some changes implemented to this end are:
- Split into two functions the creation of the entry block, and the
binding of its arguments and the corresponding Fortran symbol. This
enabled a significant simplification of the lowering of composite
constructs, where it's no longer necessary to manually ensure the lists
of arguments and symbols refer to the same variables in the same order
and also match the expected order by the `BlockArgOpenMPOpInterface`.
- Removed redundant and error-prone passing of types and locations from
`ClauseProcessor` methods. Instead, these are obtained from the values
in the appropriate clause operands structure. This also simplifies
argument lists of several lowering functions.
- Access block arguments of already created MLIR operations through the
`BlockArgOpenMPOpInterface` instead of directly indexing the argument
list of the operation, which is not scalable as more entry block
argument-generating clauses are added to an operation.
- Simplified the implementation of `genParallelOp` to no longer need to
define different callbacks depending on whether delayed privatization is
enabled.
A compound construct with a list of clauses is broken up into individual
leaf/composite constructs. Each such construct has the list of clauses
that apply to it based on the OpenMP spec.
Each lowering function (i.e. a function that generates MLIR ops) is now
responsible for generating its body as described below.
Functions that receive AST nodes extract the construct, and the clauses
from the node. They then create a work queue consisting of individual
constructs, and invoke a common dispatch function to process (lower) the
queue.
The dispatch function examines the current position in the queue, and
invokes the appropriate lowering function. Each lowering function
receives the queue as well, and once it needs to generate its body, it
either invokes the dispatch function on the rest of the queue (if any),
or processes nested evaluations if the work queue is at the end.
Re-application of ca1bd5995f6ed934f9187305190a5abfac049173 with fixes for
compilation errors.
A compound construct with a list of clauses is broken up into individual
leaf/composite constructs. Each such construct has the list of clauses
that apply to it based on the OpenMP spec.
Each lowering function (i.e. a function that generates MLIR ops) is now
responsible for generating its body as described below.
Functions that receive AST nodes extract the construct, and the clauses
from the node. They then create a work queue consisting of individual
constructs, and invoke a common dispatch function to process (lower) the
queue.
The dispatch function examines the current position in the queue, and
invokes the appropriate lowering function. Each lowering function
receives the queue as well, and once it needs to generate its body, it
either invokes the dispatch function on the rest of the queue (if any),
or processes nested evaluations if the work queue is at the end.
This patch is one in a series of four patches that seeks to refactor
slightly and extend the current record type map support that was
put in place for Fortran's descriptor types to handle explicit
member mapping for record types at a single level of depth.
For example, the below case where two members of a Fortran
derived type are mapped explicitly:
''''
type :: scalar_and_array
real(4) :: real
integer(4) :: array(10)
integer(4) :: int
end type scalar_and_array
type(scalar_and_array) :: scalar_arr
!$omp target map(tofrom: scalar_arr%int, scalar_arr%real)
''''
Current cases of derived type mapping left for future work are:
> explicit member mapping of nested members (e.g. two layers of
record types where we explicitly map a member from the internal
record type)
> Fortran's automagical mapping of all elements and nested elements
of a derived type
> explicit member mapping of a derived type and then constituient members
(redundant in Fortran due to former case but still legal as far as I am aware)
> explicit member mapping of a record type (may be handled reasonably, just
not fully tested in this iteration)
> explicit member mapping for Fortran allocatable types (a variation of nested
record types)
This patch seeks to support this by extending the Flang-new OpenMP lowering to
support generation of this newly required information, creating the neccessary
parent <-to-> member map_info links, calculating the member indices and
setting if it's a partial map.
The OMPDescriptorMapInfoGen pass has also been generalized into a map
finalization phase, now named OMPMapInfoFinalization. This pass was extended
to support the insertion of member maps into the BlockArg and MapOperands of
relevant map carrying operations. Similar to the method in which descriptor types
are expanded and constituient members inserted.
Pull Request: https://github.com/llvm/llvm-project/pull/82853
This patch contains slight modifications to the reverted PR #85258 to
avoid issues with constructs containing multiple reduction clauses,
uncovered by a test on the gfortran testsuite.
This reverts commit 9f80444c2e669237a5c92013f1a42b91b5609012.
This patch moves some code in PFT to MLIR OpenMP lowering to the
`ClauseProcessor` class. This is so that some behavior that is related
to certain clauses stays within the `ClauseProcessor` and it's not the
caller the one responsible for always doing this when the clause is
present.
In this patch some uses of `llvm::SmallVector` in Flang's lowering to
MLIR are replaced by other types (i.e. `llvm::ArrayRef` and
`llvm::SmallVectorImpl`) which are intended for these uses. This
generally prevents relying on always passing small vectors with a
particular number of elements in the stack.
…essor
Rename `findRepeatableClause` to `findRepeatableClause2`, and make the
new `findRepeatableClause` operate on new `omp::Clause` objects.
Leave `Map` unchanged, because it will require more changes for it to
work.
[Clause representation 3/6]
This patch seeks to create a process that happens on module finalization
for OpenMP, in which a list of operations that had declare target
directives applied to them and were not generated at the time of
processing the original declare target directive are re-checked to apply
the appropriate declare target semantics.
This works by maintaining a vector of declare target related data inside
of the FIR converter, in this case the symbol and the two relevant
unsigned integers representing the enumerators. This vector is added to
via a new function called from Bridge.cpp, insertDeferredDeclareTargets,
which happens prior to the processing of the directive (similarly to
getDeclareTargetFunctionDevice currently for requires), it effectively
checks if the Operation the declare target directive is applied to
currently exists, if it doesn't it appends to the vector. This is a
seperate function to the processing of the declare target via the
overloaded genOMP as we unfortunately do not have access to the list
without passing it through every call, as the AbstractConverter we pass
will not allow access to it (I've seen no other cases of casting it to a
FirConverter, so I opted to not do that).
The list is then processed at the end of the module in the
finalizeOpenMPLowering function in Bridge by calling a new function
markDelayedDeclareTargetFunctions which marks the latently generated
operations. In certain cases, some still will not be generated, e.g. if
an interface is defined, marked as declare target, but has no definition
or usage in the module then it will not be emitted to the module, so due
to these cases we must silently ignore when an operation has not been
found via it's symbol.
The main use-case for this (although, I imagine there is others) is for
processing interfaces that have been declared in a module with a declare
target directive but do not have their implementation defined in the
same module. For example, inside of a seperate C++ module that will be
linked in. In cases where the interface is called inside of a target
region it'll be marked as used on device appropriately (although,
realistically a user should explicitly mark it to match the
corresponding definition), however, in cases where it's used in a
non-clear manner through something like a function pointer passed to an
external call we require this explicit marking, which this patch adds
support for (currently will cause the compiler to crash).
This patch also adds documentation on the declare target process and
mechanisms within the compiler currently.
Adds basic support for emitting delayed privatizers from flang. So far,
only types of symbols are supported (i.e. scalars), support for more
complicated types will be added later. This also makes sure that
reduction and delayed privatization work properly together by merging
the
body-gen callbacks for both in case both clauses are present on the
parallel construct.
This started as an experiment to reduce the compilation time of
iterating over `Lower/OpenMP.cpp` a bit since it is too slow at the
moment. Trying to do that, I split the `DataSharingProcessor`,
`ReductionProcessor`, and `ClauseProcessor` into their own files and
extracted some shared code into a util file. All of these new `.h/.cpp`
files as well as `OpenMP.cpp` are now under a `Lower/OpenMP/` directory.
This resulted is a slightly better organization of the OpenMP lowering
code and hence opening this NFC.
As for the compilation time, this unfortunately does not affect it much
(it shaves off a few seconds of `OpenMP.cpp` compilation) since from
what I learned the bottleneck is in `DirectivesCommon.h` and
`PFTBuilder.h` which both consume a lot of time in template
instantiation it seems.