The related functions are `gatherDataOperandAddrAndBounds` and
`genBoundsOps`. The former is used in OpenACC as well, and it was
updated to pass evaluate::Expr instead of parser objects.
The difference in the test case comes from unfolded conversions of index
expressions, which are explicitly of type integer(kind=8).
Delete now unused `findRepeatableClause2` and `findClause2`.
Add `AsGenericExpr` that takes std::optional. It already returns
optional Expr. Making it accept an optional Expr as input would reduce
the number of necessary checks when handling frequent optional values in
evaluator.
[Clause representation 4/6]
This will be useful for OpenMP too.
I changed the definition slightly to use `fir::isa_ref_type` (which also
includes llvm pointers) because I think it reads better using the common
type helpers. There shouldn't be any llvm pointers in lowering so this
isn't a functional change.
Four "issues" on GitHub report possible performance problems, likely
detected by static analysis. None of them would ever make a measureable
difference in compilation time, but I'm resolving them to clean up the
open issues list.
Fixes https://github.com/llvm/llvm-project/issues/79703, .../79705,
.../79706, & .../79707.
IV variable are privatized during acc loop lowering. An hlfir.declare
operation is added when mapping the symbol to the new private value. In
order to avoid using multiple value in the acc.loop region, we map the
symbol to the result of the hlfir.declare operation inserted.
In #80317 the data op generation was updated to use correctly the #0
result from the hlfir.delcare op. In case of optional that are not
descriptor, it is preferable to use the original input for the varPtr
value of the OpenACC data op.
This patch also make sure that the descriptor value of optional is only
accessed when present.
The `acc.declate_action` attribute was sometime misplaced as reported in
#79770.
This patch updates the lowering code to place the
postAllocate/postDeallocate actions at the correct place.
- Support wait(devnum: ) with device_type support on all operations that
require it
- devnum value is stored as the first value of waitOperands in its
device_type sub-segment. The hasWaitDevnum attribute inform which
sub-segment has a wait(devnum) value.
- Make async/wait information homogenous on compute ops, data and update
op.
- Unify operands/attributes names across operations and use the same
custom parser/printer
Lower basic DO CONCURRENT with acc loop construct. The DO CONCURRENT is
lowered to an acc.loop operation.
This does not currently cover the DO CONCURRENT with locality specs.
acc.loop was redesigned in https://reviews.llvm.org/D159229. This patch
updates the lowering to match the new op.
DO CONCURRENT construct will be added in a follow up patch.
Note that the pre-commit ci will fail until D159229 is merged.
Depends on #67355
routine, data, parallel, serial, kernels and loop construct all support
the device_type clause. This clause takes a list of device_type.
Previously the lowering code was assuming that the list s a single item.
This PR updates the lowering to handle any number of device_types.
This patch add support for device_type on the acc.routine operation.
device_type can be specified on seq, worker, vector, gang and bind
information.
The support is following the same design than the one for compute
operations, data operation and the loop operation.
This is adding support for `device_type` clause representation in the
OpenACC MLIR dialect on the acc.loop operation and adjust flang to lower
correctly to the new representation.
Each "value" that can be impacted by a `device_type` clause is now
associated with an array attribute that carry this information. This
includes:
- `worker` clause information
- `gang` clause information
- `vector` clause information
- `collapse` clause information
- `tile` clause information
The representation of the `gang` clause information has been updated and
all values are now carried in a single operand segment. This segment is
then subdivided by `device_type`. Each value in a segment is also
associated with a `GangArgType` so it can be differentiated
(num/dim/static). This simplify the handling of gang values an limit the
number of new attributes needed.
When the clause can be associated with the operation without any value
(`gang`, `vector`, `worker`). These are represented by a dedicated
attributes with device_type information.
Extra getter functions are provided to make it easier to retrieve a
value based on a device_type.
Re-land PR after being reverted because of buildbot failures.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
The `acc` dialect operations now implement MemoryEffects interfaces in
the following ways:
- Data entry operations which may read host memory via `varPtr` are now
marked as so. The majority of them do NOT actually read the host memory.
For example, `acc.present` works on the basis of presence of pointer and
not necessarily what the data points to - so they are not marked as
reading the host memory. They still use `varPtr` though but this
dependency is reflected through ssa.
- Data clause operations which may mutate the data pointed to by
`accPtr` are marked as doing so.
- Data clause operations which update required structured or dynamic
runtime counters are marked as reading and writing the newly defined
`RuntimeCounters` resource. Some operations, like `acc.getdeviceptr` do
not actually use the runtime counters - but are marked as reading them
since the address obtained depends on the mapping operations which do
update the runtime counters. Namely, `acc.getdeviceptr` cannot be moved
across other mapping operations.
- Constructs are marked as writing to the `ConstructResource`. This may
be too strict but is needed for the following reasons: 1) Structured
constructs may not use `accPtr` and instead use `varPtr` - when this is
the case, data actions may be removed even when used. 2) Unstructured
constructs are currently used to aggregate multiple data actions. We do
not want such constructs removed or moved for now.
- Terminators are marked as `Pure` as in other dialects.
The current approach has the following limitations which may require
further improvements:
- Subsequent `acc.copyin` operations on same data do not actually read
host memory pointed to by `varPtr` but are still marked as so.
- Two `acc.delete` operations on same data may not mutate `accPtr` until
the runtime counters are zero (but are still marked as mutating).
- The `varPtrPtr` argument, when present, points to the address of
location of `varPtr`. When mapping to target device, an `accPtrPtr`
needs computed and this memory is mutated. This effect is not captured
since the current operations do not produce `accPtrPtr`.
- Runtime counter effects are imprecise since two operations with
differing `varPtr` increment/decrement different counters. Additionally,
operations with `varPtrPtr` mutate attachment counters.
- The `ConstructResource` is too strict and likely can be relaxed with
better modeling.
Make sure we only load box and read its bounds when it is present.
- Add `AddrAndBoundInfo` struct to be able to carry around the `addr`
and `isPresent` values. This is likely to grow so we can make all the
access in a single `fir.if` operation.
Early return is accepted in OpenACC loop not directly nested in a
compute construct. Since acc.loop operation has a region, the
`func.return` operation cannot be directly used inside the region.
An early return is materialized by an `acc.yield` operation returning a
`true` value. The standard end of the `acc.loop` region yield a `false`
value in this case.
A conditional branch operation on the `acc.loop` result will branch to
the `finalBlock` or just to the continue block whether an early exit was
produce in the acc.loop.
Using an op with a region cause some issue with unstructured code. This
patch make use of acc.declare_enter and acc.declare_exit to represent
the implicit declare region.
PR #70698 relax the duplication rule in acc declare clauses. This lead
to potential duplicate creation of the global constructor/destructor.
This patch make sure to not generate a duplicate ctor/dtor.
When the acc routine directive was in an interface block in a
subroutine, the routine information was attached to the wrong
subroutine. This patch fixes this be retrieving the subroutine name in
the interface.
In cases like `copy(array(N))` it is still useful to represent
the data operand uniformly with `copy(array(N:N))`.
This change generates data bounds even if it is not an array
section with the triplets. The lower and the upper bounds
are the same and the extent is one in this case.
A variable in equivalence share the storage units with one or more
objects. When lowered to FIR, the global created for the equivalence has
the name of one of the object. The variable also has an offset in the
storage unit.
This patch takes all of this into account for variable part of
equivalence used in a declare directive.
The compute and data constructs implement getNumDataOperands and
getDataOperand. The acc.loop operation similarly has multiple data
operands - thus it makes sense to expose them the same way.
For loop, only private and reduction operands are exposed this way.
Technically, acc.loop also holds cache operands - but these are hints
not a data attribute.
The location set on atomic operations in both OpenMP and OpenACC was
completly off. The real location needs to be created from the source
CharBlock of the parse tree node of the respective atomic statement.
This patch updates locations in lowering for atomic operations.
After PR#69417, lowering for combined constructs was updated to adhere
to OpenACC 3.3, section 2.11: `A private or reduction clause on a
combined construct is treated as if it appeared on the loop construct.`
However, the second part of that paragraph notes `In addition, a
reduction clause on a combined construct implies a copy clause`. Since
the acc dialect decomposes combined constructs, it is important to
distinguish between the case where an explicit data clause is required
(as noted in section 2.6.2) and the case where an implicit data action
must be generated by compiler.
Some compilers allow the `$acc routine(<name>)` to be placed at the
program unit level. To be compatible, this patch enables the use of acc
routine at this level. These acc routine directives must have a name.
The declare actions were introduced to capture semantics dealing with
allocation of descriptor-based variable. However, the post_alloc action
has an ordering error. It needs to update descriptor first before the
mapping action of the data. The reason for this is that implicit attach
must occur during mapping action - but updating the descriptor
synchronizes it with the host copy (which would hold a host pointer).
Add lowering support for array with dynamic extents in the firstprivate
recipe. Generalize the lowering so static shaped arrays and array with
dynamic extents use the same path.
Some cleaning code is taken from #68836 that is not landed yet.
Add support for assumed shape arrays in lowering of the copy region of
the firstprivate recipe. Information is passed in block arguments as it
is done for the reduction recipe.