This reverts commit 5178aeff7b96e86b066f8407b9d9732ec660dd2e.
In addition:
* Scalar constant UNSIGNED BOUNDARY is explicitly casted
to the result type so that the generated hlfir.eoshift
operation is valid. The lowering produces signless constants
by default. It might be a bigger issue in lowering, so I just
want to "fix" it for EOSHIFT in this patch.
* Since we have to create unsigned integer constant during
HLFIR inlining, I added code in createIntegerConstant
to make it possible.
Array sections like this have not been using the knowledge that
the result is contiguous:
```
type t
integer :: f
end type
type(t) :: a(:)
a%f = 0
```
Peter Klausler is working on a change that will result in the
corresponding
hlfir.designate having a component and a non-box result.
This patch fixes the issues found in HLFIR-to-FIR conversion.
This patch generalizes the code for hlfir.cshift to be applicable
for hlfir.eoshift. The major difference is the selection
of the boundary value that might be statically/dynamically absent,
in which case the default scalar value has to be used.
The scalar value of the boundary is always computed before
the hlfir.elemental or the assignment loop.
Contrary to hlfir.cshift simplication, the SHIFT value is not
normalized,
because the original value (and its sign) participate in the EOSHIFT
index computation for addressing the input array and selecting
which elements of the results are assigned from the boundary operand.
This patch allows optimizing redundant array repacking, when
the source array is statically known to be contiguous.
This is part of the implementation plan for the array repacking
feature, though, it does not affect any real life use case
as long as FIR inlining is not a thing. I experimented with
simple cases of FIR inling using `-inline-all`, and I recorded
these cases in optimize-array-repacking.fir tests.
Handles some loose ends in `do concurrent` reduction declarations. This
PR extends `getAllocaBlock` to handle declare ops, and also emit
`fir.yield` in all regions.
Host associated variables were not being handled properly.
For array references, get the fixed shape extents from the value
type instead, that works correctly in all cases.
Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.
In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):
16K 1.09
64K 1.40
128K 1.90
256K 1.91
512K 1.00
For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.
This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props_sw slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.
Do not create new descriptor for polymorphic scalars when lowering
hlfir.declare.
hlfir.declare of box/class is lowered to a fir.rebox to ensure that
local lower bounds and descriptor attributes (Pointer/Allocatable/None)
are properly set-up in the descriptor associated to the symbol.
For polymorphic scalar, this created a useless temporary descriptor.
This was breaking invalid code #145256 that violates OPTIONAL usage
rules. I am not fixing it primarily to support this invalid code, but
rather because it is dumb to create a useless fir.rebox.
This fixes an issue with `TransferWriteOp`'s implementation of the
`MemoryEffectOpInterface` where the write effect was attached to the
stored value rather than the base.
This had the effect that when asking for the memory effects for the
input memref buffer using `getEffectsOnValue(...)`, the function would
return no-effects (as the effect would have been attached to the stored
value rather than the input buffer).
Prevent hlfir.declare output to be fir.box/class values with the
heap/pointer attribute to ensure the runtime descriptor attributes are
in line with the Fortran attributes for the entities being declared
(only fir.ref<box/class> can be ALLOCATABLE/POINTERS).
This fixes a bug where an associated entity inside a SELECT TYPE was being
unexpectedly reallocated inside assign runtime because the selector was allocatable
and this attribute was not properly removed when creating the descriptor
for the associated entity (that does not inherit the ALLOCATABLE/POINTER
attribute according to Fortran 2023 section 11.1.3.3).
During `hlfir.assign` inlining and `ElementalAssignBufferization`
we can deduce the optimal shape from `lhs` and `rhs` shapes.
It is probably better be done in a separate pass that propagates
constant shapes, but I have not seen any benchmarks that would
benefit from this yet. So consider this as a workaround for a bigger
TODO issue.
The `ElementalAssignBufferization` case is from 465.tonto,
but I do not have performance results yet (I do not expect much).
If there is a read-effect operation inside `hlfir.elemental`,
there is no reason to block moving it to the assignment point
unless there are write-effect operations between the elemental
and the assignment. The previous code was disallowing the optimization
even if there were only read-effect operations in between.
This case is from 465.tonto, though this change does not improve
performance at all.
hlfir.copy_in implements copying non-contiguous array slices for
functions that take in arrays required to be contiguous through
flang-rt.
For large arrays of trivial types, this can incur overhead compared to a
plain, inlined copy loop.
To address that, add a new InlineHLFIRCopyIn optimisation pass to inline
hlfir.copy_in operations for trivial types.
For the time being, the pattern is only applied in cases where the
copy-in does not require a corresponding copy-out, such as when the
function being called declares the array parameter as intent(in).
Applying this optimisation reduces the runtime of thornado-mini's
DeleptonizationProblem by about 10%.
---------
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Polymorphic temporary are currently propagated as
fir.ref<fir.class<fir.heap<>>> because their allocation may be delayed
to the hlfir.assign copy (using realloc).
This patch moves away from this and directly allocate the temp and
propagate it as a fir.class.
The polymorphic temporaries creating is also simplified by avoiding the
need to call the runtime to setup the descriptor altogether (the runtime
is still call for the allocation currently because alloca/allocmem do
not support polymorphism).
Memory effects on the volatile memory resource may not be attached to a
particular source, in which case the value of an effect will be null.
This caused this test case to crash in the optimized bufferization
pass's safety analysis because it assumes it can get the SSA value
modified by the memory effect. This is because memory effects on the
volatile resource indicate that the operation must not be reordered with
respect to other volatile operations, but there is not a material ssa
value that can be pointed to.
This patch changes the safety checks such that memory effects which do
not have associated values are not safe for optimized bufferization.
CSE may delete operations from hlfir.exactly_once and reuse
the equivalent results from the parent region(s), e.g. from the parent
hlfir.region_assign. This makes it problematic to clone
hlfir.exactly_once before the top-level hlfir.where.
This patch adds a "canonicalizer" that pulls in such operations
back into hlfir.exactly_once.
Switch from `int64_t` to `int64_t*` to fit with the rest of the
implementation.
New tentative with some fix. The previous was reverted some time ago.
Reviewed in #138010
Early HLFIR optimizations may experience problems with values
produced by hlfir.associate. In most cases this is a unique
local memory allocation, but it can also reuse some other
hlfir.expr memory sometimes. It seems to be safe to assume
unique allocation for trivial types, since we always
allocate new memory for them.
This change allows marking more designators producing an opaque
box with 'contiguous' attribute, e.g. like in test1 case
in flang/test/HLFIR/propagate-contiguous-attribute.fir.
This would make isSimplyContiguous() return true for such
designators allowing merging hlfir.eval_in_mem with hlfir.assign
where the LHS is a contiguous array section.
Depends on #139003
Contiguous variables represented with a box do not have
explicit shape, but it looks like the base/shape computation
was assuming that. This caused generation of raw address
fir.array_coor without the shape. This patch is needed
to fix failures hapenning with #138797.
Enabling volatility lowering by default revealed some issues in lowering
and op verification.
For example, given volatile variable of a nested type, accessing
structure members of a structure member would result in a volatility
mismatch when the inner structure member is designated (and thus a
verification error at compile time).
In other cases, I found correct codegen when the checks were disabled,
also related to allocatable types and how we handle volatile references
of boxes.
This hides the strict verification of fir and hlfir ops behind a flag so
I can iteratively improve lowering of volatile variables without causing
compile-time failures, keeping the strict verification on when running
tests.
Instead of using loop-carried IsFirst predicate, we can fetch
the initial reduction values for MIN/MAX LOC/VAL reductions
from the array itself. This results in a little bit cleaner
loop nests, especially, generated for total reductions.
Otherwise, LLVM is able to peel the first iteration of the innermost
loop, but the surroudings of the peeled code are executed
multiple times withing the outer loop(s).
This patch does the manual peeling, which only works for
non-masked reductions where the input array is not empty.
[RFC on
discourse](https://discourse.llvm.org/t/rfc-volatile-representation-in-flang/85404/1)
Flang currently lacks support for volatile variables. For some cases,
the compiler produces TODO error messages and others are ignored. Some
of our tests are like the example from _C.4 Clause 8 notes: The VOLATILE
attribute (8.5.20)_ and require volatile variables.
Prior commits:
```
c9ec1bc753b0 [flang] Handle volatility in lowering and codegen (#135311)
e42f8609858f [flang][nfc] Support volatility in Fir ops (#134858)
b2711e1526f9 [flang][nfc] Support volatile on ref, box, and class types (#134386)
```
This change generalizes SumAsElemental inlining in
SimplifyHLFIRIntrinsics pass so that it can be applied
to ALL, ANY, COUNT, MAXLOC, MAXVAL, MINLOC, MINVAL, SUM.
This change makes the special handling of the reduction
operations in OptimizedBufferization redundant: once HLFIR
operations are inlined, the hlfir.elemental inlining should
do the rest of the job.
This change generalizes SumAsElemental inlining in
SimplifyHLFIRIntrinsics pass so that it can be applied
to ALL, ANY, COUNT, MAXLOC, MAXVAL, MINLOC, MINVAL, SUM.
This change makes the special handling of the reduction
operations in OptimizedBufferization redundant: once HLFIR
operations are inlined, the hlfir.elemental inlining should
do the rest of the job.
This change generalizes SumAsElemental inlining in
SimplifyHLFIRIntrinsics pass so that it can be applied
to ALL, ANY, COUNT, MAXLOC, MAXVAL, MINLOC, MINVAL, SUM.
This change makes the special handling of the reduction
operations in OptimizedBufferization redundant: once HLFIR
operations are inlined, the hlfir.elemental inlining should
do the rest of the job.
In `makeMinMaxInitValGenerator` it explicitly checks for only
`FloatType` and `IntegerType`, so we shouldn't match if we don't have
either of those types.
Fix for #134308
* Enable lowering and conversion patterns to pass volatility information
from higher level operations to lower level ones.
* Enable codegen to pass volatility to LLVM dialect ops by setting an
attribute on loads, stores, and memory intrinsics.
* Add utilities for passing along the volatility from an input type to
an output type.
To introduce volatile types into the IR, entities with the volatile
attribute will be given a volatile type in the bridge; this is not
enabled in this patch. User code should not result in IR with volatile
types yet, so this patch contains no tests with Fortran source, only IR
that already contains volatile types.
Part 3 of #132486.
Follow up to #134170.
We should be using the LLVM intrinsics instead of plain fir.calls when
we can. Existing code creates a declaration for the llvm intrinsic and
a regular fir.call, which makes it hard for consumers of the IR to find
all the intrinsic calls.
The code generation relies on `ShallowCopyDirect` runtime
to copy data between the original and the temporary arrays
(both directions). The allocations are done by the compiler
generated code. The heap allocations could have been passed
to `ShallowCopy` runtime, but I decided to expose the allocations
so that the temporary descriptor passed to `ShallowCopyDirect`
has `nocapture` - maybe this will be better for LLVM optimizations.
Currently, the helpers to get fir::ExtendedValue out of hlfir::Entity
use hlfir.declare second result (`#1`) in most cases. This is because
this result is the same as the input and matches what FIR was getting
before lowering to HLFIR.
But this creates odd situations when both hlfir.declare are raw pointers
and either result ends-up being used in the IR depending on whether the
code was generated by a helper using fir::ExtendedValue, or via "pure
HLFIR" helpers using the first result.
This will typically prevent simple CSE and easy identification that two
operation (e.g load/store) are touching the exact same memory location
without using alias analysis or "manual detection" (looking for common
hlfir.declare defining op).
Hence, when hlfir.declare results are both raw pointers, use `#0` when
producing `fir::ExtendedValue`.
When `#0` is a fir.box, keep using `#1` because these are not the same.
The only code change is in HLFIRTools.cpp and is pretty small, but there
is a big test fallout of `#1` to `#0`.
There is no reason for character element type to be forbidden in this
helper.
The assert was firing in character pointer assignment in FORALL after
#130772 added a usage of this helper.
Implement handling of `NULL()` RHS, polymorphic pointers, as well as
lower bounds or bounds remapping in pointer assignment inside FORALL.
These cases eventually do not require updating hlfir.region_assign,
lowering can simply prepare the new descriptor for the LHS inside the
RHS region.
Looking more closely at the polymorphic cases, there is not need to call
the runtime, fir.rebox and fir.embox do handle the dynamic type setting
correctly.
After this patch, the last remaining TODO is the allocatable assignment
inside FORALL, which like some cases here, is more likely an accidental
feature given FORALL was deprecated in F2003 at the same time than
allocatable components where added.
Very similar to object pointer assignment, the difference is the SSA
types of the LHS (!fir.ref<!fir.boxproc<()->()>> and RHS
(!fir.boxproc<()->()).
The RHS must be saved as simple address, not descriptors (it is not
possible to make CFI descriptor out of procedure entity).
The semantic of pointer assignments inside FORALL requires evaluating
the targets (RHS) and pointer variables (LHS) of all iterations before
evaluating the assignments.
In practice, if the compiler can prove that the RHS and LHS evaluations
are not impacted by the assignments, the evaluation of the FORALL
assignment statement can be done in a single loop. However, if the
compiler cannot prove this, it needs to "save" the addresses of the
targets and/or the pointer descriptors of each iterations before doing
the assignments.
This patch implements the most common cases where there is no lower bound
spec, no bounds remapping, the LHS is not polymorphic, and the RHS is
not NULL.
The HLFIR operation used to represent assignments inside FORALL can be
used for pointer assignments to (the only difference being that the LHS
is a descriptor address).
The analysis for intrinsic assignment can be reused, with the
distinction that the RHS data is not read during the assignment.
The logic that is used to save LHS in intrinsic assignments inside
FORALL is extracted to be used for the RHS of pointer assignments when
needed (saving a descriptor value).
Pointer assignment LHS are just descriptor addresses and are saved as
int_ptr values.
The HLFIR inlining of MAXVAL kicks in at O1 and more when the argument
is an array component reference but the implementation did not account
for the rare cases where the array components have non default lower
bounds.
This patch fixes the issue by using `getElementAt` to compute the
element address.
Rename `indices` to `oneBasedIndices` for more clarity.
Flang generates slower code for `CSHIFT(CSHIFT(PTR(:,:,I),sh1,1),sh2,2)`
pattern in facerec than other compilers. The first CSHIFT can be done
as two memcpy's wrapped in a loop for the second dimension.
This does require creating a temporary array, but it seems to be faster,
than the current hlfir.elemental inlining.
I started with modifying the new index computation in
hlfir.elemental inlining: the new arith.select approach does enable
some vectorization in LLVM, but on x86 it is using gathers/scatters
and does not give much speed-up.
I also experimented with LoopBoundSplitPass
and InductiveRangeCheckElimination for a simple (not chained) CSHIFT
case, but I could not adjust them to split the loop with a condition
on the value of the IV into two loops with disjoint iteration spaces.
I thought if I could do it, I would be able to keep the hlfir.elemental
inlining mostly untouched, and then adjust the hlfir.elemental inlining
heuristics for the facerec case.
Since I was not able to make these pass work for me, I added a special
case inlining for CSHIFT(ARRAY,SH,DIM=1) via hlfir.eval_in_mem.
If ARRAY is not statically known to have the contiguous leading
dimension, there is a dynamic check for contiguity, which allows
exposing it to LLVM and enabling the rewrite of the copy loops
into memcpys. This approach is stepping on the toes of LoopVersioning,
but it is helpful in facerec case.
I measured ~6% speed-up on grace, and ~4% on zen4.
This patch updates fir.coordinate_op to carry the field index as
attributes instead of relying on getting it from the fir.field_index
operations defining its operands.
The rational is that FIR currently has a few operations that require
DAGs to be preserved in order to be able to do code generation. This is
the case of fir.coordinate_op, which requires its fir.field operand
producer to be visible.
This makes IR transformation harder/brittle, so I want to update FIR to
get rid if this.
Codegen/printer/parser of fir.coordinate_of and many tests need to be
updated after this change.
One constructor was missing to propagate fast-math flags
from an operation to the builder. It is fixed now.
And the builder creation in one opt-bufferization case should take
the rewriter, I think.