16 Commits

Author SHA1 Message Date
David Green
7242896233
[Flang] Attempt to fix Nan handling in Minloc/Maxloc intrinsic simplification (#82313)
In certain case "extreme" values like Nan, Inf and 0xffffffff could lead
to generating different code via the inline-generated intrinsics vs the
versions in the runtimes (and other compilers like gfortran). There are
some examples I was using for testing in
https://godbolt.org/z/x4EfqEss5.

This changes the generation for the intrinsics to be more like the
runtimes, using a condition that is similar to:
  isFirst || (prev != prev && elem == elem) || elem < prev
The middle part is only used for floating point operations, and checks
if the values are Nan. This should then hopefully make the logic closer
to - return the first element with the lowest value, with Nans ignored
unless there are only Nans. The initial limit value for floats are also
changed from the largest float to Inf, to make sure it is handled
correctly.

The integer reductions are also changed to use a similar scheme to make
sure they work with masked values. This means that the preamble after
the loop can be removed.
2024-02-21 09:31:29 +00:00
David Green
815a846552
[Flang] Move genMinMaxlocReductionLoop to Transforms/Utils.cpp (#81380)
This is one option for attempting to move genMinMaxlocReductionLoop to a
better location. It moves it into Transforms and makes HLFIRTranforms
depend upon FIRTransforms.

It passes a build locally, both with and without -DBUILD_SHARED_LIBS,
and does OK on the windows CI.
2024-02-13 08:31:07 +00:00
David Green
9308d6688c
[Flang] Correct initial limit value in float min/maxloc reductions. (#81260)
I was looking through to check whether Nan was being handled correctly,
and couldn't work out why simple cases were behaving differently than
they should. It turns out the initial limit values was backwards for
minloc/maxloc reductions in general. This fixes that, introduced in
#79469.
2024-02-10 08:19:49 +00:00
David Green
ec8c8b6487 [Flang] Remove constexpr from isMax variable. NFC
The MSCV build doesn't allow the constexpr isMax variable to be used in lambda
without a capture. The -Weverything build does not allow isMax to be used in a
lambda capture as it is a constexpr. I've removed the constexpr as it shouldn't
be necessary.
2024-01-29 12:33:32 +00:00
David Green
378f7ad3b7
[Flang] Maxloc elemental intrinsic lowering. (#79469)
This is an extension to #74828 to handle maxloc too, to keep the minloc
and maxloc symmetric.
2024-01-29 10:22:28 +00:00
David Green
223d3dabc8
[Flang] Minloc elemental intrinsic lowering (#74828)
Currently the lowering of a minloc intrinsic with a mask will look something
like:
  %e = hlfir.elemental %shape ({
    ...
  })
  %m = hlfir.minloc %array mask %e
  hlfir.assign %m to %result
  hlfir.destroy %m
The elemental will be expanded into a temporary+loop, the minloc into a
FortranAMinloc call (which hopefully gets simplified to a specialized call that
can be inlined at the call site), and the assign might get expanded to a
FortranAAssign. It would be better to generate the entire construct as single
loop if we can - one that performs the minloc calculation with the mask
elemental computed inline.

This patch attempt to do that, adding a hlfir version of the expansion code
from SimplifyIntrinsics that turns an minloc+elemental into a single combined
loop nest. It attempts to reuse the methods in genMinlocReductionLoop for
constructing the loop with a modified loop body. The declaration for the
function is currently in Optimizer/Support/Utils.h, but there might be a better
place for it.

It is added as part of the OptimizedBufferizationPass, like the
similar count/any/all that have been added recently.
2024-01-25 12:17:12 +00:00
David Green
e22cb93890
[Flang] Any and All elemental lowering (#75776)
This is an extension of https://github.com/llvm/llvm-project/pull/75774,
with Any and All lowering added alongside Count.
2024-01-10 09:52:06 +00:00
David Green
9052512542 [Flang] Remove unnecessary static_assert
Certain compilers do not seem to like the static assert with a string, causing
a implicit conversion. It can be removed as it should not be reachable and the
mlir::failure should handle it correctly in case it is.
2024-01-09 17:45:13 +00:00
David Green
810c291574
[Flang] Generate inline reduction loops for elemental count intrinsics (#75774)
This adds a ReductionElementalConversion transform to
OptimizedBufferizationPass, taking hlfir::count(hlfir::elemental) and
generating the inline loop to perform the count of true elements. This
lets us generate a single loop instead of ending up as two plus a
temporary.

Any and All should be able to share the same code with a different
function/initial value.
2024-01-09 17:25:46 +00:00
Slava Zakharin
ab1db26272
[flang][hlfir] Fixed some finalization/deallocation issues. (#67047)
This set of commits resolves some of the issues with elemental calls producing
results that may require finalization, and also some memory leak issues due to
the missing deallocation of allocatable components of the temporary buffers
created by the bufferization pass.

- [flang][runtime] Expose Finalize API for derived types.

- [flang][hlfir] Add 'finalize' attribute for DestroyOp.

- [flang][hlfir] Postpone result finalization for elemental calls.

    The results of elemental calls generated inside hlfir.elemental must not
    be finalized/destructed before they are copied into the resulting
    array. The finalization must be done on the array as a whole
    (e.g. there might be different scalar and array finalization routines).
    The finalization work is left to the hlfir.destroy corresponding
    to this hlfir.elemental.

- [flang][hlfir] Tighten requirements on hlfir.end_associate operand.

    If component deallocation might be required for the operand of
    hlfir.end_associate, we have to be able to get the variable
    shape/params to create a descriptor for calling the runtime.
    This commit adds verification that we can do so.

- [flang][hlfir] Lower argument clean-ups using valid hlfir.end_associate.

    The operand must be a Fortran entity, when allocatable component
    deallocation may be required.

- [flang][hlfir] Properly clean-up temporary buffers in bufferization pass.

    This commit combines changes for proper finalization and component
    deallocation of the temporary buffers. The finalization part
    relates to hlfir.destroy operations with 'finalize' attribute.
    The component deallocation might be invoked for both hlfir.destroy
    and hlfir.end_associate, if the operand is of a derived type
    with allocatable component(s).

The changes are mostly in one function, so I decided not to split them.

- [flang][hlfir] Disable optimizations for hlfir.elemental requiring finalization.

    If hlfir.elemental is coupled with hlfir.destroy with 'finalize' attribute,
    the temporary array result of hlfir.elemental needs to be created
    for the purpose of finalization. We cannot do certain optimizations
    on such hlfir.elemental operations.

    I was not able to come up with a test for the OptimizedBufferization pass,
    but I put the check there as well.
2023-09-22 10:47:53 -07:00
Slava Zakharin
39b6c82c5d
[flang][hlfir] Better recognize non-overlapping array sections. (#65707)
This is a copy of the corresponding ArrayValueCopy analysis
for non-overlapping array slices. It is required to achieve
the same performance for Polyhedron/nf, though, additional
changes are needed in the alias analysis for disambiguating
host associated accesses.
2023-09-08 09:01:37 -07:00
Slava Zakharin
09361b1974 [flang][hlfir] Allow expanding realloc assignments with scalar RHS.
F18 10.2.1.3 p. 3 states:
If the variable is an unallocated allocatable array, expr shall have the same rank.

So if LHS is an array and RHS is a scalar, then LHS must be allocated and
the assignment is performed according to F18 10.2.1.3 p. 5:
If expr is a scalar and the variable is an array,
the expr is treated as if it were an array of the same shape as the
variable with every element of the array equal to the scalar value of expr.

This resolves performance regression in CPU2006/437.leslie3d caused
by extra Assign runtime calls for ALLOCATABLE local arrays.
Note that the extra calls do not add overhead themselves.
The problem is that the descriptor for ALLOCATABLE is passed
to Assign runtime function, and this messes up the points-to
analysis.

Example:
```
      ALLOCATABLE DUDX(:),DUDY(:),DUDZ(:)
...
      ALLOCATE( QS(IMAX-1),FSK(IMAX-1,0:KMAX,ND),
     >      QDIFFZ(IMAX-1), RMU(IMAX-1), EKCOEF(IMAX-1),
     >      DUDX(IMAX-1),DUDY(IMAX-1),DUDZ(IMAX-1),
...
      DUDZ=0D0
...
               DO I = I1, I2
                  DUDZ(I) =
     >                  DZI * ABD * ((U(I,J,KBD) - U(I,J,KCD)) +
     >                       8.0D0 * (U(I,J, KK) - U(I,J,KBD))) * R6I
```

When we are not lowering `DUDZ=0D0` to Assign call, the `base_addr` of
`DUDZ`'s descriptor is a result of `malloc`, and LLVM is able to figure out
that the accesses through this `base_addr` cannot overlap with accesses of,
for exmaple, module (global) variable DZI. This enables CSE and LICM
for the loop, eventually, resulting in clean vectorization.

When `DUDZ`'s descriptor "escapes" to Assign runtime function,
there are no guarantees about where `base_addr` can point to.
I do not think this can be resolved by using any existing LLVM function/argument
attributes. Maybe we will be able to communicate the no-aliasing information
to LLVM using `Full Restrict Support` representation.

For the purpose of enabling HLFIR by default, I am just aligning the IR
with what we have with FIR lowering.

Reviewed By: tblah

Differential Revision: https://reviews.llvm.org/D159391
2023-09-04 14:55:09 -07:00
Slava Zakharin
8f1671c065 [flang][hlfir] Allow hlfir.assign expansion for array slices.
This case is important for `Polyhedron/channel2`:
```
    u(2:M-1,1:N,new) = u(2:M-1,1:N,old) &
        +2.d0*dt*f(2:M-1,1:N)*v(2:M-1,1:N,mid) &
        -2.d0*dt/(2.d0*dx)*g*dhdx(2:M-1,1:N)
```

The slices of `u` on the left and the right hand sides are completely
disjoint, but `old` and `new` are unknown runtime values. So the slices
may also be identical rather than disjoint. For the purpose of
hlfir.assign expansion we do not care whether they are identical or
disjoint. Such kind of an answer does not fit well into the alias
analysis definition, so I added a very simplified check to handle
this case. This drops icelake execution time from 120 to 70 seconds.

Reviewed By: tblah

Differential Revision: https://reviews.llvm.org/D159323
2023-09-01 12:09:23 -07:00
Slava Zakharin
cdd5b1629a [flang][hlfir] Expand array hlfir.assign's.
Expand hlfir.assign with in-memory array RHS and LHS into
a loop nest with element-by-element assignments.
For small arrays this may result in further loop nest unrolling
enabling more value propagation and redundancy elimination.

Note the change in flang/test/HLFIR/opt-bufferization.fir:
the hlfir.assign inside hlfir.elemental gets expanded by the new
pattern.

Depends on D159151

Reviewed By: tblah

Differential Revision: https://reviews.llvm.org/D159246
2023-08-31 08:46:26 -07:00
Slava Zakharin
e60dc8ed7e [flang][hlfir] Expand hlfir.assign's with scalar RHS.
Expanding hlfir.assign's with scalar RHS late in MLIR optimization
pipeline allows LLVM to recognize most of them as simple memset loops.
This is especially important for small size LHS arrays, because
the assign loop nest may be completely unrolled enabling more value
propagation.

Reviewed By: tblah

Differential Revision: https://reviews.llvm.org/D159151
2023-08-31 08:46:26 -07:00
Tom Eccles
66abe64466 [flang][hlfir] add an optimized bufferization pass
This pass is intended to spot cases where we can do better than the
default bufferization and to rewrite those specific cases. Then the
default bufferization (bufferize-hlfir pass) can handle everything else.

The transformation added in this patch rewrites simple element-wise
updates to an array to a do-loop modifying the array in place instead of
creating and assigning an array temporary.

See the RFC at
https://discourse.llvm.org/t/rfc-hlfir-optimized-bufferization-for-elemental-array-updates

This patch gets the improvement to exchange2 but not the improvement to cam4
described in the RFC. I think the cam4 improvement will require better alias
analysis. I aim to follow up to fix this in a later patch. With changes
since the RFC, the pass improves polyhedron channel2 by about 52%.

Depends on: D156805 D157718 D157626

Differential Revision: https://reviews.llvm.org/D157107
2023-08-18 09:51:22 +00:00