Unused loop invariant loads were not sunk from the preheader to the exit
block, increasing live range.
This commit moves the sinkUnusedInvariant logic from indvarsimplify to
LICM also adds functionality to sink unused load that's not
clobbered by the loop body.
This follows similar reasoning as 45ce88758d24
(https://github.com/llvm/llvm-project/pull/159556):
LV does not preserve LCSSA, it constructs it just before processing a
loop to vectorize. Runtime check expressions are invariant to that loop,
so expanding them should not break LCSSA form for the loop we are about
to vectorize.
LV creates SCEV and memory runtime checks early on and then disconnects
the blocks temporarily. The patch fixes a mis-compile, where previously
LCSSA construction during SCEV expand may replace uses in currently
unreachable SCEV/memory check blocks.
Fixes https://github.com/llvm/llvm-project/issues/162512
PR: https://github.com/llvm/llvm-project/pull/165505
A reduction (including partial reductions) with a multiply of a constant
value can be bundled by first converting it from `reduce.add(mul(ext,
const))` to `reduce.add(mul(ext, ext(const)))` as long as it is safe to
extend the constant.
This PR adds such bundling by first truncating the constant to the
source type of the other extend, then extending it to the destination
type of the extend. The first truncate is necessary so that the types of
each extend's operand are then the same, and the call to
canConstantBeExtended proves that the extend following a truncate is
safe to do. The truncate is removed by optimisations.
This is a stacked PR, 1a and 1b can be merged in any order:
1a. https://github.com/llvm/llvm-project/pull/147302
1b. https://github.com/llvm/llvm-project/pull/163175
2. -> https://github.com/llvm/llvm-project/pull/162503
Factor out common code to determine legality of hoisting and sinking.
The patch has the side-effect of fixing an underlying bug, where a
load/store pair is reordered.
The map returned by collectStridedAccess is used to replace strides with
their versioned values. This does not work for Undef/Poison, which don't
have use-lists. Don't try to version them, as versioning won't be useful in
practice.
Fixes https://github.com/llvm/llvm-project/issues/162922.
InstSimplifyFolder can fold binary intrinsics, so take the opportunity
to unify code with getOpcodeOrIntrinsicID, and handle the case. The
additional handling of WidenGEP is non-functional, as the GEP is
simplified before it is widened, as the included test shows.
When an VF is specified via a loop hint, it will be clamped to a safe
VF or ignored if it is found to be unsafe. This is not the case for
user-specified interleave counts, which can lead to loops such as
the following with a memory dependence being vectorised with
interleaving:
```
#pragma clang loop interleave_count(4)
for (int i = 4; i < LEN; i++)
b[i] = b[i - 4] + a[i];
```
According to [1], loop hints are ignored if they are not safe to apply.
This patch adds a check to prevent vectorisation with interleaving if
isSafeForAnyVectorWidth() returns false. This is already checked in
selectInterleaveCount().
[1]
https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave
8d29d09309 exposed a crash due to incorrectly trying to handle masked
interleave recipes. For now, the current code does not support masked
interleave recipes. Bail out for them.
Move narrowInterleaveGroups to to general VPlan optimization stage.
To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.
If such a VF is found, the original VPlan is split into 2:
a) a new clone which contains all VFs of Plan, except VFToOptimize, and
b) the original Plan with VFToOptimize as single VF.
The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.
Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.
One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz
PR: https://github.com/llvm/llvm-project/pull/149706
Split off from PR #163525, this standalone patch replaces
use of undef as incoming PHI values with zero, in order
to reduce the likelihood of contributors hitting the
`undef deprecator` warning in github.
natively supported on Neon and SVE
PR #158641 refined and refactored the cost model for partial reductions.
While doing so, it missed out on certain constraints. Specifically,
cases like i32 -> i64 partial reduce are not natively supported. This
patch adds back the condition/constraint that was present before PR
#158641
Recipes in replicate regions implicitly depend on the region's
predicate. Limit CSE to recipes in the same block, when either recipe is
in a replicate region.
This allows handling VPPredInstPHIRecipe during CSE. If we perform CSE
on recipes inside a replicate region, we may end up with 2
VPPredInstPHIRecipes sharing the same operand. This is incompatible with
current VPPredInstPHIRecipe codegen, which re-sets the current value of
its operand in VPTransformState. This can cause crashes in the added
test cases.
Note that this patch only modifies ::isEqual to check for replicating
regions and not getHash, as CSE across replicating regions should be
uncommon.
Fixes https://github.com/llvm/llvm-project/issues/157314.
Fixes https://github.com/llvm/llvm-project/issues/161974.
PR: https://github.com/llvm/llvm-project/pull/162110
createWidenCast doesn't set the flag type, so when we simplify trunc
(zext nneg x) -> zext x we would hit an assertion in CSE that the flag
types don't match with other VPWidenCastRecipes that weren't simplified.
This fixes it the same way trunc flags are handled too.
As an aside I think it should be correct to preserve the nneg flag in
this case since the input operand is still non-negative after the
transform. But that's left to another PR.
Fixes https://github.com/llvm/llvm-project/issues/164171
When the legacy cost model scalarizes loads that are used as addresses
for other loads and stores, it looks to phi nodes, if they are direct
address operands of loads/stores. Match this behavior in
isUsedByLoadStoreAddress, to fix a divergence between legacy and
VPlan-based cost model.
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.
This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)
It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
Split off from PR #163525, this standalone patch replaces `ret * undef`
returns with `ret void` in order to reduce the likelihood of
contributors hitting the `undef deprecator` warning in github.
When narrowing stores of a single-scalar, we currently use
ExtractLastElement, which extracts the last element across all parts.
This is not correct if the store's address is not uniform across all
parts. If it is only uniform-per-part, the last lane per part must be
extracted. Add a new ExtractLastLanePerPart opcode to handle this
correctly. Most transforms apply to both ExtractLastElement and
ExtractLastLanePerPart, with the only difference being their treatment
during unrolling.
Fixes https://github.com/llvm/llvm-project/issues/162498.
PR: https://github.com/llvm/llvm-project/pull/163056
We have seen performance regression for several instances of the Numba
benchmark, with some ranging around 70%, on Neoverse-v2 post #158641.
The mentioned case is short reproducer of the same. See
https://godbolt.org/z/j9Mj5WM7c for the IR differences.. A future patch
will address this.
Add test coverage for min/max reductions with various combinations of
users (in and outside loops, used by stores) and predicated variants.
This adds missing test coverage for min/max reductions.
Currently we cannot vectorize loops with latch blocks terminated by a
switch. In the future this could be handled by materializing appropriate
compares.
Fixes https://github.com/llvm/llvm-project/issues/156894.
VPWidenCastRecipes with Trunc opcodes where missing the correct OpType
for IR flags. Update createWidenCast to set the correct flags for
truncs, and use it consistenly.
Fixes https://github.com/llvm/llvm-project/issues/162374.
Add extra test coverage for narrowing stores to single scalars, with the
store address being uniform-per-part, not uniform-across-all-parts.
Test for https://github.com/llvm/llvm-project/issues/162498.
Replication is currently not supported for scalable VFs. Make sure
VPReplicateRecipe::computeCost returns an invalid cost early, for
scalable VFs if the recipe is not a single-scalar.
Note that this moves the existing invalid-costs.ll out of the AArch64
subdirectory, as it does not use a target triple.
Fixes https://github.com/llvm/llvm-project/issues/160792.
Consider the following transform:
```
C = binop float A, nnan OOp
D = select ninf, i1 cond, float C, float A
->
E = select ninf, i1 cond, float OOp, float Identity
F = binop float A, E
```
We cannot propagate ninf from the original select, because OOp may be
inf, and the flag only guarantees that FalseVal (op OOp) is never
infinity.
Examples: -inf + +inf = NaN, -inf - -inf = NaN, 0 * inf = NaN
Specifically, if the original select has both ninf and nnan, we can
safely propagate the flag.
Alive2:
+ fadd: https://alive2.llvm.org/ce/z/TWfktv
+ fsub: https://alive2.llvm.org/ce/z/RAsjJb
+ fmul: https://alive2.llvm.org/ce/z/8eg4ND
Closes https://github.com/llvm/llvm-project/issues/161634.
Consistently scalarize loads used as part of address computations across
all uses in the loop. This aligns the VPlan and legacy cost model and
fixes a divergence crash. It doesn't matter if the load and address
users are in different blocks, as long as they are in the same loop, the
scalar value can be used. This removes a number of insert/extracts.
The vector intrinsics in question have no undefined behavior, and have
no other effect besides returning the result: they should hence be
marked speculatable.
VPBlendRecipes are introduced as part of if-conversion, potentially adding
a def-use chain from a load used in a compare to another load/store. In
the scalar IR, there is no connection via def-use chains, so the legacy
cost model won't consider the load used by memory operation.
Skipping blends brings the VPlan-based cost-computation in line with the
legacy cost model after https://github.com/llvm/llvm-project/pull/162157.