Move VPlan-based calculateRegisterUsage from LoopVectorize
to VPlanAnalysis.cpp. It is a VPlan-based analysis and this helps
to reduce the size of LoopVectorize.
PR: https://github.com/llvm/llvm-project/pull/135673
This patch implement the VPlan-based cost model for VPReduction,
VPExtendedReduction and VPMulAccumulateReduction.
With this patch, we can calculate the reduction cost by the VPlan-based
cost model so remove the reduction costs in `precomputeCost()`.
Ref: Original instruction based implementation:
https://reviews.llvm.org/D93476
For more powerful folding with operands that are not necessarily
all-constant, use InstSimplifyFolder instead of TargetFolder in
tryToConstantFold, and rename the function tryToFoldLiveIns.
Update initial construction to connect the Plan's entry to the scalar
preheader during initial construction. This moves a small part of the
skeleton creation out of ILV and will also enable replacing
VPInstruction::ResumePhi with regular VPPhi recipes.
Resume phis need 2 incoming values to start with, the second being the
bypass value from the scalar ph (and used to replicate the incoming
value for other bypass blocks). Adding the extra edge ensures we
incoming values for resume phis match the incoming blocks.
PR: https://github.com/llvm/llvm-project/pull/140132
The plan is to eventually add support for scalably vectorizing these for
non-power-of-2 factors, see https://github.com/llvm/llvm-project/pull/139893
Simultaneously, we need to add a test to make sure we don't generate
@llvm.vector.[de]interleave3 for AArch64 if we can't lower it (yet)
Building on top of https://github.com/llvm/llvm-project/pull/114305,
replace VPRegionBlocks with explicit CFG before executing.
This brings the final VPlan closer to the IR that is generated and
helps to simplify codegen.
It will also enable further simplifications of phi handling during
execution and transformations that do not have to preserve the
canonical IV required by loop regions. This for example could include
replacing the canonical IV with an EVL based phi while completely
removing the original canonical IV.
PR: https://github.com/llvm/llvm-project/pull/117506
As noted in the TODO, we don't need to cover up the poison elements
placed in the unused lanes for shifts, since it's not UB unlike div/rem.
New poison elements are only introduced in cases like
ShMask = <1,1,2,2> and C = <5,5,6,6> --> NewC = <poison,5,6,poison>
And the resulting shuffle won't use the poison lanes.
This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.
Depends on https://github.com/llvm/llvm-project/pull/137746
Don't use the order of incoming values of IR phis when creating
VPBlendRecipes. Instead, simply use the incoming operands and
blocks from the VPWidenPHIRecipe.
Note that this changes the order of the incoming operands/masks for some
blends.
PR: https://github.com/llvm/llvm-project/pull/139475
Split off from #118638, this adds VPInstruction::StepVector, which
generates integer step vectors (0,1,2,...,VF). This is a step towards
eventually modelling all the separate parts of
VPWidenIntOrFpInductionRecipe in VPlan.
This is then used by VPWidenIntOrFpInductionRecipe, where we materialize
it just before unrolling so the operands stay in a fixed position.
The need for a separate operand in VPWidenIntOrFpInductionRecipe, as
well as the need to update it in
optimizeVectorInductionWidthForTCAndVFUF, should be removed with #118638
when everything is expanded in convertToConcreteRecipes.
Add test showing that incorrect tbaa metadata is added to the widened
loads and stores when narrowing interleave groups.
The widened loads/stores currently have the TBAA metadata of the first
load/store, even though the wide accesses also access data with types of
the second load/store.
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.
Note that without the fp128 load and trunc, there is no failure.
Pre-commit test for #113903.
This reverts commit 8dd160f4767f971572eac065c8650d9202ff5bf9.
The recommit contains an adjustment to planContainsAdditionalSimplifications,
which considers changes to the original predicate for compares.
Original commit message:
Add simplification to fold negation into a compare, if the negation is
the only user of the compare. This removes a number of redundant
negations.
Alive2 Proofs for FPCMP test changes: https://alive2.llvm.org/ce/z/WGDz9U
PR: https://github.com/llvm/llvm-project/pull/129430
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.
PR: https://github.com/llvm/llvm-project/pull/137030
Remove legacy ILV sinkScalarOperands, which is superseded by the
sinkScalarOperands VPlan transforms.
There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.
We could probably improve this further, by allowing replication for more
recipes, but I don't think the extra complexity is warranted.
Depends on https://github.com/llvm/llvm-project/pull/136021.
PR: https://github.com/llvm/llvm-project/pull/136023
willGenerateVectors switches on opcodes of a recipe, but Histogram is
missing in the switch statement, which could cause a crash in some
cases. The crash was initially observed when developing another patch.
Some vector math routines, e.g. ArmPL, specify a particular
calling convention on the routines which can help improve
performance by specifying what registers have to be preserved
across the call.
Extend sinking logic to duplicate scalar steps recipe if it enables
sinking, that is if all users in a destination block require all lanes.
This should be the last step before removing legacy sinkScalarOperands.
PR: https://github.com/llvm/llvm-project/pull/136021
For loops without loads/stores, where the smallest/widest types are
calculated from the reduction, the smallest type returned is always -1U
and it actually returns the smallest type as the widest type. This PR
fixes the calculation.
This follows from
https://github.com/llvm/llvm-project/pull/132190#discussion_r2044232607
Any VPlan we generate that contains a replicator region will result in
replicated blocks in the output, causing a large code size increase.
Reject such VPlans when optimizing for size, as the code size impact is
usually worse than having a scalar epilogue, which we already forbid
with optsize.
This change requires a lot of test changes. For tests of optsize
specifically I've updated the test with the new output, otherwise the
tests have been adjusted to not rely on optsize.
Fixes#66652
Support auto-vectorize for fminimum_num and fmaximum_num.
For ARM64 with SVE, scalable vector cannot support yet.
---------
Co-authored-by: Your Name <you@example.com>
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.
Depends on https://github.com/llvm/llvm-project/pull/126437
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.
There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.
There are the following changes:
* Scalar usage increases in most cases by 1, as we always create a
scalar canonical IV, which is alive across the loop and is not
considered by the legacy implementation
* Output is ordered by insertion, now scalar registers are added first
due the canonical IV phi.
* Using the VPlan, we now also more precisely know if an induction will
be vectorized or scalarized.
Depends on https://github.com/llvm/llvm-project/pull/126415
PR: https://github.com/llvm/llvm-project/pull/126437
Recommit. This work was done by #132246 but failed buildbots due to the
test introduced needing updates
Inefficient SVE codegen occurs on at least two in-order cores, those
being Cortex-A510 and Cortex-A520. For example a simple vector add
```
void foo(float a, float b, float dst, unsigned n) {
for (unsigned i = 0; i < n; ++i)
dst[i] = a[i] + b[i];
}
```
Vectorizes the inner loop into the following interleaved sequence of
instructions.
```
add x12, x1, x10
ld1b { z0.b }, p0/z, [x1, x10]
add x13, x2, x10
ld1b { z1.b }, p0/z, [x2, x10]
ldr z2, [x12, #1, mul vl]
ldr z3, [x13, #1, mul vl]
dech x11
add x12, x0, x10
fadd z0.s, z1.s, z0.s
fadd z1.s, z3.s, z2.s
st1b { z0.b }, p0, [x0, x10]
addvl x10, x10, #2
str z1, [x12, #1, mul vl]
```
By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.
```
ldp q0, q3, [x11, #-16]
subs x13, x13, #8
ldp q1, q2, [x10, #-16]
add x10, x10, #32
add x11, x11, #32
fadd v0.4s, v1.4s, v0.4s
fadd v1.4s, v2.4s, v3.4s
stp q0, q1, [x12, #-16]
add x12, x12, #32
```
Which is more efficient.
Add an initial CFG simplification transform, which removes the dead
edges for blocks terminated with BranchOnCond true.
At the moment, this removes the edge between middle block and scalar
preheader when folding the tail.
PR: https://github.com/llvm/llvm-project/pull/106748
Previously only fixed vector splats were handled. This adds supports for
scalable vectors too by allowing ConstantExpr splats.
We need to add the extra V->getType()->isVectorTy() check because a
ConstantExpr might be a scalar to vector bitcast.
By allowing ConstantExprs this also allow fixed vector ConstantExprs to
be folded, which causes the diffs in
llvm/test/Analysis/ValueTracking/known-bits-from-operator-constexpr.ll
and llvm/test/Transforms/InstSimplify/ConstProp/cast-vector.ll. I can
remove them from this PR if reviewers would prefer.
Fixes#132922
Vectorizing of fminimumnum and fminimumnum have not support yet. Let's
add the testcase for it now, and we will update the testcase when we
support it.