816 Commits

Author SHA1 Message Date
Sam Tebbs
2876dbcd66
[AArch64] Don't allow mixed partial reductions without i8mm (#137602)
Partial reductions with mixed extends should only be allowed if i8mm is
present.
2025-05-01 16:06:37 +01:00
Samuel Tebbs
fa769655e7 [LV] NFC: Make VPPartialReductionRecipe a VPReductionRecipe 2025-04-30 19:44:40 +01:00
Elvis Wang
1fc0a1401a
[LV][AArch64] Add test for fp128 fmuladd reduction.(NFC) (#137576)
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for #113903.
2025-04-29 09:18:07 +08:00
Florian Hahn
043b04acff
Reapply "[VPlan] Fold NOT into predicate of wide compares." (#130347)
This reverts commit 8dd160f4767f971572eac065c8650d9202ff5bf9.

The recommit contains an adjustment to planContainsAdditionalSimplifications,
which considers changes to the original predicate for compares.

Original commit message:

Add simplification to fold negation into a compare, if the negation is
the only user of the compare. This removes a number of redundant
negations.

Alive2 Proofs for FPCMP test changes:  https://alive2.llvm.org/ce/z/WGDz9U

PR: https://github.com/llvm/llvm-project/pull/129430
2025-04-28 20:01:37 +01:00
Florian Hahn
df21288247
[VPlan] Replace ExtractFromEnd with Extract(Last|Penultimate)Element (NFC). (#137030)
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.

PR: https://github.com/llvm/llvm-project/pull/137030
2025-04-25 16:27:29 +01:00
Ramkumar Ramachandra
4955c3c476
[LV] Strip bad FIXME in test (#137142)
See https://github.com/llvm/llvm-project/pull/130118/files#r1983745712
for context.
2025-04-25 09:47:47 +01:00
Florian Hahn
15bb1db4a9
[VPlan] Remove ILV::sinkScalarOperands. (#136023)
Remove legacy ILV sinkScalarOperands, which is superseded by the
sinkScalarOperands VPlan transforms.

There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.

We could probably improve this further, by allowing replication for more
recipes, but I don't think the extra complexity is warranted.

Depends on https://github.com/llvm/llvm-project/pull/136021.

PR: https://github.com/llvm/llvm-project/pull/136023
2025-04-24 08:37:49 +01:00
Ramkumar Ramachandra
bdf21ca8ac
[LV] Fix missing entry in willGenerateVectors (#136712)
willGenerateVectors switches on opcodes of a recipe, but Histogram is
missing in the switch statement, which could cause a crash in some
cases. The crash was initially observed when developing another patch.
2025-04-23 19:06:38 +01:00
Nicholas Guy
1ce709cb84
[LV] Fix crash when building partial reductions using types that aren't known scale factors (#136680) 2025-04-23 13:19:18 +01:00
David Sherwood
ef72b93626
[LV] Use requested calling convention for vector math routines (#136122)
Some vector math routines, e.g. ArmPL, specify a particular
calling convention on the routines which can help improve
performance by specifying what registers have to be preserved
across the call.
2025-04-22 09:33:52 +01:00
Florian Hahn
5739a22fbb
[VPlan] Also duplicated scalar-steps when it enables sinking scalars. (#136021)
Extend sinking logic to duplicate scalar steps recipe if it enables
sinking, that is if all users in a destination block require all lanes.

This should be the last step before removing legacy sinkScalarOperands.

PR: https://github.com/llvm/llvm-project/pull/136021
2025-04-21 18:36:43 +01:00
David Sherwood
927a0cb8d6
[LV][NFC] Regenerate AArch64/veclib-* test CHECK lines (#136138) 2025-04-17 15:07:33 +01:00
Sander de Smalen
f9c01b59e3
[LV] Fix '-1U' bits for smallest type in getSmallestAndWidestTypes (#135783)
For loops without loads/stores, where the smallest/widest types are
calculated from the reduction, the smallest type returned is always -1U
and it actually returns the smallest type as the widest type. This PR
fixes the calculation.

This follows from
https://github.com/llvm/llvm-project/pull/132190#discussion_r2044232607
2025-04-17 13:26:15 +01:00
John Brawn
eafbb879f6
[LoopVectorize] Don't replicate blocks with optsize (#129265)
Any VPlan we generate that contains a replicator region will result in
replicated blocks in the output, causing a large code size increase.
Reject such VPlans when optimizing for size, as the code size impact is
usually worse than having a scalar epilogue, which we already forbid
with optsize.

This change requires a lot of test changes. For tests of optsize
specifically I've updated the test with the new output, otherwise the
tests have been adjusted to not rely on optsize.

Fixes #66652
2025-04-17 11:50:49 +01:00
YunQiang Su
fe9e2090be
Vectorize: Support fminimumnum and fmaximumnum (#131781)
Support auto-vectorize for fminimum_num and fmaximum_num. 
For ARM64 with SVE, scalable vector cannot support yet.

---------

Co-authored-by: Your Name <you@example.com>
2025-04-15 08:08:45 +08:00
Sam Tebbs
b658a2e74a
[LV] Reduce register usage for scaled reductions (#133090)
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.

Depends on https://github.com/llvm/llvm-project/pull/126437
2025-04-11 14:31:08 +01:00
Florian Hahn
6f92339d9e
[LV] Compute register usage for interleaving on VPlan. (#126437)
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

 * Scalar usage increases in most cases by 1, as we always create a
   scalar canonical IV, which is alive across the loop and is not
   considered by the legacy implementation

 * Output is ordered by insertion, now scalar registers are added first
   due the canonical IV phi.

 * Using the VPlan, we now also more precisely know if an induction will
   be vectorized or scalarized.

Depends on https://github.com/llvm/llvm-project/pull/126415

PR: https://github.com/llvm/llvm-project/pull/126437
2025-04-08 20:52:50 +01:00
Nashe Mncube
67dd2019ac
Recommit [AArch64][SVE]Use FeatureUseFixedOverScalableIfEqualCost for A510/A520 (#134606)
Recommit. This work was done by #132246 but failed buildbots due to the
test introduced needing updates

Inefficient SVE codegen occurs on at least two in-order cores, those
being Cortex-A510 and Cortex-A520. For example a simple vector add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence of
instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, #8
         ldp q1, q2, [x10, #-16]
         add x10, x10, #32
         add x11, x11, #32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, #32
```

Which is more efficient.
2025-04-07 14:09:43 +01:00
Florian Hahn
464286ba63
[VPlan] Don't narrow interleave groups if there are vector pointers.
Do not narrow interleave groups if there are VectorPointer recipes and
the plan was unrolled. The recipe implicitly uses VF from VPTransformState.
2025-04-06 22:14:24 +01:00
Florian Hahn
12a377ed71
[LV] Add test for mis-compile when narrowing interleave groups.
Add test case showing mis-compile due to unrolling vector-pointer
recipes after 6b98134.
2025-04-06 21:17:44 +01:00
Florian Hahn
5fbd0658a0
[VPlan] Add initial CFG simplification, removing BranchOnCond true. (#106748)
Add an initial CFG simplification transform, which removes the dead
edges for blocks terminated with BranchOnCond true.

At the moment, this removes the edge between middle block and scalar
preheader when folding the tail.

PR: https://github.com/llvm/llvm-project/pull/106748
2025-04-04 15:44:26 +01:00
Nashe Mncube
846000c005
Revert "[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A510 and A520" (#134382)
Reverts llvm/llvm-project#132246
2025-04-04 14:36:38 +01:00
Nashe Mncube
d2bcc11067
[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A510 and A520 (#132246)
Inefficient SVE codegen occurs on at least two in-order cores,
those being Cortex-A510 and Cortex-A520. For example a simple vector
add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence
of instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, #8
         ldp q1, q2, [x10, #-16]
         add x10, x10, #32
         add x11, x11, #32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, #32
```

Which is more efficient.
2025-04-04 14:12:44 +01:00
Florian Hahn
2bdc1a1337
[LV] Use frozen start value for FindLastIV if needed. (#132691)
FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use, when the epilogue is
vectorized.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

See the following scenarios in Alive2:
* Both main and epilogue vector loops execute, go to exit block: https://alive2.llvm.org/ce/z/_TSvRr
* Both main and epilogue vector loops execute, go to scalar loop: https://alive2.llvm.org/ce/z/CsPj5v
* Only epilogue vector loop executes, go to exit block: https://alive2.llvm.org/ce/z/5XqkNV
* Only epilogue vector loop executes, go to scalar loop: https://alive2.llvm.org/ce/z/JUpqRN

The latter 2 show requiring freezing the resume phi. That means we cannot freeze 
in the preheader. We could move the freeze to the main iteration count check, but
that would be a bit fragile to find and other transforms can sink the freeze if needed.


Depends on https://github.com/llvm/llvm-project/pull/132689
and https://github.com/llvm/llvm-project/pull/132690.

Fixes https://github.com/llvm/llvm-project/issues/126836

PR: https://github.com/llvm/llvm-project/pull/132691
2025-04-04 11:48:01 +01:00
Florian Hahn
0f696c2e86
[LV] Add test where epilogue is vectorized and backedge removed.
Adds extra test coverage for
https://github.com/llvm/llvm-project/pull/106748.
2025-04-03 22:14:15 +01:00
Florian Hahn
012e574d4d
[LV] Add FindLastIV test with truncated IV and epilogue vectorization.
This adds missing test coverage for
https://github.com/llvm/llvm-project/pull/132691.
2025-04-03 21:01:58 +01:00
Luke Lau
79435de8a5
[ConstantFold] Support scalable constant splats in ConstantFoldCastInstruction (#133207)
Previously only fixed vector splats were handled. This adds supports for
scalable vectors too by allowing ConstantExpr splats.

We need to add the extra V->getType()->isVectorTy() check because a
ConstantExpr might be a scalar to vector bitcast.

By allowing ConstantExprs this also allow fixed vector ConstantExprs to
be folded, which causes the diffs in
llvm/test/Analysis/ValueTracking/known-bits-from-operator-constexpr.ll
and llvm/test/Transforms/InstSimplify/ConstProp/cast-vector.ll. I can
remove them from this PR if reviewers would prefer.

Fixes #132922
2025-04-03 16:24:56 +01:00
YunQiang Su
e25187bc3e
LLVM/Test: Add vectorizing testcases for fminimumnum and fminimumnum (#133843)
Vectorizing of fminimumnum and fminimumnum have not support yet. Let's
add the testcase for it now, and we will update the testcase when we
support it.
2025-04-02 08:46:02 +08:00
Samuel Tebbs
a1e041b646 [NFC][AArch64] Pre-commit high register pressure dot product test 2025-04-01 14:13:30 +01:00
Florian Hahn
4e8fbc6071
[LV] Add epilogue vectorization tests for FindLastIV reductions.
Add missing test coverage for #126836.
2025-03-31 21:23:35 +01:00
David Sherwood
f4d25c498a
[LV][NFC] Regenerate some SVE tests using --filter-out-after option (#132174)
I recently added a new option to update_test_checks.py that can
filter out all CHECK lines after a certain point. We usually don't
care about checking for the original scalar loop after the vector
loop because it doesn't change. Cutting out unnecessary CHECK
lines makes the files smaller and hopefully the tests run quicker.
2025-03-31 12:40:41 +01:00
Florian Hahn
6b98134466
[VPlan] Re-enable narrowing interleave groups with interleaving.
Remove the UF = 1 restriction introduced by 577631f0a5 building on top
of 783a846507683, which allows updating all relevant users of the VF,
VPScalarIVSteps in particular.

This restores the full functionality of
https://github.com/llvm/llvm-project/pull/106441.
2025-03-29 20:14:10 +00:00
Florian Hahn
783a846507
[VPlan] Add VF as operand to VPScalarIVStepsRecipe.
Similarly to other recipes, update VPScalarIVStepsRecipe to also take
the runtime VF as argument. This removes some unnecessary runtime VF
computations for scalable vectors. It will also allow dropping the
UF == 1 restriction for narrowing interleave groups required in
577631f0a528.
2025-03-28 21:48:59 +00:00
David Green
70f083f068 [LV][AArch64] Test cleanup of low_trip_count_predicates.ll. NFC
Post commit cleanup from #132170
2025-03-28 19:31:37 +00:00
Hari Limaye
bf5627c85e
[LV] Optimize VPWidenIntOrFpInductionRecipe for known TC (#118828)
Optimize the IR generated for a VPWidenIntOrFpInductionRecipe to use the
narrowest type necessary, when the trip-count of a loop is known to be
constant and the only use of the recipe is the condition used by the
vector loop's backedge branch.
2025-03-28 14:47:40 +00:00
Florian Hahn
5c26e80e57
[LV] Make cost model tests independent of VPValue numbers.
Update tests to not rely on hard-coded VPValue numbers.
2025-03-27 21:15:32 +00:00
Florian Hahn
2c7d40b2f0
[VPlan] Generalize SCALAR-STEPS removal to any unroll factor.
Follow-up to dfca6c0d3bf9d1a056 to extend isUnrolled handle any unrolled
VPlan, which means there's a single UF, but it will be > 1 if unrolling
took place.
2025-03-26 21:03:50 +00:00
David Green
de1c2f24bc
[LoopVectorizer][AArch64] Move getMinTripCountTailFoldingThreshold later. (#132170)
This moves the checks of MinTripCountTailFoldingThreshold later, during the
calculation of whether to tail fold. This allows it to check beforehand whether
tail predication is required, either for scalable or fixed-width vectors.

This option is only specified for AArch64, where it returns the minimum of 5.
This patch aims to allow the vectorization of TC=4 loops, preventing them from
performing slower when SVE is present.
2025-03-26 19:35:08 +00:00
David Sherwood
1c9fe8c8af
[LV] Optimise users of induction variables in early exit blocks (#130766)
This is the second of two PRs that attempts to improve the IR
generated in the exit blocks of vectorised loops with uncountable
early exits. It follows on from PR #128880. In this PR I am
improving the generated code for users of induction variables in
early exit blocks.

This required using a newly add VPInstruction called
FirstActiveLane, which calculates the index of the first active
predicate in the mask operand.

I have added a new function optimizeEarlyExitInductionUser that
is called from optimizeInductionExitUsers when handling users in
early exit blocks.
2025-03-26 12:09:59 +00:00
Florian Hahn
577631f0a5
Reapply "[VPlan] Add transformation to narrow interleave groups. (#106441)"
This reverts commit ff3e2ba9eb94217f3ad3525dc18b0c7b684e0abf.

The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.

Original commit message:

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on https://github.com/llvm/llvm-project/pull/106431.

Fixes https://github.com/llvm/llvm-project/issues/82936.

PR: https://github.com/llvm/llvm-project/pull/106441
2025-03-25 20:57:10 +00:00
Florian Hahn
dfca6c0d3b
[VPlan] Remove no-op SCALAR-STEPS after unrolling. (#123655)
After unrolling, there may be additional simplifications that can be
applied. One example is removing SCALAR-STEPS for the first part where
only the first lane is demanded.

This removes redundant adds of 0 from a large number of tests (~200),
many which I am still working on updating.

In preparation for removing redundant WideIV steps added in
https://github.com/llvm/llvm-project/pull/119284.

PR: https://github.com/llvm/llvm-project/pull/123655
2025-03-25 12:57:24 +00:00
Florian Hahn
06fd10f1da
[VPlan] Don't create ExtractElement recipes for scalar plans. (#131604)
ExtractElements are no-ops for scalar VPlans. Don't introduce them in 
handleUncountableEarlyExit if the plan has only a scalar VF.

This fixes a crash trying to compute the cost of ExtractElement after 26ecf978951b79.

PR: https://github.com/llvm/llvm-project/pull/131604
2025-03-23 22:00:02 +00:00
Martin Storsjö
ff3e2ba9eb Revert "[VPlan] Add transformation to narrow interleave groups. (#106441)"
This reverts commit dfa665f19c52d98b8d833a8e9073427ba5641b19.

This commit caused miscompilations in ffmpeg, see
https://github.com/llvm/llvm-project/pull/106441 for details.
2025-03-23 23:27:39 +02:00
Florian Hahn
dfa665f19c
[VPlan] Add transformation to narrow interleave groups. (#106441)
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on https://github.com/llvm/llvm-project/pull/106431.

Fixes https://github.com/llvm/llvm-project/issues/82936.

PR: https://github.com/llvm/llvm-project/pull/106441
2025-03-22 21:40:17 +00:00
Florian Hahn
2f2100c879
[LV] Add additional tests for #106441.
Further increase test coverage for
https://github.com/llvm/llvm-project/pull/106441

Also regenerate checks with -filter-out-after.
2025-03-22 10:07:11 +00:00
David Sherwood
4e69258bf3
[LoopVectorize] Add cost of generating tail-folding mask to the loop (#130565)
At the moment if we decide to enable tail-folding we do not include
the cost of generating the mask per VF. This can mean we make some
poor choices of VF, which is definitely true for SVE-enabled AArch64
targets where mask generation for fixed-width vectors is more
expensive than for scalable vectors.

I've added a VPInstruction::computeCost function to return the costs
of the ActiveLaneMask and ExplicitVectorLength operations.
Unfortunately, in order to prevent asserts firing I've also had to
duplicate the same code in the legacy cost model to make sure the
chosen VFs match up. I've wrapped this up in a ifndef NDEBUG for
now. The alternative would be to disable the assert completely when
tail-folding, which I imagine is just as bad.

New tests added:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
  Transforms/LoopVectorize/RISCV/tail-folding-cost.ll
2025-03-21 09:24:56 +00:00
Florian Hahn
c73ad7ba20
[VPlan] Add transformation to narrow interleave groups.
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.

Depends on #106431.

This lands the main parts of the approved
https://github.com/llvm/llvm-project/pull/106441 as suggested to break
things up a bit more.
2025-03-20 19:41:37 +00:00
Florian Hahn
11b8699572
[LV] Don't skip instrs with side-effects in reg pressure computation. (#126415)
calculateRegisterUsage adds end points for each user of an instruction
to Ends and ignores instructions not added to it, i.e. instructions with
no users.

This means things like stores aren't included, which in turn means
values that are only used in stores are also not included for
consideration. This means we underestimate the register usage in cases
where the only users are things like stores.

Update the code to don't skip instructions without users (i.e. not in
Ends) if they have side-effects.

PR: https://github.com/llvm/llvm-project/pull/126415
2025-03-19 15:13:43 +00:00
Florian Hahn
3c554deaaa
[LV] Add reg-usage test with values only used by llvm.assume.
Add test checking we are not counting registers that are only used by
ephemeral users, like llvm.assume.
2025-03-19 12:17:50 +00:00
Florian Hahn
870f753f1f
[VPlan] Also materialize broadcasts for backedge-taken-counts (NFC).
Also include VPlan's BTC in the set of VPValues to materialize
broadcasts for, if it is used.
2025-03-18 22:35:18 +00:00