837 Commits

Author SHA1 Message Date
Paul Walker
6955a7d134 [NFC][LLVM][Instrumentation][LoopVectorize] Regenerate test checks. 2025-06-05 11:38:30 +00:00
Florian Hahn
11713e86b0
[LV] Move VPlan-based calculateRegisterUsage to VPlanAnalysis (NFC). (#135673)
Move VPlan-based calculateRegisterUsage from LoopVectorize
to VPlanAnalysis.cpp. It is a VPlan-based analysis and this helps
to reduce the size of LoopVectorize.

PR: https://github.com/llvm/llvm-project/pull/135673
2025-06-02 17:40:50 +01:00
Madhur Amilkanthwar
67ff713052
[NFC][LV] Remove incorrect comment about lack of support (#142126) 2025-05-30 13:25:55 +02:00
Florian Hahn
9ea4924720
[VPlan] Use EMIT-SCALAR for single-scalar VPPhis (NFC).
Follow-up to https://github.com/llvm/llvm-project/pull/141428, to also
use EMIT-SCALAR for VPPhis that are single scalars.
2025-05-29 11:20:07 +01:00
Florian Hahn
5b85e4b08d
[VPlan] Use EMIT-SCALAR when printing single-scalar VPInstructions. (#141428)
By using SINGLE-SCALAR when printing, it is clear in the debug output
that those VPInstructions only produce a single scalar.

Split off in preparation for
https://github.com/llvm/llvm-project/pull/140623.

PR: https://github.com/llvm/llvm-project/pull/141428
2025-05-29 09:29:06 +01:00
Elvis Wang
332fe08f1d
[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. (#113903)
This patch implement the VPlan-based cost model for VPReduction,
VPExtendedReduction and VPMulAccumulateReduction.

With this patch, we can calculate the reduction cost by the VPlan-based
cost model so remove the reduction costs in `precomputeCost()`.

Ref: Original instruction based implementation:
https://reviews.llvm.org/D93476
2025-05-29 11:15:16 +08:00
Paul Walker
9aebf4c399 [NFC][LLVM] Tests for vectorisation of loops with vscale base trip counts. 2025-05-28 12:42:41 +00:00
Ramkumar Ramachandra
5f39be5917
[VPlan] Use InstSimplifyFolder instead of TargetFolder (#141222)
For more powerful folding with operands that are not necessarily
all-constant, use InstSimplifyFolder instead of TargetFolder in
tryToConstantFold, and rename the function tryToFoldLiveIns.
2025-05-28 11:00:14 +02:00
Florian Hahn
d56deea1e4
[VPlan] Connect Entry to scalar preheader during initial construction. (#140132)
Update initial construction to connect the Plan's entry to the scalar
preheader during initial construction. This moves a small part of the
 skeleton creation out of ILV and will also enable replacing
 VPInstruction::ResumePhi with regular VPPhi recipes.

Resume phis need 2 incoming values to start with, the second being the
bypass value from the scalar ph (and used to replicate the incoming
value for other bypass blocks). Adding the extra edge ensures we
incoming values for resume phis match the incoming blocks.

PR: https://github.com/llvm/llvm-project/pull/140132
2025-05-27 16:07:56 +01:00
Luke Lau
841c8d48a6 [LV] Add tests for more interleave group factors on AArch64 and RISC-V. NFC
The plan is to eventually add support for scalably vectorizing these for
non-power-of-2 factors, see https://github.com/llvm/llvm-project/pull/139893

Simultaneously, we need to add a test to make sure we don't generate
@llvm.vector.[de]interleave3 for AArch64 if we can't lower it (yet)
2025-05-26 18:21:27 +01:00
Florian Hahn
dcef154b5c
[VPlan] Replace VPRegionBlock with explicit CFG before execute (NFCI). (#117506)
Building on top of https://github.com/llvm/llvm-project/pull/114305,
replace VPRegionBlocks with explicit CFG before executing.

This brings the final VPlan closer to the IR that is generated and
helps to simplify codegen.

It will also enable further simplifications of phi handling during
execution and transformations that do not have to preserve the 
canonical IV required by loop regions. This for example could include
replacing the canonical IV with an EVL based phi while completely
removing the original canonical IV.

PR: https://github.com/llvm/llvm-project/pull/117506
2025-05-24 19:17:16 +01:00
Luke Lau
4b4699a13c
[InstCombine] Don't cover up poison elements for shifts when folding shuffles thru binops (#141303)
As noted in the TODO, we don't need to cover up the poison elements
placed in the unused lanes for shifts, since it's not UB unlike div/rem.

New poison elements are only introduced in cases like

ShMask = <1,1,2,2> and C = <5,5,6,6> --> NewC = <poison,5,6,poison>

And the resulting shuffle won't use the poison lanes.
2025-05-24 13:47:18 +01:00
Mohammad Bashir
bcdce987c0
Fix regression tests with bad FileCheck checks (#140373)
Fixes https://github.com/llvm/llvm-project/issues/140149
2025-05-22 07:59:57 +03:00
Ramkumar Ramachandra
cf1f116f78
[VPlan] Introduce constant folder in simplifyRecipe (#125365)
Introduce a VPlan-level constant folder in simplifyRecipe that tries to
fold a recipe to a constant using TargetFolder.
2025-05-20 14:16:01 +01:00
Sam Tebbs
70501ed2f0
[LoopVectorizer] Prune VFs based on plan register pressure (#132190)
This PR moves the register usage checking to after the plans are
created, so that any recipes that optimise register usage (such as
partial reductions) can be properly costed and not have their VF pruned
unnecessarily.

Depends on https://github.com/llvm/llvm-project/pull/137746
2025-05-19 13:27:17 +01:00
Florian Hahn
7a9fd62278
[VPlan] Use VPlan operand order for VPBlendRecipes. (#139475)
Don't use the order of incoming values of IR phis when creating 
VPBlendRecipes. Instead, simply use the incoming operands and
blocks from the VPWidenPHIRecipe.

Note that this changes the order of the incoming operands/masks for some
blends.

PR: https://github.com/llvm/llvm-project/pull/139475
2025-05-14 14:56:35 +01:00
Florian Hahn
5fa64d65e9
[VPlan] Use printPhiOperands for VPPhi.
Split off from  https://github.com/llvm/llvm-project/pull/139151 to land
printing improvements separately.

Updates printing of VPPhi operands to be consistent with
VPWidenPHIRecipe.
2025-05-10 12:49:29 +01:00
Florian Hahn
e854c381c6
[VPlan] Manage noalias/alias_scope metadata in VPlan. (#136450)
Use VPIRMetadata added in
https://github.com/llvm/llvm-project/pull/135272
to also manage no-alias metadata added by versioning.

Note that this means we have to build the no-alias metadata up-front
once. If it is not used, it will be discarded automatically.

This also fixes a case where incorrect metadata was added to wide
loads/stores that got converted from an interleave group.

Compile-time impact is neutral:

https://llvm-compile-time-tracker.com/compare.php?from=38bf1af41c5425a552a53feb13c71d82873f1c18&to=2fd7844cfdf5ec0f1c2ce0b9b3ae0763245b6922&stat=instructions:u
2025-05-09 11:19:12 +01:00
Luke Lau
1484f82cbc
[VPlan] Add VPInstruction::StepVector and use it in VPWidenIntOrFpInductionRecipe (#129508)
Split off from #118638, this adds VPInstruction::StepVector, which
generates integer step vectors (0,1,2,...,VF). This is a step towards
eventually modelling all the separate parts of
VPWidenIntOrFpInductionRecipe in VPlan.

This is then used by VPWidenIntOrFpInductionRecipe, where we materialize
it just before unrolling so the operands stay in a fixed position.

The need for a separate operand in VPWidenIntOrFpInductionRecipe, as
well as the need to update it in
optimizeVectorInductionWidthForTCAndVFUF, should be removed with #118638
when everything is expanded in convertToConcreteRecipes.
2025-05-08 18:47:44 +08:00
Florian Hahn
127f48668b
[LV] Add test showing incorrect metadata merging when narrowing IGs.
Add test showing that incorrect tbaa metadata is added to the widened
loads and stores when narrowing interleave groups.

The widened loads/stores currently have the TBAA metadata of the first
load/store, even though the wide accesses also access data with types of
the second load/store.
2025-05-08 11:13:25 +01:00
Florian Hahn
9a26b2903b
[VPlan] Don't rely on region check in isUniformAfterVectorization. (#137883)
Generalize isUniformAfterVectorization check to not rely on the region,
but purely work on checking operands and opcodes.

This will be needed when disolving the vector region
(https://github.com/llvm/llvm-project/pull/117506) and improves codegen
slightly in some cases.

PR: https://github.com/llvm/llvm-project/pull/137883
2025-05-02 15:42:21 +01:00
Sam Tebbs
2876dbcd66
[AArch64] Don't allow mixed partial reductions without i8mm (#137602)
Partial reductions with mixed extends should only be allowed if i8mm is
present.
2025-05-01 16:06:37 +01:00
Samuel Tebbs
fa769655e7 [LV] NFC: Make VPPartialReductionRecipe a VPReductionRecipe 2025-04-30 19:44:40 +01:00
Elvis Wang
1fc0a1401a
[LV][AArch64] Add test for fp128 fmuladd reduction.(NFC) (#137576)
This patch add the test for the fmuladd reduction to show the test
change/fail for the cost model change.

Note that without the fp128 load and trunc, there is no failure.

Pre-commit test for #113903.
2025-04-29 09:18:07 +08:00
Florian Hahn
043b04acff
Reapply "[VPlan] Fold NOT into predicate of wide compares." (#130347)
This reverts commit 8dd160f4767f971572eac065c8650d9202ff5bf9.

The recommit contains an adjustment to planContainsAdditionalSimplifications,
which considers changes to the original predicate for compares.

Original commit message:

Add simplification to fold negation into a compare, if the negation is
the only user of the compare. This removes a number of redundant
negations.

Alive2 Proofs for FPCMP test changes:  https://alive2.llvm.org/ce/z/WGDz9U

PR: https://github.com/llvm/llvm-project/pull/129430
2025-04-28 20:01:37 +01:00
Florian Hahn
df21288247
[VPlan] Replace ExtractFromEnd with Extract(Last|Penultimate)Element (NFC). (#137030)
ExtractFromEnd only has 2 uses, extracting the last and penultimate
elements. Replace it with 2 separate opcodes, removing the need to
materialize and handle a constant argument.

PR: https://github.com/llvm/llvm-project/pull/137030
2025-04-25 16:27:29 +01:00
Ramkumar Ramachandra
4955c3c476
[LV] Strip bad FIXME in test (#137142)
See https://github.com/llvm/llvm-project/pull/130118/files#r1983745712
for context.
2025-04-25 09:47:47 +01:00
Florian Hahn
15bb1db4a9
[VPlan] Remove ILV::sinkScalarOperands. (#136023)
Remove legacy ILV sinkScalarOperands, which is superseded by the
sinkScalarOperands VPlan transforms.

There are a few cases that aren't handled by VPlan's sinkScalarOperands,
because the recipes doesn't support replicating. Those are pointer
inductions and blends.

We could probably improve this further, by allowing replication for more
recipes, but I don't think the extra complexity is warranted.

Depends on https://github.com/llvm/llvm-project/pull/136021.

PR: https://github.com/llvm/llvm-project/pull/136023
2025-04-24 08:37:49 +01:00
Ramkumar Ramachandra
bdf21ca8ac
[LV] Fix missing entry in willGenerateVectors (#136712)
willGenerateVectors switches on opcodes of a recipe, but Histogram is
missing in the switch statement, which could cause a crash in some
cases. The crash was initially observed when developing another patch.
2025-04-23 19:06:38 +01:00
Nicholas Guy
1ce709cb84
[LV] Fix crash when building partial reductions using types that aren't known scale factors (#136680) 2025-04-23 13:19:18 +01:00
David Sherwood
ef72b93626
[LV] Use requested calling convention for vector math routines (#136122)
Some vector math routines, e.g. ArmPL, specify a particular
calling convention on the routines which can help improve
performance by specifying what registers have to be preserved
across the call.
2025-04-22 09:33:52 +01:00
Florian Hahn
5739a22fbb
[VPlan] Also duplicated scalar-steps when it enables sinking scalars. (#136021)
Extend sinking logic to duplicate scalar steps recipe if it enables
sinking, that is if all users in a destination block require all lanes.

This should be the last step before removing legacy sinkScalarOperands.

PR: https://github.com/llvm/llvm-project/pull/136021
2025-04-21 18:36:43 +01:00
David Sherwood
927a0cb8d6
[LV][NFC] Regenerate AArch64/veclib-* test CHECK lines (#136138) 2025-04-17 15:07:33 +01:00
Sander de Smalen
f9c01b59e3
[LV] Fix '-1U' bits for smallest type in getSmallestAndWidestTypes (#135783)
For loops without loads/stores, where the smallest/widest types are
calculated from the reduction, the smallest type returned is always -1U
and it actually returns the smallest type as the widest type. This PR
fixes the calculation.

This follows from
https://github.com/llvm/llvm-project/pull/132190#discussion_r2044232607
2025-04-17 13:26:15 +01:00
John Brawn
eafbb879f6
[LoopVectorize] Don't replicate blocks with optsize (#129265)
Any VPlan we generate that contains a replicator region will result in
replicated blocks in the output, causing a large code size increase.
Reject such VPlans when optimizing for size, as the code size impact is
usually worse than having a scalar epilogue, which we already forbid
with optsize.

This change requires a lot of test changes. For tests of optsize
specifically I've updated the test with the new output, otherwise the
tests have been adjusted to not rely on optsize.

Fixes #66652
2025-04-17 11:50:49 +01:00
YunQiang Su
fe9e2090be
Vectorize: Support fminimumnum and fmaximumnum (#131781)
Support auto-vectorize for fminimum_num and fmaximum_num. 
For ARM64 with SVE, scalable vector cannot support yet.

---------

Co-authored-by: Your Name <you@example.com>
2025-04-15 08:08:45 +08:00
Sam Tebbs
b658a2e74a
[LV] Reduce register usage for scaled reductions (#133090)
This PR accounts for scaled reductions in `calculateRegisterUsage` to
reflect the fact that the number of lanes in their output is smaller
than the VF.

Depends on https://github.com/llvm/llvm-project/pull/126437
2025-04-11 14:31:08 +01:00
Florian Hahn
6f92339d9e
[LV] Compute register usage for interleaving on VPlan. (#126437)
Add a version of calculateRegisterUsage that works estimates register
usage for a VPlan. This mostly just ports the existing code, with some
updates to figure out what recipes will generate vectors vs scalars.

There are number of changes in the computed register usages, but they
should be more accurate w.r.t. to the generated vector code.

There are the following changes:

 * Scalar usage increases in most cases by 1, as we always create a
   scalar canonical IV, which is alive across the loop and is not
   considered by the legacy implementation

 * Output is ordered by insertion, now scalar registers are added first
   due the canonical IV phi.

 * Using the VPlan, we now also more precisely know if an induction will
   be vectorized or scalarized.

Depends on https://github.com/llvm/llvm-project/pull/126415

PR: https://github.com/llvm/llvm-project/pull/126437
2025-04-08 20:52:50 +01:00
Nashe Mncube
67dd2019ac
Recommit [AArch64][SVE]Use FeatureUseFixedOverScalableIfEqualCost for A510/A520 (#134606)
Recommit. This work was done by #132246 but failed buildbots due to the
test introduced needing updates

Inefficient SVE codegen occurs on at least two in-order cores, those
being Cortex-A510 and Cortex-A520. For example a simple vector add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence of
instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, #8
         ldp q1, q2, [x10, #-16]
         add x10, x10, #32
         add x11, x11, #32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, #32
```

Which is more efficient.
2025-04-07 14:09:43 +01:00
Florian Hahn
464286ba63
[VPlan] Don't narrow interleave groups if there are vector pointers.
Do not narrow interleave groups if there are VectorPointer recipes and
the plan was unrolled. The recipe implicitly uses VF from VPTransformState.
2025-04-06 22:14:24 +01:00
Florian Hahn
12a377ed71
[LV] Add test for mis-compile when narrowing interleave groups.
Add test case showing mis-compile due to unrolling vector-pointer
recipes after 6b98134.
2025-04-06 21:17:44 +01:00
Florian Hahn
5fbd0658a0
[VPlan] Add initial CFG simplification, removing BranchOnCond true. (#106748)
Add an initial CFG simplification transform, which removes the dead
edges for blocks terminated with BranchOnCond true.

At the moment, this removes the edge between middle block and scalar
preheader when folding the tail.

PR: https://github.com/llvm/llvm-project/pull/106748
2025-04-04 15:44:26 +01:00
Nashe Mncube
846000c005
Revert "[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A510 and A520" (#134382)
Reverts llvm/llvm-project#132246
2025-04-04 14:36:38 +01:00
Nashe Mncube
d2bcc11067
[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A510 and A520 (#132246)
Inefficient SVE codegen occurs on at least two in-order cores,
those being Cortex-A510 and Cortex-A520. For example a simple vector
add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence
of instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, #8
         ldp q1, q2, [x10, #-16]
         add x10, x10, #32
         add x11, x11, #32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, #32
```

Which is more efficient.
2025-04-04 14:12:44 +01:00
Florian Hahn
2bdc1a1337
[LV] Use frozen start value for FindLastIV if needed. (#132691)
FindLastIV introduces multiple uses of the start value, where in the
original source there was only a single use, when the epilogue is
vectorized.

Each use of undef may produce a different result, so introducing
multiple uses can produce incorrect results when the input is
undef/poison.

If the start value may be undef or poison, freeze it and use the frozen
value, which will be the same at all uses.

See the following scenarios in Alive2:
* Both main and epilogue vector loops execute, go to exit block: https://alive2.llvm.org/ce/z/_TSvRr
* Both main and epilogue vector loops execute, go to scalar loop: https://alive2.llvm.org/ce/z/CsPj5v
* Only epilogue vector loop executes, go to exit block: https://alive2.llvm.org/ce/z/5XqkNV
* Only epilogue vector loop executes, go to scalar loop: https://alive2.llvm.org/ce/z/JUpqRN

The latter 2 show requiring freezing the resume phi. That means we cannot freeze 
in the preheader. We could move the freeze to the main iteration count check, but
that would be a bit fragile to find and other transforms can sink the freeze if needed.


Depends on https://github.com/llvm/llvm-project/pull/132689
and https://github.com/llvm/llvm-project/pull/132690.

Fixes https://github.com/llvm/llvm-project/issues/126836

PR: https://github.com/llvm/llvm-project/pull/132691
2025-04-04 11:48:01 +01:00
Florian Hahn
0f696c2e86
[LV] Add test where epilogue is vectorized and backedge removed.
Adds extra test coverage for
https://github.com/llvm/llvm-project/pull/106748.
2025-04-03 22:14:15 +01:00
Florian Hahn
012e574d4d
[LV] Add FindLastIV test with truncated IV and epilogue vectorization.
This adds missing test coverage for
https://github.com/llvm/llvm-project/pull/132691.
2025-04-03 21:01:58 +01:00
Luke Lau
79435de8a5
[ConstantFold] Support scalable constant splats in ConstantFoldCastInstruction (#133207)
Previously only fixed vector splats were handled. This adds supports for
scalable vectors too by allowing ConstantExpr splats.

We need to add the extra V->getType()->isVectorTy() check because a
ConstantExpr might be a scalar to vector bitcast.

By allowing ConstantExprs this also allow fixed vector ConstantExprs to
be folded, which causes the diffs in
llvm/test/Analysis/ValueTracking/known-bits-from-operator-constexpr.ll
and llvm/test/Transforms/InstSimplify/ConstProp/cast-vector.ll. I can
remove them from this PR if reviewers would prefer.

Fixes #132922
2025-04-03 16:24:56 +01:00
YunQiang Su
e25187bc3e
LLVM/Test: Add vectorizing testcases for fminimumnum and fminimumnum (#133843)
Vectorizing of fminimumnum and fminimumnum have not support yet. Let's
add the testcase for it now, and we will update the testcase when we
support it.
2025-04-02 08:46:02 +08:00
Samuel Tebbs
a1e041b646 [NFC][AArch64] Pre-commit high register pressure dot product test 2025-04-01 14:13:30 +01:00