786 Commits

Author SHA1 Message Date
David Sherwood
f4d25c498a
[LV][NFC] Regenerate some SVE tests using --filter-out-after option (#132174)
I recently added a new option to update_test_checks.py that can
filter out all CHECK lines after a certain point. We usually don't
care about checking for the original scalar loop after the vector
loop because it doesn't change. Cutting out unnecessary CHECK
lines makes the files smaller and hopefully the tests run quicker.
2025-03-31 12:40:41 +01:00
Florian Hahn
6b98134466
[VPlan] Re-enable narrowing interleave groups with interleaving.
Remove the UF = 1 restriction introduced by 577631f0a5 building on top
of 783a846507683, which allows updating all relevant users of the VF,
VPScalarIVSteps in particular.

This restores the full functionality of
https://github.com/llvm/llvm-project/pull/106441.
2025-03-29 20:14:10 +00:00
Florian Hahn
783a846507
[VPlan] Add VF as operand to VPScalarIVStepsRecipe.
Similarly to other recipes, update VPScalarIVStepsRecipe to also take
the runtime VF as argument. This removes some unnecessary runtime VF
computations for scalable vectors. It will also allow dropping the
UF == 1 restriction for narrowing interleave groups required in
577631f0a528.
2025-03-28 21:48:59 +00:00
David Green
70f083f068 [LV][AArch64] Test cleanup of low_trip_count_predicates.ll. NFC
Post commit cleanup from #132170
2025-03-28 19:31:37 +00:00
Hari Limaye
bf5627c85e
[LV] Optimize VPWidenIntOrFpInductionRecipe for known TC (#118828)
Optimize the IR generated for a VPWidenIntOrFpInductionRecipe to use the
narrowest type necessary, when the trip-count of a loop is known to be
constant and the only use of the recipe is the condition used by the
vector loop's backedge branch.
2025-03-28 14:47:40 +00:00
Florian Hahn
5c26e80e57
[LV] Make cost model tests independent of VPValue numbers.
Update tests to not rely on hard-coded VPValue numbers.
2025-03-27 21:15:32 +00:00
Florian Hahn
2c7d40b2f0
[VPlan] Generalize SCALAR-STEPS removal to any unroll factor.
Follow-up to dfca6c0d3bf9d1a056 to extend isUnrolled handle any unrolled
VPlan, which means there's a single UF, but it will be > 1 if unrolling
took place.
2025-03-26 21:03:50 +00:00
David Green
de1c2f24bc
[LoopVectorizer][AArch64] Move getMinTripCountTailFoldingThreshold later. (#132170)
This moves the checks of MinTripCountTailFoldingThreshold later, during the
calculation of whether to tail fold. This allows it to check beforehand whether
tail predication is required, either for scalable or fixed-width vectors.

This option is only specified for AArch64, where it returns the minimum of 5.
This patch aims to allow the vectorization of TC=4 loops, preventing them from
performing slower when SVE is present.
2025-03-26 19:35:08 +00:00
David Sherwood
1c9fe8c8af
[LV] Optimise users of induction variables in early exit blocks (#130766)
This is the second of two PRs that attempts to improve the IR
generated in the exit blocks of vectorised loops with uncountable
early exits. It follows on from PR #128880. In this PR I am
improving the generated code for users of induction variables in
early exit blocks.

This required using a newly add VPInstruction called
FirstActiveLane, which calculates the index of the first active
predicate in the mask operand.

I have added a new function optimizeEarlyExitInductionUser that
is called from optimizeInductionExitUsers when handling users in
early exit blocks.
2025-03-26 12:09:59 +00:00
Florian Hahn
577631f0a5
Reapply "[VPlan] Add transformation to narrow interleave groups. (#106441)"
This reverts commit ff3e2ba9eb94217f3ad3525dc18b0c7b684e0abf.

The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.

Original commit message:

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on https://github.com/llvm/llvm-project/pull/106431.

Fixes https://github.com/llvm/llvm-project/issues/82936.

PR: https://github.com/llvm/llvm-project/pull/106441
2025-03-25 20:57:10 +00:00
Florian Hahn
dfca6c0d3b
[VPlan] Remove no-op SCALAR-STEPS after unrolling. (#123655)
After unrolling, there may be additional simplifications that can be
applied. One example is removing SCALAR-STEPS for the first part where
only the first lane is demanded.

This removes redundant adds of 0 from a large number of tests (~200),
many which I am still working on updating.

In preparation for removing redundant WideIV steps added in
https://github.com/llvm/llvm-project/pull/119284.

PR: https://github.com/llvm/llvm-project/pull/123655
2025-03-25 12:57:24 +00:00
Florian Hahn
06fd10f1da
[VPlan] Don't create ExtractElement recipes for scalar plans. (#131604)
ExtractElements are no-ops for scalar VPlans. Don't introduce them in 
handleUncountableEarlyExit if the plan has only a scalar VF.

This fixes a crash trying to compute the cost of ExtractElement after 26ecf978951b79.

PR: https://github.com/llvm/llvm-project/pull/131604
2025-03-23 22:00:02 +00:00
Martin Storsjö
ff3e2ba9eb Revert "[VPlan] Add transformation to narrow interleave groups. (#106441)"
This reverts commit dfa665f19c52d98b8d833a8e9073427ba5641b19.

This commit caused miscompilations in ffmpeg, see
https://github.com/llvm/llvm-project/pull/106441 for details.
2025-03-23 23:27:39 +02:00
Florian Hahn
dfa665f19c
[VPlan] Add transformation to narrow interleave groups. (#106441)
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on https://github.com/llvm/llvm-project/pull/106431.

Fixes https://github.com/llvm/llvm-project/issues/82936.

PR: https://github.com/llvm/llvm-project/pull/106441
2025-03-22 21:40:17 +00:00
Florian Hahn
2f2100c879
[LV] Add additional tests for #106441.
Further increase test coverage for
https://github.com/llvm/llvm-project/pull/106441

Also regenerate checks with -filter-out-after.
2025-03-22 10:07:11 +00:00
David Sherwood
4e69258bf3
[LoopVectorize] Add cost of generating tail-folding mask to the loop (#130565)
At the moment if we decide to enable tail-folding we do not include
the cost of generating the mask per VF. This can mean we make some
poor choices of VF, which is definitely true for SVE-enabled AArch64
targets where mask generation for fixed-width vectors is more
expensive than for scalable vectors.

I've added a VPInstruction::computeCost function to return the costs
of the ActiveLaneMask and ExplicitVectorLength operations.
Unfortunately, in order to prevent asserts firing I've also had to
duplicate the same code in the legacy cost model to make sure the
chosen VFs match up. I've wrapped this up in a ifndef NDEBUG for
now. The alternative would be to disable the assert completely when
tail-folding, which I imagine is just as bad.

New tests added:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
  Transforms/LoopVectorize/RISCV/tail-folding-cost.ll
2025-03-21 09:24:56 +00:00
Florian Hahn
c73ad7ba20
[VPlan] Add transformation to narrow interleave groups.
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.

Depends on #106431.

This lands the main parts of the approved
https://github.com/llvm/llvm-project/pull/106441 as suggested to break
things up a bit more.
2025-03-20 19:41:37 +00:00
Florian Hahn
11b8699572
[LV] Don't skip instrs with side-effects in reg pressure computation. (#126415)
calculateRegisterUsage adds end points for each user of an instruction
to Ends and ignores instructions not added to it, i.e. instructions with
no users.

This means things like stores aren't included, which in turn means
values that are only used in stores are also not included for
consideration. This means we underestimate the register usage in cases
where the only users are things like stores.

Update the code to don't skip instructions without users (i.e. not in
Ends) if they have side-effects.

PR: https://github.com/llvm/llvm-project/pull/126415
2025-03-19 15:13:43 +00:00
Florian Hahn
3c554deaaa
[LV] Add reg-usage test with values only used by llvm.assume.
Add test checking we are not counting registers that are only used by
ephemeral users, like llvm.assume.
2025-03-19 12:17:50 +00:00
Florian Hahn
870f753f1f
[VPlan] Also materialize broadcasts for backedge-taken-counts (NFC).
Also include VPlan's BTC in the set of VPValues to materialize
broadcasts for, if it is used.
2025-03-18 22:35:18 +00:00
David Sherwood
f6b1b91a3d
[LV][NFC] Regenerate CHECK lines in some tests (#131799)
Regenerates CHECK lines in tests that are affected by
PR #130565 to aid reviews.
2025-03-18 14:38:01 +00:00
David Sherwood
2586e7fcd8
[LV][NFC] Tidy up partial reduction tests with filter-out-after option (#129047)
A few test files seemed to have been edited after using the
update_test_checks.py script, which can make life hard for
developers when trying to update these tests in future
patches. Also, the tests still had this comment at the top

; NOTE: Assertions have been autogenerated by ...

which could potentially be confusing, since they've not
strictly been auto-generated.

I've attempted to keep the spirit of the original tests by
excluding all CHECK lines after the scalar.ph IR block,
however I've done this by using a new option called
--filter-out-after to the update_test_checks.py script.
2025-03-18 11:39:55 +00:00
Florian Hahn
ee29e16135
[LV] Reorganize tests for narrowing interleave group transform.
Make test target-dependent, as they will require access to a concrete
vector register width. Also add new tests for cost modeling, unrolling
and removing the vector loop region.
2025-03-16 19:18:47 +00:00
Florian Hahn
6a8d5f22ff
[VPlan] Don't access canonical IV in VPWidenPointerInduction::execute.
This updates VPWidenPointerInductionRecipe::execute to not use the
canonical IV to determine the insert point. Instead, it relies on the
current recipe position. In cases where this is not sufficient, set the
insert point to the first non-phi instruction, to ensure phis are
created together.
2025-03-15 21:32:48 +00:00
Florian Hahn
aadfa9f6c8
[LV] Add additional tests for narrowing interleave groups.
Extend test coverage for https://github.com/llvm/llvm-project/pull/106441.
2025-03-15 21:13:49 +00:00
Florian Hahn
37a57ca257
[FMF] Set all bits if needed when setting individual flags. (#131321)
Currently fast() won't return true if all flags are set via setXXX,
which is surprising. Update setters to set all bits if needed to make
sure isFast() consistently returns the expected result.

PR: https://github.com/llvm/llvm-project/pull/131321
2025-03-15 18:46:26 +00:00
Florian Hahn
56b05a0d6b
[VPlan] Use VFxUF in VPWidenPointerInductionRecipe.
Use VFxUF VPValue instead of computing VF * UF explicitly.
2025-03-15 18:18:53 +00:00
David Sherwood
3b6d0093aa
[LV][NFC] Refactor code for extracting first active element (#131118)
Refactor the code to extract the first active element of a
vector in the early exit block, in preparation for PR #130766.
I've replaced the VPInstruction::ExtractFirstActive nodes with
a combination of a new VPInstruction::FirstActiveLane node and
a Instruction::ExtractElement node.
2025-03-14 11:14:09 +00:00
Florian Hahn
02575f887b
[VPlan] Use VPInstruction for VPScalarPHIRecipe. (NFCI) (#129767)
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.

Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.

PR: https://github.com/llvm/llvm-project/pull/129767
2025-03-13 18:35:07 +00:00
Mel Chen
5d5e706691 [VPlan] Restrict hoisting of broadcast operations using VPDominatorTree (#117138)
This patch restricts broadcast operations from being hoisted to the vector
preheader unless the basic block that defines the broadcasted value properly
dominates the vector preheader.

This prevents potential use-before-definition issues when the broadcasted
value is defined within the plan. VPDominatorTree is used to confirm this
restriction while still allowing safe hoisting for broadcasted values defined
outside the plan.

Issue https://github.com/llvm/llvm-project/issues/117139
2025-03-13 07:16:04 -07:00
Mel Chen
ffe202ca00 Revert "[LV] Limits the splat operations be hoisted must not be defined by a recipe. (#117138)"
This reverts commit 1ff10fa82fff83bb2f0a5c1ffde6203b52bc9619.
2025-03-13 07:16:04 -07:00
Florian Hahn
8132c4f554
[VPlan] Also introduce broadcasts for live-ins used in vec preheader.
Slightly generalize materializeLiveInBroadcasts to also introduce
broadcasts for live-ins used in the vector preheader. This should cover
all live-ins.

If the live-in is used in the vector preheader, insert the broadcast at
the beginning of the block.
2025-03-11 21:19:14 +00:00
David Sherwood
26ecf97895
[LoopVectorize] Further improve cost model for early exit loops (#126235)
Following on from #125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
2025-03-11 11:48:55 +00:00
Mel Chen
1ff10fa82f
[LV] Limits the splat operations be hoisted must not be defined by a recipe. (#117138)
Issue https://github.com/llvm/llvm-project/issues/117139
2025-03-11 17:59:12 +08:00
Sushant Gokhale
c4808741e8
[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant (#123552)
This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236

The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
2025-03-09 22:26:39 -07:00
Florian Hahn
8dd160f476
Revert "[VPlan] Fold NOT into predicate of wide compares." (#130347)
Reverts llvm/llvm-project#129430

this seems to have introduced a divergence between legacy and
VPlan-based cost model

https://lab.llvm.org/buildbot/#/builders/30/builds/17159
2025-03-07 21:18:49 +00:00
Florian Hahn
cb3ce30ca8
[VPlan] Fold NOT into predicate of wide compares. (#129430)
Add simplification to fold negation into a compare, if the negation is
the only user of the compare. This removes a number of redundant
negations.

Alive2 Proofs for FPCMP test changes:  https://alive2.llvm.org/ce/z/WGDz9U

PR: https://github.com/llvm/llvm-project/pull/129430
2025-03-07 20:32:43 +00:00
Ramkumar Ramachandra
ddffb74afd
[LV] Strip unreachable SCEV-check blocks (#130079)
emitSCEVChecks checks if SCEVCheckCond matches zero, and returns
nullptr. However, it sets SCEVCheckCond as used before it does this,
which prevents it from being removed during cleanup, resulting in
unreachable blocks being emitted. Fix this.
2025-03-06 19:30:25 +00:00
Ramkumar Ramachandra
03da079968
[LoopUtils] Saturate at INT_MAX when estimating TC (#129683)
getLoopEstimatedTripCount returns std::nullopt when the trip count would
overflow the return type, but since it is an estimate anyway, we might
as well saturate at UINT_MAX, improving results.
2025-03-05 18:19:39 +00:00
Ramkumar Ramachandra
80bdfcd411
[LoopUtils] Don't wrap in getLoopEstimatedTripCount (#129080)
getLoopEstimatedTripCount returns the trip count based on profiling
data, and its documentation says that it could return 0 when the trip
count is zero, but this is not the case: a valid trip count can never be
zero, and it returns 0 when the unsigned ExitCount is incremented by 1
and wraps. Some callers are careful about checking for this zero value
in an std::optional, but it makes for an API with footguns, as a
std::optional return value indicates that a non-nullopt value would be a
valid trip count. Fix this by explicitly returning std::nullopt when the
return value would wrap, and strip additional checks in callers. This
also fixes a minor bug in LoopVectorize.
2025-03-04 08:43:08 +00:00
Benjamin Maxwell
89e7f4d31b
[LV] Teach the vectorizer to cost and vectorize modf and sincospi intrinsics (#129064)
Follow on to #128035. It is a small extension to support vectorizing
`llvm.modf.*` and `llvm.sincospi.*` too.

This renames the test files from `sincos.ll` ->
`multiple-result-intrinsics.ll` to group together the similar tests
(which make up most of this PR).
2025-02-28 12:56:12 +00:00
Florian Hahn
1e1b9bccc0
[VPlan] Simplify BLEND %a, %b, NOT(%m) -> BLEND %b, %a, %m. (#128375)
Avoid negations for normalized blends by reordering operands.

PR: https://github.com/llvm/llvm-project/pull/128375
2025-02-27 17:43:24 +00:00
Florian Hahn
649f4dcc19
[LV] Fix tests after 8150ab93f741.
PR #124119 wasn't rebased & tested before merging. Update the failing
tests.
2025-02-27 12:15:24 +00:00
John Brawn
8150ab93f7
[LoopVectorize] Use CodeSize as the cost kind for minsize (#124119)
Functions marked with minsize should aim for minimum code size, so the
vectorizer should use CodeSize for the cost kind and also the cost we
compare should be the cost for the entire loop: it shouldn't be divided
by the number of vector elements and block costs shouldn't be divided by
the block probability.

Possibly we should also be doing this for optsize as well, but there are
a lot of tests that assume the current behaviour and the definition of
optsize is less clear than minsize (for minsize the goal is to "keep the
code size of this function as small as possible" whereas for optsize
it's "keep the code size of this function low").
2025-02-27 11:07:02 +00:00
Benjamin Maxwell
3307b0374a
[LV] Teach the loop vectorizer llvm.sincos is trivially vectorizable (#128035)
Depends on #123210
2025-02-27 09:37:06 +00:00
Florian Hahn
4277c21059
[VPlan] Introduce explicit broadcasts for live-ins. (#124644)
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().

PR: https://github.com/llvm/llvm-project/pull/124644
2025-02-26 13:57:51 +00:00
Florian Hahn
52ded67249
[LAA] Always require non-wrapping pointers for runtime checks. (#127543)
Currently we only check if the pointers involved in runtime checks do
not wrap if we need to perform dependency checks. If that's not the
case, we generate runtime checks, even if the pointers may wrap (see
test/Analysis/LoopAccessAnalysis/runtime-checks-may-wrap.ll).

If the pointer wraps, then we swap start and end of the runtime check,
leading to incorrect checks.

An Alive2 proof of what the runtime checks are checking conceptually (on
i4 to have it complete in reasonable time) showing the incorrect result
should be https://alive2.llvm.org/ce/z/KsHzn8

Depends on https://github.com/llvm/llvm-project/pull/127410 to avoid
more regressions.

PR: https://github.com/llvm/llvm-project/pull/127543
2025-02-20 19:00:23 +01:00
Benjamin Maxwell
e0e67a6207
[LV] Add initial support for vectorizing literal struct return values (#109833)
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed. The users
of the call also must all be `extractvalue` instructions.

The intended use case for this is vectorizing intrinsics such as:

```
declare { float, float } @llvm.sincos.f32(float %x)
```

Mapping them to structure-returning library calls such as:

```
declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```

Or their widened form (such as `@llvm.sincos.v4f32` in this case).

Implementing this required two main changes:

1. Supporting widening `extractvalue`
2. Adding support for vectorized struct types in LV
  * This is mostly limited to parts of the cost model and scalarization

Since the supported use case is narrow, the required changes are
relatively small.
2025-02-17 09:51:35 +00:00
Florian Hahn
e5f5517f91 [VPlan] Create IR basic block for middle.block in VPlan.
Create a IR BB directly for the middle.block, instead of creating the IR
BB during skeleton creation and then replacing the middle VPBB with a
VPIRBB.

This moves another part of skeleton creation to VPlan and simplififes
the code slightly by removing code to disconnect the middle block and
vector preheader + the corresponding DT update.

NFC modulo IR block naming and block creation order, which changes the
IR names for the blocks.
2025-02-15 21:54:16 +01:00
Nicholas Guy
9c89faa62b
[LoopVectorizer][AArch64] Add support for partial reduce subtraction (#123636) 2025-02-13 10:35:45 +00:00