300 Commits

Author SHA1 Message Date
David Sherwood
4e69258bf3
[LoopVectorize] Add cost of generating tail-folding mask to the loop (#130565)
At the moment if we decide to enable tail-folding we do not include
the cost of generating the mask per VF. This can mean we make some
poor choices of VF, which is definitely true for SVE-enabled AArch64
targets where mask generation for fixed-width vectors is more
expensive than for scalable vectors.

I've added a VPInstruction::computeCost function to return the costs
of the ActiveLaneMask and ExplicitVectorLength operations.
Unfortunately, in order to prevent asserts firing I've also had to
duplicate the same code in the legacy cost model to make sure the
chosen VFs match up. I've wrapped this up in a ifndef NDEBUG for
now. The alternative would be to disable the assert completely when
tail-folding, which I imagine is just as bad.

New tests added:

  Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
  Transforms/LoopVectorize/RISCV/tail-folding-cost.ll
2025-03-21 09:24:56 +00:00
Luke Lau
01f04252b6
[LV] Get FMFs from VectorBuilder in createSimpleReduction. NFC (#132017)
The other createSimpleReduction takes the FMFs from the IRBuilder, so
this aligns the VectorBuilder variant to do the same and reduce the
possibility of there being a mismatch in flags.
2025-03-20 16:38:56 +08:00
Luke Lau
f536f71580
[LV] Split RecurrenceDescriptor into RecurKind + FastMathFlags in LoopUtils. NFC (#132014)
Split off from #131300, this splits up RecurrenceDescriptor arguments so
that arbitrary recurrence kinds may be used down the line.
2025-03-19 22:56:57 +08:00
Luke Lau
a4dc02c0e7
[VPlan] Rename VPReverseVectorPointerRecipe to VPVectorEndPointerRecipe. NFC (#131086)
After #128718 lands there will be two ways of performing a reversed
widened memory access, either by performing a consecutive unit-stride
access and a reverse, or a strided access with a negative stride.

Even though both produce a reversed vector, only the former needs
VPReverseVectorPointerRecipe which computes a pointer to the last
element of each part. A strided reverse still needs a pointer to the
first element of each part so it will use VPVectorPointerRecipe.

This renames VPReverseVectorPointerRecipe to VPVectorEndPointerRecipe to
clarify that a reversed access may not necessarily need a pointer to the
last element.
2025-03-19 00:09:15 +08:00
Elvis Wang
ed19620b8c
[VPlan] Make VPReductionRecipe a VPRecipeWithIRFlags. NFC (#130881)
This patch change the parent of the VPReductionRecipe from
VPSingleDefRecipe to VPRecipeWithIRFlags and also print/get/drop/control
flags by the VPRecipeWithIRFlags. This will remove the dependency of the
underlying instruction.

This patch also add a new function `setFastMathFlags()` to the
VPRecipeWithIRFlags because the entire reduction chain may contains
multiple instructions. And the underlying instruction may not contains
the corresponding flags for this reduction.

Split from #113903.
2025-03-18 10:08:23 +08:00
Luke Lau
67f1c033b8
[VPlan] Remove createReduction. NFCI (#131336)
This is split off from #131300.

A VPReductionRecipe will never have a AnyOf or FindLastIV recurrence, so
when it calls createReduction it always calls createSimpleReduction.

If we replace the call then it leaves createReduction with one user in
VPInstruction::ComputeReductionResult, which we can inline and then
remove.
2025-03-18 00:18:15 +08:00
Florian Hahn
4e9894498e
[VPlan] Truncate VFxUF if needed in VPWidenPointerInduction::execute.
Create truncate if needed after 56b05a0d6. Note that this preserves the
original behavior pre 56b05a0d6. If truncate would strip any set bits,
then the explicit computation in the narrower type would wrap.
2025-03-16 11:37:58 +00:00
Florian Hahn
0bd8a75a0c
[VPlan] Fix formatting after 6a8d5f22f. 2025-03-15 21:33:30 +00:00
Florian Hahn
6a8d5f22ff
[VPlan] Don't access canonical IV in VPWidenPointerInduction::execute.
This updates VPWidenPointerInductionRecipe::execute to not use the
canonical IV to determine the insert point. Instead, it relies on the
current recipe position. In cases where this is not sufficient, set the
insert point to the first non-phi instruction, to ensure phis are
created together.
2025-03-15 21:32:48 +00:00
Florian Hahn
56b05a0d6b
[VPlan] Use VFxUF in VPWidenPointerInductionRecipe.
Use VFxUF VPValue instead of computing VF * UF explicitly.
2025-03-15 18:18:53 +00:00
Florian Hahn
8ff27bbd84
[VPlan] Remove unneeded select in VPWidenPointerInductionRecipe (NFC). 2025-03-15 17:34:23 +00:00
David Sherwood
3b6d0093aa
[LV][NFC] Refactor code for extracting first active element (#131118)
Refactor the code to extract the first active element of a
vector in the early exit block, in preparation for PR #130766.
I've replaced the VPInstruction::ExtractFirstActive nodes with
a combination of a new VPInstruction::FirstActiveLane node and
a Instruction::ExtractElement node.
2025-03-14 11:14:09 +00:00
Luke Lau
26324bc1bf
[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice (#129645)
After #124093 we now support fixed-order recurrences with EVL tail
folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP
splice intrinsic.

However the costing for the splice is currently done in
VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic
we end up costing it twice.

This fixes it by splitting out the cost for the splice into
FirstOrderRecurrenceSplice so that it's not duplicated when we replace
it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe
since the splice might end up dead and discarded, e.g. in the test
@pr97452_scalable_vf1_for.
2025-03-14 15:33:32 +08:00
Florian Hahn
02575f887b
[VPlan] Use VPInstruction for VPScalarPHIRecipe. (NFCI) (#129767)
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.

Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.

PR: https://github.com/llvm/llvm-project/pull/129767
2025-03-13 18:35:07 +00:00
Florian Hahn
dc23234a66
[VPlan] Remove dead code in VPWidenPHIRecipe::print (NFC).
All incoming models for VPWidenPHIRecipe are modled in VPlan, remove
code trying to print the orignial phi.
2025-03-11 21:44:04 +00:00
David Sherwood
26ecf97895
[LoopVectorize] Further improve cost model for early exit loops (#126235)
Following on from #125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
2025-03-11 11:48:55 +00:00
Ramkumar Ramachandra
809a33d7ab
[VPlan] Clean up unused header (NFC) (#130599)
We use VPlanPatternMatch, and not PatternMatch: clean up the
PatternMatch include to avoid confusion.
2025-03-10 17:57:11 +00:00
Luke Lau
6d89c042e3
[VPlan] Remove dead AnyOf reduction case in VPReductionRecipe. NFCI (#130048)
From what I understand, we only create VPReductionRecipes for in-loop
reductions, and we don't currently support in-loop AnyOf reductions.

We only create VPReductionRecipes in the !PhiR->isInLoop() section of
adjustRecipesForReductions, and this comment from the initial patch
seems to confirm this
https://reviews.llvm.org/D108136#anchor-inline-1038338, so I think we
can remove this check in the condition logic.

I checked compiling SPEC 2017 with -prefer-inloop-predicates and the
added assertion doesn't trigger.
2025-03-07 01:05:53 +08:00
Luke Lau
47fb9c4bb9
[VPlan] Add Name argument to VPWidenPHIRecipe. NFC (#129527)
This allows a different IR name for the generated phi to be used. This
is split off from #118638 and helps remove some of the diffs in it.
2025-03-04 16:47:21 +08:00
Florian Hahn
6ce41db6b0
[VPlan] Preserve DebugLoc for VPBranchOnMaskRecipe.
Update code to set and generate debug location for branch recipe
2025-02-27 20:19:42 +00:00
Florian Hahn
253d691596
[VPlan] Update VPBranchOnMaskRecipe to always set the mask (NFC).
The mask is always available at construction time. Make it non-optional
to simlpify code.
2025-02-27 18:53:24 +00:00
Benjamin Maxwell
3307b0374a
[LV] Teach the loop vectorizer llvm.sincos is trivially vectorizable (#128035)
Depends on #123210
2025-02-27 09:37:06 +00:00
Florian Hahn
4277c21059
[VPlan] Introduce explicit broadcasts for live-ins. (#124644)
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().

PR: https://github.com/llvm/llvm-project/pull/124644
2025-02-26 13:57:51 +00:00
Luke Lau
e23ab73335
[VPlan] Don't convert widen recipes to VP intrinsics in EVL transform (#127180)
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote

This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.

IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.

Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
2025-02-22 19:38:11 +08:00
Florian Hahn
6c627831f9
[VPlan] Use VPlan predecessors in VPWidenPHIRecipe (NFC). (#126388)
Update VPWidenPHIRecipe to use the predecessors in VPlan to determine
the incoming blocks instead of tracking them separately. This brings
VPWidenPHIRecipe in line with the other phi recipes.

PR: https://github.com/llvm/llvm-project/pull/126388
2025-02-17 16:40:37 +01:00
Benjamin Maxwell
e0e67a6207
[LV] Add initial support for vectorizing literal struct return values (#109833)
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed. The users
of the call also must all be `extractvalue` instructions.

The intended use case for this is vectorizing intrinsics such as:

```
declare { float, float } @llvm.sincos.f32(float %x)
```

Mapping them to structure-returning library calls such as:

```
declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```

Or their widened form (such as `@llvm.sincos.v4f32` in this case).

Implementing this required two main changes:

1. Supporting widening `extractvalue`
2. Adding support for vectorized struct types in LV
  * This is mostly limited to parts of the cost model and scalarization

Since the supported use case is narrow, the required changes are
relatively small.
2025-02-17 09:51:35 +00:00
Nicholas Guy
9c89faa62b
[LoopVectorizer][AArch64] Add support for partial reduce subtraction (#123636) 2025-02-13 10:35:45 +00:00
Florian Hahn
e258bca950
[VPlan] Only skip expansion for SCEVUnknown if it isn't an instruction. (#125235)
Update getOrCreateVPValueForSCEVExpr to only skip expansion of
SCEVUnknown if the underlying value isn't an instruction. Instructions
may be defined in a loop and using them without expansion may break
LCSSA form. SCEVExpander will take care of preserving LCSSA if needed.

We could also try to pass LoopInfo, but there are some users of the
function where it won't be available and main benefit from skipping
expansion is slightly more concise VPlans.

Note that SCEVExpander is now used to expand SCEVUnknown with floats.
Adjust the check in expandCodeFor to only check the types and casts if
the type of the value is different to the requested type. Otherwise we
crash when trying to expand a float and requesting a float type.

Fixes https://github.com/llvm/llvm-project/issues/121518.

PR: https://github.com/llvm/llvm-project/pull/125235
2025-02-11 13:03:12 +01:00
Hassnaa Hamdi
e9a20f77ee
Reland "[LV]: Teach LV to recursively (de)interleave." (#125094)
This patch relands the changes from "[LV]: Teach LV to recursively
(de)interleave.#122989"
    Reason for revert:
- The patch exposed an assert in the vectorizer related to VF difference
between
legacy cost model and VPlan-based cost model because of uncalculated
cost for
VPInstruction which is created by VPlanTransforms as a replacement to
'or disjoint'
       instruction.
VPlanTransforms do that instructions change when there are memory
interleaving and
predicated blocks, but that change didn't cause problems because at most
cases the cost
      difference between legacy/new models is not noticeable.
    - Issue is fixed by #125434

Original patch: https://github.com/llvm/llvm-project/pull/89018
Reviewed-by: paulwalker-arm, Mel-Chen
2025-02-09 19:21:54 +00:00
Florian Hahn
32c4493d5f
[VPlan] Add incoming values for all predecessor to ResumePHI (NFCI).
Follow-up as discussed when using VPInstruction::ResumePhi for all resume
values (#112147). This patch explicitly adds incoming values for each
predecessor in VPlan. This simplifies codegen and allows transformations
adjusting the predecessors of blocks with

NFC modulo incoming block order in phis.
2025-02-09 11:20:20 +00:00
Florian Hahn
1611059f5d
[VPlan] Compute cost for binary op VPInstruction with underlying values. (#125434)
As exposed by https://github.com/llvm/llvm-project/pull/125094, we are
missing cost computation for some binary VPInstructions we created based
on original IR instructions. Their cost should be considered.

PR: https://github.com/llvm/llvm-project/pull/125434
2025-02-07 15:27:31 +00:00
James Chesterman
ac158aa13b
[LoopVectorizer] Allow partial reductions to be made in predicated loops (#124268)
Does a select on the input rather than the output. This way the mask has
the same number of lanes as the other operand in the select instruction.
2025-02-07 09:09:10 +00:00
David Sherwood
f07cd36a5d
[LoopVectorize] Add the cost of VPInstruction::AnyOf to vplan (#125058)
This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.
2025-02-05 16:31:14 +00:00
Florian Hahn
5008277322
[VPlan] Move auxiliary declarations out of VPlan.h (NFC). (#124104)
Nothing in VPlan.h directly depends on VPTransformState, VPCostContext,
VPFRange, VPlanPrinter or VPSlotTracker. Move them out to a separate
header to reduce the size of widely used VPlan.h.

This is a first step towards more cleanly separating declarations in
VPlan.

Besides reducing VPlan.h's size, this also allows including additional
VPlan-related headers in VPlanHelpers.h for use there. An example is
using VPDominatorTree in VPTransformState
(https://github.com/llvm/llvm-project/pull/117138).

PR: https://github.com/llvm/llvm-project/pull/124104
2025-02-02 13:44:07 +00:00
David Sherwood
3bc2dade36
[LoopVectorize] Enable vectorisation of early exit loops with live-outs (#120567)
This work feeds part of PR
https://github.com/llvm/llvm-project/pull/88385, and adds support for
vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.
2. We were adding the new vector.early.exit to the wrong parent loop.
It needs to have the same parent as the actual early exit block from
the original loop.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
2025-01-30 10:37:00 +00:00
Florian Hahn
713482fccf [VPlan] Use State.get to extract lane mask for BranchOnMask.
Simplifies the code slightly and avoids redundant extracts/broadcasts if
the operand is live-in or already scalar.
2025-01-27 21:35:36 +00:00
Florian Hahn
09a29fcc8d
[VPlan] Don't collect live-ins in collectUsersInExitBlocks. (NFC) (#123819)
Live-ins don't need to be handled, other than adding to the exit phi
recipe. Do that early and assert that otherwise the exit value is
defined in the vector loop region.

This should enable simply skipping other exit values that do not need
further fixing, e.g. if handling the exit value from the early exit
directly in handleUncountableEarlyExit.

PR: https://github.com/llvm/llvm-project/pull/123819
2025-01-27 16:12:07 +00:00
Elvis Wang
aff1242b8e
[LV] Align debug location of the widen-phi to the original phi. (#120338)
This patch align the debug location of the widen-phi to the debug
location of original phi.

Split from: #120054
2025-01-24 17:49:54 +08:00
Florian Hahn
6c787ff6cf
Revert "[LV]: Teach LV to recursively (de)interleave. (#122989)"
This reverts commit 9491f75e1d912b277247450d1c7b6d56f7faf885.

This triggers an assert when building with SVE enabled.
https://lab.llvm.org/buildbot/#/builders/143/builds/4795
2025-01-21 21:36:16 +00:00
John Brawn
edf3a55bce
[LoopVectorize][NFC] Centralize the setting of CostKind (#121937)
In each class which calculates instruction costs (VPCostContext,
LoopVectorizationCostModel, GeneratedRTChecks) set the CostKind once in
the constructor instead of in each function that calculates a cost. This
is in preparation for potentially changing the CostKind when compiling
for optsize.
2025-01-17 15:06:18 +00:00
Hassnaa Hamdi
9491f75e1d
Reland: [LV]: Teach LV to recursively (de)interleave. (#122989)
This commit relands the changes from "[LV]: Teach LV to recursively
(de)interleave. #89018"

Reason for revert:
- The patch exposed a bug in the IA pass, the bug is now fixed and landed by commit: #122643
2025-01-17 10:34:57 +00:00
LiqinWeng
0294dab79e
[LV][VPlan] Add fast flags for selectRecipe (#121023)
Change the inheritance of class VPWidenSelectRecipe to class
VPRecipeWithIRFlags, which allows recipe of the select to pass the
fastmath flags.The patch of #119847 will add the fastmath flag to for
recipe
2025-01-15 10:10:11 +08:00
Sam Tebbs
795e35a653
Reland "[LoopVectorizer] Add support for partial reductions" with non-phi operand fix. (#121744)
This relands the reverted #120721 with a fix for cases where neither
reduction operand are the reduction phi. Only
63114239cc8d26225a0ef9920baacfc7cc00fc58 and
63114239cc8d26225a0ef9920baacfc7cc00fc58 are new on top of the reverted
PR.

---------

Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
2025-01-13 11:20:35 +00:00
Luke Lau
f0d5104c94
[VPlan] Handle some VPInstructions in may{Read,Write}FromMemory (#120058)
This just copies the same conservative definition from mayWriteToMemory,
and enables more VPInstructions to be hoisted out in LICM.

I think this should give more accurate costs, and I was able to build
llvm-test-suite without the legacy-vplan cost model assertion going off.
2025-01-08 15:17:26 +08:00
Yingwei Zheng
a77346bad0
[IRBuilder] Refactor FMF interface (#121657)
Up to now, the only way to set specified FMF flags in IRBuilder is to
use `FastMathFlagGuard`. It makes the code ugly and hard to maintain.

This patch introduces a helper class `FMFSource` to replace the original
parameter `Instruction *FMFSource` in IRBuilder. To maximize the
compatibility, it accepts an instruction or a specified FMF.
This patch also removes the use of `FastMathFlagGuard` in some simple
cases.

Compile-time impact:
https://llvm-compile-time-tracker.com/compare.php?from=f87a9db8322643ccbc324e317a75b55903129b55&to=9397e712f6010be15ccf62f12740e9b4a67de2f4&stat=instructions%3Au
2025-01-06 14:37:04 +08:00
Florian Hahn
f4230b4332
[VPlan] Add and use debug location for VPScalarCastRecipe.
Update the recipe it always take a debug location and set it.
2025-01-05 20:08:51 +00:00
Florian Hahn
b06a45c66f
[VPlan] Add all blocks to outer loop if present during ::execute (NFCI).
This ensures that all blocks created during VPlan execution are properly
added to an enclosing loop, if present.

Split off from https://github.com/llvm/llvm-project/pull/108378 and also
needed once more of the skeleton blocks are created directly via VPlan.

This also allows removing the custom logic for early-exit loop
vectorization added as part of
https://github.com/llvm/llvm-project/pull/117008.
2024-12-31 19:34:34 +00:00
Muhammad Omair Javaid
332d2647ff Revert "[LV]: Teach LV to recursively (de)interleave. (#89018)"
This reverts commit ccfe0de0e1e37ed369c9bf89dd0188ba0afb2e9a.

This breaks LLVM build on AArch64 SVE Linux buildbots
https://lab.llvm.org/buildbot/#/builders/143/builds/4462
https://lab.llvm.org/buildbot/#/builders/17/builds/4902
https://lab.llvm.org/buildbot/#/builders/4/builds/4399
https://lab.llvm.org/buildbot/#/builders/41/builds/4299
2024-12-31 03:12:24 +05:00
Zequan Wu
4d8f9594b2 Revert "Reland "[LoopVectorizer] Add support for partial reductions" (#120721)"
This reverts commit c858bf620c3ab2a4db53e84b9365b553c3ad1aa6 as it casuse optimization crash on -O2, see https://github.com/llvm/llvm-project/pull/120721#issuecomment-2563192057
2024-12-27 11:51:54 -08:00
Hassnaa Hamdi
ccfe0de0e1
[LV]: Teach LV to recursively (de)interleave. (#89018)
Currently available intrinsics are only ld2/st2, which don't support interleaving factor > 2.
This patch teaches the LV to use ld2/st2 recursively to support high
interleaving factors.
2024-12-27 12:42:07 +00:00