2323 Commits

Author SHA1 Message Date
David Sherwood
13107cb094
[LoopVectorize] Enable more early exit vectorisation tests (#117008)
PR #112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR #88385.
6. The new split middle blocks weren't being added to the
parent loop.
2024-12-18 09:25:45 +00:00
Nikita Popov
1157187496
[VPlan] Propagate all GEP flags (#119899)
Store GEPNoWrapFlags instead of only InBounds and propagate them.
2024-12-17 13:48:50 +01:00
Florian Hahn
0e528ac404
[VPlan] Use start value operand for FindLastIV reduction phis.
Update VPReductionPHIRecipe::execute to use the start value from the
start value operand of the recipe. This is needed to make sure we resume
from the correct value during epilogue vectorization.

At the moment, the start value is set to the sentinel value in
adjustRecipesForReductions, as the original start value needs to be used
when creating ResumePhi recipes.

Fixes a mis-compile introduced by b3cba9be41bfa8 in SPEC2017 on AArch64.
2024-12-16 23:29:49 +00:00
Florian Hahn
f9120dc2a6
[VPlan] Make sure vector trip count is ready for prepareToExecute (NFC)
Split off from https://github.com/llvm/llvm-project/pull/112145. This
ensures that getOrCreateVectorTripCount creates the trip count as needed
when induction resume value creation is moved to VPlan and no longer
creates the vector trip count early.
2024-12-16 20:44:20 +00:00
Florian Hahn
89d5272841
[VPlan] Remove getPreheader(). (NFC)
The preheader is now the entry block, connected to the vector.ph.

Clean up after https://github.com/llvm/llvm-project/pull/114292.
2024-12-16 19:48:02 +00:00
Florian Hahn
95e509a989
[VPlan] Add VPWidenInduction recipe as common base class (NFC). (#120008)
This helps to simplify some existing code and new code
(https://github.com/llvm/llvm-project/pull/112145)

PR: https://github.com/llvm/llvm-project/pull/120008
2024-12-16 09:40:03 +00:00
Florian Hahn
2067e604a4
[VPlan] Manage VPWidenPointerInduction debug location via recipe.
Update VPWidenPointerInduction to manage its debug location via recipe.
This makes sure we emit a proper debug location for
VPWidenPointerInductionRecipes.
2024-12-15 14:41:07 +00:00
Florian Hahn
734a204fbd
[VPlan] Manage VPWidenIntOrFPInduction debug location via recipe (NFC).
Properly set VPWidenIntOrFpInductionRecipe's debug location in the
recipe and use it, instead of using the debug location of the underlying
IR instruction.
2024-12-15 13:45:28 +00:00
Florian Hahn
4e828f8d74
[VPlan] Perform DT expensive input DT verification earlier (NFC).
After 6c8f41d33674, DT adjustments for the skeleton are applied as VPBBs
are executed. Move input DT verification up before starting to execute
any VPBBs to avoid checking DT while the CFG and DT are in an incomplete
state.

This fixes a number of verification failures with expensive checks
enabled, including
https://lab.llvm.org/buildbot/#/builders/16/builds/10584
2024-12-12 20:06:20 +00:00
Florian Hahn
6c8f41d336
[VPlan] Hook IR blocks into VPlan during skeleton creation (NFC) (#114292)
As a first step to move towards modeling the full skeleton in VPlan,
start by wrapping IR blocks created during legacy skeleton creation in
VPIRBasicBlocks and hook them into the VPlan. This means the skeleton
CFG is represented in VPlan, just before execute. This allows moving
parts of skeleton creation into recipes in the VPBBs gradually.

Note that this allows retiring some manual DT updates, as this will be
handled automatically during VPlan execution.

PR: https://github.com/llvm/llvm-project/pull/114292
2024-12-12 15:58:16 +00:00
Florian Hahn
a480d51722
[VPlan] Use existing vector trip count VPValue for resume phi (NFC)
Instead of going through getOrAddLiveIn to get a VPValue for the vector
trip count retrieve it directly from VPlan via getVectorTripCount.

Small simplification following 0e70289f373.
2024-12-12 11:03:47 +00:00
Mel Chen
b3cba9be41
[LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable (#67812)
Consider the following loop:
```
  int rdx = init;
  for (int i = 0; i < n; ++i)
    rdx = (a[i] > b[i]) ? i : rdx;
```
We can vectorize this loop if `i` is an increasing induction variable.
The final reduced value will be the maximum of `i` that the condition
`a[i] > b[i]` is satisfied, or the start value `init`.

This patch added new RecurKind enums - IFindLastIV and FFindLastIV.

---------

Co-authored-by: Alexey Bataev <5361294+alexey-bataev@users.noreply.github.com>
2024-12-12 16:48:31 +08:00
Florian Hahn
5fae408d3a
[VPlan] Dispatch to multiple exit blocks via middle blocks. (#112138)
A more lightweight variant of
https://github.com/llvm/llvm-project/pull/109193,
which dispatches to multiple exit blocks via the middle blocks.

The patch also introduces a bit of required scaffolding to enable
early-exit vectorization, including an option. At the moment, early-exit
vectorization doesn't come with legality checks, and is only used if the
option is provided and the loop has metadata forcing vectorization. This
is only intended to be used for testing during bring-up, with @david-arm
enabling auto early-exit vectorization plugging in the changes from
https://github.com/llvm/llvm-project/pull/88385.

PR: https://github.com/llvm/llvm-project/pull/112138
2024-12-11 21:11:05 +00:00
Florian Hahn
0e7f18791c
[LV] Relax assertion in fixupIVUsers (NFC).
Adjust the assertion in fixupIVUsers to only require a unique exit block
if there are any values to fix up. This enables the bring up of
multi-exit loop vectorization without requiring a scalar epilogue.

Split off as suggested from
https://github.com/llvm/llvm-project/pull/112138.
2024-12-10 10:39:28 +00:00
Florian Hahn
56ddbeff83
[LV] Use getUniqueLatchExitBlock in createVectorLoopSkeleton (NFC).
Use getUniqueLatchExitBlock instead of getUniqueExitBlock in preparation
for multi-exit vectorization *without* requiring a scalar epilogue.

Split off as suggested from
https://github.com/llvm/llvm-project/pull/112138
2024-12-10 09:35:08 +00:00
Florian Hahn
e9834209aa
[VPlan] Move convertToConreteRecipes to end of VPlan-opt phase (NFCI).
Adjust placement as suggested in
https://github.com/llvm/llvm-project/pull/114305, after some refactoring
to prepare for the move.
2024-12-10 09:13:13 +00:00
Florian Hahn
0e70289f37
[VPlan] Create canonical IV resume value for epilogue in VPlan. (NFCI)
Update the code to create induction resume PHIs to also create a resume
phi for the canonical induction during epilogue vectorization. This
unifies the code for handling induction resume values and removes the
need to explicitly create manually resume PHI and return it during
epilogue creation.

Overall it helps to move the code for updating the canonical induction
resume value to the place where all other header phi resume values are
updated.

This is NFC, modulo order of the created phis.
2024-12-09 23:11:38 +00:00
Florian Hahn
adfe54f7da
[VPlan] Directly check VectorizingEpilogue in ::executePlan (NFC).
Directly check VectorizingEpilogue which directly indicates that the
epilogue is vectorized.
2024-12-09 22:21:25 +00:00
Florian Hahn
4fd8dbc184
[LV] Move code to prepare VPlan for epilogue vector loop to helper (NFC)
Move code to prepare the VPlan for the epilogue vector loop to a helper
to reduce size and complexity of processLoop.
2024-12-09 21:56:10 +00:00
Kazu Hirata
9099d694f6 [Vectorize] Fix a warning
This patch fixes:

  llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2699:49: error:
  captured structured bindings are a C++20 extension
  [-Werror,-Wc++20-extensions]
2024-12-09 10:26:48 -08:00
Igor Kirillov
337936a83b
[LV] Ignore some costs when loop gets fully unrolled (#106699)
When VF has a fixed width and equals the number of iterations, and we are not
tail folding by masking, comparison instruction and induction operation will be DCEed later.
Ignoring the costs of these instructions improves the cost model.
2024-12-09 18:17:52 +00:00
Florian Hahn
eff0d8103c
[VPlan] Adjust original position of convertToConcreteRecipes.
Restore the original position of the call before afef545efab77a8
to fix a number of crashes.
2024-12-08 21:52:51 +00:00
Florian Hahn
afef545efa
[VPlan] Address post-commit for #114305.
Apply suggested renaming and adjust placement as suggested in
https://github.com/llvm/llvm-project/pull/114305. Also drop unneeded
RPOT creation.
2024-12-08 21:24:19 +00:00
Florian Hahn
156da98683
[VPlan] Move printing final VPlan to ::execute (NFC).
This moves printing of the final VPlan to ::execute. This ensures the
final VPlan is printed, including recipes that get introduced by late,
lowering transforms and skeleton construction.

Split off from https://github.com/llvm/llvm-project/pull/114292, to
simplify the diff.
2024-12-07 09:39:10 +00:00
Florian Hahn
7f7f540a48
Reapply "[VPlan] Update scalar induction resume values in VPlan. (#110577)"
This reverts commit f09b16e2671cbcdf7cb7dc7ed705db092a9deda1.

The crash when building llvm-test-suite with stage2 should have been
fixed by 1091fad31a83d5ab87eb6fa11fe3bdb3f0d152ea.
2024-12-06 19:41:51 +00:00
Nikita Popov
f09b16e267 Revert "[VPlan] Update scalar induction resume values in VPlan. (#110577)"
This reverts commit 0678e2058364ec10b94560d27ec7138dfa003287.
This reverts commit 1091fad31a83d5ab87eb6fa11fe3bdb3f0d152ea.

Causes crashes in llvm-test-suite when using stage 2 clang.
2024-12-06 18:01:42 +01:00
Florian Hahn
0678e20583
[VPlan] Update scalar induction resume values in VPlan. (#110577)
Updated ILV.createInductionResumeValues (now createInductionResumeVPValue)
to directly update the VPIRInstructions wrapping the original phis with the 
created resume values.

This is the first step towards modeling them completely in VPlan.
Subsequent patches will move creation of the resume values completely
into VPlan.

Depends on https://github.com/llvm/llvm-project/pull/109975.

PR: https://github.com/llvm/llvm-project/pull/110577
2024-12-06 12:26:19 +00:00
Florian Hahn
f081ffe701
[LV] Simplify & clarify bypass handling for IV resume values (NFC)
Split off NFC part refactoring from
https://github.com/llvm/llvm-project/pull/110577. This simplifies and
clarifies induction resume value creation for bypass blocks.
2024-12-06 11:33:30 +00:00
Florian Hahn
a7fda0e1e4
[VPlan] Introduce VPScalarPHIRecipe, use for can & EVL IV codegen (NFC). (#114305)
Introduce a general recipe to generate a scalar phi. Lower
VPCanonicalIVPHIRecipe and VPEVLBasedIVRecipe to VPScalarIVPHIrecipe
before plan execution, avoiding the need for duplicated ::execute
implementations. There are other cases that could benefit, including
in-loop reduction phis and pointer induction phis.

Builds on a similar idea as
https://github.com/llvm/llvm-project/pull/82270.

PR: https://github.com/llvm/llvm-project/pull/114305
2024-12-03 14:53:51 +00:00
Florian Hahn
77767986ed
[LV] Use IsaPred in a few more places (NFC).
Simplifies the code slightly by removing explicit lambdas.
2024-12-01 18:47:53 +00:00
Florian Hahn
82821254f5
[LV] Use IVUpdateMayOverflow to set HasNUW. (#111758)
If IVUpdateMayOverflow is false, we proved that the induction increment
cannot overflow in the vector loop. This allows setting NUW in some
cases when folding the tail.

PR: https://github.com/llvm/llvm-project/pull/111758
2024-11-28 10:12:41 +00:00
Elvis Wang
9ea5be639d
Recommit "[LV][VPlan] Remove any-of reduction from precomputeCost. NFC (#117109)" (#117289)
Update the test cases contains `any-of` printings from the
precomputeCost().

Origin message: 

The any-of reduction contains phi and select instructions.

The select instruction might be optimized and removed in the vplan which
may cause VF difference between legacy and VPlan-based model. But if the
select instruction be removed, planContainsAdditionalSimplifications()
will catch it and disable the assertion.

Therefore, we can just remove the ayn-of reduction calculation in the
precomputeCost().



Recommit "[LV][VPlan] Remove any-of reduction from precomputeCost. NFC
(#117109)"
2024-11-28 15:07:36 +08:00
Florian Hahn
590f451b60
[VPlan] Allow setting IR name for VPDerivedIVRecipe (NFCI).
Allow setting the name to use for the generated IR value of the derived
IV in preparations for https://github.com/llvm/llvm-project/pull/112145.

This is analogous to VPInstruction::Name.
2024-11-24 20:39:12 +00:00
Elvis Wang
0e3c791916
Revert "[LV][VPlan] Remove any-of reduction from precomputeCost. NFC" (#117280)
Reverts llvm/llvm-project#117109

Some test cases need to update.
2024-11-22 11:32:52 +08:00
Elvis Wang
ce66b56865
[LV][VPlan] Remove any-of reduction from precomputeCost. NFC (#117109)
The any-of reduction contains phi and select instructions. 

The select instruction might be optimized and removed in the vplan which
may cause VF difference between legacy and VPlan-based model. But if the
select instruction be removed, `planContainsAdditionalSimplifications()`
will catch it and disable the assertion.

Therefore, we can just remove the ayn-of reduction calculation in the
precomputeCost().
2024-11-22 10:48:11 +08:00
Florian Hahn
4d1959b70b
[VPlan] Generalize collectUsersInExitBlocks for multiple exit bbs. (#115066)
Generalize collectUsersInExitBlock to collecting exit users in multiple
exit blocks. Exit blocks are leaf nodes in the VPlan (without
successors) except the scalar header.

Split off in preparation for
https://github.com/llvm/llvm-project/pull/112138

PR: https://github.com/llvm/llvm-project/pull/115066
2024-11-21 21:15:36 +00:00
Finn Plummer
8663b8777e
[NFC][VectorUtils][TargetTransformInfo] Add isVectorIntrinsicWithOverloadTypeAtArg api (#114849)
This changes allows target intrinsics to specify and overwrite overloaded types.

- Updates `ReplaceWithVecLib` to not provide TTI as there most probably won't be a use-case
- Updates `SLPVectorizer` to use available TTI
- Updates `VPTransformState` to pass down TTI
- Updates `VPlanRecipe` to use passed-down TTI

This change will let us add scalarization for `asdouble`:  #114847
2024-11-21 11:04:25 -08:00
Sjoerd Meijer
9bccf61f5f
[AArch64][LV] Set MaxInterleaving to 4 for Neoverse V2 and V3 (#100385)
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
2024-11-20 09:33:39 +00:00
David Sherwood
12180717cb
[NFC][LoopVectorize] Introduce new getEstimatedRuntimeVF function (#116247)
There are lots of places where we try to estimate the runtime
vectorisation factor based on the getVScaleForTuning TTI hook.
I've added a new getEstimatedRuntimeVF function and taught
several places in the vectoriser to use this new function.
2024-11-19 12:38:11 +00:00
David Sherwood
3097c60928
[LoopVectorize][NFC] Rewrite tests to check output of vplan cost model (#113697)
Currently it's very difficult to improve the cost model for tail-folded
loops because as soon as you add a VPInstruction::computeCost function
that adds the costs of instructions such as
VPInstruction::ActiveLaneMask
and VPInstruction::ExplicitVectorLength the assert in
LoopVectorizationPlanner::computeBestVF fails for some tests. This is
because the VF chosen by the legacy cost model doesn't match the vplan
cost model. See PR #90191. This assert is currently making it difficult
to improve the cost model.

Hopefully we will be in a position to remove the assert soon, however
in order to do that we have to fix up a whole bunch of tests that rely
upon the legacy cost model output. I've tried my best to update
these tests to use vplan output instead.

There is still work needed for the VF=1 case because the vplan cost
model is not printed out in this case. I've not attempted to fix those
in this patch.
2024-11-19 08:55:39 +00:00
Julian Nagele
a8538b9138
[LV] Vectorize Epilogues for loops with small VF but high IC (#108190)
- Consider MainLoopVF * IC when determining whether Epilogue
Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue when choosing
the Epilogue VF

PR: https://github.com/llvm/llvm-project/pull/108190
---------

Co-authored-by: Florian Hahn <flo@fhahn.com>
2024-11-17 19:35:32 +00:00
Luke Lau
050e2d325a
[LV] Remove assertions in IV overflow check (#115705)
In #111310 an assert was added that for the IV overflow check used with
tail folding, the overflow check is never known.

However when applying the loop guards, it looks like it's possible that
we might actually know the IV won't overflow: this occurs in
500.perlbench_r from SPEC CPU 2017 and triggers the assertion:

Assertion failed: (!isIndvarOverflowCheckKnownFalse(Cost, VF * UF) &&
!SE.isKnownPredicate(CmpInst::getInversePredicate(ICmpInst::ICMP_ULT),
TC2OverflowSCEV, SE.getSCEV(Step)) && "unexpectedly proved overflow
check to be known"), function emitIterationCountCheck, file
LoopVectorize.cpp, line 2501.

There is a discrepancy between `isIndvarOverflowCheckKnownFalse` and the
ICMP_ULT check, because the former uses `getSmallConstantMaxTripCount`
which only takes into trip counts that fit into 32 bits. There doesn't
seem to be an easy way to make the assertion aware of this, so this PR
just removes it for now.

There are two potential follow up things from this PR:

1. We miss calculating the max trip count in `@trip_count_max_1024`, it
looks like we might need to apply loop guards somewhere in
`ScalarEvolution::computeExitLimitFromICmp`
2. In `@overflow_at_0`, if `%tc == 0` then we the overflow check will
always return false, even though it will overflow

Fixes https://github.com/llvm/llvm-project/issues/115755
2024-11-14 17:04:49 +09:00
Luke Lau
9e77f59005
[LV] Account for vp_merge in out of loop EVL reductions in legacy cost model (#115903)
In #101641, support for out of loop reductions with EVL tail folding was
added by transforming selects to vp_merges in
transformRecipestoEVLRecipes.

Whilst the select was previously free, the vp_merge wasn't and incurs a
cost on RISC-V with the VPlan cost model. But this diverged from the
legacy cost model and caused the "VPlan cost model and legacy cost model
disagreed" assertion to trigger when building 502.gcc_r from SPEC CPU
2017.

Neither the select nor vp_merge recipes from the VPlan exist in the
underlying instructions, so I thought it would make the most sense to
fix this by adding the cost to the underlying phi instruction in
getInstructionCost.

It's worth noting that on RISC-V this vp_merge won't actually generate
any instructions because the mask is all true, and will be folded away.
So we should update the cost model at some point to reflect that.
2024-11-14 16:55:18 +09:00
Elvis Wang
b4d23cf685
[LV] Fix missing precomptueCosts() in emitInvalidCostRemarks(). (#114918)
We should always update the `SkipComputation` which is set in
`VPCostContext` before VPlan compute costs.

This patch prevent the assertion of in-loop reduction in the
`VPReductionRecipe::computeCost()` and other potential assertions of
partially implemented VPlan-based cost model.
2024-11-14 08:29:55 +08:00
Florian Hahn
98c4f4fce8
[LV] Remove IVEndValues, use resume value directly from fixed phi.(NFC)
Use the IV resume/end values from the phis in the scalar header,
instead of collecting them in a map. This removes some complexity
from the code dealing with induction resume values.

Analogous to 1edd22030 which did the same for reduction resume values.
2024-11-13 21:03:54 +00:00
Kazu Hirata
c236dbc343
[Vectorize] Simplify code with MapVector::operator[] (NFC) (#115592) 2024-11-09 14:36:32 -08:00
Florian Hahn
144bdf3eb7
[VPlan] Also check if plan for best legacy VF contains simplifications.
The plan for the VF chosen by the legacy cost model could also contain
additional simplifications that cause cost differences. Also check if it
contains simplifications.

Fixes https://github.com/llvm/llvm-project/issues/114860.
2024-11-08 20:53:03 +00:00
David Sherwood
d77a36e01b
[LoopVectorize] Use new getUniqueLatchExitBlock routine (#108231)
With PR #88385 I am introducing support for vectorising more loops with
early exits that don't require a scalar epilogue. As such, if a loop
doesn't have a unique exit block it will not automatically imply we
require a scalar epilogue. Also, in all places in the code today where
we use the variable LoopExitBlock we actually mean the exit block from
the latch. Therefore, it seemed reasonable to add a new
getUniqueLatchExitBlock that allows the caller to determine the exit
block taken from the latch and use this instead of getUniqueExitBlock. I
also renamed LoopExitBlock to be LatchExitBlock. I feel this not only
better reflects how the variable is used today, but also prepares the
code for PR #88385.

While doing this I also noticed that one of the comments in
requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e.
when we're not exiting from the latch block. This doesn't always imply
we have multiple exits, e.g. see the test in

Transforms/LoopVectorize/unroll_nonlatch.ll

where the latch unconditionally branches back to the only exiting block.
2024-11-06 10:35:35 +00:00
Mel Chen
4480a22c2b
[LV][EVL] Emit vp.merge intrinsic to enable out-loop reduction in EVL vectorization. (#101641)
Following #90184, this patch emits vp.merge intrinsic, which is used to
set the inactive lanes in a select operation to the RHS instead of
undef. Currently, it is applied to out-loop reduction for EVL
vectorization.

This patch performs transformation to convert 
  select(header_mask, LHS, RHS)
into
  vp.merge(all-true, LHS, RHS, EVL) 
And always use the predicated reduction select to set the incoming value
of the reduction phi to support out-loop reduction when using tail
folding with EVL.

TODO: Postpone the adjustment of the predicated reduction select to
VPlanTransform. The current adjustment might be too early, which could
lead to a situation where the predicated reduction select is adjusted,
but the EVL recipes cannot be successfully generated during
VPlanTransform.
2024-11-06 14:53:49 +08:00
Mel Chen
70de0b8bea
[LV][NFC] Simplify initialization of MinProfitableTripCount (#113445)
Iteration runtime check confirms whether the trip count is greater than
VFxUF at least. Therefore, there is no need to adjust the
MinProfitableTripCount to VF if it is zero.
Retaining the original MinProfitableTripCount information is also
beneficial for supporting more profitable runtime checks in the future.
2024-11-05 15:13:59 +08:00