Update the code to create induction resume PHIs to also create a resume
phi for the canonical induction during epilogue vectorization. This
unifies the code for handling induction resume values and removes the
need to explicitly create manually resume PHI and return it during
epilogue creation.
Overall it helps to move the code for updating the canonical induction
resume value to the place where all other header phi resume values are
updated.
This is NFC, modulo order of the created phis.
This patch fixes:
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2699:49: error:
captured structured bindings are a C++20 extension
[-Werror,-Wc++20-extensions]
When VF has a fixed width and equals the number of iterations, and we are not
tail folding by masking, comparison instruction and induction operation will be DCEed later.
Ignoring the costs of these instructions improves the cost model.
This moves printing of the final VPlan to ::execute. This ensures the
final VPlan is printed, including recipes that get introduced by late,
lowering transforms and skeleton construction.
Split off from https://github.com/llvm/llvm-project/pull/114292, to
simplify the diff.
This reverts commit f09b16e2671cbcdf7cb7dc7ed705db092a9deda1.
The crash when building llvm-test-suite with stage2 should have been
fixed by 1091fad31a83d5ab87eb6fa11fe3bdb3f0d152ea.
This reverts commit 0678e2058364ec10b94560d27ec7138dfa003287.
This reverts commit 1091fad31a83d5ab87eb6fa11fe3bdb3f0d152ea.
Causes crashes in llvm-test-suite when using stage 2 clang.
Updated ILV.createInductionResumeValues (now createInductionResumeVPValue)
to directly update the VPIRInstructions wrapping the original phis with the
created resume values.
This is the first step towards modeling them completely in VPlan.
Subsequent patches will move creation of the resume values completely
into VPlan.
Depends on https://github.com/llvm/llvm-project/pull/109975.
PR: https://github.com/llvm/llvm-project/pull/110577
Introduce a general recipe to generate a scalar phi. Lower
VPCanonicalIVPHIRecipe and VPEVLBasedIVRecipe to VPScalarIVPHIrecipe
before plan execution, avoiding the need for duplicated ::execute
implementations. There are other cases that could benefit, including
in-loop reduction phis and pointer induction phis.
Builds on a similar idea as
https://github.com/llvm/llvm-project/pull/82270.
PR: https://github.com/llvm/llvm-project/pull/114305
If IVUpdateMayOverflow is false, we proved that the induction increment
cannot overflow in the vector loop. This allows setting NUW in some
cases when folding the tail.
PR: https://github.com/llvm/llvm-project/pull/111758
Update the test cases contains `any-of` printings from the
precomputeCost().
Origin message:
The any-of reduction contains phi and select instructions.
The select instruction might be optimized and removed in the vplan which
may cause VF difference between legacy and VPlan-based model. But if the
select instruction be removed, planContainsAdditionalSimplifications()
will catch it and disable the assertion.
Therefore, we can just remove the ayn-of reduction calculation in the
precomputeCost().
Recommit "[LV][VPlan] Remove any-of reduction from precomputeCost. NFC
(#117109)"
The any-of reduction contains phi and select instructions.
The select instruction might be optimized and removed in the vplan which
may cause VF difference between legacy and VPlan-based model. But if the
select instruction be removed, `planContainsAdditionalSimplifications()`
will catch it and disable the assertion.
Therefore, we can just remove the ayn-of reduction calculation in the
precomputeCost().
This changes allows target intrinsics to specify and overwrite overloaded types.
- Updates `ReplaceWithVecLib` to not provide TTI as there most probably won't be a use-case
- Updates `SLPVectorizer` to use available TTI
- Updates `VPTransformState` to pass down TTI
- Updates `VPlanRecipe` to use passed-down TTI
This change will let us add scalarization for `asdouble`: #114847
Set the maximum interleaving factor to 4, aligning with the number of available
SIMD pipelines. This increases the number of vector instructions in the vectorised
loop body, enhancing performance during its execution. However, for very low
iteration counts, the vectorised body might not execute at all, leaving only the
epilogue loop to run. This issue affects e.g. cam4_r from SPEC FP, which
experienced a performance regression. To address this, the patch reduces the
minimum epilogue vectorisation factor from 16 to 8, enabling the epilogue to be
vectorised and largely mitigating the regression.
There are lots of places where we try to estimate the runtime
vectorisation factor based on the getVScaleForTuning TTI hook.
I've added a new getEstimatedRuntimeVF function and taught
several places in the vectoriser to use this new function.
Currently it's very difficult to improve the cost model for tail-folded
loops because as soon as you add a VPInstruction::computeCost function
that adds the costs of instructions such as
VPInstruction::ActiveLaneMask
and VPInstruction::ExplicitVectorLength the assert in
LoopVectorizationPlanner::computeBestVF fails for some tests. This is
because the VF chosen by the legacy cost model doesn't match the vplan
cost model. See PR #90191. This assert is currently making it difficult
to improve the cost model.
Hopefully we will be in a position to remove the assert soon, however
in order to do that we have to fix up a whole bunch of tests that rely
upon the legacy cost model output. I've tried my best to update
these tests to use vplan output instead.
There is still work needed for the VF=1 case because the vplan cost
model is not printed out in this case. I've not attempted to fix those
in this patch.
- Consider MainLoopVF * IC when determining whether Epilogue
Vectorization is profitable
- Allow the same VF for the Epilogue as for the main loop
- Use an upper bound for the trip count of the Epilogue when choosing
the Epilogue VF
PR: https://github.com/llvm/llvm-project/pull/108190
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
In #111310 an assert was added that for the IV overflow check used with
tail folding, the overflow check is never known.
However when applying the loop guards, it looks like it's possible that
we might actually know the IV won't overflow: this occurs in
500.perlbench_r from SPEC CPU 2017 and triggers the assertion:
Assertion failed: (!isIndvarOverflowCheckKnownFalse(Cost, VF * UF) &&
!SE.isKnownPredicate(CmpInst::getInversePredicate(ICmpInst::ICMP_ULT),
TC2OverflowSCEV, SE.getSCEV(Step)) && "unexpectedly proved overflow
check to be known"), function emitIterationCountCheck, file
LoopVectorize.cpp, line 2501.
There is a discrepancy between `isIndvarOverflowCheckKnownFalse` and the
ICMP_ULT check, because the former uses `getSmallConstantMaxTripCount`
which only takes into trip counts that fit into 32 bits. There doesn't
seem to be an easy way to make the assertion aware of this, so this PR
just removes it for now.
There are two potential follow up things from this PR:
1. We miss calculating the max trip count in `@trip_count_max_1024`, it
looks like we might need to apply loop guards somewhere in
`ScalarEvolution::computeExitLimitFromICmp`
2. In `@overflow_at_0`, if `%tc == 0` then we the overflow check will
always return false, even though it will overflow
Fixes https://github.com/llvm/llvm-project/issues/115755
In #101641, support for out of loop reductions with EVL tail folding was
added by transforming selects to vp_merges in
transformRecipestoEVLRecipes.
Whilst the select was previously free, the vp_merge wasn't and incurs a
cost on RISC-V with the VPlan cost model. But this diverged from the
legacy cost model and caused the "VPlan cost model and legacy cost model
disagreed" assertion to trigger when building 502.gcc_r from SPEC CPU
2017.
Neither the select nor vp_merge recipes from the VPlan exist in the
underlying instructions, so I thought it would make the most sense to
fix this by adding the cost to the underlying phi instruction in
getInstructionCost.
It's worth noting that on RISC-V this vp_merge won't actually generate
any instructions because the mask is all true, and will be folded away.
So we should update the cost model at some point to reflect that.
We should always update the `SkipComputation` which is set in
`VPCostContext` before VPlan compute costs.
This patch prevent the assertion of in-loop reduction in the
`VPReductionRecipe::computeCost()` and other potential assertions of
partially implemented VPlan-based cost model.
Use the IV resume/end values from the phis in the scalar header,
instead of collecting them in a map. This removes some complexity
from the code dealing with induction resume values.
Analogous to 1edd22030 which did the same for reduction resume values.
The plan for the VF chosen by the legacy cost model could also contain
additional simplifications that cause cost differences. Also check if it
contains simplifications.
Fixes https://github.com/llvm/llvm-project/issues/114860.
With PR #88385 I am introducing support for vectorising more loops with
early exits that don't require a scalar epilogue. As such, if a loop
doesn't have a unique exit block it will not automatically imply we
require a scalar epilogue. Also, in all places in the code today where
we use the variable LoopExitBlock we actually mean the exit block from
the latch. Therefore, it seemed reasonable to add a new
getUniqueLatchExitBlock that allows the caller to determine the exit
block taken from the latch and use this instead of getUniqueExitBlock. I
also renamed LoopExitBlock to be LatchExitBlock. I feel this not only
better reflects how the variable is used today, but also prepares the
code for PR #88385.
While doing this I also noticed that one of the comments in
requiresScalarEpilogue is wrong when we require a scalar epilogue, i.e.
when we're not exiting from the latch block. This doesn't always imply
we have multiple exits, e.g. see the test in
Transforms/LoopVectorize/unroll_nonlatch.ll
where the latch unconditionally branches back to the only exiting block.
Following #90184, this patch emits vp.merge intrinsic, which is used to
set the inactive lanes in a select operation to the RHS instead of
undef. Currently, it is applied to out-loop reduction for EVL
vectorization.
This patch performs transformation to convert
select(header_mask, LHS, RHS)
into
vp.merge(all-true, LHS, RHS, EVL)
And always use the predicated reduction select to set the incoming value
of the reduction phi to support out-loop reduction when using tail
folding with EVL.
TODO: Postpone the adjustment of the predicated reduction select to
VPlanTransform. The current adjustment might be too early, which could
lead to a situation where the predicated reduction select is adjusted,
but the EVL recipes cannot be successfully generated during
VPlanTransform.
Iteration runtime check confirms whether the trip count is greater than
VFxUF at least. Therefore, there is no need to adjust the
MinProfitableTripCount to VF if it is zero.
Retaining the original MinProfitableTripCount information is also
beneficial for supporting more profitable runtime checks in the future.
This work is in preparation for PRs #112138 and #88385 where
the middle block is not guaranteed to be the immediate successor
to the region block. I've simply add new getMiddleBlock()
interfaces to VPlan that for now just return
cast<VPBasicBlock>(VectorRegion->getSingleSuccessor())
Once PR #112138 lands we'll need to do more work to discover
the middle block.
Update VPlan to include the scalar loop header. This allows retiring
VPLiveOut, as the remaining live-outs can now be handled by adding
operands to the wrapped phis in the scalar loop header.
Note that the current version only includes the scalar loop header, no
other loop blocks and also does not wrap it in a region block.
PR: https://github.com/llvm/llvm-project/pull/109975
Use VPInstruction::ResumePhi to create phi nodes for reduction resume
values in the scalar preheader, similar to how ResumePhis are used for
first-order recurrence resume values after 9a5a8731e77.
This allows simplifying createAndCollectMergePhiForReduction to only
collect reduction resume phis when vectorizing epilogue loops and adding
extra incoming edges from the main vector loop. Updating phis for the
epilogue vector loops requires special attention, because additional
incoming values from the bypass blocks need to be added.
PR: https://github.com/llvm/llvm-project/pull/110004
Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.
Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
Induction users only need to be updated when vectorizing the epilogue.
Avoid running fixupIVUsers when vectorizing the main loop during
epilogue vectorization.
Previously the legacy cost model would pick the type for the cost
computation depending on the order of the members in the input IR.
This is incompatible with the VPlan-based cost model (independent of
original IR order) and also doesn't match code-gen, which uses the type
of the insert position.
Update the legacy cost model to use the type (and address space) from
the Group's insert position.
This brings the legacy cost model in line with the legacy cost model and
fixes a divergence between both models.
Note that the X86 cost model seems to assign different costs to groups
with i64 and double types. Added a TODO to check.
Fixes https://github.com/llvm/llvm-project/issues/112922.