Auto-generate check lines for scalable-loop-unpredicated-body-scalar-tail.ll,
while also updating the input to be more compact and avoid unnecessary
checks to keep auto-generated checks compact without loss of
generality.
This reverts commit 1c7c8e3ad39957285524ff116d9a6aec0d9b62f9.
Recommit with a fix for the verifier error caused for EVL recipes.
Extra test coverage added in 6f939da60e.
Materialize VF and VFxUF computation using VPInstruction
instead of directly creating IR.
This is one of the last few steps needed to model the full vector
skeleton in VPlan.
This is mostly NFC, although in some cases we remove some unused
computations.
PR: https://github.com/llvm/llvm-project/pull/152879
I've changed how we construct the EpilogueVectorizerEpilogueLoop and
EpilogueVectorizerMainLoop classes so that we construct the parent class
with an additional boolean parameter indicating whether we're
vectorising the main or epilogue loop. The
InnerLoopAndEpilogueVectorizer class uses this new argument in
combination with the EpilogueLoopVectorizationInfo struct to set the
right UF and VF values. This then allows EpilogueVectorizerEpilogueLoop
to access the correct values of VF and UF for the main loop, which are
required when setting branch weights in the minimum iteration check
block.
`VPInstruction::Not` which will generate xor instruction is widely used
for the exit condition. This patch make `VPInstruction::Not` generate
scalar `xor` if possible.
This can help reducing the (splat true) in the `xor` and make `xor` be
scalar.
Epilogue vectorization currently relies on the resume phi for the
canonical induction being always available, which is why VPPhi are
considered to have side-effects, to prevent their removal.
This patch adds a new ResumeForEpilogue opcode to mark the resume phi as
used for epilogue vectorization. This allows treating VPPhis in general
as not having side-effects, enabling removal of unused VPPhis.
The clang-x64-windows-msvc buildbot is failing after
707447159341f7b5678dee4f47731af50524b9ae due to this test failing:
https://lab.llvm.org/buildbot/#/builders/63/builds/8528
This is a stab in the dark, but my first thought is that it may be due
to the handling of floats with MSVC or something. So this removes the
floating point part of the check. I don't have access to a Windows
machine handy to debug this just yet, so pushing this to see if it can
quickly return the buildbot to green.
Materialize the vector trip count computation using VPInstruction
instead of directly creating IR. This is one of the last few steps
needed to model the full vector skeleton in VPlan. It also simplifies
vector-trip count computations for scalable vectors, as we can re-use
the UF x VF computation.
PR: https://github.com/llvm/llvm-project/pull/151925
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
We have been tracking the performance of EVL tail folding in the loop
vectorizer on RISC-V for a while now, and after much hard work from
various contributors we think it should be generally profitable to
enable by default now.
With tail folding there is a 21% improvement on 525.x264_r on SPEC CPU
2017 on the BPI-F3 (-march=rva22u64_v -O3 -flto), as well as a 30%
geomean codesize reduction on SPEC and TSVC, with no significant
regressions detected.
Now that we are early into the LLVM 22.x development cycle it seems like
a good time to enable it to catch any issues. There are still more EVL
related items of work being tracked in #123069, which should continue to
improve performance.
Changes: The original patch, landed as 1336675, was reverted due to a
bug in LoopVectorize resulting in a crash. The bug has now been fixed by
95c32bf ([VPlan] Return invalid cost if any skeleton block has invalid
costs), and this reland is identical to the original patch.
Previously we could only scalably vectorize interleave groups with
factor 2, but after 7ef77eb9984d1fb537a409cf4be89560fbb681fe we now
support all factors (available on RISC-V). So this adds the remaining
check lines for the scalable VFs.
Now that VPWidenPointerInductionRecipes are modelled in VPlan in
#148274, we can support them in EVL tail folding.
We need to replace their VFxUF operand with EVL as the increment is not
guaranteed to always be VF on the penultimate iteration, and UF is
always 1 with EVL tail folding.
We also need to move the creation of the backedge value to the latch so
that EVL dominates it.
With this we will no longer fail to convert a VPlan to EVL tail folding,
so adjust tryAddExplicitVectorLength to account for this. This brings us
to 99.4% of all vector loops vectorized on SPEC CPU 2017 with tail
folding vs no tail folding.
The test in only-compute-cost-for-vplan-vfs.ll previously relied on
widened pointer inductions with EVL tail folding to end up in a scenario
with no vector VPlans, so this also replaces it with an unvectorizable
fixed-order recurrence test from
first-order-recurrence-multiply-recurrences.ll that also gets discarded.
The initial VPlan closely reflects the original scalar loop, so unsing
VPWidenPHIRecipe here is premature. Widened phi recipes should only be
introduced together with other widened recipes.
PR: https://github.com/llvm/llvm-project/pull/150847
Update handling of canonical IV resume phi for the epilogue loop to make
sure the resume phi for the canonical IV is always the first phi in the
scalar preheader.
This makes it easier to retrieve it in preparePlanForEpilogueVectorLoop.
For now, we keep an assert to make sure we use the same resume phi as
before. This will be removed in the future.
This is the VPWidenPointerInductionRecipe equivalent of #118638, with
the motivation of allowing us to use the EVL as the induction step.
There is a new VPInstruction added, WidePtrAdd to allow adding the step
vector to the induction phi, since VPInstruction::PtrAdd only handles
scalars or multiple scalar lanes.
Originally this transformation was copied from the original recipe's
execute code, but it's since been simplifed by teaching
`unrollWidenInductionByUF` to unroll the recipe, which brings it inline
with VPWidenIntOrFpInductionRecipe.
Move selectInterleaveCount to LoopVectorizationPlanner and retrieve some
information directly from VPlan. Register pressure was already computed
for a VPlan, and with this patch we now also check for reductions
directly on VPlan, as well as checking how many load and store
operations remain in the loop.
This should be mostly NFC, but we may compute slightly different
interleave counts, except for some edge cases, e.g. where dead loads
have been removed. This shouldn't happen in practice, and the patch
doesn't cause changes across a large test corpus on AArch64.
Computing the interleave count based on VPlan allows for making better
decisions in presence of VPlan optimizations, for example when
operations on interleave groups are narrowed.
Note that there are a few test changes for tests that were still
checking the legacy cost-model output when it was computed in
selectInterleaveCount.
PR: https://github.com/llvm/llvm-project/pull/149702
Now that support for masked loads/stores of interleave groups has
landed, we can enable the loop vectorizer to generate masked interleave
access where applicable.
This improves vectorization in several ways:
* Internal predication support: This enables interleave group
vectorization for loops with internal control flow predication, provided
all members of the group share the same predicate. Gaps in interleave
groups are still not efficiently handled by masking, so masking for gaps
remains disabled for now.
* Tail folding: This allows tail folding of loops with interleave groups
by using masking. Without this, vectorized loops with interleaves would
fall back to using separate gather/scatter accesses, which can be
significantly less efficient.
"[RISCV][TTI] Enable masked interleave access for scalable vector
(#149981)" was reverted by 5294793bdcf6ca142f7a0df897638bd4e85ed1a7 due
to triggering an assertion. The issue has been addressed in the patch
"[LV] Fix gap mask requirement for interleaved access (#151105)". On the
other hand, this patch also enable fixed-length masked interleave access
(#150624) since support for fixed-length has also been landed
992118cb4deab139ae384bb85f03225a9a21b008.
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
This patch uses VPTypeAnalysis to determine its type since the induction
step is not always a live-in value in the VPlan and may be defined by a
recipe.
Currently hasEarlyExit returns true, if there are multiple exit blocks.
ExitBlocks contains the wrapped original IR exit blocks. Without
checking the predecessors we incorrectly return true for loops with
multiple countable exits, that have been vectorized by requiring a
scalar epilogue. In that case, the exit blocks will get disconnected.
Fix this by filtering out disconnected exit blocks.
Currently this should only impact the 'early-exit vectorized' statistic.
PR: https://github.com/llvm/llvm-project/pull/151718
Explicitly compute the backedge-taken count using VPInstruction. This is
needed to model the full skeleton in VPlan.
NFC modulo some instruction re-ordering.
The operands of the replicate recipe may have been narrowed, resulting
in a narrower result type. Update the type of the cloned instruction to
the correct type.
Fixes https://github.com/llvm/llvm-project/issues/151392.
Update isConditionTrueViaVFAndUF to use the vector trip count if
computable. This is the case when it has been materialized to a
constant. Otherwise fall back to the trip count.
PR: https://github.com/llvm/llvm-project/pull/151034
Make sure to check that the vector trip count is containedin the list of
incoming values to serve as tie-breaker with phis with all-zero incoming
values.
Fixes https://github.com/llvm/llvm-project/issues/151686.
Currently the multi-exit loops with all countable exits are considered
vectorized with early exit, while that terminology is used for loops we
performed early-exit vectorization, instead of requring a scalar
epilogue.