Instead of looking up the narrower reduction type via getRecurrenceType
we can generate the needed extend directly at constructiond re-use the
truncated value from the loop.
PR: https://github.com/llvm/llvm-project/pull/141860
Similar to FindLastIV, add FindFirstIVSMin to support select (icmp(), x, y)
reductions where one of x or y is a decreasing induction, producing a SMin
reduction. It uses signed max as sentinel value.
PR: https://github.com/llvm/llvm-project/pull/140451
Until now the feature to enable vectorisation of some early exit
loops with uncountable exits was controlled under a flag, off by
default. Now that we have efficient code generation for
vectorising such loops (see PR #130766) and we still have some
time from the next LLVM release it seems like a good time point
to enable the feature by default. If any issues arise post-commit
it can be easily reverted.
Using this patch I built and ran the LLVM test suite successfully,
which on neoverse-v1 led to the vectorisation of 114 additional
early exit loops. I also built and ran SPEC2017 successfully for
both neoverse-v1 and neoverse-v2.
Currently if the user enables interleaving during vectorisation of
uncountable early exit loops via the interleave_count pragma and the
enable-early-exit-vectorization option, it will miscompile. There is
ongoing work to fix this, but for now it seems safer to ignore the hint
until it is supported.
---------
Co-authored-by: Paul Walker <paul.walker@arm.com>
Explicitly unroll VPReplicateRecipes outside replicate regions by VF,
replacing them by VF single-scalar recipes. Extracts for operands are
added as needed and the scalar results are combined to a vector using a
new BuildVector VPInstruction.
It also adds a few folds to simplify unnecessary extracts/BuildVectors.
It also adds a BuildStructVector opcode for handling of calls that have
struct return types.
VPReplicateRecipe in replicate regions can will be unrolled as follow
up, turing non-single-scalar VPReplicateRecipes into 'abstract', i.e.
not executable.
PR: https://github.com/llvm/llvm-project/pull/142433
Replace redundant ExtractLastElement VPInstructions early. This is NFC,
as the VPInstruction computing the final result is vector-to-scalar,
producing a single scalar already. This enables follow-up changes to
model more aspects of reductions directly in VPlan.
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
Going mostly by the comment here - but it says "vscale is not
necessarily a power-of-2". Both in tree targets have vscale as a power
of two, and we have an existing TTI hook for that.
Similarly to VPWidenIntOrFpInductionRecipe, if we want to support it in
EVL tail folding we need to increment the induction by EVL steps instead
of VF*UF steps, but currently this is hard-wired in
VPWidenPointerInductionRecipe.
This adds an operand for the number of elements unrolled and plumbs it
through, so that we can swap it out in
VPlanTransforms::tryAddExplicitVectorLength further down the line.
This patch adds a test that shows incorrect branch weights being set in
function
EpilogueVectorizerEpilogueLoop::emitMinimumVectorEpilogueIterCountCheck
PPCTTIImpl defines hasActiveVectorLength and also getVPMemoryOpCost, but
they appear unused (i.e. no changes to tests).
Remove them, as they complicate the interface for hasActiveVectorLength.
This simplifies the only use in LV as now no placeholder values need to
be passed.
PR: https://github.com/llvm/llvm-project/pull/142310
This is prep work for enabling better UF calculations when using vscale
based VFs to vectorise loops with vscale based tripcounts.
NOTE: NFC because All uses remain fixed-length until a following PR
changes LoopVectorize's version of getSmallConstantTripCount().
When the fixed-order recurrence phi is live-out from the loop, the
vectorizer uses VPInstruction::ExtractPenultimateElement to extract the
penultimate element from the recurrence vector. However, this is not
feasible when the VF is vscale x 1, since vscale could be 1, making the
vector contain only one element.
This patch changes the behavior for vscale x 1 by extracting the last
element from the vector produced by splicing the recurrence phi and the
previous value. This ensures we can still determine the correct live-out
value of the recurrence phi.
The motivation of this PR is to make #115274 easier to implement, and
should allow us to add EVL support by just passing EVL to the VF
operand.
The current difficulty with widening IVs with EVL is that
VPWidenIntOrFpInductionRecipe generates its own backedge value. Since
it's a VPHeaderPHIRecipe the VF operand must be in the preheader, which
means we can't use the EVL since it's defined in the loop body.
The gist in this PR is to take the approach in #114305 and expand
VPWidenIntOrFpInductionRecipe into several recipes for the initial
value, phi and backedge value just before execution. I.e. this example:
```
vector.ph:
Successor(s): vector loop
<x1> vector loop: {
vector.body:
WIDEN-INDUCTION %i = phi %start, %step, %vf
...
EMIT branch-on-count ...
No successors
}
```
gets expanded to:
```
vector.ph:
...
vp<%induction.start> = ...
vp<%induction.increment> = ...
Successor(s): vector loop
<x1> vector loop: {
vector.body:
ir<%i> = WIDEN-PHI vp<%induction.start>, vp<%vec.ind.next>
...
vp<%vec.ind.next> = add ir<%i>, vp<%induction.increment>
EMIT branch-on-count ...
No successors
}
```
This allows us to a value defined in the loop in the backedge value, and
also means we can just reuse the existing backedge fixups in
VPlan::execute without having to specially handle it ourselves.
After this #115274 should just become a matter of setting the VF operand
to EVL (and building the increment step in the loop body, not the
preheader).
This fixes a crash where all incoming values for the epilogue resume
value are zero, because there are no remaining iterations to execute for
the epilogue loop.
In fixVectorizedLoop we call setProfileInfoAfterUnrolling to update the
profile information after vectorising, however for scalable VFs we
pessimistically assume vscale=1. We can improve upon this by using the
value of vscale used for tuning, i.e. when targeting neoverse-v1 the
expected value is 2.
Register pressure was only considered if the vector bandwidth was being
maximised (chosen either by the target or user options), but #132190
inadvertently caused high pressure VFs to be pruned even when max
bandwidth wasn't enabled. This PR returns to the previous behaviour.
Skip induction phis when checking for simplifications, as they may not
be lowered directly be lowered to a corresponding PHI recipe. Reductions
and first-order recurrences will get lowered to phi recipes, unless they
are removed. Considering them for simplifications allows removing them
if there are no remaining users.
NFC as currently reduction and recurrence phis are not
simplified/removed if dead.
This reverts commit 0604dc199c019b23746f4a54885ba0c75569cdae.
The recommitted version addresses post-commit comments and adjusts the
place the branch weights are added. It now runs before VPlans are optimized
for VF and UF, which may remove the vector loop region, causing a crash
trying to get the middle block after that. Test case added in
72f99b75afc12bb.
Original message:
Manage branch weights for the BranchOnCond in the middle block in VPlan.
This requires updating VPInstruction to inherit from VPIRMetadata, which
in general makes sense as there are a number of opcodes that could take
metadata.
There are other branches (part of the skeleton) that also need branch
weights adding.
PR: https://github.com/llvm/llvm-project/pull/143035
Similar to modeling the start value as operand, also model the sentinel
value as operand explicitly. This makes all require information for
code-gen available directly in VPlan.
PR: https://github.com/llvm/llvm-project/pull/142291
There are many places in VPlan and LoopVectorize where we use
getKnownMinValue to discover the number of elements in a vector. Where
we expect the vector to have a fixed length, I have used the stronger
getFixedValue call. I believe this is clearer and adds extra protection
in the form of an assert in getFixedValue that the vector is not
scalable.
While looking at VPFirstOrderRecurrencePHIRecipe::computeCost I also
took the liberty of simplifying the code.
In theory I believe this patch should be NFC, but I'm reluctant to add
that to the title in case we're just missing tests for some of the VPlan
changes. I built and ran the LLVM test suite when targeting neoverse-v1
and it seemed ok.
Move the logic to create the iteration count check to a separate helper,
so it can be re-used by when creating the skeleton for epilogue
vectorization as well.
This caused assertion failures:
llvm/lib/Transforms/Vectorize/VPlan.h:4021:
llvm::VPBasicBlock* llvm::VPlan::getMiddleBlock():
Assertion `LoopRegion && "cannot call the function after vector loop region has been removed"' failed.
See comment on the PR.
> Manage branch weights for the BranchOnCond in the middle block in VPlan.
> This requires updating VPInstruction to inherit from VPIRMetadata, which
> in general makes sense as there are a number of opcodes that could take
> metadata.
>
> There are other branches (part of the skeleton) that also need branch
> weights adding.
>
> PR: https://github.com/llvm/llvm-project/pull/143035
This reverts commit db8d34db26e9ea92c08d6e813eca9cce40c48478.
Currently the loop vectorizer can only vectorize interleave groups for
power-of-2 factors at scalable VFs by recursively interleaving
[de]interleave2 intrinsics.
However after https://github.com/llvm/llvm-project/pull/124825 and
#139893, we now have [de]interleave intrinsics for all factors up to 8,
which is enough to support all types of segmented loads and stores on
RISC-V.
Now that the interleaved access pass has been taught to lower these in
#139373 and #141512, this patch teaches the loop vectorizer to emit
these intrinsics for factors up to 8, which enables scalable
vectorization for non-power-of-2 factors.
As far as I'm aware, no in-tree target will vectorize a scalable
interelave group above factor 8 because the maximum interleave factor is
capped at 4 on AArch64 and 8 on RISC-V, and the
`-max-interleave-group-factor` CLI option defaults to 8, so the
recursive [de]interleaving code has been removed for now.
Factors of 3 with scalable VFs are also turned off in AArch64 since
there's no lowering for [de]interleave3 just yet either.
Manage branch weights for the BranchOnCond in the middle block in VPlan.
This requires updating VPInstruction to inherit from VPIRMetadata, which
in general makes sense as there are a number of opcodes that could take
metadata.
There are other branches (part of the skeleton) that also need branch
weights adding.
PR: https://github.com/llvm/llvm-project/pull/143035
Directly check via GeneratedRTChecks if any checks have been added,
instead of needing to go through ILV. This simplifies the code and
enables further refactoring in follow-up patches.
Following the work in PR #107279, this patch applies the annotative
DebugLocs, which indicate that a particular instruction is intentionally
missing a location for a given reason, to existing sites in the compiler
where their conditions apply. This is NFC in ordinary LLVM builds (each
function `DebugLoc::getFoo()` is inlined as `DebugLoc()`), but marks the
instruction in coverage-tracking builds so that it will be ignored by
Debugify, allowing only real errors to be reported. From a developer
standpoint, it also communicates the intentionality and reason for a
missing DebugLoc.
Some notes for reviewers:
- The difference between `I->dropLocation()` and
`I->setDebugLoc(DebugLoc::getDropped())` is that the former _may_ decide
to keep some debug info alive, while the latter will always be empty; in
this patch, I always used the latter (even if the former could
technically be correct), because the former could result in some
(barely) different output, and I'd prefer to keep this patch purely NFC.
- I've generally documented the uses of `DebugLoc::getUnknown()`, with
the exception of the vectorizers - in summary, they are a huge cause of
dropped source locations, and I don't have the time or the domain
knowledge currently to solve that, so I've plastered it all over them as
a form of "fixme".
fixReductionScalarResumeWhenVectorizingEpilog updates the resume phis in
the scalar preheader. Instead of looking at all recipes in the middle
block and finding their resume-phi users we can iterate over all resume
phis in the scalar preheader directly.
This slightly simplifies the code and removes the need to look for the
resume phi.
Also slightly simplifies https://github.com/llvm/llvm-project/pull/141860.
Add a new VPInstruction::ReductionStartVector opcode to create the start
values for wide reductions. This more accurately models the start value
creation in VPlan and simplifies VPReductionPHIRecipe::execute. Down the
line it also allows removing VPReductionPHIRecipe::RdxDesc.
PR: https://github.com/llvm/llvm-project/pull/142290
It should be sufficient to check that the resume phi has the correct
type, as the vector trip count as incoming value and starts at 0
otherwise. There is no need to find the middle block.
Apart from the stylistic improvement, lookup has the nice property of
returning a default-constructed object on failure-to-find, while find
returns the end iterator, which cannot be dereferenced.
After updating #118638 on tip of tree, expanding
VPWidenIntOrFpInductionRecipes fails because it needs the loop region to
get the latch to insert the increment into:
VPBasicBlock *ExitingBB =
Plan->getVectorLoopRegion()->getExitingBasicBlock();
Builder.setInsertPoint(ExitingBB,
ExitingBB->getTerminator()->getIterator());
auto *Next = Builder.createNaryOp(AddOp, {Prev, Inc}, Flags,
WidenIVR->getDebugLoc(), "vec.ind.next");
However after #117506, the region is dissolved so it doesn't work.
This shuffles the dissolveLoopRegions steps to be after
convertToConcreteRecipes so we can use the region when expanding
VPWidenIntOrFpInductionRecipes
Move VPlan-based calculateRegisterUsage from LoopVectorize
to VPlanAnalysis.cpp. It is a VPlan-based analysis and this helps
to reduce the size of LoopVectorize.
PR: https://github.com/llvm/llvm-project/pull/135673
CodeRegion's were previously passed as Value*, but then immediately
upcast to BasicBlock. Let's keep the type information around until the
use cases for non-BasicBlock code regions actually materialize.
This patch implement the VPlan-based cost model for VPReduction,
VPExtendedReduction and VPMulAccumulateReduction.
With this patch, we can calculate the reduction cost by the VPlan-based
cost model so remove the reduction costs in `precomputeCost()`.
Ref: Original instruction based implementation:
https://reviews.llvm.org/D93476