Update isConditionTrueViaVFAndUF to use the vector trip count if
computable. This is the case when it has been materialized to a
constant. Otherwise fall back to the trip count.
PR: https://github.com/llvm/llvm-project/pull/151034
When interleaved stores contain gaps, a mask is required to skip the
gaps, regardless of whether scalar epilogues are allowed.
This patch corrects the condition under which a gap mask is needed,
ensuring consistency between the legacy and VPlan-based cost models and
avoiding assertion failures.
Related #149981
This implements the first half of #151459, by changing the AVL so it's
no longer computed as `trip-count - EVL-based IV`, but instead a
separate scalar phi that is decremented by EVL each iteration.
This shortens the dependency chain for computing the AVL and should
eventually allow us to convert the branch condition to `branch-count
avl-next, 0`.
`simplifyBranchConditionForVFAndUF` had to be updated to prevent a
regression because this introduces a VPPhi in the header block.
Loop regions require fixed-length steps and rounded-up trip counts, but
after dissolution creates explicit control flow, EVL loops can leverage
variable-length stepping with original trip counts.
This patch adds a post-dissolution transform pass to convert EVL loops
from fixed-length to variable-length stepping .
With EVL tail folding, the EVL may not always be VF on the
second-to-last iteration.
Recipes that have been converted to VP intrinsics via optimizeMaskToEVL
account for this, but recipes that are left behind will still use the
old header mask which may end up having a different vector length.
This is effectively the same as #95368, and fixes this by converting
header masks from icmp ule wide-canonical-iv, backedge-trip-count ->
icmp ult step-vector, evl. Without it, recipes that fall through
optimizeMaskToEVL may use the wrong vector length, e.g. in #150074 and
#149981.
We really need to split off optimizeMaskToEVL into
VPlanTransforms::optimize and move transformRecipestoEVLRecipes into
tryToBuildVPlanWithVPRecipes, so we don't mix up what is needed for
correctness and what is needed to optimize away the mask computations.
We should be able to still generate a correct albeit suboptimal VPlan
without running optimizeMaskToEVL. I've added a TODO for this, which I
think we can do after #148274Fixes#150197
VPVectorPointer for part 0 is just the pointer operand. Simplify it
after unrolling. This removes a large number of redundant GEPs with
index 0.
PR: https://github.com/llvm/llvm-project/pull/149735
This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.
The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.
The patch removes the restrictions added in 6f43754e9 (#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.
I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.
PR: https://github.com/llvm/llvm-project/pull/148817
Materialize constant vector trip counts before ::execute, if the trip
count can be computed as Original (TC / (VF * UF)) * (VF * UF). For now
this excludes when the tail is folded or scalar epilogues are required.
This enables removing a number of redundant branches from the middle
block.
For now this is also only done when not vectorizing the epilogue, as the
simplification complicates stitching the 2 plans together.
PR: https://github.com/llvm/llvm-project/pull/142309
When looking at some EVL tail folded code in SPEC CPU 2017 I noticed we
sometimes have both VPBlendRecipes and select VPInstructions in the same
plan:
EMIT vp<%active.lane.mask> = active lane mask vp<%5>, vp<%3>
EMIT vp<%7> = icmp ...
EMIT vp<%8> = logical-and vp<%active.lane.mask>, vp<%7>
BLEND ir<%8> = ir<%n.015> ir<%foo>/vp<%8>
EMIT vp<%9> = select vp<%active.lane.mask>, ir<%8>, ir<%n.015>
Since a blend will ultimately generate a chain of selects, we could fold
the blend into the select:
EMIT vp<%active.lane.mask> = active lane mask vp<%5>, vp<%3>
EMIT vp<%7> = icmp ...
EMIT vp<%8> = logical-and vp<%active.lane.mask>, vp<%7>
EMIT ir<%8> = select vp<%8>, ir<%foo>, ir<%n.015>
So as a first step, this patch expands blends to a series of select
instructions, which may allow them to be simplified further with other
select instructions.
Previously we fell back to just simplifying the branch cond to true
since one of the phis was a VPEVLBasedIVPHIRecipe. However this should
be fine to replace with its start value.
This is split off from #133993
On its own this simplification isn't that useful, but it allows us to
make the equivalent VPBlendRecipe optimisation more generic by operating
on VPInstructions.
In order to actually test this without #133993, I've had to also extend
the m_Not pattern matcher to also catch VPWidenRecipes, since I couldn't
really think of a straightforward way to create a VPInstruction::Select
with a negated condition.
This fixes a buildbot failure with EVL tail folding after #144666:
https://lab.llvm.org/buildbot/#/builders/132/builds/1653
For a first-order recurrence to be correct with EVL tail folding we need
to convert splices to vp splices with the EVL operand.
Originally we did this by looking for users of the header mask and its
users, and converting it in createEVLRecipe.
However after #144666 a FOR splice might not actually use the header
mask if it's based off e.g. an induction variable, and so we wouldn't
pick it up in createEVLRecipe.
This fixes this by converting FOR splices separately in a loop over all
recipes in the plan, regardless of whether or not it uses the header
mask.
I think there was some conflation in createEVLRecipe between what was an
optimisation and what was needed for correctness. Most of the transforms
in it just exist to optimize the mask away and we should still emit
correct code without them. So I've renamed it to make the separation
clearer.
createEVLRecipe tries to optimise recipes that use the header mask by
replacing them with their VP equivalents and setting the EVL, allowing
the mask to be removed.
However we currently also convert widened selects to vp.select even
though they don't necessarily use the header mask.
Unlike vp.merge a vp.select only makes the "unused" lanes past EVL
poison, so it's not needed for correctness.
In the same vein as #127180, this patch removes the transform for
VPWidenSelectRecipes and keeps them as plain select instructions to
allow for more optimisations.
RISCVVLOptimizer will still be able to optimise away any VL toggles and
we end up with better code generation across llvm-test-suite and SPEC
CPU 2017.
A reverse interleave access is essentially composed of multiple
load/store operations with same negative stride, and their addresses are
based on the last lane address of member 0 in the interleaved group.
Currently, we already have VPVectorEndPointerRecipe for computing the
last lane address of consecutive reverse memory accesses. This patch
extends VPVectorEndPointerRecipe to support constant stride and extracts
the reverse interleave group address adjustment from
VPInterleaveRecipe::execute, replacing it with a
VPVectorEndPointerRecipe.
The final goal is to support interleaved accesses with EVL tail folding.
Given that VPInterleaveRecipe is large and tightly coupled — combining
both load and store, and embedding operations like reverse pointer
adjustion (GEP), widen load/store, deinterleave/interleave, and reversal
— breaking it down into smaller, dedicated recipes may allow
VPlanTransforms::tryAddExplicitVectorLength to lower them into EVL-aware
form more effectively.
One foreseeable challenge is that
VPlanTransforms::convertToConcreteRecipes currently runs after
tryAddExplicitVectorLength, so decomposing VPInterleaveRecipe will
likely need to happen earlier in the pipeline to be effective.
This patch adds a new recipe to combine multiple recipes into an
'expression' recipe, which should be considered as single entity for
cost-modeling and transforms. The recipe needs to be 'decomposed', i.e.
replaced by its individual recipes before execute.
This subsumes VPExtendedReductionRecipe and
VPMulAccumulateReductionRecipe and should make it easier to extend to
include more types of bundled patterns, like e.g. extends folded into
loads or various arithmetic instructions, if supported by the target.
It allows avoiding re-creating the original recipes when converting to
concrete recipes, together with removing the need to record various
information. The current version of the patch still retains the original
printing matching VPExtendedReductionRecipe and
VPMulAccumulateReductionRecipe, but this specialized print could be
replaced with printing the bundled recipes directly.
PR: https://github.com/llvm/llvm-project/pull/144281
In addBranchWeightToMiddleTerminator we attempt to add branch weights to
the middle block terminator. We pessimistically assume vscale=1, whereas
we can improve the estimate by using the value of vscale used for
tuning.
Following on from #118638, this handles widened induction variables with
EVL tail folding by setting the VF operand to be EVL, calculated in the
vector body.
We need to do this for correctness since with EVL tail folding the
number of elements processed in the penultimate iteration may not be VF,
but the runtime EVL, and we need take this into account when updating
the backedge value.
- Because the VF may now not be a live-in we need to move the insertion
point to just after the VFs definition
- We also need to avoid truncating it when it's the same size as the
step type, previously this wasn't a problem for live-ins.
- Also because the VF may be smaller than the IV type, since the EVL is
always i32, we may need to zext it.
On -march=rva23u64 -O3 we get 87.1% more loops vectorized on TSVC, and
42.8% more loops vectorized on SPEC CPU 2017
With EVL tail folding, any use of the VF live in should be replaced by
the EVL. Otherwise, it should likely be directly emitted as a constant
via VPTransformState::VF.
This strengthens the EVL transformation by replacing all uses of VF with
EVL and asserting that the only users are VPVectorEndPointerRecipe and
VPScalarIVStepsRecipe, the latter of which is new.
This should be NFC because even though we didn't previously replace the
EVL of VPScalarIVStepsRecipe, it's only used when unrolling which we
don't allow with EVL tail folding yet.
Both VPDerivedIVRecipe and VPScalarIVSteps recipe should be supported in
narrowInterleaveGroups:
* VPDerivedIVRecipe is based on the canonical IV and independent of VF,
* VPScalarIVSteps takes the VF as operand, so it will be updated by
narrowInterleaveGroup.
VPWidenSelectRecipes are single scalars if all their operands are. Add
support for narrowing them to a single scalar VPReplicateRecipe.
This fixes a crash after
https://github.com/llvm/llvm-project/pull/142433 (aa24029319083) when
due to a replicate recipe not being converted to single-scalar being
hoisted to the vector preheader.
Explicitly unroll VPReplicateRecipes outside replicate regions by VF,
replacing them by VF single-scalar recipes. Extracts for operands are
added as needed and the scalar results are combined to a vector using a
new BuildVector VPInstruction.
It also adds a few folds to simplify unnecessary extracts/BuildVectors.
It also adds a BuildStructVector opcode for handling of calls that have
struct return types.
VPReplicateRecipe in replicate regions can will be unrolled as follow
up, turing non-single-scalar VPReplicateRecipes into 'abstract', i.e.
not executable.
PR: https://github.com/llvm/llvm-project/pull/142433
Explicitly pass the operand we are checking to canNarrowLoad. This
simplifies the check if the operands match across recipes and enables
future optimizations.
Recipes that are vector-to-scalar are guaranteed to generate a scalar
value, so the extract is redundant after VPlan unrolling. Remove it.
This removes unneeded ExtractLastElement VPInstruction of reduction
result computations.
The motivation of this PR is to make #115274 easier to implement, and
should allow us to add EVL support by just passing EVL to the VF
operand.
The current difficulty with widening IVs with EVL is that
VPWidenIntOrFpInductionRecipe generates its own backedge value. Since
it's a VPHeaderPHIRecipe the VF operand must be in the preheader, which
means we can't use the EVL since it's defined in the loop body.
The gist in this PR is to take the approach in #114305 and expand
VPWidenIntOrFpInductionRecipe into several recipes for the initial
value, phi and backedge value just before execution. I.e. this example:
```
vector.ph:
Successor(s): vector loop
<x1> vector loop: {
vector.body:
WIDEN-INDUCTION %i = phi %start, %step, %vf
...
EMIT branch-on-count ...
No successors
}
```
gets expanded to:
```
vector.ph:
...
vp<%induction.start> = ...
vp<%induction.increment> = ...
Successor(s): vector loop
<x1> vector loop: {
vector.body:
ir<%i> = WIDEN-PHI vp<%induction.start>, vp<%vec.ind.next>
...
vp<%vec.ind.next> = add ir<%i>, vp<%induction.increment>
EMIT branch-on-count ...
No successors
}
```
This allows us to a value defined in the loop in the backedge value, and
also means we can just reuse the existing backedge fixups in
VPlan::execute without having to specially handle it ourselves.
After this #115274 should just become a matter of setting the VF operand
to EVL (and building the increment step in the loop body, not the
preheader).
VPFirstOrderRecurrencePHIRecipes where the incoming values are the same
can be simplified and removed.
Fixes https://github.com/llvm/llvm-project/issues/144212.
The new test is added together with other related tests from
first-order-recurrence.ll
This reverts commit 0604dc199c019b23746f4a54885ba0c75569cdae.
The recommitted version addresses post-commit comments and adjusts the
place the branch weights are added. It now runs before VPlans are optimized
for VF and UF, which may remove the vector loop region, causing a crash
trying to get the middle block after that. Test case added in
72f99b75afc12bb.
Original message:
Manage branch weights for the BranchOnCond in the middle block in VPlan.
This requires updating VPInstruction to inherit from VPIRMetadata, which
in general makes sense as there are a number of opcodes that could take
metadata.
There are other branches (part of the skeleton) that also need branch
weights adding.
PR: https://github.com/llvm/llvm-project/pull/143035
Add a new VPInstruction::ReductionStartVector opcode to create the start
values for wide reductions. This more accurately models the start value
creation in VPlan and simplifies VPReductionPHIRecipe::execute. Down the
line it also allows removing VPReductionPHIRecipe::RdxDesc.
PR: https://github.com/llvm/llvm-project/pull/142290