This is the first step in untangling the variable step transform and
header mask optimizations as described in #152541.
Currently we replace all VF users globally in the plan, including
VPVectorEndPointerRecipe. However this leaves reversed loads and stores
in an incorrect state until they are adjusted in optimizeMaskToEVL.
This moves the VPVectorEndPointerRecipe transform so that it is updated
in lockstep with the actual load/store recipe.
One thought that crossed my mind was that VPInterleaveRecipe could also
use VPVectorEndPointerRecipe, in which case we would have also been
computing the wrong address because we don't transform it to an EVL
recipe which accounts for the reversed address.
If we end up with a extract_element VPInstruction where both operands
are live-ins, we will try to fold the live-ins even though the first
operand is a vector whilst the live-in is scalar.
This fixes it by just returning the vector live-in instead of calling
the folder, and removes the handling for insertelement where we aren't
able to do the fold. From some quick testing we previously never hit
this fold anyway, and were probably just missing test coverage.
Fixes#154045
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.
Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.
PR: https://github.com/llvm/llvm-project/pull/151487
We almost only ever have one header mask, except with the data tail
folding style, i.e. with VPInstruction::ActiveLaneMask.
All we need to do is to make sure to erase the old header icmp based
header mask when replacing it.
This reverts commit 1c7c8e3ad39957285524ff116d9a6aec0d9b62f9.
Recommit with a fix for the verifier error caused for EVL recipes.
Extra test coverage added in 6f939da60e.
Materialize VF and VFxUF computation using VPInstruction
instead of directly creating IR.
This is one of the last few steps needed to model the full vector
skeleton in VPlan.
This is mostly NFC, although in some cases we remove some unused
computations.
PR: https://github.com/llvm/llvm-project/pull/152879
A lot of time getCanonicalIV() is used to get the canonical IV type,
e.g. to instantiate a VPTypeAnalysis or to get the LLVMContext.
However VPTypeAnalysis has a constructor that takes the VPlan directly
and there's a method on VPlan to get the LLVMContext directly, so use
those instead where possible.
This lets us remove a constructor on VPTypeAnalysis.
Also remove an unused LLVMContext argument in UnrollState whilst we're
here.
The EVL mask is always defined as `icmp ult (step-vector, EVL)`, so we
only need to generate it once per plan in the header. Then, we replace
all uses of the header mask with the EVL mask, and recursively optimize
the users of EVL mask into EVL recipes. This way, the transformation to
EVL recipes can be done with just a single loop.
Materialize the vector trip count computation using VPInstruction
instead of directly creating IR. This is one of the last few steps
needed to model the full vector skeleton in VPlan. It also simplifies
vector-trip count computations for scalable vectors, as we can re-use
the UF x VF computation.
PR: https://github.com/llvm/llvm-project/pull/151925
Now that VPWidenPointerInductionRecipes are modelled in VPlan in
#148274, we can support them in EVL tail folding.
We need to replace their VFxUF operand with EVL as the increment is not
guaranteed to always be VF on the penultimate iteration, and UF is
always 1 with EVL tail folding.
We also need to move the creation of the backedge value to the latch so
that EVL dominates it.
With this we will no longer fail to convert a VPlan to EVL tail folding,
so adjust tryAddExplicitVectorLength to account for this. This brings us
to 99.4% of all vector loops vectorized on SPEC CPU 2017 with tail
folding vs no tail folding.
The test in only-compute-cost-for-vplan-vfs.ll previously relied on
widened pointer inductions with EVL tail folding to end up in a scenario
with no vector VPlans, so this also replaces it with an unvectorizable
fixed-order recurrence test from
first-order-recurrence-multiply-recurrences.ll that also gets discarded.
The initial VPlan closely reflects the original scalar loop, so unsing
VPWidenPHIRecipe here is premature. Widened phi recipes should only be
introduced together with other widened recipes.
PR: https://github.com/llvm/llvm-project/pull/150847
This is the VPWidenPointerInductionRecipe equivalent of #118638, with
the motivation of allowing us to use the EVL as the induction step.
There is a new VPInstruction added, WidePtrAdd to allow adding the step
vector to the induction phi, since VPInstruction::PtrAdd only handles
scalars or multiple scalar lanes.
Originally this transformation was copied from the original recipe's
execute code, but it's since been simplifed by teaching
`unrollWidenInductionByUF` to unroll the recipe, which brings it inline
with VPWidenIntOrFpInductionRecipe.
This patch uses VPTypeAnalysis to determine its type since the induction
step is not always a live-in value in the VPlan and may be defined by a
recipe.
Explicitly compute the backedge-taken count using VPInstruction. This is
needed to model the full skeleton in VPlan.
NFC modulo some instruction re-ordering.
Update isConditionTrueViaVFAndUF to use the vector trip count if
computable. This is the case when it has been materialized to a
constant. Otherwise fall back to the trip count.
PR: https://github.com/llvm/llvm-project/pull/151034
When interleaved stores contain gaps, a mask is required to skip the
gaps, regardless of whether scalar epilogues are allowed.
This patch corrects the condition under which a gap mask is needed,
ensuring consistency between the legacy and VPlan-based cost models and
avoiding assertion failures.
Related #149981
This implements the first half of #151459, by changing the AVL so it's
no longer computed as `trip-count - EVL-based IV`, but instead a
separate scalar phi that is decremented by EVL each iteration.
This shortens the dependency chain for computing the AVL and should
eventually allow us to convert the branch condition to `branch-count
avl-next, 0`.
`simplifyBranchConditionForVFAndUF` had to be updated to prevent a
regression because this introduces a VPPhi in the header block.
Loop regions require fixed-length steps and rounded-up trip counts, but
after dissolution creates explicit control flow, EVL loops can leverage
variable-length stepping with original trip counts.
This patch adds a post-dissolution transform pass to convert EVL loops
from fixed-length to variable-length stepping .
With EVL tail folding, the EVL may not always be VF on the
second-to-last iteration.
Recipes that have been converted to VP intrinsics via optimizeMaskToEVL
account for this, but recipes that are left behind will still use the
old header mask which may end up having a different vector length.
This is effectively the same as #95368, and fixes this by converting
header masks from icmp ule wide-canonical-iv, backedge-trip-count ->
icmp ult step-vector, evl. Without it, recipes that fall through
optimizeMaskToEVL may use the wrong vector length, e.g. in #150074 and
#149981.
We really need to split off optimizeMaskToEVL into
VPlanTransforms::optimize and move transformRecipestoEVLRecipes into
tryToBuildVPlanWithVPRecipes, so we don't mix up what is needed for
correctness and what is needed to optimize away the mask computations.
We should be able to still generate a correct albeit suboptimal VPlan
without running optimizeMaskToEVL. I've added a TODO for this, which I
think we can do after #148274Fixes#150197
VPVectorPointer for part 0 is just the pointer operand. Simplify it
after unrolling. This removes a large number of redundant GEPs with
index 0.
PR: https://github.com/llvm/llvm-project/pull/149735
This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.
The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.
The patch removes the restrictions added in 6f43754e9 (#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.
I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.
PR: https://github.com/llvm/llvm-project/pull/148817
Materialize constant vector trip counts before ::execute, if the trip
count can be computed as Original (TC / (VF * UF)) * (VF * UF). For now
this excludes when the tail is folded or scalar epilogues are required.
This enables removing a number of redundant branches from the middle
block.
For now this is also only done when not vectorizing the epilogue, as the
simplification complicates stitching the 2 plans together.
PR: https://github.com/llvm/llvm-project/pull/142309
When looking at some EVL tail folded code in SPEC CPU 2017 I noticed we
sometimes have both VPBlendRecipes and select VPInstructions in the same
plan:
EMIT vp<%active.lane.mask> = active lane mask vp<%5>, vp<%3>
EMIT vp<%7> = icmp ...
EMIT vp<%8> = logical-and vp<%active.lane.mask>, vp<%7>
BLEND ir<%8> = ir<%n.015> ir<%foo>/vp<%8>
EMIT vp<%9> = select vp<%active.lane.mask>, ir<%8>, ir<%n.015>
Since a blend will ultimately generate a chain of selects, we could fold
the blend into the select:
EMIT vp<%active.lane.mask> = active lane mask vp<%5>, vp<%3>
EMIT vp<%7> = icmp ...
EMIT vp<%8> = logical-and vp<%active.lane.mask>, vp<%7>
EMIT ir<%8> = select vp<%8>, ir<%foo>, ir<%n.015>
So as a first step, this patch expands blends to a series of select
instructions, which may allow them to be simplified further with other
select instructions.
Previously we fell back to just simplifying the branch cond to true
since one of the phis was a VPEVLBasedIVPHIRecipe. However this should
be fine to replace with its start value.
This is split off from #133993
On its own this simplification isn't that useful, but it allows us to
make the equivalent VPBlendRecipe optimisation more generic by operating
on VPInstructions.
In order to actually test this without #133993, I've had to also extend
the m_Not pattern matcher to also catch VPWidenRecipes, since I couldn't
really think of a straightforward way to create a VPInstruction::Select
with a negated condition.