Dissolving the hierarchical VPlan CFG and converting abstract to
concrete recipes can expose additional simplification opportunities.
Do a final run of simplifyRecipes before executing the VPlan.
Add 3 new iterator ranges to VPPhiAccessors
* incoming_values(): returns a range over the incoming
values of a phi
* incoming_blocks(): returns a range over the incoming
blocks of a phi
* incoming_values_and_blocks: returns a range over pairs of
incoming values and blocks.
Depends on https://github.com/llvm/llvm-project/pull/124838.
PR: https://github.com/llvm/llvm-project/pull/138472
Now that VPWidenPointerInductionRecipes are modelled in VPlan in
#148274, we can support them in EVL tail folding.
We need to replace their VFxUF operand with EVL as the increment is not
guaranteed to always be VF on the penultimate iteration, and UF is
always 1 with EVL tail folding.
We also need to move the creation of the backedge value to the latch so
that EVL dominates it.
With this we will no longer fail to convert a VPlan to EVL tail folding,
so adjust tryAddExplicitVectorLength to account for this. This brings us
to 99.4% of all vector loops vectorized on SPEC CPU 2017 with tail
folding vs no tail folding.
The test in only-compute-cost-for-vplan-vfs.ll previously relied on
widened pointer inductions with EVL tail folding to end up in a scenario
with no vector VPlans, so this also replaces it with an unvectorizable
fixed-order recurrence test from
first-order-recurrence-multiply-recurrences.ll that also gets discarded.
This implements the first half of #151459, by changing the AVL so it's
no longer computed as `trip-count - EVL-based IV`, but instead a
separate scalar phi that is decremented by EVL each iteration.
This shortens the dependency chain for computing the AVL and should
eventually allow us to convert the branch condition to `branch-count
avl-next, 0`.
`simplifyBranchConditionForVFAndUF` had to be updated to prevent a
regression because this introduces a VPPhi in the header block.
Noticed this when checking the invariant that all phis in the header
block must be header phis. I think there's a missing set of parentheses
here, since otherwise it only cast<VPInstruction> when RecipeI isn't a
VPInstruction.
Loop regions require fixed-length steps and rounded-up trip counts, but
after dissolution creates explicit control flow, EVL loops can leverage
variable-length stepping with original trip counts.
This patch adds a post-dissolution transform pass to convert EVL loops
from fixed-length to variable-length stepping .
With EVL tail folding, the EVL may not always be VF on the
second-to-last iteration.
Recipes that have been converted to VP intrinsics via optimizeMaskToEVL
account for this, but recipes that are left behind will still use the
old header mask which may end up having a different vector length.
This is effectively the same as #95368, and fixes this by converting
header masks from icmp ule wide-canonical-iv, backedge-trip-count ->
icmp ult step-vector, evl. Without it, recipes that fall through
optimizeMaskToEVL may use the wrong vector length, e.g. in #150074 and
#149981.
We really need to split off optimizeMaskToEVL into
VPlanTransforms::optimize and move transformRecipestoEVLRecipes into
tryToBuildVPlanWithVPRecipes, so we don't mix up what is needed for
correctness and what is needed to optimize away the mask computations.
We should be able to still generate a correct albeit suboptimal VPlan
without running optimizeMaskToEVL. I've added a TODO for this, which I
think we can do after #148274Fixes#150197
Following on from #118638, this handles widened induction variables with
EVL tail folding by setting the VF operand to be EVL, calculated in the
vector body.
We need to do this for correctness since with EVL tail folding the
number of elements processed in the penultimate iteration may not be VF,
but the runtime EVL, and we need take this into account when updating
the backedge value.
- Because the VF may now not be a live-in we need to move the insertion
point to just after the VFs definition
- We also need to avoid truncating it when it's the same size as the
step type, previously this wasn't a problem for live-ins.
- Also because the VF may be smaller than the IV type, since the EVL is
always i32, we may need to zext it.
On -march=rva23u64 -O3 we get 87.1% more loops vectorized on TSVC, and
42.8% more loops vectorized on SPEC CPU 2017
With EVL tail folding, any use of the VF live in should be replaced by
the EVL. Otherwise, it should likely be directly emitted as a constant
via VPTransformState::VF.
This strengthens the EVL transformation by replacing all uses of VF with
EVL and asserting that the only users are VPVectorEndPointerRecipe and
VPScalarIVStepsRecipe, the latter of which is new.
This should be NFC because even though we didn't previously replace the
EVL of VPScalarIVStepsRecipe, it's only used when unrolling which we
don't allow with EVL tail folding yet.
Building on top of https://github.com/llvm/llvm-project/pull/114305,
replace VPRegionBlocks with explicit CFG before executing.
This brings the final VPlan closer to the IR that is generated and
helps to simplify codegen.
It will also enable further simplifications of phi handling during
execution and transformations that do not have to preserve the
canonical IV required by loop regions. This for example could include
replacing the canonical IV with an EVL based phi while completely
removing the original canonical IV.
PR: https://github.com/llvm/llvm-project/pull/117506
Add additional verifier call just before execution, to make sure the
final VPlan is valid.
Note that this currently requires disabling a small number of checks
when running late.
Add constructor that retrieves the scalar type from the trip count
expression, if no canonical IV is available. Used in the verifier, in
preparation for late verification, when the canonical IV has been
dissolved.
There are some opcodes that currently require specialized recipes, due
to their result type not being implied by their operands, including
casts.
This leads to duplication from defining multiple full recipes.
This patch introduces a new VPInstructionWithType subclass that also
stores the result type. The general idea is to have opcodes needing to
specify a result type to use this general recipe. The current patch
replaces VPScalarCastRecipe with VInstructionWithType, a similar patch
for VPWidenCastRecipe will follow soon.
There are a few proposed opcodes that should also benefit, without the
need of workarounds:
* https://github.com/llvm/llvm-project/pull/129508
* https://github.com/llvm/llvm-project/pull/119284
PR: https://github.com/llvm/llvm-project/pull/129706
Add a new VPIRPhi subclass of VPIRInstruction, that purely serves as an
overlay, to provide more convenient checking (via directly doing
isa/dyn_cast/cast) and specialied execute/print implementations.
Both VPIRInstruction and VPIRPhi share the same VPDefID, and are
differentiated by the backing IR instruction.
This pattern could alos be used to provide more specialized interfaces
for some VPInstructions ocpodes, without introducing new, completely
spearate recipes. An example would be modeling VPWidenPHIRecipe &
VPScalarPHIRecip using VPInstructions opcodes and providing an interface
to retrieve incoming blocks and values through a VPInstruction subclass
similar to VPIRPhi.
PR: https://github.com/llvm/llvm-project/pull/129387
After #128718 lands there will be two ways of performing a reversed
widened memory access, either by performing a consecutive unit-stride
access and a reverse, or a strided access with a negative stride.
Even though both produce a reversed vector, only the former needs
VPReverseVectorPointerRecipe which computes a pointer to the last
element of each part. A strided reverse still needs a pointer to the
first element of each part so it will use VPVectorPointerRecipe.
This renames VPReverseVectorPointerRecipe to VPVectorEndPointerRecipe to
clarify that a reversed access may not necessarily need a pointer to the
last element.
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.
Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.
PR: https://github.com/llvm/llvm-project/pull/129767
This patch converts the llvm.vector.splice intrinsic to
llvm.experimental.vp.splice, ensuring that fixed-order recurrences
execute correctly when tail folding by EVL is enable.
Due to the non-VFxUF penultimate EVL issue, the EVL from the previous
iteration will be preserved and used in llvm.experimental.vp.splice.
This is a copy of #126177, since it was automatically and permanently
closed because I messed up the source branch on my remote
This patch proposes to avoid converting widening recipes to VP
intrinsics during the EVL transform.
IIUC we initially did this to avoid `vl` toggles on RISC-V. However we
now have the RISCVVLOptimizer pass which mostly makes this redundant.
Emitting regular IR instead of VP intrinsics allows more generic
optimisations, both in the middle end and DAGCombiner, and we generally
have better patterns in the RISC-V backend for non-VP nodes. Sticking to
regular IR instructions is likely a lot less work than reimplementing
all of these optimisations for VP intrinsics, and on SPEC CPU 2017 we get
noticeably better code generation.
This work feeds part of PR
https://github.com/llvm/llvm-project/pull/88385, and adds support for
vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.
All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:
1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.
2. We were adding the new vector.early.exit to the wrong parent loop.
It needs to have the same parent as the actual early exit block from
the original loop.
I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.
NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
VTypeAnalysis contains some assertions which can be useful for reasoning
that the types of various operands match.
This patch teaches VPlanVerifier to invoke VTypeAnalysis to check them,
and catches some issues with VPInstruction types that are also fixed
here:
* Handles the missing cases for CalculateTripCountMinusVF,
CanonicalIVIncrementForPart and AnyOf
* Fixes ICmp and ActiveLaneMask to return i1 (to align with `icmp` and
`@llvm.get.active.lane.mask` in the LangRef)
The VPlanVerifier unit tests also need to be fleshed out a bit more to
satisfy the stricter assertions
As a first step to move towards modeling the full skeleton in VPlan,
start by wrapping IR blocks created during legacy skeleton creation in
VPIRBasicBlocks and hook them into the VPlan. This means the skeleton
CFG is represented in VPlan, just before execute. This allows moving
parts of skeleton creation into recipes in the VPBBs gradually.
Note that this allows retiring some manual DT updates, as this will be
handled automatically during VPlan execution.
PR: https://github.com/llvm/llvm-project/pull/114292
VPReverseVectorPointer relies on the runtime VF, but in DataWithEVL
tail-folding, EVL (which can be less than VF at runtime) should be used
instead.
This patch updates the logic to check the users of VF and replaces the
second operand if the user is VPReverseVectorPointer.
Update VPlan to include the scalar loop header. This allows retiring
VPLiveOut, as the remaining live-outs can now be handled by adding
operands to the wrapped phis in the scalar loop header.
Note that the current version only includes the scalar loop header, no
other loop blocks and also does not wrap it in a region block.
PR: https://github.com/llvm/llvm-project/pull/109975
Use VPWidenIntrinsicRecipe
(https://github.com/llvm/llvm-project/pull/110486)
to create vp.select intrinsics. This potentially offers an alternative
to duplicating EVL recipes for all existing recipes.
There are some recipes that will need duplicates (at least at the
moment), due to extra code-gen needs (e.g. widening loads and stores).
But in cases the intrinsic can directly be used, creating the widened
intrinsic directly would reduce the need to duplicate some recipes.
PR: https://github.com/llvm/llvm-project/pull/110489
Add a new VPIRInstruction recipe to wrap existing IR instructions not to
be modified during execution, execept for PHIs. For PHIs, a single
VPValue
operand is allowed, and it is used to add a new incoming value for the
single predecessor VPBB. Expect PHIs, VPIRInstructions cannot have any
operands.
Depends on https://github.com/llvm/llvm-project/pull/100658.
PR: https://github.com/llvm/llvm-project/pull/100735
This patch moves branch condition creation to enter the scalar epilogue
loop to VPlan. Modeling the branch in the middle block also requires
modeling the successor blocks. This is done using the recently
introduced VPIRBasicBlock.
Note that the middle.block is still created as part of the skeleton and
then patched in during VPlan execution. Unfortunately the skeleton needs
to create the middle.block early on, as it is also used for induction
resume value creation and is also needed to properly update the
dominator tree during skeleton creation.
After this patch lands, I plan to move induction resume value and phi
node creation in the scalar preheader to VPlan. Once that is done, we
should be able to create the middle.block in VPlan directly.
This is a re-worked version based on the earlier
https://reviews.llvm.org/D150398 and the main change is the use of
VPIRBasicBlock.
Depends on https://github.com/llvm/llvm-project/pull/92525
PR: https://github.com/llvm/llvm-project/pull/92651
This patch adds a new special type of VPBasicBlock that wraps an
existing IR basic block. Recipes of the block get added before the
terminator of the wrapped IR basic block. Making it a subclass of
VPBasicBlock avoids duplicating various APIs to manage recipes in a
block, as well as makes sure the traversals filtering VPBasicBlocks
automatically apply as well.
Initially VPIRBasicBlock are only used for the pre-preheader (wrapping
the original preheader of the scalar loop).
As follow-up, this will be used to move more parts of the skeleton
inside VPlan, starting with the branch and condition in the middle
block.
Separated out of https://github.com/llvm/llvm-project/pull/92651
PR: https://github.com/llvm/llvm-project/pull/93398
After e2a72fa583d9, def-use chains of EVL are modeled explicitly.
So there's no need for a custom check of its placement, as regular
def-use verification will catch mis-placements.
This patch introduces a new VPWidenMemoryRecipe base class and distinct
sub-classes to model loads and stores.
This is a first step in an effort to simplify and modularize code
generation for widened loads and stores and enable adding further more
specialized memory recipes.
PR: https://github.com/llvm/llvm-project/pull/87411
This patch introduces generating VP intrinsics in the Loop Vectorizer.
Currently the Loop Vectorizer supports vector predication in a very
limited capacity via tail-folding and masked load/store/gather/scatter
intrinsics. However, this does not let architectures with active vector
length predication support take advantage of their capabilities.
Architectures with general masked predication support also can only take
advantage of predication on memory operations. By having a way for the
Loop Vectorizer to generate Vector Predication intrinsics, which (will)
provide a target-independent way to model predicated vector
instructions. These architectures can make better use of their
predication capabilities.
Our first approach (implemented in this patch) builds on top of the
existing tail-folding mechanism in the LV (just adds a new tail-folding
mode using EVL), but instead of generating masked intrinsics for memory
operations it generates VP intrinsics for loads/stores instructions. The
patch adds a new VPlanTransforms to replace the wide header predicate
compare with EVL and updates codegen for load/stores to use VP
store/load with EVL.
Other important part of this approach is how the Explicit Vector Length
is computed. (VP intrinsics define this vector length parameter as
Explicit Vector Length (EVL)). We use an experimental intrinsic
`get_vector_length`, that can be lowered to architecture specific
instruction(s) to compute EVL.
Also, added a new recipe to emit instructions for computing EVL. Using
VPlan in this way will eventually help build and compare VPlans
corresponding to different strategies and alternatives.
Differential Revision: https://reviews.llvm.org/D99750
Verifying CFG invariants of a block before verifying its contents allows
contents verification to rely on the CFG invariants (e.g. that there's a
vector loop region that can be retrieved).
This avoids extra checks in
https://github.com/llvm/llvm-project/pull/76172.
Unify VPlan verifiers in verifyVPlanIsValid. This adds verification for
various properties on blocks to the verifier used for VPlans generated
by the inner loop vectorizer. It also adds def-use checks for the
verifier used in the VPlan native path.
This drops the separate flag to enable HCFG verification. Instead, all
VPlans are verified once they have been created, if assertions are
enabled.
This also removes VPWidenPHIRecipe from VPHeaderPHIRecipe; it is used to
model any phi node in the native path.
Similar to vp_depth_first_shallow (D140512) add vp_depth_first_deep to
make existing code clearer and more compact.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D142055
This patch adds a new VPBlockShallowTraversalWrapper struct to
provide graph traits specialization that do not traverse through
VPRegionBlocks. This matches the behavior of the existing traits for
plain VPBlockBase and is a step before moving the graph traits for
VPBlockBase to traverse through VPRegionBlocks to enable cross region
support in VPDominatorTree.
Depends on D140511.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D140512
This reduces the size of VPlan.h and avoids future growth of the file
when the graph traits are extended in future patches.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D140500
Add verification that VPHeaderPHIRecipes are only in header VPBBs. Also
adds missing checks for VPPointerInductionRecipe to
VPHeaderPHIRecipe::classof.
Split off from D119661.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D131989
This patch introduces some initial def-use verification. This catches
cases like the one fixed by D129436.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D129717