Before this patch, a VPlan contained 2 mappings for Values -> VPValue:
1) Value2VPValue and 2) VPExternalDefs.
This duplication is unnecessary and there are already cases where
external defs are added to Value2VPValue. This patch replaces all uses
of VPExternalDefs with Value2VPValue.
It clarifies the naming of getOrAddVPValue (to getOrAddExternalVPValue)
and addVPValue (to addExternalVPValue).
At the moment, this is NFC, but will enable additional simplifications
in D147783.
Depends on D147891.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147892
Update the isCanonical() implementations to check the VPValue step
operand instead of the step in the induction descriptor.
At the moment this is NFC, but it enables further optimizations if the
step is replaced by a constant in D147783.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147891
To calculate the trip count we need to add 1 to the backedge
taken count. If we need to widen the backedge count, it's better
to do the add before the widening if we can guarantee it won't
overflow.
The code here is based on similar code I found in
LoopIdiomRecognize.
This is the vectorizer version of this InstCombine patch D142783.
Looking at the IR diffs, this does look like it gets more cases
than the InstCombine patch.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D147355
Instead of iterating over all LCSSA phis in the exit block, collect all
LiveOut users of the FOR splice VPInstruction and only update those
users.
Building on top of D147471, this removes an access to the cost model
after VPlan execution.
Depends on D147471.
Reviewed By: Ayal, michaelmaitland
Differential Revision: https://reviews.llvm.org/D147472
Instead of clearing live outs when a scalar epilogue is required late,
don't add live outs during VPlan construction if a scalar epilogue is
required.
This enables more VPlan-based DCE (if the live out would be the only
user in the plan) and is a step towards removing an access of the cost
model in fixedVectorizedLoop (which is after VPlan execution).
Depends on D147468.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147471
This removes the need to convert the end of the range to the next
power-of-2 for the end iterator after 4bd3fda5124962 and was suggested
as follow-up TODO in D147468.
Add an iterator to iterate over all VFs in VFRange. This simplifies some
existing code and allows using all_of,any_of and none_of on a VFRange.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147468
This patch adds a NeedsMaskForGaps field to VPInterleaveRecipe to record
whether a mask for gaps is needed. This removes a dependence on the cost
model in VPlan code-generation.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147467
There is no need to store information about invariance in the recipe.
Replace the fields with checks of the operands using
isDefinedOutsideVectorRegions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D144489
There is no need to store information about invariance in the recipe.
Replace the fields with checks of the operands using
isDefinedOutsideVectorRegions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D144487
This patch adds the predicate as additional operand to VPReplicateRecipe
during initial construction. The predicated recipes are later moved into
replicate regions. This simplifies constructions and some VPlan
transformations, like fixed-order recurrence handling.
It also improves codegen in some cases (e.g. for in-loop reductions),
because the recipes remain in the same block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D143865
When using tail-folding and using the predicate for both data and control-flow
(the next vector iteration's predicate is generated with the llvm.active.lane.mask
intrinsic and then tested for the backedge), the LoopVectorizer still inserts a
runtime check to see if the 'i + VF' may at any point overflow for the given
trip-count. When it does, it falls back to a scalar epilogue loop.
We can get rid of that runtime check in the pre-header and therefore also
remove the scalar epilogue loop. This reduces code-size and avoids a runtime
check.
Consider the following loop:
void foo(char * __restrict__ dst, char *src, unsigned long N) {
for (unsigned long i=0; i<N; ++i)
dst[i] = src[i] + 42;
}
If 'N' is e.g. ULONG_MAX, and the VF > 1, then the loop iteration counter
will overflow when calculating the predicate for the next vector iteration
at some point, because LLVM does:
vector.ph:
%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
vector.body:
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
...
%index.next = add i64 %index, 16
; The add above may overflow, which would affect the lane mask and control flow. Hence a runtime check is needed.
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 %N)
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
The solution:
What we can do instead is calculate the predicate before incrementing
the loop iteration counter, such that the llvm.active.lane.mask is
calculated from 'i' to 'tripcount > VF ? tripcount - VF : 0', i.e.
vector.ph:
%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
%N_minus_VF = select %N > 16 ? %N - 16 : 0
vector.body:
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
...
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %N_minus_VF)
%index.next = add i64 %index, %4
; The add above may still overflow, but this time the active.lane.mask is not affected
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
For N = 20, we'd then get:
vector.ph:
%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
; %active.lane.mask.entry = <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1>
%N_minus_VF = select 20 > 16 ? 20 - 16 : 0
; %N_minus_VF = 4
vector.body: (1st iteration)
... ; using <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1> as predicate in the loop
...
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
; %active.lane.mask.next = <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
%index.next = add i64 0, 16
; %index.next = 16
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
; %8 = 1
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
; branch to %vector.body
vector.body: (2nd iteration)
... ; using <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> as predicate in the loop
...
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 16, i64 4)
; %active.lane.mask.next = <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
%index.next = add i64 16, 16
; %index.next = 32
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
; %8 = 0
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
; branch to %for.cond.cleanup
Reviewed By: fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D142109
There is no need to update the AlsoPack field when creating
VPReplicateRecipes. It can be easily computed based on the VP def-use
chains when it is needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D143864
When vectorizing code with function calls in it, if we encounter
a function which only has vectorized variants requiring a mask
we can synthesize an all-true mask to enable us to proceed.
Since we want the mask to be represented in vplan, the pointer
to the chosen Function is now stored as part of the
VPWidenCallRecipe, and mask arguments are added at the
appropriate index to the recipe operands.
Reviewed By: david-arm, fhahn, reames
Differential Revision: https://reviews.llvm.org/D132458
This patch introduces DomTreeNodeTraits for customization. Clients can implement
DomTreeNodeTraitsCustom to provide custom ParentPtr, getEntryNode and getParent.
There's also a default specialization if DomTreeNodeTraitsCustom is not implemented,
that assume a Function-like NodeT. This is what is used for the existing DominatorTree
and MachineDominatorTree.
The main motivation for this patch is using DominatorTreeBase across all
regions of a VPlan, see D140513.
Reviewed By: kuhar
Differential Revision: https://reviews.llvm.org/D142162
At the moment, both VPValue and VPDef have an ID used when casting via
classof. This duplication is cumbersome, because it requires adding IDs
for new recipes twice and also requires setting them twice. In a few
cases, there's only a VPDef ID and no VPValue ID, which can cause same
confusion.
To simplify things, remove the VPValue IDs for different recipes.
Instead, only retain the generic VPValue ID (= used VPValues without a
corresponding defining recipe) and VPVRecipe for VPValues that are
defined by recipes that inherit from VPValue.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D140848
Various places in the code where still using the VPRecipeBase:: prefix
for VPDef IDs or not prefix at all. Now that the VPDef IDs have been
moved to VPDef, use this prefix instead and consistently use it.
This reduces the size of VPlan.h and avoids future growth of the file
when the graph traits are extended in future patches.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D140500
Add and run VPlan transform to fold blocks with a single predecessor
into the predecessor. This remove redundant blocks and addresses a TODO
to replace special handling for the vector latch VPBB.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D139927
This sets the stage for D133017 by moving out the code that performs
VPlan based simplifications to a separate transform that takes the
chosen VF & UF as arguments.
The main advantage is that this transform runs before any changes to
the CFG are being made. This allows using SCEV without worrying about
making queries while the IR is in an incomplete state.
Note that this patch switches the reasoning to use SCEV, but still only
simplifies loops with constant trip counts. Using SCEV here is needed to
access the backedge taken count, because the trip count IR value has not
been created yet.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D135017
Explicitly track the UFs supported in a VPlan. This is needed to
allow transformations to restrict the UFs which are supported.
Discussed as separate improvement in D135017.
The VFs and UFs may be more constrained as the plans are transformed
(e.g. see D135017 for an example).
To make sure the VFs/UFs included in the VPlan dump are accurate,
generate them when accessing a plan's name, rather than include them in
the name string set after initial construction.
Code generation now uses the start VPValue of induction recipes.
This makes it possible to adjust the start value of the epilogue
vector loop to use the 'resume' value of the main vector loop.
Fixes#59459.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D92132
Add a VP_CLASSOF_IMPL macro to define common classof implementations for
recipes. This reduces duplication and also adds missing implementations
to existing recipes.
In scalar plans, replicate recipes will only generate a single value per
UF, independent of whether they are uniform or not. So don't consider
uniformity for plans with scalar VFs only.
This allows us to handle a few additional cases in VPlan sinking instead
of non-VPlan sinkScalarOperands.
Depends on D133762.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D134218
Document recipes used to model inductions after introducing
VPDerivedIVRecipe in 0c5df7cd2f81c.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D138748
This reverts commit bf15f1e489aa2f1ac13268c9081a992a8963eb5b.
The updated version fixes a crash by checking the induction kind instead
of the opcode; for integer inductions, the step is always added, but the
opcode might not be set.
This patch splits off the logic to transform the canonical IV to a
a value for an induction with a different start and step. This
transformation only needs to be done once (independent of VF/UF) and
enables sinking of VPScalarIVStepsRecipe as follow-up.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D133758
The return value of getDef is guaranteed to be a VPRecipeBase and all
users can also accept a VPRecipeBase *. Most users actually case to
VPRecipeBase or a specific recipe before using it, so this change
removes a number of redundant casts.
Also rename it to getDefiningRecipe to make the name a bit clearer.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D136068
@Ayal suggested a better named helper than using `!getDef()` to check if
a value is invariant across all parts.
The property we are using here is that the VPValue is defined outside
any vector loop region. There's a TODO left to handle recipes defined in
pre-header blocks.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D133666
The dependent code has been changed quite a lot since 151c144 which
b73d2c8 effectively reverts. Now we run into a case where lowering
didn't expect/support the behavior pre 151c144 any longer.
Update the code dealing with scalable pointer inductions to also check
for uniformity in combination with isScalarAfterVectorization. This
should ensure scalable pointer inductions are handled properly during
epilogue vectorization.
Fixes#57912.
Epilogue vectorization uses isScalarAfterVectorization to check if
widened versions for inductions need to be generated and bails out in
those cases.
At the moment, there are scenarios where isScalarAfterVectorization
returns true but VPWidenPointerInduction::onlyScalarsGenerated would
return false, causing widening.
This can lead to widened phis with incorrect start values being created
in the epilogue vector body.
This patch addresses the issue by storing the cost-model decision in
VPWidenPointerInductionRecipe and restoring the behavior before 151c144.
This effectively reverts 151c144, but the long-term fix is to properly
support widened inductions during epilogue vectorization
Fixes#57712.