ILV::scalarizeInstruction still uses the original IR operands to check
if an input value is uniform after vectorization.
There is no need to go back to the cost model to figure that out, as the
information is already explicit in the VPlan. Just check directly
whether the VPValue is defined outside the plan or is a uniform
VPReplicateRecipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114253
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.
I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D114646
VPWidenIntOrFpInductionRecipes can either be constructed with a PHI and
an optional cast or a PHI and a trunc instruction. Reflect this in 2
separate constructors. This also simplifies a follow-up change.
The code in widenMemoryInstruction has already been transitioned
to only rely on information provided by VPWidenMemoryInstructionRecipe
directly.
Moving the code directly to VPWidenMemoryInstructionRecipe::execute
completes the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114324
The code in widenSelectInstruction has already been transitioned
to only rely on information provided by VPWidenSelectRecipe directly.
Moving the code directly to VPWidenSelectRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114323
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.
This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.
The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)
Differential Revision: https://reviews.llvm.org/D111460
The code in widenInstruction has already been transitioned to
only rely on information provided by VPWidenRecipe directly.
Moving the code directly to VPWidenRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114322
The code in widenGEP has already been transitioned to only rely on
information provided by VPWidenGEPRecipe directly.
Moving the code directly to VPWidenGEPRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for GEP code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114321
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.
Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.
This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.
Differential Revision: https://reviews.llvm.org/D114373
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.
I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.
Tests added here:
Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
Differential Revision: https://reviews.llvm.org/D112552
This patch updates the cost model for ordered reductions so that a call
to the llvm.fmuladd intrinsic is modelled as a normal fmul instruction
plus the cost of an ordered fadd reduction.
Differential Revision: https://reviews.llvm.org/D111630
In-loop vector reductions which use the llvm.fmuladd intrinsic involve
the creation of two recipes; a VPReductionRecipe for the fadd and a
VPInstruction for the fmul. If the call to llvm.fmuladd has fast-math flags
these should be propagated through to the fmul instruction, so an
interface setFastMathFlags has been added to the VPInstruction class to
enable this.
Differential Revision: https://reviews.llvm.org/D113125
This patch fixes PR52111. The problem is that LV propagates poison-generating flags (`nuw`/`nsw`, `exact`
and `inbounds`) in instructions that contribute to the address computation of widen loads/stores that are
guarded by a condition. It may happen that when the code is vectorized and the control flow within the loop
is linearized, these flags may lead to generating a poison value that is effectively used as the base address
of the widen load/store. The fix drops all the integer poison-generating flags from instructions that
contribute to the address computation of a widen load/store whose original instruction was in a basic block
that needed predication and is not predicated after vectorization.
Reviewed By: fhahn, spatel, nlopes
Differential Revision: https://reviews.llvm.org/D111846
A first step towards modeling preheader and exit blocks in VPlan as well.
Keeping the vector loop in a region allows for changing the VF as we
traverse region boundaries.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D113182
When asking how many parts are required for a scalable vector type
there are occasions when it cannot be computed. For example, <vscale x 1 x i3>
is one such vector for AArch64+SVE because at the moment no matter how we
promote the i3 type we never end up with a legal vector. This means
that getTypeConversion returns TypeScalarizeScalableVector as the
LegalizeKind, and then getTypeLegalizationCost returns an invalid cost.
This then causes BasicTTImpl::getNumberOfParts to dereference an invalid
cost, which triggers an assert. This patch changes getNumberOfParts to
return 0 for such cases, since the definition of getNumberOfParts in
TargetTransformInfo.h states that we can use a return value of 0 to represent
an unknown answer.
Currently, LoopVectorize.cpp is the only place where we need to check for
0 as a return value, because all other instances will not currently
ask for the number of parts for <vscale x 1 x iX> types.
In addition, I have changed the target-independent interface for
getNumberOfParts to return 1 and assume there is a single register
that can fit the type. The loop vectoriser has lots of tests that are
target-independent and they relied upon the 0 value to mean the
answer is known and that we are not scalarising the vector.
I have added tests here that show we correctly return an invalid cost
for VF=vscale x 1 when the loop contains unusual types such as i7:
Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll
Differential Revision: https://reviews.llvm.org/D113772
The interface is a convenience function to ask if a block requires
predication when widening, but it's important that there are two
separate concepts to consider:
(A) The block was predicated in the original loop.
(B) The block was unpredicated in the original loop, but requires
predication because of tail folding.
In the case of (B) we know that at least one lane of the vector will
be executed, which means we can implementing a load from a uniform address
with a scalar load + splat (D112552). In the case of predication because
of (A), we cannot do this, because the scalar load itself requires
predication.
The name 'blockNeedsPredication' does not make the distinction between
(A) and (B), hence the reason to rename it.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D113392
Unfortunately sinking recipes for first-order recurrences relies on
the original position of recipes. So if a recipes needs to be sunk after
an optimized induction, it needs to stay in the original position, until
sinking is done. This is causing PR52460.
To fix the crash, keep the recipes in the original position until
sink-after is done.
Post-commit follow-up to c45045bfd04af9 to address PR52460.
This reverts commit 0d748b4d32cbddf58a1ff83f3ff178ec1ad49edc.
This is causing some failures when building Spec2017 with scalable
vectors. Reverting to investigate.
Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.
setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.
This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.
Reviewed By: sdesmalen, david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D112725
This patch adds a function to verify general properties of VPlans. The
first check makes sure that all phi-like recipes are at the beginning of
a block, with no other recipes in between.
Note that this currently may not hold for VPBlendRecipes at the moment,
as other recipes may be inserted before the VPBlendRecipe during mask
creation.
Note that this patch depends on D111300 and D111301, which fix code that
breaks the checked invariant.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D111302
All phi-like recipes should be at the beginning of a VPBasicBlock with
no other recipes in between. Ensure that the recurrence-splicing recipe
is not added between phi-like recipes, but after them.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D111301
When targeting a specific CPU with scalable vectorization, the knowledge
of that particular CPU's vscale value can be used to tune the cost-model
and make the cost per lane less pessimistic.
If the target implements 'TTI.getVScaleForTuning()', the cost-per-lane
is calculated as:
Cost / (VScaleForTuning * VF.KnownMinLanes)
Otherwise, it assumes a value of 1 meaning that the behavior
is unchanged and calculated as:
Cost / VF.KnownMinLanes
Reviewed By: kmclaughlin, david-arm
Differential Revision: https://reviews.llvm.org/D113209
The common use case for calling createStepForVF is currently something
like:
Value *Step = createStepForVF(Builder, ConstantInt::get(Ty, UF), VF);
and it makes more sense to reduce overall lines of code and change the
function to let it create the constant instead. With my patch this
becomes:
Value *Step = createStepForVF(Builder, Ty, VF, UF);
and the ConstantInt is created instead createStepForVF. A side-effect of
this is that the code in createStepForVF is also becomes simpler.
As part of this patch I've also replaced some calls to getRuntimeVF
with calls to createStepForVF, i.e.
getRuntimeVF(Builder, Count->getType(), VFactor * UFactor) ->
createStepForVF(Builder, Count->getType(), VFactor, UFactor)
because this feels semantically better.
Differential Revision: https://reviews.llvm.org/D113122
At the moment in LoopVectorizationCostModel::selectEpilogueVectorizationFactor
we bail out if the main vector loop uses a scalable VF. This patch adds
support for generating epilogue vector loops using a fixed-width VF when the
main vector loop uses a scalable VF.
I've changed LoopVectorizationCostModel::selectEpilogueVectorizationFactor
so that we convert the scalable VF into a fixed-width VF and do profitability
checks on that instead. In addition, since the scalable and fixed-width VFs
live in different VPlans that means I had to change the calls to
LVP.hasPlanWithVFs so that we only pass in the fixed-width VF.
New tests added here:
Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll
Differential Revision: https://reviews.llvm.org/D109432
The public API for this functionality is forgetValue(). There was
only one call from LoopVectorize, which was directly next to a
forgetValue() call and as such redundant.
The recipe produces exactly one VPValue and can inherit directly from
it. This is in line with other recipes and avoids having to use
getVPSingleValue.
This patch updates VPReductionRecipe::execute so that the fast-math
flags associated with the underlying instruction of the VPRecipe are
propagated through to the reductions which are created.
Differential Revision: https://reviews.llvm.org/D112548
We never expect the runtime VF to be negative so we should use
the uitofp instruction instead of sitofp.
Differential revision: https://reviews.llvm.org/D112610
This patch updates recipe creation to ensure all
VPWidenIntOrFpInductionRecipes are in the header block. At the moment,
new induction recipes can be created in different blocks when trying to
optimize casts and induction variables.
Having all induction recipes in the header makes it easier to
analyze/transform them in VPlan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D111300
This patch changes the definition of getStepVector from:
Value *getStepVector(Value *Val, int StartIdx, Value *Step, ...
to
Value *getStepVector(Value *Val, Value *StartIdx, Value *Step, ...
because:
1. it seems inconsistent to pass some values as Value* and some as
integer, and
2. future work will require the StartIdx to be an expression made up
of runtime calculations of the VF.
In widenIntOrFpInduction I've changed the code to pass in the
value returned from getRuntimeVF, but the presence of the assert:
assert(!VF.isScalable() && "scalable vectors not yet supported.");
means that currently this code path is only exercised for fixed-width
VFs and so the patch is still NFC.
Differential revision: https://reviews.llvm.org/D111882
I have removed LoopVectorizationPlanner::setBestPlan, since this
function is quite aggressive because it deletes all other plans
except the one containing the <VF,UF> pair required. The code is
currently written to assume that all <VF,UF> pairs will live in the
same vplan. This is overly restrictive, since scalable VFs live in
different plans to fixed-width VFS. When we add support for
vectorising epilogue loops when the main loop uses scalable vectors
then we will the vplan for the main loop will be different to the
epilogue.
Instead I have added a new function called
LoopVectorizationPlanner::getBestPlanFor
that returns the best vplan for the <VF,UF> pair requested and leaves
all the vplans untouched. We then pass this best vplan to
LoopVectorizationPlanner::executePlan
which now takes an additional VPlanPtr argument.
Differential revision: https://reviews.llvm.org/D111125
Use RdxDesc->getOpcode instead of getUnderlingInstr()->getOpcode.
Move the code which finds Kind and IsOrdered to be outside the for loop
since neither of these change with the vector part.
Differential Revision: https://reviews.llvm.org/D112547
At the moment a dummy entry block is created at the beginning of VPlan
construction. This dummy block is later removed again.
This means it is not easy to identify the VPlan header block in a
general fashion, because during recipe creation it is the single
successor of the entry block, while later it is the entry block.
To make getting the header easier, just skip creating the dummy block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D111299
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.
The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.
Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.
Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
Record widening decisions for memory operations within the planned recipes and
use the recorded decisions in code-gen rather than querying the cost model.
Differential Revision: https://reviews.llvm.org/D110479
This patch fixes another crash revealed by PR51614:
when *deciding* to vectorize with masked interleave groups, check if the access
is reverse (which is currently not supported).
Differential Revision: https://reviews.llvm.org/D108900
collectLoopScalars collects pointer induction updates in ScalarPtrs, assuming
that the instruction will be scalar after vectorization. This may crash later
in VPReplicateRecipe::execute() if there there is another user of the instruction
other than the Phi node which needs to be widened.
This changes collectLoopScalars so that if there are any other users of
Update other than a Phi node, it is not added to ScalarPtrs.
Reviewed By: david-arm, fhahn
Differential Revision: https://reviews.llvm.org/D111294
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136