Replace direct access to underlying IR instructions with VPlan-level
equivalents, i.e. VPTypeAnalysis and pattern matching on the recipe.
Removes a few uses of accessing underlying IR.
Directly update induction increments with step value created for wide
inductions in createWidenInductionRecipes, which does not require
looking up via RecipeBuilder.
On RISC-V narrowInterleaveGroups doesn't kick in because the wrong
VectorRegWidth is passed to isConsecutiveInterleaveGroup.
narrowInterleaveGroups is always passed the RGK_FixedWidthVector
register size, but on RISC-V the RGK_ScalableVector size is twice as
large because we want to use LMUL 2. This causes the `GroupSize ==
VectorRegWidth` check to fail.
This fixes it by using the scalable register size whenever the VF is
scalable and plumbing it through as a potentially scalable TypeSize.
Note that this only makes a difference when tail folding is disabled, as
narrowInterleaveGroups can't handle EVL based IVs yet.
Currently the only way to enable the use of wide active lane masks is to pass
-enable-wide-lane-mask and force both interleaving & tail-folding with additional
flags. This patch changes selectInterleaveCount to consider interleaving if wide
lane masks were requested, although the feature remains off by default.
This means that VPExpressions will now be constructed for
VPPartialReductionRecipe's when the loop has tail-folding predication.
Note that control-flow (if/else) predication is not yet handled
for partial reductions, because of the way partial reductions
are recognised and built up.
Split off from #158690. Currently if an instruction needs predicated due
to tail folding, it will also have a predicated discount applied to it
in multiple places.
This is likely inaccurate because we can expect a tail folded
instruction to be executed on every iteration bar the last.
This fixes it by checking if the instruction/block was originally
predicated, and in doing so prevents vectorization with tail folding
where we would have had to scalarize the memory op anyway.
On llvm-test-suite this causes 4 loops in total to no longer be
vectorized with -O3 on arm64-apple-darwin, and there's no observable
performance impact.
Update VPRecipeBuilder methos to take VPInstruction* directly instead of
ArrayRef<> for operands and Instruction * separately.
This allows avoid accessing the underlying instruction in some cases, by
using information directly from VPInstruction, like getOpcode(),
getDebugLoc(), and getOperand().
It also allows directly transferring other information directly from
VPInstruction in the future.
This patch removes the explicit Alignment parameter from
VPWidenLoadRecipe and VPWidenStoreRecipe constructors. Instead, these
recipes now directly retrieve the alignment from their
LoadInst/StoreInst.
Update getSCEVExprForVPValue to handle more complex expressions, to use
it in VPReplicateRecipe::comptueCost.
In particular, it supports construction SCEV expressions for
GetElementPtr VPReplicateRecipes, with operands that are
VPScalarIVStepsRecipe, VPDerivedIVRecipe and VPCanonicalIVRecipe. If we
hit a sub-expression we don't support yet, we return
SCEVCouldNotCompute.
Note that the SCEV expression is valid VF = 1: we only support
construction AddRecs for VPCanonicalIVRecipe, which is an AddRec
starting at 0 and stepping by 1. The returned SCEV expressions could be
converted to a VF specific one, by rewriting the AddRecs to ones with
the appropriate step.
Note that the logic for constructing SCEVs for GetElementPtr was
directly ported from ScalarEvolution.cpp.
Another thing to note is that we construct SCEV expression purely by
looking at the operation of the recipe and its translated operands, w/o
accessing the underlying IR (the exception being getting the source
element type for GEPs).
PR: https://github.com/llvm/llvm-project/pull/161276
VPWidenCanonicalIV and VPBlend recipes are created by VPPredicator, and
VPCanonicalIVPHI and VPInstruction recipes are created by
VPlanConstruction. WidenPHIs are never created.
Refine logic to scalarize interleave group member: only skip
scalarization overhead for member being used as addresses. For others,
use the regular scalar memory op cost.
This currently doesn't trigger in practice as far as I could find, but
fixes a potential divergence between VPlan- and legacy cost models.
It fixes a concrete divergence with a follow-up patch,
https://github.com/llvm/llvm-project/pull/161276.
Add an member Alignment to VPWidenMemoryRecipe to store memory alignment
directly in the recipe. Update constructors, clone(), and relevant
methods to use this stored alignment instead of querying the IR
instruction. This allows VPWidenLoadRecipe/VPWidenStoreRecipe to be
constructed without relying on the original IR instruction in the
future.
When an VF is specified via a loop hint, it will be clamped to a safe
VF or ignored if it is found to be unsafe. This is not the case for
user-specified interleave counts, which can lead to loops such as
the following with a memory dependence being vectorised with
interleaving:
```
#pragma clang loop interleave_count(4)
for (int i = 4; i < LEN; i++)
b[i] = b[i - 4] + a[i];
```
According to [1], loop hints are ignored if they are not safe to apply.
This patch adds a check to prevent vectorisation with interleaving if
isSafeForAnyVectorWidth() returns false. This is already checked in
selectInterleaveCount().
[1]
https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave
Move narrowInterleaveGroups to to general VPlan optimization stage.
To do so, narrowInterleaveGroups now has to find a suitable VF where all
interleave groups are consecutive and saturate the full vector width.
If such a VF is found, the original VPlan is split into 2:
a) a new clone which contains all VFs of Plan, except VFToOptimize, and
b) the original Plan with VFToOptimize as single VF.
The original Plan is then optimized. If a new copy for the other VFs has
been created, it is returned and the caller has to add it to the list of
candidate plans.
Together with https://github.com/llvm/llvm-project/pull/149702, this
allows to take the narrowed interleave groups into account when
computing costs to choose the best VF and interleave count.
One example where we currently miss interleaving/unrolling when
narrowing interleave groups is https://godbolt.org/z/Yz77zbacz
PR: https://github.com/llvm/llvm-project/pull/149706
Follow up on 7c4f188 ([LV] Support multiplies by constants when forming
scaled reductions), introducing m_APInt, and improving code around
canConstantBeExtended: we change canConstantBeExtended to take an APInt.
VPWidenCastRecipes with Trunc opcodes where missing the correct OpType
for IR flags. Update createWidenCast to set the correct flags for
truncs, and use it consistenly.
Fixes https://github.com/llvm/llvm-project/issues/162374.
Instead of re-setting the start value of the canonical IV when
vectorizing the epilogue we can emit an Add VPInstruction to provide
canonical IV value, adjusted by the resume value from the main loop.
This is in preparation to make the canonical IV a VPValue defined by
loop regions. It ensures that the canonical IV always starts at 0.
PR: https://github.com/llvm/llvm-project/pull/161589
Consistently scalarize loads used as part of address computations across
all uses in the loop. This aligns the VPlan and legacy cost model and
fixes a divergence crash. It doesn't matter if the load and address
users are in different blocks, as long as they are in the same loop, the
scalar value can be used. This removes a number of insert/extracts.
This reverts commit f80c0baf058dbdc5 and 94eade61a02ae5.
Recommit a small fix for targets using prefersVectorizedAddressing.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
This reverts commit f61be4352592639a0903e67a9b5d3ec664ad4d23.
Recommit a small fix handling scalarization overhead consistently with
legacy cost model if a load is used directly as operand of another
memory operation, which fixes
https://github.com/llvm/llvm-project/issues/161404.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
We can create partial reductions for multiplies with constants, if the
constant is small enough to be extended from source to destination type
w/o changing the value.
This only handles constant on the right side of a multiply, relying on
other passes to canonicalize the input.
Alive2 Proofs: https://alive2.llvm.org/ce/z/iWRMr6
PR: https://github.com/llvm/llvm-project/pull/161092
If there are direct memory op users of the newly scalarized load,
their cost may have changed because there's no scalarization
overhead for the operand. Update it.
This ensures assigning consistent costs to scalarized memory
instructions that themselves have scalarized memory instructions as
operands.
LV does not preserve LCSSA, it constructs it just before processing a
loop to vectorize. Runtime check expressions are invariant to that loop,
so expanding them should not break LCSSA form for the loop we are about
to vectorize.
This fixes a crash when discarding instructions generated when expanding
runtime checks, if the expansion introduces LCSSA phis for values from
other loops which are not in LCSSA form: we would introduce new LCSSA
phis and update all outside users, some of which are not created by the
expander and cannot be cleaned up.
Fixes https://github.com/llvm/llvm-project/issues/158259.
PR: https://github.com/llvm/llvm-project/pull/159556
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
In order to avoid conflating the legacy CSE with the VPlan-based one,
rename the legacy CSE and insert a FIXME to clarify the nature of the
legacy CSE.
Additional CSE opportunities are exposed after converting to concrete
recipes/dissolving regions and materializing various expressions. Run
CSE later, to capitalize on some of the late opportunities.
PR: https://github.com/llvm/llvm-project/pull/160572
Move creation of the minimum iteration check for the epilogue vector
loop to VPlan. This is a first step towards breaking up and moving
skeleton creation for epilogue vectorization to VPlan.
It moves most logic out of EpilogueVectorizerEpilogueLoop: the minimum
iteration check is created directly in VPlan, connecting the check
blocks from the main vector loop is done as post-processing. Next steps
are to move connecting and updating the branches from the check blocks
to VPlan, as well as updating the incoming values for phis.
Test changes are improvements due to folding of live-ins.
PR: https://github.com/llvm/llvm-project/pull/157545