Update createScalarIVSteps to take an insert point as parameter. This
ensures that the inserted scalar steps are in the same order as the
recipes they replace (vs in reverse order as currently). This helps to
reduce the diff for follow-up changes.
Update loop interleaving count computation to address loops that require at least one scalar iteration in the epilogue loop. For this case, the available trip count for interleaving the loop is one less.
demote the tree entry.
Need to check if all user nodes are marked for demotion before demoting
the node. Otherwise, some data info might be lost after vectorization.
Move simplification of VPBlendRecipes from early VPlan construction to
VPlan-to-VPlan based recipe simplification. This simplifies initial
construction.
Note that some in-loop reduction tests are failing at the moment, due to
the reduction predicate being created after the reduction recipe. I will
provide a patch for that soon.
PR: https://github.com/llvm/llvm-project/pull/76090
Similarly to how block masks are created up front and later only
retrieved also make sure masks are created in cases where edge masks are
needed, i.e. blend recipes.
Creating block-in masks for all blocks in the loop also ensures edge
masks for all relevant edges have been created. Later, the new
getEdgeMask can be used to look up cached edge masks.
This makes sure edge masks are available in all cases for
https://github.com/llvm/llvm-project/pull/76090.
This patch implements cloning for VPlans and recipes. Cloning is used in
the epilogue vectorization path, to clone the VPlan for the main vector
loop. This means we won't re-use a VPlan when executing the VPlan for
the epilogue vector loop, which in turn will enable us to perform
optimizations based on UF & VF.
When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.
This fixes a 25% performance regression introduced by commit
49b0e6dcc296792b577ae8f0f674e61a0929b99d
when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.
Currently when interleaving vector calls with linear arguments,
the Part is ignored and all vector calls use the initial value
from the first lane of the current iteration.
Fix this to extract from the correct part of the linear vector.
Add a new recipe to model scalar cast instructions, without relying on
an underlying instruction.
This allows creating scalar casts, without relying on an underlying
instruction (like the current VPReplicateRecipe). The new recipe is
used to explicitly model both truncating the induction step and the
VPDerivedIVRecipe, thus simplifying both the recipe and code
needed to introduce it.
Truncating VPWidenIntOrFpInductionRecipes should also be modeled using
the new recipe, as follow-up.
PR: https://github.com/llvm/llvm-project/pull/78113
multiregister node.
If the node can be span between several registers and same
extractelement instruction is used in several parts, it may be required
to keep such extractelement instruction to avoid compiler crash.
Replace EnableVPlanNativePath checks in the cost-model by assertions
that the code is only called for innermost loops. This ensures that the
cost model isn't used in the VPlanNativePath, which is only used for
outer-loop vectorization.
Even with EnableVPlanNativePath, inner loops are processed by the
inner loop vectorization path, not the native path, so checking for
EnableVPlanNativePath may impact decisions for inner loops and can
cause crashes, like in the attached test case.
before erasing.
Before trying to erase the extractelement instruction, not enough to
check for single use, need to check that it is not used in several nodes
because of the preliminary nodes reordering.
This patch trivially updates various opt passes to handle DPVAssigns. In
all cases, this means some combination of generifying existing code to
handle DPValues and DbgAssignIntrinsics, iterating over DPValues where
previously we did not, or duplicating code for DbgAssignIntrinsics to
the equivalent DPValue function (in inlining and salvageDebugInfo).
Instead of using the debug location of the underlying instruction, use
the debug location from the recipe. This removes an unneeded dependency
of the underlying instruction.
This patch introduces a new common base class for recipes defining a
single result VPValue. This has been discussed/mentioned at various
previous reviews as potential follow-up and helps to replace various
getVPSingleValue calls.
PR: https://github.com/llvm/llvm-project/pull/77023
LoopVectorizer is aware when a target can replace a scalable frem
instruction with a vector library call for a given VF and it returns the
relevant cost. Otherwise, it returns an invalid cost (as previously).
Add test that check costs on AArch64, when there is no vector library
available and when there is (with and without tail-folding).
NOTE: Invoking CostModel directly (not through LV) would still return
invalid costs.
!isa<Constant>(GEPIdx)' failed.
The non-constant index might be folded to constant during earlier stages
of vectorization. Need to consider this option and filter out out GEP
with the constant indices from the candidates list.
times during reduction vectorization.
If the external value was replaced in the vectorizer several times during reduction vectorization, need to find the original value to correctly handle external uses and emit extractelement instructions properly.
The following internal error occurred when using native vplan to
vectorize the program with the debug info generation.
Assertion `!isa<DbgInfoIntrinsic>(CI) && "DbgInfoIntrinsic should have been dropped during VPlan construction"' failed.
This patch ignored all debug instructions to fix the error when native
vplan is enabled.
the second vector.
Need to transform all elements in the long mask, if we decided to
produce shorter version, some elements may still have incorrect inifices
after transformation for the first vector in the permutation.
After changes, that does not require support from InstCombine, we can
drop some extra requirements for values-to-be-demoted. No need to check
for external uses for roots/other instructions, just check that the
no non-vectorized insertelement instruction, which may require
widening.
Review: https://github.com/llvm/llvm-project/pull/72679
After changes, that does not require support from InstCombine, we can
drop some extra requirements for values-to-be-demoted. No need to check
for external uses for roots/other instructions, just check that the
no non-vectorized insertelement instruction, which may require
widening.
SLP vectorizer passes the type of the subvector and the mask, which size
determines the size of the resulting vector. TTI should support this
pattern to improve cost estimation of the insert_subvector shuffle
pattern.
At the moment, block and edge masks are created on demand, which means
that they are inserted at the point where they are demanded and then
cached. It is possible that the mask for a block is looked up later at a
point that's not dominated by the point where the mask has been
inserted.
To avoid this, create masks up front on entry to the corresponding basic
block and leave it to VPlan simplification to remove unneeded masks.
Note that we need to create masks for all blocks, if any of the blocks
in the loop needs predication, as computing the mask of a block depends
on the masks of its predecessor.
Needed for #76090.
https://github.com/llvm/llvm-project/pull/76635
Need to limit the transformation of the VecMask by the corresponding part of the mask of SliceSize size to avoid compiler crash during further cost analysis.
As suggested as follow-up in
https://github.com/llvm/llvm-project/pull/72164, manage inbounds via
VPRecipeWithIRFlags.
Note that in some cases we can now preserve inbounds in a few more
cases.
With #70253 landed, selects for reduction results are explicitly used by
ComputeReductionResult and Selects can be marked as not having
side-effects again.
This reverts the revert commit 173032902c960d4d0d67b521d8c149553d8e8ba3.
This patch introduces a new ComputeReductionResult opcode to compute the
final reduction result in the middle block. The code from fixReduction
has been moved to ComputeReductionResult, after some earlier cleanup
changes to model parts of fixReduction explicitly elsewhere as needed.
The recipe may be broken down further in the future.
Note that the phi nodes to merge the reduction result from the trip
count check and the middle block, to be used as resume value for the
scalar remainder loop are also generated based on
ComputeReductionResult.
Once we have a VPValue for the reduction result, this can also be
modeled explicitly and moved out of the recipe.
vectorized.
When trying to reuse the extractelement instruction, emitted for the
insertelement instruction, need to check, if the this insertelement
instruction was vectorized. In this case, need to use vectorized value,
not the original insertelement.