Move collectPoisonGeneratingFlags from InnerLoopVectorizer to
VPlanTransforms and also update its name. collectPoisonGeneratingFlags
already directly drops poison-generating flags, not only collecting it.
This means it is more appropriate to integerate it directly into the
VPlan transform pipeline.
The current implementation still calls back to legal to check if a block
needs predication, which should be improved in the future.
A set of microbenchmarks (https://github.com/llvm/llvm-test-suite/pull/26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
Update loop interleaving count computation to address loops that require at least one scalar iteration in the epilogue loop. For this case, the available trip count for interleaving the loop is one less.
Move simplification of VPBlendRecipes from early VPlan construction to
VPlan-to-VPlan based recipe simplification. This simplifies initial
construction.
Note that some in-loop reduction tests are failing at the moment, due to
the reduction predicate being created after the reduction recipe. I will
provide a patch for that soon.
PR: https://github.com/llvm/llvm-project/pull/76090
Similarly to how block masks are created up front and later only
retrieved also make sure masks are created in cases where edge masks are
needed, i.e. blend recipes.
Creating block-in masks for all blocks in the loop also ensures edge
masks for all relevant edges have been created. Later, the new
getEdgeMask can be used to look up cached edge masks.
This makes sure edge masks are available in all cases for
https://github.com/llvm/llvm-project/pull/76090.
This patch implements cloning for VPlans and recipes. Cloning is used in
the epilogue vectorization path, to clone the VPlan for the main vector
loop. This means we won't re-use a VPlan when executing the VPlan for
the epilogue vector loop, which in turn will enable us to perform
optimizations based on UF & VF.
When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.
This fixes a 25% performance regression introduced by commit
49b0e6dcc296792b577ae8f0f674e61a0929b99d
when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.
Add a new recipe to model scalar cast instructions, without relying on
an underlying instruction.
This allows creating scalar casts, without relying on an underlying
instruction (like the current VPReplicateRecipe). The new recipe is
used to explicitly model both truncating the induction step and the
VPDerivedIVRecipe, thus simplifying both the recipe and code
needed to introduce it.
Truncating VPWidenIntOrFpInductionRecipes should also be modeled using
the new recipe, as follow-up.
PR: https://github.com/llvm/llvm-project/pull/78113
Replace EnableVPlanNativePath checks in the cost-model by assertions
that the code is only called for innermost loops. This ensures that the
cost model isn't used in the VPlanNativePath, which is only used for
outer-loop vectorization.
Even with EnableVPlanNativePath, inner loops are processed by the
inner loop vectorization path, not the native path, so checking for
EnableVPlanNativePath may impact decisions for inner loops and can
cause crashes, like in the attached test case.
Instead of using the debug location of the underlying instruction, use
the debug location from the recipe. This removes an unneeded dependency
of the underlying instruction.
This patch introduces a new common base class for recipes defining a
single result VPValue. This has been discussed/mentioned at various
previous reviews as potential follow-up and helps to replace various
getVPSingleValue calls.
PR: https://github.com/llvm/llvm-project/pull/77023
LoopVectorizer is aware when a target can replace a scalable frem
instruction with a vector library call for a given VF and it returns the
relevant cost. Otherwise, it returns an invalid cost (as previously).
Add test that check costs on AArch64, when there is no vector library
available and when there is (with and without tail-folding).
NOTE: Invoking CostModel directly (not through LV) would still return
invalid costs.
At the moment, block and edge masks are created on demand, which means
that they are inserted at the point where they are demanded and then
cached. It is possible that the mask for a block is looked up later at a
point that's not dominated by the point where the mask has been
inserted.
To avoid this, create masks up front on entry to the corresponding basic
block and leave it to VPlan simplification to remove unneeded masks.
Note that we need to create masks for all blocks, if any of the blocks
in the loop needs predication, as computing the mask of a block depends
on the masks of its predecessor.
Needed for #76090.
https://github.com/llvm/llvm-project/pull/76635
As suggested as follow-up in
https://github.com/llvm/llvm-project/pull/72164, manage inbounds via
VPRecipeWithIRFlags.
Note that in some cases we can now preserve inbounds in a few more
cases.
This patch introduces a new ComputeReductionResult opcode to compute the
final reduction result in the middle block. The code from fixReduction
has been moved to ComputeReductionResult, after some earlier cleanup
changes to model parts of fixReduction explicitly elsewhere as needed.
The recipe may be broken down further in the future.
Note that the phi nodes to merge the reduction result from the trip
count check and the middle block, to be used as resume value for the
scalar remainder loop are also generated based on
ComputeReductionResult.
Once we have a VPValue for the reduction result, this can also be
modeled explicitly and moved out of the recipe.
[LV] Change loops' interleave count computation
A set of microbenchmarks in llvm-test-suite (https://github.com/llvm/llvm-test-suite/pull/56), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial when the vector loop runs at least twice or when the epilogue loop trip count (TC) is minimal. Therefore, we choose interleaving count (IC) between TC/VF & TC/2*VF (VF = vectorization factor), such that remainder TC for the epilogue loop is minimum while the IC is maximum in case the remainder TC is same for both.
The initial tests for this change were submitted in PRs:
https://github.com/llvm/llvm-project/pull/70272 and https://github.com/llvm/llvm-project/pull/74689.
Move vector pointer generation to a separate VPVectorPointerRecipe.
This untangles address computation from the memory recipes future
and is also needed to enable explicit unrolling in VPlan.
https://github.com/llvm/llvm-project/pull/72164
This patch starts initial modeling of VF * UF in VPlan.
Initially, introduce a dedicated VFxUF VPValue, which is then
populated during VPlan::prepareToExecute. Initially, the VF * UF
applies only to the main vector loop region. Once we extend the
scope of VPlan in the future, we may want to associate different VFxUFs
with different vector loop regions (e.g. the epilogue vector loop)
This allows explicitly parameterizing recipes that rely on the
VF * UF, like the canonical induction increment. At the moment, this
mainly helps to avoid generating some duplicated calls to vscale with
scalable vectors. It should also allow using EVL as induction increments
explicitly in D99750. Referring to VF * UF is also needed in other
places that we plan to migrate to VPlan, like the minimum trip count
check during skeleton creation.
The first version creates the value for VF * UF directly in
prepareToExecute to limit the scope of the patch. A follow-on patch will
model VF * UF computation explicitly in VPlan using recipes.
Moved from Phabricator (https://reviews.llvm.org/D157322)
Compiler crashes when the assertion triggered for zext nneg instruction,
that checks that the instruction cannot produce poison. Changed the base
class for widencast recipe to handle dropping nneg flag to avoid
compiler crash.
This patch replaces the IR based truncateToMinimalBitwidths with a VPlan
version. This has 3 benefits:
1) the VPlan-based version is simpler; we don't need to implement
special codegen for each supported instruction type like the IR based
one.
2) Removes a dependency on the cost-model after VPlan execution and
3) Removes a use of getVPValue that uses underlying values after VPlan
execution (See removed FIXME).
Depends on D149081.
Depends on D149079.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D149903
After inserting a select for the final value, update the VPlan def-use
chains. At the moment, the incorrect live-out doesn't cause a
mis-compile, as computing the final reduction value is not yet modeled
in VPlan.
Parameters marked as uniform take a scalar value, assuming the value is
invariant in the scalar loop. In order to support this, we need to stop
asking for a vector function variant with a default shape assuming that all
arguments will become vector arguments, and instead consider all available
variants and their parameter types.
VPReductionRecipe may be executed for scalar VFs. Make sure to access
part 0 of the condition, as it could be an active-lane-mask, which is a
vector <1 x i1>
Fixes https://github.com/llvm/llvm-project/issues/72720.
Reverse mask early on when populating BlockInMask. This will enable
separating mask management and address computation from the memory
recipes in the future and is also needed to enable explicit unrolling in
VPlan.
This an alternative to #69886. The basic problem is that SCEV can look
through trivial LCSSA phis. When the phi node later becomes non-trivial,
we do invalidate it, but this doesn't catch uses that are not covered by
the IR use-def walk, such as those in BECounts.
Fix this by adding a special invalidation method for LCSSA phis, which
will also invalidate all the SCEVUnknowns/SCEVAddRecExprs used by the
LCSSA phi node and defined in the loop.
We should probably also use this invalidation method in other places
that add predecessors to exit blocks, such as loop unrolling and loop
peeling.
Fixes#69097.
Fixes#66616.
Fixes#63970.
Consistently add `branch_weights` metadata in any condition branch
created by `LoopVectorize.cpp`:
- Will only add metadata if the original loop-latch branch had metadata
assigned.
- Most checks should rarely trigger so I am using a 127:1 ratio.
- For the middle block we assume an equal distribution of modulo
results.
If there are function calls in the candidate loop and we have vectorized
variants available, try some wider VFs in case the conservative initial
maximum based on the widest types in the loop won't actually allow us
to make use of those function variants.
This patch moves creating the middle VPBBs and an initial empty
vector loop region for the top-level loop to createInitialVPlan.
This consolidates code to create the initial VPlan skeleton and enables
adding other bits outside the main region during initial VPlan
construction. In particular, D150398 will add the exit check & branch to
the middle block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D158333
VPReductionRecipe::execute was not handling predicates for ordered
reduction with scalar VFs, which was causing a crash. Thsi patch adds
dedicated handling for scalar VFs when dealing with the condition.
The other operands are already handled in a similar fashion below.
Fixes#70988.
Support recipes without underlying instruction in
collectPoisonGeneratingRecipes by directly trying to dyn_cast_or_null
the underlying value.
Fixes https://github.com/llvm/llvm-project/issues/70590.
* Avoid using `CM_ScalarEpilogueNotAllowedLowTripLoop` for loops known
to be predicate tail-folded, delegating to `areRuntimeChecksProfitable`
to decide on the profitability of vectorizing loops with runtime checks.
* Update the `areRuntimeChecksProfitable` function to consider the
`ScalarEpilogueLowering` setting when assessing vectorization of a loop.
With this patch, we can make more informed decisions for loops with low
trip counts, especially when leveraging Profile-Guided Optimization
(PGO) data.
This patch adds initial type inferrence for VPValues. It infers the
scalar type of a VPValue, by bottom-up traversing through defining
recipes until root nodes with known types are reached (e.g. live-ins or
load recipes). The types are then propagated top down through
operations.
This is intended as building block for a VPlan-based cost model, which
will need access to type information for VPValues/recipes.
Initial testing is done by asserting the inferred type matches the type
of the result value generated for a widen and replicate recipes.
Reductions with intermediate stores currently need to be fixed in order
of their intermediate stores. Instead of doing this at fixup time after
code has been generated, sort the reductions in adjustRecipesForReductions.
This makes the order explicit in VPlan and will enable removing
fixReductions with modeling computing the final reduction result in
VPlan, followed by also modeling the intermediate stores explicitly.
This patch enables scalable vectors in the VPlan-native path.
If a vectorization factor is specified via loop vectorization hints,
that factor is used. If no vectorization factor is specified, but the
target preferes scalable vectorization, a scalable vectorization factor
is selected.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D157484
This reverts commit e4ea0997486000b460c4875a00301b73b3c0d6a7.
The recommit fixes a reported crash by adding a missing check to make
sure the cast recipes are only introduced when vectorizing.
Test coverage added in 3cac608fbd0811b2f5c59c6e13148162ccd8543e.
Original commit message:
Update the code to create Trunc/Ext recipes directly in
adjustRecipesForReductions instead of fixing it up later in
fixReductions.
This explicitly models the required conversions and also makes sure they
are generated at the right place (instead of after the exit condition),
hence the changes in a few tests.