Need to check for the number of the unique non-constant values since the
unique values may include several constants.
Differential Revision: https://reviews.llvm.org/D115939
Pass in the vector header instead of relying on ILV::LoopVectorBody.
This reduces the dependence on state from ILV. Where VPTransformState is
available, State.CFG.PrevBB can be used.
Need to early exit out of the reordering process if the perfect/shuffled match is found in the operands. Such pattern will result in not profitable reordering because of (false positive) external use of scalars.
Differential Revision: https://reviews.llvm.org/D115811
These are deprecated and should be replaced with getAlign().
Some of these asserts don't do anything because Load/Store/AllocaInst never have a 0 align value.
No need to represent splats as a node with the reused scalars, it may
increase the cost (currently pass just ignores extra shuffle cost and it
is still not correct).
Differential Revision: https://reviews.llvm.org/D115800
Changes the preliminary multinode analysis:
1. Introduced scores for reversed loads/extractelements.
2. Improved shallow score calculation.
3. Lowered the cost of external uses (no need to consider it several times, just ones).
4. The initial lane for analysis is the one with the minimal possible
reorderings.
These changes in general shall reduce compile time and improve the
reordering in many cases.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101109
If the gather node is a mix of undefvalues and exractelement
instructions, need to take the ordering for such nodes into account too.
It allows to reorder some (sub)trees and remove some extra shuffles,
improving overall vectorization.
Also, outlined common functionality into a separate function.
Differential Revision: https://reviews.llvm.org/D115358
For the simple copy loop (see test case) vectorizer selects VF equal to 32 while the loop is known to have 17 iterations only. Such behavior makes no sense to me since such vector loop will never be executed. The only case we may want to select VF large than TC is masked vectoriztion. So I haven't touched that case.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D114528
Given a MLA reduction from two different types (say i8 and i16), we were
previously failing to find the reduction pattern, often making us chose
the lower vector factor. This improves that by using the largest of the
two extension types, allowing us to use the larger VF as the type of the
reduction.
As per https://godbolt.org/z/KP549EEYM the backend handles this
valiantly, leading to better performance.
Differential Revision: https://reviews.llvm.org/D115432
This patch simplifies handling of redundant induction casts, by
removing dead cast instructions after initial VPlan construction.
This has the following benefits:
1. fixes a crash
(see @test_optimized_cast_induction_feeding_first_order_recurrence)
2. Simplifies VPWidenIntOrFpInduction to a single-def recipes
3. Retires recordVectorLoopValueForInductionCast.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115112
This patch uses a similar trick as in D113947 to only run the extra
passes after vectorization on functions where loops have been
vectorized.
The reason for running the 'extra vector passes' is
simplification/unswitching of the runtime checks created by LV, there
should be no need to run them if nothing got vectorized
To do that, a new dummy analysis ShouldRunExtraVectorPasses has been
added. If loops have been vectorized for a function, LV will cache the
analysis. At the moment it uses MadeCFGChanges as proxy for loop
vectorized, which isn't perfect (it could be too aggressive, e.g.
because no runtime checks have been added), but should be good enough
for now.
The extra passes are now managed by a new FunctionPassManager that
runs its passes only if ShouldRunExtraVectorPasses has been cached.
Without this patch, `-extra-vectorizer-passes` has the following
compile-time impact:
NewPM-O3: +4.86%
NewPM-ReleaseThinLTO: +3.56%
NewPM-ReleaseLTO-g: +7.17%
http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=c292da649e2c6e88a31e702fdc474727d09c72bc&stat=instructions
With this patch, that gets reduced to
NewPM-O3: +1.43%
NewPM-ReleaseThinLTO: +1.00%
NewPM-ReleaseLTO-g: +1.58%
http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=e67d86b57810011cf285eb9aa1944781be6096f0&stat=instructions
It is probably still too high to enable by default, but much better.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D115052
This allows easier access to the induction descriptor from VPlan,
without needing to go through Legal. VPReductionPHIRecipe already
contains a RecurrenceDescriptor in a similar fashion.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115111
The comparator for the sort functions should provide strict weak
ordering relation between parameters. Current solution causes compiler
crash with some standard c++ library implementations, because it does
not meet this criteria. Tried to fix it + it improves the iverall
vectorization result.
Differential Revision: https://reviews.llvm.org/D115268
The recurrence lowering code has handling which claims to be about flag intersection, but all the callers pass empty arrays to the arguments. The sole exception is a caller of a method which has the argument, but no implementation.
I don't know what the intent was here, but it certaintly doesn't actually do anything today.
Both the entry and exit blocks of the top-region of a plan must be
VPBasicBlocks. They also must have no predecessors or successors
respectively.
This invariant was broken when splitting a block for sink-after. To fix
the issue, set the exit block of the region *after* sink-after is done.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114586
Need to add an extra check for potential undef values in
computeExtractCost function to avoid compiler crash on casting to
instructon.
Differential Revision: https://reviews.llvm.org/D115162
ILV::scalarizeInstruction still uses the original IR operands to check
if an input value is uniform after vectorization.
There is no need to go back to the cost model to figure that out, as the
information is already explicit in the VPlan. Just check directly
whether the VPValue is defined outside the plan or is a uniform
VPReplicateRecipe.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114253
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.
I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.
Reviewed By: dmgreen, fhahn
Differential Revision: https://reviews.llvm.org/D114646
VPWidenIntOrFpInductionRecipes can either be constructed with a PHI and
an optional cast or a PHI and a trunc instruction. Reflect this in 2
separate constructors. This also simplifies a follow-up change.
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.
Differential Revision: https://reviews.llvm.org/D114958
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.
Differential Revision: https://reviews.llvm.org/D114909
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.
Differential Revision: https://reviews.llvm.org/D114087
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.
Differential Revision: https://reviews.llvm.org/D113782
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.
Differential Revision: https://reviews.llvm.org/D114873
The code in widenMemoryInstruction has already been transitioned
to only rely on information provided by VPWidenMemoryInstructionRecipe
directly.
Moving the code directly to VPWidenMemoryInstructionRecipe::execute
completes the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114324
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.
Differential Revision: https://reviews.llvm.org/D114121
The code in widenSelectInstruction has already been transitioned
to only rely on information provided by VPWidenSelectRecipe directly.
Moving the code directly to VPWidenSelectRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114323
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.
This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.
The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)
Differential Revision: https://reviews.llvm.org/D111460
The code in widenInstruction has already been transitioned to
only rely on information provided by VPWidenRecipe directly.
Moving the code directly to VPWidenRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114322
The code in widenGEP has already been transitioned to only rely on
information provided by VPWidenGEPRecipe directly.
Moving the code directly to VPWidenGEPRecipe::execute completes
the transition for the recipe.
It provides the following advantages:
1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.
2) in particular ensures that no dependencies on
fields in ILV for GEP code generation are re-introduced.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D114321
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.
Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.
This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.
Differential Revision: https://reviews.llvm.org/D114373
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.
Differential Revision: https://reviews.llvm.org/D114101
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.
I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.
Tests added here:
Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll
Differential Revision: https://reviews.llvm.org/D112552
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.
Differential Revision: https://reviews.llvm.org/D114101