Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: `movddup (%reg), xmm0`. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.
Please note that a splat is usually three IR instructions:
- It is usually a load and 2 inserts:
%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1
- But it can also be a load, an insert and a shuffle:
%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer
Because of this some of the lit tests contain more IR instructions.
Differential Revision: https://reviews.llvm.org/D121354
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Differential Revision: https://reviews.llvm.org/D121121
This uses the existing VPlan helpers to check whether there are scalar
uses of a phi recipe. It remove one of the few remaining dependencies on
the cost model from VPlan code generation.
Depends on D121612.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121613
VPInterleaveRecipe only uses the first lane of the address. Add
onlyFirstLaneUsed implementation. This is needed for a follow-up patch.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D121612
This patch ensures scalars (except for uniforms) are no
longer collected (prior to LVP planning phase) for
scalable vectorization.
This is to avoid the chances of generating scalarized
instructions later (during LVP execute phase) as they
are not supported for scalable vectorization.
Relevant test has also been added.
Differential Revision: https://reviews.llvm.org/D121452
No need to schedule entry nodes where all instructions are not memory
read/write instructions and their operands are either constants, or
arguments, or phis, or instructions from others blocks, or their users
are phis or from the other blocks.
The resulting vector instructions can be placed at
the beginning of the basic block without scheduling (if operands does
not need to be scheduled) or at the end of the block (if users are
outside of the block).
It may save some compile time and scheduling resources.
Differential Revision: https://reviews.llvm.org/D121121
This initial patch adds code to preserve MemorySSA through a run of SLP vectorizer. The eventual plan is to use MemorySSA to accelerate SLP's memory dependence checking, but we're a ways from that. In particular, this patch is correct, but really slow. It's being landed so that we can work incrementally in tree, not because it's expected to be useful to anyone just yet.
The broader effort is being tracked in https://github.com/llvm/llvm-project/issues/54256. Its worth noting expicitly that this may not work out, and if not, we will be reverting all of the MSSA support in SLP at some point in the next few weeks.
Differential Revision: https://reviews.llvm.org/D117926
Update places still referencing LoopVectorBody to use the vector loop to
get the vector loop header. This is needed to move vector loop
code-generation to VPlan completely, which in turn is needed to model
pre-header & exit blocks in VPlan as well.
The insertion point for the builder used during VPlan code generation is
set during code generation. Setting the insert point here is dead code
and can be removed.
This patch is a follow-up to D115953. It updates optimizeInductions
to also introduce new VPScalarIVStepsRecipes if an IV has both vector
and scalar uses.
It updates all uses that only need scalar values to use the newly
created recipe for the scalar steps.
This completes untangling of VPWidenIntOrFpInductionRecipe
code-generation. Now the recipe *only* creates the widened vector
values, as it says on the tin.
The code to genereate IR has been moved directly to
VPWidenIntOrFpInductionRecipe::execute.
Note that the recipe has been updated to hold a reference to
ScalarEvolution, which is needed to expand the step, until we can place
the corresponding SCEV expansion in the pre-header.
Depends on D120827.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120828
This patch adds a helper to check if a recipe only uses scalars of a
given operand. This is similar to onlyFirstLaneUsed, which was
introduced earlier.
By default, usesScalars falls back on onlyFirstLaneUsed.
Will be used by D120828.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D120827
Single value phis won't be modeled in VPlan. If the phi only gets used
outside the loop, the current code misses the fact that the incoming
value is not dead. Update the code to also look through such phis to
check for outside users.
Fixes#54266
The analysis passes output function name encapsulated in `'` braces,
but LV uses `"`. Harmonizing this may help in creating an update script
for the LV costmodel test checks.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D121105
Root issue which triggered the revert was fixed in 689bab. No changes in the reapplied patch.
Original commit message follows:
SLP currently schedules all instructions within a scheduling window which stretches from the first instr
uction potentially vectorized to the last. This window can include a very large number of unrelated instruct
ions which are not being considered for vectorization. This change switches the code to only schedule the su
b-graph consisting of the instructions being vectorized and their transitive users.
This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:
Before this patch:
704357 SLP - Number of calcDeps actions
699021 SLP - Number of schedule calls
5598 SLP - Number of ReSchedule actions
59 SLP - Number of ReScheduleOnFail actions
10084 SLP - Number of schedule resets
8523 SLP - Number of vector instructions generated
After this patch:
102895 SLP - Number of calcDeps actions
161916 SLP - Number of schedule calls
5637 SLP - Number of ReSchedule actions
55 SLP - Number of ReScheduleOnFail actions
10083 SLP - Number of schedule resets
8403 SLP - Number of vector instructions generated
I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.
The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.
For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.
Differential Revision: https://reviews.llvm.org/D118538
While a collection of allocas are technically vectorizeable - by forming a wider alloca - this was not a transform SLP actually knows how to do. Instead, we were forming a bundle with missing dependencies, and then relying on the scheduling code to preserve program order if multiple instructions were scheduleable at once. I haven't been able to write a test case, but I'm 99% sure this was wrong in some edge case.
The unknown op case was flowing down the shufflevector path. This did result in some splat handling being lost with this change, but the same lack of splat handling is visible in a whole bunch of simple examples for the gather path. I didn't consider this interesting to fix given how narrow the splat of allocas case is.
Instead of relying on underlying instructions, this patch updates
VPScalarIVStepsRecipe to only store the required type information.
This removes access to unrelated information, as well as avoiding issues
with the same underlying instruction being shared by multiple recipes.
This change should only change the debug output and not cause any
codegen changes, hence NFCI.
This reverts commit 0539a26d91a1b7c74022fa9cf33bd7faca87544d.
Causes a miscompile, see comments on D118538.
Required updating bottom-to-top-reorder.ll.
Currently bottom-to-top reordering analysis counts orders of the
operands and then adds natural order counts for the operand users. It is
very conservative, this the user nodes themselves may require
reordering. Patch improves bottom-to-top analysis by checking for the
user nodes if they require/allows the reordring. If the user node must
be reordered, has reused scalars, is an alternate op vectorization node,
is a non-ordered gather node or may allow reordering because of the
reordered operands, such node is considered as the node that allows
reodring and is not counted as a node with the natural order.
Differential Revision: https://reviews.llvm.org/D120492
This reverts the revert commit ff93260bf6bddfbad1fa65c4d5184988885b900f.
The underlying issue causing the PPC bot failures has been fixed in
cbaac1473403 and a corresponding test case has been added in
ad2cad1c521c.
Original message:
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
Exit values of vector inductions are generated completely independent of
the induction recipes. Consider them for removal, if they are not used
in loop.
This fixes a crash exposed by 49b23f451cf7130.
SLP makes very heavy use of aliasing queries to construct pointer dependencies for scheduling purposes. AA internally usings pointerMayBeCaptured to prove some noalias results. In a local profile, we were spending about 4% of total O2 time in capture tracking. By using BatchAA interface - which caches capture results - this drops to 2%.
Note that there is no invalidation of BatchAA here. This assumes that no transformation done by SLP invalidates alias or capture results. This is the same assumption made by the existing AliasCache, so this is not a new assumption in the code.
This patch adds a new VPScalarIVStepsRecipe to handle building scalar
steps.
In the first patch, it only handles the case where there is no vector
induction variable needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D115953
This can be used to explicitly model VPValues that depend on SCEV
expansion, like the step for inductions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D116288
This patch adds a new transform to remove dead recipes. For now, it only
removes dead recipes in the header, to keep the number tests that require
updating manageable. Future patches will extend this to remove dead
recipes across the whole plan.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D118051
Currently, SLP can insert "shuffle" instruction beween PHI and Landing pad instruction. The problem is demonstrated by LIT test. The solution is to adjust insertion point once we are done with PHI generation.
Differential Revision: https://reviews.llvm.org/D120552
The min/max intrinsic cost is currently too low because in the cost calculation
we subtract the cost of the vector compare as we will not emit it.
For the cost of the vector compare we are currently passing BAD_ICMP_PREDICATE
which returns 3, the worst case cost.
I think we should be passing VecPred instead, since we know the predicates of
the compare instr.
I think this is related to commit b3b993a7ad817 which introduced the predicate
argument to getCmpSelInstrCost().
https://reviews.llvm.org/rGb3b993a7ad817c3c5801341fa78f34332900eb83
Differential Revision: https://reviews.llvm.org/D120439