Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.
Metric: SLP.NumVectorInstructions
test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test 70.00 86.00 22.9%
test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 346.00 353.00 2.0%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 346.00 353.00 2.0%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 235.00 239.00 1.7%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 235.00 239.00 1.7%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8723.00 8834.00 1.3%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00 1.2%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00 1.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00 1.1%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9184.00 0.9%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00 0.3%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00 0.3%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4235.00 4245.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00 0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1%
test-suite :: SingleSource/Benchmarks/Misc/oourafft.test 69.00 68.00 -1.4%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 207.00 192.00 -7.2%
test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 207.00 192.00 -7.2%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1%
test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 260.00 215.00 -17.3%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6%
MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.
Differential Revision: https://reviews.llvm.org/D116688
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.
Differential Revision: https://reviews.llvm.org/D115955
This reverts commit db49a78900f5e4b59714565876b5dbb5e2dfe840. The reasoning in the patch applied to a downstream branch, and I got myself confused when trying to split apart pieces. Thankfully, the assert was simply weaker than the actual invariant currently upstream which is that ReadyInsts is not empty.
The fact we could have a block with a valid scheduling window, but nothing to schedule was surprising to me. After digging through the code, this can only happen if we don't find anything to directly vectorize. However, the reduction handling code relies on this mode, so we can't simply consider such trees unvectorizeable. The assert conveys both that this situation can happen, but also that it can *only* happen for an immediate gather.
Context: We built the bundle before deciding that vectorization of a bundle is possible. A side effect of bundle construction is manipulating the scheduling window, so a bundle which isn't vectorizable can cause the creation or expansion of a scheduling window.
No need to reorder the top nodes, if they are not stores or
insertelement instructions and each node should be analized only
once, when the bottom-to-top analysis is performed.
We still endup with extractelements for the top node scalars and
the final shuffle just adds an extra cost and currently
crashes the compiler for PHI nodes.
Differential Revision: https://reviews.llvm.org/D116760
No need to include the order of the scalars beeing used as part of the
alternate vectorization into account when trying to reorder the whole
graph. Such elements better to reorder in the following phase because
the subtree still ends up in shuffle.
Part of D116688, fixes the regression in D116690.
Differential Revision: https://reviews.llvm.org/D116740
There is a bug in the reordering analysis stage. If the element with the
given hash is not added to the map but has the same number of APOs and
instructions with same parent, but different instruction opcode, it will
be initalized with default values and then the counter is increased by
1. But the lane is not updated and default to 0 instead of the actual
`Lane` value. It leads to the fact that the analysis is useless in
many cases and default to lane 0 instead of actual lane with the
minimum amount of APO operands.
Differential Revision: https://reviews.llvm.org/D116690
Need to clear CurrentOrder order mask if it is determined that
extractelements form identity order and need to use a vector-like
construct when iterating over ordered entries in the reorderTopToBottom
function.
Need to check for the number of the unique non-constant values since the
unique values may include several constants.
Differential Revision: https://reviews.llvm.org/D115939
Need to check for the number of the unique non-constant values since the
unique values may include several constants.
Differential Revision: https://reviews.llvm.org/D115939
Need to early exit out of the reordering process if the perfect/shuffled match is found in the operands. Such pattern will result in not profitable reordering because of (false positive) external use of scalars.
Differential Revision: https://reviews.llvm.org/D115811
No need to represent splats as a node with the reused scalars, it may
increase the cost (currently pass just ignores extra shuffle cost and it
is still not correct).
Differential Revision: https://reviews.llvm.org/D115800
Changes the preliminary multinode analysis:
1. Introduced scores for reversed loads/extractelements.
2. Improved shallow score calculation.
3. Lowered the cost of external uses (no need to consider it several times, just ones).
4. The initial lane for analysis is the one with the minimal possible
reorderings.
These changes in general shall reduce compile time and improve the
reordering in many cases.
Part of D57059.
Differential Revision: https://reviews.llvm.org/D101109
If the gather node is a mix of undefvalues and exractelement
instructions, need to take the ordering for such nodes into account too.
It allows to reorder some (sub)trees and remove some extra shuffles,
improving overall vectorization.
Also, outlined common functionality into a separate function.
Differential Revision: https://reviews.llvm.org/D115358
The comparator for the sort functions should provide strict weak
ordering relation between parameters. Current solution causes compiler
crash with some standard c++ library implementations, because it does
not meet this criteria. Tried to fix it + it improves the iverall
vectorization result.
Differential Revision: https://reviews.llvm.org/D115268
The recurrence lowering code has handling which claims to be about flag intersection, but all the callers pass empty arrays to the arguments. The sole exception is a caller of a method which has the argument, but no implementation.
I don't know what the intent was here, but it certaintly doesn't actually do anything today.
Need to add an extra check for potential undef values in
computeExtractCost function to avoid compiler crash on casting to
instructon.
Differential Revision: https://reviews.llvm.org/D115162
If the extractelement instruction is used multiple times in the
different tree entries (either vectorized, or gathered), need to
compensate the scalar cost of such instructions. They are completely
removed if all users are part of the tree but we need to compensate the
cost only once for each instruction.
Differential Revision: https://reviews.llvm.org/D114958
Need to outline the code for finding common vectors in insertelement
instructions into a separate function for future patches. It also
improves the process by adding some extra checks for early exit and
fixes a bug where it always finds the match because of erroneous compare
of the same values.
Differential Revision: https://reviews.llvm.org/D114909
If several shuffle instructions are emitted, some of them might
same/compatible (less defined) with the previously emitted ones. Such
shuffles can be removed safely, improving the total cost of the
vectorized code.
Differential Revision: https://reviews.llvm.org/D114087
Improved the calculation of the shuffled extracts, where possible. Need
to calculate the cost for the extracted scalars if some users are not
insertelements + improved the total estimation of the shuffled scalars
used in insertelements build vectors.
Differential Revision: https://reviews.llvm.org/D113782
Undefined vector might be not only the UndefValue, but also it can be
a constant vector with undef ot poison elements, need to check for this
kind of undef too.
Differential Revision: https://reviews.llvm.org/D114873
Extended support for undefined source vector/extract indices/non-fixed
vector types, also no need to check for the parent of the extractelement
instructions with the constant indicies.
Differential Revision: https://reviews.llvm.org/D114121
Compiler has an analysis for perfect diamond matching but it does not
support nodes with main/alternate opcodes. The problem is that the
scalars themselves are different and might not match directly with other
nodes, but operands and main/alternate opcodes might match and compiler
might reuse some previously emitted vector instructions. Need to include
this analysis in the cost model and actual vector instructions emission
process.
Differential Revision: https://reviews.llvm.org/D114101