If two nodes share the same value, which is replaced in one of the
nodes, need to automatically replace same value in all nodes. Btter to
use WeakTrackingVH for this to fix compiler crash.
With this patch an undefined mask in a shufflevector will be printed as poison.
This change is done to support the new shufflevector semantics
for undefined mask elements.
Differential Revision: https://reviews.llvm.org/D149210
llvm.is.fpclass is different from other vectorizable intrinsics in that
it is overloaded on an argument type, not on the return type.
Differential Revision: https://reviews.llvm.org/D148905
For 8-bit/16-bit vector loads/stores we scalarize and transfer to/from the vector unit, or use the (usually slow) PINSR/PEXTR instructions.
Fixes#59867
Currently the compiler calculates the compensation cost for the
extractelements, removed during vectorization. But if the extractelement
instruction is used in several nodes, we can calculate the compensation
for them several times.
Differential Revision: https://reviews.llvm.org/D148806
We were treating vXi8 multiply as the sum of a trunc(mul(extend(),extend())) which diverged from the costs from llvm-mcaonce we extended beyond legal types
Use a modified version of the D103695 script to determine more accurate throughput/latency/codesize/size-latency cost estimates
Helps address some of the regressions identified in D148806
There are 2 problems in the cost estimation for buildvector/gather.
1. If the buildvector/gather node is the same as another one node, need
to estimate the cost of this node as 0.
2. The cost of inserting float point register to non-poison vector is
not 0, it should not be considered free.
Differential Revision: https://reviews.llvm.org/D148801
The buildvector cost for the case shown in the test should be 0 but it is -1, causing the code to get vectorized, whenit shouldn't.
Differential Revision: https://reviews.llvm.org/D148732
If the partial matching is found and some other scalars must be
inserted, need to account the cost of the extractelements, transformed
to shuffles, and/or reused entries and calculate the cost of inserting
constants properly into the non-poison vectors.
Also, fixed the cost calculation for final gather/buildvector sequence.
Differential Revision: https://reviews.llvm.org/D148362
Implemented the reshuffling in finalize member function + add basic
support for add member functions, used during vector build.
Part of D110978
Differential Revision: https://reviews.llvm.org/D148279
Implemented the reshuffling in finalize member function + add basic
support for add member functions, used during vector build.
Part of D110978
Differential Revision: https://reviews.llvm.org/D148279
Introduced BoUpSLP::ShuffleCostEstimator::gather function as an initial
implementation of the gather/buildvector cost estimation for buildvector
nodes. It will allow to use general codegen infrastructure for better
cost estimation + it improves the cost estimation for the
gathers/buildvectors.
Improved part of D110978.
Differential Revision: https://reviews.llvm.org/D148174
By default these will expand back to cmp/sel, but some targets (X86) has optimized costs for scalar integer min/max patterns which are lower than the default expansion (pre-SSE41 is particularly weak for vector min/max support).
Differential Revision: [SLP] Compute min/max scalar reduction costs using min/max intrinsics instead of expanded cmp+sel
Instead of abstract cost of the scalar reduction ops, try to use the
cost of actual reduction operation instructions, where possible. Also,
remove the estimation of the vectorized GEPs pointers for reduced loads,
since it is already handled in the tree.
Differential Revision: https://reviews.llvm.org/D148036
getMinMaxCost has an alternative set of min/max costs to getIntrinsicInstrCost that are only used by getMinMaxReductionCost, but are a lot less thorough and fallback to an expansion in most cases resulting in cost overestimations - we're better off just using getIntrinsicInstrCost.
getIntrinsicInstrCost is still missing complete FMINNUM/FMAXNUM costs, so until then getMinMaxCost will still be used for these, after that we can remove getMinMaxCost and have getMinMaxReductionCost call getIntrinsicInstrCost directly.
Fixes regression noticed in D148036
This lowers the cost for FADD, FSUB, and FNEG. The motivation is to avoid
over-eager SLP vectorisation, that makes it look like SLP vectorisation is
profitable but results in significant slow downs. Lowering the cost for scalar
FADD/FSUB costs helps the profitability decision to favour the scalar
version where vectorisation isn't beneficial.
Lowering the cost for these floating point operations makes sense because a lot
of other instructions including many shuffles have only a cost of 1; these
FADD/FSUB/FNEG instructions should not be twice the cost.
Performance results show a 7% improvement for Imagick from SPEC FP 2017, a
small improvement in Blender, and unchanged results for the other apps in SPEC.
RAJAPerf is neutral and mostly shows no changes.
Differential Revision: https://reviews.llvm.org/D146033
If the value is used in the expression, need to adjust the mask before
applying the mask. Plus, need to fix the analysis of the phi nodes for
reused scalars.
Made the condition for the erasing of the gathered extractelements
stricter, remove it only if it has single vectorized use, otherwise
leave it for instcombiner/instsimplify analysis.
Patch generalizes analysis of scalars. The main part is outlined into
lambda, which can be used to find reused inserted scalars and emit
shuffle for them instead of multiple insertelement instructions, if the
permutation is found alreadyi. I.e. some scalars are transformed by the
permutation of previously vectorized nodes, and some are inserted
directly.
Reworked part of D110978
Differential Revision: https://reviews.llvm.org/D146564
The counters for the repeated scalars are ordered in the natural order,
but the original scalars might be reordered during SLP graph reordering
and this order can be dropped. Need to use the scalars after the
reordering, not the original ones, to emit correct code for same value
counters.
instruction with users."' failed.
If the externally used scalar is part of the tree and is replaced by
extractelement instruction, need to add generated extractelement
instruction to the list of the ExternallyUsedValues to avoid deletion
during vectorization.
For the attached test case, currently llvm generates instructions to load/or/store the bytes one by one. Although NEON doesn't support v4i8 natively, we can promote it to v4i16 and operate on v4i16 vectors. So this patch override getStoreMinimumVF and specify the minimum VF for i8 vector is v4i8.
Differential Revision: https://reviews.llvm.org/D145614
Need to transform mask after applying shuffle using the mask itself as
a base to correctly mark with identity those indices, actually used in
previous shuffle. Allows to fix a crash, if different sized vectors are
shuffled.
Currently the cost for fshl is an overestimate causing SLP to vectorize when it is not necessary.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D147056
This reverts commit 1387a13e1d0bac94457626ef3e7427c84caf6e65.
This introduced performance regressions on AArch64, when the cost of a
vector GEP + extracts is offset by the benefits of vectorizing the rest
of the tree.
The test in llvm/test/Transforms/SLPVectorizer/AArch64/vector-getelementptr.ll
illustrates the issue. It was extracted from code that regressed a SPEC
benchmark by 15%.
add a test to check for gep vectorization after the change from D144128 where the gep vectorization is dependant on the target hook `prefersVectorizedAddressing()`
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D146540
They are functionally equivalent but currently one fails to vectorize
because the cost of an insert subvector shuffle is too expensive.
D146747 will update the cost of these types of shuffles, so add a test
case for it.
Horizontal reduction can still kick in even when the max VF is set to 0,
but strange stuff can happen as it affects the cost model.
Enable it for these tests as eventually the goal will be to have SLP
enabled.
After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.
[1] Register spilling when compiling sqlite at various levels of `-riscv-v-register-bit-width-lmul`:
LMUL=1 2573 spills
LMUL=2 2583 spills
LMUL=4 2819 spills
LMUL=8 3256 spills
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D143723
Horizontal reductions still occur on RISC-V, despite the maximum SLP VF
reported back by TTI being 1, to disable SLP.
This can cause the cost model to think it can vectorize a gather into
smaller, widened loads, when it will actually fail to do so.
This should ultimately be fixed whenever SLP is re-enabled for RISC-V at
some point.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D146529
If the buildvector node matches the vector node, it reuse the vector
value from this vector node, but its VectorizedValue field is not
updated. Need to update this field to avoid misses during the analysis
of the reused gather/buildvector nodes.
Currently compiler does not support mixing of shuffled nodes
+ gather/buildvector of the remaining scalar values. It may reduce total
number of instructions and improve performance of the
gather/buildvector sequences.
Part of D110978
Differential Revision: https://reviews.llvm.org/D146167
When all the pointers are off the same base address and have known
distances to each other these differences can be encoded into displacements
in x86 arch. So the only cost that matters is cost of the base GEP.
Differential Revision: https://reviews.llvm.org/D146102
After merging main part of the gather/buildvector code, CreateShuffle
lambda can removed and ShuffleBuilder add functions can be used instead.
Also, part of the code from CreateShuffle migrated to createShuffle of
the BaseShuffleAnalysis::createShuffle function for better code emission.
Differential Revision: https://reviews.llvm.org/D145988
operation for combined entries.
The vector factor after combining of the shuffle entries is defined by
the size of the mask, not by the vector factors of the original
entries. So, need to adjust it to emit correct code.