Previous patch did not pass the list of the extract indices by
reference, so the compiler just ignored them. Pass indices by reference
and fix the per-register analysis.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/96808
If the base node is signed, but some values are unsigned, still the
whole node should be considered signed. Also, an extra bitwidth analysis
should be performed, when estimating the minimal bitwidth.
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.
This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.
- Prevent null dereference: if the Mask given to
`ShuffleInstructionBuilder::adjustExtracts()` is empty or all-poison,
then `VecBase` will be `nullptr` and the call to
`castToScalarTyElem(VecBase)` will dereference it. Add an assert
to guard against this.
- Prevent use of uninitialized scalar: in the unlikely event that
`CandidateVFs` is empty, then `AnyProfitableGraph` will be
uninitialized in `if` condition following the loop. (This seems like a
false-positive, but I submitted this change anyways as initializing
bools costs nothing and is generally good practice)
One of the previous patches introduced initial support for non-power-of-2
number of elements but some parts of the SLP vectorizer still were not
adjusted to handle the costs correctly. Patch fixes it by improving
analysis of the non-power-of-2 number of elements and fixes in the cost
of the extractelements instructions.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/93213
If trying to find vector value in shuffling of the extractelements and
one of the vector values is undef value, need to generate real mask value
for such vector and either undef vector, or incoming second vector, if
non-poisonous.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
Still need to do the full analysis of the signedness of the values
rather than rely on Instruction opcode, if the opcode is SExt. Still may
produce unsigned result.
Need to check that the signed operand has an extra sign bit to be sure
that we do not skip signedness, when trying to minimize bitwidth for
smin/smax intrinsics.
In some cases masked gather is less profitable than insert-subvector of
consecutive/strided stores. SLP has this kind of analysis, but need to
improve it by adding the cost of the GEP analysis.
Also, the GEP cost estimation for masked gather is fixed.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/90737
After minbitwidth analysis, and <v>, (power_of_2 - 1 const) can be
transformed into just an <v>, (all_ones const), which can be ignored at
the cost estimation and at the codegen. x264 benchmark has this pattern.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/90739
Adds transformation of consecutive vector store + reverse to strided
stores with stride -1, if it is profitable
Reviewers: RKSimon, preames
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/90464
Added a limit of 128 incoming values at max for PHIs nodes to be
vectorized plus improved performance by using logarithmic search instead
of linear if the number of incoming values is > 4.
Metric: size..text
Program size..text
exp ref diff
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 42906.00 42986.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 42909.00 42989.00 0.2%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664581.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664581.00 664661.00 0.0%
Less is better.
Replaces `buildvector <p x in> + trunc <p x in> to <p x im>` sequences to
`buildvector <p x im> of { trunc in to im }` scalars, which is free in
most cases, results in better code.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88504
If the gather node matches the vectorized node, it must also match with
the scalars completely. Otherwise, need to revectorize the gather node
to generate correct code.
Before deleting extractelement instruction for vectorized GEP with
external users, need to check that all users vectorized before deleting
this extractelement.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563
No need to try to vectorize single gather/buildvector with alternate
opcode graph, it is not profitable. In other cases, need to use last
instruction for inserting the vectorized code.
Introduced transformNodes() function to perform transformation of the
nodes (cost-based, instruction count based, etc.).
Implemented transformation of consecutive loads + reverse order to
strided loads with stride -1, if profitable.
Reviewers: RKSimon, preames, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88530
The compiler should not take into account the type of the cmp
instruction, otherwise it may treat the size incorrectly and it may lead
to incorrect codegen.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
This reverts commit 74e07ab523122d6a8347b25770062ab331b6bb84.
It might be that Mask.getBitWidth() == Mask.countl_zero() (32 in my
case) and zero bitwidth2 causes the crash.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563