Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
Still need to do the full analysis of the signedness of the values
rather than rely on Instruction opcode, if the opcode is SExt. Still may
produce unsigned result.
Need to check that the signed operand has an extra sign bit to be sure
that we do not skip signedness, when trying to minimize bitwidth for
smin/smax intrinsics.
In some cases masked gather is less profitable than insert-subvector of
consecutive/strided stores. SLP has this kind of analysis, but need to
improve it by adding the cost of the GEP analysis.
Also, the GEP cost estimation for masked gather is fixed.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/90737
After minbitwidth analysis, and <v>, (power_of_2 - 1 const) can be
transformed into just an <v>, (all_ones const), which can be ignored at
the cost estimation and at the codegen. x264 benchmark has this pattern.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/90739
Adds transformation of consecutive vector store + reverse to strided
stores with stride -1, if it is profitable
Reviewers: RKSimon, preames
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/90464
Metric: size..text
Program size..text
exp ref diff
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 42906.00 42986.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 42909.00 42989.00 0.2%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664581.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664581.00 664661.00 0.0%
Less is better.
Replaces `buildvector <p x in> + trunc <p x in> to <p x im>` sequences to
`buildvector <p x im> of { trunc in to im }` scalars, which is free in
most cases, results in better code.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88504
If the gather node matches the vectorized node, it must also match with
the scalars completely. Otherwise, need to revectorize the gather node
to generate correct code.
Before deleting extractelement instruction for vectorized GEP with
external users, need to check that all users vectorized before deleting
this extractelement.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563
No need to try to vectorize single gather/buildvector with alternate
opcode graph, it is not profitable. In other cases, need to use last
instruction for inserting the vectorized code.
Introduced transformNodes() function to perform transformation of the
nodes (cost-based, instruction count based, etc.).
Implemented transformation of consecutive loads + reverse order to
strided loads with stride -1, if profitable.
Reviewers: RKSimon, preames, topperc
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88530
The compiler should not take into account the type of the cmp
instruction, otherwise it may treat the size incorrectly and it may lead
to incorrect codegen.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
This reverts commit 74e07ab523122d6a8347b25770062ab331b6bb84.
It might be that Mask.getBitWidth() == Mask.countl_zero() (32 in my
case) and zero bitwidth2 causes the crash.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563
This reverts commit 7d4e8c1f3bbfe976f4871c9cf953f76d771b0eda.
Contrary to the commit description, this does cause large
compile-time regressions (up to 10% on individual files).
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563
Need to check that unsigned argument can be safely used in smax/smin
intrinsics by checking if at least single sign bit is cleared, otherwise
its value may be treated as negative instead of positive.
The option intended primarily for LIT tests to suppress heuristic based
profitability check and proceed vectorization of a seemingly
unprofitable alternate operation pattern. This allows the vectorizer to
execute path that was the original intent of a test.
This ensures that MinVF is a power-of-2, even if ValueTy's width is
not a power-of-2.
This should fix a number of buildbot failures with X86 bootstrapping.