This reverts commit 74e07ab523122d6a8347b25770062ab331b6bb84.
It might be that Mask.getBitWidth() == Mask.countl_zero() (32 in my
case) and zero bitwidth2 causes the crash.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563
This reverts commit 7d4e8c1f3bbfe976f4871c9cf953f76d771b0eda.
Contrary to the commit description, this does cause large
compile-time regressions (up to 10% on individual files).
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0%
test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3%
In all benchmarks extra code gets vectorized
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/88563
Need to check that unsigned argument can be safely used in smax/smin
intrinsics by checking if at least single sign bit is cleared, otherwise
its value may be treated as negative instead of positive.
Perform required operations, only when the result is required + check if
value was not inserted already into the list of the externally used
scalars and early exit, if it was.
The option intended primarily for LIT tests to suppress heuristic based
profitability check and proceed vectorization of a seemingly
unprofitable alternate operation pattern. This allows the vectorizer to
execute path that was the original intent of a test.
This ensures that MinVF is a power-of-2, even if ValueTy's width is
not a power-of-2.
This should fix a number of buildbot failures with X86 bootstrapping.
This patch enables vectorization for non-power-of-2 VFs. Initially only
VFs where adding 1 makes the VF a power-of-2, i.e. we can still make
relatively effective use of the vectors.
It relies on the existing target cost-models to return accurate costs
for
non-power-of-2 vectors. I checked mostly AArch64 and X86 and
there the costs seem reasonable for the costs I checked, although
I expect there will be a need to refine both the cost-models and
lowering
to make most effective use of non-power-of-2 SLP vectorization.
Note that re-ordering and shuffling is not implemented for nodes
requiring padding yet to keep the initial implementation simpler.
The feature is guarded by a new flag, off by defaul for now.
PR: https://github.com/llvm/llvm-project/pull/77790
If (f)sub is only operand of llvm.(f)abs or icmp eq/ne 0 (int only), we can consider it as commutative operation, just need to drop wrapping flags for ineteger
operation.
https://alive2.llvm.org/ce/z/GxvxjB for correctness of abs with dropped
flags.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/86196
The previous build system was adding custom "OpenCL" and "LLVM IR"
languages in CMake to build the builtin libraries. This was making it
harder to build in-tree because the tool binaries needed to be present
at configure time.
This commit refactors the build system to use custom commands to build
the bytecode files one by one, and link them all together into the final
bytecode library. It also enables in-tree builds by aliasing the
clang/llvm-link/etc. tool targets to internal targets, which are
imported from the LLVM installation directory when building out of tree.
Diffing (with llvm-diff) all of the final bytecode libraries in an
out-of-tree configuration against those built using the current tip
system shows no changes. Note that there are textual changes to metadata
IDs which confuse regular diff, and that llvm-diff 14 and below may show
false-positives.
This commit also removes a file listed in one of the SOURCEs which
didn't exist and which was preventing the use of
ENABLE_RUNTIME_SUBNORMAL when configuring CMake.
If the operands of the potentially alternate node are going to produce
buildvector sequences, which result in more instructions, than the
original code, then suhinstructions should be vectorized as alternate
node, better to end up with the buildvector node.
Left column - experimental, Right - reference.
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413680.00 416272.00 0.6%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12351788.00 12354844.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664901.00 664949.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664901.00 664949.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1171371.00 1171355.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1036396.00 1036284.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 111280.00 111248.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1392113.00 1391361.00 -0.1%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1392113.00 1391361.00 -0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281676.00 281452.00 -0.1%
test-suite :: MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes.test 3025.00 3019.00 -0.2%
test-suite :: MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig.test 6351.00 6335.00 -0.3%
Metric: SLP.NumVectorInstructions
Program SLP.NumVectorInstructions
results results0 diff
test-suite :: MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes.test 15.00 16.00 6.7%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1703.00 1707.00 0.2%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1703.00 1707.00 0.2%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26241.00 26239.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 11761.00 11754.00 -0.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 824.00 822.00 -0.2%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5668.00 5654.00 -0.2%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5668.00 5654.00 -0.2%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 792.00 790.00 -0.3%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 792.00 790.00 -0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 1389.00 1384.00 -0.4%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 596.00 590.00 -1.0%
test-suite :: MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig.test 6.00 5.00 -16.7%
Metric: exec_time
Program exec_time
results results0 diff
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 99.14 100.00 0.9%
Other changes are not significant (less than 0.1% percent with exectime
less 5 secs).
SingleSource/Benchmarks/Adobe-C++/loop_unroll - same small patterns
remain scalar, smaller code.
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - many small
changes, some extra stores gets vectorized.
External/SPEC/CINT2017speed/625.x264_s/625.x264_s
External/SPEC/CINT2017rate/525.x264_r/525.x264_r
x264 has one change in a loop body, in function ssim_end4, some code
remain scalar, resulting in less code size.
External/SPEC/CFP2017rate/511.povray_r/511.povray_r - some extra code
gets vectorized, looks like some other patterns were matched.
MultiSource/Benchmarks/7zip/7zip-benchmark - extra stores were
vectorized (looks like the graphs become profitable)
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg - small
changes in vectorized code (some small part remain scalar).
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s
Many changes cause by the fact that the code of one function becomes
smaller (onvertLCHabToRGB) and this functions gets inlined after that.
MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc - some small
changes here and there, some extra code is vectorized, some remain
scalar (2 x vectors)
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes - emits 2 scalars
+ 2 insertelems instead of insert, broadcast, alt code (3 instructions,
total 5 insts)
MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig - small graph
becomes profitable and gets vectorized.
External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r
External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s
Some small graph becomes profitable and gets vectorized.
MultiSource/Benchmarks/FreeBench/pifft/pifft - no changes in final code.
Reviewers: RKSimon, dtcxzyw
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/84978
If the first node is a gather node with extractelement instructions,
still need to put the vector value after all instructions, not after the
very first one.
Better to walk over SLP nodes rather than single values. Matching
a value to a node is not a 1-to-1 relation, one value may be part of
several nodes and compiler may get wrong node, when trying to map it.
Currently there are no such issues detected, but they may appear in
future.
Need to improve detection of number of bits, required for the operand,
before doing a reduction. If the instruction is incoming operand of the
signed compare, need to consider adding an extra bit for signedness.
Need to use vectorized operands, not the vecop of the extractelement
instructions, to avoid false detection of the extra vector operand in
the extractelements shuffling.
Added part for demanded bits analysis in the IsPotentiallyTruncated to
improve minbitwidth analysis final attempts.
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 43069.00 42973.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 43066.00 42970.00 -0.2%
Extra trunc instructions are emitted to operate with <32 x i8> instead
of <32 x i16>, will be removed in the next patches.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/87786
Need to fix the way the cost is calculated, otherwise wrong cast opcode
can be selected and lead to the over-optimistic vector cost. Plus, need
to take into account reduction type size.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/87528
We started seeing a crash after 8a0bfe490592de3df28d82c5dd69956e43c20f1d that
the user could be scalable, meaning the typesize is scalable and an implicit
convertion to uint64_t could be performed. Protect against that by making sure
the users type is not scalable.
Compiler can improve analysis for operands of UIToFP/SIToFP instructions
and operands of ICmp instruction.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/85966
Compiler can improve analysis for operands of UIToFP/SIToFP instructions
and operands of ICmp instruction.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/85966
If the node has cmp instruction with 3 or more different but swappable
predicates, need to keep same kind of main/alternate opcodes to avoid
incorrect detection of opcodes after reordering. Reordering changes the
order and we may erroneously consider swappable opcodes as
non-compatible/alternate, which may lead to a later compiler crash.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/87267
If we have mixed extractelement instructions, fixed and scalable ones,
need to check that compiler tries to estimate the cost for fixed vector
extractelement, not the scalable one, to avoid compiler crash.
Need to handle the case, where the resize operation itself is not
reduced but its operand is. In this case need to take an extra analysis
for the operand, not the instruction itself.
Need to fix the analysis for the alternate instructions, based on int
extension operations. If the alternate extension node is resized, but
not the operand, need to resize the node and do not shuffle final
result, we end up only with trunc instruction.
represented by bitwidth without analysis.
Need to check that initial ext/trunc nodes can be safely represented
using calculated bitwidth before applying it.