363 Commits

Author SHA1 Message Date
Alexey Bataev
2b00a73f62
[SLP]Buildvector for alternate instructions with non-profitable gather operands.
If the operands of the potentially alternate node are going to produce
buildvector sequences, which result in more instructions, than the
original code, then suhinstructions should be vectorized as alternate
node, better to end up with the buildvector node.

Left column - experimental, Right - reference.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   413680.00   416272.00  0.6%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12351788.00 12354844.00  0.0%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664901.00   664949.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664901.00   664949.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1171371.00  1171355.00 -0.0%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1036396.00  1036284.00 -0.0%
                                                                         test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   111280.00   111248.00 -0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1392113.00  1391361.00 -0.1%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1392113.00  1391361.00 -0.1%
                                                                        test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   281676.00   281452.00 -0.1%
                                                                                    test-suite :: MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes.test     3025.00     3019.00 -0.2%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig.test     6351.00     6335.00 -0.3%

Metric: SLP.NumVectorInstructions

Program                                                                                                                                                SLP.NumVectorInstructions
                                                                                                                                                       results                   results0 diff
                                                                                    test-suite :: MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes.test    15.00                     16.00   6.7%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1703.00                   1707.00   0.2%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1703.00                   1707.00   0.2%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26241.00                  26239.00  -0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 11761.00                  11754.00  -0.1%
                                                                        test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   824.00                    822.00  -0.2%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  5668.00                   5654.00  -0.2%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  5668.00                   5654.00  -0.2%
                                                                                     test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   792.00                    790.00  -0.3%
                                                                                    test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   792.00                    790.00  -0.3%
                                                                                       test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test  1389.00                   1384.00  -0.4%
                                                                                         test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   596.00                    590.00  -1.0%
                                                                                test-suite :: MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig.test     6.00                      5.00 -16.7%

Metric: exec_time

Program                                                                                                                                                exec_time
                                                                                                                                                       results   results0  diff
                                                                               test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test     99.14    100.00    0.9%

Other changes are not significant (less than 0.1% percent with exectime
less 5 secs).

SingleSource/Benchmarks/Adobe-C++/loop_unroll - same small patterns
remain scalar, smaller code.
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - many small
changes, some extra stores gets vectorized.
External/SPEC/CINT2017speed/625.x264_s/625.x264_s
External/SPEC/CINT2017rate/525.x264_r/525.x264_r
x264 has one change in a loop body, in function ssim_end4, some code
remain scalar, resulting in less code size.
External/SPEC/CFP2017rate/511.povray_r/511.povray_r - some extra code
gets vectorized, looks like some other patterns were matched.
MultiSource/Benchmarks/7zip/7zip-benchmark - extra stores were
vectorized (looks like the graphs become profitable)
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg - small
changes in vectorized code (some small part remain scalar).
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s
Many changes cause by the fact that the code of one function becomes
smaller (onvertLCHabToRGB) and this functions gets inlined after that.
MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc - some small
changes here and there, some extra code is vectorized, some remain
scalar (2 x vectors)
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes - emits 2 scalars
+ 2 insertelems instead of insert, broadcast, alt code (3 instructions,
  total 5 insts)
MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig - small graph
becomes profitable and gets vectorized.
External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r
External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s
Some small graph becomes profitable and gets vectorized.
MultiSource/Benchmarks/FreeBench/pifft/pifft - no changes in final code.

Reviewers: RKSimon, dtcxzyw

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/84978
2024-04-10 14:33:56 -04:00
Alexey Bataev
01d9528ef9
[SLP]Improve final minbitwidth analysis attempt.
Added part for demanded bits analysis in the IsPotentiallyTruncated to
improve minbitwidth analysis final attempts.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    43069.00    42973.00 -0.2%
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    43066.00    42970.00 -0.2%

Extra trunc instructions are emitted to operate with <32 x i8> instead
of <32 x i16>, will be removed in the next patches.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/87786
2024-04-08 15:54:30 -04:00
David Green
31fd6b8eec [SLP] Protect against scalable vector users.
We started seeing a crash after 8a0bfe490592de3df28d82c5dd69956e43c20f1d that
the user could be scalable, meaning the typesize is scalable and an implicit
convertion to uint64_t could be performed. Protect against that by making sure
the users type is not scalable.
2024-04-05 11:30:14 +01:00
Alexey Bataev
d57884011e
[SLP]Add support for commutative intrinsics.
Implemented long-standing TODO to support commutative intrinsics.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/86316
2024-04-03 14:28:36 -04:00
Alexey Bataev
81d9ed605b [SLP]Do extra analysis int minbitwidth if some checks return false.
The instruction itself can be considered good for minbitwidth casting,
even if one of the operand checks returns false.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/84363
2024-03-20 05:48:55 -07:00
Alexey Bataev
3a90cb4c18 Revert "[SLP]Do extra analysis int minbitwidth if some checks return false."
This reverts commit da118c93b40f74f6770cf8550903721555d3c97b to fix
crashes reported in https://github.com/llvm/llvm-project/pull/84363.
2024-03-20 05:00:05 -07:00
Alexey Bataev
da118c93b4 [SLP]Do extra analysis int minbitwidth if some checks return false.
The instruction itself can be considered good for minbitwidth casting,
even if one of the operand checks returns false.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/84363
2024-03-19 12:24:18 -07:00
Alexey Bataev
31eaf86a1e [SLP]Improve minbitwidth analysis.
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: https://github.com/llvm/llvm-project/pull/84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers:

Pull Request: https://github.com/llvm/llvm-project/pull/84536
2024-03-19 08:19:45 -07:00
Alexey Bataev
3789870758 Revert "[SLP]Improve minbitwidth analysis."
This reverts commit 7f2167868d8c1cedd3915883412b9c787a2f01db to fix
issues reported in https://github.com/llvm/llvm-project/pull/84536.
2024-03-15 03:59:48 -07:00
Alexey Bataev
7567f5ba78 Revert "[SLP]Do extra analysis int minbitwidth if some checks return false."
This reverts commit ea429e19f56005bf89e717c14efdf49ec055b183 to fix
issues reported in https://github.com/llvm/llvm-project/pull/84536#issuecomment-1999295445.
2024-03-15 03:52:58 -07:00
Alexey Bataev
e4b772444c [SLP]Do extra analysis int minbitwidth if some checks return false.
The instruction itself can be considered good for minbitwidth casting,
even if one of the operand checks returns false.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/84363
2024-03-14 16:41:04 -07:00
Alexey Bataev
5b303a98a8 Revert "[SLP]Do extra analysis int minbitwidth if some checks return false."
This reverts commit ea429e19f56005bf89e717c14efdf49ec055b183 to fix
issues revealed in
https://lab.llvm.org/buildbot/#/builders/186/builds/15299 and https://lab.llvm.org/buildbot/#/builders/238/builds/8426.
2024-03-14 15:48:26 -07:00
Alexey Bataev
ea429e19f5
[SLP]Do extra analysis int minbitwidth if some checks return false.
The instruction itself can be considered good for minbitwidth casting,
even if one of the operand checks returns false.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/84363
2024-03-14 16:16:18 -04:00
Paschalis Mpeis
f795d1a8b1
[AArch64][LV][SLP] Vectorizers use call cost for vectorized frem (#82488)
getArithmeticInstrCost is used by both LoopVectorizer and SLPVectorizer
to compute the cost of frem, which becomes a call cost on AArch64 when
TLI has a vector library function.

Add tests that do SLP vectorization for code that contains 2x double and
4x float frem instructions.
2024-03-14 17:20:29 +00:00
Alexey Bataev
7f2167868d [SLP]Improve minbitwidth analysis.
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: https://github.com/llvm/llvm-project/pull/84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers:

Pull Request: https://github.com/llvm/llvm-project/pull/84536
2024-03-14 06:23:14 -07:00
Alexey Bataev
3d45d8bc70 [SLP][NFC]Add a test with the operand node, not being in MinBWs, though
user is in.
2024-03-13 12:04:28 -07:00
Alexey Bataev
a8967b060d [SLP][NFC]Add a test with buildvector with minbitwidth Root, NFC. 2024-03-13 11:52:48 -07:00
Alexey Bataev
3e6d56617f [SLP][NFC]Add a test with reused buildvector node, being resized after
minbitwidth analysis.
2024-03-13 11:02:18 -07:00
Martin Storsjö
5b5c21d772 Revert "[SLP]Improve minbitwidth analysis."
This reverts commit 2bd369b48dbf0bc3128becb7ef8f8a1b82514b87.

That commit triggered failed assertions:
$ cat repro.c
short *a;
int b;
void h() {
  short *c = a;
  b = 0;
  for (; b < 4; b++) {
    unsigned d = a[b] + a[b + 4 * 2], e = a[b] - a[b + 4 * 2],
             f = (a[b + 4] >> 1) - a[b + 4 * 3],
             g = a[b + 4] + (a[b + 4 * 3] >> 1);
    c[b] = g;
    c[b + 4] = e + f;
    c[b + 4 * 2] = e - f;
    c[b + 4 * 3] = d - g;
  }
}
$ clang -target aarch64-linux-gnu -c -O2 repro.c
clang: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:12503: llvm::Value* llvm::slpvectorizer::BoUpSLP::vectorizeTree(llvm::slpvectorizer::BoUpSLP::TreeEntry*, bool): Assertion `(MinBWs.contains(getOperandEntry(E, 0)) || MinBWs.contains(getOperandEntry(E, 1))) && "Expected item in MinBWs."' failed.
2024-03-09 13:53:13 +02:00
Alexey Bataev
2bd369b48d
[SLP]Improve minbitwidth analysis.
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Original Pull Request: https://github.com/llvm/llvm-project/pull/84334

The patch has the same functionality (no test changes, no changes in
benchmarks) as the original patch, just has some compile time
improvements + fixes for xxhash unittest, discovered earlier in the
previous version of the patch.

Reviewers: 

Pull Request: https://github.com/llvm/llvm-project/pull/84536
2024-03-08 13:57:02 -05:00
Alexey Bataev
11185715a2 Revert "[SLP]Improve minbitwidth analysis."
This reverts commit 4ce52e2d576937fe930294cae883a0daa17eeced to fix
issues detected by https://lab.llvm.org/buildbot/#/builders/74/builds/26470/steps/12/logs/stdio.
2024-03-07 12:44:53 -08:00
Alexey Bataev
4ce52e2d57
[SLP]Improve minbitwidth analysis.
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Reviewers: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/84334
2024-03-07 10:36:41 -05:00
Alexey Bataev
aae152f1be Revert "[SLP]Improve minbitwidth analysis."
This reverts commit a730ed7c1a4a35f5219df720ffb0ba6122d64fe4 to fix
compile time issue.
2024-03-05 12:13:45 -08:00
Alexey Bataev
a730ed7c1a
[SLP]Improve minbitwidth analysis.
This improves overall analysis for minbitwidth in SLP. It allows to
analyze the trees with store/insertelement root nodes. Also, instead of
using single minbitwidth, detected from the very first analysis stage,
it tries to detect the best one for each trunc/ext subtree in the graph
and use it for the subtree.
Results in better code and less vector register pressure.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                      test-suite :: SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant.test    92549.00    92609.00  0.1%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   663381.00   663493.00  0.0%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   663381.00   663493.00  0.0%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   307182.00   307214.00  0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1394420.00  1394484.00  0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1394420.00  1394484.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2040257.00  2040273.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12396098.00 12395858.00 -0.0%
                                                                                         test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   909944.00   909768.00 -0.0%

SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant - 4 scalar
instructions remain scalar (good).
Spec2017/x264 - the whole function idct4x4dc is vectorized using <16
x i16> instead of <16 x i32>, also zext/trunc are removed. In other
places last vector zext/sext removed and replaced by
extractelement + scalar zext/sext pair.
MultiSource/Benchmarks/Bullet/bullet - reduce or <4 x i32> replaced by
reduce or <4 x i8>
Spec2017/imagick - Removed extra zext from 2 packs of the operations.
Spec2017/parest - Removed extra zext, replaced by extractelement+scalar
zext
Spec2017/blender - the whole bunch of vector zext/sext replaced by
extractelement+scalar zext/sext, some extra code vectorized in smaller
types.
Spec2006/gobmk - fixed cost estimation, some small code remains scalar.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/78976
2024-03-05 12:20:28 -05:00
Alexey Bataev
169824ba40 [SLP][NFC]SPlit test/Transforms/SLPVectorizer/AArch64/getelementptr.ll,
NFC.
2024-03-05 08:18:51 -08:00
Alexey Bataev
32994cc0d6
[SLP]Improve findReusedOrderedScalars and graph rotation.
Patch syncs the code in findReusedOrderedScalars with cost
estimation/codegen. It tries to use similar logic to better determine
best order.
Before, it just tried to find previously vectorized node without
checking if it is possible to use the vectorized value in the shuffle.
Now it relies on the more generalized version. If it determines, that
a single vector must be reordered (using same mechanism, as codegen and
cost estimation), it generates better order.

The comparison between new/ref ordering:

Metric: SLP.NumVectorInstructions

Program                                                                                                                                                SLP.NumVectorInstructions
                                                                                                                                                       results                   results0 diff
                                                                                               test-suite :: MultiSource/Benchmarks/nbench/nbench.test   139.00                    140.00   0.7%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   344.00                    346.00   0.6%
                                                                                        test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test  1293.00                   1292.00  -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5176.00                   5170.00  -0.1%
                                                                                        test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  5173.00                   5167.00  -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 11692.00                  11660.00  -0.3%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  1621.00                   1615.00  -0.4%
                                                                                             test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test   795.00                    792.00  -0.4%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26499.00                  26338.00  -0.6%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  7343.00                   7281.00  -0.8%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  1104.00                   1094.00  -0.9%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2216.00                   2180.00  -1.6%
                                                                                            test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   787.00                    637.00 -19.1%

Less 0% is better.
Most of the benchmarks see more vectorized code. The first ones just
have shuffles removed.

The ordering analysis still may require some improvements (e.g. for
alternate nodes), but this one should be produce better results.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/77529
2024-02-22 14:32:15 -05:00
Philip Reames
0f1f82a92e [SLP] Regen a couple of tests to reduce future diff 2024-02-16 10:51:42 -08:00
Nikita Popov
2d69827c5c [Transforms] Convert tests to opaque pointers (NFC) 2024-02-05 11:57:34 +01:00
Nikita Popov
90ba33099c
[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.

This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.

The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.

Fixes https://github.com/llvm/llvm-project/issues/69841.
2024-01-24 15:25:29 +01:00
Florian Hahn
3b3da7c7fb
[SLP] Add a set of tests with non-power-of-2 operations. 2024-01-11 16:47:38 +00:00
Alexey Bataev
e775ba384e [SLP][NFC]Add some extra values to avoid constant expressions in the test. 2024-01-02 11:23:10 -08:00
Nikita Popov
eecb99c5f6 [Tests] Add disjoint flag to some tests (NFC)
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
2023-12-05 14:09:36 +01:00
Craig Topper
7ec4f6094e
[InstCombine] Infer disjoint flag on Or instructions. (#72912)
The disjoint flag was recently added to IR in #72583

We already set it when we turn an add into an or. This patch sets it on Ors that weren't converted from an Add.
2023-12-02 14:11:12 -08:00
Alexey Bataev
279b1ea65f [SLP]Improve gathering of the scalars used in the graph.
Currently we emit gathers for scalars being vectorized in the tree as
a pair of extractelement/insertelement instructions. Instead we can try
to find all required vectors and emit shuffle vector instructions
directly, improving the code and reducing compile time.

Part of non-power-of-2 vectorization.

Differential Revision: https://reviews.llvm.org/D110978
2023-12-01 11:23:57 -08:00
Craig Topper
03d4a9d94d
[InstCombine] Set disjoint flag when turning Add into Or. (#72702)
The disjoint flag was recently added to IR in #72583
2023-11-27 12:54:11 -08:00
Alexey Bataev
b6f51787f6 [SLP]Fix signedness analysis for scalars in graph.
Cannot use the sign info for the roots for all scalars in the graph,
need to perform the analysis for each particular scalar (tree node).
2023-11-15 07:10:59 -08:00
Alexey Bataev
5adfad254e [SLP]Emit actual bitwidth for analyzed MinBitwidth nodes, NFCI.
SLP includes analysis for the minimum bitwidth, the actual integer
operations can be emitted. It allows to reduce register pressure and
improve perf. Currently, it includes only cost model and the next
transformation relies on InstructionCombiner. Better to do it directly
in SLP, it allows to reduce compile time and fix cost model issues.
2023-11-14 11:12:52 -08:00
Alexey Bataev
f2f3050476 Revert "[SLP]Emit actual bitwidth for analyzed MinBitwidth nodes, NFCI."
This reverts commit f6ae50f710d02d8553d28192a1f048b2a9e1fc4d to fix
a crash revealed in the internal testing.
2023-11-14 09:45:54 -08:00
Alexey Bataev
f6ae50f710 [SLP]Emit actual bitwidth for analyzed MinBitwidth nodes, NFCI.
SLP includes analysis for the minimum bitwidth, the actual integer
operations can be emitted. It allows to reduce register pressure and
improve perf. Currently, it includes only cost model and the next
transformation relies on InstructionCombiner. Better to do it directly
in SLP, it allows to reduce compile time and fix cost model issues.
2023-11-14 07:57:37 -08:00
Alexey Bataev
ac254fc055 [SLP]Improve tryToGatherExtractElements by using per-register analysis.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.

Differential Revision: https://reviews.llvm.org/D148855
2023-11-06 07:29:27 -08:00
Hans Wennborg
046c57e705 Revert "[SLP]Improve tryToGatherExtractElements by using per-register analysis."
This causes asserts:

  llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:10082:
  Value *llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::adjustExtracts(
    const TreeEntry *, MutableArrayRef<int>, unsigned int, bool &):
  Assertion `Part == 0 && "Expected firs part."' failed.

See comment on the code review.

> Currently tryToGatherExtractElements function analyzes the whole vector,
> regrdless number of actual registers, used in this vector. It may
> prevent some optimizations, because per-register analysis may allow to
> simplify the final code by reusing more already emitted vectors and
> better shuffles.
>
> Differential Revision: https://reviews.llvm.org/D148855

This reverts commit 9dfdbd788707edc8c39eb2bff16004aba1f3586b.
2023-11-06 13:56:42 +01:00
Alexey Bataev
9dfdbd7887 [SLP]Improve tryToGatherExtractElements by using per-register analysis.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.

Differential Revision: https://reviews.llvm.org/D148855
2023-11-03 10:43:58 -07:00
Martin Storsjö
66152f4eed Revert "[SLP]Improve tryToGatherExtractElements by using per-register analysis."
This reverts commit 3e6d7c6d983dd5896e3a03857584654eb1360fda.

That commit caused miscompilation of ffmpeg's libavcodec/vp9dsp_8bpp.o
on aarch64; the file still compiles correctly, but no longer produces
the right result - see https://reviews.llvm.org/D148855#4655968
for details.
2023-11-03 00:08:17 +02:00
Alexey Bataev
3e6d7c6d98 [SLP]Improve tryToGatherExtractElements by using per-register analysis.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.

Differential Revision: https://reviews.llvm.org/D148855
2023-11-01 10:42:35 -07:00
Alexey Bataev
6e8d957a22 Revert "[SLP]Improve tryToGatherExtractElements by using per-register analysis."
This reverts commit 0a34aaedd8ec2dc2375076976c1327fdbfd7877f to fix
fails reported in https://lab.llvm.org/buildbot/#/builders/265/builds/40
2023-11-01 08:52:31 -07:00
Alexey Bataev
0a34aaedd8 [SLP]Improve tryToGatherExtractElements by using per-register analysis.
Currently tryToGatherExtractElements function analyzes the whole vector,
regrdless number of actual registers, used in this vector. It may
prevent some optimizations, because per-register analysis may allow to
simplify the final code by reusing more already emitted vectors and
better shuffles.

Differential Revision: https://reviews.llvm.org/D148855
2023-11-01 07:44:49 -07:00
Philip Reames
3f2ed812f0
[InstCombine] Infer nneg on zext when forming from non-negative sext (#70706)
Builds on #67982 which recently introduced the nneg flag on a zext
instruction. InstCombine is one of our largest canonicalizers of zext
from non-negative sext instructions, so set the flag there.
2023-10-30 12:09:43 -07:00
Philip Reames
89564f0b69 Regenerate a set of auto-update tests [nfc]
To reduce the spurious test delta in an upcoming change.
2023-10-30 11:36:43 -07:00
Antonio Frighetto
138e6c1c86 [AArch64][TTI] Improve LegalVF when gather loads are scalarized
After determining the cost of loads that could not be coalesced into
`VectorizedLoads` in SLP, computing the cost of a gather-vectorized
load is carried out. Favour a potentially high valid cost when the
type of a group of loads, whose type is a vector of size dependent
upon `VF`, may be legalized into a scalar value.

Fixes: https://github.com/llvm/llvm-project/issues/68953.
2023-10-27 20:22:54 +02:00
Alex Richardson
e39f6c1844 [opt] Infer DataLayout from triple if not specified
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.

One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.

Differential Revision: https://reviews.llvm.org/D141060
2023-10-26 12:07:37 -07:00