2162 Commits

Author SHA1 Message Date
Alexey Bataev
48a4b14cb6 [SLP]Fix whole vector registers calculations for compares
Need to check that the calculated number of the elements is not larger
than the original number of scalars to prevent a compiler crash.

Fixes #134013
2025-04-02 07:26:40 -07:00
Han-Kuan Chen
5bbcc765cc
[SLP][REVEC] getNumElements should not be used as VF when REVEC is enabled. (#134031) 2025-04-02 19:04:07 +08:00
Alexey Bataev
0e3049c562
[SLP]Support revectorization of the previously vectorized scalars
If the scalar instructions is marked for the vectorization in the tree,
it cannot be vectorized as part of the another node in the same tree, in
general. It may prevent some potentially profitable vectorization
opportunities, since some nodes end up being buildvector/gather nodes,
which add to the total cost.
Patch allows revectorization of the previously vectorized scalars.

Reviewers: hiraditya, RKSimon

Reviewed By: RKSimon, hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/133091
2025-04-01 14:30:06 -04:00
Alexey Bataev
cf6a452cc7
[SLP]Fix same/alternate analysis in split node analysis for compares
getSameOpcode in some cases may consider 2 compares as having same
opcode, even though previously they were considered as alternate. It may
happen, because getSameOpcode looses info about previous instructions
and their states. Need to use isAlternateInstruction function instead
for the correct analysis.

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/133769
2025-03-31 19:33:40 -04:00
Alexey Bataev
bfd8cc0a3e [SLP]Fix a check for the whole register use
Need to check the value type, not the return type, of the instructions,
when doing the analysis for the whole register use to prevent a compiler
crash.

Fixes #133751
2025-03-31 10:52:12 -07:00
Han-Kuan Chen
65734de9b9
[SLP] NFC. Remove the redundant MainOp and AltOp find process. (#133642) 2025-03-31 10:26:45 +08:00
Alexey Bataev
1bfc61064a [SLP]Fix spill cost analysis for split vectorized nodes
If the entry is SplitVectorize, it can be skipped in favor of its
operands, operands allow correctly detect spill costs.

Fixes #133288
2025-03-28 12:45:53 -07:00
Kazu Hirata
673f4705a8
[llvm] Use *Set::insert_range (NFC) (#133353)
We can use *Set::insert_range to collapse:

  for (auto Elem : Range)
    Set.insert(E.first);

down to:

  Set.insert_range(llvm::make_first_range(Range));

In some cases, we can further fold that into the set declaration.
2025-03-27 20:44:20 -07:00
Kazu Hirata
cde58bfc16
[Transforms] Use range constructors of *Set (NFC) (#133203) 2025-03-27 07:51:58 -07:00
Martin Storsjö
a2e5932e8b Revert "[SLP] Make getSameOpcode support interchangeable instructions. (#132887)"
This reverts commit 6e66cfeeaec6f09a4454400e45d690457ecdd3de.

This change causes crashes on compiling some inputs, see
https://github.com/llvm/llvm-project/pull/127450#issuecomment-2752833710
and
https://github.com/llvm/llvm-project/pull/127450#issuecomment-2753375326
for details.
2025-03-26 10:24:25 +02:00
Han-Kuan Chen
6e66cfeeae
[SLP] Make getSameOpcode support interchangeable instructions. (#132887)
We use the term "interchangeable instructions" to refer to different
operators that have the same meaning (e.g., `add x, 0` is equivalent to
`mul x, 1`).
Non-constant values are not supported, as they may incur high costs with
little benefit.

---------

Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
2025-03-25 19:46:15 +08:00
Alexey Bataev
8122bb9dbe [SLP]Fix a check for non-schedulable instructions
Need to fix a check for non-schedulable instructions in
getLastInstructionInBundle function, because this check may not work
correctly during the codegen. Instead, need to check that actually these
instructions were never scheduled, since the scheduling analysis always
performed before the codegen and is stable.

Fixes #132841
2025-03-25 04:35:33 -07:00
Han-Kuan Chen
2682a9433b
[SLP][REVEC] Add ExtractSubvector cost for ExternalUses. (#132761)
For llvm/test/Transforms/SLPVectorizer/revec-shufflevector.ll,
ScalarCost and ExtraCost is 1, so the original scalar will be kept.
2025-03-25 18:58:54 +08:00
Martin Storsjö
b33bec9b21 Revert "[SLP] Make getSameOpcode support interchangeable instructions. (#127450)"
This reverts commit 71a0cfd93263552ddc0bfd2ea7b0abe9a578f87e.

This commit triggers failed asserts when compiling ffmpeg. The
issue is reproducible with a small standalone reproducer like this:

    void make_filters_from_proto(int *filter[][2], int bands) {
      int c, q, n;
      for (;; q++) {
        n = 0;
        for (; n < 7; n++) {
          int theta = (q * (n - 6) + (n >> 1) - 3) % bands;
          if (theta)
            c = theta;
          filter[q][n][0] = c;
        }
      }
    }

$ clang -target x86_64-linux-gnu -c repro.c -O3
clang: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:989: llvm::SmallVector<llvm
::Value*> {anonymous}::BinOpSameOpcodeHelper::InterchangeableInfo::getOperand(ll
vm::Instruction*) const: Assertion `FromCIValue.isZero() && "Cannot convert the
instruction."' failed.

The same issue also reproduces for a large number of other target
triples, aarch64-linux-gnu and others.
2025-03-25 10:22:44 +02:00
Martin Storsjö
dd059338a2 Revert "[Vectorize] Fix a warning"
This reverts commit 4c68061254c896214b7ad5ab807ac4ba11517812.

Reverting as part of a revert of a preceding commit.
2025-03-25 10:21:05 +02:00
Kazu Hirata
4c68061254 [Vectorize] Fix a warning
This patch fixes:

  llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:855:52: error:
  unused variable 'SupportedOp' [-Werror,-Wunused-const-variable]
2025-03-24 17:38:47 -07:00
Han-Kuan Chen
71a0cfd932
[SLP] Make getSameOpcode support interchangeable instructions. (#127450)
We use the term "interchangeable instructions" to refer to different
operators that have the same meaning (e.g., `add x, 0` is equivalent to
`mul x, 1`).
Non-constant values are not supported, as they may incur high costs with
little benefit.

---------

Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
2025-03-25 08:24:46 +08:00
Alexey Bataev
ad9909dd73
[SLP]Fix perfect diamond match with extractelements in scalars
Need to drop all previous estimations/vectorizations, when found
a perfect diamond match. This improves cost estimation and improves code
emission.
Also, need to adjust getScalarizationOverhead cost for non-poison input
vector. Currently, it does not allow to estimate it correctly, so
instead use conservative element-by-element insertelement cost for each
unique scalar.

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/132466
2025-03-24 09:29:18 -04:00
Alexey Bataev
3b0ec61156
[SLP][NFC] Redesign schedule bundle, separate from schedule data, NFC
That's the initial patch, intended to support revectorization of the
previously vectorized scalars. If the scalar is marked for the
vectorization, it becomes a part of the schedule bundle, used to check
dependencies and then schedule tree entry scalars into a single batch of
instructions. Unfortunately, currently this info is part of the
ScheduleData struct and it does not allow making scalars part of many
bundles. The patch separates schedule bundles from the ScheduleData,
introduces explicit class ScheduleBundle for bundles, allowing later to
extend it to support revectorization of the previously vectorized
scalars.

Reviewers: hiraditya, RKSimon

Reviewed By: RKSimon, hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/131625
2025-03-21 13:36:57 -04:00
Han-Kuan Chen
73558dc329
[SLP][REVEC] Fix getStoreMinimumVF only accept scalar types. (#132181)
Fix "Element type of a VectorType must " "be an integer, floating point,
or " "pointer type.".
2025-03-20 21:04:30 +08:00
Han-Kuan Chen
a5d4b50f93
[SLP] NFC. Change the inner loop and outer loop of appendOperandsOfVL. (#132152) 2025-03-20 20:32:20 +08:00
Han-Kuan Chen
c3e16337a4
[SLP][REVEC] Ignore UserTreeIndex if it is empty. (#131993)
Previously, the all_of check did not consider the case where the
TreeEntry is empty (i.e., when it is the first entry).
2025-03-20 11:31:49 +08:00
Kazu Hirata
0dcc201ac4
[Transforms] Use *Set::insert_range (NFC) (#132056)
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range.  This patch replaces:

  Dest.insert(Src.begin(), Src.end());

with:

  Dest.insert_range(Src);

This patch does not touch custom begin like succ_begin for now.
2025-03-19 15:35:01 -07:00
Longsheng Mou
f3f7f08eca
[SLP] Fix Wsign-compare warning (NFC) (#131948)
llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4805:57:
warning: comparison of integer expressions of different signedness:
‘int’ and ‘std::size_t’ {aka ‘long unsigned int’} [-Wsign-compare]
    [](const auto &P) { return P.value() % 2 != P.index() % 2; }))
                               ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
2025-03-19 17:01:42 +08:00
Alexey Bataev
45090b3059 [SLP]Check the whole def-use chain in the tree to find proper dominance, if the last instruction is the same
If the insertion point (last instruction) of the user nodes is the same,
need to check the whole def-use chain in the tree to find proper
dominance to prevent a compiler crash.

Fixes #131818
2025-03-18 10:01:13 -07:00
Jeffrey Byrnes
4336e5edbc
[SLP] Sort PHIs by ExtractElements when relevant (#131229)
Considering the PHIs in order of element extracted can lead to better shuffles.
2025-03-17 14:19:46 -07:00
Alexey Bataev
ead9d6a56d [SLP]Check VectorizableTree is not empty before accessing elements
Need to check VectorizableTree is not empty before accessing elements.

Fixes #131635
2025-03-17 11:04:38 -07:00
Alexey Bataev
fbf0276b6a [SLP] Reorder reuses mask, if it is not empty, for subvector operands
If the subvector operands has reuses mask, need to reorder the mask, not
the scalars, to prevent compiler crash due to mask/scalars size
mismatch.

Fixes #131360
2025-03-14 14:11:09 -07:00
Alexey Bataev
605a9f590d [SLP]Check if user node is same as other node and check operand order
Need to check if the user node is same as other node and check operand
order to prevent a compiler crash when trying to find matching gather
node with user nodes, having the same last instruction.

Fixes #131195
2025-03-14 13:46:07 -07:00
Alexey Bataev
9c86198caf [SLP] Update vector value for incoming phi node, beeing vectorized already
If the phi node contains multiple same incoming blocks/values, need to
update the corresponding vectorized value, if it is not going to be
vectorized, if the incoming value was vectorized already.

Fixes #131355
2025-03-14 12:53:56 -07:00
Alexey Bataev
bbd1bb4057 [SLP]Set insert point for split node with non-scheulable instructions after the last instruction
Need to set the insert point for non-schedulable instructions in
SplitVectorize node after the last instruction, not before, to avoid
a crash in case of buildvector subvector node.
2025-03-14 07:04:55 -07:00
Alexey Bataev
202137dbea [SLP]Fix a crash on matching gather operands of phi nodes in loops
If the gather operands in phi nodes are matching and phi nodes may build
up a loop, it may cause a compiler crash with the incorrect def-use
chain. Patch fixes this crash.
2025-03-12 14:46:00 -07:00
Alexey Bataev
10085390c6 [SLP]Reduce number of alternate instruction, where possible
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                results     results0    diff
                       test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2788.00     2820.00  1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   278168.00   280904.00  1.0%
               test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    82682.00    83258.00  0.7%
                    test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   139344.00   139712.00  0.3%
     test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    27149.00    27197.00  0.2%
               test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1008188.00  1009948.00  0.2%
          test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39226.00    39290.00  0.2%
   test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39229.00    39293.00  0.2%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2074533.00  2076549.00  0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2074533.00  2076549.00  0.1%
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   798440.00   798952.00  0.1%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44123.00    44139.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   318942.00   319038.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1159880.00  1160152.00  0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    73595.00    73611.00  0.0%
                test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1146124.00  1146348.00  0.0%
       test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   203831.00   203847.00  0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   207662.00   207678.00  0.0%
                test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   589851.00   589883.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1398543.00  1398559.00  0.0%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1398543.00  1398559.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2050990.00  2051006.00  0.0%

      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
                     test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3074157.00  3074125.00 -0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1092252.00  1092188.00 -0.0%
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   779763.00   779715.00 -0.0%
         test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   253517.00   253485.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test   848259.00   848035.00 -0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93064.00    93016.00 -0.1%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383747.00   383475.00 -0.1%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   673051.00   662907.00 -1.5%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   673051.00   662907.00 -1.5%

Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, mcpu=sifive-p470

Metric: size..text

                                                                               results    results0   diff
                                 test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  580768.00  581118.00   0.1%
                                        test-suite :: MultiSource/Applications/d/make_dparser.test   78854.00   78894.00   0.1%
                                      test-suite :: MultiSource/Applications/JM/lencod/lencod.test  633448.00  633750.00   0.0%
                                           test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277002.00  277080.00   0.0%
                             test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  931938.00  931960.00   0.0%
                                         test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00   0.0%
                                test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00  -0.0%
                                 test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00  -0.0%
                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00  -0.0%
                     test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  147424.00  147422.00  -0.0%
                    test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00  -0.0%
                     test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00  -0.0%
                                     test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  841656.00  841632.00  -0.0%
                                    test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  949026.00  948962.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  946348.00  946284.00  -0.0%
                                      test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  279794.00  279764.00  -0.0%
                       test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00  -0.1%
                              test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   25074.00   25028.00  -0.2%
                       test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   25074.00   25028.00  -0.2%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   29336.00   29184.00  -0.5%
                               test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  535390.00  510124.00  -4.7%
                              test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  535390.00  510124.00  -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test     886.00     608.00 -31.4%

CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations

Reviewers: hiraditya

Reviewed By: hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-12 08:18:51 -07:00
Hans Wennborg
5ec884e5d8 Revert "[SLP]Reduce number of alternate instruction, where possible"
This caused assertion failures:

  llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:16237:
  Value *llvm::slpvectorizer::BoUpSLP::vectorizeTree(TreeEntry *):
  Assertion `OpTE1.isSame( ArrayRef(E->Scalars).take_front(OpTE1.getVectorFactor())) && "Expected same first part of scalars."' failed.

See comment on the PR.

> Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
> It is mostly the same, adjusted after graph-to-tree transformation

This reverts commit 7de895ff1146c17ec78877900c01c09f4140e692.
2025-03-12 11:16:02 +01:00
Alexey Bataev
7de895ff11 [SLP]Reduce number of alternate instruction, where possible
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                results     results0    diff
                       test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2788.00     2820.00  1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   278168.00   280904.00  1.0%
               test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    82682.00    83258.00  0.7%
                    test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   139344.00   139712.00  0.3%
     test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    27149.00    27197.00  0.2%
               test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1008188.00  1009948.00  0.2%
          test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39226.00    39290.00  0.2%
   test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39229.00    39293.00  0.2%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2074533.00  2076549.00  0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2074533.00  2076549.00  0.1%
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   798440.00   798952.00  0.1%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44123.00    44139.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   318942.00   319038.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1159880.00  1160152.00  0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    73595.00    73611.00  0.0%
                test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1146124.00  1146348.00  0.0%
       test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   203831.00   203847.00  0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   207662.00   207678.00  0.0%
                test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   589851.00   589883.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1398543.00  1398559.00  0.0%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1398543.00  1398559.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2050990.00  2051006.00  0.0%

      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
                     test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3074157.00  3074125.00 -0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1092252.00  1092188.00 -0.0%
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   779763.00   779715.00 -0.0%
         test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   253517.00   253485.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test   848259.00   848035.00 -0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93064.00    93016.00 -0.1%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383747.00   383475.00 -0.1%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   673051.00   662907.00 -1.5%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   673051.00   662907.00 -1.5%

Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, mcpu=sifive-p470

Metric: size..text

                                                                               results    results0   diff
                                 test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  580768.00  581118.00   0.1%
                                        test-suite :: MultiSource/Applications/d/make_dparser.test   78854.00   78894.00   0.1%
                                      test-suite :: MultiSource/Applications/JM/lencod/lencod.test  633448.00  633750.00   0.0%
                                           test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277002.00  277080.00   0.0%
                             test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  931938.00  931960.00   0.0%
                                         test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00   0.0%
                                test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00  -0.0%
                                 test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00  -0.0%
                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00  -0.0%
                     test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  147424.00  147422.00  -0.0%
                    test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00  -0.0%
                     test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00  -0.0%
                                     test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  841656.00  841632.00  -0.0%
                                    test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  949026.00  948962.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  946348.00  946284.00  -0.0%
                                      test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  279794.00  279764.00  -0.0%
                       test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00  -0.1%
                              test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   25074.00   25028.00  -0.2%
                       test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   25074.00   25028.00  -0.2%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   29336.00   29184.00  -0.5%
                               test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  535390.00  510124.00  -4.7%
                              test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  535390.00  510124.00  -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test     886.00     608.00 -31.4%

CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations

Reviewers: hiraditya

Reviewed By: hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-11 11:40:28 -07:00
Hans Wennborg
e858b10917 Revert "[SLP]Reduce number of alternate instruction, where possible"
This caused failures such as:

  Instruction does not dominate all uses!
  %29 = insertelement <8 x i64> %28, i64 %xor6.i.5, i64 6
  %17 = shufflevector <8 x i64> %29, <8 x i64> poison, <6 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6>

see comment on https://github.com/llvm/llvm-project/pull/123360

> Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
> It is mostly the same, adjusted after graph-to-tree transformation
>
> Patch tries to remove wide alternate operations.
> Currently SLP vectorizer emits something like this:
> ```
> %0 = add i32
> %1 = sub i32
> %2 = add i32
> %3 = sub i32
> %4 = add i32
> %5 = sub i32
> %6 = add i32
> %7 = sub i32
>
> transformes to
>
> %v1 = add <8 x i32>
> %v2 = sub <8 x i32>
> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
> ```
> i.e. half of the results are just unused. This leads to increased
> register pressure and potentially doubles number of operations.
>
> Patch introduces SplitVectorize mode, where it splits the operations by
> opcodes and produces instead something like this:
> ```
> %v1 = add <4 x i32>
> %v2 = sub <4 x i32>
> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
> ```
> It allows to improve the performance by reducing number of ops. Also, it
> turns on some other improvements, like improved graph reordering.
>
> [...]

This reverts commit 9d37e61fc77d3d6de891c30630f1c0227522031d as well as
the follow-up commit 72bb0a9a9c6fdde43e1e191f2dc0d5d2d46aff4e.
2025-03-11 15:04:36 +01:00
Alexey Bataev
9d37e61fc7
[SLP]Reduce number of alternate instruction, where possible
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                results     results0    diff
                       test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2788.00     2820.00  1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   278168.00   280904.00  1.0%
               test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    82682.00    83258.00  0.7%
                    test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   139344.00   139712.00  0.3%
     test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    27149.00    27197.00  0.2%
               test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1008188.00  1009948.00  0.2%
          test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39226.00    39290.00  0.2%
   test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39229.00    39293.00  0.2%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2074533.00  2076549.00  0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2074533.00  2076549.00  0.1%
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   798440.00   798952.00  0.1%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44123.00    44139.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   318942.00   319038.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1159880.00  1160152.00  0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    73595.00    73611.00  0.0%
                test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1146124.00  1146348.00  0.0%
       test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   203831.00   203847.00  0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   207662.00   207678.00  0.0%
                test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   589851.00   589883.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1398543.00  1398559.00  0.0%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1398543.00  1398559.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2050990.00  2051006.00  0.0%

      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
                     test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3074157.00  3074125.00 -0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1092252.00  1092188.00 -0.0%
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   779763.00   779715.00 -0.0%
         test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   253517.00   253485.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test   848259.00   848035.00 -0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93064.00    93016.00 -0.1%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383747.00   383475.00 -0.1%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   673051.00   662907.00 -1.5%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   673051.00   662907.00 -1.5%

Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, mcpu=sifive-p470

Metric: size..text

                                                                               results    results0   diff
                                 test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  580768.00  581118.00   0.1%
                                        test-suite :: MultiSource/Applications/d/make_dparser.test   78854.00   78894.00   0.1%
                                      test-suite :: MultiSource/Applications/JM/lencod/lencod.test  633448.00  633750.00   0.0%
                                           test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277002.00  277080.00   0.0%
                             test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  931938.00  931960.00   0.0%
                                         test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00   0.0%
                                test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00  -0.0%
                                 test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00  -0.0%
                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00  -0.0%
                     test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  147424.00  147422.00  -0.0%
                    test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00  -0.0%
                     test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00  -0.0%
                                     test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  841656.00  841632.00  -0.0%
                                    test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  949026.00  948962.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  946348.00  946284.00  -0.0%
                                      test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  279794.00  279764.00  -0.0%
                       test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00  -0.1%
                              test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   25074.00   25028.00  -0.2%
                       test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   25074.00   25028.00  -0.2%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   29336.00   29184.00  -0.5%
                               test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  535390.00  510124.00  -4.7%
                              test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  535390.00  510124.00  -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test     886.00     608.00 -31.4%

CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations

Reviewers: hiraditya

Reviewed By: hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-10 10:06:39 -04:00
Alexey Bataev
4959025bbc [SLP]Fix non-determinism in reused elements analysis
Need to use consistent storages for unique elements, when going to
iterate over them to avoid non-determinism in reused elements analysis.

Fixes #130082
2025-03-06 10:12:49 -08:00
Alexey Bataev
31845cf06c Revert "[SLP]Fix non-determinism in reused elements analysis"
This reverts commit 3158525afdc3677457712963ef45c83f4f8f900f to fix
a bug revealed in https://lab.llvm.org/buildbot/#/builders/123/builds/14930
2025-03-06 08:59:08 -08:00
Alexey Bataev
3158525afd [SLP]Fix non-determinism in reused elements analysis
Need to use consistent storages for unique elements, when going to
iterate over them to avoid non-determinism in reused elements analysis.

Fixes #130082
2025-03-06 08:51:31 -08:00
Alexey Bataev
1182be503d [SLP]Fix a crash for buildvector nodes with parent phi nodes with same incoming blocks
If trying to find matching buildvector node for another nodes, and both
nodes are used by vectorized phi nodes and are coming from the same
parent block, this nodes should be considered matched to avoid a crash.
2025-03-06 07:42:43 -08:00
Alexey Bataev
855178af99
[SLP]Fix/improve getSpillCost analysis
Previous implementation may took some extra time, when walked over the
same instructions several times. And also it did not include proper
analysis for cross-basic-block use of the vectorized values. This
version fixes it.

It walks over the tree and checks the deps between entries and their
operands. If there are non-vectorized calls in between, it adds
a single(!) spill cost, because the vector value should be
spilled/reloaded only once.

Also, this version caches analysis for each entries, which are detected,
and do not repeats it, uses data, found during previous analysis for
previous nodes.

Also, it has the internal limit. If the number of instructions
between nodes and their operands is too big (> than ScheduleRegionSizeBudget / VectorizableTree.size()), it is considered that the spill is required. It allows to improve compile time.

Reviewers: preames, RKSimon, mikhailramalho

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/129258
2025-03-04 15:47:23 -05:00
Alexey Bataev
a36a67c79a [SLP]Fix the analysis of the user buildvector nodes for minbitwidth
If the user node is a buildvector/gather node and it has no internal
instructions state, need to check properly for this state and check the
type of the node itself, not its operands.

Fixes #129242
2025-02-28 13:17:14 -08:00
Alexey Bataev
e1e20c07e4 [SLP]Fix bitwidth analysis for signed nodes, incoming into UITOFP nodes
If the signed node is the operand of UITOFP, the bitwidth analysis
should consider minimum value between incoming bitwidth and the bitwidth
of the UITOFP node.

Fixes #129244
2025-02-28 11:50:50 -08:00
Alexey Bataev
69effe054c [SLP]Check for potential safety of the truncation for vectorized scalars with multi uses
If the vectorized scalars has multiple uses, need to check if it is safe
to truncate the vectorized value, before actually trying doing it.
Otherwise, the compiler may loose some important bits, which may lead to
a miscompilation.

Fixes #129057
2025-02-27 08:41:46 -08:00
Alexey Bataev
39bab1de33 [SLP]Check if the operand for removal is the reduction operand, awaiting for the reduction
If the operand of the instruction-to-be-removed is a reduction value,
which is not reduced yet, and, thus, it has no users, it may be removed
during operands analysis.

Fixes #128736
2025-02-26 14:17:11 -08:00
Alexey Bataev
418a987285 [SLP]Do not use node, if it is a subvector or buildvector node
If the buildvector has some matches with another node, which is
a subvector of another buildvector node, need to check for this and
cancel matching to avoid incorrect ordering of the nodes.

Fixes #128770
2025-02-26 13:25:37 -08:00
Han-Kuan Chen
a12ca57c1c
[SLP][REVEC] Add getScalarizationOverhead helper function to reduce error when REVEC is enabled. (#128530) 2025-02-25 23:16:05 +08:00
Han-Kuan Chen
3a6108bcac
[SLP][REVEC] Fix scalar mask is passed to getScalarizationOverhead but the type is vector. (#128476)
Fix "Vector size mismatch".
2025-02-24 23:43:27 +08:00
Alexey Bataev
eb14d2a1d4 [SLP]Fix check for matched gather node, if it is a subvector node
If the gather node is a subvector node, it may match the existing
vector/gather node in the graph, but still may require reordering. in
this case need to fully check its dependencies to prevent a compiler
crash.

Fixes #128401
2025-02-24 06:48:43 -08:00