Alexey Bataev
daab7d0807
[SLP]Initial support for (masked)loads + compress and (masked)interleaved
...
Added initial support for (masked)loads + compress and
(masked)interleaved loads.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/132099
2025-04-03 13:17:40 -07:00
Alexey Bataev
7c4013d591
Revert "[SLP]Initial support for (masked)loads + compress and (masked)interleaved"
...
This reverts commit 0bec0f5c059af5f920fe22ecda469b666b5971b0 to fix
a crash reported in https://lab.llvm.org/buildbot/#/builders/143/builds/6668 .
2025-04-03 12:58:49 -07:00
Alexey Bataev
0bec0f5c05
[SLP]Initial support for (masked)loads + compress and (masked)interleaved
...
Added initial support for (masked)loads + compress and
(masked)interleaved loads.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/132099
2025-04-03 13:21:22 -04:00
Alexey Bataev
843ef77dc2
[SLP]Update mapping between values and their matching entries upon selection
...
Need to update the mapping between gathered values and their matching
entries, if the list of the entries is updated and only some of them are
selected for final shuffling.
Fixes #134085
2025-04-02 11:59:32 -07:00
Alexey Bataev
48a4b14cb6
[SLP]Fix whole vector registers calculations for compares
...
Need to check that the calculated number of the elements is not larger
than the original number of scalars to prevent a compiler crash.
Fixes #134013
2025-04-02 07:26:40 -07:00
Han-Kuan Chen
5bbcc765cc
[SLP][REVEC] getNumElements should not be used as VF when REVEC is enabled. ( #134031 )
2025-04-02 19:04:07 +08:00
Alexey Bataev
0e3049c562
[SLP]Support revectorization of the previously vectorized scalars
...
If the scalar instructions is marked for the vectorization in the tree,
it cannot be vectorized as part of the another node in the same tree, in
general. It may prevent some potentially profitable vectorization
opportunities, since some nodes end up being buildvector/gather nodes,
which add to the total cost.
Patch allows revectorization of the previously vectorized scalars.
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/133091
2025-04-01 14:30:06 -04:00
Alexey Bataev
cf6a452cc7
[SLP]Fix same/alternate analysis in split node analysis for compares
...
getSameOpcode in some cases may consider 2 compares as having same
opcode, even though previously they were considered as alternate. It may
happen, because getSameOpcode looses info about previous instructions
and their states. Need to use isAlternateInstruction function instead
for the correct analysis.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/133769
2025-03-31 19:33:40 -04:00
Alexey Bataev
bfd8cc0a3e
[SLP]Fix a check for the whole register use
...
Need to check the value type, not the return type, of the instructions,
when doing the analysis for the whole register use to prevent a compiler
crash.
Fixes #133751
2025-03-31 10:52:12 -07:00
Han-Kuan Chen
65734de9b9
[SLP] NFC. Remove the redundant MainOp and AltOp find process. ( #133642 )
2025-03-31 10:26:45 +08:00
Alexey Bataev
1bfc61064a
[SLP]Fix spill cost analysis for split vectorized nodes
...
If the entry is SplitVectorize, it can be skipped in favor of its
operands, operands allow correctly detect spill costs.
Fixes #133288
2025-03-28 12:45:53 -07:00
Kazu Hirata
673f4705a8
[llvm] Use *Set::insert_range (NFC) ( #133353 )
...
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E.first);
down to:
Set.insert_range(llvm::make_first_range(Range));
In some cases, we can further fold that into the set declaration.
2025-03-27 20:44:20 -07:00
Kazu Hirata
cde58bfc16
[Transforms] Use range constructors of *Set (NFC) ( #133203 )
2025-03-27 07:51:58 -07:00
Martin Storsjö
a2e5932e8b
Revert "[SLP] Make getSameOpcode support interchangeable instructions. ( #132887 )"
...
This reverts commit 6e66cfeeaec6f09a4454400e45d690457ecdd3de.
This change causes crashes on compiling some inputs, see
https://github.com/llvm/llvm-project/pull/127450#issuecomment-2752833710
and
https://github.com/llvm/llvm-project/pull/127450#issuecomment-2753375326
for details.
2025-03-26 10:24:25 +02:00
Han-Kuan Chen
6e66cfeeae
[SLP] Make getSameOpcode support interchangeable instructions. ( #132887 )
...
We use the term "interchangeable instructions" to refer to different
operators that have the same meaning (e.g., `add x, 0` is equivalent to
`mul x, 1`).
Non-constant values are not supported, as they may incur high costs with
little benefit.
---------
Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
2025-03-25 19:46:15 +08:00
Alexey Bataev
8122bb9dbe
[SLP]Fix a check for non-schedulable instructions
...
Need to fix a check for non-schedulable instructions in
getLastInstructionInBundle function, because this check may not work
correctly during the codegen. Instead, need to check that actually these
instructions were never scheduled, since the scheduling analysis always
performed before the codegen and is stable.
Fixes #132841
2025-03-25 04:35:33 -07:00
Han-Kuan Chen
2682a9433b
[SLP][REVEC] Add ExtractSubvector cost for ExternalUses. ( #132761 )
...
For llvm/test/Transforms/SLPVectorizer/revec-shufflevector.ll,
ScalarCost and ExtraCost is 1, so the original scalar will be kept.
2025-03-25 18:58:54 +08:00
Martin Storsjö
b33bec9b21
Revert "[SLP] Make getSameOpcode support interchangeable instructions. ( #127450 )"
...
This reverts commit 71a0cfd93263552ddc0bfd2ea7b0abe9a578f87e.
This commit triggers failed asserts when compiling ffmpeg. The
issue is reproducible with a small standalone reproducer like this:
void make_filters_from_proto(int *filter[][2], int bands) {
int c, q, n;
for (;; q++) {
n = 0;
for (; n < 7; n++) {
int theta = (q * (n - 6) + (n >> 1) - 3) % bands;
if (theta)
c = theta;
filter[q][n][0] = c;
}
}
}
$ clang -target x86_64-linux-gnu -c repro.c -O3
clang: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:989: llvm::SmallVector<llvm
::Value*> {anonymous}::BinOpSameOpcodeHelper::InterchangeableInfo::getOperand(ll
vm::Instruction*) const: Assertion `FromCIValue.isZero() && "Cannot convert the
instruction."' failed.
The same issue also reproduces for a large number of other target
triples, aarch64-linux-gnu and others.
2025-03-25 10:22:44 +02:00
Martin Storsjö
dd059338a2
Revert "[Vectorize] Fix a warning"
...
This reverts commit 4c68061254c896214b7ad5ab807ac4ba11517812.
Reverting as part of a revert of a preceding commit.
2025-03-25 10:21:05 +02:00
Kazu Hirata
4c68061254
[Vectorize] Fix a warning
...
This patch fixes:
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:855:52: error:
unused variable 'SupportedOp' [-Werror,-Wunused-const-variable]
2025-03-24 17:38:47 -07:00
Han-Kuan Chen
71a0cfd932
[SLP] Make getSameOpcode support interchangeable instructions. ( #127450 )
...
We use the term "interchangeable instructions" to refer to different
operators that have the same meaning (e.g., `add x, 0` is equivalent to
`mul x, 1`).
Non-constant values are not supported, as they may incur high costs with
little benefit.
---------
Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
2025-03-25 08:24:46 +08:00
Alexey Bataev
ad9909dd73
[SLP]Fix perfect diamond match with extractelements in scalars
...
Need to drop all previous estimations/vectorizations, when found
a perfect diamond match. This improves cost estimation and improves code
emission.
Also, need to adjust getScalarizationOverhead cost for non-poison input
vector. Currently, it does not allow to estimate it correctly, so
instead use conservative element-by-element insertelement cost for each
unique scalar.
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/132466
2025-03-24 09:29:18 -04:00
Alexey Bataev
3b0ec61156
[SLP][NFC] Redesign schedule bundle, separate from schedule data, NFC
...
That's the initial patch, intended to support revectorization of the
previously vectorized scalars. If the scalar is marked for the
vectorization, it becomes a part of the schedule bundle, used to check
dependencies and then schedule tree entry scalars into a single batch of
instructions. Unfortunately, currently this info is part of the
ScheduleData struct and it does not allow making scalars part of many
bundles. The patch separates schedule bundles from the ScheduleData,
introduces explicit class ScheduleBundle for bundles, allowing later to
extend it to support revectorization of the previously vectorized
scalars.
Reviewers: hiraditya, RKSimon
Reviewed By: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/131625
2025-03-21 13:36:57 -04:00
Han-Kuan Chen
73558dc329
[SLP][REVEC] Fix getStoreMinimumVF only accept scalar types. ( #132181 )
...
Fix "Element type of a VectorType must " "be an integer, floating point,
or " "pointer type.".
2025-03-20 21:04:30 +08:00
Han-Kuan Chen
a5d4b50f93
[SLP] NFC. Change the inner loop and outer loop of appendOperandsOfVL. ( #132152 )
2025-03-20 20:32:20 +08:00
Han-Kuan Chen
c3e16337a4
[SLP][REVEC] Ignore UserTreeIndex if it is empty. ( #131993 )
...
Previously, the all_of check did not consider the case where the
TreeEntry is empty (i.e., when it is the first entry).
2025-03-20 11:31:49 +08:00
Kazu Hirata
0dcc201ac4
[Transforms] Use *Set::insert_range (NFC) ( #132056 )
...
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
2025-03-19 15:35:01 -07:00
Longsheng Mou
f3f7f08eca
[SLP] Fix Wsign-compare warning (NFC) ( #131948 )
...
llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4805:57:
warning: comparison of integer expressions of different signedness:
‘int’ and ‘std::size_t’ {aka ‘long unsigned int’} [-Wsign-compare]
[](const auto &P) { return P.value() % 2 != P.index() % 2; }))
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
2025-03-19 17:01:42 +08:00
Alexey Bataev
45090b3059
[SLP]Check the whole def-use chain in the tree to find proper dominance, if the last instruction is the same
...
If the insertion point (last instruction) of the user nodes is the same,
need to check the whole def-use chain in the tree to find proper
dominance to prevent a compiler crash.
Fixes #131818
2025-03-18 10:01:13 -07:00
Jeffrey Byrnes
4336e5edbc
[SLP] Sort PHIs by ExtractElements when relevant ( #131229 )
...
Considering the PHIs in order of element extracted can lead to better shuffles.
2025-03-17 14:19:46 -07:00
Alexey Bataev
ead9d6a56d
[SLP]Check VectorizableTree is not empty before accessing elements
...
Need to check VectorizableTree is not empty before accessing elements.
Fixes #131635
2025-03-17 11:04:38 -07:00
Alexey Bataev
fbf0276b6a
[SLP] Reorder reuses mask, if it is not empty, for subvector operands
...
If the subvector operands has reuses mask, need to reorder the mask, not
the scalars, to prevent compiler crash due to mask/scalars size
mismatch.
Fixes #131360
2025-03-14 14:11:09 -07:00
Alexey Bataev
605a9f590d
[SLP]Check if user node is same as other node and check operand order
...
Need to check if the user node is same as other node and check operand
order to prevent a compiler crash when trying to find matching gather
node with user nodes, having the same last instruction.
Fixes #131195
2025-03-14 13:46:07 -07:00
Alexey Bataev
9c86198caf
[SLP] Update vector value for incoming phi node, beeing vectorized already
...
If the phi node contains multiple same incoming blocks/values, need to
update the corresponding vectorized value, if it is not going to be
vectorized, if the incoming value was vectorized already.
Fixes #131355
2025-03-14 12:53:56 -07:00
Alexey Bataev
bbd1bb4057
[SLP]Set insert point for split node with non-scheulable instructions after the last instruction
...
Need to set the insert point for non-schedulable instructions in
SplitVectorize node after the last instruction, not before, to avoid
a crash in case of buildvector subvector node.
2025-03-14 07:04:55 -07:00
Alexey Bataev
202137dbea
[SLP]Fix a crash on matching gather operands of phi nodes in loops
...
If the gather operands in phi nodes are matching and phi nodes may build
up a loop, it may cause a compiler crash with the incorrect def-use
chain. Patch fixes this crash.
2025-03-12 14:46:00 -07:00
Alexey Bataev
10085390c6
[SLP]Reduce number of alternate instruction, where possible
...
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5%
Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, mcpu=sifive-p470
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1%
test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%
CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
Reviewers: hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-12 08:18:51 -07:00
Hans Wennborg
5ec884e5d8
Revert "[SLP]Reduce number of alternate instruction, where possible"
...
This caused assertion failures:
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:16237:
Value *llvm::slpvectorizer::BoUpSLP::vectorizeTree(TreeEntry *):
Assertion `OpTE1.isSame( ArrayRef(E->Scalars).take_front(OpTE1.getVectorFactor())) && "Expected same first part of scalars."' failed.
See comment on the PR.
> Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
> It is mostly the same, adjusted after graph-to-tree transformation
This reverts commit 7de895ff1146c17ec78877900c01c09f4140e692.
2025-03-12 11:16:02 +01:00
Alexey Bataev
7de895ff11
[SLP]Reduce number of alternate instruction, where possible
...
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5%
Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, mcpu=sifive-p470
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1%
test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%
CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
Reviewers: hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-11 11:40:28 -07:00
Hans Wennborg
e858b10917
Revert "[SLP]Reduce number of alternate instruction, where possible"
...
This caused failures such as:
Instruction does not dominate all uses!
%29 = insertelement <8 x i64> %28, i64 %xor6.i.5, i64 6
%17 = shufflevector <8 x i64> %29, <8 x i64> poison, <6 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6>
see comment on https://github.com/llvm/llvm-project/pull/123360
> Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
> It is mostly the same, adjusted after graph-to-tree transformation
>
> Patch tries to remove wide alternate operations.
> Currently SLP vectorizer emits something like this:
> ```
> %0 = add i32
> %1 = sub i32
> %2 = add i32
> %3 = sub i32
> %4 = add i32
> %5 = sub i32
> %6 = add i32
> %7 = sub i32
>
> transformes to
>
> %v1 = add <8 x i32>
> %v2 = sub <8 x i32>
> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
> ```
> i.e. half of the results are just unused. This leads to increased
> register pressure and potentially doubles number of operations.
>
> Patch introduces SplitVectorize mode, where it splits the operations by
> opcodes and produces instead something like this:
> ```
> %v1 = add <4 x i32>
> %v2 = sub <4 x i32>
> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
> ```
> It allows to improve the performance by reducing number of ops. Also, it
> turns on some other improvements, like improved graph reordering.
>
> [...]
This reverts commit 9d37e61fc77d3d6de891c30630f1c0227522031d as well as
the follow-up commit 72bb0a9a9c6fdde43e1e191f2dc0d5d2d46aff4e.
2025-03-11 15:04:36 +01:00
Alexey Bataev
9d37e61fc7
[SLP]Reduce number of alternate instruction, where possible
...
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5%
Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, mcpu=sifive-p470
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1%
test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%
CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
Reviewers: hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-10 10:06:39 -04:00
Alexey Bataev
4959025bbc
[SLP]Fix non-determinism in reused elements analysis
...
Need to use consistent storages for unique elements, when going to
iterate over them to avoid non-determinism in reused elements analysis.
Fixes #130082
2025-03-06 10:12:49 -08:00
Alexey Bataev
31845cf06c
Revert "[SLP]Fix non-determinism in reused elements analysis"
...
This reverts commit 3158525afdc3677457712963ef45c83f4f8f900f to fix
a bug revealed in https://lab.llvm.org/buildbot/#/builders/123/builds/14930
2025-03-06 08:59:08 -08:00
Alexey Bataev
3158525afd
[SLP]Fix non-determinism in reused elements analysis
...
Need to use consistent storages for unique elements, when going to
iterate over them to avoid non-determinism in reused elements analysis.
Fixes #130082
2025-03-06 08:51:31 -08:00
Alexey Bataev
1182be503d
[SLP]Fix a crash for buildvector nodes with parent phi nodes with same incoming blocks
...
If trying to find matching buildvector node for another nodes, and both
nodes are used by vectorized phi nodes and are coming from the same
parent block, this nodes should be considered matched to avoid a crash.
2025-03-06 07:42:43 -08:00
Alexey Bataev
855178af99
[SLP]Fix/improve getSpillCost analysis
...
Previous implementation may took some extra time, when walked over the
same instructions several times. And also it did not include proper
analysis for cross-basic-block use of the vectorized values. This
version fixes it.
It walks over the tree and checks the deps between entries and their
operands. If there are non-vectorized calls in between, it adds
a single(!) spill cost, because the vector value should be
spilled/reloaded only once.
Also, this version caches analysis for each entries, which are detected,
and do not repeats it, uses data, found during previous analysis for
previous nodes.
Also, it has the internal limit. If the number of instructions
between nodes and their operands is too big (> than ScheduleRegionSizeBudget / VectorizableTree.size()), it is considered that the spill is required. It allows to improve compile time.
Reviewers: preames, RKSimon, mikhailramalho
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/129258
2025-03-04 15:47:23 -05:00
Alexey Bataev
a36a67c79a
[SLP]Fix the analysis of the user buildvector nodes for minbitwidth
...
If the user node is a buildvector/gather node and it has no internal
instructions state, need to check properly for this state and check the
type of the node itself, not its operands.
Fixes #129242
2025-02-28 13:17:14 -08:00
Alexey Bataev
e1e20c07e4
[SLP]Fix bitwidth analysis for signed nodes, incoming into UITOFP nodes
...
If the signed node is the operand of UITOFP, the bitwidth analysis
should consider minimum value between incoming bitwidth and the bitwidth
of the UITOFP node.
Fixes #129244
2025-02-28 11:50:50 -08:00
Alexey Bataev
69effe054c
[SLP]Check for potential safety of the truncation for vectorized scalars with multi uses
...
If the vectorized scalars has multiple uses, need to check if it is safe
to truncate the vectorized value, before actually trying doing it.
Otherwise, the compiler may loose some important bits, which may lead to
a miscompilation.
Fixes #129057
2025-02-27 08:41:46 -08:00
Alexey Bataev
39bab1de33
[SLP]Check if the operand for removal is the reduction operand, awaiting for the reduction
...
If the operand of the instruction-to-be-removed is a reduction value,
which is not reduced yet, and, thus, it has no users, it may be removed
during operands analysis.
Fixes #128736
2025-02-26 14:17:11 -08:00