Jeffrey Byrnes
4336e5edbc
[SLP] Sort PHIs by ExtractElements when relevant ( #131229 )
...
Considering the PHIs in order of element extracted can lead to better shuffles.
2025-03-17 14:19:46 -07:00
Alexey Bataev
ead9d6a56d
[SLP]Check VectorizableTree is not empty before accessing elements
...
Need to check VectorizableTree is not empty before accessing elements.
Fixes #131635
2025-03-17 11:04:38 -07:00
Alexey Bataev
fbf0276b6a
[SLP] Reorder reuses mask, if it is not empty, for subvector operands
...
If the subvector operands has reuses mask, need to reorder the mask, not
the scalars, to prevent compiler crash due to mask/scalars size
mismatch.
Fixes #131360
2025-03-14 14:11:09 -07:00
Alexey Bataev
605a9f590d
[SLP]Check if user node is same as other node and check operand order
...
Need to check if the user node is same as other node and check operand
order to prevent a compiler crash when trying to find matching gather
node with user nodes, having the same last instruction.
Fixes #131195
2025-03-14 13:46:07 -07:00
Alexey Bataev
9c86198caf
[SLP] Update vector value for incoming phi node, beeing vectorized already
...
If the phi node contains multiple same incoming blocks/values, need to
update the corresponding vectorized value, if it is not going to be
vectorized, if the incoming value was vectorized already.
Fixes #131355
2025-03-14 12:53:56 -07:00
Alexey Bataev
bbd1bb4057
[SLP]Set insert point for split node with non-scheulable instructions after the last instruction
...
Need to set the insert point for non-schedulable instructions in
SplitVectorize node after the last instruction, not before, to avoid
a crash in case of buildvector subvector node.
2025-03-14 07:04:55 -07:00
Alexey Bataev
202137dbea
[SLP]Fix a crash on matching gather operands of phi nodes in loops
...
If the gather operands in phi nodes are matching and phi nodes may build
up a loop, it may cause a compiler crash with the incorrect def-use
chain. Patch fixes this crash.
2025-03-12 14:46:00 -07:00
Alexey Bataev
10085390c6
[SLP]Reduce number of alternate instruction, where possible
...
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5%
Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, mcpu=sifive-p470
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1%
test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%
CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
Reviewers: hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-12 08:18:51 -07:00
Hans Wennborg
5ec884e5d8
Revert "[SLP]Reduce number of alternate instruction, where possible"
...
This caused assertion failures:
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:16237:
Value *llvm::slpvectorizer::BoUpSLP::vectorizeTree(TreeEntry *):
Assertion `OpTE1.isSame( ArrayRef(E->Scalars).take_front(OpTE1.getVectorFactor())) && "Expected same first part of scalars."' failed.
See comment on the PR.
> Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
> It is mostly the same, adjusted after graph-to-tree transformation
This reverts commit 7de895ff1146c17ec78877900c01c09f4140e692.
2025-03-12 11:16:02 +01:00
Alexey Bataev
7de895ff11
[SLP]Reduce number of alternate instruction, where possible
...
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5%
Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, mcpu=sifive-p470
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1%
test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%
CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
Reviewers: hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-11 11:40:28 -07:00
Hans Wennborg
e858b10917
Revert "[SLP]Reduce number of alternate instruction, where possible"
...
This caused failures such as:
Instruction does not dominate all uses!
%29 = insertelement <8 x i64> %28, i64 %xor6.i.5, i64 6
%17 = shufflevector <8 x i64> %29, <8 x i64> poison, <6 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6>
see comment on https://github.com/llvm/llvm-project/pull/123360
> Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
> It is mostly the same, adjusted after graph-to-tree transformation
>
> Patch tries to remove wide alternate operations.
> Currently SLP vectorizer emits something like this:
> ```
> %0 = add i32
> %1 = sub i32
> %2 = add i32
> %3 = sub i32
> %4 = add i32
> %5 = sub i32
> %6 = add i32
> %7 = sub i32
>
> transformes to
>
> %v1 = add <8 x i32>
> %v2 = sub <8 x i32>
> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
> ```
> i.e. half of the results are just unused. This leads to increased
> register pressure and potentially doubles number of operations.
>
> Patch introduces SplitVectorize mode, where it splits the operations by
> opcodes and produces instead something like this:
> ```
> %v1 = add <4 x i32>
> %v2 = sub <4 x i32>
> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
> ```
> It allows to improve the performance by reducing number of ops. Also, it
> turns on some other improvements, like improved graph reordering.
>
> [...]
This reverts commit 9d37e61fc77d3d6de891c30630f1c0227522031d as well as
the follow-up commit 72bb0a9a9c6fdde43e1e191f2dc0d5d2d46aff4e.
2025-03-11 15:04:36 +01:00
Alexey Bataev
9d37e61fc7
[SLP]Reduce number of alternate instruction, where possible
...
Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3074157.00 3074125.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1092252.00 1092188.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779763.00 779715.00 -0.0%
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253517.00 253485.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 848259.00 848035.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93064.00 93016.00 -0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383747.00 383475.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 673051.00 662907.00 -1.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 673051.00 662907.00 -1.5%
Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, mcpu=sifive-p470
Metric: size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 580768.00 581118.00 0.1%
test-suite :: MultiSource/Applications/d/make_dparser.test 78854.00 78894.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 633448.00 633750.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277002.00 277080.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 931938.00 931960.00 0.0%
test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147424.00 147422.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 841656.00 841632.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 949026.00 948962.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946348.00 946284.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 279794.00 279764.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 25074.00 25028.00 -0.2%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 29336.00 29184.00 -0.5%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 535390.00 510124.00 -4.7%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 535390.00 510124.00 -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%
CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
Reviewers: hiraditya
Reviewed By: hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/128907
2025-03-10 10:06:39 -04:00
Alexey Bataev
4959025bbc
[SLP]Fix non-determinism in reused elements analysis
...
Need to use consistent storages for unique elements, when going to
iterate over them to avoid non-determinism in reused elements analysis.
Fixes #130082
2025-03-06 10:12:49 -08:00
Alexey Bataev
31845cf06c
Revert "[SLP]Fix non-determinism in reused elements analysis"
...
This reverts commit 3158525afdc3677457712963ef45c83f4f8f900f to fix
a bug revealed in https://lab.llvm.org/buildbot/#/builders/123/builds/14930
2025-03-06 08:59:08 -08:00
Alexey Bataev
3158525afd
[SLP]Fix non-determinism in reused elements analysis
...
Need to use consistent storages for unique elements, when going to
iterate over them to avoid non-determinism in reused elements analysis.
Fixes #130082
2025-03-06 08:51:31 -08:00
Alexey Bataev
1182be503d
[SLP]Fix a crash for buildvector nodes with parent phi nodes with same incoming blocks
...
If trying to find matching buildvector node for another nodes, and both
nodes are used by vectorized phi nodes and are coming from the same
parent block, this nodes should be considered matched to avoid a crash.
2025-03-06 07:42:43 -08:00
Alexey Bataev
855178af99
[SLP]Fix/improve getSpillCost analysis
...
Previous implementation may took some extra time, when walked over the
same instructions several times. And also it did not include proper
analysis for cross-basic-block use of the vectorized values. This
version fixes it.
It walks over the tree and checks the deps between entries and their
operands. If there are non-vectorized calls in between, it adds
a single(!) spill cost, because the vector value should be
spilled/reloaded only once.
Also, this version caches analysis for each entries, which are detected,
and do not repeats it, uses data, found during previous analysis for
previous nodes.
Also, it has the internal limit. If the number of instructions
between nodes and their operands is too big (> than ScheduleRegionSizeBudget / VectorizableTree.size()), it is considered that the spill is required. It allows to improve compile time.
Reviewers: preames, RKSimon, mikhailramalho
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/129258
2025-03-04 15:47:23 -05:00
Alexey Bataev
a36a67c79a
[SLP]Fix the analysis of the user buildvector nodes for minbitwidth
...
If the user node is a buildvector/gather node and it has no internal
instructions state, need to check properly for this state and check the
type of the node itself, not its operands.
Fixes #129242
2025-02-28 13:17:14 -08:00
Alexey Bataev
e1e20c07e4
[SLP]Fix bitwidth analysis for signed nodes, incoming into UITOFP nodes
...
If the signed node is the operand of UITOFP, the bitwidth analysis
should consider minimum value between incoming bitwidth and the bitwidth
of the UITOFP node.
Fixes #129244
2025-02-28 11:50:50 -08:00
Alexey Bataev
69effe054c
[SLP]Check for potential safety of the truncation for vectorized scalars with multi uses
...
If the vectorized scalars has multiple uses, need to check if it is safe
to truncate the vectorized value, before actually trying doing it.
Otherwise, the compiler may loose some important bits, which may lead to
a miscompilation.
Fixes #129057
2025-02-27 08:41:46 -08:00
Alexey Bataev
39bab1de33
[SLP]Check if the operand for removal is the reduction operand, awaiting for the reduction
...
If the operand of the instruction-to-be-removed is a reduction value,
which is not reduced yet, and, thus, it has no users, it may be removed
during operands analysis.
Fixes #128736
2025-02-26 14:17:11 -08:00
Alexey Bataev
418a987285
[SLP]Do not use node, if it is a subvector or buildvector node
...
If the buildvector has some matches with another node, which is
a subvector of another buildvector node, need to check for this and
cancel matching to avoid incorrect ordering of the nodes.
Fixes #128770
2025-02-26 13:25:37 -08:00
Han-Kuan Chen
a12ca57c1c
[SLP][REVEC] Add getScalarizationOverhead helper function to reduce error when REVEC is enabled. ( #128530 )
2025-02-25 23:16:05 +08:00
Han-Kuan Chen
3a6108bcac
[SLP][REVEC] Fix scalar mask is passed to getScalarizationOverhead but the type is vector. ( #128476 )
...
Fix "Vector size mismatch".
2025-02-24 23:43:27 +08:00
Alexey Bataev
eb14d2a1d4
[SLP]Fix check for matched gather node, if it is a subvector node
...
If the gather node is a subvector node, it may match the existing
vector/gather node in the graph, but still may require reordering. in
this case need to fully check its dependencies to prevent a compiler
crash.
Fixes #128401
2025-02-24 06:48:43 -08:00
Alexey Bataev
8ffdc3b207
[SLP]Fix a crash when checking a scalar in a reordered buildvector node
...
Need to check reordered scalars, not the original ones, to correctly
check proper scalar.
2025-02-21 14:59:43 -08:00
Alexey Bataev
894935cb51
[SLP]Represent SLP graph as a tree
...
We can stop using a graph representation of the SLP structure and switch
directly to tree by relying on a single user of each tree node. If the
node has multiple uses, other uses must be represented as a separate
gather/buildvector node, which then will be combined with the existing
vectorized node(s) uoon cost estimation/codegen.
This allow to simplify inner structure and turn in some extra
optimizations, which could not be turned on for the nodes with multi
users (reordering, minbitwidth analysis).
AVX512, -O3+LTO
Metric: size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253453.00 254253.00 0.3%
test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 251411.00 252051.00 0.3%
test-suite :: SingleSource/Benchmarks/Misc/oourafft.test 19114.00 19146.00 0.2%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1399200.00 1399520.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1399200.00 1399520.00 0.0%
test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 304310.00 304326.00 0.0%
test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 304662.00 304678.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12566919.00 12567511.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146300.00 1146316.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159864.00 1159880.00 0.0%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9407880.00 9407864.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9407880.00 9407864.00 -0.0%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1011612.00 1011596.00 -0.0%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 280584.00 280536.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93016.00 93000.00 -0.0%
ASCI_Purple/SMG2000 - extra code vectorized, small variations
CFP2006/444.namd - small variations, less shuffles
Benchmarks/Misc/oourafft - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations, less shuffles
LCALS/SubsetALambdaLoops - less shuffles
LCALS/SubsetARawLoops - less shuffles
CFP2017rate/526.blender_r - small variations, extra vector code
CFP2006/453.povray - small variations
CFP2017rate/511.povray_r - small variations
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - small variations
Benchmarks/tramp3d-v4 - small variations
Prolangs-C/TimberWolfMC - small variations
DOE-ProxyApps-C++/miniFE - extra code vectorized, small variations
DOE-ProxyApps-C++/CLAMR - extra code vectorized, small variations
ASCI_Purple/SMG2000 - no significant changes
RISCV, -O3+LTO
Metric: size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-pr28982b.test 1812.00 1866.00 3.0%
test-suite :: MultiSource/Benchmarks/Olden/health/health.test 3946.00 4016.00 1.8%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 513180.00 513550.00 0.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 513180.00 513550.00 0.1%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7672198.00 7672202.00 0.0%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7672198.00 7672202.00 0.0%
test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 746060.00 746044.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497716.00 9497364.00 -0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 948266.00 948214.00 -0.0%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 89874.00 89862.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 835492.00 835346.00 -0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 66230.00 66202.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946090.00 944206.00 -0.2%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1136404.00 1131854.00 -0.4%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1136404.00 1131854.00 -0.4%
gcc-c-torture/execute/GCC-C-execute-pr28982b - better vector code
Olden/health - extra vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - small variation + improvements in reordering, @pixel_hadamard_ac stopped
being vectorized because of some non-effective shuffle recognition by
the compiler
CINT2017rate/502.gcc_r
CINT2017speed/602.gcc_s - small variations
CFP2017rate/508.namd_r - small variations
CFP2017rate/526.blender_r - small variations
CFP2006/453.povray - extra vector code
Benchmarks/7zip - extra vector code
DOE-ProxyApps-C++/miniFE - small variations
CFP2017rate/511.povray_r - extra vector code
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - extra vector code
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/126771
2025-02-21 07:15:02 -05:00
Alexey Bataev
0e1ffa397e
[SLP]Fix a crash when comparing phis from unreachable blocks
...
Need to check if the block is reachable before comparing phis from it to
avoid compiler crash when requesting node.
Fixes report in https://github.com/llvm/llvm-project/pull/110529#issuecomment-2664723338
2025-02-18 08:20:48 -08:00
Alexey Bataev
37bde7ae5b
[SLP]Fix hanging on small trees with phis only with adjusted cost threshold
...
Need to check if the tree is too small before attempting to vectorize the tree to prevent hanging on small trees with phis only.
2025-02-18 07:56:47 -08:00
Alexey Bataev
3b18d47ecb
[SLP]Improved reduction cost/codegen
...
SLP vectorizer is able to combine several reductions from the list of
(potentially) reduced values with the different opcodes/values kind.
Currently, these reductions are handled independently of each other. But
instead the compiler can combine them into wide vector operations and
then perform only single reduction.
E.g, if the SLP vectorizer emits currently something like:
```
%r1 = reduce.add(<4 x i32> %v1)
%r2 = reduce.add(<4 x i32> %v2)
%r = add i32 %r1, %r2
```
it can be emitted as:
```
%v = add <4 x i32> %v1, %v2
%r = reduce.add(<4 x i32> %v)
```
It allows to improve the performance in some cases.
AVX512, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%
Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization
RISC-V, SiFive-p670, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%
execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions
Reviewers: hiraditya, topperc, preames
Pull Request: https://github.com/llvm/llvm-project/pull/118293
2025-02-14 11:03:33 -08:00
Alexey Bataev
40029800e7
Revert "[SLP]Improved reduction cost/codegen"
...
This reverts commit 7ec60bf0166519317b5ae2505dd6ed4660e3ea39 to fix
a bug reported in https://github.com/llvm/llvm-project/issues/127220 .
2025-02-14 10:18:07 -08:00
Alexey Bataev
7ec60bf016
[SLP]Improved reduction cost/codegen
...
SLP vectorizer is able to combine several reductions from the list of
(potentially) reduced values with the different opcodes/values kind.
Currently, these reductions are handled independently of each other. But
instead the compiler can combine them into wide vector operations and
then perform only single reduction.
E.g, if the SLP vectorizer emits currently something like:
```
%r1 = reduce.add(<4 x i32> %v1)
%r2 = reduce.add(<4 x i32> %v2)
%r = add i32 %r1, %r2
```
it can be emitted as:
```
%v = add <4 x i32> %v1, %v2
%r = reduce.add(<4 x i32> %v)
```
It allows to improve the performance in some cases.
AVX512, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%
Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization
RISC-V, SiFive-p670, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%
execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions
Reviewers: hiraditya, topperc, preames
Pull Request: https://github.com/llvm/llvm-project/pull/118293
2025-02-14 05:15:29 -08:00
Alexey Bataev
afa3c10de7
Revert "[SLP]Improved reduction cost/codegen"
...
This reverts commit 2ad816648f2719e6c0da507a1a371f2cad4a3f1c to fix
bug/miscompiles, reported in
https://github.com/llvm/llvm-project/pull/118293#issuecomment-2658906033
and https://github.com/llvm/llvm-project/pull/118293#issuecomment-2659024785 .
2025-02-14 04:12:47 -08:00
Alexey Bataev
ac217ee389
[SLP] Check for PHI nodes (potentially cycles!) when checking dependencies
...
When checking for dependecies for gather nodes with users with the same
last instruction, cannot rely on the index order, if there is (even
potential!) cycle in the graph, which may cause order not work correctly
and cause compiler crash.
Fixes #127128
2025-02-13 14:21:48 -08:00
Alexey Bataev
d18b1ebef5
[SLP]Check if vector user exist before accessing it
...
Need to check if vector user exist before accessing it to avoid compiler
crash.
Fixes #126581
2025-02-13 09:44:34 -08:00
Alexey Bataev
2ad816648f
[SLP]Improved reduction cost/codegen
...
SLP vectorizer is able to combine several reductions from the list of
(potentially) reduced values with the different opcodes/values kind.
Currently, these reductions are handled independently of each other. But
instead the compiler can combine them into wide vector operations and
then perform only single reduction.
E.g, if the SLP vectorizer emits currently something like:
```
%r1 = reduce.add(<4 x i32> %v1)
%r2 = reduce.add(<4 x i32> %v2)
%r = add i32 %r1, %r2
```
it can be emitted as:
```
%v = add <4 x i32> %v1, %v2
%r = reduce.add(<4 x i32> %v)
```
It allows to improve the performance in some cases.
AVX512, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0%
Benchmarks/Shootout-C++ - same transformed reduction
Adobe-C++/loop_unroll - same transformed reductions, new vector code
AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions
FreeBench/fourinarow - same transformed reductions
MiBench/telecomm-gsm - same transformed reductions
execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions
CFP2006/433.milc - better vector code, several x i64 reductions + trunc
to i32 gets trunced to x i32 reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions, extra 4 x vectorization
CINT2006/464.h264ref - same transformed reductions
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same transformed reductions
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - transformed same reduction
JM/lencod - extra 4 x vectorization
RISC-V, SiFive-p670, -O3+LTO
Metric: size..text
Program size..text
results results0 diff
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0%
test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0%
test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0%
test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2%
test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3%
test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4%
test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4%
test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4%
execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same
transformed reductions
CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions
MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/automotive-susan - same transformed reductions
ImageProcessing/Blur - same transformed reductions
Benchmarks/7zip - same transformed reductions
CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop)
MiBench/telecomm-gsm - same transformed reductions
Benchmarks/mediabench - same transformed reductions
Vectorizer/VPlanNativePath - same transformed reductions
Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions
Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions
Regression/C/Regression-C-DuffsDevice - same transformed reductions
Reviewers: hiraditya, topperc, preames
Pull Request: https://github.com/llvm/llvm-project/pull/118293
2025-02-13 10:36:28 -05:00
Alexey Bataev
7d1db31aa0
[SLP]Check the first instruction instead the first scalar for subvectors
...
Need to check the first instruction instead of first scalar for
subvectors, when trying to find full matched vectorized node in the
graph.
Fixes #126909 .
2025-02-13 06:40:37 -08:00
Alexey Bataev
bb3d789dfe
[SLP][NFC]Improve dump of the ScheduleData, NFC
2025-02-12 06:51:30 -08:00
Alexey Bataev
e1935a2b15
Revert "[SLP][NFC]Improve dump of the ScheduleData, NFC"
...
This reverts commit 108e6bca693e5f44d2d17da5a6e06203a0290de7 to fix
error revealed by buildbots https://lab.llvm.org/buildbot/#/builders/159/builds/15888 .
2025-02-12 06:34:27 -08:00
Alexey Bataev
108e6bca69
[SLP][NFC]Improve dump of the ScheduleData, NFC
2025-02-12 06:25:04 -08:00
Alexey Bataev
10844fb9b0
[SLP]Fix attempt to build the reorder mask for non-adjusted reuse mask
...
When building the reorder for non-single use reuse mask, need to check
if the size of the mask is multiple of the number of unique scalars.
Otherwise, the compiler may crash when trying to reorder nodes.
Fixes #126304
2025-02-11 13:41:25 -08:00
Alexey Bataev
7dca2c628c
[SLP]Gather scalarized calls
...
If the calls won't be vectorized, but will be scalarized after
vectorization, they should be build as buildvector nodes, not vector
nodes. Vectorization of such calls leads to incorrect cost estimation,
does not allow to calculate correctly spills costs.
Reviewers: lukel97, preames
Reviewed By: preames
Pull Request: https://github.com/llvm/llvm-project/pull/125070
2025-02-04 19:09:57 -05:00
Alexey Bataev
88e7b8b81c
[SLP]Use TTI::getScalarizationOverhead where possible
...
Better to use TTI::getScalarizationOverhead instead of
TTI::getVectorInstrCost to correctly calculate the costs of
buildvectors/extracts.
Reviewers: RKSimon
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/125725
2025-02-04 18:49:43 -05:00
Alexey Bataev
fe7e280820
[SLP][NFC]Move functions definitions, NFC
...
Move functions to use them later in the following patches
2025-02-04 07:19:18 -08:00
Alexey Bataev
0c70a26f46
[SLP]Clear root node reordering only if the root node is not re-used in graph
...
The reordering of the root node can be safely cleared only if the root
node is not reused, otherwise the graph might be broken
Fixes #125357
2025-02-03 06:05:19 -08:00
Simon Pilgrim
e3fbf19eb4
[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis. ( #124129 ) (REAPPLIED)
...
We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)
Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.
Noticed while having yet another attempt at #63980
Reapplied cleanup now that #125223 and #124984 have landed.
2025-02-03 09:55:41 +00:00
Martin Storsjö
d00579be39
Revert "[SLP]Reduce number of alternate instruction, where possible"
...
This reverts commit d5a7a483a65f830a0c7a931781bc90046dc67ff4.
That commit triggers failed asserts, see
https://github.com/llvm/llvm-project/pull/123360 for details.
2025-02-02 15:56:08 +02:00
Alexey Bataev
d5a7a483a6
[SLP]Reduce number of alternate instruction, where possible
...
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6%
Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, march=rva32u64
CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations
-O3+LTO, mcpu=sifive-p470
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0%
test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3%
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)
Reviewers: RKSimon, hiraditya
Reviewed By: RKSimon
Pull Request: https://github.com/llvm/llvm-project/pull/123360
2025-02-01 10:00:16 -08:00
Alexey Bataev
d9c9326a21
[SLP]Recalculate number of parts when requesting number of elements based on original scalars size
...
Need to recalculate number of parts, since gathered scalar size might be changed
during building the buildvector shuffles.
Fixes #125259
2025-01-31 12:55:03 -08:00
Alexey Bataev
631abff733
Revert "[SLP]Use the size of gathered scalars when evaluating slice size"
...
This reverts commit e78aa8f35e6dd66d5152396406d3d4f37f43e7f4 to fix
crashes reported in https://lab.llvm.org/buildbot/#/builders/140/builds/16047 .
2025-01-31 11:31:45 -08:00