llvm-project

Author	SHA1	Message	Date
Alexey Bataev	e05def081e	[SLP]Do not vectorize code in EH and non-returning blocks The code in EH and non-returning blocks can be skipped by the vectorizer, since it does not add to the perfromance, just consumes compile/link time. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/112221	2024-10-31 13:50:02 -04:00
Sushant Gokhale	c9f01f699c	[SLP][AArch64][NFC] Add more tests for SLP vectorization of div (#113876 ) Currently, we dont have much tests that show SLP outcome for integer divisions. This patch adds tests for same. In certain scenarios, for Neon, vectorization is profitable. An attempt would be made in future to improve the cost-model for the same.	2024-10-28 20:37:41 +05:30
Alexey Bataev	4b1b51ac52	[SLP]Initial non-power-of-2 support (but still whole register) for reductions Enables initial non-power-of-2 support (but still requires number of elements, forming whole registers) for reductions. Enables extra vectorization for MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and CFP2017rate/526.blender_r (checked for SSE2) Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/112361	2024-10-21 12:25:39 -07:00
David Green	17ac10c28f	Revert "[SLP]Initial non-power-of-2 support (but still whole register) for reductions" This reverts commit 7f2e937469a8cec3fe977bf41ad2dfb9b4ce648a as it causes regressions in the tests it modifies, and undoes what was added in #100653 (which itself was a fix for a previous regression).	2024-10-21 13:37:44 +01:00
Alexey Bataev	7f2e937469	[SLP]Initial non-power-of-2 support (but still whole register) for reductions Enables initial non-power-of-2 support (but still requires number of elements, forming whole registers) for reductions. Enables extra vectorization for MultiSource/Benchmarks/7zip/7zip-benchmark, CINT2006/464.h264ref and CFP2017rate/526.blender_r (checked for SSE2) Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/112361	2024-10-18 12:50:11 -07:00
Alexey Bataev	f74879cf0c	[SLP]Make PHICompare comparator follow weak strict ordering requirement Reviewers: efriedma-quic Reviewed By: efriedma-quic Pull Request: https://github.com/llvm/llvm-project/pull/110529	2024-10-04 14:23:48 -04:00
Alexey Bataev	be6aed90c7	[SLP]Use number of scalars as a vector length for minbw cast Need to use the number of scalars, not the vector factor of the node. Otherwise incorrect casting can be estimated, leading to a compiler crash.	2024-09-26 13:06:19 -07:00
Alexey Bataev	3469db82b5	[SLP]Add subvector vectorization for non-load nodes Previously SLP vectorize supported clustered vectorization for loads only. This patch adds support for "clustered" vectorization for other instructions. If the buildvector node contains "clusters", which can be vectorized separately and then inserted into the resulting buildvector result, it is better to do, since it may reduce the cost of the vector graph and produce better vector code. The patch does some analysis, if it is profitable to try to do this kind of extra vectorization. It checks the scalar instructions and its operands and tries to vectorize them only if they result in a better graph. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/108430	2024-09-25 10:23:41 -04:00
Sushant Gokhale	c5672e21ca	[AArch64][CostModel] Reduce the cost of fadd reduction with fast flag (#108791 ) fadd reduction with 1. Fast flag set 2. No of elements in input vector is power of 2 results in series of faddp instructions. faddp instruction has latency/throughput identical to fadd instruction and hence, we set relative cost=1 for faddp as well. The change didn't show any regression with SPEC17-FP(C/C++), llvm-test-suite on Neoverse-V2.	2024-09-24 14:35:01 +05:30
Alexey Bataev	1833d418a0	[SLP]Vectorize gathered loads Final gather/buildvector nodes may have scalar loads, which are not vectorized (since they are part of the gather nodes) but may form full vector loads, being combined. This patch walks over all gather nodes, "gathering" and sorting gathered scalar loads and then tries to build vector loads, which later are reshuffled between the gather nodes. It allows later to add support for segmented loads (kind of AOS to SOA load kind for RISC-V RVV) and may help with the removal of the alternat e opcodes support. Currently, alternate nodes may depend on each other because of the consecutive loads between their operands. Because of that we cannot simply remove alternate vectorization. But this approach may help to remove most of the stuff for it, since we'll be able to vectorize loads in between lanes. Metric: size..text, AVX512 Program size..text test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 238381.00 250669.00 5.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 25753.00 26329.00 2.2% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test 3028.00 3092.00 2.1% test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test 4243.00 4275.00 0.8% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 649765.00 653877.00 0.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 649765.00 653877.00 0.6% test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 4199.00 4222.00 0.5% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12933.00 12997.00 0.5% test-suite :: SingleSource/Benchmarks/Misc/flops.test 8282.00 8314.00 0.4% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test 10065.00 10097.00 0.3% test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test 5160.00 5176.00 0.3% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00 0.3% test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test 6908.00 6924.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 202830.00 203278.00 0.2% test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test 9133.00 9149.00 0.2% test-suite :: MultiSource/Benchmarks/Olden/power/power.test 6792.00 6803.00 0.2% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1395585.00 1397473.00 0.1% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1395585.00 1397473.00 0.1% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 97662.00 97758.00 0.1% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 595179.00 595739.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 70603.00 70667.00 0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test 19877.00 19893.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 90231.00 90279.00 0.1% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 33738.00 33754.00 0.0% test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 13262.00 13268.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1139964.00 1140460.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 849507.00 849875.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1158379.00 1158859.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 38724.00 38740.00 0.0% test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test 15180.00 15186.00 0.0% test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 15484.00 15490.00 0.0% test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test 167391.00 167455.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test 137448.00 137496.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2030254.00 2030766.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 302870.00 302934.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 303126.00 303190.00 0.0% test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 241107.00 241155.00 0.0% test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test 162974.00 163006.00 0.0% test-suite :: MultiSource/Applications/siod/siod.test 167168.00 167200.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1048796.00 1048988.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 201623.00 201655.00 0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 501734.00 501798.00 0.0% test-suite :: MultiSource/Applications/ClamAV/clamscan.test 580888.00 580952.00 0.0% test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 168319.00 168335.00 0.0% test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test 226022.00 226038.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test 118011.00 118015.00 0.0% test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test 550589.00 550605.00 0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3072477.00 3072541.00 0.0% test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2385563.00 2385579.00 0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389171.00 389155.00 -0.0% test-suite :: MultiSource/Applications/lua/lua.test 234764.00 234748.00 -0.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 227694.00 227678.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 119819.00 119807.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 117995.00 117983.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test 123610.00 123594.00 -0.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81414.00 81398.00 -0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 782040.00 781880.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9597420.00 9595292.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9597420.00 9595292.00 -0.0% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 911832.00 911608.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 192507.00 192459.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 122843.00 122811.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 122292.00 122260.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 777363.00 777155.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test 123265.00 123205.00 -0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 315534.00 315358.00 -0.1% test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test 128163.00 128083.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 6562.00 6555.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 23428.00 23396.00 -0.1% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22749.00 22717.00 -0.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39549.00 39485.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39546.00 39482.00 -0.2% test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 57214.00 57118.00 -0.2% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413668.00 412804.00 -0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1044047.00 1041487.00 -0.2% test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 12414.00 12382.00 -0.3% test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test 31161.00 30969.00 -0.6% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 224726.00 223254.00 -0.7% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93512.00 92824.00 -0.7% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281151.00 278463.00 -1.0% test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2820.00 2788.00 -1.1% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 156819.00 154739.00 -1.3% test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test 11560.00 11160.00 -3.5% test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test 6734.00 6382.00 -5.2% results results0 diff ASCI_Purple/SMG2000 - extra vector code VPlanNativePath/outer-loop-vect - extra vectorization, better vector code AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code Rodinia/hotspot - small variations CINT2017speed/625.x264_s CINT2017rate/525.x264_r - extra vector code, better vectorization BenchmarkGame/n-body - better vector code. AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations Misc/flops - extra vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations Misc-C++/Large - better vector code CFP2017rate/526.blender_r - extra vector code Prolangs-C++/city - extra vector code MiBench/consumer-lame - extra vector code CoyoteBench/fftbench - extra vector code Olden/power - better vector code CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - extra vector code CINT2017rate/531.deepsjeng_r - extra vector code CFP2006/447.dealII - small variations DOE-ProxyApps-C/miniAMR - small variations Prolangs-C/unix-smail - small variations DOE-ProxyApps-C++/PENNANT - small variations CINT2006/473.astar - small variations CFP2006/453.povray - small variations JM/lencod - extra vector code CFP2017rate/511.povray_r - small variations DOE-ProxyApps-C/CoMD - small variations CFP2006/470.lbm - extra vector code CFP2017speed/619.lbm_s CFP2017rate/519.lbm_r - extra vector code CINT2006/456.hmmer - extra code vectorized TSVC/ControlFlow-dbl - extra vector code CFP2017rate/510.parest_r - better vector code LCALS/SubsetALambdaLoops - extra code vectorized LCALS/SubsetARawLoops - extra code vectorized CFP2006/444.namd - extra code vectorized CFP2006/482.sphinx3 - better vector code Applications/siod - better vector code Benchmarks/7zip - better vector code DOE-ProxyApps-C++/CLAMR - extra code vectorized Applications/sqlite3 - extra code vectorized Applications/ClamAV - smaller vector code MallocBench/gs - small variations MicroBenchmarks/ImageProcessing - small variations TSVC/StatementReordering-flt - extra code vectorized CINT2006/471.omnetpp - small variations CINT2006/403.gcc - extra code vectorized CINT2006/483.xalancbmk - extra code vectorized JM/ldecod - small variations Applications/lua - extra code vectorized mafft/pairlocalalign - small variations TSVC/NodeSplitting-flt - extra code vectorized TSVC/Recurrences-flt - extra code vectorized TSVC/InductionVariable-flt - extra code vectorized FreeBench/pifft - small variations CINT2006/464.h264ref - extra code vectorized CINT2017speed/602.gcc_s CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined CINT2006/445.gobmk - small variations Applications/oggenc - small variations TSVC/LoopRestructuring-flt - extra code vectorized TSVC/CrossingThresholds-flt - extra code vectorized CFP2017rate/508.namd_r - small variations TSVC/ControlFlow-flt - extra code vectorized mediabench/g721 - small variations Prolangs-C/compiler - small variations FreeBench/fourinarow - better vector code MiBench/telecomm-gsm - small variation in vector code mediabench/gsm - same Prolangs-C/bison - small variations Adobe-C++/loop_unroll - extra code vectorized Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor code McCat/18-imp - variations in vector code Prolangs-C/gnugo - variations in vector code MallocBench/espresso - extra code vectorized DOE-ProxyApps-C++/miniFE - small variations in vector code Prolangs-C/TimberWolfMC - extra code vectorized, small changes in previously vectorized code. Olden/tsp - small changes in vector code CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores MiBench/security-blowfish - extra code vectorized McCat/08-main - better vector code. Metric: size..text, RISCV, sifive-p670 Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 63580.00 64020.00 0.7% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21388.00 21406.00 0.1% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 296992.00 297088.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 968112.00 968208.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test 45160.00 45164.00 0.0% test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0% test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 49764.00 49762.00 -0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 449132.00 449108.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 695932.00 695892.00 -0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 508820.00 508788.00 -0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 508820.00 508788.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 166522.00 166490.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 722252.00 722092.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 27554.00 27546.00 -0.0% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10900.00 10896.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 46754.00 46732.00 -0.0% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 631570.00 631226.00 -0.1% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 850698.00 850218.00 -0.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24816.00 24800.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24814.00 24798.00 -0.1% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1% test-suite :: MultiSource/Applications/hbd/hbd.test 27236.00 27204.00 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 293848.00 293480.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 20160.00 20048.00 -0.6% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 182088.00 181040.00 -0.6% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4788.00 4748.00 -0.8% DOE-ProxyApps-C++/miniFE - extra vector code MiBench/automotive-susan - small variations Benchmarks/Bullet - extra vector code CFP2017rate/511.povray_r - slightly better vector code TSVC/StatementReordering-dbl - small variations CINT2017rate/523.xalancbmk_r CINT2017speed/623.xalancbmk_s - extra vector code CINT2017rate/502.gcc_r CINT2017speed/602.gcc_s - extra vector code TSVC/CrossingThresholds-flt - small variations Applications/sqlite3 - extra vector code JM/lencod - extra vector code, small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - small variations CFP2017rate/526.blender_r - extra vector code, small variations DOE-ProxyApps-C/miniGMG - small variations Vectorizer/VPlanNativePath/outer-loop-vect - small variations TSVC/CrossingThresholds-dbl - small variations Benchmarks/tramp3d-v4 - small variations Benchmarks/7zip - extra vector code MiBench/telecomm-gsm - small variations mediabench/gsm/toast - small variations CFP2017rate/510.parest_r - extra vector code Applications/hbd - extra vector code JM/ldecod - better vector code Prolangs-C/compiler - extra vector code MallocBench/espresso - extra vector code mediabench/g721/g721encode - extra vectorization Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/107461	2024-09-21 15:41:06 -07:00
Alexey Bataev	e588fd994f	Revert "[SLP]Vectorize gathered loads" This reverts commit dc2deb53131b9d4c5e881229190bdda1ca3ea47f to fix the issue reported in https://lab.llvm.org/buildbot/#/builders/25/builds/2668	2024-09-21 15:25:36 -07:00
Alexey Bataev	dc2deb5313	[SLP]Vectorize gathered loads Final gather/buildvector nodes may have scalar loads, which are not vectorized (since they are part of the gather nodes) but may form full vector loads, being combined. This patch walks over all gather nodes, "gathering" and sorting gathered scalar loads and then tries to build vector loads, which later are reshuffled between the gather nodes. It allows later to add support for segmented loads (kind of AOS to SOA load kind for RISC-V RVV) and may help with the removal of the alternat e opcodes support. Currently, alternate nodes may depend on each other because of the consecutive loads between their operands. Because of that we cannot simply remove alternate vectorization. But this approach may help to remove most of the stuff for it, since we'll be able to vectorize loads in between lanes. Metric: size..text, AVX512 Program size..text test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 238381.00 250669.00 5.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 25753.00 26329.00 2.2% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test 3028.00 3092.00 2.1% test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test 4243.00 4275.00 0.8% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 649765.00 653877.00 0.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 649765.00 653877.00 0.6% test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 4199.00 4222.00 0.5% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12933.00 12997.00 0.5% test-suite :: SingleSource/Benchmarks/Misc/flops.test 8282.00 8314.00 0.4% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test 10065.00 10097.00 0.3% test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test 5160.00 5176.00 0.3% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00 0.3% test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test 6908.00 6924.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 202830.00 203278.00 0.2% test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test 9133.00 9149.00 0.2% test-suite :: MultiSource/Benchmarks/Olden/power/power.test 6792.00 6803.00 0.2% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1395585.00 1397473.00 0.1% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1395585.00 1397473.00 0.1% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 97662.00 97758.00 0.1% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 595179.00 595739.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 70603.00 70667.00 0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test 19877.00 19893.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 90231.00 90279.00 0.1% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 33738.00 33754.00 0.0% test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 13262.00 13268.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1139964.00 1140460.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 849507.00 849875.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1158379.00 1158859.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 38724.00 38740.00 0.0% test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test 15180.00 15186.00 0.0% test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 15484.00 15490.00 0.0% test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test 167391.00 167455.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test 137448.00 137496.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2030254.00 2030766.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 302870.00 302934.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 303126.00 303190.00 0.0% test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 241107.00 241155.00 0.0% test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test 162974.00 163006.00 0.0% test-suite :: MultiSource/Applications/siod/siod.test 167168.00 167200.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1048796.00 1048988.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 201623.00 201655.00 0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 501734.00 501798.00 0.0% test-suite :: MultiSource/Applications/ClamAV/clamscan.test 580888.00 580952.00 0.0% test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 168319.00 168335.00 0.0% test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test 226022.00 226038.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test 118011.00 118015.00 0.0% test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test 550589.00 550605.00 0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3072477.00 3072541.00 0.0% test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2385563.00 2385579.00 0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389171.00 389155.00 -0.0% test-suite :: MultiSource/Applications/lua/lua.test 234764.00 234748.00 -0.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 227694.00 227678.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 119819.00 119807.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 117995.00 117983.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test 123610.00 123594.00 -0.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81414.00 81398.00 -0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 782040.00 781880.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9597420.00 9595292.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9597420.00 9595292.00 -0.0% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 911832.00 911608.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 192507.00 192459.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 122843.00 122811.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 122292.00 122260.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 777363.00 777155.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test 123265.00 123205.00 -0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 315534.00 315358.00 -0.1% test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test 128163.00 128083.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 6562.00 6555.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 23428.00 23396.00 -0.1% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22749.00 22717.00 -0.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39549.00 39485.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39546.00 39482.00 -0.2% test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 57214.00 57118.00 -0.2% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413668.00 412804.00 -0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1044047.00 1041487.00 -0.2% test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 12414.00 12382.00 -0.3% test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test 31161.00 30969.00 -0.6% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 224726.00 223254.00 -0.7% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93512.00 92824.00 -0.7% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281151.00 278463.00 -1.0% test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2820.00 2788.00 -1.1% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 156819.00 154739.00 -1.3% test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test 11560.00 11160.00 -3.5% test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test 6734.00 6382.00 -5.2% results results0 diff ASCI_Purple/SMG2000 - extra vector code VPlanNativePath/outer-loop-vect - extra vectorization, better vector code AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code Rodinia/hotspot - small variations CINT2017speed/625.x264_s CINT2017rate/525.x264_r - extra vector code, better vectorization BenchmarkGame/n-body - better vector code. AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations Misc/flops - extra vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations Misc-C++/Large - better vector code CFP2017rate/526.blender_r - extra vector code Prolangs-C++/city - extra vector code MiBench/consumer-lame - extra vector code CoyoteBench/fftbench - extra vector code Olden/power - better vector code CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - extra vector code CINT2017rate/531.deepsjeng_r - extra vector code CFP2006/447.dealII - small variations DOE-ProxyApps-C/miniAMR - small variations Prolangs-C/unix-smail - small variations DOE-ProxyApps-C++/PENNANT - small variations CINT2006/473.astar - small variations CFP2006/453.povray - small variations JM/lencod - extra vector code CFP2017rate/511.povray_r - small variations DOE-ProxyApps-C/CoMD - small variations CFP2006/470.lbm - extra vector code CFP2017speed/619.lbm_s CFP2017rate/519.lbm_r - extra vector code CINT2006/456.hmmer - extra code vectorized TSVC/ControlFlow-dbl - extra vector code CFP2017rate/510.parest_r - better vector code LCALS/SubsetALambdaLoops - extra code vectorized LCALS/SubsetARawLoops - extra code vectorized CFP2006/444.namd - extra code vectorized CFP2006/482.sphinx3 - better vector code Applications/siod - better vector code Benchmarks/7zip - better vector code DOE-ProxyApps-C++/CLAMR - extra code vectorized Applications/sqlite3 - extra code vectorized Applications/ClamAV - smaller vector code MallocBench/gs - small variations MicroBenchmarks/ImageProcessing - small variations TSVC/StatementReordering-flt - extra code vectorized CINT2006/471.omnetpp - small variations CINT2006/403.gcc - extra code vectorized CINT2006/483.xalancbmk - extra code vectorized JM/ldecod - small variations Applications/lua - extra code vectorized mafft/pairlocalalign - small variations TSVC/NodeSplitting-flt - extra code vectorized TSVC/Recurrences-flt - extra code vectorized TSVC/InductionVariable-flt - extra code vectorized FreeBench/pifft - small variations CINT2006/464.h264ref - extra code vectorized CINT2017speed/602.gcc_s CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined CINT2006/445.gobmk - small variations Applications/oggenc - small variations TSVC/LoopRestructuring-flt - extra code vectorized TSVC/CrossingThresholds-flt - extra code vectorized CFP2017rate/508.namd_r - small variations TSVC/ControlFlow-flt - extra code vectorized mediabench/g721 - small variations Prolangs-C/compiler - small variations FreeBench/fourinarow - better vector code MiBench/telecomm-gsm - small variation in vector code mediabench/gsm - same Prolangs-C/bison - small variations Adobe-C++/loop_unroll - extra code vectorized Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor code McCat/18-imp - variations in vector code Prolangs-C/gnugo - variations in vector code MallocBench/espresso - extra code vectorized DOE-ProxyApps-C++/miniFE - small variations in vector code Prolangs-C/TimberWolfMC - extra code vectorized, small changes in previously vectorized code. Olden/tsp - small changes in vector code CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores MiBench/security-blowfish - extra code vectorized McCat/08-main - better vector code. Metric: size..text, RISCV, sifive-p670 Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 63580.00 64020.00 0.7% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21388.00 21406.00 0.1% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 296992.00 297088.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 968112.00 968208.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test 45160.00 45164.00 0.0% test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0% test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 49764.00 49762.00 -0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 449132.00 449108.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 695932.00 695892.00 -0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 508820.00 508788.00 -0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 508820.00 508788.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 166522.00 166490.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 722252.00 722092.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 27554.00 27546.00 -0.0% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10900.00 10896.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 46754.00 46732.00 -0.0% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 631570.00 631226.00 -0.1% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 850698.00 850218.00 -0.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24816.00 24800.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24814.00 24798.00 -0.1% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1% test-suite :: MultiSource/Applications/hbd/hbd.test 27236.00 27204.00 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 293848.00 293480.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 20160.00 20048.00 -0.6% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 182088.00 181040.00 -0.6% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4788.00 4748.00 -0.8% DOE-ProxyApps-C++/miniFE - extra vector code MiBench/automotive-susan - small variations Benchmarks/Bullet - extra vector code CFP2017rate/511.povray_r - slightly better vector code TSVC/StatementReordering-dbl - small variations CINT2017rate/523.xalancbmk_r CINT2017speed/623.xalancbmk_s - extra vector code CINT2017rate/502.gcc_r CINT2017speed/602.gcc_s - extra vector code TSVC/CrossingThresholds-flt - small variations Applications/sqlite3 - extra vector code JM/lencod - extra vector code, small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - small variations CFP2017rate/526.blender_r - extra vector code, small variations DOE-ProxyApps-C/miniGMG - small variations Vectorizer/VPlanNativePath/outer-loop-vect - small variations TSVC/CrossingThresholds-dbl - small variations Benchmarks/tramp3d-v4 - small variations Benchmarks/7zip - extra vector code MiBench/telecomm-gsm - small variations mediabench/gsm/toast - small variations CFP2017rate/510.parest_r - extra vector code Applications/hbd - extra vector code JM/ldecod - better vector code Prolangs-C/compiler - extra vector code MallocBench/espresso - extra vector code mediabench/g721/g721encode - extra vectorization Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/107461	2024-09-21 14:08:23 -07:00
Nikita Popov	848cec11f5	Revert "[SLP]Vectorize gathered loads" This reverts commit de1f5b96adcea52bf7c9670c46123fe1197050d2. This has a very large compile-time impact in some cases, in particular lencod. See: http://llvm-compile-time-tracker.com/compare.php?from=b1339abb713063363e7804124b8fb3d84143a003&to=de1f5b96adcea52bf7c9670c46123fe1197050d2&stat=instructions:u	2024-09-17 16:45:25 +02:00
Alexey Bataev	de1f5b96ad	[SLP]Vectorize gathered loads Final gather/buildvector nodes may have scalar loads, which are not vectorized (since they are part of the gather nodes) but may form full vector loads, being combined. This patch walks over all gather nodes, "gathering" and sorting gathered scalar loads and then tries to build vector loads, which later are reshuffled between the gather nodes. It allows later to add support for segmented loads (kind of AOS to SOA load kind for RISC-V RVV) and may help with the removal of the alternat e opcodes support. Currently, alternate nodes may depend on each other because of the consecutive loads between their operands. Because of that we cannot simply remove alternate vectorization. But this approach may help to remove most of the stuff for it, since we'll be able to vectorize loads in between lanes. Metric: size..text, AVX512 Program size..text test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 238381.00 250669.00 5.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 25753.00 26329.00 2.2% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-psadbw.test 3028.00 3092.00 2.1% test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test 4243.00 4275.00 0.8% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 649765.00 653877.00 0.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 649765.00 653877.00 0.6% test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test 4199.00 4222.00 0.5% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12933.00 12997.00 0.5% test-suite :: SingleSource/Benchmarks/Misc/flops.test 8282.00 8314.00 0.4% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-unpack_msasm.test 10065.00 10097.00 0.3% test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test 5160.00 5176.00 0.3% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12472220.00 12509612.00 0.3% test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test 6908.00 6924.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 202830.00 203278.00 0.2% test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test 9133.00 9149.00 0.2% test-suite :: MultiSource/Benchmarks/Olden/power/power.test 6792.00 6803.00 0.2% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1395585.00 1397473.00 0.1% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1395585.00 1397473.00 0.1% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 97662.00 97758.00 0.1% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 595179.00 595739.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 70603.00 70667.00 0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test 19877.00 19893.00 0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 90231.00 90279.00 0.1% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 33738.00 33754.00 0.0% test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 13262.00 13268.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1139964.00 1140460.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 849507.00 849875.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1158379.00 1158859.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 38724.00 38740.00 0.0% test-suite :: External/SPEC/CFP2006/470.lbm/470.lbm.test 15180.00 15186.00 0.0% test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 15484.00 15490.00 0.0% test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test 167391.00 167455.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test 137448.00 137496.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2030254.00 2030766.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 302870.00 302934.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 303126.00 303190.00 0.0% test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 241107.00 241155.00 0.0% test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test 162974.00 163006.00 0.0% test-suite :: MultiSource/Applications/siod/siod.test 167168.00 167200.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1048796.00 1048988.00 0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 201623.00 201655.00 0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 501734.00 501798.00 0.0% test-suite :: MultiSource/Applications/ClamAV/clamscan.test 580888.00 580952.00 0.0% test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 168319.00 168335.00 0.0% test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test 226022.00 226038.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test 118011.00 118015.00 0.0% test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test 550589.00 550605.00 0.0% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 3072477.00 3072541.00 0.0% test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2385563.00 2385579.00 0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 389171.00 389155.00 -0.0% test-suite :: MultiSource/Applications/lua/lua.test 234764.00 234748.00 -0.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 227694.00 227678.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 119819.00 119807.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test 117995.00 117983.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test 123610.00 123594.00 -0.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81414.00 81398.00 -0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 782040.00 781880.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9597420.00 9595292.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9597420.00 9595292.00 -0.0% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 911832.00 911608.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 192507.00 192459.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test 122843.00 122811.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 122292.00 122260.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 777363.00 777155.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test 123265.00 123205.00 -0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 315534.00 315358.00 -0.1% test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test 128163.00 128083.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 6562.00 6555.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 23428.00 23396.00 -0.1% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22749.00 22717.00 -0.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39549.00 39485.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39546.00 39482.00 -0.2% test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 57214.00 57118.00 -0.2% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413668.00 412804.00 -0.2% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1044047.00 1041487.00 -0.2% test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 12414.00 12382.00 -0.3% test-suite :: MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo.test 31161.00 30969.00 -0.6% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 224726.00 223254.00 -0.7% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93512.00 92824.00 -0.7% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281151.00 278463.00 -1.0% test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2820.00 2788.00 -1.1% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 156819.00 154739.00 -1.3% test-suite :: MultiSource/Benchmarks/MiBench/security-blowfish/security-blowfish.test 11560.00 11160.00 -3.5% test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test 6734.00 6382.00 -5.2% results results0 diff ASCI_Purple/SMG2000 - extra vector code VPlanNativePath/outer-loop-vect - extra vectorization, better vector code AVX512BWVL/Vector-AVX512BWVL-psadbw - better vector code Rodinia/hotspot - small variations CINT2017speed/625.x264_s CINT2017rate/525.x264_r - extra vector code, better vectorization BenchmarkGame/n-body - better vector code. AVX512BWVL/Vector-AVX512BWVL-unpack_msasm - small variations Misc/flops - extra vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - small variations Misc-C++/Large - better vector code CFP2017rate/526.blender_r - extra vector code Prolangs-C++/city - extra vector code MiBench/consumer-lame - extra vector code CoyoteBench/fftbench - extra vector code Olden/power - better vector code CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - extra vector code CINT2017rate/531.deepsjeng_r - extra vector code CFP2006/447.dealII - small variations DOE-ProxyApps-C/miniAMR - small variations Prolangs-C/unix-smail - small variations DOE-ProxyApps-C++/PENNANT - small variations CINT2006/473.astar - small variations CFP2006/453.povray - small variations JM/lencod - extra vector code CFP2017rate/511.povray_r - small variations DOE-ProxyApps-C/CoMD - small variations CFP2006/470.lbm - extra vector code CFP2017speed/619.lbm_s CFP2017rate/519.lbm_r - extra vector code CINT2006/456.hmmer - extra code vectorized TSVC/ControlFlow-dbl - extra vector code CFP2017rate/510.parest_r - better vector code LCALS/SubsetALambdaLoops - extra code vectorized LCALS/SubsetARawLoops - extra code vectorized CFP2006/444.namd - extra code vectorized CFP2006/482.sphinx3 - better vector code Applications/siod - better vector code Benchmarks/7zip - better vector code DOE-ProxyApps-C++/CLAMR - extra code vectorized Applications/sqlite3 - extra code vectorized Applications/ClamAV - smaller vector code MallocBench/gs - small variations MicroBenchmarks/ImageProcessing - small variations TSVC/StatementReordering-flt - extra code vectorized CINT2006/471.omnetpp - small variations CINT2006/403.gcc - extra code vectorized CINT2006/483.xalancbmk - extra code vectorized JM/ldecod - small variations Applications/lua - extra code vectorized mafft/pairlocalalign - small variations TSVC/NodeSplitting-flt - extra code vectorized TSVC/Recurrences-flt - extra code vectorized TSVC/InductionVariable-flt - extra code vectorized FreeBench/pifft - small variations CINT2006/464.h264ref - extra code vectorized CINT2017speed/602.gcc_s CINT2017rate/502.gcc_r - some extra code vectorized, extra code inlined CINT2006/445.gobmk - small variations Applications/oggenc - small variations TSVC/LoopRestructuring-flt - extra code vectorized TSVC/CrossingThresholds-flt - extra code vectorized CFP2017rate/508.namd_r - small variations TSVC/ControlFlow-flt - extra code vectorized mediabench/g721 - small variations Prolangs-C/compiler - small variations FreeBench/fourinarow - better vector code MiBench/telecomm-gsm - small variation in vector code mediabench/gsm - same Prolangs-C/bison - small variations Adobe-C++/loop_unroll - extra code vectorized Benchmarks/tramp3d-v4 - extra code gets inlined, small changes in vetor code McCat/18-imp - variations in vector code Prolangs-C/gnugo - variations in vector code MallocBench/espresso - extra code vectorized DOE-ProxyApps-C++/miniFE - small variations in vector code Prolangs-C/TimberWolfMC - extra code vectorized, small changes in previously vectorized code. Olden/tsp - small changes in vector code CFP2006/433.milc - extra code gets inlined, vectorized 2 x stores to 4 x stores MiBench/security-blowfish - extra code vectorized McCat/08-main - better vector code. Metric: size..text, RISCV, sifive-p670 Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 63580.00 64020.00 0.7% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21388.00 21406.00 0.1% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 296992.00 297088.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 968112.00 968208.00 0.0% test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test 45160.00 45164.00 0.0% test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 2635902.00 2635854.00 -0.0% test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 2635902.00 2635854.00 -0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7568730.00 7568578.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7568730.00 7568578.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test 49764.00 49762.00 -0.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 449132.00 449108.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 695932.00 695892.00 -0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 508820.00 508788.00 -0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 508820.00 508788.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9594152.00 9593336.00 -0.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 166522.00 166490.00 -0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 722252.00 722092.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 27554.00 27546.00 -0.0% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10900.00 10896.00 -0.0% test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test 46754.00 46732.00 -0.0% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 631570.00 631226.00 -0.1% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 850698.00 850218.00 -0.1% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24816.00 24800.00 -0.1% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24814.00 24798.00 -0.1% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1599946.00 1598394.00 -0.1% test-suite :: MultiSource/Applications/hbd/hbd.test 27236.00 27204.00 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 293848.00 293480.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/compiler/compiler.test 20160.00 20048.00 -0.6% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 182088.00 181040.00 -0.6% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4788.00 4748.00 -0.8% DOE-ProxyApps-C++/miniFE - extra vector code MiBench/automotive-susan - small variations Benchmarks/Bullet - extra vector code CFP2017rate/511.povray_r - slightly better vector code TSVC/StatementReordering-dbl - small variations CINT2017rate/523.xalancbmk_r CINT2017speed/623.xalancbmk_s - extra vector code CINT2017rate/502.gcc_r CINT2017speed/602.gcc_s - extra vector code TSVC/CrossingThresholds-flt - small variations Applications/sqlite3 - extra vector code JM/lencod - extra vector code, small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - small variations CFP2017rate/526.blender_r - extra vector code, small variations DOE-ProxyApps-C/miniGMG - small variations Vectorizer/VPlanNativePath/outer-loop-vect - small variations TSVC/CrossingThresholds-dbl - small variations Benchmarks/tramp3d-v4 - small variations Benchmarks/7zip - extra vector code MiBench/telecomm-gsm - small variations mediabench/gsm/toast - small variations CFP2017rate/510.parest_r - extra vector code Applications/hbd - extra vector code JM/ldecod - better vector code Prolangs-C/compiler - extra vector code MallocBench/espresso - extra vector code mediabench/g721/g721encode - extra vectorization Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/107461	2024-09-17 06:57:47 -04:00
Sushant Gokhale	d37d05795d	[SLP][AArch64] Fix test failure for PR #106507 (#108442 ) Updating the failing test in this patch.	2024-09-13 00:51:49 +05:30
Sushant Gokhale	7a6945fcf6	[AArch64][SLP] Add NFC test cases for floating point reductions (#106507 ) A successive patch would be added to fix some of the tests. Pull request: #106507	2024-09-12 23:07:12 +05:30
Philip Reames	63e8a1b16f	[SLP] Enable reordering for non-power-of-two vectors (#106638 ) This change tries to enable vector reordering during vectorization for non-power-of-two vectors. Specifically, my goal is to be able to vectorize reductions whose operands appear in other than identity order. (i.e. a[1] + a[0] + a[2]). Our standard pass pipeline, Reassociation effectively canonicalizes towards this form. So for reduction vectorization to be wildly applicable, we need this feature. This change enables the use of a non-empty ReorderIndices structure - which is effectively required for out of order loads or gathers - while leaving the ReuseShuffleIndices mechanism unused and disabled. If I've understood the code structure, the former is used when describing implicit shuffles required by the vectorization strategy (i.e. loading elements 0,1,3,2 in the order 0,1,2,3 and then shuffling later), while the later is used when trying to optimize explode/buildvectors (called gathers in this code). I audited all the code enabled by this change, but can't claim to deeply understand most of it. I added a couple of bailouts in places which appeared to be difficult to audit and optional optimizations. I've tried to do so in the least risky way I can, but am not completely confident in this change. Careful review appreciated.	2024-09-05 07:52:27 -07:00
Simon Pilgrim	6c8746b6e3	[Analysis] getIntrinsicForCallSite - add vectorization support for acos/asin/atan and cosh/sinh/tanh libcalls (#106844 ) Followup to #106584 - ensure acos/asin/atan and cosh/sinh/tanh libcalls correctly map to the llvm intrinsic equivalents	2024-09-03 10:05:56 +01:00
Simon Pilgrim	d58d105cda	[Analysis] isTriviallyVectorizable - add vectorization support for acos/asin/atan and cosh/sinh/tanh intrinsics (#106584 ) Show fallback cases in amdlibm tests where it doesn't have that specific op	2024-08-30 16:49:23 +01:00
Simon Pilgrim	c4b5cb0f31	[AArch64] Add accelerate test coverage for acos/asin/atan and cosh/sinh/tanh intrinsics to support #106584	2024-08-30 10:58:31 +01:00
Alexey Bataev	cc943a67d1	[SLP]Fix PR106626: trye several attempts for lookup values, if not found. If the value is used in Scalar several times, the first attempt to find its position in the node (if ReuseShuffleIndices and ReorderIndices not empty) may fail. In this case need to find another copy of the same value and try again. Fixes https://github.com/llvm/llvm-project/issues/106626	2024-08-29 15:07:20 -07:00
Philip Reames	ee764a2603	[SLP] Remove -slp-optimize-identity-hor-reduction-ops option (#106238 ) This code has been unchanged for two years; let's simplify the code and remove configurability which makes the code harder to follow.	2024-08-27 13:21:57 -07:00
Alexey Bataev	f3d2609af3	[SLP]Improve/fix subvectors in gather/buildvector nodes handling SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/104144	2024-08-23 06:45:22 -07:00
Vitaly Buka	96b3166602	Revert "[SLP]Improve/fix subvectors in gather/buildvector nodes handling" (#105780 ) with "[Vectorize] Fix warnings" It introduced compiler crashes, see #104144. This reverts commit 69332bb8995aef60d830406de12cb79a50390261 and 351f4a5593f1ef507708ec5eeca165b20add3340.	2024-08-22 22:21:20 -07:00
Alexey Bataev	69332bb899	[SLP]Improve/fix subvectors in gather/buildvector nodes handling SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/104144	2024-08-22 11:24:08 -04:00
Alexey Bataev	b10ecfa914	[SLP]Represent externally used values as original scalars, if profitable. Currently SLP vectorizer tries to keep only GEPs as scalar, if they are vectorized but used externally. Same approach can be used for all scalar values. This patch tries to keep original scalars if all its operands remain scalar or externally used, the cost of the original scalar is lower than the cost of the extractelement instruction, or if the number of externally used scalars in the same entry is power of 2. Last criterion allows better revectorization for multiply used scalars. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/100904	2024-08-12 10:15:02 -04:00
David Green	89b67a6400	[SLP] Cluster SortedBases before sorting. (#101144 ) In order to enforce a strict-weak ordering, this patch clusters the bases that are being sorted by the root - the first value in a gep chain. The sorting is then performed in each cluster.	2024-07-30 22:12:20 +01:00
David Green	f2d2ae3f5a	[SLP] Order clustered load base pointers by ascending offsets (#100653 ) This attempts to fix a regression from #98025, where the new order of reduction nodes causes later passes to not be able to produce as nice shuffles. The issue boils down to picking an order of [0 1 3 2] for loaded v4i8 values, which meant later parts could not find a simpler ordering for the shuffles given the legal nodes available in AArch64. If instead we make sure they are ordered [0 1 2 3] then everything can fall into place. In order to produce a better order that is more likely to work in more cases, this patch takes the existing clustered loads and sort the base pointers if there is an order between them. i.e if `V2 == gep (V1, X)` then V1 is sorted before V2.	2024-07-27 11:18:56 +01:00
Alexey Bataev	15915c06d5	[SLP]Do not vectorize small (<=2) buildvector/buildvalue sequences with MaxVF==true. If MaxVFOnly for buildvector/buildvalue vectorization is set to true and the total number of elements to vectorize is <= 2, better to try to vectorize reductions at first, which may produce larger tree (reductions have a limit of at least 4 elements to vectorize). Smaller buildvector/buildvalue sequence will be attempted to vectorize later, with MaxVFOnly set to false. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/98957	2024-07-16 12:45:58 -04:00
Alexey Bataev	8ff233f4f1	[SLP]Correctly detect minnum/maxnum patterns for select/cmp operations on floats. The patch enables detection of minnum/maxnum patterns for float point instruction, represented as select/cmp. Also, enables better cost estimation for integer min/max patterns since the compiler starts to estimate the scalars separately. Reviewers: nikic, RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/98570	2024-07-16 09:42:08 -07:00
Alexey Bataev	c3540d0b6b	Revert "[SLP]Correctly detect minnum/maxnum patterns for select/cmp operations on floats." This reverts commit c7aac38c29f564bc48f7cfb71d3b3b8b482c873b to fix crashes reavealed by the buildbot in https://lab.llvm.org/buildbot/#/builders/168/builds/1104.	2024-07-16 05:59:59 -07:00
Alexey Bataev	c7aac38c29	[SLP]Correctly detect minnum/maxnum patterns for select/cmp operations on floats. The patch enables detection of minnum/maxnum patterns for float point instruction, represented as select/cmp. Also, enables better cost estimation for integer min/max patterns since the compiler starts to estimate the scalars separately. Reviewers: nikic, RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/98570	2024-07-16 08:14:27 -04:00
Alexey Bataev	a988821123	[SLP]Keep the original order in the reductions. The patch tries to keep the original order of the instruction in the reductions. Previously, two first instructions were switched, giving reverse order. The first step to support of the ordered reductions. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/98025	2024-07-09 12:26:42 -04:00
Alexey Bataev	385118644c	[SLP]Remove operands upon marking instruction for deletion. If the instruction is marked for deletion, better to drop all its operands and mark them for deletion too (if allowed). It allows to have more vectorizable patterns and generate less useless extractelement instructions. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/97409	2024-07-08 07:56:48 -07:00
Alexey Bataev	4eecf3c650	[SLP]Reorder buildvector/reduction vectorization and fuse the loops. Currently SLP vectorizer tries at first to find reduction nodes, and then vectorize buildvector sequences. Need to try to vectorize wide buildvector sequences at first and only then try to vectorize reductions, and then smaller buildvector sequences. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/96943	2024-07-03 14:36:30 -04:00
Gabriel Baraldi	380beaec86	Fix potential crash in SLPVectorizer caused by missing check (#95937 ) I'm not super familiar with this code, but it seems that we were just missing a check. The original code that triggered this did not have uselistorders but llvm-reduce created them and it reproduces the same issue in a way more compact way. Fixes https://github.com/llvm/llvm-project/issues/95016	2024-07-02 08:15:51 -04:00
Farzon Lotfi	918313d17d	[SLPVectorizer] Support SLPVectorizer cases of tan across all backends (#95517 ) This PR is intended to address the limited SLPVectorizer support of tan raised in the comments of this PR: https://github.com/llvm/llvm-project/pull/94559. Right now emitting the tan intrinsisic allows you to vectorize tan, but emitting the libfunc does not. to address this the libcall needs to be mapped to the intrinsic. and the libcall and function name need to be marked approriately so they can be optimized or defined as a call lowering.	2024-06-27 15:15:13 -04:00
Stephen Tozer	094572701d	[RemoveDIs] Print IR with debug records by default (#91724 ) This patch makes the final major change of the RemoveDIs project, changing the default IR output from debug intrinsics to debug records. This is expected to break a large number of tests: every single one that tests for uses or declarations of debug intrinsics and does not explicitly disable writing records. If this patch has broken your downstream tests (or upstream tests on a configuration I wasn't able to run): 1. If you need to immediately unblock a build, pass `--write-experimental-debuginfo=false` to LLVM's option processing for all failing tests (remember to use `-mllvm` for clang/flang to forward arguments to LLVM). 2. For most test failures, the changes are trivial and mechanical, enough that they can be done by script; see the migration guide for a guide on how to do this: https://llvm.org/docs/RemoveDIsDebugInfo.html#test-updates 3. If any tests fail for reasons other than FileCheck check lines that need updating, such as assertion failures, that is most likely a real bug with this patch and should be reported as such. For more information, see the recent PSA: https://discourse.llvm.org/t/psa-ir-output-changing-from-debug-intrinsics-to-debug-records/79578	2024-06-14 15:07:27 +01:00
Nikita Popov	8e8d2595da	[ConstantFolding] Canonicalize constexpr GEPs to i8 (#89872 ) This patch canonicalizes constant expression GEPs to use i8 source element type, aka ptradd. This is the ConstantFolding equivalent of the InstCombine canonicalization introduced in #68882. I believe all our optimizations working on constant expression GEPs (like GlobalOpt etc) have already been switched to work on offsets, so I don't expect any significant fallout from this change. This is part of: https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699	2024-05-20 11:47:30 +02:00
Alexey Bataev	58a94b1d0a	[SLP]Fix PR91467: Look through scalar cast, when trying to cast to another type. Need to look through the SExt/ZExt scalars to be gathered, when trying to reduce their width after minbitwidth analysis to prevent permanent attempts to revectorize such gathered instructions.	2024-05-09 04:19:43 -07:00
Arthur Eubanks	2fb3774321	Revert "[SLP]Fix PR91467: Look through scalar cast, when trying to cast to another type." This reverts commit 2475efa91d8b4fa8f1a2d16052cb6d14be7d5dc6. Causes crashes, see comments on `2475efa91d`.	2024-05-08 23:01:47 +00:00
Alexey Bataev	2475efa91d	[SLP]Fix PR91467: Look through scalar cast, when trying to cast to another type. Need to look through the SExt/ZExt scalars to be gathered, when trying to reduce their width after minbitwidth analysis to prevent permanent attempts to revectorize such gathered instructions.	2024-05-08 07:25:19 -07:00
Alexey Bataev	f00f294130	[SLP]Fix PR91309: Do not consider SExt as always producing signed result. Still need to do the full analysis of the signedness of the values rather than rely on Instruction opcode, if the opcode is SExt. Still may produce unsigned result.	2024-05-07 08:57:52 -07:00
Alexey Bataev	5d9b549bb0	[SLP][NFC]Add a test showing incorrect signedness detection in sext nodes.	2024-05-07 06:46:30 -07:00
Alexey Bataev	37ae4ad0ee	[SLP]Support minbitwidth analisys for buildvector nodes. Metric: size..text Program size..text exp ref diff test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 42906.00 42986.00 0.2% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 42909.00 42989.00 0.2% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664581.00 664661.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664581.00 664661.00 0.0% Less is better. Replaces `buildvector <p x in> + trunc <p x in> to <p x im>` sequences to `buildvector <p x im> of { trunc in to im }` scalars, which is free in most cases, results in better code. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/88504	2024-04-29 09:57:37 -04:00
Alexey Bataev	d74e42acd2	[SLP]Attempt to vectorize long stores, if short one failed. We can try to vectorize long store sequences, if short ones were unsuccessful because of the non-profitable vectorization. It should not increase compile time significantly (stores are sorted already, complexity is n x log n), but vectorize extra code. Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1088012.00 1088236.00 0.0% test-suite :: SingleSource/UnitTests/matrix-types-spec.test 480396.00 480476.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664613.00 664661.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664613.00 664661.00 0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2041105.00 2040961.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 836563.00 836387.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1035100.00 1032140.00 -0.3% In all benchmarks extra code gets vectorized Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/88563	2024-04-26 06:53:44 -07:00
Alexey Bataev	102a811094	[SLP]Fix a check for multi-users for icmp user. The compiler should not take into account the type of the cmp instruction, otherwise it may treat the size incorrectly and it may lead to incorrect codegen.	2024-04-22 08:23:15 -07:00
Alexey Bataev	19a625a0a7	[SLP][NFC]Add a test with incorrect size of the external user detection.	2024-04-22 08:16:43 -07:00
Florian Hahn	6d66db3890	[SLP] Initial vectorization of non-power-of-2 ops. (#77790 ) This patch enables vectorization for non-power-of-2 VFs. Initially only VFs where adding 1 makes the VF a power-of-2, i.e. we can still make relatively effective use of the vectors. It relies on the existing target cost-models to return accurate costs for non-power-of-2 vectors. I checked mostly AArch64 and X86 and there the costs seem reasonable for the costs I checked, although I expect there will be a need to refine both the cost-models and lowering to make most effective use of non-power-of-2 SLP vectorization. Note that re-ordering and shuffling is not implemented for nodes requiring padding yet to keep the initial implementation simpler. The feature is guarded by a new flag, off by defaul for now. PR: https://github.com/llvm/llvm-project/pull/77790	2024-04-13 08:14:40 +01:00
Alexey Bataev	2b00a73f62	[SLP]Buildvector for alternate instructions with non-profitable gather operands. If the operands of the potentially alternate node are going to produce buildvector sequences, which result in more instructions, than the original code, then suhinstructions should be vectorized as alternate node, better to end up with the buildvector node. Left column - experimental, Right - reference. Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 413680.00 416272.00 0.6% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12351788.00 12354844.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 664901.00 664949.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 664901.00 664949.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1171371.00 1171355.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1036396.00 1036284.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 111280.00 111248.00 -0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1392113.00 1391361.00 -0.1% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1392113.00 1391361.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 281676.00 281452.00 -0.1% test-suite :: MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes.test 3025.00 3019.00 -0.2% test-suite :: MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig.test 6351.00 6335.00 -0.3% Metric: SLP.NumVectorInstructions Program SLP.NumVectorInstructions results results0 diff test-suite :: MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes.test 15.00 16.00 6.7% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1703.00 1707.00 0.2% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1703.00 1707.00 0.2% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26241.00 26239.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 11761.00 11754.00 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 824.00 822.00 -0.2% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 5668.00 5654.00 -0.2% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 5668.00 5654.00 -0.2% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 792.00 790.00 -0.3% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 792.00 790.00 -0.3% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 1389.00 1384.00 -0.4% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 596.00 590.00 -1.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig.test 6.00 5.00 -16.7% Metric: exec_time Program exec_time results results0 diff test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 99.14 100.00 0.9% Other changes are not significant (less than 0.1% percent with exectime less 5 secs). SingleSource/Benchmarks/Adobe-C++/loop_unroll - same small patterns remain scalar, smaller code. External/SPEC/CFP2017rate/526.blender_r/526.blender_r - many small changes, some extra stores gets vectorized. External/SPEC/CINT2017speed/625.x264_s/625.x264_s External/SPEC/CINT2017rate/525.x264_r/525.x264_r x264 has one change in a loop body, in function ssim_end4, some code remain scalar, resulting in less code size. External/SPEC/CFP2017rate/511.povray_r/511.povray_r - some extra code gets vectorized, looks like some other patterns were matched. MultiSource/Benchmarks/7zip/7zip-benchmark - extra stores were vectorized (looks like the graphs become profitable) MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg - small changes in vectorized code (some small part remain scalar). External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s Many changes cause by the fact that the code of one function becomes smaller (onvertLCHabToRGB) and this functions gets inlined after that. MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc - some small changes here and there, some extra code is vectorized, some remain scalar (2 x vectors) MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes - emits 2 scalars + 2 insertelems instead of insert, broadcast, alt code (3 instructions, total 5 insts) MultiSource/Benchmarks/Prolangs-C/plot2fig/plot2fig - small graph becomes profitable and gets vectorized. External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s Some small graph becomes profitable and gets vectorized. MultiSource/Benchmarks/FreeBench/pifft/pifft - no changes in final code. Reviewers: RKSimon, dtcxzyw Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/84978	2024-04-10 14:33:56 -04:00

1 2 3 4 5 ...

412 Commits