llvm-project

Author	SHA1	Message	Date
Alexey Bataev	39bab1de33	[SLP]Check if the operand for removal is the reduction operand, awaiting for the reduction If the operand of the instruction-to-be-removed is a reduction value, which is not reduced yet, and, thus, it has no users, it may be removed during operands analysis. Fixes #128736	2025-02-26 14:17:11 -08:00
Alexey Bataev	418a987285	[SLP]Do not use node, if it is a subvector or buildvector node If the buildvector has some matches with another node, which is a subvector of another buildvector node, need to check for this and cancel matching to avoid incorrect ordering of the nodes. Fixes #128770	2025-02-26 13:25:37 -08:00
Han-Kuan Chen	a12ca57c1c	[SLP][REVEC] Add getScalarizationOverhead helper function to reduce error when REVEC is enabled. (#128530 )	2025-02-25 23:16:05 +08:00
Han-Kuan Chen	3a6108bcac	[SLP][REVEC] Fix scalar mask is passed to getScalarizationOverhead but the type is vector. (#128476 ) Fix "Vector size mismatch".	2025-02-24 23:43:27 +08:00
Alexey Bataev	eb14d2a1d4	[SLP]Fix check for matched gather node, if it is a subvector node If the gather node is a subvector node, it may match the existing vector/gather node in the graph, but still may require reordering. in this case need to fully check its dependencies to prevent a compiler crash. Fixes #128401	2025-02-24 06:48:43 -08:00
Alexey Bataev	8ffdc3b207	[SLP]Fix a crash when checking a scalar in a reordered buildvector node Need to check reordered scalars, not the original ones, to correctly check proper scalar.	2025-02-21 14:59:43 -08:00
Alexey Bataev	894935cb51	[SLP]Represent SLP graph as a tree We can stop using a graph representation of the SLP structure and switch directly to tree by relying on a single user of each tree node. If the node has multiple uses, other uses must be represented as a separate gather/buildvector node, which then will be combined with the existing vectorized node(s) uoon cost estimation/codegen. This allow to simplify inner structure and turn in some extra optimizations, which could not be turned on for the nodes with multi users (reordering, minbitwidth analysis). AVX512, -O3+LTO Metric: size..text results results0 diff test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 253453.00 254253.00 0.3% test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 251411.00 252051.00 0.3% test-suite :: SingleSource/Benchmarks/Misc/oourafft.test 19114.00 19146.00 0.2% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1399200.00 1399520.00 0.0% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1399200.00 1399520.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 304310.00 304326.00 0.0% test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test 304662.00 304678.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12566919.00 12567511.00 0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146300.00 1146316.00 0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159864.00 1159880.00 0.0% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 9407880.00 9407864.00 -0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 9407880.00 9407864.00 -0.0% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1011612.00 1011596.00 -0.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 280584.00 280536.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 93016.00 93000.00 -0.0% ASCI_Purple/SMG2000 - extra code vectorized, small variations CFP2006/444.namd - small variations, less shuffles Benchmarks/Misc/oourafft - small variations CFP2017rate/538.imagick_r CFP2017speed/638.imagick_s - small variations, less shuffles LCALS/SubsetALambdaLoops - less shuffles LCALS/SubsetARawLoops - less shuffles CFP2017rate/526.blender_r - small variations, extra vector code CFP2006/453.povray - small variations CFP2017rate/511.povray_r - small variations CINT2017rate/502.gcc_r CINT2017speed/602.gcc_s - small variations Benchmarks/tramp3d-v4 - small variations Prolangs-C/TimberWolfMC - small variations DOE-ProxyApps-C++/miniFE - extra code vectorized, small variations DOE-ProxyApps-C++/CLAMR - extra code vectorized, small variations ASCI_Purple/SMG2000 - no significant changes RISCV, -O3+LTO Metric: size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-pr28982b.test 1812.00 1866.00 3.0% test-suite :: MultiSource/Benchmarks/Olden/health/health.test 3946.00 4016.00 1.8% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 513180.00 513550.00 0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 513180.00 513550.00 0.1% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7672198.00 7672202.00 0.0% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7672198.00 7672202.00 0.0% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 746060.00 746044.00 -0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497716.00 9497364.00 -0.0% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 948266.00 948214.00 -0.0% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 89874.00 89862.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 835492.00 835346.00 -0.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 66230.00 66202.00 -0.0% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 946090.00 944206.00 -0.2% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1136404.00 1131854.00 -0.4% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1136404.00 1131854.00 -0.4% gcc-c-torture/execute/GCC-C-execute-pr28982b - better vector code Olden/health - extra vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - small variation + improvements in reordering, @pixel_hadamard_ac stopped being vectorized because of some non-effective shuffle recognition by the compiler CINT2017rate/502.gcc_r CINT2017speed/602.gcc_s - small variations CFP2017rate/508.namd_r - small variations CFP2017rate/526.blender_r - small variations CFP2006/453.povray - extra vector code Benchmarks/7zip - extra vector code DOE-ProxyApps-C++/miniFE - small variations CFP2017rate/511.povray_r - extra vector code CFP2017speed/638.imagick_s CFP2017rate/538.imagick_r - extra vector code Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/126771	2025-02-21 07:15:02 -05:00
Alexey Bataev	0e1ffa397e	[SLP]Fix a crash when comparing phis from unreachable blocks Need to check if the block is reachable before comparing phis from it to avoid compiler crash when requesting node. Fixes report in https://github.com/llvm/llvm-project/pull/110529#issuecomment-2664723338	2025-02-18 08:20:48 -08:00
Alexey Bataev	37bde7ae5b	[SLP]Fix hanging on small trees with phis only with adjusted cost threshold Need to check if the tree is too small before attempting to vectorize the tree to prevent hanging on small trees with phis only.	2025-02-18 07:56:47 -08:00
Alexey Bataev	3b18d47ecb	[SLP]Improved reduction cost/codegen SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: https://github.com/llvm/llvm-project/pull/118293	2025-02-14 11:03:33 -08:00
Alexey Bataev	40029800e7	Revert "[SLP]Improved reduction cost/codegen" This reverts commit 7ec60bf0166519317b5ae2505dd6ed4660e3ea39 to fix a bug reported in https://github.com/llvm/llvm-project/issues/127220.	2025-02-14 10:18:07 -08:00
Alexey Bataev	7ec60bf016	[SLP]Improved reduction cost/codegen SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: https://github.com/llvm/llvm-project/pull/118293	2025-02-14 05:15:29 -08:00
Alexey Bataev	afa3c10de7	Revert "[SLP]Improved reduction cost/codegen" This reverts commit 2ad816648f2719e6c0da507a1a371f2cad4a3f1c to fix bug/miscompiles, reported in https://github.com/llvm/llvm-project/pull/118293#issuecomment-2658906033 and https://github.com/llvm/llvm-project/pull/118293#issuecomment-2659024785.	2025-02-14 04:12:47 -08:00
Alexey Bataev	ac217ee389	[SLP] Check for PHI nodes (potentially cycles!) when checking dependencies When checking for dependecies for gather nodes with users with the same last instruction, cannot rely on the index order, if there is (even potential!) cycle in the graph, which may cause order not work correctly and cause compiler crash. Fixes #127128	2025-02-13 14:21:48 -08:00
Alexey Bataev	d18b1ebef5	[SLP]Check if vector user exist before accessing it Need to check if vector user exist before accessing it to avoid compiler crash. Fixes #126581	2025-02-13 09:44:34 -08:00
Alexey Bataev	2ad816648f	[SLP]Improved reduction cost/codegen SLP vectorizer is able to combine several reductions from the list of (potentially) reduced values with the different opcodes/values kind. Currently, these reductions are handled independently of each other. But instead the compiler can combine them into wide vector operations and then perform only single reduction. E.g, if the SLP vectorizer emits currently something like: ``` %r1 = reduce.add(<4 x i32> %v1) %r2 = reduce.add(<4 x i32> %v2) %r = add i32 %r1, %r2 ``` it can be emitted as: ``` %v = add <4 x i32> %v1, %v2 %r = reduce.add(<4 x i32> %v) ``` It allows to improve the performance in some cases. AVX512, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 4553.00 4615.00 1.4% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 412708.00 416820.00 1.0% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12901.00 12981.00 0.6% test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 22717.00 22813.00 0.4% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39722.00 39850.00 0.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39725.00 39853.00 0.3% test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 15918.00 15967.00 0.3% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 155491.00 155587.00 0.1% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 227894.00 227942.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1062188.00 1062364.00 0.0% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 793672.00 793720.00 0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 657371.00 657403.00 0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074917.00 2074933.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074917.00 2074933.00 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 855219.00 855203.00 -0.0% Benchmarks/Shootout-C++ - same transformed reduction Adobe-C++/loop_unroll - same transformed reductions, new vector code AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - same transformed reductions FreeBench/fourinarow - same transformed reductions MiBench/telecomm-gsm - same transformed reductions execute/GCC-C-execute-builtin-bitops-1 - same transformed reductions CFP2006/433.milc - better vector code, several x i64 reductions + trunc to i32 gets trunced to x i32 reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions, extra 4 x vectorization CINT2006/464.h264ref - same transformed reductions CINT2017rate/525.x264_r CINT2017speed/625.x264_s - same transformed reductions CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - transformed same reduction JM/lencod - extra 4 x vectorization RISC-V, SiFive-p670, -O3+LTO Metric: size..text Program size..text results results0 diff test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/GCC-C-execute-builtin-bitops-1.test 8990.00 9514.00 5.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 588504.00 588488.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 147464.00 147440.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 21496.00 21492.00 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test 165420.00 165372.00 -0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 843928.00 843648.00 -0.0% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 100712.00 100672.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 24384.00 24336.00 -0.2% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 24380.00 24332.00 -0.2% test-suite :: SingleSource/UnitTests/Vectorizer/VPlanNativePath/outer-loop-vect.test 10348.00 10316.00 -0.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 221304.00 220480.00 -0.4% test-suite :: SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.test 3750.00 3736.00 -0.4% test-suite :: SingleSource/Regression/C/Regression-C-DuffsDevice.test 678.00 370.00 -45.4% execute/GCC-C-execute-builtin-bitops-1 - extra 4 x reductions, same transformed reductions CINT2006/464.h264ref - extra 4 x reductions, same transformed reductions MiBench/consumer-lame - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/automotive-susan - same transformed reductions ImageProcessing/Blur - same transformed reductions Benchmarks/7zip - same transformed reductions CINT2006/458.sjeng - 2 4 x i1 merged to 8 x i1 reductions (bitcast + ctpop) MiBench/telecomm-gsm - same transformed reductions Benchmarks/mediabench - same transformed reductions Vectorizer/VPlanNativePath - same transformed reductions Adobe-C++/loop_unroll - extra 4 x reductions, same transformed reductions Benchmarks/Shootout-C++ - extra 4 x reductions, same transformed reductions Regression/C/Regression-C-DuffsDevice - same transformed reductions Reviewers: hiraditya, topperc, preames Pull Request: https://github.com/llvm/llvm-project/pull/118293	2025-02-13 10:36:28 -05:00
Alexey Bataev	7d1db31aa0	[SLP]Check the first instruction instead the first scalar for subvectors Need to check the first instruction instead of first scalar for subvectors, when trying to find full matched vectorized node in the graph. Fixes #126909.	2025-02-13 06:40:37 -08:00
Alexey Bataev	bb3d789dfe	[SLP][NFC]Improve dump of the ScheduleData, NFC	2025-02-12 06:51:30 -08:00
Alexey Bataev	e1935a2b15	Revert "[SLP][NFC]Improve dump of the ScheduleData, NFC" This reverts commit 108e6bca693e5f44d2d17da5a6e06203a0290de7 to fix error revealed by buildbots https://lab.llvm.org/buildbot/#/builders/159/builds/15888.	2025-02-12 06:34:27 -08:00
Alexey Bataev	108e6bca69	[SLP][NFC]Improve dump of the ScheduleData, NFC	2025-02-12 06:25:04 -08:00
Alexey Bataev	10844fb9b0	[SLP]Fix attempt to build the reorder mask for non-adjusted reuse mask When building the reorder for non-single use reuse mask, need to check if the size of the mask is multiple of the number of unique scalars. Otherwise, the compiler may crash when trying to reorder nodes. Fixes #126304	2025-02-11 13:41:25 -08:00
Alexey Bataev	7dca2c628c	[SLP]Gather scalarized calls If the calls won't be vectorized, but will be scalarized after vectorization, they should be build as buildvector nodes, not vector nodes. Vectorization of such calls leads to incorrect cost estimation, does not allow to calculate correctly spills costs. Reviewers: lukel97, preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/125070	2025-02-04 19:09:57 -05:00
Alexey Bataev	88e7b8b81c	[SLP]Use TTI::getScalarizationOverhead where possible Better to use TTI::getScalarizationOverhead instead of TTI::getVectorInstrCost to correctly calculate the costs of buildvectors/extracts. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/125725	2025-02-04 18:49:43 -05:00
Alexey Bataev	fe7e280820	[SLP][NFC]Move functions definitions, NFC Move functions to use them later in the following patches	2025-02-04 07:19:18 -08:00
Alexey Bataev	0c70a26f46	[SLP]Clear root node reordering only if the root node is not re-used in graph The reordering of the root node can be safely cleared only if the root node is not reused, otherwise the graph might be broken Fixes #125357	2025-02-03 06:05:19 -08:00
Simon Pilgrim	e3fbf19eb4	[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis. (#124129 ) (REAPPLIED) We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at #63980 Reapplied cleanup now that #125223 and #124984 have landed.	2025-02-03 09:55:41 +00:00
Martin Storsjö	d00579be39	Revert "[SLP]Reduce number of alternate instruction, where possible" This reverts commit d5a7a483a65f830a0c7a931781bc90046dc67ff4. That commit triggers failed asserts, see https://github.com/llvm/llvm-project/pull/123360 for details.	2025-02-02 15:56:08 +02:00
Alexey Bataev	d5a7a483a6	[SLP]Reduce number of alternate instruction, where possible Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6% Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not inlined FreeBench/pifft - extra stores <8 x double> vectorized, some other extra vectorizations CINT2006/464.h264ref - some smaller code + changes similar to x264 JM/ldecod - changes similar x264 CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - significantly compact vector code Benchmarks/Bullet - small variations CFP2017rate/526.blender_r - small variations CFP2017rate/510.parest_r - small variations CINT2006/400.perlbench - extra vector code JM/lencod - extra store <16 x i32> and other changes similar x264 Applications/oggenc - extra store <16 x i8>, small variations DOE-ProxyApps-C/miniGMG - small variations Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, march=rva32u64 CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac function vectorized, idct4x4dc stopped being vectorized (looks like issue with shuffles cost) CINT2006/400.perlbench - better vector code CINT2006/445.gobmk - some variations in vector code CINT2006/464.h264ref - extra code vectorized CINT2017rate/500.perlbench_r - small variations -O3+LTO, mcpu=sifive-p470 Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0% test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3% CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some TTI costs) Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/123360	2025-02-01 10:00:16 -08:00
Alexey Bataev	d9c9326a21	[SLP]Recalculate number of parts when requesting number of elements based on original scalars size Need to recalculate number of parts, since gathered scalar size might be changed during building the buildvector shuffles. Fixes #125259	2025-01-31 12:55:03 -08:00
Alexey Bataev	631abff733	Revert "[SLP]Use the size of gathered scalars when evaluating slice size" This reverts commit e78aa8f35e6dd66d5152396406d3d4f37f43e7f4 to fix crashes reported in https://lab.llvm.org/buildbot/#/builders/140/builds/16047.	2025-01-31 11:31:45 -08:00
Alexey Bataev	e78aa8f35e	[SLP]Use the size of gathered scalars when evaluating slice size Need to use the size of the gathered scalars, not the original size of the buildvector scalars, since gathered scalar size might be changed during building the buildvector shuffles. Fixes #125259	2025-01-31 11:19:46 -08:00
Kazu Hirata	5a116f8a73	[Vectorize] Migrate away from PointerUnion::dyn_cast (NFC) (#125159 ) Note that PointerUnion::dyn_cast has been soft deprecated in PointerUnion.h: // FIXME: Replace the uses of is(), get() and dyn_cast() with // isa<T>, cast<T> and the llvm::dyn_cast<T> Literal migration would result in dyn_cast_if_present (see the definition of PointerUnion::dyn_cast), but this patch uses dyn_cast because we expect InVectors.front() and P to be nonnull.	2025-01-31 07:50:44 -08:00
Alexey Bataev	6dd07b17c7	Revert "[SLP]Reduce number of alternate instruction, where possible" This reverts commit e588085af03ba4be14a502806918fd74ca1cf367 to fix a crash reported in https://github.com/llvm/llvm-project/pull/123360#issuecomment-2627439245	2025-01-31 06:37:16 -08:00
Alexey Bataev	e588085af0	[SLP]Reduce number of alternate instruction, where possible Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6% Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not inlined FreeBench/pifft - extra stores <8 x double> vectorized, some other extra vectorizations CINT2006/464.h264ref - some smaller code + changes similar to x264 JM/ldecod - changes similar x264 CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - significantly compact vector code Benchmarks/Bullet - small variations CFP2017rate/526.blender_r - small variations CFP2017rate/510.parest_r - small variations CINT2006/400.perlbench - extra vector code JM/lencod - extra store <16 x i32> and other changes similar x264 Applications/oggenc - extra store <16 x i8>, small variations DOE-ProxyApps-C/miniGMG - small variations Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, march=rva32u64 CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac function vectorized, idct4x4dc stopped being vectorized (looks like issue with shuffles cost) CINT2006/400.perlbench - better vector code CINT2006/445.gobmk - some variations in vector code CINT2006/464.h264ref - extra code vectorized CINT2017rate/500.perlbench_r - small variations -O3+LTO, mcpu=sifive-p470 Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0% test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3% CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some TTI costs) Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/123360	2025-01-31 06:56:12 -05:00
Alexey Bataev	466217eb03	[SLP]Fix graph traversal in getSpillCost getSpill cost relies on def-use order when performs the analysis for the vectorized instructions live-over-calls spills. Patch fixes it to check the dependencies based on TreeEntries and performs actual vectorized type analysis. Reviewers: RKSimon, preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/124984	2025-01-31 06:27:47 -05:00
Philip Reames	3cef99f652	[SLP] Use early return in NoCallIntrinsic	2025-01-30 08:02:40 -08:00
Simon Pilgrim	5921295dca	Revert "[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis." (#124962 ) Reverts llvm/llvm-project#124129 as its currently causing a regression at #124499 - avoids the regression until a proper fix can be added to getSpillCost	2025-01-29 22:17:53 +00:00
Alexey Bataev	4a1a697427	[SLP][NFC]Unify ScalarToTreeEntries and MultiNodeScalars, NFC Currently, SLP has 2 distinct storages to manage mapping between vectorized instructions and their corresponding vectorized TreeEntry nodes. It leads to inefficient lookup for the matching TreeEntries and makes it harder to correctly track instructions, associated with multiple nodes. There is a plan to extend this support for instructions, that require scheduling, to allow support for copyable elements. Merging ScalarToTreeEntry and MultiNodeScalars will allow reduce maintenance of the feature Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/124914	2025-01-29 09:05:54 -05:00
Alexey Bataev	947d8ebbf3	[SLP]Unify getNumberOfParts use Adds getNumberOfParts and uses it instead of similar code across code base, fixes analysis of non-vectorizable types in computeMinimumValueSizes. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/124774	2025-01-28 12:16:44 -05:00
Alexey Bataev	a1ab5b4c87	[SLP]Check the MainOp matches the requirements for the instructions Need to include MainOp into the analysis of the instructions in getSameOpcode to be sure that it is checked for the requirements to prevent crashes during further analysis.	2025-01-28 06:00:52 -08:00
Alexey Bataev	1d5fbe83c3	[SLP]Adjust NumberOfParts value for adjusted number of buildvector scalars Need to adjust NumParts value, when GatheredScalars scalars are adjusted after extractelements analysis, to fix compiler crash	2025-01-28 05:45:13 -08:00
Han-Kuan Chen	08d14e10ca	[SLP] Fix CommonMask will be transformed into an incorrect mask if createShuffle is called multiple times. (#124244 ) We have two types of mask in SLP: a scalar mask and a vector mask. When vectorizing four i32 additions into <4 x i32>, SLP creates a mask of length 4. When vectorizing four <2 x i32> additions into <8 x i32>, SLP also creates a mask of length 4. We refer to the first case as a scalar mask (because the mask element represents a scalar, i32), and the second case as a vector mask (because the mask element represents a vector, <4 x i32>). At some point, we must convert the scalar mask into a vector mask (otherwise, calling TTI cost functions or IRBuilderBase functions may yield incorrect results). Since both ShuffleCostEstimator and ShuffleInstructionBuilder can modify the CommonMask, we have decided to perform the mask transformation only within createShuffle. However, we do not store the transformed result, as createShuffle may be called multiple times.	2025-01-28 12:02:37 +08:00
Alexey Bataev	f1d5e70a00	[SLP][NFC]Do not check poison values for corresponding vectorized entries No need to check poison values if they have been vectorized and/or mark them as vectorized, it should work only for instructions.	2025-01-27 06:38:23 -08:00
Simon Pilgrim	a12d7e4b61	[SLP] getVectorCallCosts - don't provide scalar argument data for vector IntrinsicCostAttributes (#124254 ) getVectorCallCosts determines the cost of a vector intrinsic, based off an existing scalar intrinsic call - but we were including the scalar argument data to the IntrinsicCostAttributes, which meant that not only was the cost calculation not type-only based, it was making incorrect assumptions about constant values etc. This also exposed an issue that x86 relied on fallback calculations for funnel shift costs - this is great when we have the argument data as that improves the accuracy of uniform shift amounts etc., but meant that type-only costs would default to Cost=2 for all custom lowered funnel shifts, which was far too cheap. This is the reverse of #124129 where we weren't including argument data when we could. Fixes #63980	2025-01-24 15:13:13 +00:00
Jeremy Morse	8e70273509	[NFC][DebugInfo] Use iterator moveBefore at many call-sites (#123583 ) As part of the "RemoveDIs" project, BasicBlock::iterator now carries a debug-info bit that's needed when getFirstNonPHI and similar feed into instruction insertion positions. Call-sites where that's necessary were updated a year ago; but to ensure some type safety however, we'd like to have all calls to moveBefore use iterators. This patch adds a (guaranteed dereferenceable) iterator-taking moveBefore, and changes a bunch of call-sites where it's obviously safe to change to use it by just calling getIterator() on an instruction pointer. A follow-up patch will contain less-obviously-safe changes. We'll eventually deprecate and remove the instruction-pointer insertBefore, but not before adding concise documentation of what considerations are needed (very few).	2025-01-24 10:53:11 +00:00
Alexey Bataev	c7e6ca76cb	[SLP][NFC]Add dump() method for ScheduleData struct type for better debugging	2025-01-23 09:49:37 -08:00
Simon Pilgrim	d8cd8d56ea	[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis. (#124129 ) We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at #63980	2025-01-23 16:57:13 +00:00
Alexey Bataev	fa299294c0	[SLP][NFC]Modernize code base in several places	2025-01-23 08:43:07 -08:00
Han-Kuan Chen	d3aea77f50	[SLP] Move transformMaskAfterShuffle into BaseShuffleAnalysis and use it as much as possible. (#123896 )	2025-01-23 09:47:38 +08:00
Alexey Bataev	5deb4ef9ab	[SLP]Initial non-power-of-2 (but still whole register) for remaining nodes Added non-power-of-2 (but still whole registers) vectorization support for nodes other than stores and reductions. Reviewers: preames, RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/113356	2025-01-21 10:33:03 -05:00

1 2 3 4 5 ...

2117 Commits