llvm-project

Author	SHA1	Message	Date
Alexey Bataev	10844fb9b0	[SLP]Fix attempt to build the reorder mask for non-adjusted reuse mask When building the reorder for non-single use reuse mask, need to check if the size of the mask is multiple of the number of unique scalars. Otherwise, the compiler may crash when trying to reorder nodes. Fixes #126304	2025-02-11 13:41:25 -08:00
Mikhail R. Gadelha	e78be31639	[RISCV] Added cost model for fmuladd (#125683 ) This patch updates the cost model for fmuladd on vector types to scale with LMUL. This was found when analyzing a hot loop in 519.lbm_r that was unprofitably vectorized, but doesn't directly impact that case and is split off so it doesn't get forgotten. Unlike other FP arithmetic ops, it's not scaled by 2 because the scalar cost isn't scaled by 2.	2025-02-05 09:33:24 -03:00
Simon Pilgrim	4fdd28b791	[SLP][X86] Add test coverage for #124993	2025-02-05 08:54:09 +00:00
Alexey Bataev	7dca2c628c	[SLP]Gather scalarized calls If the calls won't be vectorized, but will be scalarized after vectorization, they should be build as buildvector nodes, not vector nodes. Vectorization of such calls leads to incorrect cost estimation, does not allow to calculate correctly spills costs. Reviewers: lukel97, preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/125070	2025-02-04 19:09:57 -05:00
Alexey Bataev	88e7b8b81c	[SLP]Use TTI::getScalarizationOverhead where possible Better to use TTI::getScalarizationOverhead instead of TTI::getVectorInstrCost to correctly calculate the costs of buildvectors/extracts. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/125725	2025-02-04 18:49:43 -05:00
Alexey Bataev	5ca136d0e7	[SLP][NFC]Replace undefs with just poison in the test	2025-02-04 08:24:45 -08:00
Simon Pilgrim	f4c2e5df6f	[SLP][X86] revectorized_rdx_crash.ll - regenerate to reduce diff in #118293	2025-02-04 15:13:07 +00:00
Alexey Bataev	0c70a26f46	[SLP]Clear root node reordering only if the root node is not re-used in graph The reordering of the root node can be safely cleared only if the root node is not reused, otherwise the graph might be broken Fixes #125357	2025-02-03 06:05:19 -08:00
Martin Storsjö	d00579be39	Revert "[SLP]Reduce number of alternate instruction, where possible" This reverts commit d5a7a483a65f830a0c7a931781bc90046dc67ff4. That commit triggers failed asserts, see https://github.com/llvm/llvm-project/pull/123360 for details.	2025-02-02 15:56:08 +02:00
Alexey Bataev	d5a7a483a6	[SLP]Reduce number of alternate instruction, where possible Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6% Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not inlined FreeBench/pifft - extra stores <8 x double> vectorized, some other extra vectorizations CINT2006/464.h264ref - some smaller code + changes similar to x264 JM/ldecod - changes similar x264 CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - significantly compact vector code Benchmarks/Bullet - small variations CFP2017rate/526.blender_r - small variations CFP2017rate/510.parest_r - small variations CINT2006/400.perlbench - extra vector code JM/lencod - extra store <16 x i32> and other changes similar x264 Applications/oggenc - extra store <16 x i8>, small variations DOE-ProxyApps-C/miniGMG - small variations Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, march=rva32u64 CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac function vectorized, idct4x4dc stopped being vectorized (looks like issue with shuffles cost) CINT2006/400.perlbench - better vector code CINT2006/445.gobmk - some variations in vector code CINT2006/464.h264ref - extra code vectorized CINT2017rate/500.perlbench_r - small variations -O3+LTO, mcpu=sifive-p470 Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0% test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3% CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some TTI costs) Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/123360	2025-02-01 10:00:16 -08:00
Simon Pilgrim	71d05ac64e	[TTI] getTypeBasedIntrinsicInstrCost - add basic handling for strided load/store intrinsics (#125223 ) (REAPPLIED) As noted on #124499 - this is currently missing for type-only analysis and was falling back to scalarization for fixed vectors (and failing entirely for scalable vectors)	2025-02-01 16:24:51 +00:00
Alexey Bataev	d9c9326a21	[SLP]Recalculate number of parts when requesting number of elements based on original scalars size Need to recalculate number of parts, since gathered scalar size might be changed during building the buildvector shuffles. Fixes #125259	2025-01-31 12:55:03 -08:00
Alexey Bataev	631abff733	Revert "[SLP]Use the size of gathered scalars when evaluating slice size" This reverts commit e78aa8f35e6dd66d5152396406d3d4f37f43e7f4 to fix crashes reported in https://lab.llvm.org/buildbot/#/builders/140/builds/16047.	2025-01-31 11:31:45 -08:00
Alexey Bataev	e78aa8f35e	[SLP]Use the size of gathered scalars when evaluating slice size Need to use the size of the gathered scalars, not the original size of the buildvector scalars, since gathered scalar size might be changed during building the buildvector shuffles. Fixes #125259	2025-01-31 11:19:46 -08:00
Alexey Bataev	9955d849d6	[SLP][NFC]Add a test with the incorrect shuffled elements of buildvector	2025-01-31 11:05:20 -08:00
Alexey Bataev	6dd07b17c7	Revert "[SLP]Reduce number of alternate instruction, where possible" This reverts commit e588085af03ba4be14a502806918fd74ca1cf367 to fix a crash reported in https://github.com/llvm/llvm-project/pull/123360#issuecomment-2627439245	2025-01-31 06:37:16 -08:00
Alexey Bataev	e588085af0	[SLP]Reduce number of alternate instruction, where possible Patch tries to remove wide alternate operations. Currently SLP vectorizer emits something like this: ``` %0 = add i32 %1 = sub i32 %2 = add i32 %3 = sub i32 %4 = add i32 %5 = sub i32 %6 = add i32 %7 = sub i32 transformes to %v1 = add <8 x i32> %v2 = sub <8 x i32> %res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15> ``` i.e. half of the results are just unused. This leads to increased register pressure and potentially doubles number of operations. Patch introduces SplitVectorize mode, where it splits the operations by opcodes and produces instead something like this: ``` %v1 = add <4 x i32> %v2 = sub <4 x i32> %res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7> ``` It allows to improve the performance by reducing number of ops. Also, it turns on some other improvements, like improved graph reordering. -O3+LTO, AVX512 Metric: size..text Program size..text results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1% test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6% Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not inlined FreeBench/pifft - extra stores <8 x double> vectorized, some other extra vectorizations CINT2006/464.h264ref - some smaller code + changes similar to x264 JM/ldecod - changes similar x264 CINT2017speed/600.perlbench_s CINT2017rate/500.perlbench_r - significantly compact vector code Benchmarks/Bullet - small variations CFP2017rate/526.blender_r - small variations CFP2017rate/510.parest_r - small variations CINT2006/400.perlbench - extra vector code JM/lencod - extra store <16 x i32> and other changes similar x264 Applications/oggenc - extra store <16 x i8>, small variations DOE-ProxyApps-C/miniGMG - small variations Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code CINT2017speed/625.x264_s CINT2017rate/525.x264_r - the number of instructions increased, but looks like they are more performant. E.g., for function x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the current version and 59 for the new version. -O3+LTO, march=rva32u64 CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac function vectorized, idct4x4dc stopped being vectorized (looks like issue with shuffles cost) CINT2006/400.perlbench - better vector code CINT2006/445.gobmk - some variations in vector code CINT2006/464.h264ref - extra code vectorized CINT2017rate/500.perlbench_r - small variations -O3+LTO, mcpu=sifive-p470 Metric: size..text Program size..text results results0 diff test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0% test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3% CINT2006/464.h264ref - extra vector code in find_sad_16x16 JM/lencod - extra vector code in find_sad_16x16 d/make_dparser - smaller vector code Benchmarks/Bullet - small variations CINT2006/400.perlbench - smaller vector code CFP2017rate/526.blender_r - small variations, extra store <8 x float> in the loop, extra store <8 x i8> in loop CINT2017rate/500.perlbench_r CINT2017speed/600.perlbench_s - small variations MiBench/consumer-lame - small variations JM/ldecod - extra vector code mediabench/g721/g721encode - small variations CINT2017rate/525.x264_r CINT2017speed/625.x264_s - reduced number of wide operations and shuffles, saving the registers, similar to X86, extra code in pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some TTI costs) Reviewers: RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/123360	2025-01-31 06:56:12 -05:00
Alexey Bataev	466217eb03	[SLP]Fix graph traversal in getSpillCost getSpill cost relies on def-use order when performs the analysis for the vectorized instructions live-over-calls spills. Patch fixes it to check the dependencies based on TreeEntries and performs actual vectorized type analysis. Reviewers: RKSimon, preames Reviewed By: preames Pull Request: https://github.com/llvm/llvm-project/pull/124984	2025-01-31 06:27:47 -05:00
Alexey Bataev	fc39617746	[SLP][NFC]Update tests and remove undefs, NFC	2025-01-30 06:58:14 -08:00
Alexey Bataev	d1033d15cb	[SLP][NFC]Autogenerate checks and remove undef, NFC	2025-01-30 06:49:42 -08:00
Simon Pilgrim	5921295dca	Revert "[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis." (#124962 ) Reverts llvm/llvm-project#124129 as its currently causing a regression at #124499 - avoids the regression until a proper fix can be added to getSpillCost	2025-01-29 22:17:53 +00:00
Nikita Popov	29441e4f5f	[IR] Convert from nocapture to captures(none) (#123181 ) This PR removes the old `nocapture` attribute, replacing it with the new `captures` attribute introduced in #116990. This change is intended to be essentially NFC, replacing existing uses of `nocapture` with `captures(none)` without adding any new analysis capabilities. Making use of non-`none` values is left for a followup. Some notes: * `nocapture` will be upgraded to `captures(none)` by the bitcode reader. * `nocapture` will also be upgraded by the textual IR reader. This is to make it easier to use old IR files and somewhat reduce the test churn in this PR. * Helper APIs like `doesNotCapture()` will check for `captures(none)`. * MLIR import will convert `captures(none)` into an `llvm.nocapture` attribute. The representation in the LLVM IR dialect should be updated separately.	2025-01-29 16:56:47 +01:00
Simon Pilgrim	89ca3e72ca	[CostModel][X86] Reduce worst case v8i16/v16i8 SSE2 shuffle costs (#124789 ) These were based off instruction count, not throughput - we can probably improve these further, but these throughput numbers match the worse expanded shuffles we see in the vector-shuffle-128-v* codegen tests.	2025-01-29 10:23:09 +00:00
Alexey Bataev	947d8ebbf3	[SLP]Unify getNumberOfParts use Adds getNumberOfParts and uses it instead of similar code across code base, fixes analysis of non-vectorizable types in computeMinimumValueSizes. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/124774	2025-01-28 12:16:44 -05:00
Alexey Bataev	a1ab5b4c87	[SLP]Check the MainOp matches the requirements for the instructions Need to include MainOp into the analysis of the instructions in getSameOpcode to be sure that it is checked for the requirements to prevent crashes during further analysis.	2025-01-28 06:00:52 -08:00
Alexey Bataev	1d5fbe83c3	[SLP]Adjust NumberOfParts value for adjusted number of buildvector scalars Need to adjust NumParts value, when GatheredScalars scalars are adjusted after extractelements analysis, to fix compiler crash	2025-01-28 05:45:13 -08:00
Han-Kuan Chen	08d14e10ca	[SLP] Fix CommonMask will be transformed into an incorrect mask if createShuffle is called multiple times. (#124244 ) We have two types of mask in SLP: a scalar mask and a vector mask. When vectorizing four i32 additions into <4 x i32>, SLP creates a mask of length 4. When vectorizing four <2 x i32> additions into <8 x i32>, SLP also creates a mask of length 4. We refer to the first case as a scalar mask (because the mask element represents a scalar, i32), and the second case as a vector mask (because the mask element represents a vector, <4 x i32>). At some point, we must convert the scalar mask into a vector mask (otherwise, calling TTI cost functions or IRBuilderBase functions may yield incorrect results). Since both ShuffleCostEstimator and ShuffleInstructionBuilder can modify the CommonMask, we have decided to perform the mask transformation only within createShuffle. However, we do not store the transformed result, as createShuffle may be called multiple times.	2025-01-28 12:02:37 +08:00
Simon Pilgrim	dec47b76f4	[CostModel][X86] Update baseline CTTZ/CTLZ costs for x86_64 (#124312 ) Followup to #123623 - now that the CMOV has been removed, the throughput has improved, reducing the benefit of vectorization on pre-x86-64-v3 CPUs	2025-01-26 14:43:51 +00:00
Alexey Bataev	5e65f43041	[SLP][NFC]Add a test, producing serie of extrtactelements, building non-extendable tree	2025-01-25 11:50:14 -08:00
Simon Pilgrim	a12d7e4b61	[SLP] getVectorCallCosts - don't provide scalar argument data for vector IntrinsicCostAttributes (#124254 ) getVectorCallCosts determines the cost of a vector intrinsic, based off an existing scalar intrinsic call - but we were including the scalar argument data to the IntrinsicCostAttributes, which meant that not only was the cost calculation not type-only based, it was making incorrect assumptions about constant values etc. This also exposed an issue that x86 relied on fallback calculations for funnel shift costs - this is great when we have the argument data as that improves the accuracy of uniform shift amounts etc., but meant that type-only costs would default to Cost=2 for all custom lowered funnel shifts, which was far too cheap. This is the reverse of #124129 where we weren't including argument data when we could. Fixes #63980	2025-01-24 15:13:13 +00:00
Simon Pilgrim	625e0a40f1	[SLP][X86] Add missing SSE2/SSE4 checks from vector rotate tests	2025-01-24 10:12:19 +00:00
Simon Pilgrim	7746596713	[SLP][X86] Add VBMI2 coverage for funnel shift tests VBMI2 CPUs actually have vector funnel shift instruction support	2025-01-24 09:47:40 +00:00
Simon Pilgrim	d8cd8d56ea	[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis. (#124129 ) We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at #63980	2025-01-23 16:57:13 +00:00
Alexey Bataev	ccd77953d0	[SLP][NFC]Add a test with potential alternate node, marked for minbitwidth size	2025-01-22 06:48:34 -08:00
Sushant Gokhale	c6c647588f	[SLP][NFC] Update test for PR #118055 (#122696 ) This patch updates the motivating test for the above PR so that it does not conflict with urem PR #122236	2025-01-22 03:28:34 -08:00
Alexey Bataev	184c056e35	[SLP][NFC]Update the test by replacing undefs with constant values, NFC	2025-01-21 08:43:33 -08:00
Alexey Bataev	5deb4ef9ab	[SLP]Initial non-power-of-2 (but still whole register) for remaining nodes Added non-power-of-2 (but still whole registers) vectorization support for nodes other than stores and reductions. Reviewers: preames, RKSimon, hiraditya Reviewed By: RKSimon Pull Request: https://github.com/llvm/llvm-project/pull/113356	2025-01-21 10:33:03 -05:00
Alexey Bataev	7d01a8f2b9	[SLP]Fix vector factor for repeated node for bv When adding a node vector, when it is used already in the shuffle for buildvector, need to calculate vector factor from all vector, not only this single vector, to avoid incorrect result. Also, need to increase stability of the reused entries detection to avoid mismatch in cost estimation/codegen. Fixes #123639	2025-01-20 14:22:20 -08:00
Alexey Bataev	5e4c34a9b6	[SLP][NFC]Add a test with incorrect length and cost for repeated matching node	2025-01-20 14:17:15 -08:00
Alexey Bataev	2b1e037adb	[SLP]Fix createInsertVector mask emission	2025-01-18 11:48:53 -08:00
Alexey Bataev	92a6eff62b	[SLP][NFC]Fix the test to use poison and update to show the error	2025-01-18 11:46:04 -08:00
Alexey Bataev	55f7491dde	[SLP][NFC]Add a test with incomplete insertion mask, NFC	2025-01-18 08:13:54 -08:00
Han-Kuan Chen	07d496538f	[SLP] Replace MainOp and AltOp in TreeEntry with InstructionsState. (#122443 ) Add TreeEntry::hasState. Add assert for getTreeEntry. Remove the OpValue parameter from the canReuseExtract function. Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.	2025-01-18 10:23:20 +08:00
Alexey Bataev	ebfdd38228	[SLP][NFC]Replace undef with constant zero in tests, NFC	2025-01-17 09:48:03 -08:00
Alexey Bataev	98e2328451	[SLP][NFC]Add a test with non-power-of-2 gathered consecutive loads, NFC	2025-01-13 12:50:11 -08:00
Alexey Bataev	066b88879a	[SLP]Correctly set vector operand for extracts with poisons When extracts are vectorized and it has some poison values instead of instructions, need to correctly set the vectorized operand not as poison, but as a main vector operand of the main extract instruction. Fixes #122583	2025-01-13 10:57:07 -08:00
Alexey Bataev	ae54617523	[SLP][NFC]Add a test with incorrect extractelement parameter after extending with poison	2025-01-13 10:32:11 -08:00
Alexey Bataev	092d628383	[SLP]Check for div/rem instructions before extending with poisons Need to check if the instructions can be safely extended with poison before actually doing this to avoid incorrect transformations. Fixes #122691	2025-01-13 09:28:27 -08:00
Alexey Bataev	af524de1fa	[SLP]Do not include subvectors for fully matched buildvectors If the buildvector node fully matched another node, need to exclude subvectors, when building final shuffle, just a shuffle of the original node must be emitted. Fixes #122584	2025-01-13 07:24:16 -08:00
Alexey Bataev	681c83a2f9	[SLP]Fix mask generation after cost estimation When estimating the cost of entries shuffles for buildvectors, need to rebuild original mask, not a generated submask, used for subregisters analysis. Fixes #122430	2025-01-10 09:32:35 -08:00

1 2 3 4 5 ...

2106 Commits