2106 Commits

Author SHA1 Message Date
Alexey Bataev
10844fb9b0 [SLP]Fix attempt to build the reorder mask for non-adjusted reuse mask
When building the reorder for non-single use reuse mask, need to check
if the size of the mask is multiple of the number of unique scalars.
  Otherwise, the compiler may crash when trying to reorder nodes.

Fixes #126304
2025-02-11 13:41:25 -08:00
Mikhail R. Gadelha
e78be31639
[RISCV] Added cost model for fmuladd (#125683)
This patch updates the cost model for fmuladd on vector types to scale with LMUL. This was found when analyzing a hot loop in 519.lbm_r that was unprofitably vectorized, but doesn't directly impact that case and is split off so it doesn't get forgotten.

Unlike other FP arithmetic ops, it's not scaled by 2 because the scalar cost isn't scaled by 2.
2025-02-05 09:33:24 -03:00
Simon Pilgrim
4fdd28b791 [SLP][X86] Add test coverage for #124993 2025-02-05 08:54:09 +00:00
Alexey Bataev
7dca2c628c
[SLP]Gather scalarized calls
If the calls won't be vectorized, but will be scalarized after
vectorization, they should be build as buildvector nodes, not vector
nodes. Vectorization of such calls leads to incorrect cost estimation,
does not allow to calculate correctly spills costs.

Reviewers: lukel97, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/125070
2025-02-04 19:09:57 -05:00
Alexey Bataev
88e7b8b81c
[SLP]Use TTI::getScalarizationOverhead where possible
Better to use TTI::getScalarizationOverhead instead of
TTI::getVectorInstrCost to correctly calculate the costs of
buildvectors/extracts.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/125725
2025-02-04 18:49:43 -05:00
Alexey Bataev
5ca136d0e7 [SLP][NFC]Replace undefs with just poison in the test 2025-02-04 08:24:45 -08:00
Simon Pilgrim
f4c2e5df6f [SLP][X86] revectorized_rdx_crash.ll - regenerate to reduce diff in #118293 2025-02-04 15:13:07 +00:00
Alexey Bataev
0c70a26f46 [SLP]Clear root node reordering only if the root node is not re-used in graph
The reordering of the root node can be safely cleared only if the root
node is not reused, otherwise the graph might be broken

Fixes #125357
2025-02-03 06:05:19 -08:00
Martin Storsjö
d00579be39 Revert "[SLP]Reduce number of alternate instruction, where possible"
This reverts commit d5a7a483a65f830a0c7a931781bc90046dc67ff4.

That commit triggers failed asserts, see
https://github.com/llvm/llvm-project/pull/123360 for details.
2025-02-02 15:56:08 +02:00
Alexey Bataev
d5a7a483a6 [SLP]Reduce number of alternate instruction, where possible
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                            results     results0    diff
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   277800.00   280536.00  1.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81802.00    82426.00  0.8%
                        test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   790552.00   790952.00  0.1%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383795.00   383987.00  0.1%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2075541.00  2076501.00  0.0%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2075541.00  2076501.00  0.0%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   312702.00   312766.00  0.0%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2049374.00  2049358.00 -0.0%
                    test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1091836.00  1091772.00 -0.0%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test   852339.00   852211.00 -0.0%
                                test-suite :: MultiSource/Applications/oggenc/oggenc.test   190651.00   190523.00 -0.1%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44203.00    44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12997.00    12981.00 -0.1%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   668971.00   658427.00 -1.6%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   668971.00   658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program                                                                                                                                                size..text
                                                                               results    results0   diff
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  587336.00  587668.00  0.1%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  643308.00  643614.00  0.0%
                    test-suite :: MultiSource/Applications/d/make_dparser.test   79678.00   79710.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277322.00  277420.00  0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  933660.00  933682.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  148038.00  148024.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  283036.00  283008.00 -0.0%
   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00 -0.1%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  540582.00  511772.00 -5.3%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  540582.00  511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/123360
2025-02-01 10:00:16 -08:00
Simon Pilgrim
71d05ac64e [TTI] getTypeBasedIntrinsicInstrCost - add basic handling for strided load/store intrinsics (#125223) (REAPPLIED)
As noted on #124499 - this is currently missing for type-only analysis and was falling back to scalarization for fixed vectors (and failing entirely for scalable vectors)
2025-02-01 16:24:51 +00:00
Alexey Bataev
d9c9326a21 [SLP]Recalculate number of parts when requesting number of elements based on original scalars size
Need to recalculate number of parts, since gathered scalar size might be changed
during building the buildvector shuffles.

Fixes #125259
2025-01-31 12:55:03 -08:00
Alexey Bataev
631abff733 Revert "[SLP]Use the size of gathered scalars when evaluating slice size"
This reverts commit e78aa8f35e6dd66d5152396406d3d4f37f43e7f4 to fix
crashes reported in https://lab.llvm.org/buildbot/#/builders/140/builds/16047.
2025-01-31 11:31:45 -08:00
Alexey Bataev
e78aa8f35e [SLP]Use the size of gathered scalars when evaluating slice size
Need to use the size of the gathered scalars, not the original size of
the buildvector scalars, since gathered scalar size might be changed
during building the buildvector shuffles.

Fixes #125259
2025-01-31 11:19:46 -08:00
Alexey Bataev
9955d849d6 [SLP][NFC]Add a test with the incorrect shuffled elements of buildvector 2025-01-31 11:05:20 -08:00
Alexey Bataev
6dd07b17c7 Revert "[SLP]Reduce number of alternate instruction, where possible"
This reverts commit e588085af03ba4be14a502806918fd74ca1cf367 to fix
a crash reported in https://github.com/llvm/llvm-project/pull/123360#issuecomment-2627439245
2025-01-31 06:37:16 -08:00
Alexey Bataev
e588085af0
[SLP]Reduce number of alternate instruction, where possible
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                            results     results0    diff
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   277800.00   280536.00  1.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81802.00    82426.00  0.8%
                        test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   790552.00   790952.00  0.1%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383795.00   383987.00  0.1%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2075541.00  2076501.00  0.0%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2075541.00  2076501.00  0.0%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   312702.00   312766.00  0.0%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2049374.00  2049358.00 -0.0%
                    test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1091836.00  1091772.00 -0.0%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test   852339.00   852211.00 -0.0%
                                test-suite :: MultiSource/Applications/oggenc/oggenc.test   190651.00   190523.00 -0.1%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44203.00    44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12997.00    12981.00 -0.1%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   668971.00   658427.00 -1.6%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   668971.00   658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program                                                                                                                                                size..text
                                                                               results    results0   diff
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  587336.00  587668.00  0.1%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  643308.00  643614.00  0.0%
                    test-suite :: MultiSource/Applications/d/make_dparser.test   79678.00   79710.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277322.00  277420.00  0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  933660.00  933682.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  148038.00  148024.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  283036.00  283008.00 -0.0%
   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00 -0.1%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  540582.00  511772.00 -5.3%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  540582.00  511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/123360
2025-01-31 06:56:12 -05:00
Alexey Bataev
466217eb03
[SLP]Fix graph traversal in getSpillCost
getSpill cost relies on def-use order when performs the analysis for the
vectorized instructions live-over-calls spills.
Patch fixes it to check the dependencies based on TreeEntries and
performs actual vectorized type analysis.

Reviewers: RKSimon, preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/124984
2025-01-31 06:27:47 -05:00
Alexey Bataev
fc39617746 [SLP][NFC]Update tests and remove undefs, NFC 2025-01-30 06:58:14 -08:00
Alexey Bataev
d1033d15cb [SLP][NFC]Autogenerate checks and remove undef, NFC 2025-01-30 06:49:42 -08:00
Simon Pilgrim
5921295dca
Revert "[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis." (#124962)
Reverts llvm/llvm-project#124129 as its currently causing a regression at #124499 - avoids the regression until a proper fix can be added to getSpillCost
2025-01-29 22:17:53 +00:00
Nikita Popov
29441e4f5f
[IR] Convert from nocapture to captures(none) (#123181)
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.

Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
   reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
   make it easier to use old IR files and somewhat reduce the test churn in
   this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
   attribute. The representation in the LLVM IR dialect should be updated
   separately.
2025-01-29 16:56:47 +01:00
Simon Pilgrim
89ca3e72ca
[CostModel][X86] Reduce worst case v8i16/v16i8 SSE2 shuffle costs (#124789)
These were based off instruction count, not throughput - we can probably improve these further, but these throughput numbers match the worse expanded shuffles we see in the vector-shuffle-128-v* codegen tests.
2025-01-29 10:23:09 +00:00
Alexey Bataev
947d8ebbf3
[SLP]Unify getNumberOfParts use
Adds getNumberOfParts and uses it instead of similar code across code
base, fixes analysis of non-vectorizable types in
computeMinimumValueSizes.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/124774
2025-01-28 12:16:44 -05:00
Alexey Bataev
a1ab5b4c87 [SLP]Check the MainOp matches the requirements for the instructions
Need to include MainOp into the analysis of the instructions in
getSameOpcode to be sure that it is checked for the requirements to
prevent crashes during further analysis.
2025-01-28 06:00:52 -08:00
Alexey Bataev
1d5fbe83c3 [SLP]Adjust NumberOfParts value for adjusted number of buildvector scalars
Need to adjust NumParts value, when GatheredScalars scalars are adjusted
after extractelements analysis, to fix compiler crash
2025-01-28 05:45:13 -08:00
Han-Kuan Chen
08d14e10ca
[SLP] Fix CommonMask will be transformed into an incorrect mask if createShuffle is called multiple times. (#124244)
We have two types of mask in SLP: a scalar mask and a vector mask.
When vectorizing four i32 additions into <4 x i32>, SLP creates a mask
of length 4.
When vectorizing four <2 x i32> additions into <8 x i32>, SLP also
creates a mask of length 4.
We refer to the first case as a scalar mask (because the mask element
represents a scalar, i32), and the second case as a vector mask (because
the mask element represents a vector, <4 x i32>).
At some point, we must convert the scalar mask into a vector mask
(otherwise, calling TTI cost functions or IRBuilderBase functions may
yield incorrect results).
Since both ShuffleCostEstimator and ShuffleInstructionBuilder can modify
the CommonMask, we have decided to perform the mask transformation only
within createShuffle. However, we do not store the transformed result,
as createShuffle may be called multiple times.
2025-01-28 12:02:37 +08:00
Simon Pilgrim
dec47b76f4
[CostModel][X86] Update baseline CTTZ/CTLZ costs for x86_64 (#124312)
Followup to #123623 - now that the CMOV has been removed, the throughput has improved, reducing the benefit of vectorization on pre-x86-64-v3 CPUs
2025-01-26 14:43:51 +00:00
Alexey Bataev
5e65f43041 [SLP][NFC]Add a test, producing serie of extrtactelements, building non-extendable tree 2025-01-25 11:50:14 -08:00
Simon Pilgrim
a12d7e4b61
[SLP] getVectorCallCosts - don't provide scalar argument data for vector IntrinsicCostAttributes (#124254)
getVectorCallCosts determines the cost of a vector intrinsic, based off
an existing scalar intrinsic call - but we were including the scalar
argument data to the IntrinsicCostAttributes, which meant that not only
was the cost calculation not type-only based, it was making incorrect
assumptions about constant values etc.

This also exposed an issue that x86 relied on fallback calculations for
funnel shift costs - this is great when we have the argument data as
that improves the accuracy of uniform shift amounts etc., but meant that
type-only costs would default to Cost=2 for all custom lowered funnel
shifts, which was far too cheap.

This is the reverse of #124129 where we weren't including argument data
when we could.

Fixes #63980
2025-01-24 15:13:13 +00:00
Simon Pilgrim
625e0a40f1 [SLP][X86] Add missing SSE2/SSE4 checks from vector rotate tests 2025-01-24 10:12:19 +00:00
Simon Pilgrim
7746596713 [SLP][X86] Add VBMI2 coverage for funnel shift tests
VBMI2 CPUs actually have vector funnel shift instruction support
2025-01-24 09:47:40 +00:00
Simon Pilgrim
d8cd8d56ea
[SLP] getSpillCost - fully populate IntrinsicCostAttributes to improve cost analysis. (#124129)
We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)

Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.

Noticed while having yet another attempt at #63980
2025-01-23 16:57:13 +00:00
Alexey Bataev
ccd77953d0 [SLP][NFC]Add a test with potential alternate node, marked for minbitwidth size 2025-01-22 06:48:34 -08:00
Sushant Gokhale
c6c647588f
[SLP][NFC] Update test for PR #118055 (#122696)
This patch updates the motivating test for the above PR so that it does
not conflict with urem PR #122236
2025-01-22 03:28:34 -08:00
Alexey Bataev
184c056e35 [SLP][NFC]Update the test by replacing undefs with constant values, NFC 2025-01-21 08:43:33 -08:00
Alexey Bataev
5deb4ef9ab
[SLP]Initial non-power-of-2 (but still whole register) for remaining nodes
Added non-power-of-2 (but still whole registers) vectorization support
for nodes other than stores and reductions.

Reviewers: preames, RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/113356
2025-01-21 10:33:03 -05:00
Alexey Bataev
7d01a8f2b9 [SLP]Fix vector factor for repeated node for bv
When adding a node vector, when it is used already in the shuffle for
buildvector, need to calculate vector factor from all vector, not only
this single vector, to avoid incorrect result. Also, need to increase
stability of the reused entries detection to avoid mismatch in cost
estimation/codegen.

Fixes #123639
2025-01-20 14:22:20 -08:00
Alexey Bataev
5e4c34a9b6 [SLP][NFC]Add a test with incorrect length and cost for repeated matching node 2025-01-20 14:17:15 -08:00
Alexey Bataev
2b1e037adb [SLP]Fix createInsertVector mask emission 2025-01-18 11:48:53 -08:00
Alexey Bataev
92a6eff62b [SLP][NFC]Fix the test to use poison and update to show the error 2025-01-18 11:46:04 -08:00
Alexey Bataev
55f7491dde [SLP][NFC]Add a test with incomplete insertion mask, NFC 2025-01-18 08:13:54 -08:00
Han-Kuan Chen
07d496538f
[SLP] Replace MainOp and AltOp in TreeEntry with InstructionsState. (#122443)
Add TreeEntry::hasState.
Add assert for getTreeEntry.
Remove the OpValue parameter from the canReuseExtract function.
Remove the Opcode parameter from the ComputeMaxBitWidth lambda function.
2025-01-18 10:23:20 +08:00
Alexey Bataev
ebfdd38228 [SLP][NFC]Replace undef with constant zero in tests, NFC 2025-01-17 09:48:03 -08:00
Alexey Bataev
98e2328451 [SLP][NFC]Add a test with non-power-of-2 gathered consecutive loads, NFC 2025-01-13 12:50:11 -08:00
Alexey Bataev
066b88879a [SLP]Correctly set vector operand for extracts with poisons
When extracts are vectorized and it has some poison values instead of
instructions, need to correctly set the vectorized operand not as
poison, but as a main vector operand of the main extract instruction.

Fixes #122583
2025-01-13 10:57:07 -08:00
Alexey Bataev
ae54617523 [SLP][NFC]Add a test with incorrect extractelement parameter after extending with poison 2025-01-13 10:32:11 -08:00
Alexey Bataev
092d628383 [SLP]Check for div/rem instructions before extending with poisons
Need to check if the instructions can be safely extended with poison
before actually doing this to avoid incorrect transformations.

Fixes #122691
2025-01-13 09:28:27 -08:00
Alexey Bataev
af524de1fa [SLP]Do not include subvectors for fully matched buildvectors
If the buildvector node fully matched another node, need to exclude
subvectors, when building final shuffle, just a shuffle of the original
node must be emitted.

Fixes #122584
2025-01-13 07:24:16 -08:00
Alexey Bataev
681c83a2f9 [SLP]Fix mask generation after cost estimation
When estimating the cost of entries shuffles for buildvectors, need to
rebuild original mask, not a generated submask, used for subregisters
analysis.

Fixes #122430
2025-01-10 09:32:35 -08:00