1748 Commits

Author SHA1 Message Date
Stephen Tozer
6481dc5761
[IR][NFC] Update IRBuilder to use InsertPosition (#96497)
Uses the new InsertPosition class (added in #94226) to simplify some of
the IRBuilder interface, and removes the need to pass a BasicBlock
alongside a BasicBlock::iterator, using the fact that we can now get the
parent basic block from the iterator even if it points to the sentinel.
This patch removes the BasicBlock argument from each constructor or call
to setInsertPoint.

This has no functional effect, but later on as we look to remove the
`Instruction *InsertBefore` argument from instruction-creation
(discussed
[here](https://discourse.llvm.org/t/psa-instruction-constructors-changing-to-iterator-only-insertion/77845)),
this will simplify the process by allowing us to deprecate the
InsertPosition constructor directly and catch all the cases where we use
instructions rather than iterators.
2024-06-24 17:27:43 +01:00
Simon Pilgrim
f9fc6f6d75 [SLP] Remove dead initialization noticed by static analyser. NFC. 2024-06-21 17:42:01 +01:00
Han-Kuan Chen
be339fd99d
[SLP] NFC. Reduce redundant assignment. (#96149) 2024-06-20 20:09:28 +08:00
Tyler Lanphear
d337c504ef
[SLP][NFCI] Address issues seen in downstream Coverity scan. (#93757)
- Prevent null dereference: if the Mask given to
  `ShuffleInstructionBuilder::adjustExtracts()` is empty or all-poison,
  then `VecBase` will be `nullptr` and the call to
  `castToScalarTyElem(VecBase)` will dereference it. Add an assert
  to guard against this.

- Prevent use of uninitialized scalar: in the unlikely event that
  `CandidateVFs` is empty, then `AnyProfitableGraph` will be
  uninitialized in `if` condition following the loop. (This seems like a
  false-positive, but I submitted this change anyways as initializing
  bools costs nothing and is generally good practice)
2024-05-31 18:34:23 -07:00
Alexey Bataev
70a54bca6f
[SLP]Improve/fix extracts calculations for non-power-of-2 elements.
One of the previous patches introduced initial support for non-power-of-2
number of elements but some parts of the SLP vectorizer still were not
adjusted to handle the costs correctly. Patch fixes it by improving
analysis of the non-power-of-2 number of elements and fixes in the cost
of the extractelements instructions.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/93213
2024-05-24 09:33:36 -04:00
Alexey Bataev
554c47c8e9 [SLP]Fix undef poison vector values shuffles with poisonous vectors.
If trying to find vector value in shuffling of the extractelements and
one of the vector values is undef value, need to generate real mask value
for such vector and either undef vector, or incoming second vector, if
  non-poisonous.
2024-05-22 10:41:57 -07:00
Han-Kuan Chen
5b205956e1
[SLP] NFC. Reduce newTreeEntry usage. (#92994) 2024-05-22 23:26:33 +08:00
Alexey Bataev
30d484fa99 [SLP]Fix a crash when trying to convert masked gather nodes to strided.
Need to check if the loads node is masked gather. Only vectorized loads
can be converted to strided.
2024-05-22 08:08:56 -07:00
Han-Kuan Chen
9f449c3427
[SLP] NFC. Use TreeEntry::getOperand if setOperandsInOrder is called (#92727)
already.
2024-05-20 18:46:30 +08:00
Jay Foad
1650f1b3d7
Fix typo "indicies" (#92232) 2024-05-15 13:10:16 +01:00
Alexey Bataev
58a94b1d0a [SLP]Fix PR91467: Look through scalar cast, when trying to cast to another type.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
2024-05-09 04:19:43 -07:00
Arthur Eubanks
2fb3774321 Revert "[SLP]Fix PR91467: Look through scalar cast, when trying to cast to another type."
This reverts commit 2475efa91d8b4fa8f1a2d16052cb6d14be7d5dc6.

Causes crashes, see comments on 2475efa91d.
2024-05-08 23:01:47 +00:00
Alexey Bataev
2475efa91d [SLP]Fix PR91467: Look through scalar cast, when trying to cast to another type.
Need to look through the SExt/ZExt scalars to be gathered, when trying
to reduce their width after minbitwidth analysis to prevent permanent
attempts to revectorize such gathered instructions.
2024-05-08 07:25:19 -07:00
Alexey Bataev
f00f294130 [SLP]Fix PR91309: Do not consider SExt as always producing signed result.
Still need to do the full analysis of the signedness of the values
rather than rely on Instruction opcode, if the opcode is SExt. Still may
produce unsigned result.
2024-05-07 08:57:52 -07:00
Alexey Bataev
c144157f3d [SLP]Use last pointer instead of first for reversed strided stores.
Need to use the last address of the vectorized stores for the strided
stores, not the first one, to correctly store the data.
2024-05-06 10:16:28 -07:00
Alexey Bataev
a476032101 [SLP]Fix PR91025: correctly handle smin/smax of signed operands.
Need to check that the signed operand has an extra sign bit to be sure
that we do not skip signedness, when trying to minimize bitwidth for
smin/smax intrinsics.
2024-05-06 08:10:20 -07:00
Alexey Bataev
c7910ee1f0 [SLP][NFC]Use std::optional::value_or. 2024-05-04 11:47:41 -07:00
Alexey Bataev
03972261a9 [SLP]Fix PR90892: do a correct sign analysis of the entries elements in gather shuffles.
Need to do extra analysis of the scalar elements of the tree entry to be
shuffled instead of the vectorized value to correctly deduce signedness
info.
2024-05-03 14:01:25 -07:00
Alexey Bataev
5e67c41a93 [SLP]Fix PR90780: insert cast instruction for PHI nodes after all phi nodes.
Need to check if the vectorized value is a PHINode before insert casting
instruction and insert it after all phis to generate the code correctly.
2024-05-02 06:30:14 -07:00
Alexey Bataev
fc382db239
[SLP]Improve comparison of shuffled loads/masked gathers by adding GEP cost.
In some cases masked gather is less profitable than insert-subvector of
consecutive/strided stores. SLP has this kind of analysis, but need to
improve it by adding the cost of the GEP analysis.
Also, the GEP cost estimation for masked gather is fixed.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/90737
2024-05-01 15:53:25 -04:00
Alexey Bataev
59ef94d7cf
[SLP]Do not include the cost of and -1, <v> and emit just <v> after MinBitWidth.
After minbitwidth analysis, and <v>, (power_of_2 - 1 const) can be
transformed into just an <v>, (all_ones const), which can be ignored at
the cost estimation and at the codegen. x264 benchmark has this pattern.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/90739
2024-05-01 15:52:23 -04:00
Alexey Bataev
576261ac8f
[SLP]Improve reordering for consts, splats and ops from same nodes + improved analysis.
Improved detection of const/splat candidates, their matching and analysis of instructions from same nodes.

Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       results     results0    diff
                                                                                                                                                       results     results0    diff
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    92952.00    93096.00  0.2%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   779832.00   780136.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test   839923.00   840179.00  0.0%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   392708.00   392740.00  0.0%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1171131.00  1171147.00  0.0%

                                                                              test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1391089.00  1391073.00 -0.0%
                                                                             test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1391089.00  1391073.00 -0.0%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12352780.00 12352636.00 -0.0%

MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE - small
reordering
External/SPEC/CINT2006/464.h264ref/464.h264ref - small better code after
reordering
MultiSource/Applications/JM/lencod/lencod - smaller code with less
shuffles
MultiSource/Applications/JM/ldecod/ldecod - same
External/SPEC/CFP2017rate/511.povray_r/511.povray_r - 2 extra loads
vectorized, smaller code
External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r - better code,
size increased because of more constant vectors.
External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s - same
External/SPEC/CFP2017rate/526.blender_r/526.blender_r - small change in
the vectorized code, some code a bit better, some a bit worse.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/87091
2024-05-01 07:34:06 -04:00
Alexey Bataev
67e726a2f7
[SLP]Transform stores + reverse to strided stores with stride -1, if profitable.
Adds transformation of consecutive vector store + reverse to strided
stores with stride -1, if it is profitable

Reviewers: RKSimon, preames

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/90464
2024-05-01 07:32:33 -04:00
Alexey Bataev
51aac5b043 [SLP][NFCI]Improve compile time for phis with large number of incoming values.
Added a limit of 128 incoming values at max for PHIs nodes to be
vectorized plus improved performance by using logarithmic search instead
of linear if the number of incoming values is > 4.
2024-04-30 14:42:49 -07:00
Alexey Bataev
37ae4ad0ee
[SLP]Support minbitwidth analisys for buildvector nodes.
Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       exp           ref        diff
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    42906.00    42986.00  0.2%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    42909.00    42989.00  0.2%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664581.00   664661.00  0.0%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664581.00   664661.00  0.0%

Less is better.

Replaces `buildvector <p x in> + trunc <p x in> to <p x im>` sequences to
`buildvector <p x im> of { trunc in to im }` scalars, which is free in
most cases, results in better code.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88504
2024-04-29 09:57:37 -04:00
Alexey Bataev
040b5a1255 [SLP]Fix PR90211: vectorized node must match completely to be reused.
If the gather node matches the vectorized node, it must also match with
the scalars completely. Otherwise, need to revectorize the gather node
to generate correct code.
2024-04-29 06:51:11 -07:00
Alexey Bataev
79314c64d0 [SLP]Fix PR90224: check that users of gep are all vectorized.
Before deleting extractelement instruction for vectorized GEP with
external users, need to check that all users vectorized before deleting
this extractelement.
2024-04-26 11:49:12 -07:00
Alexey Bataev
d74e42acd2 [SLP]Attempt to vectorize long stores, if short one failed.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.

Metric: size..text

Program                                                                         size..text
                                                                                results     results0    diff
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1088012.00  1088236.00  0.0%
                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480396.00   480476.00  0.0%
          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664613.00   664661.00  0.0%
         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664613.00   664661.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2041105.00  2040961.00 -0.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   836563.00   836387.00 -0.0%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035100.00  1032140.00 -0.3%

In all benchmarks extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88563
2024-04-26 06:53:44 -07:00
Alexey Bataev
f758bb66e8 [SLP]Fix PR89988: do extra analysis of the icmp args to correctly handle signed/unsigned comparison.
If operands of icmp has different signedness, need to consider extending
unsigned operands to correctly handle comparison with the signed
operands.
2024-04-25 16:10:24 -07:00
Alexey Bataev
b4a0fd40f1 [SLP]Fix PR89635: do not try to vectorize single-gather alternate node.
No need to try to vectorize single gather/buildvector with alternate
opcode graph, it is not profitable. In other cases, need to use last
instruction for inserting the vectorized code.
2024-04-23 06:45:43 -07:00
Alexey Bataev
0ab0c1d982
[SLP]Introduce transformNodes() and transform loads + reverse to strided loads.
Introduced transformNodes() function to perform transformation of the
nodes (cost-based, instruction count based, etc.).
Implemented transformation of consecutive loads + reverse order to
strided loads with stride -1, if profitable.

Reviewers: RKSimon, preames, topperc

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88530
2024-04-22 12:31:57 -04:00
Alexey Bataev
6bd29d6639 [SLP]Fix PR89614: phis can be reordered, if reuses are not empty.
Need to relax assertion and check ReuseShuffleIndices is not empty, if
the root phi node has reorder indices.
2024-04-22 08:40:19 -07:00
Alexey Bataev
102a811094 [SLP]Fix a check for multi-users for icmp user.
The compiler should not take into account the type of the cmp
instruction, otherwise it may treat the size incorrectly and it may lead
to incorrect codegen.
2024-04-22 08:23:15 -07:00
Alexey Bataev
ef1d19b0a5 [SLP]Fix PR89438: check for all tree entries for the resized value.
Need to check all possible entries, before trying looking for the
minbitwidth in the user node. Otherwise we may incorrectly get
signedness info.
2024-04-22 06:38:38 -07:00
Alexey Bataev
cee7d994b9 [SLP]Fix PR89438: Check for same vectorized node in MinBWs, not user.
Need to check if the buildvector node has perfect diamond match in the
graph and the matched node is resized.
2024-04-19 12:52:19 -07:00
Alexey Bataev
4d7f3d9e0f [SLP]Fix final analysis for unsigned nodes.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
2024-04-19 03:03:56 -07:00
Mikhail Goncharov
054b1b3b5a Revert "[SLP]Fix final analysis for unsigned nodes."
This reverts commit 74e07ab523122d6a8347b25770062ab331b6bb84.

It might be that Mask.getBitWidth() == Mask.countl_zero() (32 in my
case) and zero bitwidth2 causes the crash.
2024-04-19 11:32:56 +02:00
Alexey Bataev
74e07ab523 [SLP]Fix final analysis for unsigned nodes.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
2024-04-18 10:05:54 -07:00
Alexey Bataev
9462abdff1 [SLP]Fix PR89187: fixx assertion check.
Need to use proper index variable to fix a crash.
2024-04-18 04:22:25 -07:00
Nikita Popov
888836930b Revert "[SLP]Attempt to vectorize long stores, if short one failed."
This reverts commit 6f7160eedb2db02f37d4ffd52fff7b0cf88b3fdc.

This still causes large compile-time regressions in some cases.
2024-04-18 10:15:45 +09:00
Alexey Bataev
6f7160eedb [SLP]Attempt to vectorize long stores, if short one failed.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.

Metric: size..text

Program                                                                         size..text
                                                                                results     results0    diff
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1088012.00  1088236.00  0.0%
                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480396.00   480476.00  0.0%
          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664613.00   664661.00  0.0%
         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664613.00   664661.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2041105.00  2040961.00 -0.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   836563.00   836387.00 -0.0%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035100.00  1032140.00 -0.3%

In all benchmarks extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88563
2024-04-17 10:24:35 -07:00
Nikita Popov
efd60556f7 Revert "[SLP]Attempt to vectorize long stores, if short one failed."
This reverts commit 7d4e8c1f3bbfe976f4871c9cf953f76d771b0eda.

Contrary to the commit description, this does cause large
compile-time regressions (up to 10% on individual files).
2024-04-17 09:25:05 +09:00
Alexey Bataev
7d4e8c1f3b
[SLP]Attempt to vectorize long stores, if short one failed.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.

Metric: size..text

Program                                                                         size..text
                                                                                results     results0    diff
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1088012.00  1088236.00  0.0%
                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480396.00   480476.00  0.0%
          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664613.00   664661.00  0.0%
         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664613.00   664661.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2041105.00  2040961.00 -0.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   836563.00   836387.00 -0.0%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035100.00  1032140.00 -0.3%

In all benchmarks extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88563
2024-04-16 14:55:41 -04:00
Alexey Bataev
c7657cf7d1
[SLP]Keep externally used GEPs as GEPs, if possible instead of extractelement.
If the vectorized GEP instruction can be still kept as a scalar GEP,
better to keep it as scalar instead of extractelement. In many cases it
is more profitable.

Metric: size..text

Program                                                                          size..text
                                                                                 results     results0    diff
                        test-suite :: SingleSource/Benchmarks/Misc/oourafft.test    18911.00    19695.00  4.1%
                   test-suite :: SingleSource/Benchmarks/Misc-C++-EH/spirit.test    59987.00    60707.00  1.2%
       test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1392209.00  1392753.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1392209.00  1392753.00  0.0%
           test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1087996.00  1088236.00  0.0%
                         test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   309310.00   309342.00  0.0%
             test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664661.00   664693.00  0.0%
            test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664661.00   664693.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12354636.00 12354908.00  0.0%
                  test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1152748.00  1152716.00 -0.0%
                       test-suite :: MultiSource/Applications/oggenc/oggenc.test   191787.00   191771.00 -0.0%
                     test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480796.00   480476.00 -0.1%

Misc/oourafft - Extra code gets vectorized
Misc-C++-EH/spirit - same
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - same, extra code gets vectorized
CINT2006/400.perlbench - some extra 4 x ptr stores vectorized
Bullet/bullet - extra 4 x ptr store vectorized
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same
CFP2017rate/526.blender_r - extra 8 x float stores (several), some extra
4 x ptr stores
CFP2006/453.povray - 2 x double loads/stores replaced by 4 x double
loads/stores
Applications/oggenc - extra code is vectorized
UnitTests/matrix-types-spec - extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88877
2024-04-16 14:54:06 -04:00
Alexey Bataev
26ebe16d78 [SLP]Fix PR88834: check if unsigned arg can be trunced, being used in smax/smin intrinsics.
Need to check that unsigned argument can be safely used in smax/smin
intrinsics by checking if at least single sign bit is cleared, otherwise
its value may be treated as negative instead of positive.
2024-04-16 06:42:15 -07:00
Florian Hahn
b73476c784
[SLP] Make sure MinVF is a power-of-2 by using PowerOf2Ceil.
This should ensure we explore the same VFs as before 6d66db3890a18e39.

Fixes https://github.com/llvm/llvm-project/issues/88640.
2024-04-16 13:29:35 +01:00
Alexey Bataev
b7b183371b [SLP][NFC]Improve perf of BoUpSLP::buildExternalUses function, NFC.
Perform required operations, only when the result is required + check if
value was not inserted already into the list of the externally used
scalars and early exit, if it was.
2024-04-15 11:18:24 -07:00
Valery Dmitriev
9abb1ffc5c
[SLP][NFC] Add option to bypass early profitability check. (#88594)
The option intended primarily for LIT tests to suppress heuristic based
profitability check and proceed vectorization of a seemingly
unprofitable alternate operation pattern. This allows the vectorizer to
execute path that was the original intent of a test.
2024-04-15 09:09:39 -07:00
Florian Hahn
6704faf6f8
[SLP] Use StoreTy to compute min VF.
This ensures that MinVF is a power-of-2, even if ValueTy's width is
not a power-of-2.

This should fix a number of buildbot failures with X86 bootstrapping.
2024-04-13 11:12:33 +01:00
Florian Hahn
6d66db3890
[SLP] Initial vectorization of non-power-of-2 ops. (#77790)
This patch enables vectorization for non-power-of-2 VFs. Initially only
VFs where adding 1 makes the VF a power-of-2, i.e. we can still make
relatively effective use of the vectors.

It relies on the existing target cost-models to return accurate costs
for
non-power-of-2 vectors. I checked mostly AArch64 and X86 and
there the costs seem reasonable for the costs I checked, although
I expect there will be a need to refine both the cost-models and
lowering
to make most effective use of non-power-of-2 SLP vectorization.

Note that re-ordering and shuffling is not implemented for nodes
requiring padding yet to keep the initial implementation simpler.

The feature is guarded by a new flag, off by defaul for now.

PR: https://github.com/llvm/llvm-project/pull/77790
2024-04-13 08:14:40 +01:00