2243 Commits

Author SHA1 Message Date
Alexey Bataev
cb648ba970 [SLP]Check if the user node has instructions, used only outside
Gather nodes with parents, which scalar instructions are used only
outside, are generated before the whole tree vectorization. Need to
teach isGatherShuffledSingleRegisterEntry to check that such nodes are
emitted first and they cannot depend on other nodes, which are emitted
later.

Fixes #141628
2025-05-29 10:09:49 -07:00
Alexey Bataev
aa452b65fc [SLP]Restore insertion points after gathers vectorization
Restore insertion points after gathers vectorization to avoid a crash in
a root node vectorization.

Fixes #141265
2025-05-24 07:25:20 -07:00
Alexey Bataev
3918ef3688
[SLP]Fix the analysis for masked compress loads
Need to remove the check for Orders in interleaved loads analysis and
estimate shuffle cost without the reordering to correctly handle the
costs of masked compress loads.

Reviewers: hiraditya, HanKuanChen, RKSimon

Reviewed By: HanKuanChen, RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/140647
2025-05-20 07:31:16 -04:00
Alexey Bataev
bb8e2a8937 [SLP]Relax assertion to avoid compiler crash
Need to relax the assertion to fix a compiler crash in case if the
reordered compress loads are more profitable than the ordered ones.

Fixes #140334
2025-05-18 14:26:36 -07:00
Alexey Bataev
fb86b3d96b [SLP]Change the insertion point for outside-block-used nodes and prevec phi operand gathers
Need to set the insertion point for (non-schedulable) vector node after
the last instruction in the node to avoid def-use breakage. But it also
causes miscompilation with gather/buildvector operands of the phi nodes,
used in the same phi only in the block.
These nodes supposed to be inserted at the end of the block and after
changing the insertion point for the non-schedulable vec block, it also
may break def-use dependencies. Need to prevector such nodes, to emit
them as early as possible, so the vectorized nodes are inserted before
these nodes.

Fixes #139728

Recommit after revert 60fb92179291e848eb7b04913bdc818d081db296

Reviewers: hiraditya, HanKuanChen, RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/139917
2025-05-18 12:59:36 -07:00
Alexey Bataev
60fb921792 Revert "[SLP]Change the insertion point for outside-block-used nodes and prevec phi operand gathers"
This reverts commit d79d9b8fbfc7e8411aeaf2f5e1be9d4247594fee to fix
a bug reported in https://github.com/llvm/llvm-project/pull/139917#issuecomment-2888216404
2025-05-17 11:06:37 -07:00
Alexey Bataev
d79d9b8fbf
[SLP]Change the insertion point for outside-block-used nodes and prevec phi operand gathers
Need to set the insertion point for (non-schedulable) vector node after
the last instruction in the node to avoid def-use breakage. But it also
causes miscompilation with gather/buildvector operands of the phi nodes,
used in the same phi only in the block.
These nodes supposed to be inserted at the end of the block and after
changing the insertion point for the non-schedulable vec block, it also
may break def-use dependencies. Need to prevector such nodes, to emit
them as early as possible, so the vectorized nodes are inserted before
these nodes.

Fixes #139728

Reviewers: hiraditya, HanKuanChen, RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/139917
2025-05-16 12:52:27 -04:00
Ramkumar Ramachandra
c807395011
[LAA/SLP] Don't truncate APInt in getPointersDiff (#139941)
Change getPointersDiff to return an std::optional<int64_t>, and fill
this value with using APInt::trySExtValue. This simple change requires
changes to other functions in LAA, and major changes in SLPVectorizer
changing types from 32-bit to 64-bit.

Fixes #139202.
2025-05-15 10:08:05 +01:00
Alexey Bataev
c632ac3506 [SLP][NFC]Add a test with the gather operand in phi node in gathered loads 2025-05-14 08:18:23 -07:00
Alexey Bataev
e1ea86e849 [SLP]Do not try to use interleaved loads, if reordering is required
If the interleaved loads require reordering, better to avoid generate
load + shuffle sequence, which in this case cannot be recognized as
interleaved load. Also, it fixes the issue with the incorrect codegen.

Fixes #138923
2025-05-12 14:12:51 -07:00
Alexey Bataev
fa985b5f1e [SLP][NFC]Add a test with missed reordering of the interleaved loads 2025-05-12 13:48:11 -07:00
Alexey Bataev
2e13f7ab01 [SLP][NFC]Add a test with the incorrect vectorization for the pointers with distance difference > 2^32 2025-05-12 06:30:05 -07:00
Han-Kuan Chen
53df6400af
[SLP] Fix incorrect operand order in interchangeable instruction. (#139225) 2025-05-12 20:03:45 +08:00
Alexey Bataev
49042f2bee [SLP][NFC]Add a test with ordering of the operands of unordered loads 2025-05-11 08:09:51 -07:00
David Green
3b4d5638b3 [AArch64] Limit vector splitting to vectors of size larger than 128bit
The intent of this code is to split larger vectors into smaller shuffles, but
it currently triggering on some small vector types. Limit it to vectors of size
>128bit.
2025-05-09 22:17:28 +01:00
Gheorghe-Teodor Bercea
25a031947a
[AMDGPU][NFC] Add tests in preparation for i8 vectorization (#138801)
Precommit tests for PR: https://github.com/llvm/llvm-project/pull/134934
2025-05-09 10:32:49 -04:00
David Green
8b41551651 [AArch64] Add a slp vectorization test for extract and shuffle costs. NFC 2025-05-06 18:09:29 +01:00
Alexey Bataev
3aecbbcbf6 [SLP]Do not match nodes if schedulability of parent nodes is different
If one user node is non-schedulable and another one is schedulable, such
nodes should be considered matched. The selection of the actual insert
point in this case differs and the insert points may match, which may
cause a compiler crash because of the broken def-use chain.

Fixes #137797
2025-05-06 07:52:49 -07:00
Alexey Bataev
9400270449 [SLP]Fix comparator for vector operands of extractelements in PHICompare
Need to make comparator to follow strict-weak ordering to fix compiler
crashes.

Fixes #138178
2025-05-01 14:28:20 -07:00
Alexander Richardson
ee13638362
[AMDGPU] Remove explicit datalayout from tests where not needed
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.

Reviewed By: shiltian, arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/137921
2025-04-30 10:58:17 -07:00
Jonas Paulsson
f5c8c1eedb
[SLPVectorizer] Move X86 specific handling into X86TTIImpl. (#137830)
`ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" `
changed SLPVectorizer getScalarizationOverhead() to call
TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some
cases. This was due to X86 specific handlings in these (overridden) methods,
and unfortunately the general preference of TTI.getScalarizationOverhead()
was dropped. If VL is available it should always be preferred to use
getScalarizationOverhead(), and this is indeed the case for SystemZ which
has a special insertion instruction that can insert two GPR64s.

Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with
codegen"` reworked SLPVectorizer getGatherCost() which together with
ad9909d caused the SystemZ test vec-elt-insertion.ll to fail.

This patch restores the SystemZ test and reverts the change in SLPVectorizer
getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always
called again. The ForPoisonSrc argument is now passed on to the TTI method
so that X86 can handle this as required.

Fixes: #135346
2025-04-30 17:11:27 +02:00
Florian Hahn
ec1016f7ef
[IVDescriptors] Support reductions with minimumnum/maximumnum. (#137335)
Add a new reduction recurrence kind for reductions with
minimumnum/maximumnum. Such reductions can be vectorized without
nsz/nnans, same as reductions with maximum/minimum intrinsics.

Note that a new reduction kind is needed to make sure partial reductions
are also combined with minimumnum/maximumnum.

Note that the final reduction to a scalar value is performed with
vector.reduce.fmin/fmax. This should be fine, as the results of the
partial reductions with maximumnum/minimumnum silences any sNaNs.

In-loop and reductions in SLP are not supported yet, as there's no
reduction version of maximumnum/minimumnum yet and fmax may be
incorrect.

PR: https://github.com/llvm/llvm-project/pull/137335
2025-04-28 11:16:36 +01:00
YunQiang Su
e9a34e4236
[RISCV] Support vectorizing FMINIMUMNUM and FMAXIMUMNUM (#135727)
RISC-V V extension support vfmax and vfmin, which follow IEEE754-2019.
We can use them directly.
2025-04-27 19:10:02 +08:00
Alexey Bataev
a7a74b349d
[SLP]Improve reordering of the alternate nodes
Better to preserve the original order of the alternate nodes to avoid
inter-lane shuffling, select/insert subvector patterns provide better
perf.

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/136329
2025-04-24 14:33:10 -04:00
Alexey Bataev
f427890a1d [SLP]Fix PHI comparator to make it follow weak strict ordering restriction
Fixes #137164
2025-04-24 11:08:17 -07:00
Philip Reames
1c722fc8f5
[RISCV][TTI] Use processShuffleMask for shuffle legalization estimate (#136191)
We had some code which tried to estimate legalization costs for
illegally typed shuffles, but it only handled the case of a widening
shuffle, and used a somewhat adhoc heuristic. We can reuse the
processShuffleMask utility (which we already use for individual vector
register splitting when exact VLEN is known) to perform the same
splitting given the legal vector type as the unit of split instead. This
makes the costing both simpler and more robust.

Note that this swings costs for illegal shuffles pretty wildly as we
were previously sometimes hitting the adhoc code, and sometimes falling
through into generic scalarization costing. I don't know that any of the
costs for the individual tests in tree are significant, but the test
which which triggered me finding this was reported to me by Alexey
reduced from something triggering a bad choice in SLP for x264. So this
has the potential to be somewhat high impact.
2025-04-22 10:50:20 -07:00
Alexey Bataev
9c388f1f05
[SLP]Prefer segmented/deinterleaved loads to strided and fix codegen
Need to estimate, which one is preferable, deinterleaved/segmented
loads or strided. Segmented loads can be combined, improving
the overall performance.

Reviewers: RKSimon, hiraditya

Reviewed By: hiraditya, RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/135058
2025-04-22 12:11:01 -04:00
Alexey Bataev
0252d338fa
[SLP]Model single unique value insert + shuffle as splat + select, where profitable
When we have the remaining unique scalar, that should be inserted into
non-poison vector and into non-zero position:
```
%vec1 = insertelement %vec, %v, pos1
%res = shuffle %vec1, poison, <0, 1, 2,..., pos1, pos1 + 1, ..., pos1,
...>
```
better to estimate if it is profitable to model it as is or model it as:
```
%bv = insertelement poison, %v, 0
%splat = shuffle %bv, poison, <poison, ..., 0, ..., 0, ...>
%res = shuffle %vec, %splat, <0, 1, 2,..., pos1 + VF, pos1 + 1, ...>
```

Reviewers: preames, hiraditya, RKSimon

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/136590
2025-04-22 11:30:29 -04:00
Alexey Bataev
fdcee2dd36
[SLP]Reorder tree, if the reorder indices are non empty
Need to consider the ordering for all nodes with the specified ordering,
not only loads/store/extracts.

Reviewers: hiraditya, RKSimon

Reviewed By: hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/136185
2025-04-18 13:37:08 -04:00
Alexey Bataev
5fe91f1b59 [SLP]Check for catchswitch block before doing the analysis of the instructions
Need to skip the analysis of the catchswitch blocks to avoid a compiler
crash when trying to get the first instruction in the block.
2025-04-17 09:10:15 -07:00
Alexey Bataev
1fcf78d153 [SLP]Cache data for compressed loads before codegen
Need to cache and use cached data for compressed loads before codegen to
avoid side-effects, caused by the earlier vectorization, which may
affect the analysis.
2025-04-17 08:43:44 -07:00
Alexey Bataev
4aca20c8b6 [SLP]Pre-cache the last instruction for all entries before vectorization
Need to pre-cache last instruction to avoid unexpected changes in the
last instruction detection during the vectorization, caused by adding
the new vector instructions, which add new uses and may affect the
analysis.
2025-04-16 11:44:55 -07:00
Alexey Bataev
913dcf1aa3 [SLP]Fix type promotion for smax reduction with unsigned reduced operands
Need to add an extra bit for sign info for unsigned reduced values to
generate correct code.
2025-04-16 10:14:29 -07:00
Alexey Bataev
51fa6cde7d [SLP][NFC]Add a test with missing unsigned promotion for smax reduction, NFC 2025-04-16 09:55:34 -07:00
Alexey Bataev
af28c9c65a [SLP]Do not reorder split node operand with reuses, if not possible
Need to check if the operand node of the split vectorize node has reuses
and check if it is possible to build the order for this node to reorder
it correctly.

Fixes #135912
2025-04-16 06:23:44 -07:00
Alexey Bataev
ddb1267430 [SLP]Insert vector instruction after landingpad
If the node must be emitted in the landingpad block, need to insert the
instructions after the landingpad instruction to avoid a crash.

Fixes #135781
2025-04-15 13:57:53 -07:00
Alexey Bataev
85eb44e304 [SLP]Fix number of operands for the split node
FOr the split node number of operands should be requested via
getNumOperands() function, even if the main op is CallInst.
2025-04-15 13:33:36 -07:00
Alexey Bataev
2271f0bebd [SLP]Check for perfect/shuffled match for the split node
If the potential split node is a perfect/shuffled match of another split
node, need to skip creation of the another split node with the same
scalars, it should be a buildvector.

Fixes #135800
2025-04-15 13:17:46 -07:00
Han-Kuan Chen
d41e517748
[SLP] Make getSameOpcode support interchangeable instructions. (#135797)
We use the term "interchangeable instructions" to refer to different
operators that have the same meaning (e.g., `add x, 0` is equivalent to
`mul x, 1`).
Non-constant values are not supported, as they may incur high costs with
little benefit.

---------

Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
2025-04-16 00:08:59 +08:00
Han-Kuan Chen
bcfc9f4529
[SLP][REVEC] VectorValuesAndScales should be supported by REVEC. (#135762)
We should align REVEC with the SLP algorithm as closely as possible. For
example, by applying REVEC-specific handling when calling IRBuilder's
Create methods, performing cost analysis via TTI, and expanding shuffle
masks using transformScalarShuffleIndicesToVector.

reference commit: 3b18d47ecbaba4e519ebf0d1bc134a404a56a9da
2025-04-15 23:03:55 +08:00
Alexey Bataev
57025b42c4 [SLP]Mark smin reduction as signed compare
Reduction signed min must be marked as signed compare, fixing the
analysis for the cases, where the incoming arguments are unsigned.

Fixes #133943
2025-04-15 07:24:17 -07:00
Alexey Bataev
7f2587a239 [SLP][NFC]Add a test with missing zext on signed minimum reduction, NFC 2025-04-15 07:14:36 -07:00
Han-Kuan Chen
e1382b3b45 Revert "[SLP] Make getSameOpcode support interchangeable instructions. (#133888)"
This reverts commit 123993fd974629ca0a094918db4c21ad1c2624d0.
2025-04-15 06:02:42 -07:00
YunQiang Su
fe9e2090be
Vectorize: Support fminimumnum and fmaximumnum (#131781)
Support auto-vectorize for fminimum_num and fmaximum_num. 
For ARM64 with SVE, scalable vector cannot support yet.

---------

Co-authored-by: Your Name <you@example.com>
2025-04-15 08:08:45 +08:00
Han-Kuan Chen
123993fd97
[SLP] Make getSameOpcode support interchangeable instructions. (#133888)
We use the term "interchangeable instructions" to refer to different
operators that have the same meaning (e.g., `add x, 0` is equivalent to
`mul x, 1`).
Non-constant values are not supported, as they may incur high costs with
little benefit.

---------

Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
2025-04-14 19:23:18 +08:00
Alexey Bataev
38e64b1a84 [SLP]Fix minbiwidth analysis for gather nodes with SIToFP users
If the buildvector node has cast to float user, it cannot be considered as safe
for truncation, need to use the original bitwidth here.

Fixes #135410
2025-04-11 11:40:41 -07:00
Alexey Bataev
c9ad5bed7f [SLP][NFC]Add a test with the incorrect type promotion after bitwidth analysis, NFC 2025-04-11 11:10:01 -07:00
Alexey Bataev
a2d129b792 [SLP]Fix a crash when trying to reduce in revec after minbitwidth analysis
Need to use the original scalar type, when building the reduction, and
use the scalar type, when performing casting, to avoid compiler crash.
2025-04-11 10:58:39 -07:00
Alexey Bataev
33af951f3f
[SLP]Synchronize cost of gather/buildvector nodes with codegen
If the buildvector node contains constants and non-constants, need to
consider shuffling of the constant vec and insertion of unique elements
into the vector. Also, if there is an input vector, need to consider the
cost of shuffling source vector and constant vector and then insertion
and shuffling of the non-constant elements.

Reviewers: hiraditya, RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/135245
2025-04-11 09:42:34 -04:00
Han-Kuan Chen
b99a2b6221 [SLP][REVEC] Update test.
The test is affected by commit aaaa2a325bd1abb8c87e0171384fd2c42da5e38a.
2025-04-11 03:01:09 -07:00