2295 Commits

Author SHA1 Message Date
Alexey Bataev
dd5ba694bd [SLP]Recalculate deps for potential control-dependent schedule data
After clearing the dependencies in copyable data, need to recalculate
dependencies for the original ScheduleData, if it can be marked as
control dependent.

Fixes #153289
2025-08-13 08:18:26 -07:00
Sam Tebbs
0bfa1718af
[LV] Create in-loop sub reductions (#147026)
This PR allows the loop vectorizer to handle in-loop sub reductions by
forming a normal in-loop add reduction with a negated input.

Stacked PRs:
1. -> https://github.com/llvm/llvm-project/pull/147026
2. https://github.com/llvm/llvm-project/pull/147255
3. https://github.com/llvm/llvm-project/pull/147302
4. https://github.com/llvm/llvm-project/pull/147513
2025-08-12 10:22:41 +01:00
Alexey Bataev
2d7b55a028
[SLP]Initial support for copyable elements
Adds initial support for copyable elements, both schedulable and
non-schedulable.
Adds support only for add for now, other opcodes will added in future.
Still some cases are not handled, e.g. stores do not include this,
because currently do not check for copyable elements.

Reviewers: hiraditya, RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/147366
2025-08-11 09:41:19 -04:00
Alexey Bataev
67af2f6c5c [SLP]Initial FMAD support (#149102)
Added initial check for potential fmad conversion in reductions and
operands vectorization.

Added the check for instruction to fix #152683

Skipped the code for reduction to avoid regressions.
2025-08-11 05:53:55 -07:00
David Green
cfe190979e Revert "[SLP]Initial FMAD support (#149102)"
This reverts commit 0fffb9f9ed81f4c2084b8fe040c88b60bb6c372a due to major
performance regressions.
2025-08-10 15:16:01 +01:00
Alexey Bataev
0fffb9f9ed [SLP]Initial FMAD support (#149102)
Added initial check for potential fmad conversion in reductions and
operands vectorization.

Added the check for instruction to fix #152683
2025-08-08 10:30:23 -07:00
Alexey Bataev
0419b459be Revert "[SLP]Initial FMAD support (#149102)"
This reverts commit 0bcf45ea3458ba79eb4257afcfd6af954292c9ce to fix the
regresions, reported in https://github.com/llvm/llvm-project/issues/152683
2025-08-08 09:17:59 -07:00
Alexey Bataev
adae370805 [SLP][NFC]Cleanup undefs and the whole test, NFC 2025-08-07 13:41:22 -07:00
Alexey Bataev
0bcf45ea34
[SLP]Initial FMAD support (#149102)
Added initial check for potential fmad conversion in reductions and
operands vectorization.
2025-08-07 09:51:43 -04:00
Ramkumar Ramachandra
edeee824f0
Reland [VectorUtils] Trivially vectorize ldexp, [l]lround (#152476)
Changes: The original patch, landed as 1336675, was reverted due to a
bug in LoopVectorize resulting in a crash. The bug has now been fixed by
95c32bf ([VPlan] Return invalid cost if any skeleton block has invalid
costs), and this reland is identical to the original patch.
2025-08-07 12:07:29 +01:00
Mikhail Gudim
3404c0b013
Slp basic test (#152355)
Add a basic test for SLPVectorizer to make sure that upcoming
refactoring patches don't break anything. Also, record a test for a
missed opportunity.
2025-08-06 14:54:50 -04:00
Alexey Bataev
e27831ff9b [SLP] Fix a check for main/alternate interchanged instruction
If the instruction is checked for matching the main instruction, need to
check if the opcode of the main instruction is compatible with the
operands of the instruction. If they are not, need to check the
alternate instruction and its operands for compatibility and return
alternate instruction as a match.

Fixes #151699

Fixed check for non-supported binary operations.
2025-08-04 11:20:54 -07:00
Michael Halkenhäuser
70af09e3a1
Revert "[SLP] Fix a check for main/alternate interchanged instruction" (#151997)
This reverts commit 3ee8d047109ea4bb479095f4b153c2120a8d726c.

Revert reason: FAILED build for openmp-offload-amdgpu-runtime-2 
https://lab.llvm.org/buildbot/#/builders/10/builds/10827
2025-08-04 12:57:20 -04:00
Alexey Bataev
3ee8d04710 [SLP] Fix a check for main/alternate interchanged instruction
If the instruction is checked for matching the main instruction, need to
check if the opcode of the main instruction is compatible with the
operands of the instruction. If they are not, need to check the
alternate instruction and its operands for compatibility and return
alternate instruction as a match.

Fixes #151699
2025-08-04 08:31:35 -07:00
Alexey Bataev
7cd1ce3aa0 [SLP]Check vector-like instruction for dominance in copyables
Need to check if the vector-like instruction is dominated by main
operation in the copyables to prevent broken def-use chain

Fixes #151456
2025-08-04 06:14:19 -07:00
David Green
b30d5315b7
[AArch64] Add better fcmp costs for expanded predicates (#147940)
Certain fcmp predicates need to be expanded into multiple operations and
or'd together. This adds some more accurate cost modelling for them
based on the predicate. Unsupported operations are given the cost of a
libcall and the latency is set to 2 as that seemed to be fairly common
between different CPUs.
2025-08-04 13:42:57 +01:00
Muhammad Omair Javaid
176d54aa33 Revert "[VectorUtils] Trivially vectorize ldexp, [l]lround (#145545)"
This reverts commit 13366759c3b9db9366659d870cc73c938422b020.

This broke various LLVM testsuite buildbots for AArch64 SVE, but the
problem got masked because relevant buildbots were already failing
due to other breakage.

It has broken llvm-test-suite test:
gfortran-regression-compile-regression__vect__pr106253_f.test

https://lab.llvm.org/buildbot/#/builders/4/builds/8164
https://lab.llvm.org/buildbot/#/builders/17/builds/9858
https://lab.llvm.org/buildbot/#/builders/41/builds/8067
https://lab.llvm.org/buildbot/#/builders/143/builds/9607
2025-08-01 01:24:52 +05:00
Ramkumar Ramachandra
13366759c3
[VectorUtils] Trivially vectorize ldexp, [l]lround (#145545) 2025-07-29 19:23:09 +01:00
Simon Pilgrim
0fa0ce1f3a
[CostModel][X86] Update SK_Broadcast based on cost kinds (#150620)
When these were converted to CostKindTblEntry the throughput was mainly copied to all cost kinds

Regenerated with my check_cost_tables.py helper script
2025-07-26 13:52:47 +01:00
Alexey Bataev
ef98e248c7 [SLP]Initial support for copyable elements (non-schedulable only)
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.

Fixed compile time regressions, reported crashes, updated release notes

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/140279
2025-07-25 10:55:07 -07:00
Martin Storsjö
936ee35dcc Revert "[SLP]Initial support for copyable elements (non-schedulable only)"
This reverts commit 898bba311f180ed54de33dc09e7071c279a4942a.

This change caused hangs and crashes, see
https://github.com/llvm/llvm-project/pull/140279#issuecomment-3115051063.
2025-07-25 01:22:20 +03:00
Martin Storsjö
bd170b78bb Revert "[SLP] Check if the user node has state before trying getting main instruction/opcode"
This reverts commit c9cea24fe68e24750b2d479144f839e1c2ec9d2b.

This is being reverted as it is intermixed with another commit
(898bba311f180ed54de33dc09e7071c279a4942a) that needs to be reverted.
2025-07-25 01:22:19 +03:00
Alexey Bataev
c9cea24fe6 [SLP] Check if the user node has state before trying getting main instruction/opcode
Need to check if the parent node has state to prevent compiler crashes.
Fixes #150479
2025-07-24 12:00:43 -07:00
Alexey Bataev
898bba311f [SLP]Initial support for copyable elements (non-schedulable only)
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.

Fixed compile time regressions, updated release notes

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/140279
2025-07-23 13:38:34 -07:00
Alexey Bataev
a415d68e48 Revert "[SLP]Initial support for copyable elements (non-schedulable only)"
This reverts commit e202dba288edd47f1b370cc43aa8cd36a924e7c1 to try to
resolve compile time issues, reported in https://llvm-compile-time-tracker.com/compare.php?from=36089e5d983fe9ae00f497c2d500f37227f82db1&to=e202dba288edd47f1b370cc43aa8cd36a924e7c1&stat=instructions%3Au&details=on
2025-07-22 07:39:32 -07:00
Alexey Bataev
e202dba288
[SLP]Initial support for copyable elements (non-schedulable only)
Adds initial support for copyable elements. This patch only models adds
and model copyable elements as add <element>, 0, i.e. uses identity
constants for missing lanes.
Only support for elements, which do not require scheduling, is added to
reduce size of the patch.

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/140279
2025-07-21 14:07:28 -04:00
Alexey Bataev
ff225b5d88 [SLP][NFC]Add a run line for the test, NFC 2025-07-18 10:14:18 -07:00
Nikita Popov
369f749dc4 [SLP] Remove lifetime.start on null pointer in test (NFC) 2025-07-18 12:47:49 +02:00
Alexey Bataev
60ae9c9c63
[SLP]Do not consider non-profitable loads slices
If all slices are small and end up with strided or even vectorization
states, better to not consider these candidates for the vectorization
and try to vectorize the whole bunch as gathered loads.

Reviewers: hiraditya, RKSimon, HanKuanChen

Reviewed By: RKSimon, HanKuanChen

Pull Request: https://github.com/llvm/llvm-project/pull/149209
2025-07-17 08:00:02 -04:00
Florian Hahn
02d3738be9
[AArch64,TTI] Remove RealUse check for vector insert/extract costs. (#146526)
getVectorInstrCostHelper would return costs of zero for vector
inserts/extracts that move data between GPR and vector registers, if
there was no 'real' use, i.e. there was no corresponding existing
instruction.

This meant that passes like LoopVectorize and SLPVectorizer, which
likely are the main users of the interface, would understimate the cost
of insert/extracts that move data between GPR and vector registers,
which has non-trivial costs.

The patch removes the special case and only returns costs of zero for
lane 0 if it there is no need to transfer between integer and vector
registers.

This impacts a number of SLP test, and most of them look like general
improvements.I think the change should make things more accurate for any
AArch64 target, but if not it could also just be Apple CPU specific.

I am seeing +2% end-to-end improvements on SLP-heavy workloads.

PR: https://github.com/llvm/llvm-project/pull/146526
2025-07-15 15:19:27 +01:00
Gaëtan Bossu
adb6efeac9
[SLP] Fix cost estimation of external uses with wrong VF (#148185)
It assumed that the VF remains constant throughout the tree. That's not
always true. This meant that we could query the extraction cost for a
lane that is out of bounds.

While experimenting with re-vectorisation for AArch64, we ran into this
issue. We cannot add a proper AArch64 test as more changes would need to
be brought in.

This commit is only fixing the computation of VF and adding an assert.
Some tests were failing after adding the assert:
 - foo() in llvm/test/Transforms/SLPVectorizer/X86/horizontal.ll
- test() in
llvm/test/Transforms/SLPVectorizer/X86/reduction-with-removed-extracts.ll
- test_with_extract() in
llvm/test/Transforms/SLPVectorizer/RISCV/segmented-loads.ll
2025-07-15 11:39:09 +01:00
Florian Hahn
eb4de577da
[SLP,AArch64] Update build-vector test to actually build vectors.
Update test with all zero constant input values which get folded during
IR construction to actually use different input values, which require
materializing build vectors.
2025-07-14 13:47:44 +01:00
Alexey Bataev
a999a1b88c
[SLP]Remove emission of vector_insert/vector_extract intrinsics
Replaced by the regular shuffles.

Fixes #145512

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/148007
2025-07-11 15:26:45 -04:00
Alexey Bataev
dd60663b9b [SLP] Emit reduction instead of 2 extracts + scalar op, when vectorizing operands (#147583)
Added emission of the 2-element reduction instead of 2 extracts + scalar
op, when trying to vectorize operands of the instruction, if it is more
profitable.
2025-07-10 12:50:52 -07:00
Alex Bradbury
18627e995c Revert "[SLP] Emit reduction instead of 2 extracts + scalar op, when vectorizing operands (#147583)"
This reverts commit ac4a38e9bd573a173432b89cbef7cce7a48e7907.

This breaks the RVV builders
(MicroBenchmarks/ImageProcessing/Blur/blur.test and
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test from llvm-test-suite)
and reportedly SPEC Accel2023
<https://github.com/llvm/llvm-project/pull/147583#issuecomment-3057183138>.
2025-07-10 14:55:22 +01:00
Simon Pilgrim
59a99c6f2c [SLP] Drop unnecessary '' from around -passes=... arg lists to appease update_test_checks.py when run on DOS. NFC. 2025-07-10 12:18:41 +01:00
Alexey Bataev
ac4a38e9bd
[SLP] Emit reduction instead of 2 extracts + scalar op, when vectorizing operands (#147583)
Added emission of the 2-element reduction instead of 2 extracts + scalar
op, when trying to vectorize operands of the instruction, if it is more
profitable.
2025-07-09 19:52:09 -04:00
Gaëtan Bossu
50facad7fc
[SLP][REVEC] Fix insertelement legality checks (#146921)
The current code assumes that all the values in VL are valid
instructions, while it is possible to get poison.
2025-07-09 10:28:50 +01:00
David Spickett
651c994feb [llvm][test] Fix REQUIRES in extractelement-insertpoint.ll
The target is called "x86" not "x86_64".
2025-07-03 13:14:42 +00:00
Hanyang (Eric) Xu
6e1e89ee38
[SLP] Avoid -passes=instcombine stages in SLP tests (#146257)
Fixes #145511

Note that there are still two instances of
--passes=slp-vectorizer,instcombine left unchanged because it seems that
the tests are meant to run in conjunction with instcombine and removing
instcombine would invalidate their original objective:


[llvm/test/Transforms/SLPVectorizer/arith-div-undef.ll](https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/SLPVectorizer/arith-div-undef.ll)

[llvm/test/Transforms/SLPVectorizer/slp-hr-with-reuse.ll](https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/SLPVectorizer/slp-hr-with-reuse.ll)
2025-07-02 06:14:41 -04:00
Luke Lau
d0c1ea928c
[InstCombine] Pull unary shuffles through fneg/fabs (#144933)
This canonicalizes fneg/fabs (shuffle X, poison, mask) -> shuffle
(fneg/fabs X), posion, mask

This undoes part of b331a7ebc1e02f9939d1a4a1509e7eb6cdda3d38 and
a8f13dbdeb31be37ee15b5febb7cc2137bbece67, but keeps the binary shuffle
case i.e. shuffle fneg, fneg, mask.

By pulling out the shuffle we bring it inline with the same
canonicalisation we perform on binary ops and intrinsics, which the
original commit acknowledges it goes in the opposite direction.

However nowadays VectorCombine is more powerful and can do more
optimisations when the shuffle is pulled out, so I think we should
revisit this. In particular we get more shuffles folded and can perform
scalarization.
2025-06-30 10:40:12 +01:00
Gheorghe-Teodor Bercea
3df36a2b18
[AMDGPU] Enable vectorization of i8 values. (#134934)
This patch adjusts the cost model to account for the ability of the
AMDGPU optimizer to group together i8 values into i32 values.

Co-authored-by: Erich Keane <ekeane@nvidia.com>
2025-06-26 19:15:31 -04:00
Simon Pilgrim
1a60c74c13
[CostModel][X86] SK_InsertSubvector inserted into the lowest subvector should be treated as SK_Select blend (#145892)
X86 uses implicit widening and BLEND/MOV shuffles in these cases - otherwise we still treat it as a SK_PermuteTwoSrc
2025-06-26 16:00:51 +01:00
Simon Pilgrim
8202c94cec
[CostModel][X86] getMaskedMemoryOpCost - widening masks must compute the cost of the full width insert_subvector across multiple legal vectors (#145693)
The memory value and mask value types might legalise differently - e.g. a v64i32 might split into 4 x v16i32 / 8 x v8i32 but the mask might legalize as 1 x v64i8 / 2 x v32i8 etc.

If the legalised value type has been split, then we must ensure we compute the cost for the entire mask value type and let getShuffleCost handle any legalisation, not assume that only a single trailing split mask will require widening.
2025-06-25 16:30:35 +01:00
Simon Pilgrim
bf4afb08fe
[CostModel] improveShuffleKindFromMask - recognise a SK_PermuteSingleSrc incorrectly tagged as SK_PermuteTwoSrc (#145352)
If a SK_PermuteTwoSrc shuffle kind's mask only references the first
operand, then treat this as SK_PermuteSingleSrc

Part of #145335
2025-06-23 20:20:47 +01:00
Matt Arsenault
54015f36c6
AMDGPU: Cost model for minimumnum/maximumnum (#141946) 2025-06-18 08:19:06 +09:00
Matt Arsenault
c9b2816388
AMDGPU: Fix cost model for 16-bit operations on gfx8 (#141943)
We should only divide the number of pieces to fit the packed instructions
if we actually have pk instructions. This increases the cost of copysign,
but is closer to the current codegen output. It could be much cheaper
than it is now.
2025-06-18 08:07:03 +09:00
Alexey Bataev
0108a5908c [SLP]Fix a crash on an subvector size calculation for non-power-of-2 vector
Patch fixes cost estimation for the extractelements from non-power-of-2
vectors, defined as subvector extracts. In this case the subvector size
might be not adjusted to a whole register size, need to get the minimum
between whole vector size and the actual difference to prevent compiler
crash.

Fixes #143513
2025-06-17 08:58:07 -07:00
Jeffrey Byrnes
c9a87a50ae
[SLPVectorizer] Use accurate cost for external users of resize shuffles (#137419)
When implementing the vectorization, we potentially need to add shuffles
for external users. In such cases, we may be shuffling a smaller vector
into a larger vector. When this happens `ResizeToVF` will just build a
poison padded identity vector. Then the to build the final shuffle, we
just use the `SK_InsertSubvector` mask.

This is possibly clearer by looking at the included test in
SLPVectorizer/AMDGPU/external-shuffle.ll

In the exit block we have a bunch of shuffles to glue the vectorized
tree match the `InsertElement` users. `TMP25` holds the result of
resizing the v2i16 vectorized sequence to match the `InsertElement` size
v16i16. Then `TMP26` is the final shuffle which replaces the
`InsertElement` sequence. This is just an insertsubvector.

However, when calculating the cost for these shuffles, we aren't
modelling this correctly. `ResizeToVF` will indicate to
`performExtractsShuffleAction` that we cannot use the original mask due
to the resize shuffle. The consequence is that the cost calculation uses
a different shuffle mask than what is ultimately used.

Going back to the included test, we can consider again `TMP26`. Clearly
we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}.
However, we will currently calculate the cost with a mask {0, 1, 2, 3,
20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector
Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not
recognize this as an `SK_InsertSubvector` mask, and targets which have
reduced costs for `SK_InsertSubvector` will not accurately calculate the
cost.
2025-06-17 08:14:05 -07:00
Han-Kuan Chen
414710c753
[SLP] Fix isCommutative to check uses of the original instruction instead of the converted instruction. (#143094) 2025-06-17 22:03:14 +08:00