4459 Commits

Author SHA1 Message Date
Alexey Bataev
51aac5b043 [SLP][NFCI]Improve compile time for phis with large number of incoming values.
Added a limit of 128 incoming values at max for PHIs nodes to be
vectorized plus improved performance by using logarithmic search instead
of linear if the number of incoming values is > 4.
2024-04-30 14:42:49 -07:00
Florian Hahn
9c3f5fe88f
[LV] Don't consider the latch block as ScalarPredicatedBB.
The conditional branch from the loop latch will be replaced by a
single branch controlling the loop, so there is no extra overhead from
scalarization. This improves the cost esimates in some cases.
2024-04-29 19:15:46 +01:00
Alexey Bataev
37ae4ad0ee
[SLP]Support minbitwidth analisys for buildvector nodes.
Metric: size..text

Program                                                                                                                                                size..text
                                                                                                                                                       exp           ref        diff
                                                                                  test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    42906.00    42986.00  0.2%
                                                                           test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    42909.00    42989.00  0.2%
                                                                                   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664581.00   664661.00  0.0%
                                                                                  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664581.00   664661.00  0.0%

Less is better.

Replaces `buildvector <p x in> + trunc <p x in> to <p x im>` sequences to
`buildvector <p x im> of { trunc in to im }` scalars, which is free in
most cases, results in better code.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88504
2024-04-29 09:57:37 -04:00
Alexey Bataev
040b5a1255 [SLP]Fix PR90211: vectorized node must match completely to be reused.
If the gather node matches the vectorized node, it must also match with
the scalars completely. Otherwise, need to revectorize the gather node
to generate correct code.
2024-04-29 06:51:11 -07:00
Maciej Gabka
bfc0317153
Move several vector intrinsics out of experimental namespace (#88748)
This patch is moving out following intrinsics:
* vector.interleave2/deinterleave2
* vector.reverse
* vector.splice

from the experimental namespace.

All these intrinsics exist in LLVM for more than a year now, and are
widely used, so should not be considered as experimental.
2024-04-29 10:16:45 +01:00
Florian Hahn
aafed3408e
[VPlan] Make createScalarIVSteps return VPScalarIVStepsRecipe (NFC).
This avoids the need for using getVPSingleValue/getDefiningRecipe at the
place the return value is used.
2024-04-28 21:56:55 +01:00
Florian Hahn
b6a8f5486b
[LV] Consider all exit branch conditions uniform.
If we vectorize a loop with multiple exits, all exiting branches should
be considered uniform, as the resulting loop will be controlled by the
canonical IV only. Previously we were overestimating the cost of values
contributing to the other exits.
2024-04-28 13:15:55 +01:00
Florian Hahn
9ee8e38cdc
[VPlan] Also propagate versioned strides to users via sext/zext.
The versioned value may not be used in the loop directly but through a
sext/zext. Add new live-ins in those cases.
2024-04-26 21:29:43 +01:00
Alexey Bataev
79314c64d0 [SLP]Fix PR90224: check that users of gep are all vectorized.
Before deleting extractelement instruction for vectorized GEP with
external users, need to check that all users vectorized before deleting
this extractelement.
2024-04-26 11:49:12 -07:00
Alexey Bataev
d74e42acd2 [SLP]Attempt to vectorize long stores, if short one failed.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.

Metric: size..text

Program                                                                         size..text
                                                                                results     results0    diff
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1088012.00  1088236.00  0.0%
                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480396.00   480476.00  0.0%
          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664613.00   664661.00  0.0%
         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664613.00   664661.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2041105.00  2040961.00 -0.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   836563.00   836387.00 -0.0%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035100.00  1032140.00 -0.3%

In all benchmarks extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88563
2024-04-26 06:53:44 -07:00
Troy Butler
468fecfc39
Fix mismatches between function parameter definitions and declarations (#89512)
Addresses issue #88716.

Some function parameter names in the affected header files did not match
the parameter names in the definitions, or were listed in a different
order.

---------

Signed-off-by: Troy-Butler <squintik@outlook.com>
2024-04-26 13:00:31 +02:00
Alexey Bataev
f758bb66e8 [SLP]Fix PR89988: do extra analysis of the icmp args to correctly handle signed/unsigned comparison.
If operands of icmp has different signedness, need to consider extending
unsigned operands to correctly handle comparison with the signed
operands.
2024-04-25 16:10:24 -07:00
Simon Pilgrim
282b56f43d
[VectorCombine] foldShuffleOfBinops - add support for length changing shuffles (#88899)
Refactor to be closer to foldShuffleOfCastops - sibling patch to #88743 that can be used to address some of the issues identified in #88693
2024-04-24 10:18:49 +01:00
Patrick O'Neill
adb0126ef1
[VPlan] Add scalar inferencing support for Not and Or insns (#89160)
Fixes #87394.

PR: https://github.com/llvm/llvm-project/pull/89160
2024-04-23 15:48:43 +01:00
Alexey Bataev
b4a0fd40f1 [SLP]Fix PR89635: do not try to vectorize single-gather alternate node.
No need to try to vectorize single gather/buildvector with alternate
opcode graph, it is not profitable. In other cases, need to use last
instruction for inserting the vectorized code.
2024-04-23 06:45:43 -07:00
Florian Hahn
dadf6f2c5a
[VPlan] Ignore incoming values with constant false mask. (#89384)
Ignore incoming values with constant false masks when trying to simplify
VPBlendRecipes.

As a follow-on optimization, we should also be able to drop all incoming
values with false masks by creating a new VPBlendRecipe with those
operands dropped.

PR: https://github.com/llvm/llvm-project/pull/89384
2024-04-23 13:59:01 +01:00
Simon Pilgrim
7f4f237cd8 [VectorCombine] foldShuffleOfShuffles - add missing arguments to getShuffleCost calls.
Ensure the getShuffleCost arguments/instruction args are populated - minor extension to #88743 to help improve shuffle costs for certain corner cases (e.g. shuffles of loads)
2024-04-23 11:53:08 +01:00
Florian Hahn
17fb3e82f6
[VPlan] Skip extending ICmp results in trunateToMinimalBitwidth.
Results of icmp don't need extending after truncating their operands, as
the result will always be i1. Skip them during extending.

Fixes https://github.com/llvm/llvm-project/issues/79742
Fixes https://github.com/llvm/llvm-project/issues/85185
2024-04-23 11:50:26 +01:00
Alexey Bataev
0ab0c1d982
[SLP]Introduce transformNodes() and transform loads + reverse to strided loads.
Introduced transformNodes() function to perform transformation of the
nodes (cost-based, instruction count based, etc.).
Implemented transformation of consecutive loads + reverse order to
strided loads with stride -1, if profitable.

Reviewers: RKSimon, preames, topperc

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88530
2024-04-22 12:31:57 -04:00
Alexey Bataev
6bd29d6639 [SLP]Fix PR89614: phis can be reordered, if reuses are not empty.
Need to relax assertion and check ReuseShuffleIndices is not empty, if
the root phi node has reorder indices.
2024-04-22 08:40:19 -07:00
Alexey Bataev
102a811094 [SLP]Fix a check for multi-users for icmp user.
The compiler should not take into account the type of the cmp
instruction, otherwise it may treat the size incorrectly and it may lead
to incorrect codegen.
2024-04-22 08:23:15 -07:00
Simon Pilgrim
bddfbe748b
[VectorCombine] foldShuffleOfShuffles - fold "shuffle (shuffle x, undef), (shuffle y, undef)" -> "shuffle x, y" (#88743)
Another step towards cleaning up shuffles that have been split, often across bitcasts between SSE intrinsic.

Strip shuffles entirely if we fold to an identity shuffle.
2024-04-22 15:57:59 +01:00
Alexey Bataev
ef1d19b0a5 [SLP]Fix PR89438: check for all tree entries for the resized value.
Need to check all possible entries, before trying looking for the
minbitwidth in the user node. Otherwise we may incorrectly get
signedness info.
2024-04-22 06:38:38 -07:00
Florian Hahn
c93f02978c
[VPlan] Remove custom checks for EVL placement in verifier (NFCI).
After e2a72fa583d9, def-use chains of EVL are modeled explicitly.
So there's no need for a custom check of its placement, as regular
def-use verification will catch mis-placements.
2024-04-22 12:49:49 +01:00
Simon Pilgrim
4cc9c6d98d [VectorCombine] foldShuffleOfBinops - don't fold shuffle(divrem(x,y),divrem(z,w)) if mask contains poison
Fixes #89390
2024-04-22 09:00:38 +01:00
David Green
a8105026ff [LV] Fix warning about Mask being set twice. NFC 2024-04-20 16:40:08 +01:00
Alexey Bataev
cee7d994b9 [SLP]Fix PR89438: Check for same vectorized node in MinBWs, not user.
Need to check if the buildvector node has perfect diamond match in the
graph and the matched node is resized.
2024-04-19 12:52:19 -07:00
Alexey Bataev
4d7f3d9e0f [SLP]Fix final analysis for unsigned nodes.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
2024-04-19 03:03:56 -07:00
Mikhail Goncharov
054b1b3b5a Revert "[SLP]Fix final analysis for unsigned nodes."
This reverts commit 74e07ab523122d6a8347b25770062ab331b6bb84.

It might be that Mask.getBitWidth() == Mask.countl_zero() (32 in my
case) and zero bitwidth2 causes the crash.
2024-04-19 11:32:56 +02:00
Florian Hahn
e2a72fa583
[VPlan] Introduce recipes for VP loads and stores. (#87816)
Introduce new subclasses of VPWidenMemoryRecipe for VP
(vector-predicated) loads and stores to address multiple TODOs from
https://github.com/llvm/llvm-project/pull/76172

Note that the introduction of the new recipes also improves code-gen for
VP gather/scatters by removing the redundant header mask. With the new
approach, it is not sufficient to look at users of the widened canonical
IV to find all uses of the header mask.

In some cases, a widened IV is used instead of separately widening the
canonical IV. To handle that, first collect all VPValues representing header
masks (by looking at users of both the canonical IV and widened inductions
that are canonical) and then checking all users (recursively) of those header
masks.

Depends on https://github.com/llvm/llvm-project/pull/87411.

PR: https://github.com/llvm/llvm-project/pull/87816
2024-04-19 09:44:23 +01:00
Alexey Bataev
74e07ab523 [SLP]Fix final analysis for unsigned nodes.
Need to check that at least single bit is cleared for unsigned nodes
before reducing their size. Otherwise they might be treated as signed in
signed nodes.
2024-04-18 10:05:54 -07:00
Ramkumar Ramachandra
73e7f2ff70
LoopVectorize: guard marking iv as scalar; fix bug (#88730)
When collecting loop scalars, LoopVectorize over-eagerly marks the
induction variable and its update as scalars after vectorization, even
if the induction variable update is a first-order recurrence. Guard the
process with this check, fixing a crash.

Fixes #72969.
2024-04-18 14:41:07 +01:00
Alexey Bataev
9462abdff1 [SLP]Fix PR89187: fixx assertion check.
Need to use proper index variable to fix a crash.
2024-04-18 04:22:25 -07:00
Ramkumar Ramachandra
63d8058ef5
LoopVectorize: guard appending InstsToScalarize; fix bug (#88720)
In the process of collecting instructions to scalarize, LoopVectorize
uses faulty reasoning whereby it also adds instructions that will be
scalar after vectorization. If an instruction satisfies
isScalarAfterVectorization() for the given VF, it should not be appended
to InstsToScalarize. Add this extra guard, fixing a crash.

Fixes #55096.
2024-04-18 10:03:07 +01:00
Nikita Popov
1baa385065
[IR][PatternMatch] Only accept poison in getSplatValue() (#89159)
In #88217 a large set of matchers was changed to only accept poison
values in splats, but not undef values. This is because we now use
poison for non-demanded vector elements, and allowing undef can cause
correctness issues.

This patch covers the remaining matchers by changing the AllowUndef
parameter of getSplatValue() to AllowPoison instead. We also carry out
corresponding renames in matchers.

As a followup, we may want to change the default for things like m_APInt
to m_APIntAllowPoison (as this is much less risky when only allowing
poison), but this change doesn't do that.

There is one caveat here: We have a single place
(X86FixupVectorConstants) which does require handling of vector splats
with undefs. This is because this works on backend constant pool
entries, which currently still use undef instead of poison for
non-demanded elements (because SDAG as a whole does not have an explicit
poison representation). As it's just the single use, I've open-coded a
getSplatValueAllowUndef() helper there, to discourage use in any other
places.
2024-04-18 15:44:12 +09:00
Nikita Popov
888836930b Revert "[SLP]Attempt to vectorize long stores, if short one failed."
This reverts commit 6f7160eedb2db02f37d4ffd52fff7b0cf88b3fdc.

This still causes large compile-time regressions in some cases.
2024-04-18 10:15:45 +09:00
Alexey Bataev
6f7160eedb [SLP]Attempt to vectorize long stores, if short one failed.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.

Metric: size..text

Program                                                                         size..text
                                                                                results     results0    diff
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1088012.00  1088236.00  0.0%
                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480396.00   480476.00  0.0%
          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664613.00   664661.00  0.0%
         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664613.00   664661.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2041105.00  2040961.00 -0.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   836563.00   836387.00 -0.0%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035100.00  1032140.00 -0.3%

In all benchmarks extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88563
2024-04-17 10:24:35 -07:00
Florian Hahn
5d314353fb
[VPlan] Check for VPWidenLoadRecipe directly in truncateToMinBW. (NFCI).
Since ne
After a separate recipe has been introduced for wide loads in
a9bafe91dd0, we can directly check for load recipes in the early
bail-out and remove the redundant bail out for stores.
2024-04-17 15:53:32 +01:00
Florian Hahn
41b7341d6b
[VPlan] Factor out helper to recursively collect all users (NFCI).
Factor out logic to collect all users recursively to be re-used
in https://github.com/llvm/llvm-project/pull/87816.
2024-04-17 14:56:47 +01:00
Florian Hahn
a9bafe91dd
[VPlan] Split VPWidenMemoryInstructionRecipe (NFCI). (#87411)
This patch introduces a new VPWidenMemoryRecipe base class and distinct
sub-classes to model loads and stores.

This is a first step in an effort to simplify and modularize code
generation for widened loads and stores and enable adding further more
specialized memory recipes.

PR: https://github.com/llvm/llvm-project/pull/87411
2024-04-17 11:00:58 +01:00
Mel Chen
cbe148b730
[LV][NFC] Remove the declaration of function fixReduction. (#88491) 2024-04-17 17:59:52 +08:00
Nikita Popov
efd60556f7 Revert "[SLP]Attempt to vectorize long stores, if short one failed."
This reverts commit 7d4e8c1f3bbfe976f4871c9cf953f76d771b0eda.

Contrary to the commit description, this does cause large
compile-time regressions (up to 10% on individual files).
2024-04-17 09:25:05 +09:00
Arthur Eubanks
c6e01627ac Revert "Reapply "[LV] Improve AnyOf reduction codegen. (#78304)""
This reverts commit c6e38b928c56f562aea68a8e90f02dbdf0eada85.

Causes miscompiles, see comments on #78304.
2024-04-16 20:40:21 +00:00
Florian Hahn
34777c238b
[VPlan] Don't mark VPBlendRecipe as phi-like.
VPBlendRecipes don't get lowered to phis and usually do not appear at
the beginning of blocks, due to their masks appearing before them.

This effectively relaxes an over-eager verifier message.

Fixes https://github.com/llvm/llvm-project/issues/88297.
Fixes https://github.com/llvm/llvm-project/issues/88804.
2024-04-16 21:24:25 +01:00
Alexey Bataev
7d4e8c1f3b
[SLP]Attempt to vectorize long stores, if short one failed.
We can try to vectorize long store sequences, if short ones were
unsuccessful because of the non-profitable vectorization. It should not
increase compile time significantly (stores are sorted already,
complexity is n x log n), but vectorize extra code.

Metric: size..text

Program                                                                         size..text
                                                                                results     results0    diff
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1088012.00  1088236.00  0.0%
                  test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480396.00   480476.00  0.0%
          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664613.00   664661.00  0.0%
         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664613.00   664661.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2041105.00  2040961.00 -0.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test   836563.00   836387.00 -0.0%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1035100.00  1032140.00 -0.3%

In all benchmarks extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88563
2024-04-16 14:55:41 -04:00
Alexey Bataev
c7657cf7d1
[SLP]Keep externally used GEPs as GEPs, if possible instead of extractelement.
If the vectorized GEP instruction can be still kept as a scalar GEP,
better to keep it as scalar instead of extractelement. In many cases it
is more profitable.

Metric: size..text

Program                                                                          size..text
                                                                                 results     results0    diff
                        test-suite :: SingleSource/Benchmarks/Misc/oourafft.test    18911.00    19695.00  4.1%
                   test-suite :: SingleSource/Benchmarks/Misc-C++-EH/spirit.test    59987.00    60707.00  1.2%
       test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1392209.00  1392753.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1392209.00  1392753.00  0.0%
           test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1087996.00  1088236.00  0.0%
                         test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   309310.00   309342.00  0.0%
             test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   664661.00   664693.00  0.0%
            test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   664661.00   664693.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12354636.00 12354908.00  0.0%
                  test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1152748.00  1152716.00 -0.0%
                       test-suite :: MultiSource/Applications/oggenc/oggenc.test   191787.00   191771.00 -0.0%
                     test-suite :: SingleSource/UnitTests/matrix-types-spec.test   480796.00   480476.00 -0.1%

Misc/oourafft - Extra code gets vectorized
Misc-C++-EH/spirit - same
CFP2017speed/638.imagick_s
CFP2017rate/538.imagick_r - same, extra code gets vectorized
CINT2006/400.perlbench - some extra 4 x ptr stores vectorized
Bullet/bullet - extra 4 x ptr store vectorized
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - same
CFP2017rate/526.blender_r - extra 8 x float stores (several), some extra
4 x ptr stores
CFP2006/453.povray - 2 x double loads/stores replaced by 4 x double
loads/stores
Applications/oggenc - extra code is vectorized
UnitTests/matrix-types-spec - extra code gets vectorized

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/88877
2024-04-16 14:54:06 -04:00
Harald van Dijk
60de56c743
[ValueTracking] Restore isKnownNonZero parameter order. (#88873)
Prior to #85863, the required parameters of llvm::isKnownNonZero were
Value and DataLayout. After, they are Value, Depth, and SimplifyQuery,
where SimplifyQuery is implicitly constructible from DataLayout. The
change to move Depth before SimplifyQuery needed callers to be updated
unnecessarily, and as commented in #85863, we actually want Depth to be
after SimplifyQuery anyway so that it can be defaulted and the caller
does not need to specify it.
2024-04-16 15:21:09 +01:00
Alexey Bataev
e84b2fb48d
[LV][NFCI]Use integer for cost/trip count calculations instead of double, fix possible UB.
Using fp type in the compiler is not the best idea, here it used with
the comparison for equal to 0 and may cause undefined behavior in some
cases.

Reviewers: fhahn

Reviewed By: fhahn

Pull Request: https://github.com/llvm/llvm-project/pull/87241
2024-04-16 09:48:13 -04:00
Alexey Bataev
26ebe16d78 [SLP]Fix PR88834: check if unsigned arg can be trunced, being used in smax/smin intrinsics.
Need to check that unsigned argument can be safely used in smax/smin
intrinsics by checking if at least single sign bit is cleared, otherwise
its value may be treated as negative instead of positive.
2024-04-16 06:42:15 -07:00
Florian Hahn
b73476c784
[SLP] Make sure MinVF is a power-of-2 by using PowerOf2Ceil.
This should ensure we explore the same VFs as before 6d66db3890a18e39.

Fixes https://github.com/llvm/llvm-project/issues/88640.
2024-04-16 13:29:35 +01:00