4261 Commits

Author SHA1 Message Date
Florian Hahn
617398e5e2
[SLP] Collect candidate VFs in vector in vectorizeStores (NFC). (#82793)
This is in preparation for
https://github.com/llvm/llvm-project/pull/77790 and makes it easy to add
other, non-power-of-2 VFs for processing.

PR: https://github.com/llvm/llvm-project/pull/82793
2024-03-01 20:13:49 +00:00
Florian Hahn
6fe60bd89f
[SLP] Exit early if MaxVF < MinVF (NFCI). (#83283)
Exit early if MaxVF < MinVF. In that case, the loop body below will
never get entered. Note that this adjusts the condition from MaxVF <=
MinVF. If MaxVF == MinVF, vectorization may still be feasible (and the
loop below gets entered).

PR: https://github.com/llvm/llvm-project/pull/83283
2024-03-01 19:43:06 +00:00
Alexey Bataev
2ab6d1e18e [SLP][NFC]Move some check to the outer if to simplify inner checks. 2024-03-01 10:58:41 -08:00
Alexey Bataev
3a30d8e9e5
[SLP]Check if masked gather can be emitted as a serie of loads/insert subvector.
Masked gather is very expensive operation and sometimes better to
represent it as a serie of consecutive/strided loads + insertsubvectors
sequences. Patch adds some basic estimation and if loads+insertsubvector
is cheaper, decides to represent it in this way rather than masked
gather.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/83481
2024-03-01 13:49:29 -05:00
Alexey Bataev
df0fd3a80e
[SLP]Try to vectorize small graph with extractelements, used in buildvector.
If the graph incudes only single "gather" node with only
extractelements/undefs, which used only in insertelement-based
buildvector sequences, it still might be profitable to vectorize it.
Need to rely on the cost model, not throw this graph away immediately.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/83581
2024-03-01 12:48:45 -05:00
Benjamin Kramer
3fc277f665
[SLPVectorizer] Make the insert/extractvector PHICompare a strict-weak ordering (#83571)
This was tripping off STL implementations that check for it (like libc++
with debug checking). The goal of this sort is to cluster operations on
the same values so preserve that property but sort everything else based
on the existing numbering.
2024-03-01 15:37:54 +01:00
Alexey Bataev
5bafb8d952 [SLP][NFC]Add/use single UsesLimit constant, NFC. 2024-03-01 06:37:08 -08:00
Florian Hahn
6ecd26132b
[SLP] Use ScopeExit to update Operands/PrevDist on all paths. (NFC) (#83490)
Use ScopeExit to make sure Operands/PrevDist are updated on all paths in
the loop. This makes it easier to ensure they are updated correctly if
new early continues are added.

Split off from https://github.com/llvm/llvm-project/pull/83283

PR: https://github.com/llvm/llvm-project/pull/83490
2024-03-01 14:30:01 +00:00
Alexey Bataev
f28c4b4bac
[SLP]Fix/improve potential masked gather loads analysis.
When do the analysis for the (potential) masked gather node, we check
that not greater than half of  the pointer operands are loop invariants
or potentially vectorizable.
Need to check actually, that we have a loop at first
and do better check for the potentially vectorizable
pointers.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/83472
2024-03-01 07:38:18 -05:00
Alexey Bataev
2d98d763a8
[SLP]Fix the cost model for extracts combined with later shuffle.
If the buildvector node contains extract, which later should be combined
with some other nodes by shuffling, need to estimate the cost of this
shuffle before building the mask after shuffle.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/83442
2024-03-01 07:12:07 -05:00
Alexey Bataev
45d82f33af [SLP]Fix miscompilation, cause by incorrect final node reordering.
Need to use the regular reordering from the correct node for the final
store/insertelement node to avoid miscommilation.
2024-02-29 11:21:06 -08:00
Florian Hahn
4d525f2b9a
[VPlan] Remove unneeded InsertPointGuard (NFCI).
getBlockInMask now simply returns an already computed mask, hence
there's no need to adjust the builder insert point.
2024-02-29 12:37:13 +00:00
Nilanjana Basu
1c211bc76e
[LV] Remove unused configuration option (#82955)
Recent set of changes (PR #67725) in loop interleaving algorithm caused removal of the loop trip count threshold for allowing interleaving. Therefore configuration option interleave-small-loop-scalar-reduction is no longer needed.
2024-02-28 10:17:25 -08:00
Florian Hahn
3fac0562f8
[VPlan] Reset trip count when replacing ExpandSCEV recipe.
Otherwise accessing the trip count may accesses freed memory.

Fixes https://lab.llvm.org/buildbot/#/builders/74/builds/26239 and
others.
2024-02-28 16:31:49 +00:00
Alexey Bataev
80cff27390 [LV][NFC]Fix a misprint, NFC. 2024-02-28 07:56:31 -08:00
Alexey Bataev
fdd60c7b96
[LV][NFC]Preselect folding style while chosing the max VF, NFC.
Selects the tail-folding style while choosing the max vector
factor and storing it in the data member rather than calculating it each
time upon getTailFoldingStyle call.

Part of https://github.com/llvm/llvm-project/pull/76172

Reviewers: ayalz, fhahn

Reviewed By: fhahn

Pull Request: https://github.com/llvm/llvm-project/pull/81885
2024-02-28 10:15:52 -05:00
Alexey Bataev
c89d51112d [SLP]Use It->second.first for BWSz, NFC. 2024-02-28 06:38:41 -08:00
Florian Hahn
15d9d0fa8f
[VPlan] Also print final VPlan directly before codegen/execute. (#82269)
Some optimizations are apply after UF and VF have been chosen. This
patch adds an extra print of the final VPlan just before
codegen/execution.

In the future, there will be additional transforms that are applied
later (interleaving for example).

PR: https://github.com/llvm/llvm-project/pull/82269
2024-02-28 13:19:43 +00:00
Florian Hahn
911055e34f
[VPlan] Consistently use (Part, 0) for first lane scalar values (#80271)
At the moment, some VPInstructions create only a single scalar value,
but use VPTransformatState's 'vector' storage for this value. Those
values are effectively uniform-per-VF (or in some cases
uniform-across-VF-and-UF). Using the vector/per-part storage doesn't
interact well with other recipes, that more accurately using (Part,
Lane) to look up scalar values and prevents VPInstructions creating
scalars from interacting with other recipes working with scalars.

This PR tries to unify handling of scalars by using (Part, 0) for scalar
values where only the first lane is demanded. This allows using
VPInstructions with other recipes like VPScalarCastRecipe and is also
needed when using VPInstructions in more cases otuside the vector loop
region to generate scalars.

Depends on https://github.com/llvm/llvm-project/pull/80269
2024-02-26 19:06:43 +00:00
Florian Hahn
85da9f80b8
[VPlan] Remove unused VPTransformState::VPValue2Value (NFCI).
Clean up unused member variable.
2024-02-25 12:14:44 +00:00
Florian Hahn
0b01320d28
[VPlan] Remove unused VPTransformState::CanonicalIV (NFCI).
Clean up unused member variable.
2024-02-23 16:54:30 +00:00
Alexey Bataev
32994cc0d6
[SLP]Improve findReusedOrderedScalars and graph rotation.
Patch syncs the code in findReusedOrderedScalars with cost
estimation/codegen. It tries to use similar logic to better determine
best order.
Before, it just tried to find previously vectorized node without
checking if it is possible to use the vectorized value in the shuffle.
Now it relies on the more generalized version. If it determines, that
a single vector must be reordered (using same mechanism, as codegen and
cost estimation), it generates better order.

The comparison between new/ref ordering:

Metric: SLP.NumVectorInstructions

Program                                                                                                                                                SLP.NumVectorInstructions
                                                                                                                                                       results                   results0 diff
                                                                                               test-suite :: MultiSource/Benchmarks/nbench/nbench.test   139.00                    140.00   0.7%
                                                                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   344.00                    346.00   0.6%
                                                                                        test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test  1293.00                   1292.00  -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5176.00                   5170.00  -0.1%
                                                                                        test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  5173.00                   5167.00  -0.1%
                                                                                test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 11692.00                  11660.00  -0.3%
                                                                                     test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  1621.00                   1615.00  -0.4%
                                                                                             test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test   795.00                    792.00  -0.4%
                                                                              test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 26499.00                  26338.00  -0.6%
                                                                                               test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  7343.00                   7281.00  -0.8%
                                                                                          test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  1104.00                   1094.00  -0.9%
                                                                                          test-suite :: MultiSource/Applications/JM/lencod/lencod.test  2216.00                   2180.00  -1.6%
                                                                                            test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   787.00                    637.00 -19.1%

Less 0% is better.
Most of the benchmarks see more vectorized code. The first ones just
have shuffles removed.

The ordering analysis still may require some improvements (e.g. for
alternate nodes), but this one should be produce better results.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: https://github.com/llvm/llvm-project/pull/77529
2024-02-22 14:32:15 -05:00
Florian Hahn
cd160a6e98
[VPlan] Do not add call results with void type to State (NFC).
With vector libraries, we may vectorize calls with void return types. Do
not add those values to the state; they can never be accessed.
2024-02-21 20:36:17 +00:00
Florian Hahn
3d66d6932e
[VPlan] Support live-ins without underlying IR in type analysis. (#80723)
A VPlan contains multiple live-ins without underlying IR, like VFxUF or
VectorTripCount. Trying to infer the scalar type of those causes a crash
at the moment.

Update VPTypeAnalysis to take a VPlan in its constructor and assign
types to those live-ins up front. All those live-ins share the type of
the canonical IV.

PR: https://github.com/llvm/llvm-project/pull/80723
2024-02-21 19:37:15 +00:00
Florian Hahn
9923d29cfa
[VPlan] Merge main VPlan verifer with HCFG verifier.
Unify VPlan verifiers in verifyVPlanIsValid. This adds verification for
various properties on blocks to the verifier used for VPlans generated
by the inner loop vectorizer. It also adds def-use checks for the
verifier used in the VPlan native path.

This drops the separate flag to enable HCFG verification. Instead, all
VPlans are verified once they have been created, if assertions are
enabled.

This also removes VPWidenPHIRecipe from VPHeaderPHIRecipe; it is used to
model any phi node in the native path.
2024-02-20 16:43:57 +00:00
Florian Hahn
44b17679e3
[VPlan] Remove stale comment from VPTransformState::get (NFC)
All values accessed via get are now part of VPTransformState, the ILV
reference in the comment has been removed a long time ago. Remove the
stale comment.
2024-02-19 22:11:51 +00:00
Alexey Bataev
35f45926eb [SLP][NFC]Add asserts for undef handling in PHIComparator, NFC. 2024-02-19 12:57:56 -08:00
Simon Pilgrim
769c22f25b
[VectorCombine] Fold reduce(trunc(x)) -> trunc(reduce(x)) iff cost effective (#81852)
Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free.

If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely.

Fixes https://github.com/llvm/llvm-project/issues/81469
2024-02-19 11:32:23 +00:00
Florian Hahn
536d78c213
[VPlan] Remove VPInstruction::setUnderlyingInstr (NFCI).
VPInstruction doesn't rely on the underlying instruction any longer for
codegen, remove the unneeded setUnderlyingInstr.
2024-02-18 18:50:01 +00:00
Florian Hahn
35ee6de966
[VPlan] Simplify addCanonicalIVRecipes by using VPBuilder (NFC).
Use VPBuilder to construct VPInstructions, which means there's no need
to manually inserting recipes.
2024-02-17 18:30:01 +00:00
Florian Hahn
20177c45db
[VPlan] Turn private members of VPlanTransforms to static funcs (NFC)
Private members of VPlanTransforms are only used inside
VPlanTransforms.cpp, just make them static.
2024-02-17 13:45:23 +00:00
Florian Hahn
0dacba3ad1
[VPlan] Handle truncating ICMPs in truncateToMinimalBWs.
Update truncateToMinimalBitwidths to handle truncating ICMPs. For ICMPs,
the new target type will be the same as the original type. In that case,
only truncate the operands, but skip the extend. This is in line with
what the original truncateToMinimalBitwidths did for compares.

Fixes https://github.com/llvm/llvm-project/issues/81415.
2024-02-16 12:58:56 +00:00
David Sherwood
1c10821022
[LoopVectorize] Fix divide-by-zero bug (#80836) (#81721)
When attempting to use the estimated trip count to refine the costs of
the runtime memory checks we should also check for sane trip counts to
prevent divide-by-zero faults on some platforms.

Fixes #80836
2024-02-14 16:07:51 +00:00
Florian Hahn
debca7ee43
[VPlan] Move dropping of poison flags to VPlanTransforms. (NFC)
Move collectPoisonGeneratingFlags from InnerLoopVectorizer to
VPlanTransforms and also update its name. collectPoisonGeneratingFlags
already directly drops poison-generating flags, not only collecting it.
This means it is more appropriate to integerate it directly into the
VPlan transform pipeline.

The current implementation still calls back to legal to check if a block
needs predication, which should be improved in the future.
2024-02-14 12:28:58 +00:00
Florian Hahn
ca56966684
[VPlan] Properly retain flags when cloning VPReplicateRecipe.
This makes sure the correct flags are used for the clone (i.e. the ones
present on the recipe), instead of the ones on the original IR
instruction.

At the moment, this should not change anything, as flags of replicate
recipe should not be dropped before they are cloned at the moment. But
that will change in a follow-up patch.
2024-02-14 11:11:46 +00:00
Alexey Bataev
b04dd5d187 [SLP]FIx PR81403: compiler crah because wrongly resized vector value.
The mask for the reshuffling/resizing might be calculated incorrectly,
fixed.
2024-02-12 10:27:25 -08:00
Alexey Bataev
833a1cadeb [SLP]Add support for strided loads.
Added basic support for strided loads support in SLP vectorizer.
Supports constant strides only. If the strided load must be
reversed, applies -stride to avoid extra reverse shuffle.

Reviewers: preames, lukel97

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/80310
2024-02-12 09:43:54 -08:00
Alexey Bataev
6a3a5cad2e Revert "[SLP]Add support for strided loads."
This reverts commit 0940f9083e68bda78bcbb323c2968a4294092e21 to fix
issues reported in https://github.com/llvm/llvm-project/pull/80310.
2024-02-12 08:47:28 -08:00
Alexey Bataev
0940f9083e
[SLP]Add support for strided loads.
Added basic support for strided loads support in SLP vectorizer.
Supports constant strides only. If the strided load must be
reversed, applies -stride to avoid extra reverse shuffle.

Reviewers: preames, lukel97

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/80310
2024-02-12 07:41:42 -05:00
Alexey Bataev
df856e4977
[SLP]Add GEP cost estimation for gathered loads.
When doing estimation for vectorization of gathered loads, need to
estimate the cost of the pointer (vectorization), as it is done for the
actual vectorized loads. Otherwise may be too optimistic about the cost
of the gathered loads.

Reviewers: preames

Reviewed By: preames

Pull Request: https://github.com/llvm/llvm-project/pull/80867
2024-02-07 07:30:41 -05:00
Alexey Bataev
299e5fef9d [SLP][NFC]Simplify/unify vectors for scattered/vectorized loads from
gathers, NFC.
2024-02-06 08:18:11 -08:00
Alexey Bataev
36e8db7d8c [SLP][NFC]Extract main part of GetGEPCostDiff to a function, NFC. 2024-02-06 08:05:42 -08:00
Nilanjana Basu
c1c5b854ad
[LV] Remove loop trip count threshold for deciding whether to interleave a loop (#67725)
A set of microbenchmarks (https://github.com/llvm/llvm-test-suite/pull/26) showed that loop interleaving can be beneficial for loops with low trip count as well. Loop interleaving count computation is updated accordingly in prior patches while this patch removes the loop trip count threshold for interleaving.
2024-02-05 17:23:58 -08:00
Florian Hahn
8cb2de7fae
[VPlan] Implement type inference for ICmp.
This fixes a crash in the attached test case due to missing type
inference for ICmp VPInstructions.
2024-02-05 15:42:07 +00:00
Florian Hahn
47abbf4fe9
[VPlan] Update VPInst::onlyFirstLaneUsed to check users. (#80269)
A VPInstruction only has its first lane used if all users use its first
lane only. Use vputils::onlyFirstLaneUsed to continue checking the
recipe's users to handle more cases.

Besides allowing additional introduction of scalar steps when
interleaving in some cases, this also enables using an Add VPInstruction
to model the increment - as a follow up.
2024-02-03 16:19:10 +00:00
Florian Hahn
3444240540
[VPlan] Mark vputils::onlyFirstPartUsed arg as const (NFC)
Split off https://github.com/llvm/llvm-project/pull/80269 as
suggested.
2024-02-03 15:59:09 +00:00
Florian Hahn
6936479020
[VPlan] Mark vputils::onlyFirstLaneUsed arg as const (NFC)
Split off https://github.com/llvm/llvm-project/pull/80269 as suggested.
2024-02-03 15:56:40 +00:00
Florian Hahn
2906f3626b
[VPlan] Update ::onlyScalarsGenerated to take IsScalable bool (NFCI).
Instead of passing in a full VF, just pass IsScalable as bool.
2024-02-03 14:51:14 +00:00
Alexey Bataev
ef7f6aca14 [SLP][NFC]Add some extra checks/reorganize the code to improve compile time, NFC. 2024-02-01 10:53:39 -08:00
Alexey Bataev
15295d0135 [SLP][NFC]Introduce and use computeCommonAlignment function, NFC. 2024-02-01 06:13:39 -08:00