3059 Commits

Author SHA1 Message Date
Philip Reames
a83441e8cd Revert "[SLP] Simplify extendSchedulingRegion"
This reverts commit 8c85f3a0523070ef656e30e368df0a679c1400cd.
2022-02-23 13:12:07 -08:00
Philip Reames
222e8610f1 [SLP] Rearrange fields in ScheduleData for density [NFC] 2022-02-23 12:33:43 -08:00
Philip Reames
a3e9b32c00 [SLP] Remove SchedulingPriority from ScheduleData [NFC]
First step in trying to shrink the memory footprint of ScheduleData to improve cache locality.
2022-02-23 11:43:46 -08:00
Philip Reames
8c85f3a052 [SLP] Simplify extendSchedulingRegion
This change uses instruction's comesBefore method to simplify the code significantly. There's little compile time concern here because getSpillCost already calls comesBefore on every basic block which contains a vectorization candidate. The only additional times we'll build basic block ordering is when we can't schedule a vector candidate anywhere in the containing block.

Differential Revision: https://reviews.llvm.org/D120364
2022-02-23 11:23:38 -08:00
Philip Reames
6adf4b039e [SLP] Remove cap on schedule window size
This cap was first added in 848c1aa45 (back in 2015).  Per the original commit message, the purpose was to avoid a compile time explosion in long basic blocks.  The algorithmic problem in scheduling has now been fixed in 0539a26d.

In the meantime, the code has rotten fairly badly.  Some intermediate refactoring caused the size to only be incremented if *both* iterators advance in the window search.  This causes the size to be badly undercounted when near one end of a basic block.  We no longer have any test which exercises the logic in an intentional way; there's one test which differs with this change, but the changes appear fairly orthoganol to the purpose of the test file.

Unfortunately, we no longer have the original motivating example, so it's possible that it also hits some other issue.  I tested locally with a large example, but even at it's worst, that one doesn't demonstrate anything too extreme even without the algorithmic fix.  It's clearly faster with, but only by ~20% which doesn't seem in line with the original commit message.   If regressions with this patch are seen, please file a bug and I'll try to fix any other algorithmic problems which fall out.
2022-02-23 08:27:45 -08:00
Brendon Cahoon
3cc15e2cb6 [SLP] Fix assert from non-constant index in insertelement
A call to getInsertIndex() in getTreeCost() is returning None,
which causes an assert because a non-constant index value for
insertelement was not expected. This case occurs when the
insertelement index value is defined with a PHI.

Differential Revision: https://reviews.llvm.org/D120223
2022-02-22 15:57:14 -06:00
Philip Reames
8612b11c86 [SLP] Use isInSchedulingRegion consistently [NFC] 2022-02-22 10:27:16 -08:00
Philip Reames
0539a26d91 [SLP] Schedule only sub-graph of vectorizable instructions
SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

Before this patch:

704357 SLP                          - Number of calcDeps actions
 699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
     59 SLP                          - Number of ReScheduleOnFail actions
  10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

After this patch:

102895 SLP                          - Number of calcDeps actions
 161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
     55 SLP                          - Number of ReScheduleOnFail actions
  10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Differential Revision: https://reviews.llvm.org/D118538
2022-02-22 10:15:55 -08:00
Kerry McLaughlin
12fb133eba [LoopVectorize] Support conditional in-loop vector reductions
Extends getReductionOpChain to look through Phis which may be part of
the reduction chain. adjustRecipesForReductions will now also create a
CondOp for VPReductionRecipe if the block is predicated and not only if
foldTailByMasking is true.

Changes were required in tryToBlend to ensure that we don't attempt
to convert the reduction Phi into a select by returning a VPBlendRecipe.
The VPReductionRecipe will create a select between the Phi and the reduction.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D117580
2022-02-22 12:04:35 +00:00
Florian Hahn
c141d158e5
[VectorCombine] Remove redundant checks (NFC).
The removed conditions are already checked by the if above.

Fixes #53761.
2022-02-19 21:05:32 +00:00
Philip Reames
3ad0bdae8f [SLP] Address post commit comment from 2e50760 2022-02-18 10:57:15 -08:00
Alexey Bataev
b0a0df9809 [SLP]Fix vectorization of the alternate cmp instruction with swapped predicates.
If the alternate cmp instruction is a swapped predicate of the main cmp
instruction, need to generate alternate instruction, not the one with
the swapped predicate. Also, the lane with the alternate opcode should
be selected only, if the corresponding operands are not compatible.

Correctness confirmed:
https://alive2.llvm.org/ce/z/94BG66

Differential Revision: https://reviews.llvm.org/D119855
2022-02-18 04:27:45 -08:00
Alexey Bataev
d1cd64ffdd [SLP][NFC]Fix misprint in function name, NFC. 2022-02-17 05:57:51 -08:00
Arthur Eubanks
826fae51d2 [SLPVectorizer][OpaquePtrs] Check GEP source element type
Fixes a miscompile with opaque pointers.

Reviewed By: #opaque-pointers, nikic

Differential Revision: https://reviews.llvm.org/D119980
2022-02-16 14:47:20 -08:00
Philip Reames
2e50760775 [SLP] Add assert that entities are scheduled as expected
Requested in D118538
2022-02-15 12:21:49 -08:00
Anton Afanasyev
b7574b092a [SLP] Don't try to vectorize pair with insertelement
Particularly this breaks vectorization of insertelements where some of
intermediate (i.e. not last) insertelements are used externally.

Fixes PR52275
Fixes #51617

Differential Revision: https://reviews.llvm.org/D119679
2022-02-15 16:12:59 +03:00
Anton Afanasyev
954ea0f044 [SLP] Simplify indices processing for insertelements
Get rid of non-constant and undef indices of insertelements
at `buildTree()` stage. Fix bugs.

Differential Revision: https://reviews.llvm.org/D119623
2022-02-14 14:50:44 +03:00
Florian Hahn
2cd22ce0d0
[LV] Pass start value directly to emitTransformedIndex (NFC). 2022-02-12 19:03:32 +00:00
Anton Afanasyev
cd685f5736 [NFC][SLP] Set default parameter for Offset equal to zero 2022-02-11 17:22:33 +03:00
Philip Reames
5ba115031d [PSE] Remove assumption that top level predicate is union from public interface [NFC*]
Note that this doesn't actually cause the top level predicate to become a non-union just yet.

The * above comes from a case in the LoopVectorizer where a predicate which is later proven no longer blocks vectorization due to a change from checking if predicates exists to whether the predicate is possibly false.
2022-02-10 16:14:52 -08:00
Simon Pilgrim
6af7c1371a [LoopVectorize] getStepVector - reduce scope of local variable. NFC. 2022-02-10 20:44:25 +00:00
David Green
b55d4c2ad8 Revert "[LV] Remove LoopVectorizationCostModel::useEmulatedMaskMemRefHack()"
This reverts commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9 as we've
received multiple reports of this significantly impacting performance,
in ways that don't seem to just be target specific cost models going
wrong. I would offer some reproducers, but the test changes here seem to
be full of them!

Reverting for now and hopefully we can remove the "hack" more carefully
as we go.
2022-02-09 20:02:54 +00:00
Alexey Bataev
370ea1a199 [SLP][NFC]Fix comment, NFC. 2022-02-09 07:14:14 -08:00
Florian Hahn
8aa122081f
[LV] Pass step to emitTransformedIndex (NFC).
Move out the induction step creation from emitTransformedIndex to the
callers. In some places (e.g. widenIntOrFpInduction) the step is already
created. Passing the step in ensures the steps are kept in sync.
2022-02-09 11:12:45 +00:00
Florian Hahn
c9e6678b56
[LV] Move buildScalarSteps out of ILV (NFC).
This makes the function independent of shared state in ILV (ensures no
new dependencies on things like the cost model are introduced) and allows
for use directly in recipe's ::execute functions.
2022-02-08 21:18:40 +00:00
David Green
b4c6d1bb37 [LoopVectorizer] Don't perform interleaving of predicated scalar loops
The vectorizer will choose at times to "vectorize" loops with a scalar
factor (VF=1) with interleaving (IC > 1). This can occasionally produce
better code than the unroller (notable for reductions where it can
produce independent reduction chains that are combined after the loop).
At times this is not very beneficial though, for example when runtime
checks are needed or when the scalar code requires predication.

This addresses the second point, preventing the vectorizer from
interleaving when the scalar loop will require predication. This
prevents it from making a bit of a mess, that is worse than the original
and better left for the unroller to unroll if beneficial. It helps
reverse some of the regressions from D118090.

Differential Revision: https://reviews.llvm.org/D118566
2022-02-07 19:34:28 +00:00
Florian Hahn
5a72357697
[LV] Use IRBuilderBase in VPlan.h, remove IRBuilder.h include (NFC).
By using IRBuilderBase instead of IRBuilder<> a forward declaration can
be used instead of including IRBuilder.h
2022-02-07 17:46:16 +00:00
Roman Lebedev
77a0da926c
[LV] Remove LoopVectorizationCostModel::useEmulatedMaskMemRefHack()
D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model.
What it essentially does is prevents scalarized vectorization of masked memory operations:
```
  // TODO: Cost model for emulated masked load/store is completely
  // broken. This hack guides the cost model to use an artificially
  // high enough value to practically disable vectorization with such
  // operations, except where previously deployed legality hack allowed
  // using very low cost values. This is to avoid regressions coming simply
  // from moving "masked load/store" check from legality to cost model.
  // Masked Load/Gather emulation was previously never allowed.
  // Limited number of Masked Store/Scatter emulation was allowed.
```

While i don't really understand about what specifically `is completely broken`
was talking about, i believe that at least on X86 with AVX2-or-later,
this is no longer true. (or at least, i would like to know what is still broken).
So i would like to follow suit after D111460, and like wise disable that hack for AVX2+.

But since this was added for X86 specifically, let's just instead completely remove this hack.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D114779
2022-02-07 16:08:31 +03:00
Djordje Todorovic
afd54e1ed1 [SLPVectorizer] Fix "unused variable" build warning 2022-02-07 10:38:19 +01:00
Kazu Hirata
3a3cb929ab [llvm] Use = default (NFC) 2022-02-06 22:18:35 -08:00
Florian Hahn
541ca12dcd
[LV] Use VPReplicateRecipe::isUniform instead isUniformAfterVec (NFCI).
In scalarizeInstruction(), isUniformAfterVectorization is used to detect
cases where it is sufficient to always access the first lane. This
should map directly checking whether the operand is a uniform replicate
recipe.

Differential Revision: https://reviews.llvm.org/D116654
2022-02-06 16:37:20 +00:00
Benjamin Kramer
ce9417348e [SLP] Skip a DenseSet<unsigned> -> bit vector conversion. NFCI. 2022-02-06 00:57:47 +01:00
Philip Reames
0cc6165d05 [SLP] Strengthen internal asserts about scheduled node state [NFC]
All members of a scheduled bundle must have valid dependencies, with no unscheduled ones, and only the lead element gets marked scheduled.
2022-02-04 12:22:52 -08:00
Philip Reames
f3f8e3da9f [SLP] Remove ScheduleData::UnscheduledDepsInBundle field [NFC-ish]
We can simply compute the value of this field on demand.  Doing so clarifies the behavior when one of the instructions within a bundle doesn't have valid dependencies.  I vaguely thing this could change behavior slightly, but none of the test cases are affected, and my attempts to write one by hand have failed.

This also minorly reduces memory usage, but that's a secondary value at best.
2022-02-04 10:12:09 -08:00
Philip Reames
bb9964ba43 [SLP] Have only ready items in ready list [NFC]
This adds the assertion that all items in the ready list are in-fact scheduleable entities ready to be scheduled.  This involves changing the ReadyInsts structure to be a set, and fixing a couple places where we left nodes on the list when they were no longer ready.
2022-02-03 19:49:24 -08:00
Philip Reames
2cbc92fb11 [SLP] Strengthen internal invariant assertions slightly
This builds on the invariant checks introduced in 1519629, and adds a couple more than seem to hold without additional work.
2022-02-03 14:56:39 -08:00
Philip Reames
1519629a20 [SLP] Add basic self consistency asserts into scheduling
The idea here is to have a verify routine we can call during scheduling to ensure broken invariants are reported.  The intent is to help in debugging scheduling bugs.

At the moment, only the most basic properties are checked as adding several I thought held reported failures.
2022-02-03 13:27:35 -08:00
Philip Reames
6d0c007bc1 [SLP] Fix a typo in comment 2022-02-03 09:11:47 -08:00
Sander de Smalen
eaee477eda [LV] Use VScaleForTuning to allow wider epilogue VFs.
When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.

This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D118709
2022-02-03 15:40:17 +00:00
Alexey Bataev
802ceb8343 [SLP]Excluded external uses from the reordering estimation.
Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.

Metric: SLP.NumVectorInstructions

                          test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test   70.00   86.00   22.9%
                             test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test  346.00  353.00    2.0%
                            test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test  346.00  353.00    2.0%
                         test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test  235.00  239.00    1.7%
                  test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test  235.00  239.00    1.7%
                     test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8723.00 8834.00    1.3%
                                 test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00    1.2%
                         test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00    1.1%
                          test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00    1.1%
                       test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9184.00    0.9%
                     test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00    0.3%
                    test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00    0.3%
                       test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4235.00 4245.00    0.2%
                              test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00    0.1%
                                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00    0.1%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  783.00  782.00   -0.1%
                      test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   69.00   68.00   -1.4%
        test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test  207.00  192.00   -7.2%
         test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test  207.00  192.00   -7.2%
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   89.00   80.00  -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   89.00   80.00  -10.1%
       test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  260.00  215.00  -17.3%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  256.00  211.00  -17.6%

MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.

Differential Revision: https://reviews.llvm.org/D116688
2022-02-03 06:50:06 -08:00
Alexey Bataev
ad2a0ccf8f [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-02-03 06:24:10 -08:00
Alexey Bataev
8a1dfbc4d8 Revert "[SLP]Alternate vectorization for cmp instructions."
This reverts commit 842a2360a84692f2e4c37cc3e652640e6627d004 to fix the
bugs reported by users in https://reviews.llvm.org/D115955#3291538.
2022-02-02 12:06:36 -08:00
Alexey Bataev
842a2360a8 [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-02-02 10:32:52 -08:00
Benjamin Kramer
0c3d22a592 Revert "[SLP]Alternate vectorization for cmp instructions."
This reverts commit 83620bd2ad867f706c699d0f2b8be10e43d9f3d7.

It's causing miscompilations, see review comments at
https://reviews.llvm.org/D115955
2022-02-02 13:08:51 +01:00
serge-sans-paille
e188aae406 Cleanup header dependencies in LLVMCore
Based on the output of include-what-you-use.

This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/

I've tried to summarize the biggest change below:

- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h

And the usual count of preprocessed lines:
$ clang++ -E  -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after:  6189948

200k lines less to process is no that bad ;-)

Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup

Differential Revision: https://reviews.llvm.org/D118652
2022-02-02 06:54:20 +01:00
Sander de Smalen
2a44eaf20f [LV] Allow a scalable VF for the epilogue.
For some reason we limited the epilogue VF to be fixed-width, but there
is not necessarily a reason for doing so. If the main VF=vscale x 16, the
epilogue VF could be either fixed-width, or a scalable VF upto vscale x 8.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D118688
2022-02-01 22:38:55 +00:00
Alexey Bataev
83620bd2ad [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-02-01 09:54:20 -08:00
Benjamin Kramer
5281f0dab2 Revert "[SLP]Alternate vectorization for cmp instructions."
This reverts commit afaaecc88c6e5989de8a6a0266610860ef99d9d6.

Crashes when compiling SciPy, test case https://reviews.llvm.org/P8276
2022-02-01 11:40:43 +01:00
Florian Hahn
7fe4fa9a0a
[LV] Use onlyFirstLaneDemanded when widening pointer phis (NFCI).
This removes another instance of recipe execution still relying on
the cost model.

Depends on D116554.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D116656
2022-02-01 09:50:47 +00:00
Alexey Bataev
afaaecc88c [SLP]Alternate vectorization for cmp instructions.
Added support for alternate ops vectorization of the cmp instructions.
It allows to vectorize either cmp instructions with same/swapped
predicate but different (swapped) operands kinds or cmp instructions
with different predicates and compatible operands kinds.

Differential Revision: https://reviews.llvm.org/D115955
2022-01-31 11:11:25 -08:00