1526 Commits

Author SHA1 Message Date
Florian Hahn
3f2fb767e3
[VPlan] Make IV operand explicit for VPWidenCanonicalIVRecipe (NFC).
This makes the def-use relationship between VPCanonicalIVPHIRecipe and
VPWidenCanonicalIVRecipe explicit. Needed for D117140.
2022-01-13 11:13:05 +00:00
Florian Hahn
7ce48be0fd
[LV] Inline CreateSplatIV call for scalar VFs (NFC).
This is a NFC change split off from D116123, as suggested there.
D116123 will remove the last user of CreateSplatIV.
2022-01-13 09:34:31 +00:00
Florian Hahn
d4a8fc3a87
[VPlan] Introduce and use BranchOnCount VPInstruction.
This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.

It must be the last recipe in the exit block of the topmost vector loop
region.

This extracts parts from D113224 and was discussed in D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116479
2022-01-12 13:42:13 +00:00
Rosie Sumpter
552eb372cb [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter
This is required to query the legality more precisely in the LoopVectorizer.

This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.

Differential Revision: https://reviews.llvm.org/D115329
2022-01-12 13:34:12 +00:00
David Sherwood
e3c84fb948 [LoopVectorize] Add support for tail folding using scalable vectors
This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.

It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.

In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:

  -prefer-predicate-over-epilogue=predicate-dont-vectorize
  -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.

Various tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
  Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.

Differential Revision: https://reviews.llvm.org/D113003
2022-01-10 10:55:40 +00:00
Florian Hahn
1ce01b7dfe
[SCEVExpander] Simplify cleanup, skip sorting by dominance.
There is no need to sort inserted instructions by dominance, as the
deletion loop still requires RAUW with undef before deleting. Removing
instructions in reverse insertion order should still insure that the
number of uselist updates is kept to a minimum.
2022-01-09 18:38:41 +00:00
Sander de Smalen
9cbe000df2 [LV] Load/store/reduction type must be sized, assert it.
This addresses a suggestion by @nikic on D115356.
2022-01-06 12:35:27 +00:00
Florian Hahn
2ee8154816
[LV] Don't use getVPSingleValue for VPWidenMemoryInstRecipe (NFC).
VPWidenMemoryInstructionRecipe is a VPValue, so this can be passed
directly, instead of relying on getVPSingleValue.
2022-01-05 13:51:50 +00:00
Sander de Smalen
95a93722db [LV] Remove what seems like stale code in collectElementTypesForWidening.
This was originally added in rG22174f5d5af1eb15b376c6d49e7925cbb7cca6be
although that patch doesn't really mention any reasons for ignoring the
pointer type in this calculation if the memory access isn't consecutive.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D115356
2022-01-05 12:20:59 +00:00
Florian Hahn
65c4d6191f
[VPlan] Add VPCanonicalIVPHIRecipe, partly retire createInductionVariable.
At the moment, the primary induction variable for the vector loop is
created as part of the skeleton creation. This is tied to creating the
vector loop latch outside of VPlan. This prevents from modeling the
*whole* vector loop in VPlan, which in turn is required to model
preheader and exit blocks in VPlan as well.

This patch introduces a new recipe VPCanonicalIVPHIRecipe to represent the
primary IV in VPlan and CanonicalIVIncrement{NUW} opcodes for
VPInstruction to model the increment.

This allows us to partly retire createInductionVariable. At the moment,
a bit of patching up is done after executing all blocks in the plan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D113223
2022-01-05 10:46:06 +00:00
Rosie Sumpter
961f51fdf0 [LoopVectorize][CostModel] Choose smaller VFs for in-loop reductions without loads/stores
For loops that contain in-loop reductions but no loads or stores, large
VFs are chosen because LoopVectorizationCostModel::getSmallestAndWidestTypes
has no element types to check through and so returns the default widths
(-1U for the smallest and 8 for the widest). This results in the widest
VF being chosen for the following example,

float s = 0;
for (int i = 0; i < N; ++i)
  s += (float) i*i;

which, for more computationally intensive loops, leads to large loop
sizes when the operations end up being scalarized.

In this patch, for the case where ElementTypesInLoop is empty, the widest
type is determined by finding the smallest type used by recurrences in
the loop instead of falling back to a default value of 8 bits. This
results in the cost model choosing a more sensible VF for loops like
the one above.

Differential Revision: https://reviews.llvm.org/D113973
2022-01-04 10:12:57 +00:00
Florian Hahn
791523bae6
[LV] Set loop metadata after VPlan execution (NFC).
Setting the loop metadata for the vector loop after VPlan execution
allows generating the full loop body during VPlan execution. This is in
preparation for D113224.
2022-01-03 09:59:50 +00:00
Florian Hahn
6e0a333f71
[LV] Use Builder.CreateVectorReverse directly. (NFC)
IRBuilder::CreateVectorReverse already handles all cases required by
LoopVectorize. It can be used directly instead of reverseVector.
2022-01-02 19:09:30 +00:00
Florian Hahn
b1a333f0fe
[VPlan] Don't consider VPWidenCanonicalIVRecipe phi-like.
VPWidenCanonicalIVRecipe does not create PHI instructions, so it does
not need to be placed in the phi section of a VPBasicBlock.

Also tidies the code so the WidenCanonicalIV recipe and the
compare/lane-masks are created in the header.

Discussed D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116473
2022-01-02 12:48:17 +00:00
Florian Hahn
7305798049
[VPlan] Remove VPWidenPHIRecipe constructor without start value (NFC).
This was suggested as a separate cleanup in recent reviews.
2022-01-01 13:53:48 +00:00
Florian Hahn
e2f1c4c706
[LV] Turn check for unexpected VF into assertion (NFC).
VF should always be non-zero in widenIntOrFpInduction. Turn check into
assertion.
2021-12-31 13:19:03 +00:00
Florian Hahn
ba9016a030
[LV] Replace redundant tail-fold check with assert (NFC).
The code path can only be reached when folding the tail, so turn the
check into an assertion.
2021-12-29 19:00:41 +01:00
Florian Hahn
9d297c7894
[VPlan] Add prepareToExecute to set up live-ins (NFC).
This patch adds a new prepareToExecute helper to set up live-ins, so
VPTransformState doesn't need to hold values like TripCount.

This also requires making the trip count operand for ActiveLaneMask
explicit in VPlan.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116320
2021-12-28 17:49:47 +01:00
Florian Hahn
c2275278c6
[VPlan] Add abstract base class for header phi recipes (NFC).
Not all header phis widen the phi, e.g. like the new
VPCanonicalIVPHIRecipe in D113223. To let those recipes also inherit
from a phi-like base class, add a more generic VPHeaderPHIRecipe
abstract base class.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D116304
2021-12-28 15:37:47 +01:00
Florian Hahn
c66286ed59
[LV] Use specific first-order recurrence recipe as arg type (NFC).
Required for further refactoring in D116304.
2021-12-28 10:58:21 +01:00
Florian Hahn
2e630eabd3
[LV] Sink BTC creation to actual use (NFC).
Suggested separately in D116123.
2021-12-27 11:25:46 +01:00
Florian Hahn
511726c64d
[LV] Move getStepVector out of ILV (NFC).
First step to split up induction handling and move it outside ILV.
Used in D116123 and following.
2021-12-26 21:17:26 +01:00
Kazu Hirata
76f0f1cc5c Use {DenseSet,SetVector,SmallPtrSet}::contains (NFC) 2021-12-24 21:43:06 -08:00
Florian Hahn
ede7c2438f
[VPlan] Create header & latch blocks for skeleton up front (NFC).
By creating the header and latch blocks up front and adding blocks and
recipes in between those 2 blocks we ensure that the entry and exits of
the plan remain valid throughout construction.

In order to avoid test changes and keep printing of the plans the same,
we use the new header block instead of creating a new block on the first
iteration of the loop traversing the original loop.

We also fold the latch into its predecessor.

This is a follow up to a post-commit suggestion in D114586.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115793
2021-12-22 12:44:25 +00:00
Florian Hahn
c83ef407df
[LV] Adjust comment to say the induction is created in header.
Follow-up suggested post-commit for 1a54889f48fa.
2021-12-22 11:56:40 +00:00
Florian Hahn
1a54889f48
[LV] Ensure WidenCanonicalIVRecipe is always created in header (NFC).
The VPWidenCanonicalIVRecipe must always be created in the phi section
of the header block. Use that block as insert point.
2021-12-21 15:14:48 +00:00
Sander de Smalen
b1ff20fd35 [LV] Enable scalable vectorization by default for SVE cores.
The availability of SVE should be sufficient to enable scalable
auto-vectorization.

This patch adds a new TTI interface to query the target what style of
vectorization it wants when scalable vectors are available. For other
targets than AArch64, this currently defaults to 'FixedWidthOnly'.

Differential Revision: https://reviews.llvm.org/D115651
2021-12-20 16:23:29 +00:00
Florian Hahn
5b362e4c7f
[VPlan] Add Debugloc to VPInstruction.
Upcoming changes require attaching debug locations to VPInstructions,
e.g. adding induction increment recipes in D113223.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115123
2021-12-20 15:10:41 +00:00
Florian Hahn
564d109b35
[LV] Pass VectorHeader block to emitTransformedIndex (NFC).
Pass in the vector header instead of relying on ILV::LoopVectorBody.
This reduces the dependence on state from ILV. Where VPTransformState is
available, State.CFG.PrevBB can be used.
2021-12-17 10:11:16 +00:00
Evgeniy Brevnov
7002125cff [LV][NFC] Fix debug message to print out resulting clamped VF 2021-12-13 18:54:05 +07:00
Evgeniy Brevnov
2025e0985c [LV] Make sure VF doesn't exceed compile time known TC
For the simple copy loop (see test case) vectorizer selects VF equal to 32 while the loop is known to have 17 iterations only. Such behavior makes no sense to me since such vector loop will never be executed. The only case we may want to select VF large than TC is masked vectoriztion. So I haven't touched that case.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D114528
2021-12-13 13:48:46 +07:00
Florian Hahn
b6a2ddb6c8
[LV] Use info from State in some helper functions (NFC).
This updates several helper functions to use information provided by
VPTransformState instead of ILV directly, to help with the transition
out of ILV.
2021-12-12 20:48:38 +00:00
David Green
fed3041863 [LV][ARM] Improve reduction costmodel for mismatching extension types.
Given a MLA reduction from two different types (say i8 and i16), we were
previously failing to find the reduction pattern, often making us chose
the lower vector factor. This improves that by using the largest of the
two extension types, allowing us to use the larger VF as the type of the
reduction.

As per https://godbolt.org/z/KP549EEYM the backend handles this
valiantly, leading to better performance.

Differential Revision: https://reviews.llvm.org/D115432
2021-12-10 15:40:58 +00:00
Florian Hahn
505ad03c7d
[LV] Remove redundant IV casts using VPlan (NFCI).
This patch simplifies handling of redundant induction casts, by
removing dead cast instructions after initial VPlan construction.
This has the following benefits:

  1. fixes a crash
     (see @test_optimized_cast_induction_feeding_first_order_recurrence)
  2. Simplifies VPWidenIntOrFpInduction to a single-def recipes
  3. Retires recordVectorLoopValueForInductionCast.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115112
2021-12-10 13:57:03 +00:00
Florian Hahn
acea6e9cfa
[Passes] Only run extra vector passes if loops have been vectorized.
This patch uses a similar trick as in D113947 to only run the extra
passes after vectorization on functions where loops have been
vectorized.

The reason for running the 'extra vector passes' is
simplification/unswitching of the runtime checks created by LV, there
should be no need to run them if nothing got vectorized

To do that, a new dummy analysis ShouldRunExtraVectorPasses has been
added. If loops have been vectorized for a function, LV will cache the
analysis. At the moment it uses MadeCFGChanges as proxy for loop
vectorized, which isn't perfect (it could be too aggressive, e.g.
because no runtime checks have been added), but should be good enough
for now.

The extra passes are now managed by a new FunctionPassManager that
runs its passes only if ShouldRunExtraVectorPasses has been cached.

Without this patch, `-extra-vectorizer-passes` has the following
compile-time impact:

NewPM-O3: +4.86%
NewPM-ReleaseThinLTO: +3.56%
NewPM-ReleaseLTO-g: +7.17%

http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=c292da649e2c6e88a31e702fdc474727d09c72bc&stat=instructions

With this patch, that gets reduced to

NewPM-O3: +1.43%
NewPM-ReleaseThinLTO: +1.00%
NewPM-ReleaseLTO-g: +1.58%

http://llvm-compile-time-tracker.com/compare.php?from=ead3979a92fc33add4710c4510d6906260dcb4ad&to=e67d86b57810011cf285eb9aa1944781be6096f0&stat=instructions

It is probably still too high to enable by default, but much better.

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D115052
2021-12-10 11:42:45 +00:00
Florian Hahn
978883d254
[VPlan] Add InductionDescriptor to VPWidenIntOrFpInduction. (NFC)
This allows easier access to the induction descriptor from VPlan,
without needing to go through Legal. VPReductionPHIRecipe already
contains a RecurrenceDescriptor in a similar fashion.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D115111
2021-12-10 09:55:09 +00:00
Florian Hahn
d74a8a78ad
[LV] Mark various functions as const (NFC).
Make sure various accessors do not modify any state, in preparation for
D115111.
2021-12-09 10:51:29 +00:00
Florian Hahn
e9a2944495
[VPlan] Verify plan entry and exit blocks, set correct exit block.
Both the entry and exit blocks of the top-region of a plan must be
VPBasicBlocks. They also must have no predecessors or successors
respectively.

This invariant was broken when splitting a block for sink-after. To fix
the issue, set the exit block of the region *after* sink-after is done.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114586
2021-12-07 16:26:31 +00:00
Cullen Rhodes
0395e01583 [IR] Split vscale_range interface
Interface is split from:

  std::pair<unsigned, unsigned> getVScaleRangeArgs()

into separate functions for min/max:

  unsigned getVScaleRangeMin();
  Optional<unsigned> getVScaleRangeMax();

Reviewed By: sdesmalen, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D114075
2021-12-07 10:38:26 +00:00
Florian Hahn
07276e49e3
[LV] Check VPValue operand instead of Cost::isUniformAfterVec (NFC).
ILV::scalarizeInstruction still uses the original IR operands to check
if an input value is uniform after vectorization.

There is no need to go back to the cost model to figure that out, as the
information is already explicit in the VPlan. Just check directly
whether the VPValue is defined outside the plan or is a uniform
VPReplicateRecipe.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114253
2021-12-06 18:32:35 +00:00
Sander de Smalen
3d549dddf7 [LV] Pass compare predicate to getCmpSelInstrCost.
If the condition of a select is a compare, pass its predicate to
TTI::getCmpSelInstrCost to get a more accurate cost value instead
of passing BAD_ICMP_PREDICATE.

I noticed that the commit message from D90070 had a comment about the
vectorized select predicate possibly being composed of other compares with
different predicate values, but I wasn't able to construct an example
where this was an actual issue. If this is an issue, I guess we could
add another check that the block isn't predicated for any reason.

Reviewed By: dmgreen, fhahn

Differential Revision: https://reviews.llvm.org/D114646
2021-12-06 11:41:27 +00:00
Florian Hahn
7c3c352d82
[VPlan] Separate ctors for VPWidenIntOrFpInduction. (NFC)
VPWidenIntOrFpInductionRecipes can either be constructed with a PHI and
an optional cast or a PHI and a trunc instruction. Reflect this in 2
separate constructors. This also simplifies a follow-up change.
2021-12-05 12:15:18 +00:00
Florian Hahn
e44298a8f8
[LV] Move code from vectorizeMemoryInstruction to recipe's execute().
The code in widenMemoryInstruction has already been transitioned
to only rely on information provided by VPWidenMemoryInstructionRecipe
directly.

Moving the code directly to VPWidenMemoryInstructionRecipe::execute
completes the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114324
2021-12-01 14:56:51 +00:00
Philip Reames
c41b318423 [LV] Remove unneeded cast to Operator [NFC] 2021-11-30 08:45:13 -08:00
Florian Hahn
dab776dd0f
[LV] Move code from widenSelectInstruction to VPWidenSelectRecipe. (NFC)
The code in widenSelectInstruction has already been transitioned
to only rely on information provided by VPWidenSelectRecipe directly.

Moving the code directly to VPWidenSelectRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114323
2021-11-30 10:32:44 +00:00
Roman Lebedev
8cd782487f
[X86][LoopVectorize] "Fix" X86TTIImpl::getAddressComputationCost()
We ask `TTI.getAddressComputationCost()` about the cost of computing vector address,
and then multiply it by the vector width. This doesn't make any sense,
it implies that we'd do a vector GEP and then scalarize the vector of pointers,
but there is no such thing in the vectorized IR, we perform scalar GEP's.

This is *especially* bad on X86, and was effectively prohibiting any scalarized
vectorization of gathers/scatters, because `X86TTIImpl::getAddressComputationCost()`
says that cost of vector address computation is `10` as compared to `1` for scalar.

The computed costs are similar to the ones with D111222+D111220,
but we end up without masked memory intrinsics that we'd then have to
expand later on, without much luck. (D111363)

Differential Revision: https://reviews.llvm.org/D111460
2021-11-30 10:47:56 +03:00
Florian Hahn
fd71159f64
[LV] Move code from widenInstruction to VPWidenRecipe. (NFC)
The code in widenInstruction has already been transitioned to
only rely on information provided by VPWidenRecipe directly.

Moving the code directly to VPWidenRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for vector code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114322
2021-11-29 09:09:00 +00:00
Florian Hahn
3495090b9b
[LV] Move code from widenGEP to VPWidenGEPRecipe (NFC).
The code in widenGEP has already been transitioned to only rely on
information provided by VPWidenGEPRecipe directly.

Moving the code directly to VPWidenGEPRecipe::execute completes
the transition for the recipe.

It provides the following advantages:

1. Less indirection, easier to see what's going on.
2. Removes accesses to fields of ILV.

2) in particular ensures that no dependencies on
fields in ILV for GEP code generation are re-introduced.

Reviewed By: Ayal

Differential Revision: https://reviews.llvm.org/D114321
2021-11-28 18:29:18 +00:00
Sander de Smalen
28a4deab92 [LV] Fix incorrectly marking a pointer indvar as 'scalar'.
collectLoopScalars should only add non-uniform nodes to the list if they
are used by a load/store instruction that is marked as CM_Scalarize.

Before this patch, the LV incorrectly marked pointer induction variables
as 'scalar' when they required to be widened by something else,
such as a compare instruction, and weren't used by a node marked as
'CM_Scalarize'. This case is covered by sve-widen-phi.ll.

This change also allows removing some code where the LV tried to
widen the PHI nodes with a stepvector, even though it was marked as
'scalarAfterVectorization'. Now that this code is more careful about
marking instructions that need widening as 'scalar', this code has
become redundant.

Differential Revision: https://reviews.llvm.org/D114373
2021-11-28 09:49:28 +00:00
David Sherwood
e20391fc5d [LoopVectorize] When tail-folding, don't always predicate uniform loads
In VPRecipeBuilder::handleReplication if we believe the instruction
is predicated we then proceed to create new VP region blocks even
when the load is uniform and only predicated due to tail-folding.

I have updated isPredicatedInst to avoid treating a uniform load as
predicated when tail-folding, which means we can do a single scalar
load and a vector splat of the value.

Tests added here:

  Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

Differential Revision: https://reviews.llvm.org/D112552
2021-11-26 11:30:54 +00:00