No in-tree targets currently use it in the
preferInLoopReduction/preferPredicatedReductionSelect TTI hooks. It
looks like it used to be used in LoopUtils, at least in
8ca60db40bd944dc5f67e0f200a403b4e03818ea, but I presume it was replaced
by RecurrenceDescriptor.
getLoopEstimatedTripCount returns the trip count based on profiling
data, and its documentation says that it could return 0 when the trip
count is zero, but this is not the case: a valid trip count can never be
zero, and it returns 0 when the unsigned ExitCount is incremented by 1
and wraps. Some callers are careful about checking for this zero value
in an std::optional, but it makes for an API with footguns, as a
std::optional return value indicates that a non-nullopt value would be a
valid trip count. Fix this by explicitly returning std::nullopt when the
return value would wrap, and strip additional checks in callers. This
also fixes a minor bug in LoopVectorize.
This patch converts the llvm.vector.splice intrinsic to
llvm.experimental.vp.splice, ensuring that fixed-order recurrences
execute correctly when tail folding by EVL is enable.
Due to the non-VFxUF penultimate EVL issue, the EVL from the previous
iteration will be preserved and used in llvm.experimental.vp.splice.
Functions marked with minsize should aim for minimum code size, so the
vectorizer should use CodeSize for the cost kind and also the cost we
compare should be the cost for the entire loop: it shouldn't be divided
by the number of vector elements and block costs shouldn't be divided by
the block probability.
Possibly we should also be doing this for optsize as well, but there are
a lot of tests that assume the current behaviour and the definition of
optsize is less clear than minsize (for minsize the goal is to "keep the
code size of this function as small as possible" whereas for optsize
it's "keep the code size of this function low").
Add a new VPInstruction::Broadcast opcode and use it to materialize
explicit broadcasts of live-ins. The initial patch only materlizes the
broadcasts if the vector preheader dominates all uses that need it.
Later patches will pick the best valid insert point, thus retiring
implicit hoisting of broadcasts from VPTransformsState::get().
PR: https://github.com/llvm/llvm-project/pull/124644
Constract immutable VPIRBasicBlocks for all exit blocks up front and
keep a list of them. Same as the scalar header, they are leaf nodes of
the VPlan and won't change. Some exit blocks may be unreachable, e.g. if
the scalar epilogue always executes or depending on optimizations.
This simplifies both the way we retrieve the exit blocks as well as
hooking up the exit blocks.
PR: https://github.com/llvm/llvm-project/pull/128374
As the name of the function suggests, convertPointerToIntegerType should
return an IntegerType instead of a Type, and should only ever be called
with integer or ptr type. Fix the callers getWiderType, and
addInductionPhi to narrow the type of WidestIndTy to IntegerType,
stripping unclear casts. While at it, rename convertPointerToIntegerType
and getWiderType for clarity.
Use HCFGBuilder to build an initial VPlan 0, which wraps all input
instructions in VPInstructions and update tryToBuildVPlanWithVPRecipes
to replace the VPInstructions with widened recipes.
At the moment, widened recipes are created based on the underlying
instruction of the VPInstruction. Masks are also still created based on
the input IR basic blocks and the loop CFG is flattened in the main loop
processing the VPInstructions.
This patch also incldues support for Switch instructions in HCFGBuilder
using just a VPInstruction with Instruction::Switch opcode.
There are multiple follow-ups planned:
* Perform predication on the VPlan directly,
* Unify code constructing VPlan 0 to be shared by both inner and outer
loop code paths.
* Construct VPlan 0 once, clone subsequent ones for VFs
PR: https://github.com/llvm/llvm-project/pull/124432
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed. The users
of the call also must all be `extractvalue` instructions.
The intended use case for this is vectorizing intrinsics such as:
```
declare { float, float } @llvm.sincos.f32(float %x)
```
Mapping them to structure-returning library calls such as:
```
declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```
Or their widened form (such as `@llvm.sincos.v4f32` in this case).
Implementing this required two main changes:
1. Supporting widening `extractvalue`
2. Adding support for vectorized struct types in LV
* This is mostly limited to parts of the cost model and scalarization
Since the supported use case is narrow, the required changes are
relatively small.
Create a IR BB directly for the middle.block, instead of creating the IR
BB during skeleton creation and then replacing the middle VPBB with a
VPIRBB.
This moves another part of skeleton creation to VPlan and simplififes
the code slightly by removing code to disconnect the middle block and
vector preheader + the corresponding DT update.
NFC modulo IR block naming and block creation order, which changes the
IR names for the blocks.
We call collectInLoopReductions in multiple places asking
the same question with exactly the same answer. For
example, this was being called from a loop in
calculateRegisterUsage and this patch hoists the call out
to above the loop. In addition I've changed
collectInLoopReductions so that it bails out if we've
already built up a list.
`forgetLcssaPhiWithNewPredecessor` performs additional invalidation if
there is an existing SCEV for the phi, but earlier
`forgetBlockAndLoopDispositions` or `forgetLoop` may already invalidate
the SCEV for the phi.
Change the order to first call `forgetLcssaPhiWithNewPredecessor` to
ensure it runs before its SCEV gets invalidated too eagerly.
Fixes https://github.com/llvm/llvm-project/issues/119665.
PR: https://github.com/llvm/llvm-project/pull/119897
This patch relands the changes from "[LV]: Teach LV to recursively
(de)interleave.#122989"
Reason for revert:
- The patch exposed an assert in the vectorizer related to VF difference
between
legacy cost model and VPlan-based cost model because of uncalculated
cost for
VPInstruction which is created by VPlanTransforms as a replacement to
'or disjoint'
instruction.
VPlanTransforms do that instructions change when there are memory
interleaving and
predicated blocks, but that change didn't cause problems because at most
cases the cost
difference between legacy/new models is not noticeable.
- Issue is fixed by #125434
Original patch: https://github.com/llvm/llvm-project/pull/89018
Reviewed-by: paulwalker-arm, Mel-Chen
Follow-up as discussed when using VPInstruction::ResumePhi for all resume
values (#112147). This patch explicitly adds incoming values for each
predecessor in VPlan. This simplifies codegen and allows transformations
adjusting the predecessors of blocks with
NFC modulo incoming block order in phis.
Run recipe simplification and dead recipe removal after VPlan-based
unrolling and optimizeForVFAndUF, to clean up any redundant or dead
recipes introduced by them. Currently this is NFC, as it removes the
corresponding removeDeadRecipes run in optimizeForVFAndUF and no
additional simplifications kick in after unrolling yet. That is changing
with https://github.com/llvm/llvm-project/pull/123655.
Note that with this change, pattern-matching is now applied after
EVL-based recipes have been introduced.
Trying to match VPWidenEVLRecipe when not explicitly requested might
apply a pattern with 2 operands to one with 3 due to the extra EVL
operand and VPWidenEVLRecipe being a subclass of VPWidenRecipe.
To prevent this, update Recipe_match::match to only match
VPWidenEVLRecipe if it is in the requested recipe types (RecipeTy).
PR: https://github.com/llvm/llvm-project/pull/125926
Consistently use hasScalarVFOnly instead of using
hasVF(ElementCount::getFixed(1)). Also add an assert to ensure all cases
are covered by hasScalarVFOnly.
The legacy and vplan cost models did not agree because
VPWidenCallRecipe::computeCost only calculates the cost of the
call instruction, whereas
LoopVectorizationCostModel::setVectorizedCallDecision in some
cases adds on the cost of a synthesised mask argument. However,
this mask is always 'splat(i1 true)' which should be hoisted out
of the loop during codegen. In order to synchronise the two cost
models I have two options:
1) Also add the cost of the splat to the vplan model, or
2) Remove the cost of the splat from the legacy model.
I chose 2) because I feel this more closely represents what the
final code will look like. There is an argument that we should
take account of such broadcast costs in the preheader when
deciding if it's profitable to vectorise a loop, however there
isn't currently a mechanism to do this. We currently only take
account of the runtime checks when assessing profitability and
what the minimum trip count should be. However, I don't believe
this work needs doing as part of this PR.
The current legality check for folding with EVL has incomplete
verification for VF.
This patch fixes the VF check, ensuring that tail folding with EVL is
enabled only when a scalable VF is available. This allows loops that
prefer tail folding with EVL but cannot use scalable VF vectorization to
still be vectorized using a fixed VF, rather than abandoning
vectorization entirely.
This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.
The plans with scalar VF should not be transformed the plans folded by
EVL.
TODO: Move the scalar VF checking into `LoopVectorizationCostModel
::foldTailWithEVL()`.
We currently call getVScaleForTuning in many places, doing a
lot of work asking the same question with the same answer.
I've refactored the code to cache the value if the max
scalable VF != 0 and pull out the cached value from
LoopVectorizationCostModel.
Nothing in VPlan.h directly depends on VPTransformState, VPCostContext,
VPFRange, VPlanPrinter or VPSlotTracker. Move them out to a separate
header to reduce the size of widely used VPlan.h.
This is a first step towards more cleanly separating declarations in
VPlan.
Besides reducing VPlan.h's size, this also allows including additional
VPlan-related headers in VPlanHelpers.h for use there. An example is
using VPDominatorTree in VPTransformState
(https://github.com/llvm/llvm-project/pull/117138).
PR: https://github.com/llvm/llvm-project/pull/124104
This work feeds part of PR
https://github.com/llvm/llvm-project/pull/88385, and adds support for
vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.
All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:
1. The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.
2. We were adding the new vector.early.exit to the wrong parent loop.
It needs to have the same parent as the actual early exit block from
the original loop.
I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.
NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
Add new runPass helpers to run a VPlan transformation. This makes it
easier to add additional checks/functionality for each transform run. In
this patch, an option is added to run the verifier after each VPlan
transform.
Follow-ups will use the same helper to also support printing VPlans
after each transform.
Note that the verifier at the moment requires there to be a canonical IV
and vector loop region, so the final lowering transforms aren't run via
runPass yet.
PR: https://github.com/llvm/llvm-project/pull/123640
Change `getScaledReduction` to take an existing vector, rather than
creating and returning a new one each call.
Rename `getScaledReduction` to `getScaledReductions` to more accurately
reflect what it's now doing.
---------
Co-authored-by: Karlo Basioli <68535415+basioli-k@users.noreply.github.com>
Live-ins don't need to be handled, other than adding to the exit phi
recipe. Do that early and assert that otherwise the exit value is
defined in the vector loop region.
This should enable simply skipping other exit values that do not need
further fixing, e.g. if handling the exit value from the early exit
directly in handleUncountableEarlyExit.
PR: https://github.com/llvm/llvm-project/pull/123819
Update HCFG builder to preserve the original latch block of the initial
VPlan, ensuring there is always a latch.
It also skips creating the BranchOnCond for the latch of the top-level
loop, instead of removing it later. Exiting via the latch is controlled
by later recipes.
This further unifies HCFG construction and prepares for use to also
build an initial VPlan (VPlan0) for inner loops.
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to getFirstNonPHI use the iterator-returning version.
This patch changes a bunch of call-sites calling getFirstNonPHI to use
getFirstNonPHIIt, which returns an iterator. All these call sites are
where it's obviously safe to fetch the iterator then dereference it. A
follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
getFirstNonPHI, but not before adding concise documentation of what
considerations are needed (very few).
---------
Co-authored-by: Stephen Tozer <Melamoto@gmail.com>