Refactors VPVectorPointerRecipe to use the VF VPValue to obtain the
runtime VF, similar to #95305.
Since only reverse vector pointers require the runtime VF, the patch
sets VPUnrollPart::PartOpIndex to 1 for vector pointers and 2 for
reverse vector pointers. As a result, the generation of reverse vector
pointers is moved into a separate recipe.
Induction users only need to be updated when vectorizing the epilogue.
Avoid running fixupIVUsers when vectorizing the main loop during
epilogue vectorization.
Previously the legacy cost model would pick the type for the cost
computation depending on the order of the members in the input IR.
This is incompatible with the VPlan-based cost model (independent of
original IR order) and also doesn't match code-gen, which uses the type
of the insert position.
Update the legacy cost model to use the type (and address space) from
the Group's insert position.
This brings the legacy cost model in line with the legacy cost model and
fixes a divergence between both models.
Note that the X86 cost model seems to assign different costs to groups
with i64 and double types. Added a TODO to check.
Fixes https://github.com/llvm/llvm-project/issues/112922.
Use SCEV to check if the minimum iteration check (TC < Step) is known to
be false.
This is a first step towards addressing
https://github.com/llvm/llvm-project/issues/111098. To catch the exact
case from the issue, we need to do extra work to make sure the wrap
flags on the shl are preserved and used by SCEV.
Note that skeleton creation will be gradually moved to VPlan and this
simplification should be done as VPlan transform eventually. The current
plan is to move skeleton creation to VPlan starting from parts closest
to the parts already created by VPlan, starting with induction resume
value creation (started with
https://github.com/llvm/llvm-project/pull/110577), then memory and SCEV
checks and finally minimum iteration checks.
PR: https://github.com/llvm/llvm-project/pull/111310
Enabled initial support for max safe distance in DataWithEVL mode. If
max safe distance is required, need to emit special code:
CMP = icmp ult AVL, MAX_SAFE_DISTANCE
SAFE_AVL = select CMP, AVL, MAX_SAFE_DISTANCE
EVL = call i32 @llvm.experimental.get.vector.length(i64 SAFE_AVL)
while vectorize the loop in DataWithEVL tail folding mode.
Reviewers: fhahn
Reviewed By: fhahn
Pull Request: https://github.com/llvm/llvm-project/pull/102897
Previously, the cost model was returning an invalid cost. This simply
moves the check from one place to another. This is mostly to make the
cost modeling code a bit easier to follow.
---------
Co-authored-by: Mel Chen <mel.chen@sifive.com>
Any-of reductions are narrowed to i1. Update the legacy cost model to
use the correct type when computing the cost of a phi that gets lowered
to selects (BLEND).
This fixes a divergence between legacy and VPlan-based cost models after
36fc291b6ec6d.
Fixes https://github.com/llvm/llvm-project/issues/111874.
There are a number of places where we call getSmallConstantMaxTripCount
without passing a vector of predicates:
getSmallBestKnownTC
isIndvarOverflowCheckKnownFalse
computeMaxVF
isMoreProfitable
I've changed all of these to now pass in a predicate vector so that
we get the benefit of making better vectorisation choices when we
know the max trip count for loops that require SCEV predicate checks.
I've tried to add tests that cover all the cases affected by these
changes.
Update fixupIVUsers to compute the value for escaped inductions using
the already computed end value of the induction (EndValue), but
subtracting the step.
This results in slightly simpler codegen, as we avoid computing the full
transformed index at VectorTripCount - 1.
PR: https://github.com/llvm/llvm-project/pull/110576
This patch splits off intrinsic hanlding to a new
VPWidenIntrinsicRecipe. VPWidenIntrinsicRecipes only need access to the
intrinsic ID to widen and the scalar result type (in case the intrinsic
is overloaded on the result type). It does not need access to an
underlying IR call instruction or function.
This means VPWidenIntrinsicRecipe can be created easily without access
to underlying IR.
Implement VPBlendRecipe::computeCost. VPBlendRecipe is currently is also
used if only the first lane is used.
This also requires pre-computing costs for forced scalars and
instructions considered profitable to scalarize. For those, the cost
will be computed separately in the legacy cost model. This will also be
needed when implementing VPReplicateRecipe::computeCost.
There was some code in emitSCEVChecks to update the dominator
tree if LoopBypassBlocks is empty, however there are no tests
that fail when replacing this code with an assert. I built
both SPEC2017 and the LLVM test suite and also didn't see any
build failures. I've removed the code for now and added an
assert to guard this in case anything changes, since it seems
pointless to have code that's impossible to defend.
Update VPInterleaveRecipe to always use the pointer to member 0 as
pointer argument. This in many cases helps to remove unneeded index
adjustments and simplifies VPInterleaveRecipe::execute.
In some rare cases, the address of member 0 does not dominate the insert
position of the interleave group. In those cases a PtrAdd VPInstruction
is emitted to compute the address of member 0 based on the address of
the insert position. Alternatively we could hoist the recipe computing
the address of member 0.
The legacy cost model always computes the cost for uniforms as cost of
VF = 1, but VPWidenCallRecipes would be created, as
setVectorizedCallDecisions would not consider uniform calls.
Fix setVectorizedCallDecision to set to Scalarize, if the call is
uniform-after-vectorization.
This fixes a bug in VPlan construction uncovered by the VPlan-based
cost model.
Fixes https://github.com/llvm/llvm-project/issues/111040.
This fixes another case where the VPlan-based and legacy cost models
disagree. If any of the operands is predicated, it can't be trivially
hoisted and we should consider the cost for evaluating it each loop
iteration.
Fixes https://github.com/llvm/llvm-project/issues/108697.
LoopVectorizationLegality currently only treats a loop as legal to vectorise
if PredicatedScalarEvolution::getBackedgeTakenCount returns a valid
SCEV, or more precisely that the loop must have an exact backedge taken
count. Therefore, in LoopVectorize.cpp we can safely replace all calls to
getBackedgeTakenCount with calls to getSymbolicMaxBackedgeTakenCount,
since the result is the same.
This also helps prepare the loop vectoriser for PR #88385.
This replaces some of the most frequent offenders of using a DenseMap that
cause a malloc, where the typical element-count is small enough to fit in
an initial stack allocation.
Most of these are fairly obvious, one to highlight is the collectOffset
method of GEP instructions: if there's a GEP, of course it's going to have
at least one offset, but every time we've called collectOffset we end up
calling malloc as well for the DenseMap in the MapVector.
GeneratedRTChecks::getCost duplicates getSmallBestKnownTC partially,
when attempting to get the best trip-count estimate. Since the intent of
this code is to get the best trip-count estimate, and
getSmallBestKnownTC is written for exactly this purpose, replace the
partial code-duplication with a call to this function.
This patch separates the computation of the final reduction result and
the intermediate stores of reduction.
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
Predicated instructions cannot hoisted trivially, so don't treat them as
uniform value in the cost model.
This fixes a difference between legacy and VPlan-based cost model.
Fixes https://github.com/llvm/llvm-project/issues/110295.
Use the reduction resume values from the phis in the scalar header,
instead of collecting them in a map. This removes some complexity from
the general executePlan code paths and pushes it to only the epilogue
vectorization part.
The vector trip count must already be created when fixupIVUsers is
called. Don't pass the vector preheader there and delay retrieving the
vector loop header. This ensures we are re-using the already computed
trip count. Computing the trip count from scratch would not be correct,
as the IR may not be in a valid state yet.
After 8ec406757cb92 (https://github.com/llvm/llvm-project/pull/95842),
only the lane part of VPIteration is used.
Simplify the code by replacing remaining uses of VPIteration with VPLane directly.
This follows in the spirit of 7d82c99403f615f6236334e698720bf979959704,
and extends the costing API for compares and selects to provide
information about the operands passed in an analogous manner. This
allows us to model the cost of materializing the vector constant, as
some select-of-constants are significantly more expensive than others
when you account for the cost of materializing the constants involved.
This is a stepping stone towards fixing
https://github.com/llvm/llvm-project/issues/109466. A separate SLP patch
will be required to utilize the new API.
* Rename Speculative -> Uncountable and update tests.
* Add comments explaining why it's safe to ignore the predicates when
building up a list of exiting blocks.
* Reshuffle some code to do (hopefully) cheaper checks first.
When trying to maximize vector bandwidth we ask TTI for the number of
registers required for a given operation. If the type of that operation
happens to be something illegal for scalable vectors (e.g.
<vscale x 4 x fp128>) then we would see a crash.
Instead, just return a default value and let the cost model reject the
invalid operation later.
After 8ec406757cb92 (https://github.com/llvm/llvm-project/pull/95842),
VPTransformState only stores a single vector value per VPValue.
Simplify the code by replacing the SmallVector in PerPartOutput with a
single Value * and rename to VPV2Vector for clarity.
Also remove the redundant Part argument from various accessors.
This patch implements explicit unrolling by UF as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).
It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)
In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.
The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.
This has been split off https://github.com/llvm/llvm-project/pull/94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.
The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.
A follow-up patch will clean up all references to VPTransformState::UF.
Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.
PR: https://github.com/llvm/llvm-project/pull/95842
Don't call raw_string_ostream::flush(), which is essentially a no-op.
As specified in the docs, raw_string_ostream is always unbuffered.
( 65b13610a5226b84889b923bae884ba395ad084d for further reference )