Previously only fixed vector splats were handled. This adds supports for
scalable vectors too by allowing ConstantExpr splats.
We need to add the extra V->getType()->isVectorTy() check because a
ConstantExpr might be a scalar to vector bitcast.
By allowing ConstantExprs this also allow fixed vector ConstantExprs to
be folded, which causes the diffs in
llvm/test/Analysis/ValueTracking/known-bits-from-operator-constexpr.ll
and llvm/test/Transforms/InstSimplify/ConstProp/cast-vector.ll. I can
remove them from this PR if reviewers would prefer.
Fixes#132922
Vectorizing of fminimumnum and fminimumnum have not support yet. Let's
add the testcase for it now, and we will update the testcase when we
support it.
I recently added a new option to update_test_checks.py that can
filter out all CHECK lines after a certain point. We usually don't
care about checking for the original scalar loop after the vector
loop because it doesn't change. Cutting out unnecessary CHECK
lines makes the files smaller and hopefully the tests run quicker.
Remove the UF = 1 restriction introduced by 577631f0a5 building on top
of 783a846507683, which allows updating all relevant users of the VF,
VPScalarIVSteps in particular.
This restores the full functionality of
https://github.com/llvm/llvm-project/pull/106441.
Similarly to other recipes, update VPScalarIVStepsRecipe to also take
the runtime VF as argument. This removes some unnecessary runtime VF
computations for scalable vectors. It will also allow dropping the
UF == 1 restriction for narrowing interleave groups required in
577631f0a528.
Optimize the IR generated for a VPWidenIntOrFpInductionRecipe to use the
narrowest type necessary, when the trip-count of a loop is known to be
constant and the only use of the recipe is the condition used by the
vector loop's backedge branch.
Follow-up to dfca6c0d3bf9d1a056 to extend isUnrolled handle any unrolled
VPlan, which means there's a single UF, but it will be > 1 if unrolling
took place.
This moves the checks of MinTripCountTailFoldingThreshold later, during the
calculation of whether to tail fold. This allows it to check beforehand whether
tail predication is required, either for scalable or fixed-width vectors.
This option is only specified for AArch64, where it returns the minimum of 5.
This patch aims to allow the vectorization of TC=4 loops, preventing them from
performing slower when SVE is present.
This is the second of two PRs that attempts to improve the IR
generated in the exit blocks of vectorised loops with uncountable
early exits. It follows on from PR #128880. In this PR I am
improving the generated code for users of induction variables in
early exit blocks.
This required using a newly add VPInstruction called
FirstActiveLane, which calculates the index of the first active
predicate in the mask operand.
I have added a new function optimizeEarlyExitInductionUser that
is called from optimizeInductionExitUsers when handling users in
early exit blocks.
This reverts commit ff3e2ba9eb94217f3ad3525dc18b0c7b684e0abf.
The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.
Original commit message:
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.
This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.
This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.
Depends on https://github.com/llvm/llvm-project/pull/106431.
Fixes https://github.com/llvm/llvm-project/issues/82936.
PR: https://github.com/llvm/llvm-project/pull/106441
After unrolling, there may be additional simplifications that can be
applied. One example is removing SCALAR-STEPS for the first part where
only the first lane is demanded.
This removes redundant adds of 0 from a large number of tests (~200),
many which I am still working on updating.
In preparation for removing redundant WideIV steps added in
https://github.com/llvm/llvm-project/pull/119284.
PR: https://github.com/llvm/llvm-project/pull/123655
ExtractElements are no-ops for scalar VPlans. Don't introduce them in
handleUncountableEarlyExit if the plan has only a scalar VF.
This fixes a crash trying to compute the cost of ExtractElement after 26ecf978951b79.
PR: https://github.com/llvm/llvm-project/pull/131604
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.
This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.
This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.
Depends on https://github.com/llvm/llvm-project/pull/106431.
Fixes https://github.com/llvm/llvm-project/issues/82936.
PR: https://github.com/llvm/llvm-project/pull/106441
At the moment if we decide to enable tail-folding we do not include
the cost of generating the mask per VF. This can mean we make some
poor choices of VF, which is definitely true for SVE-enabled AArch64
targets where mask generation for fixed-width vectors is more
expensive than for scalable vectors.
I've added a VPInstruction::computeCost function to return the costs
of the ActiveLaneMask and ExplicitVectorLength operations.
Unfortunately, in order to prevent asserts firing I've also had to
duplicate the same code in the legacy cost model to make sure the
chosen VFs match up. I've wrapped this up in a ifndef NDEBUG for
now. The alternative would be to disable the assert completely when
tail-folding, which I imagine is just as bad.
New tests added:
Transforms/LoopVectorize/AArch64/sve-tail-folding-cost.ll
Transforms/LoopVectorize/RISCV/tail-folding-cost.ll
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.
This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.
This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.
Depends on #106431.
This lands the main parts of the approved
https://github.com/llvm/llvm-project/pull/106441 as suggested to break
things up a bit more.
calculateRegisterUsage adds end points for each user of an instruction
to Ends and ignores instructions not added to it, i.e. instructions with
no users.
This means things like stores aren't included, which in turn means
values that are only used in stores are also not included for
consideration. This means we underestimate the register usage in cases
where the only users are things like stores.
Update the code to don't skip instructions without users (i.e. not in
Ends) if they have side-effects.
PR: https://github.com/llvm/llvm-project/pull/126415
A few test files seemed to have been edited after using the
update_test_checks.py script, which can make life hard for
developers when trying to update these tests in future
patches. Also, the tests still had this comment at the top
; NOTE: Assertions have been autogenerated by ...
which could potentially be confusing, since they've not
strictly been auto-generated.
I've attempted to keep the spirit of the original tests by
excluding all CHECK lines after the scalar.ph IR block,
however I've done this by using a new option called
--filter-out-after to the update_test_checks.py script.
Make test target-dependent, as they will require access to a concrete
vector register width. Also add new tests for cost modeling, unrolling
and removing the vector loop region.
This updates VPWidenPointerInductionRecipe::execute to not use the
canonical IV to determine the insert point. Instead, it relies on the
current recipe position. In cases where this is not sufficient, set the
insert point to the first non-phi instruction, to ensure phis are
created together.
Currently fast() won't return true if all flags are set via setXXX,
which is surprising. Update setters to set all bits if needed to make
sure isFast() consistently returns the expected result.
PR: https://github.com/llvm/llvm-project/pull/131321
Refactor the code to extract the first active element of a
vector in the early exit block, in preparation for PR #130766.
I've replaced the VPInstruction::ExtractFirstActive nodes with
a combination of a new VPInstruction::FirstActiveLane node and
a Instruction::ExtractElement node.
Now that all phi nodes manage their incoming blocks through the
VPlan-predecessors, there should be no need for having a dedicate
recipe, it should be sufficient to allow PHI opcodes in VPInstruction.
Follow-ups will also migrate VPWidenPHIRecipe and possibly others,
building on top of https://github.com/llvm/llvm-project/pull/129388.
PR: https://github.com/llvm/llvm-project/pull/129767
This patch restricts broadcast operations from being hoisted to the vector
preheader unless the basic block that defines the broadcasted value properly
dominates the vector preheader.
This prevents potential use-before-definition issues when the broadcasted
value is defined within the plan. VPDominatorTree is used to confirm this
restriction while still allowing safe hoisting for broadcasted values defined
outside the plan.
Issue https://github.com/llvm/llvm-project/issues/117139
Slightly generalize materializeLiveInBroadcasts to also introduce
broadcasts for live-ins used in the vector preheader. This should cover
all live-ins.
If the live-in is used in the vector preheader, insert the broadcast at
the beginning of the block.
Following on from #125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.
It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.
We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236
The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).
emitSCEVChecks checks if SCEVCheckCond matches zero, and returns
nullptr. However, it sets SCEVCheckCond as used before it does this,
which prevents it from being removed during cleanup, resulting in
unreachable blocks being emitted. Fix this.
getLoopEstimatedTripCount returns std::nullopt when the trip count would
overflow the return type, but since it is an estimate anyway, we might
as well saturate at UINT_MAX, improving results.
getLoopEstimatedTripCount returns the trip count based on profiling
data, and its documentation says that it could return 0 when the trip
count is zero, but this is not the case: a valid trip count can never be
zero, and it returns 0 when the unsigned ExitCount is incremented by 1
and wraps. Some callers are careful about checking for this zero value
in an std::optional, but it makes for an API with footguns, as a
std::optional return value indicates that a non-nullopt value would be a
valid trip count. Fix this by explicitly returning std::nullopt when the
return value would wrap, and strip additional checks in callers. This
also fixes a minor bug in LoopVectorize.
Follow on to #128035. It is a small extension to support vectorizing
`llvm.modf.*` and `llvm.sincospi.*` too.
This renames the test files from `sincos.ll` ->
`multiple-result-intrinsics.ll` to group together the similar tests
(which make up most of this PR).