Move materialization of the symbolic UF directly to unrollByUF. At this
point, unrolling materializes the decision and it is natural to also
materialize the symbolic UF here.
Previously, the canonical IV increment may have overflowed to a non-zero
value due to vscale being a non power-of-two. So we used to emit a
runtime check for this.
If you didn't want the runtime check,
DataAndControlFlowWithoutRuntimeCheck skipped it and instead tweaked the
trip count so it wouldn't overflow.
However #144963 stopped the check from ever being emitted because vscale
is always a power-of-two on AArch64 and RISC-V, so it never overflowed
to a non-zero value. And in #183292 the code to emit the check was
removed. But we never restored the trip count back to normal when the
target's vscale was a power-of-two.
Now that vscale is always a power-of-two, this PR avoids adjusting it. A
follow up NFC can then remove DataAndControlFlowWithoutRuntimeCheck.
Account for masked VPInstruction when verifying the operands in the
constructor. Fixes a crash when trying to unroll VPlans for predicated
early exits.
In the reordered RHS path of matchesShlZExt, the code never checked that
each shift amount (0, Stride, 2×Stride, …) appears at most once. When
the same shift appeared in multiple lanes, it still filled Order,
producing a non-permutation (e.g. Order = [0,0,0,1]). That led to bad
shuffle masks and miscompilation (e.g. shuffles with poison).
The patch adds an explicit duplicate check: before setting Order[Idx] =
Pos, it ensures Pos has not been seen before, using a SmallBitVector
SeenPositions(VF). If a position is seen twice, the function returns
false and the optimization is not applied.
Replace manual region dissolution code in
simplifyBranchConditionForVFAndUF with using general
removeBranchOnConst. simplifyBranchConditionForVFAndUF now just creates
a (BranchOnCond true) or updates BranchOnTwoConds.
The loop then gets automatically removed by running removeBranchOnConst.
This removes a bunch of special logic to handle header phi replacements
and CFG updates. With the new code, there's no restriction on what kind
of header phi recipes the loop contains.
Note that VPEVLBasedIVRecipe needs to be marked as readnone. This is
technically unrelated, but I could not find an independent test that
would be impacted.
The code to deal with epilogue resume values now needs updating, because
we may simplify a reduction directly to the start value.
PR: https://github.com/llvm/llvm-project/pull/181252
After #183080, the canonical IV (not the increment!) can't overflow. So
now canonical IVs that are unrolled will have steps that don't overflow,
so we can add the nuw flag.
This allows us to tighten the VPlanVerifier isKnownMonotonic check by
restricting it to adds with nuw.
This reverts commit b0b3e3e1c7f6387eabc2ef9ff1fea311e63a4299.
After thinking about this for a bit, I don't think this is correct.
vscale being a power-of-2 only guarantees the canonical IV increment
overflows to zero, but not overflows in general.
The reason for doing this in `transformToPartialReduction` is so that we
can create the VPExpressions directly when transforming reductions into
partial reductions (to be done in a follow-up PR).
I also intent to see if we can merge the in-loop reductions with partial
reductions, so that there will be no need for the separate
`convertToAbstractRecipes` VPlan Transform pass.
Unless we're working with AVX512 mask predicate types, sign extending a
vXi1 comparison result back to the width of the comparison source types
is free.
VectorCombine::foldShuffleOfCastops - pass the original CastInst in the
getCastInstrCost calls to track the source comparison instruction.
Fixes#165813
After #183080 vscale can no longer be a non-power of 2, which means the
canonical IV can't overflow with tail folding w/ scalable vectors
anymore. Therefore we don't need to drop the NUW flag.
IVUpdateMayOverflow is left to be removed in a separate PR since it
removes further runtime checks.
Now that we have ExitingIVValue, we can also use it for tail-folded
loops; the only difference is that we have to compute the end value with
the original trip count instead the vector trip count.
This allows removing the induction increment operand only used when
tail-folding.
PR: https://github.com/llvm/llvm-project/pull/182507
Previously, we could miscompile when vectorizing conditional scalar
assignments with forced tail folding, as the backedge select could be
based on the header mask, not the assignment conditional.
This resulted in a number of failures in the LLVM test suite when
building with `-O3 -march=armv8-a+sve -mllvm
-prefer-predicate-over-epilogue=predicate-dont-vectorize`.
The patch reworks `handleFindLastReductions()` to correctly handle tail
folding.
Currently if -vplan-verify-each is enabled and a pass fails the
verifier, it will output the failure to stderr but will still finish
with a zero exit code.
This adds an assert that the verification fails so that e.g. lit will
pick up verifier failures in the in-tree tests with an EXPENSIVE_CHECKS
build.
Currently the LastActiveLane verification fails in several tests, so
this also includes a fix to handle more prefix masks. All of the prefix
masks that the verifier encounters are of the form `icmp ult/ule
monotonically-increasing-sequence, uniform`, which always generate a
prefix mask.
Tested that llvm-test-suite + SPEC CPU 2017 now pass with
-vplan-verify-each enabled for RISC-V.
Add support for a single early exit that is executed conditionally. To
make sure the mask from any non-exiting control flow is combined with
the early exit condition.
To do so, introduce a MaskedCond VPInstruction, which is inserted as
user of the early-exit condition, at the point of the early-exit branch.
The VPInstruction will get masked automatically if needed by the
predicator, ensuring that we properly account for it when checking
whether the early exit has been taken.
Note that this does not allow for instructions that require predication
after the early exit. This requires additional work in progress:
https://github.com/llvm/llvm-project/pull/172454
As an alternative to MaskedCond, we could also predicate before handling
early exiting blocks: https://github.com/llvm/llvm-project/pull/181830
PR: https://github.com/llvm/llvm-project/pull/182395
After https://github.com/llvm/llvm-project/pull/183080 this is no longer
a configurable property.
NOTE: No test changes expected beyond
llvm/test/Transforms/LoopVectorize/scalable-predication.ll which has
been removed because it only existed to verfiy the now unsupported
functionality.
The correct way to check if two memory locations may alias is outlined
in ScopedNoAliasAAResult::alias: extract this into a helper, to fix the
current logic.
Add an alternative to test VPlan in more isolation via a new
`vplan-test-transform` option, which builds VPlan0 for each loop in the
input IR and then can invoke a set of transforms on it.
In order to allow different recipe types to be created, a new
widen-from-metadata transform is added, which transforms VPInstructions
to different recipes, based on custom !vplan.widen metadata. Currently
this supports creating widen & replicate recipes, but can easily be
extended in the future.
Currently the handling is intentionally bare-bones, to be extended
gradually as needed.
PR: https://github.com/llvm/llvm-project/pull/178522
In #182254 we want to start aborting compilation when the verifier fails
between passes, but currently we run into various EVL related failures.
The EVL is used in quite a few more places than when the verification
was originally added, all of which need to be handled by the verifier. I
think this is also exacerbated by the fact that many recipes nowadays
are converted to concrete recipes later in the pipeline which duplicates
the number of patterns we need to match.
The EVL transform itself has also changed much since its original
implementation, i.e. non-trapping recipes don't use EVL (#127180) and VP
recipes are generated via pattern matching instead of unconditionally
(#155394), so I'm not sure if the verification is as relevant today.
Rather than try to add more patterns this PR removes the verification to
reduce the maintainence cost. Split off from #182254
Currently createExtractsForLiveOuts only handles creating extracts when
the middle block has one predecessor, but if an early exit exits to the
same block as the latch then it might have multiple predecessors.
This handles the latter case to avoid the need to handle it in
VPlanTransforms::handleUncountableEarlyExits. Addresses the comment in
https://github.com/llvm/llvm-project/pull/174864#discussion_r2794153217
This enables vectorization of epilogue loops produced by LoopVectorizer on
SystemZ.
LoopVectorizationCostModel::isEpilogueVectorizationProfitable() and
TTI.preferEpilogueVectorization() have been refactored slightly so that
targets can override preferEpilogueVectorization(ElementCount Iters) and
directly control this, whereas before this depended on
TTI.getMaxInterleaveFactor() as well.
The Iters passed to preferEpilogueVectorization() reflects the total number
of scalar iterations performed in the vectorized loop (including interleaving).
The default implementation of preferEpilogueVectorization() now subsumes
the old check against getMaxInterleaveFactor(). This patch should be NFC for
other targets.
For FindLast reduction selecting an IV, we can avoid the horizontal
AnyOf in the vector loop, by introducing an independent boolean
reduction to track if the condition was ever true in the loop. If it was
never true in the loop, we select the start value, otherwise the select
the min/max of the FindIV reduction, as required by the predicate.
The main advantage of this approach is that we have 2 independent
reductions, that do not require a horizontal AnyOf reduction in the
loop.
Currently this requires a non-wrapping IV, but this can be relaxed in
the future by selecting a canonical IV, which is then transformed to the
specific derived IV for the reduction after the loop.
Depends on https://github.com/llvm/llvm-project/pull/177870.
PR: https://github.com/llvm/llvm-project/pull/172569
Bail on vectorizing a loop in LoopIdiomVectorize when the loop carries
hints that indicate vectorization is disabled.
This means that LoopIdiomVectorize will now respect vectorize(disable)
loop hints.
#149042 added last-active-lane and removed the restriction that we
couldn't tail fold loops that had outside users (in AllowedExit).
However we still have a restriction that IVs can't have outside users.
This was added separately to the AllowedExit restriction in #81609, but
it looks like #149042 didn't remove it.
AFAICT we currently extract the correct lane for IVs, so this PR relaxes
the restriction. This helps a good few loops get tail folded in
llvm-test-suite.
-force-tail-folding-style=none was added to pr5881-scev-expansion.ll to
preserve the original scev expansion, since otherwise we end up with a
cttz.elts(false, false, true, true) that blocks SCEV analysis. We should
probably teach ConstantFolding to fold it.
This avoids having to pass around the RecurKind or re-figure it out from
the VPReductionPHI node.
This is useful in a follow-up PR, where we need to distinguish between a
`Sub` and `AddWithSub` recurrence, which can't be deduced from the
`ReductionBinOp` field.
In order to be able to create selects for reduction phis through tail
folding in foldTailByMasking (#176143), make VPReductionPHIRecipe an
instance of VPIRFlags and plumb the FMFs from the original RdxDesc.
This allows us to remove more uses of the RecurrenceDescriptor in
addReductionResultComputation, which should help untie it from
LoopVectorizationLegality.
Extend handleMultiUseReductions to support strict predicates (>, <),
matching the first index instead of the last for non-strict predicates.
Builds on top of https://github.com/llvm/llvm-project/pull/141431.
FindLast reductions with strict predicates are adjusted to compute the
correct result as follows:
1. Find the first canonical indices corresponding to partial min/max
values, using loop reductions.
2. Find which of the partial min/max values are equal to the overall
min/max value.
3. Select among the canonical indices those corresponding to the overall
min/max value.
4. Find the first canonical index of overall min/max and scale it back to
the original IV using VPDerivedIVRecipe.
5. If the overall min/max equals the starting min/max, the condition in
the loop was always false, due to being strict; return the original start
value in that case.
When the vectorized epilogue loop uses partial reductions, the PHI node
in the loop must start at 0 (because for partial sub-reductions the
sub is done in the middle block) and the compute-reduction-result must
subtract from the partial result (as calculated in the middle block of
the main vector loop), instead of subtracting from the original init
value.
This fixes the issue as reported on #178919 by @aeubanks.
Fixes#179187 - as described in the issue, the current FindFirstByte
transformation in LoopIdiomVectorizePass will incorrectly early-exit as
soon as a needle matching a search element is found, even if a previous
search element could match a subsequent needle.
This patch ensures all needles are tested before we return a matching
search element.
Converts reduced or(select %cmp, bitmask, 0) to zext(bitcast %vector_cmp to
i<num_reduced_values>) to in
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/181940