With #181706 using the cost-model to decide whether using partial
reductions is profitable, we need to more accurately represent the cost
of certain partial reduction operations:
* Reflect the fact that *MLALB/T instructions can be used for 16-bit ->
32-bit partial reductions (or *MLAL/MLAL2 for NEON).
* Calculate the cost of expanding the partial reduction in ISel for
reductions that don't have an explicit instruction, rather than
returning a random number. For sub-reductions we scale the cost to make
them slightly cheaper, so that they're still candidates for forming cdot
operations.
Due to a somewhat recent change, IntOrFpInduction recipes have
associated VPIRFlags. The VPlanUnroll logic for WidenInduction recipes
predates this change, and computes incomplete wrap-flags: update it to
simply use the flags on IntOrFpInduction recipes; PointerInduction
recipes have no associated flags, and indeed, no flags should be used.
The wrap flags from the IV bin-op are not guaranteed to apply to
truncated inductions, which are evaluated in narrower types.
Instead of dropping them late (in expandVPWidenIntOrFpInduction), do not
add them at the outset, the prevent invalid transforms based on
incorrect flags in the future.
PR: https://github.com/llvm/llvm-project/pull/188966
This reverts commit 4562a953db9d9813a873b78144cee1df39c7e0c0.
The recommit adjusts processLaneForReplicateRegion to first remap all
operands, then update the new operands. This fixes a VPlan verification
failure when running LV tests with expensive checks.
Original message:
This patch adds a new replicateReplicateRegionsByVF transform to unroll
replicate=regions by VF, dissolving them. The transform creates VF
copies of the replicate-region's content, connects them and converts
recipes to single-scalar variants for the corresponding lanes.
The initial version skips regions with live-outs (VPPredInstPHIRecipe),
which will be added in follow-up patches.
Depends on https://github.com/llvm/llvm-project/pull/170053
PR: https://github.com/llvm/llvm-project/pull/170212
This patch adds a new replicateReplicateRegionsByVF transform to
unroll replicate=regions by VF, dissolving them. The transform creates
VF copies of the replicate-region's content, connects them and converts
recipes to single-scalar variants for the corresponding lanes.
The initial version skips regions with live-outs (VPPredInstPHIRecipe),
which will be added in follow-up patches.
Depends on https://github.com/llvm/llvm-project/pull/170053
PR: https://github.com/llvm/llvm-project/pull/170212
When not folding the tail the minimum iteration count check ensures that
the vector loop is not executed if computing the trip count wraps around
to zero, as the trip count must be at least VF when vectorizing without
tail-folding.
Add and use a new tryToRefineConstantMaxTripCount helper. This ensures
we do not create dead main loops when vectorizing the epilogue, as we
choose smaller main VFs.
PR: https://github.com/llvm/llvm-project/pull/188135
This reverts commit e30f9c19464bcf1bf1e9f69b63884fb78ad2d05d.
Re-land, now that the reported crash causing the revert has been fixed
as part of 77fb84889 (#187504).
Original message:
Replace manual region dissolution code in
simplifyBranchConditionForVFAndUF with using general
removeBranchOnConst. simplifyBranchConditionForVFAndUF now just creates
a (BranchOnCond true) or updates BranchOnTwoConds.
The loop then gets automatically removed by running removeBranchOnConst.
This removes a bunch of special logic to handle header phi replacements
and CFG updates. With the new code, there's no restriction on what kind
of header phi recipes the loop contains.
Note that VPEVLBasedIVRecipe needs to be marked as readnone. This is
technically unrelated, but I could not find an independent test that
would be impacted.
The code to deal with epilogue resume values now needs updating, because
we may simplify a reduction directly to the start value.
PR: https://github.com/llvm/llvm-project/pull/181252
MVE only has 8 vector registers, so it's not too hard for the vectorizer
to end up using more than that resulting in enough spilling that it's
worse than not vectorizing. Enable
shouldConsiderVectorizationRegPressure for targets with MVE so the
vectorizer doesn't vectorize in those cases.
The legacy cost model computes and passes RHSInfo both when widening and
replicating. Match behavior in VPlan-based cost model.
The added test shows that we now compute the same cost as the legacy
cost model.
Without this change, the test added in
llvm/test/Transforms/LoopVectorize/AArch64/predicated-costs.ll would
crash with https://github.com/llvm/llvm-project/pull/187056.
PR: https://github.com/llvm/llvm-project/pull/188126
This reverts commit cdaf29f84dd0abbd1f961982799059c92d76625b.
This version skips removeBranchOnConst when vectorizing the epilogue, as
it may trigger folds that remove the resume phi used as resume value
from the epilogue.
This fixes https://github.com/llvm/llvm-project/issues/187323.
Original message:
This patch tries to drastically simplify resume value handling for the
scalar loop when vectorizing the epilogue.
It uses a simpler, uniform approach for updating all resume values in
the scalar loop:
1. Create ResumeForEpilogue recipes for all scalar resume phis in the
main loop (the epilogue plan will have exactly the same scalar resume
phis, in exactly the same order)
2. Update ::execute for ResumeForEpilogue to set the underlying value
when executing. This is not super clean, but allows easy lookup of the
generated IR value when we update the resume phis in the epilogue. Once
we connect the 2 plans together explicitly, this can be removed.
3. Use the list of ResumeForEpilogue VPInstructions from the main loop
to update the resume/bypass values from the epilogue.
This simplifies the code quite a bit, makes it more robust (should fix
https://github.com/llvm/llvm-project/issues/179407) and also fixes a
mis-compile in the existing tests (see change in
llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll,
where previously we would incorrectly resume using the start value when
the epilogue iteration check failed)
In some cases, we get simpler code, due to additional CSE, in some cases
the induction end value computations get moved from the epilogue
iteration check to the vector preheader. We could try to sink the
instructions as cleanup, but it is probably not worth the trouble.
Fixes https://github.com/llvm/llvm-project/issues/179407.
PR for recommit https://github.com/llvm/llvm-project/pull/188134
The test was added in
4e9894498e.
Alternative fixes would be:
* Remove unused GEP, although not clear why we'd want to overwrite
stored `i64` with `ptr` store.
* Keep this patch, but perform both GEPs with `i64` element type to
reduce the diff. It's not clear if the scalarization caused by that type
mismatch is intentional/relevant for the original change.
`LoopVectorizationCostModel::getPredBlockCostDivisor(...)` may return
large `uint64_t` values that get coerced to an `unsigned` by
`VPCostContext::getPredBlockCostDivisor(...)`, which can cause division
by zero.
Fixes#187584
This test failed on the llvm-clang-win-x-aarch64 buildbot.
It seems the rounding is different, leading to a different output.
Instead of:
Cost for VF 4: 9 (Estimated cost per lane: 2.2)
The windows buildbot it fails because the test output is:
Cost for VF 4: 9 (Estimated cost per lane: 2.3)
When narrowing interleave groups, the main vector loop processes IC
iterations instead of VF * IC. Update selectEpilogueVectorizationFactor
to use the effective VF, checking if the canonical IV controlling the
loop now steps by UF instead of VFxUF.
This avoids epilogue vectorization with dead epilogue vector loops and
also prevents crashes in cases where we can prove both the epilogue and
scalar loop are dead.
Fixes https://github.com/llvm/llvm-project/issues/186846
PR: https://github.com/llvm/llvm-project/pull/187016
Simplify exactly as InstCombine does. A follow-up would include
simplifying add x, (sub 0, y) -> sub x, y.
Alive2 proof: https://alive2.llvm.org/ce/z/Af7QiD
This updates `matchExtendedReductionOperand` so the simple case of
`UpdateR(PrevValue, ext(...))` is matched first as an early exit. The
binop matching is then flattened to remove the extra layer of the
`MatchExtends` lambda.
I was very puzzled the other day when it showed that VF 8 had a cost of
X and VF 16 had a cost of X/2, yet it still choose VF 8. This PR adds
some extra debug output to explain why this happens.
Instead of checking dereferenceability early during
LoopVectorizationLegality, defer the check to VPlan construction via
areAllLoadsDereferenceable.
This in preparation for supporting early exit vectorization of
non-dereferencable loads, e.g. via speculative loads
(https://discourse.llvm.org/t/rfc-provide-intrinsics-for-speculative-loads/89692)
or first-faulting loads. Detection in VPlan allows easily replacing
potentially non-deref loads with other loads as needed.
PR: https://github.com/llvm/llvm-project/pull/185323
This test is failing on the llvm-clang-x-aarch64 buildbot due to what
looks like a difference in rounding behaviour when printing estimated
cost per lane. Solve this by removing the fractional part, which is what
we've done in the past when this has happened (e.g. commit aeb88f677).
When matching scalar steps of the canonical IV, also match a derived IV
of the canonical IV if the derivation is essentially a no-op. Fixes a
failure in the mve-reg-pressure-spills.ll test when expensive checks are
enabled.
Correctly inform transform passes about our registers; this prevents the
issue with the `find-last` test where the loop vectorizer pass
mistakenly thinks that the backend has vector capabilities and generates
vector types, which causes the backend to crash.
See also: https://github.com/sparclinux/issues/issues/69
This patch replace the remaining LogicalAnd to vp.merge in the second
pass to not break the `m_RemoveMask` pattern in the optimizeMaskToEVL.
Also skip cost model comparison when the plan contains `vp_merge` which
won't be calculated by the legacy model.
This can help to remove header mask for FindLast reduction (CSA) loops.
Original PR: https://github.com/llvm/llvm-project/pull/184068
Original built-bot failure:
https://lab.llvm.org/buildbot/#/builders/213/builds/2497
masked_cond is used to combine early-exit conditions with masks from
predicate. The early-exit condition should only be evaluated if the mask
is true. Emit the mask first, to avoid incorrect poison propagation.
Fixes https://github.com/llvm/llvm-project/issues/187061.
Now that we can vectorize loops with multiple early exits, we emit
dispatch blocks after the middle block to go to a specific exit or
continue in the dispatch chain.
With that, we need to be a bit more careful when it comes to picking the
loop the dispatch block belongs to. The dispatch block will belong to
the innermost loop of all exit blocks reachable from the current block.
Fixes https://github.com/llvm/llvm-project/issues/185362
PR: https://github.com/llvm/llvm-project/pull/185618
Align it with the style of `LoopVectorize/VPlan/predicator.ll`:
* Move ascii-graphs close to IR to avoid scrolling through CHECKs when
comparing the picture and actual IR
* Rename `%cN` to ensure that `bbN` branches on `%cN`
Currently when considering register pressure is enabled, we reject any
VF that has higher pressure than the number of registers. However this
can result in failing to vectorize in cases where it's beneficial, as
the cost of the extra spills is less than the benefit we get from
vectorizing.
Deal with this by instead calculating the cost of spills and adding that
to the rest of the cost, so we can detect this kind of situation and
still vectorize while avoiding vectorizing in cases where the extra cost
makes it not with it.
Fixes#186005
On RV32 with zve32x, i.e. no legal 64 bit types either scalar or vector,
@llvm.cttz.elts.i64 cannot be lowered and so returns an illegal cost for
scalable VFs. However VPInstruction::FirstActiveLane and
VPInstruction::LastActiveLane always use a hardcoded i64 type.
This causes a legacy/VPlan cost model mismatch in the live-out.ll test,
and in early-exit-live-out.ll prevents the scalable VF from being
chosen.
This PR teaches the two VPInstructions to use the target's index type,
i.e. the width of a pointer in the default address space, so it will
generate a 32 bit cttz.elts on RV32. This should be large enough to hold
the maximum number of elements in a vector, as if the vector was any
bigger it would imply it isn't accessible by memory.
I considered using the canonical IV type but I don't think that will
work since the canonical IV can be i64 on RV32, and it causes
regressions due to extra zexting on 64-bit targets with a 32-bit IV.
This patch replace the remaining LogicalAnd to vp.merge in the second
pass to not break the `m_RemoveMask` pattern in the optimizeMaskToEVL.
This can help to remove header mask for FindLast reduction (CSA) loops.
PR: https://github.com/llvm/llvm-project/pull/184068
This patch removes the extra logical-and in `x && (x && y)` and `x && (y && x)` to `x && y`.
This helps to simplify mask calculation in the FindLast reduction and
exposes more opportunities to replace to EVL.
PR link: https://github.com/llvm/llvm-project/pull/185806
This patch tries to drastically simplify resume value handling for the
scalar loop when vectorizing the epilogue.
It uses a simpler, uniform approach for updating all resume values in
the scalar loop:
1. Create ResumeForEpilogue recipes for all scalar resume phis in the
main loop (the epilogue plan will have exactly the same scalar resume
phis, in exactly the same order)
2. Update ::execute for ResumeForEpilogue to set the underlying value
when executing. This is not super clean, but allows easy lookup of the
generated IR value when we update the resume phis in the epilogue. Once
we connect the 2 plans together explicitly, this can be removed.
3. Use the list of ResumeForEpilogue VPInstructions from the main loop
to update the resume/bypass values from the epilogue.
This simplifies the code quite a bit, makes it more robust (should fix
https://github.com/llvm/llvm-project/issues/179407) and also fixes a
mis-compile in the existing tests (see change in
llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll,
where previously we would incorrectly resume using the start value when
the epilogue iteration check failed)
In some cases, we get simpler code, due to additional CSE, in some cases
the induction end value computations get moved from the epilogue
iteration check to the vector preheader. We could try to sink the
instructions as cleanup, but it is probably not worth the trouble.
Fixes https://github.com/llvm/llvm-project/issues/179407.
The patch was intially landed as bd5f9384, but then reverted due to an
underlying issue in narrowInterleaveGroups, described in #185860. The
issue has since been fixed. The reland is simply a conflict-resolved
version of the original patch, which includes an additonal test update.
WidenCast is very similar to Widen recipes.
Fixes#128062.
Restrict PHI nodes that getRangeRef is allowed to recursively examine so
we don't need a "visited" set. And fix createSCEVIter so it creates all
the relevant SCEV nodes before getRangeRef tries to examine them.
The tests that are affected have induction variables that aren't
AddRecs. (Other cases are theoretically affected, but don't seem to show
up in our tests.)