This patch fix the assertion when the `isUniform` (from legacy model)
and `isSingleScalar`(from Vplan-based model) mismatch.
The simplify test that cause assertion
```
loop:
loadA = load %a => %a is loop invariant.
loadB = load %LoadA
...
```
In the legacy cost model, it cannot analysis that addr of `%loadB` is
uniform but in the Vplan-based cost model both addr in `%loadA` and
`loadB` is single scalar.
Full test caused crash: https://llvm.godbolt.org/z/zEG8YKjqh.
---------
Co-authored-by: Luke Lau <luke@igalia.com>
This adds initial support to LoopVectorizationLegality to analyze loops
with side effects (particularly stores to memory) and an uncountable
exit. This patch alone doesn't enable any new transformations, but
does give clearer reasons for rejecting vectorization for such a loop.
The intent is for a loop like the following to pass the specific checks,
and only be rejected at the end until the transformation code is
committed:
```
// Assume a is marked restrict
// Assume b is known to be large enough to access up to b[N-1]
for (int i = 0; i < N; ++) {
a[i]++;
if (b[i] > threshold)
break;
}
```
Follow up on 528b13d ([SCEVExp] Add helper to clean up dead instructions
after expansion.) to hoist the SCEVExapnder::eraseDeadInstructions call
from LoopVectorize into the LoopUtils APIs add[Diff]RuntimeChecks, so
that other callers (LoopDistribute and LoopVersioning) can benefit from
the patch.
Update hasChecks to use getSCEV/MemRuntimeChecks(), which automatically
handles checking for known-false checks.
This improves a few cases where we previously did not add metadata to
disable runtime unrolling, due to runtime checks, even though no runtime
checks are needed.
The second operand when using a safe divisor will always be a select in
the loop, so won't be invariant; don't treat it as such.
This fixes a divergence with legacy and VPlan based cost model.
Fixes https://github.com/llvm/llvm-project/issues/156066.
In #149056 VF pruning was changed so that it only pruned VFs that
stemmed from MaxBandwidth being enabled.
However we always compute register pressure regardless of whether or not
max bandwidth is permitted for any VFs (via
`MaxPermissibleVFWithoutMaxBW`).
This skips the computation if not needed and renames the method for
clarity.
The diff in reg-usage.ll is due to the scalable VPlan not actually
having any maxbandwidth VFs, so I've changed it to check the
fixed-length VF instead, which is affected by maxbandwidth.
This patch splits out the legality checks from PR #151300, following the
landing of PR #128593.
It is a step toward supporting vectorization of early-exit loops that
contain potentially faulting loads.
In this commit, an early-exit loop is considered legal for vectorization
if it satisfies the following criteria:
1. it is a read-only loop.
2. all potentially faulting loads are unit-stride, which is the only
type currently supported by vp.load.ff.
This patch consolidates updating loop metadata and profile info for both
the remainder and vector loops in a single place. This is NFC, modulo
consistently applying vectorization specific metadata also in the
experimental VPlan-native path.
Split off from https://github.com/llvm/llvm-project/pull/154510.
In case of equal costs Prefer epilogue with fixed-width over scalable VF.
That is helpful in cases like post-LTO vectorization where epilogue with
fixed-width VF can be removed when we eventually know that the trip count
is less than the epilogue iterations.
Introduce a simple common-subexpression-elimination pass at the
VPlan-level, running late during the execution of the VPlan. The
long-term vision is to get rid of the legacy non-VPlan-based cse routine
in LV, but this patch doesn't yet fully subsume it.
We currently only emit the branch weights for the epilogue
iteration count check if there was already branch weight
data for the scalar loop. However, the code makes no use
of the existing branch weight when estimating the
likelihood of taking a particular branch and so we can
just always add the branch weights regardless. These
hints should hopefully improve code generation.
This patch check if the addr is uniform in legacy cost model to align
vplan-based cost model after #150371.
This patch fixes llvm-test-suite assertion
(https://lab.llvm.org/buildbot/#/builders/210/builds/1935) due to cost
model misaligned after #149955 under RISCV.
I've tested this patch (on top of #149955) on the llvm-test-suite
locally with crashed options `rva23u64`, `rva23u64_zvl1024b` and build
successfully.
Since this fix will change LV, I think would be better to create a PR to
fix this.
Move fixing up reduction resume values out of the general ::executePlan
and perform it together with updating induction resume values.
This also allows moving additional bypass block handling to
EpilogueVectorizerEpilogueLoop.
The InterleavedAccess pass already supports transforming
vector-predicated (vp) load/store intrinsics. With this patch, we start
enabling interleaved access under tail folding by EVL.
This patch introduces a new base class, VPInterleaveBase, and a concrete
class, VPInterleaveEVLRecipe. Both the existing VPInterleaveRecipe and
the new VPInterleaveEVLRecipe inherit from and implement
VPInterleaveBase.
Compared to VPInterleaveRecipe, VPInterleaveEVLRecipe adds an EVL
operand to emit vp.load/vp.store intrinsics.
Currently, tail folding by EVL is only supported for scalable
vectorization. Therefore, VPInterleaveEVLRecipe will only emit
interleave/deinterleave intrinsics. Reverse accesses are not yet
implemented, as masked reverse interleaved access under tail folding is
not yet supported.
Fixed#123201
This patch adds a new flag (-enable-wide-lane-mask) which allows
LoopVectorize to generate wider-than-VF active lane masks when it
is safe to do so (i.e. the mask is used for data and control flow).
The transform in extractFromWideActiveLaneMask creates vector
extracts from the first active lane mask in the header & loop body,
modifying the active lane mask phi operands to use the extracts.
An additional operand is passed to the ActiveLaneMask instruction,
the value of which is used as a multiplier of VF when generating the
mask.
By default this is 1, and is updated to UF by
extractFromWideActiveLaneMask.
The motivation for this change is to improve interleaved loops when
SVE2.1 is available, where we can make use of the whilelo instruction
which returns a predicate pair.
This is based on a PR that was created by @momchil-velikov (#81140)
and contains tests which were added there.
This patch adds a new VPlan-based addMinimumIterationCheck, which
replaced the ILV version for the non-epilogue case.
The VPlan-based version constructs a SCEV expression to compute the
minimum iterations, use that to check if the check is known true or
false. Otherwise it creates a VPExpandSCEV recipe and emits a
compare-and-branch.
When using epilogue vectorization, we still need to create the minimum
trip-count-check during the legacy skeleton creation. The patch moves
the definitions out of ILV.
PR: https://github.com/llvm/llvm-project/pull/153643
LoopVectorizationCostModel::expectedCost will only override the cost
returned by getInstructionCost when valid. This patch ensures we do
the same in VPCostContext::getLegacyCost, avoiding the "VPlan cost
model and legacy cost model disagreed" assert in the included test.
In VPWidenRecipe::computeCost for the instructions udiv, sdiv, urem and
srem we fall back on the legacy cost unnecessarily. At this point we
know that the vplan must be functionally correct, i.e. if the
divide/remainder is not safe to speculatively execute then we must have
either:
1. Scalarised the operation, in which case we wouldn't be using a
VPWidenRecipe, or
2. We've inserted a select for the second operand to ensure we don't
fault through divide-by-zero.
For 2) it's necessary to add the select operation to
VPInstruction::computeCost so that we mirror the cost of the legacy cost
model. The only problem with this is that we also generate selects in
vplan for predicated loops with reductions, which *aren't* accounted for
in the legacy cost model. In order to prevent asserts firing I've also
added the selects to precomputeCosts to ensure the legacy costs match
the vplan costs for reductions.
Extend [Specific]Cmp_match to handle floating-point compares, and
introduce m_Cmp that matches both integer and floating-point compares.
Use it in simplifyRecipe to match and simplify the general case of
compares. The change has necessitated a bugfix in
VPReplicateRecipe::execute.
Move the logic to expand SCEVs directly to a late VPlan transform that
expands SCEVs in the entry block. This turns VPExpandSCEVRecipe into an
abstract recipe without execute, which clarifies how the recipe is
handled, i.e. it is not executed like regular recipes.
It also helps to simplify construction, as now scalar evolution isn't
required to be passed to the recipe.
Remove the ArrayRef<const Value*> Args operand from
getOperandsScalarizationOverhead and require that the callers
de-duplicate arguments and filter constant operands.
Removing the Value * based Args argument enables callers where no Value
* operands are available to use the function in a follow-up: computing
the scalarization cost directly for a VPlan recipe.
It also allows more accurate cost-estimates in the future: for example,
when vectorizing a loop, we could also skip operands that are live-ins,
as those also do not require scalarization.
PR: https://github.com/llvm/llvm-project/pull/154126
In setVectorizedCallDecision we attempt to calculate the scalar costs
for vectorisation calls, even for scalable VFs where we already know the
answer is Invalid. We can avoid doing unnecessary work by skipping this
completely for scalable vectors.
After a485e0e, we may not set the vector trip count in
preparePlanForEpilogueVectorLoop if it is zero. We should not choose a
VF * UF that makes the main vector loop dead (i.e. vector trip count is
zero), but there are some cases where this can happen currently.
In those cases, set EPI.VectorTripCount to zero.
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
Currently, VPInterleaveRecipe::execute does not support generating LLVM
IR for interleaved accesses that require a gap mask for scalable VFs.
It would be better to detect and prevent such groups from being
vectorized as interleaved accesses in
LoopVectorizationCostModel::interleavedAccessCanBeWidened, rather than
relying on the TTI function getInterleavedMemoryOpCost to return an
invalid cost.
Materialze Build(Struct)Vectors explicitly for VPRecplicateRecipes, to
serve their users requiring a vector, instead of doing so when unrolling
by VF.
Now we only need to implicitly build vectors in VPTransformState::get
for VPInstructions. Once they are also unrolled by VF we can remove the
code-path alltogether.
PR: https://github.com/llvm/llvm-project/pull/151487
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
Dissolving the hierarchical VPlan CFG and converting abstract to
concrete recipes can expose additional simplification opportunities.
Do a final run of simplifyRecipes before executing the VPlan.