Follow up on 7c4f188 ([LV] Support multiplies by constants when forming
scaled reductions), introducing m_APInt, and improving code around
canConstantBeExtended: we change canConstantBeExtended to take an APInt.
VPWidenCastRecipes with Trunc opcodes where missing the correct OpType
for IR flags. Update createWidenCast to set the correct flags for
truncs, and use it consistenly.
Fixes https://github.com/llvm/llvm-project/issues/162374.
Instead of re-setting the start value of the canonical IV when
vectorizing the epilogue we can emit an Add VPInstruction to provide
canonical IV value, adjusted by the resume value from the main loop.
This is in preparation to make the canonical IV a VPValue defined by
loop regions. It ensures that the canonical IV always starts at 0.
PR: https://github.com/llvm/llvm-project/pull/161589
Consistently scalarize loads used as part of address computations across
all uses in the loop. This aligns the VPlan and legacy cost model and
fixes a divergence crash. It doesn't matter if the load and address
users are in different blocks, as long as they are in the same loop, the
scalar value can be used. This removes a number of insert/extracts.
This reverts commit f80c0baf058dbdc5 and 94eade61a02ae5.
Recommit a small fix for targets using prefersVectorizedAddressing.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
This reverts commit f61be4352592639a0903e67a9b5d3ec664ad4d23.
Recommit a small fix handling scalarization overhead consistently with
legacy cost model if a load is used directly as operand of another
memory operation, which fixes
https://github.com/llvm/llvm-project/issues/161404.
Original message:
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
We can create partial reductions for multiplies with constants, if the
constant is small enough to be extended from source to destination type
w/o changing the value.
This only handles constant on the right side of a multiply, relying on
other passes to canonicalize the input.
Alive2 Proofs: https://alive2.llvm.org/ce/z/iWRMr6
PR: https://github.com/llvm/llvm-project/pull/161092
If there are direct memory op users of the newly scalarized load,
their cost may have changed because there's no scalarization
overhead for the operand. Update it.
This ensures assigning consistent costs to scalarized memory
instructions that themselves have scalarized memory instructions as
operands.
LV does not preserve LCSSA, it constructs it just before processing a
loop to vectorize. Runtime check expressions are invariant to that loop,
so expanding them should not break LCSSA form for the loop we are about
to vectorize.
This fixes a crash when discarding instructions generated when expanding
runtime checks, if the expansion introduces LCSSA phis for values from
other loops which are not in LCSSA form: we would introduce new LCSSA
phis and update all outside users, some of which are not created by the
expander and cannot be cleaned up.
Fixes https://github.com/llvm/llvm-project/issues/158259.
PR: https://github.com/llvm/llvm-project/pull/159556
Update VPReplicateRecipe::computeCost to compute costs of more
replicating loads/stores.
There are 2 cases that require extra checks to match the legacy cost
model:
1. If the pointer is based on an induction, the legacy cost model passes
its SCEV to getAddressComputationCost. In those cases, still fall back
to the legacy cost. SCEV computations will be added as follow-up
2. If a load is used as part of an address of another load, the legacy
cost model skips the scalarization overhead. Those cases are currently
handled by a usedByLoadOrStore helper.
Note that getScalarizationOverhead also needs updating, because when the
legacy cost model computes the scalarization overhead, scalars have not
been collected yet, so we can't each for replicating recipes to skip
their cost, except other loads. This again can be further improved by
modeling inserts/extracts explicitly and consistently, and compute costs
for those operations directly where needed.
PR: https://github.com/llvm/llvm-project/pull/160053
In order to avoid conflating the legacy CSE with the VPlan-based one,
rename the legacy CSE and insert a FIXME to clarify the nature of the
legacy CSE.
Additional CSE opportunities are exposed after converting to concrete
recipes/dissolving regions and materializing various expressions. Run
CSE later, to capitalize on some of the late opportunities.
PR: https://github.com/llvm/llvm-project/pull/160572
Move creation of the minimum iteration check for the epilogue vector
loop to VPlan. This is a first step towards breaking up and moving
skeleton creation for epilogue vectorization to VPlan.
It moves most logic out of EpilogueVectorizerEpilogueLoop: the minimum
iteration check is created directly in VPlan, connecting the check
blocks from the main vector loop is done as post-processing. Next steps
are to move connecting and updating the branches from the check blocks
to VPlan, as well as updating the incoming values for phis.
Test changes are improvements due to folding of live-ins.
PR: https://github.com/llvm/llvm-project/pull/157545
Check if the scale-factor of the accumulator is the same as the request
ScaleFactor in tryToCreatePartialReductions.
This prevents creating partial reductions if not all instructions in the
reduction chain form partial reductions. e.g. because we do not form a
partial reduction for the loop exit instruction.
Currently code-gen works fine, because the scale factor of
VPPartialReduction is not used during ::execute, but it means we compute
incorrect cost/register pressure, because the partial reduction won't
reduce to the specified scaling factor.
PR: https://github.com/llvm/llvm-project/pull/158603
In some cases, safe-divisor selects can be hoisted out of the vector
loop. Catching all cases in the legacy cost model isn't possible, in
particular checking if all conditions guarding a division are loop
invariant.
Instead, check in planContainsAdditionalSimplifications if there are any
hoisted safe-divisor selects. If so, don't compare to the more
inaccurate legacy cost model.
Fixes https://github.com/llvm/llvm-project/issues/160354.
Fixes https://github.com/llvm/llvm-project/issues/160356.
This ensures each scalarized member has an accurate cost, matching the
cost it would have if it would not have been considered for an
interleave group.
For UDiv/SDiv with invariant divisors, the created selects will be
hoisted out. Don't compute their cost for each iteration, to match the
more accurate VPlan-based cost modeling.
Fixes https://github.com/llvm/llvm-project/issues/159402.
Loads of addresses are scalarized and have their costs computed w/o
scalarization overhead. Consistently apply this logic also to
non-uniform loads that are already scalarized, to ensure their costs are
consistent with other scalarized lodas that are used as addresses.
Always add pointers proved to be uniform via legal/SCEV to worklist.
This extends the existing logic to handle a few more pointers known to
be uniform.
After https://github.com/llvm/llvm-project/pull/153643, there may be a
BranchOnCond with constant condition in the entry block.
Simplify those in removeBranchOnConst. This removes a number of
redundant conditional branch from entry blocks.
In some cases, it may also make the original scalar loop unreachable,
because we know it will never execute. In that case, we need to remove
the loop from LoopInfo, because all unreachable blocks may dominate each
other, making LoopInfo invalid. In those cases, we can also completely
remove the loop, for which I'll share a follow-up patch.
Depends on https://github.com/llvm/llvm-project/pull/153643.
PR: https://github.com/llvm/llvm-project/pull/154510
Follow up on 7fb3a91 ([PatternMatch] Introduce match functor) to
introduce the VPlanPatternMatch version of the match functor to shorten
some idioms.
Co-authored-by: Luke Lau <luke@igalia.com>
Stacked on #156923
In https://godbolt.org/z/8svWaredK, we spill a lot on RISC-V because
whilst the largest element type is i8, we generate a bunch of pointer
vectors for gathers and scatters. This means the VF chosen is quite high
e.g. <vscale x 16 x i8>, but we end up using a bunch of <vscale x 16 x
i64> m8 registers for the pointers.
This was briefly fixed by #132190 where we computed register pressure in
VPlan and used it to prune VFs that were likely to spill. The legacy
cost model wasn't able to do this pruning because it didn't have
visibility into the pointer vectors that were needed for the
gathers/scatters.
However VF pruning was restricted again to just the case when max
bandwidth was enabled in #141736 to avoid an AArch64 regression, and
restricted again in #149056 to only prune VFs that had max bandwidth
enabled.
On RISC-V we take advantage of register grouping for performance and
choose a default of LMUL 2, which means there are 16 registers to work
with – half the number as SVE, so we encounter higher register pressure
more frequently.
As such, we likely want to always consider pruning VFs with high
register pressure and not just the VFs from max bandwidth.
This adds a TTI hook to opt into this behaviour for RISC-V which fixes
the motivating godbolt example above. When last checked this
significantly reduces the number of spills on SPEC CPU 2017, up to
80% on 538.imagick_r.
This patch fix the assertion when the `isUniform` (from legacy model)
and `isSingleScalar`(from Vplan-based model) mismatch.
The simplify test that cause assertion
```
loop:
loadA = load %a => %a is loop invariant.
loadB = load %LoadA
...
```
In the legacy cost model, it cannot analysis that addr of `%loadB` is
uniform but in the Vplan-based cost model both addr in `%loadA` and
`loadB` is single scalar.
Full test caused crash: https://llvm.godbolt.org/z/zEG8YKjqh.
---------
Co-authored-by: Luke Lau <luke@igalia.com>
This adds initial support to LoopVectorizationLegality to analyze loops
with side effects (particularly stores to memory) and an uncountable
exit. This patch alone doesn't enable any new transformations, but
does give clearer reasons for rejecting vectorization for such a loop.
The intent is for a loop like the following to pass the specific checks,
and only be rejected at the end until the transformation code is
committed:
```
// Assume a is marked restrict
// Assume b is known to be large enough to access up to b[N-1]
for (int i = 0; i < N; ++) {
a[i]++;
if (b[i] > threshold)
break;
}
```
Follow up on 528b13d ([SCEVExp] Add helper to clean up dead instructions
after expansion.) to hoist the SCEVExapnder::eraseDeadInstructions call
from LoopVectorize into the LoopUtils APIs add[Diff]RuntimeChecks, so
that other callers (LoopDistribute and LoopVersioning) can benefit from
the patch.
Update hasChecks to use getSCEV/MemRuntimeChecks(), which automatically
handles checking for known-false checks.
This improves a few cases where we previously did not add metadata to
disable runtime unrolling, due to runtime checks, even though no runtime
checks are needed.
The second operand when using a safe divisor will always be a select in
the loop, so won't be invariant; don't treat it as such.
This fixes a divergence with legacy and VPlan based cost model.
Fixes https://github.com/llvm/llvm-project/issues/156066.