Extend replicateByVF added in #142433 (aa240293190) to also explicitly
unroll replicating VPInstructions.
Now the only remaining case where we replicate for all lanes is
VPReplicateRecipes in replicate regions.
PR: https://github.com/llvm/llvm-project/pull/155102
Stacked on #156923
In https://godbolt.org/z/8svWaredK, we spill a lot on RISC-V because
whilst the largest element type is i8, we generate a bunch of pointer
vectors for gathers and scatters. This means the VF chosen is quite high
e.g. <vscale x 16 x i8>, but we end up using a bunch of <vscale x 16 x
i64> m8 registers for the pointers.
This was briefly fixed by #132190 where we computed register pressure in
VPlan and used it to prune VFs that were likely to spill. The legacy
cost model wasn't able to do this pruning because it didn't have
visibility into the pointer vectors that were needed for the
gathers/scatters.
However VF pruning was restricted again to just the case when max
bandwidth was enabled in #141736 to avoid an AArch64 regression, and
restricted again in #149056 to only prune VFs that had max bandwidth
enabled.
On RISC-V we take advantage of register grouping for performance and
choose a default of LMUL 2, which means there are 16 registers to work
with – half the number as SVE, so we encounter higher register pressure
more frequently.
As such, we likely want to always consider pruning VFs with high
register pressure and not just the VFs from max bandwidth.
This adds a TTI hook to opt into this behaviour for RISC-V which fixes
the motivating godbolt example above. When last checked this
significantly reduces the number of spills on SPEC CPU 2017, up to
80% on 538.imagick_r.
This patch fix the assertion when the `isUniform` (from legacy model)
and `isSingleScalar`(from Vplan-based model) mismatch.
The simplify test that cause assertion
```
loop:
loadA = load %a => %a is loop invariant.
loadB = load %LoadA
...
```
In the legacy cost model, it cannot analysis that addr of `%loadB` is
uniform but in the Vplan-based cost model both addr in `%loadA` and
`loadB` is single scalar.
Full test caused crash: https://llvm.godbolt.org/z/zEG8YKjqh.
---------
Co-authored-by: Luke Lau <luke@igalia.com>
Track which ops already have been narrowed, to avoid narrowing the same
operation multiple times. Repeated narrowing will lead to incorrect
results, because we could first narrow from an interleave group -> wide
load, and then narrow the wide load > single-scalar load.
Fixes thttps://github.com/llvm/llvm-project/issues/156190.
Split GEPs that have more than one non-zero offset into two GEPs. This
is in preparation for the ptradd migration, which can only represent
such GEPs.
This also enables CSE and LICM of the common base.
This adds initial support to LoopVectorizationLegality to analyze loops
with side effects (particularly stores to memory) and an uncountable
exit. This patch alone doesn't enable any new transformations, but
does give clearer reasons for rejecting vectorization for such a loop.
The intent is for a loop like the following to pass the specific checks,
and only be rejected at the end until the transformation code is
committed:
```
// Assume a is marked restrict
// Assume b is known to be large enough to access up to b[N-1]
for (int i = 0; i < N; ++) {
a[i]++;
if (b[i] > threshold)
break;
}
```
- In sve-epilog-vscale-fixed.ll file, it tests the preference of
fixed-width epilogue VF vs scalable when costs are equal. This NFC patch
is changing the TC in the test case to be unknown to avoid folding the
epilogue in future LV changes.
In simplifyBlends, when normalizing a blend recipe, the first mask that
is used only by the blend and is not all-false is chosen, and its
corresponding incoming value becomes the initial value, with the others
blended into it. At the same time, the mask that is chosen can be
eliminated. However, a multi-user mask might be used by a dead recipe,
which prevents this optimization. This patch moves removeDeadRecipes
before simplifyBlends to eliminate dead recipes, allowing simplifyBlends
to remove more dead masks.
Extracts technically do not use scalars, but vectors, but if the operand
is a single scalar we do not need a vector and they should not block
forming single scalars.
Remove instcombine, simplifycfg and dce from some tests, as they make it
a bit more difficult to see the codegen coming out of LV and most
simplifications are already done on the VPlan-level.
Also modernizes some check lines.
In #157322 we crash because we try to infer a type for a VPReplicate
switch recipe.
My understanding was that these switches should be removed by
VPlanPredicator, but this switch survived through it because it was
unconditional, i.e. had no cases other than the default case.
This fixes#157322 by not emitting any recipes for unconditional
switches to begin with, similar to how we treat unconditional branches.
Look through extractvalue to simplify umul_with_overflow where one of
the operands is 1.
This removes some redundant instructions when expanding SCEVs, which in
turn makes the runtime check cost estimate more accurate, reducing the
minimum iterations for which vectorization is profitable.
PR: https://github.com/llvm/llvm-project/pull/157307
The SCEV predicate in the existing tests for optimizing for size is
known to be false. Add additional test with a predicate that cannot be
proven true/false.
Also generate checks with latest version of script.
Remove instcombine, simplifycfg and dce from some tests, as they make it
a bit more difficult to see the codegen coming out of LV and most
simplifications are already done on the VPlan-level.
Also modernizes some check lines.
Update hasChecks to use getSCEV/MemRuntimeChecks(), which automatically
handles checking for known-false checks.
This improves a few cases where we previously did not add metadata to
disable runtime unrolling, due to runtime checks, even though no runtime
checks are needed.
The second operand when using a safe divisor will always be a select in
the loop, so won't be invariant; don't treat it as such.
This fixes a divergence with legacy and VPlan based cost model.
Fixes https://github.com/llvm/llvm-project/issues/156066.
The 256-bit maximum vector register size control was removed from AVX10
whitepaper, ref: https://cdrdv2.intel.com/v1/dl/getContent/784343
We have warned these options in LLVM21 through #132542. This patch
removes underlying implementations in LLVM22.
Following up from #150368, this moves folding common edge masks into
simplifyBlends.
One test in uniform-blend.ll ended up regressing but after looking at it
closely, it came from a weird (x && !x) edge mask. So I've just included
a simplifcation in this PR to fold that to false.
In #149056 VF pruning was changed so that it only pruned VFs that
stemmed from MaxBandwidth being enabled.
However we always compute register pressure regardless of whether or not
max bandwidth is permitted for any VFs (via
`MaxPermissibleVFWithoutMaxBW`).
This skips the computation if not needed and renames the method for
clarity.
The diff in reg-usage.ll is due to the scalable VPlan not actually
having any maxbandwidth VFs, so I've changed it to check the
fixed-length VF instead, which is affected by maxbandwidth.
This patch splits out the legality checks from PR #151300, following the
landing of PR #128593.
It is a step toward supporting vectorization of early-exit loops that
contain potentially faulting loads.
In this commit, an early-exit loop is considered legal for vectorization
if it satisfies the following criteria:
1. it is a read-only loop.
2. all potentially faulting loads are unit-stride, which is the only
type currently supported by vp.load.ff.
This patch consolidates updating loop metadata and profile info for both
the remainder and vector loops in a single place. This is NFC, modulo
consistently applying vectorization specific metadata also in the
experimental VPlan-native path.
Split off from https://github.com/llvm/llvm-project/pull/154510.
In case of equal costs Prefer epilogue with fixed-width over scalable VF.
That is helpful in cases like post-LTO vectorization where epilogue with
fixed-width VF can be removed when we eventually know that the trip count
is less than the epilogue iterations.
Update evaluatePtrAddrecAtMaxBTCWillNotWrap to support non-constant
sizes in dereferenceable assumptions.
Apply loop-guards in a few places needed to reason about expressions
involving trip counts of the from (BTC - 1).
PR: https://github.com/llvm/llvm-project/pull/156758
This reverts commit f0df1e3dd4ec064821f673ced7d83e5a2cf6afa1.
Recommit with extra check for SCEVCouldNotCompute. Test has been added in
b16930204b.
Original message:
Remove the fall-back to constant max BTC if the backedge-taken-count
cannot be computed.
The constant max backedge-taken count is computed considering loop
guards, so to avoid regressions we need to apply loop guards as needed.
Also remove the special handling for Mul in willNotOverflow, as this
should not longer be needed after 914374624f
(https://github.com/llvm/llvm-project/pull/155300).
PR: https://github.com/llvm/llvm-project/pull/155672