When vectorizing with predication some loops that were previously
vectorized without zvfhmin/zvfbfmin will no longer be vectorized because
the masked load/store or gather/scatter cost returns illegal.
This is due to a discrepancy where for these costs we check
isLegalElementTypeForRVV but for regular memory accesses we don't.
But for bf16 and f16 vectors we don't actually need the extension
support for loads and stores, so this adds a new function which takes
this into account.
For regular memory accesses we should probably also e.g. return an
invalid cost for i64 elements on zve32x, but it doesn't look like we
have tests for this yet.
We also should probably not be vectorizing these bf16/f16 loops to begin
with if we don't have zvfhmin/zvfbfmin and zfhmin/zfbfmin. I think this
is due to the scalar costs being too cheap. I've added tests for this in
a100f6367205c6a909d68027af6a8675a8091bd9 to fix in another patch.
Align the tests closer with what we eventually intend to enable by
default on RISC-V by using
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue, instead
of dropping vectorization entirely with predicate-dont-vectorize.
Also adjust the non-EVL run lines so that they use
-prefer-predicate-over-epilogue=scalar-epilogue instead of
-force-tail-folding-style=none, so we're only using testing one type of
flag instead of a combination of two.
This isn't needed after we set the tail folding style to data-with-evl
via TTI in #148686. Also rename the tests to reflect the fact they're
no longer forcing the tail folding style.
VPVectorPointer for part 0 is just the pointer operand. Simplify it
after unrolling. This removes a large number of redundant GEPs with
index 0.
PR: https://github.com/llvm/llvm-project/pull/149735
This patch adds a new ExtractLane VPInstruction which extracts across
multiple parts using a wide index, to be used in combination with
FirstActiveLane.
The patch updates early-exit codegen to use it instead ExtractElement,
which is only per-part. With this change, interleaving should work
correctly with early-exit loops.
The patch removes the restrictions added in 6f43754e9 (#145877), but
does not yet automatically select interleave counts > 1 for early-exit
loops.
I'll share a patch as follow-up. The cost of extracting a lane adds
non-trivial overhead in the exit block, so that should be considered
when picking the interleave count.
PR: https://github.com/llvm/llvm-project/pull/148817
Materialize constant vector trip counts before ::execute, if the trip
count can be computed as Original (TC / (VF * UF)) * (VF * UF). For now
this excludes when the tail is folded or scalar epilogues are required.
This enables removing a number of redundant branches from the middle
block.
For now this is also only done when not vectorizing the epilogue, as the
simplification complicates stitching the 2 plans together.
PR: https://github.com/llvm/llvm-project/pull/142309
Handle mem checks known to be false in getMemRuntimeChecks the same way
as SCEV checks known to be false in getSCEVChecks. This ensures such
redundant check blocks are not added in the first place.
Update tests for which checking both the scalar resume and exit values
is interesting, because they have first-order recurrences to have
variable trip-counts, to avoid the branch in the middle.block being
folded away by https://github.com/llvm/llvm-project/pull/142309.
For similar reasons, also update check-prof-info.ll
There are a number of cases for which SCEV may not be able to prove a
predicate will always be true/false, which may be simplified to a
constant during expansion (see discussion in
https://github.com/llvm/llvm-project/pull/131538).
Bail out early if runtime checks are known to always fail, as the
vector loop generated later will never execute.
Now that support for masked loads/stores of interleave groups has
landed, we can enable the loop vectorizer to generate masked interleave
access where applicable.
This improves vectorization in several ways:
* Internal predication support: This enables interleave group
vectorization for loops with internal control flow predication, provided
all members of the group share the same predicate. Gaps in interleave
groups are still not efficiently handled by masking, so masking for gaps
remains disabled for now.
* Tail folding: This allows tail folding of loops with interleave groups
by using masking. Without this, vectorized loops with interleaves would
fall back to using separate gather/scatter accesses, which can be
significantly less efficient.
* Scalable vector support: Currently, only scalable vector types are
supported for masked interleave lowering. Fixed-length vector support
will be enabled in the future.
As interleave access is not yet supported with tail folding by EVL, that
functionality is temporarily disabled. We are going to create another
patch to support it.
Co-authored-by: Philip Reames <preames@rivosinc.com>
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
Also clamp the max VF when maximizing vector bandwidth by the maximum
trip count. Otherwise we may end up choosing a VF for which the vector
loop never executes.
PR: https://github.com/llvm/llvm-project/pull/149794
This reverts commit 25e97fc420f8ecc43fbabadfe9767b4163e6ee36.
The original commit was reverted due to a crash in llvm-test-suite. The
crash stemmed from a multiply reduction, which isn't supported for
scalable VFs on RISC-V. But for EVL tail folding we only support
scalable VFs, so when -force-tail-folding-style=data-with-evl is
specified we check to see if there's a scalable VF, and fall back to
data-without-lane-mask if there isn't.
This is done in setTailFoldingStyles, but previously we were only
checking if the forced tail folding style was legal, not the style
returned by TTI.
This version fixes this by checking the actual computed tail folding
style and not just the forced one, and adds a test for the crash in
llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll
Previously we fell back to just simplifying the branch cond to true
since one of the phis was a VPEVLBasedIVPHIRecipe. However this should
be fine to replace with its start value.
In preparation to eventually make EVL tail folding the default, this
patch sets DataWithEVL as the preferred tail folding style for RISC-V,
but doesn't enable tail folding by default.
And although tail folding isn't enabled by default, the loop vectorizer
will actually tail fold loops with a small trip count, so this will
cause some EVL vectorized loops to be generated in the default
configuration.
The EVL tail folding work is still not complete, e.g. we still need to
handle interleave groups etc., see #123069, but a lot of these missing
features also apply to the data (masked) tail folding strategy, which is
the default anyway.
The actual overall performance picture is much better, on TSVC EVL tail
folding is faster than data on every benchmark on the spacemit-x60[^1]:
https://lnt.lukelau.me/db_default/v4/nts/755?compare_to=756
And on SPEC CPU 2017 we see a geomean improvement[^2]:
https://lnt.lukelau.me/db_default/v4/nts/751?compare_to=753
This is likely due to masked instructions generally being less
performant on the spacemit-x60, up to twice as slow:
https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html
[^1]: These benchmarks don't exactly give the same performance numbers
as this patch, but it's a good indicator that EVL tail folding is
generally faster than masked tail folding.
[^2]: The large code size increase in 505.mcf_r is due to a function
being inlined now
Currently we may try to vectorize the epilogue with a scalable VF, even
if there are no remaining iterations after the main vector loop with a
fixed VF.
Update selectEpilogueVectorizationFactor to always compute the number of
remaining iterations and exit early if no epilogue iterations remain.
Fixes https://github.com/llvm/llvm-project/issues/149726
PR: https://github.com/llvm/llvm-project/pull/149789
This patch includes the following changes:
1. Merge riscv-vector-reverse-output.ll into riscv-vector-reverse.ll,
and only check the generated LLVM IR.
2. Add vplan-riscv-vector-reverse.ll to preserve the original debug
output checks from riscv-vector-reverse.ll.
Update LV to vectorize maxnum/minnum reductions without fast-math flags,
by adding an extra check in the loop if any inputs to maxnum/minnum are
NaN, due to maxnum/minnum behavior w.r.t to signaling NaNs. Signed-zeros
are already handled consistently by maxnum/minnum.
If any input is NaN,
*exit the vector loop,
*compute the reduction result up to the vector iteration that contained
NaN inputs and
* resume in the scalar loop
New recurrence kinds are added for reductions using maxnum/minnum
without fast-math flags.
PR: https://github.com/llvm/llvm-project/pull/148239
Currently if MaxBandwidth is enabled, the register pressure is checked
for each VF. This changes that to only perform said check if the VF
would not have otherwise been considered by the LoopVectorizer if
maxBandwidth was not enabled.
Theoretically this allows for higher VFs to be considered than would
otherwise be deemed "safe" (from a regpressure perspective), but more
concretely this reduces the amount of work done at compile-time when
maxBandwidth is enabled.
Simplify the handling of exit users by generating all extracts first
(safe option), and have FOR handling optimize the extracts, similar to
already done for reductions and inductions.
NFC modulo first-order recurrence extract order in middle block.
In getScaledReductions for the case where we try to match a partial
reduction of the form:
%phi = phi i32 ...
...
%add = add i32 %phi, %zext
where
%zext = i8 %some_val to i32
we should ensure that %zext is actually inside the loop.
Fixes https://github.com/llvm/llvm-project/issues/148260
This preserves the nuw/nsw flags on widened truncs by checking for
TruncInst in the VPIRFlags constructor
The motivation for this is to be able to fold away some redundant truncs
feeding into uitofps (or potentially narrow the inductions feeding them)
This reverts commit d43a80936d437d217d5a6dbbaa5fb131c27e7085.
With the correctness issue blocking the recommit finally fixed
(5d01697ec6cb), again unconditionally check if accesses are completely
before or after each other.
If I understand correctly there was a point where we used to need this
before it was implied by Zvl*b.
Now that it is though and we use -mattr=+v in pretty much every test we
can remove it.
In unroll-in-loop-vectorizer.ll we can force a VF of 1 instead by using
-force-vector-width=1, and in scalable-basics.ll the two RUN lines were
the same so I merged them.
Update isDereferenceableAndAlignedPointer to make use of dereferenceable
assumptions with variable sizes via SCEV.
To do so, factor out the logic to check via an assumption to a helper,
and use SE to check if the access size is less than the dereferenceable
size.
PR: https://github.com/llvm/llvm-project/pull/128436
Legal::isMaskRequired may be overly conservative and also return true
when no mask is actually required.
Use isPredicatedInst from the cost model instead, which fixes a
cost-model divergence between legacy and VPlan cost model where the
legacy cost model incorrectly assumed some loads were predicated.
Fixes https://github.com/llvm/llvm-project/issues/148431.
Interleaving does not currently work properly when vectorising loops
with uncountable early exits. Interleaving is already disabled for
normal vectorisation and for the pragma/hint - this patch also disables
it when using -force-vector-interleave.