Spent a bunch of time tracing down an odd issue "in SCEV" which turned out
to be the fact that SCEV doesn't have access to TTI. As a result, the only
way for it to get range facts on vscales (to avoid collapsing ranges of
element counts and type sizes to trivial ranges on multiplies) is to look
at the vscale_range attribute. Since vscale_range is set by clang by
default, manually setting it in the tests shouldn't interfere with the
test intent.
After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.
[1] Register spilling when compiling sqlite at various levels of `-riscv-v-register-bit-width-lmul`:
LMUL=1 2573 spills
LMUL=2 2583 spills
LMUL=4 2819 spills
LMUL=8 3256 spills
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D143723
This work follows on from D142109 and addresses a possible regression
when we know the loop iteration counter cannot overflow.
When we know the overflow-check always evaluates to false, it's better to
use the other style of tail folding where it assumes a runtime check was
added, because that avoids having to calculate a modified trip-count.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D142894
Instcombine prefers this canonical form (see getPreferredVectorIndex),
as does IRBuilder when passing the index as an integer so we may as
well use the prefered form from creation.
NOTE: All test changes are mechanical with nothing else expected
beyond a change of index type from i32 to i64.
Differential Revision: https://reviews.llvm.org/D140983
The motivation here is to a) bring us closer into alignment with AArch64 under the assumption that codepath is better tested, and b) simplify pattern matching in an upcoming change.
The immediate impact is a significant IR reduction but a fairly minimal change in the generated assembly. Due to a difference in expansion behavior we get a saturating add vs an unsaturating one for the old code, but that's about it. This difference comes down to different handling of overflow, which doesn't seem to be possible here anyways, so the assembly codegen is arguably a minor regression. I don't expect that to matter in practice.
Differential Revision: https://reviews.llvm.org/D129221
When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.
Tests added here:
Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
Transforms/LoopVectorize/RISCV/low-trip-count.ll
Differential Revision: https://reviews.llvm.org/D121595