Add missing test coverage of loops where the vector loop region can be
removed that include replicate recipes as well as nested loops.
Extra test coverage for https://github.com/llvm/llvm-project/pull/108378.
Currently available intrinsics are only ld2/st2, which don't support interleaving factor > 2.
This patch teaches the LV to use ld2/st2 recursively to support high
interleaving factors.
This re-lands the reverted #92418
When the VF is small enough so that dividing the VF by the scaling
factor results in 1, the reduction phi execution thinks the VF is scalar
and sets the reduction's output as a scalar value, tripping assertions
expecting a vector value. The latest commit in this PR fixes that by
using `State.VF` in the scalar check, rather than the divided VF.
---------
Co-authored-by: Nicholas Guy <nicholas.guy@arm.com>
I've only fixed up the tests where I was able to use a simple sed script
to replace the text. Even after this patch lands, there are still over
50 tests that need updating in X86/CostModel!
Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.
---------
Co-authored-by: Samuel Tebbs <samuel.tebbs@arm.com>
Currently the addUsersInExitBlocks incorrectly assumes exit phis only
have a single operand, which may not be the case for loops with early
exits when they share a common exit block.
Also further relax the assertion in fixupIVUsers to allow exit values if
they come from theloop latch/middle.block.
PR: https://github.com/llvm/llvm-project/pull/120260
PR #112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:
1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR #88385.
6. The new split middle blocks weren't being added to the
parent loop.
When building SPEC CPU 2017 with RISC-V and EVL tail folding, this
assertion in VPTypeAnalysis would trigger during the transformation to
EVL recipes:
d8a0709b10/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (L135-L142)
It was caused by this recipe:
```
WIDEN ir<%shr> = vp.or ir<%add33>, ir<0>, vp<%6>
```
Having its type inferred as i16, when ir<%add33> and ir<0> had inferred
types of i32 somehow.
The cause of this turned out to be because the VPTypeAnalysis cache was
getting clobbered: In this transform we were erasing recipes but keeping
around the same mapping from VPValue* to Type*. In the meantime, new
recipes would be created which would have the same address as the old
value. They would then incorrectly get the old erased VPValue*'s cached
type:
```
--- before ---
0x600001ec5030: WIDEN ir<%mul21.neg> = vp.mul vp<%11>, ir<0>, vp<%6>
0x600001ec5450: <badref> <- some value that was erased
--- after ---
0x600001ec5030: WIDEN ir<%mul21.neg> = vp.mul vp<%11>, ir<0>, vp<%6>
0x600001ec5450: WIDEN ir<%shr> = vp.or ir<%add33>, ir<0>, vp<%6> <- a new value that happens to have the same address
```
This fixes this by deferring the erasing of recipes till after the
transformation.
The test case might be a bit flakey since it just happens to have the
right conditions to recreate this. I tried to add an assert in
inferScalarType that every VPValue in the cache was valid, but couldn't
find a way of telling if a VPValue had been erased.
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
This fixes a crash that shows up when building SPEC CPU 2017 with EVL
tail folding on RISC-V.
A VPWidenCastRecipe doesn't always have an underlying value, and in the
case of this crash this happens whenever a widened cast is created via
truncateToMinimalBitwidths.
Fix this by just using the opcode stored in the recipe itself.
I think a similar issue exists with VPWidenIntrinsicRecipe and how it's
widened, but I haven't run into any crashes with it just yet.
This was originally done to reduce the diff for the change. Remove it
and update the remaining tests. NFC modulo reordering of incoming
values.
Clean up after https://github.com/llvm/llvm-project/pull/114292.
Update VPReductionPHIRecipe::execute to use the start value from the
start value operand of the recipe. This is needed to make sure we resume
from the correct value during epilogue vectorization.
At the moment, the start value is set to the sentinel value in
adjustRecipesForReductions, as the original start value needs to be used
when creating ResumePhi recipes.
Fixes a mis-compile introduced by b3cba9be41bfa8 in SPEC2017 on AArch64.
I'm not sure if getStepVector was used for other things in the past
where StartIdx was non-zero, but nowadays VPWidenIntOrFpInductionRecipe
is the only user of it, and just passes zero to it. I presume
InstCombine was already catching this so hopefully removing this won't
affect codegen.
Modernize VPWidenIntOrFpInductionRecipe printing by including the result
VPValue and all operand VPValues, similar to VPScalarIVStepsRecipe and
VPDerivedIVRecipe.
Update VPWidenPointerInduction to manage its debug location via recipe.
This makes sure we emit a proper debug location for
VPWidenPointerInductionRecipes.
As a first step to move towards modeling the full skeleton in VPlan,
start by wrapping IR blocks created during legacy skeleton creation in
VPIRBasicBlocks and hook them into the VPlan. This means the skeleton
CFG is represented in VPlan, just before execute. This allows moving
parts of skeleton creation into recipes in the VPBBs gradually.
Note that this allows retiring some manual DT updates, as this will be
handled automatically during VPlan execution.
PR: https://github.com/llvm/llvm-project/pull/114292
Consider the following loop:
```
int rdx = init;
for (int i = 0; i < n; ++i)
rdx = (a[i] > b[i]) ? i : rdx;
```
We can vectorize this loop if `i` is an increasing induction variable.
The final reduced value will be the maximum of `i` that the condition
`a[i] > b[i]` is satisfied, or the start value `init`.
This patch added new RecurKind enums - IFindLastIV and FFindLastIV.
---------
Co-authored-by: Alexey Bataev <5361294+alexey-bataev@users.noreply.github.com>
A more lightweight variant of
https://github.com/llvm/llvm-project/pull/109193,
which dispatches to multiple exit blocks via the middle blocks.
The patch also introduces a bit of required scaffolding to enable
early-exit vectorization, including an option. At the moment, early-exit
vectorization doesn't come with legality checks, and is only used if the
option is provided and the loop has metadata forcing vectorization. This
is only intended to be used for testing during bring-up, with @david-arm
enabling auto early-exit vectorization plugging in the changes from
https://github.com/llvm/llvm-project/pull/88385.
PR: https://github.com/llvm/llvm-project/pull/112138
When we have a gep inbounds from the base of an object (e.g. alloca or
global), we know that the index cannot be negative, as this would go out
of bounds. As such, we can infer nuw as well.
The implementation is a bit stricter than necessary, we could also
accept one unknown index followed by known-non-negative indices.
Proof: https://alive2.llvm.org/ce/z/Hp7-6w (Note that alive2 currently
incorrectly doesn't require the inbounds for the alloca case, see
https://github.com/AliveToolkit/alive2/issues/1138).
Update the code to create induction resume PHIs to also create a resume
phi for the canonical induction during epilogue vectorization. This
unifies the code for handling induction resume values and removes the
need to explicitly create manually resume PHI and return it during
epilogue creation.
Overall it helps to move the code for updating the canonical induction
resume value to the place where all other header phi resume values are
updated.
This is NFC, modulo order of the created phis.
When VF has a fixed width and equals the number of iterations, and we are not
tail folding by masking, comparison instruction and induction operation will be DCEed later.
Ignoring the costs of these instructions improves the cost model.