We have an optimization in VPPredicator when creating blends where if
all the incoming values are the same, we just return that value.
This extends it to handle cases like "phi [%x, %x, poison, %x]" by
ignoring poison values.
This is split off from #176143 to prevent regressions when maintaining
SSA by adding PHIs with a poison incoming value.
When a recipe can be safely sunk and all of its users are outside the
vector loop region in the same dedicated exit block, the recipe does not
need to be executed on every iteration.
This patch extends the VPlan-based LICM (Loop Invariant Code Motion) to
also sink such recipes from the vector loop region into the exit block.
This reduces redundant computation and improves cost model accuracy.
TODO: Support nested loop sinking
TODO: Support sinking `VPReplicateRecipe` (requires `replicateByVF`
fixes)
TODO: Support recipes with multiple defined values (e.g., interleaved
loads)
TODO: Clone recipes without users to all exit blocks
TODO: Support PHI node users by checking incoming value blocks
TODO: Support sinking when users are in multiple blocks
TODO: Clone recipes when users are on multiple exit paths
Co-authored-by: Luke Lau <luke@igalia.com>
---------
Co-authored-by: Luke Lau <luke@igalia.com>
Co-authored-by: Luke Lau <luke_lau@icloud.com>
Fixes#176208. Scaled back version of #176515 that only affects the RISCV backend.
Only modifies the cost for cases when DIV is a legal operation.
Updates the cost for both Scalar and Vector types.
Used `TTI::TCC_Expensive` as suggested by
https://github.com/llvm/llvm-project/issues/176208#issuecomment-3760902537.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
Whilst reviewing PR #176754 I realised there seemed to be some odd cost
model issues for the tests in file
LoopVectorize/AArch64/fold-tail-low-trip-count.ll
where we seemed to be vectorising loops that aren't worth it. It turns
out the tests were not targeting AArch64 despite being in the AArch64
directory. I fixed the RUN line for the file and also added a new file
for RISCV so we get more test coverage.
When `extract-lane` only contains single vector operand. We can simplify
it to `extractelement`.
This patch makes `extract-lane` generate simple `extractelement` when it
only contains single vector operand to prevent unused IR generated.
This patch is mostly NFC, the unused IR should be removed in following
IR passes.
Remove the artificial PhiR operand of ComputeReductionResult, which was
only used to look up recurrence kind, in-loop and ordered properties.
Instead, encode them as VPIRFlags as suggested by @ayalz in
https://github.com/llvm/llvm-project/pull/170223.
This addresses a TODO to make codegen for ComputeReductionResult
independent of looking up information from other recipes.
This is NFC w.r.t. codegen, the printing has been improved to include
the reduction type, and whether it is in-loop/ordered.
PR: https://github.com/llvm/llvm-project/pull/174026
Fixes#175058
Similar to #175028, on RV64 we insert a zext in between most uses of EVL
so most of the VPlanVerifier EVL checks don't fire unless we're
compiling for RV32.
In this case, we're experiencing a crash because we can have a PtrAdd
that uses EVL. This fixes it by adding PtrAdd to the list of allowed
instructions
Fixes#175028
We have a VPlanVerifier assertion that a VPInstruction that uses EVL
only has one use. This used to hold until we implemented CSE, but now we
can run into the case where e.g. a multiply from an expanded
VPWidenPointerInductionRecipe gets cse'd, causing it to have multiple
uses:
EMIT ir<%0> = WIDEN-POINTER-INDUCTION ir<%.pre3>, ir<6>, vp<%5>
EMIT ir<%1> = WIDEN-POINTER-INDUCTION ir<%.pre>, ir<6>, vp<%5>
EMIT-SCALAR vp<%5> = EXPLICIT-VECTOR-LENGTH vp<%avl>
-->
EMIT-SCALAR vp<%10> = EXPLICIT-VECTOR-LENGTH vp<%avl>
EMIT vp<%11> = mul ir<6>, vp<%10>
EMIT vp<%ptr.ind> = ptradd vp<%pointer.phi>, vp<%11>
EMIT vp<%12> = mul ir<6>, vp<%10>
EMIT vp<%ptr.ind>.1 = ptradd vp<%pointer.phi>.1, vp<%12>
-->
EMIT-SCALAR vp<%5> = EXPLICIT-VECTOR-LENGTH vp<%avl>
EMIT vp<%6> = mul ir<6>, vp<%5>
EMIT vp<%ptr.ind> = ptradd vp<%pointer.phi>, vp<%6>
EMIT vp<%ptr.ind>.1 = ptradd vp<%pointer.phi>.1, vp<%6>
This removes the check, as I'm not sure it's that useful anymore now
that we have CSE. Coincidentally, this crash only happened on RV32
because RV64 requires zexting the EVL, which sidesteps a lot of the
checks to begin with.
This patch simplifies extract-lane(%lane_num, %X) to %X when %X is a
scalar value. Extracting from a scalar is redundant since there is only
one value to extract.
In VPlanPatternMatch.h I have changed the int_pred_ty code to look
through broadcasts in order to catch more cases, i.e. multiplying by a
splat of one, etc.
Conservatively predicate sdiv/srem:
- RHS may carry poison in masked‑off lanes.
- RHS could be −1 while LHS has masked‑off lanes (risking INT_MIN/−1
overflow).
We’ll relax this once we can prove non‑wrap/non‑poison conditions.
Fixes#170775.
The original patch, landed as a2db31b0 ([VPlan] Simplify pow-of-2
(mul|udiv) -> (shl|lshr), #172477) had a critical commutative matcher
bug, which has now been fixed. An assert has also been strengthened,
following a post-commit review.
This PR implements the first change outlined in
https://discourse.llvm.org/t/rfc-allow-non-constant-offsets-in-llvm-vector-splice/88974?u=lukel
In order to allow non-immediate offsets in the llvm.vector.splice
intrinsic, we need to separate out the "shift left" and "shift right"
modes into two separate intrinsics, which were previously determined by
whether or not the offset is positive or negative.
The description in the LangRef has also been reworded in terms of
sliding elements left or right and extracting either the upper or lower
half as opposed to extracting from a certain index, which brings it
inline with the definition of `llvm.fshr.*`/`llvm.fshl.*`.
This patch teaches AutoUpgrade.cpp to upgrade the old intrinsics into
their new equivalent one based on their offset, so existing uses of
vector.splice should still work.
Uses of llvm.vector.splice in `llvm/test/CodeGen` haven't been replaced
in this PR to keep the diff small and kick the tyres on the AutoUpgrader
a bit. I planned to do this in a follow up NFC but can include it in
this PR if reviewers prefer.
Similarly the shuffle costing kind `SK_Splice` has just been kept the
same for now, to be split into `SK_SpliceLeft` and `SK_SpliceRight`
later.
This patch introduces VPInstruction::Reverse and extracts the reverse
operations of loaded/stored values from reverse memory accesses. This
extraction facilitates future support for permutation elimination within
VPlan.
In an effort to get rid of VPUnrollPartAccessor and directly unroll
recipes, start by directly unrolling VectorPointerRecipe, allowing for
VPlan-based simplifications and simplification of the corresponding
execute.
These quantities should never unsigned-wrap. This matches the behavior
if only VFxUF is used (and not VF): when computing both VF and VFxUF,
nuw should hold for each step separately.
In 531.deepsjeng_r from SPEC CPU 2017 there's a loop that we
unprofitably loop vectorize on RISC-V.
The loop looks something like:
```c
for (int i = 0; i < n; i++) {
if (x0[i] == a)
if (x1[i] == b)
if (x2[i] == c)
// do stuff...
}
```
Because it's so deeply nested the actual inner level of the loop rarely
gets executed. However we still deem it profitable to vectorize, which
due to the if-conversion means we now always execute the body.
This stems from the fact that `getPredBlockCostDivisor` currently
assumes that blocks have 50% chance of being executed as a heuristic.
We can fix this by using BlockFrequencyInfo, which gives a more accurate
estimate of the innermost block being executed 12.5% of the time. We can
then calculate the probability as `HeaderFrequency / BlockFrequency`.
Fixing the cost here gives a 7% speedup for 531.deepsjeng_r on RISC-V.
Whilst there's a lot of changes in the in-tree tests, this doesn't
affect llvm-test-suite or SPEC CPU 2017 that much:
- On armv9-a -flto -O3 there's 0.0%/0.2% more geomean loops vectorized
on llvm-test-suite/SPEC CPU 2017.
- On x86-64 -flto -O3 **with PGO** there's 0.9%/0% less geomean loops
vectorized on llvm-test-suite/SPEC CPU 2017.
Overall geomean compile time impact is 0.03% on stage1-ReleaseLTO:
https://llvm-compile-time-tracker.com/compare.php?from=9eee396c58d2e24beb93c460141170def328776d&to=32fbff48f965d03b51549fdf9bbc4ca06473b623&stat=instructions%3Au
The VPlan-based cost model use vp_gather/vp_scatter for gather/scatter
costs, which is different to the legacy cost model and cannot be matched
there. Don't verify the costs match for plans containing gather/scatters
with EVL.
Fixes https://github.com/llvm/llvm-project/issues/169948.
Update the logic in narrowToSingleScalar to allow narrowing even if not
all users use scalars, if at least one of the operands already needs
broadcasting.
In that case, there won't be any additional broadcasts introduced. This
should allow removing the special handling for stores, which can
introduce additional broadcasts currently.
Fixes https://github.com/llvm/llvm-project/issues/169668.
PR: https://github.com/llvm/llvm-project/pull/168246
In some case, VPWidenPointerInductions become only used by scalars after
legalizeAndOptimizationInducftions was already run, for example due to
some VPlan optimizations.
Move the code to scalarize VPWidenPointerInductions to a helper and use
it if needed.
This fixes a crash after #148274 in the added test case.
Fixes https://github.com/llvm/llvm-project/issues/169780
In preparation to strip VPUnrollPartAccessor and unroll recipes
directly, strip unnecessary complication in getGEPIndexTy, as the unroll
part will no longer be available in follow-ups (see #168886 for
instance). The patch also helps by doing a mass test update up-front.
Narrowing the GEP index type conditionally does not yield any benefit,
and the change is non-functional in terms of emitted assembly. While at
it, avoid hard-coding address-space 0, and use the pointer operand's
address space to get the GEP index type.
Changes: Fix a missed update to WidenGEP::usesFirstLaneOnly, and include
reduced-case test that was previously hitting the new assert: the
underlying reason was that VPWidenGEP::usesScalars was too weak, and the
single-scalar WidenGEP was not narrowed by narrowToSingleScalarRecipes.
This allows us to strip a special case in VPWidenGEP::execute.
This patch implements a transform to hoists single-scalar replicated
loads with invariant addresses out of the vector loop to the preheader
when scoped noalias metadata proves they cannot alias with any stores in
the loop.
This enables hosting of loads we can prove do not alias any stores in
the loop due to memory runtime checks added during vectorization.
PR: https://github.com/llvm/llvm-project/pull/166247
Changes: The previous patch had to be reverted to a mismatching-OpType
assert in cse. The reduced-test has now been added corresponding to a
RVV pointer-induction, and the pointer-induction case has been updated
to use createOverflowingBinaryOp.
While at it, record VPIRFlags in VPWidenInductionRecipe.
VPPartialReductionRecipe doesn't yet support an EVL variant, and we
guard against this by not calling convertToAbstractRecipes when we're
tail folding with EVL.
However recently some things got shuffled around which means we may
detect some scaled reductions in collectScaledReductions and store them
in ScaledReductionMap, where outside of convertToAbstractRecipes we may
look them up and start e.g. adding a scale factor to an otherwise
regular VPReductionPHI.
This fixes it by skipping collectScaledReductions, and fixes#167861
[`llvm.experimental.get.vector.length`](https://llvm.org/docs/LangRef.html#id2399)
has the property that if the AVL (%cnt) is less than or equal to VF
(%max_lanes) then the return value is just AVL.
This patch uses SCEV to simplify this in optimizeForVFAndUF, and adds
`ExplicitVectorLength` to
`VPInstruction::opcodeMayReadOrWriteFromMemory` so it gets removed once
dead.
On RISC-V narrowInterleaveGroups doesn't kick in because the wrong
VectorRegWidth is passed to isConsecutiveInterleaveGroup.
narrowInterleaveGroups is always passed the RGK_FixedWidthVector
register size, but on RISC-V the RGK_ScalableVector size is twice as
large because we want to use LMUL 2. This causes the `GroupSize ==
VectorRegWidth` check to fail.
This fixes it by using the scalable register size whenever the VF is
scalable and plumbing it through as a potentially scalable TypeSize.
Note that this only makes a difference when tail folding is disabled, as
narrowInterleaveGroups can't handle EVL based IVs yet.
Since div/rem operations don’t support a mask operand, the lanes of the
divisor that are masked out are currently replaced with 1 using
VPInstruction::Select before the predicated div/rem operation.
This patch replaces
```
VPInstruction::Select(logical_and(header_mask, conditional_mask), LHS, RHS)
```
with
```
vp.merge(conditional_mask, LHS, RHS, EVL)
```
so that the header mask can be replaced by EVL in this usage scenario
when tail folding with EVL.
narrowToSingleScalarRecipes can permit users that are WidenStore, or a
VPInstruction that has a suitable opcode. This is a generalization and
extension of the existing code.
Call getVectorTripCount first, and call getTripCount failing that, in
simplifyBranchConditionForVFAndUF, to simplify missed cases. While at
it, strip the dead check for a zero TC.