At the moment, the full cost of all interleave group members is assigned
to the instruction at the group's insert position, even if the decision
was to not form an interleave group.
This can lead to inaccurate cost estimates, e.g. if the instruction at
the insert position is dead. If the decision is to not vectorize but
scalarize or scather/gather, then the cost will be to total cost for all
members. In those cases, assign individual the cost per member, to more
closely reflect to choice per instruction.
This fixes a divergence between legacy and VPlan-based cost model.
Fixes https://github.com/llvm/llvm-project/issues/108098.
Similar to VFxUF, also add a VF VPValue to VPlan and use it to get the
runtime VF in VPWidenIntOrFpInductionRecipe. Code for VF is only
generated if there are users of VF, to avoid unnecessary test changes.
PR: https://github.com/llvm/llvm-project/pull/95305
Update some tests with loop-invariant instructions, where hoisting them
out of the loop changes the vectorization decision. This should preserve
their original spirit when making further improvements.
Similarly to dd94537b4, setVectorizedCallDecision also did not consider
ForcedScalars. This lead to VPlans not reflecting the decision by the
legacy cost model (cost computation would use scalar cost, VPlan would
have VPWidenCallRecipe).
To fix this, check if the call has been forced to scalar in
setVectorizedCallDecision.
Note that this requires moving setVectorizedCallDecision after
collectLoopUniforms (which sets ForcedScalars). collectLoopUniforms does
not depend on call decisions and can safely be moved.
Fixes https://github.com/llvm/llvm-project/issues/107051.
Analogous to 2c7786e94a1058bd4f96794a1d4f70dcb86e5cc5, cleanup a case
where the vectorizer is emitting a non-canonical identity value given
the available flags. We use largest/smallest value during ISEL, and VP
expansion, but not during vectorization.
Since the fmin/fmax/fminimum/fmaximum intrinsics don't require a start
value, this difference is only visible when masking of inactive lanes is
required.
Primary motivation of this change is simply to remove a difference
between version of code which reason about the identity value of a
reduction so I can kill all but one off.
In review, it was pointed out that this is actually a functional fix as well.
The old code used inf on a noinf reduction instruction - whose
result is poison! That wasn't the intent of the code.
This is a follow up to 924907bc6, and is mostly motivated by consistency
but does include one additional optimization. In general, we prefer 0.0
over -0.0 as the identity value for an fadd. We use that value in
several places, but don't in others. So, let's be consistent and use the
same identity (when nsz allows) everywhere.
This creates a bunch of test churn, but due to 924907bc6, most of that
churn doesn't actually indicate a change in codegen. The exception is
that this change enables the use of 0.0 for nsz, but *not* reasoc, fadd
reductions. Or said differently, it allows the neutral value of an
ordered fadd reduction to be 0.0.
collectInstsToScalarize may decide to scalarize a call. If so, we have
to update the widening decision for the call, otherwise the call won't
be scalarized as expected during VPlan construction.
This issue was uncovered by f82543d509.
This moves the logic to create simplified operands using SCEV to MUL
recipe creation. This is needed to match the behavior of the legacy's cost
model. TODOs are to extend to other opcodes and move to a transform.
Note that this also restricts the number of SCEV simplifications we
apply to more precisely match the cases handled by the legacy cost
model.
Fixes https://github.com/llvm/llvm-project/issues/107015.
Follow-up to 9ccf825, adjust computeCost to also pass IntrinsicInst to
TTI if available, as there are multiple places in TTI which use the
IntrinsicInst.
Fixes https://github.com/llvm/llvm-project/issues/107016.
This patch replaces all dominated uses of condition with true/false to
improve context-sensitive optimizations. It eliminates a bunch of
branches in llvm-opt-benchmark.
As a side effect, it may introduce new phi nodes in some corner cases.
See the following case:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
br i1 %cond, label %bb1, label %bb2
bb1:
br i1 %cmp, label %if.then, label %if.else
if.then:
br %bb2
if.else:
br %bb2
bb2:
%res = phi i1 [%cmp, %entry], [%cmp, %if.then], [%cmp, %if.else]
ret i1 %res
}
```
It will be simplified into:
```
define i1 @test(i1 %cmp, i1 %cond) {
entry:
br i1 %cond, label %bb1, label %bb2
bb1:
br i1 %cmp, label %if.then, label %if.else
if.then:
br %bb2
if.else:
br %bb2
bb2:
%res = phi i1 [%cmp, %entry], [true, %if.then], [false, %if.else]
ret i1 %res
}
```
I am planning to fix this in late pipeline/CGP since this problem exists
before the patch.
This patch is moving out stepvector intrinsic from the experimental
namespace.
This intrinsic exists in LLVM for several years now, and is widely used.
Don't consider the cost of branches marked to be skipped in VPlan cost
pre-computation. Those aren't included in the legacy cost, so they
should not be included in the VPlan cast.
The idea behind this canonicalization is that it allows us to handle less
patterns, because we know that some will be canonicalized away. This is
indeed very useful to e.g. know that constants are always on the right.
However, this is only useful if the canonicalization is actually
reliable. This is the case for constants, but not for arguments: Moving
these to the right makes it look like the "more complex" expression is
guaranteed to be on the left, but this is not actually the case in
practice. It fails as soon as you replace the argument with another
instruction.
The end result is that it looks like things correctly work in tests,
while they actually don't. We use the "thwart complexity-based
canonicalization" trick to handle this in tests, but it's often a
challenge for new contributors to get this right, and based on the
regressions this PR originally exposed, we clearly don't get this right
in many cases.
For this reason, I think that it's better to remove this complexity
canonicalization. It will make it much easier to write tests for
commuted cases and make sure that they are handled.
This patch increases scatter overhead on Neoverse-V2 to 13. This
benefits s128 kernel from TSVC_2 test suite.
SPEC 17, RAJAPerf, and Sptter are unaffected by this patch.
This patch boosts s128 kernel's performance from TSVC test suite by about
40% as this enables vectorization. Also, handle minor code refactoring
for gather related part.
The command line option enable-scalable-autovec-in-streaming-mode is
used to enable scalable vectors but the same check is missing from
enableScalableVectorization, which is blocking auto-vectorisation.
This patch moves the logic to create remarks for instructions with
invalid costs to work on recipes and decoupling it from
selectVectorizationFactor. This is needed to replace the remaining uses
of selectVectorizationFactor with getBestPlan using the VPlan-based cost
model.
The current implementation iterates over all VPlans and their recipes
again, to find recipes with invalid costs, which is more work but will
only be done when remarks for LV are enabled. Once the remaining uses of
selectVectorizationFactor are retired, we can collect VPlans with
invalid costs as part of getBestPlan if we want to optimize the remarks
case a bit, at the cost of adding additional complexity.
PR: https://github.com/llvm/llvm-project/pull/99322
Update existing tests with dead interleave groups by adding users. This
ensures the tests keep testing what they were intended to test with a
planned change to skip unused instructions in cost computations.
For the Neoverse V2 we would like to prefer fixed width over scalable
vectorisation if the cost-model assigns an equal cost to both for certain
loops. This improves 7 kernels from TSVC-2 and several production kernels by
about 2x, and does not affect SPEC21017 INT and FP. This also adds a new TTI
hook that can steer the loop vectorizater to preferring fixed width
vectorization, which can be set per CPU. For now, this is only enabled for the
Neoverse V2.
There are 3 reasons why preferring NEON might be better in the case the
cost-model is a tie and the SVE vector size is the same as NEON (128-bit):
architectural reasons, micro-architecture reasons, and SVE codegen reasons. The
latter will be improved over time, so the more important reasons are the former
two. I.e., (micro) architecture reason is the use of LPD/STP instructions which
are not available in SVE2 and it avoids predication.
For what it is worth: this codegen strategy to generate more NEON is inline
with GCC's codegen strategy, which is actually even more aggressive in
generating NEON when no predication is required. We could be smarter about the
decision making, but this seems to be a first good step in the right direction,
and we can always revise this later (for example make the target hook more
general).
This change allows to consider compare instructions in the loop with
multiple use inside the loop and outside.
This change allows to vectorise this loop:
int foo(float* a, int n) {
_Bool any = 0;
_Bool all = 1;
for (int i = 0; i < n; i++) {
if (a[i] < 0.0f) {
any = 1;
} else {
all = 0;
}
}
return all ? 1 : any ? 2 : 3;
}
Resume and exit values for inductions are currently still created
outside of VPlan and independent of the induction recipes. Don't add
live-outs for now, as the additional unneeded users can pessimize other
anlysis.
Fixes https://github.com/llvm/llvm-project/issues/98660.
This patch introduces a new ResumePhi VPInstruction which creates a phi
in a leaf block of a VPlan. The first use is to create the phi node for
fixed-order recurrence resume values in the scalar preheader.
The VPInstruction takes 2 operands: 1) the incoming value from the
middle-block and a default value to be used for all other incoming
blocks.
In follow-up changes, it will also be used to create phis for reduction
and induction resume values.
Depends on https://github.com/llvm/llvm-project/pull/92651
PR: https://github.com/llvm/llvm-project/pull/94760
This patch implements limited loop vectorization support for the 'all-in-one' histogram intrinsic. The feature is disabled by default, and when enabled will only vectorize if there are no other users of values in the gather-modify-scatter sequence.
Update buildPlainCFG to re-use the vector and latch VPBBs created as
part of the initial skeleton in 72937203dd3b.
This should fix the leak sanitizer failure discovered by
https://lab.llvm.org/buildbot/#/builders/52/builds/619.
This patch moves branch condition creation to enter the scalar epilogue
loop to VPlan. Modeling the branch in the middle block also requires
modeling the successor blocks. This is done using the recently
introduced VPIRBasicBlock.
Note that the middle.block is still created as part of the skeleton and
then patched in during VPlan execution. Unfortunately the skeleton needs
to create the middle.block early on, as it is also used for induction
resume value creation and is also needed to properly update the
dominator tree during skeleton creation.
After this patch lands, I plan to move induction resume value and phi
node creation in the scalar preheader to VPlan. Once that is done, we
should be able to create the middle.block in VPlan directly.
This is a re-worked version based on the earlier
https://reviews.llvm.org/D150398 and the main change is the use of
VPIRBasicBlock.
Depends on https://github.com/llvm/llvm-project/pull/92525
PR: https://github.com/llvm/llvm-project/pull/92651
This is a small canonicalization for `gep i32, p, (mul x, C)` -> `gep
i8, p, (mul x, C*4)`, so that the mul can combine both of the constant
multiplications, and we take a small step towards canonicalizing more
geps to i8.
It currently doesn't attempt to check for multiple uses on the mul, but
that should be possible if it sounds better. Let me know what you think
of the idea in general.
Previously isElementTypeLegalForScalableVector returned false for i1
types, which also prevented vectorisation of loops with i1 reductions.
This is overkill - we only need to disable vectorisation for loads
and/or stores of i1 types. I've added i1 as a legal type, but changed
the cost model to return an invalid cost for loads and stores.
At the moment, vectorization is only enabled in streaming(-compatible)
mode when enabled through an option. But the interfaces should check
more than just 'hasSVE()', because a function with +sme in streaming
mode should also vectorize with the option enabled.
Additionally, a streaming-compatible function should only be able to use
fixed-length autovec if SVE is available, otherwise the vector code will
be scalarised by the backend.
Use VPIRBasicBlock to wrap the middle block and implement patching up
branches in predecessors in VPIRBasicBlock::execute. The IR middle block
is only created after skeleton creation. Initially a regular
VPBasicBlock is created, which will later be replaced by a
VPIRBasicBlock once the middle IR basic block has been created.
Note that this slightly changes the order of instructions created in the
middle block; code generated by recipe execution in the middle block
will now be inserted before the terminator (and in between the compare
to used by the terminator). The original order will be restored in
https://github.com/llvm/llvm-project/pull/92651.
PR: https://github.com/llvm/llvm-project/pull/95816