Now that we store the ScalarCost in the VectorizationFactor it is possible to
use it to get a slightly more accurate cost in isMoreProfitable between two
vector factors. This extends the logic added in D101726 to non-tail-folded
cases, using the costs of `VecCost * (TripCount / VF) + ScalarCost * (TripCount % VF)`
to compare VFs where the TripCount is known and we are not folding the tail.
This shouldn't alter very much as small trip counts are usually not vectorized,
but does seem to help in the testcase where 4 * VF4 is chosen as profitable
compared to 2 * VF8 + 4 * scalar.
Differential Revision: https://reviews.llvm.org/D147720
llvm.is.fpclass is different from other vectorizable intrinsics in that
it is overloaded on an argument type, not on the return type.
Differential Revision: https://reviews.llvm.org/D148905
This reverts the revert commit 3d8ed8b5192a59104bfbd5bf7ac84d035ee0a4a5.
The new version of the patch adds a set to avoid duplicating work in
isFixedOrderRecurrence, which was previously done through the removed
SinkAfter map.
Original commit message:
Building on D142885 and D142589, retire the SinkAfter map from the
recurrence handling code. It is replaced by checking whether it is
possible to sink all users of a recurrence directly in VPlan. This
results in simpler code overall and allows to handle additional cases
(see the improvements in @test_crash).
Depends on D142885.
Depends on D142589.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D142886
Since D146813, LICM will reassociate GEPs to expose hoisting
opportunities itself. Don't perform this transform in InstCombine,
where it is fragile because it depends on an optional LoopInfo
analysis.
This ensures VPlan-based DCE won't be able to remove the unused
recurrences.
It also adds a dedicated new test (@unused_recurrence) where an unused
recurrence can be removed.
In LoopVectorizationCostModel::isEpilogueVectorizationProfitable we
check to see if the chosen main vector loop VF >= 16. If so, we
decide to create a vector epilogue loop. However, this doesn't
take VScaleForTuning into account because we could be targeting a
CPU where vscale > 1, and hence the runtime VF would be a multiple
of the known minimum value.
This patch multiplies scalable VFs by VScaleForTuning and several
tests have been updated that now produce vector epilogues.
Differential Revision: https://reviews.llvm.org/D147522
The current sinking code doesn't prevent us from sinking a load past an
aliasing store. Skip sinking instructions that may read from memory to
avoid a mis-compile.
See @minimal_bit_widths_with_aliasing_store for an example where 2 loads
are sunk past aliasing stores before this fix.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147259
Building on D142885 and D142589, retire the SinkAfter map from the
recurrence handling code. It is replaced by checking whether it is
possible to sink all users of a recurrence directly in VPlan. This
results in simpler code overall and allows to handle additional cases
(see the improvements in @test_crash).
Depends on D142885.
Depends on D142589.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D142886
Add users of exit values for recurrences to make sure exit value
generation will be checked in a follow-up change.
Also adjusts/fixes naming in the test.
Based off D148215, when expanding a min/max reduction we should be creating min/max intrinsics directly instead of relying on instcombine to fold them back together.
This patch handles integer min/max cases. Hopefully we can add floating point support soon (at least for fastmath/nnan cases) - but we're missing some of the plumbing to pass the correct FMF to the intrinsic at the moment.
Differential Revision: https://reviews.llvm.org/D148221
To calculate the trip count we need to add 1 to the backedge
taken count. If we need to widen the backedge count, it's better
to do the add before the widening if we can guarantee it won't
overflow.
The code here is based on similar code I found in
LoopIdiomRecognize.
This is the vectorizer version of this InstCombine patch D142783.
Looking at the IR diffs, this does look like it gets more cases
than the InstCombine patch.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D147355
Move the code to collect live-out earlier and only generate extracts for
exit values if there are any live-outs that use them.
Depends on D147472.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147567
Instead of clearing live outs when a scalar epilogue is required late,
don't add live outs during VPlan construction if a scalar epilogue is
required.
This enables more VPlan-based DCE (if the live out would be the only
user in the plan) and is a step towards removing an access of the cost
model in fixedVectorizedLoop (which is after VPlan execution).
Depends on D147468.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147471
In getInstructionCost if we know a zext/sext is going to be shrunk
we should only be changing the destination type, and leave the
source type unchanged. For example, we may change a zext from
zext <16 x i8> %a to <16 x i32>
to
zext <16 x i8> %a to <16 x i16>
However, we were previously calculating the cost for doing
zext <16 x i16> %a to <16 x i16>
which is incorrect.
Differential Revision: https://reviews.llvm.org/D147152
(JFYI - This has been heavily reframed since original attempt at landing.)
This change updates the InductionDescriptor logic to allow matching a pointer IV with a non-constant stride, but also updates the LoopVectorizer to bailout on such descriptors by default. This preserves the default vectorizer behavior.
In review, it was pointed out that there's multiple unfortunate performance implications which need to be addressed before this can be enabled. Having a flag allows us to exercise the behavior, and write test cases for logic which is otherwise unreachable (or hard to reach).
This will also enable non-constant stride pointer recurrences for other consumers. I've audited said code, and don't see any obvious issues.
Differential Revision: https://reviews.llvm.org/D147336
Generally, the cost of a memory op will scale with the number of vector registers accessed. Machines might exist which have a narrow memory access than vector register width, but machines with a wider memory access width than vector register width seem unlikely.
I noticed this because we were preferring wide loads + deinterleaves on examples where the cost of a short gather (actually a strided load) would be better. Touching 8 vector registers instead of doing a 4 element gather is not a good tradeoff.
Differential Revision: https://reviews.llvm.org/D147470
LLVM has the ability to vectorize using function variants that require
a mask by creating an all-true mask, and to vectorize a conditional
call via scalarization, now we want to join the two parts together
and use a masked variant when a mask is required.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D136251
If the legalized type is a legal interleaved access type (i.e. there's a
supported vlseg/vsseg instruction for it), the interleaved access pass
will pick any interleaved memory op (wide load + shuffles) and lower it
into a vlseg/vsseg intrinsic.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D146522
Extend handling to support `%base + offset` as start for AddRecs in
isDereferenceableAndAlignedInLoop. This is done by adjusting AccessSize
by the offset and effectively checking if the full object starting from
%base to %base + offset + access-size is dereferenceable.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D147260
Multiple errors have being reported on
https://reviews.llvm.org/rG498aa534f472d28db893aa9a8627d0b46e17f312
Reverting until the correctness issues can be resolved.
We are also seeing a lot of performance differences from the patch. Some are
looking good, but some are looking pretty bad.
This matches the handling for integer IVs. I left the non-opaque cases alone, mostly because they're largely irrelevant today.
This doesn't actually make much difference in vectorization right now as we immediately fail on aliasing checks (which also bail on non-constant strides). Slightly suprisingly, it's the case which *do* need runtime checks which work after this patch as they don't use the same dependency analysis path.
This will also enable non-constant stride pointer recurrences for other consumers. I've auditted said code, and don't see any obvious issues.
This commit extends D134719 "[AArch64] Enable libm vectorized
functions via SLEEF" with the mappings for the scalable functions.
It also introduces all the necessary changes needed to support masked
interfaces.
Reviewed By: danielkiss, sdesmalen
Differential Revision: https://reviews.llvm.org/D146839
This commit extends D134719 "[AArch64] Enable libm vectorized
functions via SLEEF" with the mappings for the scalable functions.
It also introduces all the necessary changes needed to support masked
interfaces.
Signed-off-by: Paul Osmialowski <pawel.osmialowski@arm.com>
The cost model was not accounting for the fact that we can generate vrgather + an index expression.
Two cases to call out.
1) I did not model the difference between vrgather and vrgatherei16. The result is the constant pool cost can be slightly understated on RV32. I don't think we care, but if someone disagrees, this would be easy to add.
2) Our current codegen for i8 vectors longer than 256 (which is the limit of what this costs) has some room for improvement.
Differential Revision: https://reviews.llvm.org/D147000
If we use tail-folding for reverse loops that contain loads
and stores then we will need to reverse the loop predicate.
This patch adds a new 'reverse' sve-tail-folding option and
ensures they are not considered 'simple'.
I did this by adding a function called
containsDecreasingPointers to AArch64TargetTransformInfo.cpp
that searches all instructions in the loop for loads or
stores with negative strides.
Differential Revision: https://reviews.llvm.org/D146128