Previously we could only vectorize FP reductions if fast math was enabled, as this allows us to
reorder FP operations. However, it may still be beneficial to vectorize the loop by moving
the reduction inside the vectorized loop and making sure that the scalar reduction value
be an input to the horizontal reduction, e.g:
%phi = phi float [ 0.0, %entry ], [ %reduction, %vector_body ]
%load = load <8 x float>
%reduction = call float @llvm.vector.reduce.fadd.v8f32(float %phi, <8 x float> %load)
This patch adds a new flag (IsOrdered) to RecurrenceDescriptor and makes use of the changes added
by D75069 as much as possible, which already teaches the vectorizer about in-loop reductions.
For now in-order reduction support is off by default and controlled with the `-enable-strict-reductions` flag.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D98435
Changes getRecurrenceIdentity to always return a neutral value of -0.0 for FAdd.
Reviewed By: dmgreen, spatel
Differential Revision: https://reviews.llvm.org/D98963
We can't see how much overhead/redundancy is being
created with the partial checks.
To make it smaller and easier to read, I reduced the
vectorization factor because that does not add new
information - it just duplicates things.
Support deriving dereferenceability facts from allocation sites with known object sizes while correctly accounting for any possibly frees between allocation and use site. (At the moment, we're conservative and only allowing it in functions where we know we can't free.)
This is part of the work on deref-at-point semantics. I'm making the change unconditional as the miscompile in this case is way too easy to trip by accident, and the optimization was only recently added (by me).
There will be a follow up patch wiring through TLI since that should now be doable without introducing widespread miscompiles.
Differential Revision: https://reviews.llvm.org/D95815
This patch just adds tests that we can vectorize loop such as these:
for (i = 0; i < n; i++)
dst[i * 7] += 1;
and
for (i = 0; i < n; i++)
if (cond[i])
dst[i * 7] += 1;
using scalable vectors, where we expect to use gathers and scatters in the
vectorized loop. The vector of pointers used for the gather is identical
to those used for the scatter so there should be no memory dependences.
Tests are added here:
Transforms/LoopVectorize/AArch64/sve-large-strides.ll
Differential Revision: https://reviews.llvm.org/D99192
This marks FSIN and other operations to EXPAND for scalable
vectors, so that they are not assumed to be legal by the cost-model.
Depends on D97470
Reviewed By: dmgreen, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D97471
LLVM test Transforms/LoopVectorize/X86/x86-pr39099.ll tries to check for
the absence of a sequence of instructions with several CHECK-NOT with
one of those directives using a variable defined in another. However
CHECK-NOT are checked independently so that is using a variable defined
in a pattern that should not occur in the input.
This commit only checks for the absence of a widened load which rules
out the presence of the whole sequence and does not involve an undefined
variable.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D99583
This patch adds support for the vectorization of induction variables when
using scalable vectors, which required the following changes:
1. Removed assert from InnerLoopVectorizer::getStepVector.
2. Modified InnerLoopVectorizer::createVectorIntOrFpInductionPHI to use
a runtime determined value for VF and removed an assert.
3. Modified InnerLoopVectorizer::buildScalarSteps to work for scalable
vectors. I did this by calculating the full vector value for each Part
of the unroll factor (UF) and caching this in the VP state. This means
that we are always able to extract an arbitrary element from the vector
if necessary. In addition to this, I also permitted the caching of the
individual lane values themselves for the known minimum number of elements
in the same way we do for fixed width vectors. This is a further
optimisation that improves the code quality since it avoids unnecessary
extractelement operations when extracting the first lane.
4. Added an assert to InnerLoopVectorizer::widenPHIInstruction, since while
testing some code paths I noticed this is currently broken for scalable
vectors.
Various tests to support different cases have been added here:
Transforms/LoopVectorize/AArch64/sve-inductions.ll
Differential Revision: https://reviews.llvm.org/D98715
Re-apply 25fbe803d4db, with a small update to emit the right remark
class.
Original message:
[LV] Move runtime pointer size check to LVP::plan().
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.
A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.
Reviewed By: lebedev.ri
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.
A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D98634
This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...
After some investigation it seems that there are only these cases that occur
in practice:
1. VF is a scalar, in which case N = 1.
2. VF is a vector. We can only get here if: a) the instruction is a
GEP/bitcast with scalar uses, or b) this is an update to an induction variable
that remains scalar.
I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. For all other cases I have added an assert that none of the
users needs scalarising, which didn't fire in any unit tests.
Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.
Differential Revision: https://reviews.llvm.org/D98512
D95598 added a cost model for broadcast shuffle, which should enable loops
such as the following to vectorize, where the load of b[42] is invariant
and can be done using a scalar load + splat:
for (int i=0; i<n; ++i)
a[i] = b[i] + b[42];
This patch adds tests to verify that we can vectorize such loops.
Reviewed By: joechrisellis
Differential Revision: https://reviews.llvm.org/D98506
The name is included when printing in DOT mode. Also print it in non-DOT
mode after 93a9d2de8f4f.
This will become more important to distinguish different plans once
VPlans are gradually refined.
The scalarization overhead was set deliberately high for MVE, whilst the
codegen was new. It helps protect us against the negative ramifications
of mixing scalar and vector instructions. This decreases that,
especially for floating point where the cost of extracting/inserting
lane elements can be low. For integer the cost is still fairly high due
to the cross-register-bank copy, but is no longer n^2 in the length of
the vector.
In general, this will decrease the cost of scalarizing floats and long
integer vectors. i64 increase in cost, having a high cost before and
after this patch. For floats this allows up to start doing things like
vectorizing fdiv instructions, even if they are scalarized.
Differential Revision: https://reviews.llvm.org/D98245
I foresee two uses for this:
1) It's easier to use those in debugger.
2) Once we start implementing more VPlan-to-VPlan transformations (especially
inner loop massaging stuff), using the vectorized LLVM IR as CHECK targets in
LIT test would become too obscure. I can imagine that we'd want to CHECK
against VPlan dumps after multiple transformations instead. That would be
easier with plain text dumps than with DOT format.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D96628
This patch adds support for intrinsic overloading on unnamed types.
This fixes PR38117 and PR48340 and will also be needed for the Full Restrict Patches (D68484).
The main problem is that the intrinsic overloading name mangling is using 's_s' for unnamed types.
This can result in identical intrinsic mangled names for different function prototypes.
This patch changes this by adding a '.XXXXX' to the intrinsic mangled name when at least one of the types is based on an unnamed type, ensuring that we get a unique name.
Implementation details:
- The mapping is created on demand and kept in Module.
- It also checks for existing clashes and recycles potentially existing prototypes and declarations.
- Because of extra data in Module, Intrinsic::getName needs an extra Module* argument and, for speed, an optional FunctionType* argument.
- I still kept the original two-argument 'Intrinsic::getName' around which keeps the original behavior (providing the base name).
-- Main reason is that I did not want to change the LLVMIntrinsicGetName version, as I don't know how acceptable such a change is
-- The current situation already has a limitation. So that should not get worse with this patch.
- Intrinsic::getDeclaration and the verifier are now using the new version.
Other notes:
- As far as I see, this should not suffer from stability issues. The count is only added for prototypes depending on at least one anonymous struct
- The initial count starts from 0 for each intrinsic mangled name.
- In case of name clashes, existing prototypes are remembered and reused when that makes sense.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D91250
This reverts commit 6b053c9867a3ede32e51cef3ed972d5ce5b38bc0.
The build is broken:
ld.lld: error: undefined symbol: llvm::VPlan::printDOT(llvm::raw_ostream&) const
>>> referenced by LoopVectorize.cpp
>>> LoopVectorize.cpp.o:(llvm::LoopVectorizationPlanner::printPlans(llvm::raw_ostream&)) in archive lib/libLLVMVectorize.a
I foresee two uses for this:
1) It's easier to use those in debugger.
2) Once we start implementing more VPlan-to-VPlan transformations (especially
inner loop massaging stuff), using the vectorized LLVM IR as CHECK targets in
LIT test would become too obscure. I can imagine that we'd want to CHECK
against VPlan dumps after multiple transformations instead. That would be
easier with plain text dumps than with DOT format.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D96628
This makes the induction part of the loop vectorizer match the reduction part.
We do not need all of the fast-math-flags. For example, there are some that
clearly are not in play like arcp or afn.
If we want to make FMF constraints consistent across the IR optimizer, we
might want to add nsz too, but that's up for debate (users can't expect
associative FP math and preservation of sign-of-zero at the same time?).
The calling code was fixed to avoid miscompiles with:
1bee549737ac
Differential Revision: https://reviews.llvm.org/D98708
The `hasIrregularType` predicate checks whether an array of N values of type Ty is "bitcast-compatible" with a <N x Ty> vector.
The previous check returned invalid results in some cases where there's some padding between the array elements: eg. a 4-element array of u7 values is considered as compatible with <4 x u7>, even though the vector is only loading/storing 28 bits instead of 32.
The problem causes LLVM to generate incorrect code for some targets: for AArch64 the vector loads/stores are lowered in terms of ubfx/bfi, effectively losing the top (N * padding bits).
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D97465
This adds the cost of an i1 extract and a branch to the cost in
getMemInstScalarizationCost when the instruction is predicated. These
predicated loads/store would generate blocks of something like:
%c1 = extractelement <4 x i1> %C, i32 1
br i1 %c1, label %if, label %else
if:
%sa = extractelement <4 x i32> %a, i32 1
%sb = getelementptr inbounds float, float* %pg, i32 %sa
%sv = extractelement <4 x float> %x, i32 1
store float %sa, float* %sb, align 4
else:
So this increases the cost by the extract and branch. This is probably
still too low in many cases due to the cost of all that branching, but
there is already an existing hack increasing the cost using
useEmulatedMaskMemRefHack. It will increase the cost of a memop if it is
a load or there are more than one store. This patch improves the cost
for when there is only a single store, and hopefully at some point in
the future the hack can be removed.
Differential Revision: https://reviews.llvm.org/D98243
This patch adds support for reverse loop vectorization.
It is possible to vectorize the following loop:
```
for (int i = n-1; i >= 0; --i)
a[i] = b[i] + 1.0;
```
with fixed or scalable vector.
The loop-vectorizer will use 'reverse' on the loads/stores to make
sure the lanes themselves are also handled in the right order.
This patch adds support for scalable vector on IRBuilder interface to
create a reverse vector. The IR function
CreateVectorReverse lowers to experimental.vector.reverse for scalable vector
and keedp the original behavior for fixed vector using shuffle reverse.
Differential Revision: https://reviews.llvm.org/D95363
This reverts commit 329aeb5db43f5e69df038fb20d2def77fe6f8595,
and relands commit 61f006ac655431bd44b9e089f74c73bec0c1a48c.
This is a continuation of D89456.
As it was suggested there, now that SCEV models `PtrToInt`,
we can try to improve SCEV's pointer handling.
In particular, i believe, i will need this in the future
to further fix `SCEVAddExpr`operation type handling.
This removes special handling of `ConstantPointerNull`
from `ScalarEvolution::createSCEV()`, and add constant folding
into `ScalarEvolution::getPtrToIntExpr()`.
This way, `null` constants stay as such in SCEV's,
but gracefully become zero integers when asked.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D98147
This is a continuation of D89456.
As it was suggested there, now that SCEV models `PtrToInt`,
we can try to improve SCEV's pointer handling.
In particular, i believe, i will need this in the future
to further fix `SCEVAddExpr`operation type handling.
This removes special handling of `ConstantPointerNull`
from `ScalarEvolution::createSCEV()`, and add constant folding
into `ScalarEvolution::getPtrToIntExpr()`.
This way, `null` constants stay as such in SCEV's,
but gracefully become zero integers when asked.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D98147
This patch fixes a crash when trying to get a scalar value using
VPTransformState::get() for uniform induction values or truncated
induction values. IVs and truncated IVs can be uniform and the updated
code accounts for that, fixing the crash.
This should fix
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=31981
Add support to widen select instructions in VPlan native path by using a correct recipe when such instructions are encountered. This is already used by inner loop vectorizer.
Previously select instructions get handled by the wrong recipe and resulted in unreachable instruction errors like this one: https://bugs.llvm.org/show_bug.cgi?id=48139.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D97136
This test shows a case where we can potentially scalarize the store in a
predicated loop, creating a lot of instructions that would be much
slower than scalar.
Since P8 is the oldest machine supported by MASSV pass,
_massv place holder is removed and the oldest version of
MASSV functions is assumed. If the P9 vector specific is
detected in the compilation process, the P8 prefix will
be updated to P9.
Differential Revision: https://reviews.llvm.org/D98064
For loops of the form:
void foo(int *a, int *cond, short *inv, long long n) {
for (long long i=0; i<n; ++i) {
if (cond[i])
a[i] = *inv;
}
}
we can vectorise for SVE using masked gather loads where the array
of pointers is simply a vector splat of 'inv' and the mask comes
from the condition 'cond[i] != 0'.
This patch simply adds tests upstream to defend this capability.
Differential Revision: https://reviews.llvm.org/D98043
Add support to widen call instructions in VPlan native path by using a correct recipe when such instructions are encountered. This is already used by inner loop vectorizer.
Previously call instructions got handled by wrong recipes and resulted in unreachable instruction errors like this one: https://bugs.llvm.org/show_bug.cgi?id=48139.
Patch by Mauri Mustonen <mauri.mustonen@tuni.fi>
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D97278
These intrinsics, not the icmp+select are the canonical form nowadays,
so we might as well directly emit them.
This should not cause any regressions, but if it does,
then then they would needed to be fixed regardless.
Note that this doesn't deal with `SCEVExpander::isHighCostExpansion()`,
but that is a pessimization, not a correctness issue.
Additionally, the non-intrinsic form has issues with undef,
see https://reviews.llvm.org/D88287#2587863
There are certain loops like this below:
for (int i = 0; i < n; i++) {
a[i] = b[i] + 1;
*inv = a[i];
}
that can only be vectorised if we are able to extract the last lane of the
vectorised form of 'a[i]'. For fixed width vectors this already works since
we know at compile time what the final lane is, however for scalable vectors
this is a different story. This patch adds support for extracting the last
lane from a scalable vector using a runtime determined lane value. I have
added support to VPIteration for runtime-determined lanes that still permit
the caching of values. I did this by introducing a new class called VPLane,
which describes the lane we're dealing with and provides interfaces to get
both the compile-time known lane and the runtime determined value. Whilst
doing this work I couldn't find any explicit tests for extracting the last
lane values of fixed width vectors so I added tests for both scalable and
fixed width vectors.
Differential Revision: https://reviews.llvm.org/D95139
By implementing the method "unsigned RISCVTTIImpl::getRegisterBitWidth(bool Vector)",
fixed-length vectorization is enabled when possible. Without this method, the
"#pragma clang loop" directive is needed to enable vectorization(or the cost model
may inform LLVM that "Vectorization is possible but not beneficial").
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D97549
This code assumed that FP math was only permissable if it was
fully "fast", so it hard-coded "fast" when creating new instructions.
The underlying code already allows matching recurrences/reductions
that are only "reassoc", so this change should prevent the potential
miscompile seen in the test diffs (we created "fast" ops even though
none existed in the original code).
I don't know if we need to create the temporary IRBuilder objects
used here, so that could be follow-up clean-up.
There's an open question about whether we should require "nsz" in
addition to "reassoc" here. InstCombine uses that combo for its
reassociative folds, but I think codegen is not as strict.
Similar to b3a33553aec7, but this shows a TODO and a potential
miscompile is already present.
We are tracking an FP instruction that does *not* have FMF (reassoc)
properties, so calling that "Unsafe" seems opposite of the common
reading.
I also removed one getter method by rolling the null check into
the access. Further simplification may be possible.
The motivation is to clean up the interactions between FMF and
function-level attributes in these classes and their callers.
The new test shows that there is an existing bug somewhere in
the callers. We assumed that the original code was fully 'fast'
and so we produced IR with 'fast' even though it was just 'reassoc'.
Under -O3 and -Ofast, the MASSV conversion prevents the sqrt call to be inlined.
Inline sqrt is faster than MASSV call on leppc.
Differential Revision: https://reviews.llvm.org/D97487
This patch updates LV to generate the runtime checks just after cost
modeling, to allow a more precise estimate of the actual cost of the
checks. This information will be used in future patches to generate
larger runtime checks in cases where the checks only make up a small
fraction of the expected scalar loop execution time.
The runtime checks are created up-front in a temporary block to allow better
estimating the cost and un-linked from the existing IR. After deciding to
vectorize, the checks are moved backed. If deciding not to vectorize, the
temporary block is completely removed.
This patch is similar in spirit to D71053, but explores a different
direction: instead of delaying the decision on whether to vectorize in
the presence of runtime checks it instead optimistically creates the
runtime checks early and discards them later if decided to not
vectorize. This has the advantage that the cost-modeling decisions
can be kept together and can be done up-front and thus preserving the
general code structure. I think delaying (part) of the decision to
vectorize would also make the VPlan migration a bit harder.
One potential drawback of this patch is that we speculatively
generate IR which we might have to clean up later. However it seems like
the code required to do so is quite manageable.
Reviewed By: lebedev.ri, ebrevnov
Differential Revision: https://reviews.llvm.org/D75980
This reverts the revert commit 437f0bbcd509d0ed71b91ec1f86f48c2f4aae980.
It adds a new toVPRecipeResult, which forces VPRecipeOrVPValueTy to be
constructed with a VPRecipeBase *. This should address ambiguous
constructor issues for recipe sub-types that also inherit from VPValue.
This reverts commit 4efa097eb4c87d7ffe09a95a5b4ff372bdddda85, because
some the compilers used for some bots do not support automatic
conversions to PointerUnion.
Generalize the return value of tryToCreateWidenRecipe to return either a
newly create recipe or an existing VPValue. Use this to avoid creating
unnecessary VPBlendRecipes.
Fixes PR44800.