This removes the need to convert the end of the range to the next
power-of-2 for the end iterator after 4bd3fda5124962 and was suggested
as follow-up TODO in D147468.
Add an iterator to iterate over all VFs in VFRange. This simplifies some
existing code and allows using all_of,any_of and none_of on a VFRange.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147468
This patch adds a NeedsMaskForGaps field to VPInterleaveRecipe to record
whether a mask for gaps is needed. This removes a dependence on the cost
model in VPlan code-generation.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D147467
The original loop has O(MxN) since `is_contained` iterates over
all incoming values. This change makes it so only the phis
which use the value as an incoming value are iterated over so
it is now O(M).
Differential Revision: https://reviews.llvm.org/D146999
Conditionally setting MaskForGaps is only needed for loads. This avoid
re-computing MaskForGaps for stores.
Suggested as independent cleanup in D147467.
In getInstructionCost if we know a zext/sext is going to be shrunk
we should only be changing the destination type, and leave the
source type unchanged. For example, we may change a zext from
zext <16 x i8> %a to <16 x i32>
to
zext <16 x i8> %a to <16 x i16>
However, we were previously calculating the cost for doing
zext <16 x i16> %a to <16 x i16>
which is incorrect.
Differential Revision: https://reviews.llvm.org/D147152
(JFYI - This has been heavily reframed since original attempt at landing.)
This change updates the InductionDescriptor logic to allow matching a pointer IV with a non-constant stride, but also updates the LoopVectorizer to bailout on such descriptors by default. This preserves the default vectorizer behavior.
In review, it was pointed out that there's multiple unfortunate performance implications which need to be addressed before this can be enabled. Having a flag allows us to exercise the behavior, and write test cases for logic which is otherwise unreachable (or hard to reach).
This will also enable non-constant stride pointer recurrences for other consumers. I've audited said code, and don't see any obvious issues.
Differential Revision: https://reviews.llvm.org/D147336
LLVM has the ability to vectorize using function variants that require
a mask by creating an all-true mask, and to vectorize a conditional
call via scalarization, now we want to join the two parts together
and use a masked variant when a mask is required.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D136251
Given just how many arguments we pass to
preferPredicateOverEpilogue and considering this list may
grow over time I've decided to pass in a pointer to a new
TailFoldingInfo structure instead, similar to what we do
with IntrinsicCostAttributes, etc. In addition, many of the
arguments we pass in are actually available in the
LoopVectorizationLegality class so I've managed to
reduce the set of pointers that we need to pass in the
TailFoldingInfo struct.
Differential Revision: https://reviews.llvm.org/D146127
Multiple errors have being reported on
https://reviews.llvm.org/rG498aa534f472d28db893aa9a8627d0b46e17f312
Reverting until the correctness issues can be resolved.
We are also seeing a lot of performance differences from the patch. Some are
looking good, but some are looking pretty bad.
This matches the handling for integer IVs. I left the non-opaque cases alone, mostly because they're largely irrelevant today.
This doesn't actually make much difference in vectorization right now as we immediately fail on aliasing checks (which also bail on non-constant strides). Slightly suprisingly, it's the case which *do* need runtime checks which work after this patch as they don't use the same dependency analysis path.
This will also enable non-constant stride pointer recurrences for other consumers. I've auditted said code, and don't see any obvious issues.
This one-line patch just tightens up the code added in
1c4fedfa35aeb8b456e2d8f4f826c0e026b9d863
where we try to avoid tail-folding if we know the runtime
VF will always be a multiple of the trip count.
Currently in LoopVectorize we avoid tail-folding if we can
prove the trip count is always a multiple of the maximum
fixed-width VF. This works because we know the vectoriser
only ever chooses a VF that is a power of 2. However, if
we are also considering scalable VFs then we conservatively
bail out of the optimisation because we don't know the value
of vscale, which could be an odd or prime number, etc.
This patch tries to enable the same optimisation for scalable
VFs by asking if vscale is known to be a power of 2. If so,
we can then query the maximum value of vscale and use the same
logic as we do for fixed-width VFs. I've also added a new TTI
hook called isVScaleKnownToBeAPowerOfTwo that does the same
thing as the existing TargetLowering hook.
Differential Revision: https://reviews.llvm.org/D146199
The function doesn't use anything from VPRecipeBuilder, so move the
definition to where it is actually used and turn it into a simple static
function.
It also makes the VPRecipeBuilder argument for createAndOptimizeReplicateRegions
unnecessary.
Update isVectorizedMemAccessUse to also check if the pointer is stored.
This prevents LV to incorrectly consider a pointer as uniform if it is
used as both pointer and stored by the same StoreInst.
Fixes#61396.
InnerLoopVectorizer::createBitOrPointerCast only supported fixed
length vectors since it hadn't been updated. Supporting scalable
vectors is just a matter of changing types and using elementcount
instead of numelements, since there's nothing which actually relies
on knowing the exact length of the vector.
Original written by mgabka.
Split out from D145163.
These idioms already appear a number of places in code, and upcoming changes to the various sanitizers continue to need more instances of the same patterns.
Differential Revision: https://reviews.llvm.org/D145945
DFAJumpThreading
JumpThreading
LibCallsShrink
LoopVectorize
SLPVectorizer
DeadStoreElimination
AggressiveDCE
CorrelatedValuePropagation
IndVarSimplify
These are part of the optimization pipeline, of which the legacy version is deprecated and being removed.
There is no need to store information about invariance in the recipe.
Replace the fields with checks of the operands using
isDefinedOutsideVectorRegions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D144489
There is no need to store information about invariance in the recipe.
Replace the fields with checks of the operands using
isDefinedOutsideVectorRegions.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D144487
This patch adds the predicate as additional operand to VPReplicateRecipe
during initial construction. The predicated recipes are later moved into
replicate regions. This simplifies constructions and some VPlan
transformations, like fixed-order recurrence handling.
It also improves codegen in some cases (e.g. for in-loop reductions),
because the recipes remain in the same block.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D143865
Check if replicate recipe is in a replicate region when considering to
collect predicated instructions. This allows use IsPredicated for
recipes with a mask attached directly in D143865.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D145322
AArch64/reg-usage.ll has an issue with the output ordering due to use of unordered container. This was discovered by -DLLVM_REVERSE_ITERATION:BOOL=ON
cmake option.
This patch tries to address it by making use of ordered container.
Differential Revision: https://reviews.llvm.org/D145472/
This patch adds support for scalarizing calls to a function when
there is a vector variant that cannot be used, either because there
isn't a masked variant or because the cost model indicated a VF
without a masked variant was better.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D134422
This work follows on from D142109 and addresses a possible regression
when we know the loop iteration counter cannot overflow.
When we know the overflow-check always evaluates to false, it's better to
use the other style of tail folding where it assumes a runtime check was
added, because that avoids having to calculate a modified trip-count.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D142894
When using tail-folding and using the predicate for both data and control-flow
(the next vector iteration's predicate is generated with the llvm.active.lane.mask
intrinsic and then tested for the backedge), the LoopVectorizer still inserts a
runtime check to see if the 'i + VF' may at any point overflow for the given
trip-count. When it does, it falls back to a scalar epilogue loop.
We can get rid of that runtime check in the pre-header and therefore also
remove the scalar epilogue loop. This reduces code-size and avoids a runtime
check.
Consider the following loop:
void foo(char * __restrict__ dst, char *src, unsigned long N) {
for (unsigned long i=0; i<N; ++i)
dst[i] = src[i] + 42;
}
If 'N' is e.g. ULONG_MAX, and the VF > 1, then the loop iteration counter
will overflow when calculating the predicate for the next vector iteration
at some point, because LLVM does:
vector.ph:
%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
vector.body:
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
...
%index.next = add i64 %index, 16
; The add above may overflow, which would affect the lane mask and control flow. Hence a runtime check is needed.
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 %N)
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
The solution:
What we can do instead is calculate the predicate before incrementing
the loop iteration counter, such that the llvm.active.lane.mask is
calculated from 'i' to 'tripcount > VF ? tripcount - VF : 0', i.e.
vector.ph:
%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
%N_minus_VF = select %N > 16 ? %N - 16 : 0
vector.body:
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
...
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %N_minus_VF)
%index.next = add i64 %index, %4
; The add above may still overflow, but this time the active.lane.mask is not affected
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
For N = 20, we'd then get:
vector.ph:
%active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
; %active.lane.mask.entry = <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1>
%N_minus_VF = select 20 > 16 ? 20 - 16 : 0
; %N_minus_VF = 4
vector.body: (1st iteration)
... ; using <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1> as predicate in the loop
...
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
; %active.lane.mask.next = <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
%index.next = add i64 0, 16
; %index.next = 16
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
; %8 = 1
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
; branch to %vector.body
vector.body: (2nd iteration)
... ; using <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> as predicate in the loop
...
%active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 16, i64 4)
; %active.lane.mask.next = <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
%index.next = add i64 16, 16
; %index.next = 32
%8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
; %8 = 0
br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
; branch to %for.cond.cleanup
Reviewed By: fhahn, david-arm
Differential Revision: https://reviews.llvm.org/D142109
BlockFrequencyInfo should generally only be fetched in PGO builds
where a PSI profile summary is available. However, LoopVectorize
was fetching it unconditionally.
This results in a small compile-time improvement for non-PGO builds.
Differential Revision: https://reviews.llvm.org/D144953
Previously, while calculating register usage due to invariants, it was assumed that invariant would always be part of widening
instructions. This resulted in calculating vector register types for vectors which cant be legalized(check the newly added test for more details).
An invariant might not always need a vector register. For e.g., invariant might just be used for iteration check.
This patch checks if the invariant is part of any widening instruction and considers register usage accordingly. Fixes issue 60493
Differential Revision: https://reviews.llvm.org/D143422
Previously, while calculating register usage due to invariants, it was assumed that invariant would always be part of widening
instructions. This resulted in calculating vector register types for vectors which cant be legalized(check the newly added test for more details).
An invariant might not always need a vector register. For e.g., invariant might just be used for iteration check.
This patch checks if the invariant is part of any widening instruction and considers register usage accordingly. Fixes issue 60493
Differential Revision: https://reviews.llvm.org/D143422
To query the maximum value for vscale, the LV queries the vscale_range
attribute or a TTI hook. To avoid having to reimplement the same behaviour
for multiple uses (such as in D142894), it makes sense to move this code
to a separate function.
Enabling assignment tracking without this patch, a significant amount of
additional compiler run time comes from the RemoveRedundantDbgInstrs call in
InstCombine. This patch reduces compiler run time by choosing better places to
call RemoveRedundantDbgInstrs.
In non-assignment-tracking builds, RemoveRedundantDbgInstrs is called by
InstCombine if LowerDbgDeclare makes a change (i.e. it is _sometimes_
called). In assignment tracking builds LowerDbgDeclare doesn't do anything. We
still need to clean up redundant intrinsics to avoid a large performance hit
due to the number of instructions, so the current approach is to have
InstCombine _always_ call RemoveRedundantDbgInstrs.
Instrumenting the compiler to run RemoveRedundantDbgInstrs after every pass and
dump the numbers and building CTMark/tramp3d-v4 indicates that SROA and
LoopVectorize give us a bigger bang (number removed) for buck (times pass is
run).
The compile time tracker reports that this patch reduces the number of
instructions retired building CTMark projects by an average of 1.1%.
Reviewed By: scott.linder
Differential Revision: https://reviews.llvm.org/D144483
In order to allow targets to disable interleaving for scalable vectors, pass the entire VF's ElementCount to getMaxInterleaveFactor.
This is based off of the approach used here: 8d36708507
The plan would then be to disable interleaving on scalable VFs on RISC-V in a follow up patch.
See https://reviews.llvm.org/D143723#4132349
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D144474
There is no need to update the AlsoPack field when creating
VPReplicateRecipes. It can be easily computed based on the VP def-use
chains when it is needed.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D143864
The code only needs access to INvalidCosts, ORE and TheLoop, so it can
easily be moved into a helper to make selectVectorizationFactor more
compact.
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D143957
Prevoius pseudo probes were dropped out of a vectorized loop body during loop vectorization. This can result in the samples of the loop entry is used for the loop body, which in turn can cause undercounting of the loop iteration count. The undercounting can further prevent the loop from being vectorized in the next build. I'm fixing this by explicting allowing pseudo probes to be kept in the vectorized loop body, and by claiming a probe instruction is not "uniform", the vectorizer will duplicate it by the number of vector lanes.
For one internal service, I'm seeing the change causes the size increase of the .pseudoprobe section by 0.7%, which should count around 0.2% of the whole binary size.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D144066
When vectorizing code with function calls in it, if we encounter
a function which only has vectorized variants requiring a mask
we can synthesize an all-true mask to enable us to proceed.
Since we want the mask to be represented in vplan, the pointer
to the chosen Function is now stored as part of the
VPWidenCallRecipe, and mask arguments are added at the
appropriate index to the recipe operands.
Reviewed By: david-arm, fhahn, reames
Differential Revision: https://reviews.llvm.org/D132458
Fixed issue where 'ConstantInt::get(IndextTy, -Part)' was executed with the wrong type for Part,
e.g. IndexTy was i64, but Part was 'unsigned', which led to things like 'mul i64 .., 4294967292',
which was obviously wrong.
Also changed sve-vector-reverse.ll to be vectorized with UF>1 to test this.
This reverts commit 1f01cdda68614dba12af3cc3aff38541d0abcc6b.