Currently if a loop contains loads that we can prove at compile time
are dereferenceable when certain conditions are satisfied the function
isDereferenceableAndAlignedInLoop will still return false because
getSmallConstantMaxTripCount will return 0 when SCEV predicates
are required. This patch changes getSmallConstantMaxTripCount to take
an optional Predicates pointer argument so that we can permit
functions such as isDereferenceableAndAlignedInLoop to consider more
cases.
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
There are a few places in ScalarEvolution.cpp where we copy predicates
from one list to another and they have a similar pattern:
for (const auto *P : ENT.Predicates)
Predicates->push_back(P);
We can avoid the loop by writing them like this:
Predicates->append(ENT.Predicates.begin(), ENT.Predicates.end());
which may end up being more efficient since we only have to try
reserving more space once.
Due to a reviewer request on PR #88385 I have created this patch
to add a getPredicatedExitCount function, which is similar to
getExitCount except that it uses the predicated backedge taken
information. With PR #88385 we will start to care about more
loops with multiple exits, and want the ability to query exit
counts for a particular exiting block. Such loops may require
predicates in order to be vectorised.
New tests added here:
Analysis/ScalarEvolution/predicated-exit-count.ll
Whilst dealing with review comments on
https://github.com/llvm/llvm-project/pull/96752
I discovered that SCEV does not know about the dereferenceable attribute
on function arguments so I have updated getRangeRef to make use of it
by calling getPointerDereferenceableBytes.
Use computeConstantDifference() instead of casting getMinusSCEV() to
SCEVConstant. This can be much faster in some cases, because
computeConstantDifference() computes the result without creating new
SCEV expressions.
This improves LTO/ThinLTO compile-time for lencod by more than 10%.
I've verified that computeConstantDifference() does not produce worse
results than the previous code for anything in llvm-test-suite. This
required raising the iteration cutoff to 6. I ended up increasing it to
8 just to be on the safe side (for code outside llvm-test-suite), and
because this doesn't materially affect compile-time anyway (we'll almost
always bail out earlier).
Inside computeConstantDifference(), handle the case where both sides are
of the form `C * %x`, in which case we can strip off the common
multiplication (as long as we remember to multiply by it for the
following difference calculation).
There is an obvious alternative implementation here, which would be to
directly decompose multiplies inside the "Multiplicity" accumulation.
This does work, but I've found this to be both significantly slower
(because everything has to work on APInt) and more complex in
implementation (e.g. because we now need to match back the new More/Less
with an arbitrary factor) without providing more power in practice. As
such, I went for the simpler variant here.
This is the last step to make computeConstantDifference() sufficiently
powerful to replace existing uses of
`cast<SCEVConstant>(getMinusSCEV())` with it.
computeConstantDifference() can currently look through addrecs with
identical steps, and then through adds with identical operands (apart
from constants).
However, it fails to handle minor variations, such as two nested add
recs, or an outer add with an inner addrec (rather than the other way
around).
This patch supports these cases by adding a loop over the
simplifications, limited to a small number of iterations. The motivation
is the same as in #101339, to make
computeConstantDifference() powerful enough to replace existing uses of
`dyn_cast<SCEVConstant>(getMinusSCEV())` with it. Though as the IR test
diff shows, other callers may also benefit.
The canAssumeNoSelfWrap routine in howManyLessThans was doing two subtly
inter-related things. First, it was proving no-self-wrap. This exactly
duplicates the existing logic in the caller. Second, it was establishing
the precondition for the nw->nsw/nuw inference. Specifically, we need to
know that *this* exit must be taken for the inference to be sound.
Otherwise, another (possible abnormal) exit could be taken in the
iteration where this IV would become poison.
This change moves all of that logic into the caller, and caches the
resulting nuw/nsw flags in the AddRec. This centralizes the logic in one
place, and makes it clear that it all depends on controlling the sole
exit.
We do loose a couple cases with SCEV predication. Specifically, if SCEV
predication was able to convert e.g. zext(addrec) into an addrec(zext)
using predication, but didn't record the nuw fact on the new addrec,
then the consuming code can no longer fix this up. I don't think this
case particularly matters.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
The Mul factor was zero-extended here, resulting in incorrect
results for integers larger than 64-bit.
As we currently only multiply by 1 or -1, just split this into
two cases -- there's no need for a full multiplication here.
Fixes https://github.com/llvm/llvm-project/issues/102597.
Without this patch, the constructor arguments come from
SmallVectorImpl, not ArrayRef. This patch switches them to ArrayRef
so that we can construct SmallVector with a single argument.
Note that LLVM Programmer’s Manual prefers ArrayRef to SmallVectorImpl
for flexibility.
Currently it only deals with the case where we're subtracting adds with
at most one non-constant operand. This patch extends it to cancel out
common operands for the subtraction of arbitrary add expressions.
The background here is that I want to replace a getMinusSCEV() call in
LAA with computeConstantDifference():
93fecc2577/llvm/lib/Analysis/LoopAccessAnalysis.cpp (L1602-L1603)
This particular call is very expensive in some cases (e.g. lencod with
LTO) and computeConstantDifference() could achieve this much more
cheaply, because it does not need to construct new SCEV expressions.
However, the current computeConstantDifference() implementation is too
weak for this and misses many basic cases. This is a step towards making
it more powerful while still keeping it pretty fast.
Add a common constantFoldAndGroupOps() helper that takes care of
constant folding and grouping transforms that are common to all nary
ops. This moves the constant folding prior to grouping, which is more
efficient, and excludes any constant from the sort.
The constant folding has hooks for folding, identity constants and
absorber constants.
This gives a compile-time improvement for SCEV-heavy workloads like
lencod.
We have existing code which reasons about a step evenly dividing the
iteration space is a finite loop with a single exit implying
no-self-wrap. The sign of the step doesn't effect this.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
SCEV has logic for inferring wrap flags on AddRecs which are known to
control an exit based on whether the step is a power of two. This logic
only considered constants, and thus did not trigger for steps such as (4
x vscale) which are common in scalably vectorized loops.
The net effect is that we were very sensative to the preservation of
nsw/nuw flags on such IVs, and could not infer trip counts if they got
lost for any reason.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
The cache triggers almost never, and seems unlikely to help with
performance. However, when it does, it is likely to cause the comparator
to become inconsistent due to a bad interaction of the depth limit and
cache hits. This leads to crashes in debug builds. See the new unit test
for a reproducer.
The use-def walk in forgetValue() was skipping instructions with
non-SCEVable types. However, SCEV may look past with.overflow
intrinsics returning aggregates.
Fixes#97586.
PR #96919 caused a minor compile-time regression, mostly because
SCEV now goes through an extra out-of-line function to fetch the
data layout, and does this a lot. Cache the DataLayout in SCEV
to avoid these repeated calls.
Fixes#92554 (std::reverse will auto-vectorize now)
When calculating number of times a exit condition containing a
comparison is executed, we mostly assume that RHS of comparison should
be loop invariant, but it may be another add-recurrence.
~In that case, we can try the computation with `LHS = LHS - RHS` and
`RHS = 0`.~ (It is not valid unless proven that it doesn't wrap)
**Edit:**
We can calculate back edge count for loop structure like:
```cpp
left = left_start
right = right_start
while(left < right){
// ...do something...
left += s1; // the stride of left is s1 (> 0)
right -= s2; // the stride of right is -s2 (s2 > 0)
}
// left and right converge somewhere in the middle of their start values
```
We can calculate the backedge-count as ceil((End - left_start) /u (s1-
(-s2)) where, End = max(left_start, right_start).
**Alive2**: https://alive2.llvm.org/ce/z/ggxx58
Namespaces are terminated with a closing comment in the majority of the
codebase so do the same here for consistency. Also format code within
some namespaces to make clang-format happy.
In `computeExitLimit()`, we use `getExitingBlock()` to check if loop has
exactly one exiting block.
Since `computeExitLimit()` is only used in
`computeBackedgeTakenCount()`, and `getExitingBlocks()` is called to get
all exiting blocks in `computeBackedgeTakenCount()`, we can simply check
if loop has exactly one exiting block by checking if the number of
exiting blocks equals 1 in `computeBackedgeTakenCount()` and pass it as
an argument to `computeExitLimit()`.
This change helps to improve the compile time for files containing large
loops.
nusw implies nsw offset and nuw base+offset arithmetic if offset
is non-negative. nuw implies nuw offset and base+offset arithmetic.
As usual, we can only transfer is poison implies UB.
This change builds on 0a357ad which supported non-constant strides in
howFarToZero, but used only context insensitive reasoning.
This change does two things:
1) Directly use context sensitive queries to prove facts established
before the loop. Note that we technically only need facts known
at the latch, but using facts known on entry is a conservative
approximation which will cover most everything.
2) For the non-zero check, we can usually prove non-zero from the
finite assumption implied by mustprogress. This eliminates the
need to do the context sensitive query in the common case.
If we have a urem expression, emitting it as a urem is significantly
better that letting the fully expansion kick in. We have the risk of a
udiv or mul which could have previously been shared, but loosing that
seems like a reasonable tradeoff for being able to round trip a urem w/o
modification.
VF * vscale is the canonical step for a scalably vectorized loop, and
LFTR canonicalizes to NE loop tests, so having our trip count logic be
unable to compute trip counts for such loops is unfortunate.
The existing code needed minimal generalization to handle non-constant
strides. The tricky cases to be sure we handle correctly are: zero, and
-1 (due to the special case of abs(-1) being non-positive).
This patch does the full generalization in terms of code structure, but
in practice, this seems unlikely to benefit
anything beyond the (C * vscale) case. I did some quick investigation,
and it seems the context free non-zero, and sign checks are basically
never disproved for arbitrary scales. I think we have alternate tactics
available for these, but I'm going to return to that in a separate
patch.
SCEVLoopGuardRewriter only replaces operands with equivalent values, so
we should be able to transfer the flags from the original expression.
PR: https://github.com/llvm/llvm-project/pull/91472
This patch adds a predicated version of
getSymbolicMaxBackedgeTakenCount.
The intended use for this is loop access analysis for loops with
uncountable exits. When analyzing dependences and computing runtime
checks, we need the smallest upper bound on the number of iterations. In
terms of memory safety, it shouldn't matter if any uncomputable exits
leave the loop, as long as we prove that there are no dependences given
the minimum of the countable exits. The same should apply also for
generating runtime checks.
PR: https://github.com/llvm/llvm-project/pull/93498
Update SCEVUnionPredicate::add to only add predicates from another union
predicate, if they aren't alread implied by the union predicate we add
them to.
Note that there exists logic elsewhere to avoid adding predicates if
they are already implied, but this logic misses cases when only some
predicates of a union predicate are implied by the current set of
predicates.
PR: https://github.com/llvm/llvm-project/pull/93397
If we have an exit condition of the form IV <= Limit, we will first try
to convert it into IV < Limit+1 or IV-1 < Limit based on range info (in
icmp simplification). If that fails, we try to convert it to IV < Limit
+ 1 based on controlling exits in non-infinite loops.
However, if all else fails, we can still determine the exit count by
rewriting to ext(IV) < ext(Limit) + 1, where the zero/sign extension
ensures that the addition does not overflow.
Proof: https://alive2.llvm.org/ce/z/iR-iYd
When calculating the exit count exhaustively, if any of the involved
operations is non-deterministic, the exit count we compute at
compile-time and the exit count at run-time may differ. Using these
non-deterministic constant folding results is only correct if we
actually replace all uses of the instruction with the value. SCEV (or
its consumers) generally don't do this.
Handle this by adding a new AllowNonDeterministic flag to the constant
folding API, and disabling it in SCEV. If non-deterministic results are
not allowed, do not fold FP lib calls in general, and FP operations
returning NaNs in particular. This could be made more precise (some FP
libcalls like fabs are fully deterministic), but I don't think this that
precise handling here is worthwhile.
Fixes the interesting part of
https://github.com/llvm/llvm-project/issues/89885.
The argument order to MatchBinaryAddToConst doesn't match the comment
and also is counter-intuitive (passing RHS before LHS, C2 before C1).
This patch adjusts the order to be inline with the calls above, which
should be equivalent, but more natural:
https://alive2.llvm.org/ce/z/ZWGp-Z
PR: https://github.com/llvm/llvm-project/pull/91945
Use ICmpInst::compare() where possible, ConstantFoldCompareInstOperands
in other places. This only changes places where the either the fold is
guaranteed to succeed, or the code doesn't use the resulting compare if
we fail to fold.
Rename has/dropPoisonGeneratingFlagsOrMetadata to
has/dropPoisonGeneratingAnnotations and make it also handle
nonnull, align and range return attributes on calls, similar
to the existing handling for !nonnull, !align and !range metadata.
Prior to #85863, the required parameters of llvm::isKnownNonZero were
Value and DataLayout. After, they are Value, Depth, and SimplifyQuery,
where SimplifyQuery is implicitly constructible from DataLayout. The
change to move Depth before SimplifyQuery needed callers to be updated
unnecessarily, and as commented in #85863, we actually want Depth to be
after SimplifyQuery anyway so that it can be defaulted and the caller
does not need to specify it.
BinomialCoefficient computes the value of W-bit IV at iteration It of a loop. When W is 1, we can call multiplicative inverse on 0 which triggers an assert since 1b76120.
Since the arithmetic is supposed to wrap if It or K does not fit in W bits, do the truncation into W bits after we do the shift.
Fixes#87798