This change is mostly about getting rid of some "uninteresting" cases in a follow on deeper heuristic change. If anyone sees actually interesting code differences out of this, please let me know. I'm not expecting this to have much impact at all.
Case 1 - With the single deoptimize non-latch exit, we can't have two exiting blocks sharing an exit block. We can only hit this with a poorly documented debug flag.
Case 2 - Why should we treat epilog cases differently from prolog cases? Or to say it differently, why should starting with a constant control whether a multiple exit loop gets unrolled?
Sorry for the lack of tests here. These are both *exceedingly* narrow cases in practice, and after a while trying, I couldn't come up with a test which did anything "useful" as opposed to simply exercise a random combination of force flags. Note that the legality cases for each are already exercised with force flags.
All of the interesting logic from this routine has been removed, inline the single check into the sole non-assert caller. The assert use has little value with the restructured code and is simply dropped.
This reverts commit c35e8185d8c170c20e28956e0c9f3c1be895fefb.
mstorsjo reported in the revision thread that one VNCoercion assertion
is violated and seemly in relate to this commit. As per "If a test case
that demonstrates a problem is reported in the commit thread, please
revert and investigate offline", this commit is reverted.
ProfileCount could model invalid values, but a user had no indication
that the getCount method could return bogus data. Optional<ProfileCount>
addresses that, because the user must dereference the optional. In
addition, the patch removes concept duplication.
Differential Revision: https://reviews.llvm.org/D113839
The if-check above deleted part guarantees that StoreOffset <= LoadOffset
and that StoreOffset + StoreSize >= LoadOffset + LoadSize, and given that
LoadOffset + LoadSize > LoadOffset when LoadSize > 0. Thus, this shows
StoreOffset + StoreSize > LoadOffset is guaranteed given LoadSize > 0,
while it could be meaningless to have a type with nonpositive size, so that
the check could be removed.
Part of revision D100179
Reviewed By: nikic
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.
Why does this matter? A couple of reasons:
* SCEV doesn't have a native subtract node. Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such. As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones. (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language. We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced. (You can see this looking at nearby phis in the test cases.)
Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.
* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value. We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
The unrolling code was previously inserting new cloned blocks at the end of the function. The result of this with typical loop structures is that the new iterations are placed far from the initial iteration.
With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting. As such, placing Count-1 copies out of line is a fairly poor code placement choice. We'd much rather fall through into the hot (non-exiting) path. For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code.
However, the real motivation for this change isn't performance. Its readability and human understanding. Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious.
Without this patch, passingValueIsAlwaysUndefined will iterate over all
instructions from I to the end of the basic block, even if the use is
outside the block.
This patch adds an early bail out, if the use instruction is outside I's
BB. This can greatly reduce compile-time in cases where very large basic
blocks are involved, with a large number of PHI nodes and incoming
values.
Note that the refactoring makes the handling of the case where I is a
phi and Use is in PHI more explicit as well: for phi nodes, we can also
directly bail out. In the existing code, we would iterate until we reach
the end and return false.
Based on an earlier patch by Matt Wala.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D113293
This is a fix for test failures on expensive checks build caused by db289340c841990055a164e8eb2a3b5ff25677bf.
With LLVM_ENABLE_EXPENSIVE_CHECKS enabled the llvm::sort shuffles the given container.
However, the sort is only called when the TTI is passed to replaceCongruentIVs.
In the mentioned patch we pass it TTI, so the sort happens. But due to shuffling
equivalent Phis may appear in different order from run to run.
With the stable_sort instead of sort this is impossible - the order of sorted Phis
is preserved.
Extended value is known to be inside range smaller than full one.
Prevent SCCP to mark such value as overdefined.
Fixes PR52253
Differential Revision: https://reviews.llvm.org/D112721
Added support for peeling loops with exits that are followed either by an
unreachable-terminated block or block that has a terminatnig deoptimize call.
All blocks in the sequence must have an unique successor, maybe except
for the last one.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D110922
It's a no-op, no overflow happens ever: https://alive2.llvm.org/ce/z/Zw89rZ
While generally i don't like such hacks,
we have a very good reason to do this: here we are expanding
a run-time correctness check for the vectorization,
and said `umul_with_overflow` will not be optimized out
before we query the cost of the checks we've generated.
Which means, the cost of run-time checks would be artificially inflated,
and after https://reviews.llvm.org/D109368 that will affect
the minimal trip count for which these checks are even evaluated.
And if they aren't even evaluated, then the vectorized code
certainly won't be run.
We could consider doing this in IRBuilder, but then we'd need to
also teach `CreateExtractValue()` to look into chain of `insertvalue`'s,
and i'm not sure there's precedent for that.
Refs. https://reviews.llvm.org/D109368#3089809
The function simplifyOnce only calls simplifyOnceImpl and does nothing else.
Having this separate helper makes no sense. Removing it.
Patch by Dmitry Bakunevich!
Differential Revision: https://reviews.llvm.org/D112517
Reviewed By: mkazantsev
When peeling a loop, we assume that the latch has a `br` terminator and that
all loop exits are either terminated with an `unreachable` or have a terminating
deoptimize call. So when we peel off the 1st iteration, we change the IDom of
all loop exits to the peeled copy of `NCD(IDom(Exit), Latch)`. This works now,
but if we add logic to support loops with exits that are followed by a block
with an `unreachable` or a terminating deoptimize call, changing the exit's idom
wouldn't be enough and DT would be broken.
For example, let `Exit1` and `Exit2` are loop exits, and each of them
unconditionally branches to the same `unreachable` terminated block. So neither
of the exits dominates this unreachable block. If we change the IDoms of the
exits to some peeled loop block, we don't update the dominators of the unreachable
block. Currently we just don't get to the peeling logic, saying that we can't peel
such loops.
Previously we stored exits' IDoms in a map before peeling a loop and then, after
peeling off one iteration, we changed their IDoms.
Now we use the same logic not only for exits but for all non-loop blocks dominated
by the loop.
So when we add logic to support peeling loops with exits which branch, for example,
to an unreachable-terminated block, we would update the IDoms not only for exits,
but for their successors.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D111611
Reviewed By: mkazantsev, nikic
Always insert values into ExprValueMap, and instead skip using them
in SCEVExpander if poison-generating flags have been lost. This
ensures that all values that are in ValueExprMap are also in
ExprValueMap, so we can use the latter to invalidate the former.
This change is probably not entirely NFC for the case where
originally the SCEV had no nowrap flags but they were inferred
later, in which case that would now allow reusing the existing
value for expansion.
Differential Revision: https://reviews.llvm.org/D112389
As this API is now internally offset-based, we can accept a starting
offset and remove the need to create a temporary bitcast+gep
sequence to perform an offset load. The API now mirrors the
ConstantFoldLoadFromConst() API.
As discussed in D112016, our current requirement of speculatability
for ephemeral is overly strict: What we really care about is that
the instruction will be DCEd once the assume is dropped. For that
it is sufficient that the instruction is side-effect free and not
a terminator.
In particular, this allows non-dereferenceable loads to be ephemeral
values.
Differential Revision: https://reviews.llvm.org/D112179
At the moment, rewriteLoopExitValue forgets the current phi node in the
loop that collects phis to rewrite. A few lines after the value is
forgotten, SCEV is used again to analyze incoming values and
potentially expand SCEV expression. This means that another SCEV is
created for PN, before the IR is actually updated in the next loop.
This leads to accessing invalid cached expression in combination with
D71539.
PN should only be changed once the actual incoming exit value is set in
the next loop. Moving invalidation there should ensure that PN is
invalidated in all relevant cases.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D111495
As discussed in:
* https://reviews.llvm.org/D94166
* https://lists.llvm.org/pipermail/llvm-dev/2020-September/145031.html
The GlobalIndirectSymbol class lost most of its meaning in
https://reviews.llvm.org/D109792, which disambiguated getBaseObject
(now getAliaseeObject) between GlobalIFunc and everything else.
In addition, as long as GlobalIFunc is not a GlobalObject and
getAliaseeObject returns GlobalObjects, a GlobalAlias whose aliasee
is a GlobalIFunc cannot currently be modeled properly. Creating
aliases for GlobalIFuncs does happen in the wild (e.g. glibc). In addition,
calling getAliaseeObject on a GlobalIFunc will currently return nullptr,
which is undesirable because it should return the object itself for
non-aliases.
This patch refactors the GlobalIFunc class to inherit directly from
GlobalObject, and removes GlobalIndirectSymbol (while inlining the
relevant parts into GlobalAlias and GlobalIFunc). This allows for
calling getAliaseeObject() on a GlobalIFunc to return the GlobalIFunc
itself, making getAliaseeObject() more consistent and enabling
alias-to-ifunc to be properly modeled in the IR.
I exercised some judgement in the API clients of GlobalIndirectSymbol:
some were 'monomorphized' for GlobalAlias and GlobalIFunc, and
some remained shared (with the type adapted to become GlobalValue).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D108872
This simplifies the return value of addRuntimeCheck from a pair of
instructions to a single `Value *`.
The existing users of addRuntimeChecks were ignoring the first element
of the pair, hence there is not reason to track FirstInst and return
it.
Additionally all users of addRuntimeChecks use the second returned
`Instruction *` just as `Value *`, so there is no need to return an
`Instruction *`. Therefore there is no need to create a redundant
dummy `and X, true` instruction any longer.
Effectively this change should not impact the generated code because the
redundant AND will be folded by later optimizations. But it is easy to
avoid creating it in the first place and it allows more accurately
estimating the cost of the runtime checks.
This reverts commit fa16329ae0721023376f24c7577b9020d438df1a.
See comments in discussion. Merged by mistake, not entirely getting what
the problem was.
When peeling a loop, we assume that the latch has a `br` terminator and
that all loop exits are either terminated with an `unreachable` or have
a terminating deoptimize call. So when we peel off the 1st iteration, we
change the IDom of all loop exits to the peeled copy of
`NCD(IDom(Exit), Latch)`. This works now, but if we add logic to support
loops with exits that are followed by a block with an `unreachable` or a
terminating deoptimize call, changing the exit's idom wouldn't be enough
and DT would be broken.
For example, let `Exit1` and `Exit2` are loop exits, and each of them
unconditionally branches to the same `unreachable` terminated block. So
neither of the exits dominates this unreachable block. If we change the
IDoms of the exits to some peeled loop block, we don't update the
dominators of the unreachable block. Currently we just don't get to the
peeling logic, saying that we can't peel such loops.
With this NFC we just insert edges from cloned exiting blocks to their
exits after peeling each iteration (we accumulate the insertion updates
and then after peeling apply the updates to DT).
This patch was a part of D110922.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D111611
Reviewed By: mkazantsev
Fixes: https://bugs.llvm.org/show_bug.cgi?id=51841
This patch places an arbitrary limit on the size of DIExpressions that
we will produce via salvaging, for performance reasons. This helps to
fix a performance issue observed in the bug above, in which debug values
would be salvaged hundreds of times, producing expressions with over
1000 elements and causing the compiler to hang. Limiting the size of
debug values that we will produce to 128 largely fixes this issue.
Reviewed By: dblaikie, jmorse
Differential Revision: https://reviews.llvm.org/D110332
Rather than checking for loop nest preheaders upfront in IVUsers,
move this requirement into isSafeToExpand() from SCEVExpander.
Historically, LSR did not check whether SCEVs are safe to expand
and fully relied on IVUsers to validate this. Later, support for
non-expandable SCEVs was added via rigid formulas.
Checking this in isSafeToExpand() makes it more obvious what
exactly this check is guarding against, and avoids the awkward
loop nest scan.
This is a followup to https://reviews.llvm.org/D111493#3055286.
Differential Revision: https://reviews.llvm.org/D111681
This patch continues unblocking optimizations that are blocked by pseudo probe instrumentation.
Not exactly like DbgIntrinsics, PseudoProbe intrinsic has other attributes (such as mayread, maywrite, mayhaveSideEffect) that can block optimizations. The issues fixed are:
- Flipped default param of getFirstNonPHIOrDbg API to skip pseudo probes
- Unblocked CSE by avoiding pseudo probe from clobbering memory SSA
- Unblocked induction variable simpliciation
- Allow empty loop deletion by treating probe intrinsic isDroppable
- Some refactoring.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D110847
This patch adds a new cost heuristic that allows peeling a single
iteration off read-only loops, if the loop contains a load that
1. is feeding an exit condition,
2. dominates the latch,
3. is not already known to be dereferenceable,
4. and has a loop invariant address.
If all non-latch exits are terminated with unreachable, such loads
in the loop are guaranteed to be dereferenceable after peeling,
enabling hoisting/CSE'ing them.
This enables vectorization of loops with certain runtime-checks, like
multiple calls to `std::vector::at` if the vector is passed as pointer.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D108114
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:
int r = a;
for (int i = 0; i < n; i++) {
if (src[i] > 3) {
r = b;
}
src[i] += 2;
}
We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.
The IR generated by clang typically looks like this:
%phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
...
%pred = icmp ugt i32 %val, i32 3
%phi.update = select i1 %pred, i32 %b, i32 %phi
We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.
Tests have been added here:
Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
Transforms/LoopVectorize/select-cmp-predicated.ll
Transforms/LoopVectorize/select-cmp.ll
Differential Revision: https://reviews.llvm.org/D108136
This factors out utilities for scanning a bounded block of instructions since we have this code repeated in a bunch of places. The change to InlineFunction isn't strictly NFC as the limit mechanism there didn't handle debug instructions correctly.
Removed obsolete DT verification that should not be there because the
strategy of DT updates has changed.
Differential Revision: https://reviews.llvm.org/D110922
Added support for peeling loops with "deoptimizing" exits -
such exits that it or any of its children (or any of their
children, etc) either has a @llvm.experimental.deoptimize call
prior to the terminating return instruction of this basic block
or is terminated with unreachable. All blocks in the the
sequence must have a single successor, maybe except for the last
one.
Previously we only checked the exit block for being deoptimizing.
Now we check if the last reachable block from the exit is deoptimizing.
Patch by Dmitry Makogon!
Differential Revision: https://reviews.llvm.org/D110922
Reviewed By: mkazantsev