std::hash returns a 64bit hash code while previously we were using only lower 32 bits which caused hash collision for large workloads.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D113688
No need to count the final shuffle cost for the constants, gathering of
the constants is just a constant vector + extra inserts, if required.
Differential Revision: https://reviews.llvm.org/D113770
Need to adjust the types of GEPs indices when building the tree
entries/operands. Otherwise some of the nodes might differ and
vectorizer is unable to correctly find them and count their cost.
Differential Revision: https://reviews.llvm.org/D113792
rGf39978b84f1d3a1da6c32db48f64c8daae64b3ad led to and/or exposed
an issue with IndVarSimplification for a loop where a i32 phi node is
no longer replaced by a widened (i64) phi node, because the SCEVs of a
sign-extend no longer folded the same way. I'm unsure how to properly
explain this because it's all rather complicated, but in short: SCEVs
don't fold as nicely as they used to and this caused a difference.
While investigating this, I found that IndVarSimplify can actually
optimise the case in the way we want to if it chooses the widened IV to
be 'signed' (the i32 IV is both sign and zero-extended). Oddly enough,
there is some level of indeterminism in the way the algorithm works,
it just picks the sign of the 'first' zext/sext user, where the order of
the users-iterator is not guaranteed to be the same on each invocation
of the pass (e.g. shown by first running loop-rotate, which puts the
users in a different order).
While I think the fix is valid in the sense that consistently picking
_any_ order is better than having an nondeterministic order, I can
use a bit of advice from people more familiar in this area of the
code-base.
For example, I'm not sure if this fix is hiding another issue where the
IndVarSimplify pass could actually draw the same conclusions (i.e. that
it only needs an i64 phi node) if it does a bit more work, regardless
of whether it chooses the induction variable to be signed or unsigned.
I'm also not sure if choosing signed is better than unsigned, or whether
that just happens to be beneficial only in this individual case.
Any feedback would be much appreciated!
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D112573
Previously, any change in any function in an SCC would cause all
analyses for all functions in the SCC to be invalidated. With this
change, we now manually invalidate analyses for functions we modify,
then let the pass manager know that all function analyses should be
preserved since we've already handled function analysis invalidation.
So far this only touches the inliner, argpromotion, function-attrs, and
updateCGAndAnalysisManager(), since they are the most used.
This is part of an effort to investigate running the function
simplification pipeline less on functions we visit multiple times in the
inliner pipeline.
However, this causes major memory regressions especially on larger IR.
To counteract this, turn on the option to eagerly invalidate function
analyses. This invalidates analyses on functions immediately after
they're processed in a module or scc to function adaptor for specific
parts of the pipeline.
Within an SCC, if a pass only modifies one function, other functions in
the SCC do not have their analyses invalidated, so in later function
passes in the SCC pass manager the analyses may still be cached. It is
only after the function passes that the eager invalidation takes effect.
For the default pipelines this makes sense because the inliner pipeline
runs the function simplification pipeline after all other SCC passes
(except CoroSplit which doesn't request any analyses).
Overall this has mostly positive effects on compile time and positive effects on memory usage.
https://llvm-compile-time-tracker.com/compare.php?from=7f627596977624730f9298a1b69883af1555765e&to=39e824e0d3ca8a517502f13032dfa67304841c90&stat=instructionshttps://llvm-compile-time-tracker.com/compare.php?from=7f627596977624730f9298a1b69883af1555765e&to=39e824e0d3ca8a517502f13032dfa67304841c90&stat=max-rss
D113196 shows that we slightly regressed compile times in exchange for
some memory improvements when turning on eager invalidation. D100917
shows that we slightly improved compile times in exchange for major
memory regressions in some cases when invalidating less in SCC passes.
Turning these on at the same time keeps the memory improvements while
keeping compile times neutral/slightly positive.
Reviewed By: asbirlea, nikic
Differential Revision: https://reviews.llvm.org/D113304
This change is mostly about getting rid of some "uninteresting" cases in a follow on deeper heuristic change. If anyone sees actually interesting code differences out of this, please let me know. I'm not expecting this to have much impact at all.
Case 1 - With the single deoptimize non-latch exit, we can't have two exiting blocks sharing an exit block. We can only hit this with a poorly documented debug flag.
Case 2 - Why should we treat epilog cases differently from prolog cases? Or to say it differently, why should starting with a constant control whether a multiple exit loop gets unrolled?
Sorry for the lack of tests here. These are both *exceedingly* narrow cases in practice, and after a while trying, I couldn't come up with a test which did anything "useful" as opposed to simply exercise a random combination of force flags. Note that the legality cases for each are already exercised with force flags.
All of the interesting logic from this routine has been removed, inline the single check into the sole non-assert caller. The assert use has little value with the restructured code and is simply dropped.
A bunch of scalars can be treated as a splat not only if all elements
are the same but also if some of them are undefvalues.
Differential Revision: https://reviews.llvm.org/D113774
This reverts commit c35e8185d8c170c20e28956e0c9f3c1be895fefb.
mstorsjo reported in the revision thread that one VNCoercion assertion
is violated and seemly in relate to this commit. As per "If a test case
that demonstrates a problem is reported in the commit thread, please
revert and investigate offline", this commit is reverted.
If the vector intrinsic has scalar argument, we currently still create
a tree entry for this argument. This entry is not used, just consumes
resources and increases the cost of the tree.
Differential Revision: https://reviews.llvm.org/D113806
The interface is a convenience function to ask if a block requires
predication when widening, but it's important that there are two
separate concepts to consider:
(A) The block was predicated in the original loop.
(B) The block was unpredicated in the original loop, but requires
predication because of tail folding.
In the case of (B) we know that at least one lane of the vector will
be executed, which means we can implementing a load from a uniform address
with a scalar load + splat (D112552). In the case of predication because
of (A), we cannot do this, because the scalar load itself requires
predication.
The name 'blockNeedsPredication' does not make the distinction between
(A) and (B), hence the reason to rename it.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D113392
ProfileCount could model invalid values, but a user had no indication
that the getCount method could return bogus data. Optional<ProfileCount>
addresses that, because the user must dereference the optional. In
addition, the patch removes concept duplication.
Differential Revision: https://reviews.llvm.org/D113839
The if-check above deleted part guarantees that StoreOffset <= LoadOffset
and that StoreOffset + StoreSize >= LoadOffset + LoadSize, and given that
LoadOffset + LoadSize > LoadOffset when LoadSize > 0. Thus, this shows
StoreOffset + StoreSize > LoadOffset is guaranteed given LoadSize > 0,
while it could be meaningless to have a type with nonpositive size, so that
the check could be removed.
Part of revision D100179
Reviewed By: nikic
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.
Why does this matter? A couple of reasons:
* SCEV doesn't have a native subtract node. Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such. As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones. (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language. We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced. (You can see this looking at nearby phis in the test cases.)
Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.
* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value. We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
The unrolling code was previously inserting new cloned blocks at the end of the function. The result of this with typical loop structures is that the new iterations are placed far from the initial iteration.
With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting. As such, placing Count-1 copies out of line is a fairly poor code placement choice. We'd much rather fall through into the hot (non-exiting) path. For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code.
However, the real motivation for this change isn't performance. Its readability and human understanding. Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious.
Need to fix ther cost estimation for split loads, since we look at the
subregs already, no need to permute them, need just to estimate
subregister insert, if it is smaller than the real register. Also, using
split loads, it might be profitable already to vectorize smaller trees
with gathering of the loads.
Differential Revision: https://reviews.llvm.org/D107188
For the scalar/splat case, this fold is subsumed by
foldLogOpOfMaskedICmps(). However, the conjugated fold for "or"
also supports splats with undef. Make both code paths consistent
by using m_ZeroInt() for the "and" implementation as well.
https://alive2.llvm.org/ce/z/tN63cuhttps://alive2.llvm.org/ce/z/ufB_Ue
When folding and/or of icmps, look through add of a constant and
adjust the icmp range instead. Effectively, this decomposes
X + C1 < C2 style range checks back into a normal range. This allows
us to fold comparisons involving two range checks or one range check
and some other condition. We had a fold for a really specific case
of this (or of range check and eq, and only one one side!) while
this handles it in fully generality.
Differential Revision: https://reviews.llvm.org/D113510
Since there is just a single check for LHS in ~(A | B) & C | ...
transforms and multiple RHS checks inside with more coming I am
removing m_OneUse checks for LHS and adding new checks for RHS.
This is non essential as long as there is total benefit.
In addition (~(A | B) & C) | (~(A | C) & B) --> (B ^ C) & ~A
checks were overly restrictive, it should be good without any
additional checks.
Differential Revision: https://reviews.llvm.org/D113141
A problem was noted in the post-commit review for
c36b7e21bd8f04a44d6 / D113035 :
If the source type is not integer or integer vector,
then we could crash when trying to ComputeNumSignBits().
Unfortunately sinking recipes for first-order recurrences relies on
the original position of recipes. So if a recipes needs to be sunk after
an optimized induction, it needs to stay in the original position, until
sinking is done. This is causing PR52460.
To fix the crash, keep the recipes in the original position until
sink-after is done.
Post-commit follow-up to c45045bfd04af9 to address PR52460.
This reverts commit 0d748b4d32cbddf58a1ff83f3ff178ec1ad49edc.
This is causing some failures when building Spec2017 with scalable
vectors. Reverting to investigate.
This reverts commit 7cd273c339cfe8427404f881ae280bd9fae6ff78.
Several patches with tests fixes have been applied:
0cada82f0a30e5ae22dce66b58604ab9b47a3897 "[Test] Remove incorrect test in GVN"
97cb13615d6d9df254e3c0f3deef9eaedfe189b6 "[Test] Separate IndVars test into AArch64 and X86 parts"
985cc490f17d28b20392ee214895d947b85120ef "[Test] Remove separated test in IndVars",
and test failures caused by 5ec2386 should be resolved now.
Previously, InstCombine detected a pair of llvm.stacksave/stackrestore
instructions that are adjacent modulo debug instructions in order to
eliminate the llvm.stackrestore. This precludes situations where
intervening instructions (e.g. loads) preclude the llvm.stacksave and
llvm.stackrestore from becoming adjacent. This commit extends the logic
and allows for eliminating the llvm.stackrestore when the range of
instructions between them does not include any alloca or side-effect
causing instructions.
Signed-off-by: Itay Bookstein <itay.bookstein@nextsilicon.com>
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D113105
Hoist the instruction classification logic outside the loop
in preparation for reuse in a future commit.
Signed-off-by: Itay Bookstein <itay.bookstein@nextsilicon.com>
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D113464
This patch uses the abstract attributor introduced in D111054 to get the
assumption values instead of the `hasAssumption` function. This also
calls it so assumption information should propagate throug the device
where applicabile.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111445
This patch introduces a new abstract attributor instance that propagates
assumption information from functions. Conceptually, if a function is
only called by functions that have certain assumptions, then we can
apply the same assumptions to that function. This problem is similar to
calculating the dominator set, but the assumptions are merged instead of
nodes.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D111054
add tracing for loads and stores.
The primary goal is to have more options for data-flow-guided fuzzing,
i.e. use data flow insights to perform better mutations or more agressive corpus expansion.
But the feature is general puspose, could be used for other things too.
Pipe the flag though clang and clang driver, same as for the other SanitizerCoverage flags.
While at it, change some plain arrays into std::array.
Tests: clang flags test, LLVM IR test, compiler-rt executable test.
Reviewed By: morehouse
Differential Revision: https://reviews.llvm.org/D113447
The implementation for and/or is the same, apart from the choice
of exactIntersectWith() vs exactUnionWith(). Extract a common
function to make future extension easier.
Rather than testing for many specific combinations of predicates
and values, compute the exact icmp regions for both comparisons
and check whether they union/intersect exactly. If they do,
construct the equivalent icmp for the new range. Assuming that the
existing code handled all possible cases, this should be NFC.
Differential Revision: https://reviews.llvm.org/D113367