This patch adds metadata to disable runtime unrolling to the vectorized
loop. If runtime unrolling/interleaving is considered profitable, LV
will interleave the loop directly. There should be no need to perform
runtime unrolling at a later stage.
Note that we already add metadata to disable runtime unrolling to the
scalar loop after vectorization.
The additional unrolling unnecessarily increases code size and compile
time. In addition to that we have several bug reports of unncessary
runtime unrolling for vectorized loops, e.g. PR40961
Compile-time improvements:
NewPM-O3: -1.04%
NewPM-ReleaseThinLTO: -0.59%
NewPM-ReleaseLTO-g: -0.97%
https://llvm-compile-time-tracker.com/compare.php?from=ce1be13a868d0f8afa367975558c1a6175cce33a&to=78bc2e67f22e9e10e61cdb6cdac4bb857d95eb1b&stat=instructions:uFixes#40306.
Reviewed By: lebedev.ri, nikic
Differential Revision: https://reviews.llvm.org/D115261
This was originally part of D133788. There are no visible
regressions. All of the diffs show a large unsigned constant
becoming a small negative constant. This should be better
for analysis (and slightly less compile-time) and codegen.
Don't demand low order bits from the LHS of an Add if:
- they are not demanded in the result, and
- they are known to be zero in the RHS, so they can't possibly
overflow and affect higher bit positions
This is intended to avoid a regression from a future patch to change
the order of canonicalization of ADD and AND.
Differential Revision: https://reviews.llvm.org/D130075
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.
In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.
I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright
Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031
Differential Revision: https://reviews.llvm.org/D130374
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.
Why does this matter? A couple of reasons:
* SCEV doesn't have a native subtract node. Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such. As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones. (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language. We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced. (You can see this looking at nearby phis in the test cases.)
Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.
* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value. We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.
Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.
Tests need to be updated to not use `opt -O# -pass-name`.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D112036
This patch marks the induction increment of the main induction variable
of the vector loop as NUW when not folding the tail.
If the tail is not folded, we know that End - Start >= Step (either
statically or through the minimum iteration checks). We also know that both
Start % Step == 0 and End % Step == 0. We exit the vector loop if %IV +
%Step == %End. Hence we must exit the loop before %IV + %Step unsigned
overflows and we can mark the induction increment as NUW.
This should make SCEV return more precise bounds for the created vector
loops, used by later optimizations, like late unrolling.
At the moment quite a few tests still need to be updated, but before
doing so I'd like to get initial feedback to make sure I am not missing
anything.
Note that this could probably be further improved by using information
from the original IV.
Attempt of modeling of the assumption in Alive2:
https://alive2.llvm.org/ce/z/H_DL_g
Part of a set of fixes required for PR50412.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D103255
This makes the induction part of the loop vectorizer match the reduction part.
We do not need all of the fast-math-flags. For example, there are some that
clearly are not in play like arcp or afn.
If we want to make FMF constraints consistent across the IR optimizer, we
might want to add nsz too, but that's up for debate (users can't expect
associative FP math and preservation of sign-of-zero at the same time?).
The calling code was fixed to avoid miscompiles with:
1bee549737ac
Differential Revision: https://reviews.llvm.org/D98708
These intrinsics, not the icmp+select are the canonical form nowadays,
so we might as well directly emit them.
This should not cause any regressions, but if it does,
then then they would needed to be fixed regardless.
Note that this doesn't deal with `SCEVExpander::isHighCostExpansion()`,
but that is a pessimization, not a correctness issue.
Additionally, the non-intrinsic form has issues with undef,
see https://reviews.llvm.org/D88287#2587863
I am trying to untangle the fast-math-flags propagation logic
in the vectorizers (see a6f022127 for SLP).
The loop vectorizer has a mix of checking FP function attributes,
IR-level FMF, and just wrong assumptions.
I am trying to avoid regressions while fixing this, and I think
the IR-level logic is good enough for that, but it's hard to say
for sure. This would be the 1st step in the clean-up.
The existing test that I changed to include 'fast' actually shows
a miscompile: the function only had the equivalent of nnan, but we
created new instructions that had fast (all FMF set). This is
similar to the example in https://llvm.org/PR35538
Differential Revision: https://reviews.llvm.org/D95452
... so just ensure that we pass DomTreeUpdater it into it.
Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
I noticed an add example like the one from D91343, so here's a similar patch.
The logic is based on existing code for the single-use demanded bits fold.
But I only matched a constant instead of using compute known bits on the
operands because that was the motivating patterni that I noticed.
I think this will allow removing a special-case (but incomplete) dedicated
fold within visitAnd(), but I need to untangle the existing code to be sure.
https://rise4fun.com/Alive/V6fP
Name: add with low mask
Pre: (C1 & (-1 u>> countLeadingZeros(C2))) == 0
%a = add i8 %x, C1
%r = and i8 %a, C2
=>
%r = and i8 %x, C2
Differential Revision: https://reviews.llvm.org/D91415
Summary:
Previosly we simply always said that `SCEVMinMaxExpr` is too costly to expand.
But this isn't really true, it expands into just a comparison+swap pair.
And again much like with add/mul, there will be one less such pair
than the number of operands. And we need to count the cost of operands themselves.
This does change a number of testcases, and as far as i can tell,
all of these changes are improvements, in the sense that
we fixed up more latches to do the [in]equality comparison.
This concludes cost-modelling changes, no other SCEV expressions exist as of now.
This is a part of addressing [[ https://bugs.llvm.org/show_bug.cgi?id=44668 | PR44668 ]].
Reviewers: reames, mkazantsev, wmi, sanjoy
Reviewed By: mkazantsev
Subscribers: hiraditya, javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73744
Summary:
Currently, `SCEVExpander::isHighCostExpansionHelper()` has the following logic:
```
if (auto *UDivExpr = dyn_cast<SCEVUDivExpr>(S)) {
// If the divisor is a power of two and the SCEV type fits in a native
// integer (and the LHS not expensive), consider the division cheap
// irrespective of whether it occurs in the user code since it can be
// lowered into a right shift.
if (auto *SC = dyn_cast<SCEVConstant>(UDivExpr->getRHS()))
if (SC->getAPInt().isPowerOf2()) {
if (isHighCostExpansionHelper(UDivExpr->getLHS(), L, At,
BudgetRemaining, TTI, Processed))
return true;
const DataLayout &DL =
L->getHeader()->getParent()->getParent()->getDataLayout();
unsigned Width = cast<IntegerType>(UDivExpr->getType())->getBitWidth();
return DL.isIllegalInteger(Width);
}
```
Since this test does not have a datalayout specified,
`SCEVExpander::isHighCostExpansionHelper()` says that
`[[TMP2:%.*]] = lshr exact i64 [[TMP1]], 5` is high-cost, and didn't perform it.
But future patches will change that logic to solely rely on cost-model,
without any such datalayout checks, so i think it is best to show
that that change is ephemeral, and can already happen without costmodel changes.
Reviewers: reames, fhahn, sanjoy, craig.topper, RKSimon
Reviewed By: RKSimon
Subscribers: javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73717
As it's causing some bot failures (and per request from kbarton).
This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.
llvm-svn: 358546
This is no-functional-change-intended.
This is repackaging the functionality of D30333 (defer switch-to-lookup-tables) and
D35411 (defer folding unconditional branches) with pass parameters rather than a named
"latesimplifycfg" pass. Now that we have individual options to control the functionality,
we could decouple when these fire (but that's an independent patch if desired).
The next planned step would be to add another option bit to disable the sinking transform
mentioned in D38566. This should also make it clear that the new pass manager needs to
be updated to limit simplifycfg in the same way as the old pass manager.
Differential Revision: https://reviews.llvm.org/D38631
llvm-svn: 316835
Summary:
When simplifying unconditional branches from empty blocks, we pre-test if the
BB belongs to a set of loop headers and keep the block to prevent passes from
destroying canonical loop structure. However, the current algorithm fails if
the destination of the branch is a loop header. Especially when such a loop's
latch block is folded into loop header it results in additional backedges and
LoopSimplify turns it into a nested loop which prevent later optimizations
from being applied (e.g., loop unrolling and loop interleaving).
This patch augments the existing algorithm by further checking if the
destination of the branch belongs to a set of loop headers and defer
eliminating it if yes to LateSimplifyCFG.
Fixes PR33605: https://bugs.llvm.org/show_bug.cgi?id=33605
Reviewers: efriedma, mcrosier, pacxx, hsung, davidxl
Reviewed By: efriedma
Subscribers: ashutosh.nema, gberry, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D35411
llvm-svn: 308422
Generate a single test to decide if there are enough iterations to jump to the
vectorized loop, or else go to the scalar remainder loop. This test compares the
Scalar Trip Count: if STC < VF * UF go to the scalar loop. If
requiresScalarEpilogue() holds, at-least one iteration must remain scalar; the
rest can be used to form vector iterations. So in this case the test checks
instead if (STC - 1) < VF * UF by comparing STC <= VF * UF, and going to the
scalar loop if so. Otherwise the vector loop is entered for at-least one vector
iteration.
This test covers the case where incrementing the backedge-taken count will
overflow leading to an incorrect trip count of zero. In this (rare) case we will
also avoid the vector loop and jump to the scalar loop.
This patch simplifies the existing tests and effectively removes the basic-block
originally named "min.iters.checked", leaving the single test in block
"vector.ph".
Original observation and initial patch by Evgeny Stupachenko.
Differential Revision: https://reviews.llvm.org/D34150
llvm-svn: 308421
Allowed loop vectorization with secondary FP IVs. Like this:
float *A;
float x = init;
for (int i=0; i < N; ++i) {
A[i] = x;
x -= fp_inc;
}
The auto-vectorization is possible when the induction binary operator is "fast" or the function has "unsafe" attribute.
Differential Revision: https://reviews.llvm.org/D21330
llvm-svn: 276554