29 Commits

Author SHA1 Message Date
Florian Hahn
68469a80cb
[LV] Disable runtime unrolling for vectorized loops.
This patch adds metadata to disable runtime unrolling to the vectorized
loop. If runtime unrolling/interleaving is considered profitable, LV
will interleave the loop directly. There should be no need to perform
runtime unrolling at a later stage.

Note that we already add metadata to disable runtime unrolling to the
scalar loop after vectorization.

The additional unrolling unnecessarily increases code size and compile
time. In addition to that we have several bug reports of unncessary
runtime unrolling for vectorized loops, e.g. PR40961

Compile-time improvements:

  NewPM-O3: -1.04%
  NewPM-ReleaseThinLTO: -0.59%
  NewPM-ReleaseLTO-g: -0.97%

https://llvm-compile-time-tracker.com/compare.php?from=ce1be13a868d0f8afa367975558c1a6175cce33a&to=78bc2e67f22e9e10e61cdb6cdac4bb857d95eb1b&stat=instructions:u

Fixes #40306.

Reviewed By: lebedev.ri, nikic

Differential Revision: https://reviews.llvm.org/D115261
2023-01-06 10:56:17 +00:00
Nikita Popov
7d7577256b [LoopVectorize] Convert some tests to opaque pointers (NFC) 2022-12-14 15:16:59 +01:00
Simon Pilgrim
09cb9fdef9 [InstCombine] Fold ult(add(x,-1),c) -> ule(x,c) iff x != 0 (PR57635)
Alive2: https://alive2.llvm.org/ce/z/sZ6wwS

As detailed on Issue #57635 and #37628 - for unsigned comparisons, we can compare prior to a decrement iff the value is known never to be zero.

Differential Revision: https://reviews.llvm.org/D134172
2022-09-20 16:44:41 +01:00
Simon Pilgrim
7e626d7a89 [LoopVectorize][X86] Use quotes around the pass list to appease DOS cmd evaluation
DOS can't handle -passes='default<O3>' correctly
2022-09-19 10:24:37 +01:00
Sanjay Patel
d6498abc24 [InstCombine] remove multi-use add demanded constant fold
This was originally part of D133788. There are no visible
regressions. All of the diffs show a large unsigned constant
becoming a small negative constant. This should be better
for analysis (and slightly less compile-time) and codegen.
2022-09-18 14:23:43 -04:00
Jay Foad
2754ff883d [InstCombine] Try not to demand low order bits for Add
Don't demand low order bits from the LHS of an Add if:
- they are not demanded in the result, and
- they are known to be zero in the RHS, so they can't possibly
  overflow and affect higher bit positions

This is intended to avoid a regression from a future patch to change
the order of canonicalization of ADD and AND.

Differential Revision: https://reviews.llvm.org/D130075
2022-08-22 20:03:53 +01:00
Sanjay Patel
bfb9b8e075 [Passes] add a tail-call-elim pass near the end of the opt pipeline
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.

In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.

I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031

Differential Revision: https://reviews.llvm.org/D130374
2022-07-25 15:25:47 -04:00
Philip Reames
37ead201e6 [runtime-unroll] Use incrementing IVs instead of decrementing ones
This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.

Why does this matter?  A couple of reasons:
* SCEV doesn't have a native subtract node.  Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such.  As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones.  (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language.  We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced.  (You can see this looking at nearby phis in the test cases.)

Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.

* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value.  We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
2021-11-12 15:44:58 -08:00
Arthur Eubanks
15fefcb9eb [opt] Directly translate -O# to -passes='default<O#>'
Right now when we see -O# we add the corresponding 'default<O#>' into
the list of passes to run when translating legacy -pass-name. This has
the side effect of not using the default AA pipeline.

Instead, treat -O# as -passes='default<O#>', but don't allow any other
-passes or -pass-name. I think we can keep `opt -O#` as shorthand for
`opt -passes='default<O#>` but disallow anything more than just -O#.

Tests need to be updated to not use `opt -O# -pass-name`.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D112036
2021-10-18 16:48:10 -07:00
Florian Hahn
23c2f2e6b2
[LV] Mark increment of main vector loop induction variable as NUW.
This patch marks the induction increment of the main induction variable
of the vector loop as NUW when not folding the tail.

If the tail is not folded, we know that End - Start >= Step (either
statically or through the minimum iteration checks). We also know that both
Start % Step == 0 and End % Step == 0. We exit the vector loop if %IV +
%Step == %End. Hence we must exit the loop before %IV + %Step unsigned
overflows and we can mark the induction increment as NUW.

This should make SCEV return more precise bounds for the created vector
loops, used by later optimizations, like late unrolling.

At the moment quite a few tests still need to be updated, but before
doing so I'd like to get initial feedback to make sure I am not missing
anything.

Note that this could probably be further improved by using information
from the original IV.

Attempt of modeling of the assumption in Alive2:
https://alive2.llvm.org/ce/z/H_DL_g

Part of a set of fixes required for PR50412.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D103255
2021-06-07 10:47:52 +01:00
Sanjay Patel
c8893f3b78 [LoopVectorize] relax FMF constraint for FP induction
This makes the induction part of the loop vectorizer match the reduction part.
We do not need all of the fast-math-flags. For example, there are some that
clearly are not in play like arcp or afn.

If we want to make FMF constraints consistent across the IR optimizer, we
might want to add nsz too, but that's up for debate (users can't expect
associative FP math and preservation of sign-of-zero at the same time?).

The calling code was fixed to avoid miscompiles with:
1bee549737ac

Differential Revision: https://reviews.llvm.org/D98708
2021-03-18 08:11:22 -04:00
Sanjay Patel
d2eae990a1 [LoopVectorize] add FP induction test with minimal FMF; NFC 2021-03-16 12:05:34 -04:00
Roman Lebedev
b46c085d2b
[NFCI] SCEVExpander: emit intrinsics for integral {u,s}{min,max} SCEV expressions
These intrinsics, not the icmp+select are the canonical form nowadays,
so we might as well directly emit them.

This should not cause any regressions, but if it does,
then then they would needed to be fixed regardless.

Note that this doesn't deal with `SCEVExpander::isHighCostExpansion()`,
but that is a pessimization, not a correctness issue.

Additionally, the non-intrinsic form has issues with undef,
see https://reviews.llvm.org/D88287#2587863
2021-03-06 21:52:46 +03:00
Sanjay Patel
ab93c18c12 [LoopVectorize] use IR fast-math-flags exclusively (not FP function attributes)
I am trying to untangle the fast-math-flags propagation logic
in the vectorizers (see a6f022127 for SLP).

The loop vectorizer has a mix of checking FP function attributes,
IR-level FMF, and just wrong assumptions.

I am trying to avoid regressions while fixing this, and I think
the IR-level logic is good enough for that, but it's hard to say
for sure. This would be the 1st step in the clean-up.

The existing test that I changed to include 'fast' actually shows
a miscompile: the function only had the equivalent of nnan, but we
created new instructions that had fast (all FMF set). This is
similar to the example in https://llvm.org/PR35538

Differential Revision: https://reviews.llvm.org/D95452
2021-01-27 14:17:11 -05:00
Roman Lebedev
5cce4aff18
[SimplifyCFG] TryToSimplifyUncondBranchFromEmptyBlock() already knows how to preserve DomTree
... so just ensure that we pass DomTreeUpdater it into it.

Fixes DomTree preservation for a large number of tests,
all of which are marked as such so that they do not regress.
2020-12-17 01:03:49 +03:00
Sanjay Patel
4e68bc0999 Revert "[InstCombine] add multi-use demanded bits fold for add with low-bit mask"
This reverts commit e56103d25016c9ce4e98f652ac1a09379793ccf5.
There is a stage2 msan failure blamed on this commit:
http://lab.llvm.org:8011/#/builders/74/builds/888/steps/9/logs/stdio
2020-11-16 14:48:09 -05:00
Sanjay Patel
e56103d250 [InstCombine] add multi-use demanded bits fold for add with low-bit mask
I noticed an add example like the one from D91343, so here's a similar patch.
The logic is based on existing code for the single-use demanded bits fold.
But I only matched a constant instead of using compute known bits on the
operands because that was the motivating patterni that I noticed.

I think this will allow removing a special-case (but incomplete) dedicated
fold within visitAnd(), but I need to untangle the existing code to be sure.

https://rise4fun.com/Alive/V6fP

  Name: add with low mask
  Pre: (C1 & (-1 u>> countLeadingZeros(C2))) == 0
  %a = add i8 %x, C1
  %r = and i8 %a, C2
  =>
  %r = and i8 %x, C2

Differential Revision: https://reviews.llvm.org/D91415
2020-11-15 15:09:49 -05:00
Sanjay Patel
9e0c35655b [LoopVectorize] regenerate test checks; NFC 2020-11-12 17:15:46 -05:00
Roman Lebedev
1badf7c33a
[InstComine] Forego of one-use check in (X - (X & Y)) --> (X & ~Y) if Y is a constant
Summary:
This is potentially more friendly for further optimizations,
analysies, e.g.: https://godbolt.org/z/G24anE

This resolves phase-ordering bug that was introduced
in D75145 for https://godbolt.org/z/2gBwF2
https://godbolt.org/z/XvgSua

Reviewers: spatel, nikic, dmgreen, xbolva00

Reviewed By: nikic, xbolva00

Subscribers: hiraditya, zzheng, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75757
2020-03-06 21:39:07 +03:00
Roman Lebedev
d6f47aeb51
[SCEV] SCEVExpander::isHighCostExpansionHelper(): cost-model min/max (PR44668)
Summary:
Previosly we simply always said that `SCEVMinMaxExpr` is too costly to expand.
But this isn't really true, it expands into just a comparison+swap pair.
And again much like with add/mul, there will be one less such pair
than the number of operands. And we need to count the cost of operands themselves.

This does change a number of testcases, and as far as i can tell,
all of these changes are improvements, in the sense that
we fixed up more latches to do the [in]equality comparison.

This concludes cost-modelling changes, no other SCEV expressions exist as of now.

This is a part of addressing [[ https://bugs.llvm.org/show_bug.cgi?id=44668 | PR44668 ]].

Reviewers: reames, mkazantsev, wmi, sanjoy

Reviewed By: mkazantsev

Subscribers: hiraditya, javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73744
2020-02-25 23:05:59 +03:00
Roman Lebedev
3bd33ccfdf
[NFC?][SCEV][LoopVectorize] Add datalayout to the X86/float-induction-x86.ll test
Summary:
Currently, `SCEVExpander::isHighCostExpansionHelper()` has the following logic:
```
  if (auto *UDivExpr = dyn_cast<SCEVUDivExpr>(S)) {
    // If the divisor is a power of two and the SCEV type fits in a native
    // integer (and the LHS not expensive), consider the division cheap
    // irrespective of whether it occurs in the user code since it can be
    // lowered into a right shift.
    if (auto *SC = dyn_cast<SCEVConstant>(UDivExpr->getRHS()))
      if (SC->getAPInt().isPowerOf2()) {
        if (isHighCostExpansionHelper(UDivExpr->getLHS(), L, At,
                                      BudgetRemaining, TTI, Processed))
          return true;
        const DataLayout &DL =
            L->getHeader()->getParent()->getParent()->getDataLayout();
        unsigned Width = cast<IntegerType>(UDivExpr->getType())->getBitWidth();
        return DL.isIllegalInteger(Width);
      }
```

Since this test does not have a datalayout specified,
`SCEVExpander::isHighCostExpansionHelper()` says that
`[[TMP2:%.*]] = lshr exact i64 [[TMP1]], 5` is high-cost, and didn't perform it.

But future patches will change that logic to solely rely on cost-model,
without any such datalayout checks, so i think it is best to show
that that change is ephemeral, and can already happen without costmodel changes.

Reviewers: reames, fhahn, sanjoy, craig.topper, RKSimon

Reviewed By: RKSimon

Subscribers: javed.absar, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D73717
2020-02-12 12:27:38 +03:00
Roman Lebedev
7bca4a28f5
[NFC][LoopVectorize] Autogenerate tests affected by isHighCostExpansionHelper() cost modelling (PR44668) 2020-01-27 23:34:30 +03:00
Eric Christopher
cee313d288 Revert "Temporarily Revert "Add basic loop fusion pass.""
The reversion apparently deleted the test/Transforms directory.

Will be re-reverting again.

llvm-svn: 358552
2019-04-17 04:52:47 +00:00
Eric Christopher
a863435128 Temporarily Revert "Add basic loop fusion pass."
As it's causing some bot failures (and per request from kbarton).

This reverts commit r358543/ab70da07286e618016e78247e4a24fcb84077fda.

llvm-svn: 358546
2019-04-17 02:12:23 +00:00
Sanjay Patel
b049173157 [SimplifyCFG] use pass options and remove the latesimplifycfg pass
This is no-functional-change-intended.

This is repackaging the functionality of D30333 (defer switch-to-lookup-tables) and 
D35411 (defer folding unconditional branches) with pass parameters rather than a named
"latesimplifycfg" pass. Now that we have individual options to control the functionality,
we could decouple when these fire (but that's an independent patch if desired). 

The next planned step would be to add another option bit to disable the sinking transform
mentioned in D38566. This should also make it clear that the new pass manager needs to
be updated to limit simplifycfg in the same way as the old pass manager.

Differential Revision: https://reviews.llvm.org/D38631

llvm-svn: 316835
2017-10-28 18:43:07 +00:00
Balaram Makam
b05a55787a [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure.
Summary:
When simplifying unconditional branches from empty blocks, we pre-test if the
BB belongs to a set of loop headers and keep the block to prevent passes from
destroying canonical loop structure. However, the current algorithm fails if
the destination of the branch is a loop header. Especially when such a loop's
latch block is folded into loop header it results in additional backedges and
LoopSimplify turns it into a nested loop which prevent later optimizations
from being applied (e.g., loop  unrolling and loop interleaving).

This patch augments the existing algorithm by further checking if the
destination of the branch belongs to a set of loop headers and defer
eliminating it if yes to LateSimplifyCFG.

Fixes PR33605: https://bugs.llvm.org/show_bug.cgi?id=33605

Reviewers: efriedma, mcrosier, pacxx, hsung, davidxl

Reviewed By: efriedma

Subscribers: ashutosh.nema, gberry, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D35411

llvm-svn: 308422
2017-07-19 08:53:34 +00:00
Ayal Zaks
8c452d76ed [LV] Test once if vector trip count is zero, instead of twice
Generate a single test to decide if there are enough iterations to jump to the
vectorized loop, or else go to the scalar remainder loop. This test compares the
Scalar Trip Count: if STC < VF * UF go to the scalar loop. If
requiresScalarEpilogue() holds, at-least one iteration must remain scalar; the
rest can be used to form vector iterations. So in this case the test checks
instead if (STC - 1) < VF * UF by comparing STC <= VF * UF, and going to the
scalar loop if so. Otherwise the vector loop is entered for at-least one vector
iteration.

This test covers the case where incrementing the backedge-taken count will
overflow leading to an incorrect trip count of zero. In this (rare) case we will
also avoid the vector loop and jump to the scalar loop.

This patch simplifies the existing tests and effectively removes the basic-block
originally named "min.iters.checked", leaving the single test in block
"vector.ph".

Original observation and initial patch by Evgeny Stupachenko.

Differential Revision: https://reviews.llvm.org/D34150

llvm-svn: 308421
2017-07-19 05:16:39 +00:00
Matthew Simpson
9eed0bee3d [LV] Handle external uses of floating-point induction variables
Reference: https://bugs.llvm.org/show_bug.cgi?id=32758
Differential Revision: https://reviews.llvm.org/D32445

llvm-svn: 301428
2017-04-26 16:23:02 +00:00
Elena Demikhovsky
376a18bd92 [Loop Vectorizer] Handling loops FP induction variables.
Allowed loop vectorization with secondary FP IVs. Like this:
float *A;
float x = init;
for (int i=0; i < N; ++i) {
  A[i] = x;
  x -= fp_inc;
}

The auto-vectorization is possible when the induction binary operator is "fast" or the function has "unsafe" attribute.

Differential Revision: https://reviews.llvm.org/D21330

llvm-svn: 276554
2016-07-24 07:24:54 +00:00