117 Commits

Author SHA1 Message Date
Ryotaro Kasuga
b75530ff03
[LoopInterchange] Consider forward/backward dependency in vectorize heuristic (#133672)
The vectorization heuristic of LoopInterchange attempts to move a
vectorizable loop to the innermost position. Before this patch, a loop
was deemed vectorizable if there are no loop-carried dependencies
induced by the loop.
This patch extends the vectorization heuristic by introducing the
concept of forward and backward dependencies, inspired by
LoopAccessAnalysis. Specifically, an additional element is appended to
each direction vector to indicate whether it represents a forward
dependency (`<`) or not (`*`). Among these, only the forward
dependencies (i.e., those whose last element is `<`) affect the
vectorization heuristic. Accordingly, the check is conservative, and
dependencies are considered forward only when this can be proven.
Currently, we only support perfectly nested loops whose body consists of
a single basic block. For other cases, dependencies are pessimistically
treated as non-forward.
2025-07-25 22:37:20 +09:00
Sjoerd Meijer
a9f8143072
[LoopInterchange] Ignore the cost-model, force interchange if legal (#148858)
This is and has been proven useful for testing purposes, to get more
test coverage.
2025-07-18 13:42:25 +01:00
Madhur Amilkanthwar
fb81a0dd9e
[LoopInterchange][NFCI] Split reductions-non-wrapped-operations.ll (#149449)
This test has grown too big. Having one for float for int would be more
manageable.
2025-07-18 10:06:12 +05:30
Madhur Amilkanthwar
4f8597f071
[LoopInterchange] Add test for floating point math flags (#149090)
Adding a test where both `ninf` and `reassoc` flags are present on the
instruction. We don't know yet if it is legal to interchange. Prima
facie, it does not look like it should be legal but more analysis is
needed.
2025-07-16 21:46:12 +05:30
Ryotaro Kasuga
b3c293c5b9
[LoopInterchange] Drop nuw/nsw flags from reduction ops when interchanging (#148612)
Before this patch, when a reduction exists in the loop, the legality
check of LoopInterchange only verified if there exists a
non-reassociative floating-point instruction in the reduction
calculation. However, it is insufficient, because reordering integer
reductions can also lead to incorrect transformations. Consider the
following example:

```c
int A[2][2] = {
  { INT_MAX, INT_MAX },
  { INT_MIN, INT_MIN },
};

int sum = 0;
for (int i = 0; i < 2; i++)
  for (int j = 0; j < 2; j++)
    sum += A[j][i];
```

To make this exchange legal, we must drop nuw/nsw flags from the
instructions involved in the reduction operations.

This patch extends the legality check to correctly handle such cases. In
particular, for integer addition and multiplication, it verifies that
the nsw and nuw flags are set on involved instructions, and drop them
when the transformation actually performed. This patch also introduces
explicit checks for the kind of reduction and permits only those that
are known to be safe for interchange. Consequently, some "unknown"
reductions (at the moment, `FindFirst*` and `FindLast*`) are rejected.

Fix #148228
2025-07-15 22:04:16 +09:00
Ryotaro Kasuga
a8280c4be4
[LoopInterchange] Fix incorrect GEPs in tests (NFC) (#147223)
These tests were missing the leading zero(s) in the GEP.
2025-07-09 10:51:50 +09:00
Ryotaro Kasuga
29487759e3
[LoopInterchange] Defer CacheCost calculation until needed (#146874)
LoopInterchange currently stop and exit its process when the number of
loads/stores in the loop is greater than `MaxMemInstrCount`. This
prevents excessive querying to DependenceAnalysis. However, computing
`CacheCost` also involves DependenceAnalysis queries, and their number
can grow to `O(N^2)` in the worst case, where `N` is the number of
loads/stores in the loop. Therefore, we should also avoid calculating it
if the loads/stores count exceeds `MaxMemInstrCount`.
This patch defers the calculation of `CacheCost` until it is actually
needed to reduce compile time. This avoids computing `CacheCost` when
the number of loads/stores is large. Additionally, since this patch
delays its calculation as much as possible, it is also effective in
other scenarios, e.g., when there are no legal loop pairs to exchange.
2025-07-08 17:44:34 +09:00
Ryotaro Kasuga
3099b7eb5d
[Passes] Move LoopInterchange into optimization pipeline (#145503)
As mentioned in https://github.com/llvm/llvm-project/pull/145071,
LoopInterchange should be part of the optimization pipeline rather than
the simplification pipeline. This patch moves LoopInterchange into the
optimization pipeline.

More contexts:

- By default, LoopInterchange attempts to improve data locality,
however, it also takes increasing vectorization opportunities into
account. Given that, it is reasonable to run it as close to
vectorization as possible.
- I looked into previous changes related to the placement of
LoopInterchange, but couldn’t find any strong motivation suggesting that
it benefits other simplifications.
- As far as I tried some tests (including llvm-test-suite), removing
LoopInterchange from the simplification pipeline does not affect other
simplifications. Therefore, there doesn't seem to be much value in
keeping it there.
- The new position reduces compile-time for ThinLTO, probably because it
only runs once per function in post-link optimization, rather than both
in pre-link and post-link optimization.

I haven't encountered any cases where the positional difference affects
optimization results, so please feel free to revert if you run into any issues.
2025-07-04 20:06:53 +09:00
Ryotaro Kasuga
1e5f7f64b0
[LoopInterchange] Handle confused dependence correctly (#140709)
This patch fixes the handling of a confused `Dependence` object. Such an
object doesn’t contain any information about dependencies, so we must
process it conservatively. However, it was converted into a direction
vector like `[I I ... I]`. As a result, it was treated as if there are
no loop-carried dependencies, which can lead to illegal loop exchanges.

Fixes #140238
2025-06-05 16:44:02 +09:00
Ryotaro Kasuga
e99ca74dc1
[LoopInterchange] Relax the legality check to accept more patterns (#139690)
When proving the legality of exchanging two loops, it doesn't need to
check the elements of the direction vectors associated with the loops
outside of the two target loops. Before this patch, the legality check
looked at all elements of a direction vector to calculate the
lexicographically order of the vector, which may reject some legal
exchanges. For example, if a direction vector is `[* < =]`, it is safe
to swap the last two loops because the corresponding subsequence of the
vector (`[< =]`) is lexicographically positive for both before and after
the exchange. However, the its order is unknown if we don't drop the
prefix since the first element is `*`. This patch improves the logic of
legality check to ignore such unrelated prefixes of direction vectors.
2025-05-13 21:14:14 +09:00
Ryotaro Kasuga
91f3965be4
[LoopInterchange] Fix the vectorizable check for a loop (#133667)
In the profitability check for vectorization, the dependency matrix was
not handled correctly. This can result to make a wrong decision: It may
say "this loop can be vectorized" when in fact it cannot. The root cause
of this is that the check process early returns when it finds '=' or 'I'
in the dependency matrix. To make sure that we can actually vectorize
the loop, we need to check all the rows of the matrix. This patch fixes
the process of checking whether we can vectorize the loop or not. Now it
won't make a wrong decision for a loop that cannot be vectorized.

Related: #131130
2025-04-03 16:21:19 +09:00
Ryotaro Kasuga
cf976bfdeb
[LoopInterchange] Add tests for the vectorization profitability (NFC) (#133665)
There is a problem with the current profitability check for
vectorization in LoopInterchange. There are both false positives and
false negatives. The former means that the heuristic may say that "an
exchange is necessary to vectorize the innermost loop" even though it's
already possible. The latter means that the heuristic may miss a case
where an exchange is necessary to vectorize the innermost loop. Note
that this is not a dependency analysis problem. This is caused by
incorrect handling of the dependency matrix in the profitability check,
so these problems can occur even if the analysis is accurate (no
overestimation).

This patch adds tests to clarify the cases that should be fixed. The
root cause of these cases is that the heuristic doesn't handle the
direction of a dependency correctly.
2025-04-02 21:02:30 +09:00
Ryotaro Kasuga
528e408b94
[LoopInterchange] Add an option to control the cost heuristics applied (#133664)
LoopInterchange has several heuristic functions to determine if
exchanging two loops is profitable or not. Whether or not to use each
heuristic and the order in which to use them were fixed, but #125830
allows them to be changed internally at will. This patch adds a new
option to control them via the compiler option.

The previous patch also added an option to prioritize the vectorization
heuristic. This patch also removes it to avoid conflicts between it and
the newly introduced one, e.g., both
`-loop-interchange-prioritize-vectorization=1` and
`-loop-interchange-profitabilities='cache,vectorization'` are specified.
2025-04-02 15:41:40 +09:00
Ryotaro Kasuga
281028e37c
[LoopInterchange] Prevent from undoing its own transformation (#127473)
LoopInterchange uses the bubble-sort fashion algorithm to sort the loops
in a loop nest. For two loops `A` and `B`, it calls `isProfitable(A, B)`
to determine whether these loops should be exchanged. The result of
`isProfitable(A, B)` should be conservative, that is, it should return
true only when we are sure exchanging `A` and `B` will improve
performance. If it's not conservative enough, it's possible that a
subsequent `isProfitable(B, A)` will also return true, in which case
LoopInterchange will undo its previous transformation. To avoid such
cases, `isProfitable(B, A)` must not return true if `isProfitable(A, B)`
returned true in the past. However, the current implementation can be in
such a situation. This patch resolves it by correcting the handling of
two loops that have the same cache cost.

This resolve the problem mentioned in
https://github.com/llvm/llvm-project/pull/118267#issuecomment-2510759354.
2025-03-21 21:42:49 +09:00
Ryotaro Kasuga
17b202fc17
[LoopInterchange] Add an option to prioritize vectorization (#131988)
The LoopInterchange cost-model consists of several decision rules. They
are called one by one, and if some rule can determine the profitability,
then the subsequent rules aren't called. In the current implementation,
the rule for `CacheCostAnalysis` is called first, and if it fails to
determine the profitability, then the rule for vectorization is called.
However, there are cases where interchanging loops for vectorization
makes the code faster even if such exchanges are detrimental to the
cache. For example, exchanging the inner two loops in the following
example looks about x3 faster in my local (compiled with `-O3
-mcpu=neoverse-v2 -mllvm -cache-line-size=64`), even though it's
rejected by the rule based on cache cost. (NOTE: LoopInterchange cannot
exchange these loops due to legality checks. This should also be
improved.)

```c
__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],
                                   dd[256][256],ee[256][256],ff[256][256];

// Alternative of TSVC s231 with more array accesses than the original.
void s231_alternative() {
  for (int nl = 0; nl < 100*(100000/256); nl++) {
    for (int i = 0; i < 256; ++i) {
      for (int j = 1; j < 256; j++) {
        aa[j][i] = aa[j-1][i] + bb[j][i] + cc[i][j]
                 + dd[i][j] + ee[i][j] + ff[i][j];
      }
    }
  }
}
```

This patch introduces a new option to prioritize the vectorization rule
over the cache cost rule.

Related issue: #131130

---------

Co-authored-by: Florian Hahn <flo@fhahn.com>
2025-03-21 17:03:06 +09:00
Madhur Amilkanthwar
0074a462f1
[LoopInterchange] Hoist isComputableLoopNest() in the control flow (#124247)
The profiling of the LLVM Test-suite reveals that a significant portion,
specifically 14,090 out of 139,323, loop nests were identified as
non-viable candidates for transformation, leading to the transform
exiting from isComputableLoopNest() without any action.

More importantly, dependence information was computed for these loop
nests before reaching the function isComputableLoopNest(), which does
not require DI and relies solely on scalar evolution (SE).

To enhance compile-time efficiency, this patch moves the call to
isComputableLoopNest() earlier in the control-flow, thereby avoiding
unnecessary dependence calculations.

The impact of this change is evident on the compile-time-tracker, with
the overall geometric mean improvement recorded at 0.11%, while the
lencode benchmark gets a more substantial benefit of 0.44%.
This improvement can be tracked in the isc-ln-exp-2 branch under my
repo.
2025-02-05 13:50:17 +05:30
Ryotaro Kasuga
e82f93890d
[LoopInterchange] Add tests of 'S' deps (NFC) (#125214)
The incorrect handling of scalar dependencies in LoopInterchange was
fixed by #119345. This patch adds tests that are relative to the issues
fixed by #119345.
2025-02-03 20:48:37 +09:00
Nikita Popov
29441e4f5f
[IR] Convert from nocapture to captures(none) (#123181)
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.

Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
   reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
   make it easier to use old IR files and somewhat reduce the test churn in
   this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
   attribute. The representation in the LLVM IR dialect should be updated
   separately.
2025-01-29 16:56:47 +01:00
Ryotaro Kasuga
690f251063
[LoopInterchange] Handle LE and GE correctly (#124901)
LoopInterchange have converted `DVEntry::LE` and `DVEntry::GE` in
direction vectors to '<' and '>' respectively. This handling is
incorrect because the information about the '=' it lost. This leads to
miscompilation in some cases. To resolve this issue, convert them to '*'
instead.

Resolve #123920
2025-01-29 19:30:54 +09:00
Madhur Amilkanthwar
d15f3e828d
[LoopInterchange] Constrain LI within supported loop nest depth (#118656)
This patch is an extension to #115128.

After profiling LLVM test-suite, I see a lot of loop nest of depth more
than `MaxLoopNestDepth` which is 10. Early exit for them would save
compile-time as it would avoid computing DependenceInfo and CacheCost.

Please see 'bound-max-depth' branch on compile-time-tracker.
2025-01-23 10:41:54 +05:30
Madhur Amilkanthwar
5d281a480e
[LoopInterchange] Constrain number of load/stores in a loop (#118973)
In the current state of the code, the transform computes entries for the
dependency matrix until `MaxMemInstrCount` which is 100. After 99th
entry, it terminates and thus overall wastes compile-time.

It would be nice if we can compute total number of entries upfront and
early exit if the number of entries > 100. However, computing the number
of entries is not always possible as it depends on two factors:
1. Number of load-store pairs in a loop.
2. Number of common loop levels for each of the pair.

This patch constrains the whole computation on the number of loads and
stores instructions in the loop.

In another approach, I experimented with computing 1 and constraining
the number of pairs, but that did not lead to any additional benefit in
terms of compile time. However, when other issues are fixed, I can
revisit this approach.
2025-01-21 10:49:19 +05:30
Sjoerd Meijer
456ec1c2f4
[LoopInterchange] Remove 'S' Scalar Dependencies (#119345)
We are not handling 'S' scalar dependencies correctly and have at least
the following miscompiles related to that:

[LoopInterchange] incorrect handling of scalar dependencies and dependence vectors starting with ">" #54176
[LoopInterchange] Interchange breaks program correctness #46867
[LoopInterchange] Loops should not interchanged due to dependencies #47259
[LoopInterchange] Loops should not interchanged due to control flow #47401

This patch does no longer insert the "S" dependency/direction into the
dependency matrix, so a dependency is never "S". We seem to have
forgotten what the exact meaning is of this dependency type, and don't
see why it should be treated differently.

We prefer correctness over incorrect and more aggressive results. I.e.,
this prevents the miscompiles at the expense of handling less cases,
i.e. making interchange more pessimistic. However, some of the cases
that are now rejected for dependence analysis reasons, were rejected
before too but for other reasons (e.g. profitability). So at least for
the llvm regression tests, the number of regression are very reasonable.
This should be a stopgap. We would like to get interchange enabled by
default and thus prefer correctness over unsafe transforms, and later
see if we can get solve the regressions.
2025-01-20 13:04:58 +00:00
Sjoerd Meijer
77803e461c
[loop-interchange] Move tests over to use remarks (#123053)
Checking the remark message if interchange did or didn't happen is more
straight forward than the full IR for these cases. This comment was also
made when I moved some tests away from relying on debug builds in change
#116780, and this is a prep step for #119345 that is going to change
these test cases.
2025-01-16 15:13:18 +00:00
Lee Wei
abb9f9fa06
[llvm] Remove br i1 undef from some regression tests [NFC] (#117112)
This PR removes tests with `br i1 undef` under
`llvm/tests/Transforms/Loop*, Lower*`.
2024-11-21 08:06:56 +00:00
Sjoerd Meijer
edf56f1fa2
[LoopInterchange] Don't rely on ASSERTS build for tests. NFC. (#116780)
A lot of interchange tests unnecessary relied on a build with ASSERTS
enabled. Instead, simply check the IR output for both negative and
positive tests so that we don't rely on debug messages. This increases
test coverage as these tests will now also run with non-assert builds.
For a couple of files keeping some of the debug tests was useful, so
separated out them out and moved them to a similarly named *-remarks.ll
file.
2024-11-19 13:56:55 +00:00
Sjoerd Meijer
cac6f21149
[LoopInterchange] Make the entries of the Dependency Matrix unique (#116195)
The entries in the dependency matrix can contain a lot of duplicates,
which is unnecessary and results in more checks that we can avoid, and
this patch adds that.
2024-11-19 11:40:53 +00:00
Madhur Amilkanthwar
1eaa17975d
[LoopInterchange] Bail out early if minimum loop nest is not met (#115128)
This patch bails out early if minimum depth
is not met. As it stands today, the pass computes
CacheCost before it attempts to do the transform.
This is not needed if minimum depth is not met.
This handles basic cases where depth is typically 1.

As the patch avoids unnecessary computation, it is aimed to improve
compile-time.
2024-11-19 09:58:20 +05:30
Sjoerd Meijer
cb64c3c573
[LoopInterchange] Precommit tests for scalar dependencies. NFC. (#115900)
We are miscompiling and incorrectly interchanging loops with scalar
dependencies that are live-out and conditionally set. This precommits
some tests demonstrating this. This is based on the tests in
https://reviews.llvm.org/D87879 by `mdchen`.
2024-11-14 09:13:10 +00:00
Rouzbeh
670259466b
[LoopCacheAnalysis] Fix loop cache cost to always round the cost up to the nearest integer number (#88915)
Currently loop cache analysis uses following formula to evaluate cost of
an RefGroup for a consecutive memory access:

`RefCost=(TripCount*Stride)/CLS`

This cost evaluates to zero when `TripCount*Stride` is smaller than
cache-line-size. This results in wrong cost value for a loop and
misleads loopInterchange decisions as shown in [this
case](https://llvm.godbolt.org/z/jTz1vn4hn).

This patch fixes the problem by rounding the cost to 1 once this problem
happens.
2024-05-27 09:54:39 -04:00
Nikita Popov
2d69827c5c [Transforms] Convert tests to opaque pointers (NFC) 2024-02-05 11:57:34 +01:00
Nikita Popov
eecb99c5f6 [Tests] Add disjoint flag to some tests (NFC)
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.
2023-12-05 14:09:36 +01:00
Joshua Cao
a400c6ac03 [LoopInterchange] Add GEP with 3 indices test for pr57148
Motivated by https://reviews.llvm.org/D147117. Need a test with
BackedgeTakenCount=False and a 3 index GEP.
2023-04-02 00:19:49 -07:00
Ram-NK
ee7188c8b2 [LoopInterchange] Correcting the profitability check
Before D135808, There would be endless loop interchange posibility (no
proper priority was there in profitability check. Any profitable check
may leads to loop-interchange). With this patch, there  is no endless
interchange (priority in profitable check is defined. Order of decision
is 'Cache cost' check, 'InstrOrderCost', 'Vectorization'). Corrected the
dependency checking inside isProfitableForVectorization(), corrected the
checking of bad order loops in isProfitablePerInstrOrderCost().

Reviewed By: Meinersbur, bmahjour, #loopoptwg

Differential Revision: https://reviews.llvm.org/D135808
2023-01-16 14:36:06 -05:00
Nikita Popov
055fb7795a [Transforms] Convert some tests to opaque pointers (NFC)
These are all tests where conversion worked automatically, and
required no manual fixup.
2023-01-05 12:43:45 +01:00
Roman Lebedev
25a87862a0
[NFC] Port all LoopInterchange tests to -passes= syntax 2022-12-08 02:38:46 +03:00
Congzhe Cao
75b33d6bd5 [LoopInterchange] Check phis in all subloops
This is the bugfix to the miscompile mentioned in
https://reviews.llvm.org/D132055#3814831. The IR
that reproduced the bug is added as the test case in
this patch.

What this patch does is that, during legality phase
instead of checking the phi nodes only in `InnerLoop`
and `OuterLoop`, we check phi nodes in all subloops
of the `OuterLoop`. Suppose if the loop nest is triply
nested, and `InnerLoop` and `OuterLoop` is the middle
loop and the outermost loop respectively, we'll check
phi nodes in the innermost loop as well, in addition to
the ones in the middle and outermost loops.

Reviewed By: Meinersbur, #loopoptwg

Differential Revision: https://reviews.llvm.org/D134930
2022-11-04 00:20:52 -04:00
Arthur Eubanks
f3a928e233 [opt] Don't translate legacy -analysis flag to require<analysis>
Tests relying on this should explicitly use -passes='require<analysis>,foo'.
2022-10-07 14:54:34 -07:00
Congzhe Cao
22c91df52c [LoopInterchange][PR57148] Ensure the correct form of IR after transformation
This is a bugfix patch that resolves the following two bugs in loop interchange:

1. PR57148 which is an assertion error due to of loss of LCSSA form after interchange,
   as referred to test1() in pr57148.ll.
2. Use before def for the outermost loop induction variables after interchange,
   as referred to test2() in pr57148.ll.

The fix in this patch is that:

1. In cases where the LCSSA form is not maintained after interchange, we update the IR
   to the LCSSA form again.
2. We split the phi nodes in the inner loop header into a separate basic block to avoid
   the situation where use of the outer indvar appears before its def after interchange.
   Previously we already did this for innermost loops, now we do it for non-innermost
   loops (e.g., middle loops) as well.

Reviewed By: bmahjour, Meinersbur, #loopoptwg

Differential Revision: https://reviews.llvm.org/D132055
2022-09-22 00:20:53 -04:00
Congzhe Cao
6782d71680 [LoopPassManager] Ensure to construct loop nests with the outermost loop
This patch is to resolve the bug reported and discussed in
https://reviews.llvm.org/D124926#3718761 and https://reviews.llvm.org/D124926#3719876.

The problem is that loop interchange is a loopnest pass under the new pass manager,
but the loop nest may not be constructed correctly by the loop pass manager after
running loop interchange and before running the next pass, which might cause problems
when it continues running the next pass.

The reason that the loop nest is constructed incorrectly is that the outermost
loop might have changed after interchange, and what was the original outermost
loop is not the current outermost loop anymore. Constructing the loop nest based
on the original outermost loop would generate an invalid loop nest.

The fix in this patch is that, in the loop pass manager before running each loopnest
pass, we re-cosntruct the loop nest based on the current outermost loop, if LPMUpdater
notifies the loop pass manager that the previous loop nest has been invalidated by passes
like loop interchange.

Reviewed By: aeubanks

Differential Revision: https://reviews.llvm.org/D132199
2022-09-21 23:59:26 -04:00
Congzhe Cao
8dc4b2edfa [LoopInterchange][PR56275] Fix legality with negative dependence vectors
This is the 2nd patch of the two-patch series (D130188, D130189) that
fix PR56275 (https://github.com/llvm/llvm-project/issues/56275) which
is a missed opportunity for loop interchange.

As follow-up on the dependence analysis (DA) patch D130188, this patch
normalizes DA results in loop interchange, such that negative dependence
vectors queried by loop interchange are reversed to be non-negative.

Now all tests in PR56275 can get interchanged. Those tests are added
in lit test as `pr56275.ll`.

Reviewed By: kawashima-fj, bmahjour, Meinersbur, #loopoptwg

Differential Revision: https://reviews.llvm.org/D130189
2022-08-03 19:59:01 -04:00
Congzhe Cao
b941857b40 [LoopInterchange] New cost model for loop interchange
This is another attempt to land this patch.

The patch proposed to use a new cost model for loop interchange,
which is obtained from loop cache analysis.

Given a loopnest, what loop cache analysis returns is a vector of
loops [loop0, loop1, loop2, ...] where loop0 should be replaced as
the outermost loop, loop1 should be placed one more level inside, and
loop2 one more level inside, etc. What loop cache analysis does is not
only more comprehensive than the current cost model, it is also a "one-shot"
query which means that we only need to query it once during the entire
loop interchange pass, which is better than the current cost model where
we query it every time we check whether it is profitable to interchange
two loops. Thus complexity is reduced, especially after D120386 where we
do more interchanges to get the globally optimal loop access pattern.

Updates made to test cases are mostly minor changes and some
corrections. One change that applies to all tests is that we added an option
`-cache-line-size=64` to the RUN lines. This is ensure that loop
cache analysis receives a valid number of cache line size for correct
analysis. Test coverage for loop interchange is not reduced.

Currently we did not completely remove the legacy cost model, but
keep it as fall-back in case the new cost model did not run successfully.
This is because currently we have some limitations in delinearization, which
sometimes makes loop cache analysis bail out. The longer term goal is to
enhance delinearization and eventually remove the legacy cost model
compeletely.

Reviewed By: bmahjour, #loopoptwg

Differential Revision: https://reviews.llvm.org/D124926
2022-06-28 00:08:37 -04:00
Evgenii Stepanov
878309cc54 Revert "[LoopInterchange] New cost model for loop interchange"
llvm/lib/Analysis/LoopCacheAnalysis.cpp:702:30: runtime error: signed
integer overflow: 6148914691236517209 * 100 cannot be represented in
type 'long'

https://lab.llvm.org/buildbot/#/builders/5/builds/25185

This reverts commit 1b24fe34b06cd9f2337313f513a8b19f9a37c5de.
2022-06-23 16:10:53 -07:00
Congzhe Cao
1b24fe34b0 [LoopInterchange] New cost model for loop interchange
This is the second attempt to land this patch.

The patch proposed to use a new cost model for loop interchange,
which is obtained from loop cache analysis.

Given a loopnest, what loop cache analysis returns is a vector of
loops [loop0, loop1, loop2, ...] where loop0 should be replaced as the
outermost loop, loop1 should be placed one more level inside, and loop2
one more level inside, etc. What loop cache analysis does is not only more
comprehensive than the current cost model, it is also a "one-shot" query
which means that we only need to query it once during the entire loop
interchange pass, which is better than the current cost model where we
query it every time we check whether it is profitable to interchange two
loops. Thus complexity is reduced, especially after D120386 where we do
more interchanges to get the globally optimal loop access pattern.

Updates made to test cases are mostly minor changes and some corrections.
One change that applies to all tests is that we added an option
`-cache-line-size=64` to the RUN lines. This is ensure that loop cache
analysis receives a valid number of cache line size for correct analysis.
Test coverage for loop interchange is not reduced.

Currently we did not completely remove the legacy cost model, but keep it
as fall-back in case the new cost model did not run successfully. This is
because currently we have some limitations in delinearization, which sometimes
makes loop cache analysis bail out. The longer term goal is to enhance
delinearization and eventually remove the legacy cost model compeletely.

Reviewed By: bmahjour, #loopoptwg

Differential Revision: https://reviews.llvm.org/D124926
2022-06-23 16:34:57 -04:00
Daniil Suchkov
f1940a5895 Revert "[LoopInterchange] New cost model for loop interchange"
Reverting the commit due to numerous buildbot failures.

This reverts commit 006334470d8d1b5d8f630890336fcb45795749d1.
2022-06-03 00:52:08 +00:00
Congzhe Cao
006334470d [LoopInterchange] New cost model for loop interchange
This patch proposed to use a new cost model for loop interchange, which
is obtained from loop cache analysis.

Given a loopnest, what loop cache analysis returns is a vector of loops
[loop0, loop1, loop2, ...] where loop0 should be replaced as the outermost
loop, loop1 should be placed one more level inside, and loop2 one more level
inside, etc. What loop cache analysis does is not only more comprehensive than
the current cost model, it is also a "one-shot" query which means that we only
need to query it once during the entire loop interchange pass, which is better
than the current cost model where we query it every time we check whether it is
profitable to interchange two loops. Thus complexity is reduced, especially after
D120386 where we do more interchanges to get the globally optimal loop access pattern.

Updates made to test cases are mostly minor changes and some corrections.
Test coverage for loop interchange is not reduced.

Currently we did not completely remove the legacy cost model, but keep it as
fall-back in case the new cost model did not run successfully. This is because
currently we have some limitations in delinearization, which sometimes makes
loop cache analysis bail out. The longer term goal is to enhance delinearization
and eventually remove the legacy cost model compeletely.

Reviewed By: bmahjour, #loopoptwg

Differential Revision: https://reviews.llvm.org/D124926
2022-06-02 19:07:14 -04:00
Congzhe Cao
eac3487510 [LoopInterchange] Try to achieve the most optimal access pattern after interchange
Motivated by pr43326 (https://bugs.llvm.org/show_bug.cgi?id=43326), where a slightly
modified case is as follows.

 void f(int e[10][10][10], int f[10][10][10]) {
   for (int a = 0; a < 10; a++)
     for (int b = 0; b < 10; b++)
       for (int c = 0; c < 10; c++)
         f[c][b][a] = e[c][b][a];
 }

The ideal optimal access pattern after running interchange is supposed to be the following

 void f(int e[10][10][10], int f[10][10][10]) {
   for (int c = 0;  c < 10; c++)
     for (int b = 0; b < 10; b++)
       for (int a = 0; a < 10; a++)
         f[c][b][a] = e[c][b][a];
 }

Currently loop interchange is limited to picking up the innermost loop and finding an order
that is locally optimal for it. However, the pass failed to produce the globally optimal
loop access order. For more complex examples what we get could be quite far from the
globally optimal ordering.

What is proposed in this patch is to do a "bubble-sort" fashion when doing interchange.
By comparing neighbors in `LoopList` in each iteration, we would be able to move each loop
onto a most appropriate place, hence this is an approach that tries to achieve the
globally optimal ordering.

The motivating example above is added as a test case.

Reviewed By: Meinersbur

Differential Revision: https://reviews.llvm.org/D120386
2022-04-06 15:31:56 -04:00
Congzhe Cao
abc8ca65c3 [LoopInterchange] Detect output dependency of a store instruction with itself
This patch is motivated by pr48057 where an output dependency is not detected
since loop interchange did not check a store instruction with itself.
Fixed that deficiency.

Reviewed By: bmahjour, Meinersbur, #loopoptwg

Differential Revision: https://reviews.llvm.org/D118102
2022-03-09 15:50:27 -05:00
Congzhe Cao
1ef04326ec [LoopInterchange] Support loop interchange with floating point reductions
Enabled loop interchange support for floating point reductions
if it is allowed to reorder floating point operations.

Previously when we encouter a floating point PHI node in the
outer loop exit block, we bailed out since we could not detect
floating point reductions in the early days. Now we remove this
limiation since we are able to detect floating point reductions.

Reviewed By: #loopoptwg, Meinersbur

Differential Revision: https://reviews.llvm.org/D117450
2022-02-06 17:04:47 -05:00
Igor Kirillov
d3932c690d [LoopVectorize] Add tests with reductions that are stored in invariant address
This patch adds tests for functionality that is to be implemented in D110235.

Differential Revision: https://reviews.llvm.org/D117213
2022-01-24 21:26:38 +00:00
Congzhe Cao
fa6a2876c7 [LoopInterchange] Enable interchange with multiple inner loop indvars
Currently loop interchange only supports loops with one inner loop
induction variable. This patch adds support for transformation with
more than one inner loop induction variables. The induction PHIs and
induction increment instructions are moved/duplicated properly to the
new outer header and the new outer latch, respectively.

Reviewed By: bmahjour

Differential Revision: https://reviews.llvm.org/D114917
2022-01-14 16:28:41 -05:00