The vectorization heuristic of LoopInterchange attempts to move a
vectorizable loop to the innermost position. Before this patch, a loop
was deemed vectorizable if there are no loop-carried dependencies
induced by the loop.
This patch extends the vectorization heuristic by introducing the
concept of forward and backward dependencies, inspired by
LoopAccessAnalysis. Specifically, an additional element is appended to
each direction vector to indicate whether it represents a forward
dependency (`<`) or not (`*`). Among these, only the forward
dependencies (i.e., those whose last element is `<`) affect the
vectorization heuristic. Accordingly, the check is conservative, and
dependencies are considered forward only when this can be proven.
Currently, we only support perfectly nested loops whose body consists of
a single basic block. For other cases, dependencies are pessimistically
treated as non-forward.
Before this patch, when a reduction exists in the loop, the legality
check of LoopInterchange only verified if there exists a
non-reassociative floating-point instruction in the reduction
calculation. However, it is insufficient, because reordering integer
reductions can also lead to incorrect transformations. Consider the
following example:
```c
int A[2][2] = {
{ INT_MAX, INT_MAX },
{ INT_MIN, INT_MIN },
};
int sum = 0;
for (int i = 0; i < 2; i++)
for (int j = 0; j < 2; j++)
sum += A[j][i];
```
To make this exchange legal, we must drop nuw/nsw flags from the
instructions involved in the reduction operations.
This patch extends the legality check to correctly handle such cases. In
particular, for integer addition and multiplication, it verifies that
the nsw and nuw flags are set on involved instructions, and drop them
when the transformation actually performed. This patch also introduces
explicit checks for the kind of reduction and permits only those that
are known to be safe for interchange. Consequently, some "unknown"
reductions (at the moment, `FindFirst*` and `FindLast*`) are rejected.
Fix#148228
LoopInterchange currently stop and exit its process when the number of
loads/stores in the loop is greater than `MaxMemInstrCount`. This
prevents excessive querying to DependenceAnalysis. However, computing
`CacheCost` also involves DependenceAnalysis queries, and their number
can grow to `O(N^2)` in the worst case, where `N` is the number of
loads/stores in the loop. Therefore, we should also avoid calculating it
if the loads/stores count exceeds `MaxMemInstrCount`.
This patch defers the calculation of `CacheCost` until it is actually
needed to reduce compile time. This avoids computing `CacheCost` when
the number of loads/stores is large. Additionally, since this patch
delays its calculation as much as possible, it is also effective in
other scenarios, e.g., when there are no legal loop pairs to exchange.
This patch fixes the handling of a confused `Dependence` object. Such an
object doesn’t contain any information about dependencies, so we must
process it conservatively. However, it was converted into a direction
vector like `[I I ... I]`. As a result, it was treated as if there are
no loop-carried dependencies, which can lead to illegal loop exchanges.
Fixes#140238
When proving the legality of exchanging two loops, it doesn't need to
check the elements of the direction vectors associated with the loops
outside of the two target loops. Before this patch, the legality check
looked at all elements of a direction vector to calculate the
lexicographically order of the vector, which may reject some legal
exchanges. For example, if a direction vector is `[* < =]`, it is safe
to swap the last two loops because the corresponding subsequence of the
vector (`[< =]`) is lexicographically positive for both before and after
the exchange. However, the its order is unknown if we don't drop the
prefix since the first element is `*`. This patch improves the logic of
legality check to ignore such unrelated prefixes of direction vectors.
The legality check in LoopInterchange allows two loops to be exchanged
if all direction vectors are lexicographically positive (or zero) for
both before and after the swap. The current implementation performs this
routine naively. However, if a direction vector is lexicographically
positive due to an element corresponding to a loop that is outside the
given two loops (i.e., if there is an element `<` before the loops we
are trying to interchange), then obviously it is also positive after
exchanging them. For example, for a direction vector `[< < >]`, swapping
the last two elements doesn't make it lexicographically negative because
the first element is `<`.
This patch adds a code to skip legality check if surrounding loops
already guarantee that the direction vector is lexicographically
positive. Note that this is only a small improvement on its own, but
it's necessary to relax the legality check I'm working on.
Split off from #118267
---------
Co-authored-by: Michael Kruse <llvm-project@meinersbur.de>
In the profitability check for vectorization, the dependency matrix was
not handled correctly. This can result to make a wrong decision: It may
say "this loop can be vectorized" when in fact it cannot. The root cause
of this is that the check process early returns when it finds '=' or 'I'
in the dependency matrix. To make sure that we can actually vectorize
the loop, we need to check all the rows of the matrix. This patch fixes
the process of checking whether we can vectorize the loop or not. Now it
won't make a wrong decision for a loop that cannot be vectorized.
Related: #131130
LoopInterchange has several heuristic functions to determine if
exchanging two loops is profitable or not. Whether or not to use each
heuristic and the order in which to use them were fixed, but #125830
allows them to be changed internally at will. This patch adds a new
option to control them via the compiler option.
The previous patch also added an option to prioritize the vectorization
heuristic. This patch also removes it to avoid conflicts between it and
the newly introduced one, e.g., both
`-loop-interchange-prioritize-vectorization=1` and
`-loop-interchange-profitabilities='cache,vectorization'` are specified.
LoopInterchange uses the bubble-sort fashion algorithm to sort the loops
in a loop nest. For two loops `A` and `B`, it calls `isProfitable(A, B)`
to determine whether these loops should be exchanged. The result of
`isProfitable(A, B)` should be conservative, that is, it should return
true only when we are sure exchanging `A` and `B` will improve
performance. If it's not conservative enough, it's possible that a
subsequent `isProfitable(B, A)` will also return true, in which case
LoopInterchange will undo its previous transformation. To avoid such
cases, `isProfitable(B, A)` must not return true if `isProfitable(A, B)`
returned true in the past. However, the current implementation can be in
such a situation. This patch resolves it by correcting the handling of
two loops that have the same cache cost.
This resolve the problem mentioned in
https://github.com/llvm/llvm-project/pull/118267#issuecomment-2510759354.
The LoopInterchange cost-model consists of several decision rules. They
are called one by one, and if some rule can determine the profitability,
then the subsequent rules aren't called. In the current implementation,
the rule for `CacheCostAnalysis` is called first, and if it fails to
determine the profitability, then the rule for vectorization is called.
However, there are cases where interchanging loops for vectorization
makes the code faster even if such exchanges are detrimental to the
cache. For example, exchanging the inner two loops in the following
example looks about x3 faster in my local (compiled with `-O3
-mcpu=neoverse-v2 -mllvm -cache-line-size=64`), even though it's
rejected by the rule based on cache cost. (NOTE: LoopInterchange cannot
exchange these loops due to legality checks. This should also be
improved.)
```c
__attribute__((aligned(64))) float aa[256][256],bb[256][256],cc[256][256],
dd[256][256],ee[256][256],ff[256][256];
// Alternative of TSVC s231 with more array accesses than the original.
void s231_alternative() {
for (int nl = 0; nl < 100*(100000/256); nl++) {
for (int i = 0; i < 256; ++i) {
for (int j = 1; j < 256; j++) {
aa[j][i] = aa[j-1][i] + bb[j][i] + cc[i][j]
+ dd[i][j] + ee[i][j] + ff[i][j];
}
}
}
}
```
This patch introduces a new option to prioritize the vectorization rule
over the cache cost rule.
Related issue: #131130
---------
Co-authored-by: Florian Hahn <flo@fhahn.com>
Some post-processing involved in exchanging a pair of loops has been
done separately from `processLoop`, which is a main function that does
the transformation. It's better to consolidate these processes into the
same function. This patch is a preparation of #127474.
Parameter PossiblyLoopIndependent has lost its intended purpose. This
flag is always set to true in all cases when depends() is called, hence
we want to reconsider the utility of this variable and remove it from
the function signature entirely. This is an NFC patch.
The profiling of the LLVM Test-suite reveals that a significant portion,
specifically 14,090 out of 139,323, loop nests were identified as
non-viable candidates for transformation, leading to the transform
exiting from isComputableLoopNest() without any action.
More importantly, dependence information was computed for these loop
nests before reaching the function isComputableLoopNest(), which does
not require DI and relies solely on scalar evolution (SE).
To enhance compile-time efficiency, this patch moves the call to
isComputableLoopNest() earlier in the control-flow, thereby avoiding
unnecessary dependence calculations.
The impact of this change is evident on the compile-time-tracker, with
the overall geometric mean improvement recorded at 0.11%, while the
lencode benchmark gets a more substantial benefit of 0.44%.
This improvement can be tracked in the isc-ln-exp-2 branch under my
repo.
LoopInterchange have converted `DVEntry::LE` and `DVEntry::GE` in
direction vectors to '<' and '>' respectively. This handling is
incorrect because the information about the '=' it lost. This leads to
miscompilation in some cases. To resolve this issue, convert them to '*'
instead.
Resolve#123920
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to getFirstNonPHI use the iterator-returning version.
This patch changes a bunch of call-sites calling getFirstNonPHI to use
getFirstNonPHIIt, which returns an iterator. All these call sites are
where it's obviously safe to fetch the iterator then dereference it. A
follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
getFirstNonPHI, but not before adding concise documentation of what
considerations are needed (very few).
---------
Co-authored-by: Stephen Tozer <Melamoto@gmail.com>
As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.
This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.
We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
This patch is an extension to #115128.
After profiling LLVM test-suite, I see a lot of loop nest of depth more
than `MaxLoopNestDepth` which is 10. Early exit for them would save
compile-time as it would avoid computing DependenceInfo and CacheCost.
Please see 'bound-max-depth' branch on compile-time-tracker.
In the current state of the code, the transform computes entries for the
dependency matrix until `MaxMemInstrCount` which is 100. After 99th
entry, it terminates and thus overall wastes compile-time.
It would be nice if we can compute total number of entries upfront and
early exit if the number of entries > 100. However, computing the number
of entries is not always possible as it depends on two factors:
1. Number of load-store pairs in a loop.
2. Number of common loop levels for each of the pair.
This patch constrains the whole computation on the number of loads and
stores instructions in the loop.
In another approach, I experimented with computing 1 and constraining
the number of pairs, but that did not lead to any additional benefit in
terms of compile time. However, when other issues are fixed, I can
revisit this approach.
We are not handling 'S' scalar dependencies correctly and have at least
the following miscompiles related to that:
[LoopInterchange] incorrect handling of scalar dependencies and dependence vectors starting with ">" #54176
[LoopInterchange] Interchange breaks program correctness #46867
[LoopInterchange] Loops should not interchanged due to dependencies #47259
[LoopInterchange] Loops should not interchanged due to control flow #47401
This patch does no longer insert the "S" dependency/direction into the
dependency matrix, so a dependency is never "S". We seem to have
forgotten what the exact meaning is of this dependency type, and don't
see why it should be treated differently.
We prefer correctness over incorrect and more aggressive results. I.e.,
this prevents the miscompiles at the expense of handling less cases,
i.e. making interchange more pessimistic. However, some of the cases
that are now rejected for dependence analysis reasons, were rejected
before too but for other reasons (e.g. profitability). So at least for
the llvm regression tests, the number of regression are very reasonable.
This should be a stopgap. We would like to get interchange enabled by
default and thus prefer correctness over unsafe transforms, and later
see if we can get solve the regressions.
The entries in the dependency matrix can contain a lot of duplicates,
which is unnecessary and results in more checks that we can avoid, and
this patch adds that.
This patch bails out early if minimum depth
is not met. As it stands today, the pass computes
CacheCost before it attempts to do the transform.
This is not needed if minimum depth is not met.
This handles basic cases where depth is typically 1.
As the patch avoids unnecessary computation, it is aimed to improve
compile-time.
As outlined in my proposal of how to get rid of debug intrinsics, this
patch adds a moveBefore method that signals the caller /intends/ the order
of moved instructions is to stay the same. This semantic difference has an
effect on debug-info, as it signals whether debug-info needs to move with
instructions or not.
The patch just replaces a few calls to moveBefore with calls to
moveBeforePreserving -- and the latter just calls the former, so it's all
NFC right now. A future patch will add an implementation of
moveBeforePreserving that takes action to correctly preserve debug-info,
but that's tightly coupled with our non-instruction debug-info
representation that's still being reviewed.
[0] https://discourse.llvm.org/t/rfc-instruction-api-changes-needed-to-eliminate-debug-intrinsics-from-ir/68939
Differential Revision: https://reviews.llvm.org/D156369
SCEVExpander keeps track of all instructions it inserted. However,
it currently misses some phi nodes created during LCSSA construction.
Fix this by collecting these into another argument.
This also removes the IRBuilder argument, which was added for
essentially the same purpose, but only handles the root LCSSA nodes,
not those inserted by SSAUpdater.
This was reported as a regression on D149344, but the reduced test
case also reproduces without it.
Differential Revision: https://reviews.llvm.org/D150681
Before D135808, There would be endless loop interchange posibility (no
proper priority was there in profitability check. Any profitable check
may leads to loop-interchange). With this patch, there is no endless
interchange (priority in profitable check is defined. Order of decision
is 'Cache cost' check, 'InstrOrderCost', 'Vectorization'). Corrected the
dependency checking inside isProfitableForVectorization(), corrected the
checking of bad order loops in isProfitablePerInstrOrderCost().
Reviewed By: Meinersbur, bmahjour, #loopoptwg
Differential Revision: https://reviews.llvm.org/D135808
We don't need to explicitly forget subloops because forgetting parent
loops will automatically forget their subloops
Differential Revision: https://reviews.llvm.org/D141029
The current code of validDepInterchange() enumerates cases that are
legal for interchange. This could be simplified by checking
lexicographically order of the swapped direction matrix.
Reviewed By: congzhe, Meinersbur, bmahjour
Differential Revision: https://reviews.llvm.org/D137461
This is the bugfix to the miscompile mentioned in
https://reviews.llvm.org/D132055#3814831. The IR
that reproduced the bug is added as the test case in
this patch.
What this patch does is that, during legality phase
instead of checking the phi nodes only in `InnerLoop`
and `OuterLoop`, we check phi nodes in all subloops
of the `OuterLoop`. Suppose if the loop nest is triply
nested, and `InnerLoop` and `OuterLoop` is the middle
loop and the outermost loop respectively, we'll check
phi nodes in the innermost loop as well, in addition to
the ones in the middle and outermost loops.
Reviewed By: Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D134930
isOuterMostDepPositive()
The function isOuterMostDepPositive() is checked after negative dependence
vectors are normalized to be non-negative, so there will not be any negative
dependency ('>' as the outermost non-equal sign) after normalization. And
therefore the check in isOuterMostDepPositive() is irrelevent and redundant.
Reviewed By: congzhe
Differential Revision: https://reviews.llvm.org/D132982
This is a bugfix patch that resolves the following two bugs in loop interchange:
1. PR57148 which is an assertion error due to of loss of LCSSA form after interchange,
as referred to test1() in pr57148.ll.
2. Use before def for the outermost loop induction variables after interchange,
as referred to test2() in pr57148.ll.
The fix in this patch is that:
1. In cases where the LCSSA form is not maintained after interchange, we update the IR
to the LCSSA form again.
2. We split the phi nodes in the inner loop header into a separate basic block to avoid
the situation where use of the outer indvar appears before its def after interchange.
Previously we already did this for innermost loops, now we do it for non-innermost
loops (e.g., middle loops) as well.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D132055
This patch is to resolve the bug reported and discussed in
https://reviews.llvm.org/D124926#3718761 and https://reviews.llvm.org/D124926#3719876.
The problem is that loop interchange is a loopnest pass under the new pass manager,
but the loop nest may not be constructed correctly by the loop pass manager after
running loop interchange and before running the next pass, which might cause problems
when it continues running the next pass.
The reason that the loop nest is constructed incorrectly is that the outermost
loop might have changed after interchange, and what was the original outermost
loop is not the current outermost loop anymore. Constructing the loop nest based
on the original outermost loop would generate an invalid loop nest.
The fix in this patch is that, in the loop pass manager before running each loopnest
pass, we re-cosntruct the loop nest based on the current outermost loop, if LPMUpdater
notifies the loop pass manager that the previous loop nest has been invalidated by passes
like loop interchange.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D132199