We currently set `UP.Count` to `TripCount` and `MaxTripCount` prior to
full and upper bound unrolling, respectively. This was likely done to
ensure that calls to `UCE.getUnrolledLoopSize(UP)` use the appropriate
trip count. However, we can use `UCE.getUnrolledLoopSize(UP,
FullUnrollTripCount)` instead.
To prevent unintentional unrolling, we set `UP.Count = 0` when
early-exiting `computeUnrollCount()`. (Note: this does not occur
[here](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1190-L1198)).
This seems like a bug.)
We only perform early exits when evaluating runtime unrolling. At that
point, [we know `TripCount` is
false](3fb31e7b06/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1157-L1158)),
and thus we could not have leaked `TripCount`. However, we [could've
leaked
`MaxTripCount`](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1102-L1110)).
It seems like:
eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1181-L1188)
was supposed to handle this case. However:
- It uses `<` instead of `<=`. This breaks the existing convention
[[1]](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L869))
[[2]](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1103))
for how `UP.MaxUpperBound` is treated.
- It's ignored when a target sets `UP.Force = true`.
Thus:
- When `UP.Force == false`, we leak `MaxTripCount` into runtime
unrolling when `MaxTripCount && (UP.UpperBound || MaxOrZero) &&
MaxTripCount == UP.MaxUpperBound`
- When `UP.Force == true`, we leak `MaxTripCount` into runtime unrolling
when `MaxTripCount && (UP.UpperBound || MaxOrZero) && MaxTripCount <=
UP.MaxUpperBound`.
This PR:
- Uses `UCE.getUnrolledLoopSize(UP, FullUnrollTripCount)`
- Stops setting `TripCount` and `MaxTripCount` prior to calling
`shouldFullUnroll()`
- Removes the `UP.Count = 0` safeguards
- Swaps `<` with `<=`, to address the `UP.Force == false` case
- Adds a test to document the behavior change (no longer leaking
`MaxTripCount`) in the `UP.Force == true` case.
This PR is intended as an AMDGPU-specific solution for #181267 while
discussions on changing the default behavior for all targets continue in
that PR.
Problem:
Loops with an explicit unroll pragma (#pragma unroll / #pragma clang
loop unroll(enable)) that have an expensive runtime trip count currently
don't get unrolled because UP.AllowExpensiveTripCount defaults to false.
The pragma is silently ignored. This is not the case when an unroll
factor is specified (PragmaCount > 0), where the pass sets
UP.AllowExpensiveTripCount = true.
Solution:
Set UP.AllowExpensiveTripCount to true for for loops that have an unroll
pramga in the AMDGPU TTI Implementation.
I've added a new lit test expensive-tripcount.ll that verifies
pragma-driven unrolling with expensive trip counts will work as
expected.
The change showed no meaningful regressions across a few different
workloads from Composable Kernels (CK) and llama.cpp as well as Pytorch
kernels on AMDGPU gfx950. Additionally, the change improves the
performance of PyTorch reduction loops on AMDGPU targets.
Canonicalize GEP source element types from `%T` to `[sizeof(%T) x i8]`.
This is intended to flush out any remaining places that rely on GEP
element types, as part of the `ptradd` migration. The impact of this
change is expected to be fairly minimal (we might enable a few more
hoist/sink style folds that depend on equal GEP types).
Ensure that frontends can request both `PragmaEnable` and `PragmaFull`
semantics on a loop.
FYI: to the best of my knowledge, it's not possible to toggle both
`PragmaEnable` and `PragmaFull` via `clang`.
Remove the redundant debug message.
While we're here, adopt the same debug message language that's used in
#178476 and use an `if` instead of a single `case` `switch` statement.
Add a new helper function `matchEquivZeroRHS()` that recognizes
comparisons with constants that are equivalent to comparisons with zero,
and transforms the predicate accordingly.
This handles the following transformations:
- icmp sgt X, -1 --> icmp sge X, 0
- icmp sle X, -1 --> icmp slt X, 0
- icmp [us]ge X, 1 --> icmp [us]gt X, 0
- icmp [us]lt X, 1 --> icmp [us]le X, 0
This enables more optimization opportunities in `simplifyICmpWithZero`,
such as folding icmp sgt X, -1 when X is known to be non-negative.
---
- IR Impact: https://github.com/dtcxzyw/llvm-opt-benchmark/pull/3414
Unify the ad-hoc use of whitespace in `LLVM_DEBUG()` messages.
This approach should also make it easier to see which loop debug
messages correspond to and which part of the loop unrolling heuristics
each message corresponds to.
Adds `-unroll-runtime-other-exit-predictable=false` option to
`unroll-multi-exit-loop-heuristics.ll` test for stability reasons.
This is a followup to a discussion in #164799 and a similar patch
https://reviews.llvm.org/D98098. Since this option is false by default,
this is an NFC.
This patch improves multi-exit loop unrolling by taking into account
branch probability and not only other exit being deopting one.
This implementation uses branch metadata directly because of unstable
state of BPI in this part of code (runtime unrolling invalidates the
state of the map and using BPI in my tests has caused errors).
If branch probability metadata are not present then the current deopt
heuristic is still used.
---------
Co-authored-by: Marek Sedlacek <msedlacek@azul.com>
Refactor a number of recent tests in
`llvm/test/Transforms/LoopUnroll/branch-weights-freq` to make it easier
to understand and extend them.
The changes mostly resemble the refactoring I recently did in PR #165635
in response to reviewer comments:
- For each case (e.g., each `-unroll-count` value in
`unroll-epilog.ll`), group all FileCheck directives together. That way,
while digesting a single case, the reader does not need to sift through
all other cases and a complex FileCheck prefix scheme.
- Reduce CFG testing. Drop many FileCheck directives that check for all
basic block labels and branches, and drop the cryptic
`-implicit-check-not` that excludes others. Instead, just use positive
checks for every loop body (represented by `call void @f`), for relevant
metadata, and for the branch instructions to which the metadata is
attached, and use simple negative checks (e.g.,
`-implicit-check-not='!prof'`) to be sure we have not missed any.
- Better document what the test intends to cover.
The result is sometimes longer tests due to comments and repetition, but
I believe they are easier to maintain this way.
When LoopUnroll copies the original loop's latch to the corresponding
non-latch branch in an unrolled iteration, any `!llvm.loop` is copied
along with it, but `!llvm.loop` is useless and misleading there. This
patch discards it.
e06831a3b29d did the same for LoopPeel.
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
Reland #171963, #172639 and #173444, they are reverted in
86b9f90b9574b3a7d15d28a91f6316459dcfa046 because of introducing
non-determinism in compiles.
The non-determinism has been fixed in
9b8addffa70cee5b2acc5454712d9cf78ce45710.
This causes non-determinism in compiles.
From nikic: "FYI the non-determinism is also visible on
llvm-opt-benchmark. Maybe repeatedly running test cases from
299446d99f
could reproduce the issue..."
Also revert dependent 796fafeff92fe5d2d20594859e92607116e30a16 and
e135447bda617125688b71d33480d131d1076a72.
We have disabled unrolling for vectorized loops in #151525 but this
PR only checked the instruction type.
For some loops, there is no instruction with vector type but they
are still vector operations (just like the memset zero test in the
precommit test).
Here we check the operands as well to cover these cases.
Add a new isAppleMLike helper, that returns true if the core is part of
the Apple M core family or Apple A14 or later. Used to apply cost
decisions consistently to those groups of cores.
The function is now a single place to update when new cores are added.
It also makes sure we apply unrolling decisions for newer Apple cores to
Apple A17.
PR: https://github.com/llvm/llvm-project/pull/170553
LoopPeel sometimes proves that, when reached, the original loop always
executes at least two iterations. LoopPeel then unconditionally executes
both the remaining loop's initial iteration and the peeled final
iteration. But that increases the latter's frequency above its frequency
in the original loop. To maintain the total frequency, this patch
compensates by decreasing the remaininng loop's latch probability.
This is another step in issue #135812 and was discussed at
<https://github.com/llvm/llvm-project/pull/166858#discussion_r2528968542>.
- Introduce the -aarch64-force-unroll-threshold option; when a loop’s
cost is below this value we set UP.Force = true (default 0 keeps current
behaviour)
- Add an AArch64 loop-unroll regression test that runs once at the
default threshold and once with the flag raised, confirming forced
unrolling
PR #159163's probability computation for epilogue loops does not handle
the possibility of an original loop probability of one. Runtime loop
unrolling does not make sense for such an infinite loop, and a division
by zero results. This patch works around that case.
Issue #165998.
Currently, `LoopFullUnrollPass` incorrectly performs partial unrolling
when `#pragma unroll` is specified and both `TripCount` and
`MaxTripCount` are unknown. This patch adds a check to prevent partial
unrolling when `OnlyFullUnroll` parameter is true and both trip count
values are zero.
As another step in issue #135812, this patch fixes block frequencies for
partial loop unrolling with an epilogue remainder loop. It does not
fully handle the case when the epilogue loop itself is unrolled. That
will be handled in the next patch.
For the guard and latch of each of the unrolled loop and epilogue loop,
this patch sets branch weights derived directly from the original loop
latch branch weights. The total frequency of the original loop body,
summed across all its occurrences in the unrolled loop and epilogue
loop, is the same as in the original loop. This patch also sets
`llvm.loop.estimated_trip_count` for the epilogue loop instead of
relying on the epilogue's latch branch weights to imply it.
This patch fixes branch weights in tests that PR #157754 adversely
affected.
This patch implements the LoopUnroll changes discussed in [[RFC] Fix
Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785)
and is thus another step in addressing issue #135812.
In summary, for the case of partial loop unrolling without a remainder
loop, this patch changes LoopUnroll to:
- Maintain branch weights consistently with the original loop for the
sake of preserving the total frequency of the original loop body.
- Store the new estimated trip count in the
`llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.
- Correct the new estimated trip count (e.g., 3 instead of 2) when the
original estimated trip count (e.g., 10) divided by the unroll count
(e.g., 4) leaves a remainder (e.g., 2).
There are loop unrolling cases this patch does not fully fix, such as
partial unrolling with a remainder loop and complete unrolling, and
there are two associated tests whose branch weights this patch adversely
affects. They will be addressed in future patches that should land with
this patch.
Followup metadata for remainder loops is handled by two implementations,
both added by 7244852557ca6:
1. `tryToUnrollLoop` in `LoopUnrollPass.cpp`.
2. `CloneLoopBlocks` in `LoopUnrollRuntime.cpp`.
As far as I can tell, 2 is useless: I added `assert(!NewLoopID)` for the
`NewLoopID` returned by the `makeFollowupLoopID` call, and it never
fails throughout check-all for my build.
Moreover, if 2 were useful, it appears it would have a bug caused by
7cd826a321d9. That commit skips adding loop metadata to a new remainder
loop if the remainder loop itself is to be completely unrolled because
it will then no longer be a loop. However, that commit incorrectly
assumes that `UnrollRemainder` dictates complete unrolling of a
remainder loop, and thus it skips adding loop metadata even if the
remainder loop will be only partially unrolled.
To avoid further confusion here, this patch removes 2. check-all
continues to pass for my build. If 2 actually is useful, please advise
so we can create a test that covers that usage.
Near 2, this patch retains the `UnrollRemainder` guard on the
`setLoopAlreadyUnrolled` call, which adds `llvm.loop.unroll.disable` to
the remainder loop. That behavior exists both before and after
7cd826a321d9. The logic appears to be that remainder loop unrolling
(whether complete or partial) is opt-in. That is, unless
`UnrollRemainder` is true, `UnrollRuntimeLoopRemainder` skips running
remainder loop unrolling, and `llvm.loop.unroll.disable` suppresses any
later attempt at it.
This patch also extends testing of remainder loop followup metadata to
be sure remainder loop partial unrolling is handled correctly by 1.
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.
This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)
It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
The original loop (OL) that serves as input to LoopUnroll has basic
blocks that are arranged as follows:
```
OLPreHeader
OLHeader <-.
... |
OLLatch ---'
OLExit
```
In this depiction, every block has an implicit edge to the next block
below, so any explicit edge indicates a conditional branch.
Given OL and unroll count N, LoopUnroll sometimes creates an unrolled
loop (UL) with a remainder loop (RL) epilogue arranged like this:
```
,-- ULGuard
| ULPreHeader
| ULHeader <-.
| ... |
| ULLatch ---'
| ULExit
`-> RLGuard -----.
RLPreHeader |
,-> RLHeader |
| ... |
`-- RLLatch |
RLExit |
OLExit <-----'
```
Each UL iteration executes N OL iterations, but each RL iteration
executes 1 OL iteration. ULGuard or RLGuard checks whether the first
iteration of UL or RL should execute, respectively. If so, ULLatch or
RLLatch checks whether to execute each subsequent iteration.
Once reached, OL always executes its first iteration but not necessarily
the next N-1 iterations. Thus, ULGuard is always required before the
first UL iteration. However, when control flows from ULGuard directly to
RLGuard, the first OL iteration has yet to execute, so RLGuard is then
redundant before the first RL iteration.
Thus, this patch makes the following changes:
- Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus
eliminating RLGuard's unnecessary branch instruction for that path.
- Eliminate the creation of RLGuard phi node poison values. Without this
patch, RLGuard has such a phi node for each value that is defined by any
OL iteration and used in OLExit. The poison value is required where
ULGuard is the predecessor. The poison value indicates that control flow
from ULGuard to RLGuard to Exit has no counterpart in OL because the
first OL iteration must execute either in UL or RL.
- Simplify the CFG by not splitting ULExit and RLGuard because, without
the ULGuard predecessor, the single block can now be a dedicated UL
exit.
- To RLPreHeader, add an `llvm.assume` call that asserts the RL trip
count is non-zero. Without this patch, RLPreHeader is reachable only
when RLGuard guarantees that assertion is true. With this patch, RLGuard
guarantees it only when RLGuard is the predecessor, and the OL structure
guarantees it when ULGuard is the predecessor. If RL itself is unrolled
later, this guarantee somehow prevents ScalarEvolution from giving up
when trying to compute a maximum trip count for RL. That maximum trip
count enables the branch instruction in the final unrolled instance of
RLLatch to be eliminated. Without the `llvm.assume` call, some existing
unroll tests start to fail because that instruction is not eliminated.
The original motivation for this patch is to facilitate later patches
that fix LoopUnroll's computation of branch weights so that they
maintain the block frequency of OL's body (see #135812). Specifically,
this patch ensures RLGuard's branch weights do not affect RL's
contribution to the block frequency of OL's body in the case that
ULGuard skips UL.
[LoopPeel] Fix branch weights' effect on block frequencies
This patch implements the LoopPeel changes discussed in [[RFC] Fix Loop
Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
In summary, a loop's latch block can have branch weight metadata that
encodes an estimated trip count that is derived from application profile
data. Initially, the loop body's block frequencies agree with the
estimated trip count, as expected. However, sometimes loop
transformations adjust those branch weights in a way that correctly
maintains the estimated trip count but that corrupts the block
frequencies. This patch addresses that problem in LoopPeel, which it
changes to:
- Maintain branch weights consistently with the original loop for the
sake of preserving the total frequency of the original loop body.
- Store the new estimated trip count in the
`llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.
Add test where a branch can be removed after peeling by applying info
from loop guards. It unfortunately requires running IndVars first, to
strengthen flags of the induction.
When partially or runtime unrolling loops with reductions, currently the
reductions are performed in-order in the loop, negating most benefits
from unrolling such loops.
This patch extends unrolling code-gen to keep a parallel reduction phi
per unrolled iteration and combining the final result after the loop.
For out-of-order CPUs, this allows executing mutliple reduction chains
in parallel.
For now, the initial transformation is restricted to cases where we
unroll a small number of iterations (hard-coded to 4, but should maybe
be capped by TTI depending on the execution units), to avoid introducing
an excessive amount of parallel phis.
It also requires single block loops for now, where the unrolled
iterations are known to not exit the loop (either due to runtime
unrolling or partial unrolling). This ensures that the unrolled loop
will have a single basic block, with a single exit block where we can
place the final reduction value computation.
The initial implementation also only supports parallelizing loops with a
single reduction and only integer reductions. Those restrictions are
just to keep the initial implementation simpler, and can easily be
lifted as follow-ups.
With corresponding TTI to the AArch64 unrolling preferences which I will
also share soon, this triggers in ~300 loops across a wide range of
workloads, including LLVM itself, ffmgep, av1aom, sqlite, blender,
brotli, zstd and more.
PR: https://github.com/llvm/llvm-project/pull/149470
LoopPeel currently considers PHI nodes that become loop invariants
through peeling. However, in some cases, peeling transforms PHI nodes
into induction variables (IVs), potentially enabling further
optimizations such as loop vectorization. For example:
```c
// TSVC s292
int im = N-1;
for (int i=0; i<N; i++) {
a[i] = b[i] + b[im];
im = i;
}
```
In this case, peeling one iteration converts `im` into an IV, allowing
it to be handled by the loop vectorizer.
This patch adds a new feature to peel loops when to convert PHIs into
IVs. At the moment this feature is disabled by default.
Enabling it allows to vectorize the above example. I have measured on
neoverse-v2 and observed a speedup of more than 60% (options: `-O3
-ffast-math -mcpu=neoverse-v2 -mllvm -enable-peeling-for-iv`).
This PR is taken over from #94900
Related #81851