712 Commits

Author SHA1 Message Date
Justin Fargnoli
cb32b8bffb
[LoopUnrollPass] Don't pre-set UP.Count before legality checks in computeUnrollCount() (#185979)
We currently set `UP.Count` to `TripCount` and `MaxTripCount` prior to
full and upper bound unrolling, respectively. This was likely done to
ensure that calls to `UCE.getUnrolledLoopSize(UP)` use the appropriate
trip count. However, we can use `UCE.getUnrolledLoopSize(UP,
FullUnrollTripCount)` instead.

To prevent unintentional unrolling, we set `UP.Count = 0` when
early-exiting `computeUnrollCount()`. (Note: this does not occur
[here](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1190-L1198)).
This seems like a bug.)

We only perform early exits when evaluating runtime unrolling. At that
point, [we know `TripCount` is
false](3fb31e7b06/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1157-L1158)),
and thus we could not have leaked `TripCount`. However, we [could've
leaked
`MaxTripCount`](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1102-L1110)).

It seems like:
eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1181-L1188)

was supposed to handle this case. However:

- It uses `<` instead of `<=`. This breaks the existing convention
[[1]](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L869))
[[2]](eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1103))
for how `UP.MaxUpperBound` is treated.
- It's ignored when a target sets `UP.Force = true`.

Thus:
- When `UP.Force == false`, we leak `MaxTripCount` into runtime
unrolling when `MaxTripCount && (UP.UpperBound || MaxOrZero) &&
MaxTripCount == UP.MaxUpperBound`
- When `UP.Force == true`, we leak `MaxTripCount` into runtime unrolling
when `MaxTripCount && (UP.UpperBound || MaxOrZero) && MaxTripCount <=
UP.MaxUpperBound`.

This PR:
- Uses `UCE.getUnrolledLoopSize(UP, FullUnrollTripCount)`
- Stops setting `TripCount` and `MaxTripCount` prior to calling
`shouldFullUnroll()`
- Removes the `UP.Count = 0` safeguards
- Swaps `<` with `<=`, to address the `UP.Force == false` case
- Adds a test to document the behavior change (no longer leaking
`MaxTripCount`) in the `UP.Force == true` case.
2026-03-31 19:50:52 +00:00
Justin Fargnoli
a2891ff85c
Reapply "[LoopUnroll] Remove computeUnrollCount()'s return value" (#187104)
Address
https://github.com/llvm/llvm-project/pull/184529#issuecomment-4074393657
by checking the loop's metadata prior to unrolling.
2026-03-18 17:55:56 +00:00
Adel Ejjeh
49250284cf
[AMDGPU][LoopUnroll] Enable allowexpensivetripcount for amdgpu when unroll pragmas are present (#181241)
This PR is intended as an AMDGPU-specific solution for #181267 while
discussions on changing the default behavior for all targets continue in
that PR.

Problem:
Loops with an explicit unroll pragma (#pragma unroll / #pragma clang
loop unroll(enable)) that have an expensive runtime trip count currently
don't get unrolled because UP.AllowExpensiveTripCount defaults to false.
The pragma is silently ignored. This is not the case when an unroll
factor is specified (PragmaCount > 0), where the pass sets
UP.AllowExpensiveTripCount = true.

Solution:
Set UP.AllowExpensiveTripCount to true for for loops that have an unroll
pramga in the AMDGPU TTI Implementation.
I've added a new lit test expensive-tripcount.ll that verifies
pragma-driven unrolling with expensive trip counts will work as
expected.

The change showed no meaningful regressions across a few different
workloads from Composable Kernels (CK) and llama.cpp as well as Pytorch
kernels on AMDGPU gfx950. Additionally, the change improves the
performance of PyTorch reduction loops on AMDGPU targets.
2026-03-13 09:56:20 -07:00
Justin Fargnoli
40fca74d80
[LoopUnrollPass] Trace loop unroll count heuristics with LLVM_DEBUG (#182981) 2026-03-12 16:46:56 +00:00
Nikita Popov
6ecbc0c96e
[InstCombine] Canonicalize GEP source element types (#180745)
Canonicalize GEP source element types from `%T` to `[sizeof(%T) x i8]`.

This is intended to flush out any remaining places that rely on GEP
element types, as part of the `ptradd` migration. The impact of this
change is expected to be fairly minimal (we might enable a few more
hoist/sink style folds that depend on equal GEP types).
2026-03-06 14:48:01 +00:00
Justin Fargnoli
f0265ccb60
[LoopUnroll] Ensure we can accept both llvm.loop.unroll.full and llvm.loop.unroll.enable metadata on the same loop (NFC) (#182381)
Ensure that frontends can request both `PragmaEnable` and `PragmaFull`
semantics on a loop.

FYI: to the best of my knowledge, it's not possible to toggle both
`PragmaEnable` and `PragmaFull` via `clang`.
2026-03-05 16:44:25 +00:00
theRonShark
8f9c926868
Revert "AMDGPU: Fix runtime unrolling when cascaded GEPs present (#14… (#183641)
…7700)"

slows down llama.cpp

This reverts commit cff4a00d3f7d91c0dd3a93eb81db66be178273d3.
2026-02-26 21:59:04 -05:00
Justin Fargnoli
f9e0a999bc
[LoopUnrollPass] Fix capitalization in LLVM_DEBUG (#182429)
Carve out changes to existing debug messages from #178476.
2026-02-24 08:40:07 -08:00
Justin Fargnoli
413cafa462
[LoopUnrollPass] Remove redundant debug message in tryToUnrollLoop() (#181954)
Remove the redundant debug message. 

While we're here, adopt the same debug message language that's used in
#178476 and use an `if` instead of a single `case` `switch` statement.
2026-02-19 20:02:10 +00:00
Kunqiu Chen
85e07bad93
[InstructionSimplify] Extend simplifyICmpWithZero to handle equivalent zero RHS (#179055)
Add a new helper function `matchEquivZeroRHS()` that recognizes
comparisons with constants that are equivalent to comparisons with zero,
and transforms the predicate accordingly.

This handles the following transformations:
- icmp sgt X, -1 --> icmp sge X, 0
- icmp sle X, -1 --> icmp slt X, 0
- icmp [us]ge X, 1 --> icmp [us]gt X, 0
- icmp [us]lt X, 1 --> icmp [us]le X, 0

This enables more optimization opportunities in `simplifyICmpWithZero`,
such as folding icmp sgt X, -1 when X is known to be non-negative.

---

- IR Impact: https://github.com/dtcxzyw/llvm-opt-benchmark/pull/3414
2026-02-13 00:06:32 +08:00
Justin Fargnoli
45412b6790
[LoopUnrollPass] Indent LLVM_DEBUG() messages based on our depth in the tryToUnrollLoop() call graph (#178945)
Unify the ad-hoc use of whitespace in `LLVM_DEBUG()` messages. 

This approach should also make it easier to see which loop debug
messages correspond to and which part of the loop unrolling heuristics
each message corresponds to.
2026-02-11 17:59:42 +00:00
Andreas Jonson
faa4b97b10
[InstCombine] fold icmp ne (and X, 1), 0 --> trunc X to i1 (#178977)
Remove vector check so this fold always is done.

proof: https://alive2.llvm.org/ce/z/oabD6J
closes #172888
2026-02-03 19:14:27 +01:00
Justin Fargnoli
7889f729ac
[LoopUnroll] Remove preceding whitespace in loop peeling optimization remark (#178951) 2026-02-02 09:32:57 -08:00
Marek Sedláček
fa675aabc7
[NFC][LoopUnroll] Add -unroll-runtime-other-exit-predictable=false to unroll-multi-exit-loop-heuristics.ll (#179198)
Adds `-unroll-runtime-other-exit-predictable=false` option to
`unroll-multi-exit-loop-heuristics.ll` test for stability reasons.

This is a followup to a discussion in #164799 and a similar patch
https://reviews.llvm.org/D98098. Since this option is false by default,
this is an NFC.
2026-02-02 11:11:41 -05:00
Marek Sedláček
362c39d36d
[LoopUnroll] Use branch probability in multi-exit loop unrolling (#164799)
This patch improves multi-exit loop unrolling by taking into account
branch probability and not only other exit being deopting one.

This implementation uses branch metadata directly because of unstable
state of BPI in this part of code (runtime unrolling invalidates the
state of the map and using BPI in my tests has caused errors).
If branch probability metadata are not present then the current deopt
heuristic is still used.

---------

Co-authored-by: Marek Sedlacek <msedlacek@azul.com>
2026-01-28 11:12:34 -05:00
Joel E. Denny
b1698d3ac0
[LoopUnroll][NFC] Simplify recent block frequency tests (#177025)
Refactor a number of recent tests in
`llvm/test/Transforms/LoopUnroll/branch-weights-freq` to make it easier
to understand and extend them.

The changes mostly resemble the refactoring I recently did in PR #165635
in response to reviewer comments:
- For each case (e.g., each `-unroll-count` value in
`unroll-epilog.ll`), group all FileCheck directives together. That way,
while digesting a single case, the reader does not need to sift through
all other cases and a complex FileCheck prefix scheme.
- Reduce CFG testing. Drop many FileCheck directives that check for all
basic block labels and branches, and drop the cryptic
`-implicit-check-not` that excludes others. Instead, just use positive
checks for every loop body (represented by `call void @f`), for relevant
metadata, and for the branch instructions to which the metadata is
attached, and use simple negative checks (e.g.,
`-implicit-check-not='!prof'`) to be sure we have not missed any.
- Better document what the test intends to cover.

The result is sometimes longer tests due to comments and repetition, but
I believe they are easier to maintain this way.
2026-01-21 10:40:10 -05:00
Joel E. Denny
565591d8a4
[LoopUnroll] Do not copy !llvm.loop from latch to non-latch (#165635)
When LoopUnroll copies the original loop's latch to the corresponding
non-latch branch in an unrolled iteration, any `!llvm.loop` is copied
along with it, but `!llvm.loop` is useless and misleading there. This
patch discards it.

e06831a3b29d did the same for LoopPeel.
2026-01-14 10:40:31 -05:00
Graham Hunter
2abd6d6d7a
[LV] Vectorize conditional scalar assignments (#158088)
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222

This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
2026-01-14 14:59:18 +00:00
Mingjie Xu
fac9472593
[IR] Reland Optimize PHINode::removeIncomingValue() and PHINode::removeIncomingValueIf() to use the swapping strategy. (#174274)
Reland #171963, #172639 and #173444, they are reverted in
86b9f90b9574b3a7d15d28a91f6316459dcfa046 because of introducing
non-determinism in compiles.
The non-determinism has been fixed in
9b8addffa70cee5b2acc5454712d9cf78ce45710.
2026-01-04 09:24:53 +08:00
Walter Lee
86b9f90b95
Revert 159f1c048e08a8780d92858cfc80e723c90235e3 (#173893)
This causes non-determinism in compiles.

From nikic: "FYI the non-determinism is also visible on
llvm-opt-benchmark. Maybe repeatedly running test cases from
299446d99f
could reproduce the issue..."

Also revert dependent 796fafeff92fe5d2d20594859e92607116e30a16 and
e135447bda617125688b71d33480d131d1076a72.
2025-12-29 20:23:13 -05:00
Mingjie Xu
159f1c048e
[IR] Optimize PHINode::removeIncomingValue() by swapping removed incoming value with the last incoming value. (#171963)
Current implementation uses `std::copy` to shift all incoming values
after the removed index. This patch optimizes
`PHINode::removeIncomingValue()` by replacing the linear shift of
incoming values with a swap-with-last strategy.

After this change, the relative order of incoming values after removal
is not preserved.

This improves compile-time for PHI nodes with many predecessors.

Depends:
https://github.com/llvm/llvm-project/pull/171955
https://github.com/llvm/llvm-project/pull/171956
https://github.com/llvm/llvm-project/pull/171960
https://github.com/llvm/llvm-project/pull/171962
2025-12-17 19:44:01 +08:00
Craig Topper
ef21740781
[LoopPeel] Check for onlyAccessesInaccessibleMemory instead of llvm.assume in peelToTurnInvariantLoadsDereferenceable. (#171910)
onlyAccessesInaccessibleMemory can't alias with a load. This allows us
to ignore more intrinsics than llvm.assume.

Follow up from #171547
2025-12-12 10:45:41 -08:00
Craig Topper
ccc3835ffa
[LoopPeel] Ignore assume intrinsics for the mayWriteToMemory check in peelToTurnInvariantLoadsDereferenceable. (#171547)
llvm.assume intrinsics have the mayWriteToMemory property, but
won't prevent the load from becoming dereferenceable.
2025-12-10 13:14:19 -08:00
Pengcheng Wang
a0b6638c85
[RISCV] Don't unroll vectorized loops with vector operands (#171089)
We have disabled unrolling for vectorized loops in #151525 but this
PR only checked the instruction type.

For some loops, there is no instruction with vector type but they
are still vector operations (just like the memset zero test in the
precommit test).

Here we check the operands as well to cover these cases.
2025-12-09 12:42:41 +08:00
Pengcheng Wang
893479adcc [RISCV] Precommit test for unrolling loops with vector operands 2025-12-09 11:51:33 +08:00
Florian Hahn
7470d721c6
[AArch64] Add isAppleMLike helper to check for M cores and aligned CPUs. (#170553)
Add a new isAppleMLike helper, that returns true if the core is part of
the Apple M core family or Apple A14 or later. Used to apply cost
decisions consistently to those groups of cores.

The function is now a single place to update when new cores are added.
It also makes sure we apply unrolling decisions for newer Apple cores to
Apple A17.

PR: https://github.com/llvm/llvm-project/pull/170553
2025-12-05 20:05:29 +00:00
Florian Hahn
c5e6f4e99d
[AArch64] Add unrolling test with -mcpu=apple-a17.
Currently Apple unrolling preferences are not applied to apple-a17.
2025-12-03 20:15:58 +00:00
Philip Reames
c752bb9203
[IndVars] Strengthen inference of samesign flags (#170363)
When reviewing another change, I noticed that we were failing to infer
samsign for two cases: 1) an unsigned comparison, and 2) when both
arguments were known negative.

Using CVP and InstCombine as a reference, we need to be careful to not
allow eq/ne comparisons. I'm a bit unclear on the why of that, and for
now am going with the low risk change. I may return to investigate that
in a follow up.

Compile time results look like noise to me, see:
https://llvm-compile-time-tracker.com/compare.php?from=49a978712893fcf9e5f40ac488315d029cf15d3d&to=2ddb263604fd7d538e09dc1f805ebc30eb3ffab0&stat=instructions:u
2025-12-03 16:16:22 +00:00
Philip Reames
49a9787128 [SCEV] Regenerate a subset of auto updated tests
Reducing spurious diff in an upcoming change.
2025-12-02 12:16:53 -08:00
Julian Nagele
b641509637
[LoopUnroll] Introduce parallel accumulators when unrolling FP reductions. (#166630)
This is building on top of
https://github.com/llvm/llvm-project/pull/149470, also introducing
parallel accumulator PHIs when the reduction is for floating points,
provided we have the reassoc flag. See also
https://github.com/llvm/llvm-project/pull/166353, which aims to
introduce parallel accumulators for reductions with vector instructions.
2025-11-27 15:03:36 +00:00
Julian Nagele
c73de9777e
[IVDesciptors] Support detecting reductions with vector instructions. (#166353)
In combination with https://github.com/llvm/llvm-project/pull/149470
this will introduce parallel accumulators when unrolling reductions with
vector instructions. See also
https://github.com/llvm/llvm-project/pull/166630, which aims to
introduce parallel accumulators for FP reductions.
2025-11-24 11:12:06 +00:00
Joel E. Denny
21fedcbf89
[LoopPeel] Fix BFI when peeling last iteration without guard (#168250)
LoopPeel sometimes proves that, when reached, the original loop always
executes at least two iterations. LoopPeel then unconditionally executes
both the remaining loop's initial iteration and the peeled final
iteration. But that increases the latter's frequency above its frequency
in the original loop. To maintain the total frequency, this patch
compensates by decreasing the remaininng loop's latch probability.

This is another step in issue #135812 and was discussed at
<https://github.com/llvm/llvm-project/pull/166858#discussion_r2528968542>.
2025-11-20 10:45:53 -05:00
Vladi Krapp
42a1184e42
[AArch64] Allow forcing unrolling of small loops (#167488)
- Introduce the -aarch64-force-unroll-threshold option; when a loop’s
cost is below this value we set UP.Force = true (default 0 keeps current
behaviour)
- Add an AArch64 loop-unroll regression test that runs once at the
default threshold and once with the flag raised, confirming forced
unrolling
2025-11-17 08:59:44 +00:00
Mircea Trofin
358e9a56af
[LP] Assign weights when peeling last iteration. (#166858) 2025-11-15 10:01:04 -08:00
Joel E. Denny
1aa86ca521
[LoopUnroll] Fix division by zero (#166258)
PR #159163's probability computation for epilogue loops does not handle
the possibility of an original loop probability of one. Runtime loop
unrolling does not make sense for such an infinite loop, and a division
by zero results. This patch works around that case.

Issue #165998.
2025-11-04 12:49:33 -05:00
Ivan Kelarev
37825ad4f6
[LoopUnroll] Prevent LoopFullUnrollPass from performing partial unrolling when trip counts are unknown (#165013)
Currently, `LoopFullUnrollPass` incorrectly performs partial unrolling
when `#pragma unroll` is specified and both `TripCount` and
`MaxTripCount` are unknown. This patch adds a check to prevent partial
unrolling when `OnlyFullUnroll` parameter is true and both trip count
values are zero.
2025-11-04 09:20:01 -08:00
Joel E. Denny
bb9bd5f263
[LoopUnroll] Fix assert fail on zeroed branch weights (#165938)
BranchProbability fails an assert when its denominator is zero.

Reported at
<https://github.com/llvm/llvm-project/pull/159163#pullrequestreview-3406318423>.
2025-11-03 10:19:12 -05:00
Joel E. Denny
cc8ff73fba
[LoopUnroll] Fix block frequencies for epilogue (#159163)
As another step in issue #135812, this patch fixes block frequencies for
partial loop unrolling with an epilogue remainder loop. It does not
fully handle the case when the epilogue loop itself is unrolled. That
will be handled in the next patch.

For the guard and latch of each of the unrolled loop and epilogue loop,
this patch sets branch weights derived directly from the original loop
latch branch weights. The total frequency of the original loop body,
summed across all its occurrences in the unrolled loop and epilogue
loop, is the same as in the original loop. This patch also sets
`llvm.loop.estimated_trip_count` for the epilogue loop instead of
relying on the epilogue's latch branch weights to imply it.

This patch fixes branch weights in tests that PR #157754 adversely
affected.
2025-10-31 11:01:42 -04:00
Joel E. Denny
24557cce40
[LoopUnroll] Fix block frequencies when no runtime (#157754)
This patch implements the LoopUnroll changes discussed in [[RFC] Fix
Loop Transformations to Preserve Block

Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785)
and is thus another step in addressing issue #135812.

In summary, for the case of partial loop unrolling without a remainder
loop, this patch changes LoopUnroll to:

- Maintain branch weights consistently with the original loop for the
sake of preserving the total frequency of the original loop body.
- Store the new estimated trip count in the
`llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.
- Correct the new estimated trip count (e.g., 3 instead of 2) when the
original estimated trip count (e.g., 10) divided by the unroll count
(e.g., 4) leaves a remainder (e.g., 2).

There are loop unrolling cases this patch does not fully fix, such as
partial unrolling with a remainder loop and complete unrolling, and
there are two associated tests whose branch weights this patch adversely
affects. They will be addressed in future patches that should land with
this patch.
2025-10-31 10:44:27 -04:00
Joel E. Denny
8d186e2195
[LoopUnroll][NFCI] Clean up remainder followup metadata handling (#165272)
Followup metadata for remainder loops is handled by two implementations,
both added by 7244852557ca6:

1. `tryToUnrollLoop` in `LoopUnrollPass.cpp`.
2. `CloneLoopBlocks` in `LoopUnrollRuntime.cpp`.

As far as I can tell, 2 is useless: I added `assert(!NewLoopID)` for the
`NewLoopID` returned by the `makeFollowupLoopID` call, and it never
fails throughout check-all for my build.

Moreover, if 2 were useful, it appears it would have a bug caused by
7cd826a321d9. That commit skips adding loop metadata to a new remainder
loop if the remainder loop itself is to be completely unrolled because
it will then no longer be a loop. However, that commit incorrectly
assumes that `UnrollRemainder` dictates complete unrolling of a
remainder loop, and thus it skips adding loop metadata even if the
remainder loop will be only partially unrolled.

To avoid further confusion here, this patch removes 2. check-all
continues to pass for my build. If 2 actually is useful, please advise
so we can create a test that covers that usage.

Near 2, this patch retains the `UnrollRemainder` guard on the
`setLoopAlreadyUnrolled` call, which adds `llvm.loop.unroll.disable` to
the remainder loop. That behavior exists both before and after
7cd826a321d9. The logic appears to be that remainder loop unrolling
(whether complete or partial) is opt-in. That is, unless
`UnrollRemainder` is true, `UnrollRuntimeLoopRemainder` skips running
remainder loop unrolling, and `llvm.loop.unroll.disable` suppresses any
later attempt at it.

This patch also extends testing of remainder loop followup metadata to
be sure remainder loop partial unrolling is handled correctly by 1.
2025-10-30 10:57:27 -04:00
paperchalice
249883d0c5
[test][Transforms] Remove unsafe-fp-math uses part 2 (NFC) (#164786)
Post cleanup for #164534.
2025-10-23 20:31:31 +08:00
Nikita Popov
573ca36753
[IR] Replace alignment argument with attribute on masked intrinsics (#163802)
The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter`
intrinsics currently accept a separate alignment immarg. Replace this
with an `align` attribute on the pointer / vector of pointers argument.

This is the standard representation for alignment information on
intrinsics, and is already used by all other memory intrinsics. This
means the signatures now match llvm.expandload, llvm.vp.load, etc.
(Things like llvm.memcpy used to have a separate alignment argument as
well, but were already migrated a long time ago.)

It's worth noting that the masked.gather and masked.scatter intrinsics
previously accepted a zero alignment to indicate the ABI type alignment
of the element type. This special case is gone now: If the align
attribute is omitted, the implied alignment is 1, as usual. If ABI
alignment is desired, it needs to be explicitly emitted (which the
IRBuilder API already requires anyway).
2025-10-20 08:50:09 +00:00
Florian Hahn
2d027260b0
[SCEV] Collect guard info for ICMP NE w/o constants. (#160500)
When collecting information from loop guards, use UMax(1, %b - %a) for
ICMP NE %a, %b, if neither are constant.

This improves results in some cases, and will be even more useful
together with
 * https://github.com/llvm/llvm-project/pull/160012
 * https://github.com/llvm/llvm-project/pull/159942

https://alive2.llvm.org/ce/z/YyBvoT

PR: https://github.com/llvm/llvm-project/pull/160500
2025-10-14 14:20:34 +00:00
Joel E. Denny
6d44b9082e
[LoopUnroll] Skip remainder loop guard if skip unrolled loop (#156549)
The original loop (OL) that serves as input to LoopUnroll has basic
blocks that are arranged as follows:

```
OLPreHeader
OLHeader <-.
...        |
OLLatch ---'
OLExit
```

In this depiction, every block has an implicit edge to the next block
below, so any explicit edge indicates a conditional branch.

Given OL and unroll count N, LoopUnroll sometimes creates an unrolled
loop (UL) with a remainder loop (RL) epilogue arranged like this:

```
,-- ULGuard
|   ULPreHeader
|   ULHeader <-.
|   ...        |
|   ULLatch ---'
|   ULExit
`-> RLGuard -----.
    RLPreHeader  |
,-> RLHeader     |
|   ...          |
`-- RLLatch      |
    RLExit       |
    OLExit <-----'
```

Each UL iteration executes N OL iterations, but each RL iteration
executes 1 OL iteration. ULGuard or RLGuard checks whether the first
iteration of UL or RL should execute, respectively. If so, ULLatch or
RLLatch checks whether to execute each subsequent iteration.

Once reached, OL always executes its first iteration but not necessarily
the next N-1 iterations. Thus, ULGuard is always required before the
first UL iteration. However, when control flows from ULGuard directly to
RLGuard, the first OL iteration has yet to execute, so RLGuard is then
redundant before the first RL iteration.

Thus, this patch makes the following changes:
- Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus
eliminating RLGuard's unnecessary branch instruction for that path.
- Eliminate the creation of RLGuard phi node poison values. Without this
patch, RLGuard has such a phi node for each value that is defined by any
OL iteration and used in OLExit. The poison value is required where
ULGuard is the predecessor. The poison value indicates that control flow
from ULGuard to RLGuard to Exit has no counterpart in OL because the
first OL iteration must execute either in UL or RL.
- Simplify the CFG by not splitting ULExit and RLGuard because, without
the ULGuard predecessor, the single block can now be a dedicated UL
exit.
- To RLPreHeader, add an `llvm.assume` call that asserts the RL trip
count is non-zero. Without this patch, RLPreHeader is reachable only
when RLGuard guarantees that assertion is true. With this patch, RLGuard
guarantees it only when RLGuard is the predecessor, and the OL structure
guarantees it when ULGuard is the predecessor. If RL itself is unrolled
later, this guarantee somehow prevents ScalarEvolution from giving up
when trying to compute a maximum trip count for RL. That maximum trip
count enables the branch instruction in the final unrolled instance of
RLLatch to be eliminated. Without the `llvm.assume` call, some existing
unroll tests start to fail because that instruction is not eliminated.

The original motivation for this patch is to facilitate later patches
that fix LoopUnroll's computation of branch weights so that they
maintain the block frequency of OL's body (see #135812). Specifically,
this patch ensures RLGuard's branch weights do not affect RL's
contribution to the block frequency of OL's body in the case that
ULGuard skips UL.
2025-10-07 10:45:49 -04:00
Joel E. Denny
afb262855e
[LoopPeel] Fix branch weights' effect on block frequencies (#128785)
[LoopPeel] Fix branch weights' effect on block frequencies

This patch implements the LoopPeel changes discussed in [[RFC] Fix Loop
Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).

In summary, a loop's latch block can have branch weight metadata that
encodes an estimated trip count that is derived from application profile
data. Initially, the loop body's block frequencies agree with the
estimated trip count, as expected. However, sometimes loop
transformations adjust those branch weights in a way that correctly
maintains the estimated trip count but that corrupts the block
frequencies. This patch addresses that problem in LoopPeel, which it
changes to:

- Maintain branch weights consistently with the original loop for the
sake of preserving the total frequency of the original loop body.
- Store the new estimated trip count in the
`llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.
2025-10-02 16:07:55 +00:00
Florian Hahn
3c4f611791
[LoopPeel] Add test with branch that can be simplified with guards.
Add test where a branch can be removed after peeling by applying info
from loop guards. It unfortunately requires running IndVars first, to
strengthen flags of the induction.
2025-09-24 11:51:55 +01:00
Florian Hahn
8693ef16f6
[SCEV] Add tests that benefit from rewriting SCEVAddExpr with guards.
Add additional tests benefiting from rewriting existing SCEVAddExprs with
guards.
2025-09-20 19:24:19 +01:00
Florian Hahn
3ea089ba19
[AArch64] Enable RT and partial unrolling with reductions for Apple CPUs. (#149699)
Update unrolling preferences for Apple Silicon CPUs to enable partial
unrolling and runtime unrolling for small loops with reductions.

This builds on top of unroller changes to introduce parallel reduction
phis, if possible: https://github.com/llvm/llvm-project/pull/149470.

PR: https://github.com/llvm/llvm-project/pull/149699
2025-09-09 13:23:30 +00:00
Florian Hahn
2d9e452ab0
[LoopUnroll] Introduce parallel reduction phis when unrolling. (#149470)
When partially or runtime unrolling loops with reductions, currently the
reductions are performed in-order in the loop, negating most benefits
from unrolling such loops.

This patch extends unrolling code-gen to keep a parallel reduction phi
per unrolled iteration and combining the final result after the loop.
For out-of-order CPUs, this allows executing mutliple reduction chains
in parallel.

For now, the initial transformation is restricted to cases where we
unroll a small number of iterations (hard-coded to 4, but should maybe
be capped by TTI depending on the execution units), to avoid introducing
an excessive amount of parallel phis.

It also requires single block loops for now, where the unrolled
iterations are known to not exit the loop (either due to runtime
unrolling or partial unrolling). This ensures that the unrolled loop
will have a single basic block, with a single exit block where we can
place the final reduction value computation.

The initial implementation also only supports parallelizing loops with a
single reduction and only integer reductions. Those restrictions are
just to keep the initial implementation simpler, and can easily be
lifted as follow-ups.

With corresponding TTI to the AArch64 unrolling preferences which I will
also share soon, this triggers in ~300 loops across a wide range of
workloads, including LLVM itself, ffmgep, av1aom, sqlite, blender,
brotli, zstd and more.

PR: https://github.com/llvm/llvm-project/pull/149470
2025-09-04 20:54:09 +01:00
Ryotaro Kasuga
2330fd2f73
[LoopPeel] Add new option to peeling loops to convert PHI into IV (#121104)
LoopPeel currently considers PHI nodes that become loop invariants
through peeling. However, in some cases, peeling transforms PHI nodes
into induction variables (IVs), potentially enabling further
optimizations such as loop vectorization. For example:

```c
// TSVC s292
int im = N-1;
for (int i=0; i<N; i++) {
  a[i] = b[i] + b[im];
  im = i;
}
```

In this case, peeling one iteration converts `im` into an IV, allowing
it to be handled by the loop vectorizer.

This patch adds a new feature to peel loops when to convert PHIs into
IVs. At the moment this feature is disabled by default.

Enabling it allows to vectorize the above example. I have measured on
neoverse-v2 and observed a speedup of more than 60% (options: `-O3
-ffast-math -mcpu=neoverse-v2 -mllvm -enable-peeling-for-iv`).

This PR is taken over from #94900
Related #81851
2025-08-20 13:44:56 +00:00