llvm-project

Author	SHA1	Message	Date
Justin Fargnoli	cb32b8bffb	[LoopUnrollPass] Don't pre-set `UP.Count` before legality checks in `computeUnrollCount()` (#185979 ) We currently set `UP.Count` to `TripCount` and `MaxTripCount` prior to full and upper bound unrolling, respectively. This was likely done to ensure that calls to `UCE.getUnrolledLoopSize(UP)` use the appropriate trip count. However, we can use `UCE.getUnrolledLoopSize(UP, FullUnrollTripCount)` instead. To prevent unintentional unrolling, we set `UP.Count = 0` when early-exiting `computeUnrollCount()`. (Note: this does not occur [here](`eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1190-L1198)`). This seems like a bug.) We only perform early exits when evaluating runtime unrolling. At that point, [we know `TripCount` is false](`3fb31e7b06/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1157-L1158)`), and thus we could not have leaked `TripCount`. However, we [could've leaked `MaxTripCount`](`eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1102-L1110)`). It seems like: `eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1181-L1188)` was supposed to handle this case. However: - It uses `<` instead of `<=`. This breaks the existing convention [[1]](`eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L869)`) [[2]](`eb687fb106/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (L1103)`) for how `UP.MaxUpperBound` is treated. - It's ignored when a target sets `UP.Force = true`. Thus: - When `UP.Force == false`, we leak `MaxTripCount` into runtime unrolling when `MaxTripCount && (UP.UpperBound \|\| MaxOrZero) && MaxTripCount == UP.MaxUpperBound` - When `UP.Force == true`, we leak `MaxTripCount` into runtime unrolling when `MaxTripCount && (UP.UpperBound \|\| MaxOrZero) && MaxTripCount <= UP.MaxUpperBound`. This PR: - Uses `UCE.getUnrolledLoopSize(UP, FullUnrollTripCount)` - Stops setting `TripCount` and `MaxTripCount` prior to calling `shouldFullUnroll()` - Removes the `UP.Count = 0` safeguards - Swaps `<` with `<=`, to address the `UP.Force == false` case - Adds a test to document the behavior change (no longer leaking `MaxTripCount`) in the `UP.Force == true` case.	2026-03-31 19:50:52 +00:00
Justin Fargnoli	a2891ff85c	Reapply "[LoopUnroll] Remove computeUnrollCount()'s return value" (#187104 ) Address https://github.com/llvm/llvm-project/pull/184529#issuecomment-4074393657 by checking the loop's metadata prior to unrolling.	2026-03-18 17:55:56 +00:00
Adel Ejjeh	49250284cf	[AMDGPU][LoopUnroll] Enable allowexpensivetripcount for amdgpu when unroll pragmas are present (#181241 ) This PR is intended as an AMDGPU-specific solution for #181267 while discussions on changing the default behavior for all targets continue in that PR. Problem: Loops with an explicit unroll pragma (#pragma unroll / #pragma clang loop unroll(enable)) that have an expensive runtime trip count currently don't get unrolled because UP.AllowExpensiveTripCount defaults to false. The pragma is silently ignored. This is not the case when an unroll factor is specified (PragmaCount > 0), where the pass sets UP.AllowExpensiveTripCount = true. Solution: Set UP.AllowExpensiveTripCount to true for for loops that have an unroll pramga in the AMDGPU TTI Implementation. I've added a new lit test expensive-tripcount.ll that verifies pragma-driven unrolling with expensive trip counts will work as expected. The change showed no meaningful regressions across a few different workloads from Composable Kernels (CK) and llama.cpp as well as Pytorch kernels on AMDGPU gfx950. Additionally, the change improves the performance of PyTorch reduction loops on AMDGPU targets.	2026-03-13 09:56:20 -07:00
Justin Fargnoli	40fca74d80	[LoopUnrollPass] Trace loop unroll count heuristics with `LLVM_DEBUG` (#182981 )	2026-03-12 16:46:56 +00:00
Nikita Popov	6ecbc0c96e	[InstCombine] Canonicalize GEP source element types (#180745 ) Canonicalize GEP source element types from `%T` to `[sizeof(%T) x i8]`. This is intended to flush out any remaining places that rely on GEP element types, as part of the `ptradd` migration. The impact of this change is expected to be fairly minimal (we might enable a few more hoist/sink style folds that depend on equal GEP types).	2026-03-06 14:48:01 +00:00
Justin Fargnoli	f0265ccb60	[LoopUnroll] Ensure we can accept both `llvm.loop.unroll.full` and `llvm.loop.unroll.enable` metadata on the same loop (NFC) (#182381 ) Ensure that frontends can request both `PragmaEnable` and `PragmaFull` semantics on a loop. FYI: to the best of my knowledge, it's not possible to toggle both `PragmaEnable` and `PragmaFull` via `clang`.	2026-03-05 16:44:25 +00:00
theRonShark	8f9c926868	Revert "AMDGPU: Fix runtime unrolling when cascaded GEPs present (#14… (#183641 ) …7700)" slows down llama.cpp This reverts commit cff4a00d3f7d91c0dd3a93eb81db66be178273d3.	2026-02-26 21:59:04 -05:00
Justin Fargnoli	f9e0a999bc	[LoopUnrollPass] Fix capitalization in `LLVM_DEBUG` (#182429 ) Carve out changes to existing debug messages from #178476.	2026-02-24 08:40:07 -08:00
Justin Fargnoli	413cafa462	[LoopUnrollPass] Remove redundant debug message in `tryToUnrollLoop()` (#181954 ) Remove the redundant debug message. While we're here, adopt the same debug message language that's used in #178476 and use an `if` instead of a single `case` `switch` statement.	2026-02-19 20:02:10 +00:00
Kunqiu Chen	85e07bad93	[InstructionSimplify] Extend simplifyICmpWithZero to handle equivalent zero RHS (#179055 ) Add a new helper function `matchEquivZeroRHS()` that recognizes comparisons with constants that are equivalent to comparisons with zero, and transforms the predicate accordingly. This handles the following transformations: - icmp sgt X, -1 --> icmp sge X, 0 - icmp sle X, -1 --> icmp slt X, 0 - icmp [us]ge X, 1 --> icmp [us]gt X, 0 - icmp [us]lt X, 1 --> icmp [us]le X, 0 This enables more optimization opportunities in `simplifyICmpWithZero`, such as folding icmp sgt X, -1 when X is known to be non-negative. --- - IR Impact: https://github.com/dtcxzyw/llvm-opt-benchmark/pull/3414	2026-02-13 00:06:32 +08:00
Justin Fargnoli	45412b6790	[LoopUnrollPass] Indent `LLVM_DEBUG()` messages based on our depth in the `tryToUnrollLoop()` call graph (#178945 ) Unify the ad-hoc use of whitespace in `LLVM_DEBUG()` messages. This approach should also make it easier to see which loop debug messages correspond to and which part of the loop unrolling heuristics each message corresponds to.	2026-02-11 17:59:42 +00:00
Andreas Jonson	faa4b97b10	[InstCombine] fold icmp ne (and X, 1), 0 --> trunc X to i1 (#178977 ) Remove vector check so this fold always is done. proof: https://alive2.llvm.org/ce/z/oabD6J closes #172888	2026-02-03 19:14:27 +01:00
Justin Fargnoli	7889f729ac	[LoopUnroll] Remove preceding whitespace in loop peeling optimization remark (#178951 )	2026-02-02 09:32:57 -08:00
Marek Sedláček	fa675aabc7	[NFC][LoopUnroll] Add `-unroll-runtime-other-exit-predictable=false` to `unroll-multi-exit-loop-heuristics.ll` (#179198 ) Adds `-unroll-runtime-other-exit-predictable=false` option to `unroll-multi-exit-loop-heuristics.ll` test for stability reasons. This is a followup to a discussion in #164799 and a similar patch https://reviews.llvm.org/D98098. Since this option is false by default, this is an NFC.	2026-02-02 11:11:41 -05:00
Marek Sedláček	362c39d36d	[LoopUnroll] Use branch probability in multi-exit loop unrolling (#164799 ) This patch improves multi-exit loop unrolling by taking into account branch probability and not only other exit being deopting one. This implementation uses branch metadata directly because of unstable state of BPI in this part of code (runtime unrolling invalidates the state of the map and using BPI in my tests has caused errors). If branch probability metadata are not present then the current deopt heuristic is still used. --------- Co-authored-by: Marek Sedlacek <msedlacek@azul.com>	2026-01-28 11:12:34 -05:00
Joel E. Denny	b1698d3ac0	[LoopUnroll][NFC] Simplify recent block frequency tests (#177025 ) Refactor a number of recent tests in `llvm/test/Transforms/LoopUnroll/branch-weights-freq` to make it easier to understand and extend them. The changes mostly resemble the refactoring I recently did in PR #165635 in response to reviewer comments: - For each case (e.g., each `-unroll-count` value in `unroll-epilog.ll`), group all FileCheck directives together. That way, while digesting a single case, the reader does not need to sift through all other cases and a complex FileCheck prefix scheme. - Reduce CFG testing. Drop many FileCheck directives that check for all basic block labels and branches, and drop the cryptic `-implicit-check-not` that excludes others. Instead, just use positive checks for every loop body (represented by `call void @f`), for relevant metadata, and for the branch instructions to which the metadata is attached, and use simple negative checks (e.g., `-implicit-check-not='!prof'`) to be sure we have not missed any. - Better document what the test intends to cover. The result is sometimes longer tests due to comments and repetition, but I believe they are easier to maintain this way.	2026-01-21 10:40:10 -05:00
Joel E. Denny	565591d8a4	[LoopUnroll] Do not copy !llvm.loop from latch to non-latch (#165635 ) When LoopUnroll copies the original loop's latch to the corresponding non-latch branch in an unrolled iteration, any `!llvm.loop` is copied along with it, but `!llvm.loop` is useless and misleading there. This patch discards it. e06831a3b29d did the same for LoopPeel.	2026-01-14 10:40:31 -05:00
Graham Hunter	2abd6d6d7a	[LV] Vectorize conditional scalar assignments (#158088 ) Based on Michael Maitland's previous work: https://github.com/llvm/llvm-project/pull/121222 This PR uses the existing recurrences code instead of introducing a new pass just for CSA autovec. I've also made recipes that are more generic.	2026-01-14 14:59:18 +00:00
Mingjie Xu	fac9472593	[IR] Reland Optimize PHINode::removeIncomingValue() and PHINode::removeIncomingValueIf() to use the swapping strategy. (#174274 ) Reland #171963, #172639 and #173444, they are reverted in 86b9f90b9574b3a7d15d28a91f6316459dcfa046 because of introducing non-determinism in compiles. The non-determinism has been fixed in 9b8addffa70cee5b2acc5454712d9cf78ce45710.	2026-01-04 09:24:53 +08:00
Walter Lee	86b9f90b95	Revert 159f1c048e08a8780d92858cfc80e723c90235e3 (#173893 ) This causes non-determinism in compiles. From nikic: "FYI the non-determinism is also visible on llvm-opt-benchmark. Maybe repeatedly running test cases from `299446d99f` could reproduce the issue..." Also revert dependent 796fafeff92fe5d2d20594859e92607116e30a16 and e135447bda617125688b71d33480d131d1076a72.	2025-12-29 20:23:13 -05:00
Mingjie Xu	159f1c048e	[IR] Optimize PHINode::removeIncomingValue() by swapping removed incoming value with the last incoming value. (#171963 ) Current implementation uses `std::copy` to shift all incoming values after the removed index. This patch optimizes `PHINode::removeIncomingValue()` by replacing the linear shift of incoming values with a swap-with-last strategy. After this change, the relative order of incoming values after removal is not preserved. This improves compile-time for PHI nodes with many predecessors. Depends: https://github.com/llvm/llvm-project/pull/171955 https://github.com/llvm/llvm-project/pull/171956 https://github.com/llvm/llvm-project/pull/171960 https://github.com/llvm/llvm-project/pull/171962	2025-12-17 19:44:01 +08:00
Craig Topper	ef21740781	[LoopPeel] Check for onlyAccessesInaccessibleMemory instead of llvm.assume in peelToTurnInvariantLoadsDereferenceable. (#171910 ) onlyAccessesInaccessibleMemory can't alias with a load. This allows us to ignore more intrinsics than llvm.assume. Follow up from #171547	2025-12-12 10:45:41 -08:00
Craig Topper	ccc3835ffa	[LoopPeel] Ignore assume intrinsics for the mayWriteToMemory check in peelToTurnInvariantLoadsDereferenceable. (#171547 ) llvm.assume intrinsics have the mayWriteToMemory property, but won't prevent the load from becoming dereferenceable.	2025-12-10 13:14:19 -08:00
Pengcheng Wang	a0b6638c85	[RISCV] Don't unroll vectorized loops with vector operands (#171089 ) We have disabled unrolling for vectorized loops in #151525 but this PR only checked the instruction type. For some loops, there is no instruction with vector type but they are still vector operations (just like the memset zero test in the precommit test). Here we check the operands as well to cover these cases.	2025-12-09 12:42:41 +08:00
Pengcheng Wang	893479adcc	[RISCV] Precommit test for unrolling loops with vector operands	2025-12-09 11:51:33 +08:00
Florian Hahn	7470d721c6	[AArch64] Add isAppleMLike helper to check for M cores and aligned CPUs. (#170553 ) Add a new isAppleMLike helper, that returns true if the core is part of the Apple M core family or Apple A14 or later. Used to apply cost decisions consistently to those groups of cores. The function is now a single place to update when new cores are added. It also makes sure we apply unrolling decisions for newer Apple cores to Apple A17. PR: https://github.com/llvm/llvm-project/pull/170553	2025-12-05 20:05:29 +00:00
Florian Hahn	c5e6f4e99d	[AArch64] Add unrolling test with -mcpu=apple-a17. Currently Apple unrolling preferences are not applied to apple-a17.	2025-12-03 20:15:58 +00:00
Philip Reames	c752bb9203	[IndVars] Strengthen inference of samesign flags (#170363 ) When reviewing another change, I noticed that we were failing to infer samsign for two cases: 1) an unsigned comparison, and 2) when both arguments were known negative. Using CVP and InstCombine as a reference, we need to be careful to not allow eq/ne comparisons. I'm a bit unclear on the why of that, and for now am going with the low risk change. I may return to investigate that in a follow up. Compile time results look like noise to me, see: https://llvm-compile-time-tracker.com/compare.php?from=49a978712893fcf9e5f40ac488315d029cf15d3d&to=2ddb263604fd7d538e09dc1f805ebc30eb3ffab0&stat=instructions:u	2025-12-03 16:16:22 +00:00
Philip Reames	49a9787128	[SCEV] Regenerate a subset of auto updated tests Reducing spurious diff in an upcoming change.	2025-12-02 12:16:53 -08:00
Julian Nagele	b641509637	[LoopUnroll] Introduce parallel accumulators when unrolling FP reductions. (#166630 ) This is building on top of https://github.com/llvm/llvm-project/pull/149470, also introducing parallel accumulator PHIs when the reduction is for floating points, provided we have the reassoc flag. See also https://github.com/llvm/llvm-project/pull/166353, which aims to introduce parallel accumulators for reductions with vector instructions.	2025-11-27 15:03:36 +00:00
Julian Nagele	c73de9777e	[IVDesciptors] Support detecting reductions with vector instructions. (#166353 ) In combination with https://github.com/llvm/llvm-project/pull/149470 this will introduce parallel accumulators when unrolling reductions with vector instructions. See also https://github.com/llvm/llvm-project/pull/166630, which aims to introduce parallel accumulators for FP reductions.	2025-11-24 11:12:06 +00:00
Joel E. Denny	21fedcbf89	[LoopPeel] Fix BFI when peeling last iteration without guard (#168250 ) LoopPeel sometimes proves that, when reached, the original loop always executes at least two iterations. LoopPeel then unconditionally executes both the remaining loop's initial iteration and the peeled final iteration. But that increases the latter's frequency above its frequency in the original loop. To maintain the total frequency, this patch compensates by decreasing the remaininng loop's latch probability. This is another step in issue #135812 and was discussed at <https://github.com/llvm/llvm-project/pull/166858#discussion_r2528968542>.	2025-11-20 10:45:53 -05:00
Vladi Krapp	42a1184e42	[AArch64] Allow forcing unrolling of small loops (#167488 ) - Introduce the -aarch64-force-unroll-threshold option; when a loop’s cost is below this value we set UP.Force = true (default 0 keeps current behaviour) - Add an AArch64 loop-unroll regression test that runs once at the default threshold and once with the flag raised, confirming forced unrolling	2025-11-17 08:59:44 +00:00
Mircea Trofin	358e9a56af	[LP] Assign weights when peeling last iteration. (#166858 )	2025-11-15 10:01:04 -08:00
Joel E. Denny	1aa86ca521	[LoopUnroll] Fix division by zero (#166258 ) PR #159163's probability computation for epilogue loops does not handle the possibility of an original loop probability of one. Runtime loop unrolling does not make sense for such an infinite loop, and a division by zero results. This patch works around that case. Issue #165998.	2025-11-04 12:49:33 -05:00
Ivan Kelarev	37825ad4f6	[LoopUnroll] Prevent LoopFullUnrollPass from performing partial unrolling when trip counts are unknown (#165013 ) Currently, `LoopFullUnrollPass` incorrectly performs partial unrolling when `#pragma unroll` is specified and both `TripCount` and `MaxTripCount` are unknown. This patch adds a check to prevent partial unrolling when `OnlyFullUnroll` parameter is true and both trip count values are zero.	2025-11-04 09:20:01 -08:00
Joel E. Denny	bb9bd5f263	[LoopUnroll] Fix assert fail on zeroed branch weights (#165938 ) BranchProbability fails an assert when its denominator is zero. Reported at <https://github.com/llvm/llvm-project/pull/159163#pullrequestreview-3406318423>.	2025-11-03 10:19:12 -05:00
Joel E. Denny	cc8ff73fba	[LoopUnroll] Fix block frequencies for epilogue (#159163 ) As another step in issue #135812, this patch fixes block frequencies for partial loop unrolling with an epilogue remainder loop. It does not fully handle the case when the epilogue loop itself is unrolled. That will be handled in the next patch. For the guard and latch of each of the unrolled loop and epilogue loop, this patch sets branch weights derived directly from the original loop latch branch weights. The total frequency of the original loop body, summed across all its occurrences in the unrolled loop and epilogue loop, is the same as in the original loop. This patch also sets `llvm.loop.estimated_trip_count` for the epilogue loop instead of relying on the epilogue's latch branch weights to imply it. This patch fixes branch weights in tests that PR #157754 adversely affected.	2025-10-31 11:01:42 -04:00
Joel E. Denny	24557cce40	[LoopUnroll] Fix block frequencies when no runtime (#157754 ) This patch implements the LoopUnroll changes discussed in [[RFC] Fix Loop Transformations to Preserve Block Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785) and is thus another step in addressing issue #135812. In summary, for the case of partial loop unrolling without a remainder loop, this patch changes LoopUnroll to: - Maintain branch weights consistently with the original loop for the sake of preserving the total frequency of the original loop body. - Store the new estimated trip count in the `llvm.loop.estimated_trip_count` metadata, introduced by PR #148758. - Correct the new estimated trip count (e.g., 3 instead of 2) when the original estimated trip count (e.g., 10) divided by the unroll count (e.g., 4) leaves a remainder (e.g., 2). There are loop unrolling cases this patch does not fully fix, such as partial unrolling with a remainder loop and complete unrolling, and there are two associated tests whose branch weights this patch adversely affects. They will be addressed in future patches that should land with this patch.	2025-10-31 10:44:27 -04:00
Joel E. Denny	8d186e2195	[LoopUnroll][NFCI] Clean up remainder followup metadata handling (#165272 ) Followup metadata for remainder loops is handled by two implementations, both added by 7244852557ca6: 1. `tryToUnrollLoop` in `LoopUnrollPass.cpp`. 2. `CloneLoopBlocks` in `LoopUnrollRuntime.cpp`. As far as I can tell, 2 is useless: I added `assert(!NewLoopID)` for the `NewLoopID` returned by the `makeFollowupLoopID` call, and it never fails throughout check-all for my build. Moreover, if 2 were useful, it appears it would have a bug caused by 7cd826a321d9. That commit skips adding loop metadata to a new remainder loop if the remainder loop itself is to be completely unrolled because it will then no longer be a loop. However, that commit incorrectly assumes that `UnrollRemainder` dictates complete unrolling of a remainder loop, and thus it skips adding loop metadata even if the remainder loop will be only partially unrolled. To avoid further confusion here, this patch removes 2. check-all continues to pass for my build. If 2 actually is useful, please advise so we can create a test that covers that usage. Near 2, this patch retains the `UnrollRemainder` guard on the `setLoopAlreadyUnrolled` call, which adds `llvm.loop.unroll.disable` to the remainder loop. That behavior exists both before and after 7cd826a321d9. The logic appears to be that remainder loop unrolling (whether complete or partial) is opt-in. That is, unless `UnrollRemainder` is true, `UnrollRuntimeLoopRemainder` skips running remainder loop unrolling, and `llvm.loop.unroll.disable` suppresses any later attempt at it. This patch also extends testing of remainder loop followup metadata to be sure remainder loop partial unrolling is handled correctly by 1.	2025-10-30 10:57:27 -04:00
paperchalice	249883d0c5	[test][Transforms] Remove unsafe-fp-math uses part 2 (NFC) (#164786 ) Post cleanup for #164534.	2025-10-23 20:31:31 +08:00
Nikita Popov	573ca36753	[IR] Replace alignment argument with attribute on masked intrinsics (#163802 ) The `masked.load`, `masked.store`, `masked.gather` and `masked.scatter` intrinsics currently accept a separate alignment immarg. Replace this with an `align` attribute on the pointer / vector of pointers argument. This is the standard representation for alignment information on intrinsics, and is already used by all other memory intrinsics. This means the signatures now match llvm.expandload, llvm.vp.load, etc. (Things like llvm.memcpy used to have a separate alignment argument as well, but were already migrated a long time ago.) It's worth noting that the masked.gather and masked.scatter intrinsics previously accepted a zero alignment to indicate the ABI type alignment of the element type. This special case is gone now: If the align attribute is omitted, the implied alignment is 1, as usual. If ABI alignment is desired, it needs to be explicitly emitted (which the IRBuilder API already requires anyway).	2025-10-20 08:50:09 +00:00
Florian Hahn	2d027260b0	[SCEV] Collect guard info for ICMP NE w/o constants. (#160500 ) When collecting information from loop guards, use UMax(1, %b - %a) for ICMP NE %a, %b, if neither are constant. This improves results in some cases, and will be even more useful together with * https://github.com/llvm/llvm-project/pull/160012 * https://github.com/llvm/llvm-project/pull/159942 https://alive2.llvm.org/ce/z/YyBvoT PR: https://github.com/llvm/llvm-project/pull/160500	2025-10-14 14:20:34 +00:00
Joel E. Denny	6d44b9082e	[LoopUnroll] Skip remainder loop guard if skip unrolled loop (#156549 ) The original loop (OL) that serves as input to LoopUnroll has basic blocks that are arranged as follows: ``` OLPreHeader OLHeader <-. ... \| OLLatch ---' OLExit ``` In this depiction, every block has an implicit edge to the next block below, so any explicit edge indicates a conditional branch. Given OL and unroll count N, LoopUnroll sometimes creates an unrolled loop (UL) with a remainder loop (RL) epilogue arranged like this: ``` ,-- ULGuard \| ULPreHeader \| ULHeader <-. \| ... \| \| ULLatch ---' \| ULExit `-> RLGuard -----. RLPreHeader \| ,-> RLHeader \| \| ... \| `-- RLLatch \| RLExit \| OLExit <-----' ``` Each UL iteration executes N OL iterations, but each RL iteration executes 1 OL iteration. ULGuard or RLGuard checks whether the first iteration of UL or RL should execute, respectively. If so, ULLatch or RLLatch checks whether to execute each subsequent iteration. Once reached, OL always executes its first iteration but not necessarily the next N-1 iterations. Thus, ULGuard is always required before the first UL iteration. However, when control flows from ULGuard directly to RLGuard, the first OL iteration has yet to execute, so RLGuard is then redundant before the first RL iteration. Thus, this patch makes the following changes: - Adjust ULGuard to branch to RLPreHeader instead of RLGuard, thus eliminating RLGuard's unnecessary branch instruction for that path. - Eliminate the creation of RLGuard phi node poison values. Without this patch, RLGuard has such a phi node for each value that is defined by any OL iteration and used in OLExit. The poison value is required where ULGuard is the predecessor. The poison value indicates that control flow from ULGuard to RLGuard to Exit has no counterpart in OL because the first OL iteration must execute either in UL or RL. - Simplify the CFG by not splitting ULExit and RLGuard because, without the ULGuard predecessor, the single block can now be a dedicated UL exit. - To RLPreHeader, add an `llvm.assume` call that asserts the RL trip count is non-zero. Without this patch, RLPreHeader is reachable only when RLGuard guarantees that assertion is true. With this patch, RLGuard guarantees it only when RLGuard is the predecessor, and the OL structure guarantees it when ULGuard is the predecessor. If RL itself is unrolled later, this guarantee somehow prevents ScalarEvolution from giving up when trying to compute a maximum trip count for RL. That maximum trip count enables the branch instruction in the final unrolled instance of RLLatch to be eliminated. Without the `llvm.assume` call, some existing unroll tests start to fail because that instruction is not eliminated. The original motivation for this patch is to facilitate later patches that fix LoopUnroll's computation of branch weights so that they maintain the block frequency of OL's body (see #135812). Specifically, this patch ensures RLGuard's branch weights do not affect RL's contribution to the block frequency of OL's body in the case that ULGuard skips UL.	2025-10-07 10:45:49 -04:00
Joel E. Denny	afb262855e	[LoopPeel] Fix branch weights' effect on block frequencies (#128785 ) [LoopPeel] Fix branch weights' effect on block frequencies This patch implements the LoopPeel changes discussed in [[RFC] Fix Loop Transformations to Preserve Block Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785). In summary, a loop's latch block can have branch weight metadata that encodes an estimated trip count that is derived from application profile data. Initially, the loop body's block frequencies agree with the estimated trip count, as expected. However, sometimes loop transformations adjust those branch weights in a way that correctly maintains the estimated trip count but that corrupts the block frequencies. This patch addresses that problem in LoopPeel, which it changes to: - Maintain branch weights consistently with the original loop for the sake of preserving the total frequency of the original loop body. - Store the new estimated trip count in the `llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.	2025-10-02 16:07:55 +00:00
Florian Hahn	3c4f611791	[LoopPeel] Add test with branch that can be simplified with guards. Add test where a branch can be removed after peeling by applying info from loop guards. It unfortunately requires running IndVars first, to strengthen flags of the induction.	2025-09-24 11:51:55 +01:00
Florian Hahn	8693ef16f6	[SCEV] Add tests that benefit from rewriting SCEVAddExpr with guards. Add additional tests benefiting from rewriting existing SCEVAddExprs with guards.	2025-09-20 19:24:19 +01:00
Florian Hahn	3ea089ba19	[AArch64] Enable RT and partial unrolling with reductions for Apple CPUs. (#149699 ) Update unrolling preferences for Apple Silicon CPUs to enable partial unrolling and runtime unrolling for small loops with reductions. This builds on top of unroller changes to introduce parallel reduction phis, if possible: https://github.com/llvm/llvm-project/pull/149470. PR: https://github.com/llvm/llvm-project/pull/149699	2025-09-09 13:23:30 +00:00
Florian Hahn	2d9e452ab0	[LoopUnroll] Introduce parallel reduction phis when unrolling. (#149470 ) When partially or runtime unrolling loops with reductions, currently the reductions are performed in-order in the loop, negating most benefits from unrolling such loops. This patch extends unrolling code-gen to keep a parallel reduction phi per unrolled iteration and combining the final result after the loop. For out-of-order CPUs, this allows executing mutliple reduction chains in parallel. For now, the initial transformation is restricted to cases where we unroll a small number of iterations (hard-coded to 4, but should maybe be capped by TTI depending on the execution units), to avoid introducing an excessive amount of parallel phis. It also requires single block loops for now, where the unrolled iterations are known to not exit the loop (either due to runtime unrolling or partial unrolling). This ensures that the unrolled loop will have a single basic block, with a single exit block where we can place the final reduction value computation. The initial implementation also only supports parallelizing loops with a single reduction and only integer reductions. Those restrictions are just to keep the initial implementation simpler, and can easily be lifted as follow-ups. With corresponding TTI to the AArch64 unrolling preferences which I will also share soon, this triggers in ~300 loops across a wide range of workloads, including LLVM itself, ffmgep, av1aom, sqlite, blender, brotli, zstd and more. PR: https://github.com/llvm/llvm-project/pull/149470	2025-09-04 20:54:09 +01:00
Ryotaro Kasuga	2330fd2f73	[LoopPeel] Add new option to peeling loops to convert PHI into IV (#121104 ) LoopPeel currently considers PHI nodes that become loop invariants through peeling. However, in some cases, peeling transforms PHI nodes into induction variables (IVs), potentially enabling further optimizations such as loop vectorization. For example: ```c // TSVC s292 int im = N-1; for (int i=0; i<N; i++) { a[i] = b[i] + b[im]; im = i; } ``` In this case, peeling one iteration converts `im` into an IV, allowing it to be handled by the loop vectorizer. This patch adds a new feature to peel loops when to convert PHIs into IVs. At the moment this feature is disabled by default. Enabling it allows to vectorize the above example. I have measured on neoverse-v2 and observed a speedup of more than 60% (options: `-O3 -ffast-math -mcpu=neoverse-v2 -mllvm -enable-peeling-for-iv`). This PR is taken over from #94900 Related #81851	2025-08-20 13:44:56 +00:00

1 2 3 4 5 ...

712 Commits