When LoopUnroll copies the original loop's latch to the corresponding
non-latch branch in an unrolled iteration, any `!llvm.loop` is copied
along with it, but `!llvm.loop` is useless and misleading there. This
patch discards it.
e06831a3b29d did the same for LoopPeel.
Based on Michael Maitland's previous work:
https://github.com/llvm/llvm-project/pull/121222
This PR uses the existing recurrences code instead of introducing a
new pass just for CSA autovec. I've also made recipes that are more
generic.
Add support for vectorizing loops that select the index of the minimum
or maximum element. The patch implements vectorizing those patterns by
combining Min/Max and FindFirstIV reductions.
It extends matching Min/Max reductions to allow in-loop users that are
FindLastIV reductions. It records a flag indicating that the Min/Max
reduction is used by another reduction. The extra user is then check as
part of the new `handleMultiUseReductions` VPlan transformation.
It processes any reduction that has other reduction users. The reduction
using the min/max reduction currently must be a FindLastIV reduction,
which needs adjusting to compute the correct result:
1. We need to find the last IV for which the condition based on the
min/max reduction is true,
2. Compare the partial min/max reduction result to its final value and,
3. Select the lanes of the partial FindLastIV reductions which
correspond to the lanes matching the min/max reduction result.
Depends on https://github.com/llvm/llvm-project/pull/140451
PR: https://github.com/llvm/llvm-project/pull/141431
As another step in issue #135812, this patch fixes block frequencies for
partial loop unrolling with an epilogue remainder loop. It does not
fully handle the case when the epilogue loop itself is unrolled. That
will be handled in the next patch.
For the guard and latch of each of the unrolled loop and epilogue loop,
this patch sets branch weights derived directly from the original loop
latch branch weights. The total frequency of the original loop body,
summed across all its occurrences in the unrolled loop and epilogue
loop, is the same as in the original loop. This patch also sets
`llvm.loop.estimated_trip_count` for the epilogue loop instead of
relying on the epilogue's latch branch weights to imply it.
This patch fixes branch weights in tests that PR #157754 adversely
affected.
This patch implements the LoopUnroll changes discussed in [[RFC] Fix
Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785)
and is thus another step in addressing issue #135812.
In summary, for the case of partial loop unrolling without a remainder
loop, this patch changes LoopUnroll to:
- Maintain branch weights consistently with the original loop for the
sake of preserving the total frequency of the original loop body.
- Store the new estimated trip count in the
`llvm.loop.estimated_trip_count` metadata, introduced by PR #148758.
- Correct the new estimated trip count (e.g., 3 instead of 2) when the
original estimated trip count (e.g., 10) divided by the unroll count
(e.g., 4) leaves a remainder (e.g., 2).
There are loop unrolling cases this patch does not fully fix, such as
partial unrolling with a remainder loop and complete unrolling, and
there are two associated tests whose branch weights this patch adversely
affects. They will be addressed in future patches that should land with
this patch.
When partially or runtime unrolling loops with reductions, currently the
reductions are performed in-order in the loop, negating most benefits
from unrolling such loops.
This patch extends unrolling code-gen to keep a parallel reduction phi
per unrolled iteration and combining the final result after the loop.
For out-of-order CPUs, this allows executing mutliple reduction chains
in parallel.
For now, the initial transformation is restricted to cases where we
unroll a small number of iterations (hard-coded to 4, but should maybe
be capped by TTI depending on the execution units), to avoid introducing
an excessive amount of parallel phis.
It also requires single block loops for now, where the unrolled
iterations are known to not exit the loop (either due to runtime
unrolling or partial unrolling). This ensures that the unrolled loop
will have a single basic block, with a single exit block where we can
place the final reduction value computation.
The initial implementation also only supports parallelizing loops with a
single reduction and only integer reductions. Those restrictions are
just to keep the initial implementation simpler, and can easily be
lifted as follow-ups.
With corresponding TTI to the AArch64 unrolling preferences which I will
also share soon, this triggers in ~300 loops across a wide range of
workloads, including LLVM itself, ffmgep, av1aom, sqlite, blender,
brotli, zstd and more.
PR: https://github.com/llvm/llvm-project/pull/149470
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E);
down to:
Set.insert_range(Range);
In some cases, we can further fold that into the set declaration.
Add an extra knob to RuntimeUnrollMultiExit to let backends control
whether to allow multi-exit unrolling on a per-loop basis.
This gives backends more fine-grained control on deciding if multi-exit
unrolling is profitable for a given loop and uarch. Similar to
4226e0a0c75.
PR: https://github.com/llvm/llvm-project/pull/124462
Add an extra know to UnrollingPreferences to let backends control the
maximum budget for SCEV expansions.
This gives backends more fine-grained control on the cost of the runtime
checks for runtime unrolling.
PR: https://github.com/llvm/llvm-project/pull/118316
This patch fixes a simple error where as part of loop unrolling we
optimize conditional loop-exiting branches into unconditional branches
when we know that they will or won't exit the loop, but does not
propagate the source location of the original branch to the new one.
Found using https://github.com/llvm/llvm-project/pull/107279.
Use the more aggressive invalidation method in a number of places
that add incoming values to lcssa phi nodes. It is likely that
it's possible to construct cases with incorrect SCEV preservation
similar to https://github.com/llvm/llvm-project/issues/109333 for
these.
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
- There is no restriction on a loop with controlled convergent
operations when
the relevant tokens are defined and used within the loop.
- When a token defined outside a loop is used inside (also called a loop
convergence heart), unrolling is allowed only in the absence of
remainder or
runtime checks.
- When a token defined inside a loop is used outside, such a loop is
said to be
"extended". This loop can only be unrolled by also duplicating the
extended part
lying outside the loop. Such unrolling is disabled for now.
- Clean up loop hearts: When unrolling a loop with a heart, duplicating
the
heart will introduce multiple static uses of a convergence control token
in a
cycle that does not contain its definition. This violates the static
rules for
tokens, and needs to be cleaned up into a single occurrence of the
intrinsic.
- Spell out the initializer for UnrollLoopOptions to improve
readability.
Original implementation [D85605] by Nicolai Haehnle
<nicolai.haehnle@amd.com>.
Remove redundant debug instructions after blocks have been merged into
the predecessor, It can reduce some compile time in some cases.
This change only fixes the situation of loop unrolling, and other
situations are not considered. "RemoveRedundantDbgInstrs" seems to be
very time-consuming. Thus, we just add here after the "Dest" has been
merged into the "Fold", this may be a more targeted solution!!!
fixes: https://github.com/llvm/llvm-project/issues/89073
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
31 under llvm/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
This patch adds loadCSE support to simplifyLoopAfterUnroll. It is based
on EarlyCSE's implementation using ScopeHashTable and is using SCEV for
accessed pointers to check to find redundant loads after unrolling.
This applies to the late unroll pass only, for full unrolling those
redundant loads will be cleaned up by the regular pipeline.
The current approach constructs MSSA on-demand per-loop, but there is
still small but notable compile-time impact:
stage1-O3 +0.04%
stage1-ReleaseThinLTO +0.06%
stage1-ReleaseLTO-g +0.05%
stage1-O0-g +0.02%
stage2-O3 +0.09%
stage2-O0-g +0.04%
stage2-clang +0.02%
https://llvm-compile-time-tracker.com/compare.php?from=c089fa5a729e217d0c0d4647656386dac1a1b135&to=ec7c0f27cb5c12b600d9adfc8543d131765ec7be&stat=instructions:u
This benefits some workloads with runtime-unrolling disabled,
where users use pragmas to force unrolling, as well as with
runtime unrolling enabled.
On SPEC/MultiSource, this removes a number of loads after unrolling
on AArch64 with runtime unrolling enabled.
```
External/S...te/526.blender_r/526.blender_r 96
MultiSourc...rks/mediabench/gsm/toast/toast 39
SingleSource/Benchmarks/Misc/ffbench 4
External/SPEC/CINT2006/403.gcc/403.gcc 18
MultiSourc.../Applications/JM/ldecod/ldecod 4
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg 6
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG 9
MultiSourc...e/Applications/ClamAV/clamscan 4
MultiSourc.../MallocBench/espresso/espresso 3
MultiSourc...dence-flt/LinearDependence-flt 2
MultiSourc...ch/office-ispell/office-ispell 4
MultiSourc...ch/consumer-jpeg/consumer-jpeg 6
MultiSourc...ench/security-sha/security-sha 11
MultiSourc...chmarks/McCat/04-bisect/bisect 3
SingleSour...tTests/2020-01-06-coverage-009 12
MultiSourc...ench/telecomm-gsm/telecomm-gsm 39
MultiSourc...lds-flt/CrossingThresholds-flt 24
MultiSourc...dence-dbl/LinearDependence-dbl 2
External/S...C/CINT2006/445.gobmk/445.gobmk 6
MultiSourc...enchmarks/mafft/pairlocalalign 53
External/S...31.deepsjeng_r/531.deepsjeng_r 3
External/S...rate/510.parest_r/510.parest_r 58
External/S...NT2006/464.h264ref/464.h264ref 29
External/S...NT2017rate/502.gcc_r/502.gcc_r 45
External/S...C/CINT2006/456.hmmer/456.hmmer 6
External/S...te/538.imagick_r/538.imagick_r 18
External/S.../CFP2006/447.dealII/447.dealII 4
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE 12
External/S...2017rate/525.x264_r/525.x264_r 36
MultiSourc...Benchmarks/7zip/7zip-benchmark 33
MultiSourc...hmarks/ASC_Sequoia/AMGmk/AMGmk 2
MultiSourc...chmarks/VersaBench/8b10b/8b10b 1
MultiSourc.../Applications/JM/lencod/lencod 116
MultiSourc...lds-dbl/CrossingThresholds-dbl 24
MultiSource/Benchmarks/McCat/05-eks/eks 15
```
PR: https://github.com/llvm/llvm-project/pull/83860
I'd reverted this in 6c7805d5d1 after a bad stage. Original commit
messsage follows:
[NFC][RemoveDIs] Bulk update utilities to insert with iterators
As part of the RemoveDIs project we need LLVM to insert instructions using
iterators wherever possible, so that the iterators can carry a bit of
debug-info. This commit implements some of that by updating the contents of
llvm/lib/Transforms/Utils to always use iterator-versions of instruction
constructors.
There are two general flavours of update:
* Almost all call-sites just call getIterator on an instruction
* Several make use of an existing iterator (scenarios where the code is
actually significant for debug-info)
The underlying logic is that any call to getFirstInsertionPt or similar
APIs that identify the start of a block need to have that iterator passed
directly to the insertion function, without being converted to a bare
Instruction pointer along the way.
I've also switched DemotePHIToStack to take an optional iterator: it needs
to take an iterator, and having a no-insert-location behaviour appears to
be important. The constructors for ICmpInst and FCmpInst have been updated
too. They're the only instructions that take block _references_ rather than
pointers for certain calls, and a future patch is going to make use of
default-null block insertion locations.
All of this should be NFC.
As part of the RemoveDIs project we need LLVM to insert instructions using
iterators wherever possible, so that the iterators can carry a bit of
debug-info. This commit implements some of that by updating the contents of
llvm/lib/Transforms/Utils to always use iterator-versions of instruction
constructors.
There are two general flavours of update:
* Almost all call-sites just call getIterator on an instruction
* Several make use of an existing iterator (scenarios where the code is
actually significant for debug-info)
The underlying logic is that any call to getFirstInsertionPt or similar
APIs that identify the start of a block need to have that iterator passed
directly to the insertion function, without being converted to a bare
Instruction pointer along the way.
I've also switched DemotePHIToStack to take an optional iterator: it needs
to take an iterator, and having a no-insert-location behaviour appears to
be important. The constructors for ICmpInst and FCmpInst have been updated
too. They're the only instructions that take block _references_ rather than
pointers for certain calls, and a future patch is going to make use of
default-null block insertion locations.
All of this should be NFC.
C++20 comes with std::erase to erase a value from std::vector. This
patch renames llvm::erase_value to llvm::erase for consistency with
C++20.
We could make llvm::erase more similar to std::erase by having it
return the number of elements removed, but I'm not doing that for now
because nobody seems to care about that in our code base.
Since there are only 50 occurrences of erase_value in our code base,
this patch replaces all of them with llvm::erase and deprecates
llvm::erase_value.
Loop unrolling tends to produce chains of
`%x1 = add %x0, 1; %x2 = add %x1, 1; ...` with one add per unrolled
iteration. This patch simplifies these adds to `%xN = add %x0, N`
directly during unrolling, rather than waiting for InstCombine to do so.
The motivation for this is that having a single add (rather than
an add chain) on the induction variable makes it a simple recurrence,
which we specially recognize in a number of places. This allows
InstCombine to directly perform folds with that knowledge, instead
of first folding the add chains, and then doing other folds in another
InstCombine iteration.
Due to the reduced number of InstCombine iterations, this also
results in a small compile-time improvement.
Differential Revision: https://reviews.llvm.org/D153540
A pseudo probe is created with dwarf line information shared with its nearest instruction. If the instruction comes with a dwarf discriminator, it will be shared with the probe as well. This can confuse the later FS-AFDO discriminator assignment pass. To fix this, I'm cleaning up the discriminator fields for probes when they are inserted.
I also notice another possibility to change the discriminator field of pseudo probes in the pipeline before the FS discriminator assignment pass. That is the loop unroller, which assigns duplication factor to instruction being vectorized. I'm disabling that for pseudo probe intrinsics specifically, also for callsites with probes.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D148569
This reverts commit c5ea42bcf48c8f3d3e35a6bff620b06d2a499108.
Recommit the patch with a fix for loops where the exiting terminator is
not a branch instruction. In that case, ExitInfos may be empty. In
addition to checking if there's a single exiting block also check if
there's a single ExitInfo.
A test case has been added in f92b35392ed8e4631.
The scope of DT updates are very limited when unrolling loops: the DT
should only need updating for
* new blocks added
* exiting blocks we simplified branches
This can be done manually without too much extra work.
MergeBlockIntoPredecessor also needs to be updated to support direct
DT updates.
This fixes excessive time spent in DTU for same cases. In an internal
example, time spent in LoopUnroll with this patch goes from ~200s to 2s.
It also is slightly positive for CTMark:
* NewPM-O3: -0.13%
* NewPM-ReleaseThinLTO: -0.11%
* NewPM-ReleaseLTO-g: -0.13%
Notable improvements are mafft (~ -0.50%) and lencod (~ -0.30%), with no
workload regressed.
https://llvm-compile-time-tracker.com/compare.php?from=78a9ee7834331fb4360457cc565fa36f5452f7e0&to=687e08d011b0dc6d3edd223612761e44225c7537&stat=instructions:u
Reviewed By: kuhar
Differential Revision: https://reviews.llvm.org/D141487
There is no need to update the DT here, because there must be a unique
latch. Hence if the latch is not exiting it must directly branch back
to the original loop header and does not dominate any nodes.
Skipping a DT update here simplifies D141487.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D141810
The function name was misleading - the expectation set both by the name
and by other members of Function (like isDeclaration or isIntrinsic)
would be that the function somehow would "be" "debug info for
profiling". But that's not the case - the property indicates (as the
comment over the declaration also explains) whether debug info should be
emitted (for profiling).
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
When unrolling, the exit values in LCSSA phis will get updated.
Invalidate cached SCEV values for those phis in case SCEV looked through
a exit phi.
Fixes#58340.
After unrolling a loop, the block and loop dispositions need to be
cleared. As we don't know which SCEVs in the loop/blocks may be
impacted, completely clear the cache. This should also fix some cases
where deleted loops remained in the LoopDispositions cache.
This fixes a verification failure surfaced by D134531.
I am planning on reviewing/updating the existing uses of
forgetLoopDispositions to check if they should be replaced by
forgetBlockAndLoopDispositions.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D134612
This patch replaces calls to GreatestCommonDivisor64 with std::gcd
where both arguments are known to be of unsigned types no larger than
64 bits in size.