- There is no restriction on a loop with controlled convergent
operations when
the relevant tokens are defined and used within the loop.
- When a token defined outside a loop is used inside (also called a loop
convergence heart), unrolling is allowed only in the absence of
remainder or
runtime checks.
- When a token defined inside a loop is used outside, such a loop is
said to be
"extended". This loop can only be unrolled by also duplicating the
extended part
lying outside the loop. Such unrolling is disabled for now.
- Clean up loop hearts: When unrolling a loop with a heart, duplicating
the
heart will introduce multiple static uses of a convergence control token
in a
cycle that does not contain its definition. This violates the static
rules for
tokens, and needs to be cleaned up into a single occurrence of the
intrinsic.
- Spell out the initializer for UnrollLoopOptions to improve
readability.
Original implementation [D85605] by Nicolai Haehnle
<nicolai.haehnle@amd.com>.
This patch adds loadCSE support to simplifyLoopAfterUnroll. It is based
on EarlyCSE's implementation using ScopeHashTable and is using SCEV for
accessed pointers to check to find redundant loads after unrolling.
This applies to the late unroll pass only, for full unrolling those
redundant loads will be cleaned up by the regular pipeline.
The current approach constructs MSSA on-demand per-loop, but there is
still small but notable compile-time impact:
stage1-O3 +0.04%
stage1-ReleaseThinLTO +0.06%
stage1-ReleaseLTO-g +0.05%
stage1-O0-g +0.02%
stage2-O3 +0.09%
stage2-O0-g +0.04%
stage2-clang +0.02%
https://llvm-compile-time-tracker.com/compare.php?from=c089fa5a729e217d0c0d4647656386dac1a1b135&to=ec7c0f27cb5c12b600d9adfc8543d131765ec7be&stat=instructions:u
This benefits some workloads with runtime-unrolling disabled,
where users use pragmas to force unrolling, as well as with
runtime unrolling enabled.
On SPEC/MultiSource, this removes a number of loads after unrolling
on AArch64 with runtime unrolling enabled.
```
External/S...te/526.blender_r/526.blender_r 96
MultiSourc...rks/mediabench/gsm/toast/toast 39
SingleSource/Benchmarks/Misc/ffbench 4
External/SPEC/CINT2006/403.gcc/403.gcc 18
MultiSourc.../Applications/JM/ldecod/ldecod 4
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg 6
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG 9
MultiSourc...e/Applications/ClamAV/clamscan 4
MultiSourc.../MallocBench/espresso/espresso 3
MultiSourc...dence-flt/LinearDependence-flt 2
MultiSourc...ch/office-ispell/office-ispell 4
MultiSourc...ch/consumer-jpeg/consumer-jpeg 6
MultiSourc...ench/security-sha/security-sha 11
MultiSourc...chmarks/McCat/04-bisect/bisect 3
SingleSour...tTests/2020-01-06-coverage-009 12
MultiSourc...ench/telecomm-gsm/telecomm-gsm 39
MultiSourc...lds-flt/CrossingThresholds-flt 24
MultiSourc...dence-dbl/LinearDependence-dbl 2
External/S...C/CINT2006/445.gobmk/445.gobmk 6
MultiSourc...enchmarks/mafft/pairlocalalign 53
External/S...31.deepsjeng_r/531.deepsjeng_r 3
External/S...rate/510.parest_r/510.parest_r 58
External/S...NT2006/464.h264ref/464.h264ref 29
External/S...NT2017rate/502.gcc_r/502.gcc_r 45
External/S...C/CINT2006/456.hmmer/456.hmmer 6
External/S...te/538.imagick_r/538.imagick_r 18
External/S.../CFP2006/447.dealII/447.dealII 4
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE 12
External/S...2017rate/525.x264_r/525.x264_r 36
MultiSourc...Benchmarks/7zip/7zip-benchmark 33
MultiSourc...hmarks/ASC_Sequoia/AMGmk/AMGmk 2
MultiSourc...chmarks/VersaBench/8b10b/8b10b 1
MultiSourc.../Applications/JM/lencod/lencod 116
MultiSourc...lds-dbl/CrossingThresholds-dbl 24
MultiSource/Benchmarks/McCat/05-eks/eks 15
```
PR: https://github.com/llvm/llvm-project/pull/83860
Get more precise cost of instruction after LoopUnroll considering that
some operands of it can be simplified, e.g. induction variable will be
replaced by constant after full unrolling.
Fixes [PR77842](https://github.com/llvm/llvm-project/issues/77842) where
UBSAN causes pragma full unroll to try and unroll INT_MAX times. This
sets a cap to make sure we don't attempt this and crash the compiler.
Testing:
ninja check-all with new test
---------
Co-authored-by: Nikita Popov <github@npopov.com>
The UnrollMaxUpperBound should be target dependent, since different
chips provide different register set which brings different ability of
storing more temporary values of a program. So I add a MaxUpperBound
value in UnrollingPreference which can be override by targets. All uses
of UnrollMaxUpperBound are replaced with UP.MaxUpperBound.
The default value is still 8 and the command line argument
'--unroll-max-upperbound' takes final effect if provided.
Fix loop unroll fail caused by branches folding.
For example:
SimplifyCFG foldloop branches then cause loop unroll failed for "#program unroll" loop.
```
#program unroll
for (int I = 0; I < ConstNum; ++I) { // folding "I < ConstNum" and "Cond2"
if (Cond2) {
break;
}
xxx loop body;
}
```
The pragma unroll metadata only takes effect if there is an exact trip
count, but not if there is an upper bound trip count. This patch make it
work with an upper bound trip count as well in shouldPragmaUnroll().
Loop unroll is important in stack nervous devices (e.g. GPU, and that is
why a lot of GPU code mark loop with "#program unroll").
It usually much simplify the address (offset) calculations in old
iterations, then we can do a lot of others optimizations, e.g, SROA, for
these simplifed address (escape alloca the whole aggregates).
Instead of having ApproximateLoopSize() use a bunch of out parameters,
from which we later construct an UnrollCostEstimator, directly
construct UnrollCostEstimator which holds all the information
derived from loop analysis. This makes it easier to add additional
metrics in the future.
The last use was removed by:
commit d623b2f95fd559901f008a0588dddd0949a8db01
Author: Arthur Eubanks <aeubanks@google.com>
Date: Fri Mar 10 17:24:19 2023 -0800
FullLoopUnroll was performing runtime unrolling in certain cases when
'#pragma unroll' was specified. Patch to fix this by introducing new parameter
to tryToUnrollLoop() to differentiate between LoopUnrollPass and
FullLoopUnrollPass. Based on the discussion here
(https://discourse.llvm.org/t/loop-unroller-fails-to-unroll-loop/69834)
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D148071
The value map of last peeled iteration is computed within peelLoop API.
This patch exposes it for callers of peelLoop.
While this is not currently used by upstream passes, we have a usecase
downstream which benefits from this API update. Future users of peelLoop
can also use the ValueMap if needed.
Similar value maps are exposed by other loop utilities such as loop
cloning.
Differential Revision: https://reviews.llvm.org/D138228
value() has undesired exception checking semantics and calls
__throw_bad_optional_access in libc++. Moreover, the API is unavailable without
_LIBCPP_NO_EXCEPTIONS on older Mach-O platforms (see
_LIBCPP_AVAILABILITY_BAD_OPTIONAL_ACCESS).
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
Using these queries with a context instruction and without a cache
seems to be about 2x slower than with it so this theoretically
improves compile time.
* Replace getUserCost with getInstructionCost, covering all cost kinds.
* Remove getInstructionLatency, it's not implemented by any backends, and we should fold the functionality into getUserCost (now getInstructionCost) to make it easier for targets to handle the cost kinds with their existing cost callbacks.
Original Patch by @samparker (Sam Parker)
Differential Revision: https://reviews.llvm.org/D79483
Teach the unroller(s) how to handle an invalid cost. This avoids crashes when the backend can't provide a cost due to either a fundemental limitation or an unimplemented cost model case.
Differential Revision: https://reviews.llvm.org/D127305
Per the documentation in Support/InstructionCost.h, the purpose of an invalid cost is so that clients can change behavior on impossible to cost inputs. CodeMetrics was instead asserting that invalid costs never occurred.
On a target with an incomplete cost model - e.g. RISCV - this means that transformations would crash on (falsely) invalid constructs - e.g. scalable vectors. While we certainly should improve the cost model - and I plan to do so in the near future - we also shouldn't be crashing. This violates the explicitly stated purpose of an invalid InstructionCost.
I updated all of the "easy" consumers where bailouts were locally obvious. I plan to follow up with loop unroll in a following change.
Differential Revision: https://reviews.llvm.org/D127131
Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!`
error. More were added due to cargo cult. Since the error has been removed,
cl::ZeroOrMore is unneeded.
Also remove cl::init(false) while touching the lines.
There are a few places where we use report_fatal_error when the input is broken.
Currently, this function always crashes LLVM with an abort signal, which
then triggers the backtrace printing code.
I think this is excessive, as wrong input shouldn't give a link to
LLVM's github issue URL and tell users to file a bug report.
We shouldn't print a stack trace either.
This patch changes report_fatal_error so it uses exit() rather than
abort() when its argument GenCrashDiag=false.
Reviewed by: nikic, MaskRay, RKSimon
Differential Revision: https://reviews.llvm.org/D126550
Similar to c515b2f39e77, If there are no loops in the function as seen
through LI, we should avoid computing the remaining expensive analyses
(such as SCEV, BPI). Reordered the analyses requests and early return
if there are no loops.
The logic of avoiding expensive analyses is applied to LoopVectorizer,
LoopLoadElimination and LoopUnrollPass, i.e. all function passes which operate
on loops.
This is an NFC with compile time improvement.
Differential Revision: https://reviews.llvm.org/D124529
IMO when user provide unroll pragma, compiler should always respect it.
It is not clear to me why loop unroll pass currently ensure that the
unrolled loop size is limited by PragmaUnrollThreshold.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D119148
Cleanup code in peelLoop API. We already have usage of DT without guarding
against a null DT, so this change constant folds the remaining null DT
checks.
Also make the argument a reference so that it is clear the argument is
a nonnull DT.
Extracted from D118472.
If a loop isn't forced to be unrolled, we want to avoid unrolling it when there
is an explicit unroll-and-jam pragma. This is to prevent automatic unrolling
from interfering with the user requested transformation.
Differential Revision: https://reviews.llvm.org/D114886
5c77aa2b917c [unroll] Use early return in shouldFullUnroll [nfc]
wasn't quite NFC since !(x <= y) is x > y rather than x >= y
Credit to Justin Bogner for spotting the bug
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D114894
This change should be NFC. It's posted for review mostly to make sure others are happy with the names I'm introducing for "exact full unroll" and "bounded full unroll". The motivation here is that our cost model for bounded unrolling is too aggressive - it gives benefits for exits we aren't going to prune - but I also just think the new version of the code is a lot easier to follow.
Differential Revision: https://reviews.llvm.org/D114453
These variables are not out-params, and we immediately return after assigning them. Thus, the assignments are dead and just confusing.
I believe these used to be out-params, but they're not any more.
This patch adds a new cost heuristic that allows peeling a single
iteration off read-only loops, if the loop contains a load that
1. is feeding an exit condition,
2. dominates the latch,
3. is not already known to be dereferenceable,
4. and has a loop invariant address.
If all non-latch exits are terminated with unreachable, such loads
in the loop are guaranteed to be dereferenceable after peeling,
enabling hoisting/CSE'ing them.
This enables vectorization of loops with certain runtime-checks, like
multiple calls to `std::vector::at` if the vector is passed as pointer.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D108114
Added '-print-pipeline-passes' printing of parameters for those passes
declared with *_WITH_PARAMS macro in PassRegistry.def.
Note that it only prints the parameters declared inside *_WITH_PARAMS as
in a few cases there appear to be additional parameters not parsable.
The following passes are now covered (i.e. all of those with *_WITH_PARAMS in
PassRegistry.def).
LoopExtractorPass - loop-extract
HWAddressSanitizerPass - hwsan
EarlyCSEPass - early-cse
EntryExitInstrumenterPass - ee-instrument
LowerMatrixIntrinsicsPass - lower-matrix-intrinsics
LoopUnrollPass - loop-unroll
AddressSanitizerPass - asan
MemorySanitizerPass - msan
SimplifyCFGPass - simplifycfg
LoopVectorizePass - loop-vectorize
MergedLoadStoreMotionPass - mldst-motion
GVN - gvn
StackLifetimePrinterPass - print<stack-lifetime>
SimpleLoopUnswitchPass - simple-loop-unswitch
Differential Revision: https://reviews.llvm.org/D109310