Running ConstraintEliminiation after the first InstCombine run results
in slightly more simplifications on average.
There are is a tiny number of regressions, mostly due to CVP eliminating
a condition that ConstraintElimination would use, but in most cases
there's a slight improvement or no change.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D140853
This reverts commit 2656572d485127cc30b8fe9752024d2a0f1c50db.
It looks like CINT2017rate/502.gcc_r gets mis-compiled with LTO + PGO on
AArch64 with function specialization.
This patch enables Function Specialization by default at all
optimization levels except Os, Oz.
Compilation Time Overhead:
--------------------------
Measured the Instruction Count increase (Geomean) for CTMark from
the llvm-testsuite as in https://llvm-compile-time-tracker.com.
* {-O3, Non-LTO}: +0.136% Instruction Count
* {-O3, LTO}: +0.346% Instruction Count
Performance Uplift:
-------------------
Measured +9.121% score increase for 505.mcf_r from SPEC Int 2017
(Tested on Neoverse N1 with -O3 + LTO)
Correctness Testing:
--------------------
* Passes bootstrap Clang with ASAN + LTO + FuncSpec aggressive options:
{ MaxClonesThreshold=10,
SmallFunctionThreshold=10,
AvgLoopIterationCount=30,
SpecializeOnAddresses=true,
EnableSpecializationForLiteralConstant=true,
FuncSpecializationMaxIters=10 }
* Builds Chromium and passes its unittests with the above options + ThinLTO.
For more info please refer to
https://discourse.llvm.org/t/rfc-should-we-enable-function-specialization/61518
Differential Revision: https://reviews.llvm.org/D140210
Reland 877a9f9abec61f06e39f1cd872e37b828139c2d1 since D138654 (parent)
has been fixed with 9ebaf4fef4aac89d4eff08e48185d61bc893f14e and with
8f1e11c5a7d70f96943a72649daa69f152d73e90.
Differential Revision: https://reviews.llvm.org/D126455
Currently, SROA is CFG-preserving.
Not doing so does not affect any pipeline test. (???)
Internally, SROA requires Dominator Tree, and uses it solely for the final `-mem2reg` call.
By design, we can't really SROA alloca if their address escapes somehow,
but we have logic to deal with `load` of `select`/`PHI`,
where at least one of the possible addresses prevents promotion,
by speculating the `load`s and `select`ing between loaded values.
As one would expect, that requires ensuring that the speculation is actually legal.
Even ignoring complexity bailouts, that logic does not deal with everything,
e.g. `isSafeToLoadUnconditionally()` does not recurse into hands of `select`.
There can also be cases where the load is genuinely non-speculate.
So if we can't prove that the load can be speculated,
unfold the select, produce two-entry phi node, and perform predicated load.
Now, that transformation must obviously update Dominator Tree,
since we require it later on. Doing so is trivial.
Additionally, we don't want to do this for the final SROA invocation (D136806).
In the end, this ends up having negative (!) compile-time cost:
https://llvm-compile-time-tracker.com/compare.php?from=c6d7e80ec4c17a415673b1cfd25924f98ac83608&to=ddf9600365093ea50d7e278696cbfa01641c959d&stat=instructions:u
Though indeed, this only deals with `select`s, `PHI`s are still using speculation.
Should we update some more analysis?
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D138238
This reverts commit 739611870d3b06605afe25cc07833f6a62de9545,
and recommits 03e6d9d9d1d48e43f3efc35eb75369b90d4510d5
with a fixed assertion - we should check that DTU is there,
not just assert false...
The assertion about not modifying the CFG seems to not hold,
will recommit in a bit.
https://lab.llvm.org/buildbot#builders/139/builds/32412
This reverts commit 03e6d9d9d1d48e43f3efc35eb75369b90d4510d5.
This reverts commit 4f90f4ada33718f9025d0870a4fe3fe88276b3da.
Currently, SROA is CFG-preserving.
Not doing so does not affect any pipeline test. (???)
Internally, SROA requires Dominator Tree, and uses it solely for the final `-mem2reg` call.
By design, we can't really SROA alloca if their address escapes somehow,
but we have logic to deal with `load` of `select`/`PHI`,
where at least one of the possible addresses prevents promotion,
by speculating the `load`s and `select`ing between loaded values.
As one would expect, that requires ensuring that the speculation is actually legal.
Even ignoring complexity bailouts, that logic does not deal with everything,
e.g. `isSafeToLoadUnconditionally()` does not recurse into hands of `select`.
There can also be cases where the load is genuinely non-speculate.
So if we can't prove that the load can be speculated,
unfold the select, produce two-entry phi node, and perform predicated load.
Now, that transformation must obviously update Dominator Tree,
since we require it later on. Doing so is trivial.
Additionally, we don't want to do this for the final SROA invocation (D136806).
In the end, this ends up having negative (!) compile-time cost:
https://llvm-compile-time-tracker.com/compare.php?from=c6d7e80ec4c17a415673b1cfd25924f98ac83608&to=ddf9600365093ea50d7e278696cbfa01641c959d&stat=instructions:u
Though indeed, this only deals with `select`s, `PHI`s are still using speculation.
Should we update some more analysis?
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D138238
This reverts commit 877a9f9abec61f06e39f1cd872e37b828139c2d1.
It depends on the parent revision 42c2dc401742266da3e0251b6c1ca491f4779963
which needs to be reverted as it broke some buildbots, so reverting both.
The aim of this patch is to minimize the compilation time overhead of
running Function Specialization. It is about 40% slower to run as a
standalone pass (IPSCCP + FuncSpec vs IPSCCP with FuncSpec) according
to my measurements. I compiled the llvm testsuite with NewPM-O3 + LTO
and measured single threaded [user + system] time of IPSCCP and FuncSpec
by passing the '-time-passes' option to lld. Then I compared the two
configurations in terms of Instruction Count of the total compilation
(not of the individual passes) as in https://llvm-compile-time-tracker.com.
Geomean for non-LTO builds is -0.25% and LTO is -0.5% approximately.
You can find more info below:
https://discourse.llvm.org/t/rfc-should-we-enable-function-specialization/61518
Differential Revision: https://reviews.llvm.org/D126455
ControlHeightReduction (CHR) clones the code region to reduce the
branches in the hot code path. The number of clones is linear to the
depth of the region.
Currently it does not have control over the code size increase. We are
seeing one ~9000 BB functions get expanded to ~250000 BBs, an 25x
increase. This creates a big compile time issue for the downstream
optimizations.
This patch adds a cap for number of clones for one region.
Differential Revision: https://reviews.llvm.org/D138333
The option was added with https://reviews.llvm.org/D102496,
and currently the name is accurate, but I am hoping to add
a load transform that is not a scalarization. See issue #17113.
Move these to the new PM if they're used there.
Part of removing the legacy pass manager for optimization pipeline.
Reland with UseNewGVN usage in clang removed.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D137915
Move these to the new PM if they're used there.
Part of removing the legacy pass manager for optimization pipeline.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D137915
The legacy PM allowed you to set a custom inliner threshold via
builder.Inliner = llvm::createFunctionInliningPass(inline_threshold);
This allows the same thing to be done with the new PM optimization pipelines.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D137038
The ArgumentPromotion pass uses Mem2Reg promotion at the end to cutting
down generated `alloca` instructions as well as meaningless `store`s and
this behavior can leave unused (dead) arguments. To eliminate the dead
arguments and therefore let the DeadCodeElimination remove becoming dead
inserted `GEP`s as well as `load`s and `cast`s in the callers, the
DeadArgumentElimination pass should be run after the ArgumentPromotion
one.
Differential Revision: https://reviews.llvm.org/D128830
Alive2 doesn't support verification of optimizations that use inter-procedural analyses.
Right now, clang uses GlobalsAA by default and there's no way to disable it.
This leads to Alive2 producing false positives.
The added flag allows us to skip global analyses altogether.
Differential Revision: https://reviews.llvm.org/D134139
Similar to OptimizerLastEPCallbacks workaround
added D96320.
Probably NFC as-is, I don't see anything hooked with this callbacks yet,
but I we are looking to move sanitizers.
Reviewed By: aeubanks, MaskRay
Differential Revision: https://reviews.llvm.org/D133333
This reverts commit b10a341aa5b0b93b9175a8f11efc9a0955ab361e.
This commit exposes the pre-existing https://github.com/llvm/llvm-project/issues/56503 in some edge cases. Will fix that and then reland this.
The ArgumentPromotion pass uses Mem2Reg promotion at the end to cutting
down generated `alloca` instructions as well as meaningless `store`s and
this behavior can leave unused (dead) arguments. To eliminate the dead
arguments and therefore let the DeadCodeElimination remove becoming dead
inserted `GEP`s as well as `load`s and `cast`s in the callers, the
DeadArgumentElimination pass should be run after the ArgumentPromotion
one.
Differential Revision: https://reviews.llvm.org/D128830
The commit breaks the compiler when a function is used as a function
parameter (hm... for a function from the standard C library?):
```
static float strtof(char *, char *) {}
void a() { strtof(a, 0); }
```
This reverts commit 879f5118fc74657e4a5c4eff6810098e1eed75ac.
The ArgumentPromotion pass uses Mem2Reg promotion at the end to cutting
down generated `alloca` instructions as well as meaningless `store`s and
this behavior can leave unused (dead) arguments. To eliminate the dead
arguments and therefore let the DeadCodeElimination remove becoming dead
inserted `GEP`s as well as `load`s and `cast`s in the callers, the
DeadArgumentElimination pass should be run after the ArgumentPromotion
one.
Differential Revision: https://reviews.llvm.org/D128830
The ArgumentPromotion pass uses Mem2Reg promotion at the end to cutting
down generated `alloca` instructions as well as meaningless `store`s and
this behavior can leave unused (dead) arguments. To eliminate the dead
arguments and therefore let the DeadCodeElimination remove becoming dead
inserted `GEP`s as well as `load`s and `cast`s in the callers, the
DeadArgumentElimination pass should be run after the ArgumentPromotion
one.
Differential Revision: https://reviews.llvm.org/D128830
Add the `-enable-post-pgo-loop-rotation` option to enable or disable the loop rotation transformation [1]. With some instrumentations, e.g., function entry coverage [2], loop rotation is not necessary and can lead to some surprise differences in codegen, even for functions where instrumentation is blocked with `noprofile` or `skipprofile`. The default value is `true` so the default behavior does not change.
[1] https://www.llvm.org/docs/LoopTerminology.html#loop-terminology-loop-rotate
[2] https://reviews.llvm.org/D116180
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D131817
We call tail-call-elim near the beginning of the pipeline,
but that is too early to annotate calls that get added later.
In the motivating case from issue #47852, the missing 'tail'
on memset leads to sub-optimal codegen.
I experimented with removing the early instance of
tail-call-elim instead of just adding another pass, but that
appears to be slightly worse for compile-time:
+0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving
the pass to later, then adding the original early pass back
(so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright
Note that there was an effort to split the tail call functionality
into 2 passes - that could help reduce compile-time if we find
that this change costs more in compile-time than expected based
on the preliminary testing:
D60031
Differential Revision: https://reviews.llvm.org/D130374
This patch turns on the flag `-enable-no-rerun-simplification-pipeline`, which means the simplification pipeline will not be rerun on unchanged functions in the CGSCCPass Manager.
Compile time improvement:
https://llvm-compile-time-tracker.com/compare.php?from=17457be1c393ff691cca032b04ea1698fedf0301&to=882301ebb893c8ef9f09fe1ea871f7995426fa07&stat=instructions
No meaningful run time regressions observed in the llvm test suite and
in additional internal workloads at this time.
The example test in `test/Other/no-rerun-function-simplification-pipeline.ll` is a good means to understand the effect of this change:
```
define void @f1(void()* %p) alwaysinline {
call void %p()
ret void
}
define void @f2() #0 {
call void @f1(void()* @f2)
call void @f3()
ret void
}
define void @f3() #0 {
call void @f2()
ret void
}
```
There are two SCCs formed by the ModuleToPostOrderCGSCCAdaptor: (f1) and (f2, f3).
The pass manager runs on the first SCC, leading to running the simplification pipeline (function and loop passes) on f1. With the flag on, after this, the output will have `Running analysis: ShouldNotRunFunctionPassesAnalysis on f1`.
Next, the pass manager runs on the second SCC: (f2, f3). Since f1() was inlined, f2() now calls itself, and also calls f3(), while f3() only calls f2().
So the pass manager for the SCC first runs the Inliner on (f2, f3), then the simplification pipeline on f2.
With the flag on, the output will have `Running analysis: ShouldNotRunFunctionPassesAnalysis on f2`; unless the inliner makes a change, this analysis remains preserved which means there's no reason to rerun the simplification pipeline. With the flag off, there is a second run of the simplification pipeline run on f2.
Next, the same flow occurs for f3. The simplification pipeline is run on f3 a single time with the flag on, along with `ShouldNotRunFunctionPassesAnalysis on f3`, and twice with the flag off.
The reruns occur only on f2 and f3 due to the additional ref edges.
The CGProfilePass needs to be run during FullLTO compilation at link
time to emit the .llvm.call-graph-profile section to the compiled LTO
object file. Currently, it is being run only during the initial
LTO-prelink compilation stage (to produce the bitcode files to be
consumed by the linker) and so the section is not produced.
ThinLTO is not affected because:
- For ThinLTO-prelink compilation the CGProfilePass pass is not run
because ThinLTO-prelink passes are added via
buildThinLTOPreLinkDefaultPipeline. Normal and FullLTO-prelink
passes are both added via buildPerModuleDefaultPipeline which uses
the LTOPreLink parameter to customize its behavior for the
FullLTO-prelink pass differences.
- ThinLTO backend compilation phase adds the CGProfilePass (see:
buildModuleOptimizationPipeline).
Adjust when the pass is run so that the .llvm.call-graph-profile
section is produced correctly for FullLTO.
Fixes#56185 (https://github.com/llvm/llvm-project/issues/56185)
Some cl::ZeroOrMore were added to avoid the `may only occur zero or one times!`
error. More were added due to cargo cult. Since the error has been removed,
cl::ZeroOrMore is unneeded.
Also remove cl::init(false) while touching the lines.
All callers pass true.
select-unfold-freeze.ll is now a subset of select.ll so delete it.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D126501
CoroCleanup is designed to lowering all the remaining coroutine
intrinsics. It is required to run after CoroSplit only. However, the
position of CoroCleanup now is far too late. The downside here is that
the unlowered coroutine instrincs might blocking other optimizations
too. So it should be a pure win to hoist the position of CoroCleanup.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D124360
This change could reduce the time we call `declaresCoroEarlyIntrinsics`.
And it is helpful for future changes.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D123925
VectorizerStart extension is module callback in old PM, but is function
callback in new PM. We lack a module extension point between end of
buildModuleSimplificationPipeline and the function optimization
(including vectorizer) pipeline. So this patch adds a new module
extension point before the function optimization pipeline.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D122296
CoroSplit lowers various coroutine intrinsics. It's a CGSCC pass and
CGSCC passes don't run on unreachable functions. Normally GlobalDCE will
come along and delete unreachable functions, but we don't run GlobalDCE
under -O0, so an unreachable function with coroutine intrinsics may
never have CoroSplit run on it.
This patch adds GlobalDCE when coroutines intrinsics are present. It
also now runs all coroutine passes conditional when coroutine intrinsics
are present. This should also solve the -O0 regression reported in
D105877 due to LazyCallGraph construction.
Fixes https://github.com/llvm/llvm-project/issues/54117
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D122275
RequireAnalysis<GlobalsAA> doesn't actually recompute GlobalsAA.
GlobalsAA isn't invalidated (unless specifically invalidated) because
it's self-updating via ValueHandles, but can be imprecise during the
self-updates.
Rather than invalidating GlobalsAA, which would invalidate AAManager and
any analyses that use AAManager, create a new pass that recomputes
GlobalsAA.
Fixes#53131.
Differential Revision: https://reviews.llvm.org/D121167
This PR adds two extension points to the default LTO pipeline in PassBuilder, one at the beginning and one at the end. These two extension points already existed in the old pass manager, the aim is to replicate the same functionality in the new one.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D120491
LICM will speculatively hoist code outside of loops. This requires removing information, like alias analysis (https://github.com/llvm/llvm-project/issues/53794), range information (https://bugs.llvm.org/show_bug.cgi?id=50550), among others. Prior to https://reviews.llvm.org/D99249 , LICM would only be run after LoopRotate. Running Loop Rotate prior to LICM prevents a instruction hoist from being speculative, if it was conditionally executed by the iteration (as is commonly emitted by clang and other frontends). Adding the additional LICM pass first, however, forces all of these instructions to be considered speculative, even if they are not speculative after LoopRotate. This destroys information, resulting in performance losses for discarding this additional information.
This PR modifies LICM to accept a ``speculative'' parameter which allows LICM to be set to perform information-loss speculative hoists or not. Phase ordering is then modified to not perform the information-losing speculative hoists until after loop rotate is performed, preserving this additional information.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D119965