243 Commits

Author SHA1 Message Date
Joel E. Denny
37e03b56b8
Revert "[PGO] Add llvm.loop.estimated_trip_count metadata" (#151585)
Reverts llvm/llvm-project#148758

[As
requested.](https://github.com/llvm/llvm-project/pull/148758#pullrequestreview-3076627201)
2025-07-31 15:56:31 -04:00
Joel E. Denny
f7b65011de
[PGO] Add llvm.loop.estimated_trip_count metadata (#148758)
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights` metadata.
As [suggested in the PR #128785
review](https://github.com/llvm/llvm-project/pull/128785#discussion_r2151091036),
it does so via a new `PGOEstimateTripCountsPass` pass, which creates the
new metadata for each loop but omits the value if it cannot estimate a
trip count due to the loop's form.

An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count,
but later passes can sometimes transform the loop in a way that makes it
possible. Currently, such passes do not necessarily update the metadata,
but eventually that should be fixed. Until then, if the new metadata has
no value, `llvm::getLoopEstimatedTripCount` disregards it and tries
again to estimate the trip count from the loop's current
`branch_weights` metadata.
2025-07-31 12:28:25 -04:00
Ryotaro Kasuga
3099b7eb5d
[Passes] Move LoopInterchange into optimization pipeline (#145503)
As mentioned in https://github.com/llvm/llvm-project/pull/145071,
LoopInterchange should be part of the optimization pipeline rather than
the simplification pipeline. This patch moves LoopInterchange into the
optimization pipeline.

More contexts:

- By default, LoopInterchange attempts to improve data locality,
however, it also takes increasing vectorization opportunities into
account. Given that, it is reasonable to run it as close to
vectorization as possible.
- I looked into previous changes related to the placement of
LoopInterchange, but couldn’t find any strong motivation suggesting that
it benefits other simplifications.
- As far as I tried some tests (including llvm-test-suite), removing
LoopInterchange from the simplification pipeline does not affect other
simplifications. Therefore, there doesn't seem to be much value in
keeping it there.
- The new position reduces compile-time for ThinLTO, probably because it
only runs once per function in post-link optimization, rather than both
in pre-link and post-link optimization.

I haven't encountered any cases where the positional difference affects
optimization results, so please feel free to revert if you run into any issues.
2025-07-04 20:06:53 +09:00
Mircea Trofin
daa2a587cc
[TRE] Adjust function entry count when using instrumented profiles (#143987)
The entry count of a function needs to be updated after a callsite is elided by TRE: before elision, the entry count accounted for the recursive call at that callsite. After TRE, we need to remove that callsite's contribution.

This patch enables this for instrumented profiling cases because, there, we know the function entry count captured entries before TRE. We cannot currently address this for sample-based (because we don't know whether this function was TRE-ed in the binary that donated samples)
2025-06-23 08:07:31 -07:00
Nikita Popov
ae8c85c9ce
[Passes] Remove LoopInterchange from O1 pipeline (#145071)
This is a fairly exotic pass, I don't think it makes a lot of sense to
run it at O1, esp. as vectorization wouldn't run at O1 anyway.
2025-06-23 09:11:03 +02:00
Peter Collingbourne
3fa231f47c
Add SimplifyTypeTests pass.
This pass figures out whether inlining has exposed a constant address to
a lowered type test, and remove the test if so and the address is known
to pass the test. Unfortunately this pass ends up needing to reverse
engineer what LowerTypeTests did; this is currently inherent to the design
of ThinLTO importing where LowerTypeTests needs to run at the start.

Reviewers: teresajohnson

Reviewed By: teresajohnson

Pull Request: https://github.com/llvm/llvm-project/pull/141327
2025-06-05 11:09:20 -07:00
Snehasish Kumar
16c7b3c9f5
[MemProf] Split MemProfiler into Instrumentation and Use. (#142811)
Most of the recent development on the MemProfiler has been on the Use part. The instrumentation has been quite stable for a while. As the complexity of the use grows (with undrifting, diagnostics etc) I figured it would be good to separate these two implementations.
2025-06-05 07:36:50 -07:00
Kazu Hirata
228f66807d
[llvm] Remove unused includes (NFC) (#142733)
These are identified by misc-include-cleaner.  I've filtered out those
that break builds.  Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
2025-06-04 12:30:52 -07:00
Min-Yih Hsu
0ab67ec191
[LV][EVL] Introduce the EVLIndVarSimplify Pass for EVL-vectorized loops (#131005)
When we enable EVL-based loop vectorization w/ predicated tail-folding,
each vectorized loop has effectively two induction variables: one
calculates the step using (VF x vscale) and the other one increases the
IV by values returned from experiment.get.vector.length. The former,
also known as canonical IV, is more favorable for analyses as it's
"countable" in the sense of SCEV; the latter (EVL-based IV), however, is
more favorable to codegen, at least for those that support scalable
vectors like AArch64 SVE and RISC-V.

The idea is that we use canonical IV all the way until the end of all
vectorizers, where we replace it with EVL-based IV using EVLIVSimplify
introduced here. Such that we can have the best from both worlds.

This Pass is enabled by default in RISC-V. However, since we haven't
really vectorize loops with predicate tail-folding by default, this Pass
is no-op at this moment.
2025-05-14 13:49:50 -07:00
Chengjun
94d933676c
[AA] Move Target Specific AA before BasicAA (#125965)
In this change, NVPTX AA is moved before Basic AA to potentially improve
compile time. Additionally, it introduces a flag in the
`ExternalAAWrapper` that allows other backends to run their
target-specific AA passes before Basic AA, if desired.

The change works for both New Pass Manager and Legacy Pass Manager.

Original implementation by Princeton Ferro <pferro@nvidia.com>
2025-05-07 15:25:48 -07:00
Tianle Liu
e038c5401c
[LTO][Pipelines] Add 0 hot-caller threshold for SamplePGO + FullLTO (#135152)
If a hot callsite function is not inlined in the 1st build, inlining the
hot callsite in pre-link stage of SPGO 2nd build may lead to Function
Sample not found in profile file in link stage. It will miss some
profile info.
ThinLTO has already considered and dealed with it by setting
HotCallSiteThreshold to 0 to stop the inline. This patch just adds the
same processing for FullLTO.
2025-04-14 11:21:08 +08:00
Mircea Trofin
4c90d977db
[ctxprof] Use the flattened contextual profile pre-thinlink (#134723)
Flatten the profile pre-thinlink so that ThinLTO has something to work with for the parts of the binary that aren't covered by contextual profiles. Post-thinlink, the flattener is re-run and will actually change profile info, but just for the modules containing contextual trees ("specialized modules"). For the rest, the flattener just yanks out the instrumentation.
2025-04-08 17:30:49 -07:00
Paul Kirth
268c065eab
[fatlto] Add coroutine passes when using FatLTO with ThinLTO (#134434)
When coroutines are used w/ both -ffat-lto-objects and -flto=thin,
the coroutine passes are not added to the optimization pipelines.
Ensure they are added before ModuleOptimization to generate a
working ELF object.

Fixes #134409.
2025-04-07 08:41:49 -07:00
Ellis Hoag
2044dd07da
[InstrProf] Remove -forder-file-instrumentation (#130192) 2025-03-13 08:28:16 -07:00
Vitaly Buka
3ccacc4e44
Revert "[LTO][Pipelines][Coro] De-duplicate Coro passes" (#129977)
Reverts llvm/llvm-project#128654

Breaks FatLTO
https://github.com/llvm/llvm-project/pull/128654#issuecomment-2700053700
2025-03-06 07:57:30 -08:00
Mircea Trofin
eb1c3ace39
[ctxprof] Override type of instrumentation if -profile-context-root is specified (#128940)
This patch makes it easy to enable ctxprof instrumentation for targets where the build has a bunch of defaults for instrumented PGO that we want to inherit for ctxprof.

This is switching experimental defaults: we'll eventually enable ctxprof instrumentation through `PGOOpt` but that type is currently quite entangled and, for the time being, no point adding to that.
2025-02-26 19:56:59 -08:00
Mircea Trofin
f6703a4ff5
[ctxprof] don't inline weak symbols after instrumentation (#128811)
Contextual profiling identifies functions by GUID. Functions that may get overridden by the linker with a prevailing copy may have, during instrumentation, different variants in different modules. If these variants get inlined before linking (here I assume thinlto), they will identify themselves to the ctxprof runtime as their GUID, leading to issues - they may have different counter counts, for instance.

If we block their inlining in the pre-thinlink compilation, only the prevailing copy will survive post-thinlink and the confusion is avoided.

The change introduces a small pass just for this purpose, which marks any symbols that could be affected by the above as `noinline` (even if they were `alwaysinline`). We already carried out some inlining (via the preinliner), before instrumenting, so technically the `alwaysinline` directives were honored.

We could later (different patch) choose to mark them back to their original attribute (none or `alwaysinline`) post-thinlink, if we want to - but experimentally that doesn't really change much of the performance of the instrumented binary.
2025-02-26 11:01:37 -08:00
Vitaly Buka
31897e651a
[LTO][Pipelines][Coro] De-duplicate Coro passes (#128654)
```
if (!isLTOPostLink(Phase))
    CoroPM.addPass(CoroEarlyPass());
if (!isLTOPreLink(Phase))
    // Other Coro passes
```

Followup to #126168.
2025-02-25 20:14:19 -08:00
Vitaly Buka
30a7c816ee
[LTO][Pipelines][NFC] Exctract isLTOPostLink (#128653) 2025-02-25 14:56:51 -08:00
Paul Kirth
63c1be7249
[llvm][fatlto] Add FatLTOCleanup pass (#125911)
When using FatLTO, it is common to want to enable certain types of whole
program optimizations (WPD) or security transforms (CFI), so that they
can be made available when performing LTO. However, these transforms
should not be used when compiling the non-LTO object code. Since the
frontend must emit different IR, we cannot simply clone the module and
optimize the LTO section and non-LTO section differently to work around
this. Instead, we need to remove any problematic instruction sequences.

This patch adds a new pass whose responsibility is to clean up the IR
in the FatLTO pipeline after creating the bitcode section, which is
after running the pre-link pipeline but before running module
optimization. This allows us to safely drop any conflicting instructions
or IR constructs that are inappropriate for non-LTO compilation.
2025-02-13 11:39:02 -08:00
Vitaly Buka
1032df6f60
[LTO][Pipelines][Coro] Handle coroutines in LTO pipeline (#126168)
ThinLTO delays handling of coroutines to ThinLTO backend.
However it's usually possible to use ThinLTO prelink objects for FullLTO.

In this case we have left-over coroutines which crash in codegen.

Issue #104525.
2025-02-12 21:39:32 -08:00
Vitaly Buka
be98428374
[NFC][Pipelines] Extract buildCoroConditionalWrapper (#126860)
Helper for #126168.

`Phase` will be used in followup patches.
2025-02-11 23:54:07 -08:00
Sjoerd Meijer
612df14c00
[Clang][Driver] Add an option to control loop-interchange (#125830)
This introduces options `-floop-interchange` and `-fno-loop-interchange`
to enable/disable the loop-interchange pass. This is part of the work
that tries to get that pass enabled by default (#124911), where it was
remarked that a user facing option to control this would be convenient
to have. The option name is the same as GCC's.
2025-02-07 10:31:24 +00:00
Joseph Huber
d9500f5032 [OpenMP] Fix the OpenMPOpt pass incorrectly optimizing if definition was missing
Summary:
This code is intended to block transformations if the call isn't
present, however the way it's coded it silently lets it pass if the
definition doesn't exist at all. This previously was always valid since
we included the runtime as one giant blob so everything was always
there, but now that we want to move towards separate ones, it's not
quite correct.
2025-02-06 21:38:36 -06:00
Axel Sorenson
d3161defd6
[PassBuilder] VectorizerEnd Extension Points (#123494)
Added an extension point after vectorizer passes in the PassBuilder.
Additionally, added extension points before and after vectorizer passes
in `buildLTODefaultPipeline`. Credit goes to @mshockwave for guiding me
through my first LLVM contribution (and my first open source
contribution in general!) :)
- Implemented `registerVectorizerEndEPCallback`
- Implemented `invokeVectorizerEndEPCallbacks`
- Added `VectorizerEndEPCallbacks` SmallVector
- Added a command line option `passes-ep-vectorizer-end` to
`NewPMDriver.cpp`
- `buildModuleOptimizationPipeline` now calls
`invokeVectorizerEndEPCallbacks`
- `buildO0DefaultPipeline` now calls `invokeVectorizerEndEPCallbacks`
- `buildLTODefaultPipeline` now calls BOTH
`invokeVectorizerStartEPCallbacks` and `invokeVectorizerEndEPCallbacks`
- Added LIT tests to `new-pm-defaults.ll`, `new-pm-lto-defaults.ll`,
`new-pm-O0-ep-callbacks.ll`, and `pass-pipeline-parsing.ll`
- Renamed `CHECK-EP-Peephole` to `CHECK-EP-PEEPHOLE` in
`new-pm-lto-defaults.ll` for consistency.

This code is intended for developers that wish to implement and run
custom passes after the vectorizer passes in the PassBuilder pipeline.
For example, in #91796, a pass was created that changed the induction
variables of vectorized code. This is right after the vectorization
passes.
2025-01-29 11:24:03 -08:00
gulfemsavrun
38902153fe
[PassBuilder] Add RelLookupTableConverterPass to LTO (#124053)
[PassBuilder] Add RelLookupTableConverterPass to LTO

This patch adds RelLookupTableConverterPass into the LTO
post-link optimization pass pipeline. This optimization
converts lookup tables to relative lookup tables to make
them PIC-friendly, which is already included in the non-LTO
pass pipeline. This patch adds this optimization to the
post-link optimization pipeline to discover more
opportunities in the LTO context.
2025-01-28 15:08:03 -08:00
Ryan Mansfield
67efbd0bf1
[LLVM] Fix various cl::desc typos and whitespace issues (NFC) (#121955) 2025-01-08 11:07:23 +01:00
Florian Hahn
9e66206638
[Passes] Generalize ShouldRunExtraVectorPasses to allow re-use (NFCI). (#118323)
Generalize ShouldRunExtraVectorPasses to ShouldRunExtraPasses, to allow
re-use for other transformations.

PR: https://github.com/llvm/llvm-project/pull/118323
2024-12-04 16:55:06 +00:00
Kyungwoo Lee
d23c5c2d65
[CGData] Global Merge Functions (#112671)
This implements a global function merging pass. Unlike traditional
function merging passes that use IR comparators, this pass employs a
structurally stable hash to identify similar functions while ignoring
certain constant operands. These ignored constants are tracked and
encoded into a stable function summary. When merging, instead of
explicitly folding similar functions and their call sites, we form a
merging instance by supplying different parameters via thunks. The
actual size reduction occurs when identically created merging instances
are folded by the linker.

Currently, this pass is wired to a pre-codegen pass, enabled by the
`-enable-global-merge-func` flag.
In a local merging mode, the analysis and merging steps occur
sequentially within a module:
- `analyze`: Collects stable function hashes and tracks locations of
ignored constant operands.
- `finalize`: Identifies merge candidates with matching hashes and
computes the set of parameters that point to different constants.
- `merge`: Uses the stable function map to optimistically create a
merged function.

We can enable a global merging mode similar to the global function
outliner
(https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753/),
which will perform the above steps separately.
- `-codegen-data-generate`: During the first round of code generation,
we analyze local merging instances and publish their summaries.
- Offline using `llvm-cgdata` or at link-time, we can finalize all these
merging summaries that are combined to determine parameters.
- `-codegen-data-use`: During the second round of code generation, we
optimistically create merging instances within each module, and finally,
the linker folds identically created merging instances.

Depends on #112664
This is a patch for
https://discourse.llvm.org/t/rfc-global-function-merging/82608.
2024-11-13 17:34:07 -08:00
Lei Wang
bc1aa2863b
[SampleFDO] Support enabling sample loader pass in O0 mode (#113985)
Add support for enabling sample loader pass in O0 mode(under
`-fsample-profile-use`). This can help verify PGO raw profile count
quality or provide a more accurate performance proxy(predictor), as O0
mode has minimal or no compiler optimizations that might otherwise
impact profile count accuracy.
- Explicitly disable the sample loader inlining to ensure it only emits
sampling annotation.
- Use flattened profile for O0 mode.
- Add the pass after `AddDiscriminatorsPass` pass to work with
`-fdebug-info-for-profiling`.
2024-11-08 15:29:44 -08:00
Yuxuan Chen
c6414970d7
[Coroutines] Inline the .noalloc ramp function marked coro_safe_elide (#114004) 2024-11-07 22:41:32 -08:00
Hari Limaye
fbd89bcc66
Reland "[LTO] Run Argument Promotion before IPSCCP" (#111853)
Run ArgumentPromotion before IPSCCP in the LTO pipeline, to expose more
constants to be propagated. We also run PostOrderFunctionAttrs to
improve the information available to ArgumentPromotion's alias analysis,
and SROA to clean up allocas.

Relands #111163.
2024-11-06 13:54:48 +00:00
Shilei Tian
390300d9f4
[PassBuilder] Add ThinOrFullLTOPhase to optimizer pipeline (#114577) 2024-11-03 23:25:29 -05:00
Shilei Tian
dc45ff1d2a
[PassBuilder] Add ThinOrFullLTOPhase to early simplication EP call backs (#114547)
The early simplication pipeline is used in non-LTO and (Thin/Full)LTO
pre-link
stage. There are some passes that we want them in non-LTO mode, but not
at LTO
pre-link stage. The control is missing currently. This PR adds the
support. To
demonstrate the use, we only enable the internalization pass in non-LTO
mode for
AMDGPU because having it run in pre-link stage causes some issues.
2024-11-03 23:24:10 -05:00
Shilei Tian
5445edb5d6
[PassBuilder] Replace bool LTOPreLink with ThinOrFullLTOPhase Phase (#114564)
This will allow more fine-grained control in the future.
2024-11-01 14:56:35 -04:00
Lei Wang
bef3b54ea1
[InstrPGO] Avoid using global variable to fix potential data race (#114364)
In https://github.com/llvm/llvm-project/pull/109837, it sets a global
variable(`PGOInstrumentColdFunctionOnly`) in PassBuilderPipelines.cpp
which introduced a data race detected by TSan. To fix this, I decouple
the flag setting, the flags are now set
separately(`instrument-cold-function-only-path` is required to be used
with `--pgo-instrument-cold-function-only`).
2024-10-31 21:28:13 -07:00
Paul Kirth
913cd11f94
[llvm][fatlto] Drop any CFI related instrumentation after emitting bitcode (#112788)
We want to support CFI instrumentation for the bitcode section, without
miscompiling the object code portion of a FatLTO object. We can reuse
the existing mechanisms in the LowerTypeTestsPass to do that, by just
adding the pass to the FatLTO pipeline after the EmbedBitcodePass with
the correct options set.

Fixes #112053
2024-10-31 12:40:21 -07:00
Dmitry Chernenkov
d924a9ba03 Revert "[InstrPGO] Support cold function coverage instrumentation (#109837)"
This reverts commit e517cfc531886bf6ed64b4e7109bb3141ac7f430.
2024-10-31 10:55:17 +00:00
Paul Kirth
b01e2a8b56
[llvm] Allow always dropping all llvm.type.test sequences
Currently, the `DropTypeTests` parameter only fully works with phi nodes
and llvm.assume instructions. However, we'd like CFI to work in
conjunction with FatLTO, in so far as the bitcode section should be able
to contain the CFI instrumentation, while any incompatible bits are
dropped when compiling the object code.

To do that, we need to drop the llvm.type.test instructions everywhere,
and not just their uses in phi nodes. This patch updates the
LowerTypeTest pass so that uses are removed, and replaced with `true` in
all cases, and not just in phi nodes.

Addressing this will allow us to fix #112053 by modifying the FatLTO
pipeline.

Reviewers: pcc, nikic

Reviewed By: pcc

Pull Request: https://github.com/llvm/llvm-project/pull/112787
2024-10-30 16:56:30 -07:00
Lei Wang
e517cfc531
[InstrPGO] Support cold function coverage instrumentation (#109837)
This patch adds support for cold function coverage instrumentation based
on sampling PGO counts. The major motivation is to detect dead functions
for the services that are optimized with sampling PGO. If a function is
covered by sampling profile count (e.g., those with an entry count > 0),
we choose to skip instrumenting those functions, which significantly
reduces the instrumentation overhead.

More details about the implementation and flags:
- Added a flag `--pgo-instrument-cold-function-only` in
`PGOInstrumentation.cpp` as the main switch to control skipping the
instrumentation.
- Built the extra instrumentation passes(a bundle of passes in
`addPGOInstrPasses`) under sampling PGO pipeline. This is controlled by
`--instrument-cold-function-only-path` flag.
- Added a driver flag `-fprofile-generate-cold-function-coverage`: 
- 1) Config the flags in one place, i,e. adding
`--instrument-cold-function-only-path=<...>` and
`--pgo-function-entry-coverage`. Note that the instrumentation file path
is passed through `--instrument-sample-cold-function-path`, because we
cannot use the `PGOOptions.ProfileFile` as it's already used by
`-fprofile-sample-use=<...>`.
- 2) makes linker to link `compiler_rt.profile` lib(see
[ToolChain.cpp#L1125-L1131](https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChain.cpp#L1125-L1131)
).
- Added a flag(`--pgo-cold-instrument-entry-threshold`) to config entry
count to determine cold function.

Overall, the full command is like:

```
clang++ -O2 -fprofile-generate-cold-function-coverage=<...> -fprofile-sample-use=<...>  code.cc -o code
```
2024-10-28 10:13:45 -07:00
Teresa Johnson
1de71652fd
[MemProf] Support cloning for indirect calls with ThinLTO (#110625)
This patch enables support for cloning in indirect callsites.

This is done by synthesizing callsite records for each virtual call
target from the profile metadata. In the thin link all the synthesized
records for a particular indirect callsite initially share the same
context node, but support is added to partition the callsites and
outgoing edges based on the callee function, creating a separate node
for each target.

In the LTO backend, when cloning is needed we first perform indirect
call promotion, then change the target of the new direct call to the
desired clone.

Note this is ThinLTO-specific, since for regular LTO indirect call
promotion should have already occurred.
2024-10-11 13:53:35 -07:00
Arthur Eubanks
e34d614e7d
[Passes] Remove -enable-infer-alignment-pass flag (#111873)
This flag has been on for a while without any complaints.
2024-10-10 12:28:46 -07:00
Hari Limaye
0a0f100a70
Revert "[LTO] Run Argument Promotion before IPSCCP" (#111839)
Reverts llvm/llvm-project#111163, as this was merged prematurely.
2024-10-10 15:03:01 +01:00
Hari Limaye
b9754e9d28
[LTO] Run Argument Promotion before IPSCCP (#111163)
Run ArgumentPromotion before IPSCCP in the LTO pipeline, to expose more
constants to be propagated. We also run PostOrderFunctionAttrs to
improve the information available to ArgumentPromotion's alias analysis,
and SROA to clean up allocas.
2024-10-10 06:08:27 -04:00
Mircea Trofin
9c0ba62010
[ctx_prof] Relax the "profile use" case around PGOOpt (#108265)
`PGOOpt` could have a value if, for instance, debug info for profiling
is requested. Relaxing the requirement, for now, following that
eventually we would factor `PGOOpt` to better capture the supported
interplay between the various profiling options.
2024-09-11 13:39:50 -07:00
Yuxuan Chen
761bf333e3
[LLVM][Coroutines] Switch CoroAnnotationElidePass to a FunctionPass (#107897)
After landing https://github.com/llvm/llvm-project/pull/99285 we found
that the call graph update was causing the following crash when
expensive checks are turned on
```
llvm-project/llvm/lib/Analysis/CGSCCPassManager.cpp:982: LazyCallGraph::SCC &updateCGAndAnalysisManagerForPass(LazyCallGraph &, LazyCallGraph::SCC &, LazyCallGraph::Node &, CGSCCAnalysisManager &, CGSCCUpdateResult &, FunctionAnalysisManager &, bool): Assertion `(RC == &TargetRC || RC->isAncestorOf(Targe
tRC)) && "New call edge is not trivial!"' failed.                                                                                                                                                                                                                                                                               
```
I have to admit I believe that the call graph update process I did for
that patch could be wrong.

After reading the code in `CGSCCToFunctionPassAdaptor`, I am convinced
that `CoroAnnotationElidePass` can be a FunctionPass and rely on the
adaptor to update the call graph for us, so long as we properly
invalidate the caller's analyses.

After this patch,
`llvm/test/Transforms/Coroutines/coro-transform-must-elide.ll` no longer
fails under expensive checks.
2024-09-09 18:57:39 -07:00
Mircea Trofin
3b22618094
[ctx_prof] Insert the ctx prof flattener after the module inliner (#107499)
This patch enables experimenting with the contextual profile. ICP is currently disabled in this case - will reenable it subsequently. Also subsequently the inline cost model / decision making would be updated to be context-aware. Right now, this just achieves "complete use" of the profile, in that it's ingested, maintained, and sunk to a flat profile when not needed anymore.

Issue [#89287](https://github.com/llvm/llvm-project/issues/89287)
2024-09-09 18:16:24 -07:00
Yuxuan Chen
a416267a5f
[LLVM][Coroutines] Transform "coro_elide_safe" calls to switch ABI coroutines to the noalloc variant (#99285)
This patch is episode three of the middle end implementation for the
coroutine HALO improvement project published on discourse:
https://discourse.llvm.org/t/language-extension-for-better-more-deterministic-halo-for-c-coroutines/80044

After we attribute the calls to some coroutines as "coro_elide_safe" in
the C++ FE and creating a `noalloc` ramp function, we use a new middle
end pass to move the call to coroutines to the noalloc variant.

This pass should be run after CoroSplit. For each node we process in
CoroSplit, we look for its callers and replace the attributed ones in
presplit coroutines to the noalloc one. The transformed `noalloc` ramp
function will also require a frame pointer to a block of memory it can
use as an activation frame. We allocate this on the caller's frame with
an alloca.

Please note that we cannot safely transform such attributed calls in
post-split coroutines due to memory lifetime reasons. The CoroSplit pass
is responsible for creating the coroutine frame spills for all the
allocas in the coroutine. Therefore it will be unsafe to create new
allocas like this one in post-split coroutines. This happens relatively
rarely because CGSCC performs the passes on the callees before the
caller. However, if multiple coroutines coexist in one SCC, this
situation does happen (and prevents us from having potentially unbound
frame size due to recursion.)

You can find episode 1: Clang FE of this patch series at
https://github.com/llvm/llvm-project/pull/99282
Episode 2: CoroSplit at https://github.com/llvm/llvm-project/pull/99283
2024-09-08 23:09:40 -07:00
Mingming Liu
d4ddf06b0c
[NFCI]Remove EntryCount from FunctionSummary and clean up surrounding synthetic count passes. (#107471)
The primary motivation is to remove `EntryCount` from `FunctionSummary`.
This frees 8 bytes out of `sizeof(FunctionSummary)` (136 bytes as of
64498c5483).

While I'm at it, this PR clean up {SummaryBasedOptimizations,
SyntheticCountsPropagation} since they were not used and there are no
plans to further invest on them.

With this patch, bitcode writer writes a placeholder 0 at the byte
offset of `EntryCount` and bitcode reader can parse the function entry
count at the correct byte offset. Added a TODO to stop writing
`EntryCount` and bump bitcode version
2024-09-06 16:38:17 -07:00
Mircea Trofin
775c50709c
[ctx_prof] Flattened profile lowering pass (#107329)
Pass to flatten and lower the contextual profile to profile (i.e. `MD_prof`) metadata. This is expected to be used after all IPO transformations have happened.

Prior to lowering, the instrumentation is maintained during IPO and the contextual profile is kept in sync (see PRs #105469, #106154). Flattening (#104539) sums up all the counters belonging to all a function's context nodes.

We first propagate counter values (from the flattened profile) using the same propagation algorithm as `PGOUseFunc::populateCounters`, then map the edge values to `branch_weights`. Functions. in the module that don't have an entry in the flattened profile are deemed cold, and any `MD_prof` metadata they may have is reset. The profile summary is also reset at this point.

Issue [#89287](https://github.com/llvm/llvm-project/issues/89287)
2024-09-06 13:47:08 -07:00