391 Commits

Author SHA1 Message Date
Nikita Popov
c23b4fbdbb
[IR] Remove size argument from lifetime intrinsics (#150248)
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.

This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
2025-08-08 11:09:34 +02:00
Lei Wang
8f3f93cd78
[SampleFDO] Match functions with the same base function name (#126688)
Sometimes, there may be no matched anchors but the functions still
match. e.g. if the function’s template typename changes, all the
callsites that use the type are mismatched and the caller function that
contains those callsite are mismatched. Introduce a check to match the
functions if their demangled base names are the same.
2025-03-24 10:11:55 -07:00
Lei Wang
0454dd8c48
[StaleProfileMatching] Use only profile anchor size for similarity calculation (#126783)
We observed that the number of IR anchors is usually greater than the
number of profile anchors, because IR anchors can be optimized away
later and llvm-profgen might not generate profiles for cold callsites.
This can cause missing function matches. I’m changing the similarity
calculation to use only the profile anchor size. In another point of
view, It also makes sense to reuse as many profile anchors as possible
regardless of the new functions in the IR.
2025-02-13 15:38:08 -08:00
Mingming Liu
5399782508
[IR] Generalize Function's {set,get}SectionPrefix to GlobalObjects, the base class of {Function, GlobalVariable, IFunc} (#125757)
This is a split of https://github.com/llvm/llvm-project/pull/125756
2025-02-06 14:51:13 -08:00
Lei Wang
068d0c0f4b
[CSSPGO] Turn on call-graph matching by default for CSSPGO (#125938)
Tested call-graph matching on some of Meta's large services, it works to
reuse some renamed function profiles, no negative perf or significant
build speed regression observed. Turned it on by default for CSSPGO
mode.
2025-02-06 11:54:59 -08:00
Haohai Wen
ccc8e45404
[PseudoProbe] Fix cleanup for pseudo probe after annotation (#119660)
When using -sample-profile-remove-probe, pseudo probe desc should
also be removed and dwarf discriminator for call instruction should
be restored.
2024-12-13 17:05:03 +08:00
Nikita Popov
462cb3cd6c
[InstCombine] Infer nusw + nneg -> nuw for getelementptr (#111144)
If the gep is nusw (usually via inbounds) and the offset is
non-negative, we can infer nuw.

Proof: https://alive2.llvm.org/ce/z/ihztLy
2024-12-05 14:36:40 +01:00
Lei Wang
6e60330af5
[SampleFDO] Read call-graph matching recovered top-level function profile (#101053)
With extbinary profile format, initial profile loading only reads
profile based on current function names in the module. However, if a
function is renamed, sample loader skips to load its original
profile(which has a different name), we will miss this case. To address
this, we load the top-level profile candidate explicitly for the
matching. If a match is found later, the function profile will be
further preserved for use by the sample loader.
2024-09-04 11:08:40 -07:00
Lei Wang
0f28aa632c
[CSSPGO] Return error_code for missing probe profile (#102085)
We should undo this https://reviews.llvm.org/D102007 after undoing
https://reviews.llvm.org/D104477 to use missing probe to represent
unknown count, zero count and unknown count are different to profile
inference.

It only affects post-inline(linker) pipeline(with
`--overwrite-existing-weights` on ) and flow-sensitive FDO.
2024-08-13 08:12:04 -07:00
Lei Wang
18cdfa72e0
[SampleFDO] Stale profile call-graph matching (#95135)
Profile staleness could be due to function renaming. Given that sample
profile loader relies on exact string matching, a trivial change in the
function signature( such as `int foo()` --> `long foo()` ) can make the
mangled name different, the function profile(including all nested
children profile) becomes unavailable.

This patch introduces stale profile call-graph level matching, targeting
at identifying the trivial function renaming and reusing the old
function profile.

Some noteworthy details:

1. Extend the LCS based CFG level matching to identify new function. 
- Extend to match function and profile have different name instead of
the exact function name matching. This leverages LCS, i.e during the
finding of callsite anchor matching, when two function name are
different, try matching the functions instead of return.
- In LCS, the equal function check is replaced by
`functionMatchesProfile`.
- Only try matching functions that are new functions(neither appears on
each side). This reduces the matching scope as we don't need to match
the originally matched function.
2.  Determine the matching by call-site anchor similarity check.
- A new function `functionMatchesProfile(IRFunc, ProfFunc)` is used to
check the renaming for the possible <IRFunc, ProfFunc> pair, use the
LCS(diff) matching to compute the equal set and we define: `Similarity =
|equalSet * 2| / (|A| + |B|)`. The profile name is marked as renamed if
the similarity is above a
threshold(`-func-profile-similarity-threshold`)

3.  Process the matching in top-down function order 
- when a caller's is done matching, the new function names are saved for
later use, using top-down order will maximize the reused results.
- `ProfileNameToFuncMap` is used to save or cache the matching result.
4. Update the original profile at the end using `ProfileNameToFuncMap`.

5. Added a new switch --salvage-unused-profile to control this, default
is false.

Verified on one Meta's internal big service, confirmed 90%+ of the found
renaming pair is good. (There could be incorrect renaming pair if the
num of the anchor is small, but checked that those functions are simple
cold function)
2024-07-17 10:33:00 -07:00
Krzysztof Pszeniczny
21ba91c439
[SamplePGO] Add a cutoff for number of profile matching anchors (#95542)
The algorithm added by PR #87375 can be potentially quadratic in the
number of anchors. This is almost never a problem because normally
functions have a reasonable number of function calls.

However, in some rare cases of auto-generated code we observed very
large functions that trigger quadratic behaviour here (resulting in
>130GB of peak heap memory usage for clang). Let's add a knob for
controlling the max number of callsites in a function above which stale
profile matching won't be performed.
2024-06-17 19:50:58 +02:00
Lei Wang
e20b904721
[PseudoProbe] Make probe discriminator compatible with dwarf base discriminator (#94506)
It's useful if the probe-based build can consume a dwarf based
profile(e.g. the profile transition), before there is a conflict for the
discriminator, this change tries to mitigate the issue by encoding the
dwarf base discriminator into the probe discriminator.
As the num of probe id(num of basic block and calls) starts from 1,
there are some unused space. We try to reuse some bit of the probe id.
The new encode rule is:
- Use a bit to [28:28] to indicate whether dwarf base discriminator is
encoded.(fortunately we can borrow this bit from the `PseudoProbeType`)
- If the bit is set, use [15:3] for probe id, [18:16] for dwarf base
discriminator. Otherwise, still use [18:3] for probe id.

Note that these doesn't affect the original probe id capacity, we still
prioritize probe id encoding, i.e. the base discriminator is not encoded
when probe id is bigger than [15:3].
 
Then adjust `getBaseDiscriminatorFromDiscriminator` to use the base
discriminator from the probe discriminator.
2024-06-07 11:37:49 -07:00
William Junda Huang
5a23d31c50
[Sample Profile] Check hot callsite threshold when inlining a function with a sample profile (#93286)
Currently if a callsite is hot as determined by the sample profile, it
is unconditionally inlined barring invalid cases (such as recursion).
Inline cost check should still apply because a function's hotness and
its inline cost are two different things.
For example if a function is calling another very large function
multiple times (at different code paths), the large function should not
be inlined even if its hot.
2024-05-28 16:41:53 -04:00
Lei Wang
5b6f151104
[SampleFDO] Improve stale profile matching by diff algorithm (#87375)
This change improves the matching algorithm by using the diff algorithm,
the current matching algorithm only processes the callsites grouped by
the same name functions, it doesn't consider the order relationships
between different name functions, this sometimes fails to handle this
ambiguous anchor case. For example. (`Foo:1` means a
calliste[callee_name: callsite_location])
```
IR :      foo:1  bar:2  foo:4  bar:5 
Profile :        bar:3  foo:5  bar:6
```
The `foo:1` is matched to the 2nd `foo:5` and using the diff
algorithm(finding longest common subsequence ) can help on this issue.
One well-known diff algorithm is the Myers diff algorithm(paper "An
O(ND) Difference Algorithm and Its Variations∗" Eugene W. Myers), its
variations have been implemented and used in many famous tools, like the
GNU diff or git diff. It provides an efficient way to find the longest
common subsequence or the shortest edit script through graph searching.
There are several variations/refinements for the algorithm, but as in
our case, the num of function callsites is usually very small, so we
implemented the basic greedy version in this change which should be good
enough.
We observed better matchings and positive perf improvement on our
internal services.
2024-05-13 16:01:29 -07:00
Nabeel Omer
686a206b26
[SampleProfileLoader] Fix integer overflow in generateMDProfMetadata (#90217)
This patch fixes an integer overflow in the SampleProfileLoader pass.
The issue occurs when weights are saturated and Profi isn't being used.

This patch also adds a newline to a debug message to make it more
readable.
2024-05-08 14:32:56 +01:00
Krzysztof Pszeniczny
e06d6ed1ef
[SamplePGO] Handle FS discriminators in SampleProfileMatcher (#90858)
Currently the code uses FunctionSamples::getCallSiteIdentifier which
will sometimes incorrectly guess that FSAFDO discriminators are probe
based and will convert them incorrectly.

This change doesn't affect builds which don't use FSAFDO, it only fixes
sample profile matching with FS discriminators.

The test for this is manually updated to use discriminator value 15,
which is a perfectly valid base discriminator in the FS world, but
satisfies `isPseudoProbeDiscriminator`, so
`getBaseDiscriminatorFromDiscriminator` will incorrectly extract the
probe index from it.

Note: this change only affects how the base discriminators will be
extracted when doing stale profile matching in the IR-level sample
profile loader. It doesn't add stale profile matching to the MIR-level
FS profile loader pass.
2024-05-02 10:32:37 -07:00
Lei Wang
b7248d5363
[PseudoProbe] Add an option to remove pseudo probes after profile annotation (#90293)
This can be used for testing perf overhead of pseudo-probe.
2024-04-29 09:27:33 -07:00
Lei Wang
5bbce06ac6
[PseudoProbe] Mix block and call probe ID in lexical order (#75092)
Before all the call probe ids are after block ids, in this change, it
mixed the call probe and block probe by reordering them in
lexical(line-number) order. For example:
```
main():
BB1
if(...) 
  BB2 foo(..);   
else 
  BB3 bar(...);
BB4
```
Before the profile is
```
main
 1: ..
 2: ..
 3: ...
 4: ...
 5: foo ...
 6: bar ...
 ```
 Now the new order is
```
 main
 1: ..
 2: ..
 3: foo ...
 4: ...
 5: bar ...
 6: ...
```
This can potentially make it more tolerant of profile mismatch, either from stale profile or frontend change. e.g. before if we add one block, even the block is the last one, all the call probes are shifted and mismatched. Moreover, this makes better use of call-anchor based stale profile matching. Blocks are matched based on the closest anchor, there would be more anchors used for the matching, reduce the mismatch scope.
2024-04-03 11:18:29 -07:00
Lei Wang
a94a3cd3d6
Always check the function attribute to determine checksum mismatch for available_externally functions (#87279)
This is to fix an assertion error. Apparently, `pseudo_probe_desc` could
still be available for import functions, and its checksum mismatch state
can be different from import function's `profile-checksum-mismatch`
attr. This happens when unstable IR or ODR violation issue occurs, the
definitions of the same function across different translation units
could be different and result in different checksums. During link time
deduplication, the internal function definition (the checksum in desc is
computed based on) is substituted by the `available_externally`
definition, which cause the inconsistency. Hence, we fix it to by always
checking the state for the new `available_externally` definition, which
is saved in the function attribute.
2024-04-03 10:58:17 -07:00
Krzysztof Pszeniczny
17642c7602
[SamplePGO] Support -salvage-stale-profile without probes too (#86116)
Currently -salvage-stale-profile is a no-op if the profile is not
probe-based. We observed that it can help for regular, non-probe- based
profiles too: some of our internal benchmarks show 0.2-0.3% QPS
improvement.

There seems to be no good reason to limit this flag to only work for
probe-based profiles.
2024-04-03 09:46:00 -07:00
Lei Wang
b8cc3ba409
[PseudoProbe] Extend to skip instrumenting probe into the dests of invoke (#79919)
As before we only skip instrumenting probe of `unwind`(`KnownColdBlock`)
block, this PR extends to skip the both EH flow from `invoke`, i.e. also
skip the `normal` dest. For more contexts: when doing call-to-invoke
conversion, the block is split by the `invoke` and two extra
blocks(`normal` and `unwind`) are added. With this PR, the
instrumentation is the same as the one before the call-to-invoke
conversion.
One significant benefit is this can help mitigate the "unstable IR"
issue(https://discourse.llvm.org/t/ipo-for-linkonce-odr-functions/69404),
the two versions now are on the same probe instrumentation, expected to
be the same checksum.
To achieve the same checksum, some tweaks is needed:
-  Now it also skips incrementing the probe ID for the skipped probe.
- The checksum is also computed based on the CFG that skips the EH
edges.

We observed this fixes ~5% mismatched samples.
2024-04-01 13:54:54 -07:00
Lei Wang
1d99d7a6f8
[SampleFDO][NFC] Refactoring SampleProfileMatcher (#86988)
Move all the stale profile matching stuffs into new files so that it can
be shared for unit testing.
2024-03-28 20:03:03 -07:00
Xiangyang (Mark) Guo
1607e8212c
[InlineCost] Disable cost-benefit when sample based PGO is used (#86626)
#66457 makes InlineCost to use cost-benefit by default, which causes
0.4-0.5% performance regression on multiple internal workloads. See
discussions https://github.com/llvm/llvm-project/pull/66457. This pull
request reverts it.

Co-authored-by: helloguo <helloguo@meta.com>
2024-03-28 10:11:57 -07:00
Lei Wang
f8bab38b6d
[CSSPGO] Fix the issue of missing callee profile matches (#85715)
Two fixes related to the callee/inlinee profile:

1. Fix the bug that the matching results are missing to distribute to
the callee profiles (should be pass-by-reference).
2. Narrow imported function matching to checksum mismatched functions. 

More context: before we run matchings for all imported functions even
checksums are matched, however, after we fix 1), we got a regression,
it's likely due to the matching is not no-op for checksum matched
function, so we want to make it consistent to only run matching for
checksum mismatched (imported)functions. Since the
metadata(pseudo_probe_desc) are dropped for imported function, we
leverage the function attribute mechanism and add a new function
attribute(`profile-checksum-mismatch`) to transfer the info from
pre-link to post-link.
2024-03-27 22:27:22 -07:00
Lei Wang
2598aa67c8
[CSSPGO] Reject high checksum mismatched profile (#84097)
Error out the build if the checksum mismatch is extremely high, it's
better to drop the profile rather than apply the bad profile.
Note that the check is on a module level, the user could make big
changes to functions in one single module but those changes might not be
performance significant to the whole binary, so we want to be
conservative, only expect to catch big perf regression. To do this, we
select a set of the "hot" functions for the check. We use two
parameter(`hot-func-cutoff-for-staleness-error` and
`min-functions-for-staleness-error`) to control the function selection
to make sure the selected are hot enough and the num of function is not
small.
Tuned the parameters on our internal services, it works to catch big
perf regression due to the high mismatch .
2024-03-27 11:14:21 -07:00
Lei Wang
12a2bc301f
[CSSPGO] Fix the issue of preinliner import function list (#85719)
By design, when the nested profile is pre-inliner based, we should fully
honor pre-inliner decision, fix it by setting threshold to zero. We
observed a perf win on one internal service, no negative impact for
other big services.
2024-03-19 16:50:48 -07:00
Lei Wang
c98da372cb
[CSSPGO] Compute and report profile matching recovered callsites and samples (#79090)
This change adds the support to compute and report the staleness metrics
after stale profile matching so that we can know how effective the fuzzy
matching is, i. e. how many callsites and samples are recovered by the
matching.
Some implementation notes:
- The function checksum mismatch metrics are not applicable here as it's
function-level metrics, checksum mismatch remains the same before and
after matching, so we need to compute based on the callsite samples.
- Added two new counters `NumRecoveredCallsites`,
`RecoveredCallsiteSamples` for this and removed `TotalCallsiteSamples`
as now the we can use the `TotalFuncHashSamples` as base, and renamed
some counters.
- In profile matching, we changed to use a state machine to represent
the callsite's matching state changes. See the `MatchState` for the
state, and used a new function `recordCallsiteMatchStates` to compute
and record the callsite's match states changes before and after the
matching, , the result is compressed and saved into a
`FuncCallsiteMatchStates` map for later counting use.
- Changed the counting function to run on module-level and moved it to
the end of the whole process(`computeAndReportProfileStaleness`). The
reason is before the callsite is only counted on top-level function,
this change extends it to count(recursively) on the inlined functions
and samples, which is more accurate.
2024-02-19 11:36:20 -08:00
Nikita Popov
90ba33099c
[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.

This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.

The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.

Fixes https://github.com/llvm/llvm-project/issues/69841.
2024-01-24 15:25:29 +01:00
Nick Anderson
f1ec0d12bb
Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#77182)
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #75380

Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
2024-01-09 13:32:59 +07:00
Simon Pilgrim
7648371c25 Revert 4d7c5ad58467502fcbc433591edff40d8a4d697d "[NewPM] Update CodeGenPreparePass reference in CodeGenPassBuilder (#77054)"
Revert e0c554ad87d18dcbfcb9b6485d0da800ae1338d1 "Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#75380)"

Revert #75380 and #77054 as they were breaking EXPENSIVE_CHECKS buildbots: https://lab.llvm.org/buildbot/#/builders/104
2024-01-05 12:28:10 +00:00
Nick Anderson
e0c554ad87
Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#75380)
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #64560

Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
2024-01-05 13:47:56 +07:00
Yingwei Zheng
1228becf7d
[FuncAttrs] Deduce noundef attributes for return values (#76553)
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any 
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding 
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6

Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|

The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
2023-12-31 20:44:48 +08:00
Mircea Trofin
bb6497ffa6
[BPI] Reuse the AsmWriter's BB naming scheme in BranchProbabilityPrinterPass (#73593)
When using `BranchProbabilityPrinterPass`, if a BB has no name, we get pretty unusable information like `edge -> has probability...` (i.e. we have no idea what the vertices of that edge are).

This patch uses `printAsOperand`, which uses the same naming scheme as `Function::dump`, so for example during debugging sessions, the IR obtained from a function and the names used by `BranchProbabilityPrinterPass` will match.

A shortcoming is that `printAsOperand` will result in the numbering algorithm re-running for every edge and every vertex (when `BranchProbabilityPrinterPass` is run on a function). If, for the given scenario, this is a problem, we can revisit this subsequently.

Another nuance is that the entry basic block will be numbered, which may be slightly confusing when it's anonymous, but it's easily identifiable - the first edge would have it as source (and the number should be easily recognizable)
2023-12-02 13:01:48 -08:00
William Junda Huang
683f2df6e5
[SampleProfile] Fix bug where remapper returns empty string and crashing Sample Profile loader (#71479)
Normally SampleContext does not allow using an empty StirngRef to
construct an object, this is to prevent bugs reading the profile.
However empty names may be emitted by a function which its name is
intentionally set to empty, or a bug in the remapper that returns an
empty string. Regardless, converting it to FunctionId first will prevent
the assert, and that assert check is unnecessary, which will be
addressed in another patch
2023-11-10 21:38:13 +00:00
Matthias Braun
e3cf80c5c1
BlockFrequencyInfoImpl: Avoid big numbers, increase precision for small spreads
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:

* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
2023-10-24 20:27:39 -07:00
Fangrui Song
001af0f894 [MC] Actually make .pseudoprobe created sections deterministic
Fix a18ee8b7c95c6dfa410c6acaaf8cffcfde1220b5 to use a comparator
that actually works: assign an ordinal to registered section.
2023-09-20 22:41:28 -07:00
HaohaiWen
954979d681 Reland [InlineCost] Enable the cost benefit analysis for Sample PGO (#66457)
Enables the cost-benefit-analysis-based inliner by default if we have
sample profile.

No extra fix is required.
2023-09-21 12:44:24 +08:00
Haohai Wen
486fc81583 Revert "[InlineCost] Enable the cost benefit analysis for Sample PGO (#66457)"
This reverts commit 2f2319cf2406d9830a331cbf015881c55ae78806.
2023-09-21 10:30:39 +08:00
HaohaiWen
2f2319cf24
[InlineCost] Enable the cost benefit analysis for Sample PGO (#66457)
Enables the cost-benefit-analysis-based inliner by default if we have
sample profile.
2023-09-21 09:21:55 +08:00
Fangrui Song
a18ee8b7c9 [MC] Make .pseudo_probe created sections deterministic after D91878
MCPseudoProbeSections::emit iterates over MCProbeDivisions and creates sections.
When the map key is MCSymbol *, the iteration order is not stable. The
underlying BumpPtrAllocator largely decreases the flakiness. That said, two
elements may sit in two different allocations from BumpPtrAllocator, with
an unpredictable order. Under tcmalloc,
llvm/test/Transforms/SampleProfile/pseudo-probe-emit.ll fails about 7 times per
1000 runs.
2023-09-20 18:11:14 -07:00
Simon Pilgrim
e6b85c3027 [DAG] FoldSetCC - add missing icmp(X,undef) -> isTrueWhenEqual case (REAPPLIED)
Followup to D59363 which failed to handle the icmp(X,undef) -> isTrueWhenEqual case - similar to llvm::ConstantFoldCompareInstruction

As discussed on the review, this is affecting some previously reduced test cases, but will also prevent reductions from relying on this inconsistent behaviour in the future.

Reapplied after reversion at e1e3c75c7dad72 with a tweak to the pseudo-probe-peep.ll test

Differential Revision: https://reviews.llvm.org/D158068
2023-09-13 12:33:39 +01:00
Hongtao Yu
6b856abc6f
[PseudoProbe] Use probe id as the base dwarf discriminator for callsites (#65685)
With `-fpseudo-probe-for-profiling`, the dwarf discriminator for a
callsite will be overwritten to pseudo probe related information for
that callsite. The probe information is encoded in a special format
(i.e., with all lowest three digits be one) in order to be distinguished
from regular dwarf discriminator. The special encoding format will be
decoded to zero by the regular discriminator logic. This means all
callsites would have a zero discriminator in both the sample profile and
the compiler, for classic AutoFDO. This is inconvenient in that no
decent classic AutoFDO can be generated from a pseudo probe build. I'm
mitigating the issue by allowing callsite probe id to be used as the
base dwarf discriminator for classic AutoFDO, since probe id is also
unique and can be used to differentiate callsites on the same source
line.
2023-09-08 09:49:54 -07:00
wlei
4bb6bbb9bf [CSSPGO] Skip reporting staleness metrics for imported functions
Accumulating the staleness metrics from per-link is less accurate than doing it from post-link time(assuming we use the offline profile mismatch as baseline), the reason is that there are some duplicated reports for the same functions, for example, one template function could be included in multiple TUs, but in post thin link time, only one function are kept(linkonce_odr) and others are marked as available-externally function. Hence, this change skips reporting the metrics for imported functions(available-externally).

I saw the post-link number is now very close to the offline number(dump the mismatched functions and count the metrics offline based on the entire profile), sightly smaller than offline number due to some missing inlined functions.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D156725
2023-08-30 18:00:23 -07:00
wlei
3365cd4544 [CSSPGO] Compute checksum mismatch recursively on nested profile
Follow-up diff for https://reviews.llvm.org/D158891. Compute the checksum mismatch based on the original nested profile. Additionally, use a recursive way to compute the children mismatched samples in the nested tree even the top-level func checksum is matched.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D158900
2023-08-30 18:00:23 -07:00
wlei
62a3f6c96e [CSSPGO] Retire FlattenProfileForMatching
- Always use flattened profile to find the profile anchors. Since profile under different contexts may have different inlined callsites, to get more profile anchors, we use a merged profile from all the contexts(the flattened profile) to find callsite anchors.

- Compute the staleness metrics based on the original nested profile, as currently once a callsite is mismatched, all its children profile are dropped.(TODO: in future, we can improve to reuse the children valid profile)

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D158891
2023-08-30 18:00:23 -07:00
wlei
062af2e763 [CSSPGO] Support stale profile matching for LTO
As in per-link time, callsites could be optimized out by inlining, we don't have those original call targets in the IR in LTO time. Additionally, the inlined code doesn't actually belong to the original function, the IR locations or pseudo probe parsed from it are incorrect and could mislead the matching later.

This change adds the support to extract the original IR location info from the inlined code, specifically, it make sure to skip all the inlined code that doesn't belong the original function, but before that, it processes the inline frames of the debug info to extract the base frame and recover its callsite and callee target(name).

Measured on some stale profile instances, all showed some perf improvements.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D156722
2023-08-30 18:00:23 -07:00
wlei
148cceb0d6 [CSSPGO] Refactoring SampleProfileMatcher::runOnFunction
- rename `IRLocation` --> `IRAnchors`,  `ProfileLocation` --> `ProfileAnchors`
- reorganize runOnFunction, fact out the finding IR anchors code into `findIRAnchors`
- introduce a new function `findProfileAnchors` to populate the profile related anchors, the result is saved into `ProfileAnchors`, it's later used for both mismatch report and matching, this can avoid to parse the `getBodySamples` and `getCallsiteSamples` for multiple times.
- move the `MatchedCallsiteLocs` stuffs from `findIRAnchors` to `countProfileMismatches` so that all the staleness metrics report are computed in one function.
- move all matching related into `runStaleProfileMatching`, and move all mismatching report into `countProfileMismatches`

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D158817
2023-08-30 18:00:23 -07:00
Sameer Sahasrabuddhe
f7031c41ec [NFC] strengthen some CHECK-NOT lines
The affected lit tests failed when they were run in a path that contained the
word "call". CHECK-NOT lines that were supposed to match only the IR ended up
matching the path printed in the output. Fixed this by checking for "call void"
instead.
2023-08-07 16:50:03 +05:30
wlei
bfefeeb139 [SamplePGO] Fix ICE that callee samples returns null while finding import functions
We found that in a special condition, the input callee `Samples` is null for `findExternalInlineCandidate`, which caused an ICE.

In some rare cases, call instruction could be changed after being pushed into inline candidate queue, this is because earlier inlining may expose constant propagation which can change indirect call to direct call. When this happens, we may fail to find matching function samples for the candidate later(for example if the profile is stale), even if a match was found when the candidate was enqueued.

See this reduced program:

file1.c:
```
 int bar(int x);

 int(*foo())()  {
  return bar;
 };

void func()
 {
     int (*fptr)(int);
     fptr = foo();
     a +=  (*fptr)(10);
 }
```
file2.c:
```
int bar(int x) { return x + 1;}
```

The two CALL: `foo` and `(*ptr)` are pushed into the queue at the beginning, say `foo` is hotter and popped first for inlining. During the inlining of `foo`, it performs the constant propagation for the function pointer `bar` and then changed `(*ptr)` to a direct call `bar(..)`. Note that at this time, `(*ptr)/bar` is still in the queue, later while it's popped out for inlining, it use the a different target name(bar) to look for the callee samples. At the same time, if the profile is stale and the new function is different from the old function in the profile, then this led the return of the null callee sample.

Reviewed By: hoy, wenlei

Differential Revision: https://reviews.llvm.org/D154637
2023-07-07 14:56:21 -07:00
Fangrui Song
2cb8d5ca3a [Pseudo Probe] Do not place functions in nodeduplicate COMDATs
For a function not in an IR COMDAT, currently we place it into a nodeduplicate IR
COMDAT so that its text section and its associated .pseudo_probe section will be
in the same section group, which can be retained or discarded by the linker as a
unit. However, the section group wastes space.

After D153189 uses SHF_LINK_ORDER to ensure a .pseudo_probe section will be
discarded when its associated text section is discarded, we can remove the
nodeduplicate IR change.

In the following example, the .pseudo_probe associated with .text.f is discarded as expected.
```
clang -c -ffunction-sections -fpseudo-probe-for-profiling -xc =(printf 'void _start(){} void f(){}') -o a.o
ld.lld --gc-sections --print-gc-sections a.o
```

Reviewed By: hoy

Differential Revision: https://reviews.llvm.org/D153191
2023-06-17 15:40:20 -07:00