Add support for enabling sample loader pass in O0 mode(under
`-fsample-profile-use`). This can help verify PGO raw profile count
quality or provide a more accurate performance proxy(predictor), as O0
mode has minimal or no compiler optimizations that might otherwise
impact profile count accuracy.
- Explicitly disable the sample loader inlining to ensure it only emits
sampling annotation.
- Use flattened profile for O0 mode.
- Add the pass after `AddDiscriminatorsPass` pass to work with
`-fdebug-info-for-profiling`.
Profile staleness could be due to function renaming. Given that sample
profile loader relies on exact string matching, a trivial change in the
function signature( such as `int foo()` --> `long foo()` ) can make the
mangled name different, the function profile(including all nested
children profile) becomes unavailable.
This patch introduces stale profile call-graph level matching, targeting
at identifying the trivial function renaming and reusing the old
function profile.
Some noteworthy details:
1. Extend the LCS based CFG level matching to identify new function.
- Extend to match function and profile have different name instead of
the exact function name matching. This leverages LCS, i.e during the
finding of callsite anchor matching, when two function name are
different, try matching the functions instead of return.
- In LCS, the equal function check is replaced by
`functionMatchesProfile`.
- Only try matching functions that are new functions(neither appears on
each side). This reduces the matching scope as we don't need to match
the originally matched function.
2. Determine the matching by call-site anchor similarity check.
- A new function `functionMatchesProfile(IRFunc, ProfFunc)` is used to
check the renaming for the possible <IRFunc, ProfFunc> pair, use the
LCS(diff) matching to compute the equal set and we define: `Similarity =
|equalSet * 2| / (|A| + |B|)`. The profile name is marked as renamed if
the similarity is above a
threshold(`-func-profile-similarity-threshold`)
3. Process the matching in top-down function order
- when a caller's is done matching, the new function names are saved for
later use, using top-down order will maximize the reused results.
- `ProfileNameToFuncMap` is used to save or cache the matching result.
4. Update the original profile at the end using `ProfileNameToFuncMap`.
5. Added a new switch --salvage-unused-profile to control this, default
is false.
Verified on one Meta's internal big service, confirmed 90%+ of the found
renaming pair is good. (There could be incorrect renaming pair if the
num of the anchor is small, but checked that those functions are simple
cold function)
Also some control flow simplifications.
Notably, this doesn't address `sampleprof_error`. I *think* the style
there tries to match `std::error_category`.
Also left `hash_value` as-is, because it matches what we do in Hashing.h
Note that the version of getValueProfDataFromInst that returns bool
has been "deprecated" since:
commit 1e15371dd8843dfc52b9435afaa133997c1773d8
Author: Mingming Liu <mingmingl@google.com>
Date: Mon Apr 1 15:14:49 2024 -0700
…f weights" #95136
Reverts #95060, and relands #86609, with the unintended code generation
changes addressed.
This patch implements the changes to LLVM IR discussed in
https://discourse.llvm.org/t/rfc-update-branch-weights-metadata-to-allow-tracking-branch-weight-origins/75032
In this patch, we add an optional field to MD_prof meatdata nodes for
branch weights, which can be used to distinguish weights added from
llvm.expect* intrinsics from those added via other methods, e.g. from
profiles or inserted by the compiler.
One of the major motivations, is for use with MisExpect diagnostics,
which need to know if branch_weight metadata originates from an
llvm.expect intrinsic. Without that information, we end up checking
branch weights multiple times in the case if ThinLTO + SampleProfiling,
leading to some inaccuracy in how we report MisExpect related
diagnostics to users.
Since we change the format of MD_prof metadata in a fundamental way, we
need to update code handling branch weights in a number of places.
We also update the lang ref for branch weights to reflect the change.
This patch implements the changes to LLVM IR discussed in
https://discourse.llvm.org/t/rfc-update-branch-weights-metadata-to-allow-tracking-branch-weight-origins/75032
In this patch, we add an optional field to MD_prof metadata nodes for
branch weights, which can be used to distinguish weights added from
`llvm.expect*` intrinsics from those added via other methods, e.g.
from profiles or inserted by the compiler.
One of the major motivations, is for use with MisExpect diagnostics,
which need to know if branch_weight metadata originates from an
llvm.expect intrinsic. Without that information, we end up checking
branch weights multiple times in the case if ThinLTO + SampleProfiling,
leading to some inaccuracy in how we report MisExpect related
diagnostics to users.
Since we change the format of MD_prof metadata in a fundamental way, we
need to update code handling branch weights in a number of places.
We also update the lang ref for branch weights to reflect the change.
Currently if a callsite is hot as determined by the sample profile, it
is unconditionally inlined barring invalid cases (such as recursion).
Inline cost check should still apply because a function's hotness and
its inline cost are two different things.
For example if a function is calling another very large function
multiple times (at different code paths), the large function should not
be inlined even if its hot.
This patch fixes an integer overflow in the SampleProfileLoader pass.
The issue occurs when weights are saturated and Profi isn't being used.
This patch also adds a newline to a debug message to make it more
readable.
Two fixes related to the callee/inlinee profile:
1. Fix the bug that the matching results are missing to distribute to
the callee profiles (should be pass-by-reference).
2. Narrow imported function matching to checksum mismatched functions.
More context: before we run matchings for all imported functions even
checksums are matched, however, after we fix 1), we got a regression,
it's likely due to the matching is not no-op for checksum matched
function, so we want to make it consistent to only run matching for
checksum mismatched (imported)functions. Since the
metadata(pseudo_probe_desc) are dropped for imported function, we
leverage the function attribute mechanism and add a new function
attribute(`profile-checksum-mismatch`) to transfer the info from
pre-link to post-link.
Error out the build if the checksum mismatch is extremely high, it's
better to drop the profile rather than apply the bad profile.
Note that the check is on a module level, the user could make big
changes to functions in one single module but those changes might not be
performance significant to the whole binary, so we want to be
conservative, only expect to catch big perf regression. To do this, we
select a set of the "hot" functions for the check. We use two
parameter(`hot-func-cutoff-for-staleness-error` and
`min-functions-for-staleness-error`) to control the function selection
to make sure the selected are hot enough and the num of function is not
small.
Tuned the parameters on our internal services, it works to catch big
perf regression due to the high mismatch .
By design, when the nested profile is pre-inliner based, we should fully
honor pre-inliner decision, fix it by setting threshold to zero. We
observed a perf win on one internal service, no negative impact for
other big services.
This change adds the support to compute and report the staleness metrics
after stale profile matching so that we can know how effective the fuzzy
matching is, i. e. how many callsites and samples are recovered by the
matching.
Some implementation notes:
- The function checksum mismatch metrics are not applicable here as it's
function-level metrics, checksum mismatch remains the same before and
after matching, so we need to compute based on the callsite samples.
- Added two new counters `NumRecoveredCallsites`,
`RecoveredCallsiteSamples` for this and removed `TotalCallsiteSamples`
as now the we can use the `TotalFuncHashSamples` as base, and renamed
some counters.
- In profile matching, we changed to use a state machine to represent
the callsite's matching state changes. See the `MatchState` for the
state, and used a new function `recordCallsiteMatchStates` to compute
and record the callsite's match states changes before and after the
matching, , the result is compressed and saved into a
`FuncCallsiteMatchStates` map for later counting use.
- Changed the counting function to run on module-level and moved it to
the end of the whole process(`computeAndReportProfileStaleness`). The
reason is before the callsite is only counted on top-level function,
this change extends it to count(recursively) on the inlined functions
and samples, which is more accurate.
Normally SampleContext does not allow using an empty StirngRef to
construct an object, this is to prevent bugs reading the profile.
However empty names may be emitted by a function which its name is
intentionally set to empty, or a bug in the remapper that returns an
empty string. Regardless, converting it to FunctionId first will prevent
the assert, and that assert check is unnecessary, which will be
addressed in another patch
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
Address feedback in https://reviews.llvm.org/D158817. Since `extractProbe` can be used for both calliste and BB probe, we can leverage this to unify the callsite handling code.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D159169
/data/home/jiefu/llvm-project/llvm/lib/Transforms/IPO/SampleProfile.cpp:2189:8: error: variable 'IsFuncHashMismatch' set but not used [-Werror,-Wunused-but-set-variable]
bool IsFuncHashMismatch = false;
^
1 error generated.
Accumulating the staleness metrics from per-link is less accurate than doing it from post-link time(assuming we use the offline profile mismatch as baseline), the reason is that there are some duplicated reports for the same functions, for example, one template function could be included in multiple TUs, but in post thin link time, only one function are kept(linkonce_odr) and others are marked as available-externally function. Hence, this change skips reporting the metrics for imported functions(available-externally).
I saw the post-link number is now very close to the offline number(dump the mismatched functions and count the metrics offline based on the entire profile), sightly smaller than offline number due to some missing inlined functions.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D156725
Follow-up diff for https://reviews.llvm.org/D158891. Compute the checksum mismatch based on the original nested profile. Additionally, use a recursive way to compute the children mismatched samples in the nested tree even the top-level func checksum is matched.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D158900
- Always use flattened profile to find the profile anchors. Since profile under different contexts may have different inlined callsites, to get more profile anchors, we use a merged profile from all the contexts(the flattened profile) to find callsite anchors.
- Compute the staleness metrics based on the original nested profile, as currently once a callsite is mismatched, all its children profile are dropped.(TODO: in future, we can improve to reuse the children valid profile)
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D158891
As in per-link time, callsites could be optimized out by inlining, we don't have those original call targets in the IR in LTO time. Additionally, the inlined code doesn't actually belong to the original function, the IR locations or pseudo probe parsed from it are incorrect and could mislead the matching later.
This change adds the support to extract the original IR location info from the inlined code, specifically, it make sure to skip all the inlined code that doesn't belong the original function, but before that, it processes the inline frames of the debug info to extract the base frame and recover its callsite and callee target(name).
Measured on some stale profile instances, all showed some perf improvements.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D156722
- rename `IRLocation` --> `IRAnchors`, `ProfileLocation` --> `ProfileAnchors`
- reorganize runOnFunction, fact out the finding IR anchors code into `findIRAnchors`
- introduce a new function `findProfileAnchors` to populate the profile related anchors, the result is saved into `ProfileAnchors`, it's later used for both mismatch report and matching, this can avoid to parse the `getBodySamples` and `getCallsiteSamples` for multiple times.
- move the `MatchedCallsiteLocs` stuffs from `findIRAnchors` to `countProfileMismatches` so that all the staleness metrics report are computed in one function.
- move all matching related into `runStaleProfileMatching`, and move all mismatching report into `countProfileMismatches`
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D158817
SampleProfileLoader::promoteMergeNotInlinedContextSample adds
certain uninlined functions to the sample profile map (unordered_map, which is
previously read from a profile file). This action may cause the map to
be rehashed, invalidating all pointers to FunctionSamples used by many
members of SampleProfileLoader, while the existing code did nothing to
guard against that. This bug is theoretical since adding a few
new functions to a large profile usually won't trigger a rehash, or even
if there's a rehash std::unordered_map tries its best to expand its
capacity in-place.
This bug will trigger if the container type of sample profile map is
changed to llvm::DenseMap or other implementation, such as in D147740,
for SampleProfReader's performance reason.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D157061
We found that in a special condition, the input callee `Samples` is null for `findExternalInlineCandidate`, which caused an ICE.
In some rare cases, call instruction could be changed after being pushed into inline candidate queue, this is because earlier inlining may expose constant propagation which can change indirect call to direct call. When this happens, we may fail to find matching function samples for the candidate later(for example if the profile is stale), even if a match was found when the candidate was enqueued.
See this reduced program:
file1.c:
```
int bar(int x);
int(*foo())() {
return bar;
};
void func()
{
int (*fptr)(int);
fptr = foo();
a += (*fptr)(10);
}
```
file2.c:
```
int bar(int x) { return x + 1;}
```
The two CALL: `foo` and `(*ptr)` are pushed into the queue at the beginning, say `foo` is hotter and popped first for inlining. During the inlining of `foo`, it performs the constant propagation for the function pointer `bar` and then changed `(*ptr)` to a direct call `bar(..)`. Note that at this time, `(*ptr)/bar` is still in the queue, later while it's popped out for inlining, it use the a different target name(bar) to look for the callee samples. At the same time, if the profile is stale and the new function is different from the old function in the profile, then this led the return of the null callee sample.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D154637
We tested the stale profile matching on several Meta's internal services, all results are positive, for instance, in one service that refreshed its profile every one or two weeks, it consistently gave 1~2% performance improvement. We also observed an instance that a trivial refactoring caused a 2% regression and the matching can successfully recover the whole regression. Therefore, we'd like to turn it on by default for CSSPGO.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D154027
Parametrize SampleProfileInference and SampleProfileLoaderBaseImpl by function
type (Function/MachineFunction) instead of block type
(BasicBlock/MachineBasicBlock). Move out specializations to appropriate
locations.
This change makes it possible to use GraphTraits instead of a custom TypeMap and
make SampleProfileInference not dependent on LLVM types, paving the way for
generalizing SampleProfileInference interfaces to BOLT IR types
(BinaryFunction/BinaryBasicBlock) in stale profile matching (D144500).
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D152187
This change enables loading pseudo-probe based profile on MIR. Different from the IR profile loader, callsites are excluded from MIR profile loading since they are not assinged a FS discriminator. Using zero as the discriminator is not accurate and would undo the distribution work done by the IR loader based on pseudo probe distribution factor. We reply on block probes only for FS profile loading.
Some refactoring is done to the IR profile loader so that `getProbeWeight` can be shared by both loaders.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D148584
Part 2 of https://reviews.llvm.org/D147456
Use callee name on IR as an anchor to match the call target/inlinee name in the profile. The advantages of this in particular:
- Different from the traditional way of encoding hash signatures to every block that would affect binary/profile size and build speed, it doesn't require any additional information for this, all the data is already in the IR and profiles.
- Effective for current nested profile layout in which once a callsite is mismatched all the inlinee's profiles are dropped.
**The input of the algorithm:**
- IR locations: the anchor is the callee name of direct callsite.
- Profile locations: the anchor is the call target name for `BodySample`s or inlinee's profile name for `CallsiteSamples`.
The two lists are populated by parsing the IR and profile and both can be generalized as a sequence of locations with an optional anchor.
For example: say location `1.2(foo)` refers to a callsite at `1.2` with callee name `foo` and `1.3` refers to a non-directcall location `1.3`.
```
// The current build source code:
int main() {
1. ...
2. foo();
3. ...
4 ...
5. ...
6. bar();
7. ...
}
```
IR locations are populated and simplified as: `[1, 2(foo), 3, 5, 6(bar), 7]`.
```
; The "stale" profile:
main:350:1
1: 1
2: 3
3: 100 foo:100
4: 2
7: 2
8: 200 bar:200
9: 30
```
Profile locations are populated and simplified as `[1, 2, 3(foo), 4, 7, 8(bar), 9]`
**Matching heuristic:**
- Match all the anchors in lexical order first.
- Match non-anchors evenly between two anchors: Split the non-anchor range, the first half is matched based on the start anchor, the second half is matched based on the end anchor.
So the example above is matched like:
```
[1, 2(foo), 3, 5, 6(bar), 7]
| | | | | |
[1, 2, 3(foo), 4, 7, 8(bar), 9]
```
3 -> 4 matching is based on anchor `foo`, 5 -> 7 matching is based on anchor `bar`.
The output mapping of matching is [2->3, 3->4, 5->7, 6->8, 7->9].
For the implementation, the anchors are saved in a map for fast look-up. The result mapping is saved into `IRToProfileLocationMap`(see https://reviews.llvm.org/D147456) and distributed to all FunctionSamples(`distributeIRToProfileLocationMap`)
**Clang-self build benchmark: **
Current build version: clang-10
The profiled version: clang-9
Results compared to a refresh profile(collected profile on clang-10) and to be fair, we invalidated new functions' profiles(both refresh and stale profile use the same profile list).
1) Regression to using refresh profile with this off : -3.93%
2) Regression to using refresh profile with this on : -1.1%
So this algorithm can recover ~72% of the regression.
**Internal(Meta) large-scale services.**
we saw one real instance of a 3 week stale profile., it delivered a ~1.8% win.
**Notes or future work:**
- Classic AutoFDO support: the current version only supports pseudo-probe, but I believe it's not hard to extend to classic line-number based AutoFDO since pseudo-probe and line-number are shared the LineLocation structure.
- The fuzzy matching is an open-ended area and there could be more heuristics to try out, but since the current version already recovers a reasonable percentage of regression(with some pseudo probe order change, it can recover close to 90%), I'm submitting the patch for review and we will try more heuristics in future.
- Profile call target name are only available when the call is hit by samples, the missing anchor might mislead the matching, this can be mitigated in llvm-profgen to generate the call target for the zero samples.
- This doesn't handle function name mismatch, we plan to solve it in future.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D147545
AutoFDO/CSSPGO often has to deal with stale profiles collected on binaries built from several revisions behind release. It’s likely to get incorrect profile annotations using the stale profile, which results in unstable or low performing binaries. Currently for source location based profile, once a code change causes a profile mismatch, all the locations afterward are mismatched, the affected samples or inlining info are lost. If we can provide a matching framework to reuse parts of the mismatched profile - aka incremental PGO, it will make PGO more stable, also increase the optimization coverage and boost the performance of binary.
This patch is the part 1 of stale profile matching, summary of the implementation:
- Added a structure for the matching result:`LocToLocMap`, which is a location to location map meaning the location of current build is matched to the location of the previous build(to be used to query the “stale” profile).
- In order to use the matching results for sample query, we need to pass them to all the location queries. For code cleanliness, we added a new pointer field(`IRToProfileLocationMap`) to `FunctionSamples`.
- Added a wrapper(`mapIRLocToProfileLoc`) for the query to the location, the location from input IR will be remapped to the matched profile location.
- Added a new switch `--salvage-stale-profile`.
- Some refactoring for the staleness detection.
Test case is in part 2 with the matching algorithm.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D147456
For profile staleness report, before it only counts for the top-level function samples in the nested profile, the samples in the inlinees are ignored. This could affect the quality of the metrics when there are heavily inlined functions. This change adds a feature to flatten the nested profile and we're changing to use flatten profile as the input for stale profile detection and matching.
Example for profile flattening:
```
Original profile:
_Z3bazi:20301:1000
1: 1000
3: 2000
5: inline1:1600
1: 600
3: inline2:500
1: 500
Flattened profile:
_Z3bazi:18701:1000
1: 1000
3: 2000
5: 600 inline1:600
inline1:1100:600
1: 600
3: 500 inline2: 500
inline2:500:500
1: 500
```
This feature could be useful for offline analysis, like understanding the hotness of each individual function. So I'm adding the support to `llvm-profdata merge` under `--gen-flattened-profile`.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D146452
The legacy pass is only used in AMDGPU codegen, which doesn't care about running it in call graph order (it actually has to work around that fact).
Make the legacy pass a module pass and share code with the new pass.
This allows us to remove the legacy inliner infrastructure.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D146446
The function order in some tests had to be changed because they relied on ordering of functions returned in an SCC which is consistent but unspecified.
Make the access to profile data going through virtual file system so the
inputs can be remapped. In the context of the caching, it can make sure
we capture the inputs and provided an immutable input as profile data.
Reviewed By: akyrtzi, benlangmuir
Differential Revision: https://reviews.llvm.org/D139052
value() has undesired exception checking semantics and calls
__throw_bad_optional_access in libc++. Moreover, the API is unavailable without
_LIBCPP_NO_EXCEPTIONS on older Mach-O platforms (see
_LIBCPP_AVAILABILITY_BAD_OPTIONAL_ACCESS).
This fixes clang.
Fix two issues for profile staleness report.
1) It should be more accurate to use the sum of all entry count(`getHeadSamplesEstimate`) for the callsite samples than the total samples, since even the top-level callsite is mismatched, it does affect the inlining but it can still be merged into base profile and used later.
2) I accidentally missed to persist the num of mismatched callsite into binary.
Also added the asm testing to test the decoding of the section.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D140063