Before all the call probe ids are after block ids, in this change, it
mixed the call probe and block probe by reordering them in
lexical(line-number) order. For example:
```
main():
BB1
if(...)
BB2 foo(..);
else
BB3 bar(...);
BB4
```
Before the profile is
```
main
1: ..
2: ..
3: ...
4: ...
5: foo ...
6: bar ...
```
Now the new order is
```
main
1: ..
2: ..
3: foo ...
4: ...
5: bar ...
6: ...
```
This can potentially make it more tolerant of profile mismatch, either from stale profile or frontend change. e.g. before if we add one block, even the block is the last one, all the call probes are shifted and mismatched. Moreover, this makes better use of call-anchor based stale profile matching. Blocks are matched based on the closest anchor, there would be more anchors used for the matching, reduce the mismatch scope.
This is to fix an assertion error. Apparently, `pseudo_probe_desc` could
still be available for import functions, and its checksum mismatch state
can be different from import function's `profile-checksum-mismatch`
attr. This happens when unstable IR or ODR violation issue occurs, the
definitions of the same function across different translation units
could be different and result in different checksums. During link time
deduplication, the internal function definition (the checksum in desc is
computed based on) is substituted by the `available_externally`
definition, which cause the inconsistency. Hence, we fix it to by always
checking the state for the new `available_externally` definition, which
is saved in the function attribute.
Currently -salvage-stale-profile is a no-op if the profile is not
probe-based. We observed that it can help for regular, non-probe- based
profiles too: some of our internal benchmarks show 0.2-0.3% QPS
improvement.
There seems to be no good reason to limit this flag to only work for
probe-based profiles.
As before we only skip instrumenting probe of `unwind`(`KnownColdBlock`)
block, this PR extends to skip the both EH flow from `invoke`, i.e. also
skip the `normal` dest. For more contexts: when doing call-to-invoke
conversion, the block is split by the `invoke` and two extra
blocks(`normal` and `unwind`) are added. With this PR, the
instrumentation is the same as the one before the call-to-invoke
conversion.
One significant benefit is this can help mitigate the "unstable IR"
issue(https://discourse.llvm.org/t/ipo-for-linkonce-odr-functions/69404),
the two versions now are on the same probe instrumentation, expected to
be the same checksum.
To achieve the same checksum, some tweaks is needed:
- Now it also skips incrementing the probe ID for the skipped probe.
- The checksum is also computed based on the CFG that skips the EH
edges.
We observed this fixes ~5% mismatched samples.
#66457 makes InlineCost to use cost-benefit by default, which causes
0.4-0.5% performance regression on multiple internal workloads. See
discussions https://github.com/llvm/llvm-project/pull/66457. This pull
request reverts it.
Co-authored-by: helloguo <helloguo@meta.com>
Two fixes related to the callee/inlinee profile:
1. Fix the bug that the matching results are missing to distribute to
the callee profiles (should be pass-by-reference).
2. Narrow imported function matching to checksum mismatched functions.
More context: before we run matchings for all imported functions even
checksums are matched, however, after we fix 1), we got a regression,
it's likely due to the matching is not no-op for checksum matched
function, so we want to make it consistent to only run matching for
checksum mismatched (imported)functions. Since the
metadata(pseudo_probe_desc) are dropped for imported function, we
leverage the function attribute mechanism and add a new function
attribute(`profile-checksum-mismatch`) to transfer the info from
pre-link to post-link.
Error out the build if the checksum mismatch is extremely high, it's
better to drop the profile rather than apply the bad profile.
Note that the check is on a module level, the user could make big
changes to functions in one single module but those changes might not be
performance significant to the whole binary, so we want to be
conservative, only expect to catch big perf regression. To do this, we
select a set of the "hot" functions for the check. We use two
parameter(`hot-func-cutoff-for-staleness-error` and
`min-functions-for-staleness-error`) to control the function selection
to make sure the selected are hot enough and the num of function is not
small.
Tuned the parameters on our internal services, it works to catch big
perf regression due to the high mismatch .
By design, when the nested profile is pre-inliner based, we should fully
honor pre-inliner decision, fix it by setting threshold to zero. We
observed a perf win on one internal service, no negative impact for
other big services.
This change adds the support to compute and report the staleness metrics
after stale profile matching so that we can know how effective the fuzzy
matching is, i. e. how many callsites and samples are recovered by the
matching.
Some implementation notes:
- The function checksum mismatch metrics are not applicable here as it's
function-level metrics, checksum mismatch remains the same before and
after matching, so we need to compute based on the callsite samples.
- Added two new counters `NumRecoveredCallsites`,
`RecoveredCallsiteSamples` for this and removed `TotalCallsiteSamples`
as now the we can use the `TotalFuncHashSamples` as base, and renamed
some counters.
- In profile matching, we changed to use a state machine to represent
the callsite's matching state changes. See the `MatchState` for the
state, and used a new function `recordCallsiteMatchStates` to compute
and record the callsite's match states changes before and after the
matching, , the result is compressed and saved into a
`FuncCallsiteMatchStates` map for later counting use.
- Changed the counting function to run on module-level and moved it to
the end of the whole process(`computeAndReportProfileStaleness`). The
reason is before the callsite is only counted on top-level function,
this change extends it to count(recursively) on the inlined functions
and samples, which is more accurate.
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #75380
Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
Revert e0c554ad87d18dcbfcb9b6485d0da800ae1338d1 "Port CodeGenPrepare to new pass manager (and BasicBlockSectionsProfil… (#75380)"
Revert #75380 and #77054 as they were breaking EXPENSIVE_CHECKS buildbots: https://lab.llvm.org/buildbot/#/builders/104
Port CodeGenPrepare to new pass manager and dependency
BasicBlockSectionsProfileReader
Fixes: #64560
Co-authored-by: Krishna-13-cyber <84722531+Krishna-13-cyber@users.noreply.github.com>
This patch deduces `noundef` attributes for return values.
IIUC, a function returns `noundef` values iff all of its return values
are guaranteed not to be `undef` or `poison`.
Definition of `noundef` from LangRef:
```
noundef
This attribute applies to parameters and return values. If the value representation contains any
undefined or poison bits, the behavior is undefined. Note that this does not refer to padding
introduced by the type’s storage representation.
```
Alive2: https://alive2.llvm.org/ce/z/g8Eis6
Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=30dcc33c4ea3ab50397a7adbe85fe977d4a400bd&to=c5e8738d4bfbf1e97e3f455fded90b791f223d74&stat=instructions:u
|stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang|
|--|--|--|--|--|--|--|
|+0.01%|+0.01%|-0.01%|+0.01%|+0.03%|-0.04%|+0.01%|
The motivation of this patch is to reduce the number of `freeze` insts
and enable more optimizations.
When using `BranchProbabilityPrinterPass`, if a BB has no name, we get pretty unusable information like `edge -> has probability...` (i.e. we have no idea what the vertices of that edge are).
This patch uses `printAsOperand`, which uses the same naming scheme as `Function::dump`, so for example during debugging sessions, the IR obtained from a function and the names used by `BranchProbabilityPrinterPass` will match.
A shortcoming is that `printAsOperand` will result in the numbering algorithm re-running for every edge and every vertex (when `BranchProbabilityPrinterPass` is run on a function). If, for the given scenario, this is a problem, we can revisit this subsequently.
Another nuance is that the entry basic block will be numbered, which may be slightly confusing when it's anonymous, but it's easily identifiable - the first edge would have it as source (and the number should be easily recognizable)
Normally SampleContext does not allow using an empty StirngRef to
construct an object, this is to prevent bugs reading the profile.
However empty names may be emitted by a function which its name is
intentionally set to empty, or a bug in the remapper that returns an
empty string. Regardless, converting it to FunctionId first will prevent
the assert, and that assert check is unnecessary, which will be
addressed in another patch
BlockFrequencyInfo calculates block frequencies as Scaled64 numbers but as a last step converts them to unsigned 64bit integers (`BlockFrequency`). This improves the factors picked for this conversion so that:
* Avoid big numbers close to UINT64_MAX to avoid users overflowing/saturating when adding multiply frequencies together or when multiplying with integers. This leaves the topmost 10 bits unused to allow for some room.
* Spread the difference between hottest/coldest block as much as possible to increase precision.
* If the hot/cold spread cannot be represented loose precision at the lower end, but keep the frequencies at the upper end for hot blocks differentiable.
MCPseudoProbeSections::emit iterates over MCProbeDivisions and creates sections.
When the map key is MCSymbol *, the iteration order is not stable. The
underlying BumpPtrAllocator largely decreases the flakiness. That said, two
elements may sit in two different allocations from BumpPtrAllocator, with
an unpredictable order. Under tcmalloc,
llvm/test/Transforms/SampleProfile/pseudo-probe-emit.ll fails about 7 times per
1000 runs.
Followup to D59363 which failed to handle the icmp(X,undef) -> isTrueWhenEqual case - similar to llvm::ConstantFoldCompareInstruction
As discussed on the review, this is affecting some previously reduced test cases, but will also prevent reductions from relying on this inconsistent behaviour in the future.
Reapplied after reversion at e1e3c75c7dad72 with a tweak to the pseudo-probe-peep.ll test
Differential Revision: https://reviews.llvm.org/D158068
With `-fpseudo-probe-for-profiling`, the dwarf discriminator for a
callsite will be overwritten to pseudo probe related information for
that callsite. The probe information is encoded in a special format
(i.e., with all lowest three digits be one) in order to be distinguished
from regular dwarf discriminator. The special encoding format will be
decoded to zero by the regular discriminator logic. This means all
callsites would have a zero discriminator in both the sample profile and
the compiler, for classic AutoFDO. This is inconvenient in that no
decent classic AutoFDO can be generated from a pseudo probe build. I'm
mitigating the issue by allowing callsite probe id to be used as the
base dwarf discriminator for classic AutoFDO, since probe id is also
unique and can be used to differentiate callsites on the same source
line.
Accumulating the staleness metrics from per-link is less accurate than doing it from post-link time(assuming we use the offline profile mismatch as baseline), the reason is that there are some duplicated reports for the same functions, for example, one template function could be included in multiple TUs, but in post thin link time, only one function are kept(linkonce_odr) and others are marked as available-externally function. Hence, this change skips reporting the metrics for imported functions(available-externally).
I saw the post-link number is now very close to the offline number(dump the mismatched functions and count the metrics offline based on the entire profile), sightly smaller than offline number due to some missing inlined functions.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D156725
Follow-up diff for https://reviews.llvm.org/D158891. Compute the checksum mismatch based on the original nested profile. Additionally, use a recursive way to compute the children mismatched samples in the nested tree even the top-level func checksum is matched.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D158900
- Always use flattened profile to find the profile anchors. Since profile under different contexts may have different inlined callsites, to get more profile anchors, we use a merged profile from all the contexts(the flattened profile) to find callsite anchors.
- Compute the staleness metrics based on the original nested profile, as currently once a callsite is mismatched, all its children profile are dropped.(TODO: in future, we can improve to reuse the children valid profile)
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D158891
As in per-link time, callsites could be optimized out by inlining, we don't have those original call targets in the IR in LTO time. Additionally, the inlined code doesn't actually belong to the original function, the IR locations or pseudo probe parsed from it are incorrect and could mislead the matching later.
This change adds the support to extract the original IR location info from the inlined code, specifically, it make sure to skip all the inlined code that doesn't belong the original function, but before that, it processes the inline frames of the debug info to extract the base frame and recover its callsite and callee target(name).
Measured on some stale profile instances, all showed some perf improvements.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D156722
- rename `IRLocation` --> `IRAnchors`, `ProfileLocation` --> `ProfileAnchors`
- reorganize runOnFunction, fact out the finding IR anchors code into `findIRAnchors`
- introduce a new function `findProfileAnchors` to populate the profile related anchors, the result is saved into `ProfileAnchors`, it's later used for both mismatch report and matching, this can avoid to parse the `getBodySamples` and `getCallsiteSamples` for multiple times.
- move the `MatchedCallsiteLocs` stuffs from `findIRAnchors` to `countProfileMismatches` so that all the staleness metrics report are computed in one function.
- move all matching related into `runStaleProfileMatching`, and move all mismatching report into `countProfileMismatches`
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D158817
The affected lit tests failed when they were run in a path that contained the
word "call". CHECK-NOT lines that were supposed to match only the IR ended up
matching the path printed in the output. Fixed this by checking for "call void"
instead.
We found that in a special condition, the input callee `Samples` is null for `findExternalInlineCandidate`, which caused an ICE.
In some rare cases, call instruction could be changed after being pushed into inline candidate queue, this is because earlier inlining may expose constant propagation which can change indirect call to direct call. When this happens, we may fail to find matching function samples for the candidate later(for example if the profile is stale), even if a match was found when the candidate was enqueued.
See this reduced program:
file1.c:
```
int bar(int x);
int(*foo())() {
return bar;
};
void func()
{
int (*fptr)(int);
fptr = foo();
a += (*fptr)(10);
}
```
file2.c:
```
int bar(int x) { return x + 1;}
```
The two CALL: `foo` and `(*ptr)` are pushed into the queue at the beginning, say `foo` is hotter and popped first for inlining. During the inlining of `foo`, it performs the constant propagation for the function pointer `bar` and then changed `(*ptr)` to a direct call `bar(..)`. Note that at this time, `(*ptr)/bar` is still in the queue, later while it's popped out for inlining, it use the a different target name(bar) to look for the callee samples. At the same time, if the profile is stale and the new function is different from the old function in the profile, then this led the return of the null callee sample.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D154637
For a function not in an IR COMDAT, currently we place it into a nodeduplicate IR
COMDAT so that its text section and its associated .pseudo_probe section will be
in the same section group, which can be retained or discarded by the linker as a
unit. However, the section group wastes space.
After D153189 uses SHF_LINK_ORDER to ensure a .pseudo_probe section will be
discarded when its associated text section is discarded, we can remove the
nodeduplicate IR change.
In the following example, the .pseudo_probe associated with .text.f is discarded as expected.
```
clang -c -ffunction-sections -fpseudo-probe-for-profiling -xc =(printf 'void _start(){} void f(){}') -o a.o
ld.lld --gc-sections --print-gc-sections a.o
```
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D153191
* Add the SHF_LINK_ORDER flag so that the .pseudo_probe section is discarded when the associated text section is discarded.
* Add unique ID so that with `clang -ffunction-sections -fno-unique-section-names`, there is one separate .pseudo_probe for each text section (disambiguated by `.section ....,unique,id` in assembly)
The changes allow .pseudo_probe GC even if we don't place instrumented functions
in an IR comdat (see `getOrCreateFunctionComdat` in SampleProfileProbe.cpp).
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D153189
This reverts commit 0c03f48480f69b854f86d31235425b5cb71ac921.
Going to fix forward size regression instead due to more dependent patches needing to be reverted otherwise.
Unlike every other analysis and transform, simplifyInstruction
permitted operating on instructions which are not inserted
into a function. This created an edge case no other code needs
to really worry about, and limited transforms in cases that
can make use of the context function. Only the inliner and a handful
of other utilities were making use of this, so just fix up these
edge cases. Results in some IR ordering differences since
cloned blocks are inserted eagerly now. Plus some additional
simplifications trigger (e.g. some add 0s now folded out that
previously didn't).
For pseudo probes we would like to keep their original dwarf discriminator (either a zero or null) until the first FS-discriminator pass. The inliner is a violation of that, given that it assigns inlinee instructions with no debug info with the that of the callsite. This is being disabled in this patch.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D151568
This caused compiler assertions, see comment on
https://reviews.llvm.org/D150107.
This also reverts the dependent follow-up change:
> [X86] Remove patterns for ADD/AND/OR/SUB/XOR/CMP with immediate 8 and optimize during MC lowering, NFCI
>
> This is follow-up of D150107.
>
> In addition, the function `X86::optimizeToFixedRegisterOrShortImmediateForm` can be
> shared with project bolt and eliminates the code in X86InstrRelaxTables.cpp.
>
> Differential Revision: https://reviews.llvm.org/D150949
This reverts commit 2ef8ae134828876ab3ebda4a81bb2df7b095d030 and
5586bc539acb26cb94e461438de01a5080513401.
This is follow-up of D150107.
In addition, the function `X86::optimizeToFixedRegisterOrShortImmediateForm` can be
shared with project bolt and eliminates the code in X86InstrRelaxTables.cpp.
Differential Revision: https://reviews.llvm.org/D150949
This change enables loading pseudo-probe based profile on MIR. Different from the IR profile loader, callsites are excluded from MIR profile loading since they are not assinged a FS discriminator. Using zero as the discriminator is not accurate and would undo the distribution work done by the IR loader based on pseudo probe distribution factor. We reply on block probes only for FS profile loading.
Some refactoring is done to the IR profile loader so that `getProbeWeight` can be shared by both loaders.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D148584
A pseudo probe is created with dwarf line information shared with its nearest instruction. If the instruction comes with a dwarf discriminator, it will be shared with the probe as well. This can confuse the later FS-AFDO discriminator assignment pass. To fix this, I'm cleaning up the discriminator fields for probes when they are inserted.
I also notice another possibility to change the discriminator field of pseudo probes in the pipeline before the FS discriminator assignment pass. That is the loop unroller, which assigns duplication factor to instruction being vectorized. I'm disabling that for pseudo probe intrinsics specifically, also for callsites with probes.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D148569
Part 2 of https://reviews.llvm.org/D147456
Use callee name on IR as an anchor to match the call target/inlinee name in the profile. The advantages of this in particular:
- Different from the traditional way of encoding hash signatures to every block that would affect binary/profile size and build speed, it doesn't require any additional information for this, all the data is already in the IR and profiles.
- Effective for current nested profile layout in which once a callsite is mismatched all the inlinee's profiles are dropped.
**The input of the algorithm:**
- IR locations: the anchor is the callee name of direct callsite.
- Profile locations: the anchor is the call target name for `BodySample`s or inlinee's profile name for `CallsiteSamples`.
The two lists are populated by parsing the IR and profile and both can be generalized as a sequence of locations with an optional anchor.
For example: say location `1.2(foo)` refers to a callsite at `1.2` with callee name `foo` and `1.3` refers to a non-directcall location `1.3`.
```
// The current build source code:
int main() {
1. ...
2. foo();
3. ...
4 ...
5. ...
6. bar();
7. ...
}
```
IR locations are populated and simplified as: `[1, 2(foo), 3, 5, 6(bar), 7]`.
```
; The "stale" profile:
main:350:1
1: 1
2: 3
3: 100 foo:100
4: 2
7: 2
8: 200 bar:200
9: 30
```
Profile locations are populated and simplified as `[1, 2, 3(foo), 4, 7, 8(bar), 9]`
**Matching heuristic:**
- Match all the anchors in lexical order first.
- Match non-anchors evenly between two anchors: Split the non-anchor range, the first half is matched based on the start anchor, the second half is matched based on the end anchor.
So the example above is matched like:
```
[1, 2(foo), 3, 5, 6(bar), 7]
| | | | | |
[1, 2, 3(foo), 4, 7, 8(bar), 9]
```
3 -> 4 matching is based on anchor `foo`, 5 -> 7 matching is based on anchor `bar`.
The output mapping of matching is [2->3, 3->4, 5->7, 6->8, 7->9].
For the implementation, the anchors are saved in a map for fast look-up. The result mapping is saved into `IRToProfileLocationMap`(see https://reviews.llvm.org/D147456) and distributed to all FunctionSamples(`distributeIRToProfileLocationMap`)
**Clang-self build benchmark: **
Current build version: clang-10
The profiled version: clang-9
Results compared to a refresh profile(collected profile on clang-10) and to be fair, we invalidated new functions' profiles(both refresh and stale profile use the same profile list).
1) Regression to using refresh profile with this off : -3.93%
2) Regression to using refresh profile with this on : -1.1%
So this algorithm can recover ~72% of the regression.
**Internal(Meta) large-scale services.**
we saw one real instance of a 3 week stale profile., it delivered a ~1.8% win.
**Notes or future work:**
- Classic AutoFDO support: the current version only supports pseudo-probe, but I believe it's not hard to extend to classic line-number based AutoFDO since pseudo-probe and line-number are shared the LineLocation structure.
- The fuzzy matching is an open-ended area and there could be more heuristics to try out, but since the current version already recovers a reasonable percentage of regression(with some pseudo probe order change, it can recover close to 90%), I'm submitting the patch for review and we will try more heuristics in future.
- Profile call target name are only available when the call is hit by samples, the missing anchor might mislead the matching, this can be mitigated in llvm-profgen to generate the call target for the zero samples.
- This doesn't handle function name mismatch, we plan to solve it in future.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D147545
For profile staleness report, before it only counts for the top-level function samples in the nested profile, the samples in the inlinees are ignored. This could affect the quality of the metrics when there are heavily inlined functions. This change adds a feature to flatten the nested profile and we're changing to use flatten profile as the input for stale profile detection and matching.
Example for profile flattening:
```
Original profile:
_Z3bazi:20301:1000
1: 1000
3: 2000
5: inline1:1600
1: 600
3: inline2:500
1: 500
Flattened profile:
_Z3bazi:18701:1000
1: 1000
3: 2000
5: 600 inline1:600
inline1:1100:600
1: 600
3: 500 inline2: 500
inline2:500:500
1: 500
```
This feature could be useful for offline analysis, like understanding the hotness of each individual function. So I'm adding the support to `llvm-profdata merge` under `--gen-flattened-profile`.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D146452
The function order in some tests had to be changed because they relied on ordering of functions returned in an SCC which is consistent but unspecified.