Record the number of function invocations from external code - code
outside the binary, which may include JIT code and DSOs. Accounting
external entry counts improves the fidelity of call graph flow
conservation analysis.
Test Plan: updated shrinkwrapping.test
Sample is a general term covering both basic (IP) and branch (LBR)
profiles. Find and replace ambiguous uses of sample in a basic sample
sense.
Rename `RawBranchCount` into `RawSampleCount` reflecting its use for
both kinds of profile.
Rename `PF_LBR` profile type as `PF_BRANCH` reflecting non-LBR based
branch profiles (non-brstack SPE, synthesized brstack ETM/PT).
Follow-up to #137644.
Test Plan: NFC
Match inline trees first between profile and the binary: by GUID,
checksum, parent, and inline site for inlined functions. Map profile
probes to binary probes via matched inline tree nodes. Each binary probe
has an associated binary basic block. If all probes from one profile
basic block map to the same binary basic block, it’s an exact match,
otherwise the block is determined by majority vote and reported as loose
match.
Pseudo probe matching happens between exact hash matching and call/loose
matching.
Introduce ProbeMatchSpec - a mechanism to match probes belonging to
another binary function. For example, given functions foo and bar:
```
void foo() {
bar();
}
```
profiled binary: bar is not inlined => have top-level function bar
new binary where the profile is applied to: bar is inlined into foo.
Currently, BOLT does 1:1 matching between profile functions and binary
functions based on the name. #100446 will extend this to N:M where
multiple profiles can be matched to one binary function (as in the
example above where binary function foo would use profiles for foo and
bar), and one profile can be matched to multiple binary functions (e.g.
if bar was inlined into multiple functions).
In this diff, ProbeMatchSpecs would only have one BinaryFunctionProfile
(existing name-based matching).
Test Plan: Added match-blocks-with-pseudo-probes.test
Performance test:
- Setup:
- Baseline no-BOLT: Clang with pseudo probes, ThinLTO + CSSPGO
(#79942)
- BOLT fresh: BOLTed Clang using fresh profile,
- BOLT stale (hash): BOLTed Clang using stale profile (collected on
Clang 10K commits back), `-infer-stale-profile` (hash+call block
matching)
- BOLT stale (+probe): BOLTed Clang using stale profile,
`-infer-stale-profile` with `-stale-matching-with-pseudo-probes`
(hash+call+pseudo probe block matching)
- 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for
build,
- BOLT profiles are collected on Clang compiling large preprocessed
C++ file.
- Benchmark: building Clang (average of 5 runs), see driver in
aaupov/llvm-devmtg-2022
- Results, wall time, lower is better:
- Baseline no-BOLT: 429.52 +- 2.61s,
- BOLT stale (hash): 413.21 +- 2.19s,
- BOLT stale (+probe): 409.69 +- 1.41s,
- BOLT fresh: 384.50 +- 1.80s.
---------
Co-authored-by: Amir Ayupov <aaupov@fb.com>
YAML function profiles have sparse function IDs, assigned from
sequential function IDs from profiled binary. For example, for one large
binary, YAML profile has 15K functions, but the highest ID is ~600K,
close to number of functions in the profiled binary.
In `matchProfileToFunction`, `YamlProfileToFunction` vector was resized
to match function ID, which entails a 40X overcommit. Change the type of
`YamlProfileToFunction` to DenseMap to reduce memory utilization.
#99891 makes use of it for profile lookup associated with a given binary
function.
The flag currently controls writing of probe information in YAML
profile. #99891 adds a separate flag to use probe information for stale
profile matching. Thus `profile-use-pseudo-probes` becomes a misnomer
and `profile-write-pseudo-probes` better captures the intent.
Reviewers: maksfb, WenleiHe, ayermolo, rafaelauler, dcci
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/106364
This ensures forward compatibility, where old BOLT versions can consume
the profile created by newer versions with extra keys.
Test Plan: added yaml-unknown-keys.test
Implemented call graph function matching. First, two call graphs are
constructed for both profiled and binary functions. Then functions are
hashed based on the names of their callee/caller functions. Finally,
functions are matched based on these neighbor hashes and the
longest common prefix of their names. The `match-with-call-graph`
flag turns this matching on.
Test Plan: Added match-with-call-graph.test. Matched 164 functions
in a large binary with 10171 profiled functions.
Added another hash level – call hash – following opcode hash matching
for stale block matching. Call hash strings are the concatenation of the
lexicographically ordered names of each blocks’ called functions. This
change bolsters block matching in cases where some instructions have
been removed or added but calls remain constant.
Test Plan: added match-functions-with-calls-as-anchors.test.
Moved function matching techniques into separate helper functions for
ease of understanding and to make space for additional function
matching techniques to be added (e.g. call graph function matching).
A mapping - from namespace to associated binary functions - is used to
match function profiles to binary based on the
'--name-similarity-function-matching-threshold' flag set edit distance
threshold. The flag is set to 0 (exact name matching) by default as it is
expensive, requiring the processing of all BFs.
Test Plan: Added name-similarity-function-matching.test. On a binary
with 5M functions, rewrite passes took ~520s without the flag and
~2018s with the flag set to 20.
Added flag '--match-profile-with-function-hash' to match functions
based on exact hash. After identical and LTO name matching, more
functions can be recovered for inference with exact hash, in the case
of function renaming with no functional changes. Collisions are
possible in the unlikely case where multiple functions share the same
exact hash. The flag is off by default as it requires the processing of
all binary functions and subsequently is expensive.
Test Plan: added hashing-based-function-matching.test.
We compute BF hashes in `YAMLProfileReader::readProfile` when first
matching profile functions with binary functions, and second time in
`YAMLProfileReader::parseFunctionProfile` during the profile assignment
(we need to do that to account for LTO private functions with
mismatching suffix).
Avoid recomputing the hash if it's been set.
YAML profile for non-simple functions without CFG is
1) useless for optimizations,
2) can't be attached, similar to fdata profile,
3) would be reported as invalid/stale even if the profile is valid.
Don't attempt to attach the profile in this case, aligning the behavior
to DataReader.
Test Plan: added yaml-non-simple.test
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
Provide backwards compatibility for YAML profile that uses `std::hash`:
xxh3 hash is the default for newly produced profile (sets `std-hash:
false`),
whereas the profile that doesn't specify `std-hash` will be treated as
`std-hash: true`, preserving old behavior.
D159460 regressed the bugfix in D156644. Fix that and emit a warning.
Add a test case.
Reviewed By: #bolt, maksfb
Differential Revision: https://reviews.llvm.org/D159529
Reduce YAML profile processing times:
- preprocessProfile: speed up buildNameMaps by replacing ProfileNameToProfile
mapping with ProfileFunctionNames set and ProfileBFs vector.
Pre-look up YamlBF->BF correspondence, memoize in ProfileBFs.
- readProfile: replace iteration over all functions in the binary by iteration
over profile functions (strict match and LTO name match).
On a large binary (1.9M functions) and large YAML profile (121MB, 30k functions)
reduces profile steps runtime:
pre-process profile data: 12.4953s -> 10.7123s
process profile data: 9.8195s -> 5.6639s
Compared to fdata profile reading:
pre-process profile data: 8.0268s
process profile data: 1.0265s
process profile data pre-CFG: 0.1644s
Reviewed By: #bolt, maksfb
Differential Revision: https://reviews.llvm.org/D159460
Work around the issue of multiple profiles per function.
Can happen with a stale profile which has separate profiles
that in a new binary got merged and became aliases.
Reviewed By: #bolt, maksfb
Differential Revision: https://reviews.llvm.org/D156644
1. Using ADT/Bitfields.h for hash computation; this is equivalent but shorter than the existing implementation
2. Getting rid of Layout indices for stale matching; using BB->getIndex for indexing
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D155748
Specify blocks order used in YAML profile. Needed to ensure profile backwards
compatibility with pre-D155514 DFS order by default.
Reviewed By: #bolt, maksfb
Differential Revision: https://reviews.llvm.org/D156176
Use layout order in YAML profile reading/writing. Preserve old behavior (DFS order)
under `-profile-use-dfs` option.
Reviewed By: spupyrev
Differential Revision: https://reviews.llvm.org/D155514
BOLT often has to deal with profiles collected on binaries built from several
revisions behind release. As a result, a certain percentage of functions is
considered stale and not optimized. This diff adds an ability to match profile
to functions that are not 100% binary identical, which increases the
optimization coverage and boosts the performance of applications.
The algorithm consists of two phases: matching and inference:
- At the matching phase, we try to "guess" as many block and jump counts from
the stale profile as possible. To this end, the content of each basic block
is hashed and stored in the (yaml) profile. When BOLT optimizes a binary,
it computes block hashes and identifies the corresponding entries in the
stale profile. It yields a partial profile for every CFG in the binary.
- At the inference phase, we employ a network flow-based algorithm (profi) to
reconstruct "realistic" block and jump counts from the partial profile
generated at the first stage. In practice, we don't always produce proper
profile data but the majority (e.g., >90%) of CFGs get the correct counts.
This is a first part of the change; the next stacked diff extends the block hashing
and provides perf evaluation numbers.
Reviewed By: maksfb
Differential Revision: https://reviews.llvm.org/D144500
`Function.RawBranchCount` is initialized for fdata profile but not for yaml one.
The diff adds the computation of the field for yaml profiles
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D144211
Summary:
Refactor bolt/*/Profile to follow the braces rule for if/else/loop from
[LLVM Coding Standards](https://llvm.org/docs/CodingStandards.html).
(cherry picked from FBD33345741)
Summary:
Switched members of BinaryFunction to ADT where it was possible and
made sense. As a result, the size of BinaryFunction on x86-64 Linux
reduced from 1624 bytes to 1448.
(cherry picked from FBD32981555)
Summary:
Moves source files into separate components, and make explicit
component dependency on each other, so LLVM build system knows how to
build BOLT in BUILD_SHARED_LIBS=ON.
Please use the -c merge.renamelimit=230 git option when rebasing your
work on top of this change.
To achieve this, we create a new library to hold core IR files (most
classes beginning with Binary in their names), a new library to hold
Utils, some command line options shared across both RewriteInstance
and core IR files, a new library called Rewrite to hold most classes
concerned with running top-level functions coordinating the binary
rewriting process, and a new library called Profile to hold classes
dealing with profile reading and writing.
To remove the dependency from BinaryContext into X86-specific classes,
we do some refactoring on the BinaryContext constructor to receive a
reference to the specific backend directly from RewriteInstance. Then,
the dependency on X86 or AArch64-specific classes is transfered to the
Rewrite library. We can't have the Core library depend on targets
because targets depend on Core (which would create a cycle).
Files implementing the entry point of a tool are transferred to the
tools/ folder. All header files are transferred to the include/
folder. The src/ folder was renamed to lib/.
(cherry picked from FBD32746834)