Reuse the definition of profile density from llvm-profgen (#92144):
- the density is computed in perf2bolt using raw samples (perf.data or
pre-aggregated data),
- function density is the ratio of dynamically executed function bytes
to the static function size in bytes,
- profile density:
- functions are sorted by density in decreasing order, accumulating
their respective sample counts,
- profile density is the smallest density covering 99% of total sample
count.
In other words, BOLT binary profile density is the minimum amount of
profile information per function (excluding functions in tail 1% sample
count) which is sufficient to optimize the binary well.
The density threshold of 60 was determined through experiments with
large binaries by reducing the sample count and checking resulting
profile density and performance. The threshold is conservative.
perf2bolt would print the warning if the density is below the threshold
and suggest to increase the sampling duration and/or frequency to reach
a given density, e.g.:
```
BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples.
```
Test Plan: updated pre-aggregated-perf.test
Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe
Reviewed By: WenleiHe, wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/101094
For a large binary with BAT section of size 38 MB with ~170k maps,
reduces writeMaps time from 70s down to 1s.
The inefficiency was in the use of std::distance with std::map::iterator
which doesn't provide random access. Use sorted vector for lookups.
Test Plan: NFC
Reviewers: maksfb, rafaelauler, dcci, ayermolo
Reviewed By: maksfb
Pull Request: https://github.com/llvm/llvm-project/pull/112061
In a perfect profile, each positive-execution-count block in the
function’s CFG should be reachable from a positive-execution-count
function entry block through a positive-execution-count path. This new
pass checks how well the BOLT input profile satisfies this “CFG
continuity” property.
More specifically, for each of the hottest 1000 functions, the pass
calculates the function’s fraction of basic block execution counts that
is “unreachable”. It then reports the 95th percentile of the
distribution of the 1000 unreachable fractions in a single BOLT-INFO
line. The smaller the reported value is, the better the BOLT profile
satisfies the CFG continuity property.
The default value of 1000 above can be changed via the hidden BOLT
option `-num-functions-for-continuity-check=[N]`. If more detailed stats
are needed, `-v=1` can be added to the BOLT invocation: the hottest N
functions will be grouped into 5 equally-sized buckets, from the hottest
to the coldest; for each bucket, various summary statistics of the
distribution of the fractions and the raw unreachable execution counts
will be reported.
Don't call raw_string_ostream::flush(), which is essentially a no-op.
As specified in the docs, raw_string_ostream is always unbuffered.
( 65b13610a5226b84889b923bae884ba395ad084d for further reference )
… segments in Elf binary.
The heuristic is improved by also taking into account that only
executable segments should contain instructions.
Fixes#109384.
(this is the part related to bolt, lld and mlir)
Without these explicit includes, removing other headers, who implicitly
include llvm-config.h, may have non-trivial side effects. For example,
`clangd` may report even `llvm-config.h` as "no used" in case it defines
a macro, that is explicitly used with #ifdef. It is actually amplified
with different build configs which use different set of macros.
Add probe inline tree information to YAML profile, at function level:
- function GUID,
- checksum,
- parent node id,
- call site in the parent.
This information is used for pseudo probe block matching (#99891).
The encoding adds/changes probe information in multiple levels of
YAML profile:
- BinaryProfile: add pseudo_probe_desc with GUIDs and Hashes, which
permits deduplication of data:
- many GUIDs are duplicate as the same callee is commonly inlined
into multiple callers,
- hashes are also very repetitive, especially for functions with
low block counts.
- FunctionProfile: add inline tree (see above). Top-level function
is included as root of function inline tree, which makes guid and
pseudo_probe_desc_hash fields redundant.
- BlockProfile: densely-encoded block probe information:
- probes reference their containing inline tree node,
- separate lists for block, call, indirect call probes,
- block probe encoding is specialized: ids are encoded as bitset
in uint64_t. If only block probe with id=1 is present, it's
encoded as implicit entry (id=0, omitted).
- inline tree nodes with identical probes share probe description
where node indices are combined into a list.
On top of #107970, profile with new probe encoding has the following
characteristics (profile for a large binary):
- Profile without probe information: 33MB, 3.8MB compressed (baseline).
- Profile with inline tree information: 92MB, 14MB compressed.
Profile processing time (YAML parsing, inference, attaching steps):
- profile without pseudo probes: 5s,
- profile with pseudo probes, without pseudo probe matching: 11s,
- with pseudo probe matching: 12.5s.
Test Plan: updated pseudoprobe-decoding-inline.test
Reviewers: wlei-llvm, ayermolo, rafaelauler, dcci, maksfb
Reviewed By: wlei-llvm, rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/107137
Pseudo probe function records contain GUIDs assigned by the compiler
using an IR function name. Thus suffixes added later (e.g. `.llvm.`
for internal symbols, `.destroy`/`.resume` for coroutine fragments,
and `.cold`/`.warm` for split fragments) cause GUID mismatch.
Address that by dropping those suffixes using `getCommonName` which is
a parametrized form of `getLTOCommonName`.
This patch aborts BOLT execution if it finds out-of-section (section
end) symbol in GOT table. In order to handle such situations properly in
future, we would need to have an arch-dependent way to analyze
relocations or its sequences, e.g., for ARM it would probably be ADRP +
LDR analysis in order to get GOT entry address. Currently, it is also
challenging because GOT-related relocation symbols are replaced to
__BOLT_got_zero. Anyway, it seems to be quite a rare case, which seems
to be only? related to static binaries. For the most part, it seems that
it should be handled on the linker stage, since static binary should not
have GOT table at all. LLD linker with relaxations enabled would replace
instruction addresses from GOT directly to target symbols, which
eliminates the problem.
Anyway, in order to achieve detection of such cases, this patch fixes a
few things in BOLT:
1. For the end symbols, we're now using the section provided by ELF
binary. Previously it would be tied with a wrong section found by symbol
address.
2. The end symbols would have limited registration we would only
add them in name->data GlobalSymbols map, since using address->data
BinaryDataMap map would likely be impossible due to address duality of
such symbols.
3. The outdated BD->getSection (currently returning refence, not
pointer) check in postProcessSymbolTable is replaced by getSize check in
order to allow zero-sized top-level symbols if they are located in
zero-sized sections. For the most part, such things could only be found
in tests, but I don't see a reason not to handle such cases.
4. Updated section-end-sym test and removed x86_64 requirement since
there is no reason for this (tested on aarch64 linux)
The test was provided by peterwaller-arm (thank you) in #100096 and
slightly modified by me.
Continue from #87196 as author did not have much time, I have taken over
working on this PR. We would like to have this so it'll be easier to
package for Nix.
Can be tested by copying cmake, bolt, third-party, and llvm directories
out into their own directory with this PR applied and then build bolt.
---------
Co-authored-by: pca006132 <john.lck40@gmail.com>
Three-way splitting can create references between split fragments (warm
to cold or vice versa) that are not handled by
`isChildOf/isParentOf/isChildOrParentOf`. Generalize fragment
relationships to allow checking if two functions belong to one group,
potentially in presence of ICF which can join multiple groups.
Test Plan: NFC for existing tests
Reviewers: maksfb, ayermolo, rafaelauler, dcci
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/99979
Implemented call graph function matching. First, two call graphs are
constructed for both profiled and binary functions. Then functions are
hashed based on the names of their callee/caller functions. Finally,
functions are matched based on these neighbor hashes and the
longest common prefix of their names. The `match-with-call-graph`
flag turns this matching on.
Test Plan: Added match-with-call-graph.test. Matched 164 functions
in a large binary with 10171 profiled functions.
Read pseudo probes in regular and BAT YAML profile generation, and
attach them to YAML profile basic blocks. This exposes GUID, probe id,
and probe type in profile for future use in stale profile matching.
Test Plan: updated pseudoprobe-decoding-inline.test
Reviewers: dcci, rafaelauler, ayermolo, maksfb
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/99554
Add a BinaryFunction field for pseudo probe function GUID.
Populate it during pseudo probe section parsing, and emit it in YAML
profile (both regular and BAT), along with function checksum.
To be used for stale function matching.
Test Plan: update pseudoprobe-decoding-inline.test
Detect and support fixed PIC indirect jumps of the following form:
```
movslq En(%rip), %r1
leaq PIC_JUMP_TABLE(%rip), %r2
addq %r2, %r1
jmpq *%r1
```
with PIC_JUMP_TABLE that looks like following:
```
JT: ----------
E1:| L1 - JT |
|----------|
E2:| L2 - JT |
|----------|
| |
......
En:| Ln - JT |
----------
```
The code could be produced by compilers, see
https://github.com/llvm/llvm-project/issues/91648.
Test Plan: updated jump-table-fixed-ref-pic.test
Reviewers: maksfb, ayermolo, dcci, rafaelauler
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/91667
With aggressive ICF, it's possible to have different local symbols
(under different FILE symbols) to be mapped to the same address.
FileSymRefs only keeps a single SymbolRef per address, which prevents
fragment matching from finding the correct symbol to perform parent
function lookup.
Work around this issue by switching FileSymRefs to a multimap. In
future, uses of FileSymRefs can be replaced with SortedSymbols which
keeps essentially the same information.
Test Plan: added ambiguous_fragment.test
Reviewers: dcci, ayermolo, maksfb, rafaelauler
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/98992
`createDummyReturnFunction` is not creating a function but instead only
a function body that is simply a return statement.
This patch renames it to: `createReturnInstructionList`
AArch64 needs this function when instrumenting statically-linked binaries.
Sample commands:
```bash
clang -Wl,-q test.c -static -o out
llvm-bolt -instrument -instrumentation-sleep-time=5 out -o out.instr
```
Added another hash level – call hash – following opcode hash matching
for stale block matching. Call hash strings are the concatenation of the
lexicographically ordered names of each blocks’ called functions. This
change bolsters block matching in cases where some instructions have
been removed or added but calls remain constant.
Test Plan: added match-functions-with-calls-as-anchors.test.
Moved function matching techniques into separate helper functions for
ease of understanding and to make space for additional function
matching techniques to be added (e.g. call graph function matching).
There could be multiple TUs with the same hash in various DWO files. In
bigger binaries this could be in the thousands. Although they could be
structurally different and we need to output Entries for all of them,
for the purposes of figuring out a TU hash we only need one entry in
Foreign TU list.
Refactors legacy ranges writers to create a writer for each instance of
a DWO file.
We now write out everything into .debug_ranges after the all the DWO
files are processed. This also changes the order that ranges is written
out in, as before we wrote out while in the main CU processing loop and
we now iterate through the CU buckets created by partitionCUs, after the
main processing loop.
A mapping - from namespace to associated binary functions - is used to
match function profiles to binary based on the
'--name-similarity-function-matching-threshold' flag set edit distance
threshold. The flag is set to 0 (exact name matching) by default as it is
expensive, requiring the processing of all BFs.
Test Plan: Added name-similarity-function-matching.test. On a binary
with 5M functions, rewrite passes took ~520s without the flag and
~2018s with the flag set to 20.
9d0754ada5dbbc0c009bcc2f7824488419cc5530 dropped MC support required for
optimal macro-fusion alignment in BOLT. Remove the support in BOLT as
performance measurements with large binaries didn't show a significant
improvement.
Test Plan:
macro-fusion alignment was never upstreamed, so no upstream tests are
affected.
`isUnsupportedBranch` was renamed (and inverted) to `isReversibleBranch`, as that was how it was being used. But one use in `BinaryFunction::disassemble` was using the original meaning to detect unsupported branches, and the `isUnsupportedBranch` had 2 separate semantic checks.
Move the unsupported branch check from `isReversibleBranch` to a new entry point: `isUnsupportedInstruction`. Call that from `BinaryFunction::disassemble`.
Move the dynamic branch check from X86's isReversibleBranch to the base class, as it is not an architecture-specific check.
Remove unnecessary `isReversibleBranch` calls from Instrumentation and X86 MCPlusBuilder.
Alternative instruction sequences in the Linux kernel can modify the
stack and thus they need their own ORC unwind entries. Since there's
only one ORC table, it has to be "shared" among multiple instruction
sequences. The kernel achieves this by putting a restriction on
instruction boundaries. If ORC state changes at a given IP, only one of
the alternative sequences can have an instruction starting/ending at
this IP. Then, developers can insert NOPs to guarantee the above
requirement is met.
The most common use of ORC with alternatives is "pushf; pop %rax"
sequence used for paravirtualization. Note that newer kernel versions
no longer use .parainstructions; instead, they utilize alternatives for
the same purpose.
Before we implement a better support for alternatives, we can safely
skip ORC entries associated with them.
Fixes#87052.
`convertCallToIndirectCall` applies the PLTCall optimization and returns
an (updated if needed) iterator to the converted call instruction. Since
AArch64 requires to inject additional instructions to implement this
pass, the relevant BasicBlock and an iterator was passed to the
`convertCallToIndirectCall`.
`NumCallsOptimized` is updated only on successful application of the
pass.
Tests:
- Inputs/plt-tailcall.c: an example of a tail call optimized PLT call.
- AArch64/plt-call.test: it is the actual A64 test, that runs the
PLTCall optimization on the above input file and verifies the
application of the pass to the calls: 'printf' and 'puts'.
Both `reverseBranchCondition` and `replaceBranchTarget` return a success boolean. But all-but-one caller ignores the return value, and the exception emits a fatal error on failure.
Thus, just return nothing.