21 Commits

Author SHA1 Message Date
chrisPyr
038fff3f24
[NFC][BOLT] Make file-local cl::opt global variables static (#126472)
#125983
2025-03-05 22:11:05 -08:00
Kazu Hirata
06e0869624 [BOLT] Fix warnings
This patch fixes:

  bolt/lib/Profile/StaleProfileMatching.cpp:694:24: error: unused
  variable 'BinHash' [-Werror,-Wunused-variable]

  bolt/lib/Profile/YAMLProfileWriter.cpp:206:61: error: missing field
  'GUID' initializer [-Werror,-Wmissing-field-initializers]

  bolt/lib/Profile/YAMLProfileReader.cpp:840:16: error: unused
  variable 'MatchedWithPseudoProbes' [-Werror,-Wunused-variable]
2024-11-12 09:39:57 -08:00
Shaw Young
9a9af0a23f
[BOLT] Match blocks with pseudo probes (#99891)
Match inline trees first between profile and the binary: by GUID,
checksum, parent, and inline site for inlined functions. Map profile
probes to binary probes via matched inline tree nodes. Each binary probe
has an associated binary basic block. If all probes from one profile
basic block map to the same binary basic block, it’s an exact match,
otherwise the block is determined by majority vote and reported as loose
match.

Pseudo probe matching happens between exact hash matching and call/loose
matching.

Introduce ProbeMatchSpec - a mechanism to match probes belonging to
another binary function. For example, given functions foo and bar:
```
void foo() {
  bar();
}
```
profiled binary: bar is not inlined => have top-level function bar
new binary where the profile is applied to: bar is inlined into foo.

Currently, BOLT does 1:1 matching between profile functions and binary
functions based on the name. #100446 will extend this to N:M where
multiple profiles can be matched to one binary function (as in the
example above where binary function foo would use profiles for foo and
bar), and one profile can be matched to multiple binary functions (e.g.
if bar was inlined into multiple functions).

In this diff, ProbeMatchSpecs would only have one BinaryFunctionProfile
(existing name-based matching). 

Test Plan: Added match-blocks-with-pseudo-probes.test

Performance test:
- Setup:
  - Baseline no-BOLT: Clang with pseudo probes, ThinLTO + CSSPGO
  (#79942)
  - BOLT fresh: BOLTed Clang using fresh profile,
  - BOLT stale (hash): BOLTed Clang using stale profile (collected on
    Clang 10K commits back), `-infer-stale-profile` (hash+call block
    matching)
  - BOLT stale (+probe): BOLTed Clang using stale profile,
    `-infer-stale-profile` with `-stale-matching-with-pseudo-probes`
    (hash+call+pseudo probe block matching)
  - 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for
    build,
  - BOLT profiles are collected on Clang compiling large preprocessed
    C++ file.
- Benchmark: building Clang (average of 5 runs), see driver in
  aaupov/llvm-devmtg-2022
- Results, wall time, lower is better:
  - Baseline no-BOLT: 429.52 +- 2.61s,
  - BOLT stale (hash): 413.21 +- 2.19s,
  - BOLT stale (+probe): 409.69 +- 1.41s,
  - BOLT fresh: 384.50 +- 1.80s.

---------

Co-authored-by: Amir Ayupov <aaupov@fb.com>
2024-11-12 07:21:03 -08:00
Shaw Young
131eb30584
[BOLT] Match blocks with calls as anchors (#96596)
Added another hash level – call hash – following opcode hash matching
for stale block matching. Call hash strings are the concatenation of the
lexicographically ordered names of each blocks’ called functions. This 
change bolsters block matching in cases where some instructions have
been removed or added but calls remain constant.

Test Plan: added match-functions-with-calls-as-anchors.test.
2024-07-10 15:46:47 -07:00
shaw young
753498eed1
[BOLT] Add sink block to flow CFG in profile inference (#95047)
Summary: Constructing an artificial sink block for the
flow CFG in stale profile inference to allow profile
inference to be run on CFGs with blocks that terminate
and have successors.

Testing Plan: Added infer_no_exits.test to verify that 
functions with exit blocks with a landing pad are 
covered by stale profile inference.

---------

Co-authored-by: Amir Ayupov <fads93@gmail.com>
2024-06-17 16:58:26 -07:00
shaw young
68fc8dffe4 [BOLT] Drop high discrepancy profiles in matching (#95156)
Summary: Functions with high discrepancy 
(measured by matched function blocks) 
can be ignored with an added command line 
argument for better performance.

Test Plan: Added 
stale-matching-min-matched-block.test

---------

Co-authored-by: Amir Ayupov <aaupov@fb.com>
2024-06-17 15:14:35 -07:00
shaw young
96378b3da8
[BOLT] Add NamedRegionTimer to inferStaleProfile (#93078) 2024-05-22 11:04:12 -07:00
shaw young
c8fc234ee2
[BOLT][NFC] Eliminate uses of throwing std::map::at (#92950)
Remove calls to std::unordered_map::at, std::map::at, and
std::vector::at.
2024-05-22 09:27:14 -07:00
Amir Ayupov
32c9d5ef4f Revert "[BOLT] Add NamedRegionTimer to inferStaleProfile (#92621)"
This reverts commit 9f2313829fd210f9923375e93bc11fe9685c26d5.

Creates a dependency cycle: lib/Rewrite depends on lib/Profile.
2024-05-21 13:55:32 -07:00
shaw young
9f2313829f
[BOLT] Add NamedRegionTimer to inferStaleProfile (#92621) 2024-05-21 13:26:57 -07:00
Amir Ayupov
b431546d41
[BOLT] Check BF state in stale matching (#85339)
Only apply stale matching if the binary function is in CFG state, i.e.
has basic blocks.

Test Plan:
Updated bolt/test/X86/reader-stale-yaml.test
2024-03-15 10:55:53 -07:00
Amir Ayupov
3c64b24ed3
[BOLT] Add extra staleness logging (#80225)
Report two extra metrics:
- # of stale functions with matching block count,
- # of stale blocks with matching instruction count.
2024-02-01 07:16:40 -08:00
Amir Ayupov
b039ccc684
[BOLT] Provide backwards compatibility for YAML profile with std::hash (#74253)
Provide backwards compatibility for YAML profile that uses `std::hash`:
xxh3 hash is the default for newly produced profile (sets `std-hash:
false`),
whereas the profile that doesn't specify `std-hash` will be treated as
`std-hash: true`, preserving old behavior.
2023-12-11 12:27:32 -08:00
spupyrev
e7dd596c68
[BOLT] Use deterministic xxh3 for computing BF/BB hashes (#72542)
std::hash and ADT/Hashing::hash_value are non-deterministic functions
whose
results might vary across implementation/process/execution. Using xxh3
instead
for computing hashes of BinaryFunctions and BinaryBasicBlock for stale
profile
matching.
(A possible alternative is to use ADT/StableHashing.h based on FNV
hashing but
xxh3 seems to be more popular in LLVM)

This is to address https://github.com/llvm/llvm-project/issues/65241.
2023-11-27 14:45:46 -08:00
spupyrev
42da84fda9 [BOLT] Always match stale entry blocks
Two (minor) improvements for stale matching:
- always match entry blocks to each other, even if there is a hash mismatch;
- ignore nops in (loose) hash computation.

I record a small improvement in inference quality on my benchmarks. Tests are not affected

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D159488
2023-09-08 15:46:20 -07:00
spupyrev
1256ef274c [BOLT] Fine-tuning hash computation for stale matching
Fine-tuning hash computation for stale matching:
- introducing a new "loose" basic block hash that allows to match many more blocks than before;
- tweaking params of the inference algorithm that find (slightly) better solutions;
- added more meaningful tests for stale matching.

Tested the changes on several open-source benchmarks (clang, rocksdb, chrome)
and one prod workload using different compiler modes (LTO/PGO etc). There is
always an improvement in the quality of inferred profiles.
(The current implementation is still not optimal but the diff is a step forward;
I am open to further suggestions)

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D156278
2023-08-31 07:29:02 -07:00
spupyrev
6d1502c654 [BOLT] (Minor) Changes in stale inference
1. Using ADT/Bitfields.h for hash computation; this is equivalent but shorter than the existing implementation
2. Getting rid of Layout indices for stale matching; using BB->getIndex for indexing

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D155748
2023-07-27 15:29:03 -07:00
spupyrev
31e8a9f4d9 [BOLT] Add stale-related logging
Adding some logs related to stale profile matching. The new data can be helpful
to understand how "stale" the input profile is and how well the inference is
able to utilize the stale data.

Example of outputs on clang-10 built with LTO (profile collected on a year-old release):
```
BOLT-INFO: inferred profile for 2101 (18.52% of profiled, 100.00% of stale) functions responsible for 30.95% samples (14754697 out of 47670654)
BOLT-INFO: stale inference matched 89.42% of basic blocks (79052 out of 88402 stale) responsible for 76.99% samples (645737 out of 838719 stale)
```

LTO+AutoFDO:
```
BOLT-INFO: inferred profile for 6146 (57.57% of profiled, 100.00% of stale) functions responsible for 90.34% samples (50891403 out of 56330313)
BOLT-INFO: stale inference matched 74.55% of basic blocks (191295 out of 256589 stale) responsible for 57.30% samples (1288632 out of 2248799 stale)
```

Reviewed By: Amir, maksfb

Differential Revision: https://reviews.llvm.org/D154737
2023-07-27 08:56:57 -07:00
Kazu Hirata
b188f9f597 [BOLT] Use {StringMap,DenseMapBase}::lookup (NFC) 2023-06-16 07:48:19 -07:00
spupyrev
2316a10fe5 [BOLT] stale profile matching [part 2 out of 2]
This is a first "serious" version of stale profile matching in BOLT. This diff
extends the hash computation for basic blocks so that we can apply a fuzzy
hash-based matching. The idea is to compute several "versions" of a hash value
for a basic block. A loose version of a hash (computed by ignoring instruction
operands) allows to match blocks in functions whose content has been changed,
while stricter hash values (considering instruction opcodes with operands and
even based on hashes of block's successors/predecessors) allow to resolve
collisions. In order to save space and build time, individual hash components
are blended into a single uint64_t.
There are likely numerous ways of improving hash computation but already this
simple variant provides significant perf benefits.

**Perf testing** on the clang binary: collecting data on clang-10 and using it
to optimize clang-11 (with ~1 year of commits in between). Next, we compare
- //stale_clang// (clang-11 optimized with profile collected on clang-10 with **infer-stale-profile=0**)
- //opt_clang// (clang-11 optimized with profile collected on clang-11)
- //infer_clang// (clang-11 optimized with profile collected on clang-10 with **infer-stale-profile=1**)

`LTO-only` mode:
//stale_clang// vs //opt_clang//: task-clock [delta(%): 9.4252 ± 1.6582, p-value: 0.000002]
(That is, there is a ~9.5% perf regression)
//infer_clang// vs //opt_clang//: task-clock [delta(%): 2.1834 ± 1.8158, p-value: 0.040702]
(That is, the regression is reduced to ~2%)
Related BOLT logs:
```
BOLT-INFO: identified 2114 (18.61%) stale functions responsible for 30.96% samples
BOLT-INFO: inferred profile for 2101 (18.52% of all profiled) functions responsible for 30.95% samples
```

`LTO+AutoFDO` mode:
//stale_clang// vs //opt_clang//: task-clock [delta(%): 19.1293 ± 1.4131, p-value: 0.000002]
//infer_clang// vs //opt_clang//: task-clock [delta(%): 7.4364 ± 1.3343, p-value: 0.000002]
Related BOLT logs:
```
BOLT-INFO: identified 5452 (50.27%) stale functions responsible for 85.34% samples
BOLT-INFO: inferred profile for 5442 (50.23% of all profiled) functions responsible for 85.33% samples
```

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D146661
2023-06-08 14:42:41 -07:00
spupyrev
44268271f6 [BOLT] stale profile matching [part 1 out of 2]
BOLT often has to deal with profiles collected on binaries built from several
revisions behind release. As a result, a certain percentage of functions is
considered stale and not optimized. This diff adds an ability to match profile
to functions that are not 100% binary identical, which increases the
optimization coverage and boosts the performance of applications.

The algorithm consists of two phases: matching and inference:
- At the matching phase, we try to "guess" as many block and jump counts from
  the stale profile as possible. To this end, the content of each basic block
  is hashed and stored in the (yaml) profile. When BOLT optimizes a binary,
  it computes block hashes and identifies the corresponding entries in the
  stale profile. It yields a partial profile for every CFG in the binary.
- At the inference phase, we employ a network flow-based algorithm (profi) to
  reconstruct "realistic" block and jump counts from the partial profile
  generated at the first stage. In practice, we don't always produce proper
  profile data but the majority (e.g., >90%) of CFGs get the correct counts.

This is a first part of the change; the next stacked diff extends the block hashing
and provides perf evaluation numbers.

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D144500
2023-06-06 12:13:52 -07:00