llvm-project

shylie/llvm-project

Fork 0

Commit Graph

Author	SHA1	Message	Date
Kazu Hirata	b188f9f597	[BOLT] Use {StringMap,DenseMapBase}::lookup (NFC)	2023-06-16 07:48:19 -07:00
spupyrev	2316a10fe5	[BOLT] stale profile matching [part 2 out of 2] This is a first "serious" version of stale profile matching in BOLT. This diff extends the hash computation for basic blocks so that we can apply a fuzzy hash-based matching. The idea is to compute several "versions" of a hash value for a basic block. A loose version of a hash (computed by ignoring instruction operands) allows to match blocks in functions whose content has been changed, while stricter hash values (considering instruction opcodes with operands and even based on hashes of block's successors/predecessors) allow to resolve collisions. In order to save space and build time, individual hash components are blended into a single uint64_t. There are likely numerous ways of improving hash computation but already this simple variant provides significant perf benefits. Perf testing on the clang binary: collecting data on clang-10 and using it to optimize clang-11 (with ~1 year of commits in between). Next, we compare - //stale_clang// (clang-11 optimized with profile collected on clang-10 with infer-stale-profile=0) - //opt_clang// (clang-11 optimized with profile collected on clang-11) - //infer_clang// (clang-11 optimized with profile collected on clang-10 with infer-stale-profile=1) `LTO-only` mode: //stale_clang// vs //opt_clang//: task-clock [delta(%): 9.4252 ± 1.6582, p-value: 0.000002] (That is, there is a ~9.5% perf regression) //infer_clang// vs //opt_clang//: task-clock [delta(%): 2.1834 ± 1.8158, p-value: 0.040702] (That is, the regression is reduced to ~2%) Related BOLT logs: ``` BOLT-INFO: identified 2114 (18.61%) stale functions responsible for 30.96% samples BOLT-INFO: inferred profile for 2101 (18.52% of all profiled) functions responsible for 30.95% samples ``` `LTO+AutoFDO` mode: //stale_clang// vs //opt_clang//: task-clock [delta(%): 19.1293 ± 1.4131, p-value: 0.000002] //infer_clang// vs //opt_clang//: task-clock [delta(%): 7.4364 ± 1.3343, p-value: 0.000002] Related BOLT logs: ``` BOLT-INFO: identified 5452 (50.27%) stale functions responsible for 85.34% samples BOLT-INFO: inferred profile for 5442 (50.23% of all profiled) functions responsible for 85.33% samples ``` Reviewed By: Amir Differential Revision: https://reviews.llvm.org/D146661	2023-06-08 14:42:41 -07:00
spupyrev	44268271f6	[BOLT] stale profile matching [part 1 out of 2] BOLT often has to deal with profiles collected on binaries built from several revisions behind release. As a result, a certain percentage of functions is considered stale and not optimized. This diff adds an ability to match profile to functions that are not 100% binary identical, which increases the optimization coverage and boosts the performance of applications. The algorithm consists of two phases: matching and inference: - At the matching phase, we try to "guess" as many block and jump counts from the stale profile as possible. To this end, the content of each basic block is hashed and stored in the (yaml) profile. When BOLT optimizes a binary, it computes block hashes and identifies the corresponding entries in the stale profile. It yields a partial profile for every CFG in the binary. - At the inference phase, we employ a network flow-based algorithm (profi) to reconstruct "realistic" block and jump counts from the partial profile generated at the first stage. In practice, we don't always produce proper profile data but the majority (e.g., >90%) of CFGs get the correct counts. This is a first part of the change; the next stacked diff extends the block hashing and provides perf evaluation numbers. Reviewed By: maksfb Differential Revision: https://reviews.llvm.org/D144500	2023-06-06 12:13:52 -07:00

Author

SHA1

Message

Date

Kazu Hirata

b188f9f597

[BOLT] Use {StringMap,DenseMapBase}::lookup (NFC)

2023-06-16 07:48:19 -07:00

spupyrev

2316a10fe5

[BOLT] stale profile matching [part 2 out of 2]

This is a first "serious" version of stale profile matching in BOLT. This diff
extends the hash computation for basic blocks so that we can apply a fuzzy
hash-based matching. The idea is to compute several "versions" of a hash value
for a basic block. A loose version of a hash (computed by ignoring instruction
operands) allows to match blocks in functions whose content has been changed,
while stricter hash values (considering instruction opcodes with operands and
even based on hashes of block's successors/predecessors) allow to resolve
collisions. In order to save space and build time, individual hash components
are blended into a single uint64_t.
There are likely numerous ways of improving hash computation but already this
simple variant provides significant perf benefits.

**Perf testing** on the clang binary: collecting data on clang-10 and using it
to optimize clang-11 (with ~1 year of commits in between). Next, we compare
- //stale_clang// (clang-11 optimized with profile collected on clang-10 with **infer-stale-profile=0**)
- //opt_clang// (clang-11 optimized with profile collected on clang-11)
- //infer_clang// (clang-11 optimized with profile collected on clang-10 with **infer-stale-profile=1**)

`LTO-only` mode:
//stale_clang// vs //opt_clang//: task-clock [delta(%): 9.4252 ± 1.6582, p-value: 0.000002]
(That is, there is a ~9.5% perf regression)
//infer_clang// vs //opt_clang//: task-clock [delta(%): 2.1834 ± 1.8158, p-value: 0.040702]
(That is, the regression is reduced to ~2%)
Related BOLT logs:
```
BOLT-INFO: identified 2114 (18.61%) stale functions responsible for 30.96% samples
BOLT-INFO: inferred profile for 2101 (18.52% of all profiled) functions responsible for 30.95% samples
```

`LTO+AutoFDO` mode:
//stale_clang// vs //opt_clang//: task-clock [delta(%): 19.1293 ± 1.4131, p-value: 0.000002]
//infer_clang// vs //opt_clang//: task-clock [delta(%): 7.4364 ± 1.3343, p-value: 0.000002]
Related BOLT logs:
```
BOLT-INFO: identified 5452 (50.27%) stale functions responsible for 85.34% samples
BOLT-INFO: inferred profile for 5442 (50.23% of all profiled) functions responsible for 85.33% samples
```

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D146661

2023-06-08 14:42:41 -07:00

spupyrev

44268271f6

[BOLT] stale profile matching [part 1 out of 2]

BOLT often has to deal with profiles collected on binaries built from several
revisions behind release. As a result, a certain percentage of functions is
considered stale and not optimized. This diff adds an ability to match profile
to functions that are not 100% binary identical, which increases the
optimization coverage and boosts the performance of applications.

The algorithm consists of two phases: matching and inference:
- At the matching phase, we try to "guess" as many block and jump counts from
  the stale profile as possible. To this end, the content of each basic block
  is hashed and stored in the (yaml) profile. When BOLT optimizes a binary,
  it computes block hashes and identifies the corresponding entries in the
  stale profile. It yields a partial profile for every CFG in the binary.
- At the inference phase, we employ a network flow-based algorithm (profi) to
  reconstruct "realistic" block and jump counts from the partial profile
  generated at the first stage. In practice, we don't always produce proper
  profile data but the majority (e.g., >90%) of CFGs get the correct counts.

This is a first part of the change; the next stacked diff extends the block hashing
and provides perf evaluation numbers.

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D144500

2023-06-06 12:13:52 -07:00

3 Commits