llvm-project

Author	SHA1	Message	Date
Rafael Auler	47f54e4992	Revert "[BOLT][NFC] Register profiled functions once (#150622 )" (#152597 ) In perf2bolt, we are observing sporadic crashes in the recently added registerProfiledFunctions from #150622. Addresses provided by the hardware (from LBR) might be -1, which clashes with what LLVM uses in DenseSet as empty tombstones records. This causes DenseSet to assert with "can't insert empty tombstone into map" when ingesting this data. Revert this change for now to unbreak perf2bolt.	2025-08-07 15:28:22 -07:00
Amir Ayupov	1b657c6d6b	[BOLT][NFC] Register profiled functions once (#150622 ) While registering profiled functions, only handle each address once. Speeds up `DataAggregator::preprocessProfile`. Test Plan: For intermediate size pre-aggregated profile (10MB), reduces parsing time from ~0.41s down to ~0.16s.	2025-07-28 13:29:55 +02:00
Amir Ayupov	a850912de1	[BOLT] Require CFG in BAT mode (#150488 ) `getFallthroughsInTrace` requires CFG for functions not covered by BAT, even in BAT/fdata mode. BAT-covered functions go through special handling in fdata (`BAT->getFallthroughsInTrace`) and YAML (`DataAggregator::writeBATYAML`) modes. Since all modes (BAT/no-BAT, YAML/fdata) now need disassembly/CFG construction: - drop special BAT/fdata handling that omitted disassembly/CFG in `RewriteInstance::run`, enabling CFG for all non-BAT functions, - switch `getFallthroughsInTrace` to check if a function has CFG, - which allows emitting profile for non-simple functions in all modes. Previously, traces in non-simple functions were reported as invalid/ mismatching disassembled function contents. This change reduces the number of such invalid traces and increases the number of profiled functions. These functions may participate in function reordering via call graph profile. Test Plan: updated unclaimed-jt-entries.s	2025-07-25 13:54:37 +02:00
Amir Ayupov	00311cf604	[BOLT] Impute missing trace fall-through (#145258 )	2025-07-12 12:25:58 -07:00
Amir Ayupov	46e3ec0244	[BOLT][NFCI] Report perf script time (#147232 ) Leverage `sys::ProcessStatistics` to report the run time and memory usage of perf script processes launched when reading perf data. The reporting is enabled in debug mode with `-debug-only=aggregator`. Switch buildid-list command to non-waiting `launchPerfProcess` to get its runtime as well, unifying it with the rest of perf script processes. Test Plan: NFC	2025-07-07 06:05:48 -07:00
Amir Ayupov	f6973baf28	[BOLT][NFC] Split out parsePerfData (#145248 )	2025-06-24 09:39:51 -07:00
Amir Ayupov	6d8c6ef90c	[BOLT][NFC] Simplify doTrace in BAT mode (#143233 ) `BoltAddressTranslation::getFallthroughsInTrace` iterates over address translation map entries and therefore has direct access to both original and translated offsets. Return the translated offsets in fall-throughs list to avoid duplicate address translation inside `doTrace`. Test Plan: NFC	2025-06-20 12:45:21 -07:00
Amir Ayupov	7085065c02	[BOLT] Support pre-aggregated returns (#143296 ) Intel's Architectural LBR supports capturing branch type information as part of LBR stack (SDM Vol 3B, part 2, October 2024): ``` 20.1.3.2 Branch Types The IA32_LBR_x_INFO.BR_TYPE and IA32_LER_INFO.BR_TYPE fields encode the branch types as shown in Table 20-3. Table 20-3. IA32_LBR_x_INFO and IA32_LER_INFO Branch Type Encodings Encoding \| Branch Type 0000B \| COND 0001B \| NEAR_IND_JMP 0010B \| NEAR_REL_JMP 0011B \| NEAR_IND_CALL 0100B \| NEAR_REL_CALL 0101B \| NEAR_RET 011xB \| Reserved 1xxxB \| OTHER_BRANCH For a list of branch operations that fall into the categories above, see Table 20-2. Table 20-2. Branch Type Filtering Details Branch Type \| Operations Recorded COND \| Jcc, JCXZ, and LOOP NEAR_IND_JMP \| JMP r/m* NEAR_REL_JMP \| JMP rel* NEAR_IND_CALL \| CALL r/m* NEAR_REL_CALL \| CALL rel* (excluding CALLs to the next sequential IP) NEAR_RET \| RET (0C3H) OTHER_BRANCH \| JMP/CALL ptr, JMP/CALL m, RET (0C8H), SYS*, interrupts, exceptions (other than debug exceptions), IRET, INT3, INTn, INTO, TSX Abort, EENTER, ERESUME, EEXIT, AEX, INIT, SIPI, RSM ``` Linux kernel can preserve branch type when `save_type` is enabled, even if CPU does not support Architectural LBR: `f09079bd04/tools/perf/Documentation/perf-record.txt (L457-L460)` > - save_type: save branch type during sampling in case binary is not available later. For the platforms with Intel Arch LBR support (12th-Gen+ client or 4th-Gen Xeon+ server), the save branch type is unconditionally enabled when the taken branch stack sampling is enabled. Kernel-reported branch type values: `8c6bc74c7f/include/uapi/linux/perf_event.h (L251-L269)` This information is needed to disambiguate external returns (from DSO/JIT) to an entry point or a landing pad, when BOLT can't disassemble the branch source. This patch adds new pre-aggregated types: - return trace (R), - external return fall-through (r). For such types, the checks for fall-through start (not an entry or a landing pad) are relaxed. Depends on #143295. Test Plan: updated callcont-fallthru.s	2025-06-20 03:17:08 -07:00
Ádám Kallai	f75973949b	[BOLT][AArch64] Add support for SPE brstack format (#129231 ) Since Linux 6.14, Perf gained the ability to report SPE branch events using the `brstack` format, which matches the layout of LBR/BRBE. This patch reuses the existing LBR parsing logic to support SPE. Example SPE brstack format: ```bash perf script -i perf.data -F pid,brstack --itrace=bl ``` ``` PID FROM / TO / PREDICTED 16984 0x72e342e5f4/0x72e36192d0/M/-/-/11/RET/- 16984 0x72e7b8b3b4/0x72e7b8b3b8/PN/-/-/11/COND/- 16984 0x72e7b92b48/0x72e7b92b4c/PN/-/-/8/COND/- 16984 0x72eacc6b7c/0x760cc94b00/P/-/-/9/RET/- 16984 0x72e3f210fc/0x72e3f21068/P/-/-/4//- 16984 0x72e39b8c5c/0x72e3627b24/P/-/-/4//- 16984 0x72e7b89d20/0x72e7b92bbc/P/-/-/4/RET/- ``` SPE brstack flags can be two characters long: `PN` or `MN`: - `P` = predicted branch - `M` = mispredicted branch - `N` = optionally appears when the branch is NOT-TAKEN - flag is relevant only to conditional branches Example of usage with BOLT: 1. Capture SPE branch events: ```bash perf record -e 'arm_spe_0/branch_filter=1/u' -- binary ``` 2. Convert profile for BOLT: ```bash perf2bolt -p perf.data -o perf.fdata --spe binary ``` 3. Run BOLT Optimization: ```bash llvm-bolt binary -o binary.bolted --data perf.fdata ... ``` A unit test verifies the parsing of the 'SPE brstack format'. --------- Co-authored-by: Paschalis Mpeis <paschalis.mpeis@arm.com>	2025-06-20 10:40:35 +01:00
Amir Ayupov	9fed480f18	[BOLT] Explicitly check for returns when extending call continuation profile (#143295 ) Call continuation logic relies on assumptions about fall-through origin: - the branch is external to the function, - fall-through start is at the beginning of the block, - the block is not an entry point or a landing pad. Leverage trace information to explicitly check whether the origin is a return instruction, and defer to checks above only in case of DSO-external branch source. This covers both regular and BAT cases, addressing call continuation fall-through undercounting in the latter mode, which improves BAT profile quality metrics. For example, for one large binary: - CFG discontinuity 21.83% -> 0.00%, - CFG flow imbalance 10.77%/100.00% -> 3.40%/13.82% (weighted/worst) - CG flow imbalance 8.49% —> 8.49%. Depends on #143289. Test Plan: updated callcont-fallthru.s	2025-06-17 06:28:27 -07:00
Amir Ayupov	7e6c1bd3ed	[BOLT][NFCI] Simplify DataAggregator using traces (#143289 ) Consistently apply traces as defined in #127125 for branch profile aggregation. This combines branches and fall-through records into one. With large input binaries/profiles, the speed up in aggregation time (`-time-aggr`, wall time): - perf.data, pre-BOLT input: 154.5528s -> 144.0767s - pre-aggregated data, pre-BOLT input: 15.1026s -> 9.0711s - pre-aggregated data, BOLTed input: 15.4871s -> 10.0077s Test Plan: NFC	2025-06-16 23:54:40 -07:00
Amir Ayupov	902a991e12	[BOLT] Make memory profile parsing optional (#129585 ) Introduce `parse-mem-profile` option to limit overheads processing tracing data (Intel PT or ARM ETM). By default, it's enabled for perf data (existing behavior), unless `itrace` is passed to parse tracing data where it's extremely expensive. In this case, the flag needs to be set explicitly if needed.	2025-06-12 14:46:37 -07:00
Amir Ayupov	0c77468288	[BOLT] Expose external entry count for functions (#141674 ) Record the number of function invocations from external code - code outside the binary, which may include JIT code and DSOs. Accounting external entry counts improves the fidelity of call graph flow conservation analysis. Test Plan: updated shrinkwrapping.test	2025-06-10 14:31:22 -07:00
Kazu Hirata	8da9eb235d	[BOLT] Use std::tie to implement operator< (NFC) (#143560 ) std::tie facilitates lexicographical comparisons through std::tuple's built-in operator<.	2025-06-10 11:31:46 -07:00
Amir Ayupov	03bbd04bb7	[BOLT][NFCI] Skip validation in parseLBRSample (#143288 ) Parsed branches and fall-throughs are validated in `doBranch` and `doTrace` respectively. Simplify parseLBRSample by omitting the validation. This also speeds up perf data processing as checks are only done once for aggregated branches/fall-throughs and not individual LBR entries. Since invalid/external addresses are no longer sanitized during parsing, sanitize them in `doBranch`. Test Plan: updated X86/pre-aggregated-perf.test	2025-06-08 17:50:02 -07:00
Amir Ayupov	dcd2ac7ef2	[BOLT] Sort EntryData (#143308 ) Aggregated branch data has two containers: `Data` for local branches, and `EntryData` for external branches. Fix the omission and sort `EntryData` to ensure stable output fdata profiles. Test Plan: updated pre-aggregated-perf.test	2025-06-08 17:43:44 -07:00
Amir Ayupov	c480dcddd9	[BOLT][NFC] Move LBREntry from DataReader to DataAggregator (#143287 ) LBREntry is only used in DataAggregator. Test Plan: NFC	2025-06-08 17:41:46 -07:00
Amir Ayupov	dc513fa8dc	[BOLT] Zero initialize pre-aggregated counters (#142698 ) #140196 introduced UB by using uninitialized misprediction count for pre-aggregated traces. Fix by zero initializing both counters. Test Plan: updated entry-point-fallthru.s	2025-06-03 20:50:19 -07:00
Amir Ayupov	18e51314c4	[BOLT] Support pre-aggregated basic sample profile (#140196 ) Define a pre-aggregated basic sample format: ``` E <event name> S <location> <count> ``` `-nl` flag is required to use parsed basic samples. Test Plan: update pre-aggregated-perf.test	2025-06-02 11:43:48 -07:00
Amir Ayupov	5047a33cd8	[BOLT][heatmap] Produce zoomed-out heatmaps (#140153 ) Add a capability to produce multiple heatmaps with given bucket sizes. The default heatmap block size (64B) could be too fine-grained for large binaries. Extend the option `block-size` to accept a list of bucket sizes for additional heatmaps with coarser granularity. The heatmap is simply rescaled so provided sizes should be multiples of each other. Human-readable suffixes can be used, e.g. 4K, 16kb, 1MiB. New defaults: 64B (base bucket size), 4KB (default page size), 256KB (for large binaries). Test Plan: updated heatmap-preagg.test	2025-05-30 16:20:19 -07:00
Kazu Hirata	a0c33e535b	[BOLT] Use llvm::find (NFC) (#141520 )	2025-05-26 15:12:59 -07:00
Kazu Hirata	465e0daa6c	[BOLT] Avoid repeated hash lookups (NFC) (#140426 ) We can use try_emplace to succinctly implement GetOrCreateFuncEntry and GetOrCreateFuncMemEntry. Since it's a bit mouthful to say FuncBasicSampleData::ContainerTy(), this patch changes the second parameters to default ones.	2025-05-17 19:43:55 -07:00
Amir Ayupov	9d5d715330	[BOLT][heatmap] Add synthetic hot text section (#139824 ) In heatmap mode, report samples and utilization of the section(s) between hot text markers `[__hot_start, __hot_end)`. The intended use is with multi-way splitting where there are several sections that contain "hot" code (e.g. `.text.warm` with CDSplit). Addresses the comment on #139193 https://github.com/llvm/llvm-project/pull/139193#pullrequestreview-2835274682 Test Plan: updated heatmap-preagg.test	2025-05-14 09:47:14 -07:00
Amir Ayupov	616489e2ee	[BOLT] Drop perf2bolt cold samples diagnostic (#139337 ) Cold samples diagnostics in perf2bolt are superseded by `perf2bolt --heatmap` option (#139194). It provides a superset of stats and works without BAT section which is not emitted by default. Test Plan: NFC	2025-05-13 13:24:56 -07:00
Amir Ayupov	0289ca09be	[BOLT] Print heatmap from perf2bolt (#139194 ) Add perf2bolt `--heatmap` option to produce heatmaps during profile aggregation. Distinguish exclusive mode (`llvm-bolt-heatmap`) and optional mode (`perf2bolt --heatmap`), which impacts perf.data handling: exclusive mode covers all addresses, whereas optional mode consumes attached profile only covering function addresses. Test Plan: updated per2bolt tests: - pre-aggregated-perf.test: pre-aggregated data, - bolt-address-translation-yaml.test: pre-aggregated + BOLTed input, - perf_test.test: no-LBR perf data.	2025-05-13 13:23:18 -07:00
Amir Ayupov	7f4febde10	[BOLT][heatmap] Compute section utilization and partition score (#139193 ) Heatmap groups samples into buckets of configurable size (`--block-size` flag with 64 bytes as the default =X86 cache line size). Buckets are mapped to containing sections; for buckets that cover multiple sections, they are attributed to the first overlapping section. Buckets not mapped to a section are reported as unmapped. Heatmap reports section hotness which is a percentage of samples attributed to the section. Define section utilization as a percentage of buckets with non-zero samples relative to the total number of section buckets. Also define section partition score as a product of section hotness (where total excludes unmapped buckets) and mapped utilization, ranging from 0 to 1 (higher is better). The intended use of new metrics is with production profile collected from BOLT-optimized binary. In this case the partition score of .text (hot text if function splitting is enabled) reflects optimization profile representativeness and the quality of hot-cold splitting. Partition score of 1 means that all samples fall into hot text, and all buckets (cache lines) in hot text are exercised, equivalent to perfect hot-cold splitting. Test Plan: updated heatmap-preagg.test	2025-05-13 13:20:13 -07:00
Kazu Hirata	383a825d6d	[BOLT] Use StringRef::contains (NFC) (#139658 ) Once we convert EventNames to StringRef, which is cheap, we can call StringRef::contains without creating a temporary instance of std::string.	2025-05-12 22:59:26 -07:00
Amir Ayupov	fbdb5aeff6	[BOLT] Build heatmap with pre-aggregated data (#138798 ) Reuse data structures used by perf data reader for pre-aggregated data. Combined with #136531 this allows using pre-aggregated data for heatmap. Test Plan: heatmap-preagg.test	2025-05-12 18:04:10 -07:00
Amir Ayupov	f2351d9e7f	[BOLT][heatmap] Use parsed basic/branch events (#136531 ) Remove duplicate profile parsing in heatmap construction, switching to using parsed profile. #138798 adds support for using pre-aggregated profile for heatmap construction. Test Plan: added heatmap.test in `0868850a15`	2025-05-12 17:33:30 -07:00
Amir Ayupov	e039d16ee5	[BOLT][NFC] Disambiguate sample as basic sample (#139350 ) Sample is a general term covering both basic (IP) and branch (LBR) profiles. Find and replace ambiguous uses of sample in a basic sample sense. Rename `RawBranchCount` into `RawSampleCount` reflecting its use for both kinds of profile. Rename `PF_LBR` profile type as `PF_BRANCH` reflecting non-LBR based branch profiles (non-brstack SPE, synthesized brstack ETM/PT). Follow-up to #137644. Test Plan: NFC	2025-05-12 17:15:16 -07:00
Kazu Hirata	2a0e8863d4	[BOLT] Use StringRef::consume_front (NFC) (#139432 )	2025-05-10 22:51:27 -07:00
Amir Ayupov	8f31c6dde7	[BOLT] Support profile density with basic samples (#137644 ) For profile with LBR samples, binary function profile density is computed as a ratio of executed bytes to function size in bytes. For profile with IP samples, use the size of basic block containing the sample IP as a numerator. Test Plan: updated perf_test.test	2025-05-10 21:01:49 -07:00
Kazu Hirata	193135c800	[BOLT] Remove an unused local variable (NFC) (#139392 )	2025-05-10 10:21:59 -07:00
Amir Ayupov	54aa16d293	[BOLT] Drop converting return profile to call cont (#129477 ) The workaround was not implemented for BAT case, and it is no longer needed with pre-aggregated traces, alternatively, the effect can be achieved with `infer-fall-throughs` with old pre-aggregated format (branches + ranges). Test Plan: updated callcont-fallthru.s	2025-05-06 22:21:53 -07:00
Amir Ayupov	5d0afacd1b	[BOLT][NFCI] Emit uniform diagnostics in DataAggregator (#136530 ) DataAggregator supports reading different kinds of profile data: - perf data: branch records or IP samples, - pre-aggregated branch data. Make profile quality reporting uniform across all kinds of input: - out-of-range and mismatching samples, - samples in cold code in BAT mode (profiled BOLTed binary). Test Plan: NFCI	2025-04-24 13:51:18 -07:00
Kazu Hirata	a8644b3d88	[BOLT] Call hash_combine_range with ranges (NFC) (#136524 )	2025-04-20 19:41:26 -07:00
Amir Ayupov	fa4ac19f0f	[BOLT] Accept PLT fall-throughs as valid traces (#129481 ) We used to report PLT traces as invalid (mismatching disassembled function contents) because PLT functions are marked as pseudo and ignored, thus missing CFG. However, such traces are not mismatching the function contents. Accept them without attaching the profile. Test Plan: updated callcont-fallthru.s	2025-04-11 21:26:19 -07:00
chrisPyr	038fff3f24	[NFC][BOLT] Make file-local cl::opt global variables static (#126472 ) #125983	2025-03-05 22:11:05 -08:00
Amir Ayupov	f567524399	[BOLT] Fix doTrace in BAT mode (#128546 ) When processing BOLTed binaries with BAT section, we used to indiscriminately use `BAT->getFallthroughsInTrace` to record fall-throughs, even if the function is not covered by BAT. Fix that by using non-BAT CFG-based `getFallthroughsInTrace` if the function is not in BAT. Test Plan: updated bolt-address-translation-yaml.test	2025-02-25 10:56:13 -08:00
Amir Ayupov	61acfb07e8	[BOLT] Add pre-aggregated trace support (#127125 ) Traces are triplets of branch source, target, and fall-through end (next branch). Traces simplify differentiation of fall-throughs into local- and external-origin, which improves performance over profile with undifferentiated fall-throughs by eliminating profile discontinuity in call to continuation fall-throughs. This makes it possible to avoid converting return profile into call to continuation profile which may introduce statistical biases. The existing format makes provisions for local- (F) and external- (f) origin fall-throughs, but the profile producer needs to know function boundaries. BOLT has that information readily available, so providing the origin branch of a fall-through is a functional replacement of the fall-through kind (f or F). This also has an effect of combining branches and fall-throughs into a single record. As traces subsume other pre-aggregated profile kinds, BOLT may drop support for them soon. Users of pre-aggregated profile format are advised to migrate to the trace format. Test Plan: Updated callcont-fallthru.s	2025-02-13 15:14:56 -08:00
Amir Ayupov	e6c9cd9c06	[BOLT] Drop parsing sample PC when processing LBR perf data (#123420 ) Remove options to generate autofdo data (unused) and `use-event-pc` (not beneficial). Cuts down perf2bolt time for 11GB perf.data by 40s (11:10->10:30).	2025-01-21 09:04:49 -08:00
Nikita Popov	e7244d8659	[BOLT][CMake] Don't export bolt libraries in LLVMExports.cmake (#121936 ) Bolt makes use of add_llvm_library and as such ends up exporting its libraries from LLVMExports.cmake, which is not correct. Bolt doesn't have its own exports file, and I assume that there is no desire to have one either -- Bolt libraries are not intended to be consumed as a cmake module, right? As such, this PR adds a NO_EXPORT option to simplify exclude these libraries from the exports file.	2025-01-08 09:41:09 +01:00
Paschalis Mpeis	51003076eb	Reapply [BOLT] DataAggregator support for binaries with multiple text segments (#118023 ) When a binary has multiple text segments, the Size is computed as the difference of the last address of these segments from the BaseAddress. The base addresses of all text segments must be the same. Introduces flag 'perf-script-events' for testing, which allows passing perf events without BOLT having to parse them by invoking 'perf script'. The flag is used to pass a mock perf profile that has two memory mappings for a mock binary that has two text segments. The mapping size is updated as `parseMMapEvents` now processes all text segments.	2024-12-02 09:20:40 +00:00
Hans Wennborg	537343dea4	Revert "[BOLT] DataAggregator support for binaries with multiple text segments (#92815 )" This caused test failures, see comment on the PR: Failed Tests (2): BOLT-Unit :: Core/./CoreTests/AArch64/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0 BOLT-Unit :: Core/./CoreTests/X86/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0 > When a binary has multiple text segments, the Size is computed as the > difference of the last address of these segments from the BaseAddress. > The base addresses of all text segments must be the same. > > Introduces flag 'perf-script-events' for testing. It allows passing perf events > without BOLT having to parse them using 'perf script'. The flag is used to > pass a mock perf profile that has two memory mappings for a mock binary > that has two text segments. The size of the mapping is updated as this > change `parseMMapEvents` processes all text segments. This reverts commit 4b71b3782d217db0138b701c4514bd2168ca1659.	2024-11-26 14:59:30 +01:00
Paschalis Mpeis	4b71b3782d	[BOLT] DataAggregator support for binaries with multiple text segments (#92815 ) When a binary has multiple text segments, the Size is computed as the difference of the last address of these segments from the BaseAddress. The base addresses of all text segments must be the same. Introduces flag 'perf-script-events' for testing. It allows passing perf events without BOLT having to parse them using 'perf script'. The flag is used to pass a mock perf profile that has two memory mappings for a mock binary that has two text segments. The size of the mapping is updated as this change `parseMMapEvents` processes all text segments.	2024-11-25 13:12:43 +00:00
Kazu Hirata	06e0869624	[BOLT] Fix warnings This patch fixes: bolt/lib/Profile/StaleProfileMatching.cpp:694:24: error: unused variable 'BinHash' [-Werror,-Wunused-variable] bolt/lib/Profile/YAMLProfileWriter.cpp:206:61: error: missing field 'GUID' initializer [-Werror,-Wmissing-field-initializers] bolt/lib/Profile/YAMLProfileReader.cpp:840:16: error: unused variable 'MatchedWithPseudoProbes' [-Werror,-Wunused-variable]	2024-11-12 09:39:57 -08:00
Shaw Young	9a9af0a23f	[BOLT] Match blocks with pseudo probes (#99891 ) Match inline trees first between profile and the binary: by GUID, checksum, parent, and inline site for inlined functions. Map profile probes to binary probes via matched inline tree nodes. Each binary probe has an associated binary basic block. If all probes from one profile basic block map to the same binary basic block, it’s an exact match, otherwise the block is determined by majority vote and reported as loose match. Pseudo probe matching happens between exact hash matching and call/loose matching. Introduce ProbeMatchSpec - a mechanism to match probes belonging to another binary function. For example, given functions foo and bar: ``` void foo() { bar(); } ``` profiled binary: bar is not inlined => have top-level function bar new binary where the profile is applied to: bar is inlined into foo. Currently, BOLT does 1:1 matching between profile functions and binary functions based on the name. #100446 will extend this to N:M where multiple profiles can be matched to one binary function (as in the example above where binary function foo would use profiles for foo and bar), and one profile can be matched to multiple binary functions (e.g. if bar was inlined into multiple functions). In this diff, ProbeMatchSpecs would only have one BinaryFunctionProfile (existing name-based matching). Test Plan: Added match-blocks-with-pseudo-probes.test Performance test: - Setup: - Baseline no-BOLT: Clang with pseudo probes, ThinLTO + CSSPGO (#79942) - BOLT fresh: BOLTed Clang using fresh profile, - BOLT stale (hash): BOLTed Clang using stale profile (collected on Clang 10K commits back), `-infer-stale-profile` (hash+call block matching) - BOLT stale (+probe): BOLTed Clang using stale profile, `-infer-stale-profile` with `-stale-matching-with-pseudo-probes` (hash+call+pseudo probe block matching) - 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for build, - BOLT profiles are collected on Clang compiling large preprocessed C++ file. - Benchmark: building Clang (average of 5 runs), see driver in aaupov/llvm-devmtg-2022 - Results, wall time, lower is better: - Baseline no-BOLT: 429.52 +- 2.61s, - BOLT stale (hash): 413.21 +- 2.19s, - BOLT stale (+probe): 409.69 +- 1.41s, - BOLT fresh: 384.50 +- 1.80s. --------- Co-authored-by: Amir Ayupov <aaupov@fb.com>	2024-11-12 07:21:03 -08:00
Amir Ayupov	d936924f5e	[BOLT][NFC] Make YamlProfileToFunction a DenseMap (#108712 ) YAML function profiles have sparse function IDs, assigned from sequential function IDs from profiled binary. For example, for one large binary, YAML profile has 15K functions, but the highest ID is ~600K, close to number of functions in the profiled binary. In `matchProfileToFunction`, `YamlProfileToFunction` vector was resized to match function ID, which entails a 40X overcommit. Change the type of `YamlProfileToFunction` to DenseMap to reduce memory utilization. #99891 makes use of it for profile lookup associated with a given binary function.	2024-11-08 15:24:48 -08:00
Amir Ayupov	74e6478f81	[BOLT] Set call to continuation count in pre-aggregated profile #109683 identified an issue with pre-aggregated profile where a call to continuation fallthrough edge count is missing (profile discontinuity). This issue only affects pre-aggregated profile but not perf data since LBR stack has the necessary information to determine if the trace (fall- through) starts at call continuation, whereas pre-aggregated fallthrough lacks this information. The solution is to look at branch records in pre-aggregated profiles that correspond to returns and assign counts to call to continuation fallthrough: - BranchFrom is in another function or DSO, - BranchTo may be a call continuation site: - not an entry point/landing pad. Note that we can't directly check if BranchFrom corresponds to a return instruction if it's in external DSO. Keep call continuation handling for perf data (`getFallthroughsInTrace`) [1] as-is due to marginally better performance. The difference is that return-converted call to continuation fallthrough is slightly more frequent than other fallthroughs since the former only requires one LBR address while the latter need two that belong to the profiled binary. Hence return-converted fallthroughs have larger "weight" which affects code layout. [1] `DataAggregator::getFallthroughsInTrace` `fea18afeed/bolt/lib/Profile/DataAggregator.cpp (L906-L915)` Test Plan: added callcont-fallthru.s Reviewers: maksfb, ayermolo, ShatianWang, dcci Reviewed By: maksfb, ShatianWang Pull Request: https://github.com/llvm/llvm-project/pull/109486	2024-11-07 16:20:19 -08:00
Amir Ayupov	6ee5ff95ab	[BOLT] Add profile density computation Reuse the definition of profile density from llvm-profgen (#92144): - the density is computed in perf2bolt using raw samples (perf.data or pre-aggregated data), - function density is the ratio of dynamically executed function bytes to the static function size in bytes, - profile density: - functions are sorted by density in decreasing order, accumulating their respective sample counts, - profile density is the smallest density covering 99% of total sample count. In other words, BOLT binary profile density is the minimum amount of profile information per function (excluding functions in tail 1% sample count) which is sufficient to optimize the binary well. The density threshold of 60 was determined through experiments with large binaries by reducing the sample count and checking resulting profile density and performance. The threshold is conservative. perf2bolt would print the warning if the density is below the threshold and suggest to increase the sampling duration and/or frequency to reach a given density, e.g.: ``` BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples. ``` Test Plan: updated pre-aggregated-perf.test Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe, wlei-llvm Pull Request: https://github.com/llvm/llvm-project/pull/101094	2024-10-24 18:30:59 -07:00

1 2 3 4 5

239 Commits