llvm-project

Author	SHA1	Message	Date
Rafael Auler	47f54e4992	Revert "[BOLT][NFC] Register profiled functions once (#150622 )" (#152597 ) In perf2bolt, we are observing sporadic crashes in the recently added registerProfiledFunctions from #150622. Addresses provided by the hardware (from LBR) might be -1, which clashes with what LLVM uses in DenseSet as empty tombstones records. This causes DenseSet to assert with "can't insert empty tombstone into map" when ingesting this data. Revert this change for now to unbreak perf2bolt.	2025-08-07 15:28:22 -07:00
Amir Ayupov	1b657c6d6b	[BOLT][NFC] Register profiled functions once (#150622 ) While registering profiled functions, only handle each address once. Speeds up `DataAggregator::preprocessProfile`. Test Plan: For intermediate size pre-aggregated profile (10MB), reduces parsing time from ~0.41s down to ~0.16s.	2025-07-28 13:29:55 +02:00
Amir Ayupov	a850912de1	[BOLT] Require CFG in BAT mode (#150488 ) `getFallthroughsInTrace` requires CFG for functions not covered by BAT, even in BAT/fdata mode. BAT-covered functions go through special handling in fdata (`BAT->getFallthroughsInTrace`) and YAML (`DataAggregator::writeBATYAML`) modes. Since all modes (BAT/no-BAT, YAML/fdata) now need disassembly/CFG construction: - drop special BAT/fdata handling that omitted disassembly/CFG in `RewriteInstance::run`, enabling CFG for all non-BAT functions, - switch `getFallthroughsInTrace` to check if a function has CFG, - which allows emitting profile for non-simple functions in all modes. Previously, traces in non-simple functions were reported as invalid/ mismatching disassembled function contents. This change reduces the number of such invalid traces and increases the number of profiled functions. These functions may participate in function reordering via call graph profile. Test Plan: updated unclaimed-jt-entries.s	2025-07-25 13:54:37 +02:00
Amir Ayupov	00311cf604	[BOLT] Impute missing trace fall-through (#145258 )	2025-07-12 12:25:58 -07:00
Amir Ayupov	46e3ec0244	[BOLT][NFCI] Report perf script time (#147232 ) Leverage `sys::ProcessStatistics` to report the run time and memory usage of perf script processes launched when reading perf data. The reporting is enabled in debug mode with `-debug-only=aggregator`. Switch buildid-list command to non-waiting `launchPerfProcess` to get its runtime as well, unifying it with the rest of perf script processes. Test Plan: NFC	2025-07-07 06:05:48 -07:00
Amir Ayupov	f6973baf28	[BOLT][NFC] Split out parsePerfData (#145248 )	2025-06-24 09:39:51 -07:00
Amir Ayupov	6d8c6ef90c	[BOLT][NFC] Simplify doTrace in BAT mode (#143233 ) `BoltAddressTranslation::getFallthroughsInTrace` iterates over address translation map entries and therefore has direct access to both original and translated offsets. Return the translated offsets in fall-throughs list to avoid duplicate address translation inside `doTrace`. Test Plan: NFC	2025-06-20 12:45:21 -07:00
Amir Ayupov	7085065c02	[BOLT] Support pre-aggregated returns (#143296 ) Intel's Architectural LBR supports capturing branch type information as part of LBR stack (SDM Vol 3B, part 2, October 2024): ``` 20.1.3.2 Branch Types The IA32_LBR_x_INFO.BR_TYPE and IA32_LER_INFO.BR_TYPE fields encode the branch types as shown in Table 20-3. Table 20-3. IA32_LBR_x_INFO and IA32_LER_INFO Branch Type Encodings Encoding \| Branch Type 0000B \| COND 0001B \| NEAR_IND_JMP 0010B \| NEAR_REL_JMP 0011B \| NEAR_IND_CALL 0100B \| NEAR_REL_CALL 0101B \| NEAR_RET 011xB \| Reserved 1xxxB \| OTHER_BRANCH For a list of branch operations that fall into the categories above, see Table 20-2. Table 20-2. Branch Type Filtering Details Branch Type \| Operations Recorded COND \| Jcc, JCXZ, and LOOP NEAR_IND_JMP \| JMP r/m* NEAR_REL_JMP \| JMP rel* NEAR_IND_CALL \| CALL r/m* NEAR_REL_CALL \| CALL rel* (excluding CALLs to the next sequential IP) NEAR_RET \| RET (0C3H) OTHER_BRANCH \| JMP/CALL ptr, JMP/CALL m, RET (0C8H), SYS*, interrupts, exceptions (other than debug exceptions), IRET, INT3, INTn, INTO, TSX Abort, EENTER, ERESUME, EEXIT, AEX, INIT, SIPI, RSM ``` Linux kernel can preserve branch type when `save_type` is enabled, even if CPU does not support Architectural LBR: `f09079bd04/tools/perf/Documentation/perf-record.txt (L457-L460)` > - save_type: save branch type during sampling in case binary is not available later. For the platforms with Intel Arch LBR support (12th-Gen+ client or 4th-Gen Xeon+ server), the save branch type is unconditionally enabled when the taken branch stack sampling is enabled. Kernel-reported branch type values: `8c6bc74c7f/include/uapi/linux/perf_event.h (L251-L269)` This information is needed to disambiguate external returns (from DSO/JIT) to an entry point or a landing pad, when BOLT can't disassemble the branch source. This patch adds new pre-aggregated types: - return trace (R), - external return fall-through (r). For such types, the checks for fall-through start (not an entry or a landing pad) are relaxed. Depends on #143295. Test Plan: updated callcont-fallthru.s	2025-06-20 03:17:08 -07:00
Ádám Kallai	f75973949b	[BOLT][AArch64] Add support for SPE brstack format (#129231 ) Since Linux 6.14, Perf gained the ability to report SPE branch events using the `brstack` format, which matches the layout of LBR/BRBE. This patch reuses the existing LBR parsing logic to support SPE. Example SPE brstack format: ```bash perf script -i perf.data -F pid,brstack --itrace=bl ``` ``` PID FROM / TO / PREDICTED 16984 0x72e342e5f4/0x72e36192d0/M/-/-/11/RET/- 16984 0x72e7b8b3b4/0x72e7b8b3b8/PN/-/-/11/COND/- 16984 0x72e7b92b48/0x72e7b92b4c/PN/-/-/8/COND/- 16984 0x72eacc6b7c/0x760cc94b00/P/-/-/9/RET/- 16984 0x72e3f210fc/0x72e3f21068/P/-/-/4//- 16984 0x72e39b8c5c/0x72e3627b24/P/-/-/4//- 16984 0x72e7b89d20/0x72e7b92bbc/P/-/-/4/RET/- ``` SPE brstack flags can be two characters long: `PN` or `MN`: - `P` = predicted branch - `M` = mispredicted branch - `N` = optionally appears when the branch is NOT-TAKEN - flag is relevant only to conditional branches Example of usage with BOLT: 1. Capture SPE branch events: ```bash perf record -e 'arm_spe_0/branch_filter=1/u' -- binary ``` 2. Convert profile for BOLT: ```bash perf2bolt -p perf.data -o perf.fdata --spe binary ``` 3. Run BOLT Optimization: ```bash llvm-bolt binary -o binary.bolted --data perf.fdata ... ``` A unit test verifies the parsing of the 'SPE brstack format'. --------- Co-authored-by: Paschalis Mpeis <paschalis.mpeis@arm.com>	2025-06-20 10:40:35 +01:00
Amir Ayupov	9fed480f18	[BOLT] Explicitly check for returns when extending call continuation profile (#143295 ) Call continuation logic relies on assumptions about fall-through origin: - the branch is external to the function, - fall-through start is at the beginning of the block, - the block is not an entry point or a landing pad. Leverage trace information to explicitly check whether the origin is a return instruction, and defer to checks above only in case of DSO-external branch source. This covers both regular and BAT cases, addressing call continuation fall-through undercounting in the latter mode, which improves BAT profile quality metrics. For example, for one large binary: - CFG discontinuity 21.83% -> 0.00%, - CFG flow imbalance 10.77%/100.00% -> 3.40%/13.82% (weighted/worst) - CG flow imbalance 8.49% —> 8.49%. Depends on #143289. Test Plan: updated callcont-fallthru.s	2025-06-17 06:28:27 -07:00
Amir Ayupov	7e6c1bd3ed	[BOLT][NFCI] Simplify DataAggregator using traces (#143289 ) Consistently apply traces as defined in #127125 for branch profile aggregation. This combines branches and fall-through records into one. With large input binaries/profiles, the speed up in aggregation time (`-time-aggr`, wall time): - perf.data, pre-BOLT input: 154.5528s -> 144.0767s - pre-aggregated data, pre-BOLT input: 15.1026s -> 9.0711s - pre-aggregated data, BOLTed input: 15.4871s -> 10.0077s Test Plan: NFC	2025-06-16 23:54:40 -07:00
Amir Ayupov	902a991e12	[BOLT] Make memory profile parsing optional (#129585 ) Introduce `parse-mem-profile` option to limit overheads processing tracing data (Intel PT or ARM ETM). By default, it's enabled for perf data (existing behavior), unless `itrace` is passed to parse tracing data where it's extremely expensive. In this case, the flag needs to be set explicitly if needed.	2025-06-12 14:46:37 -07:00
Amir Ayupov	0c77468288	[BOLT] Expose external entry count for functions (#141674 ) Record the number of function invocations from external code - code outside the binary, which may include JIT code and DSOs. Accounting external entry counts improves the fidelity of call graph flow conservation analysis. Test Plan: updated shrinkwrapping.test	2025-06-10 14:31:22 -07:00
Amir Ayupov	03bbd04bb7	[BOLT][NFCI] Skip validation in parseLBRSample (#143288 ) Parsed branches and fall-throughs are validated in `doBranch` and `doTrace` respectively. Simplify parseLBRSample by omitting the validation. This also speeds up perf data processing as checks are only done once for aggregated branches/fall-throughs and not individual LBR entries. Since invalid/external addresses are no longer sanitized during parsing, sanitize them in `doBranch`. Test Plan: updated X86/pre-aggregated-perf.test	2025-06-08 17:50:02 -07:00
Amir Ayupov	dcd2ac7ef2	[BOLT] Sort EntryData (#143308 ) Aggregated branch data has two containers: `Data` for local branches, and `EntryData` for external branches. Fix the omission and sort `EntryData` to ensure stable output fdata profiles. Test Plan: updated pre-aggregated-perf.test	2025-06-08 17:43:44 -07:00
Amir Ayupov	c480dcddd9	[BOLT][NFC] Move LBREntry from DataReader to DataAggregator (#143287 ) LBREntry is only used in DataAggregator. Test Plan: NFC	2025-06-08 17:41:46 -07:00
Amir Ayupov	dc513fa8dc	[BOLT] Zero initialize pre-aggregated counters (#142698 ) #140196 introduced UB by using uninitialized misprediction count for pre-aggregated traces. Fix by zero initializing both counters. Test Plan: updated entry-point-fallthru.s	2025-06-03 20:50:19 -07:00
Amir Ayupov	18e51314c4	[BOLT] Support pre-aggregated basic sample profile (#140196 ) Define a pre-aggregated basic sample format: ``` E <event name> S <location> <count> ``` `-nl` flag is required to use parsed basic samples. Test Plan: update pre-aggregated-perf.test	2025-06-02 11:43:48 -07:00
Amir Ayupov	5047a33cd8	[BOLT][heatmap] Produce zoomed-out heatmaps (#140153 ) Add a capability to produce multiple heatmaps with given bucket sizes. The default heatmap block size (64B) could be too fine-grained for large binaries. Extend the option `block-size` to accept a list of bucket sizes for additional heatmaps with coarser granularity. The heatmap is simply rescaled so provided sizes should be multiples of each other. Human-readable suffixes can be used, e.g. 4K, 16kb, 1MiB. New defaults: 64B (base bucket size), 4KB (default page size), 256KB (for large binaries). Test Plan: updated heatmap-preagg.test	2025-05-30 16:20:19 -07:00
Amir Ayupov	9d5d715330	[BOLT][heatmap] Add synthetic hot text section (#139824 ) In heatmap mode, report samples and utilization of the section(s) between hot text markers `[__hot_start, __hot_end)`. The intended use is with multi-way splitting where there are several sections that contain "hot" code (e.g. `.text.warm` with CDSplit). Addresses the comment on #139193 https://github.com/llvm/llvm-project/pull/139193#pullrequestreview-2835274682 Test Plan: updated heatmap-preagg.test	2025-05-14 09:47:14 -07:00
Amir Ayupov	616489e2ee	[BOLT] Drop perf2bolt cold samples diagnostic (#139337 ) Cold samples diagnostics in perf2bolt are superseded by `perf2bolt --heatmap` option (#139194). It provides a superset of stats and works without BAT section which is not emitted by default. Test Plan: NFC	2025-05-13 13:24:56 -07:00
Amir Ayupov	0289ca09be	[BOLT] Print heatmap from perf2bolt (#139194 ) Add perf2bolt `--heatmap` option to produce heatmaps during profile aggregation. Distinguish exclusive mode (`llvm-bolt-heatmap`) and optional mode (`perf2bolt --heatmap`), which impacts perf.data handling: exclusive mode covers all addresses, whereas optional mode consumes attached profile only covering function addresses. Test Plan: updated per2bolt tests: - pre-aggregated-perf.test: pre-aggregated data, - bolt-address-translation-yaml.test: pre-aggregated + BOLTed input, - perf_test.test: no-LBR perf data.	2025-05-13 13:23:18 -07:00
Amir Ayupov	fbdb5aeff6	[BOLT] Build heatmap with pre-aggregated data (#138798 ) Reuse data structures used by perf data reader for pre-aggregated data. Combined with #136531 this allows using pre-aggregated data for heatmap. Test Plan: heatmap-preagg.test	2025-05-12 18:04:10 -07:00
Amir Ayupov	f2351d9e7f	[BOLT][heatmap] Use parsed basic/branch events (#136531 ) Remove duplicate profile parsing in heatmap construction, switching to using parsed profile. #138798 adds support for using pre-aggregated profile for heatmap construction. Test Plan: added heatmap.test in `0868850a15`	2025-05-12 17:33:30 -07:00
Amir Ayupov	e039d16ee5	[BOLT][NFC] Disambiguate sample as basic sample (#139350 ) Sample is a general term covering both basic (IP) and branch (LBR) profiles. Find and replace ambiguous uses of sample in a basic sample sense. Rename `RawBranchCount` into `RawSampleCount` reflecting its use for both kinds of profile. Rename `PF_LBR` profile type as `PF_BRANCH` reflecting non-LBR based branch profiles (non-brstack SPE, synthesized brstack ETM/PT). Follow-up to #137644. Test Plan: NFC	2025-05-12 17:15:16 -07:00
Amir Ayupov	8f31c6dde7	[BOLT] Support profile density with basic samples (#137644 ) For profile with LBR samples, binary function profile density is computed as a ratio of executed bytes to function size in bytes. For profile with IP samples, use the size of basic block containing the sample IP as a numerator. Test Plan: updated perf_test.test	2025-05-10 21:01:49 -07:00
Kazu Hirata	193135c800	[BOLT] Remove an unused local variable (NFC) (#139392 )	2025-05-10 10:21:59 -07:00
Amir Ayupov	54aa16d293	[BOLT] Drop converting return profile to call cont (#129477 ) The workaround was not implemented for BAT case, and it is no longer needed with pre-aggregated traces, alternatively, the effect can be achieved with `infer-fall-throughs` with old pre-aggregated format (branches + ranges). Test Plan: updated callcont-fallthru.s	2025-05-06 22:21:53 -07:00
Amir Ayupov	5d0afacd1b	[BOLT][NFCI] Emit uniform diagnostics in DataAggregator (#136530 ) DataAggregator supports reading different kinds of profile data: - perf data: branch records or IP samples, - pre-aggregated branch data. Make profile quality reporting uniform across all kinds of input: - out-of-range and mismatching samples, - samples in cold code in BAT mode (profiled BOLTed binary). Test Plan: NFCI	2025-04-24 13:51:18 -07:00
Amir Ayupov	fa4ac19f0f	[BOLT] Accept PLT fall-throughs as valid traces (#129481 ) We used to report PLT traces as invalid (mismatching disassembled function contents) because PLT functions are marked as pseudo and ignored, thus missing CFG. However, such traces are not mismatching the function contents. Accept them without attaching the profile. Test Plan: updated callcont-fallthru.s	2025-04-11 21:26:19 -07:00
Amir Ayupov	f567524399	[BOLT] Fix doTrace in BAT mode (#128546 ) When processing BOLTed binaries with BAT section, we used to indiscriminately use `BAT->getFallthroughsInTrace` to record fall-throughs, even if the function is not covered by BAT. Fix that by using non-BAT CFG-based `getFallthroughsInTrace` if the function is not in BAT. Test Plan: updated bolt-address-translation-yaml.test	2025-02-25 10:56:13 -08:00
Amir Ayupov	61acfb07e8	[BOLT] Add pre-aggregated trace support (#127125 ) Traces are triplets of branch source, target, and fall-through end (next branch). Traces simplify differentiation of fall-throughs into local- and external-origin, which improves performance over profile with undifferentiated fall-throughs by eliminating profile discontinuity in call to continuation fall-throughs. This makes it possible to avoid converting return profile into call to continuation profile which may introduce statistical biases. The existing format makes provisions for local- (F) and external- (f) origin fall-throughs, but the profile producer needs to know function boundaries. BOLT has that information readily available, so providing the origin branch of a fall-through is a functional replacement of the fall-through kind (f or F). This also has an effect of combining branches and fall-throughs into a single record. As traces subsume other pre-aggregated profile kinds, BOLT may drop support for them soon. Users of pre-aggregated profile format are advised to migrate to the trace format. Test Plan: Updated callcont-fallthru.s	2025-02-13 15:14:56 -08:00
Amir Ayupov	e6c9cd9c06	[BOLT] Drop parsing sample PC when processing LBR perf data (#123420 ) Remove options to generate autofdo data (unused) and `use-event-pc` (not beneficial). Cuts down perf2bolt time for 11GB perf.data by 40s (11:10->10:30).	2025-01-21 09:04:49 -08:00
Paschalis Mpeis	51003076eb	Reapply [BOLT] DataAggregator support for binaries with multiple text segments (#118023 ) When a binary has multiple text segments, the Size is computed as the difference of the last address of these segments from the BaseAddress. The base addresses of all text segments must be the same. Introduces flag 'perf-script-events' for testing, which allows passing perf events without BOLT having to parse them by invoking 'perf script'. The flag is used to pass a mock perf profile that has two memory mappings for a mock binary that has two text segments. The mapping size is updated as `parseMMapEvents` now processes all text segments.	2024-12-02 09:20:40 +00:00
Hans Wennborg	537343dea4	Revert "[BOLT] DataAggregator support for binaries with multiple text segments (#92815 )" This caused test failures, see comment on the PR: Failed Tests (2): BOLT-Unit :: Core/./CoreTests/AArch64/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0 BOLT-Unit :: Core/./CoreTests/X86/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0 > When a binary has multiple text segments, the Size is computed as the > difference of the last address of these segments from the BaseAddress. > The base addresses of all text segments must be the same. > > Introduces flag 'perf-script-events' for testing. It allows passing perf events > without BOLT having to parse them using 'perf script'. The flag is used to > pass a mock perf profile that has two memory mappings for a mock binary > that has two text segments. The size of the mapping is updated as this > change `parseMMapEvents` processes all text segments. This reverts commit 4b71b3782d217db0138b701c4514bd2168ca1659.	2024-11-26 14:59:30 +01:00
Paschalis Mpeis	4b71b3782d	[BOLT] DataAggregator support for binaries with multiple text segments (#92815 ) When a binary has multiple text segments, the Size is computed as the difference of the last address of these segments from the BaseAddress. The base addresses of all text segments must be the same. Introduces flag 'perf-script-events' for testing. It allows passing perf events without BOLT having to parse them using 'perf script'. The flag is used to pass a mock perf profile that has two memory mappings for a mock binary that has two text segments. The size of the mapping is updated as this change `parseMMapEvents` processes all text segments.	2024-11-25 13:12:43 +00:00
Amir Ayupov	74e6478f81	[BOLT] Set call to continuation count in pre-aggregated profile #109683 identified an issue with pre-aggregated profile where a call to continuation fallthrough edge count is missing (profile discontinuity). This issue only affects pre-aggregated profile but not perf data since LBR stack has the necessary information to determine if the trace (fall- through) starts at call continuation, whereas pre-aggregated fallthrough lacks this information. The solution is to look at branch records in pre-aggregated profiles that correspond to returns and assign counts to call to continuation fallthrough: - BranchFrom is in another function or DSO, - BranchTo may be a call continuation site: - not an entry point/landing pad. Note that we can't directly check if BranchFrom corresponds to a return instruction if it's in external DSO. Keep call continuation handling for perf data (`getFallthroughsInTrace`) [1] as-is due to marginally better performance. The difference is that return-converted call to continuation fallthrough is slightly more frequent than other fallthroughs since the former only requires one LBR address while the latter need two that belong to the profiled binary. Hence return-converted fallthroughs have larger "weight" which affects code layout. [1] `DataAggregator::getFallthroughsInTrace` `fea18afeed/bolt/lib/Profile/DataAggregator.cpp (L906-L915)` Test Plan: added callcont-fallthru.s Reviewers: maksfb, ayermolo, ShatianWang, dcci Reviewed By: maksfb, ShatianWang Pull Request: https://github.com/llvm/llvm-project/pull/109486	2024-11-07 16:20:19 -08:00
Amir Ayupov	6ee5ff95ab	[BOLT] Add profile density computation Reuse the definition of profile density from llvm-profgen (#92144): - the density is computed in perf2bolt using raw samples (perf.data or pre-aggregated data), - function density is the ratio of dynamically executed function bytes to the static function size in bytes, - profile density: - functions are sorted by density in decreasing order, accumulating their respective sample counts, - profile density is the smallest density covering 99% of total sample count. In other words, BOLT binary profile density is the minimum amount of profile information per function (excluding functions in tail 1% sample count) which is sufficient to optimize the binary well. The density threshold of 60 was determined through experiments with large binaries by reducing the sample count and checking resulting profile density and performance. The threshold is conservative. perf2bolt would print the warning if the density is below the threshold and suggest to increase the sampling duration and/or frequency to reach a given density, e.g.: ``` BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples. ``` Test Plan: updated pre-aggregated-perf.test Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe, wlei-llvm Pull Request: https://github.com/llvm/llvm-project/pull/101094	2024-10-24 18:30:59 -07:00
Amir Ayupov	08916cef7e	[BOLT] Set RawBranchCount in DataAggregator Align DataAggregator (Linux perf and pre-aggregated profile reader) to DataReader (fdata profile reader) behavior: set BF->RawBranchCount which is used in profile density computation (#101094). Reviewers: ayermolo, maksfb, dcci, rafaelauler, WenleiHe Reviewed By: WenleiHe Pull Request: https://github.com/llvm/llvm-project/pull/101093	2024-10-24 18:28:44 -07:00
Kristof Beyls	6d216fb7b8	[perf2bolt] Improve heuristic to map in-process addresses to specific… (#109397 ) … segments in Elf binary. The heuristic is improved by also taking into account that only executable segments should contain instructions. Fixes #109384.	2024-09-23 15:14:51 +02:00
Amir Ayupov	c00c62c113	[BOLT] Add pseudo probe inline tree to YAML profile Add probe inline tree information to YAML profile, at function level: - function GUID, - checksum, - parent node id, - call site in the parent. This information is used for pseudo probe block matching (#99891). The encoding adds/changes probe information in multiple levels of YAML profile: - BinaryProfile: add pseudo_probe_desc with GUIDs and Hashes, which permits deduplication of data: - many GUIDs are duplicate as the same callee is commonly inlined into multiple callers, - hashes are also very repetitive, especially for functions with low block counts. - FunctionProfile: add inline tree (see above). Top-level function is included as root of function inline tree, which makes guid and pseudo_probe_desc_hash fields redundant. - BlockProfile: densely-encoded block probe information: - probes reference their containing inline tree node, - separate lists for block, call, indirect call probes, - block probe encoding is specialized: ids are encoded as bitset in uint64_t. If only block probe with id=1 is present, it's encoded as implicit entry (id=0, omitted). - inline tree nodes with identical probes share probe description where node indices are combined into a list. On top of #107970, profile with new probe encoding has the following characteristics (profile for a large binary): - Profile without probe information: 33MB, 3.8MB compressed (baseline). - Profile with inline tree information: 92MB, 14MB compressed. Profile processing time (YAML parsing, inference, attaching steps): - profile without pseudo probes: 5s, - profile with pseudo probes, without pseudo probe matching: 11s, - with pseudo probe matching: 12.5s. Test Plan: updated pseudoprobe-decoding-inline.test Reviewers: wlei-llvm, ayermolo, rafaelauler, dcci, maksfb Reviewed By: wlei-llvm, rafaelauler Pull Request: https://github.com/llvm/llvm-project/pull/107137	2024-09-12 20:51:35 -07:00
Amir Ayupov	ccc7a072db	[BOLT] Drop blocks without profile in BAT YAML (#107970 ) Align BAT YAML (DataAggregator) to YAMLProfileWriter which drops blocks without profile: `61372fc5db/bolt/lib/Profile/YAMLProfileWriter.cpp (L162-L176)` Test Plan: NFCI	2024-09-11 16:36:47 -07:00
Amir Ayupov	c820bd3e33	[BOLT][NFC] Rename profile-use-pseudo-probes The flag currently controls writing of probe information in YAML profile. #99891 adds a separate flag to use probe information for stale profile matching. Thus `profile-use-pseudo-probes` becomes a misnomer and `profile-write-pseudo-probes` better captures the intent. Reviewers: maksfb, WenleiHe, ayermolo, rafaelauler, dcci Reviewed By: rafaelauler Pull Request: https://github.com/llvm/llvm-project/pull/106364	2024-09-11 16:27:33 -07:00
Amir Ayupov	ee09f7d1fc	[MC][NFC] Reduce Address2ProbesMap size Replace the map from addresses to list of probes with a flat vector containing probe references sorted by their addresses. Reduces pseudo probe parsing time from 9.56s to 8.59s and peak RSS from 9.66 GiB to 9.08 GiB as part of perf2bolt processing a large binary. Test Plan: ``` bin/llvm-lit -sv test/tools/llvm-profgen ``` Reviewers: maksfb, rafaelauler, dcci, ayermolo, wlei-llvm Reviewed By: wlei-llvm Pull Request: https://github.com/llvm/llvm-project/pull/102904	2024-08-26 09:14:35 -07:00
Amir Ayupov	4d19676de4	[BOLT] Add profile-use-pseudo-probes option Move pseudo probe profile generation under --profile-use-pseudo-probes option. Note that updating pseudo probes is independent from this flag. Test Plan: updated pseudoprobe-decoding-inline.test Reviewers: maksfb, rafaelauler, ayermolo, dcci, WenleiHe Reviewed By: WenleiHe Pull Request: https://github.com/llvm/llvm-project/pull/100299	2024-07-24 07:31:01 -07:00
Amir Ayupov	c905db67a0	[BOLT] Attach pseudo probes to blocks in YAML profile Read pseudo probes in regular and BAT YAML profile generation, and attach them to YAML profile basic blocks. This exposes GUID, probe id, and probe type in profile for future use in stale profile matching. Test Plan: updated pseudoprobe-decoding-inline.test Reviewers: dcci, rafaelauler, ayermolo, maksfb Reviewed By: rafaelauler Pull Request: https://github.com/llvm/llvm-project/pull/99554	2024-07-18 21:01:40 -07:00
Amir Ayupov	9b007a199d	[BOLT] Expose pseudo probe function checksum and GUID (#99389 ) Add a BinaryFunction field for pseudo probe function GUID. Populate it during pseudo probe section parsing, and emit it in YAML profile (both regular and BAT), along with function checksum. To be used for stale function matching. Test Plan: update pseudoprobe-decoding-inline.test	2024-07-18 20:58:16 -07:00
Amir Ayupov	d1d9545ed3	[BOLT][BAT] Add entries for deleted basic blocks Deleted basic blocks are required for correct mapping of branches modified by SCTC. Increases BAT size, bytes: - large binary: 8622496 -> 8703244. - small binary (X86/bolt-address-translation.test): 928 -> 940. Test Plan: updated bb-with-two-tail-calls.s Reviewers: ayermolo, dcci, maksfb, rafaelauler Reviewed By: rafaelauler Pull Request: https://github.com/llvm/llvm-project/pull/91906	2024-05-23 19:19:07 -07:00
Amir Ayupov	465bfd41fa	[BOLT][NFC] Simplify BBHashMapTy (#91812 )	2024-05-22 16:00:51 -07:00
Amir Ayupov	1529ec085a	[BOLT][NFC] Move out PrintProgramStats from Profile into Rewrite (#93075 ) Eliminate the dependence of Profile on Passes. Test Plan: NFC	2024-05-22 13:53:41 -07:00

1 2 3

126 Commits