750 Commits

Author SHA1 Message Date
Franklin
6e8a1a45a7
[BOLT] Detect Linux kernel version if the binary is a Linux kernel (#119088)
This makes it easier to handle differences (e.g. of exception table
entry size) between versions of Linux kernel
2024-12-26 09:54:23 -08:00
Kristof Beyls
4111841f88
[BOLT] Correctly print preferred disassembly for annotated instructions (#120564)
This patch makes sure that `BinaryContext::printInstruction` prints the
preferred disassembly. Preferred disassembly only gets printed when
there are no annotations on the MCInst. Therefore, this patch
temporarily removes the annotations before printing it.

A few examples of before and after on AArch64 instructions are as
follows:

```
  BEFORE                     AFTER
                             (preferred disassembly)

  ret   x30                  ret
  orr   x30, xzr, x0         mov   x30, x0
  hint  #29                  autiasp
  hint  #12                  autia1716
```

Clearly, the preferred disassembly is easier for developers to read, and
is the disassembly that tools should be printing.

This patch is motivated as part of future work on the
llvm-bolt-binary-analysis tool, making sure that the reports it prints
do use preferred disassembly.

This patch was cherry-picked from
https://github.com/kbeyls/llvm-project/tree/bolt-gadget-scanner-prototype.

In this current patch, this only affects existing RISCV test cases.

This patch also does improve test cases in future patches that will
introduce a binary analysis for llvm-bolt-binary-analysis that checks
for correct application of pac-ret (pointer authentication on return
addresses).
2024-12-20 08:54:07 +00:00
Alexander Yermolovich
3c357a49d6
[BOLT] Add support for safe-icf (#116275)
Identical Code Folding (ICF) folds functions that are identical into one
function, and updates symbol addresses to the new address. This reduces
the size of a binary, but can lead to problems. For example when
function pointers are compared. This can be done either explicitly in
the code or generated IR by optimization passes like Indirect Call
Promotion (ICP). After ICF what used to be two different addresses
become the same address. This can lead to a different code path being
taken.

This is where safe ICF comes in. Linker (LLD) does it using address
significant section generated by clang. If symbol is in it, or an object
doesn't have this section symbols are not folded.

BOLT does not have the information regarding which objects do not have
this section, so can't re-use this mechanism.

This implementation scans code section and conservatively marks
functions symbols as unsafe. It treats symbols as unsafe if they are
used in non-control flow instruction. It also scans through the data
relocation sections and does the same for relocations that reference a
function symbol. The latter handles the case when function pointer is
stored in a local or global variable, etc. If a relocation address
points within a vtable these symbols are skipped.
2024-12-16 21:49:53 -08:00
Alexander Yermolovich
0a7e048667
[BOLT][DWARF][NFC] Minimize dwarf5-debug-names-gnu-push-tls-address.s (#120103)
Removed unnecessary parts from the .text section.
2024-12-16 09:54:00 -08:00
Nicholas
671095b452
[BOLT][AArch64] Check Last Element Instead of Returning nullptr in lookupStubFromGroup (#114015)
The current implementation of `lookupStubFromGroup` is incorrect. The
function is intended to find and return the closest stub using
`lower_bound`, which identifies the first element in a sorted range that
is not less than a specified value. However, if such an element is not
found within `Candidates` and the list is not empty, the function
returns `nullptr`. Instead, it should check whether the last element
satisfies the condition.
2024-12-16 12:14:11 +00:00
Maksim Panchenko
f86f4574bb
[BOLT][Linux] Fix static keys test case (#119771)
The key address in the static keys jump table was incorrectly encoded as
an absolute value instead of PC-relative causing incorrect
interpretation of the "likely" property of the key.
2024-12-15 17:13:04 -08:00
Amir Ayupov
8652608404
[BOLT] Fix counts aggregation in merge-fdata (#119652)
merge-fdata used to consider misprediction count as part of "signature",
or the aggregation key. This prevented it from collapsing profile lines
with different misprediction counts, which resulted in duplicate
`(from, to)` pairs with different misprediction and execution counts.

Fix that by splitting out misprediction count and accumulating it
separately.

Test Plan: updated bolt/test/merge-fdata-lbr-mode.test
2024-12-14 22:38:24 -08:00
Amir Ayupov
97f43364cc
[BOLT][NFC] Speedup merge-fdata (#119942)
Eliminate splitting the buffer into lines, and use `std::getline`
directly. Simplify no_lbr and boltedcollection handling as well.

Test Plan: For a large fdata file (200MB), fstream version is ~10%
faster.
2024-12-14 22:26:20 -08:00
Alexander Yermolovich
331c2dd8b4
[BOLT][DWARF] Add support for DW_OP_GNU_push_tls_address to .debug_names (#119939)
Added support to BOLT for DW_OP_GNU_push_tls_address. So now
DW_TAG_variable with this OP in DW_AT_location will appear in debug
names acceleration table. Although not in the DWARF 5 spec it is similar
to DW_OP_form_tls_address. Without this support llvm-dwarfdump --verify
--debug-names will report errors.
2024-12-14 09:30:25 -08:00
Tibor Dusnoki
5225f1b435
[BOLT][merge-fdata] Fix basic sample profile aggregation without LBR info (#118481)
When a basic sample profile is gathered without LBR info, the generated
profile contains a "no-lbr" tag in the first line of the fdata file.
This PR fixes merge-fdata to recognize and save this tag to the output
file.
2024-12-13 16:28:37 +00:00
Aiden Grossman
2bf3ef1847
[BOLT] Require non root user for unreadable-profile.test (#119816)
This patch adds a requirement for a non root user in
unreadable-profile.test. This test fails if run as a root user (like in
a container without explicitly changing the user), which can lead to
some CI test failures.
2024-12-12 22:14:41 -08:00
Kristof Beyls
ceb7214be0
[BOLT] Introduce binary analysis tool based on BOLT (#115330)
This initial commit does not add any specific binary analyses yet, it
merely contains the boilerplate to introduce a new BOLT-based tool.

This basically combines the 4 first patches from the prototype pac-ret
and stack-clash binary analyzer discussed in RFC
https://discourse.llvm.org/t/rfc-bolt-based-binary-analysis-tool-to-verify-correctness-of-security-hardening/78148
and published at
https://github.com/llvm/llvm-project/compare/main...kbeyls:llvm-project:bolt-gadget-scanner-prototype

The introduction of such a BOLT-based binary analysis tool was proposed
and discussed in at least the following places:
- The RFC pointed to above
- EuroLLVM 2024 round table
https://discourse.llvm.org/t/summary-of-bolt-as-a-binary-analysis-tool-round-table-at-eurollvm/78441
The round table showed quite a few people interested in being able to
build a custom binary analysis quickly with a tool like this.
- Also at the US LLVM dev meeting a few weeks ago, I heard interest from
a few people, asking when the tool would be available upstream.
- The presentation "Adding Pointer Authentication ABI support for your
ELF platform"
(https://llvm.swoogo.com/2024devmtg/session/2512720/adding-pointer-authentication-abi-support-for-your-elf-platform)
explicitly mentioned interest to extend the prototype tool to verify
correct implementation of pauthabi.
2024-12-12 10:06:27 +00:00
Alexander Yermolovich
4b825c7417
[BOLT][DWARF] Add support for transitive DW_AT_name/DW_AT_linkage_name resolution for DW_AT_name/DW_AT_linkage_name. (#119493)
This fix handles a case where a DIE that does not have
DW_AT_name/DW_AT_linkage_name, but has a reference to another DIE using
DW_AT_abstract_origin/DW_AT_specification. It also fixes a bug where
there are cross CU references for those attributes. Previously it would
use a DWARF Unit of a DIE which was being processed The
warf5-debug-names-cross-cu.s test just happened to work because how it
was constructed where string section was shared by both DWARF Units.

To resolve DW_AT_name/DW_AT_linkage_name this patch iterates over
references until it either reaches the final DIE or finds both of those
names.
2024-12-11 14:27:56 -08:00
Peter Waller
14dcf8214f
[BOLT] Add --pad-funcs-before=func:n (#117924)
This complements --pad-funcs, and by using both simultaneously, enables
moving a specific function through the address space without modifying
any code
other than the targeted function (and references to it) by doing
(before+after=constant).

See also: proposed functionality to enable inserting random padding in

https://discourse.llvm.org/t/rfc-lld-feature-for-controlling-for-code-size-dependent-measurement-bias
and https://github.com/llvm/llvm-project/pull/117653
2024-12-11 09:58:52 +00:00
Alexander Yermolovich
50c0e679b9
[BOLT][DWARF] Add support for DW_TAG_union_type to DebugNames. (#119023)
Adding support for DW_TAG_union_type for DebugNames acceleration tables.
2024-12-06 15:45:52 -08:00
Maksim Panchenko
d5956fb8f9
[BOLT][AArch64] Add support for short LLD thunks/veneers (#118422)
When a callee function is closer than 256MB from its call site, LLD
linker can strategically create a short thunk for the function with a
single branch instruction (that covers +/-128MB). Detect and convert
such thunks into direct calls in BOLT.
2024-12-03 13:44:51 -08:00
Peter Waller
b5ed375f9d
[BOLT] Skip _init; avoiding GOT breakage for static binaries (#117751)
_init is used during startup of binaires. Unfortunately, its
address can be shared (at least on AArch64 glibc static binaries) with a
data
reference that lives in the GOT. The GOT rewriting is currently unable
to distinguish between data addresses and function addresses. This leads
to the data address being incorrectly rewritten, causing a crash on
startup of the binary:

  Unexpected reloc type in static binary.

To avoid this, don't consider _init for being moved, by skipping it.

~We could add further conditions to narrow the skipped case for known
crashes, but as a straw man I thought it'd be best to keep the condition
as simple as possible and see if there any objections to this.~
(Edit: this broke the test
bolt/test/runtime/X86/retpoline-synthetic.test,
because _init was skipped from the retpoline pass and it has an indirect
call in it, so I include a check for static binaries now, which avoids
the test failure,
but perhaps this could/should be narrowed further?)

For now, skip _init for static binaries on any architecture; we could
add further conditions to narrow the skipped case for known crashes, but
as a straw man I thought it'd be best to keep the condition as simple as
possible and see if there any objections to this.

Updates #100096.
2024-11-28 14:59:07 +00:00
Raul Tambre
003b48e0cb
[BOLT][test] enable GNU extensions, use C++ compiler, remove unnecessary target (#117043)
1. With a Clang that doesn't default to GNU extensions they need to be enabled explicitly.
2. The X86 directory lit config sets it already, there's no reason for this test to do it by itself.
3. The C frontend executable will fail if there's for example a Clang resource file for the C++ mode that sets C++-specific options:
```
+ /home/tambre/dev/llvm/build/bin/clang --target=x86_64-unknown-linux-gnu -fPIE -fuse-ld=lld -Wl,--unresolved-symbols=ignore-all -pie -fPIC -shared /home/tambre/dev/llvm/bolt/test/R_ABS.pic.lld.cpp -o /home/tambre/dev/llvm/build/tools/bolt/test/Output/R_ABS.pic.lld.cpp.tmp.so -Wl,-q -fuse-ld=lld
clang: warning: argument unused during compilation: '-pie' [-Wunused-command-line-argument]
error: invalid argument '-std=c23' not allowed with 'C++'
```
2024-11-27 00:14:00 +02:00
Maksim Panchenko
92301180f7
[BOLT] Use compact EH format for fixed-address executables (#117274)
Use ULEB128 format for emitting LSDAs for fixed-address executables,
similar to what we use for PIEs/DSOs. Main difference is that we don't
use landing pad trampolines when landing pads are not contained in a
single fragment. Instead, we fallback to emitting larger fixed-address
LSDAs, which is still better than adding trampoline instructions.
2024-11-22 00:28:55 -08:00
Maksim Panchenko
105ecd8bb2
[BOLT] Avoid EH trampolines for PIEs/DSOs (#117106)
We used to emit EH trampolines for PIE/DSO whenever a function fragment
contained a landing pad outside of it. However, it is common to have all
landing pads in a cold fragment even when their throwers are in a hot
one.

To reduce the number of trampolines, analyze landing pads for any given
function fragment, and if they all belong to the same (possibly
different) fragment, designate that fragment as a landing pad fragment
for the "thrower" fragment. Later, emit landing pad fragment symbol as
an LPStart for the thrower LSDA.
2024-11-21 18:18:30 -08:00
Maksim Panchenko
996553228f
[BOLT] Overwrite .eh_frame and .gcc_except_table (#116755)
Under --use-old-text or --strict, we completely rewrite contents of EH
frames and exception tables sections. If new contents of either section
do not exceed the size of the original section, rewrite the section
in-place.
2024-11-19 12:59:05 -08:00
Maksim Panchenko
08ef939637
[BOLT] Overwrite .eh_frame_hdr in-place (#116730)
If the new EH frame header can fit into the original .eh_frame_hdr
section, overwrite it in-place and pad with zeroes.
2024-11-18 20:42:38 -08:00
Maksim Panchenko
be89e794f7
[BOLT][AArch64] Add support for long absolute LLD thunks/veneers (#113408)
Absolute thunks generated by LLD reference function addresses recorded
as data in code. Since they are generated by the linker, they don't have
relocations associated with them and thus the addresses are left
undetected. Use pattern matching to detect such thunks and handle them
in VeneerElimination pass.
2024-11-12 11:27:08 -08:00
Shaw Young
9a9af0a23f
[BOLT] Match blocks with pseudo probes (#99891)
Match inline trees first between profile and the binary: by GUID,
checksum, parent, and inline site for inlined functions. Map profile
probes to binary probes via matched inline tree nodes. Each binary probe
has an associated binary basic block. If all probes from one profile
basic block map to the same binary basic block, it’s an exact match,
otherwise the block is determined by majority vote and reported as loose
match.

Pseudo probe matching happens between exact hash matching and call/loose
matching.

Introduce ProbeMatchSpec - a mechanism to match probes belonging to
another binary function. For example, given functions foo and bar:
```
void foo() {
  bar();
}
```
profiled binary: bar is not inlined => have top-level function bar
new binary where the profile is applied to: bar is inlined into foo.

Currently, BOLT does 1:1 matching between profile functions and binary
functions based on the name. #100446 will extend this to N:M where
multiple profiles can be matched to one binary function (as in the
example above where binary function foo would use profiles for foo and
bar), and one profile can be matched to multiple binary functions (e.g.
if bar was inlined into multiple functions).

In this diff, ProbeMatchSpecs would only have one BinaryFunctionProfile
(existing name-based matching). 

Test Plan: Added match-blocks-with-pseudo-probes.test

Performance test:
- Setup:
  - Baseline no-BOLT: Clang with pseudo probes, ThinLTO + CSSPGO
  (#79942)
  - BOLT fresh: BOLTed Clang using fresh profile,
  - BOLT stale (hash): BOLTed Clang using stale profile (collected on
    Clang 10K commits back), `-infer-stale-profile` (hash+call block
    matching)
  - BOLT stale (+probe): BOLTed Clang using stale profile,
    `-infer-stale-profile` with `-stale-matching-with-pseudo-probes`
    (hash+call+pseudo probe block matching)
  - 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for
    build,
  - BOLT profiles are collected on Clang compiling large preprocessed
    C++ file.
- Benchmark: building Clang (average of 5 runs), see driver in
  aaupov/llvm-devmtg-2022
- Results, wall time, lower is better:
  - Baseline no-BOLT: 429.52 +- 2.61s,
  - BOLT stale (hash): 413.21 +- 2.19s,
  - BOLT stale (+probe): 409.69 +- 1.41s,
  - BOLT fresh: 384.50 +- 1.80s.

---------

Co-authored-by: Amir Ayupov <aaupov@fb.com>
2024-11-12 07:21:03 -08:00
Amir Ayupov
74e6478f81
[BOLT] Set call to continuation count in pre-aggregated profile
#109683 identified an issue with pre-aggregated profile where a call to
continuation fallthrough edge count is missing (profile discontinuity).

This issue only affects pre-aggregated profile but not perf data since
LBR stack has the necessary information to determine if the trace (fall-
through) starts at call continuation, whereas pre-aggregated fallthrough
lacks this information.

The solution is to look at branch records in pre-aggregated profiles
that correspond to returns and assign counts to call to continuation
fallthrough:
- BranchFrom is in another function or DSO,
- BranchTo may be a call continuation site:
  - not an entry point/landing pad.

Note that we can't directly check if BranchFrom corresponds to a return
instruction if it's in external DSO.

Keep call continuation handling for perf data (`getFallthroughsInTrace`)
[1] as-is due to marginally better performance. The difference is that
return-converted call to continuation fallthrough is slightly more
frequent than other fallthroughs since the former only requires one LBR
address while the latter need two that belong to the profiled binary.
Hence return-converted fallthroughs have larger "weight" which affects
code layout.

[1] `DataAggregator::getFallthroughsInTrace`
fea18afeed/bolt/lib/Profile/DataAggregator.cpp (L906-L915)

Test Plan: added callcont-fallthru.s

Reviewers: maksfb, ayermolo, ShatianWang, dcci

Reviewed By: maksfb, ShatianWang

Pull Request: https://github.com/llvm/llvm-project/pull/109486
2024-11-07 16:20:19 -08:00
Maksim Panchenko
49ee6069db
[BOLT][AArch64] Add support for compact code model (#112110)
Add `--compact-code-model` option that executes alternative branch
relaxation with an assumption that the resulting binary has less than
128MB of code. The relaxation is done in `relaxLocalBranches()`, which
operates on a function level and executes on multiple functions in
parallel.

Running the new option on AArch64 Clang binary produces slightly smaller
code and the relaxation finishes in about 1/10th of the time.

Note that the new `.text` has to be smaller than 128MB, *and* `.plt` has
to be closer than 128MB to `.text`.
2024-11-07 14:51:12 -08:00
Jacob Bramley
16cd5cdf4d
[BOLT] Ignore AArch64 markers outside their sections. (#74106)
AArch64 uses $d and $x symbols to delimit data embedded in code.
However, sometimes we see $d symbols, typically in .eh_frame, with
addresses that belong to different sections. These occasionally fall
inside .text functions and cause BOLT to stop disassembling, which in
turn causes DWARF CFA processing to fail.

As a workaround, we just ignore symbols with addresses outside the
section they belong to. This behaviour is consistent with objdump and
similar tools.
2024-11-07 15:16:14 +03:00
Amir Ayupov
cafd3e10c3
[BOLT][test] Fix NFC check with pre-aggregated-perf.test (#113944)
NFC checks have been failing starting with
https://lab.llvm.org/buildbot/#/builders/92/builds/8567.

NFC testing wrapper (llvm-bolt-wrapper) replaces the call of `perf2bolt`
with `llvm-bolt --aggregate-only --ignore-build-id`.

`show-density` is automatically enabled for perf2bolt only but not for
`llvm-bolt --aggregate-only`. Add the flag to the test to work around
the issue.

Test Plan:
```
cd build
../llvm-project/bolt/utils/nfc-check-setup.py --switch-back --verbose
bin/llvm-lit -a tools/bolt/test/X86/pre-aggregated-perf.test
```
2024-10-28 11:30:30 -07:00
Amir Ayupov
6ee5ff95ab
[BOLT] Add profile density computation
Reuse the definition of profile density from llvm-profgen (#92144):
- the density is computed in perf2bolt using raw samples (perf.data or
  pre-aggregated data),
- function density is the ratio of dynamically executed function bytes
  to the static function size in bytes,
- profile density:
  - functions are sorted by density in decreasing order, accumulating
    their respective sample counts,
  - profile density is the smallest density covering 99% of total sample
    count.

In other words, BOLT binary profile density is the minimum amount of
profile information per function (excluding functions in tail 1% sample
count) which is sufficient to optimize the binary well.

The density threshold of 60 was determined through experiments with
large binaries by reducing the sample count and checking resulting
profile density and performance. The threshold is conservative.

perf2bolt would print the warning if the density is below the threshold
and suggest to increase the sampling duration and/or frequency to reach
a given density, e.g.:
```
BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples.
```

Test Plan: updated pre-aggregated-perf.test

Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe

Reviewed By: WenleiHe, wlei-llvm

Pull Request: https://github.com/llvm/llvm-project/pull/101094
2024-10-24 18:30:59 -07:00
Paschalis Mpeis
cb9bacf57d
[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. (#96609)
When split functions is used, BOLT may skip tentative code layout
estimation in some cases, like:
- when there is no profile data for some blocks (ie cold blocks)
- when there are cold functions in lite mode
- when skip functions is used
     
However, when rewriting the binary we still need to compute PC-relative
distances between hot and cold basic blocks. Without cold layout
estimation, BOLT uses '0x0' as the address of the first cold block,
leading to incorrect estimations of any PC-relative addresses.
 
This affects large binaries as the relaxStub method expands more
branches than necessary using the short-jump sequence, at it wrongly
believes it has exceeded the branch distance boundary.
 
This increases code size with both a larger and slower sequence;
however,
performance regression is expected to be minimal since this only affects
any called cold code.
 
Example of such an unnecessary relaxation:
from:
```armasm
b       .Ltmp1234
```
 
to:
```armasm
adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16
```
2024-10-17 08:59:05 +01:00
Maksim Panchenko
0e86e5214c
[BOLT][AArch64] Reduce the number of ADR relaxations (#111577)
If ADR instruction references the same function, we can skip relaxation
even if the function is split but ADR is in the main fragment.
2024-10-08 16:15:00 -07:00
ShatianWang
4cab01f072
[BOLT] Profile quality stats -- CFG discontinuity (#109683)
In a perfect profile, each positive-execution-count block in the
function’s CFG should be reachable from a positive-execution-count
function entry block through a positive-execution-count path. This new
pass checks how well the BOLT input profile satisfies this “CFG
continuity” property.

More specifically, for each of the hottest 1000 functions, the pass
calculates the function’s fraction of basic block execution counts that
is “unreachable”. It then reports the 95th percentile of the
distribution of the 1000 unreachable fractions in a single BOLT-INFO
line. The smaller the reported value is, the better the BOLT profile
satisfies the CFG continuity property.

The default value of 1000 above can be changed via the hidden BOLT
option `-num-functions-for-continuity-check=[N]`. If more detailed stats
are needed, `-v=1` can be added to the BOLT invocation: the hottest N
functions will be grouped into 5 equally-sized buckets, from the hottest
to the coldest; for each bucket, various summary statistics of the
distribution of the fractions and the raw unreachable execution counts
will be reported.
2024-10-08 19:07:43 -04:00
Tex Riddell
e237d8aac8
[BOLT] Fix tests broken by abe0dd1 (#110071)
abe0dd195a3b2630afdc5c1c233eb2a068b2d72f (#109553) changed default
llvm-objdump output for consecutive zeros.

This broke two tests:
BOLT :: AArch64/constant_island_pie_update.s
BOLT :: AArch64/update-weak-reference-symbol.s

This fixes the test failures by adding -z to llvm-objdump in RUN line.
2024-09-25 19:34:57 -07:00
Maksim Panchenko
4db0cc4c55
[BOLT] Allow sections in --print-only flag (#109622)
While printing functions, expand --print-only flag to accept section
names. E.g., "--print-only=\.init" will only print functions from
".init" section.
2024-09-25 23:44:06 +02:00
Maksim Panchenko
6fb39ac77b
[BOLT][merge-fdata] Initialize YAML profile header (#109613)
While merging profiles, some fields in the input header, e.g.
HashFunction, could be uninitialized leading to a UMR. Initialize merged
header with the first input header.

Fixes #109592
2024-09-25 23:18:34 +02:00
Amir Ayupov
300051159b [BOLT][test] Update log.test and perf_test
Address noisy tests by:
- perf_test: bumping sampling frequency to maximum,
- log.test: matching Binary Function "main"
2024-09-23 15:47:19 -07:00
sinan
31ac3d092b
[BOLT] Add .iplt support to x86 (#106513)
Add X86 support for parsing .iplt section and symbols.
2024-09-23 18:22:43 +08:00
Tom Stellard
773353b20a
[bolt][tests] Skip tests that use perf when perf counters are unavailable (#107892)
On the GitHub Action runners, perf always fails with the error below ,
so we need to skip the perf tests on platforms like this that have
limited access to the perf counters.

```
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is 4:
  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
```
2024-09-17 17:07:35 -07:00
Nikita Popov
827dd1ef2f
[Bolt] Explicitly request PIE in tests (#108818)
When clang is built with `-DCLANG_DEFAULT_PIE_ON_LINUX=OFF`, a number of
bolt tests fail:

    BOLT :: AArch64/build_id.c
    BOLT :: AArch64/plt-call.test
BOLT :: X86/dwarf5-dwarf4-types-backward-forward-cross-reference.test
    BOLT :: X86/dwarf5-locexpr-referrence.test
    BOLT :: X86/internal-call-instrument.s
    BOLT :: X86/linux-static-keys.s
    BOLT :: X86/plt-call.test

Avoid this by explicitly adding `-fPIE` and `-pie` to the default flags
in tests, so we don't depend on the clang-side default.
2024-09-17 08:58:49 +02:00
Amir Ayupov
c00c62c113
[BOLT] Add pseudo probe inline tree to YAML profile
Add probe inline tree information to YAML profile, at function level:
- function GUID,
- checksum,
- parent node id,
- call site in the parent.

This information is used for pseudo probe block matching (#99891).

The encoding adds/changes probe information in multiple levels of
YAML profile:
- BinaryProfile: add pseudo_probe_desc with GUIDs and Hashes, which
  permits deduplication of data:
  - many GUIDs are duplicate as the same callee is commonly inlined
    into multiple callers,
  - hashes are also very repetitive, especially for functions with
    low block counts.
- FunctionProfile: add inline tree (see above). Top-level function
  is included as root of function inline tree, which makes guid and
  pseudo_probe_desc_hash fields redundant.
- BlockProfile: densely-encoded block probe information:
  - probes reference their containing inline tree node,
  - separate lists for block, call, indirect call probes,
  - block probe encoding is specialized: ids are encoded as bitset
    in uint64_t. If only block probe with id=1 is present, it's
    encoded as implicit entry (id=0, omitted).
  - inline tree nodes with identical probes share probe description
    where node indices are combined into a list.

On top of #107970, profile with new probe encoding has the following
characteristics (profile for a large binary):

- Profile without probe information: 33MB, 3.8MB compressed (baseline).
- Profile with inline tree information: 92MB, 14MB compressed.

Profile processing time (YAML parsing, inference, attaching steps):
- profile without pseudo probes: 5s,
- profile with pseudo probes, without pseudo probe matching: 11s,
- with pseudo probe matching: 12.5s.

Test Plan: updated pseudoprobe-decoding-inline.test

Reviewers: wlei-llvm, ayermolo, rafaelauler, dcci, maksfb

Reviewed By: wlei-llvm, rafaelauler

Pull Request: https://github.com/llvm/llvm-project/pull/107137
2024-09-12 20:51:35 -07:00
Amir Ayupov
c820bd3e33
[BOLT][NFC] Rename profile-use-pseudo-probes
The flag currently controls writing of probe information in YAML
profile. #99891 adds a separate flag to use probe information for stale
profile matching. Thus `profile-use-pseudo-probes` becomes a misnomer
and `profile-write-pseudo-probes` better captures the intent.

Reviewers: maksfb, WenleiHe, ayermolo, rafaelauler, dcci

Reviewed By: rafaelauler

Pull Request: https://github.com/llvm/llvm-project/pull/106364
2024-09-11 16:27:33 -07:00
Amir Ayupov
15fa3ba547
[BOLT][YAML] Allow unknown keys in the input (#100824)
This ensures forward compatibility, where old BOLT versions can consume
the profile created by newer versions with extra keys.

Test Plan: added yaml-unknown-keys.test
2024-09-03 11:27:57 -07:00
Maksim Panchenko
abd69b3653
[BOLT] Handle internal calls in ValidateInternalCalls (#105736)
Move handling of all internal calls into the designated pass. Preserve
NOPs and mark functions as non-simple on non-X86 platforms.
2024-08-27 11:31:32 -07:00
Harini0924
7f3793207b
[BOLT][test] Removed the use of parentheses in BOLT tests with lit internal shell (#105720)
This patch addresses compatibility issues with the lit internal shell by
removing the use of subshell execution (parentheses and subshell syntax)
in the `BOLT` tests. The lit internal shell does not support
parentheses, so the tests have been refactored to use separate command
invocations, with outputs redirected to temporary files where necessary.

This change is relevant for enabling the lit internal shell by default,
as outlined in [[RFC] Enabling the Lit Internal Shell by
Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179)

fixes: #102401
2024-08-23 08:20:11 -07:00
ShatianWang
cbd302410e
[BOLT] Improve BinaryFunction::inferFallThroughCounts() (#105450)
This PR improves how basic block execution count is updated when using
the BOLT option `-infer-fall-throughs`. Previously, if a 0-count
fall-through edge is assigned a positive inferred count N, then the
successor block's execution count will be incremented by N. Since the
successor's execution count is calculated using information besides
inflow sum (such as outflow sum), it likely is already correct, and
incrementing it by an additional N would be wrong. This PR improves how
the successor's execution count is updated by using the max over its
current count and N.
2024-08-21 00:35:07 -04:00
Harini0924
4f5d866af7
[llvm-lit] Add REQUIRES: shell to BOLT permission test for lit internal shell (#103012)
This patch adds the `REQUIRES: shell` directive to the BOLT permission
test to ensure it only runs in environments with a full-featured
Unix-like shell. This change is necessary because the test relies on
advanced shell capabilities that are not supported by lit's internal
shell.

**Reasoning:** The BOLT permission test uses features like running
commands in the background with `&`, performing arithmetic operations,
and handling special number formats (octal). These features require a
more capable shell than what lit's internal shell provides. Without a
proper shell, the test could fail or behave unpredictably.

This change is relevant for enabling the lit internal shell by default,
as outlined in [[RFC] Enabling the Lit Internal Shell by
Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179)
2024-08-13 19:58:59 -07:00
Connie
887f7002b6
[NFC][bolt][test] Change '|&' to '2>&1 |' for lit internal shell support (#102402)
This patches changes all references to '|&' in bolt tests to instead use
the '2>&1 |' syntax for better consistency across testing and so that
lit's internal shell can be used to run these tests. This addresses a
suggestion made in the comments of this RFC:
https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179.

Fixes https://github.com/llvm/llvm-project/issues/102388
2024-08-12 17:18:17 -07:00
Sayhaan Siddiqui
6aad62cf5b
[BOLT][DWARF] Add parallelization for processing of DWO debug information (#100282)
Enables parallelization for the processing of DWO CUs.
2024-08-08 16:41:51 -07:00
Davide Italiano
e49549ff19 Revert "[BOLT] Abort on out-of-section symbols in GOT (#100801)"
This reverts commit a4900f0d936f0e86bbd04bd9de4291e1795f1768.
2024-08-07 20:52:19 -07:00
Vladislav Khmelevsky
a4900f0d93
[BOLT] Abort on out-of-section symbols in GOT (#100801)
This patch aborts BOLT execution if it finds out-of-section (section
end) symbol in GOT table. In order to handle such situations properly in
future, we would need to have an arch-dependent way to analyze
relocations or its sequences, e.g., for ARM it would probably be ADRP +
LDR analysis in order to get GOT entry address. Currently, it is also
challenging because GOT-related relocation symbols are replaced to
__BOLT_got_zero. Anyway, it seems to be quite a rare case, which seems
to be only? related to static binaries. For the most part, it seems that
it should be handled on the linker stage, since static binary should not
have GOT table at all. LLD linker with relaxations enabled would replace
instruction addresses from GOT directly to target symbols, which
eliminates the problem.

Anyway, in order to achieve detection of such cases, this patch fixes a
few things in BOLT:
1. For the end symbols, we're now using the section provided by ELF
binary. Previously it would be tied with a wrong section found by symbol
address.
2. The end symbols would have limited registration we would only
add them in name->data GlobalSymbols map, since using address->data
BinaryDataMap map would likely be impossible due to address duality of
such symbols.
3. The outdated BD->getSection (currently returning refence, not
pointer) check in postProcessSymbolTable is replaced by getSize check in
order to allow zero-sized top-level symbols if they are located in
zero-sized sections. For the most part, such things could only be found
in tests, but I don't see a reason not to handle such cases.
4. Updated section-end-sym test and removed x86_64 requirement since
there is no reason for this (tested on aarch64 linux)

The test was provided by peterwaller-arm (thank you) in #100096 and
slightly modified by me.
2024-08-07 16:26:12 +04:00