After porting BOLT to RISCV some of the relocations were broken on both
AArch64 and X86.
On AArch64 the example of broken relocations would be GOT, during
handling them, we should replace the symbol to __BOLT_got_zero in order
to address GOT entry, not the symbol that addresses this entry. This is
done further in code, so it is too early to add rel here.
On X86 it is a mistake to add relocations without addend. This is the
exact problem that is raised on #97937. Due to different code generation
I had to use gcc-generated yaml test, since with clang I wasn't able to
reproduce problem.
Added tests for both architectures and made the problematic condition
riscV-specific.
Take a common weak reference pattern for example
```
__attribute__((weak)) void undef_weak_fun();
if (&undef_weak_fun)
undef_weak_fun();
```
In this case, an undefined weak symbol `undef_weak_fun` has an address
of zero, and Bolt incorrectly changes the relocation for the
corresponding symbol to symbol@PLT, leading to incorrect runtime
behavior.
When BOLT is run in AggregateOnly mode (perf2bolt), it exits with code
zero so destructors are not run thus TimerGroup never prints the timers.
Add explicit printing just before the exit to honor options requesting
timers (`--time-rewrite`, `--time-aggr`).
Test Plan: updated bolt/test/timers.c
Reviewers: ayermolo, maksfb, rafaelauler, dcci
Reviewed By: dcci
Pull Request: https://github.com/llvm/llvm-project/pull/101270
Continue from #87196 as author did not have much time, I have taken over
working on this PR. We would like to have this so it'll be easier to
package for Nix.
Can be tested by copying cmake, bolt, third-party, and llvm directories
out into their own directory with this PR applied and then build bolt.
---------
Co-authored-by: pca006132 <john.lck40@gmail.com>
Multi-way splitting can cause multiple fragments to access the same jump
table. Relax the assumption that a jump table can only have up to two
parents.
Test Plan: added bolt/test/X86/three-way-split-jt.s
Reviewers: ayermolo, dcci, rafaelauler, maksfb
Reviewed By: rafaelauler, dcci
Pull Request: https://github.com/llvm/llvm-project/pull/99988
from PEP8
(https://peps.python.org/pep-0008/#programming-recommendations):
> Comparisons to singletons like None should always be done with is or
is not, never the equality operators.
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
Implemented call graph function matching. First, two call graphs are
constructed for both profiled and binary functions. Then functions are
hashed based on the names of their callee/caller functions. Finally,
functions are matched based on these neighbor hashes and the
longest common prefix of their names. The `match-with-call-graph`
flag turns this matching on.
Test Plan: Added match-with-call-graph.test. Matched 164 functions
in a large binary with 10171 profiled functions.
Read pseudo probes in regular and BAT YAML profile generation, and
attach them to YAML profile basic blocks. This exposes GUID, probe id,
and probe type in profile for future use in stale profile matching.
Test Plan: updated pseudoprobe-decoding-inline.test
Reviewers: dcci, rafaelauler, ayermolo, maksfb
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/99554
Add a BinaryFunction field for pseudo probe function GUID.
Populate it during pseudo probe section parsing, and emit it in YAML
profile (both regular and BAT), along with function checksum.
To be used for stale function matching.
Test Plan: update pseudoprobe-decoding-inline.test
Detect and support fixed PIC indirect jumps of the following form:
```
movslq En(%rip), %r1
leaq PIC_JUMP_TABLE(%rip), %r2
addq %r2, %r1
jmpq *%r1
```
with PIC_JUMP_TABLE that looks like following:
```
JT: ----------
E1:| L1 - JT |
|----------|
E2:| L2 - JT |
|----------|
| |
......
En:| Ln - JT |
----------
```
The code could be produced by compilers, see
https://github.com/llvm/llvm-project/issues/91648.
Test Plan: updated jump-table-fixed-ref-pic.test
Reviewers: maksfb, ayermolo, dcci, rafaelauler
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/91667
With aggressive ICF, it's possible to have different local symbols
(under different FILE symbols) to be mapped to the same address.
FileSymRefs only keeps a single SymbolRef per address, which prevents
fragment matching from finding the correct symbol to perform parent
function lookup.
Work around this issue by switching FileSymRefs to a multimap. In
future, uses of FileSymRefs can be replaced with SortedSymbols which
keeps essentially the same information.
Test Plan: added ambiguous_fragment.test
Reviewers: dcci, ayermolo, maksfb, rafaelauler
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/98992
AArch64 needs this function when instrumenting statically-linked binaries.
Sample commands:
```bash
clang -Wl,-q test.c -static -o out
llvm-bolt -instrument -instrumentation-sleep-time=5 out -o out.instr
```
Added another hash level – call hash – following opcode hash matching
for stale block matching. Call hash strings are the concatenation of the
lexicographically ordered names of each blocks’ called functions. This
change bolsters block matching in cases where some instructions have
been removed or added but calls remain constant.
Test Plan: added match-functions-with-calls-as-anchors.test.
In three-way split functions, if only .warm fragment is present, BAT
incorrectly overwrites the map for .warm fragment by empty .cold
fragment.
Test Plan: updated register-fragments-bolt-symbols.s
BOLT only checks for the most common indirect branch pattern during the
branch analyzation.
Extended the logic with two other indirect patterns which slightly
differ from the expected one.
Those patterns may be hit when statically linking libc (pattern 2
requires 'lld' linker).
As a workaround mark them as UNKNOWN branch for now.
Fixes: #83114
There could be multiple TUs with the same hash in various DWO files. In
bigger binaries this could be in the thousands. Although they could be
structurally different and we need to output Entries for all of them,
for the purposes of figuring out a TU hash we only need one entry in
Foreign TU list.
Refactors legacy ranges writers to create a writer for each instance of
a DWO file.
We now write out everything into .debug_ranges after the all the DWO
files are processed. This also changes the order that ranges is written
out in, as before we wrote out while in the main CU processing loop and
we now iterate through the CU buckets created by partitionCUs, after the
main processing loop.
A mapping - from namespace to associated binary functions - is used to
match function profiles to binary based on the
'--name-similarity-function-matching-threshold' flag set edit distance
threshold. The flag is set to 0 (exact name matching) by default as it is
expensive, requiring the processing of all BFs.
Test Plan: Added name-similarity-function-matching.test. On a binary
with 5M functions, rewrite passes took ~520s without the flag and
~2018s with the flag set to 20.
This test from #74253 relies on particular hash values from Hashing.h.
The test fails in LLVM_ENABLE_ABI_BREAKING_CHECKS=on modes (#96282) or
whenever Hashing.h implementation changes.
Added flag '--match-profile-with-function-hash' to match functions
based on exact hash. After identical and LTO name matching, more
functions can be recovered for inference with exact hash, in the case
of function renaming with no functional changes. Collisions are
possible in the unlikely case where multiple functions share the same
exact hash. The flag is off by default as it requires the processing of
all binary functions and subsequently is expensive.
Test Plan: added hashing-based-function-matching.test.
Alternative instruction sequences in the Linux kernel can modify the
stack and thus they need their own ORC unwind entries. Since there's
only one ORC table, it has to be "shared" among multiple instruction
sequences. The kernel achieves this by putting a restriction on
instruction boundaries. If ORC state changes at a given IP, only one of
the alternative sequences can have an instruction starting/ending at
this IP. Then, developers can insert NOPs to guarantee the above
requirement is met.
The most common use of ORC with alternatives is "pushf; pop %rax"
sequence used for paravirtualization. Note that newer kernel versions
no longer use .parainstructions; instead, they utilize alternatives for
the same purpose.
Before we implement a better support for alternatives, we can safely
skip ORC entries associated with them.
Fixes#87052.
Alternative instructions in the Linux kernel may modify control flow in
a function. As such, it is unsafe to optimize functions with alternative
instructions until we properly support CFG alternatives.
Previously, we marked functions with alt instructions before the
emission, but that could be too late if we remove or replace
instructions with alternatives. We could have marked functions as
non-simple immediately after reading .altinstructions, but it's nice to
be able to view functions after CFG is built. Thus assign the non-simple
status after building CFG.
Summary: Constructing an artificial sink block for the
flow CFG in stale profile inference to allow profile
inference to be run on CFGs with blocks that terminate
and have successors.
Testing Plan: Added infer_no_exits.test to verify that
functions with exit blocks with a landing pad are
covered by stale profile inference.
---------
Co-authored-by: Amir Ayupov <fads93@gmail.com>
Summary: Functions with high discrepancy
(measured by matched function blocks)
can be ignored with an added command line
argument for better performance.
Test Plan: Added
stale-matching-min-matched-block.test
---------
Co-authored-by: Amir Ayupov <aaupov@fb.com>
`convertCallToIndirectCall` applies the PLTCall optimization and returns
an (updated if needed) iterator to the converted call instruction. Since
AArch64 requires to inject additional instructions to implement this
pass, the relevant BasicBlock and an iterator was passed to the
`convertCallToIndirectCall`.
`NumCallsOptimized` is updated only on successful application of the
pass.
Tests:
- Inputs/plt-tailcall.c: an example of a tail call optimized PLT call.
- AArch64/plt-call.test: it is the actual A64 test, that runs the
PLTCall optimization on the above input file and verifies the
application of the pass to the calls: 'printf' and 'puts'.
.altinstructions section contains a list of structures where fields can
have different sizes while other fields could be present or not
depending on the kernel version. Add automatic detection of such
variations and use it by default. The user can still overwrite the
automatic detection with `--alt-inst-has-padlen` and
`--alt-inst-feature-size` options.
Previously when an entry was skipped in parent chain a child will point
to the next valid entry in the chain. After discussion in
https://github.com/llvm/llvm-project/pull/91808 this is not very useful.
Changed implemenation so that all the children of the entry that is
skipped won't have DW_IDX_parent.
In ValidateMemRefs pass, when we validate references in the form of
`Symbol + Addend`, we should check `Symbol` not `Symbol + Addend`
against aliasing a jump table.
Recommitting with a modified test case:
https://github.com/llvm/llvm-project/pull/88838
Co-authored-by: sinan <sinan.lin@linux.alibaba.com>