In this patch I am adding the missing target hooks required for the
liveness analysis to run on AArch64. These are
- getFlagsReg()
- getRegsUsedAsParams()
- getDefaultLiveOut()
- getGPRegs()
- isCleanRegXOR()
I am also introducing the following API in LivenessAnalysis
- BitVector getLiveIn/Out(const MCInst &)
- MCPhysReg scavengeRegFromState(BitVector &)
My intention is to allow the LongJmp pass scavenge usable registers when
injecting code.
When the compact-code-model is used, LongJmpPass::relaxLocalBranches
attempts to reverseBranchCondition without calling isReversibleBranch
resulting in runtime error. With this patch I am adding an additional
trampoline to handle irreversible FEAT_CMPBR branches.
In the future the plan is to use liveness analysis and replace the
irreversible branch with compare followed by branch (see #185731) as
long as the condition flags are dead, or emit the additional trampoline
otherwise.
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan:
added pre-aggregated-perf-buildid.test
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan: added pre-aggregated-perf-buildid.test
Reviewers:
paschalis-mpeis, maksfb, yavtuk, ayermolo, yozhu, rafaelauler, yota9
Reviewed By: paschalis-mpeis
Pull Request: https://github.com/llvm/llvm-project/pull/186931
Create bolt/docs/profiles.md documenting all accepted profile formats:
perf.data, fdata, YAML, and pre-aggregated. Covers collection methods,
format syntax, examples, and known limitations.
Add reference from bolt/docs/index.rst.
Adding a generator into Perf2bolt is the initial step to support the
large end-to-end tests for Arm SPE. This functionality proves unified format of
pre-parsed profile that Perf2bolt is able to consume.
Why does the test need to have a textual format SPE profile?
* To collect an Arm SPE profile by Linux Perf, it needs to have
an arm developer device which has SPE support.
* To decode SPE data, it also needs to have the proper version of
Linux Perf.
* The minimum required version of Linux Perf is v6.15.
Bypassing these technical difficulties, that easier to prove
a pre-generated textual profile format.
The generator relies on the aggregator work to spawn the required
perf-script jobs based on the the aggregation type, and merges the
results of the pref-script jobs into a single file.
This hybrid profile will contain all required events such as BuildID,
MMAP, TASK, BRSTACK, or MEM event for the aggregation.
Two examples below how to generate a pre-parsed perf data as
an input for ARM SPE aggregation:
`perf2bolt -p perf.data BINARY -o perf.text --spe
--generate-perf-script`
Or for basic aggregation:
`perf2bolt -p perf.data BINARY -o perf.text --ba --generate-perf-script`
Remove some unused code in BOLT:
- `RewriteInstance::linkRuntime` is declared but not defined
- `BranchContext` typedef is never used
- `FuncBranchData::getBranch` is defined but never used
- `FuncBranchData::getDirectCallBranch` is defined but never used
The assert condition (function is not split or split
into less than three fragments) is not always true now
that we will emit more local symbols due to #184074.
This commit enables compatibility of instrumentation-file-append-pid and
instrumentation-sleep-time options. It also requires keeping the
counters mapping between the watcher process and the instrumented binary
process in shared mode. This is useful when we instrument a shared
library that is used by several tasks running on the target system. In
case when we cannot wait for every task to complete, we must use the
sleep-time option. Without append-pid option, we would overwrite the
profile at the same path but collected from different tasks, leading to
unexpected or suboptimal optimization effects.
Co-authored-by: Vasily Leonenko <vasily.leonenko@huawei.com>
Allow `--function-order` to be combined with `--reorder-functions`
algorithms. Functions listed in the order file are pinned first
(indices 0..N-1), then the selected algorithm orders remaining
functions starting at index N.
Add separate options to enable each of the available gadget detectors.
Furthermore, add two meta-options enabling all PtrAuth scanners and all
available scanners of any type (which is only PtrAuth for now, though).
This commit renames `pacret` option to `ptrauth-pac-ret` and `pauth` to
`ptrauth-all`.
Currently LongJmpPass::relaxLocalBranches bails early if the estimated
size of a binary function is less than 32KB assuming that the shortest
branches are 16 bits. Therefore the fixup value for the cold branch
target may go out of range if the function is larger than 1KB.
I am decreasing ShortestJumpSpan from 32KB to 1KB, since FEAT_CMPBR
branches are 11 bits.
Launch this perf job with the others at the beginning of the aggregation
process.
Extracting buildid-list from perf data is not a costly process, so it
can be performed by default. This provides a distinct advantage when
this dataset is required in other perf2bolt stages as well.
Please see PR #171144.
Some binaries are built using `-gz=zstd`, but when using
`--update-debug-sections` on said binaries BOLT crashes.
This patch fixes this issue by recognising compressed debug sections in
binaries via their flag `SHF_COMPRESSED` and appropriately erroring out.
Legacy GNU-style compression is not handled.
The "private global" terminology, likely came from
llvm/lib/IR/Mangler.cpp, is misleading: "private" is the opposite of
"global", and these prefixed symbols are not global in the object file
format sense (e.g. ELF has STB_GLOBAL while these symbols are always
STB_LOCAL). The term "internal symbol" better describes their purpose:
symbols for internal use by compilers and assemblers, not meant to be
visible externally.
This rename is a step toward adopting the "internal symbol prefix"
terminology agreed with GNU as
(https://sourceware.org/pipermail/binutils/2026-March/148448.html).
Use `%cxxflags`, so that `-fPIE -pie` get passed in order to ensure the
test behavior is the same regardless of cmake configuration. We do
similar in many other BOLT tests.
There are cases in which `getEntryIDForSymbol` is called, where the
given Symbol is in a constant island, and so BOLT can not find its
function. This causes BOLT to reach `llvm_unreachable("symbol not
found")` and crash. This patch adds a check that avoids this crash.
BOLT currently strips all STT_NOTYPE STB_LOCAL zero-sized symbols
that fall inside function bodies. Certain such symbols are named
labels (loop markers and subroutine entry points) or local function
symbols in hand-written assembly. We now keep them in local symbol
table in BOLT processed binaries for better symbolication.
`BinaryFunction::translateInputToOutputAddress()` contains fallback
logic in case that querying `IOAddressMap` doesn't yield an output
address. Because this function could be called in scenarios where
`IOAddressMap` won't be set up, we should check if the map actually
exists before lookup.
Disable all stderr diagnostic output on Android since there is typically
no terminal to read diagnostic message. The `noinline`annotation is to
keep same inline decision before and after this change. On AArch64
the `.text` section in instr runtime library is now ~4.8 KB smaller.
Use `GlobalWriteProfileMutex` to synchronize between data reset and
dump. Between static counter reset and increment, we use atomic store
in counter reset - the counter increment sequence inserted within user
code already takes care of thread safety, so we just need to make sure
the counter reset code is also thread safe (no torn write to counter).
Similar to #132569 for RISC-V, replace the unofficial `@plt` and
`@gotpcrel` relocation specifiers, currently only used by clang
-fexperimental-relative-c++-abi-vtables, with %pltpcrel %gotpcrel. The
syntax is not used in humand-written assembly code, and is not supported
by GNU assembler.
Also replace the recent `@funcinit` with `%funcinit(x)`.
Checks that isReversibleBranch() returns false
- when the immediate value is 63 and needs +1 adjustment
- when the immediate value is 0 and needs -1 adjustment
Checks that reverseBranchCondition() adjusts
- the opcode
- the immediate operand if necessary (+/-1)
- the register operands if necessary (swap)
Raw profile data file may contain lines truncated due to unexpected
app exit. This change is to have merge_fdata check number of fields
in each line of raw profile data file and ignore a line if the number
is not expected.
When applying BTI fixups to indirect branch targets, ignored functions
are
considered as a special case:
- these hold no instructions,
- have no CFG,
- and are not emitted in the new text section.
The solution is to patch the entry points in the original location.
If such a situation occurs in a binary, recompilation using the
-fpatchable-function-entry flag is required. This will place a nop at
all
function starts, which BOLT can use to patch the original section.
Without the extra nop, BOLT cannot safely patch the original .text
section.
An alternative solution could be to also ignore the function from which
the stub starts. This has not been tried as LongJmp pass - where most
stubs are inserted - is currently not equipped to ignore functions.
Testing: both the success and failure cases are covered with lit tests.
Insert new PT_LOAD segments right after the last existing PT_LOAD in the
program header table, instead of before PT_DYNAMIC or at the end. This
maintains the ascending p_vaddr order required by the ELF specification.
Previously, new segments could end up breaking PT_LOAD p_vaddr order
when PT_LOAD segments followed PT_DYNAMIC or PT_GNU_STACK. This lead to
runtime loader incorrectly assessing dynamic object size and silently
corrupting memory.
Summary:
When .bolt_reserved section is defined in the linker script, there's
no way to mark the containing segment executable other than via PHDRS
command which overrides program headers entirely which is impractical.
Since .bolt_reserved contains executable code, mark segment executable
in BOLT.
Test Plan: bolt-reserved.test
The instrument-ind-call test checks the correctness of instrumented
snippet by the set of registers are used, the call id value is
meaningless (platform depend) and should be exclude from test.
The Armv9.6-A compare-and-branch instructions use a short range 9-bit
immediate value. They do not have a corresponding relocation type in the
ABI. For now we only support them in compact code model, with
diagnostics added in the LongJmp pass to ensure this condition. Some
interesting edge cases we cover:
- function splitting works when target is within or beyond the 1KB range
of those instructions,
- but doesn't work beyond the 128MB limit of the compact code model
- branch inversion works with block reordering so long as the immediate
value adjustments remain in bounds
This patch moves the applyBTIFixup from LongJmp pass to MCPlusBuilder.
This refactor allows applyBTIFixup to be called from other passes
inserting indirect branches, such as:
- Hugify,
- PatchEntries.
As different passes have different information about their targets (e.g.
target BasicBlock, target Symbol, target Function), specialized versions
are created (applyBTIFixupToSymbol, applyBTIFixupToTarget), and each
calls
applyBTIFixupCommon, which implements the original logic from before.
Names of related lit tests are updated to have the "bti" prefix.
After ICF folds functions, FoldedIntoFunction may point to a function
that was also folded. Add a post-processing step at the end of ICF to
flatten all chains so FoldedIntoFunction always points to the ultimate
root parent (a function that is not itself folded).
In relocation mode, keep folded functions in the BinaryFunctions map
instead of erasing them. Mark them as folded using setFolded() and skip
emitting them.
Currently, many unnecessary samples are populated into MemSamples,
including zero-initialized samples and samples in which the PC address
is not contained in any BinaryFunction. But these samples are totally
skipped during processing and the whole MemSamples vector is cleared
immediately after processing. So, we could just stop populating these
samples into MemSamples, which would reduce maximum resident set size
when processing a large perf.data.
Hot text mover functions are placed in special sections (e.g.,
.never_hugify) to avoid being placed on hot/huge pages. Folding them
with functions from other sections could defeat this purpose.
Add a check in ICF's isIdenticalWith() to prevent folding when either
function is a hot text mover.
When handling relocation in one function referencing code or
data defined in another function, we should check if relocation
target is constant island or not, and get the referenced symbol
accordingly for both cases.
On x86-64, PLT optimization does not require the binary to be linked
with -znow because indirect calls through GOT work correctly with lazy
binding. At runtime, the dynamic linker's resolver will populate the GOT
entry on the first call, just like with a regular PLT call.
This change removes the -znow requirement specifically for x86-64 while
keeping it for other architectures. I haven't checked RISV-V, but it's
still necessary on AArch64.
This patch adds comprehensive assembler (MC layer) support for the
Mach-O object file format on RISC-V targets, enabling assembly and
disassembly of RISC-V code targeting Apple platforms.
Key changes:
- Define RISC-V-specific Mach-O relocation types in BinaryFormat/MachO.h
- Implement RISCVMachObjectWriter with full relocation handling for:
- PCREL_HI/LO pairs for PC-relative addressing
- GOT relocations for external symbols
- Branch relocations (CALL, unconditional/conditional branches)
- Data section relocations
Test files include llvm-otool dumps to verify the generated relocations.
This code is based on code originally written by Tim Northover.
Having it in the X86 subdirectory only affects tests in that directory.
That's however not sufficient as for example runtime/X86/pie-exceptions-split.test is affected but
isn't located in the X86 directory.
This essentially fixes the fix for the original commit by guarding it properly for when the X86
target has been built and the flag is recognized.
Fixes: 6c48fbc1dcfbd44a47f126f21e575340b67aac06