Allow `parseString()` to return an empty `StringRef` when the delimiter
appears at position 0. This enables parsing pre-aggregated profile
addresses with an omitted buildid but preserved colon (`:addr` format),
where the empty buildid corresponds to the main binary.
Previously, `parseString()` rejected zero-length fields by treating
`StringEnd == 0` the same as `StringRef::npos` (delimiter not found).
These are distinct situations: `npos` means no delimiter exists, while
`0` means the field before the delimiter is empty. The fix removes the
`StringEnd == 0` sub-condition so only the missing-delimiter case
errors.
The existing test for buildid-prefixed addresses is extended to also
verify that `:addr` input produces identical output to the plain-address
and non-empty-buildid variants.
Test Plan:
Added empty-buildid input file and extended
`pre-aggregated-perf-buildid.test` to run perf2bolt with `:addr` format
and diff the fdata output against the existing buildid-prefixed result.
perf2bolt generates empty fdata files for small binaries and right now
BOLT does this check while parsing by calling `((!hasBranchData() &&
!hasMemData()))`. Instead, early exit as soon as the buffer finishes
reading the data file and exit with error message.
Template patchELFPHDRTable, rewriteNoteSections, markGnuRelroSections,
and discoverStorage to support both ELF32LE and ELF64LE binaries.
Previously these functions were hardcoded for ELF64LE, causing crashes
when processing 32-bit ELF binaries.
The RewriteInstance constructor now accepts ELF32LE objects in addition
to ELF64LE. The ELF_FUNCTION macro is reused (and moved earlier in the
header) to dispatch to the correct template instantiation.
These changes are preparation for adding support to hexagon architecture
in Bolt.
Summary:
When the disk runs out of space during output file writing, BOLT would
crash with SIGSEGV/SIGABRT because raw_fd_ostream silently records write
errors and only reports them via abort() in its destructor. This made it
difficult to distinguish real BOLT bugs from infrastructure issues in
production monitoring.
Add an explicit error check on the output stream before calling
Out->keep(), so BOLT exits cleanly with exit code 1 and a clear error
message instead.
Test: manually verified with a full filesystem that BOLT now prints
"BOLT-ERROR: failed to write output file: No space left on device" and
exits with code 1.
In this patch I am adding the missing target hooks required for the
liveness analysis to run on AArch64. These are
- getFlagsReg()
- getRegsUsedAsParams()
- getDefaultLiveOut()
- getGPRegs()
- isCleanRegXOR()
I am also introducing the following API in LivenessAnalysis
- BitVector getLiveIn/Out(const MCInst &)
- MCPhysReg scavengeRegFromState(BitVector &)
My intention is to allow the LongJmp pass scavenge usable registers when
injecting code.
When the compact-code-model is used, LongJmpPass::relaxLocalBranches
attempts to reverseBranchCondition without calling isReversibleBranch
resulting in runtime error. With this patch I am adding an additional
trampoline to handle irreversible FEAT_CMPBR branches.
In the future the plan is to use liveness analysis and replace the
irreversible branch with compare followed by branch (see #185731) as
long as the condition flags are dead, or emit the additional trampoline
otherwise.
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan:
added pre-aggregated-perf-buildid.test
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan: added pre-aggregated-perf-buildid.test
Reviewers:
paschalis-mpeis, maksfb, yavtuk, ayermolo, yozhu, rafaelauler, yota9
Reviewed By: paschalis-mpeis
Pull Request: https://github.com/llvm/llvm-project/pull/186931
Create bolt/docs/profiles.md documenting all accepted profile formats:
perf.data, fdata, YAML, and pre-aggregated. Covers collection methods,
format syntax, examples, and known limitations.
Add reference from bolt/docs/index.rst.
Adding a generator into Perf2bolt is the initial step to support the
large end-to-end tests for Arm SPE. This functionality proves unified format of
pre-parsed profile that Perf2bolt is able to consume.
Why does the test need to have a textual format SPE profile?
* To collect an Arm SPE profile by Linux Perf, it needs to have
an arm developer device which has SPE support.
* To decode SPE data, it also needs to have the proper version of
Linux Perf.
* The minimum required version of Linux Perf is v6.15.
Bypassing these technical difficulties, that easier to prove
a pre-generated textual profile format.
The generator relies on the aggregator work to spawn the required
perf-script jobs based on the the aggregation type, and merges the
results of the pref-script jobs into a single file.
This hybrid profile will contain all required events such as BuildID,
MMAP, TASK, BRSTACK, or MEM event for the aggregation.
Two examples below how to generate a pre-parsed perf data as
an input for ARM SPE aggregation:
`perf2bolt -p perf.data BINARY -o perf.text --spe
--generate-perf-script`
Or for basic aggregation:
`perf2bolt -p perf.data BINARY -o perf.text --ba --generate-perf-script`
Remove some unused code in BOLT:
- `RewriteInstance::linkRuntime` is declared but not defined
- `BranchContext` typedef is never used
- `FuncBranchData::getBranch` is defined but never used
- `FuncBranchData::getDirectCallBranch` is defined but never used
The assert condition (function is not split or split
into less than three fragments) is not always true now
that we will emit more local symbols due to #184074.
This commit enables compatibility of instrumentation-file-append-pid and
instrumentation-sleep-time options. It also requires keeping the
counters mapping between the watcher process and the instrumented binary
process in shared mode. This is useful when we instrument a shared
library that is used by several tasks running on the target system. In
case when we cannot wait for every task to complete, we must use the
sleep-time option. Without append-pid option, we would overwrite the
profile at the same path but collected from different tasks, leading to
unexpected or suboptimal optimization effects.
Co-authored-by: Vasily Leonenko <vasily.leonenko@huawei.com>
Allow `--function-order` to be combined with `--reorder-functions`
algorithms. Functions listed in the order file are pinned first
(indices 0..N-1), then the selected algorithm orders remaining
functions starting at index N.
Add separate options to enable each of the available gadget detectors.
Furthermore, add two meta-options enabling all PtrAuth scanners and all
available scanners of any type (which is only PtrAuth for now, though).
This commit renames `pacret` option to `ptrauth-pac-ret` and `pauth` to
`ptrauth-all`.
Currently LongJmpPass::relaxLocalBranches bails early if the estimated
size of a binary function is less than 32KB assuming that the shortest
branches are 16 bits. Therefore the fixup value for the cold branch
target may go out of range if the function is larger than 1KB.
I am decreasing ShortestJumpSpan from 32KB to 1KB, since FEAT_CMPBR
branches are 11 bits.
Launch this perf job with the others at the beginning of the aggregation
process.
Extracting buildid-list from perf data is not a costly process, so it
can be performed by default. This provides a distinct advantage when
this dataset is required in other perf2bolt stages as well.
Please see PR #171144.
Some binaries are built using `-gz=zstd`, but when using
`--update-debug-sections` on said binaries BOLT crashes.
This patch fixes this issue by recognising compressed debug sections in
binaries via their flag `SHF_COMPRESSED` and appropriately erroring out.
Legacy GNU-style compression is not handled.
The "private global" terminology, likely came from
llvm/lib/IR/Mangler.cpp, is misleading: "private" is the opposite of
"global", and these prefixed symbols are not global in the object file
format sense (e.g. ELF has STB_GLOBAL while these symbols are always
STB_LOCAL). The term "internal symbol" better describes their purpose:
symbols for internal use by compilers and assemblers, not meant to be
visible externally.
This rename is a step toward adopting the "internal symbol prefix"
terminology agreed with GNU as
(https://sourceware.org/pipermail/binutils/2026-March/148448.html).
Use `%cxxflags`, so that `-fPIE -pie` get passed in order to ensure the
test behavior is the same regardless of cmake configuration. We do
similar in many other BOLT tests.
There are cases in which `getEntryIDForSymbol` is called, where the
given Symbol is in a constant island, and so BOLT can not find its
function. This causes BOLT to reach `llvm_unreachable("symbol not
found")` and crash. This patch adds a check that avoids this crash.
BOLT currently strips all STT_NOTYPE STB_LOCAL zero-sized symbols
that fall inside function bodies. Certain such symbols are named
labels (loop markers and subroutine entry points) or local function
symbols in hand-written assembly. We now keep them in local symbol
table in BOLT processed binaries for better symbolication.
`BinaryFunction::translateInputToOutputAddress()` contains fallback
logic in case that querying `IOAddressMap` doesn't yield an output
address. Because this function could be called in scenarios where
`IOAddressMap` won't be set up, we should check if the map actually
exists before lookup.
Disable all stderr diagnostic output on Android since there is typically
no terminal to read diagnostic message. The `noinline`annotation is to
keep same inline decision before and after this change. On AArch64
the `.text` section in instr runtime library is now ~4.8 KB smaller.
Use `GlobalWriteProfileMutex` to synchronize between data reset and
dump. Between static counter reset and increment, we use atomic store
in counter reset - the counter increment sequence inserted within user
code already takes care of thread safety, so we just need to make sure
the counter reset code is also thread safe (no torn write to counter).
Similar to #132569 for RISC-V, replace the unofficial `@plt` and
`@gotpcrel` relocation specifiers, currently only used by clang
-fexperimental-relative-c++-abi-vtables, with %pltpcrel %gotpcrel. The
syntax is not used in humand-written assembly code, and is not supported
by GNU assembler.
Also replace the recent `@funcinit` with `%funcinit(x)`.
Checks that isReversibleBranch() returns false
- when the immediate value is 63 and needs +1 adjustment
- when the immediate value is 0 and needs -1 adjustment
Checks that reverseBranchCondition() adjusts
- the opcode
- the immediate operand if necessary (+/-1)
- the register operands if necessary (swap)
Raw profile data file may contain lines truncated due to unexpected
app exit. This change is to have merge_fdata check number of fields
in each line of raw profile data file and ignore a line if the number
is not expected.
When applying BTI fixups to indirect branch targets, ignored functions
are
considered as a special case:
- these hold no instructions,
- have no CFG,
- and are not emitted in the new text section.
The solution is to patch the entry points in the original location.
If such a situation occurs in a binary, recompilation using the
-fpatchable-function-entry flag is required. This will place a nop at
all
function starts, which BOLT can use to patch the original section.
Without the extra nop, BOLT cannot safely patch the original .text
section.
An alternative solution could be to also ignore the function from which
the stub starts. This has not been tried as LongJmp pass - where most
stubs are inserted - is currently not equipped to ignore functions.
Testing: both the success and failure cases are covered with lit tests.
Insert new PT_LOAD segments right after the last existing PT_LOAD in the
program header table, instead of before PT_DYNAMIC or at the end. This
maintains the ascending p_vaddr order required by the ELF specification.
Previously, new segments could end up breaking PT_LOAD p_vaddr order
when PT_LOAD segments followed PT_DYNAMIC or PT_GNU_STACK. This lead to
runtime loader incorrectly assessing dynamic object size and silently
corrupting memory.
Summary:
When .bolt_reserved section is defined in the linker script, there's
no way to mark the containing segment executable other than via PHDRS
command which overrides program headers entirely which is impractical.
Since .bolt_reserved contains executable code, mark segment executable
in BOLT.
Test Plan: bolt-reserved.test
The instrument-ind-call test checks the correctness of instrumented
snippet by the set of registers are used, the call id value is
meaningless (platform depend) and should be exclude from test.
The Armv9.6-A compare-and-branch instructions use a short range 9-bit
immediate value. They do not have a corresponding relocation type in the
ABI. For now we only support them in compact code model, with
diagnostics added in the LongJmp pass to ensure this condition. Some
interesting edge cases we cover:
- function splitting works when target is within or beyond the 1KB range
of those instructions,
- but doesn't work beyond the 128MB limit of the compact code model
- branch inversion works with block reordering so long as the immediate
value adjustments remain in bounds
This patch moves the applyBTIFixup from LongJmp pass to MCPlusBuilder.
This refactor allows applyBTIFixup to be called from other passes
inserting indirect branches, such as:
- Hugify,
- PatchEntries.
As different passes have different information about their targets (e.g.
target BasicBlock, target Symbol, target Function), specialized versions
are created (applyBTIFixupToSymbol, applyBTIFixupToTarget), and each
calls
applyBTIFixupCommon, which implements the original logic from before.
Names of related lit tests are updated to have the "bti" prefix.
After ICF folds functions, FoldedIntoFunction may point to a function
that was also folded. Add a post-processing step at the end of ICF to
flatten all chains so FoldedIntoFunction always points to the ultimate
root parent (a function that is not itself folded).
In relocation mode, keep folded functions in the BinaryFunctions map
instead of erasing them. Mark them as folded using setFolded() and skip
emitting them.
Currently, many unnecessary samples are populated into MemSamples,
including zero-initialized samples and samples in which the PC address
is not contained in any BinaryFunction. But these samples are totally
skipped during processing and the whole MemSamples vector is cleared
immediately after processing. So, we could just stop populating these
samples into MemSamples, which would reduce maximum resident set size
when processing a large perf.data.
Hot text mover functions are placed in special sections (e.g.,
.never_hugify) to avoid being placed on hot/huge pages. Folding them
with functions from other sections could defeat this purpose.
Add a check in ICF's isIdenticalWith() to prevent folding when either
function is a hot text mover.
When handling relocation in one function referencing code or
data defined in another function, we should check if relocation
target is constant island or not, and get the referenced symbol
accordingly for both cases.