329 Commits

Author SHA1 Message Date
Maksim Panchenko
2db9b6a93f
[BOLT] Make instruction size a first-class annotation (#72167)
When NOP instructions are used to reserve space in the code, e.g. for
patching, it becomes critical to preserve their original size while
emitting the code. On x86, we rely on "Size" annotation for NOP
instructions size, as the original instruction size is lost in the
disassembly/assembly process.

This change makes instruction size a first-class annotation and is
affectively NFCI. A follow-up diff will use the annotation for code
emission.
2023-11-13 14:33:39 -08:00
Vladislav Khmelevsky
cf18f142c0
[BOLT] Read .rela.dyn in static non-pie binary (#71635)
Static non-pie binary doesn't have DYNAMIC segment and BOLT skips
reading .rela.dyn section because of it. But such binaries might have
this section for example to store IFUNC relocation which is resolved
by linked-in startup files, so force reading this section for static
executables.
2023-11-10 11:47:12 +04:00
Vladislav Khmelevsky
c6c04a83a7
[BOLT] Run EliminateUnreachableBlocks in parallel (#71299)
The wall time for this pass decreased on my laptop from ~80 sec to 5
sec processing the clang.
2023-11-10 00:46:04 +04:00
spaette
1a2f83366b
[BOLT] Fix typos (#68121)
Closes https://github.com/llvm/llvm-project/issues/63097

Before merging please make sure the change to
bolt/include/bolt/Passes/StokeInfo.h is correct.

bolt/include/bolt/Passes/StokeInfo.h

```diff
  //  This Pass solves the two major problems to use the Stoke program without
- //  proting its code:
+ //  probing its code:
```

I'm still not happy about the awkward wording in this comment.

bolt/include/bolt/Passes/FixRelaxationPass.h

```
$ ed -s bolt/include/bolt/Passes/FixRelaxationPass.h <<<'9,12p'
// This file declares the FixRelaxations class, which locates instructions with
// wrong targets and fixes them. Such problems usually occures when linker
// relaxes (changes) instructions, but doesn't fix relocations types properly
// for them.
$
```


bolt/docs/doxygen.cfg.in
bolt/include/bolt/Core/BinaryContext.h
bolt/include/bolt/Core/BinaryFunction.h
bolt/include/bolt/Core/BinarySection.h
bolt/include/bolt/Core/DebugData.h
bolt/include/bolt/Core/DynoStats.h
bolt/include/bolt/Core/Exceptions.h
bolt/include/bolt/Core/MCPlusBuilder.h
bolt/include/bolt/Core/Relocation.h
bolt/include/bolt/Passes/FixRelaxationPass.h
bolt/include/bolt/Passes/InstrumentationSummary.h
bolt/include/bolt/Passes/ReorderAlgorithm.h
bolt/include/bolt/Passes/StackReachingUses.h
bolt/include/bolt/Passes/StokeInfo.h
bolt/include/bolt/Passes/TailDuplication.h
bolt/include/bolt/Profile/DataAggregator.h
bolt/include/bolt/Profile/DataReader.h
bolt/lib/Core/BinaryContext.cpp
bolt/lib/Core/BinarySection.cpp
bolt/lib/Core/DebugData.cpp
bolt/lib/Core/DynoStats.cpp
bolt/lib/Core/Relocation.cpp
bolt/lib/Passes/Instrumentation.cpp
bolt/lib/Passes/JTFootprintReduction.cpp
bolt/lib/Passes/ReorderData.cpp
bolt/lib/Passes/RetpolineInsertion.cpp
bolt/lib/Passes/ShrinkWrapping.cpp
bolt/lib/Passes/TailDuplication.cpp
bolt/lib/Rewrite/BoltDiff.cpp
bolt/lib/Rewrite/DWARFRewriter.cpp
bolt/lib/Rewrite/RewriteInstance.cpp
bolt/lib/Utils/CommandLineOpts.cpp
bolt/runtime/instr.cpp
bolt/test/AArch64/got-ld64-relaxation.test
bolt/test/AArch64/unmarked-data.test
bolt/test/X86/Inputs/dwarf5-cu-no-debug-addr-helper.s
bolt/test/X86/Inputs/linenumber.cpp
bolt/test/X86/double-jump.test
bolt/test/X86/dwarf5-call-pc-function-null-check.test
bolt/test/X86/dwarf5-split-dwarf4-monolithic.test
bolt/test/X86/dynrelocs.s
bolt/test/X86/fallthrough-to-noop.test
bolt/test/X86/tail-duplication-cache.s
bolt/test/runtime/X86/instrumentation-ind-calls.s
2023-11-09 11:29:46 -08:00
Job Noorman
96b5e092dc
[BOLT] Support instrumentation hook via DT_FINI_ARRAY (#67348)
BOLT currently hooks its its instrumentation finalization function via
`DT_FINI`. However, this method of calling finalization routines is not
supported anymore on newer ABIs like RISC-V. `DT_FINI_ARRAY` is
preferred there.

This patch adds support for hooking into `DT_FINI_ARRAY` instead if the
binary does not have a `DT_FINI` entry. If it does, `DT_FINI` takes
precedence so this patch should not change how the currently supported
instrumentation targets behave.

`DT_FINI_ARRAY` points to an array in memory of `DT_FINI_ARRAYSZ` bytes.
It consists of pointer-length entries that contain the addresses of
finalization functions. However, the addresses are only filled-in by the
dynamic linker at load time using relative relocations. This makes
hooking via `DT_FINI_ARRAY` a bit more complicated than via `DT_FINI`.

The implementation works as follows:
- While scanning the binary: find the section where `DT_FINI_ARRAY`
points to, read its first dynamic relocation and use its addend to find
the address of the fini function we will use to hook;
- While writing the output file: overwrite the addend of the dynamic
relocation with the address of the runtime library's fini function.

Updating the dynamic relocation required a bit of boiler plate: since
dynamic relocations are stored in a `std::multiset` which doesn't
support getting mutable references to its items, functions were added to
`BinarySection` to take an existing relocation and insert a new one.
2023-11-08 11:01:10 +00:00
Vladislav Khmelevsky
e2f1a95f2a
[BOLT][AArch64] Handle IFUNCS properly (#71104)
Currently we were testing only the binaries compiled with O0, which
results in indirect call to the IFUNC trampoline and the trampoline has
associated IFUNC symbol with it. Compile with O3 results in direct
calling the IFUNC trampoline and no symbols are associated with it, the
IFUNC symbol address becomes the same as IFUNC resolver address. Since
no symbol was associated the BF was not created before PLT analyze and
be the algorithm we're going to analyze target relocation. As we're
expecting the JUMP relocation we're also expecting the associated symbol
with it to be presented. But for IFUNC relocation the IRELATIVE
relocation is used and no symbol is associated with it, the addend value
is pointing on the target symbol, so we need to find BF using it and use
it's symbol in this situation. Currently this is checked only for
AArch64 platform, so I've limited it in code to use this logic only for
this platform, although I wouldn't be surprised if other platforms needs
to activate this logic too.
2023-11-08 11:41:43 +04:00
Maksim Panchenko
0df154671b
[BOLT] Use Label annotation instead of EHLabel pseudo. NFCI. (#70179)
When we need to attach EH label to an instruction, we can now use Label
annotation instead of EHLabel pseudo instruction.
2023-11-06 14:43:14 -08:00
Maksim Panchenko
b336d741d0
[BOLT] Use direct storage for Label annotations. NFCI. (#70147)
Store the Label annotation directly in the operand and avoid the extra
allocation and indirection overheads associated with MCSimpleAnnotation.
2023-11-06 14:24:55 -08:00
maksfb
74e0a26fd1
[BOLT] Modify MCPlus annotation internals. NFCI. (#70412)
When annotating MCInst instructions, attach extra annotation operands
directly to the annotated instruction, instead of attaching them to an
instruction pointed to by a special kInst operand.

With this change, it's no longer necessary to allocate MCInst and most
of the first-class annotations come with free memory as currently MCInst
is declared with:

    SmallVector<MCOperand, 10> Operands;

i.e. more operands than are normally being used.

We still create a kInst operand with a nullptr instruction value to
designate the beginning of annotation operands. However, this special
operand might not be needed if we can rely on MCInstrDesc::NumOperands.
2023-11-06 12:14:22 -08:00
maksfb
e28c393bd1
[BOLT] Reduce the number of emitted symbols. NFCI. (#70175)
We emit a symbol before an instruction for a number of reasons, e.g. for
tracking LocSyms, debug line, or if the instruction has a label
annotation. Currently, we may emit multiple symbols per instruction.

Reuse the same label instead of creating and emitting new ones when
possible. I'm planning to refactor EH labels as well in a separate diff.

Change getLabel() to return a pointer instead of std::optional<> since
an empty label should be treated identically to no label.
2023-11-06 11:41:47 -08:00
maksfb
6e26246c22
[BOLT][DWARF] Refactor address ranges processing (#71225)
Create BinaryFunction::translateInputToOutputRange() and use it for
updating DWARF debug ranges and location lists while de-duplicating the
existing code. Additionally, move DWARF-specific code out of
BinaryFunction and add print functions to facilitate debugging.

Note that this change is deliberately kept "bug-level" compatible with
the existing solution to keep it NFCI and make it easier to track any
possible regressions in the future updates to the ranges-handling code.
2023-11-06 11:10:20 -08:00
Vladislav Khmelevsky
888742a121
[BOLT][AArch64] Handle .plt.got section (#71216)
It seems that currently this section is only created by the mold linker
if 2 conditions are met: 1. The PLT function was called directly. 2. The
indirect access to PLT function was found (e.g. through ADRP
relocation). Although mold created symbol for every plt entry I've
removed them in yaml file to check that .plt.got was truly disassembled
by bolt.
2023-11-04 00:47:24 +04:00
spupyrev
287fcd38a1
[BOLT] Rename cds to cdsort (#69966)
Unify naming for the layout algorithms by renaming "cds" to "cdsort".
This is
NFC unless someone is already using the new algorithm (which is
unlikely).
2023-11-02 12:46:36 -07:00
maksfb
8244ff6739
[BOLT] Fix incorrect basic block output addresses (#70000)
Some optimization passes may duplicate basic blocks and assign the same
input offset to a number of different blocks in a function. This is done
e.g. to correctly map debugging ranges for duplicated code.

However, duplicate input offsets present a problem when we use
AddressMap to generate new addresses for basic blocks. The output
address is calculated based on the input offset and will be the same for
blocks with identical offsets. The result is potentially incorrect debug
info and BAT records.

To address the issue, we have to eliminate the dependency on input
offsets while generating output addresses for a basic block. Each block
has a unique label, hence we extend AddressMap to include address lookup
based on MCSymbol and use the new functionality to update block
addresses.
2023-10-24 12:22:43 -07:00
Job Noorman
b6b492880f
[BOLT][RISCV] Set minimum function alignment to 2 for RVC (#69837)
In #67707, the minimum function alignment on RISC-V was set to 4. When
RVC (compressed instructions) is enabled, the minimum alignment can be
reduced to 2.

This patch implements this by delegating the choice of minimum alignment
to a new `MCPlusBuilder::getMinFunctionAlignment` function. This way,
the target-dependent code in `BinaryFunction` is minimized.
2023-10-23 08:09:11 +00:00
Vladislav Khmelevsky
b7944f7c04
[BOLT] Return proper minimal alignment from BF (#67707)
Currently minimal alignment of function is hardcoded to 2 bytes.
Add 2 more cases:
1. In case BF is data in code return the alignment of CI as minimal
alignment
2. For aarch64 and riscv platforms return the minimal value of 4 (added
test for aarch64)
Otherwise fallback to returning the 2 as it previously was.
2023-10-12 09:33:08 +04:00
Job Noorman
da37139ac9
[BOLT][NFC] Add allocator id to MCPlusBuilder::setLabel (#68707)
This will be needed for some RISC-V instrumentation functions and is
also consistent with other annotation setters.
2023-10-11 07:25:46 +00:00
Job Noorman
ff5e2babcb
[BOLT] Improve handling of relocations targeting specific instructions (#66395)
On RISC-V, there are certain relocations that target a specific
instruction instead of a more abstract location like a function or basic
block. Take the following example that loads a value from symbol `foo`:

```
nop
1: auipc t0, %pcrel_hi(foo)
ld t0, %pcrel_lo(1b)(t0)
```

This results in two relocation:
- auipc: `R_RISCV_PCREL_HI20` referencing `foo`;
- ld: `R_RISCV_PCREL_LO12_I` referencing to local label `1` which points
to the auipc instruction.

It is of utmost importance that the `R_RISCV_PCREL_LO12_I` keeps
referring to the auipc instruction; if not, the program will fail to
assemble. However, BOLT currently does not guarantee this.

BOLT currently assumes that all local symbols are jump targets and
always starts a new basic block at symbol locations. The example above
results in a CFG the looks like this:

```
.BB0:
    nop
.BB1:
    auipc t0, %pcrel_hi(foo)
    ld t0, %pcrel_lo(.BB1)(t0)
```

While this currently works (i.e., the `R_RISCV_PCREL_LO12_I` relocation
points to the correct instruction), it has two downsides:
- Too many basic blocks are created (the example above is logically only
  one yet two are created);
- If instructions are inserted in `.BB1` (e.g., by instrumentation),
  things will break since the label will not point to the auipc anymore.

This patch proposes to fix this issue by teaching BOLT to track labels
that should always point to a specific instruction. This is implemented
as follows:
- Add a new annotation type (`kLabel`) that allows us to annotate
  instructions with an `MCSymbol *`;
- Whenever we encounter a relocation type that is used to refer to a
  specific instruction (`Relocation::isInstructionReference`), we
  register it without a symbol;
- During disassembly, whenever we encounter an instruction with such a
  relocation, create a symbol for its target and store it in an offset
  to symbol map (to ensure multiple relocations referencing the same
  instruction use the same label);
- After disassembly, iterate this map to attach labels to instructions
  via the new annotation type;
- During emission, emit these labels right before the instruction.

I believe the use of annotations works quite well for this use case as
it allows us to reliably track instruction labels. If we were to store
them as offsets in basic blocks, it would be error prone to keep them
updated whenever instructions are inserted or removed.

I have chosen to add labels as first-class annotations (as opposed to a
generic one) because the documentation of `MCAnnotation` suggests that
generic annotations are to be used for optional metadata that can be
discarded without affecting correctness. As this is not the case for
labels, a first-class annotation seemed more appropriate.
2023-10-06 06:46:16 +00:00
Job Noorman
8fb83bf5f1
[BOLT][NFC] Add MCSubtargetInfo to MCPlusBuilder (#68223)
On RISC-V, it's helpful to have access to `MCSubtargetInfo` while
generating instructions in `MCPlusBuilder`. For example, a return
instruction might be generated differently based on if the target
supports compressed instructions (`c.jr ra`) or not (`jalr ra`).
2023-10-06 06:39:58 +00:00
Rafael Auler
853e126ce3 [BOLT] Support input binaries that use R_X86_GOTPC64
In large code model, the address of GOT is calculated by the
static linker via R_X86_GOTPC64 reloc applied against a MOVABSQ
instruction. In the final binary, it can be disassembled as a regular
immediate, but because such immediate is the result of PC-relative
pointer arithmetic, we need to parse this relocation and update this
calculation whenever we move code, otherwise we break the code trying
to read GOT.

A test case showing how GOT is accessed was provided.

Reviewed By: #bolt, maksfb

Differential Revision: https://reviews.llvm.org/D158911
2023-10-02 23:12:44 -07:00
Vladislav Khmelevsky
846eb76761 [BOLT][AArch64] Fix instrumentation deadloop
According to ARMv8-a architecture reference manual B2.10.5 software
must avoid having any explicit memory accesses between exclusive load
and associated store instruction. Otherwise exclusive monitor might
clear the exclusivity without application-related cause which may
result in the deadloop. Disable instrumentation for such functions,
since between exclusive load and store there might be branches and we
would insert instrumentation snippet which contains loads and stores.

The better solution would be to analyze with BFS finding the exact BBs
between load and store and not instrumenting them. Or even better to
recognize such sequences and replace them with more complex one, e.g.
loading value non exclusively, and for the brach where exclusive store
is made make exclusive load and store sequentially, but for now just
disable instrumentation for such functions completely.

Differential Revision: https://reviews.llvm.org/D159520
2023-09-22 00:58:01 +04:00
Kristof Beyls
8fb28e45ce
[BOLT] Fix data race in MCPlusBuilder::getOrCreateAnnotationIndex (#67004)
MCPlusBuilder::getOrCreateAnnotationIndex(Name) can be called from
different threads, for example when making use of
ParallelUtilities::runOnEachFunctionWithUniqueAllocId.

The race occurs when an Index for a particular annotation Name needs to
be created for the first time.

For example, this can easily happen when multiple "copies" of an
analysis pass run on different BinaryFunctions, and the analysis pass
creates a new Annotation Index to be able to store analysis results as
annotations.

This was found by using the ThreadSanitizer.

No regression test was added; I don't think there is good way to write
regression tests that verify the absence of data races?

---------

Co-authored-by: Amir Ayupov <fads93@gmail.com>
2023-09-21 19:53:09 +02:00
Job Noorman
dc925be68b
[BOLT][RISCV] Carry-over annotations when fixing calls (#66763)
`FixRISCVCallsPass` changes all different forms of calls to `PseudoCALL`
instructions. However, the original call's annotations were lost in the
process.

This patch fixes this by moving all annotations from the old to the new
call. `MCPlusBuilder::moveAnnotations` had to be made public for this.
2023-09-21 06:37:47 +00:00
Job Noorman
c5ba61978c [BOLT][RISCV] Add support for linker relaxation
Calls on RISC-V are typically compiled to `auipc`/`jalr` pairs to allow
a maximum target range (32-bit pc-relative). In order to optimize calls
to near targets, linker relaxation may replace those pairs with, for
example, single `jal` instructions.

To allow BOLT to freely reassign function addresses in relaxed binaries,
this patch proposes the following approach:
- Expand all relaxed calls back to `auipc`/`jalr`;
- Rely on JITLink to relax those back to shorter forms where possible.

This is implemented by detecting all possible call instructions and
replacing them with `PseudoCALL` (or `PseudoTAIL`) instructions. The
RISC-V backend then expands those and adds the necessary relocations for
relaxation.

Since BOLT generally ignores pseudo instruction, this patch makes
`MCPlusBuilder::isPseudo` virtual so that `RISCVMCPlusBuilder` can
override it to exclude `PseudoCALL` and `PseudoTAIL`.

To ensure JITLink knows about the correct section addresses while
relaxing, reassignment of addresses has been moved to a post-allocation
pass. Note that this is probably the time it had to be done in the
first place since in `notifyResolved` (where it was done before), all
symbols are supposed to be resolved already.

Depends on D159082

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D159089
2023-09-15 11:57:28 +02:00
Amir Ayupov
7b750943d7 [BOLT][NFC] Speedup YAML profile processing
Reduce YAML profile processing times:
- preprocessProfile: speed up buildNameMaps by replacing ProfileNameToProfile
  mapping with ProfileFunctionNames set and ProfileBFs vector.
  Pre-look up YamlBF->BF correspondence, memoize in ProfileBFs.
- readProfile: replace iteration over all functions in the binary by iteration
  over profile functions (strict match and LTO name match).

On a large binary (1.9M functions) and large YAML profile (121MB, 30k functions)
reduces profile steps runtime:
pre-process profile data: 12.4953s -> 10.7123s
process profile data: 9.8195s -> 5.6639s

Compared to fdata profile reading:
pre-process profile data: 8.0268s
process profile data: 1.0265s
process profile data pre-CFG: 0.1644s

Reviewed By: #bolt, maksfb

Differential Revision: https://reviews.llvm.org/D159460
2023-09-11 16:07:57 -07:00
Job Noorman
eafe4ee2e8 [BOLT] Rename isLoad/isStore to mayLoad/mayStore
As discussed in D159266, for some instructions it's impossible to know
statically if they will load/store (e.g., predicated instructions).
Therefore, mayLoad/mayStore are more appropriate names.
2023-09-01 09:36:05 +02:00
Job Noorman
76f040bda6 [BOLT] Provide generic implementations for isLoad/isStore
`MCInstrDesc` provides the `mayLoad` and `mayStore` flags that seem
appropriate to use as a target-independent way to implement `isLoad` and
`isStore`.

I believe this is currently good enough to use for the RISC-V target as
well. I've provided a test for this that checks the generated dyno
stats (which seems to be the only thing both `isLoad` and `isStore` are
used for).

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D159266
2023-09-01 09:36:05 +02:00
spupyrev
1256ef274c [BOLT] Fine-tuning hash computation for stale matching
Fine-tuning hash computation for stale matching:
- introducing a new "loose" basic block hash that allows to match many more blocks than before;
- tweaking params of the inference algorithm that find (slightly) better solutions;
- added more meaningful tests for stale matching.

Tested the changes on several open-source benchmarks (clang, rocksdb, chrome)
and one prod workload using different compiler modes (LTO/PGO etc). There is
always an improvement in the quality of inferred profiles.
(The current implementation is still not optimal but the diff is a step forward;
I am open to further suggestions)

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D156278
2023-08-31 07:29:02 -07:00
Job Noorman
475a93a07a [BOLT] Calculate output values using BOLTLinker
BOLT uses `MCAsmLayout` to calculate the output values of functions and
basic blocks. This means output values are calculated based on a
pre-linking state and any changes to symbol values during linking will
cause incorrect values to be used.

This issue can be triggered by enabling linker relaxation on RISC-V.
Since linker relaxation can remove instructions, symbol values may
change. This causes, among other things, the symbol table created by
BOLT in the output executable to be incorrect.

This patch solves this issue by using `BOLTLinker` to get symbol values
instead of `MCAsmLayout`. This way, output values are calculated based
on a post-linking state. To make sure the linker can update all
necessary symbols, this patch also makes sure all these symbols are not
marked as temporary so that they end-up in the object file's symbol
table.

Note that this patch only deals with symbols of binary functions
(`BinaryFunction::updateOutputValues`). The technique described above
turned out to be too expensive for basic block symbols so those are
handled differently in D155604.

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D154604
2023-08-28 10:13:07 +02:00
Kazu Hirata
d791fa26a9 [BOLT] Use SmallPtrSet::contains (NFC) 2023-08-27 13:18:38 -07:00
Elvina Yakubova
6e4c230525 [BOLT][Instrumentation] Initial instrumentation support for AArch64
This commit adds code generation for AArch64 instrumentation,
including direct and indirect calls support.

Reviewed By: rafauler, yota9

Differential Revision: https://reviews.llvm.org/D151899
2023-08-24 19:34:57 +03:00
Denis Revunov
28fd2ca142 [BOLT] Fix trap value for non-X86
The trap value used by BOLT was assumed to be single-byte instruction.
It made some functions unaligned on AArch64(e.g exceptions-instrumentation test)
and caused emission failures. Fix that by changing fill value to StringRef.

Reviewed By: rafauler

Differential Revision: https://reviews.llvm.org/D158191
2023-08-24 01:29:41 +03:00
zhoujiapeng
9fee2ac044 [BOLT][NFC] Split createRelocation in X86 and share the second part
This commit splits the createRelocation function for the X86 architecture
into two parts, retaining the first half and moving the second half to a
new function called extractFixupExpr. The purpose of this change is to make
extractFixupExpr a shared function between AArch64 and X86 architectures,
increasing code reusability and maintainability.

Child revision: https://reviews.llvm.org/D156018

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D157217
2023-08-23 00:29:25 +08:00
Job Noorman
23c8d38258 [BOLT] Calculate input to output address map using BOLTLinker
BOLT uses MCAsmLayout to calculate the output values of basic blocks.
This means output values are calculated based on a pre-linking state and
any changes to symbol values during linking will cause incorrect values
to be used.

This issue was first addressed in D154604 by adding all basic block
symbols to the symbol table for the linker to resolve them. However, the
runtime overhead of handling this huge symbol table turned out to be
prohibitively large.

This patch solves the issue in a different way. First, a temporary
section containing [input address, output symbol] pairs is emitted to the
intermediary object file. The linker will resolve all these references
so we end up with a section of [input address, output address] pairs.
This section is then parsed and used to:
- Replace BinaryBasicBlock::OffsetTranslationTable
- Replace BinaryFunction::InputOffsetToAddressMap
- Update BinaryBasicBlock::OutputAddressRange

Note that the reason this is more performant than the previous attempt
is that these symbol references do not cause entries to be added to the
symbol table. Instead, section-relative references are used for the
relocations.

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D155604
2023-08-21 10:36:20 +02:00
hezuoqiang
a37e8a4bdc [BOLT] Consider Code Fragments during regreassign
During register swapping, the code fragments associated with the
function need to be swapped together (which may be generated during
PGO optimization).

Fix https://github.com/llvm/llvm-project/issues/59730

Reviewed By: rafauler
Differential Revision: https://reviews.llvm.org/D141931
2023-08-18 16:46:18 +08:00
Alexander Yermolovich
2c784f7d26 [BOLT][DWARF] Fix handling of invalid DIE references
Compiler can generate DIE References that are invalid. Previously BOLT could
assert when writing out IR to .debug_info. Changed where DIE offsets are changed
so that it's always done. Thus making sure that assert is not triggered.

Added more specific warnings, and ability to print out invalid referenced DIE
offset when verbosity >=1.

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D157746
2023-08-14 17:28:24 -07:00
Alexander Yermolovich
43fe9dcb71 [BOLT][DWARF][NFC] Remove addIndexAddress
Removed unused API DebugAddrWriter::addIndexAddress.

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D157357
2023-08-08 18:23:04 -07:00
spupyrev
b402487b74 [BOLT] A new code layout algorithm for function reordering [3b/3]
This is a new algorithm for function layout (reordering) based on the call graph
extracted from a profile data; see diffs down the stack for more details.

This layout is very similar to the existing hfsort+, but perhaps a little better
on some benchmarks. The goals of the change is as follows:

(i) rename and replace hfsort+ with a newer (hopefully better) implementation.
I'd prefer to keep both algs together for some time to simplify evaluation and
transition, but do want to remove hfsort+ once we're confident that there are
no regressions.

(ii) unify the implementation of code layout algorithms across LLVM. Currently
Passes/HfsortPlus.cpp and Utils/CodeLayout.cpp share many implementation-specific
details; this diff unifies the code.

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D153039
2023-07-31 10:49:06 -07:00
Alexander Yermolovich
75f770a68f [BOLT][DWARF] Update handling of size 1 ranges and fix sub-programs with ranges
When output range is only one entry, and input is low_pc/high_pc do not convert
to ranges. This helps with size of .debug_ranges/.debug_rnglists. It also helps
when either low_pc/high_pc is 0. We not generating potentially invalid ranges
that result in LLDB error.

Also fixed handling of DW_AT_subprogram with ranges. This can be created with
-fbasic-block-sections=all.

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D156374
2023-07-30 17:32:32 -07:00
spupyrev
31e8a9f4d9 [BOLT] Add stale-related logging
Adding some logs related to stale profile matching. The new data can be helpful
to understand how "stale" the input profile is and how well the inference is
able to utilize the stale data.

Example of outputs on clang-10 built with LTO (profile collected on a year-old release):
```
BOLT-INFO: inferred profile for 2101 (18.52% of profiled, 100.00% of stale) functions responsible for 30.95% samples (14754697 out of 47670654)
BOLT-INFO: stale inference matched 89.42% of basic blocks (79052 out of 88402 stale) responsible for 76.99% samples (645737 out of 838719 stale)
```

LTO+AutoFDO:
```
BOLT-INFO: inferred profile for 6146 (57.57% of profiled, 100.00% of stale) functions responsible for 90.34% samples (50891403 out of 56330313)
BOLT-INFO: stale inference matched 74.55% of basic blocks (191295 out of 256589 stale) responsible for 57.30% samples (1288632 out of 2248799 stale)
```

Reviewed By: Amir, maksfb

Differential Revision: https://reviews.llvm.org/D154737
2023-07-27 08:56:57 -07:00
Amir Ayupov
e8a75c3f6e [BOLT][NFC] Simplify YAMLProfileReader
- Add `FunctionSet` type alias.
- Use any_of
- Use ErrorOr handling pattern

Reviewed By: #bolt, maksfb

Differential Revision: https://reviews.llvm.org/D156043
2023-07-26 08:26:16 -07:00
Amir Ayupov
1e0d08e872 [BOLT] Add blocks order kind to YAML profile header
Specify blocks order used in YAML profile. Needed to ensure profile backwards
compatibility with pre-D155514 DFS order by default.

Reviewed By: #bolt, maksfb

Differential Revision: https://reviews.llvm.org/D156176
2023-07-24 21:33:05 -07:00
Maksim Panchenko
dd630d831c [BOLT][NFC] Add post-CFG processing to MetadataRewriter interface
Add MetadataRewriter::postCFGInitializer().

Reviewed By: jobnoorman

Differential Revision: https://reviews.llvm.org/D155153
2023-07-13 11:09:57 -07:00
Maksim Panchenko
e6724cbd8a [BOLT] Add reading support for Linux ORC sections
Read ORC (oops rewind capability) info used for unwinding the stack by
Linux Kernel. The info is stored in .orc_unwind and .orc_unwind_ip
sections. There is also a related .orc_lookup section that is being
populated by the kernel during runtime. Contents of the sections are
sorted for quicker lookup by a post-link objtool.

Unless we modify stack access instructions, we don't have to change ORC
info attributed to instructions in the binary. However, we need to
update instruction addresses and sort both sections based on the new
layout.

For pretty printing, we add "--print-orc" option that prints ORC info
next to instructions in code dumps.

Reviewed By: Amir

Differential Revision: https://reviews.llvm.org/D154815
2023-07-13 11:07:29 -07:00
Shoaib Meenai
caf5b6a212 [bolt] Fix MSVC builds
We need to explicitly mark DWARFUnitInfo as non-copyable since MSVC's
STL has a `noexcept(false)` move constructor for `unordered_map`; see
the added comment for more details.

An alternative might be using SmallVector instead of std::vector, since
that never tries to copy elements [1]. That would result in a bunch of
API changes though, so I figured a smaller targeted fix was better.

[1] https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h

Reviewed By: ayermolo, maksfb

Differential Revision: https://reviews.llvm.org/D154924
2023-07-11 09:39:25 -07:00
Alexander Yermolovich
dcfa2ab534 [BOLT][DWARF] Change to process and write out TUs first then CUs in batches
To reduce memory footprint changed so that we process and write out TUs first,
reset DIEBuilder and process CUs. CUs are processed in buckets. First bucket
contains all the CUs with cross CU references. Rest processd one at a time.

clang-17 build in debug mode, by clang-17.
before
8:25.81 real, 834.37 user, 86.03 sys, 0 amem, 79525064 mmem
8:02.20 real, 820.46 user, 81.81 sys, 0 amem, 79501616 mmem
7:52.69 real, 802.01 user, 83.99 sys, 0 amem, 79534392 mmem

after
7:49.35 real, 822.04 user, 66.19 sys, 0 amem, 34934260 mmem
7:42.16 real, 825.46 user, 63.52 sys, 0 amem, 34951660 mmem
7:46.71 real, 821.11 user, 63.14 sys, 0 amem, 34981164 mmem

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D151909
2023-07-10 14:42:04 -07:00
Alexander Yermolovich
8362418748 [BOLT][DWARF] Output DWO files as they are being processed
Changed how we handle writing out .dwo and .dwp files. We now write out DWO
sections sooner and destroy DIEBuilder. This should decrease memory footprint.

Ran on clang-17 build in debug mode with split-dwarf.
before
8:07.49 real,   664.62 user,    69.00 sys,      0 amem, 41601612 mmem
8:07.06 real,   669.60 user,    68.75 sys,      0 amem, 41822588 mmem
8:00.36 real,   664.14 user,    66.36 sys,      0 amem, 41561548 mmem

after
8:21.85 real,   682.23 user,    69.64 sys,      0 amem, 39379880 mmem
8:04.58 real,   671.62 user,    66.50 sys,      0 amem, 39735800 mmem
8:10.02 real,   680.67 user,    67.24 sys,       0 amem, 39662888 mmem

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D151908
2023-07-10 14:42:04 -07:00
Alexander Yermolovich
c33536e9c3 [BOLT][DWARF] Numerous fixes for a new DWARFRewriter
* Some cleanup and minor fixes for the new debug information re-writer before moving on
to productatization.

* The new rewriter wasn't handling binary with DWARF5 and DWARF4 with
-fdebug-types-sections.

* Removed dead cross cu reference code.

* Added support for DW_AT_sibling.

* With the new re-writer abbrev number can change which can lead to offset of Type
Units changing. Before we would just copy raw data. Changed to write out Type
Unit List. This is generated by gdb-add-index.

* Fixed how bolt handles gdb-index generated by gdb-11 with types sections.
Simplified logic that handles variations of gdb-index.

* Clang can generate two type units with the same hash, but different content. LLD
does not de-duplicate when ThinLTO is involved. Changed so that TU hash and
offset are used to make TU's unique.

* It is possible to have references within location expression to another DIE.
Fixed it so that relative offset is updated correctly.

* Removed all the code related to patching.

* Removed dead code. Changed how we handling writting out TUs and TU Index. It now
  should fully work for DWARF4 and DWARF5.

* Removed unused arguments from some APIs, changed return type to void, and other
small cleanups.

Reviewed By: maksfb

Differential Revision: https://reviews.llvm.org/D151906
2023-07-10 14:42:03 -07:00
Rui Zhong
87fb0ea27e [BOLT][DWARF] Implement new mechanism for DWARFRewriter
This revision implement new mechanism for DWARFRewriter.
In the new mechanism, we adopt the same way with DWARFLinker did.
By parsing Debug information into IR, we are allowed to handle debug information more flexible.
Now the debug information updating process relies on IR and IR will be written out to binary once the updating finished.

A new class was added: DIEBuilder. This class is responsible for parsing debug information and raising it to the IR level.
This class is also used to write out the .debug_info and .debug_abbrev sections.
Since we output brand new Abbrev section we won't need to always convert low_pc/high_pc into ranges.
When conversion does happen we can also remove low_pc entry.

Reviewed By: maksfb, ayermolo

Differential Revision: https://reviews.llvm.org/D130315
2023-07-10 14:42:03 -07:00
Nico Weber
de7781ea42 Revert "[DWARF][BOLT] Implement new mechanism for DWARFRewriter"
This reverts commit 460a2244430fae192298a5fd9fa2a269e540e8c1.
It breaks building on macOS, and it was landed with a review URL
pointing to some Facebook-internal service.

Also reverts a bunch of follow-ups:

Revert "[BOLT][DWARF] Don't check string offsets"
This reverts commit f9d6f48c8bf5acaac07502403c41cf0b0d89c8d2.

Revert "[BOLT][DWARF] Change to process and write out TUs first then CUs in batches"
This reverts commit 88e95c1e4bb6e2ad3bfd185b96341ad5c09eff6b.

Revert "[BOLT][DWARF] Output DWO files as they are being processed"
This reverts commit 46ca2e3fcd419b1246357ed3b9cd36630f16e64d.

Revert "[BOLT][DWARF] Don't check string offsets"
This reverts commit cfe4a4b04f219a9dbb4e3fc01883437b6ff0e702.

Revert "[BOLT][DWARF] Numerous fixes for a new DWARFRewriter"
This reverts commit 2701a661daa393ad5901ac88d420d7aa931eda0d.
2023-07-07 08:07:01 -04:00