This provides a general mechanism similar to ELF linker scripts'
/DISCARD/ for COFF. Though the intention is to explicitly discard
.llvmbc and .llvmcmd sections. (See discussion in #150897, #188398
for more details.)
The `bool serial` condition in scanRelocations disabled parallelism for
three cases: -z nocombreloc, MIPS, and PPC64. Resolve two cases:
- nocombreloc: .rela.dyn is now always created with combreloc=true so
non-relative relocations are sorted deterministically. Since
#187964 already separates relative relocations unconditionally,
the only remaining effect of -z nocombreloc is suppressing
DT_RELACOUNT (gated on ctx.arg.zCombreloc in DynamicSection).
- PPC64: After #181496 moved scanning into scanSectionImpl, the
sole thread-unsafe access is ctx.ppc64noTocRelax (DenseSet::insert).
Protect it with ctx.relocMutex, which is already used for rare
operations during parallel scanning.
MIPS retains serial scanning due to `MipsGotSection` mutations.
This patch fixes https://github.com/llvm/llvm-project/issues/187033
In BE8 mode, instruction bytes are reversed for sections containing
code. This logic currently assumes that arm mapping symbols (e.g. $a,
$t, $d) are always associated with InputSections.
However, mapping symbols can also be defined in other section types such
as mergeable sections (SHF_MERGE). These are not represented as
InputSection, and attempting to cast them using
cast_if_present<InputSection> results in an assertion failure.
Add `markParallel` using level-synchronized `parallelFor`. Each BFS
level is processed in parallel; newly discovered sections are collected
in per-thread queues and merged for the next level.
The parallel path is used when `!TrackWhyLive && partitions.size()==1`.
`parallelFor` naturally degrades to serial when `--threads=1`.
Uses depth-limited inline recursion (depth<3) and optimistic
load-then-exchange dedup for best performance.
Linking a Release+Asserts clang (--gc-sections, --time-trace) on an old
x86-64:
8 threads: markLive 315ms -> 82ms (-234ms). Total 1562ms -> 1350ms
(1.16x).
16 threads: markLive 199ms -> 50ms (-149ms). Total 1017ms -> 862ms
(1.18x).
and on Apple M4: markLive 61ms -> 13ms. Total 317.3ms -> 272.7ms
(1.16x).
... out of the per-relocation resolveReloc and into a post-GC scan of
global symbols. This decouples the --as-needed logic from the mark
algorithm, simplifying the imminent parallel GC mark.
The generated assembly looks more optimized. In addition, this avoids
widened load, which would cause a TSan-detected data race with parallel
--gc-sections (#189321).
Nested TaskGroups run serially to prevent deadlock, as documented by
https://reviews.llvm.org/D61115 and refined by
https://reviews.llvm.org/D148984 to use threadIndex.
Enable nested parallelism by having worker threads actively execute
tasks from the work queue while waiting (work-stealing), instead of
just blocking. Root-level TaskGroups (main thread) keep the efficient
blocking Latch::sync(), so there is no overhead for the common
non-nested case.
In lld, https://reviews.llvm.org/D131247 worked around the limitation
by passing a single root TaskGroup into OutputSection::writeTo and
spawning 4MB-chunked tasks into it. However, SyntheticSection::writeTo
calls with internal parallelism (e.g. GdbIndexSection,
MergeNoTailSection) still ran serially on worker threads. With this
change, their internal parallelFor/parallelForEach calls parallelize
automatically via helpSync work-stealing.
The increased parallelism can reorder error messages from parallel
phases (e.g. relocation processing during section writes), so one lld
test is updated to use --threads=1 for deterministic output.
Add
--bp-compression-sort-section=<glob>[=<layout_priority>[=<match_priority>]]
to let users split input sections into multiple compression groups, run
balanced partitioning independently per group, and leave out sections
that are poor candidates for BP. This replaces the old coarse
--bp-compression-sort with a more explicit, user-controlled one.
In ELF, the glob matches input section names (.text.unlikely.cold1). In
Mach-O, it matches the concatenated segment+section name (__TEXT__text).
layout_priority controls group placement in the final layout.
match_priority resolves conflicts when multiple globs match the same
section: explicit priority beats positional matching, and among
positional specs the last match wins.
A CRTP hook getCompressionSubgroupKey() allows backends to further
subdivide glob groups into independent BP instances. This allows Mach-O
backend to separate cold functions via N_COLD_FUNC in the future.
The deprecated --bp-compression-sort option keeps its existing
function/data behavior by assigning sections to fixed legacy groups.
Two optimizations to make getSectionPiece O(1) for common cases:
1. For non-string fixed-size merge sections, use direct computation
(offset / entsize) instead of binary search.
2. Pre-resolve piece indices for non-section Defined symbols during
splitSections. The piece index and intra-piece offset are packed
into Defined::value as ((pieceIdx+1) << 32) | intraPieceOffset,
replacing repeated binary searches (MarkLive, includeInSymtab,
getRelocTargetVA) with a single upfront resolution.
On x86-64, references to mergeable strings use local labels:
leaq .LC0(%rip), %rax # R_X86_64_PC32 .LC0-4
The relocations use non-section symbols and benefit from optimization 2.
On many other targets (e.g. AArch64), the addend is 0 and the assembler
adjusts such relocations to reference section symbols, which still use
binary search.
On a clang link (clang-relassert reproduce tarball, x86-64):
- --gc-sections: 1.05x as fast
Clang plan to emit R_AARCH64_TLS_DTPREL64 in .debug_info (see PR
#146572). LLD currently fails to recognize this relocation.
This prevent the debugger from correctly locating TLS variables when
using the DWARF DW_OP_GNU_push_tls_address or DW_AT_location with DTPREL
offsets.
This patch adds support for R_AARCH64_TLS_DTPREL64, adds its mapping to
R_DTPREL.
In LTO, part of LLVM's middle-end runs after linking has finished. LTO's
semantics depend on the complete set of extracted bitcode files being
known at this time. If the middle-end inserts new calls to library
functions (libfuncs) that are implemented in bitcode, this could extract
new bitcode object files into the link. These cannot be compiled,
leading to undefined symbol references.
Additionally, the middle-end in LTO may reason that such library
functions have no references, and it may internalize them, then
manipulate their API or even delete them. Afterwards, it may emit a call
to them, again producing undefined symbol references.
This patch resolves the former issue by ensuring that the middle end
emits no new references to symbols defined in bitcode, and it resolves
the latter issue by ensuring that extracted bitcode for libfuncs is
considered external, since new calls may be emitted to them at any time.
The new semantics are not yet established for MachO LLD, which does not
yet appear to have any special handling for libcalls in LTO. It also
does not yet support distributed ThinLTO; doing so would require
additional (de)serialization work.
This is the patch referenced in @ilovepi's and my talk at the last LLVM
devmeeting: "LT-Uh-Oh"
Gemini 3.1 was used in porting to COFF and WASM LLDs.
Linking large Hexagon binaries (e.g. ASan runtime with >8 MiB of text)
fails with R_HEX_B22_PCREL / R_HEX_PLT_B22_PCREL relocation overflow on
calls to PLT entries, even though the thunk infrastructure exists and
needsThunks is set.
needsThunk() always used s.getVA() to compute the branch destination,
even for PLT calls where the actual destination is the PLT entry. This
meant the distance check used the wrong address and failed to create
thunks when the PLT entry was out of B22_PCREL range.
Fix by using s.getPltVA() when expr == R_PLT_PC. Also override
getThunkSectionSpacing() so ThunkSections are pre-created at appropriate
intervals for large binaries.
Previously, R_HEX_GD_PLT_* relocations would create PLT entries for TLS
symbols like 'foo' in addition to __tls_get_addr.
This fix skips NEEDS_PLT on TLS symbols with R_HEX_GD_PLT_*, creates
__tls_get_addr symbol earlier with NEEDS_PLT, changes
hexagonTLSSymbolUpdate to only rebind relocations.
Also a test for the edge case where a GD_PLT relocation directly
references __tls_get_addr which previously caused a crash due to
duplicate PLT entry creation.
---------
Co-authored-by: Fangrui Song <i@maskray.me>
Removes the patches introduced by #150897 which broke LTO embed
documented features for creating whole-program-bitcode representations
of executables, used in production analysis/rewriting toolsets. This was
a documented feature available up until 21.1.8 broken by 22.x release.
This previously allowed the users to have a whole-program-bitcode
section `.llvmbc` embedded inside of the final executable.
The in-process ThinLTO backend typically generates object files in
memory and adds them directly to the link, except when the ThinLTO cache
is in use. DTLTO is unusual in that it adds files to the link from disk
in all cases.
When the ThinLTO cache is not in use, ThinLTO adds files via an
`AddStreamFn` callback provided by the linker, which ultimately appends
to a `SmallVector` in LLD. When the cache is in use, the linker supplies
an `AddBufferFn` callback that adds files more efficiently (by moving
`MemoryBuffer` ownership).
This patch adds a mandatory `AddBufferFn` to the DTLTO ThinLTO backend.
The backend uses this to add files to the link more efficiently.
Additionally:
- Move AddStream from CGThinBackend to InProcessThinBackend, for reader
clarity.
- Modify linker comments that implied the AddBuffer path is
cache-specific.
For a Clang link (Debug build with sanitizers and instrumentation) using
an optimized toolchain (PGO non-LTO, llvmorg-22.1.0), measuring the mean
`Add DTLTO files to the link` time trace scope duration:
- On Windows (Windows 11 Pro Build 26200, AMD Family 25 @ ~4.5 GHz, 16
cores/32 threads, 64 GB RAM), this patch reduces the mean from
2799.148 ms to 157.972 ms.
- On Linux (Ubuntu 24.04.3 LTS Kernel 6.14, Ryzen 9 5950X, 16
cores/32 threads, boost up to 5.09 GHz, 64 GB RAM), this patch reduces
the mean from 255.291 ms to 41.630 ms.
Based on work by @romanova-ekaterina and @kbelochapka.
MachO --icf=safe and --icf=safe_thunks used to keep folding code from
object files that did not contain __llvm_addrsig, which was inconsistent
with the conservative ELF/COFF behavior. Mark all symbols in such
objects as address-significant instead, and add regression coverage for
both safe ICF modes with and without addrsig.
Move the "offset is outside the section" error for merge sections from
getSectionPiece to getSymVA, where we know the offset comes from a
section symbol + addend. Include the offset value in the diagnostic.
Accept offset == section_size (one-past-end) to match GNU ld behavior,
while rejecting offset > section_size. Skip out-of-bounds offsets in
MarkLive to avoid assertion failures in getSectionPiece.
This change eliminates the Android-specific --android-memtag-* flags
from lld, replacing them with -z memtag-* generic equivalents. With
these generic flags, the linker will emit only the dynamic array tags
specified in the "Memtag ABI Extension to ELF", but no Android-specific
memtag note.
In addition, this change adds an --android-memtag-note flag which should
be used when the Android-specific memtag note should be emitted.
This change also modifies the clang driver to make use of the new flags.
In addOrphanSections, getRelocatedSection() only returns non-null for -r
or --emit-relocs links. Guard code blocks with `copyRelocs` to skip
unnecessary dyn_cast + getRelocatedSection calls per section in the
common case. Hoist copyRelocs and relocatable to local variables so the
compiler does not reload them through ctx on every loop iteration.
"Assign sections" decreases by 1ms.
When compiling WebAssembly with ThinLTO, functions are partitioned into
isolated `.bc` modules and dispatched to individual LTO backend threads.
During code generation, the `CoalesceFeaturesAndStripAtomics` pass
iterates over the module to gather the union of target features (like
`+atomics`) attached to defined functions. In particular when not using
threads, it lowers away atomics and TLS variables to their
single-threaded equivalents.
However, if a partitioned module only contains globally defined TLS
variables (e.g. there are no functions, or all functions were fully
inlined or stripped by dropDeadSymbols before ThinLTO optimization), the
module becomes completely devoid of function definitions. The coalescing
pass then falls back to fetching features from the `TargetMachine`.
Because in LTO the `TargetMachine` defaults to a generic target without
atomics enabled, the TLS is lowered away and the `wasm-feature-atomics`
flag is omitted from the resulting ThinLTO object partition, causing
`wasm-ld` to immediately reject it.
To fix this we take advantage of the fact that the linker always knows
whether threads are being used (via the --shared-memory flag). When
using shared memory, we enable +atomics and +bulk-memory in the
TargetMachine that is used for the backend, and the feature coalescing
pass will correctly detect the use of therads.
This only makes sense for atomics because of the global linker
configuration; for other features we wouldn't be able to do this, but we
don't rewrite away any other features anyway.
Use parallelFor to process files in parallel, collecting Symbol*
pointers per-file, then merge into the symbol table serially.
Linking clang-14 (208K .symtab entries) is 1.04x as fast.
Remove the combreloc guard from addReloc and mergeRels so that
relative relocations are always routed to relativeRelocs, even with -z
nocombreloc or --pack-dyn-relocs=android.
Update AndroidPackedRelocationSection::updateAllocSize to iterate
both relativeRelocs and relocs.
Previously, the flow was:
1. Parallel scan adds relative relocs to per-thread `relocsVec`
2. `mergeRels()` copies all into `relocs`
3. `partitionRels()` uses `stable_partition` to separate
Now, relative relocs are routed at `addReloc` time by checking
`reloc.type == relativeRel`. In `mergeRels`, sharded entries are
classified through the same `addReloc` path rather than blindly
appended. `relocsVec` may contain non-relative entries like
`R_AARCH64_AUTH_RELATIVE`.
This eliminates the `stable_partition` on the full relocation vector
(543K entries for clang) and avoids copying relative relocations into
`relocs` only to move them out again.
Linking an x86_64 release+assertions build of clang is 1.04x as fast.
`numRelativeRelocs` caches `relativeRelocs.size()` at `finalizeContents`
time for `DT_RELACOUNT`. Using a live `relativeRelocs.size()` would
cause `DynamicSection::writeTo` to emit an extra entry when thunks add
relocs after `.dynamic` is sized, overflowing into adjacent sections.
Tested by ppc64-long-branch-rel14.s.
Previously, the implicit warnings from force-bti (or gcs=always) weren't
possible to silence.
The force-ibt/cet-report flags could also be handled the same way, but I
haven't checked with GNU ld how they behave. And there, the force-ibt
flag only produces warnings if the IBT bit is missing, while cet-report
warns if either IBT or SHSTK are missing - but force-ibt probably
shouldn't implicitly start warning for missing SHSTK.
This addresses a discrepancy to GNU ld that was noted in #186173.
In LoongArch and RISC-V, the relaxation pass iterates over input sections
within executable output sections. When a linker script places a synthetic
section (e.g., .got) into such an output section, the linker would crash
because synthetic sections do not have the relaxAux field initialized.
The relaxAux data structure is only allocated for non-synthetic sections
in initSymbolAnchors. This patch adds the necessary null checks in the
relaxation loops (relaxOnce and finalizeRelax) to skip sections that
do not require relaxation.
A null check is also added to elf::initSymbolAnchors to ensure the
subsequent sorting of anchors is safe.
Fixes: #184757
Reviewers: MaskRay
Pull Request: https://github.com/llvm/llvm-project/pull/184758
This matches GNU ld, where gcs=always makes it implicitly warn about
missing GCS flags, by matching the existing code pattern used for BTI
and IBT.
Also test that warnings can be printed for both missing BTI and GCS for
the same object file.
This fixes#186173.
-u creates an Undefined with STT_NOTYPE. When an object file provides
another Undefined with STT_TLS for the same symbol, Symbol::resolve
only updated binding, leaving type as STT_NOTYPE. This caused
sym.isTls() to return false in postScanRelocations, skipping TLS GOT
entry creation and leading to an out-of-range R_X86_64_GOTTPOFF error.
Fix: in resolve(Undefined), when the existing type is STT_NOTYPE,
adopt the incoming type.
Back in 6474d1b20 this test was updated, removing the NORMAL vs SHARED
distinction in the output checking. However many of the NORMAL-NEXT
lines were left unmodified, making them effectively disabled.
This restores and updates the expectations.
After ICF, multiple symbols may resolve to the same address but remain
as distinct Symbol pointers. When used as keys in thunkMap, this caused
redundant branch-extension thunks to be created for the same target. Fix
this by providing a custom DenseMapInfo for thunkMap that hashes and
compares Defined symbols by (isec, value) instead of pointer identity.
Without this change, passing -fthinlto-index causes -fpass-plugin
arguments to be ignored. We want to be able to use plugins with
distributed thin-lto, so add support for this.
Implement RISCV::scanSectionImpl, following the pattern established
for x86 (#178846) and AArch64 (#181099). This merges the getRelExpr
and TLS handling for SHF_ALLOC sections into the target-specific
scanner, enabling devirtualization and eliminating abstraction
overhead.
- Inline relocation classification into scanSectionImpl with a switch
on relocation type, replacing the generic rs.scan() path.
- Use processR_PC/processR_PLT_PC for common PC-relative and PLT
relocations.
- Handle TLS IE and GD directly (RISC-V does not optimize GD/LD/IE).
- Replace TLS-optimization-specific expressions for TLSDESC, following
the x86 pattern: R_RELAX_TLS_GD_TO_IE -> R_GOT_PC,
R_RELAX_TLS_GD_TO_LE -> R_TPREL. Update relocateAlloc and relax()
to dispatch on relocation type instead of RelExpr for TLSDESC.
- Simplify getRelExpr to only handle relocations needed by
relocateNonAlloc and preprocessRelocs.
- Remove RISC-V-specific checks from handleTlsRelocation (isRISCV
variable, TLSDESC label special cases).
- Move R_RISCV_VENDOR handling into the relocation type switch. An
undefined vendor symbol now gets the standard undefined symbol error
instead of a vendor-specific diagnostic.
For a Win32 DLL, a .def file can have a custom executable base:
```
LIBRARY "stub.dll" BASE=0x10000000
```
Currently the parser enforces Base 10, but [Microsoft's
documentation](https://learn.microsoft.com/en-us/cpp/build/reference/rules-for-module-definition-statements?view=msvc-170)
states "Numeric arguments are specified in base 10 or hexadecimal".
This fixes that, and also HEAPSIZE and STACKSIZE (which use the same
function).
There are a few more instances of `getAsInteger` that expect base10 -
for ordinals and the VERSION directive. Since I don't have an
in-the-wild example of a .def file using hexadecimal for these, I am
wary about changing those too.
This actually both improves and simplifies the `Inputs/weak_alias`. With
the `.ll` version we ended up using memory and `__stack_pointer` and
locals, but LLVM ended up generated `call` rather than `call_indirect`
for the `call_alias_ptr` and `call_direct_ptr`. With the assembly tests
we can ensure the usage of `call_indirect` while avoiding all the other
stuff.
RFC
https://discourse.llvm.org/t/rfc-dwarfdebug-fix-and-improve-handling-imported-entities-types-and-static-local-in-subprogram-and-lexical-block-scopes/68544
This patch moves the emission of global variables from
`DwarfDebug::beginModule()` to `DwarfDebug::endModule()`.
It has the following effects:
1. The order of debug entities in the resulting DWARF changes.
2. Currently, if a DISubprogram requires emission of both concrete
out-of-line and inlined subprogram DIEs, and such a subprogram contains
a static local variable, the DIE for the variable is emitted into the
concrete out-of-line subprogram DIE. As a result, the variable is not
available in debugger when breaking at the inlined function instance.
It happens because static locals are emitted in
`DwarfDebug::beginModule()`, but abstract DIEs for functions that are
not completely inlined away are created only later during
`DwarfDebug::endFunctionImpl()` calls.
With this patch, DIEs for static local variables of subprograms that
have both inlined and the concrete out-of-line instances are placed into
abstract subprogram DIEs. They become visible in debugger when breaking
at concrete out-of-line and inlined function instances.
`llvm/test/DebugInfo/Generic/inlined-static-var.ll` illustrates that.
3. It will allow to simplify abstract subprogram DIEs creation by
reverting https://github.com/llvm/llvm-project/pull/159104 later.
This is needed to simplify DWARF emission in a context of proper support
of function-local static variables which comes in the next patch
(https://reviews.llvm.org/D144008), making all function-local entities
handled in `DwarfDebug::endModuleImpl()`.
Authored-by: Kristina Bessonova <kbessonova@accesssoftek.com>
Co-authored-by: David Blaikie <dblaikie@gmail.com>
Co-authored-by: Vladislav Dzhidzhoev <vdzhidzhoev@accesssoftek.com>
findMaskR8() lacked an isDuplex() check, unlike findMaskR6(),
findMaskR11(), and findMaskR16() which all handle duplex instructions.
When the assembler generates R_HEX_8_X on a duplex SA1_addi instruction
(e.g. `{ r0 = add(r0, ##target); memw(r1+#0) = r2 }`), the wrong mask
0x00001fe0 placed relocation bits at [12:5] instead of [25:20],
corrupting the low sub-instruction (e.g. memw became memb).
Add the isDuplex() check returning 0x03f00000, and add a comprehensive
test covering all duplex instruction x relocation type combinations
across findMaskR6, findMaskR8, findMaskR11, and findMaskR16.
Move the ArmCmseSGVeneer and ArmCmseSGSection class definitions from
SyntheticSections.h into the anonymous namespace in Arch/ARM.cpp, where
the implementations already reside. Rename ArmCmseSGVeneer to
CmseSGVeneer as it no longer needs the Arm prefix for disambiguation.
Implement LoongArch::scanSectionImpl, following the pattern established
for x86, PPC64, SystemZ, AArch64. This merges the getRelExpr and TLS
handling for SHF_ALLOC sections into the target-specific scanner,
enabling devirtualization and eliminating abstraction overhead.
- Inline relocation classification into scanSectionImpl with a switch
on relocation type, replacing the generic rs.scan() path.
- Use processR_PC/processR_PLT_PC for common PC-relative and PLT
relocations.
- Inline TLS handling: IE->LE optimization for _PC_ variants only (not
_PCADD_ or absolute), TLSDESC->IE/LE for non-extreme code model,
GD/LD flag setting without going through generic handleTlsRelocation.
- Remove adjustTlsExpr by inlining its logic into scanSectionImpl.
- Remove LoongArch-specific code from Relocations.cpp:
handleTlsRelocation, execOptimizeInLoongArch, and the sort condition.
- Simplify getRelExpr to only handle relocations needed by
relocateNonAlloc, scanEhSection, and the extreme code model fallback
in relocateAlloc.
The only expectations change here is that `__stack_pointer` is
no longer exports in the `archive-export.test` test. This is because
we don't enable the mutable-globals feature (since the assembly files
don't contains all the now-default features of the generic CPU).