When `-import-declaration` option is enabled, declaration import is
supported for functions. https://github.com/llvm/llvm-project/pull/88024
has the context for this option.
This patch supports declaration import for global variables in
distributed ThinLTO. The motivating use case is to propagate `dso_local`
attribute of global variables across modules, to optimize global
variable access when a binary is built with
`-fno-direct-access-external-data`.
* With `-fdirect-access-external-data`, non thread-local global
variables will [have `dso_local`
attributes](fe3c23b439/clang/lib/CodeGen/CodeGenModule.cpp (L1730-L1746)).
This optimizes the global variable access as shown by
https://gcc.godbolt.org/z/vMzWcKdh3
`DenseMap::lookup` returns by value (because it default-creates the
returned value if the key isn't present in the map), which means that we
do a lot of copying here. Since we assert that something is present in
the returned value two lines below this call, it's safe to use `.at`
here instead.
Copying and then destroying dense maps here is responsible for 60% of
the time spent in LTO indexing in a large internal build.
We've noticed that for large builds executing thin-link can take on the
order of 10s of minutes. We are only using a single thread to write the
sharded indices and import files for each input bitcode file. While we
need to ensure the index file produced lists modules in a deterministic
order, that doesn't prevent us from executing the rest of the work in
parallel.
In this change we use a thread pool to execute as much of the backend's
work as possible in parallel. In local testing on a machine with 80
cores, this change makes a thin-link for ~100,000 input files run in ~2
minutes. Without this change it takes upwards of 10 minutes.
---------
Co-authored-by: Nuri Amari <nuriamari@fb.com>
- the set used for targets under a callsite is simpler to use if iterators
are stable (it gets manipulated during updates)
- the set used to fetch the transitive closure of GUIDs under a node can
be left as a choice to the user.
This patch reduces the memory usage for import lists by employing
memory-efficient data structures.
With this patch, an import list for a given destination module is
basically DenseSet<uint32_t> with each element indexing into the
deduplication table containing tuples of:
{SourceModule, GUID, Definition/Declaration}
In one of our large applications, the peak memory usage goes down by
9.2% from 6.120GB to 5.555GB during the LTO indexing step.
This patch addresses several sources of space inefficiency associated
with std::unordered_map:
- std::unordered_map<GUID, ImportKind> takes up 16 bytes because of
padding even though ImportKind only carries one bit of information.
- std::unordered_map uses pointers to elements, both in the hash table
proper and for collision chains.
- We allocate an instance of std::unordered_map for each
{Destination Module, Source Module} pair for which we have at least
one import. Most import lists have less than 10 imports, so the
metadata like the size of std::unordered_map and the pointer to the
hash table costs a lot relative to the actual contents.
I'm planning to reduce the memory footprint of ThinLTO indexing by
changing ImportMapTy. A look-up of the import type will involve data
private to ImportMapTy, so it must be done by a member function of
ImportMapTy. This patch turns getImportType into a member function so
that a subsequent "real" change will just have to update the
implementation of the function in place.
The background is as follows. I'm planning to reduce the memory
footprint of ThinLTO indexing by changing ImportMapTy, the data
structure used for an import list. Once this patch lands, I'm
planning to change the type slightly. The new type alias allows us to
update the type without touching many places.
This patch forward ports the heterogeneous std::map::operator[]() from
C++26 so that we can look up the map without allocating an instance of
std::string when the key-value pair exists in the map.
The background is as follows. I'm planning to reduce the memory
footprint of ThinLTO indexing by changing ImportMapTy, the data
structure used for an import list. The new list will be a hash set of
tuples (SourceModule, GUID, ImportType) represented in a space
efficient manner. That means that as we iterate over the hash set, we
encounter SourceModule as many times as GUID. We don't want to create
a temporary instance of std::string every time we look up
ModuleToSummariesForIndex like:
auto &SummariesForIndex =
ModuleToSummariesForIndex[std::string(ILI.first)];
This patch removes the need to create the temporaries by enabling the
hetegeneous lookup with std::set<K, V, std::less<>> and forward
porting std::map::operator[]() from C++26.
This patch introduces a helper function collectImportStatistics. The
new function computes statistics of imports for
ComputeCrossModuleImport and dumpImportListForModule with no
functional change.
The background is as follows. I'm planning to reduce the memory
footprint of ThinLTO indexing by changing ImportMapTy, the data
structure used for an import list. The new list will be a hash set of
tuples (SourceModule, GUID, ImportType) represented in a space
efficient manner. That means that obtaining statistics like the
number of definitions per source module requires us to go through the
entire import list (for a given destination module).
Introducing a helper function now makes the callers more independent
of the underlying data structures used in ImportMapT.
This patch introduces getSourceModules to compute the list of source
modules in the ascending alphabetical order. The new function is
intended to hide implementation details of ImportMapTy while
simplifying FunctionImporter::importFunctions a little bit.
This patch introduces type alias ModuleToSummariesForIndexTy.
I'm planning to change the type slightly to allow heterogeneous lookup
(that is, std::map<K, V, std::less<>>) in a subsequent patch. The
problem is that changing the type affects many places. Using a type
alias reduces the impact.
This patch turns type alias ImportMapTy into a proper class to provide
a more intuitive interface like:
ImportList.addDefinition(...)
as opposed to:
FunctionImporter::addDefinition(ImportList, ...)
Also, this patch requires all non-const accesses to go through
addDefinition, maybeAddDeclaration, and addGUID while providing const
accesses via:
const ImportMapTyImpl &getImportMap() const { return ImportMap; }
I realize ImportMapTy may not be the best name as a class (maybe OK as
a type alias). I am not renaming ImportMapTy in this patch at least
because there are 47 mentions of ImportMapTy under llvm/.
I missed this one when I introduced helper functions in:
commit 3082a381f57ef2885c270f41f2955e08c79634c5
Author: Kazu Hirata <kazu@google.com>
Date: Thu Aug 22 12:06:47 2024 -0700
The new helper functions make the intent clearer while hiding
implementation details, including how we handle previously added
entries. Note that:
- If we are adding a GUID as a GlobalValueSummary::Definition, then we
override a previously added GlobalValueSummary::Declaration entry
for the same GUID.
- If we are adding a GUID as a GlobalValueSummary::Declaration, then a
previously added GlobalValueSummary::Definition entry for the same
GUID takes precedence, and no change is made.
Keeping the json-based input as it's useful for diagnostics or for driving the import by other means than contextual composition.
The support for the contextual profile is just another modality for constructing the import list (`WorkloadImportsManager::Workloads`).
Everything else - i.e. the actual importing logic - is already independent from how that list was obtained.
VTable value profiling can create reference edges from `mod1:func_foo`
to `mod2:local-vtable`. Indirect call profiling can create reference
edges from `mod1:func_foo` to `mod2:local_func_bar`.
Given a ref chain `mod1:func_foo -> mod2:local-var`,`local-var` doesn't
get imported by default.
Compiler checks / requires the module of 'local-var' is the same as the
function that referenced it(`mod1:func_foo`). This is to prevent
mis-compilation when both `mod1` and `mod2` has `local-var` of the same
name, and cpp files are compiled without full path.
This patch allows the import when one of the following conditions
happen:
1) Introduce an option `import-assume-local-unique`. When the compiler
user can guarantee that all files are compiled with full paths, they can
set this option.
2) When there is one instance of value summary.
Test:
* A/B testing this option alone gives -0.16% statistically consistent
cpu cycle reduction on one search workload (no throughput increase)
* Testing it together with existing more-efficient ICP bumps the
throughput increase by a margin (0.05%~0.1%)
* No regressions observed.
https://github.com/llvm/llvm-project/pull/87600 was reverted in order to
revert
6262763341.
Now https://github.com/llvm/llvm-project/pull/95482 is fix forward for
6262763341.
This patch is a reland for
https://github.com/llvm/llvm-project/pull/87600
**Changes on top of original patch**
In `llvm/include/llvm/IR/ModuleSummaryIndex.h`, make the type of
`GVSummaryPtrSet` an `unordered_set` which is more memory efficient when
the number of elements is smaller than 128 [1]
**Original commit message**
For distributed ThinLTO, the LTO indexing step generates combined
summary for each module, and postlink pipeline reads the combined
summary which stores the information for link-time optimization.
This patch populates the 'import type' of a summary in bitcode, and
updates bitcode reader to parse the bit correctly.
[1]
393eff4e02/llvm/lib/Support/SmallPtrSet.cpp (L43)
https://github.com/llvm/llvm-project/pull/95482 is a reland of
https://github.com/llvm/llvm-project/pull/88024.
https://github.com/llvm/llvm-project/pull/95482 keeps indexing memory
usage reasonable by using unordered_map and doesn't make other changes
to originally reviewed code.
While discussing possible ways to minimize indexing memory usage, Teresa
asked whether I need `ExportSetTy` as a map or a set is sufficient. This
PR implements the idea. It uses a set rather than a map to track exposed
ValueInfos.
Currently, `ExportLists` has two use cases, and neither needs to track a
ValueInfo's import/export status. So using a set is sufficient and
correct.
1) In both in-process and distributed ThinLTO, it's used to decide if a
function or global variable is visible [1] from another module after importing
creates additional cross-module references.
* If a cross-module call edge is seen today, the callee must be visible
to another module without keeping track of its export status already.
For instance, this [2] is how callees of direct calls get exported.
2) For in-process ThinLTO [3], it's used to compute lto cache key.
* The cache key computation already hashes [4] 'ImportList' , and 'ExportList' is
determined by 'ImportList'. So it's fine to not track 'import type' for export list.
[1] 66cd8ec4c0/llvm/lib/LTO/LTO.cpp (L1815-L1819)
[2] 66cd8ec4c0/llvm/lib/LTO/LTO.cpp (L1783-L1794)
[3] 66cd8ec4c0/llvm/lib/LTO/LTO.cpp (L1494-L1496)
[4] b76100e220/llvm/lib/LTO/LTO.cpp (L194-L222)
This reverts commit e33db249b53fb70dce62db3ebd82d42239bd1d9d.
The change from *set to *map increases memory usage, and caused indexing
OOM in some applications. Need to profile offline to bring the memory
usage down.
For distributed ThinLTO, the LTO indexing step generates combined
summary for each module, and postlink pipeline reads the combined
summary which stores the information for link-time optimization.
This patch populates the 'import type' of a summary in bitcode, and
updates bitcode reader to parse the bit correctly.
Originally, when `EnableImportMetadata` enabled, `SourceFileName` will
be recorded as `thinlto_src_module`. Now `SourceFileName` will be
recorded as `thinlto_src_file` and `ModuleIdentifier` will be recorded
as `thinlto_src_module`.
An example of a "workload definition" would be "the transitive closure of functions actually called to satisfy a RPC request", i.e. a (typically significantly) smaller subset of the transitive closure (static + possible indirect call targets) of callees. This means this workload definition is a type of flat dynamic profile.
Producing one is not in scope - it can be produced offline from traces, or from sample-based profiles, etc.
This patch adds awareness to ThinLTO of such a concept. A workload is defined as a root and a list of functions. All function references are by-name (more readable than GUIDs). In the case of aliases, the expectation is the list contains all the alternative names.
The workload definitions are presented to the linker as a json file, containing a dictionary. The keys are the roots, the values are the list of functions.
The import list for a module defining a root will be the functions listed for it in the profile.
Using names this way assumes unique names for internal functions, i.e. clang's `-funique-internal-linkage-names`.
Note that the behavior affects the entire module where a root is defined (i.e. different workloads best be defined in different modules), and does not affect modules that don't define roots.
Added a class to hold such common state. The goal is to both reduce the argument list of other utilities used by `computeImportForModule` (which will be brought as members in a subsequent patch), and to make it easy to extend such state later.
Also removed them from the header. They are there for test-only. This
simplifies further refactoring (as well as code comprehension)
Differential Revision: https://reviews.llvm.org/D159308
The import/export maps, and the ModuleToDefinedGVSummaries map, are all
indexed by module paths, which are StringRef obtained from the module
summary index, which already has a data structure than owns these
strings (the ModulePathStringTable). Because these other maps are also
StringMap, which makes a copy of the string key, we were keeping
multiple extra copies of the module paths, leading to memory overhead.
Change these to DenseMap keyed by StringRef, and document that the
strings are owned by the index.
The only exception is the llvm-link tool which synthesizes an import list
from command line options, and I have added a string cache to maintain
ownership there.
I measured around 5% memory reduction in the thin link of a large
binary.
Differential Revision: https://reviews.llvm.org/D156580
Since the symbols in the ThinLTO summary are indexed by GUID we can end
up in corner cases where a callee edge in the combined index goes to a
summary for a global variable. This could happen in the case of hash
collisions, and in the case of SamplePGO profiles could potentially happen
due to code changes (since we synthesize call edges to GUIDs that were
inlined callees in the profiled code).
Handle this by simply ignoring any non-FunctionSummary callees.
Differential Revision: https://reviews.llvm.org/D152406
After importing variables, we do some checking to ensure that variables
marked read or write only, which have been marked exported (e.g.
because a referencing function has been exported), are on at least one
module's imports list. This is because the read or write only variables
will be internalized, so we need a copy any any module that references
it.
This checking is overly conservative in the case of linkonce_odr or
other linkage types where there can already be a duplicate copy in
existence in the importing module, which therefore wouldn't need to
import it. Loosen up the checking for these linkage types.
Fixes https://github.com/llvm/llvm-project/issues/62468.
Differential Revision: https://reviews.llvm.org/D149630
This makes the logic for referenced globals reusable for import criteria
that don't use thresholds - in fact, we currently didn't consider any
thresholds when importing.
Differential Revision: https://reviews.llvm.org/D149298
This makes it easier to reuse the legality part for other import
policies that wouldn't use thresholds.
Importing un-inlinable functions is also legal, because they could be
further specialized in a context-specific way, without inlining.
Differential Revision: https://reviews.llvm.org/D148838
This logic was added in https://reviews.llvm.org/D95943 specifically to
handle an issue for non-prevailing global variables. It turns out that
it adds a new issue for prevailing glboal variables, since those could
be replaced by an available_externally definition and hence incorrectly
omitted from the output object file. Limit the import to non-prevailing
global variables to fix this, as suggested by @tejohnson.
The bulk of the diff is mechanical changes to thread isPrevailing
through to where it's needed and ensure it's available before the
relevant calls; the actual logic change itself is straightforward.
Fixes https://github.com/llvm/llvm-project/issues/61677
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D146876
Conflicting module flags leads to a proper error for regular LTO but a crash
(report_fatal_error) for ThinLTO. Switch to createStringError to fix the crash
and match regular LTO.
For a local linkage GlobalObject in a non-prevailing COMDAT, it remains defined while its
leader has been made available_externally. This violates the COMDAT rule that
its members must be retained or discarded as a unit.
To fix this, update the regular LTO change D34803 to track local linkage
GlobalValues, and port the code to ThinLTO (GlobalAliases are not handled.)
This fixes two problems.
(a) `__cxx_global_var_init` in a non-prevailing COMDAT group used to
linger around (unreferenced, hence benign), and is now correctly discarded.
```
int foo();
inline int v = foo();
```
(b) Fix https://github.com/llvm/llvm-project/issues/58215:
as a size optimization, we place private `__profd_` in a COMDAT with a
`__profc_` key. When FuncImport.cpp makes `__profc_` available_externally due to
a non-prevailing COMDAT, `__profd_` incorrectly remains private. This change
makes the `__profd_` available_externally.
```
cat > c.h <<'eof'
extern void bar();
inline __attribute__((noinline)) void foo() {}
eof
cat > m1.cc <<'eof'
#include "c.h"
int main() {
bar();
foo();
}
eof
cat > m2.cc <<'eof'
#include "c.h"
__attribute__((noinline)) void bar() {
foo();
}
eof
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
clang -O2 -fprofile-generate=./t m1.cc m2.cc -flto=thin -fuse-ld=lld -o t_gen
rm -fr t && ./t_gen && llvm-profdata show -function=foo t/default_*.profraw
```
If a GlobalAlias references a GlobalValue which is just changed to
available_externally, change the GlobalAlias as well (e.g. C5/D5 comdats due to
cc1 -mconstructor-aliases). The GlobalAlias may be referenced by other
available_externally functions, so it cannot easily be removed.
Depends on D137441: we use available_externally to mark a GlobalAlias in a
non-prevailing COMDAT, similar to how we handle GlobalVariable/Function.
GlobalAlias may refer to a ConstantExpr, not changing GlobalAlias to
GlobalVariable gives flexibility for future extensions (the use case is niche.
For simplicity we don't handle it yet). In addition, available_externally
GlobalAlias is the most straightforward implementation and retains the aliasee
information to help optimizers.
See windows-vftable.ll: Windows vftable uses an alias pointing to a
private constant where the alias is the COMDAT leader. The COMDAT use case
is skeptical and ThinLTO does not discard the alias in the non-prevailing COMDAT.
This patch retains the behavior.
See new tests ctor-dtor-alias2.ll: depending on whether the complete object
destructor emitted, when ctor/dtor aliases are used, we may see D0/D2 COMDATs in
one TU and D0/D1/D2 in a D5 COMDAT in another TU.
Allow such a mix-and-match with `if (GO->getComdat()->getName() == GO->getName()) NonPrevailingComdats.insert(GO->getComdat());`
GlobalAlias handling in ThinLTO is still weird, but this patch should hopefully
improve the situation for at least all cases I can think of.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D135427