Modules may contain a mix of functions that participate or don't participate in callgraphs covered by a contextual profile. We currently have been importing all the functions under a context root in the module defining that root, but if the other functions there are covered by flat profiles, the result is difficult to reason about.
This patch allows moving everything under a context root (and that root) in its own module. For now, we expect a module with a filename matching the GUID of the function be present in the set of modules known by the linker. This mechanism can be improved in a later patch.
Subsequent patches will handle implementing "move" instead of "import" semantics for the root function (because we want to make sure only one version of the root exists - so the optimizations we perform are actually the ones being observed at runtime).
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E);
down to:
Set.insert_range(Range);
In some cases, we can further fold that into the set declaration.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
When `-import-declaration` option is enabled, declaration import is
supported for functions. https://github.com/llvm/llvm-project/pull/88024
has the context for this option.
This patch supports declaration import for global variables in
distributed ThinLTO. The motivating use case is to propagate `dso_local`
attribute of global variables across modules, to optimize global
variable access when a binary is built with
`-fno-direct-access-external-data`.
* With `-fdirect-access-external-data`, non thread-local global
variables will [have `dso_local`
attributes](fe3c23b439/clang/lib/CodeGen/CodeGenModule.cpp (L1730-L1746)).
This optimizes the global variable access as shown by
https://gcc.godbolt.org/z/vMzWcKdh3
`DenseMap::lookup` returns by value (because it default-creates the
returned value if the key isn't present in the map), which means that we
do a lot of copying here. Since we assert that something is present in
the returned value two lines below this call, it's safe to use `.at`
here instead.
Copying and then destroying dense maps here is responsible for 60% of
the time spent in LTO indexing in a large internal build.
We've noticed that for large builds executing thin-link can take on the
order of 10s of minutes. We are only using a single thread to write the
sharded indices and import files for each input bitcode file. While we
need to ensure the index file produced lists modules in a deterministic
order, that doesn't prevent us from executing the rest of the work in
parallel.
In this change we use a thread pool to execute as much of the backend's
work as possible in parallel. In local testing on a machine with 80
cores, this change makes a thin-link for ~100,000 input files run in ~2
minutes. Without this change it takes upwards of 10 minutes.
---------
Co-authored-by: Nuri Amari <nuriamari@fb.com>
- the set used for targets under a callsite is simpler to use if iterators
are stable (it gets manipulated during updates)
- the set used to fetch the transitive closure of GUIDs under a node can
be left as a choice to the user.
This patch reduces the memory usage for import lists by employing
memory-efficient data structures.
With this patch, an import list for a given destination module is
basically DenseSet<uint32_t> with each element indexing into the
deduplication table containing tuples of:
{SourceModule, GUID, Definition/Declaration}
In one of our large applications, the peak memory usage goes down by
9.2% from 6.120GB to 5.555GB during the LTO indexing step.
This patch addresses several sources of space inefficiency associated
with std::unordered_map:
- std::unordered_map<GUID, ImportKind> takes up 16 bytes because of
padding even though ImportKind only carries one bit of information.
- std::unordered_map uses pointers to elements, both in the hash table
proper and for collision chains.
- We allocate an instance of std::unordered_map for each
{Destination Module, Source Module} pair for which we have at least
one import. Most import lists have less than 10 imports, so the
metadata like the size of std::unordered_map and the pointer to the
hash table costs a lot relative to the actual contents.
I'm planning to reduce the memory footprint of ThinLTO indexing by
changing ImportMapTy. A look-up of the import type will involve data
private to ImportMapTy, so it must be done by a member function of
ImportMapTy. This patch turns getImportType into a member function so
that a subsequent "real" change will just have to update the
implementation of the function in place.
The background is as follows. I'm planning to reduce the memory
footprint of ThinLTO indexing by changing ImportMapTy, the data
structure used for an import list. Once this patch lands, I'm
planning to change the type slightly. The new type alias allows us to
update the type without touching many places.
This patch forward ports the heterogeneous std::map::operator[]() from
C++26 so that we can look up the map without allocating an instance of
std::string when the key-value pair exists in the map.
The background is as follows. I'm planning to reduce the memory
footprint of ThinLTO indexing by changing ImportMapTy, the data
structure used for an import list. The new list will be a hash set of
tuples (SourceModule, GUID, ImportType) represented in a space
efficient manner. That means that as we iterate over the hash set, we
encounter SourceModule as many times as GUID. We don't want to create
a temporary instance of std::string every time we look up
ModuleToSummariesForIndex like:
auto &SummariesForIndex =
ModuleToSummariesForIndex[std::string(ILI.first)];
This patch removes the need to create the temporaries by enabling the
hetegeneous lookup with std::set<K, V, std::less<>> and forward
porting std::map::operator[]() from C++26.
This patch introduces a helper function collectImportStatistics. The
new function computes statistics of imports for
ComputeCrossModuleImport and dumpImportListForModule with no
functional change.
The background is as follows. I'm planning to reduce the memory
footprint of ThinLTO indexing by changing ImportMapTy, the data
structure used for an import list. The new list will be a hash set of
tuples (SourceModule, GUID, ImportType) represented in a space
efficient manner. That means that obtaining statistics like the
number of definitions per source module requires us to go through the
entire import list (for a given destination module).
Introducing a helper function now makes the callers more independent
of the underlying data structures used in ImportMapT.
This patch introduces getSourceModules to compute the list of source
modules in the ascending alphabetical order. The new function is
intended to hide implementation details of ImportMapTy while
simplifying FunctionImporter::importFunctions a little bit.
This patch introduces type alias ModuleToSummariesForIndexTy.
I'm planning to change the type slightly to allow heterogeneous lookup
(that is, std::map<K, V, std::less<>>) in a subsequent patch. The
problem is that changing the type affects many places. Using a type
alias reduces the impact.
This patch turns type alias ImportMapTy into a proper class to provide
a more intuitive interface like:
ImportList.addDefinition(...)
as opposed to:
FunctionImporter::addDefinition(ImportList, ...)
Also, this patch requires all non-const accesses to go through
addDefinition, maybeAddDeclaration, and addGUID while providing const
accesses via:
const ImportMapTyImpl &getImportMap() const { return ImportMap; }
I realize ImportMapTy may not be the best name as a class (maybe OK as
a type alias). I am not renaming ImportMapTy in this patch at least
because there are 47 mentions of ImportMapTy under llvm/.
I missed this one when I introduced helper functions in:
commit 3082a381f57ef2885c270f41f2955e08c79634c5
Author: Kazu Hirata <kazu@google.com>
Date: Thu Aug 22 12:06:47 2024 -0700
The new helper functions make the intent clearer while hiding
implementation details, including how we handle previously added
entries. Note that:
- If we are adding a GUID as a GlobalValueSummary::Definition, then we
override a previously added GlobalValueSummary::Declaration entry
for the same GUID.
- If we are adding a GUID as a GlobalValueSummary::Declaration, then a
previously added GlobalValueSummary::Definition entry for the same
GUID takes precedence, and no change is made.
Keeping the json-based input as it's useful for diagnostics or for driving the import by other means than contextual composition.
The support for the contextual profile is just another modality for constructing the import list (`WorkloadImportsManager::Workloads`).
Everything else - i.e. the actual importing logic - is already independent from how that list was obtained.
VTable value profiling can create reference edges from `mod1:func_foo`
to `mod2:local-vtable`. Indirect call profiling can create reference
edges from `mod1:func_foo` to `mod2:local_func_bar`.
Given a ref chain `mod1:func_foo -> mod2:local-var`,`local-var` doesn't
get imported by default.
Compiler checks / requires the module of 'local-var' is the same as the
function that referenced it(`mod1:func_foo`). This is to prevent
mis-compilation when both `mod1` and `mod2` has `local-var` of the same
name, and cpp files are compiled without full path.
This patch allows the import when one of the following conditions
happen:
1) Introduce an option `import-assume-local-unique`. When the compiler
user can guarantee that all files are compiled with full paths, they can
set this option.
2) When there is one instance of value summary.
Test:
* A/B testing this option alone gives -0.16% statistically consistent
cpu cycle reduction on one search workload (no throughput increase)
* Testing it together with existing more-efficient ICP bumps the
throughput increase by a margin (0.05%~0.1%)
* No regressions observed.
https://github.com/llvm/llvm-project/pull/87600 was reverted in order to
revert
6262763341.
Now https://github.com/llvm/llvm-project/pull/95482 is fix forward for
6262763341.
This patch is a reland for
https://github.com/llvm/llvm-project/pull/87600
**Changes on top of original patch**
In `llvm/include/llvm/IR/ModuleSummaryIndex.h`, make the type of
`GVSummaryPtrSet` an `unordered_set` which is more memory efficient when
the number of elements is smaller than 128 [1]
**Original commit message**
For distributed ThinLTO, the LTO indexing step generates combined
summary for each module, and postlink pipeline reads the combined
summary which stores the information for link-time optimization.
This patch populates the 'import type' of a summary in bitcode, and
updates bitcode reader to parse the bit correctly.
[1]
393eff4e02/llvm/lib/Support/SmallPtrSet.cpp (L43)
https://github.com/llvm/llvm-project/pull/95482 is a reland of
https://github.com/llvm/llvm-project/pull/88024.
https://github.com/llvm/llvm-project/pull/95482 keeps indexing memory
usage reasonable by using unordered_map and doesn't make other changes
to originally reviewed code.
While discussing possible ways to minimize indexing memory usage, Teresa
asked whether I need `ExportSetTy` as a map or a set is sufficient. This
PR implements the idea. It uses a set rather than a map to track exposed
ValueInfos.
Currently, `ExportLists` has two use cases, and neither needs to track a
ValueInfo's import/export status. So using a set is sufficient and
correct.
1) In both in-process and distributed ThinLTO, it's used to decide if a
function or global variable is visible [1] from another module after importing
creates additional cross-module references.
* If a cross-module call edge is seen today, the callee must be visible
to another module without keeping track of its export status already.
For instance, this [2] is how callees of direct calls get exported.
2) For in-process ThinLTO [3], it's used to compute lto cache key.
* The cache key computation already hashes [4] 'ImportList' , and 'ExportList' is
determined by 'ImportList'. So it's fine to not track 'import type' for export list.
[1] 66cd8ec4c0/llvm/lib/LTO/LTO.cpp (L1815-L1819)
[2] 66cd8ec4c0/llvm/lib/LTO/LTO.cpp (L1783-L1794)
[3] 66cd8ec4c0/llvm/lib/LTO/LTO.cpp (L1494-L1496)
[4] b76100e220/llvm/lib/LTO/LTO.cpp (L194-L222)
This reverts commit e33db249b53fb70dce62db3ebd82d42239bd1d9d.
The change from *set to *map increases memory usage, and caused indexing
OOM in some applications. Need to profile offline to bring the memory
usage down.
For distributed ThinLTO, the LTO indexing step generates combined
summary for each module, and postlink pipeline reads the combined
summary which stores the information for link-time optimization.
This patch populates the 'import type' of a summary in bitcode, and
updates bitcode reader to parse the bit correctly.
Originally, when `EnableImportMetadata` enabled, `SourceFileName` will
be recorded as `thinlto_src_module`. Now `SourceFileName` will be
recorded as `thinlto_src_file` and `ModuleIdentifier` will be recorded
as `thinlto_src_module`.
An example of a "workload definition" would be "the transitive closure of functions actually called to satisfy a RPC request", i.e. a (typically significantly) smaller subset of the transitive closure (static + possible indirect call targets) of callees. This means this workload definition is a type of flat dynamic profile.
Producing one is not in scope - it can be produced offline from traces, or from sample-based profiles, etc.
This patch adds awareness to ThinLTO of such a concept. A workload is defined as a root and a list of functions. All function references are by-name (more readable than GUIDs). In the case of aliases, the expectation is the list contains all the alternative names.
The workload definitions are presented to the linker as a json file, containing a dictionary. The keys are the roots, the values are the list of functions.
The import list for a module defining a root will be the functions listed for it in the profile.
Using names this way assumes unique names for internal functions, i.e. clang's `-funique-internal-linkage-names`.
Note that the behavior affects the entire module where a root is defined (i.e. different workloads best be defined in different modules), and does not affect modules that don't define roots.
Added a class to hold such common state. The goal is to both reduce the argument list of other utilities used by `computeImportForModule` (which will be brought as members in a subsequent patch), and to make it easy to extend such state later.
Also removed them from the header. They are there for test-only. This
simplifies further refactoring (as well as code comprehension)
Differential Revision: https://reviews.llvm.org/D159308
The import/export maps, and the ModuleToDefinedGVSummaries map, are all
indexed by module paths, which are StringRef obtained from the module
summary index, which already has a data structure than owns these
strings (the ModulePathStringTable). Because these other maps are also
StringMap, which makes a copy of the string key, we were keeping
multiple extra copies of the module paths, leading to memory overhead.
Change these to DenseMap keyed by StringRef, and document that the
strings are owned by the index.
The only exception is the llvm-link tool which synthesizes an import list
from command line options, and I have added a string cache to maintain
ownership there.
I measured around 5% memory reduction in the thin link of a large
binary.
Differential Revision: https://reviews.llvm.org/D156580
Since the symbols in the ThinLTO summary are indexed by GUID we can end
up in corner cases where a callee edge in the combined index goes to a
summary for a global variable. This could happen in the case of hash
collisions, and in the case of SamplePGO profiles could potentially happen
due to code changes (since we synthesize call edges to GUIDs that were
inlined callees in the profiled code).
Handle this by simply ignoring any non-FunctionSummary callees.
Differential Revision: https://reviews.llvm.org/D152406
After importing variables, we do some checking to ensure that variables
marked read or write only, which have been marked exported (e.g.
because a referencing function has been exported), are on at least one
module's imports list. This is because the read or write only variables
will be internalized, so we need a copy any any module that references
it.
This checking is overly conservative in the case of linkonce_odr or
other linkage types where there can already be a duplicate copy in
existence in the importing module, which therefore wouldn't need to
import it. Loosen up the checking for these linkage types.
Fixes https://github.com/llvm/llvm-project/issues/62468.
Differential Revision: https://reviews.llvm.org/D149630
This makes the logic for referenced globals reusable for import criteria
that don't use thresholds - in fact, we currently didn't consider any
thresholds when importing.
Differential Revision: https://reviews.llvm.org/D149298
This makes it easier to reuse the legality part for other import
policies that wouldn't use thresholds.
Importing un-inlinable functions is also legal, because they could be
further specialized in a context-specific way, without inlining.
Differential Revision: https://reviews.llvm.org/D148838