See https://discourse.llvm.org/t/rfc-keep-globalvalue-guids-stable/84801
for context.
This is a non-functional change which just changes the interface of
GlobalValue, in preparation for future functional changes. This part
touches a fair few users, so is split out for ease of review. Future
changes to the GlobalValue implementation can then be focused purely on
that class.
This does the following:
* Rename GlobalValue::getGUID(StringRef) to
getGUIDAssumingExternalLinkage. This is simply making explicit at the
callsite what is currently implicit.
* Where possible, migrate users to directly calling getGUID on a
GlobalValue instance.
* Otherwise, where possible, have them call the newly renamed
getGUIDAssumingExternalLinkage, to make the assumption explicit.
There are a few cases where neither of the above are possible, as the
caller saves and reconstructs the necessary information to compute the
GUID themselves. We want to migrate these callers eventually, but for
this first step we leave them be.
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E);
down to:
Set.insert_range(Range);
In some cases, we can further fold that into the set declaration.
This was disabled by default earlier while some failures were
investigated and ultimately fixed. It has been tested more extensively
since and can be enabled by default.
Improve the information printed when -memprof-report-hinted-sizes is
enabled. Now print the full context hash computed from the original
profile, similar to what we do when reporting matching statistics. This
will make it easier to correlate with the profile.
Note that the full context hash must be computed at profile match time
and saved in the metadata and summary, because we may trim the context
during matching when it isn't needed for distinguishing hotness.
Similarly, due to the context trimming, we may have more than one full
context id and total size pair per MIB in the metadata and summary,
which now get a list of these pairs.
Remove the old aggregate size from the metadata and summary support.
One other change from the prior support is that we no longer write the
size information into the combined index for the LTO backends, which
don't use this information, which reduces unnecessary bloat in
distributed index files.
The -enable-memprof-indirect-call-support meant to guard the recently
added memprof ICP support was not used in enough places. Specifically,
it was not checked in mayHaveMemprofSummary, which is called from the
ThinLTO backend applyImports. This led to failures when checking the
callsite records, as we incorrectly expected records for indirect calls.
Fix the option to be checked in all necessary locations, and add
testing.
This patch enables support for cloning in indirect callsites.
This is done by synthesizing callsite records for each virtual call
target from the profile metadata. In the thin link all the synthesized
records for a particular indirect callsite initially share the same
context node, but support is added to partition the callsites and
outgoing edges based on the callee function, creating a separate node
for each target.
In the LTO backend, when cloning is needed we first perform indirect
call promotion, then change the target of the new direct call to the
desired clone.
Note this is ThinLTO-specific, since for regular LTO indirect call
promotion should have already occurred.
During the ThinLTO indexing step for one of our large applications, we
create 4 million instances of FunctionSummary.
Changing:
std::vector<EdgeTy> CallGraphEdgeList;
to:
SmallVector<EdgeTy, 0> CallGraphEdgeList;
in FunctionSummary reduces the size of each instance by 8 bytes. The
rest of the patch makes the same change to other places so that the
types stay compatible across function boundaries.
The primary motivation is to remove `EntryCount` from `FunctionSummary`.
This frees 8 bytes out of `sizeof(FunctionSummary)` (136 bytes as of
64498c5483).
While I'm at it, this PR clean up {SummaryBasedOptimizations,
SyntheticCountsPropagation} since they were not used and there are no
plans to further invest on them.
With this patch, bitcode writer writes a placeholder 0 at the byte
offset of `EntryCount` and bitcode reader can parse the function entry
count at the correct byte offset. Added a TODO to stop writing
`EntryCount` and bump bitcode version
During the ThinLTO indexing step for one of our large applications, we
create 7.5 million instances of GlobalValueSummary.
Changing:
std::vector<ValueInfo> RefEdgeList;
to:
SmallVector<ValueInfo, 0> RefEdgeList;
in GlobalValueSummary reduces the size of each instance by 8 bytes.
The rest of the patch makes the same change to other places so that
the types stay compatible across function boundaries.
If requested, via the -memprof-report-hinted-sizes option, track the
total profiled size of each MIB through the thin link, then report on
the corresponding allocation coldness after all cloning is complete.
To save size, a different bitcode record type is used for the allocation
info when the option is specified, and the sizes are kept separate from
the MIBs in the index.
This patch adds a variant of getValueProfDataFromInst that returns
std::vector<InstrProfValueData> instead of
std::unique<InstrProfValueData[]>. The new return type carries the
length with it, so we can drop out parameter ActualNumValueData.
Also, the caller can directly feed the return value into a range-based
for loop as shown in the patch.
I'm planning to migrate other callers of getValueProfDataFromInst to
the new variant in follow-up patches.
Callers of getPromotionCandidatesForInstruction pass NumVals as an out
parameter for the number of value-count pairs of the value profiling
data, but nobody uses the out parameter.
This patch removes the parameter and updates the callers. Note that
the number of value-count pairs is still available via
getPromotionCandidatesForInstruction(...).size().
If an ifunc has local linkage, do not add it into ref edges and mark its
referencer (a function or global variable) not eligible for import. An
ifunc doesn't have summary and ThinLTO cannot promote it. Importing the
referencer may cause linkage errors.
To reference a similar fix, https://reviews.llvm.org/D158961 marks
callers of local ifunc not eligible for import to fix
https://github.com/llvm/llvm-project/issues/58740
Also reapply "[llvm] Teach whole program devirtualization about
relative vtables"
This reverts commit 1c604a9780fcfe92a99d539913553f0835b81de3 and
474f5efebed24547e76d022f0c5ffcc9db97ce6f.
The motivating use case is to support import the function declaration
across modules to construct call graph edges for indirect calls [1]
when importing the function definition costs too much compile time
(e.g., the function is too large has no `noinline` attribute).
1. Currently, when the compiled IR module doesn't have a function
definition but its postlink combined summary contains the function
summary or a global alias summary with this function as aliasee, the
function definition will be imported from source module by IRMover. The
implementation is in FunctionImporter::importFunctions [2]
2. In order for FunctionImporter to import a declaration of a function,
both function summary and alias summary need to carry the def / decl
state. Specifically, all existing summary fields doesn't differ across
import modules, but the def / decl state of is decided by
`<ImportModule, Function>`.
This change encodes the def/decl state in `GlobalValueSummary::GVFlags`.
In the subsequent changes
1. The indexing step `computeImportForModule` [3]
will compute the set of definitions and the set of declarations for each
module, and passing on the information to bitcode writer.
2. Bitcode writer will look up the def/decl state and sets the state
when it writes out the flag value. This is demonstrated in
https://github.com/llvm/llvm-project/pull/87600
3. Function importer will read the def/decl state when reading the
combined summary to figure out two sets of global values, and IRMover
will be updated to import the declaration (aka linkGlobalValuePrototype [4])
into the destination module.
- The next change is https://github.com/llvm/llvm-project/pull/87600
[1] mentioned in rfc https://discourse.llvm.org/t/rfc-for-better-call-graph-sort-build-a-more-complete-call-graph-by-adding-more-indirect-call-edges/74029#support-cross-module-function-declaration-import-5
[2] 3b337242ee/llvm/lib/Transforms/IPO/FunctionImport.cpp (L1608-L1764)
[3] 3b337242ee/llvm/lib/Transforms/IPO/FunctionImport.cpp (L856)
[4] 3b337242ee/llvm/lib/Linker/IRMover.cpp (L605)
Add annotated vtable GUID as referenced variables in per function
summary, and update bitcode writer to create value-ids for these
referenced vtables.
- This is the part3 of type profiling work, and described in the "Virtual Table Definition Import" [1] section of the
RFC.
[1] https://github.com/llvm/llvm-project/pull/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
This adds support for a HasTailCall flag on function call edges in the
ThinLTO summary. It is intended for use in aiding discovery of missing
frames from tail calls in profiled call stacks for MemProf of profiled
binaries that did not disable tail call elimination. A follow on change
will add the use of this new flag during MemProf context disambiguation.
The new flag is encoded in the bitcode along with either the hotness
flag from the profile, or the relative block frequency under the
-write-relbf-to-summary flag when there is no profile data.
Because we now will always have some additional call edge information, I
have removed the non-profile function summary record format, and we
simply encode the tail call flag along with a hotness type of none when
there is no profile information or relative block frequency. The change
of record format and name caused most of the test case changes.
I have added explicit testing of generation of the new tail call flag
into the bitcode and IR assembly format as part of the changes to
llvm/test/Bitcode/thinlto-function-summary-refgraph.ll. I have also
added round trip testing through assembly and bitcode to
llvm/test/Assembler/thinlto-summary.ll.
The `BlockFrequency` class abstracts `uint64_t` frequency values. Use it
more consistently in various APIs and disable implicit conversion to
make usage more consistent and explicit.
- Use `BlockFrequency Freq` parameter for `setBlockFreq`,
`getProfileCountFromFreq` and `setBlockFreqAndScale` functions.
- Return `BlockFrequency` in `getEntryFreq()` functions.
- While on it change some `const BlockFrequency& Freq` parameters to
plain `BlockFreqency Freq`.
- Mark `BlockFrequency(uint64_t)` constructor as explicit.
- Add missing `BlockFrequency::operator!=`.
- Remove `uint64_t BlockFreqency::getMaxFrequency()`.
- Add `BlockFrequency BlockFrequency::max()` function.
Fix https://github.com/llvm/llvm-project/issues/58740
The `target_clones` attribute results in ifunc on eligible targets
(Linux glibc/Android or FreeBSD). If the function has internal linkage,
we will get an internal linkage ifunc.
```
__attribute__((target_clones("popcnt", "default")))
static int foo(int n) { return __builtin_popcount(n); }
int use(int n) { return foo(n); }
@foo.ifunc = internal ifunc i32 (i32), ptr @foo.resolver
define internal nonnull ptr @foo.resolver() comdat {
; local linkage comdat is another issue that should be fixed
...
select i1 %.not, ptr @foo.default.1, ptr @foo.popcnt.0
...
}
define internal i32 @foo.default.1(i32 noundef %n)
```
ifuncs are not included in module summaries, so LTO doesn't know the
local linkage `foo.default.1` referenced by `foo.resolver`
should be promoted. If a caller of `foo` (e.g. `use`) is imported,
the local linkage `foo.resolver` will be cloned as a definition
(IRLinker::shouldLink), leading to linker errors.
```
ld.lld: error: undefined hidden symbol: foo.default.1.llvm.8017227050314953235
>>> referenced by bar.c
>>> lto.tmp:(foo.ifunc)
```
As a simple fix, just mark `use` as not eligible for import. Non-local
linkage ifuncs do not have the problem, because they are not imported,
and not cloned when a caller is imported.
---
https://reviews.llvm.org/D82745 contains a more involved fix, though the
original bug it intended to fix
(https://github.com/llvm/llvm-project/issues/45833) now works.
Note: importing ifunc is tricky.
If we import an ifunc, we need to make sure the resolver and the
implementation are in the translation unit, as required by
https://sourceware.org/glibc/wiki/GNU_IFUNC
> Requirement (a): Resolver must be defined in the same translation unit as the implementations.
This is infeasible if the implementation is changed to
available_externally.
In addition, the imported ifunc may be referenced by two translation
units. This doesn't work with PowerPC32 -msecure-plt
(https://maskray.me/blog/2021-01-18-gnu-indirect-function).
At the very least, every referencing translation unit needs one extra
IRELATIVE dynamic relocation.
At least for the local linkage ifunc case, it doesn't have much use
outside of `target_clones`, as a global pointer is usually a better
replacement.
I think ifuncs just have too many pitfalls to design more IR features
around it to optimize them.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D158961
Similar to D156016 for MapVector.
This brings back commit fae7b98c221b5b28797f7b56b656b6b819d99f27 with a
fix to llvm/unittests/Support/ThreadPool.cpp's `_WIN32` code path.
This is failing on Windows MSVC builds:
llvm\unittests\Support\ThreadPool.cpp(380): error C2440: 'return': cannot convert from 'Vector' to 'std::vector<llvm::BitVector,std::allocator<llvm::BitVector>>'
with
[
Vector=llvm::SmallVector<llvm::BitVector,0>
]
SmallVector<*, 0> is often a better replacement for std::vector :
both the object size and the code size are smaller.
(SmallMapVector uses SmallVector as well, but it is not common.)
clang size decreases by 0.0226%.
instructions:u decreases 0.037% when compiling a sqlite3 amalgram.
Reviewed By: JDevlieghere
Differential Revision: https://reviews.llvm.org/D156016
Here's a high level summary of the changes in this patch. For more
information on rational, see the RFC.
(https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774).
- Add config parameter to LTO backend, specifying which LTO mode is
desired when using unified LTO.
- Add unified LTO flag to the summary index for efficiency. Unified
LTO modules can be detected without parsing the module.
- Make sure that the ModuleID is generated by incorporating more types
of symbols.
Differential Revision: https://reviews.llvm.org/D123803
This adds a type_checked_load_relative intrinsic whose semantics it is to
load a relative function pointer.
A relative function pointer is a pointer to a 32bit value that when
added to its address yields the address of the function.
Differential Revision: https://reviews.llvm.org/D143204
Applies ThinLTO cloning decisions made during the thin link and
recorded in the summary index to the IR during the ThinLTO backend.
Depends on D141077.
Differential Revision: https://reviews.llvm.org/D149117
As pointed out in
https://discourse.llvm.org/t/undeterministic-thin-index-file/69985, the
block count added to distributed ThinLTO index files breaks incremental
builds on ThinLTO - if any linked file has a different number of BBs,
then the accumulated sum placed in the index files will change, causing
all ThinLTO backend compiles to be redone.
The block count is only used for scaling of partial sample profiles, and
was added in D80403 for D79831.
This patch simply removes this field from the index files of non partial
sample profile compiles, which is NFC on the output of the compiler.
We subsequently need to see if this can be removed for partial sample
profiles without signficant performance loss, or redesigned in a way
that does not destroy caching.
Differential Revision: https://reviews.llvm.org/D148746
Prior to this patch, WPD was not acting on relative-vtables in C++. This
involves teaching WPD about these things:
- llvm.load.relative which is how relative-vtables are indexed (instead of GEP)
- dso_local_equivalent which is used in the vtable itself when taking the
offset between a virtual function and vtable
- Update llvm/test/ThinLTO/X86/devirt.ll to use opaque pointers and add
equivalent tests for RV
Differential Revision: https://reviews.llvm.org/D134320
We were not summarizing a function alias in the vtable, leading to
incorrect WPD in some cases, and missing WPD in others.
Specifically, we would end up ignoring function aliases as they aren't
summarized, so we could incorrectly devirtualize if there was a single
other non-alias function in a compatible vtable. And if there was only
one implementation, but it was an alias, we would not be able to
identify and perform the single implementation devirtualization.
Handling the alias summary correctly also required fixing the handling
in mustBeUnreachableFunction, so that it is not incorrectly ignored.
Regular LTO is conservatively correct because it will skip
devirtualizing when any pointer within a vtable is not a function.
However, it needs additional work to be able to take advantage of
function alias within the vtable that is in fact the only
implementation. For that reason, the Regular LTO testing in the second
test case is currently disabled, and will be enabled along with a follow
on enhancement fix for Regular LTO WPD.
Differential Revision: https://reviews.llvm.org/D144209
This restores commit 98ed423361de2f9dc0113a31be2aa04524489ca9 and
follow on fix 00c22351ba697dbddb4b5bf0ad94e4bcea4b316b, which were
reverted in 5d938eb6f79b16f55266dd23d5df831f552ea082 due to an
MSVC bot failure. I've included a fix for that failure.
Differential Revision: https://reviews.llvm.org/D135714
This reverts commit 00c22351ba697dbddb4b5bf0ad94e4bcea4b316b.
This reverts commit 98ed423361de2f9dc0113a31be2aa04524489ca9.
Seemingly MSVC has some kind of issue with this patch, in terms of linking:
https://lab.llvm.org/buildbot/#/builders/123/builds/14137
I'll post more detail on D135714 momentarily.
This restores 47459455009db4790ffc3765a2ec0f8b4934c2a4, which was
reverted in commit 452a14efc84edf808d1e2953dad2c694972b312f, along with
fixes for a couple of bot failures.
Implements the ThinLTO summary support for memprof related metadata.
This includes support for the assembly format, and for building the
summary from IR during ModuleSummaryAnalysis.
To reduce space in both the bitcode format and the in memory index,
we do 2 things:
1. We keep a single vector of all uniq stack id hashes, and record the
index into this vector in the callsite and allocation memprof
summaries.
2. When building the combined index during the LTO link, the callsite
and allocation memprof summaries are only kept on the FunctionSummary
of the prevailing copy.
Differential Revision: https://reviews.llvm.org/D135714
This reverts commit bf8381a8bce28fc69857645cc7e84a72317e693e.
There is a layering violation: LLVMAnalysis depends on LLVMCore, so
LLVMCore should not include LLVMAnalysis header
llvm/Analysis/ModuleSummaryAnalysis.h
Enable using -module-summary with -S
(similarly to what currently can be achieved with opt <input> -o - | llvm-dis).
This is a recommit of ef9e62469.
Test plan: ninja check-all
Differential revision: https://reviews.llvm.org/D137768
This makes sure that this code continue working when switching to
the memory attribute.
A caveat here is that onlyReadsMemory() will also true for readnone.
To be conservative, I'm explicitly excluding that case here.
Turning on opaque pointers has uncovered an issue with WPD where we currently pattern match away `assume(type.test)` in WPD so that a later LTT doesn't resolve the type test to undef and introduce an `assume(false)`. The pattern matching can fail in cases where we transform two `assume(type.test)`s into `assume(phi(type.test.1, type.test.2))`.
Currently we create `assume(type.test)` for all virtual calls that might be devirtualized. This is to support `-Wl,--lto-whole-program-visibility`.
To prevent this, all virtual calls that may not be in the same LTO module instead use a new `llvm.public.type.test` intrinsic in place of the `llvm.type.test`. Then when we know if `-Wl,--lto-whole-program-visibility` is passed or not, we can either replace all `llvm.public.type.test` with `llvm.type.test`, or replace all `llvm.public.type.test` with `true`. This prevents WPD from trying to pattern match away `assume(type.test)` for public virtual calls when failing the pattern matching will result in miscompiles.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D128955