We weren't performing node merging on newly created nodes in some cases.
Use a simple iteration over the node and its callers until no more
opportunities are found. I confirmed that for several large codes the
max iterations is 3 (meaning we only needed to do any work on the first
2, as expected). This can potentially be made more elegant in the
future, but it is a simple and effective solution.
Also fix a bug exposed by the test case, getting the function for a call
instruction in the FullLTO handling, using an existing method to look
through aliases if needed.
There is no reason to use std::map for the call maps maintained for
function clones during function clone assignment, as we don't iterate
over them and don't need deterministic ordering, so use the more
efficient DenseMap.
This patch fixes:
llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp:4771:9:
error: non-void lambda does not return a value in all control paths
[-Werror,-Wreturn-type]
This reverts commit 314e22bcab2b0f3d208708431a14215058f0718f, reapplying
PR150735 with a fix for the unstable iteration order exposed by the new
tests (PR151039).
We iterate over a std::map indexed by FuncInfo, which is a pair of a
pointer and a clone number. In the ThinLTO case, this isn't an issue as
the function pointer always points to the same FunctionSummary object.
However, for regular LTO, this is a pointer to a Function object, which
is different for each clone. This will lead to unstable iteration order.
This was exposed in a test case added for PR150735, which added a new
instance of iteration over this map.
Since these function clones are added and numbered sequentially, change
this to a vector indexed by clone number, which points to a structure
containing the clone FuncInfo and the call map (the old map's key and
value, respectively).
Fix a bug in function assignment where we were not assigning all
callsite clones to a function clone. This led to incorrect call updates
because multiple callsite clones could look like they were assigned to
the same function clone.
Add in a stat and debug message to help identify and debug cases where
this is still happening.
We already included the assigned clone of the callsite node's callee in
the dot graph after function assignment. This adds the same information
for the enclosing caller function to aid debugging.
In rare cases the declaration of a function may not match its callsite
after function importing, when the declaration was imported from a
module where the function had void return type (presumably due to
incomplete types). Instead of using setCalledFunction() to change a call
to call its clone, which updates the call's function type as well, just
call setCalledOperand directly so the only thing changed is the function
target.
Note this can't happen for the other places where we call
setCalledFunction: FullLTO requires the cloned callee to be defined in
the same FullLTO merged module; ThinLTO memprof ICP calls an ICP
facility to first perform the promotion and that will be blocked if the
function type doesn't match the callsite (the new test explicitly tests
this latter case).
Allow users to set the minimum absolute count for inlining of indirect
calls promoted during cloning. This is primarily meant to enable
generation of synthetic vp metadata introduced in PR141164 when
profiling memprof-optimized binaries.
g++-13 warned that:
llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp:1645:8: warning: variable ‘PrevIterCreatedNode’ set but not used [-Wunused-but-set-variable]
1645 | bool PrevIterCreatedNode = false;
| ^~~~~~~~~~~~~~~~~~~
When asserts were not enabled.
To aid in debugging, (optionally) dump the dot graph immediately after
the stack update phase (which matches nodes to interior callsites) and
before we cleanup mismatched callee edges (either via tail call fixup,
indirect call fixup, or nulling otherwise).
Reapply PR142507 with fix for test: add in the same x86_64-linux
requirement as other tests as the stack ids are currently computed
differently on big endian systems. This will be investigated separately.
In order to allow selective reporting of context hinting during the LTO
link, and in the future to allow selective more aggressive cloning, add
an option to specify a minimum percent of the max cold size in the
profile summary. Contexts that meet that threshold will get context size
info metadata (and ThinLTO summary information) on the associated
allocations.
Specifying -memprof-report-hinted-sizes during the pre-LTO compile step
will continue to cause all contexts to receive this metadata. But
specifying -memprof-report-hinted-sizes only during the LTO link will
cause only those that meet the new threshold and have the metadata to
get reported.
To support this, because the alloc info summary and associated bitcode
requires the context size information to be in the same order as the
other context information, 0s are inserted for contexts without this
metadata. The bitcode writer uses a more compact format for the context
ids to allow better compression of the 0s.
As part of this change several helper methods are added to query whether
metadata contains context size info on any or all contexts.
In order to allow selective reporting of context hinting during the LTO
link, and in the future to allow selective more aggressive cloning, add
an option to specify a minimum percent of the max cold size in the
profile summary. Contexts that meet that threshold will get context size
info metadata (and ThinLTO summary information) on the associated
allocations.
Specifying -memprof-report-hinted-sizes during the pre-LTO compile step
will continue to cause all contexts to receive this metadata. But
specifying -memprof-report-hinted-sizes only during the LTO link will
cause only those that meet the new threshold and have the metadata to
get reported.
To support this, because the alloc info summary and associated bitcode
requires the context size information to be in the same order as the
other context information, 0s are inserted for contexts without this
metadata. The bitcode writer uses a more compact format for the context
ids to allow better compression of the 0s.
As part of this change several helper methods are added to query whether
metadata contains context size info on any or all contexts.
Guard a loop that only exists to do assertion checking of stack ids on
memprof metadata so that it isn't compiled and executed under NDEBUG.
This is similar to how callsite metadata stack id verification is
guarded further below.
See https://discourse.llvm.org/t/rfc-keep-globalvalue-guids-stable/84801
for context.
This is a non-functional change which just changes the interface of
GlobalValue, in preparation for future functional changes. This part
touches a fair few users, so is split out for ease of review. Future
changes to the GlobalValue implementation can then be focused purely on
that class.
This does the following:
* Rename GlobalValue::getGUID(StringRef) to
getGUIDAssumingExternalLinkage. This is simply making explicit at the
callsite what is currently implicit.
* Where possible, migrate users to directly calling getGUID on a
GlobalValue instance.
* Otherwise, where possible, have them call the newly renamed
getGUIDAssumingExternalLinkage, to make the assumption explicit.
There are a few cases where neither of the above are possible, as the
caller saves and reconstructs the necessary information to compute the
GUID themselves. We want to migrate these callers eventually, but for
this first step we leave them be.
We perform cloning for each allocation node separately. However, this
sometimes results in a situation where the same node calls multiple
clones of the same callee, created for different allocations. This
causes issues when assigning functions to these clones, as each node can
in reality only call a single callee clone.
To address this, before assigning functions, merge callee clone nodes as
needed using a post order traversal from the allocations. We attempt to
use existing clones as the merge node when legal, and to share them
among callers with the same properties (callers calling the same set of
callee clone nodes for the same allocations).
Without this fix, in some cases incorrect function assignment will lead
to calling the wrong allocation clone. In fact, this showed up in an
existing test, that I didn't notice as it existed to test earlier parts
of the cloning process.
If we are replacing a sequence of stack nodes with a single node
representing inlined IR, and the stack id sequence contains recursion,
we may have already removed some edges. Handle this case correctly by
skipping the now removed edge.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
When we need to reclone other callees of a caller node during function
assignment due to the creation of a new function clone, we need to skip
recursive edges on that caller. We don't want to reclone the callee in
that case (which is the caller), which isn't necessary and also isn't
correct from a graph update perspective. It resulted in an assertion and
in an NDEBUG build caused an infinite loop.
To simplify debugging and analysis, particularly for very large
applications with large graphs, this patch adds support for either
highlighting a single context id or allocation's context ids, and/or
only exporting the nodes/edges for a single context id or allocation's
context ids. When highlighting, the specified nodes and edges are a
brighter color and larger.
This can be controlled by the new -memprof-dot-scope={all,alloc,context}
flag which controls how much to export, along with two companion flags:
-memprof-dot-alloc-id=ID
-memprof-dot-context-id=ID
These two are interpreted differently depending on the value of
-memprof-dot-scope (where "all" is the default).
If exporting all, one of the above flags can optionally be passed to
highlight the nodes/edges for the given context id or allocation's
context ids.
If exporting alloc scope, an alloc id must be provided. A context id can
optionally be provided to highlight that context.
If exporting context scope, a context id must be provided.
The ids to use can be obtained either by looking at the full graph, or a
context id can be identified from the -memprof-report-hinted-sizes
output after PR128188 is merged.
During the whole program reporting of contexts when hinted byte
reporting is enabled via -memprof-report-hinted-sizes, also print the
internal context id. This is useful for debugging, as well as for
guiding the dot file dumping with some upcoming changes that will
accept a context id to focus the graph on a context of interest.
Invoke the backedge computation (refactored as a new method) at the end
of the graph construction, instead of at the start of cloning. That
makes more logical sense, and it also makes it easier to look at the
results in the postbuild dot graph with a follow on change to display
those differently.
Two misc cleanup/improvements to the dot printing.
Remove a redundant "style=filled" in the Node attributes. No effect on
resulting graph.
Add a "color" attribute to the Edge, with the same color name as
"fillcolor". The latter only fills in the arrowhead, and the former is
what affects the line. This makes the edge colors more visible,
previously it was a black edge with a colored in arrowhead.
For the second change, I added the new Edge color attributes to the
checking in the two "basic.ll" tests, so we get some testing coverage of
the full printing. For the other affected tests I removed the final "]'"
after the fillcolor so it matches up through that attribute and ignores
the rest of the line.
In order to facilitate cloning of recursive cycles, we first identify
backedges using a standard DFS search from the root callers, then
initially defer recursively invoking the cloning function via those
edges. This is because the cloning opportunity along the backedge may
not be exposed until the current node is cloned for other non-backedge
callers that are cold after the earlier recursive cloning, resulting
in a cold predecessor of the backedge. So we recursively invoke the
cloning function for the backedges during the cloning of the current
node for its caller edges (which were sorted to enable handling cold
callers first).
There was no significant time or memory overhead measured for several
large applications.
We were previously checking this after recursing on all callers, but if
we already have a single allocation type there is no need to even look
at any callers. Didn't show a significant improvement overall, but it
does reduce the count of times we enter the identifyClones and do other
checks.
This change addresses a number of issues with the support added by
PR121985 which were exposed through more exhaustive testing,
specifically places that needed updates to perform correct graph updates
in the presence of cycles.
A new test case is added that reproduces these issues, and the default
is flipped back to enabling this handling.
When we apply cloning decisions in the ThinLTO backend, we need to find
the corresponding summary for each function in the IR, and in some cases
for callee functions. This is complicated when the function was a
promoted local, in which case the GUID was formed from the hash of the
original source file prepended to the function name. Those functions
can be identified by the fact that they were given a ".llvm." suffix
during promotion.
We previously didn't do this correctly for promoted locals imported from
other modules, as we only tried the current module source name. This led
to crashes, in particular when the current module also had an local
function of the same original name. In particular, we were attempting to
iterate through the wrong summary's callsites, and there were fewer than
in the actual function so we accessed data off the end (in a release
build with assertion checking off - with assertion checking on we double
check the stack ids and that would have failed). Even if we hadn't
crashed or hit an assert, we could have applied the wrong cloning
decisions, leading to unsats at link time.
Luckily, function importing attaches thinlto_src_file metadata
containing the original source file name to all imported functions. It
normally doesn't do this by default, however, it always does if MemProf
context disambiguation is enabled. Therefore, we can just look to see if
the function contains this metadata and if so use it to recreate the
original GUID.
A similar issue can occur when looking for the ValueInfo / GUID of
a direct tail call to see if we synthesized a callsite record for a
missing tail call frame. In that case, the callee function may be a
declaration, if we imported its caller but not the callee function
definition. Because imported declarations don't get the thinlto_src_file
metadata, we instead look at its caller (which works because this
happens very early in the backend before any inlining).
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses cast
because we know which alternative to expect in the ternary expression.
Remove edge iterator parameters from the various helpers that move edges
onto other nodes, and their associated iterator update code, and instead
iterate over copies of the edge lists in the caller loops. This also
avoids the need to increment these iterators at every early loop
continue.
This simplifies the code, makes it less error prone when updating, and
in particular, facilitates adding handling of recursive contexts.
There were no measurable compile time and memory overhead effects for a
large target.
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses cast
because we expect the arguments to be of the requested types. Note
that all these cases have assert and/or dereferences just after cast,
implying that the return value from cast must be nonnull.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
We pass in a pointer to an Edge iterator to
moveEdgeToExistingCalleeClone, so that it can be correctly updated when
we remove edges during an edge iteration. We were not dereferencing this
pointer in one case, meaning we would increment the pointer and not the
iterator as intended.
This did not cause any issues, as it turns out that we would simply skip
the edge on the next iteration as it was already appropriately handled.
While in theory this incurred some extra compilation time, in practice
for a large application the effect was not significant. I confirmed that
there was no effect to any cloning from the fix.
I plan to send a follow up change to avoid the need to pass in an
iterator at all and simplify / consolidate the handling in the caller,
but want to fix this in case something requires a revert of the follow
on fix.
IndexCall is a simple wrapper around:
PointerUnion<CallsiteInfo *, AllocInfo *>
Now, because we don't have CastInfo for IndexCall, we would have to
use getBase like so:
dyn_cast_if_present<CallsiteInfo *>(Call.getBase())
This patch adds simplify_type<IndexCall>, which in turn enables
CastInfo for IndexCall, so we can drop getBase like so::
dyn_cast_if_present<CallsiteInfo *>(Call)
Note that PointerUnion::is have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
In this patch, I'm calling call().getBase() for an instance of
PointerUnion. call() alone would return an instance of IndexCall,
which wraps PointerUnion. Note that isa<> cannot directly accept an
instance of IndexCall, at least without defining CastInfo.
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.