This patch adds CalleeGuids to the serialized format and increments the version number to 4. The unit tests are updated to include a new test for v4 and the YAML format is also updated to be able to roundtrip the v4 format.
See https://discourse.llvm.org/t/rfc-keep-globalvalue-guids-stable/84801
for context.
This is a non-functional change which just changes the interface of
GlobalValue, in preparation for future functional changes. This part
touches a fair few users, so is split out for ease of review. Future
changes to the GlobalValue implementation can then be focused purely on
that class.
This does the following:
* Rename GlobalValue::getGUID(StringRef) to
getGUIDAssumingExternalLinkage. This is simply making explicit at the
callsite what is currently implicit.
* Where possible, migrate users to directly calling getGUID on a
GlobalValue instance.
* Otherwise, where possible, have them call the newly renamed
getGUIDAssumingExternalLinkage, to make the assumption explicit.
There are a few cases where neither of the above are possible, as the
caller saves and reconstructs the necessary information to compute the
GUID themselves. We want to migrate these callers eventually, but for
this first step we leave them be.
* Added YAML traits for `CallSiteInfo`
* Updated the `MemProfReader` to pass `Frames` instead of the entire
`CallSiteInfo`
* Updated test cases to use `testing::Field`
* Add YAML sequence traits for CallSiteInfo in MemProfYAML
* Also extend IndexedMemProfRecord
* XFAIL the MemProfYaml round trip test until we update the profile
format
For now we only read and write the additional information from the YAML
format. The YAML round trip test will be enabled when the serialized format is updated.
Now that IndexedMemProfData::{addFrame,addCallStack} are the only
callers of Frame::hash and hashCallStack, respectively, this patch
moves those functions into IndexedMemProfData and makes them private.
With this patch, we can obtain FrameId and CallStackId only through
addFrame and addCallStack, respectively.
CallStackRadixTreeBuilder::build takes the parameter
MemProfFrameIndexes by value, involving copies:
std::optional<const llvm::DenseMap<FrameIdTy, LinearFrameId>>
MemProfFrameIndexes
Then "build" makes another copy of MemProfFrameIndexe and passes it to
encodeCallStack for every call stack, which is painfully slow.
This patch changes the type to a pointer so that we don't have to make
a copy every time we pass the argument.
Without this patch, it takes 553 seconds to run "llvm-profdata merge"
on a large MemProf raw profile. This patch shortenes that down to 67
seconds.
This patch removes two functions to verify the consistency between:
- IndexedAllocationInfo::CallStack
- IndexedAllocationInfo::CSId
Now that MemProf format Version 1 has been removed,
IndexedAllocationInfo::CallStack doesn't participate in either
serialization or deserialization, so we don't care about the
consistency between the two fields in IndexAllocationInfo.
Subsequent patches will remove uses of the old field and eventually
remove the field.
This reverts commit fdb050a5024320ec29d2edf3f2bc686c3a84abaa, and
restores ccb4702038900d82d1041ff610788740f5cef723, with a fix for build
bot failures.
Specifically, add ProfileData to the dependences of the BitWriter
library, which was causing shared library builds of LLVM to fail.
Reproduced the failure with a shared library build and confirmed this
change fixes that build failure.
Leverage the support added to represent allocation contexts in a more
compact way via a radix tree in the indexed profile to similarly reduce
sizes of the bitcode summaries.
For a large target, this reduced the size of the per-module summaries by
about 18% and in the distributed combined index files by 28%.
Prepare for usage in the bitcode reader/writer where we already have a
LinearFrameId:
- templatize input frame id type in CallStackRadixTreeBuilder
- templatize input frame id type in computeFrameHistogram
- make the map from FrameId to LinearFrameId optional
We plan to use the same radix format in the ThinLTO summary records,
where we already have a LinearFrameId.
This patch removes MemProf format Version 0 now that version 2 and 3
seem to be working well.
I'm not touching version 1 for now because some tests still rely on
version 1.
Note that Version 0 is identical to Version 1 except that the MemProf
section of the indexed format has a MemProf version field.
We call llvm::sort in a couple of places in the V3 encoding:
- We sort Frames by FrameIds for stability of the output.
- We sort call stacks in the dictionary order to maximize the length
of the common prefix between adjacent call stacks.
It turns out that we can improve the deserialization performance by
modifying the comparison functions -- without changing the format at
all. Both places take advantage of the histogram of Frames -- how
many times each Frame occurs in the call stacks.
- Frames: We serialize popular Frames in the descending order of
popularity for improved cache locality. For two equally popular
Frames, we break a tie by serializing one that tends to appear
earlier in call stacks. Here, "earlier" means a smaller index
within llvm::SmallVector<FrameId>.
- Call Stacks: We sort the call stacks to reduce the number of times
we follow pointers to parents during deserialization. Specifically,
instead of comparing two call stacks in the strcmp style -- integer
comparisons of FrameIds, we compare two FrameIds F1 and F2 with
Histogram[F1] < Histogram[F2] at respective indexes. Since we
encode from the end of the sorted list of call stacks, we tend to
encode popular call stacks first.
Since the two places use the same histogram, we compute it once and
share it in the two places.
Sorting the call stacks reduces the number of "jumps" by 74% when we
deserialize all MemProfRecords. The cycle and instruction counts go
down by 10% and 1.5%, respectively.
If we sort the Frames in addition to the call stacks, then the cycle
and instruction counts go down by 14% and 1.6%, respectively, relative
to the same baseline (that is, without this patch).
Call stacks are a huge portion of the MemProf profile, taking up 70+%
of the profile file size.
This patch implements a radix tree to compress call stacks, which are
known to have long common prefixes. Specifically,
CallStackRadixTreeBuilder, introduced in this patch, takes call stacks
in the MemProf profile, sorts them in the dictionary order to maximize
the common prefix between adjacent call stacks, and then encodes a
radix tree into a single array that is ready for serialization.
The resulting radix array is essentially a concatenation of call stack
arrays, each encoded with its length followed by the payload, except
that these arrays contain "instructions" like "skip 7 elements
forward" to borrow common prefixes from other call stacks.
This patch does not integrate with the MemProf
serialization/deserialization infrastructure yet. Once integrated,
the radix tree is expected to roughly halve the file size of the
MemProf profile.
This patch replaces llvm::SmallVector<Frame> with std::vector<Frame>.
llvm::SmallVector<Frame> sets aside one inline element. Meanwhile,
when I sort all call stacks by their lengths, the length at the first
percentile is already 2. That is, 99 percent of call stacks do not
take advantage of the inline element.
Using std::vector<Frame> reduces the cycle and instruction counts by
11% and 22%, respectively, with "llvm-profdata show" modified to
deserialize all MemProfRecords.
This patch replaces uint32_t with LinearCallStackId where appropriate.
I'm replacing uint64_t with LinearCallStackId in
writeMemProfCallStackArray, but that's OK because it's a value to be
used as LinearCallStackId anyway.
With this patch, we stop using on-disk hash tables for Frames and call
stacks. Instead, we'll write out all the Frames as a flat array while
maintaining mappings from FrameIds to the indexes into the array.
Then we serialize call stacks in terms of those indexes.
Likewise, we'll write out all the call stacks as another flat array
while maintaining mappings from CallStackIds to the indexes into the
call stack array. One minor difference from Frames is that the
indexes into the call stack array are not contiguous because call
stacks are variable-length objects.
Then we serialize IndexedMemProfRecords in terms of the indexes
into the call stack array.
Now, we describe each call stack with 32-bit indexes into the Frame
array (as opposed to the 64-bit FrameIds in Version 2). The use of
the smaller type cuts down the profile file size by about 40% relative
to Version 2. The departure from the on-disk hash tables contributes
a little bit to the savings, too.
For now, IndexedMemProfRecords refer to call stacks with 64-bit
indexes into the call stack array. As a follow-up, I'll change that
to uint32_t, including necessary updates to RecordWriterTrait.
This patch adds Version 3 for development purposes. For now, this
patch adds V3 as a copy of V2.
For the most part, this patch adds "case Version3:" wherever "case
Version2:" appears. One exception is writeMemProfV3, which is copied
from writeMemProfV2 but updated to write out memprof::Version3 to the
MemProf header. We'll incrementally modify writeMemProfV3 in
subsequent patches.
"const" being removed in this patch prevents the move semantics from
being used in:
AI.CallStack = Callback(IndexedAI.CSId);
With this patch on an indexed MemProf Version 2 profile, the cycle
count and instruction count go down by 13.3% and 26.3%, respectively,
with "llvm-profdata show" modified to deserialize all MemProfRecords.
std::move and reserve here result in a measurable speed-up in
llvm-profdata modified to deserialize all MemProfRecords. The cycle
count goes down by 7.1% while the instruction count goes down by 21%.
CallStackIdConverter sets LastUnmappedId when a mapping failure
occurs. Now, since toMemProfRecord takes an instance of
CallStackIdConverter by value, namely std::function, the caller of
toMemProfRecord never receives the mapping failure that occurs inside
toMemProfRecord. The same problem applies to FrameIdConverter.
The patch fixes the problem by passing FrameIdConverter and
CallStackIdConverter by reference, namely llvm::function_ref.
While I am it, this patch deletes the copy constructor and copy
assignment operator to avoid accidental copies.
PortableMemInfoBlock::{serialize,deserialize} take Schema into
account, allowing us to serialize/deserialize a subset of the fields.
However, PortableMemInfoBlock::serializedSize does not. That is, it
assumes that all fields are always serialized and deserialized. In
other words, if we choose to serialize/deserialize a subset of the
fields, serializedSize would claim more storage than we actually need.
This patch fixes the problem by teaching serializedSize to take Schema
into account. For now, this patch has no effect on the actual indexed
MemProf profile because we serialize/deserialize all fields, but that
might change in the future.
Aside from check-llvm, I tested this patch by verifying that
llvm-profdata generates bit-wise identical files for each version for
a large raw MemProf file I have.
We are in the process of referring to call stacks with CallStackId in
IndexedMemProfRecord and IndexedAllocationInfo instead of holding call
stacks inline (both in memory and the serialized format). Doing so
deduplicates call stacks and reduces the MemProf profile file size.
Before we can eliminate the two fields holding call stacks inline:
- IndexedAllocationInfo::CallStack
- IndexedMemProfRecord::CallSites
we need to eliminate all the read operations on them.
This patch is a step toward that direction. Specifically, we
eliminate the read operations in the context of MemProfReader and
RawMemProfReader. A subsequent patch will eliminate the read
operations during the serialization.
The first field to serialize is the size of
IndexedMemProfRecord::AllocSites. It has nothing to do with
GlobalValue::GUID. This happens to work because of:
using GUID = uint64_t;
I'm currently developing a new version of the indexed memprof format
where we deduplicate call stacks in IndexedAllocationInfo::CallStack
and IndexedMemProfRecord::CallSites. We refer to call stacks with
integer IDs, namely CallStackId, just as we refer to Frame with
FrameId. The deduplication will cut down the profile file size by 80%
in a large memprof file of mine.
As a step toward the goal, this patch teaches
IndexedMemProfRecord::{serialize,deserialize} to speak Version2. A
subsequent patch will add Version2 support to llvm-profdata.
The essense of the patch is to replace the serialization of a call
stack, a vector of FrameIDs, with that of a CallStackId. That is:
const IndexedAllocationInfo &N = ...;
...
LE.write<uint64_t>(N.CallStack.size());
for (const FrameId &Id : N.CallStack)
LE.write<FrameId>(Id);
becomes:
LE.write<CallStackId>(N.CSId);
There are two ways to create in-memory instances of
IndexedAllocationInfo -- deserialization of the raw MemProf data and
that of the indexed MemProf data.
With:
commit 74799f424063a2d751e0f9ea698db1f4efd0d8b2
Author: Kazu Hirata <kazu@google.com>
Date: Sat Mar 23 19:50:15 2024 -0700
we compute CallStackId for each call stack in IndexedAllocationInfo
while deserializing the raw MemProf data.
This patch does the same while deserilizing the indexed MemProf data.
As with the patch above, this patch does not add any use of
CallStackId yet.
The indexed MemProf file has a huge amount of redundancy. In a large
internal application, 82% of call stacks, stored in
IndexedAllocationInfo::CallStack, are duplicates.
We should work toward deduplicating call stacks by referring to them
with unique IDs with actual call stacks stored in a separate data
structure, much like we refer to memprof::Frame with memprof::FrameId.
At the same time, we need to facilitate a graceful transition from the
current version of the MemProf format to the next. We should be able
to read (but not write) the current version of the MemProf file even
after we move onto the next one.
With those goals in mind, I propose to have an integer ID next to
CallStack in IndexedAllocationInfo to refer to a call stack in a
succinct manner. We'll gradually increase the areas of the compiler
where IDs and call stacks have one-to-one correspondence and
eventually remove the existing CallStack field.
This patch adds call stack ID, named CSId, to IndexedAllocationInfo
and teaches the raw profile reader to compute unique call stack IDs
and store them in the new field. It does not introduce any user of
the call stack IDs yet, except in verifyFunctionProfileData.
The comment about the stripping of suffixes when creating the indexed
MemProf profile was partially incorrect, as we do not strip ".__uniq."
suffixes by default (by design). Update the comment accordingly.
Note that llvm::support::endianness has been renamed to
llvm::endianness while becoming an enum class. This patch replaces
{big,little,native} with llvm::endianness::{big,little,native}.
This patch completes the migration to llvm::endianness and
llvm::endianness::{big,little,native}. I'll post a separate patch to
remove the migration helpers in llvm/Support/Endian.h:
using endianness = llvm::endianness;
constexpr llvm::endianness big = llvm::endianness::big;
constexpr llvm::endianness little = llvm::endianness::little;
constexpr llvm::endianness native = llvm::endianness::native;
Canonicalize the function name (strip suffixes etc) to ensure that
function name suffixes added by late stage passes do not cause
mismatches when memprof profile data is consumed.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D159132
The current implementation of memprof information in the indexed profile
format stores the representation of each calling context fram inline.
This patch uses an interned representation where the frame contents are
stored in a separate on-disk hash table. The table is indexed via a hash
of the contents of the frame. With this patch, the compressed size of a
large memprof profile reduces by ~22%.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D123094
This reverts commit 0d362c90d335509c57c0fbd01ae1829e2b9c3765.
Reason: Causes the MSan buildbot to fail (see comments on
https://reviews.llvm.org/D121179 for more information
To ease profile annotation, each of the callsites in a function can be
annotated with profile data - "IR metadata format for MemProf" [1]. This
patch extends the on-disk serialized record format to store the debug
information for allocation callsites incl inline frames. This change is
incompatible with the existing format i.e. indexed profiles must be
regenerated, raw profiles are unaffected.
[1] https://groups.google.com/g/llvm-dev/c/aWHsdMxKAfE/m/WtEmRqyhAgAJ
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D121179
This patch adds support for optional memory profile information to be
included with and indexed profile. The indexed profile header adds a new
field which points to the offset of the memory profile section (if
present) in the indexed profile. For users who do not utilize this
feature the only overhead is a 64-bit offset in the header.
The memory profile section contains (1) profile metadata describing the
information recorded for each entry (2) an on-disk hashtable containing
the profile records indexed via llvm::md5(function_name). We chose to
introduce a separate hash table instead of the existing one since the
indexing for the instrumented fdo hash table is based on a CFG hash
which itself is perturbed by memprof instrumentation.
This commit also includes the changes reviewed separately in D120093.
Differential Revision: https://reviews.llvm.org/D120103