At the time of instrumentation (and instrumentation lowering), `noreturn` is not applied uniformously. Rather than running `FunctionAttrs` pass, we just need to use `llvm::canReturn` exposed in PR #135650
Functions with `musttail` calls can't be roots because we can't instrument their `ret` to release the context. This patch tags their `CtxRoot` field in their `FunctionData`. In compiler-rt we then know not to allow such functions become roots, and also not confuse `CtxRoot == 0x1` with there being a context root.
Currently we also lose the context tree under such cases. We can, in a subsequent patch, have the root detector search past these functions.
This is an optional mechanism that automatically detects roots. It's a best-effort mechanism, and its main goal is to *avoid* pointing at the message pump function as a root. This is the function that polls message queue(s) in an infinite loop, and is thus a bad root (it never exits).
High-level, when collection is requested - which should happen when a server has already been set up and handing requests - we spend a bit of time sampling all the server's threads. Each sample is a stack which we insert in a `PerThreadCallsiteTrie`. After a while, we run for each `PerThreadCallsiteTrie` the root detection logic. We then traverse all the `FunctionData`, find the ones matching the detected roots, and allocate a `ContextRoot` for them. From here, we special case `FunctionData` objects, in `__llvm_ctx_profile_get_context, that have a `CtxRoot` and route them to `__llvm_ctx_profile_start_context`.
For this to work, on the llvm side, we need to have all functions call `__llvm_ctx_profile_release_context` because they _might_ be roots. This comes at a slight (percentages) penalty during collection - which we can afford since the overall technique is ~5x faster than normal instrumentation. We can later explore conditionally enabling autoroot detection and avoiding this penalty, if desired.
Note that functions that `musttail call` can't have their return instrumented this way, and a subsequent patch will harden the mechanism against this case.
The mechanism could be used in combination with explicit root specification, too.
Mechanism to keep the compiler-rt and llvm view of `FunctionData` in sync. Since CtxInstrContextNode.h is exactly the same on both sides (there's an existing test, `compiler-rt/test/ctx_profile/TestCases/check-same-ctx-node.test`, checking that), we capture the structure in a macro that is then generated as `struct` fields on the compiler-rt side, and as `Type` objects on the llvm side. The macro needs to be told how to render a few kinds of fields.
If we add more fields to FunctionData that can be described by the current known types of fields, then the llvm side would automatically be updated. If we need to add more kinds of fields, which we do by adding parameters to the macro, the llvm side (if not updated) would trigger a compilation error.
`ContextRoot` `FunctionData` are currently known by the llvm side, which has to instantiate and zero-initialize them.
This patch makes `FunctionData` the only global value that needs to be known and instantiated by the compiler. On the compiler-rt side, `ContextRoot`s are hung off `FunctionData`, when applicable.
This is for two reasons. First, it is a step towards root autodetection (in a subsequent patch). An autodetection mechanism would instantiate the `ContextRoot` for the detected roots, and then `__llvm_ctx_profile_get_context` would detect that and route to `__llvm_ctx_profile_start_context`.
The second reason is that we will hang off `ContextRoot` more complex datatypes (next patch), and we want to avoid too deep of a coupling between llvm and compiler-rt. Acting as a place to hang related data, `FunctionData` can stay simple - pointers and an (atomic) int (the mutex).
When we collect a contextual profile, we sample the threads entering its root and only collect on one at a time (see `ContextRoot::Taken`). If we want to compare profiles between contextual profiles, and/or flat profiles, we have a problem: we don't know how to compare the counter values relative to each other. To that end, we add `ContextRoot::TotalEntries`, which is incremented every time a root is entered and serves as multiplier for the counter values collected under that root.
We expose this in the profile and leave the normalization to the user of the profile, for a few reasons:
* it's only needed if reasoning about all profiles in aggregate.
* the goal, in compiler_rt, is to flush out the profile as quickly as possible, and performing multiplications adds an overhead that may not even be necessary if the consumer of the profile doesn't care about combining profiles
* the information itself may be interesting as an indication of relative sampling of various contexts.
Collect flat profiles. We only do this for function activations that aren't otherwise collectible under a context root are encountered.
This allows us to reason about the full profile without concerning ourselves wether we are double-counting. For example we can combine (during profile use) flattened contextual profiles with flat profiles.
Contextual profiling identifies functions by GUID. Functions that may get overridden by the linker with a prevailing copy may have, during instrumentation, different variants in different modules. If these variants get inlined before linking (here I assume thinlto), they will identify themselves to the ctxprof runtime as their GUID, leading to issues - they may have different counter counts, for instance.
If we block their inlining in the pre-thinlink compilation, only the prevailing copy will survive post-thinlink and the confusion is avoided.
The change introduces a small pass just for this purpose, which marks any symbols that could be affected by the above as `noinline` (even if they were `alwaysinline`). We already carried out some inlining (via the preinliner), before instrumenting, so technically the `alwaysinline` directives were honored.
We could later (different patch) choose to mark them back to their original attribute (none or `alwaysinline`) post-thinlink, if we want to - but experimentally that doesn't really change much of the performance of the instrumented binary.
We don't need that name variable for contextual instrumentation, we just
use the function to get its GUID which we pass to the runtime, and rely
on metadata to capture it through the various optimization passes. This
change removes the need for the name global variable.
Continuing from #102084, which introduced the analysis, we now populate
it with info about functions contained in the module.
When we will update the profile due to e.g. inlined callsites, we'll
ingest the callee's counters and callsites to the caller. We'll move
those to the caller's respective index space (counter and callers), so
we need to know and maintain where those currently end.
We also don't need to keep profiles not pertinent to this module.
This patch also introduces an arguably much simpler way to track the
GUID of a function from the frontend compilation, through ThinLTO, and
into the post-thinlink compilation step, which doesn't rely on keeping
names around. A separate RFC and patches will discuss extending this to
the current PGO (instrumented and sampled) and other consumers as an
infrastructural component.
This adds instrumenting callsites to PGOInstrumentation, *if* contextual profiling is requested. The latter also enables inserting counters in the entry basic block and disables value profiling (the latter is a point in time change)
This change adds the skeleton of the contextual profiling lowering pass, just so we can introduce the flag controlling that and the API to check that. The actual lowering pass will be introduced in a subsequent patch.
(Tracking Issue: #89287, RFC referenced there)