Functions with `musttail` calls can't be roots because we can't instrument their `ret` to release the context. This patch tags their `CtxRoot` field in their `FunctionData`. In compiler-rt we then know not to allow such functions become roots, and also not confuse `CtxRoot == 0x1` with there being a context root.
Currently we also lose the context tree under such cases. We can, in a subsequent patch, have the root detector search past these functions.
This is an optional mechanism that automatically detects roots. It's a best-effort mechanism, and its main goal is to *avoid* pointing at the message pump function as a root. This is the function that polls message queue(s) in an infinite loop, and is thus a bad root (it never exits).
High-level, when collection is requested - which should happen when a server has already been set up and handing requests - we spend a bit of time sampling all the server's threads. Each sample is a stack which we insert in a `PerThreadCallsiteTrie`. After a while, we run for each `PerThreadCallsiteTrie` the root detection logic. We then traverse all the `FunctionData`, find the ones matching the detected roots, and allocate a `ContextRoot` for them. From here, we special case `FunctionData` objects, in `__llvm_ctx_profile_get_context, that have a `CtxRoot` and route them to `__llvm_ctx_profile_start_context`.
For this to work, on the llvm side, we need to have all functions call `__llvm_ctx_profile_release_context` because they _might_ be roots. This comes at a slight (percentages) penalty during collection - which we can afford since the overall technique is ~5x faster than normal instrumentation. We can later explore conditionally enabling autoroot detection and avoiding this penalty, if desired.
Note that functions that `musttail call` can't have their return instrumented this way, and a subsequent patch will harden the mechanism against this case.
The mechanism could be used in combination with explicit root specification, too.
An initial patch for supporting automated root detection. The auto-detector is introduced subsequently, but this patch introduces a datastructure for capturing sampled stacks, per thread, in a trie, and inferring from such samples which functions are reasonable roots.