Consider running the following code with operator new and delete overloaded to
track allocations with call stacks:
std::thread( []({ thread_local std::string str; });
Each call stack requires a memory allocation to be performed by the profiler,
to make the stack available at a later time. When the thread is created, the
TLS block is initialized and the std::string buffer can be allocated. To track
this allocation, rpmalloc has to be initialized. This initialization also
happens within the TLS block.
Now, when the thread exits, the heap managed by rpmalloc may be released first
during the TLS block destruction (and if the destruction is performed in
reverse creation order, then it *will* be destroyed first, as rpmalloc was
initialized only after the std::string initialization, to track the allocation
performed within). The next thing to happen is destruction of std::string and
release of the memory block it contains.
The release is tracked by the profiler, and as mentioned earlier, to save the
call stack for later use, a memory allocation is needed. But the allocator is
no longer available in this thread, because rpmalloc was released just before!
As a solution to this issue, profiler will detect whether the allocator is
still available and will ignore the call stack, if it's not. The other
solution is to disable the rpmalloc thread cleanup, which may potentially
cause leak-like behavior, in case a large number of threads is spawned and
destroyed.
Note that this is not a water-tight solution. Other functions will still want
to allocate memory for call stacks, but it is rather unlikely that such calls
would be performed during TLS block destruction. It is also possible that the
event queue will run out of allocated space for events at this very moment,
and in such a case the allocator will also fail.
If call stack capture is enabled for context switch data, the 64KB buffer is
too small to work without overruns. However, if the default buffer size is
increased, then the maximum locked memory limit is hit.
This change keeps the small buffer size for all the buffers that may be used
without escalated privileges. The context switch buffer is bigger, but it does
not need to obey the limits, as the application is running as root, if it is
to be used.
This fixes issue #293. Symbols are not loaded if the module is loaded dynamically after the SymInitialize call.
This may stall the symbol resolver thread a bit the first time a module is loaded.
On Windows there is no way to distinguish callstack data coming from random
sampling and from context switches. Each callstack timestamp has to be matched
against the context switch data in order to decide its origin. This is
obviously non-trivial.
On some other platforms, the origin information may be available right away,
in which case the process of matching against the context switch data, which
possibly includes postponing callstacks for processing in the future, may be
completely omitted.