Consider running the following code with operator new and delete overloaded to
track allocations with call stacks:
std::thread( []({ thread_local std::string str; });
Each call stack requires a memory allocation to be performed by the profiler,
to make the stack available at a later time. When the thread is created, the
TLS block is initialized and the std::string buffer can be allocated. To track
this allocation, rpmalloc has to be initialized. This initialization also
happens within the TLS block.
Now, when the thread exits, the heap managed by rpmalloc may be released first
during the TLS block destruction (and if the destruction is performed in
reverse creation order, then it *will* be destroyed first, as rpmalloc was
initialized only after the std::string initialization, to track the allocation
performed within). The next thing to happen is destruction of std::string and
release of the memory block it contains.
The release is tracked by the profiler, and as mentioned earlier, to save the
call stack for later use, a memory allocation is needed. But the allocator is
no longer available in this thread, because rpmalloc was released just before!
As a solution to this issue, profiler will detect whether the allocator is
still available and will ignore the call stack, if it's not. The other
solution is to disable the rpmalloc thread cleanup, which may potentially
cause leak-like behavior, in case a large number of threads is spawned and
destroyed.
Note that this is not a water-tight solution. Other functions will still want
to allocate memory for call stacks, but it is rather unlikely that such calls
would be performed during TLS block destruction. It is also possible that the
event queue will run out of allocated space for events at this very moment,
and in such a case the allocator will also fail.
In case of CPU statistics data, this entry is created during creation of a
source location. This won't be done for GPU zones, as it would needlessly
expand the number of held entries. This is assuming the number of GPU zones
is significantly less than the number of CPU zones.
Note: Zone counts are currently being calculated, but they are not being
saved. Proper usage of this data (as is performed in the CPU counterpart)
would remove the possibility of insertion of new entries into the map in
ReconstructZoneStatistics().