Both Windows and Linux use 32-bit thread identifiers. MacOS has a 64-bit
counter, but in practice it will never overflow during profiling and no false
aliasing will happen.
These changes are only done client-side and in the network protocol. The
server still uses 64-bit thread identifiers, to enable virtual threads, etc.
These are extremely useful for ecosystems such as Rust. There are a
couple of reasons why:
1. Rust strongly advises against relying on life before/after main, as
it is difficult to reason about. Most users working in Rust will
generally be quite surprised when encountering this concept.
2. Rust and its package manager makes it easy to use packages (crates)
and somewhat less straightforward to consider the implications of
including a dependency.
In case of the `rust_tracy_client` set of packages, I currently have
to warn throughout the documentation of the package that simply
adding a dependency on the bindings package is sufficient to
potentially accidentally broadcast a lot of information about the
instrumented binary to the broader world. This seems like a major
footgun given how easy is it to forget about having added this
dependency.
Ability to manually manage the lifetime of the profiler would be a great
solution to these problems.
This makes sure that profiler threads are properly included in sample data on
Linux. This was previously working because sample capture was performed
system-wide. Now samples are only captured in client context, which includes
all spawned threads. Since this inclusion only works for threads which will be
spawned after the trace starts, no thread can be created before sampling setup
is done.
The C++11 spec states in [basic.stc.thread] thread storage duration:
2. A variable with thread storage duration shall be initialized before its
first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.
Previously Tracy relied on the TLS data being initialized:
- During thread creation (MSVC).
- Or during first use in a thread, but the initialization was performed for
the whole TLS block.
It seems that new compilers are more granular with how they perform the
initialization, hence rpmalloc init has to be checked before each allocation,
as it cannot be "folded" into, for example, initialization of the profiler
itself.
Fuck knows how this is supposed to work. perf_event_open() opens the
descriptor successfully, but it produces no samples, if precise_ip is not 0.
There are no such problems on ARM (where precise_ip is 3, but maybe it is not
supported at all on that architecture, again, fuck knows if), and on AMD
perf_event_open() does not succeed when precise_ip > 0.
Original commit a6b25497 by xavier <xavierb@gmail.com>:
add TRACY_CALLSTACK_IGNORE_INLINES to tradeoff speed vs precision in win32 DecodeCallstackPtr()
SymQueryInlineTrace() is too slow in some cases:
300000 queries backlog getting processed at ~70 per second is prohibitive.
(without inlines resolution, it's more like ~20000 queries per second)
This codepath, involving a workaround for GCC < 8.4, called 'new' and
'delete' directly, which could cause infinite recursion when
user-provided versions of those functions were themselves using Tracy
functionality.
Now, this codepath uses Tracy's internal allocator.
See issues #194, #196