From c774534b47756544802b87b69673217a1769d5d6 Mon Sep 17 00:00:00 2001
From: Bartosz Taudul <wolf.pld@gmail.com>
Date: Sun, 20 Oct 2019 20:52:33 +0200
Subject: [PATCH] Use rdtsc instead of rdtscp.

But rdtscp is serializing!

No, it's not. Quoting the Intel Instruction Set Reference:

"The RDTSCP instruction is not a serializing instruction, but it does
wait until all previous instructions have executed and all previous
loads are globally visible. But it does not wait for previous stores to
be globally visible, and subsequent instructions may begin execution
before the read operation is performed.",

"The RDTSC instruction is not a serializing instruction. It does not
necessarily wait until all previous instructions have been executed
before reading the counter. Similarly, subsequent instructions may begin
execution before the read operation is performed."

So, the difference is in waiting for prior instructions to finish
executing. Notice that even in the rdtscp case, execution of the
following instructions may commence before time measurement is finished
and data stores may be still pending.

But, you may say, Intel in its "How to Benchmark Code Execution Times"
document shows that using rdtscp is superior to rdstc. Well, not
exactly. What they do show is that when a *single function* is
considered, there are ways to measure its execution time with little to
no error.

This is not what Tracy is doing.

In our case there is no way to determine absolute "this is before" and
"this is after" points of a zone, as we probably already are inside
another zone.  Stopping the CPU execution, so that a deeply nested zone
may be measured with great precision, will skew the measurements of all
parent zones.

And this is not what we want to measure, anyway. We are not interested
in how a *single function* behaves, but how a *whole program* behaves.
The out-of-order CPU behavior may influence the measurements? Good! We
are interested in that. We want to see *how* the code is really
executed. How is *stopping* the CPU to make a timer read an appropriate
thing to do, when we want to see how a program is performing?

At least that's the theory.

And besides all that, the profiling overhead is now reduced.
---
 client/TracyProfiler.hpp | 6 ++----
 manual/tracy.tex         | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/client/TracyProfiler.hpp b/client/TracyProfiler.hpp
index 83228637..1cca2387 100644
--- a/client/TracyProfiler.hpp
+++ b/client/TracyProfiler.hpp
@@ -109,12 +109,10 @@ public:
         return std::chrono::duration_cast<std::chrono::nanoseconds>( std::chrono::high_resolution_clock::now().time_since_epoch() ).count();
 #    endif
 #  elif defined _WIN32 || defined __CYGWIN__
-        static unsigned int dontcare;
-        const auto t = int64_t( __rdtscp( &dontcare ) );
-        return t;
+        return int64_t( __rdtsc() );
 #  elif defined __i386 || defined _M_IX86 || defined __x86_64__ || defined _M_X64
         uint32_t eax, edx;
-        asm volatile ( "rdtscp" : "=a" (eax), "=d" (edx) :: "%ecx" );
+        asm volatile ( "rdtsc" : "=a" (eax), "=d" (edx) );
         return ( uint64_t( edx ) << 32 ) + uint64_t( eax );
 #  endif
 #else
diff --git a/manual/tracy.tex b/manual/tracy.tex
index 1691ebdd..28185875 100644
--- a/manual/tracy.tex
+++ b/manual/tracy.tex
@@ -127,7 +127,7 @@ One microsecond ($\frac{1}{1000}$ of a millisecond) in our comparison equals to
 
 And finally, one nanosecond ($\frac{1}{1000}$ of a microsecond) would be one nanometer. The modern microprocessor transistor gate, the width of DNA helix, or the thickness of a cell membrane are in the range of 5~\si{\nano\metre}. In one~\si{\nano\second} the light can travel only 30~\si{\centi\meter}.
 
-Tracy can achieve single-digit nanosecond measurement resolution, due to usage of hardware timing mechanisms on the x86 and ARM architectures\footnote{In both 32 and 64~bit variants. On x86 Tracy requires a modern version of the \texttt{rdtscp} instruction (Sandy Bridge and later). On ARM-based systems Tracy will try to use the timer register (\textasciitilde 40 \si{\nano\second} resolution). If it fails (due to kernel configuration), Tracy falls back to system provided timer, which can range in resolution from 250 \si{\nano\second} to 1 \si{\micro\second}.}. Other profilers may rely on the timers provided by operating system, which do have significantly reduced resolution (about 300~\si{\nano\second} -- 1~\si{\micro\second}). This is enough to hide the subtle impact of cache access optimization, etc.
+Tracy can achieve single-digit nanosecond measurement resolution, due to usage of hardware timing mechanisms on the x86 and ARM architectures\footnote{In both 32 and 64~bit variants. On x86 Tracy requires a modern version of the \texttt{rdtsc} instruction (Sandy Bridge and later). On ARM-based systems Tracy will try to use the timer register (\textasciitilde 40 \si{\nano\second} resolution). If it fails (due to kernel configuration), Tracy falls back to system provided timer, which can range in resolution from 250 \si{\nano\second} to 1 \si{\micro\second}.}. Other profilers may rely on the timers provided by operating system, which do have significantly reduced resolution (about 300~\si{\nano\second} -- 1~\si{\micro\second}). This is enough to hide the subtle impact of cache access optimization, etc.
 
 \subsubsection{Timer accuracy}