From a13b04669839e45a2ed5ed33abac787b71eb0414 Mon Sep 17 00:00:00 2001 From: Bartosz Taudul Date: Sat, 11 Dec 2021 21:00:31 +0100 Subject: [PATCH] User manual polish pass. --- manual/tracy.tex | 1051 +++++++++++++++++++++++----------------------- 1 file changed, 524 insertions(+), 527 deletions(-) diff --git a/manual/tracy.tex b/manual/tracy.tex index f72c7ed9..7c6a1f3e 100644 --- a/manual/tracy.tex +++ b/manual/tracy.tex @@ -100,8 +100,8 @@ Hello and welcome to the Tracy Profiler user manual! Here you will find all the \begin{itemize} \item Chapter~\ref{quicklook}, \emph{\nameref{quicklook}}, gives a short description of what Tracy is and how it works. -\item Chapter~\ref{firststeps}, \emph{\nameref{firststeps}}, shows how the profiler can be integrated into your application, and how to build the graphical user interface (section~\ref{buildingserver}). At this point you will be able to establish a connection from the profiler to your application. -\item Chapter~\ref{client}, \emph{\nameref{client}}, provides information on how to instrument your application, in order to retrieve useful profiling data. This includes description of the C API (section~\ref{capi}), which enables usage of Tracy in any programming language. +\item Chapter~\ref{firststeps}, \emph{\nameref{firststeps}}, shows how you can integrate the profiler into your application and how to build the graphical user interface (section~\ref{buildingserver}). At this point, you will be able to establish a connection from the profiler to your application. +\item Chapter~\ref{client}, \emph{\nameref{client}}, provides information on how to instrument your application, in order to retrieve useful profiling data. This includes a description of the C API (section~\ref{capi}), which enables usage of Tracy in any programming language. \item Chapter~\ref{capturing}, \emph{\nameref{capturing}}, goes into more detail on how the profiling information can be captured and stored on disk. \item Chapter~\ref{analyzingdata}, \emph{\nameref{analyzingdata}}, guides you through the graphical user interface of the profiler. \item Chapter~\ref{csvexport}, \emph{\nameref{csvexport}}, explains how to export some zone timing statistics into a CSV format. @@ -111,7 +111,7 @@ Hello and welcome to the Tracy Profiler user manual! Here you will find all the \section*{Quick-start guide} -For Tracy to profile your application, you will need to integrate the profiler into your application, and additionally run an independent executable which will act both as a server with which your application will communicate, and as a profiling viewer. The most basic integration looks like this: +For Tracy to profile your application, you will need to integrate the profiler into your application and run an independent executable that will act both as a server with which your application will communicate and as a profiling viewer. The most basic integration looks like this: \begin{itemize} \item Add the Tracy repository to your project directory. @@ -137,13 +137,13 @@ There's much more Tracy can do, which can be explored by carefully reading this \section{A quick look at Tracy Profiler} \label{quicklook} -Tracy is a real-time, nanosecond resolution \emph{hybrid frame and sampling profiler} that can be used for remote or +Tracy is a real-time, nanosecond resolution \emph{hybrid frame and sampling profiler} that can you can use for remote or embedded telemetry of games and other applications. It can profile CPU (C, C++11, Lua), GPU (OpenGL, Vulkan, Direct3D 11/12, OpenCL) and memory. It also can monitor locks held by threads and show where contention does happen. -While Tracy can perform statistical analysis of sampled call stack data, just like other \emph{statistical profilers} (such as VTune, perf or Very Sleepy), it mainly focuses on manual markup of the source code, which allows frame-by-frame inspection of the program execution. You will be able to see exactly which functions are called, how much time is spent in them, and how do they interact with each other in a multi-threaded environment. In contrast, the statistical analysis may show you the hot spots in your code, but it is unable to accurately pinpoint the underlying cause for semi-random frame stutter that may occur every couple of seconds. +While Tracy can perform statistical analysis of sampled call stack data, just like other \emph{statistical profilers} (such as VTune, perf, or Very Sleepy), it mainly focuses on manual markup of the source code. Such markup allows frame-by-frame inspection of the program execution. For example, you will be able to see exactly which functions are called, how much time they require, and how do they interact with each other in a multi-threaded environment. In contrast, the statistical analysis may show you the hot spots in your code, but it cannot accurately pinpoint the underlying cause for semi-random frame stutter that may occur every couple of seconds. -Even though Tracy targets \emph{frame} profiling, with the emphasis on analysis of \emph{frame time} in real-time applications (i.e.~games), it does work with utilities that do not employ the concept of a frame. There's nothing that would prohibit profiling of, for example, a compression tool, or an event-driven UI application. +Even though Tracy targets \emph{frame} profiling, with the emphasis on analysis of \emph{frame time} in real-time applications (i.e.~games), it does work with utilities that do not employ the concept of a frame. There's nothing that would prohibit the profiling of, for example, a compression tool or an event-driven UI application. You may think of Tracy as the RAD Telemetry plus Intel VTune, on overdrive. @@ -152,26 +152,26 @@ You may think of Tracy as the RAD Telemetry plus Intel VTune, on overdrive. The concept of Tracy being a real-time profiler may be explained in a couple of different ways: \begin{enumerate} -\item The profiled application is not slowed down by profiling\footnote{See section~\ref{perfimpact} for a benchmark.}. The act of recording a profiling event has virtually zero cost -- it only takes a few nanoseconds. Even on low-power mobile devices there's no perceptible impact on execution speed. -\item The profiler itself works in real-time, without the need to process collected data in a complex way. Actually, it is quite inefficient in the way it works, as the data it presents is calculated anew each frame. And yet it can run at 60 frames per second. -\item The profiler has full functionality when the profiled application is running and the data is still being collected. You may interact with your application and then immediately switch to the profiler, when a performance drop occurs. +\item The profiled application is not slowed down by profiling\footnote{See section~\ref{perfimpact} for a benchmark.}. The act of recording a profiling event has virtually zero cost -- it only takes a few nanoseconds. Even on low-power mobile devices, execution speed has no noticeable impact. +\item The profiler itself works in real-time, without the need to process collected data in a complex way. Actually, it is pretty inefficient in how it works because it recalculates the data it presents each frame anew. And yet, it can run at 60 frames per second. +\item The profiler has full functionality when the profiled application runs and the data is still collected. You may interact with your application and immediately switch to the profiler when a performance drop occurs. \end{enumerate} \subsection{Nanosecond resolution} -It is hard to imagine how long a nanosecond is. One good analogy is to compare it with a measure of length. Let's say that one second is one meter (the average doorknob is on the height of one meter). +It is hard to imagine how long a nanosecond is. One good analogy is to compare it with a measure of length. Let's say that one second is one meter (the average doorknob is at the height of one meter). One millisecond ($\frac{1}{1000}$ of a second) would be then the length of a millimeter. The average size of a red ant or the width of a pencil is 5 or 6~\si{\milli\metre}. A modern game running at 60 frames per second has only 16~\si{\milli\second} to update the game world and render the entire scene. -One microsecond ($\frac{1}{1000}$ of a millisecond) in our comparison equals to one micron. The diameter of a typical bacterium ranges from 1 to 10 microns. The diameter of a red blood cell, or width of strand of spider web silk is about 7~\si{\micro\metre}. +One microsecond ($\frac{1}{1000}$ of a millisecond) in our comparison equals one micron. The diameter of a typical bacterium ranges from 1 to 10 microns. The diameter of a red blood cell or width of a strand of spider web silk is about 7~\si{\micro\metre}. -And finally, one nanosecond ($\frac{1}{1000}$ of a microsecond) would be one nanometer. The modern microprocessor transistor gate, the width of DNA helix, or the thickness of a cell membrane are in the range of 5~\si{\nano\metre}. In one~\si{\nano\second} the light can travel only 30~\si{\centi\meter}. +And finally, one nanosecond ($\frac{1}{1000}$ of a microsecond) would be one nanometer. The modern microprocessor transistor gate, the width of the DNA helix, or the thickness of a cell membrane are in the range of 5~\si{\nano\metre}. In one~\si{\nano\second} the light can travel only 30~\si{\centi\meter}. -Tracy can achieve single-digit nanosecond measurement resolution, due to usage of hardware timing mechanisms on the x86 and ARM architectures\footnote{In both 32 and 64~bit variants. On x86 Tracy requires a modern version of the \texttt{rdtsc} instruction (Sandy Bridge and later). Note that resolution of Time Stamp Counter readings may depend on the actual used hardware and its design decisions related to how TSC synchronization is handled between different CPU sockets, etc. On ARM-based systems Tracy will try to use the timer register (\textasciitilde 40 \si{\nano\second} resolution). If it fails (due to kernel configuration), Tracy falls back to system provided timer, which can range in resolution from 250 \si{\nano\second} to 1 \si{\micro\second}.}. Other profilers may rely on the timers provided by operating system, which do have significantly reduced resolution (about 300~\si{\nano\second} -- 1~\si{\micro\second}). This is enough to hide the subtle impact of cache access optimization, etc. +Tracy can achieve single-digit nanosecond measurement resolution due to usage of hardware timing mechanisms on the x86 and ARM architectures\footnote{In both 32 and 64~bit variants. On x86, Tracy requires a modern version of the \texttt{rdtsc} instruction (Sandy Bridge and later). Note that Time Stamp Counter readings' resolution may depend on the used hardware and its design decisions related to how TSC synchronization is handled between different CPU sockets, etc. On ARM-based systems Tracy will try to use the timer register (\textasciitilde 40 \si{\nano\second} resolution). If it fails (due to kernel configuration), Tracy falls back to system provided timer, which can range in resolution from 250 \si{\nano\second} to 1 \si{\micro\second}.}. Other profilers may rely on the timers provided by the operating system, which do have significantly reduced resolution (about 300~\si{\nano\second} -- 1~\si{\micro\second}). This is enough to hide the subtle impact of cache access optimization, etc. \subsubsection{Timer accuracy} -You may wonder why it is important to have a truly high resolution timer\footnote{Interestingly, the \texttt{std::chrono::high\_resolution\_clock} is not really a high resolution clock.}. After all, you only want to profile functions that have long execution times, and not some short-lived procedures, that have no impact on the application's run time. +You may wonder why it is vital to have a genuinely high resolution timer\footnote{Interestingly the \texttt{std::chrono::high\_resolution\_clock} is not really a high-resolution clock.}. After all, you only want to profile functions with long execution times and not some short-lived procedures that have no impact on the application's run time. It is wrong to think so. Optimizing a function to execute in 430~\si{\nano\second}, instead of 535~\si{\nano\second} (note that there is only a 100~\si{\nano\second} difference) results in 14 \si{\milli\second} savings if the function is executed 18000 times\footnote{This is a real optimization case. The values are median function run times and do not reflect the real execution time, which explains the discrepancy in the total reported time.}. It may not seem like a big number, but this is how much time there is to render a complete frame in a 60~FPS game. Imagine that this is your particle processing loop. @@ -218,21 +218,21 @@ Now let's take a look at the timer readings. \item The $C$ range (610~\si{\nano\second}) is only 20~\si{\nano\second} longer than the $B$ range, but it is reported as 900~\si{\nano\second}, a 600~\si{\nano\second} difference! \end{itemize} -Here you can see why it is important to use a high precision timer. While there is no escape from the measurement errors, their impact can be reduced by increasing the timer accuracy. +Here, you can see why using a high-precision timer is essential. While there is no escape from the measurement errors, a profiler can reduce their impact by increasing the timer accuracy. \subsection{Frame profiler} -Tracy is aimed at understanding the inner workings of a tight loop of a game (or any other kind of an interactive application). That's why it slices the execution time of a program using the \emph{frame}\footnote{A frame is used to describe a single image displayed on the screen by the game (or any other program), preferably 60 times per second to achieve smooth animation. You can also think about physics update frames, audio processing frames, etc.} as a basic work-unit\footnote{Frame usage is not required. See section~\ref{markingframes} for more information.}. The most interesting frames are the ones that took longer than the allocated time, producing visible hitches in the on-screen animation. Tracy allows inspection of such misbehavior. +Tracy aims to give you an understanding of the inner workings of a tight loop of a game (or any other kind of interactive application). That's why it slices the execution time of a program using the \emph{frame}\footnote{A frame is used to describe a single image displayed on the screen by the game (or any other program), preferably 60 times per second to achieve smooth animation. You can also think about physics update frames, audio processing frames, etc.} as a basic work-unit\footnote{Frame usage is not required. See section~\ref{markingframes} for more information.}. The most interesting frames are the ones that took longer than the allocated time, producing visible hitches in the on-screen animation. Tracy allows inspection of such misbehavior. \subsection{Sampling profiler} -Tracy is able to periodically sample what the profiled application is doing, which gives you detailed performance information at the source line/assembly instruction level. This can give you deep understanding of how the program is executed by the processor. Using this information you can get a coarse view at the call stacks, fine-tune your algorithms, or even 'steal' an optimization performed by one compiler and make it available for the others. +Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even 'steal' an optimization performed by one compiler and make it available for the others. -On some platforms it is possible to sample the hardware performance counters, which will give you information not only \emph{where} your program is running slowly, but also \emph{why}. +On some platforms, it is possible to sample the hardware performance counters, which will give you information not only \emph{where} your program is running slowly, but also \emph{why}. \subsection{Remote or embedded telemetry} -Tracy uses the client-server model to enable a wide range of use-cases (see figure~\ref{clientserver}). For example, a game on a mobile phone may be profiled over the wireless connection, with the profiler running on a desktop computer. Or you can run the client and server on the same machine, using a localhost connection. It is also possible to embed the visualization front-end in the profiled application, making the profiling self-contained\footnote{See section~\ref{embeddingserver} for guidelines.}. +Tracy uses the client-server model to enable a wide range of use-cases (see figure~\ref{clientserver}). For example, you may profile a game on a mobile phone over the wireless connection, with the profiler running on a desktop computer. Or you can run the client and server on the same machine, using a localhost connection. It is also possible to embed the visualization front-end in the profiled application, making the profiling self-contained\footnote{See section~\ref{embeddingserver} for guidelines.}. \begin{figure}[h] \centering\begin{tikzpicture} @@ -268,31 +268,31 @@ Tracy uses the client-server model to enable a wide range of use-cases (see figu \label{clientserver} \end{figure} -In Tracy terminology, the profiled application is a \emph{client} and the profiler itself is a \emph{server}. It was named this way because the client is a thin layer that just collects events and sends them for processing and long-term storage on the server. The fact that the server needs to connect to the client to begin the profiling session may be a bit confusing at first. +In Tracy terminology, the profiled application is a \emph{client}, and the profiler itself is a \emph{server}. It was named this way because the client is a thin layer that just collects events and sends them for processing and long-term storage on the server. The fact that the server needs to connect to the client to begin the profiling session may be a bit confusing at first. \subsection{Why Tracy?} -You may wonder, why should you use Tracy, when there are so many other profilers available. Here are some arguments: +You may wonder why you should use Tracy when so many other profilers are available. Here are some arguments: \begin{itemize} -\item Tracy is free and open source (BSD license), while RAD Telemetry costs about \$8000 per year. +\item Tracy is free and open-source (BSD license), while RAD Telemetry costs about \$8000 per year. \item Tracy provides out-of-the-box Lua bindings. It has been successfully integrated with other native and interpreted languages (Rust, Arma scripting language) using the C API (see chapter~\ref{capi} for reference). -\item Tracy has a wide variety of profiling options. You can profile CPU, GPU, locks, memory allocations, context switches and more. -\item Tracy is feature rich. Statistical information about zones, trace comparisons, or inclusion of inline function frames in call stacks (even in statistics of sampled stacks) are features unique to Tracy. -\item Tracy focuses on performance. Many tricks are used to reduce memory requirements and network bandwidth. The impact on the client execution speed is minimal, while other profilers perform heavy data processing within the profiled application (and then claim to be lightweight). +\item Tracy has a wide variety of profiling options. For example, you can profile CPU, GPU, locks, memory allocations, context switches, and more. +\item Tracy is feature-rich. For example, statistical information about zones, trace comparisons, or inclusion of inline function frames in call stacks (even in statistics of sampled stacks) are features unique to Tracy. +\item Tracy focuses on performance. It uses many tricks to reduce memory requirements and network bandwidth. As a result, the impact on the client execution speed is minimal, while other profilers perform heavy data processing within the profiled application (and then claim to be lightweight). \item Tracy uses low-level kernel APIs, or even raw assembly, where other profilers rely on layers of abstraction. -\item Tracy is multi-platform right from the very beginning. Both on the client and server side. Other profilers tend to have Windows-specific graphical interfaces. +\item Tracy is multi-platform right from the very beginning. Both on the client and server-side. Other profilers tend to have Windows-specific graphical interfaces. \item Tracy can handle millions of frames, zones, memory events, and so on, while other profilers tend to target very short captures. -\item Tracy doesn't require manual markup of interesting areas in your code to start profiling. You may rely on automated call stack sampling and add instrumentation later, when you know where it's needed. -\item Tracy provides mapping of source code to the assembly, with detailed information about cost of executing each instruction on the CPU. +\item Tracy doesn't require manual markup of interesting areas in your code to start profiling. Instead, you may rely on automated call stack sampling and add instrumentation later when you know where it's needed. +\item Tracy provides a mapping of source code to the assembly, with detailed information about the cost of executing each instruction on the CPU. \end{itemize} \subsection{Performance impact} \label{perfimpact} -To check how much slowdown is introduced by using Tracy, let's profile an example application. For this purpose we have used etcpak\footnote{\url{https://github.com/wolfpld/etcpak}}. The input data was a $16384 \times 16384$ pixels test image and the $4 \times 4$ pixel block compression function was selected to be instrumented. The image was compressed on 12 parallel threads, and the timing data represents a mean compression time of a single image. +Let's profile an example application to check how much slowdown is introduced by using Tracy. For this purpose we have used etcpak\footnote{\url{https://github.com/wolfpld/etcpak}}. The input data was a $16384 \times 16384$ pixels test image, and the $4 \times 4$ pixel block compression function was selected to be instrumented. The image was compressed on 12 parallel threads, and the timing data represents a mean compression time of a single image. -The results are presented in table~\ref{PerformanceImpact}. Dividing the average of run time differences (37.7 \si{\milli\second}) by a number of captured zones per single image (\num{16777216}) shows us that the impact of profiling is only 2.25 \si{\nano\second} per zone (this includes two events: start and end of a zone). +The results are presented in table~\ref{PerformanceImpact}. Dividing the average of run time differences (37.7 \si{\milli\second}) by the count of captured zones per single image (\num{16777216}) shows us that the impact of profiling is only 2.25 \si{\nano\second} per zone (this includes two events: start and end of a zone). \begin{table}[h] \centering @@ -307,7 +307,7 @@ ETC2 & \num{201326592} & \num{16777216} & 212.4 \si{\milli\second} & 250.5 \si{\ \subsubsection{Assembly analysis} -To see how such small overhead (only 2.25 \si{\nano\second}) is achieved, let's take a look at the assembly. The following x64 code is responsible for logging start of a zone. Do note that it is generated by compiling fully portable C++. +To see how Tracy achieves such small overhead (only 2.25 \si{\nano\second}), let's take a look at the assembly. The following x64 code is responsible for logging the start of a zone. Do note that it is generated by compiling fully portable C++. \begin{lstlisting}[language={[x86masm]Assembler}] mov byte ptr [rsp+0C0h],1 ; store zone activity information @@ -335,11 +335,11 @@ lea rax,[rbp+1] ; increment buffer counter mov qword ptr [rdi+28h],rax ; write buffer counter \end{lstlisting} -The second code block, responsible for ending a zone, is similar, but smaller, as it can reuse some variables retrieved in the above code. +The second code block, responsible for ending a zone, is similar but smaller, as it can reuse some variables retrieved in the above code. \subsection{Examples} -To see how Tracy can be integrated into an application, you may look at example programs in the \texttt{examples} directory. Looking at the commit history might be the best way to do that. +To see how to integrate Tracy into your application, you may look at example programs in the \texttt{examples} directory. Looking at the commit history might be the best way to do that. \subsection{On the web} @@ -354,16 +354,16 @@ Tracy can be found at the following web addresses: \subsubsection{Binary distribution} -The version releases of the profiler are provided as precompiled Windows binaries to be downloaded at \url{https://github.com/wolfpld/tracy/releases}, along with the user manual. You will need to install the latest Visual C++ redistributable package to use them. +The version releases of the profiler are provided as precompiled Windows binaries for download at \url{https://github.com/wolfpld/tracy/releases}, along with the user manual. You will need to install the latest Visual C++ redistributable package to use them. -Development builds of both Windows binaries and the user manual are provided as artifacts created by the automated Continuous Integration system on GitHub. +Development builds of Windows binaries, and the user manual are available as artifacts created by the automated Continuous Integration system on GitHub. Note that these binary releases require AVX2 instruction set support on the processor. If you have an older CPU, you will need to set a proper instruction set architecture in the project properties and build the executables yourself. \section{First steps} \label{firststeps} -Tracy Profiler supports MSVC, gcc and clang. A reasonably recent version of the compiler is needed, due to C++11 requirement. The following platforms are confirmed to be working (this is not a complete list): +Tracy Profiler supports MSVC, GCC, and clang. You will need to use a reasonably recent version of the compiler due to the C++11 requirement. The following platforms are confirmed to be working (this is not a complete list): \begin{itemize} \item Windows (x86, x64) @@ -375,7 +375,7 @@ Tracy Profiler supports MSVC, gcc and clang. A reasonably recent version of the \item iOS (ARM, ARM64) \end{itemize} -Moreover, the following platforms are not supported due to how secretive their owners are, but were reported to be working after extending the system integration layer: +Moreover, the following platforms are not supported due to how secretive their owners are but were reported to be working after extending the system integration layer: \begin{itemize} \item PlayStation 4 @@ -389,26 +389,26 @@ You may also try your luck with Mingw, but don't get your hopes too high. This p \subsection{Initial client setup} \label{initialsetup} -The recommended way to integrate Tracy into an application is to create a git submodule in the repository (assuming that git is used for version control). This way it is very easy to update Tracy to newly released versions. If that's not an option, copy all files from the Tracy checkout directory to your project. +The recommended way to integrate Tracy into an application is to create a git submodule in the repository (assuming that you use git for version control). This way, it is straightforward to update Tracy to newly released versions. If that's not an option, copy all the Tracy checkout directory files to your project. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{What revision should I use?} -When deciding on the Tracy Profiler version you want to use, you have basically two options. Take into consideration the following pros and cons: +You have two options when deciding on the Tracy Profiler version you want to use. Take into consideration the following pros and cons: \begin{itemize} -\item Using the last-version-tagged revision will give you a stable platform to work with. You won't experience any breakages, major UI overhauls or network protocol changes. Unfortunately, you also won't be getting any bug fixes. +\item Using the last-version-tagged revision will give you a stable platform to work with. You won't experience any breakages, major UI overhauls, or network protocol changes. Unfortunately, you also won't be getting any bug fixes. \item Working with the bleeding edge \texttt{master} development branch will give you access to all the new improvements and features added to the profiler. While it is generally expected that \texttt{master} should always be usable, \textbf{there are no guarantees that it will be so.} \end{itemize} Do note that all bug fixes and pull requests are made against the \texttt{master} branch. \end{bclogo} -With the source code included in your project, add the \texttt{tracy/TracyClient.cpp} source file to the IDE project and/or makefile. You're done. Tracy is now integrated into the application. +With the source code included in your project, add the \texttt{tracy/TracyClient.cpp} source file to the IDE project or makefile. You're done. Tracy is now integrated into the application. -In the default configuration Tracy is disabled. This way you don't have to worry that the production builds will perform collection of profiling data. You will probably want to create a separate build configuration, with the \texttt{TRACY\_ENABLE} define, which enables profiling. +In the default configuration, Tracy is disabled. This way, you don't have to worry that the production builds will collect profiling data. To enable profiling, you will probably want to create a separate build configuration, with the \texttt{TRACY\_ENABLE} define. \begin{bclogo}[ noborder=true, @@ -417,20 +417,20 @@ logo=\bcbombe ]{Important} \begin{itemize} \item Double-check that the define name is entered correctly (as \texttt{TRACY\_ENABLE}), don't make a mistake of adding an additional \texttt{D} at the end. Make sure that this macro is defined for all files across your project (e.g. it should be specified in the \texttt{CFLAGS} variable, which is always passed to the compiler, or in an equivalent way), and \emph{not} as a \texttt{\#define} in just some of the source files. -\item The value of the define is not taken into consideration by Tracy, only the fact if the macro is defined, or not (unless specified otherwise). Be careful not to make a mistake of assigning numeric values to Tracy defines, which could lead you to being puzzled why constructs such as \texttt{TRACY\_ENABLE=0} don't work as you expect them to do. +\item Tracy does not consider the value of the definition, only the fact if the macro is defined or not (unless specified otherwise). Be careful not to make the mistake of assigning numeric values to Tracy defines, which could lead you to be puzzled why constructs such as \texttt{TRACY\_ENABLE=0} don't work as you expect them to do. \end{itemize} \end{bclogo} -The application you want to profile should be compiled with all the usual optimization options enabled (i.e.~make a release build). It makes no sense to profile debugging builds, as the unoptimized code and additional checks (asserts, etc.) completely change how the program behaves. You should enable usage of the native architecture of your CPU (e.g.~\texttt{-march=native}), to leverage the expanded instruction sets, which may not be available in the default baseline target configuration. +You should compile the application you want to profile with all the usual optimization options enabled (i.e.~make a release build). Profiling debugging builds makes little sense, as the unoptimized code and additional checks (asserts, etc.) completely change how the program behaves. In addition, you should enable usage of the native architecture of your CPU (e.g.~\texttt{-march=native}) to leverage the expanded instruction sets, which may not be available in the default baseline target configuration. -Finally, on Unix make sure that the application is linked with libraries \texttt{libpthread} and \texttt{libdl}. BSD systems will also need to be linked with \texttt{libexecinfo}. +Finally, on Unix, make sure that the application is linked with libraries \texttt{libpthread} and \texttt{libdl}. BSD systems will also need to be linked with \texttt{libexecinfo}. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{CMake integration} -You can integrate Tracy easily with CMake by adding the git submodule folder as a subdirectory. +You can integrate Tracy with CMake by adding the git submodule folder as a subdirectory. \begin{lstlisting} # set options before add_subdirectory @@ -452,7 +452,7 @@ noborder=true, couleur=black!5, logo=\bclampe ]{CMake FetchContent} -When using CMake 3.11 or newer, you can use Tracy via CMake FetchContent. In this case, you do not need to manually add a git submodule for Tracy. Add this to your CMakeLists.txt: +When using CMake 3.11 or newer, you can use Tracy via CMake FetchContent. In this case, you do not need to add a git submodule for Tracy manually. Add this to your CMakeLists.txt: \begin{lstlisting} FetchContent_Declare( @@ -475,48 +475,48 @@ target_link_libraries( PUBLIC TracyClient) \subsubsection{Short-lived applications} -In case you want to profile a short-lived program (for example, a compression utility that finishes its work in one second), set the \texttt{TRACY\_NO\_EXIT} environment variable to $1$. With this option enabled, Tracy will not exit until an incoming connection is made, even if the application has already finished executing. If your platform doesn't support easy setup of environment variables, you may also add the \texttt{TRACY\_NO\_EXIT} define to your build configuration, which has the same effect. +In case you want to profile a short-lived program (for example, a compression utility that finishes its work in one second), set the \texttt{TRACY\_NO\_EXIT} environment variable to $1$. With this option enabled, Tracy will not exit until an incoming connection is made, even if the application has already finished executing. If your platform doesn't support an easy setup of environment variables, you may also add the \texttt{TRACY\_NO\_EXIT} define to your build configuration, which has the same effect. \subsubsection{On-demand profiling} \label{ondemand} -By default Tracy will begin profiling even before the program enters the \texttt{main} function. If you don't want to perform a full capture of application lifetime, you may define the \texttt{TRACY\_ON\_DEMAND} macro, which will enable profiling only when there's an established connection with the server. +By default, Tracy will begin profiling even before the program enters the \texttt{main} function. However, suppose you don't want to perform a full capture of the application lifetime. In that case, you may define the \texttt{TRACY\_ON\_DEMAND} macro, which will enable profiling only when there's an established connection with the server. -It should be noted, that if on-demand profiling is \emph{disabled} (which is the default), then the recorded events will be stored in the system memory until a server connection is made and the data can be uploaded\footnote{This memory is never released, but it is reused for collection of further events.}. Depending on the amount of the things profiled, the requirements for event storage can easily grow up to a couple of gigabytes. Since this data is cleared after the initial connection is made, you won't be able to perform a second connection to a client, unless the on-demand mode is used. +You should note that if on-demand profiling is \emph{disabled} (which is the default), then the recorded events will be stored in the system memory until a server connection is made and the data can be uploaded\footnote{This memory is never released, but the profiler reuses it for collection of other events.}. Depending on the amount of the things profiled, the requirements for event storage can quickly grow up to a couple of gigabytes. Furthermore, since this data is no longer available after the initial connection, you won't be able to perform a second connection to a client unless the on-demand mode is used. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcattention ]{Caveats} -The client with on-demand profiling enabled needs to perform additional bookkeeping, in order to present a coherent application state to the profiler. This incurs additional time cost for each profiling event. +The client with on-demand profiling enabled needs to perform additional bookkeeping to present a coherent application state to the profiler. This incurs additional time costs for each profiling event. \end{bclogo} \subsubsection{Client discovery} -By default Tracy client will announce its presence to the local network\footnote{Additional configuration may be required to achieve full functionality, depending on your network layout. Read about UDP broadcasts for more information.}. If you want to disable this feature, define the \texttt{TRACY\_NO\_BROADCAST} macro. +By default, the Tracy client will announce its presence to the local network\footnote{Additional configuration may be required to achieve full functionality, depending on your network layout. Read about UDP broadcasts for more information.}. If you want to disable this feature, define the \texttt{TRACY\_NO\_BROADCAST} macro. \subsubsection{Client network interface} -By default Tracy client will listen on all network interfaces. If you want to restrict it to only listening on the localhost interface, define the \texttt{TRACY\_ONLY\_LOCALHOST} macro at compile time, or set the \texttt{TRACY\_ONLY\_LOCALHOST} environment variable to $1$ at runtime. +By default, the Tracy client will listen on all network interfaces. If you want to restrict it to only listening on the localhost interface, define the \texttt{TRACY\_ONLY\_LOCALHOST} macro at compile-time, or set the \texttt{TRACY\_ONLY\_LOCALHOST} environment variable to $1$ at runtime. -By default Tracy client will listen on IPv6 interfaces, falling back to IPv4 only if IPv6 is not available. If you want to restrict it to only listening on IPv4 interfaces, define the \texttt{TRACY\_ONLY\_IPV4} macro at compile time, or set the \texttt{TRACY\_ONLY\_IPV4} environment variable to $1$ at runtime. +By default, the Tracy client will listen on IPv6 interfaces, falling back to IPv4 only if IPv6 is unavailable. If you want to restrict it to only listening on IPv4 interfaces, define the \texttt{TRACY\_ONLY\_IPV4} macro at compile-time, or set the \texttt{TRACY\_ONLY\_IPV4} environment variable to $1$ at runtime. \subsubsection{Setup for multi-DLL projects} -In projects that consist of multiple DLLs/shared objects things are a bit different. Compiling \texttt{TracyClient.cpp} into every DLL is not an option because this would result in several instances of Tracy objects lying around in the process. We rather need to pass the instances of them to the different DLLs to be reused there. +Things are a bit different in projects that consist of multiple DLLs/shared objects. Compiling \texttt{TracyClient.cpp} into every DLL is not an option because this would result in several instances of Tracy objects lying around in the process. We instead need to pass their instances to the different DLLs to be reused there. -For that you need a \emph{profiler DLL} to which your executable and the other DLLs link. If that doesn't exist you have to create one explicitly for Tracy\footnote{You may also look at the \texttt{library} directory in the profiler source tree.}. This library should contain the \texttt{tracy/TracyClient.cpp} source file. Link the executable and all DLLs which you want to profile to this DLL. +For that, you need a \emph{profiler DLL} to which your executable and the other DLLs link. If that doesn't exist, you have to create one explicitly for Tracy\footnote{You may also look at the \texttt{library} directory in the profiler source tree.}. This library should contain the \texttt{tracy/TracyClient.cpp} source file. Link the executable and all DLLs you want to profile to this DLL. If you are targeting Windows with Microsoft Visual Studio or MinGW, add the \texttt{TRACY\_IMPORTS} define to your application. If you are experiencing crashes or freezes when manually loading/unloading a separate DLL with Tracy integration, you might want to try defining both \texttt{TRACY\_DELAYED\_INIT} and \texttt{TRACY\_MANUAL\_LIFETIME} macros. -\texttt{TRACY\_DELAYED\_INIT} enables a path where profiler data is gathered into one structure and initialized on the first request rather than statically at the DLL load at the expense of atomic load on each request to the profiler data. \texttt{TRACY\_MANUAL\_LIFETIME} flag augments this behavior to provide manual \texttt{StartupProfiler} and \texttt{ShutdownProfiler} functions that allow you to manually create and destroy the profiler data, removing the need to do an atomic load on each call, as well as letting you define an appropriate place to free the resources. +\texttt{TRACY\_DELAYED\_INIT} enables a path where profiler data is gathered into one structure and initialized on the first request rather than statically at the DLL load at the expense of atomic load on each request to the profiler data. \texttt{TRACY\_MANUAL\_LIFETIME} flag augments this behavior to provide manual \texttt{StartupProfiler} and \texttt{ShutdownProfiler} functions that allow you to create and destroy the profiler data manually. This manual management removes the need to do an atomic load on each call and lets you define an appropriate place to free the resources. \subsubsection{Problematic platforms} -In case of some programming environments you may need to take extra steps to ensure Tracy is able to work correctly. +In the case of some programming environments, you may need to take extra steps to ensure Tracy can work correctly. \paragraph{Microsoft Visual Studio} @@ -535,9 +535,9 @@ Because Apple \emph{has} to be \emph{think different}, there are some problems w \paragraph{Android lunacy} \label{androidlunacy} -Starting with Android 8.0 you are no longer allowed to use the \texttt{/proc} file system. One of the consequences of this change is inability to check system CPU usage. +Starting with Android 8.0, you are no longer allowed to use the \texttt{/proc} file system. One of the consequences of this change is the inability to check system CPU usage. -This is apparently a security enhancement. In its infinite wisdom Google has decided to not give you any option to bypass this restriction. +This is apparently a security enhancement. Unfortunately, in its infinite wisdom, Google has decided not to give you an option to bypass this restriction. To workaround this limitation, you will need to have a rooted device. Execute the following commands using \texttt{root} shell: @@ -548,24 +548,24 @@ echo 0 > /proc/sys/kernel/perf_event_paranoid echo 0 > /proc/sys/kernel/kptr_restrict \end{lstlisting} -The first command will allow access to system CPU statistics. The second one will allow inspection of foreign processes (which is required for context switch capture). The third one will lower restrictions on access to performance counters. The last one will allow retrieval of kernel symbol pointers. \emph{Be sure that you are fully aware of the consequences of making these changes.} +The first command will allow access to system CPU statistics. The second one will enable inspection of foreign processes (required for context switch capture). The third one will lower restrictions on access to performance counters. The last one will allow retrieval of kernel symbol pointers. \emph{Be sure that you are fully aware of the consequences of making these changes.} \paragraph{Virtual machines} -The best way to run Tracy is on bare metal. Avoid profiling applications in virtualized environments, including services provided in the cloud. Virtualization interferes with the critical facilities needed for the profiler to work, which will influence the results you get. Possible problems may vary, depending on the configuration of the VM, and include: +The best way to run Tracy is on bare metal. Avoid profiling applications in virtualized environments, including services provided in the cloud. Virtualization interferes with the critical facilities needed for the profiler to work, influencing the results you get. Possible problems may vary, depending on the configuration of the VM, and include: \begin{itemize} \item Reduced precision of time stamps. -\item Inability to obtain precise time stamps, resulting in error messages such as \emph{CPU doesn't support RDTSC instruction}, or \emph{CPU doesn't support invariant TSC}. On Windows this can be worked around by rebuilding the profiled application with the \texttt{TRACY\_TIMER\_QPC} define, which severely lowers resolution of time readings. +\item Inability to obtain precise timestamps, resulting in error messages such as \emph{CPU doesn't support RDTSC instruction}, or \emph{CPU doesn't support invariant TSC}. On Windows, you can work this around by rebuilding the profiled application with the \texttt{TRACY\_TIMER\_QPC} define, which severely lowers the resolution of time readings. \item Frequency of call stack sampling may be reduced. -\item Call stack sampling might lack time stamps. While such reduced data set can be used to perform statistical analysis, you won't be able to limit the time range, or see the sampling zones on the timeline. +\item Call stack sampling might lack time stamps. While you can use such a reduced data set to perform statistical analysis, you won't be able to limit the time range or see the sampling zones on the timeline. \end{itemize} \subsubsection{Changing network port} -Network communication between the client and the server by default is performed using network port 8086. The profiling session utilizes the TCP protocol and client broadcasts are done over UDP. +By default, the client and server communicate on the network using port 8086. The profiling session utilizes the TCP protocol, and the client sends presence announcement broadcasts over UDP. -If for some reason you want to use another port\footnote{For example, other programs may already be using it, or you may have overzealous firewall rules, or you may want to run two clients on the same IP address.}, you can change it using the \texttt{TRACY\_DATA\_PORT} macro for the data connection, and \texttt{TRACY\_BROADCAST\_PORT} macro for client broadcasts. Alternatively, both ports may be changed at the same time by declaring the \texttt{TRACY\_PORT} macro (specific macros listed before have higher priority). The data connection port may be also changed without recompiling the client application, by setting the \texttt{TRACY\_PORT} environment variable. +Suppose for some reason you want to use another port\footnote{For example, other programs may already be using it, or you may have overzealous firewall rules, or you may want to run two clients on the same IP address.}. In that case, you can change it using the \texttt{TRACY\_DATA\_PORT} macro for the data connection and \texttt{TRACY\_BROADCAST\_PORT} macro for client broadcasts. Alternatively, you may change both ports at the same time by declaring the \texttt{TRACY\_PORT} macro (specific macros listed before have higher priority). You may also change the data connection port without recompiling the client application by setting the \texttt{TRACY\_PORT} environment variable. If a custom port is not specified and the default listening port is already occupied, the profiler will automatically try to listen on a number of other ports. @@ -582,95 +582,95 @@ To enable network communication, Tracy needs to open a listening port. Make sure When using Tracy Profiler, keep in mind the following requirements: \begin{itemize} -\item Each lock may be used in no more than 64 unique threads. -\item There can be no more than 65534 unique source locations\footnote{A source location is a place in the code, which is identified by source file name and line number, for example when you markup a zone.}. This number is further split in half between native code source locations and dynamic source locations (for example, when Lua instrumentation is used). +\item The application may use each lock in no more than 64 unique threads. +\item There can be no more than 65534 unique source locations\footnote{A source location is a place in the code, which is identified by source file name and line number, for example, when you markup a zone.}. This number is further split in half between native code source locations and dynamic source locations (for example, when Lua instrumentation is used). \item Profiling session cannot be longer than 1.6 days ($2^{47}$ \si{\nano\second}). This also includes on-demand sessions. \item No more than 4 billion ($2^{32}$) memory free events may be recorded. \item No more than 16 million ($2^{24}$) unique call stacks can be captured. \end{itemize} -The following conditions also need apply, but don't trouble yourself with them too much. You would probably already knew, if you'd be breaking any. +The following conditions also need to apply but don't trouble yourself with them too much. You would probably already know if you'd be breaking any. \begin{itemize} \item Only little-endian CPUs are supported. \item Virtual address space must be limited to 48 bits. -\item Tracy server requires CPU which is able to handle misaligned memory accesses. +\item Tracy server requires CPU which can handle misaligned memory accesses. \end{itemize} \subsection{Check your environment} -It is not an easy task to reliably measure performance of an application on modern machines. There are many factors affecting program execution characteristics, some of which you will be able to minimize, and others you will have to live with. It is critically important that you understand how these variables impact profiling results, as it is key to understanding the data you get. +It is not an easy task to reliably measure the performance of an application on modern machines. There are many factors affecting program execution characteristics, some of which you will be able to minimize and others you will have to live with. It is critically important that you understand how these variables impact profiling results, as it is key to understanding the data you get. \subsubsection{Operating system} \label{checkenvironmentos} -In a multitasking operating system applications compete for system resources with each other. This has a visible effect on the measurements performed by the profiler, which you may, or may not accept. +In a multitasking operating system, applications compete for system resources with each other. This has a visible effect on the measurements performed by the profiler, which you may or may not accept. -In order to get the most accurate profiling results you should minimize interference caused by other programs running on the same machine. Before starting a profile session close all web browsers, music players, instant messengers, and all other non-essential applications like Steam, Uplay, etc. Make sure you don't have the debugger hooked into the profiled program, as it also has impact on the timing results. +To get the most accurate profiling results, you should minimize interference caused by other programs running on the same machine. Before starting a profile session, close all web browsers, music players, instant messengers, and all other non-essential applications like Steam, Uplay, etc. Make sure you don't have the debugger hooked into the profiled program, as it also impacts the timing results. -Interference caused by other programs can be seen in the profiler, if context switch capture (section~\ref{contextswitches}) is enabled. +Interference caused by other programs can be seen in the profiler if context switch capture (section~\ref{contextswitches}) is enabled. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{Debugger in Visual Studio} -In MSVC you would typically run your program using the \emph{Start Debugging} menu option, which is conveniently available as a \keys{F5} shortcut. You should instead use the \emph{Start Without Debugging} option, available as \keys{\ctrl + F5} shortcut. +In MSVC, you would typically run your program using the \emph{Start Debugging} menu option, which is conveniently available as a \keys{F5} shortcut. You should instead use the \emph{Start Without Debugging} option, available as \keys{\ctrl + F5} shortcut. \end{bclogo} \subsubsection{CPU design} \label{checkenvironmentcpu} -Where to even begin here? Modern processors are such a complex beasts, that it's almost impossible to surely say anything about how they will behave. Cache configuration, prefetcher logic, memory timings, branch predictor, execution unit counts are all the drivers of instructions-per-cycle uplift nowadays, after the megahertz race had hit the wall. Not only is it incredibly difficult to reason about, but you also need to take into account how the CPU topology affects things, which is described in more detail in section~\ref{cputopology}. +Where to even begin here? Modern processors are such complex beasts that it's almost impossible to say anything about how they will behave surely. Cache configuration, prefetcher logic, memory timings, branch predictor, execution unit counts are all the drivers of instructions-per-cycle uplift nowadays after the megahertz race had hit the wall. Not only is it challenging to reason about, but you also need to take into account how the CPU topology affects things, which is described in more detail in section~\ref{cputopology}. -Nevertheless, let's take a look on the ways we can try to stabilize the profiling data. +Nevertheless, let's look at how we can try to stabilize the profiling data. \paragraph{Superscalar out-of-order speculative execution} -Also known as: the \emph{spectre} thing we have to dealt with now. +Also known as: the \emph{spectre} thing we have to deal with now. -You must be aware that most processors available on the market\footnote{With the exception of low-cost ARM CPUs.} \emph{do not} execute machine code in a linear way, as laid out in the source code. This can lead to counterintuivive timing results reported by Tracy. Trying to get more 'reliable' readings\footnote{And by saying 'reliable' you do in reality mean: behaving in a way you expect it to.} would require a change in the behavior of the code and this is not a thing a profiler should do. Instead, Tracy shows you what the hardware is \emph{really} doing. +You must be aware that most processors available on the market\footnote{Except low-cost ARM CPUs.} \emph{do not} execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more 'reliable' readings\footnote{And by saying 'reliable,' you do in reality mean: behaving in a way you expect it.} would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is \emph{really} doing. -This is a complex subject and the details vary from one CPU to another. You can read a brief rundown of the topic at the following address: \url{https://travisdowns.github.io/blog/2019/06/11/speed-limits.html}. +This is a complex subject, and the details vary from one CPU to another. You can read a brief rundown of the topic at the following address: \url{https://travisdowns.github.io/blog/2019/06/11/speed-limits.html}. \paragraph{Simultaneous multithreading} Also known as: Hyper-threading. Typically present on Intel and AMD processors. -To get the most reliable results you should have whole CPU core resources dedicated to a single thread of your program. Otherwise you're no longer measuring the behavior of your code, but rather how it keeps up when its computing resources are randomly taken away by some other thing running on another pipeline within the same physical core. +To get the most reliable results, you should have all the CPU core resources dedicated to a single thread of your program. Otherwise, you're no longer measuring the behavior of your code but rather how it keeps up when its computing resources are randomly taken away by some other thing running on another pipeline within the same physical core. -Note that you might \emph{want} to observe this behavior, if you plan to deploy your application on a machine with simultaneous multithreading enabled. This would require careful examination of what else is running on the machine, or even how the threads of your own program are scheduled by the operating system, as various combinations of competing workloads (e.g. integer/floating point operations) will be impacted differently. +Note that you might \emph{want} to observe this behavior if you plan to deploy your application on a machine with simultaneous multithreading enabled. This would require careful examination of what else is running on the machine, or even how the operating system schedules the threads of your own program, as various combinations of competing workloads (e.g., integer/floating-point operations) will be impacted differently. \paragraph{Turbo mode frequency scaling} Also known as: Turbo Boost (Intel), Precision Boost (AMD). -While the CPU is more-or-less designed to always be able to work at the advertised \emph{base} frequency, there is usually some headroom left, which allows usage of the built-in automatic overclocking. There are no guarantees that the turbo frequencies can be attained, or how long they will be held, as there are many things to take into consideration: +While the CPU is more-or-less designed always to be able to work at the advertised \emph{base} frequency, there is usually some headroom left, which allows usage of the built-in automatic overclocking. There are no guarantees that the CPU can attain the turbo frequencies or how long it will uphold them, as there are many things to take into consideration: \begin{itemize} -\item How many cores are being used? Just one, or all 8? All 16? -\item What type of work is being performed? Integer? Floating point? 128-wide SIMD? 256-wide SIMD? 512-wide SIMD? -\item Were you lucky in the silicon lottery? Some dies are simply better made and are able to achieve higher frequencies. -\item Are you running on the best-rated core, or at the worst-rated core? Some cores may be unable to match the performance of other cores in the same processor. -\item What kind of cooling solution are you using? The cheap one bundled with the CPU, or a beefy chunk of metal that has no problem with heat dissipation? +\item How many cores are in use? Just one, or all 8? All 16? +\item What type of work is being performed? Integer? Floating-point? 128-wide SIMD? 256-wide SIMD? 512-wide SIMD? +\item Were you lucky in the silicon lottery? Some dies are just better made and can achieve higher frequencies. +\item Are you running on the best-rated core or at the worst-rated core? Some cores may be unable to match the performance of other cores in the same processor. +\item What kind of cooling solution are you using? The cheap one bundled with the CPU or a hefty chunk of metal that has no problem with heat dissipation? \item Do you have complete control over the power profile? Spoiler alert: no. The operating system may run anything at any time on any of the other cores, which will impact the turbo frequency you're able to achieve. \end{itemize} -As you can see, this feature basically screams 'unreliable results!' Best keep it disabled and run at the base frequency. Otherwise your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at \emph{four} different speeds. +As you can see, this feature basically screams 'unreliable results!' Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at \emph{four} different speeds. -Keep in mind that even at the base frequency you may hit thermal limits of the silicon and be downthrottled. +Keep in mind that even at the base frequency, you may hit the thermal limits of the silicon and be down throttled. \paragraph{Power saving} -This is basically the same as turbo mode, but in reverse. While unused, processor cores are kept at lower frequencies (or even completely disabled) to reduce power usage. When your code starts running\footnote{Not necessarily when the application is started, but also when, for example, a blocking mutex becomes released by other thread and is acquired.} the core frequency needs to ramp up, which may be visible in the measurements. +This is, in essence, the same as turbo mode, but in reverse. While unused, processor cores are kept at lower frequencies (or even wholly disabled) to reduce power usage. When your code starts running\footnote{Not necessarily when the application is started, but also when, for example, a blocking mutex becomes released by other thread and is acquired.}, the core frequency needs to ramp up, which may be visible in the measurements. -What's even worse, if your code doesn't do a lot of work (for example, because it is waiting for the GPU to finish rendering the frame), the core frequency might not be ramped up to 100\%, which will skew the results. +Even worse, if your code doesn't do a lot of work (for example, because it is waiting for the GPU to finish rendering the frame), the CPU might not ramp up the core frequency to 100\%, which will skew the results. Again, to get the best results, keep this feature disabled. \paragraph{AVX offset and power licenses} -Intel CPUs are unable to run at their advertised frequencies when wide SIMD operations are performed due to increased power requirements\footnote{AMD processors are not affected by this issue.}. Depending on the width \emph{and} type of operations performed, the core operating frequency will be reduced, in some cases quite drastically\footnote{\url{https://en.wikichip.org/wiki/intel/xeon_gold/5120\#Frequencies}}. To make things even better, \emph{some} part of the workload will be executed within the available power license, at twice reduced processing rate, then the CPU may be stopped for some time, so that the wide parts of executions units may be powered up, then the work will continue at full processing rate, but at reduced frequency. +Intel CPUs are unable to run at their advertised frequencies when they perform wide SIMD operations due to increased power requirements\footnote{AMD processors are not affected by this issue.}. Therefore, depending on the width \emph{and} type of operations executed, the core operating frequency will be reduced, in some cases quite drastically\footnote{\url{https://en.wikichip.org/wiki/intel/xeon_gold/5120\#Frequencies}}. To make things even better, \emph{some} parts of the workload will execute within the available power license, at a twice reduced processing rate. After that, the CPU may be stopped for some time so that the wide parts of executions units can be powered up. Then the work will continue at full processing rate but at a reduced frequency. Be very careful when using AVX2 or AVX512. @@ -679,7 +679,7 @@ More information can be found at \url{https://travisdowns.github.io/blog/2020/01 \paragraph{Summing it up} \label{ryzen} -Power management schemes employed in various CPUs make it hard to reason about true performance of the code. For example, figure~\ref{ryzenimage} contains a histogram of function execution times (as described in chapter~\ref{findzone}), as measured on an AMD Ryzen CPU. The results ranged from 13.05~\si{\micro\second} to 61.25~\si{\micro\second} (extreme outliers were not included on the graph, limiting the longest displayed time to 36.04~\si{\micro\second}). +Power management schemes employed in various CPUs make it hard to reason about the true performance of the code. For example, figure~\ref{ryzenimage} contains a histogram of function execution times (as described in chapter~\ref{findzone}), as measured on an AMD Ryzen CPU. The results ranged from 13.05~\si{\micro\second} to 61.25~\si{\micro\second} (extreme outliers were not included on the graph, limiting the longest displayed time to 36.04~\si{\micro\second}). \begin{figure}[h] \centering @@ -688,18 +688,18 @@ Power management schemes employed in various CPUs make it hard to reason about t \label{ryzenimage} \end{figure} -We can immediately see that there are two distinct peaks, at 13.4~\si{\micro\second} and 15.3~\si{\micro\second}. A reasonable assumption would be that there are two paths in the code, one that can omit some work, and the second one which must do some additional job. But here's a catch -- the measured code is actually branchless and is always executed the same way. The two peaks represent two turbo frequencies between which the CPU was aggressively switching. +We can immediately see that there are two distinct peaks, at 13.4~\si{\micro\second} and 15.3~\si{\micro\second}. A reasonable assumption would be that there are two paths in the code, one that can omit some work, and the second one which must do some additional job. But here's a catch -- the measured code is actually branchless and always executes the same way. The two peaks represent two turbo frequencies between which the CPU was aggressively switching. -We can also see that the graph gradually falls off to the right (representing longer times), with a small bump near the end. This can be attributed to running in power saving mode, with differing reaction times to the required operating frequency boost to full power. +We can also see that the graph gradually falls off to the right (representing longer times), with a slight bump near the end. Again, this can be attributed to running in power-saving mode, with different reaction times to the required operating frequency boost to full power. \subsection{Building the server} \label{buildingserver} -The easiest way to get going is to build the data analyzer, available in the \texttt{profiler} directory. With it you can connect to localhost or remote clients and view the collected data right away. +The easiest way to get going is to build the data analyzer, available in the \texttt{profiler} directory. Then, you can connect to localhost or remote clients and view the collected data right away with it. -If you prefer to inspect the data only after a trace has been performed, you may use the command line utility in the \texttt{capture} directory. It will save a data dump that may be later opened in the graphical viewer application. +If you prefer to inspect the data only after a trace has been performed, you may use the command-line utility in the \texttt{capture} directory. It will save a data dump that you may later open in the graphical viewer application. -Note that ideally you should be using the same version of the Tracy profiler on both client and server. The network protocol may change in between versions, in which case you won't be able to make a connection. +Ideally, it would be best to use the same version of the Tracy profiler on both client and server. The network protocol may change in-between releases, in which case you won't be able to make a connection. See section~\ref{capturing} for more information about performing captures. @@ -708,7 +708,7 @@ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -Due to the memory requirements for data storage, Tracy server is only supposed to run on 64-bit platforms. While there is nothing preventing the program from building and executing in a 32-bit environment, doing so is not supported. +Due to the memory requirements for data storage, the Tracy server is only supposed to run on 64-bit platforms. While nothing prevents the program from building and executing in a 32-bit environment, doing so is not supported. \end{bclogo} \subsubsection{Required libraries} @@ -719,7 +719,7 @@ To build the application contained in the \texttt{profiler} directory, you will \paragraph{Windows} -On Windows you will need to use the \texttt{vcpkg} utility. If you are not familiar with this tool, please read the description at the following address: \url{https://docs.microsoft.com/en-us/cpp/build/vcpkg}. +On Windows, you will need to use the \texttt{vcpkg} utility. If you are not familiar with this tool, please read the description at the following address: \url{https://docs.microsoft.com/en-us/cpp/build/vcpkg}. There are two ways you can run \texttt{vcpkg} to install the dependencies for Tracy: @@ -751,37 +751,37 @@ On Windows navigate to the \texttt{build/win32} directory and open the solution \subsubsection{Embedding the server in profiled application} \label{embeddingserver} -While not officially supported, it is possible to embed the server in your application, the same one which is running the client part of Tracy. This is left up for you to figure out. +While not officially supported, it is possible to embed the server in your application, the same one running the client part of Tracy. How to make this work is left up for you to figure out. Note that most libraries bundled with Tracy are modified in some way and contained in the \texttt{tracy} namespace. The one exception is Dear ImGui, which can be freely replaced. -Be aware that while the Tracy client uses its own separate memory allocator, the server part of Tracy will use global memory allocation facilities, shared with the rest of your application. This will affect both the memory usage statistics and Tracy memory profiling. +Be aware that while the Tracy client uses its own separate memory allocator, the server part of Tracy will use global memory allocation facilities shared with the rest of your application. This will affect both the memory usage statistics and Tracy memory profiling. The following defines may be of interest: \begin{itemize} \item \texttt{TRACY\_NO\_FILESELECTOR} -- controls whether a system load/save dialog is compiled in. If it's enabled, the saved traces will be named \texttt{trace.tracy}. -\item \texttt{TRACY\_NO\_STATISTICS} -- Tracy will perform statistical data collection on the fly, if this macro is \emph{not} defined. This allows extended analysis of the trace (for example, you can perform a live search for matching zones) at a small CPU processing cost and a considerable memory usage increase (at least 8 bytes per zone). -\item \texttt{TRACY\_NO\_ROOT\_WINDOW} -- the main profiler view won't occupy whole window if this macro is defined. Additional setup is required for this to work. If you are embedding the server into your application you probably want to enable this option. +\item \texttt{TRACY\_NO\_STATISTICS} -- Tracy will perform statistical data collection on the fly, if this macro is \emph{not} defined. This allows extended trace analysis (for example, you can perform a live search for matching zones) at a small CPU processing cost and a considerable memory usage increase (at least 8 bytes per zone). +\item \texttt{TRACY\_NO\_ROOT\_WINDOW} -- the main profiler view won't occupy the whole window if this macro is defined. Additional setup is required for this to work. If you want to embed the server into your application, you probably should enable this option. \end{itemize} \subsubsection{DPI scaling} -The graphic server application will adapt to the system DPI scaling. If for some reason this doesn't work in your case, you may try setting the \texttt{TRACY\_DPI\_SCALE} environment variable to a scale fraction, where a value of 1 indicates no scaling. +The graphic server application will adapt to the system DPI scaling. If for some reason, this doesn't work in your case, you may try setting the \texttt{TRACY\_DPI\_SCALE} environment variable to a scale fraction, where a value of 1 indicates no scaling. \subsection{Naming threads} \label{namingthreads} Remember to set thread names for proper identification of threads. You should do so by using the function \texttt{tracy::SetThreadName(name)} exposed in the \texttt{tracy/common/TracySystem.hpp} header, as the system facilities typically have limited functionality. -If context switch capture is active, Tracy will try to capture thread names through operating system data. This is only a fallback mechanism and it shouldn't be relied upon. +Tracy will try to capture thread names through operating system data if context switch capture is active. However, this is only a fallback mechanism, and it shouldn't be relied upon. \subsection{Crash handling} \label{crashhandling} -On selected platforms (see section~\ref{featurematrix}) Tracy will intercept application crashes\footnote{For example, invalid memory accesses ('segmentation faults', 'null pointer exceptions'), divisions by zero, etc.}. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with information about the crash reason, call stack at the time of crash, etc. +On selected platforms (see section~\ref{featurematrix}) Tracy will intercept application crashes\footnote{For example, invalid memory accesses ('segmentation faults', 'null pointer exceptions'), divisions by zero, etc.}. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc. -This is an automatic process and it doesn't require user interaction. +This is an automatic process, and it doesn't require user interaction. \begin{bclogo}[ noborder=true, @@ -790,7 +790,7 @@ logo=\bcattention ]{Caveats} \begin{itemize} \item On MSVC the debugger has priority over the application in handling exceptions. If you want to finish the profiler data collection with the debugger hooked-up, select the \emph{continue} option in the debugger pop-up dialog. -\item On Linux crashes are handled with signals. Tracy needs to have \texttt{SIGPWR} available, which is rather rarely used, but the program you are profiling may expect to employ it for its own purposes, which would cause a conflict\footnote{For example, it may be used by Mono to trigger garbage collection.}. To workaround such cases you may set the \texttt{TRACY\_CRASH\_SIGNAL} macro value to some other signal (see \texttt{man 7 signal} for a list of signals). Make sure that you avoid conflicts by selecting a signal that the application wouldn't normally receive or emit. +\item On Linux, crashes are handled with signals. Tracy needs to have \texttt{SIGPWR} available, which is rather rarely used. Still, the program you are profiling may expect to employ it for its purposes, which would cause a conflict\footnote{For example, Mono may use it to trigger garbage collection.}. To workaround such cases, you may set the \texttt{TRACY\_CRASH\_SIGNAL} macro value to some other signal (see \texttt{man 7 signal} for a list of signals). Ensure that you avoid conflicts by selecting a signal that the application wouldn't usually receive or emit. \end{itemize} \end{bclogo} @@ -833,34 +833,34 @@ VSync capture & \faCheck & \faTimes & \faTimes & \faTimes & \faTimes & \faTimes \section{Client markup} \label{client} -With the aforementioned steps you will be able to connect to the profiled program, but there probably won't be any data collection performed\footnote{With some small exceptions, see section~\ref{automated}.}. Unless you're able to perform automatic call stack sampling (see chapter~\ref{sampling}), you will have to manually instrument the application. All the user-facing interface is contained in the \texttt{tracy/Tracy.hpp} header file. +With the steps mentioned above, you will be able to connect to the profiled program, but there probably won't be any data collection performed\footnote{With some small exceptions, see section~\ref{automated}.}. Unless you're able to perform automatic call stack sampling (see chapter~\ref{sampling}), you will have to instrument the application manually. All the user-facing interface is contained in the \texttt{tracy/Tracy.hpp} header file. -Manual instrumentation is best started with adding markup to the main loop of the application, along with a few function that are called there. This will give you a rough outline of the function's time cost, which you may then further refine by instrumenting functions deeper in the call stack. Alternatively, automated sampling might guide you more quickly to places of interest. +Manual instrumentation is best started with adding markup to the application's main loop, along with a few functions that the loop calls. Such an approach will give you a rough outline of the function's time cost, which you may then further refine by instrumenting functions deeper in the call stack. Alternatively, automated sampling might guide you more quickly to places of interest. \subsection{Handling text strings} \label{textstrings} -When dealing with Tracy macros, you will encounter two ways of providing string data to the profiler. In both cases you should pass \texttt{const char*} pointers, but there are differences in expected lifetime of the pointed data. +When dealing with Tracy macros, you will encounter two ways of providing string data to the profiler. In both cases, you should pass \texttt{const char*} pointers, but there are differences in the expected lifetime of the pointed data. \begin{enumerate} \item When a macro only accepts a pointer (for example: \texttt{TracyMessageL(text)}), the provided string data must be accessible at any time in program execution (\emph{this also includes the time after exiting the \texttt{main} function}). The string also cannot be changed. This basically means that the only option is to use a string literal (e.g.: \texttt{TracyMessageL("Hello")}). -\item If there's a string pointer with a size parameter (for example: \texttt{TracyMessage(text, size)}), the profiler will allocate an internal temporary buffer to store the data. The \texttt{size} count should not include the terminating null character, using \texttt{strlen(text)} is fine. The pointed-to data is not used afterwards. Remember that allocating and copying memory involved in this operation has a small time cost. +\item If there's a string pointer with a size parameter (for example \texttt{TracyMessage(text, size)}), the profiler will allocate a temporary internal buffer to store the data. The \texttt{size} count should not include the terminating null character, using \texttt{strlen(text)} is fine. The pointed-to data is not used afterward. Remember that allocating and copying memory involved in this operation has a small time cost. \end{enumerate} -Be aware that each single instance of text string data passed to the profiler can't be larger than 64 KB. +Be aware that every single instance of text string data passed to the profiler can't be larger than 64 KB. \subsubsection{Program data lifetime} \label{datalifetime} -Take extra care to consider the lifetime of program code (which includes string literals) in your application. If you're dynamically adding and removing modules (i.e. DLLs, shared objects) during the runtime, text data will be only present when the module is loaded. Additionally, when a module is unloaded, another one can be placed in its space in process memory map, which can result in aliasing of text strings. This leads to all sorts of confusion and potential crashes. +Take extra care to consider the lifetime of program code (which includes string literals) in your application. For example, if you dynamically add and remove modules (i.e., DLLs, shared objects) during the runtime, text data will only be present when the module is loaded. Additionally, when a module is unloaded, the operating system can place another one in its space in the process memory map, resulting in the aliasing of text strings. This leads to all sorts of confusion and potential crashes. -Note that string literals are available as the only option in many parts of the Tracy API. For example, take a look at how frame or plot names are specified. You cannot unload modules that contain string literals which were passed to the profiler\footnote{If you really do must unload a module, manually allocating a \texttt{char} buffer, as described in section~\ref{uniquepointers}, will give you a persistent string in memory.}. +Note that string literals are the only option in many parts of the Tracy API. For example, look at how frame or plot names are specified. You cannot unload modules that contain string literals that you passed to the profiler\footnote{If you really do must unload a module, manually allocating a \texttt{char} buffer, as described in section~\ref{uniquepointers}, will give you a persistent string in memory.}. \subsubsection{Unique pointers} \label{uniquepointers} -In some cases, which will be clearly marked in the manual, Tracy expects that you provide an unique pointer in each case the same string literal is used. This can be exemplified in the following listing: +In some cases marked in the manual, Tracy expects you to provide a unique pointer in each occurrence the same string literal is used. This can be exemplified in the following listing: \begin{lstlisting} FrameMarkStart("Audio processing"); @@ -868,15 +868,15 @@ FrameMarkStart("Audio processing"); FrameMarkEnd("Audio processing"); \end{lstlisting} -Here you can see that two string literals with identical contents are passed to two different macros. It is entirely up to the compiler to decide if these two strings will be pooled into one pointer, or if there will be two instances present in the executable image\footnote{\cite{ISO:2012:III} \S 2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined".}. For example, on MSVC this is controlled by \menu[,]{Configuration Properties,C/C++,Code Generation,Enable String Pooling} option in the project properties (it is automatically enabled in optimized builds). Note that even if string pooling is enabled on the compilation unit level, it is still up to the linker to implement pooling across object files. +Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image\footnote{\cite{ISO:2012:III} \S 2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined."}. For example, on MSVC, this is controlled by \menu[,]{Configuration Properties,C/C++,Code Generation,Enable String Pooling} option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files. -As you can see making sure that string literals are properly pooled can be surprisingly tricky. To workaround this problem you may employ the following technique. In \emph{one} source file create the unique pointer for a string literal, for example: +As you can see, making sure that string literals are properly pooled can be surprisingly tricky. To work around this problem, you may employ the following technique. In \emph{one} source file create the unique pointer for a string literal, for example: \begin{lstlisting} const char* const sl_AudioProcessing = "Audio processing"; \end{lstlisting} -Then in each file where the literal is to be used, use the variable name instead. Notice that with such approach if you'd want to change a name passed to Tracy, you'd need to do it only in one place. +Then in each file where you want to use the literal, use the variable name instead. Notice that if you'd like to change a name passed to Tracy, you'd need to do it only in one place with such an approach. \begin{lstlisting} extern const char* const sl_AudioProcessing; @@ -886,7 +886,7 @@ FrameMarkStart(sl_AudioProcessing); FrameMarkEnd(sl_AudioProcessing); \end{lstlisting} -In some cases you may want to have semi-dynamic strings, for example you may want to enumerate workers, but don't know how many will be used. This can be handled by allocating a never-freed \texttt{char} buffer, which can be then propagated where it's needed. For example: +In some cases, you may want to have semi-dynamic strings. For example, you may want to enumerate workers but don't know how many will be used. You can handle this by allocating a never-freed \texttt{char} buffer, which you can then propagate where it's needed. For example: \begin{lstlisting} char* workerId = new char[16]; @@ -895,13 +895,11 @@ snprintf(workerId, 16, "Worker %i", id); FrameMarkStart(workerId); \end{lstlisting} -You have to make sure it's initialized only once, before passing it to any Tracy API, that it is not overwritten by new data, etc. In the end this is just a pointer to character-string data, and it doesn't matter if it was loaded from program image, or if it was allocated on the heap. - -Proper support for non-unique same-content string literals may be implemented in future. +You have to make sure it's initialized only once, before passing it to any Tracy API, that it is not overwritten by new data, etc. In the end, this is just a pointer to character-string data. It doesn't matter if the memory was loaded from the program image or allocated on the heap. \subsection{Specifying colors} -In some cases you will want to provide your own colors to be displayed by the profiler. In all such places you should use a hexadecimal \texttt{0xRRGGBB} notation. +In some cases, you will want to provide your own colors to be displayed by the profiler. You should use a hexadecimal \texttt{0xRRGGBB} notation in all such places. Alternatively you may use named colors predefined in \texttt{common/TracyColor.hpp} (included by \texttt{Tracy.hpp}). Visual reference: \url{https://en.wikipedia.org/wiki/X11_color_names}. @@ -910,7 +908,7 @@ Do not use \texttt{0x000000} if you want to specify black color, as zero is a sp \subsection{Marking frames} \label{markingframes} -To slice the program's execution recording into frame-sized chunks\footnote{Each frame starts immediately after the previous has ended.}, put the \texttt{FrameMark} macro after you have completed rendering the frame. Ideally that would be right after the swap buffers command. +To slice the program's execution recording into frame-sized chunks\footnote{Each frame starts immediately after previous has ended.}, put the \texttt{FrameMark} macro after you have completed rendering the frame. Ideally, that would be right after the swap buffers command. \begin{bclogo}[ noborder=true, @@ -923,11 +921,11 @@ This step is optional, as some applications do not use the concept of a frame. \subsubsection{Secondary frame sets} \label{secondaryframeset} -In some cases you may want to track more than one set of frames in your program. To do so, you may use the \texttt{FrameMarkNamed(name)} macro, which will create a new set of frames for each unique name you provide. Make sure the passed string literal is properly pooled, as described in section~\ref{uniquepointers}. +In some cases, you may want to track more than one set of frames in your program. To do so, you may use the \texttt{FrameMarkNamed(name)} macro, which will create a new set of frames for each unique name you provide. But, first, make sure you are correctly pooling the passed string literal, as described in section~\ref{uniquepointers}. \subsubsection{Discontinuous frames} -Some types of frames are discontinuous by nature. For example, a physics processing step in a game loop, or an audio callback running on a separate thread. These kinds of workloads are executed periodically, with a pause between each run. Tracy can also track these kind of frames. +Some types of frames are discontinuous by their nature -- they are executed periodically, with a pause between each run. Examples of such frames are a physics processing step in a game loop or an audio callback running on a separate thread. Tracy can also track this kind of frames. To mark the beginning of a discontinuous frame use the \texttt{FrameMarkStart(name)} macro. After the work is finished, use the \texttt{FrameMarkEnd(name)} macro. @@ -948,11 +946,11 @@ logo=\bcbombe It is possible to attach a screen capture of your application to any frame in the main frame set. This can help you see the context of what's happening in various places in the trace. You need to implement retrieval of the image data from GPU by yourself. -Images are sent using the \texttt{FrameImage(image, width, height, offset, flip)} macro, where \texttt{image} is a pointer to RGBA\footnote{Alpha value is ignored, but leaving it out wouldn't map well to the way graphics hardware works.} pixel data, \texttt{width} and \texttt{height} are the image dimensions, which \emph{must be divisible by 4}, \texttt{offset} specifies how much frame lag was there for the current image (see chapter~\ref{screenshotcode}), and \texttt{flip} should be set, if the graphics API stores images upside-down\footnote{For example, OpenGL flips images, but Vulkan does not.}. The image data is copied by the profiler and doesn't need to be retained. +Images are sent using the \texttt{FrameImage(image, width, height, offset, flip)} macro, where \texttt{image} is a pointer to RGBA\footnote{Alpha value is ignored, but leaving it out wouldn't map well to the way graphics hardware works.} pixel data, \texttt{width} and \texttt{height} are the image dimensions, which \emph{must be divisible by 4}, \texttt{offset} specifies how much frame lag was there for the current image (see chapter~\ref{screenshotcode}), and \texttt{flip} should be set, if the graphics API stores images upside-down\footnote{For example, OpenGL flips images, but Vulkan does not.}. The profiler copies the image data, so you don't need to retain it. -Handling image data requires a lot of memory and bandwidth\footnote{One uncompressed 1080p image takes 8 MB.}. To achieve sane memory usage you should scale down taken screen shots to a sensible size, e.g. $320\times180$. +Handling image data requires a lot of memory and bandwidth\footnote{One uncompressed 1080p image takes 8 MB.}. To achieve sane memory usage, you should scale down taken screenshots to a suitable size, e.g., $320\times180$. -To further reduce image data size, frame images are internally compressed using the DXT1 Texture Compression technique\footnote{\url{https://en.wikipedia.org/wiki/S3_Texture_Compression}}, which significantly reduces data size\footnote{One pixel is stored in a nibble (4 bits) instead of 32 bits.}, at a small quality decrease. The compression algorithm is very fast and can be made even faster by enabling SIMD processing, as indicated in table~\ref{EtcSimd}. +To further reduce image data size, frame images are internally compressed using the DXT1 Texture Compression technique\footnote{\url{https://en.wikipedia.org/wiki/S3_Texture_Compression}}, which significantly reduces data size\footnote{One pixel is stored in a nibble (4 bits) instead of 32 bits.}, at a slight quality decrease. The compression algorithm is high-speed and can be made even faster by enabling SIMD processing, as indicated in table~\ref{EtcSimd}. \begin{table}[h] \centering @@ -978,20 +976,20 @@ couleur=black!5, logo=\bcattention ]{Caveats} \begin{itemize} -\item Frame images are compressed on a second client profiler thread\footnote{Small part of compression task is performed on the server.}, to reduce memory usage of queued images. This might have impact on the performance of the profiled application. +\item Frame images are compressed on a second client profiler thread\footnote{Small part of compression task is offloaded to the server.}, to reduce memory usage of queued images. This might have an impact on the performance of the profiled application. \item This second thread will be periodically woken up, even if there are no frame images to compress\footnote{This way of doing things is required to prevent a deadlock in specific circumstances.}. If you are not using the frame image capture functionality and you don't wish this thread to be running, you can define the \texttt{TRACY\_NO\_FRAME\_IMAGE} macro. -\item Due to implementation details of the network buffer, single frame image cannot be greater than 256 KB after compression. Note that a $960\times540$ image fits in this limit. +\item Due to implementation details of the network buffer, a single frame image cannot be greater than 256 KB after compression. Note that a $960\times540$ image fits in this limit. \end{itemize} \end{bclogo} \paragraph{OpenGL screen capture code example} \label{screenshotcode} -There are many pitfalls associated with retrieving screen contents in an efficient way. For example, using \texttt{glReadPixels} and then resizing the image using some library is terrible for performance, as it forces synchronization of the GPU to CPU and performs the downscaling in software. To do things properly we need to scale the image using the graphics hardware and transfer data asynchronously, which allows the GPU to run independently of CPU. +There are many pitfalls associated with efficiently retrieving screen content. For example, using \texttt{glReadPixels} and then resizing the image using some library is terrible for performance, as it forces synchronization of the GPU to CPU and performs the downscaling in software. To do things properly, we need to scale the image using the graphics hardware and transfer data asynchronously, which allows the GPU to run independently of the CPU. -The following example shows how this can be achieved using OpenGL 3.2. More recent OpenGL versions allow doing things even better (for example by using persistent buffer mapping), but it won't be covered here. +The following example shows how this can be achieved using OpenGL 3.2. Of course, more recent OpenGL versions allow doing things even better (for example, using persistent buffer mapping), but this manual won't cover it here. -Let's begin by defining the required objects. We need a \emph{texture} to store the resized image, a \emph{framebuffer object} to be able to write to the texture, a \emph{pixel buffer object} to store the image data for access by the CPU and a \emph{fence} to know when the data is ready for retrieval. We need everything in \emph{at least} three copies (we'll use four), because the rendering, as seen in program, may be ahead of the GPU by a couple frames. We need an index to access the appropriate data set in a ring-buffer manner. And finally, we need a queue to store indices to data sets that we are still waiting for. +Let's begin by defining the required objects. First, we need a \emph{texture} to store the resized image, a \emph{framebuffer object} to be able to write to the texture, a \emph{pixel buffer object} to store the image data for access by the CPU, and a \emph{fence} to know when the data is ready for retrieval. We need everything in \emph{at least} three copies (we'll use four) because the rendering, as seen in the program, can run ahead of the GPU by a couple of frames. Next, we need an index to access the appropriate data set in a ring-buffer manner. And finally, we need a queue to store indices to data sets that we are still waiting for. \begin{lstlisting} GLuint m_fiTexture[4]; @@ -1002,7 +1000,7 @@ int m_fiIdx = 0; std::vector m_fiQueue; \end{lstlisting} -Everything needs to be properly initialized (the cleanup is left for the reader to figure out). +Everything needs to be correctly initialized (the cleanup is left for the reader to figure out). \begin{lstlisting} glGenTextures(4, m_fiTexture); @@ -1024,7 +1022,7 @@ for(int i=0; i<4; i++) } \end{lstlisting} -We will now setup a screen capture, which will downscale the screen contents to $320\times180$ pixels and copy the resulting image to a buffer which will be accessible by the CPU when the operation is done. This should be placed right before \emph{swap buffers} or \emph{present} call. +We will now set up a screen capture, which will downscale the screen contents to $320\times180$ pixels and copy the resulting image to a buffer accessible by the CPU when the operation is done. This should be placed right before \emph{swap buffers} or \emph{present} call. \begin{lstlisting} assert(m_fiQueue.empty() || m_fiQueue.front() != m_fiIdx); // check for buffer overrun @@ -1040,7 +1038,7 @@ m_fiQueue.emplace_back(m_fiIdx); m_fiIdx = (m_fiIdx + 1) % 4; \end{lstlisting} -And lastly, just before the capture setup code that was just added\footnote{Yes, before. We are handling past screen captures here.} we need to have the image retrieval code. We are checking if the capture operation has finished and if it has, we map the \emph{pixel buffer object} to memory, inform the profiler that there's image data to be handled, unmap the buffer and go to check the next queue item. If a capture is still pending, we break out of the loop and wait until the next frame to check if the GPU has finished the capture. +And lastly, just before the capture setup code that was just added\footnote{Yes, before. We are handling past screen captures here.} we need to have the image retrieval code. We are checking if the capture operation has finished. If it has, we map the \emph{pixel buffer object} to memory, inform the profiler that there are image data to be handled, unmap the buffer and go to check the next queue item. If capture is still pending, we break out of the loop. We will have to wait until the next frame to check if the GPU has finished performing the capture. \begin{lstlisting} while(!m_fiQueue.empty()) @@ -1056,29 +1054,29 @@ while(!m_fiQueue.empty()) } \end{lstlisting} -Notice that in the call to \texttt{FrameImage} we are passing the remaining queue size as the \texttt{offset} parameter. Queue size represents how many frames ahead our program is relative to the GPU. Since we are sending past frame images we need to specify how many frames behind the images are. Of course if this would be a synchronous capture (without use of fences and with retrieval code after the capture setup), we would set \texttt{offset} to zero, as there would be no frame lag. +Notice that in the call to \texttt{FrameImage} we are passing the remaining queue size as the \texttt{offset} parameter. Queue size represents how many frames ahead our program is relative to the GPU. Since we are sending past frame images, we need to specify how many frames behind the images are. Of course, if this would be synchronous capture (without the use of fences and with retrieval code after the capture setup), we would set \texttt{offset} to zero, as there would be no frame lag. \subparagraph{High quality capture} -The code above uses \texttt{glBlitFramebuffer} function, which can only use nearest neighbor filtering. This can result in low-quality screen shots, as shown on figure~\ref{lowqualityss}. With a bit more work it is possible to obtain much nicer looking screen shots, as presented on figure~\ref{highqualityss}. Unfortunately, you will need to setup a complete rendering pipeline for this to work. +The code above uses \texttt{glBlitFramebuffer} function, which can only use nearest neighbor filtering. The use of such filtering can result in low-quality screenshots, as shown in figure~\ref{lowqualityss}. However, with a bit more work, it is possible to obtain nicer-looking screenshots, as presented in figure~\ref{highqualityss}. Unfortunately, you will need to set up a complete rendering pipeline for this to work. -First, you need to allocate additional set of intermediate frame buffers and textures, sized the same as the screen. These new textures should have minification filter set to \texttt{GL\_LINEAR\_MIPMAP\_LINEAR}. You will also need to setup everything needed to render a full-screen quad: a simple texturing shader and vertex buffer with appropriate data. Since this vertex buffer will be used to render to the scaled-down framebuffer, you may prepare its contents beforehand and update it only when the aspect ratio would change. +First, you need to allocate an additional set of intermediate frame buffers and textures, sized the same as the screen. These new textures should have a minification filter set to \texttt{GL\_LINEAR\_MIPMAP\_LINEAR}. You will also need to set up everything needed to render a full-screen quad: a simple texturing shader and vertex buffer with appropriate data. Since you will use this vertex buffer to render to the scaled-down frame buffer, you may prepare its contents beforehand and update it only when the aspect ratio changes. -With all this done, the screen capture can be performed as follows: +With all this done, you can perform the screen capture as follows: \begin{itemize} \item Setup vertex buffer configuration for the full-screen quad buffer (you only need position and uv~coordinates). -\item Blit the screen contents to the full-sized framebuffer. -\item Bind the texture backing the full-sized framebuffer. -\item Generate mip-maps using \texttt{glGenerateMipmap}. +\item Blit the screen contents to the full-sized frame buffer. +\item Bind the texture backing the full-sized frame buffer. +\item Generate mipmaps using \texttt{glGenerateMipmap}. \item Set viewport to represent the scaled-down image size. \item Bind vertex buffer data, shader, setup the required uniforms. -\item Draw full-screen quad to the scaled-down framebuffer. -\item Retrieve framebuffer contents, as in the code above. +\item Draw full-screen quad to the scaled-down frame buffer. +\item Retrieve frame buffer contents, as in the code above. \item Restore viewport, vertex buffer configuration, bound textures, etc. \end{itemize} -While this approach is much more complex than the previously discussed one, the resulting image quality increase makes it worth it. +While this approach is much more complex than the previously discussed one, the resulting image quality increase makes it worthwhile. \begin{figure}[h] \centering @@ -1096,7 +1094,7 @@ While this approach is much more complex than the previously discussed one, the \end{minipage} \end{figure} -You can see the performance results you may expect in a simple application in table~\ref{asynccapture}. The na\"ive capture performs synchronous retrieval of full screen image and resizes it using \emph{stb\_image\_resize}. The proper and high quality captures do things as described in this chapter. +You can see the performance results you may expect in a simple application in table~\ref{asynccapture}. The na\"ive capture performs synchronous retrieval of full-screen image and resizes it using \emph{stb\_image\_resize}. The proper and high-quality captures do things as described in this chapter. \begin{table}[h] \centering @@ -1112,25 +1110,25 @@ $2560\times1440$ & 23~FPS & 3300~FPS & 1600~FPS \subsection{Marking zones} \label{markingzones} -To record a zone's\footnote{A \texttt{zone} represents the lifetime of a special on-stack profiler variable. Typically it would exist for the duration of a whole scope of the profiled function, but you also can measure time spent in scopes of a for-loop, or an if-branch.} execution time add the \texttt{ZoneScoped} macro at the beginning of the scope you want to measure. This will automatically record function name, source file name and location. Optionally you may use the \texttt{ZoneScopedC(color)} macro to set a custom color for the zone. Note that the color value will be constant in the recording (don't try to parametrize it). You may also set a custom name for the zone, using the \texttt{ZoneScopedN(name)} macro. Color and name may be combined by using the \texttt{ZoneScopedNC(name, color)} macro. +To record a zone's\footnote{A \texttt{zone} represents the lifetime of a special on-stack profiler variable. Typically it would exist for the duration of a whole scope of the profiled function, but you also can measure time spent in scopes of a for-loop or an if-branch.} execution time add the \texttt{ZoneScoped} macro at the beginning of the scope you want to measure. This will automatically record function name, source file name, and location. Optionally you may use the \texttt{ZoneScopedC(color)} macro to set a custom color for the zone. Note that the color value will be constant in the recording (don't try to parametrize it). You may also set a custom name for the zone, using the \texttt{ZoneScopedN(name)} macro. Color and name may be combined by using the \texttt{ZoneScopedNC(name, color)} macro. -Use the \texttt{ZoneText(text, size)} macro to add a custom text string that will be displayed along the zone information (for example, name of the file you are opening). Multiple text strings can be attached to any single zone. The dynamic color of a zone can be specified with the \texttt{ZoneColor(uint32\_t)} macro to override the source location color. If you want to send a numeric value and don't want to pay the cost of converting it to a string, you may use the \texttt{ZoneValue(uint64\_t)} macro. You can check if the current zone is active with the \texttt{ZoneIsActive} macro. +Use the \texttt{ZoneText(text, size)} macro to add a custom text string that the profiler will display along with the zone information (for example, name of the file you are opening). Multiple text strings can be attached to any single zone. The dynamic color of a zone can be specified with the \texttt{ZoneColor(uint32\_t)} macro to override the source location color. If you want to send a numeric value and don't want to pay the cost of converting it to a string, you may use the \texttt{ZoneValue(uint64\_t)} macro. Finally, you can check if the current zone is active with the \texttt{ZoneIsActive} macro. -If you want to set zone name on a per-call basis, you may do so using the \texttt{ZoneName(text, size)} macro. This name won't be used in the process of grouping the zones for statistical purposes (sections~\ref{statistics} and~\ref{findzone}). +If you want to set zone name on a per-call basis, you may do so using the \texttt{ZoneName(text, size)} macro. However, this name won't be used in the process of grouping the zones for statistical purposes (sections~\ref{statistics} and~\ref{findzone}). \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -Zones are identified using static data structures embedded in program code. You need to consider the lifetime of code in your application, as discussed in section~\ref{datalifetime}, to make sure that the profiler is able to access this data at any time during the program lifetime. +Zones are identified using static data structures embedded in program code. Therefore, you need to consider the lifetime of code in your application, as discussed in section~\ref{datalifetime}, to make sure that the profiler can access this data at any time during the program lifetime. -If this requirement can't be fulfilled, you must use transient zones, described in section~\ref{transientzones}. +If you can't fulfill this requirement, you must use transient zones, described in section~\ref{transientzones}. \end{bclogo} \subsubsection{Manual management of zone scope} -The zone markup macros automatically report when they end, through the RAII mechanism\footnote{\url{https://en.cppreference.com/w/cpp/language/raii}}. This is very helpful, but sometimes you may want to mark the zone start and end points yourself, for example if you want to have a zone that crosses the function's boundary. This can be achieved by using the C API, which is described in section~\ref{capi}. +The zone markup macros automatically report when they end, through the RAII mechanism\footnote{\url{https://en.cppreference.com/w/cpp/language/raii}}. This is very helpful, but sometimes you may want to mark the zone start and end points yourself, for example, if you want to have a zone that crosses the function's boundary. You can achieve this by using the C API, which is described in section~\ref{capi}. \subsubsection{Multiple zones in one scope} \label{multizone} @@ -1146,7 +1144,7 @@ noborder=true, couleur=black!5, logo=\bcattention ]{Zone stack} -The \texttt{ZoneScoped} macros are imposing creation and usage of an implicit zone stack. You must follow the rules of this stack also when you are using the named macros, which give you some more leeway in doing things. For example, you can only set the text for the zone which is on top of the stack, as you only could do with the \texttt{ZoneText} macro. It doesn't matter that you can call the \texttt{Text} method of a non-top zone which is accessible through a variable. Take a look at the following code: +The \texttt{ZoneScoped} macros are imposing the creation and usage of an implicit zone stack. You must also follow the rules of this stack when using the named macros, which give you some more leeway in doing things. For example, you can only set the text for the zone which is on top of the stack, as you only could do with the \texttt{ZoneText} macro. It doesn't matter that you can call the \texttt{Text} method of a non-top zone which is accessible through a variable. Take a look at the following code: \begin{lstlisting} { @@ -1161,17 +1159,16 @@ The \texttt{ZoneScoped} macros are imposing creation and usage of an implicit zo \end{lstlisting} It is valid to set the \texttt{Zone1} text or name \emph{only} in places \circled{a} or \circled{c}. After \texttt{Zone2} is created at \circled{b} you can no longer perform operations on \texttt{Zone1}, until \texttt{Zone2} is destroyed. - \end{bclogo} \subsubsection{Filtering zones} \label{filteringzones} -Zone logging can be disabled on a per zone basis, by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument ('\texttt{true}' in the example in section~\ref{multizone}), which will determine whether the zone should be logged. +Zone logging can be disabled on a per-zone basis by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument ('\texttt{true}' in the example in section~\ref{multizone}), which will determine whether the zone should be logged. -Note that this parameter may be a run-time variable, for example an user controlled switch to enable profiling of a specific part of code only when required. +Note that this parameter may be a run-time variable, such as a user-controlled switch to enable profiling of a specific part of code only when required. -If the condition is constant at compile-time, the resulting code will not contain a branch (the profiling code will either be always enabled, or won't be there at all). The following listing presents how profiling of specific application subsystems might be implemented: +If the condition is constant at compile-time, the resulting code will not contain a branch (the profiling code will either be always enabled or won't be there at all). The following listing presents how you might implement profiling of specific application subsystems: \begin{lstlisting} enum SubSystems @@ -1204,7 +1201,7 @@ void Graphics::Render() \subsubsection{Transient zones} \label{transientzones} -In order to prevent problems caused by unloadable code, described in section~\ref{datalifetime}, transient zones copy the source location data to an on-heap buffer. This way the requirement on the string literal data being accessible for the rest of program lifetime is relaxed, at the cost of increased memory usage. +In order to prevent problems caused by unloadable code, described in section~\ref{datalifetime}, transient zones copy the source location data to an on-heap buffer. This way, the requirement on the string literal data being accessible for the rest of the program lifetime is relaxed, at the cost of increased memory usage. Transient zones can be declared through the \texttt{ZoneTransient} and \texttt{ZoneTransientN} macros, with the same set of parameters as the \texttt{ZoneNamed} macros. See section~\ref{multizone} for details and make sure that you observe the requirements outlined there. @@ -1229,15 +1226,15 @@ This doesn't stop some compilers from dispensing \emph{fashion advice} about var \subsubsection{Exiting program from within a zone} -At the present time exiting the profiled application from inside a zone is not supported. When the client calls \texttt{exit()}, the profiler will wait for all zones to end, before a program can be truly terminated. If program execution stopped inside a zone, this will never happen, and the profiled application will seemingly hang up. At this point you will need to manually terminate the program (or simply disconnect the profiler server). +Exiting the profiled application from inside a zone is not supported. When the client calls \texttt{exit()}, the profiler will wait for all zones to end before a program can be truly terminated. If program execution stops inside a zone, this will never happen, and the profiled application will seemingly hang up. At this point, you will need to manually terminate the program (or disconnect the profiler server). As a workaround, you may add a \texttt{try}/\texttt{catch} pair at the bottom of the function stack (for example in the \texttt{main()} function) and replace \texttt{exit()} calls with throwing a custom exception. When this exception is caught, you may call \texttt{exit()}, knowing that the application's data structures (including profiling zones) were properly cleaned up. \subsection{Marking locks} -Modern programs must use multi-threading to achieve full performance capability of the CPU. Correct execution requires claiming exclusive access to data shared between threads. When many threads want to enter the same critical section at once, the application's multi-threaded performance advantage is nullified. To help solve this problem, Tracy can collect and display lock interactions in threads. +Modern programs must use multi-threading to achieve the full performance capability of the CPU. However, correct execution requires claiming exclusive access to data shared between threads. When many threads want to simultaneously enter the same critical section, the application's multi-threaded performance advantage nullifies. To help solve this problem, Tracy can collect and display lock interactions in threads. -To mark a lock (mutex) for event reporting, use the \texttt{TracyLockable(type, varname)} macro. Note that the lock must implement the Mutex requirement\footnote{\url{https://en.cppreference.com/w/cpp/named_req/Mutex}} (i.e.\ there's no support for timed mutices). For a concrete example, you would replace the line +To mark a lock (mutex) for event reporting, use the \texttt{TracyLockable(type, varname)} macro. Note that the lock must implement the Mutex requirement\footnote{\url{https://en.cppreference.com/w/cpp/named_req/Mutex}} (i.e.,\ there's no support for timed mutexes). For a concrete example, you would replace the line \begin{lstlisting} std::mutex m_lock; @@ -1257,7 +1254,7 @@ The standard \texttt{std::lock\_guard} and \texttt{std::unique\_lock} wrappers s std::lock_guard lock(m_lock); \end{lstlisting} -To mark the location of a lock being held, use the \texttt{LockMark(varname)} macro, after you have obtained the lock. Note that the \texttt{varname} must be a lock variable (a reference is also valid). This step is optional. +To mark the location of a lock being held, use the \texttt{LockMark(varname)} macro after you have obtained the lock. Note that the \texttt{varname} must be a lock variable (a reference is also valid). This step is optional. Similarly, you can use \texttt{TracySharedLockable}, \texttt{TracySharedLockableN} and \texttt{SharedLockableBase} to mark locks implementing the SharedMutex requirement\footnote{\url{https://en.cppreference.com/w/cpp/named_req/SharedMutex}}. Note that while there's no support for timed mutices in Tracy, both \texttt{std::shared\_mutex} and \texttt{std::shared\_timed\_mutex} may be used\footnote{Since \texttt{std::shared\_mutex} was added in C++17, using \texttt{std::shared\_timed\_mutex} is the only way to have shared mutex functionality in C++14.}. @@ -1274,7 +1271,7 @@ noborder=true, couleur=black!5, logo=\bcattention ]{Caveats} -Due to limits of internal bookkeeping in the profiler, each lock may be used in no more than 64 unique threads. If you have many short lived temporary threads, consider using a thread pool to limit the numbers of created threads. +Due to the limits of internal bookkeeping in the profiler, you may use each lock in no more than 64 unique threads. If you have many short-lived temporary threads, consider using a thread pool to limit the number of created threads. \end{bclogo} \subsubsection{Custom locks} @@ -1284,7 +1281,7 @@ If using the \texttt{TracyLockable} or \texttt{TracySharedLockable} wrappers doe \subsection{Plotting data} \label{plottingdata} -Tracy is able to capture and draw numeric value changes over time. You may use it to analyze draw call counts, number of performed queries, etc. To report data, use the \texttt{TracyPlot(name, value)} macro. +Tracy can capture and draw numeric value changes over time. You may use it to analyze draw call counts, number of performed queries, etc. To report data, use the \texttt{TracyPlot(name, value)} macro. To configure how plot values are presented by the profiler, you may use the \texttt{TracyPlotConfig(name, format)} macro, where \texttt{format} is one of the following options: @@ -1294,31 +1291,31 @@ To configure how plot values are presented by the profiler, you may use the \tex \item \texttt{tracy::PlotFormatType::Percentage} -- values will be displayed as percentage (with value $100$ being equal to $100\%$). \end{itemize} -It is beneficial, but not required to use unique pointer for name string literal (see section~\ref{uniquepointers} for more details). +It is beneficial but not required to use a unique pointer for name string literal (see section~\ref{uniquepointers} for more details). \subsection{Message log} \label{messagelog} -Fast navigation in large data sets and correlating zones with what was happening in application may be difficult. To ease these issues Tracy provides a message log functionality. You can send messages (for example, your typical debug output) using the \texttt{TracyMessage(text, size)} macro. Alternatively, use \texttt{TracyMessageL(text)} for string literal messages. +Fast navigation in large data sets and correlating zones with what was happening in the application may be difficult. To ease these issues, Tracy provides a message log functionality. You can send messages (for example, your typical debug output) using the \texttt{TracyMessage(text, size)} macro. Alternatively, use \texttt{TracyMessageL(text)} for string literal messages. If you want to include color coding of the messages (for example to make critical messages easily visible), you can use \texttt{TracyMessageC(text, size, color)} or \texttt{TracyMessageLC(text, color)} macros. \subsubsection{Application information} \label{appinfo} -Tracy can collect additional information about the profiled application, which will be available in the trace description. This can include data such as the source repository revision, the environment in which application is running (dev/prod), etc. +Tracy can collect additional information about the profiled application, which will be available in the trace description. This can include data such as the source repository revision, the application's environment (dev/prod), etc. Use the \texttt{TracyAppInfo(text, size)} macro to report the data. \subsection{Memory profiling} \label{memoryprofiling} -Tracy can monitor memory usage of your application. Knowledge about each performed memory allocation enables the following: +Tracy can monitor the memory usage of your application. Knowledge about each performed memory allocation enables the following: \begin{itemize} \item Memory usage graph (like in massif, but fully interactive). \item List of active allocations at program exit (memory leaks). -\item Visualization of memory map. +\item Visualization of the memory map. \item Ability to rewind view of active allocations and memory map to any point of program execution. \item Information about memory statistics of each zone. \item Memory allocation hot-spot tree. @@ -1341,24 +1338,24 @@ void operator delete(void* ptr) noexcept } \end{lstlisting} -In some rare cases (e.g. destruction of TLS block), events may be reported after the profiler is no longer available, which would lead to a crash. To workaround this issue, you may use \texttt{TracySecureAlloc} and \texttt{TracySecureFree} variants of the macros. +In some rare cases (e.g., destruction of TLS block), events may be reported after the profiler is no longer available, which would lead to a crash. To work around this issue, you may use \texttt{TracySecureAlloc} and \texttt{TracySecureFree} variants of the macros. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -Each tracked memory free event must also have a corresponding memory allocation event. Tracy will terminate the profiling session if this assumption is broken (see section~\ref{instrumentationfailures}). If you encounter this issue, you may want to check for: +Each tracked memory-free event must also have a corresponding memory allocation event. Tracy will terminate the profiling session if this assumption is broken (see section~\ref{instrumentationfailures}). If you encounter this issue, you may want to check for: \begin{itemize} \item Mismatched \texttt{malloc}/\texttt{new} or \texttt{free}/\texttt{delete}. -\item Reporting the same memory address being allocated twice (without a free between two allocs). +\item Reporting the same memory address being allocated twice (without a free between two allocations). \item Double freeing the memory. -\item Untracked allocations made in external libraries, that are freed in the application. +\item Untracked allocations made in external libraries that are freed in the application. \item Places where the memory is allocated, but profiling markup is added. \end{itemize} -This requirement is relaxed in the on-demand mode (section~\ref{ondemand}), because the memory allocation event might have happened before the connection was made. +This requirement is relaxed in the on-demand mode (section~\ref{ondemand}) because the memory allocation event might have happened before the server made the connection. \end{bclogo} \begin{bclogo}[ @@ -1366,29 +1363,29 @@ noborder=true, couleur=black!5, logo=\bclampe ]{Non-stable memory addresses} -Note that the pointer data you provide to the profiler does not have to reflect the real memory layout, which in some cases may not be known. This includes possibility of having multiple overlapping memory allocation regions. For example, you may want to track GPU memory, which may be mapped to different locations in the program address space during allocation and freeing. Or maybe you use some sort of memory defragmentation scheme, which by its very design moves pointers around. In such cases you may rather use unique numeric identifiers as a way of identifying allocated objects. This will make some profiler facilities unavailable, for example the memory map won't have much sense anymore. +Note that the pointer data you provide to the profiler does not have to reflect the actual memory layout, which you may not know in some cases. This includes the possibility of having multiple overlapping memory allocation regions. For example, you may want to track GPU memory, which may be mapped to different locations in the program address space during allocation and freeing. Or maybe you use some memory defragmentation scheme, which by its very design moves pointers around. You may instead use unique numeric identifiers to identify allocated objects in such cases. This will make some profiler facilities unavailable. For example, the memory map won't have much sense anymore. \end{bclogo} \subsubsection{Memory pools} \label{memorypools} -Sometimes an application will use more than one memory pool. For example, in addition to tracking the \texttt{malloc}/\texttt{free} heap you may also be interested in memory usage of a graphic API, such as Vulkan. Or maybe you want to see how your scripting language is managing memory. +Sometimes an application will use more than one memory pool. For example, in addition to tracking the \texttt{malloc}/\texttt{free} heap, you may also be interested in memory usage of a graphic API, such as Vulkan. Or maybe you want to see how your scripting language is managing memory. To mark that a separate memory pool is to be tracked you should use the named version of memory macros, for example \texttt{TracyAllocN(ptr, size, name)} and \texttt{TracyFreeN(ptr, name)}, where \texttt{name} is an unique pointer to a string literal (section~\ref{uniquepointers}) identifying the memory pool. \subsection{GPU profiling} \label{gpuprofiling} -Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12 and OpenCL execution time on GPU. +Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, and OpenCL execution time on GPU. -Note that the CPU and GPU timers may be not synchronized, unless a calibrated context is created. Since availability of calibrated contexts is limited, you can correct the desynchronization of uncalibrated contexts in the profiler's options (section~\ref{options}). +Note that the CPU and GPU timers may be unsynchronized unless you create a calibrated context, but the availability of calibrated contexts is limited. You can try to correct the desynchronization of uncalibrated contexts in the profiler's options (section~\ref{options}). \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{Check the scope} -If the graphic API you are using requires explicitly stating that you start and finish recording of command buffers, remember that the instrumentation macros requirements have to be satisfied both during the construction and destruction of the zone. For example, in the following code the zone destructor will be executed after buffer recording has ended, which is an error. +If the graphic API you are using requires explicitly stating that you start and finish the recording of command buffers, remember that the instrumentation macros requirements must be satisfied during the zone's construction and destruction. For example, the zone destructor will be executed in the following code after buffer recording has ended, which is an error. \begin{lstlisting} { @@ -1398,7 +1395,7 @@ If the graphic API you are using requires explicitly stating that you start and } \end{lstlisting} -To fix such issues, add a nested scope, encompassing the command buffer recording section. +Add a nested scope encompassing the command buffer recording section to fix such issues. \end{bclogo} \begin{bclogo}[ @@ -1406,20 +1403,20 @@ noborder=true, couleur=black!5, logo=\bcattention ]{Caveat emptor} -The profiling results you will get can be unreliable, or just plainly wrong. It all depends on the quality of graphics drivers and how the underlying hardware implements timers. While Tracy employs a number of heuristics in order to make things as reliable as possible, in the end it has to talk to the GPU through the commonly unreliable API calls. +The profiling results you will get can be unreliable or plainly wrong. It all depends on the quality of graphics drivers and how the underlying hardware implements timers. While Tracy employs some heuristics to make things as reliable as possible, it must talk to the GPU through the commonly unreliable API calls. -For example, on Linux the Intel GPU driver will report 64-bit precision of time stamps. This is not true, as the driver will only provide time stamps with 36-bit precision, rolling over the exceeding values. Tracy is able to detect this and employ workarounds. This is, sadly, not enough to make the readings reliable, as this timer we can access through the API is not a real one. Deep down the driver has access to the true timer, which is used to provide the virtual values we can get. This hardware timer has a period which \emph{does not match} the period of the API timer. In result, the virtual timer will sometimes overflow \emph{in midst} of a cycle, making the reported time values jump forward. This is a problem that only the driver vendor can fix. +For example, on Linux, the Intel GPU driver will report 64-bit precision of time stamps. Unfortunately, this is not true, as the driver will only provide timestamps with 36-bit precision, rolling over the exceeding values. Tracy can detect such problems and employ workarounds. This is, sadly, not enough to make the readings reliable, as this timer we can access through the API is not a real one. Deep down, the driver has access to the actual timer, which it uses to provide the virtual values we can get. Unfortunately, this hardware timer has a period which \emph{does not match} the period of the API timer. As a result, the virtual timer will sometimes overflow \emph{in midst} of a cycle, making the reported time values jump forward. This is a problem that only the driver vendor can fix. -If you experience crippling problems while profiling the GPU, you might get better results with different driver, different operating system, or different hardware. +If you experience crippling problems while profiling the GPU, you might get better results with a different driver, different operating system, or different hardware. \end{bclogo} \subsubsection{OpenGL} -You will need to include the \texttt{tracy/TracyOpenGL.hpp} header file and declare each of your rendering contexts using the \texttt{TracyGpuContext} macro (typically you will only have one context). Tracy expects no more than one context per thread and no context migration. To set a custom name for the context, use the \texttt{TracyGpuContextName(name, size)} macro. +You will need to include the \texttt{tracy/TracyOpenGL.hpp} header file and declare each of your rendering contexts using the \texttt{TracyGpuContext} macro (typically, you will only have one context). Tracy expects no more than one context per thread and no context migration. To set a custom name for the context, use the \texttt{TracyGpuContextName(name, size)} macro. To mark a GPU zone use the \texttt{TracyGpuZone(name)} macro, where \texttt{name} is a string literal name of the zone. Alternatively you may use \texttt{TracyGpuZoneC(name, color)} to specify zone color. -You also need to periodically collect the GPU events using the \texttt{TracyGpuCollect} macro. A good place to do it is after the swap buffers function call. +You also need to periodically collect the GPU events using the \texttt{TracyGpuCollect} macro. An excellent place to do it is after the swap buffers function call. \begin{bclogo}[ noborder=true, @@ -1437,11 +1434,11 @@ logo=\bcattention Similarly, for Vulkan support you should include the \texttt{tracy/TracyVulkan.hpp} header file. Tracing Vulkan devices and queues is a bit more involved, and the Vulkan initialization macro \texttt{TracyVkContext(physdev, device, queue, cmdbuf)} returns an instance of \texttt{TracyVkCtx} object, which tracks an associated Vulkan queue. Cleanup is performed using the \texttt{TracyVkDestroy(ctx)} macro. You may create multiple Vulkan contexts. To set a custom name for the context, use the \texttt{TracyVkContextName(ctx, name, size)} macro. -The physical device, logical device, queue and command buffer must relate with each other. The queue must support graphics or compute operations. The command buffer must be in the initial state and be able to be reset. It will be rerecorded and submitted to the queue multiple times and it will be in the executable state on exit from the initialization function. +The physical device, logical device, queue, and command buffer must relate to each other. The queue must support graphics or compute operations. The command buffer must be in the initial state and be able to be reset. The profiler will rerecord and submit it to the queue multiple times, and it will be in the executable state on exit from the initialization function. -To mark a GPU zone use the \texttt{TracyVkZone(ctx, cmdbuf, name)} macro, where \texttt{name} is a string literal name of the zone. Alternatively you may use \texttt{TracyVkZoneC(ctx, cmdbuf, name, color)} to specify zone color. The provided command buffer must be in the recording state and it must be created within the queue that is associated with \texttt{ctx} context. +To mark a GPU zone use the \texttt{TracyVkZone(ctx, cmdbuf, name)} macro, where \texttt{name} is a string literal name of the zone. Alternatively you may use \texttt{TracyVkZoneC(ctx, cmdbuf, name, color)} to specify zone color. The provided command buffer must be in the recording state, and it must be created within the queue that is associated with \texttt{ctx} context. -You also need to periodically collect the GPU events using the \texttt{TracyVkCollect(ctx, cmdbuf)} macro\footnote{It is considerably faster than the OpenGL's \texttt{TracyGpuCollect}.}. The provided command buffer must be in the recording state and outside of a render pass instance. +You also need to periodically collect the GPU events using the \texttt{TracyVkCollect(ctx, cmdbuf)} macro\footnote{It is considerably faster than the OpenGL's \texttt{TracyGpuCollect}.}. The provided command buffer must be in the recording state and outside a render pass instance. \subparagraph{Calibrated context} @@ -1455,19 +1452,19 @@ To enable Direct3D 11 support, include the \texttt{tracy/TracyD3D11.hpp} header To mark a GPU zone, use the \texttt{TracyD3D11Zone(name)} macro, where \texttt{name} is a string literal name of the zone. Alternatively you may use \texttt{TracyD3D11ZoneC(name, color)} to specify zone color. -You also need to periodically collect the GPU events using the \texttt{TracyD3D11Collect} macro. A good place to do it is after the swap chain present function. +You also need to periodically collect the GPU events using the \texttt{TracyD3D11Collect} macro. An excellent place to do it is after the swap chain present function. \subsubsection{Direct3D 12} To enable Direct3D 12 support, include the \texttt{tracy/TracyD3D12.hpp} header file. Tracing Direct3D 12 queues is nearly on par with the Vulkan implementation, where a \texttt{TracyD3D12Ctx} is returned from a call to \texttt{TracyD3D12Context(device, queue)}, which should be later cleaned up with the \texttt{TracyD3D12Destroy(ctx)} macro. Multiple contexts can be created, each with any queue type. To set a custom name for the context, use the \texttt{TracyD3D12ContextName(ctx, name, size)} macro. -The queue must have been created through the specified device, however a command list is not needed for this stage. +The queue must have been created through the specified device, however, a command list is not needed for this stage. Using GPU zones is the same as the Vulkan implementation, where the \texttt{TracyD3D12Zone(ctx, cmdList, name)} macro is used, with \texttt{name} as a string literal. \texttt{TracyD3D12ZoneC(ctx, cmdList, name, color)} can be used to create a custom-colored zone. The given command list must be in an open state. The macro \texttt{TracyD3D12NewFrame(ctx)} is used to mark a new frame, and should appear before or after recording command lists, similar to \texttt{FrameMark}. This macro is a key component that enables automatic query data synchronization, so the user doesn't have to worry about synchronizing GPU execution before invoking a collection. Event data can then be collected and sent to the profiler using the \texttt{TracyD3D12Collect(ctx)} macro. -Note that due to artifacts from dynamic frequency scaling, GPU profiling may be slightly inaccurate. To counter this, \texttt{ID3D12Device::SetStablePowerState()} can be used to enable accurate profiling, at the expense of some performance. If the machine is not in developer mode, the device will be removed upon calling. Do not use this in shipping code. +Note that GPU profiling may be slightly inaccurate due to artifacts from dynamic frequency scaling. To counter this, \texttt{ID3D12Device::SetStablePowerState()} can be used to enable accurate profiling, at the expense of some performance. If the machine is not in developer mode, the operating system will remove the device upon calling. Do not use this in the shipping code. Direct3D 12 contexts are always calibrated. @@ -1475,11 +1472,11 @@ Direct3D 12 contexts are always calibrated. OpenCL support is achieved by including the \texttt{tracy/TracyOpenCL.hpp} header file. Tracing OpenCL requires the creation of a Tracy OpenCL context using the macro \texttt{TracyCLContext(context, device)}, which will return an instance of \texttt{TracyCLCtx} object that must be used when creating zones. The specified \texttt{device} must be part of the \texttt{context}. Cleanup is performed using the \texttt{TracyCLDestroy(ctx)} macro. Although not common, it is possible to create multiple OpenCL contexts for the same application. To set a custom name for the context, use the \texttt{TracyCLContextName(ctx, name, size)} macro. -To mark an OpenCL zone one must make sure that a valid OpenCL \texttt{cl\_event} object is available. The event will be the object that Tracy will use to query profiling information from the OpenCL driver. For this to work, all OpenCL queues must be created with the \texttt{CL\_QUEUE\_PROFILING\_ENABLE} property. +To mark an OpenCL zone one must make sure that a valid OpenCL \texttt{cl\_event} object is available. The event will be the object that Tracy will use to query profiling information from the OpenCL driver. For this to work, you must create all OpenCL queues with the \texttt{CL\_QUEUE\_PROFILING\_ENABLE} property. OpenCL zones can be created with the \texttt{TracyCLZone(ctx, name)} where \texttt{name} will usually be a descriptive name for the operation represented by the \texttt{cl\_event}. Within the scope of the zone, you must call \texttt{TracyCLSetEvent(event)} for the event to be registered in Tracy. -Similarly to Vulkan and OpenGL, you also need to periodically collect the OpenCL events using the \texttt{TracyCLCollect(ctx)} macro. A good place to perform this operation is after a \texttt{clFinish}, since this will ensure that any previous queued OpenCL commands will have finished by this point. +Similar to Vulkan and OpenGL, you also need to periodically collect the OpenCL events using the \texttt{TracyCLCollect(ctx)} macro. An excellent place to perform this operation is after a \texttt{clFinish} since this will ensure that any previously queued OpenCL commands will have finished by this point. \subsubsection{Multiple zones in one scope} @@ -1487,7 +1484,7 @@ Putting more than one GPU zone macro in a single scope features the same issue a To solve this problem, in case of OpenGL use the \texttt{TracyGpuNamedZone} macro in place of \texttt{TracyGpuZone} (or the color variant). The same applies to Vulkan and Direct3D 11/12 -- replace \texttt{TracyVkZone} with \texttt{TracyVkNamedZone} and \texttt{TracyD3D11Zone}/\texttt{TracyD3D12Zone} with \texttt{TracyD3D11NamedZone}/\texttt{TracyD3D12NamedZone}. -Remember that you need to provide your own name for the created stack variable as the first parameter to the macros. +Remember to provide your name for the created stack variable as the first parameter to the macros. \subsubsection{Transient GPU zones} @@ -1496,17 +1493,17 @@ Transient zones (see section~\ref{transientzones} for details) are available in \subsection{Fibers} \label{fibers} -Fibers are lightweight threads, which are not under control of the operating system and need to be manually scheduled by the application. There are other cooperative multitasking primitives, like coroutines, or green threads, which also fall under this umbrella, as far as Tracy is concerned. +Fibers are lightweight threads, which are not under the operating system's control and need to be manually scheduled by the application. As far as Tracy is concerned, there are other cooperative multitasking primitives, like coroutines, or green threads, which also fall under this umbrella. -To enable fiber support in the client code you will need to add the \texttt{TRACY\_FIBERS} define to your project. You need to do this explicitly, as there is a small performance hit due to the required additional processing. +To enable fiber support in the client code, you will need to add the \texttt{TRACY\_FIBERS} define to your project. You need to do this explicitly, as there is a small performance hit due to additional processing. -In order to properly instrument fibers you will need to modify the fiber dispatch code in your program. You will need to insert the \texttt{TracyFiberEnter(fiber)} macro every time a fiber starts or resumes execution. You will also need to insert the \texttt{TracyFiberLeave} macro when the execution control in a thread returns to the non-fiber part of the code. Note that you can safely call \texttt{TracyFiberEnter} multiple times in succession, without an intermediate \texttt{TracyFiberLeave}, if one fiber is directly switching to another, without returning control to the fiber dispatch worker. +To properly instrument fibers, you will need to modify the fiber dispatch code in your program. You will need to insert the \texttt{TracyFiberEnter(fiber)} macro every time a fiber starts or resumes execution. You will also need to insert the \texttt{TracyFiberLeave} macro when the execution control in a thread returns to the non-fiber part of the code. Note that you can safely call \texttt{TracyFiberEnter} multiple times in succession, without an intermediate \texttt{TracyFiberLeave} if one fiber is directly switching to another, without returning control to the fiber dispatch worker. Fibers are identified by unique \texttt{const char*} string names. Remember that you should observe the rules laid out in section~\ref{uniquepointers} while handling such strings. -No additional instrumentation is needed in other parts of the code. Zones, messages and other such events will be properly attributed to the currently running fiber in its own separate track. +No additional instrumentation is needed in other parts of the code. Zones, messages, and other such events will be properly attributed to the currently running fiber in its own separate track. -A very simple example, which is not actually using any OS fiber functionality, is presented below: +A straightforward example, which is not actually using any OS fiber functionality, is presented below: \begin{lstlisting} const char* fiber = "job1"; @@ -1533,7 +1530,7 @@ int main() } \end{lstlisting} -As you can see, there are two threads, \texttt{t1} and \texttt{t2}, which are simulating worker threads which would be used by a real fiber library. A C API zone is created in thread \texttt{t1} and is ended in thread \texttt{t2}. Without the fiber markup this would be an invalid operation, but with fibers the zone is attributed to fiber \texttt{job1}, and not to thread \texttt{t1} or \texttt{t2}. +As you can see, there are two threads, \texttt{t1} and \texttt{t2}, which are simulating worker threads that a real fiber library would use. A C API zone is created in thread \texttt{t1} and is ended in thread \texttt{t2}. Without the fiber markup, this would be an invalid operation, but with fibers, the zone is attributed to fiber \texttt{job1}, and not to thread \texttt{t1} or \texttt{t2}. \subsection{Collecting call stacks} \label{collectingcallstacks} @@ -1586,9 +1583,9 @@ Be aware that call stack collection is a relatively slow operation. Table~\ref{C You can force call stack capture in the non-\texttt{S} postfixed macros by adding the \texttt{TRACY\_CALLSTACK} define, set to the desired call stack capture depth. This setting doesn't affect the explicit call stack macros. -The maximum call stack depth that can be retrieved is 62 frames. This is a restriction at the level of operating system. +The maximum call stack depth that the profiler can retrieve is 62 frames. This is a restriction at the level of the operating system. -Tracy will automatically exclude certain uninteresting functions from the captured call stacks. For example, the pass-through intrinsic wrapper functions won't be reported. +Tracy will automatically exclude certain uninteresting functions from the captured call stacks. So, for example, the pass-through intrinsic wrapper functions won't be reported. \begin{bclogo}[ noborder=true, @@ -1600,12 +1597,12 @@ Collecting call stack data will also trigger retrieval of profiled program's exe \subsubsection{Debugging symbols} -To have proper call stack information, the profiled application must be compiled with debugging symbols enabled. You can achieve that in the following way: +You must compile the profiled application with debugging symbols enabled to have correct call stack information. You can achieve that in the following way: \begin{itemize} -\item On MSVC open the project properties and go to \menu[,]{Linker,Debugging,Generate Debug Info}, where the \emph{Generate Debug Information} option should be selected. +\item On MSVC, open the project properties and go to \menu[,]{Linker,Debugging,Generate Debug Info}, where you should select the \emph{Generate Debug Information} option. \item On gcc or clang remember to specify the debugging information \texttt{-g} parameter during compilation and \emph{do not} add the strip symbols \texttt{-s} parameter. Additionally, omitting frame pointers will severely reduce the quality of stack traces, which can be fixed by adding the \texttt{-fno-omit-frame-pointer} parameter. Link the executable with an additional option \texttt{-rdynamic} (or \texttt{-{}-export-dynamic}, if you are passing parameters directly to the linker). -\item On OSX you may need to run \texttt{dsymutil} to extract the debugging data out of the executable binary. +\item On OSX, you may need to run \texttt{dsymutil} to extract the debugging data out of the executable binary. \item On iOS you will have to add a \emph{New Run Script Phase} to your XCode project, which shall execute the following shell script: \begin{lstlisting}[language=sh] @@ -1621,22 +1618,22 @@ You will also need to setup proper dependencies, by setting the following input You may also be interested in symbols from external libraries, especially if you have sampling profiling enabled (section~\ref{sampling}). In MSVC you can retrieve such symbols by going to \menu[,]{Tools,Options,Debugging,Symbols} and selecting appropriate \emph{Symbol file (.pdb) location} servers. Note that additional symbols may significantly increase application startup times. -Libraries built with vcpkg typically also provide pdb symbol files, even for release builds. Using vcpkg to obtain libraries has the extra benefit that everything is built using local source files, which allows Tracy to provide source view not only of your application, but also the libraries you use. +Libraries built with vcpkg typically provide PDB symbol files, even for release builds. Using vcpkg to obtain libraries has the extra benefit that everything is built using local source files, which allows Tracy to provide a source view not only of your application but also the libraries you use. \paragraph{Refreshing symbols list on Windows} -If your application loads shared libraries during runtime, you will need to notify the debug symbol library (dbghelp) that it needs to refresh its internal symbols database. This database is filled during application initialization and is not automatically updated when you load a library. The symbol list can be manually updated by using functions such as \texttt{SymRefreshModuleList()} or \texttt{SymLoadModule()}. Note that \texttt{LdrRegisterDllNotification()} may be used to register callback to be executed when a DLL is loaded. +If your application loads shared libraries during runtime, you will need to notify the debug symbol library (dbghelp) to refresh its internal symbols database. This database is filled during application initialization and is not automatically updated when you load a library. The symbol list can be manually updated by using functions such as \texttt{SymRefreshModuleList()} or \texttt{SymLoadModule()}. Note that \texttt{LdrRegisterDllNotification()} may be used to register callback to be executed when a DLL is loaded. -Since dbghelp functions are not thread safe, you must take extra steps to make sure your calls to the \texttt{Sym*} family of functions are not colliding with calls made by Tracy. To do so, perform the following steps: +Since dbghelp functions are not thread-safe, you must take extra steps to make sure your calls to the \texttt{Sym*} family of functions are not colliding with calls made by Tracy. To do so, perform the following steps: \begin{enumerate} \item Add a \texttt{TRACY\_DBGHELP\_LOCK} define, with the value set to prefix of lock-handling functions (for example: \texttt{TRACY\_DBGHELP\_LOCK=DbgHelp}). -\item Create a dbghelp lock (i.e. mutex) in your application. +\item Create a dbghelp lock (i.e., mutex) in your application. \item Provide a set of \texttt{Init}, \texttt{Lock} and \texttt{Unlock} functions, including the provided prefix name, which will operate on the lock. These functions must be defined using the C linkage. Notice that there's no cleanup function. -\item Remember to appropriately protect access to dbghelp in your code! +\item Remember to protect access to dbghelp in your code appropriately! \end{enumerate} -An example implementation of such lock interface is provided below, as a reference: +An example implementation of such a lock interface is provided below, as a reference: \begin{lstlisting} extern "C" @@ -1651,15 +1648,15 @@ void DbgHelpUnlock() { ReleaseMutex(dbgHelpLock); } \paragraph{Disabling resolution of inline frames} -Inline frames retrieval on Windows can be multiple orders of magnitude slower than just performing basic symbol resolution. This is being manifested as profiler seemingly being stuck for a long time, having hundred thousands of query backlog entries queued, which are slowly trickling down. If your use case requires speed of operation rather than having call stacks with inline frames included, you may define the \texttt{TRACY\_NO\_CALLSTACK\_INLINES} macro, which will make the profiler stick to the basic but fast frame resolution mode. +Inline frames retrieval on Windows can be multiple orders of magnitude slower than just performing essential symbol resolution. This manifests as profiler seemingly being stuck for a long time, having hundreds of thousands of query backlog entries queued, which are slowly trickling down. If your use case requires speed of operation rather than having call stacks with inline frames included, you may define the \texttt{TRACY\_NO\_CALLSTACK\_INLINES} macro, which will make the profiler stick to the basic but fast frame resolution mode. \subsection{Lua support} To profile Lua code using Tracy, include the \texttt{tracy/TracyLua.hpp} header file in your Lua wrapper and execute \texttt{tracy::LuaRegister(lua\_State*)} function to add instrumentation support. -In the Lua code, add \texttt{tracy.ZoneBegin()} and \texttt{tracy.ZoneEnd()} calls to mark execution zones. You need to call the \texttt{ZoneEnd} method, because there is no automatic destruction of variables in Lua and we don't know when the garbage collection will be performed. \emph{Double check if you have included all return paths!} +In the Lua code, add \texttt{tracy.ZoneBegin()} and \texttt{tracy.ZoneEnd()} calls to mark execution zones. You need to call the \texttt{ZoneEnd} method because there is no automatic destruction of variables in Lua, and we don't know when the garbage collection will be performed. \emph{Double check if you have included all return paths!} -Use \texttt{tracy.ZoneBeginN(name)} if you want to set a custom zone name\footnote{While technically this name doesn't need to be constant, like in the \texttt{ZoneScopedN} macro, it should be, as it is used to group the zones together. This grouping is then used to display various statistics in the profiler. You may still set the per-call name using the \texttt{tracy.ZoneName} method.}. +Use \texttt{tracy.ZoneBeginN(name)} if you want to set a custom zone name\footnote{While technically this name doesn't need to be constant, like in the \texttt{ZoneScopedN} macro, it should be, as it is used to group the zones. This grouping is then used to display various statistics in the profiler. You may still set the per-call name using the \texttt{tracy.ZoneName} method.}. Use \texttt{tracy.ZoneText(text)} to set zone text. @@ -1673,9 +1670,9 @@ Lua instrumentation needs to perform additional work (including memory allocatio To collect Lua call stacks (see section~\ref{collectingcallstacks}), replace \texttt{tracy.ZoneBegin()} calls with \texttt{tracy.ZoneBeginS(depth)}, and \texttt{tracy.ZoneBeginN(name)} calls with \texttt{tracy.ZoneBeginNS(name, depth)}. Using the \texttt{TRACY\_CALLSTACK} macro automatically enables call stack collection in all zones. -Be aware that for Lua call stack retrieval to work, you need to be on a platform which supports collection of native call stacks. +Be aware that for Lua call stack retrieval to work, you need to be on a platform that supports the collection of native call stacks. -Cost of performing Lua call stack capture is presented in table~\ref{CallstackTimesLua} and figure~\ref{CallstackPlotLua}. Lua call stacks include native call stacks, which have a capture cost of their own (table~\ref{CallstackTimes}) and the \texttt{depth} parameter is applied for both captures. The presented data was captured with full Lua stack depth, but only 13 frames were available on the native call stack. Hence, to explain the non-linearity of the graph you need to consider what was really measured: +Cost of performing Lua call stack capture is presented in table~\ref{CallstackTimesLua} and figure~\ref{CallstackPlotLua}. Lua call stacks include native call stacks, which have a capture cost of their own (table~\ref{CallstackTimes}), and the \texttt{depth} parameter is applied for both captures. The presented data were captured with full Lua stack depth, but only 13 frames were available on the native call stack. Hence, to explain the non-linearity of the graph, you need to consider what was truly measured: \begin{displaymath} \text{Cost}_{\text{total}}(\text{depth}) = @@ -1724,14 +1721,14 @@ Cost of performing Lua call stack capture is presented in table~\ref{CallstackTi \subsubsection{Instrumentation cleanup} -Even if Tracy is disabled, you still have to pay the no-op function call cost. To prevent that you may want to use the \texttt{tracy::LuaRemove(char* script)} function, which will replace instrumentation calls with white-space. This function does nothing if profiler is enabled. +Even if Tracy is disabled, you still have to pay the no-op function call cost. To prevent that, you may want to use the \texttt{tracy::LuaRemove(char* script)} function, which will replace instrumentation calls with white-space. This function does nothing if the profiler is enabled. \subsection{C API} \label{capi} -In order to profile code written in C programming language, you will need to include the \texttt{tracy/TracyC.h} header file, which exposes the C API. +To profile code written in C programming language, you will need to include the \texttt{tracy/TracyC.h} header file, which exposes the C API. -At the moment there's no support for C API based markup of locks, GPU zones, or Lua. +At the moment, there's no support for C API based markup of locks, GPU zones, or Lua. \begin{bclogo}[ noborder=true, @@ -1769,16 +1766,16 @@ The following macros mark the beginning of a zone: \item \texttt{TracyCZoneNC(ctx, name, color, active)} \end{itemize} -Refer to sections~\ref{markingzones} and~\ref{multizone} for description of macro variants and parameters. The \texttt{ctx} parameter specifies the name of a data structure, which will be created on stack to hold the internal zone data. +Refer to sections~\ref{markingzones} and~\ref{multizone} for description of macro variants and parameters. The \texttt{ctx} parameter specifies the name of a data structure, which the macro will create on the stack to hold the internal zone data. -Unlike C++, there's no automatic destruction mechanism in C, so you will need to manually mark where the zone ends. To do so use the \texttt{TracyCZoneEnd(ctx)} macro. +Unlike C++, there's no automatic destruction mechanism in C, so you will need to mark where the zone ends manually. To do so use the \texttt{TracyCZoneEnd(ctx)} macro. Zone text and name may be set by using the \texttt{TracyCZoneText(ctx, txt, size)}, \texttt{TracyCZoneValue(ctx, value)} and \texttt{TracyCZoneName(ctx, txt, size)} macros. Make sure you are following the zone stack rules, as described in section~\ref{multizone}! \paragraph{Zone context data structure} \label{zonectx} -In typical use cases the zone context data structure is hidden from your view, requiring only to specify its name for the \texttt{TracyCZone} and \texttt{TracyCZoneEnd} macros. However, it is possible to use it in advanced scenarios, for example if you want to start a zone in one function, then end it in another one. To do so you will need to forward the data structure either through a function parameter, or as a return value, or place it in a thread-local stack structure. To accomplish this you need to keep in mind the following rules: +In typical use cases the zone context data structure is hidden from your view, requiring only to specify its name for the \texttt{TracyCZone} and \texttt{TracyCZoneEnd} macros. However, it is possible to use it in advanced scenarios, for example, if you want to start a zone in one function, then end it in another one. To do so, you will need to forward the data structure either through a function parameter or as a return value or place it in a thread-local stack structure. To accomplish this, you need to keep in mind the following rules: \begin{itemize} \item The created variable name is exactly what you pass as the \texttt{ctx} parameter. @@ -1790,9 +1787,9 @@ In typical use cases the zone context data structure is hidden from your view, r \paragraph{Zone validation} -Since all instrumentation using the C API has to be done by hand, it is possible to miss some code paths where a zone should be started or ended. Tracy will perform additional validation of instrumentation correctness to prevent bad profiling runs. Read section~\ref{instrumentationfailures} for more information. +Since all C API instrumentation has to be done by hand, it is possible to miss some code paths where a zone should be started or ended. Tracy will perform additional validation of instrumentation correctness to prevent bad profiling runs. Read section~\ref{instrumentationfailures} for more information. -The validation comes with a performance cost though, which you may not want to pay. If you are \emph{completely sure} that the instrumentation is not broken in any way, you may use the \texttt{TRACY\_NO\_VERIFY} macro, which will disable the validation code. +However, the validation comes with a performance cost, which you may not want to pay. Therefore, if you are \emph{entirely sure} that the instrumentation is not broken in any way, you may use the \texttt{TRACY\_NO\_VERIFY} macro, which will disable the validation code. \paragraph{Transient zones in C API} @@ -1809,13 +1806,13 @@ Use the following macros in your implementations of \texttt{malloc} and \texttt{ \item \texttt{TracyCSecureFree(ptr)} \end{itemize} -Using this functionality in a proper way can be quite tricky, as you also will need to handle all the memory allocations made by external libraries (which typically allow usage of custom memory allocation functions), but also the allocations made by system functions. If such an allocation can't be tracked, you will need to make sure freeing is not reported\footnote{It's not uncommon to see a pattern where a system function returns some allocated memory, which you then need to free.}. +Correctly using this functionality can be pretty tricky. You also will need to handle all the memory allocations made by external libraries (which typically allow usage of custom memory allocation functions) and the allocations made by system functions. If you can't track such an allocation, you will need to make sure freeing is not reported\footnote{It's not uncommon to see a pattern where a system function returns some allocated memory, which you then need to release.}. -There is no explicit support for \texttt{realloc} function. You will need to handle it by marking memory allocations and frees, according to the system manual describing behavior of this routine. +There is no explicit support for \texttt{realloc} function. You will need to handle it by marking memory allocations and frees, according to the system manual describing the behavior of this routine. Memory pools (section~\ref{memorypools}) are supported through macros with \texttt{N} postfix. -For more information about memory profiling refer to section~\ref{memoryprofiling}. +For more information about memory profiling, refer to section~\ref{memoryprofiling}. \subsubsection{Plots and messages} @@ -1834,9 +1831,9 @@ Consult sections~\ref{plottingdata} and~\ref{messagelog} for more information. \subsubsection{GPU zones} -Hooking up support for GPU zones requires a bit more work than usual. The C API provides a low-level interface which you can use to submit the data, but there are no facilities to help you with timestamp processing. +Hooking up support for GPU zones requires a bit more work than usual. The C API provides a low-level interface that you can use to submit the data, but there are no facilities to help you with timestamp processing. -Moreover, there are two sets of functions described below. The standard set sends data asynchronously, while the \texttt{\_serial} one ensures proper ordering of all events, regardless of the originating thread. Generally speaking, you should be using the asynchronous functions only in case of APIs which are strictly single-threaded, like OpenGL. +Moreover, there are two sets of functions described below. The standard set sends data asynchronously, while the \texttt{\_serial} one ensures proper ordering of all events, regardless of the originating thread. Generally speaking, you should be using the asynchronous functions only in the case of strictly single-threaded APIs, like OpenGL. A GPU context can be created with the \texttt{\_\_\_tracy\_emit\_gpu\_new\_context} function (or the serialized variant). You'll need to specify: @@ -1856,7 +1853,7 @@ GPU zones are ended via \texttt{\_\_\_tracy\_emit\_gpu\_zone\_end}. When the timestamps are fetched from the GPU, they must then be emitted via the \texttt{\_\_\_tracy\_emit\_gpu\_time} function. After all timestamps for a frame are emitted, \texttt{queryIds} may be re-used. -To see how this API should be used you should look at the reference implementation contained in API-specific C++ headers provided by Tracy. For example, to see how to write your instrumentation of OpenGL, you should closely follow contents of the \texttt{TracyOpenGL.hpp} implementation. +To see how you should use this API, you should look at the reference implementation contained in API-specific C++ headers provided by Tracy. For example, to see how to write your instrumentation of OpenGL, you should closely follow the contents of the \texttt{TracyOpenGL.hpp} implementation. \subsubsection{Fibers} @@ -1873,7 +1870,7 @@ You can collect call stacks of zones and memory allocation events, as described \subsubsection{Using the C API to implement bindings} \label{capibindings} -Tracy C API exposes functions with the \texttt{\_\_\_tracy} prefix that may be used to write bindings to other programming languages. Most of the functions available are a counterpart to macros described in section~\ref{capi}. Some functions do not have macro equivalents and are dedicated expressly for binding implementation purposes. This includes the following: +Tracy C API exposes functions with the \texttt{\_\_\_tracy} prefix that you may use to write bindings to other programming languages. Most of the functions available are a counterpart to macros described in section~\ref{capi}. However, some functions do not have macro equivalents and are dedicated expressly for binding implementation purposes. This includes the following: \begin{itemize} \item \texttt{\_\_\_tracy\_startup\_profiler(void)} @@ -1889,8 +1886,8 @@ name, by providing it in the \texttt{name} variable, and specifying its size in The \texttt{\_\_\_tracy\_alloc\_srcloc} and \texttt{\_\_\_tracy\_alloc\_srcloc\_name} functions return an \texttt{uint64\_t} source location identifier corresponding to an \emph{allocated source -location}. As these functions do not require for the provided string data to be available after they -return, calling code is free to deallocate them at any time afterwards. This way the string +location}. As these functions do not require the provided string data to be available after they +return, the calling code is free to deallocate them at any time afterward. This way, the string lifetime requirements described in section~\ref{textstrings} are relaxed. The \texttt{uint64\_t} return value from allocation functions must be passed to one of the zone @@ -1913,27 +1910,27 @@ couleur=black!5, logo=\bcbombe ]{Important} Since you are directly calling the profiler functions here, you will need to take care of manually -disabling the code, if the \texttt{TRACY\_ENABLE} macro is not defined. +disabling the code if the \texttt{TRACY\_ENABLE} macro is not defined. \end{bclogo} \subsection{Automated data collection} \label{automated} -Tracy will perform automatic collection of system data without user intervention. This behavior is platform specific and may not be available everywhere. Refer to section~\ref{featurematrix} for more information. +Tracy will perform an automatic collection of system data without user intervention. This behavior is platform-specific and may not be available everywhere. Refer to section~\ref{featurematrix} for more information. \subsubsection{Privilege elevation} \label{privilegeelevation} -Some profiling data can be only retrieved using the kernel facilities, which are not available to users with normal privilege level. To collect such data you will need to elevate your rights to admin level, either by running the profiled program from the \texttt{root} account on Unix, or through the \emph{Run as administrator} option on Windows\footnote{To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from within the IDE.}. On Android you will need to have a rooted device (see section~\ref{androidlunacy} for additional information). +Some profiling data can only be retrieved using the kernel facilities, which are not available to users with normal privilege level. To collect such data, you will need to elevate your rights to the administrator level. You can do so either by running the profiled program from the \texttt{root} account on Unix or through the \emph{Run as administrator} option on Windows\footnote{To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from within the IDE.}. On Android, you will need to have a rooted device (see section~\ref{androidlunacy} for additional information). -As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the necessary rights). If for some reason you would want to prevent your application from trying to access kernel data, you may recompile your program with the \texttt{TRACY\_NO\_SYSTEM\_TRACING} define. +As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the \texttt{TRACY\_NO\_SYSTEM\_TRACING} define. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcattention ]{Caveats} -Data retrieval on Android requires spawning an elevated process to read the information provided by the kernel. While the standard \texttt{cat} utility can be used for this task, the resulting CPU usage is not acceptable, due to how the kernel handles blocking reads. As a workaround, Tracy will inject a specialized kernel data reader program at \texttt{/data/tracy\_systrace}, which has more acceptable resource requirements. +Data retrieval on Android requires spawning an elevated process to read the information provided by the kernel. While the standard \texttt{cat} utility can be used for this task, the resulting CPU usage is not acceptable due to how the kernel handles blocking reads. As a workaround, Tracy will inject a specialized kernel data reader program at \texttt{/data/tracy\_systrace}, which has more acceptable resource requirements. \end{bclogo} \begin{bclogo}[ @@ -1941,57 +1938,57 @@ noborder=true, couleur=black!5, logo=\bclampe ]{What should be granted privileges?} -Sometimes it may be confusing which program should be given the admin access. After all, some other profilers have to run elevated to access all their capabilities. +Sometimes it may be confusing which program should be given admin access. After all, some other profilers have to run elevated to access all their capabilities. -In case of Tracy the administrative rights should be given to \emph{the profiled application}. Remember that the server part of the profiler (where the data is collected and displayed) may be running on another machine, and thus it can't be used to access kernel data. +In the case of Tracy, you should give the administrative rights to \emph{the profiled application}. Remember that the server part of the profiler (where the data is collected and displayed) may be running on another machine, and thus you can't use it to access kernel data. \end{bclogo} \subsubsection{CPU usage} -System-wide CPU load is gathered with relatively high granularity (one reading every 100 \si{\milli\second}). The readings are available as a plot (see section~\ref{plots}). Note that this parameter takes into account all applications running on the system, not only the profiled program. +System-wide CPU load is gathered with relatively high granularity (one reading every 100 \si{\milli\second}). The readings are available as a plot (see section~\ref{plots}). Note that this parameter considers all applications running on the system, not only the profiled program. \subsubsection{Context switches} \label{contextswitches} -Since the profiled program is executing simultaneously with other applications, you can't have exclusive access to the CPU. The multitasking operating system's scheduler is giving threads waiting to execute short time slices, where part of the work can be done. Afterwards threads are preempted to give other threads a chance to run. This ensures that each program running in the system has a fair environment and no program can hog the system resources for itself. +Since the profiled program is executing simultaneously with other applications, you can't have exclusive access to the CPU. Instead, the multitasking operating system's scheduler gives threads waiting to execute short time slices to do part of their work. Afterward, threads are preempted to give other threads a chance to run. This ensures that each program running in the system has a fair environment, and no program can hog the system resources for itself. -As a corollary, it is often not enough to know how long it took to execute a zone. The thread in which a zone was running might have been suspended by the system, which artificially increases the time readings. +As a corollary, it is often not enough to know how long it took to execute a zone. For example, the thread in which a zone was running might have been suspended by the system. This would have artificially increased the time readings. -To solve this problem, Tracy collects context switch\footnote{A context switch happens when any given CPU core stops executing one thread and starts running another one.} information. This data can be then used to see when a zone was in the executing state and where it was waiting to be resumed. +To solve this problem, Tracy collects context switch\footnote{A context switch happens when any given CPU core stops executing one thread and starts running another one.} information. This data can then be used to see when a zone was in the executing state and where it was waiting to be resumed. -Context switch data capture may be disabled by adding the \texttt{TRACY\_NO\_CONTEXT\_SWITCH} define to the client. It needs privilege elevation, which is described in section~\ref{privilegeelevation}. +You may disable context switch data capture by adding the \texttt{TRACY\_NO\_CONTEXT\_SWITCH} define to the client. It needs privilege elevation, which is described in section~\ref{privilegeelevation}. \subsubsection{CPU topology} \label{cputopology} -Tracy may perform discovery of CPU topology data in order to provide further information about program performance characteristics. It is very useful when combined with context switches (section~\ref{contextswitches}). +Tracy may discover CPU topology data to provide further information about program performance characteristics. It is handy when combined with context switch information (section~\ref{contextswitches}). -In essence, the topology information gives you context about what any given \emph{logical CPU} really is and how it relates to other logical CPUs. The topology hierarchy consists of packages, cores and threads. +In essence, the topology information gives you context about what any given \emph{logical CPU} really is and how it relates to other logical CPUs. The topology hierarchy consists of packages, cores, and threads. Packages contain cores and shared resources, such as memory controller, L3 cache, etc. A store-bought CPU is an example of a package. While you may think that multi-package configurations would be a domain of servers, they are actually quite common in the mobile devices world, with many platforms using the \emph{big.LITTLE} arrangement of two packages in one silicon chip. Cores contain at least one thread and shared resources: execution units, L1 and L2 cache, etc. -Threads (or \emph{logical CPUs}; not to be confused with program threads) are basically the processor instruction pipelines. A pipeline might become stalled, for example due to pending memory access, leaving core resources unused. To reduce this bottleneck, some CPUs may use simultaneous multithreading\footnote{Commonly known as Hyper-threading.}, in which more than one pipeline will be using a single physical core resources. +Threads (or \emph{logical CPUs}; not to be confused with program threads) are basically the processor instruction pipelines. A pipeline might become stalled, for example, due to pending memory access, leaving core resources unused. To reduce this bottleneck, some CPUs may use simultaneous multithreading\footnote{Commonly known as Hyper-threading.}, in which more than one pipeline will be using a single physical core resources. -Knowing which package and core any logical CPU belongs to enables many insights. For example, two threads scheduled to run on the same core will compete for shared execution units and cache, resulting in reduced performance. Or, a migration of a program thread from one core to another core will invalidate L1 and L2 cache, which is less costly than a migration from one package to another, which also invalidates L3 cache. +Knowing which package and core any logical CPU belongs to enables many insights. For example, two threads scheduled to run on the same core will compete for shared execution units and cache, resulting in reduced performance. Or, migrating a program thread from one core to another will invalidate the L1 and L2 cache. However, such invalidation is less costly than migration from one package to another, which also invalidates the L3 cache. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -In this manual, the word \emph{core} is typically used as a short term for \emph{logical CPU}. Do not confuse it with physical processor cores. +In this manual, the word \emph{core} is typically used as a short term for \emph{logical CPU}. Please do not confuse it with physical processor cores. \end{bclogo} \subsubsection{Call stack sampling} \label{sampling} -Manual markup of zones doesn't cover every function existing in a program and cannot be performed in system libraries, or in kernel. This can leave blank spaces on the trace, leaving you with no clue what the application was doing. Tracy is able to periodically inspect state of running threads, providing you with a snapshot of call stack at the time when sampling was performed. While this information doesn't have the fidelity of manually inserted zones, it can sometimes give you an insight where to go next. +Manual markup of zones doesn't cover every function existing in a program and cannot be performed in system libraries or the kernel. This can leave blank spaces on the trace, leaving you no clue what the application was doing. However, Tracy can periodically inspect the state of running threads, providing you with a snapshot of the call stack at the time when sampling was performed. While this information doesn't have the fidelity of manually inserted zones, it can sometimes give you an insight into where to go next. This feature requires privilege elevation, as described in chapter~\ref{privilegeelevation}. Proper setup of the required program debugging data is described in chapter~\ref{collectingcallstacks}. -By default sampling is performed at 8 kHz frequency on Windows (which is the maximum possible value). On Linux and Android it is performed at 10 kHz\footnote{The maximum sampling frequency is limited by the \texttt{kernel.perf\_event\_max\_sample\_rate} sysctl parameter.}. This value can be changed by providing the sampling frequency (in Hz) through the \texttt{TRACY\_SAMPLING\_HZ} macro. +By default, sampling is performed at 8 kHz frequency on Windows (the maximum possible value). On Linux and Android, it is performed at 10 kHz\footnote{The maximum sampling frequency is limited by the \texttt{kernel.perf\_event\_max\_sample\_rate} sysctl parameter.}. You can change this value by providing the sampling frequency (in Hz) through the \texttt{TRACY\_SAMPLING\_HZ} macro. Call stack sampling may be disabled by using the \texttt{TRACY\_NO\_SAMPLING} define. @@ -2000,7 +1997,7 @@ noborder=true, couleur=black!5, logo=\bcbombe ]{Linux sampling rate limits} -The operating system may decide that sampling is taking too much CPU time and reduce the allowed sampling rate. This can be seen in \texttt{dmesg} output as: +The operating system may decide that sampling takes too much CPU time and reduce the allowed sampling rate. This can be seen in \texttt{dmesg} output as: \texttt{perf: interrupt took too long, lowering kernel.perf\_event\_max\_sample\_rate to \emph{value}}. @@ -2012,66 +2009,66 @@ Should you want to disable this mechanism, you can set the \texttt{kernel.perf\_ \paragraph{Wait stacks} \label{waitstacks} -On Windows sampling functionality also captures call stacks for context switch events. Such call stacks will show you what the application was doing when the thread was suspended and subsequently resumed, hence the name. We can categorize wait stacks into the following categories: +On Windows, sampling functionality also captures call stacks for context switch events. Such call stacks will show you what the application was doing when the thread was suspended and subsequently resumed, hence the name. We can categorize wait stacks into the following categories: \begin{enumerate} \item Random preemptive multitasking events, which are expected and do not have any significance. \item Expected waits, which may be caused by issuing sleep commands, waiting for a lock to become available, performing I/O, and so on. Quantitative analysis of such events may (but probably won't) direct you to some problems in your code. -\item Unexpected waits, which should be immediately taken care of. After all, what's the point of profiling and optimizing your program, if it is constantly waiting? An example of such unexpected wait may be some anti-virus service interfering with each of your file reads, when your assumption was that the operating system will buffer a large chunk of the data after the first read and make it immediately available to the application in the following calls. +\item Unexpected waits, which should be immediately taken care of. After all, what's the point of profiling and optimizing your program if it is constantly waiting for something? An example of such an unexpected wait may be some anti-virus service interfering with each of your file read operations. In this case, you could have assumed that the system would buffer a large chunk of the data after the first read to make it immediately available to the application in the following calls. \end{enumerate} \subsubsection{Hardware sampling} \label{hardwaresampling} -While the call stack sampling is a generic software-implemented functionality of the operating system, there's another way of sampling program execution patterns. Modern processors host a wide array of different hardware performance counters, which increase when some event in a CPU core happens. These could be as simple as counting each clock cycle, or as implementation specific as counting 'retired instructions that are delivered to the back-end after the front-end had at least 1 bubble-slot for a period of 2 cycles'. +While the call stack sampling is a generic software-implemented functionality of the operating system, there's another way of sampling program execution patterns. Modern processors host a wide array of different hardware performance counters, which increase when some event in a CPU core happens. These could be as simple as counting each clock cycle or as implementation-specific as counting 'retired instructions that are delivered to the back-end after the front-end had at least 1 bubble-slot for a period of 2 cycles'. -Tracy is able to use these counters to present you the following three statistics, which may help guide you in discovery why your code is not as fast as possible: +Tracy can use these counters to present you the following three statistics, which may help guide you in discovering why your code is not as fast as possible: \begin{enumerate} -\item \emph{Instructions Per Cycle (IPC)} -- shows how many instructions were executing concurrently within a single core cycle. Higher values are better. The maximum achievable value depends on the design of CPU, including things such as the number of execution units and their individual capabilities. Calculated as $\frac{\text{\#instructions retired}}{\text{\#cycles}}$. Can be disabled with the \texttt{TRACY\_NO\_SAMPLE\_RETIREMENT} macro. -\item \emph{Branch miss rate} -- shows how frequently the CPU branch predictor makes a wrong choice. Lower values are better. Calculated as $\frac{\text{\#branch misses}}{\text{\#branch instructions}}$. Can be disabled with the \texttt{TRACY\_NO\_SAMPLE\_BRANCH} macro. -\item \emph{Cache miss rate} -- shows how frequently the CPU has to retrieve data from memory. Lower values are better. The specifics of which cache level is taken into account here vary from one implementation to another. Calculated as $\frac{\text{\#cache misses}}{\text{\#cache references}}$. Can be disabled with the \texttt{TRACY\_NO\_SAMPLE\_CACHE} macro. +\item \emph{Instructions Per Cycle (IPC)} -- shows how many instructions were executing concurrently within a single core cycle. Higher values are better. The maximum achievable value depends on the design of the CPU, including things such as the number of execution units and their individual capabilities. Calculated as $\frac{\text{\#instructions retired}}{\text{\#cycles}}$. You can disable it with the \texttt{TRACY\_NO\_SAMPLE\_RETIREMENT} macro. +\item \emph{Branch miss rate} -- shows how frequently the CPU branch predictor makes a wrong choice. Lower values are better. Calculated as $\frac{\text{\#branch misses}}{\text{\#branch instructions}}$. You can disable it with the \texttt{TRACY\_NO\_SAMPLE\_BRANCH} macro. +\item \emph{Cache miss rate} -- shows how frequently the CPU has to retrieve data from memory. Lower values are better. The specifics of which cache level is taken into account here vary from one implementation to another. Calculated as $\frac{\text{\#cache misses}}{\text{\#cache references}}$. You can disable it with the \texttt{TRACY\_NO\_SAMPLE\_CACHE} macro. \end{enumerate} -Each performance counter has to be collected by a dedicated Performance Monitoring Unit (PMU). The availability of PMUs is very limited, so you may not be able to capture all the statistics mentioned above at the same time (as each requires capture of two different counters). In such case, you will need to manually select what needs to be sampled, with the macros specified above. +Each performance counter has to be collected by a dedicated Performance Monitoring Unit (PMU). However, the availability of PMUs is very limited, so you may not be able to capture all the statistics mentioned above at the same time (as each requires capture of two different counters). In such a case, you will need to manually select what needs to be sampled with the macros specified above. If the provided measurements are not specific enough for your needs, you will need to use a profiler better tailored to the hardware you are using, such as Intel VTune, or AMD \si{\micro\relax}Prof. -Another problem to consider here is the measurement skid. It is quite hard to accurately pinpoint the exact assembly instruction which has caused the counter to trigger. Due to this the results you'll get may look a bit nonsense at times. For example, a branch miss may be attributed to the multiply instruction. Not much can be done with that, as this is exactly what the hardware is reporting. The amount of skid you will encounter depends on the specific implementation of a processor, and each vendor has their own solution to minimize it. Intel uses Precise Event Based Sampling (PEBS), which is rather good, but it still can, for example, blend the branch statistics across the comparison instruction and the following jump instruction. AMD employs their own Instruction Based Sampling (IBS), which tends to provide worse results in comparison. +Another problem to consider here is the measurement skid. It is pretty hard to accurately pinpoint the exact assembly instruction which has caused the counter to trigger. Due to this, the results you'll get may look a bit nonsense at times. For example, a branch miss may be attributed to the multiply instruction. Unfortunately, not much can be done with that, as this is exactly what the hardware is reporting. The amount of skid you will encounter depends on the specific implementation of a processor, and each vendor has its own solution to minimize it. Intel uses Precise Event Based Sampling (PEBS), which is rather good, but it still can, for example, blend the branch statistics across the comparison instruction and the following jump instruction. AMD employs its own Instruction Based Sampling (IBS), which tends to provide worse results in comparison. -Do note that the statistics presented by Tracy are a combination of two randomly sampled counters, so you should take them with a grain of salt. The random nature of sampling\footnote{The hardware counters in practice can be triggered only once per million-or-so events happening.} makes it fully possible to count more branch misses than branch instructions, or some other similar sillyness. You should always cross-check this data with the count of sampled events, in order to decide if the provided values can be reliably acted upon. +Do note that the statistics presented by Tracy are a combination of two randomly sampled counters, so you should take them with a grain of salt. The random nature of sampling\footnote{The hardware counters in practice can be triggered only once per million-or-so events happening.} makes it entirely possible to count more branch misses than branch instructions or some other similar silliness. You should always cross-check this data with the count of sampled events to decide if you can reliably act upon the provided values. \subparagraph{Availability} -Currently the hardware performance counter readings are only available on Linux, which also includes the WSL2 layer on Windows\footnote{You may need Windows 11 and the WSL preview from Microsoft Store for this to work.}. Access to them is performed using the kernel-provided infrastructure, so what you get may depend on how your kernel was configured. This also means that the exact set of supported hardware is not known, as it depends on what has been implemented in the Linux itself. At this point the x86 hardware is fully supported (including features such as PEBS or IBS), and there's PMU support on a selection of ARM designs. +Currently, the hardware performance counter readings are only available on Linux, which also includes the WSL2 layer on Windows\footnote{You may need Windows 11 and the WSL preview from Microsoft Store for this to work.}. Access to them is performed using the kernel-provided infrastructure, so what you get may depend on how your kernel was configured. This also means that the exact set of supported hardware is not known, as it depends on what has been implemented in Linux itself. At this point, the x86 hardware is fully supported (including features such as PEBS or IBS), and there's PMU support on a selection of ARM designs. \subsubsection{Executable code retrieval} \label{executableretrieval} -To enable deep insight into program execution, Tracy will capture small chunks of the executable image during profiling. The retrieved code can be subsequently disassembled to be inspected in detail. This functionality will be performed only for functions that are no larger than 128 KB and only if symbol information is present. +Tracy will capture small chunks of the executable image during profiling to enable deep insight into program execution. The retrieved code can be subsequently disassembled to be inspected in detail. The profiler will perform this functionality only for functions no larger than 128 KB and only if symbol information is present. -Discovery of previously unseen executable code may result in reduced performance of real-time capture. This is especially true when the profiling session had just started. Such behavior is expected and will go back to normal after a couple of moments. +The discovery of previously unseen executable code may result in reduced performance of real-time capture. This is especially true when the profiling session had just started. However, such behavior is expected and will go back to normal after several moments. -You should be extra careful when working with non-public code, as parts of your program will be embedded in the captured trace. Disabling collection of program code can be achieved by compiling the profiled application with the \texttt{TRACY\_NO\_CODE\_TRANSFER} define. You can also strip the code from a saved trace using the \texttt{update} utility (section~\ref{dataremoval}). +It would be best to be extra careful when working with non-public code, as parts of your program will be embedded in the captured trace. You can disable the collection of program code by compiling the profiled application with the \texttt{TRACY\_NO\_CODE\_TRANSFER} define. You can also strip the code from a saved trace using the \texttt{update} utility (section~\ref{dataremoval}). \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -For proper program code retrieval no module used by the application can be unloaded during the runtime. See section~\ref{datalifetime} for an explanation. +For proper program code retrieval, you can unload no module used by the application during the runtime. See section~\ref{datalifetime} for an explanation. \end{bclogo} \subsubsection{Vertical synchronization} -On Windows Tracy will automatically capture hardware Vsync events, if running with elevated privileges (see section~\ref{privilegeelevation}). These events will be reported as '\texttt{[x] Vsync}' frame sets, where \texttt{x} is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application, due to desktop composition, command queue buffering, etc. +On Windows, Tracy will automatically capture hardware Vsync events if running with elevated privileges (see section~\ref{privilegeelevation}). These events will be reported as '\texttt{[x] Vsync}' frame sets, where \texttt{x} is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, etc. Use the \texttt{TRACY\_NO\_VSYNC\_CAPTURE} macro to disable capture of Vsync events. \subsection{Trace parameters} \label{traceparameters} -Sometimes it is desired to change how the profiled application is behaving during the profiling run, for example you may want to enable or disable capture of frame images without recompiling and restarting your program. To be able to do so you must register a callback function using the \texttt{TracyParameterRegister(callback)} macro, where \texttt{callback} is a function conforming to the following signature: +Sometimes it is desired to change how the profiled application behaves during the profiling run. For example, you may want to enable or disable the capture of frame images without recompiling and restarting your program. To be able to do so you must register a callback function using the \texttt{TracyParameterRegister(callback)} macro, where \texttt{callback} is a function conforming to the following signature: \begin{lstlisting} void Callback(uint32_t idx, int32_t val) @@ -2086,7 +2083,7 @@ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -Usage of trace parameters makes profiling runs dependent on user interaction with the profiler, and thus it's not recommended to be employed if a consistent profiling environment is desired. Furthermore, interaction with the parameters is only possible in the graphical profiling application, and not in the command line capture utility. +Usage of trace parameters makes profiling runs dependent on user interaction with the profiler, and thus it's not recommended to be employed if a consistent profiling environment is desired. Furthermore, interaction with the parameters is only possible in the graphical profiling application but not in the command line capture utility. \end{bclogo} \subsection{Connection status} @@ -2097,7 +2094,7 @@ To determine if a connection is currently established between the client and the \section{Capturing the data} \label{capturing} -After the client application has been instrumented, you will want to connect to it using a server, which is available either as a headless capture-only utility, or as a full-fledged graphical profiling interface. +After the client application has been instrumented, you will want to connect to it using a server, available either as a headless capture-only utility or as a full-fledged graphical profiling interface. \subsection{Command line} @@ -2111,7 +2108,7 @@ You can capture a trace using a command line utility contained in the \texttt{ca \item \texttt{-s seconds} -- number of seconds to capture before automatically disconnecting (optional). \end{itemize} -If there is no client running at the given address, the server will wait until a connection can be made. During the capture the following information will be displayed: +If no client is running at the given address, the server will wait until it can make a connection. During the capture, the utility will display the following information: \begin{verbatim} % ./capture -a 127.0.0.1 -o trace @@ -2121,7 +2118,7 @@ Timer resolution: 3 ns 1.33 Mbps / 40.4% = 3.29 Mbps | Net: 64.42 MB | Mem: 283.03 MB | Time: 10.6 s \end{verbatim} -The \emph{queue delay} and \emph{timer resolution} parameters are calibration results of timers used by the client. The next line is a status bar, which displays: network connection speed, connection compression ratio, and the resulting uncompressed data rate; total amount of data transferred over the network; memory usage of the capture utility; time extent of the captured data. +The \emph{queue delay} and \emph{timer resolution} parameters are calibration results of timers used by the client. The following line is a status bar, which displays: network connection speed, connection compression ratio, and the resulting uncompressed data rate; the total amount of data transferred over the network; memory usage of the capture utility; time extent of the captured data. You can disconnect from the client and save the captured trace by pressing \keys{\ctrl + C}. If you prefer to disconnect after a fixed time, use the \texttt{-s seconds} parameter. @@ -2134,7 +2131,7 @@ The client \emph{address entry} field and the \faWifi{}~\emph{Connect} button ar If you want to open a trace that you have stored on the disk, you can do so by pressing the \faFolderOpen{}~\emph{Open saved trace} button. -The \emph{discovered clients} list is only displayed if there are clients broadcasting their presence on the local network\footnote{Only on IPv4 networks and only within the broadcast domain.}. Each entry shows the address\footnote{Either as an IP address, or as a host name, if able to resolve.} of the client (and optionally port, if different from the default one), how long the client has been running, and the name of the application that is profiled. Clicking on an entry will connect to the client. Incompatible clients are grayed-out and can't be connected to. Clicking on the \emph{\faFilter{}~Filter} toggle button will display client filtering input fields, allowing removal of the displayed entries, according to their address, port number, or program name. If filters are active, a yellow \faExclamationTriangle{}~warning icon will be displayed. +The \emph{discovered clients} list is only displayed if clients are broadcasting their presence on the local network\footnote{Only on IPv4 network and only within the broadcast domain.}. Each entry shows the client's address\footnote{Either as an IP address or as a hostname, if able to resolve.} (and port, if different from the default one), how long the client has been running, and the name of the profiled application. Clicking on an entry will connect to the client. Incompatible clients are grayed out and can't be connected to. Clicking on the \emph{\faFilter{}~Filter} toggle button will display client filtering input fields, allowing removal of the displayed entries according to their address, port number, or program name. If filters are active, a yellow \faExclamationTriangle{}~warning icon will be displayed. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2162,9 +2159,9 @@ Both connecting to a client and opening a saved trace will present you with the \subsubsection{Connection information pop-up} \label{connectionpopup} -If this is a real-time capture, you will also have access to the connection information pop-up (figure~\ref{connectioninfo}) through the \emph{\faWifi{}~Connection} button, with the capture status similar to the one displayed by the command line utility. This dialog also displays the connection speed graphed over time and the profiled application's current frames per second and frame time measurements. The \emph{Query backlog} consists of two numbers. The first one represents the number of queries that were held back due to the bandwidth volume overwhelming the available network send buffer. The second one shows how many queries are in-flight, meaning requests which were sent to the client, but weren't yet answered. While these numbers drains down to zero, the performance of real time profiling may be temporarily compromised. The circle displayed next to the bandwidth graph signals the connection status. If it's red, the connection is active. If it's gray, the client has disconnected. +If this is a real-time capture, you will also have access to the connection information pop-up (figure~\ref{connectioninfo}) through the \emph{\faWifi{}~Connection} button, with the capture status similar to the one displayed by the command-line utility. This dialog also shows the connection speed graphed over time and the profiled application's current frames per second and frame time measurements. The \emph{Query backlog} consists of two numbers. The first represents the number of queries that were held back due to the bandwidth volume overwhelming the available network send buffer. The second one shows how many queries are in-flight, meaning requests sent to the client but not yet answered. While these numbers drain down to zero, the performance of real time profiling may be temporarily compromised. The circle displayed next to the bandwidth graph signals the connection status. If it's red, the connection is active. If it's gray, the client has disconnected. -You can use the \faSave{}~\emph{Save trace} button to save the current profile data to a file\footnote{This should be taken literally. If a live capture is in progress and a save is performed, some data may be missing from the capture and won't be saved.}. The available compression modes are discussed in sections~\ref{archival} and~\ref{fidict}. Use the \faPlug{}~\emph{Stop} button to disconnect from the client\footnote{While requesting disconnect stops retrieval of any new events, the profiler will wait for any data that is still pending for the current set of events.}. The \faExclamationTriangle{}~\emph{Discard} button is used to discard current trace. +You can use the \faSave{}~\emph{Save trace} button to save the current profile data to a file\footnote{You should take this literally. If a live capture is in progress and a save is performed, some data may be missing from the capture and won't be saved.}. The available compression modes are discussed in sections~\ref{archival} and~\ref{fidict}. Use the \faPlug{}~\emph{Stop} button to disconnect from the client\footnote{While requesting disconnect stops retrieval of any new events, the profiler will wait for any data that is still pending for the current set of events.}. The \faExclamationTriangle{}~\emph{Discard} button is used to discard current trace. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2187,29 +2184,29 @@ You can use the \faSave{}~\emph{Save trace} button to save the current profile d If frame image capture has been implemented (chapter~\ref{frameimages}), a thumbnail of the last received frame image will be provided for reference. -If the profiled application opted to provide trace parameters (see section~\ref{traceparameters}) and the connection is still active, this pop-up will also contain a \emph{trace parameters} section, listing all the provided options. When you change any value here, a callback function will be executed on the client. +Suppose the profiled application opted to provide trace parameters (see section~\ref{traceparameters}) and the connection is still active. In that case, this pop-up will also contain a \emph{trace parameters} section, listing all the provided options. A callback function will be executed on the client when you change any value here. \subsubsection{Automatic loading or connecting} -You can pass trace file name as an argument to the profiler application to open the capture, skipping the welcome dialog. You can also use the \texttt{-a address} argument to automatically connect to the given address. To specify the network port, pass the \texttt{-p port} parameter. It will be used for connections to client (overridable in the UI) and for listening to client discovery broadcasts. +You can pass the trace file name as an argument to the profiler application to open the capture, skipping the welcome dialog. You can also use the \texttt{-a address} argument to connect to the given address automatically. Finally, to specify the network port, pass the \texttt{-p port} parameter. The profiler will use it for client connections (overridable in the UI) and for listening to client discovery broadcasts. \subsection{Connection speed} -Tracy network bandwidth requirements depend on the amount of data collection the profiled application is performing. In typical use case scenarios, you may expect anything between 1~Mbps and 100~Mbps data transfer rate. +Tracy network bandwidth requirements depend on the amount of data collection the profiled application performs. You may expect anything between 1~Mbps and 100~Mbps data transfer rate in typical use case scenarios. -The maximum attainable connection speed is determined by the ability of the client to provide data and the ability of the server to process the received data. In an extreme conditions test performed on an i7~8700K, the maximum transfer rate peaked at 950~Mbps. In each second the profiler was able to process 27~million zones and consume 1~GB of RAM. +The maximum attainable connection speed is determined by the ability of the client to provide data and the ability of the server to process the received data. In an extreme conditions test performed on an i7~8700K, the maximum transfer rate peaked at 950~Mbps. In each second, the profiler could process 27~million zones and consume 1~GB of RAM. \subsection{Memory usage} -The captured data is stored in RAM and only written to the disk, when the capture finishes. This can result in memory exhaustion when you are capturing massive amounts of profile data, or even in normal usage situations, when the capture is performed over a long stretch of time. The recommended usage pattern is to perform moderate instrumentation of the client code and limit capture time to the strict necessity. +The captured data is stored in RAM and only written to the disk when the capture finishes. This can result in memory exhaustion when you capture massive amounts of profile data or even in typical usage situations when the capture is performed over a long time. Therefore, the recommended usage pattern is to perform moderate instrumentation of the client code and limit capture time to the strict necessity. -In some cases it may be useful to perform an \emph{on-demand} capture, as described in section~\ref{ondemand}. In such case you will be able to profile only the interesting case (e.g.\ behavior during loading of a level in a game), ignoring all the unneeded data. +In some cases, it may be helpful to perform an \emph{on-demand} capture, as described in section~\ref{ondemand}. In such a case, you will be able to profile only the exciting topic (e.g.,\ behavior during loading of a level in a game), ignoring all the unneeded data. -If you truly need to capture large traces, you have two options. Either buy more RAM, or use a large swap file on a fast disk drive\footnote{The operating system is able to manage memory paging much better than Tracy would be ever able to.}. +If you genuinely need to capture large traces, you have two options. Either buy more RAM or use a large swap file on a fast disk drive\footnote{The operating system can manage memory paging much better than Tracy would be ever able to.}. \subsection{Trace versioning} -Each new release of Tracy changes the internal format of trace files. While there is a backwards compatibility layer, allowing loading of traces created by previous versions of Tracy in new releases, it won't be there forever. You are thus advised to upgrade your traces using the utility contained in the \texttt{update} directory. +Each new release of Tracy changes the internal format of trace files. While there is a backward compatibility layer, allowing loading traces created by previous versions of Tracy in new releases, it won't be there forever. You are thus advised to upgrade your traces using the utility contained in the \texttt{update} directory. To use it, you will need to provide the input file and the output file. The program will print a short summary when it finishes, with information about trace file versions, their respective sizes and the output trace file compression ratio: @@ -2218,12 +2215,12 @@ To use it, you will need to provide the input file and the output file. The prog old.tracy (0.3.0) {916.4 MB} -> new.tracy (0.4.0) {349.4 MB, 31.53%} 9.7 s, 38.13% change \end{verbatim} -The new file contains the same data as the old one, but in the updated internal representation. Note that to perform an upgrade, whole trace needs to be loaded to memory. +The new file contains the same data as the old one but with an updated internal representation. Note that the whole trace needs to be loaded to memory to perform an upgrade. \subsubsection{Archival mode} \label{archival} -The \texttt{update} utility supports optional higher levels of data compression, which reduce disk size of traces, at the cost of increased compression times. With the default settings, the output files have a reasonable size and are quick to save and load. A list of available compression modes and their respective results is available in table~\ref{compressiontimes} and figures~\ref{savesize}, \ref{savetime} and~\ref{loadtime}. Compression mode selection is controlled by the following command line options: +The \texttt{update} utility supports optional higher levels of data compression, which reduce disk size of traces at the cost of increased compression times. The output files have a reasonable size and are quick to save and load with the default settings. A list of available compression modes and their respective results is available in table~\ref{compressiontimes} and figures~\ref{savesize}, \ref{savetime} and~\ref{loadtime}. The following command-line options control compression mode selection: \begin{itemize} \item \texttt{-h} -- enables LZ4 HC compression. @@ -2330,14 +2327,14 @@ The \texttt{update} utility supports optional higher levels of data compression, Trace files created using the \emph{default}, \emph{hc} and \emph{extreme} modes are optimized for fast decompression and can be further compressed using file compression utilities. For example, using 7-zip results in archives of the following sizes: 77.2 MB, 54.3 MB, 52.4 MB. -For archival purposes it is however much better to use the \emph{zstd} compression modes, which are faster, compress trace files more tightly, and are directly loadable by the profiler, without the intermediate decompression step. +For archival purposes, it is, however, much better to use the \emph{zstd} compression modes, which are faster, compress trace files more tightly, and are directly loadable by the profiler, without the intermediate decompression step. \subsubsection{Frame images dictionary} \label{fidict} -Frame images have to be compressed individually, so that there are no delays during random access to contents of any image. Unfortunately, because of this there is no reuse of compression state between similar (or even identical) images, which leads to increased memory consumption. This can be partially remedied by enabling calculation of an optional frame images dictionary with the \texttt{-d} command line parameter. +Frame images have to be compressed individually so that there are no delays during random access to the contents of any image. Unfortunately, because of this, there is no reuse of compression state between similar (or even identical) images, which leads to increased memory consumption. The profiler can partially remedy this by enabling the calculation of an optional frame images dictionary with the \texttt{-d} command line parameter. -Saving a trace with frame images dictionary enabled will need some extra time, which will depend on the amount of image data you have captured. Loading such trace will also be slower, but not by much. How much RAM will be saved by the dictionary depends on the similarity of frame images. Be aware that post-processing effects such as artificial film grain have a subtle effect on image contents, which is significant in this case. +Saving a trace with frame images dictionary-enabled will need some extra time, depending on the amount of image data you have captured. Loading such a trace will also be slower, but not by much. How much RAM the dictionary will save depends on the similarity of frame images. Be aware that post-processing effects such as artificial film grain have a subtle impact on image contents, which is significant in this case. The dictionary cannot be used when you are capturing a trace. @@ -2358,31 +2355,31 @@ In some cases you may want to share just a portion of the trace file, omitting s \item \texttt{S} -- source file cache. \end{itemize} -Flags can be concatenated, for example specifying \texttt{-s CSi} will remove symbol code, source file cache and frame images in the destination trace file. +Flags can be concatenated. For example specifying \texttt{-s CSi} will remove symbol code, source file cache, and frame images in the destination trace file. \subsection{Instrumentation failures} \label{instrumentationfailures} -In some cases your program may be incorrectly instrumented, for example you could have unbalanced zone begin and end events, or you could report a memory free event without first reporting a memory allocation event. When Tracy detects such misbehavior it immediately terminates connection with the client and displays an error message. +In some cases, your program may be incorrectly instrumented. For example, you could have unbalanced zone begin and end events or report a memory-free event without first reporting a memory allocation event. When Tracy detects such misbehavior, it immediately terminates the connection with the client and displays an error message. \section{Analyzing captured data} \label{analyzingdata} -You have instrumented your application and you have captured a profiling trace. Now you want to look at the collected data. You can do this in the application contained in the \texttt{profiler} directory. +You have instrumented your application, and you have captured a profiling trace. Now you want to look at the collected data. You can do this in the application contained in the \texttt{profiler} directory. -The workflow is identical, whether you are viewing a previously saved trace, or if you're performing a live capture, as described in section~\ref{interactiveprofiling}. +The workflow is identical, whether you are viewing a previously saved trace or if you're performing a live capture, as described in section~\ref{interactiveprofiling}. \subsection{Time display} In most cases Tracy will display an approximation of time value, depending on how big it is. For example, a short time range will be displayed as 123~\si{\nano\second}, and some longer ones will be shortened to 123.45~\si{\micro\second}, 123.45~\si{\milli\second}, 12.34~\si{\second}, 1:23.4, 12:34:56, or even 1d12:34:56 to indicate more than a day has passed. -While such presentation makes time values easy to read, it is not always appropriate. For example, you may have multiple events happen at a time approximated to 1:23.4, giving you a precision of only $\sfrac{1}{10}$ of a second. There's certainly a lot that can happen in 100~\si{\milli\second}. +While such a presentation makes time values easy to read, it is not always appropriate. For example, you may have multiple events happen at a time approximated to 1:23.4, giving you the precision of only $\sfrac{1}{10}$ of a second. And there's certainly a lot that can happen in 100~\si{\milli\second}. -To solve this problem, an alternative time display is used in appropriate places. It combines a day--hour--minute--second value with full nanosecond resolution, resulting in values such as 1:23~456,789,012~\si{\nano\second}. +An alternative time display is used in appropriate places to solve this problem. It combines a day--hour--minute--second value with full nanosecond resolution, resulting in values such as 1:23~456,789,012~\si{\nano\second}. \subsection{Main profiler window} -The main profiler window is split into three sections, as seen on figure~\ref{mainwindow}: the control menu, the frame time graph and the timeline display. +The main profiler window is split into three sections, as seen in figure~\ref{mainwindow}: the control menu, the frame time graph, and the timeline display. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2408,19 +2405,19 @@ The main profiler window is split into three sections, as seen on figure~\ref{ma \draw (0.1, -1.3) rectangle+(15.9, -1) node [midway] {Frame time graph}; \draw (0.1, -2.4) rectangle+(15.9, -3) node [midway] {Timeline view}; \end{tikzpicture} -\caption{Main profiler window. Note that the top line of buttons has been split into two rows in this manual.} +\caption{Main profiler window. Note that this manual has split the top line of buttons into two rows.} \label{mainwindow} \end{figure} \subsubsection{Control menu} \label{controlmenu} -The control menu (top row of buttons) provides access to various features of the profiler. The buttons perform the following actions: +The control menu (top row of buttons) provides access to various profiler features. The buttons perform the following actions: \begin{itemize} \item \emph{\faWifi{}~Connection} -- Opens the connection information popup (see section~\ref{connectionpopup}). Only available when live capture is in progress. \item \emph{\faPowerOff{} Close} -- This button unloads the current profiling trace and returns to the welcome menu, where another trace can be loaded. In live captures it is replaced by \emph{\faPause{}~Pause}, \emph{\faPlay{}~Resume} and \emph{\faSquare{}~Stopped} buttons. -\item \emph{\faPause{} Pause} -- While a live capture is in progress, the profiler will display recent events, as either the last three fully captured frames, or a certain time range. This can be used to see the current behavior of the program. The pause button\footnote{Or perform any action on the timeline view, apart from changing the zoom level.} will stop the automatic updates of the timeline view (the capture will be still progressing). +\item \emph{\faPause{} Pause} -- While a live capture is in progress, the profiler will display recent events, as either the last three fully captured frames, or a certain time range. You can use this to see the current behavior of the program. The pause button\footnote{Or perform any action on the timeline view, apart from changing the zoom level.} will stop the automatic updates of the timeline view (the capture will still be progressing). \item \emph{\faPlay{} Resume} -- This button allows to resume following the most recent events in a live capture. You will have selection of one of the following options: \emph{\faSearchPlus{}~Newest three frames}, or \emph{\faRulerHorizontal{}~Use current zoom level}. \item \emph{\faSquare{} Stopped} -- Inactive button used to indicate that the client application was terminated. \item \emph{\faCog{} Options} -- Toggles the settings menu (section~\ref{options}). @@ -2441,15 +2438,15 @@ The control menu (top row of buttons) provides access to various features of the \item \emph{\faSearchPlus{}~Display scale} -- Enables run-time resizing of the displayed content. This may be useful in environments with potentially reduced visibility, e.g. during a presentation. Note that this setting is independent to the UI scaling coming from the system DPI settings. \end{itemize} -The frame information block consists of four elements: the current frame set name along with the number of captured frames (click on it with the \LMB{}~left mouse button to go to a specified frame), the two navigational buttons \faCaretLeft{} and \faCaretRight{}, which allow you to focus the timeline view on the previous or next frame, and the frame set selection button \faCaretDown{}, which is used to switch to a another frame set\footnote{See section~\ref{framesets} for another way to change the active frame set.}. For more information about marking frames, see section~\ref{markingframes}. +The frame information block consists of four elements: the current frame set name along with the number of captured frames (click on it with the \LMB{}~left mouse button to go to a specified frame), the two navigational buttons \faCaretLeft{} and \faCaretRight{}, which allow you to focus the timeline view on the previous or next frame, and the frame set selection button \faCaretDown{}, which is used to switch to another frame set\footnote{See section~\ref{framesets} for another way to change the active frame set.}. For more information about marking frames, see section~\ref{markingframes}. -The next three items show the \emph{\faEye{}~view time range}, the \emph{\faDatabase{}~time span} of the whole capture (clicking on it with the \MMB{} middle mouse button will set the view range to the entire capture), and the \emph{\faMemory{}~memory usage} of the profiler. +The following three items show the \emph{\faEye{}~view time range}, the \emph{\faDatabase{}~time span} of the whole capture (clicking on it with the \MMB{} middle mouse button will set the view range to the entire capture), and the \emph{\faMemory{}~memory usage} of the profiler. \paragraph{Notification area} -The notification area is used to display informational notices, for example how long it took to load a trace from disk. A pulsating dot next to the \faTasks~icon indicates that some background tasks are being performed, that may need to be completed before full capabilities of the profiler are available. If a crash was captured during profiling (section~\ref{crashhandling}), a \emph{\faSkull{}~crash} icon will be displayed. The red \faSatelliteDish{}~icon indicates that queries are currently being backlogged, while the same yellow icon indicates that some queries are currently in-flight (see chapter~\ref{connectionpopup} for more information). +The notification area displays informational notices, for example, how long it took to load a trace from the disk. A pulsating dot next to the \faTasks~icon indicates that some background tasks are being performed that may need to be completed before full capabilities of the profiler are available. If a crash was captured during profiling (section~\ref{crashhandling}), a \emph{\faSkull{}~crash} icon will be displayed. The red \faSatelliteDish{}~icon indicates that queries are currently being backlogged, while the same yellow icon indicates that some queries are currently in-flight (see chapter~\ref{connectionpopup} for more information). -If drawing of timeline elements was disabled in the options menu (section~\ref{options}), the following orange icons will be used to remind the user about that fact. Click on the icons to enable drawing of the selected elements. Note that collapsed labels (section~\ref{zoneslocksplots}) are not taken into account here. +If the drawing of timeline elements was disabled in the options menu (section~\ref{options}), the profiler will use the following orange icons to remind you about that fact. Click on the icons to enable drawing of the selected elements. Note that collapsed labels (section~\ref{zoneslocksplots}) are not taken into account here. \begin{itemize} \item \faExpand{} -- Display of empty labels is enabled. @@ -2466,7 +2463,7 @@ If drawing of timeline elements was disabled in the options menu (section~\ref{o \subsubsection{Frame time graph} \label{frametimegraph} -The graph of currently selected frame set (figure~\ref{frametime}) provides an outlook on the time spent in each frame, allowing you to see where the problematic frames are and to quickly navigate to them. +The graph of the currently selected frame set (figure~\ref{frametime}) provides an outlook on the time spent in each frame, allowing you to see where the problematic frames are and to navigate to them quickly. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2513,35 +2510,35 @@ The graph of currently selected frame set (figure~\ref{frametime}) provides an o \label{frametime} \end{figure} -Each bar displayed on the graph represents an unique frame in the current frame set\footnote{Unless the view is zoomed out and multiple frames are merged into one column.}. The progress of time is in the right direction. The height of the bar indicates the time spent in frame, complemented with the color information: +Each bar displayed on the graph represents a unique frame in the current frame set\footnote{Unless the view is zoomed out and multiple frames are merged into one column.}. The progress of time is in the right direction. The height of the bar indicates the time spent in the frame, complemented with the color information: \begin{itemize} \item If the bar is \emph{blue}, then the frame met the \emph{best} time of 143 FPS, or 6.99 \si{\milli\second}\footnote{The actual target is 144 FPS, but one frame leeway is allowed to account for timing inaccuracies.} (represented by blue target line). \item If the bar is \emph{green}, then the frame met the \emph{good} time of 59 FPS, or 16.94 \si{\milli\second} (represented by green target line). \item If the bar is \emph{yellow}, then the frame met the \emph{bad} time of 29 FPS, or 34.48 \si{\milli\second} (represented by yellow target line). -\item If the bar is \emph{red}, then the frame didn't met any time limits. +\item If the bar is \emph{red}, then the frame didn't meet any time limits. \end{itemize} The frames visible on the timeline are marked with a violet box drawn over them. When a zone is displayed in the find zone window (section~\ref{findzone}), the coloring of frames may be changed, as described in section~\ref{frametimefindzone}. -Moving the \faMousePointer{} mouse cursor over the frames displayed on the graph will display tooltip with information about frame number, frame time, frame image (if available, see chapter~\ref{frameimages}), etc. Such tooltips are common for many UI elements in the profiler and won't be mentioned later in the manual. +Moving the \faMousePointer{} mouse cursor over the frames displayed on the graph will display a tooltip with information about frame number, frame time, frame image (if available, see chapter~\ref{frameimages}), etc. Such tooltips are common for many UI elements in the profiler and won't be mentioned later in the manual. -The timeline view may be focused on the frames, by clicking or dragging the \LMB{}~left mouse button on the graph. The graph may be scrolled left and right by dragging the \RMB{}~right mouse button over the graph. The view may be zoomed in and out by using the \Scroll{}~mouse wheel. If the view is zoomed out, so that multiple frames are merged into one column, the highest frame time will be used to represent the given column. +You may focus the timeline view on the frames by clicking or dragging the \LMB{}~left mouse button on the graph. The graph may be scrolled left and right by dragging the \RMB{}~right mouse button over the graph. Finally, you may zoom the view in and out by using the \Scroll{}~mouse wheel. If the view is zoomed out, so that multiple frames are merged into one column, the profiler will use the highest frame time to represent the given column. Clicking the \LMB{}~left mouse button on the graph while the \keys{\ctrl}~key is pressed will open the frame image playback window (section~\ref{playback}) and set the playback to the selected frame. See section~\ref{frameimages} for more information about frame images. \subsubsection{Timeline view} -The timeline is the most important element of the profiler UI. All the captured data is displayed there, laid out on the horizontal axis, according to the flow of time. Where there was no profiling performed, the timeline is dimmed out. The view is split into three parts: the time scale, the frame sets and the combined zones, locks and plots display. +The timeline is the most crucial element of the profiler UI. All the captured data is displayed there, laid out on the horizontal axis, according to time flow. Where there was no profiling performed, the timeline is dimmed out. The view is split into three parts: the time scale, the frame sets, and the combined zones, locks, and plots display. \subparagraph{Collapsed items} \label{collapseditems} -Due to extreme differences in time scales, you will almost constantly see events that are too small to be displayed on the screen. Such events have preset minimum size (so they can be seen) and are marked with a zig-zag pattern, to indicate that you need to zoom-in to see more detail. +Due to extreme differences in time scales, you will almost constantly see events too small to be displayed on the screen. Such events have preset minimum size (so they can be seen) and are marked with a zig-zag pattern to indicate that you need to zoom in to see more detail. -The zig-zag pattern can be seen applied to frame sets on figure~\ref{framesetsfig}, and to zones on figure~\ref{zoneslocks}. +The zig-zag pattern can be seen applied to frame sets on figure~\ref{framesetsfig}, and zones on figure~\ref{zoneslocks}. \paragraph{Time scale} @@ -2573,9 +2570,9 @@ The time scale is a quick aid in determining the relation between screen space a \label{timescale} \end{figure} -The leftmost value on the scale represents the time at which the timeline starts. The rest of numbers label the notches on the scale, with some numbers omitted, if there's no space to display them. +The leftmost value on the scale represents when the timeline starts. The rest of the numbers label the notches on the scale, with some numbers omitted if there's no space to display them. -Hovering the \faMousePointer{}~mouse pointer over the time scale will display tooltip with exact timestamp at the position of mouse cursor. +Hovering the \faMousePointer{}~mouse pointer over the time scale will display a tooltip with the exact timestamp at the position of the mouse cursor. \paragraph{Frame sets} \label{framesets} @@ -2611,20 +2608,20 @@ Frames from each frame set are displayed directly underneath the time scale. Eac \label{framesetsfig} \end{figure} -On figure~\ref{framesetsfig} we can see the fully described frames~312 and 347. The description consists of the frame name, which is \emph{Frame} for the default frame set (section~\ref{markingframes}) or the name you used for the secondary name set (section~\ref{secondaryframeset}), the frame number and the frame time. Since frame~348 is too small to be fully labeled, only the frame time is shown. Frame~349 is even smaller, with no space for any text. Moreover, frames~313~to~346 are too small to be displayed individually, so they are replaced with a zig-zag pattern, as described in section~\ref{collapseditems}. +In figure~\ref{framesetsfig} we can see the fully described frames~312 and 347. The description consists of the frame name, which is \emph{Frame} for the default frame set (section~\ref{markingframes}) or the name you used for the secondary name set (section~\ref{secondaryframeset}), the frame number, and the frame time. Since frame~348 is too small to be fully labeled, only the frame time is shown. On the other hand, frame~349 is even smaller, with no space for any text. Moreover, frames~313~to~346 are too small to be displayed individually, so they are replaced with a zig-zag pattern, as described in section~\ref{collapseditems}. -You can also see that there are frame separators, projected down to the rest of the timeline view. Note that only the separators for the currently selected frame set are displayed. You can make a frame set active by clicking the \LMB{}~left mouse button on a frame set row you want to select (also see section~\ref{controlmenu}). +You can also see frame separators are projected down to the rest of the timeline view. Note that only the separators for the currently selected frame set are displayed. You can make a frame set active by clicking the \LMB{}~left mouse button on a frame set row you want to select (also see section~\ref{controlmenu}). Clicking the \MMB{} middle mouse button on a frame will zoom the view to the extent of the frame. -If a frame has an associated frame image (see chapter~\ref{frameimages}), you can hold the \keys{\ctrl} key and click the \LMB{}~left mouse button on the frame, to open the frame image playback window (see chapter~\ref{playback}) and set the playback to the selected frame. +If a frame has an associated frame image (see chapter~\ref{frameimages}), you can hold the \keys{\ctrl} key and click the \LMB{}~left mouse button on the frame to open the frame image playback window (see chapter~\ref{playback}) and set the playback to the selected frame. If the \emph{\faFlagCheckered{}~Draw frame targets} option is enabled (see section~\ref{options}), time regions in frames exceeding the set target value will be marked with a red background. \paragraph{Zones, locks and plots display} \label{zoneslocksplots} -On this combined view you will find the zones with locks and their associated threads. The plots are graphed right below. +You will find the zones with locks and their associated threads on this combined view. The plots are graphed right below. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2683,26 +2680,26 @@ On this combined view you will find the zones with locks and their associated th \label{zoneslocks} \end{figure} -The left hand side \emph{index area} of the timeline view displays various labels (threads, locks), which can be categorized in the following way: +The left-hand side \emph{index area} of the timeline view displays various labels (threads, locks), which can be categorized in the following way: \begin{itemize} -\item \emph{Light blue label} -- GPU context. Multi-threaded Vulkan, OpenCL and Direct3D 12 contexts are additionally split into separate threads. +\item \emph{Light blue label} -- GPU context. Multi-threaded Vulkan, OpenCL, and Direct3D 12 contexts are additionally split into separate threads. \item \emph{Pink label} -- CPU data graph. -\item \emph{White label} -- A CPU thread. Will be replaced by a bright red label in a thread that has crashed (section~\ref{crashhandling}). If automated sampling was performed, clicking the~\LMB{}~left mouse button on the \emph{\faGhost{}~ghost zones} button will switch zone display mode between 'instrumented' and 'ghost'. -\item \emph{Green label} -- Fiber, coroutine, or any other sort of cooperative multitasking 'green thread'. +\item \emph{White label} -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section~\ref{crashhandling}). If automated sampling was performed, clicking the~\LMB{}~left mouse button on the \emph{\faGhost{}~ghost zones} button will switch zone display mode between 'instrumented' and 'ghost.' +\item \emph{Green label} -- Fiber, coroutine, or any other sort of cooperative multitasking 'green thread.' \item \emph{Light red label} -- Indicates a lock. \item \emph{Yellow label} -- Plot. \end{itemize} -Labels accompanied by the \faCaretDown{}~symbol can be collapsed out of the view, to reduce visual clutter. Hover the~\faMousePointer{}~mouse pointer over the label to display additional information. Click the \MMB{}~middle mouse button on a label to zoom the view to the extent of the label contents. +Labels accompanied by the \faCaretDown{}~symbol can be collapsed out of the view to reduce visual clutter. Hover the~\faMousePointer{}~mouse pointer over the label to display additional information. Click the \MMB{}~middle mouse button on a title to zoom the view to the extent of the label contents. \subparagraph{Zones} -In an example on figure~\ref{zoneslocks} you can see that there are two threads: \emph{Main thread} and \emph{Streaming thread}\footnote{By clicking on a thread name you can temporarily disable display of the zones in this thread.}. We can see that the \emph{Main thread} has two root level zones visible: \emph{Update} and \emph{Render}. The \emph{Update} zone is split into further sub-zones, some of which are too small to be displayed at the current zoom level. This is indicated by drawing a zig-zag pattern over the merged zones box (section~\ref{collapseditems}), with the number of collapsed zones printed in place of zone name. We can also see that the \emph{Physics} zone acquires the \emph{Physics lock} mutex for the most of its run time. +In an example in figure~\ref{zoneslocks} you can see that there are two threads: \emph{Main thread} and \emph{Streaming thread}\footnote{By clicking on a thread name, you can temporarily disable the display of the zones in this thread.}. We can see that the \emph{Main thread} has two root level zones visible: \emph{Update} and \emph{Render}. The \emph{Update} zone is split into further sub-zones, some of which are too small to be displayed at the current zoom level. This is indicated by drawing a zig-zag pattern over the merged zones box (section~\ref{collapseditems}), with the number of collapsed zones printed in place of the zone name. We can also see that the \emph{Physics} zone acquires the \emph{Physics lock} mutex for most of its run time. -Meanwhile the \emph{Streaming thread} is performing some \emph{Streaming jobs}. The first \emph{Streaming job} sent a message (section~\ref{messagelog}), which in addition to being listed in the message log is being indicated by a triangle over the thread separator. When there are multiple messages in one place, the triangle outline shape changes to a filled triangle. +Meanwhile, the \emph{Streaming thread} is performing some \emph{Streaming jobs}. The first \emph{Streaming job} sent a message (section~\ref{messagelog}). In addition to being listed in the message log, it is indicated by a triangle over the thread separator. When multiple messages are in one place, the triangle outline shape changes to a filled triangle. -At high zoom levels, the zones will be displayed with additional markers, as presented on figure~\ref{inaccuracy}. The red regions at the start and end of a zone indicate the cost associated with recording an event (\emph{Queue delay}). The error bars show the timer inaccuracy (\emph{Timer resolution}). Note that these markers are only \emph{approximations}, as there are many factors that can impact the true cost of capturing a zone, for example cache effects, or CPU frequency scaling, which is unaccounted for (see section~\ref{checkenvironmentcpu}). +At high zoom levels, the zones will be displayed with additional markers, as presented in figure~\ref{inaccuracy}. The red regions at the start and end of a zone indicate the cost associated with recording an event (\emph{Queue delay}). The error bars show the timer inaccuracy (\emph{Timer resolution}). Note that these markers are only \emph{approximations}, as many factors can impact the actual cost of capturing a zone, for example, cache effects or CPU frequency scaling, which is unaccounted for (see section~\ref{checkenvironmentcpu}). \begin{figure}[h] \centering\begin{tikzpicture} @@ -2724,40 +2721,40 @@ At high zoom levels, the zones will be displayed with additional markers, as pre The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/OpenCL context in place of a thread name. -Hovering the \faMousePointer{} mouse pointer over a zone will highlight all other zones that have the same source location with a white outline. Clicking the \LMB{}~left mouse button on a zone will open zone information window (section~\ref{zoneinfo}). Holding the \keys{\ctrl} key and clicking the \LMB{}~left mouse button on a zone will open zone statistics window (section~\ref{findzone}). Clicking the \MMB{}~middle mouse button on a zone will zoom the view to the extent of the zone. +Hovering the \faMousePointer{} mouse pointer over a zone will highlight all other zones that have the exact source location with a white outline. Clicking the \LMB{}~left mouse button on a zone will open the zone information window (section~\ref{zoneinfo}). Holding the \keys{\ctrl} key and clicking the \LMB{}~left mouse button on a zone will open the zone statistics window (section~\ref{findzone}). Clicking the \MMB{}~middle mouse button on a zone will zoom the view to the extent of the zone. \subparagraph{Ghost zones} -Display of ghost zones (not pictured on figure~\ref{zoneslocks}, but similar to normal zones view) can be enabled by clicking on the \emph{\faGhost{}~ghost zones} icon next to thread label, available if automated sampling (see chapter~\ref{sampling}) was performed. Ghost zones will also be displayed by default, if no instrumented zones are available for a given thread, to help with pinpointing functions that should be instrumented. +You can enable the view of ghost zones (not pictured on figure~\ref{zoneslocks}, but similar to standard zones view) by clicking on the \emph{\faGhost{}~ghost zones} icon next to the thread label, available if automated sampling (see chapter~\ref{sampling}) was performed. Ghost zones will also be displayed by default if no instrumented zones are available for a given thread to help with pinpointing functions that should be instrumented. -Ghost zones represent true function calls in the program, periodically reported by the operating system. Due to the limited resolution of sampling, you need to take great care when looking at reported timing data. While it may be apparent that some small function requires a relatively long time to execute, for example 125~\si{\micro\second} (8~kHz~sampling rate), in reality this time represents a period between taking two distinct samples, not the actual function run time. Similarly, two (or more) distinct function calls may be represented as a single ghost zone, because the profiler doesn't have the information needed to know about true lifetime of a sampled function. +Ghost zones represent true function calls in the program, periodically reported by the operating system. Due to the limited sampling resolution, you need to take great care when looking at reported timing data. While it may be apparent that some small function requires a relatively long time to execute, for example, 125~\si{\micro\second} (8~kHz~sampling rate), in reality, this time represents a period between taking two distinct samples, not the actual function run time. Similarly, two (or more) separate function calls may be represented as a single ghost zone because the profiler doesn't have the information needed to know about the actual lifetime of a sampled function. -Another common pitfall to watch for is the order of presented functions. \emph{It is not what you expect it to be!} Read chapter~\ref{readingcallstacks} for a critical insight on how call stacks might seem nonsensical at first, and why they aren't. +Another common pitfall to watch for is the order of presented functions. \emph{It is not what you expect it to be!} Read chapter~\ref{readingcallstacks} for critical insight on how call stacks might seem nonsensical at first and why they aren't. -The available information about ghost zones is quite limited, but it's enough to give you a rough outlook on the execution of your application. The timeline view alone is more than any other statistical profiler is able to present. In addition to that, Tracy properly handles inlined function calls, which are indicated by darker background of ghost zones. Zones representing kernel-mode functions are displayed with red function names. +The available information about ghost zones is quite limited, but it's enough to give you a rough outlook on the execution of your application. The timeline view alone is more than any other statistical profiler can present. In addition, Tracy correctly handles inlined function calls, which are indicated by a darker background of ghost zones. Lastly, zones representing kernel-mode functions are displayed with red function names. Clicking the \LMB{}~left mouse button on a ghost zone will open the corresponding source file location, if able (see chapter~\ref{sourceview} for conditions). There are three ways in which source locations can be assigned to a ghost zone: \begin{enumerate} \item If the selected ghost zone is \emph{not} an inline frame and its symbol data has been retrieved, the source location points to the function entry location (first line of the function). \item If the selected ghost zone is \emph{not} an inline frame, but its symbol data is not available, the source location will point to a semi-random location within the function body (i.e. to one of the sampled addresses in the program, but not necessarily the one representing the selected time stamp, as multiple samples with different addresses may be merged into one ghost zone). -\item If the selected ghost zone \emph{is} an inline frame, the source location will point to a semi-random location within the inlined function body (see details in the above point). It is not possible to go to the entry location of such function, as it doesn't exist in the program binary. Inlined functions begin in the parent function. +\item If the selected ghost zone \emph{is} an inline frame, the source location will point to a semi-random location within the inlined function body (see details in the above point). It is impossible to go to such a function's entry location, as it doesn't exist in the program binary. Inlined functions begin in the parent function. \end{enumerate} \subparagraph{Call stack samples} -The row of dots right below the \emph{Main thread} label shows call stack sample points, which may have been automatically captured (see chapter~\ref{sampling} for more detail). Hovering the \faMousePointer{}~mouse pointer over each dot will display a short call stack summary, while clicking on a dot with the \LMB{}~left mouse button will open a more detailed call stack information window (see section~\ref{callstackwindow}). +The row of dots right below the \emph{Main thread} label shows call stack sample points, which may have been automatically captured (see chapter~\ref{sampling} for more detail). Hovering the \faMousePointer{}~mouse pointer over each dot will display a short call stack summary while clicking on the dot with the \LMB{}~left mouse button will open a more detailed call stack information window (see section~\ref{callstackwindow}). \subparagraph{Context switches} -The thick line right below the samples represents context switch data (see section~\ref{contextswitches}). We can see that the main thread, as displayed, starts in a suspended state, represented by the dotted region. Then it is woken up and starts execution of the \texttt{Update} zone. In midst of the physics processing it is preempted, which explains why there is an empty space between child zones. Then it is resumed again and continues execution into the \texttt{Render} zone, where it is preempted again, but for a shorter time. After rendering is done, the thread sleeps again, presumably waiting for the vertical blanking, to indicate next frame. Similar information is also available for the streaming thread. +The thick line right below the samples represents context switch data (see section~\ref{contextswitches}). We can see that the main thread, as displayed, starts in a suspended state, represented by the dotted region. Then it is woken up and starts execution of the \texttt{Update} zone. It is preempted amid the physics processing, which explains why there is an empty space between child zones. Then it is resumed again and continues execution into the \texttt{Render} zone, where it is preempted again, but for a shorter time. After rendering is done, the thread sleeps again, presumably waiting for the vertical blanking to indicate the next frame. Similar information is also available for the streaming thread. Context switch regions are using the following color key: \begin{itemize} \item \emph{Green} -- Thread is running. -\item \emph{Red} -- Thread is waiting to be resumed by the scheduler. There are many reasons why a thread may be in the waiting state. Hovering the \faMousePointer{}~mouse pointer over the region will display more information. If sampling was performed, a wait stack may be displayed. See section~\ref{waitstacks} for additional details. -\item \emph{Blue} -- Thread is waiting to be resumed and is migrating to another CPU core. This might have visible performance effects, because low level CPU caches are not shared between cores, which may result in additional cache misses. To avoid this problem, you may pin a thread to a specific core, by setting its affinity. +\item \emph{Red} -- Thread is waiting to be resumed by the scheduler. There are many reasons why a thread may be in the waiting state. Hovering the \faMousePointer{}~mouse pointer over the region will display more information. If sampling was performed, the profiler might display a wait stack. See section~\ref{waitstacks} for additional details. +\item \emph{Blue} -- Thread is waiting to be resumed and is migrating to another CPU core. This might have visible performance effects because low-level CPU caches are not shared between cores, which may result in additional cache misses. To avoid this problem, you may pin a thread to a specific core by setting its affinity. \item \emph{Bronze} -- Thread has been placed in the scheduler's run queue and is about to be resumed. \end{itemize} @@ -2765,38 +2762,38 @@ Fiber work and yield states are presented in the same way as context switch regi \subparagraph{CPU data} -This label is only available if context switch data was collected. It is split into two parts: a graph of CPU load by various threads running in the system, and a per-core thread execution display. +This label is only available if the profiler collected context switch data. It is split into two parts: a graph of CPU load by various threads running in the system and a per-core thread execution display. -The CPU load graph is showing how much CPU resources were used at any given time during program execution. The green part of the graph represents threads belonging to the profiled application and the gray part of the graph shows all other programs running in the system. Hovering the \faMousePointer{}~mouse pointer over the graph will display a list of threads running on the CPU at the given time. +The CPU load graph shows how much CPU resources were used at any given time during program execution. The green part of the graph represents threads belonging to the profiled application, and the gray part of the graph shows all other programs running in the system. Hovering the \faMousePointer{}~mouse pointer over the graph will display a list of threads running on the CPU at the given time. Each line in the thread execution display represents a separate logical CPU thread. If CPU topology data is available (see section~\ref{cputopology}), package and core assignment will be displayed in brackets, in addition to numerical processor identifier (i.e. \texttt{[\emph{package}:\emph{core}] CPU \emph{thread}}). When a core is busy executing a thread, a zone will be drawn at the appropriate time. Zones are colored according to the following key: \begin{itemize} \item \emph{Bright color} -- or \emph{orange} if dynamic thread colors are disabled -- Thread tracked by the profiler. -\item \emph{Dark blue} -- Thread existing in the profiled application, but not known to the profiler. This may include internal profiler threads, helper threads created by external libraries, etc. +\item \emph{Dark blue} -- Thread existing in the profiled application but not known to the profiler. This may include internal profiler threads, helper threads created by external libraries, etc. \item \emph{Gray} -- Threads assigned to other programs running in the system. \end{itemize} -When the \faMousePointer{}~mouse pointer is hovered over either the CPU data zone, or the thread timeline label, Tracy will display a line connecting all zones associated with the selected thread. This can be used to easily see how the thread was migrating across the CPU cores. +When the \faMousePointer{}~mouse pointer is hovered over either the CPU data zone or the thread timeline label, Tracy will display a line connecting all zones associated with the selected thread. This can be used to quickly see how the thread migrated across the CPU cores. -Careful examination of the data presented on this graph may allow you to determine areas where the profiled application was fighting for system resources with other programs (see section~\ref{checkenvironmentos}), or give you a hint to add more instrumentation macros. +Careful examination of the data presented on this graph may allow you to determine areas where the profiled application was fighting for system resources with other programs (see section~\ref{checkenvironmentos}) or give you a hint to add more instrumentation macros. \subparagraph{Locks} -Mutual exclusion zones are displayed in each thread that tries to acquire them. There are three color-coded kinds of lock event regions that may be displayed. Note that when the timeline view is zoomed out, the contention regions are always displayed over the uncontented ones. +Mutual exclusion zones are displayed in each thread that tries to acquire them. There are three color-coded kinds of lock event regions that may be displayed. Note that the contention regions are always displayed over the uncontented ones when the timeline view is zoomed out. \begin{itemize} -\item \emph{Green region\footnote{This region type is disabled by default and needs to be enabled in options (section~\ref{options}).}} -- The lock is being held solely by one thread and no other thread tries to access it. In case of shared locks it is possible that multiple threads hold the read lock, but no thread requires a write lock. -\item \emph{Yellow region} -- The lock is being owned by this thread and some other thread also wants to acquire the lock. -\item \emph{Red region} -- The thread wants to acquire the lock, but is blocked by other thread, or threads in case of shared lock. +\item \emph{Green region\footnote{This region type is disabled by default and needs to be enabled in options (section~\ref{options}).}} -- The lock is being held solely by one thread, and no other thread tries to access it. In the case of shared locks, multiple threads hold the read lock, but no thread requires a write lock. +\item \emph{Yellow region} -- The lock is being owned by this thread, and some other thread also wants to acquire the lock. +\item \emph{Red region} -- The thread wants to acquire the lock but is blocked by other thread or threads in case of a shared lock. \end{itemize} -Hovering the \faMousePointer{}~mouse pointer over a lock timeline will highlight the lock in all threads to help reading the lock behavior. Hovering the \faMousePointer{}~mouse pointer over a lock event will display important information, for example a list of threads that are currently blocking, or which are blocked by the lock. Clicking the \LMB{}~left mouse button on a lock event or a lock label will open the lock information window, as described in section~\ref{lockwindow}. Clicking the \MMB{}~middle mouse button on a lock event will zoom the view to the extent of the event. +Hovering the \faMousePointer{}~mouse pointer over a lock timeline will highlight the lock in all threads to help read the lock behavior. Hovering the \faMousePointer{}~mouse pointer over a lock event will display important information, for example, a list of threads that are currently blocking or which are blocked by the lock. Clicking the \LMB{}~left mouse button on a lock event or a lock label will open the lock information window, as described in section~\ref{lockwindow}. Clicking the \MMB{}~middle mouse button on a lock event will zoom the view to the extent of the event. \subparagraph{Plots} \label{plots} -The numerical data values (figure~\ref{plot}) are plotted right below the zones and locks. Note that the minimum and maximum values currently displayed on the plot are visible on the screen, along with the y range of the plot and number of drawn data points. The discrete data points are indicated with little rectangles. Multiple data points are indicated by a filled rectangle. +The numerical data values (figure~\ref{plot}) are plotted right below the zones and locks. Note that the minimum and maximum values currently displayed on the plot are visible on the screen, along with the y range of the plot and the number of drawn data points. The discrete data points are indicated with little rectangles. A filled rectangle indicates multiple data points. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2814,24 +2811,24 @@ The numerical data values (figure~\ref{plot}) are plotted right below the zones \label{plot} \end{figure} -When memory profiling (section~\ref{memoryprofiling}) is enabled, Tracy will automatically generate a \emph{\faMemory{}~Memory usage} plot, which has extended capabilities. Hovering over a data point (memory allocation event) will visually display duration of the allocation. Clicking the \LMB{} left mouse button on the data point will open the memory allocation information window, which will display the duration of the allocation as long as the window is open. +When memory profiling (section~\ref{memoryprofiling}) is enabled, Tracy will automatically generate a \emph{\faMemory{}~Memory usage} plot, which has extended capabilities. For example, hovering over a data point (memory allocation event) will visually display the allocation duration. Clicking the \LMB{} left mouse button on the data point will open the memory allocation information window, which will show the duration of the allocation as long as the window is open. -Another plot that is automatically provided by Tracy is the \emph{\faTachometer*{}~CPU usage} plot, which represents the total system CPU usage percentage (it is not limited to the profiled application). +Another plot that Tracy automatically provides is the \emph{\faTachometer*{}~CPU usage} plot, which represents the total system CPU usage percentage (it is not limited to the profiled application). \subsubsection{Navigating the view} -Hovering the \faMousePointer{} mouse pointer over the timeline view will display a vertical line that can be used to visually line-up events in multiple threads. Dragging the \LMB{} left mouse button will display time measurement of the selected region. +Hovering the \faMousePointer{} mouse pointer over the timeline view will display a vertical line that you can use to line up events in multiple threads visually. Dragging the \LMB{} left mouse button will display the time measurement of the selected region. -The timeline view may be scrolled both vertically and horizontally by dragging the \RMB{} right mouse button. Note that only the zones, locks and plots scroll vertically, while the time scale and frame sets always stay on the top. +The timeline view may be scrolled both vertically and horizontally by dragging the \RMB{} right mouse button. Note that only the zones, locks, and plots scroll vertically, while the time scale and frame sets always stay on the top. -You can zoom in and out the timeline view by using the \Scroll{}~mouse wheel. Pressing the \keys{\ctrl} key will make zooming more precise, while pressing the \keys{\shift} key will make it faster. You can select a range to which you want to zoom-in by dragging the \MMB{} middle mouse button. Dragging the \MMB{} middle mouse button while the \keys{\ctrl} key is pressed will zoom-out. +You can zoom in and out the timeline view by using the \Scroll{}~mouse wheel. Pressing the \keys{\ctrl} key will make zooming more precise while pressing the \keys{\shift} key will make it faster. You can select a range to which you want to zoom in by dragging the \MMB{} middle mouse button. Dragging the \MMB{} middle mouse button while the \keys{\ctrl} key is pressed will zoom out. \subsection{Time ranges} \label{timeranges} -Sometimes you may want to specify a time range, for example to limit some statistics to a specific part of your program execution, or to mark interesting places. +Sometimes, you may want to specify a time range, such as limiting some statistics to a specific part of your program execution or marking interesting places. -To define a time range, drag the \LMB{}~left mouse button over the timeline view, while holding the \keys{\ctrl} key. When the mouse key is released, the selected time extent will be marked with a blue striped pattern and a context menu will be displayed with the following options: +To define a time range, drag the \LMB{}~left mouse button over the timeline view while holding the \keys{\ctrl} key. When the mouse key is released, the profiler will mark the selected time extent with a blue striped pattern, and it will display a context menu with the following options: \begin{itemize} \item \emph{\faSearch{}~Limit find zone time range} -- this will limit find zone results. See chapter~\ref{findzone} for more details. @@ -2843,18 +2840,18 @@ To define a time range, drag the \LMB{}~left mouse button over the timeline view Alternatively, you may specify the time range by clicking the \RMB{}~right mouse button on a zone or a frame. The resulting time extent will match the selected item. -In order to reduce clutter, time range regions are only displayed if the windows they affect are open, or if the time range limits control window is open (section~\ref{timerangelimits}). Time range limits window can be accessed through the \emph{\faTools{} Tools} button on the control menu. +To reduce clutter, time range regions are only displayed if the windows they affect are open or if the time range limits control window is open (section~\ref{timerangelimits}). You can access the time range limits window through the \emph{\faTools{} Tools} button on the control menu. -Each time range can be freely adjusted on the timeline by clicking the \LMB{}~left mouse button on the range's edge and dragging the mouse. +You can freely adjust each time range on the timeline by clicking the \LMB{}~left mouse button on the range's edge and dragging the mouse. \subsubsection{Annotating the trace} \label{annotatingtrace} -Tracy allows adding custom notes to the trace. For example, you may want to mark a region to ignore, because the application was out-of-focus, or a region where a new user was connecting to the game, which resulted in a frame drop that needs to be investigated. +Tracy allows adding custom notes to the trace. For example, you may want to mark a region to ignore because the application was out-of-focus or a region where a new user was connecting to the game, which resulted in a frame drop that needs to be investigated. -Methods of specifying the annotation region are described in section~\ref{timeranges}. When a new annotation is added a settings window is displayed (section~\ref{annotationsettings}), allowing you to enter a description. +Methods of specifying the annotation region are described in section~\ref{timeranges}. When a new annotation is added, a settings window is displayed (section~\ref{annotationsettings}), allowing you to enter a description. -Annotations are displayed on the timeline, as presented on figure~\ref{annotation}. Clicking on the circle next to the text description will open the annotation settings window, in which you can modify or remove the region. List of all annotations in the trace is available in the annotations list window described in section~\ref{annotationlist}, which is accessible through the \emph{\faTools{} Tools} button on the control menu. +Annotations are displayed on the timeline, as presented in figure~\ref{annotation}. Clicking on the circle next to the text description will open the annotation settings window, in which you can modify or remove the region. List of all annotations in the trace is available in the annotations list window described in section~\ref{annotationlist}, which is accessible through the \emph{\faTools{} Tools} button on the control menu. \begin{figure}[h] \centering\begin{tikzpicture} @@ -2868,12 +2865,12 @@ Annotations are displayed on the timeline, as presented on figure~\ref{annotatio \label{annotation} \end{figure} -Please note that while the annotations persist between profiling sessions, they are not saved in the trace, but in the user data files, as described in section~\ref{tracespecific}. +Please note that while the annotations persist between profiling sessions, they are not saved in the trace but in the user data files, as described in section~\ref{tracespecific}. \subsection{Options menu} \label{options} -In this window you can set various trace-related options. The timeline view might sometimes become overcrowded, in which case disabling display of some profiling events can increase readability. +In this window, you can set various trace-related options. For example, the timeline view might sometimes become overcrowded, in which case disabling the display of some profiling events can increase readability. \begin{itemize} \item \emph{\faExpand{} Draw empty labels} -- By default threads that don't have anything to display at the current zoom level are hidden. Enabling this option will show them anyway. @@ -2904,82 +2901,82 @@ Enabling the \emph{Ignore custom} option will force usage of the selected zone c \item \emph{None} -- Namespaces are completely omitted (e.g.\ \texttt{sort}). \end{itemize} \end{itemize} -\item \emph{\faLock{} Draw locks} -- Controls the display of locks. If the \emph{Only contended} option is selected, the non-blocking regions of locks won't be displayed (see section~\ref{zoneslocksplots}). The \emph{Locks} drop-down allows disabling display of locks on a per-lock basis. As a convenience, the list of locks is split into the single-threaded and multi-threaded (contended and uncontended) categories. Clicking the \RMB{}~right mouse button on a lock label opens the lock information window (section~\ref{lockwindow}). +\item \emph{\faLock{} Draw locks} -- Controls the display of locks. If the \emph{Only contended} option is selected, the profiler won't display the non-blocking regions of locks (see section~\ref{zoneslocksplots}). The \emph{Locks} drop-down allows disabling the display of locks on a per-lock basis. As a convenience, the list of locks is split into the single-threaded and multi-threaded (contended and uncontended) categories. Clicking the \RMB{}~right mouse button on a lock label opens the lock information window (section~\ref{lockwindow}). \item \emph{\faSignature{} Draw plots} -- Allows disabling display of plots. Individual plots can be disabled in the \emph{Plots} drop-down. -\item \emph{\faRandom{} Visible threads} -- Here you can select which threads are visible on the timeline. Display order of threads can be changed by dragging thread labels. -\item \emph{\faImages{} Visible frame sets} -- Frame set display can be enabled or disabled here. Note that disabled frame sets are still available for selection in the frame set selection drop-down (section~\ref{controlmenu}), but are marked with a dimmed font. +\item \emph{\faRandom{} Visible threads} -- Here you can select which threads are visible on the timeline. You can change the display order of threads by dragging thread labels. +\item \emph{\faImages{} Visible frame sets} -- Frame set display can be enabled or disabled here. Note that disabled frame sets are still available for selection in the frame set selection drop-down (section~\ref{controlmenu}) but are marked with a dimmed font. \end{itemize} -Disabling display of some events is especially recommended when the profiler performance drops below acceptable levels for interactive usage. +Disabling the display of some events is especially recommended when the profiler performance drops below acceptable levels for interactive usage. \subsection{Messages window} \label{messages} -In this window you can see all the messages that were sent by the client application, as described in section~\ref{messagelog}. The window is split into four columns: \emph{time}, \emph{thread}, \emph{message} and \emph{call stack}. Hovering the \faMousePointer{}~mouse cursor over a message will highlight it on the timeline view. Clicking the \LMB{} left mouse button on a message will center the timeline view on the selected message. +In this window, you can see all the messages that were sent by the client application, as described in section~\ref{messagelog}. The window is split into four columns: \emph{time}, \emph{thread}, \emph{message} and \emph{call stack}. Hovering the \faMousePointer{}~mouse cursor over a message will highlight it on the timeline view. Clicking the \LMB{} left mouse button on a message will center the timeline view on the selected message. -The \emph{call stack} column is filled only if a call stack capture was requested, as described in section~\ref{collectingcallstacks}. A single entry consists of the \emph{\faAlignJustify{}~Show} button, which opens the call stack information window (chapter~\ref{callstackwindow}) and of an abbreviated information about call path. +The \emph{call stack} column is filled only if a call stack capture was requested, as described in section~\ref{collectingcallstacks}. A single entry consists of the \emph{\faAlignJustify{}~Show} button, which opens the call stack information window (chapter~\ref{callstackwindow}) and of abbreviated information about the call path. -If the \emph{\faImage{}~Show frame images} option is selected, hovering the \faMousePointer{}~mouse cursor over a message will show a tooltip containing frame image (see section~\ref{frameimages}) associated with frame in which the message was issued, if available. +If the \emph{\faImage{}~Show frame images} option is selected, hovering the \faMousePointer{}~mouse cursor over a message will show a tooltip containing frame image (see section~\ref{frameimages}) associated with a frame in which the message was issued, if available. -In a live capture, the message list will automatically scroll down to display the most recent message. This behavior can be disabled by manually scrolling the message list up. When the view is scrolled down to display the last message, the auto-scrolling feature will be enabled again. +The message list will automatically scroll down to display the most recent message during live capture. You can disable this behavior by manually scrolling the message list up. The auto-scrolling feature will be enabled again when the view is scrolled down to display the last message. -The message list can be filtered in the following ways: +You can filter the message list in the following ways: \begin{itemize} \item By the originating thread in the \emph{\faRandom{} Visible threads} drop-down. -\item By matching the message text to the expression in the \emph{\faFilter{}~Filter messages} entry field. Multiple filter expressions can be comma-separated (e.g. 'warn, info' will match messages containing strings 'warn' \emph{or} 'info'). Matches can be excluded by preceding the term with a minus character (e.g. '-debug' will hide all messages containing string 'debug'). +\item By matching the message text to the expression in the \emph{\faFilter{}~Filter messages} entry field. Multiple filter expressions can be comma-separated (e.g. 'warn, info' will match messages containing strings 'warn' \emph{or} 'info'). You can exclude matches by preceding the term with a minus character (e.g., '-debug' will hide all messages containing the string 'debug'). \end{itemize} \subsection{Statistics window} \label{statistics} -Looking at the timeline view gives you a very localized outlook on things. Sometimes you want to take a look at the general overview of the program's behavior, for example you want to know which function takes the most of application's execution time. The statistics window provides you exactly that information. +Looking at the timeline view gives you a very localized outlook on things. However, sometimes you want to look at the general overview of the program's behavior. For example, you want to know which function takes the most of the application's execution time. The statistics window provides you with exactly that information. -If the trace capture was performed with call stack sampling enabled (as described in chapter~\ref{sampling}), you will be presented with an option to switch between \emph{\faSyringe{}~Instrumentation} and \emph{\faEyeDropper{}~Sampling} modes. If no sampling data was collected, but symbols were retrieved, the second mode will be displayed as \emph{\faPuzzlePiece{}~Symbols}, enabling you to list available symbols. Otherwise only the instrumentation view will be present. +If the trace capture was performed with call stack sampling enabled (as described in chapter~\ref{sampling}), you will be presented with an option to switch between \emph{\faSyringe{}~Instrumentation} and \emph{\faEyeDropper{}~Sampling} modes. If the profiler collected no sampling data, but it retrieved symbols, the second mode will be displayed as \emph{\faPuzzlePiece{}~Symbols}, enabling you to list available symbols. Otherwise, only the instrumentation view will be present. \subsubsection{Instrumentation mode} -Here you will find a multi-column display of captured zones, which contains: the zone \emph{name} and \emph{location}, \emph{total time} spent in the zone, the \emph{count} of zone executions and the \emph{mean time spent in the zone per call}. The view may be sorted according to the three displayed values. +Here you will find a multi-column display of captured zones, which contains: the zone \emph{name} and \emph{location}, \emph{total time} spent in the zone, the \emph{count} of zone executions and the \emph{mean time spent in the zone per call}. You may sort the view according to the three displayed values. -In the \emph{~Timing} menu, the \emph{~With children} selection displays inclusive measurements, that is, containing execution time of zone's children. The \emph{~Self only} selection switches the measurement to exclusive, displaying just the time spent in zone, subtracting the child calls. The \emph{~Non-reentrant} selection displays inclusive time, but counting only the first appearance of a given zone on a thread's stack. +In the \emph{~Timing} menu, the \emph{~With children} selection displays inclusive measurements, that is, containing execution time of zone's children. The \emph{~Self only} selection switches the measurement to exclusive, displaying just the time spent in the zone, subtracting the child calls. Finally, the \emph{~Non-reentrant} selection shows inclusive time but counts only the first appearance of a given zone on a thread's stack. Clicking the \LMB{} left mouse button on a zone will open the individual zone statistics view in the find zone window (section~\ref{findzone}). You can filter the displayed list of zones by matching the zone name to the expression in the \emph{\faFilter{}~Filter zones} entry field. Refer to section~\ref{messages} for a more detailed description of the expression syntax. -To limit the statistics to a specific time extent, you may enable the \emph{Limit range} option (chapter~\ref{timeranges}). The inclusion region will be marked with a red striped pattern. Note that a zone must be fully inside the region to be counted. More options can be accessed through the \emph{\faRuler{}~Limits} button, which will open the time range limits window, described in section~\ref{timerangelimits}. +To limit the statistics to a specific time extent, you may enable the \emph{Limit range} option (chapter~\ref{timeranges}). The inclusion region will be marked with a red striped pattern. Note that a zone must be entirely inside the region to be counted. You can access more options through the \emph{\faRuler{}~Limits} button, which will open the time range limits window, described in section~\ref{timerangelimits}. \subsubsection{Sampling mode} \label{statisticssampling} -Data displayed in this mode is in essence very similar to the instrumentation one. Here you will find function names, their locations in source code and time measurements. There are, however, some very important differences. +Data displayed in this mode is, in essence, very similar to the instrumentation one. Here you will find function names, their locations in source code, and time measurements. There are, however, some significant differences. -First and foremost, the presented information is constructed from a number of call stack samples, which represent real addresses in the application's binary code, mapped to the line numbers in the source files. This reverse mapping may not be always possible, or may be erroneous. Furthermore, due to the nature of the sampling process, it is impossible to obtain exact time measurement. Instead, time values are guesstimated by multiplying number of sample counts by mean time between two distinct samples. +First and foremost, the presented information is constructed from many call stack samples, which represent real addresses in the application's binary code, mapped to the line numbers in the source files. This reverse mapping may not always be possible or could be erroneous. Furthermore, due to the nature of the sampling process, it is impossible to obtain exact time measurements. Instead, time values are guesstimated by multiplying the number of sample counts by mean time between two different samples. -The \emph{Name} column contains name of the function in which the sampling was done. Kernel-mode function samples are distinguished with the red color. If the \emph{\faSitemap{}~Inlines} option is enabled, functions which were inlined will be preceded with a '\faCaretRight{}' symbol and additionally display their parent function name in parenthesis. Otherwise, only non-inlined functions are listed, with count of inlined functions in parenthesis. Any entry containing inlined function may be expanded to display the corresponding functions list (some functions may be hidden if the \emph{\faPuzzlePiece{}~Show all} option is disabled, due to lack of sampling data). Clicking on a function name will open the sample entry call stacks window (see chapter~\ref{sampleparents}). Note that if inclusive times are displayed, listed functions will be partially or completely coming from mid-stack frames, which will prevent, or limit the capability to display parent call stacks. +The \emph{Name} column contains name of the function in which the sampling was done. Kernel-mode function samples are distinguished with the red color. If the \emph{\faSitemap{}~Inlines} option is enabled, functions which were inlined will be preceded with a '\faCaretRight{}' symbol and additionally display their parent function name in parenthesis. Otherwise, only non-inlined functions are listed, with a count of inlined functions in parenthesis. You may expand any entry containing an inlined function to display the corresponding functions list (some functions may be hidden if the \emph{\faPuzzlePiece{}~Show all} option is disabled due to lack of sampling data). Clicking on a function name will open the sample entry call stacks window (see chapter~\ref{sampleparents}). Note that if inclusive times are displayed, listed functions will be partially or completely coming from mid-stack frames, preventing, or limiting the capability to display parent call stacks. -The \emph{Location} column displays the corresponding source file name and line number. Depending on the \emph{Location} option selection it can either show function entry address, or the instruction at which the sampling was performed. The \emph{Entry} mode points at the beginning of a non-inlined function, or at the place where inlined function was inserted in its parent function. The \emph{Sample} mode is not useful for non-inlined functions, as it points to one randomly selected sampling point out of many that were captured. However, in case of inlined functions, this random sampling point is within the inlined function body. Using these options in tandem enable you to look at both the inlined function code and the place where it was inserted. If the \emph{Smart} location is selected, profiler will display entry point position for non-inlined functions and sample location for inlined functions. Selecting the \emph{\faAt{}~Address} option will instead print the symbol address. +The \emph{Location} column displays the corresponding source file name and line number. Depending on the \emph{Location} option selection, it can either show the function entry address or the instruction at which the sampling was performed. The \emph{Entry} mode points at the beginning of a non-inlined function or at the place where the compiler inserted an inlined function in its parent function. The \emph{Sample} mode is not useful for non-inlined functions, as it points to one randomly selected sampling point out of many that were captured. However, in the case of inlined functions, this random sampling point is within the inlined function body. Using these options in tandem lets you look at both the inlined function code and the place where it was inserted. If the \emph{Smart} location is selected, the profiler will display the entry point position for non-inlined functions and sample location for inlined functions. Selecting the \emph{\faAt{}~Address} option will instead print the symbol address. The location data is complemented by the originating executable image name, contained in the \emph{Image} column. -Some function locations may not be found, due to insufficient debugging data available on the client side. To filter out such entries, use the \emph{\faEyeSlash{}~Hide unknown} option. +The profiler may not find some function locations due to insufficient debugging data available on the client-side. To filter out such entries, use the \emph{\faEyeSlash{}~Hide unknown} option. -The \emph{Time} or \emph{Count} column (depending on the \emph{\faStopwatch{}~Show time} option selection) shows number of taken samples, either as a raw count, or in an easier to understand time format. Note that the percentage value of time is calculated relative to the wall-clock time, and the percentage value of sample counts is relative to total number of collected samples. +The \emph{Time} or \emph{Count} column (depending on the \emph{\faStopwatch{}~Show time} option selection) shows number of taken samples, either as a raw count, or in an easier to understand time format. Note that the percentage value of time is calculated relative to the wall-clock time. The percentage value of sample counts is relative to the total number of collected samples. -The last column, \emph{Code size}, displays the size of symbol in the executable image of the program. Since inlined routines are directly embedded into other functions, their symbol size will be based on the parent symbol, and displayed as 'less than'. In some cases this data won't be available. If the symbol code has been retrieved\footnote{Symbols larger than 64~KB are not captured.}, symbol size will be prepend with the \texttt{\faDatabase}~icon, and clicking the \RMB{}~right mouse button on the location column entry will open symbol view window (section~\ref{symbolview}). +The last column, \emph{Code size}, displays the size of the symbol in the executable image of the program. Since inlined routines are directly embedded into other functions, their symbol size will be based on the parent symbol and displayed as 'less than'. In some cases, this data won't be available. If the symbol code has been retrieved\footnote{Symbols larger than 64~KB are not captured.} symbol size will be prepended with the \texttt{\faDatabase}~icon, and clicking the \RMB{}~right mouse button on the location column entry will open symbol view window (section~\ref{symbolview}). -Finally, the list can be filtered using the \emph{\faFilter{}~Filter symbols} entry field, just like in the instrumentation mode case. Additionally, you can also filter results by the originating image name of the symbol. Display of kernel symbols may be disabled with the \emph{\faHatWizard{}~Include kernel} switch. The exclusive/inclusive time counting mode can be switched using the \emph{~Timing} menu (non-reentrant timing is not available in the Sampling view). Limiting the time range is also available, but is restricted to self time. If the \emph{\faPuzzlePiece{}~Show all} option is selected, the list will include not only call stack samples, but also all other symbols collected during the profiling process (this is enabled by default, if no sampling was performed). +Finally, the list can be filtered using the \emph{\faFilter{}~Filter symbols} entry field, just like in the instrumentation mode case. Additionally, you can also filter results by the originating image name of the symbol. You may disable the display of kernel symbols with the \emph{\faHatWizard{}~Include kernel} switch. The exclusive/inclusive time counting mode can be switched using the \emph{~Timing} menu (non-reentrant timing is not available in the Sampling view). Limiting the time range is also available but is restricted to self-time. If the \emph{\faPuzzlePiece{}~Show all} option is selected, the list will include not only the call stack samples but also all other symbols collected during the profiling process (this is enabled by default if no sampling was performed). \subsection{Find zone window} \label{findzone} -The individual behavior of zones may be influenced by many factors, like CPU cache effects, access times amortized by the disk cache, thread context switching, etc. Sometimes the execution time depends on the internal data structures and their response to different inputs. In other words, it is hard to determine the true performance characteristics by looking at any single zone. +The individual behavior of zones may be influenced by many factors, like CPU cache effects, access times amortized by the disk cache, thread context switching, etc. Moreover, sometimes the execution time depends on the internal data structures and their response to different inputs. In other words, it is hard to determine the actual performance characteristics by looking at any single zone. -Tracy gives you the ability to display an execution time histogram of all occurrences of a zone. On this view you can see how the function behaves in general. You can inspect how various data inputs influence the execution time and you can filter the data to eventually drill down to the individual zone calls, so that you can see the environment in which they were called. +Tracy gives you the ability to display an execution time histogram of all occurrences of a zone. On this view, you can see how the function behaves in general. You can inspect how various data inputs influence the execution time. You can filter the data to eventually drill down to the individual zone calls to see the environment in which they were called. -You start by entering a search query, which will be matched against known zone names (see section~\ref{markingzones} for information on the grouping of zone names). If the search found some results, you will be presented with a list of zones in the \emph{matched source locations} drop-down. The selected zone's graph is displayed on the \emph{histogram} drop-down and also the matching zones are highlighted on the timeline view. Clicking the \RMB{} right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}). +You start by entering a search query, which will be matched against known zone names (see section~\ref{markingzones} for information on the grouping of zone names). If the search found some results, you will be presented with a list of zones in the \emph{matched source locations} drop-down. The selected zone's graph is displayed on the \emph{histogram} drop-down, and also the matching zones are highlighted on the timeline view. Clicking the \RMB{} right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}). -An example histogram is presented on figure~\ref{findzonehistogram}. Here you can see that the majority of zone calls (by count) are clustered in the 300~\si{\nano\second} group, closely followed by the 10~\si{\micro\second} cluster. There are some outliers at the 1~and~10~\si{\milli\second} marks, which can be ignored on most occasions, as these are single occurrences. +An example histogram is presented in figure~\ref{findzonehistogram}. Here you can see that the majority of zone calls (by count) are clustered in the 300~\si{\nano\second} group, closely followed by the 10~\si{\micro\second} cluster. There are some outliers at the 1~and~10~\si{\milli\second} marks, which can be ignored on most occasions, as these are single occurrences. \begin{figure}[h] \centering\begin{tikzpicture} @@ -3022,18 +3019,18 @@ An example histogram is presented on figure~\ref{findzonehistogram}. Here you ca \label{findzonehistogram} \end{figure} -The histogram is accompanied by various data statistics about displayed data, for example the \emph{total time} of the displayed samples, or the \emph{maximum number of counts} in histogram bins. The following options control how the data is presented: +Various data statistics about displayed data accompany the histogram, for example, the \emph{total time} of the displayed samples or the \emph{maximum number of counts} in histogram bins. The following options control how the data is presented: \begin{itemize} \item \emph{Log values} -- Switches between linear and logarithmic scale on the y~axis of the graph, representing the call counts\footnote{Or time, if the \emph{cumulate time} option is enabled.}. \item \emph{Log time} -- Switches between linear and logarithmic scale on the x~axis of the graph, representing the time bins. -\item \emph{Cumulate time} -- Changes how the histogram bin values are calculated. By default the vertical bars on the graph represent the \emph{call counts} of zones that fit in the given time bin. If this option is enabled, the bars represent the \emph{time spent} in the zones. For example, on graph presented on figure~\ref{findzonehistogram} the 10~\si{\micro\second} cluster is the dominating one, if we look at the time spent in zone, even if the 300~\si{\nano\second} cluster has greater number of call counts. -\item \emph{Self time} -- Removes children time from the analysed zones, which results in displaying only the time spent in the zone itself (or in non-instrumented function calls). Cannot be selected when \emph{Running time} is active. -\item \emph{Running time} -- Removes time when zone's thread execution was suspended by the operating system due to preemption by other threads, waiting for system resources, lock contention, etc. Available only when context switch capture was performed (section~\ref{contextswitches}). Cannot be selected when \emph{Self time} is active. -\item \emph{Minimum values in bin} -- Excludes display of bins which do not hold enough values at both ends of the time range. Increasing this parameter will eliminate outliers, allowing to concentrate on the interesting part of the graph. +\item \emph{Cumulate time} -- Changes how the histogram bin values are calculated. By default, the vertical bars on the graph represent the \emph{call counts} of zones that fit in the given time bin. If this option is enabled, the bars represent the \emph{time spent} in the zones. For example, on the graph presented in figure~\ref{findzonehistogram} the 10~\si{\micro\second} cluster is the dominating one, if we look at the time spent in the zone, even if the 300~\si{\nano\second} cluster has a greater number of call counts. +\item \emph{Self time} -- Removes children time from the analyzed zones, which results in displaying only the time spent in the zone itself (or in non-instrumented function calls). It cannot be selected when \emph{Running time} is active. +\item \emph{Running time} -- Removes time when zone's thread execution was suspended by the operating system due to preemption by other threads, waiting for system resources, lock contention, etc. Available only when the profiler performed context switch capture (section~\ref{contextswitches}). It cannot be selected when \emph{Self time} is active. +\item \emph{Minimum values in bin} -- Excludes display of bins that do not hold enough values at both ends of the time range. Increasing this parameter will eliminate outliers, allowing us to concentrate on the interesting part of the graph. \end{itemize} -You can drag the \LMB{} left mouse button over the histogram to select a time range that you want to closely look at. This will display the data in the histogram info section and it will also filter zones displayed in the \emph{found zones} section. This is quite useful, if you want to actually look at the outliers, i.e.\ where did they originate from, what the program was doing at the moment, etc\footnote{More often than not you will find out, that the application was just starting, or an access to a cold file was required and there's not much you can do to optimize that particular case.}. You can reset the selection range by pressing the \RMB{} right mouse button on the histogram. +You can drag the \LMB{} left mouse button over the histogram to select a time range that you want to look at closely. This will display the data in the histogram info section, and it will also filter zones shown in the \emph{found zones} section. This is quite useful if you actually want to look at the outliers, i.e.,\ where did they originate from, what the program was doing at the moment, etc\footnote{More often than not you will find out, that the application was just starting, or access to a cold file was required and there's not much you can do to optimize that particular case.}. You can reset the selection range by pressing the \RMB{} right mouse button on the histogram. The \emph{found zones} section displays the individual zones grouped according to the following criteria: @@ -3041,14 +3038,14 @@ The \emph{found zones} section displays the individual zones grouped according t \item \emph{Thread} -- In this mode you can see which threads were executing the zone. \item \emph{User text} -- Splits the zones according to the custom user text (see section~\ref{markingzones}). \item \emph{Zone name} -- Groups zones by the name set on a per-call basis (see section~\ref{markingzones}). -\item \emph{Call stacks} -- Zones are grouped by the originating call stack (see section~\ref{collectingcallstacks}). Note that two call stacks may sometimes appear identical, even if they are not, due to an easy to overlook difference in the source line numbers. -\item \emph{Parent} -- Groups zones according to the parent zone. This mode relies on the zone hierarchy, and \emph{not} on the call stack information. -\item \emph{No grouping} -- Disables zone grouping. May be useful in cases when you just want to see zones in order as they appeared. +\item \emph{Call stacks} -- Zones are grouped by the originating call stack (see section~\ref{collectingcallstacks}). Note that two call stacks may sometimes appear identical, even if they are not, due to an easily overlooked difference in the source line numbers. +\item \emph{Parent} -- Groups zones according to the parent zone. This mode relies on the zone hierarchy and \emph{not} on the call stack information. +\item \emph{No grouping} -- Disables zone grouping. It may be useful when you want to see zones in order as they appear. \end{itemize} -Each group may be sorted according to the \emph{order} in which it appeared, the call \emph{count}, the total \emph{time} spent in the group, or the \emph{mean time per call}. Expanding the group view will display individual occurrences of the zone, which can be sorted by application's time, execution time or zone's name. Clicking the \LMB{} left mouse button on a zone will open the zone information window (section~\ref{zoneinfo}). Clicking the \MMB{} middle mouse button on a zone will zoom the timeline view to the zone's extent. +You may sort each group according to the \emph{order} in which it appeared, the call \emph{count}, the total \emph{time} spent in the group, or the \emph{mean time per call}. Expanding the group view will display individual occurrences of the zone, which can be sorted by application's time, execution time, or zone's name. Clicking the \LMB{} left mouse button on a zone will open the zone information window (section~\ref{zoneinfo}). Clicking the \MMB{} middle mouse button on a zone will zoom the timeline view to the zone's extent. -Clicking the \LMB{} left mouse button on group name will highlight the group time data on the histogram (figure~\ref{findzonehistogramgroup}). This function provides a quick insight about the impact of the originating thread, or input data on the zone performance. Clicking on the \emph{\faBackspace~Clear} button will reset the group selection. +Clicking the \LMB{} left mouse button on the group name will highlight the group time data on the histogram (figure~\ref{findzonehistogramgroup}). This function provides a quick insight into the impact of the originating thread or input data on the zone performance. Clicking on the \emph{\faBackspace~Clear} button will reset the group selection. \begin{figure}[h] \centering\begin{tikzpicture} @@ -3082,11 +3079,11 @@ Clicking the \LMB{} left mouse button on group name will highlight the group tim \label{findzonehistogramgroup} \end{figure} -The call stack grouping mode has a different way of listing groups. Here only one group is displayed at any time, due to need to display the call stack frames. You can switch between call stack groups by using the~\faCaretLeft{}~and~\faCaretRight{} buttons. The group can be selected by clicking on the~\emph{\faCheck{}~Select} button. You can open the call stack window (section~\ref{callstackwindow}) by pressing the~\emph{\faAlignJustify{}~Call~stack} button. +The call stack grouping mode has a different way of listing groups. Here only one group is displayed at any time due to the need to display the call stack frames. You can switch between call stack groups by using the~\faCaretLeft{}~and~\faCaretRight{} buttons. You can select the group by clicking on the~\emph{\faCheck{}~Select} button. You can open the call stack window (section~\ref{callstackwindow}) by pressing the~\emph{\faAlignJustify{}~Call~stack} button. -Tracy displays a variety of statistical values regarding the selected function: mean (average value), median (middle value), mode (most common value, quantized using histogram bins), and \textsigma{} (standard deviation). The mean and median zone times are also displayed on the histogram as a red (mean) and blue (median) vertical bars. When a group is selected, additional bars will indicate the mean group time (orange) and median group time (green). You can disable drawing of either set of markers by clicking on the check-box next to the color legend. +Tracy displays a variety of statistical values regarding the selected function: mean (average value), median (middle value), mode (most common value, quantized using histogram bins), and \textsigma{} (standard deviation). The mean and median zone times are also displayed on the histogram as red (mean) and blue (median) vertical bars. Additional bars will indicate the mean group time (orange) and median group time (green). You can disable the drawing of either set of markers by clicking on the check-box next to the color legend. -Hovering the \faMousePointer{}~mouse cursor over a zone on the timeline, which is currently selected in the find zone window, will display a pulsing vertical bar on the histogram, highlighting the bin to which the hovered zone has been assigned. Zone entry on the zone list will also be highlighted. +Hovering the \faMousePointer{}~mouse cursor over a zone on the timeline, which is currently selected in the find zone window, will display a pulsing vertical bar on the histogram, highlighting the bin to which the hovered zone has been assigned. In addition, it will also highlight zone entry on the zone list. \begin{bclogo}[ noborder=true, @@ -3101,50 +3098,50 @@ noborder=true, couleur=black!5, logo=\bcattention ]{Caveats} -When using the execution times histogram you must be aware about the hardware peculiarities. Read section~\ref{checkenvironmentcpu} for more detail. +When using the execution times histogram, you must know the hardware peculiarities. Read section~\ref{checkenvironmentcpu} for more detail. \end{bclogo} \subsubsection{Timeline interaction} -When the zone statistics are displayed in the find zone menu, matching zones will be highlighted on the timeline display. Highlight colors match the histogram display. Bright blue highlight is used to indicate that a zone is in the optional selection range, while the yellow highlight is used for the rest of zones. +The profiler will highlight matching zones on the timeline display when the zone statistics are displayed in the find zone menu. Highlight colors match the histogram display. A bright blue highlight indicates that a zone is in the optional selection range, while the yellow highlight is used for the rest of the zones. \subsubsection{Frame time graph interaction} \label{frametimefindzone} -The frame time graph (section~\ref{frametimegraph}) behavior is altered when a zone is displayed in the find zone window and the \emph{Show zone time in frames} option is selected. Instead of coloring the frame bars according to the frame time targets, an accumulated zone execution time is shown. +The frame time graph (section~\ref{frametimegraph}) behavior is altered when a zone is displayed in the find zone window and the \emph{Show zone time in frames} option is selected. An accumulated zone execution time is shown instead of coloring the frame bars according to the frame time targets. Each bar is drawn in gray color, with the white part accounting for the zone time. If the execution time is greater than the frame time (this is possible if more than one thread was executing the same zone), the overflow will be displayed using red color. -Enabling \emph{Self time} option has an effect on the displayed values, but \emph{Running time} has not. +Enabling \emph{Self time} option affects the displayed values, but \emph{Running time} does not. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcattention ]{Caveats} -The displayed data might not be calculated correctly and some zones may not be included in the reported times. +The profiler might not calculate the displayed data correctly, and it may not include some zones in the reported times. \end{bclogo} \subsubsection{Limiting zone time range} -If the \emph{Limit range} option is selected, only the zones within the specified time range (chapter~\ref{timeranges}) will be included in the data. The inclusion region will be marked with a green striped pattern. Note that a zone must be fully inside the region to be counted. More options can be accessed through the \emph{\faRuler{}~Limits} button, which will open the time range limits window, described in section~\ref{timerangelimits}. +If the \emph{Limit range} option is selected, the profiler will include only the zones within the specified time range (chapter~\ref{timeranges}) in the data. The inclusion region will be marked with a green striped pattern. Note that a zone must be entirely inside the region to be counted. You can access more options through the \emph{\faRuler{}~Limits} button, which will open the time range limits window, described in section~\ref{timerangelimits}. \subsubsection{Zone samples} -If sampling data has been captured (see section~\ref{sampling}), an additional expandable \emph{\faEyeDropper{}~Samples} section will be displayed. This section contains only the sample data that can be attributed to the displayed zone. Looking at this list may give you additional insight into what is happening within the zone. Refer to section~\ref{statisticssampling} for more information about this view. +If sampling data has been captured (see section~\ref{sampling}), an additional expandable \emph{\faEyeDropper{}~Samples} section will be displayed. This section contains only the sample data attributed to the displayed zone. Looking at this list may give you additional insight into what is happening within the zone. Refer to section~\ref{statisticssampling} for more information about this view. -The list of samples can be further narrowed down by selecting a time range on the histogram and/or by selecting a group in the \emph{Found zones} section. Do note that the random nature of sampling makes it highly unlikely that short-lived zones (i.e. left part of the histogram) will have any sample data collected. +You can further narrow down the list of samples by selecting a time range on the histogram or by choosing a group in the \emph{Found zones} section. However, do note that the random nature of sampling makes it highly unlikely that short-lived zones (i.e., left part of the histogram) will have any sample data collected. \subsection{Compare traces window} \label{compare} -Comparing the performance impact of the optimization work is not an easy thing to do. Benchmarking is often inconclusive, if even possible, in case of interactive applications, where the benchmarked function might not have a visible impact on frame render time. Doing isolated micro-benchmarks loses the execution environment of the application, in which many different functions compete for limited system resources. +Comparing the performance impact of the optimization work is not an easy thing to do. Benchmarking is often inconclusive, if even possible, in the case of interactive applications, where the benchmarked function might not have a visible impact on frame render time. Furthermore, doing isolated micro-benchmarks loses the application's execution environment, in which many different parts compete for limited system resources. -Tracy solves this problem by providing a compare traces functionality, very similar to the find zone window, described in section~\ref{findzone}. Traces can be compared either by zone or frame timing data. +Tracy solves this problem by providing a compare traces functionality, very similar to the find zone window, described in section~\ref{findzone}. You can compare traces either by zone or frame timing data. -You would begin your work by recording a reference trace that represents the usual behavior of the program. Then, after the optimization of the code is completed, you record another trace, doing roughly what you did for the reference one. Having the optimized trace open you select the \emph{\faFolderOpen{}~Open second trace} option in the compare traces window and load the reference trace. +You would begin your work by recording a reference trace that represents the usual behavior of the program. Then, after the optimization of the code is completed, you record another trace, doing roughly what you did for the reference one. Finally, having the optimized trace open, you select the \emph{\faFolderOpen{}~Open second trace} option in the compare traces window and load the reference trace. -Now things start to get familiar. You search for a zone, similarly like in the find zone window, choose the one you want in the \emph{matched source locations} drop-down, and then you look at the histogram\footnote{When comparing frame times you are presented with a list of available frame sets, without the search box.}. This time there are two overlaid graphs, one representing the current trace, and the second one representing the external (reference) trace (figure~\ref{comparehistogram}). You can easily see how the performance characteristics of the zone were affected by your modifications. +Now things start to get familiar. You search for a zone, similarly like in the find zone window, choose the one you want in the \emph{matched source locations} drop-down, and then you look at the histogram\footnote{When comparing frame times you are presented with a list of available frame sets, without the search box.}. This time there are two overlaid graphs, one representing the current trace and the second one representing the external (reference) trace (figure~\ref{comparehistogram}). You can easily see how the performance characteristics of the zone were affected by your modifications. \begin{figure}[h] \centering\begin{tikzpicture} @@ -3175,32 +3172,32 @@ Now things start to get familiar. You search for a zone, similarly like in the f \label{comparehistogram} \end{figure} -Note that the traces are color and symbol coded. The current trace is marked by a yellow \faLemon{} symbol, and the external one is marked by a red \faGem{} symbol. +Note that the traces are color and symbol-coded. The current trace is marked by a yellow \faLemon{} symbol, and the external one is marked by a red \faGem{} symbol. -When searching for source locations it's not uncommon to match more than one zone (for example a search for \texttt{Draw} may result in \texttt{DrawCircle} and \texttt{DrawRectangle} matches). Typically you wouldn't want to compare execution profiles of two unrelated functions, which is prevented by the \emph{link selection} option, which ensures that when you choose a source location in one trace, the corresponding one is also selected in second trace. Be aware that this may still result in a mismatch, for example if you have overloaded functions. In such case you will need to manually select the appropriate function in the other trace. +When searching for source locations it's not uncommon to match more than one zone (for example a search for \texttt{Draw} may result in \texttt{DrawCircle} and \texttt{DrawRectangle} matches). Typically you wouldn't want to compare execution profiles of two unrelated functions, which is prevented by the \emph{link selection} option, which ensures that when you choose a source location in one trace, the corresponding one is also selected in the second trace. Be aware that this may still result in a mismatch, for example, if you have overloaded functions. In such a case, you will need to select the appropriate function in the other trace manually. -It may be difficult, if not impossible, to perform identical runs of a program. This means that the number of collected zones may differ in both traces, which would influence the displayed results. To fix this problem enable the \emph{Normalize values} option, which will adjust the displayed results as-if both traces had the same number of recorded zones. +It may be difficult, if not impossible, to perform identical runs of a program. This means that the number of collected zones may differ in both traces, influencing the displayed results. To fix this problem, enable the \emph{Normalize values} option, which will adjust the displayed results as if both traces had the same number of recorded zones. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{Trace descriptions} -Set custom trace descriptions (see section~\ref{traceinfo}) to easily differentiate the two loaded traces. If no trace description is set, a name of the profiled program will be displayed along with the capture time. +Set custom trace descriptions (see section~\ref{traceinfo}) to easily differentiate the two loaded traces. If no trace description is set, the name of the profiled program will be displayed along with the capture time. \end{bclogo} \subsection{Memory window} \label{memorywindow} -The data gathered by profiling memory usage (section~\ref{memoryprofiling}) can be viewed in the memory window. If more than one memory pool was tracked during the capture, you will be able to select which pool you want to look at, using the \emph{\faArchive{}~Memory pool} selection box. +You can view the data gathered by profiling memory usage (section~\ref{memoryprofiling}) in the memory window. If the profiler tracked more than one memory pool during the capture, you would be able to select which collection you want to look at, using the \emph{\faArchive{}~Memory pool} selection box. The top row contains statistics, such as \emph{total allocations} count, number of \emph{active allocations}, current \emph{memory usage} and process \emph{memory span}\footnote{Memory span describes the address space consumed by the program. It is calculated as a difference between the maximum and minimum observed in-use memory address.}. -The lists of captured memory allocations are displayed in a common multi-column format thorough the profiler. The first column specifies the memory address of an allocation, or an address and an offset, if the address is not at the start of the allocation. Clicking the \LMB{} left mouse button on an address will open the memory allocation information window\footnote{While the allocation information window is opened, the address will be highlighted on the list.} (see section~\ref{memallocinfo}). Clicking the \MMB{}~middle mouse button on an address will zoom the timeline view to memory allocation's range. The next column contains the allocation size. +The lists of captured memory allocations are displayed in a common multi-column format through the profiler. The first column specifies the memory address of an allocation or an address and an offset if the address is not at the start of the allocation. Clicking the \LMB{} left mouse button on an address will open the memory allocation information window\footnote{While the allocation information window is opened, the address will be highlighted on the list.} (see section~\ref{memallocinfo}). Clicking the \MMB{}~middle mouse button on an address will zoom the timeline view to memory allocation's range. The next column contains the allocation size. The allocation's timing data is contained in two columns: \emph{appeared at} and \emph{duration}. Clicking the \LMB{}~left mouse button on the first one will center the timeline view at the beginning of allocation, and likewise, clicking on the second one will center the timeline view at the end of allocation. Note that allocations that have not yet been freed will have their duration displayed in green color. -The memory event location in the code is displayed in the last four columns. The \emph{thread} column contains the thread where the allocation was made and freed (if applicable), or an \emph{alloc / free} pair of threads, if it was allocated in one thread and freed in another. The \emph{zone alloc} contains the zone in which the allocation was performed\footnote{The actual allocation is typically a couple functions deeper in the call stack.}, or \texttt{-} if there was no active zone in the given thread at the time of allocation. Clicking the \LMB{}~left mouse button on the zone name will open the zone information window (section~\ref{zoneinfo}). Similarly, the \emph{zone free} column displays the zone which freed the allocation, which may be colored yellow, if it is the same exact zone that did the allocation. Alternatively, if the zone has not yet been freed, a green \emph{active} text is displayed. The last column contains the \emph{alloc} and \emph{free} call stack buttons, or their placeholders, if no call stack is available (see section~\ref{collectingcallstacks} for more information). Clicking on either of the buttons will open the call stack window (section~\ref{callstackwindow}). Note that the call stack buttons that match the information window will be highlighted. +The memory event location in the code is displayed in the last four columns. The \emph{thread} column contains the thread where the allocation was made and freed (if applicable), or an \emph{alloc / free} pair of the threads if it was allocated in one thread and freed in another. The \emph{zone alloc} contains the zone in which the allocation was performed\footnote{The actual allocation is typically a couple functions deeper in the call stack.}, or \texttt{-} if there was no active zone in the given thread at the time of allocation. Clicking the \LMB{}~left mouse button on the zone name will open the zone information window (section~\ref{zoneinfo}). Similarly, the \emph{zone free} column displays the zone which freed the allocation, which may be colored yellow, if it is the same zone that did the allocation. Alternatively, if the zone has not yet been freed, a green \emph{active} text is displayed. The last column contains the \emph{alloc} and \emph{free} call stack buttons, or their placeholders, if no call stack is available (see section~\ref{collectingcallstacks} for more information). Clicking on either of the buttons will open the call stack window (section~\ref{callstackwindow}). Note that the call stack buttons that match the information window will be highlighted. The memory window is split into the following sections: @@ -3210,28 +3207,28 @@ The \emph{\faAt{} Allocations} pane allows you to search for the specified addre \subsubsection{Active allocations} -The \emph{\faHeartbeat{} Active allocations} pane displays a list of currently active memory allocations and their total memory usage. Here you can see where exactly your program did allocate memory it is currently using. If the application has already exited, this becomes a list of leaked memory. +The \emph{\faHeartbeat{} Active allocations} pane displays a list of currently active memory allocations and their total memory usage. Here, you can see where your program allocated memory it is now using. If the application has already exited, this becomes a list of leaked memory. \subsubsection{Memory map} -On the \emph{\faMap{} Memory map} pane you can see the graphical representation of your program's address space. Active allocations are displayed as green lines, while the freed memory is marked as red lines. The brightness of the color indicates how much time has passed since the last memory event at the given location -- the most recent events are the most vibrant. +On the \emph{\faMap{} Memory map} pane, you can see the graphical representation of your program's address space. Active allocations are displayed as green lines, while the freed memory is red. The brightness of the color indicates how much time has passed since the last memory event at the given location -- the most recent events are the most vibrant. -This view may be helpful in assessing the general memory behavior of the application, or in debugging the problems resulting from address space fragmentation. +This view may help assess the general memory behavior of the application or in debugging the problems resulting from address space fragmentation. \subsubsection{Bottom-up call stack tree} \label{callstacktree} -The \emph{\faTree{}~Bottom-up call stack tree} pane is only available, if the memory events were collecting the call stack data (section~\ref{collectingcallstacks}). In this view you are presented with a tree of memory allocations, starting at the call stack entry point and going up to the allocation's pinpointed place. Each level of the tree is sorted according to the number of bytes allocated in given branch. +The \emph{\faTree{}~Bottom-up call stack tree} pane is only available, if the memory events were collecting the call stack data (section~\ref{collectingcallstacks}). In this view, you are presented with a tree of memory allocations, starting at the call stack entry point and going up to the allocation's pinpointed place. Each tree level is sorted according to the number of bytes allocated in the given branch. -Each tree node consists of three elements: the function name, the source file location and the memory allocation data. The memory allocation data is either yellow \emph{inclusive} events count (allocations performed by children), or the cyan \emph{exclusive} events count (allocations that took place in the node)\footnote{Due to the way call stacks work there is no possibility for an entry to have both inclusive and exclusive counts, in a properly instrumented program.}. There are two values that are counted: total memory size and number of allocations. +Each tree node consists of the function name, the source file location, and the memory allocation data. The memory allocation data is either yellow \emph{inclusive} events count (allocations performed by children) or the cyan \emph{exclusive} events count (allocations that took place in the node)\footnote{Due to the way call stacks work, there is no possibility for an entry to have both inclusive and exclusive counts, in an adequately instrumented program.}. Two values are counted: total memory size and number of allocations. -The \emph{Group by function name} option controls how tree nodes are grouped. If it is disabled, then the grouping is performed at a machine instruction level granularity. This may result in very verbose output, but the displayed source locations are precise. To make the tree more readable you may opt to perform grouping at the function name level, which will result in less valid source file locations, as multiple entries are collapsed into one. +The \emph{Group by function name} option controls how tree nodes are grouped. If it is disabled, the grouping is performed at a machine instruction-level granularity. This may result in a very verbose output, but the displayed source locations are precise. To make the tree more readable, you may opt to perform grouping at the function name level, which will result in less valid source file locations, as multiple entries are collapsed into one. -Enabling the \emph{Only active allocations} option will limit the call stack tree to only display active allocations. +Enabling the \emph{Only active allocations} option will limit the call stack tree only to display active allocations. -Clicking the \RMB{}~right mouse button on the function name will open allocations list window (see section \ref{alloclist}), which list all the allocations included at the current call stack tree level. Clicking the \RMB{}~right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}). +Clicking the \RMB{}~right mouse button on the function name will open the allocations list window (see section \ref{alloclist}), which lists all the allocations included at the current call stack tree level. Likewise, clicking the \RMB{}~right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}). -Some function names may be too long to be properly displayed, with the events count data at the end. In such cases, you may press the \emph{control} button, which will display events count tooltip. +Some function names may be too long to correctly display, with the events count data at the end. In such cases, you may press the \emph{control} button, which will display the events count tooltip. \subsubsection{Top-down call stack tree} @@ -3239,7 +3236,7 @@ This pane is identical in functionality to the \emph{Bottom-up call stack tree}, \subsubsection{Looking back at the memory history} -By default the memory window displays the memory data at the current point of program execution. It is however possible to view the historical data by enabling the \emph{\faRuler{}~Limits} option. Only the memory events included in the time range will be taken into consideration in the displayed results. See section~\ref{timerangelimits} for more information. +By default, the memory window displays the memory data at the current point of program execution. It is, however, possible to view the historical data by enabling the \emph{\faRuler{}~Limits} option. The profiler will consider only the memory events within the time range in the displayed results. See section~\ref{timerangelimits} for more information. \subsection{Allocations list window} \label{alloclist} @@ -3249,27 +3246,27 @@ This window displays the list of allocations included at the selected call stack \subsection{Memory allocation information window} \label{memallocinfo} -The information about the selected memory allocation is displayed in this window. It lists the allocation's address and size, along with the time, thread and zone data of the allocation and free events. Clicking the \emph{\faMicroscope{}~Zoom to allocation} button will zoom the timeline view to the allocation's extent. +The information about the selected memory allocation is displayed in this window. It lists the allocation's address and size, along with the time, thread, and zone data of the allocation and free events. Clicking the \emph{\faMicroscope{}~Zoom to allocation} button will zoom the timeline view to the allocation's extent. \subsection{Trace information window} \label{traceinfo} -This window contains information about the current trace: captured program name, time of the capture, profiler version which performed the capture and a custom trace description, which you can fill in. +This window contains information about the current trace: captured program name, time of the capture, profiler version which performed the capture, and a custom trace description, which you can fill in. Open the \emph{Trace statistics} section to see information about the trace, such as achieved timer resolution, number of captured zones, lock events, plot data points, memory allocations, etc. -There's also a section containing the selected frame set timing statistics and histogram\footnote{See section~\ref{findzone} for a description of the histogram. Note that there are subtle differences in the available functionality.}. As a convenience you can switch the active frame set here and limit the displayed frame statistics to the frame range visible on the screen. +There's also a section containing the selected frame set timing statistics and histogram\footnote{See section~\ref{findzone} for a description of the histogram. Note that there are subtle differences in the available functionality.}. As a convenience, you can switch the active frame set here and limit the displayed frame statistics to the frame range visible on the screen. -If \emph{CPU topology} data is available (see section~\ref{cputopology}), you will be able to view the package, core and thread hierarchy. +If \emph{CPU topology} data is available (see section~\ref{cputopology}), you will be able to view the package, core, and thread hierarchy. -The \emph{Source location substitutions} section allows adapting the source file paths, as captured by the profiler to the actual on-disk locations\footnote{This has no effect on source files cached during the profiling run.}. You can create a new substitution by clicking the \emph{Add new substitution} button. This will add a new entry, with input fields for ECMAScript-conforming regular expression pattern and its corresponding replacement string. The outcome of substitutions can be quickly tested in the \emph{example source location} input field, which will be transformed and displayed below, as \emph{result}. +The \emph{Source location substitutions} section allows adapting the source file paths, as captured by the profiler, to the actual on-disk locations\footnote{This does not affect source files cached during the profiling run.}. You can create a new substitution by clicking the \emph{Add new substitution} button. This will add a new entry, with input fields for ECMAScript-conforming regular expression pattern and its corresponding replacement string. You can quickly test the outcome of substitutions in the \emph{example source location} input field, which will be transformed and displayed below, as \emph{result}. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{Quick example} -Let's say we have an unix-based operating system with program sources in \texttt{/home/user/program/src/} directory. We have also performed a capture of an application running under Windows, with sources in \texttt{C:\textbackslash{}Users\textbackslash{}user\textbackslash{}Desktop\textbackslash{}program\textbackslash{}src} directory. Obviously, the source locations don't match and the profiler can't access the source files we have on our disk. We can fix that by adding two substitution patterns: +Let's say we have an Unix-based operating system with program sources in \texttt{/home/user/program/src/} directory. We have also performed a capture of an application running under Windows, with sources in \texttt{C:\textbackslash{}Users\textbackslash{}user\textbackslash{}Desktop\textbackslash{}program\textbackslash{}src} directory. The source locations don't match, and the profiler can't access the source files on our disk. We can fix that by adding two substitution patterns: \begin{itemize} \item \texttt{\^{}C:\textbackslash{}\textbackslash{}Users\textbackslash{}\textbackslash{}user\textbackslash{}\textbackslash{}Desktop} \hspace{1em}\textrightarrow\hspace{1em} \texttt{/home/user} @@ -3277,24 +3274,24 @@ Let's say we have an unix-based operating system with program sources in \texttt \end{itemize} \end{bclogo} -In this window you can view the information about the machine on which the profiled application was running. This includes the operating system, used compiler, CPU name, amount of total available RAM, etc. If application information was provided (see section~\ref{appinfo}), it will also be displayed here. +In this window, you can view the information about the machine on which the profiled application was running. This includes the operating system, used compiler, CPU name, total available RAM, etc. In addition, if application information was provided (see section~\ref{appinfo}), it will also be displayed here. -If an application should crash during profiling (section~\ref{crashhandling}), the crash information will be displayed in this window. It provides you information about the thread that has crashed, the crash reason and the crash call stack (section~\ref{callstackwindow}). +If an application should crash during profiling (section~\ref{crashhandling}), the profiler will display the crash information in this window. It provides you information about the thread that has crashed, the crash reason, and the crash call stack (section~\ref{callstackwindow}). \subsection{Zone information window} \label{zoneinfo} -The zone information window displays detailed information about a single zone. There can be only one zone information window open at any time. While the window is open the zone will be highlighted on the timeline view with a green outline. The following data is presented: +The zone information window displays detailed information about a single zone. There can be only one zone information window open at any time. While the window is open, the profiler will highlight the zone on the timeline view with a green outline. The following data is presented: \begin{itemize} -\item Basic source location information: function name, source file location and the thread name. +\item Basic source location information: function name, source file location, and the thread name. \item Timing information. -\item If context switch capture was performed (section~\ref{contextswitches}) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about timing, CPU migrations and wait reasons. If CPU topology data is available (section~\ref{cputopology}), zone migrations across cores will be marked with 'C', and migrations across packages -- with 'P'. In some cases context switch data might be incomplete\footnote{For example, when a capture is ongoing and context switch information has not yet been received.}, in which case a warning message will be displayed. +\item If the profiler performed context switch capture (section~\ref{contextswitches}) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section~\ref{cputopology}), the profiler will mark zone migrations across cores with 'C' and migrations across packages -- with 'P.' In some cases, context switch data might be incomplete\footnote{For example, when capture is ongoing and context switch information has not yet been received.}, in which case a warning message will be displayed. \item Memory events list, both summarized and a list of individual allocation/free events (see section~\ref{memorywindow} for more information on the memory events list). -\item List of messages that were logged in the zone's scope (including its children). -\item Zone trace, taking into account the zone tree and call stack information (section~\ref{collectingcallstacks}), trying to reconstruct a combined zone + call stack trace\footnote{Reconstruction is only possible, if all zones have full call stack capture data available. In case where that's not available, an \emph{unknown frames} entry will be present.}. Captured zones are displayed as normal text, while functions that were not instrumented are dimmed. Hovering the \faMousePointer{}~mouse pointer over a zone will highlight it on the timeline view with a red outline. Clicking the \LMB{}~left mouse button on a zone will switch the zone info window to that zone. Clicking the \MMB{}~middle mouse button on a zone will zoom the timeline view to the zone's extent. Clicking the \RMB{}~right mouse button on a source file location will open the source file view window (if applicable, see section~\ref{sourceview}). +\item List of messages that the profiler logged in the zone's scope (including its children). +\item Zone trace, taking into account the zone tree and call stack information (section~\ref{collectingcallstacks}), trying to reconstruct a combined zone + call stack trace\footnote{Reconstruction is only possible if all zones have complete call stack capture data available. In the case where that's not available, an \emph{unknown frames} entry will be present.}. Captured zones are displayed as standard text, while not instrumented functions are dimmed. Hovering the \faMousePointer{}~mouse pointer over a zone will highlight it on the timeline view with a red outline. Clicking the \LMB{}~left mouse button on a zone will switch the zone info window to that zone. Clicking the \MMB{}~middle mouse button on a zone will zoom the timeline view to the zone's extent. Clicking the \RMB{}~right mouse button on a source file location will open the source file view window (if applicable, see section~\ref{sourceview}). \item Child zones list, showing how the current zone's execution time was used. Zones on this list can be grouped according to their source location. Each group can be expanded to show individual entries. All the controls from the zone trace are also available here. -\item Time distribution in child zones, which expands the information provided in the child zones list by processing \emph{all} zone children (including multiple levels of grandchildren). This results in a statistical list of zones that were really doing the work in the current zone's time span. If a group of zones is selected on this list, the find zone window (section~\ref{findzone}) will open, with time range limited to show only the children of the current zone. +\item Time distribution in child zones, which expands the information provided in the child zones list by processing \emph{all} zone children (including multiple levels of grandchildren). This results in a statistical list of zones that were really doing the work in the current zone's time span. If a group of zones is selected on this list, the find zone window (section~\ref{findzone}) will open, with a time range limited to show only the children of the current zone. \end{itemize} The zone information window has the following controls available: @@ -3303,9 +3300,9 @@ The zone information window has the following controls available: \item \emph{\faMicroscope{} Zoom to zone} -- Zooms the timeline view to the zone's extent. \item \emph{\faArrowUp{} Go to parent} -- Switches the zone information window to display current zone's parent zone (if available). \item \emph{\faChartBar{} Statistics} -- Displays the zone general performance characteristics in the find zone window (section~\ref{findzone}). -\item \emph{\faAlignJustify{} Call stack} -- Views the current zone's call stack in the call stack window (section~\ref{callstackwindow}). The button will be highlighted, if the call stack window shows the zone's call stack. Only available if zone had captured call stack data (section~\ref{collectingcallstacks}). -\item \emph{\faFile*{} Source} -- Display source file view window with the zone source code (only available if applicable, see section~\ref{sourceview}). Button will be highlighted, if the source file is being currently displayed (but the focused source line might be different). -\item \emph{\faArrowLeft{} Go back} -- Returns to the previously viewed zone. The viewing history is lost when the zone information window is closed, or when the type of displayed zone changes (from CPU to GPU or vice versa). +\item \emph{\faAlignJustify{} Call stack} -- Views the current zone's call stack in the call stack window (section~\ref{callstackwindow}). The button will be highlighted if the call stack window shows the zone's call stack. Only available if zone had captured call stack data (section~\ref{collectingcallstacks}). +\item \emph{\faFile*{} Source} -- Display source file view window with the zone source code (only available if applicable, see section~\ref{sourceview}). The button will be highlighted if the source file is displayed (but the focused source line might be different). +\item \emph{\faArrowLeft{} Go back} -- Returns to the previously viewed zone. The viewing history is lost when the zone information window is closed or when the type of displayed zone changes (from CPU to GPU or vice versa). \end{itemize} Clicking on the \emph{\faClipboard{}~Copy to clipboard} buttons will copy the appropriate data to the clipboard. @@ -3313,20 +3310,20 @@ Clicking on the \emph{\faClipboard{}~Copy to clipboard} buttons will copy the ap \subsection{Call stack window} \label{callstackwindow} -This window shows the frames contained in the selected call stack. Each frame is described by a function name, source file location and originating image\footnote{Executable images are called \emph{modules} by Microsoft.} name. Function frames originating from kernel are marked with a red color. Clicking the \LMB{}~left mouse button on either the function name of source file location will copy the name to the clipboard. Clicking the \RMB{}~right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}). +This window shows the frames contained in the selected call stack. Each frame is described by a function name, source file location, and originating image\footnote{Executable images are called \emph{modules} by Microsoft.} name. Function frames originating from the kernel are marked with a red color. Clicking the \LMB{}~left mouse button on either the function name of source file location will copy the name to the clipboard. Clicking the \RMB{}~right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}). -A single stack frame may have multiple function call places associated with it. This happens in case of inlined function calls. Such entries will be displayed in the call stack window, with \emph{inline} in place of frame number\footnote{Or '\faCaretRight{}'~icon in case of call stack tooltips.}. +A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with \emph{inline} in place of frame number\footnote{Or '\faCaretRight{}'~icon in case of call stack tooltips.}. Stack frame location may be displayed in the following number of ways, depending on the \emph{\faAt{}~Frame location} option selection: \begin{itemize} \item \emph{Source code} -- displays source file and line number associated with the frame. \item \emph{Entry point} -- source code at the beginning of the function containing selected frame, or function call place in case of inline frames. -\item \emph{Return address} -- shows return address, which may be used to pinpoint the exact instruction in the disassembly. +\item \emph{Return address} -- shows return address, which you may use to pinpoint the exact instruction in the disassembly. \item \emph{Symbol address} -- displays begin address of the function containing the frame address. \end{itemize} -In some cases it may be not possible to properly decode stack frame address. Such frames will be presented with a dimmed '\texttt{[ntdll.dll]}' name of the image containing the frame address, or simply '\texttt{[unknown]}' if even this information cannot be retrieved. Additionally, '\texttt{[kernel]}' is used to indicate unknown stack frames within the operating system internal routines. +In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed '\texttt{[ntdll.dll]}' name of the image containing the frame address, or simply '\texttt{[unknown]}' if the profiler cannot retrieve even this information. Additionally, '\texttt{[kernel]}' is used to indicate unknown stack frames within the operating system's internal routines. If the displayed call stack is a sampled call stack (chapter~\ref{sampling}), an additional button will be available, \emph{\faDoorOpen{}~Global entry statistics}. Clicking it will open the sample entry call stacks window (chapter~\ref{sampleparents}) for the current call stack. @@ -3335,7 +3332,7 @@ Clicking on the \emph{\faClipboard{}~Copy to clipboard} button will copy call st \subsubsection{Reading call stacks} \label{readingcallstacks} -You need to take special care when reading call stacks. Contrary to their name, call stacks do not show \emph{function call stacks}, but rather \emph{function return stacks}. This might be a bit confusing at first, but this is how programs do work. Consider the following source code: +You need to take special care when reading call stacks. Contrary to their name, call stacks do not show \emph{function call stacks}, but rather \emph{function return stacks}. This might not be very clear at first, but this is how programs do work. Consider the following source code: \begin{lstlisting} int main() @@ -3358,12 +3355,12 @@ Let's say you are looking at the call stack of some function called within \text At the first glance it may look like \texttt{unique\_ptr::reset} was the \emph{call site} of the \texttt{Application::Run}, which would make no sense, but this is not the case here. When you remember these are the \emph{function return points}, it becomes much more clear what is happening. As an optimization, \texttt{Application::Run} is returning directly into \texttt{unique\_ptr::reset}, skipping the return to \texttt{main} and an unnecessary \texttt{reset} function call. -Moreover, the linker may determine in some rare cases that any two functions in your program are identical\footnote{For example if all they do is zero-initialize a region of memory. As some constructors would do.}. As a result, only one copy of the binary code will be provided in the executable for both functions to share. While this optimization produces more compact programs, it also means that there's no way to distinguish the two functions apart in the resulting machine code. In effect, some call stacks may look nonsensical until you perform a small investigation. +Moreover, the linker may determine in some rare cases that any two functions in your program are identical\footnote{For example, if all they do is zero-initialize a region of memory. As some constructors would do.}. As a result, only one copy of the binary code will be provided in the executable for both functions to share. While this optimization produces more compact programs, it also means that there's no way to distinguish the two functions apart in the resulting machine code. In effect, some call stacks may look nonsensical until you perform a small investigation. \subsection{Sample entry call stacks window} \label{sampleparents} -This window displays statistical information about the selected symbol. All sampled call stacks (chapter~\ref{sampling}) leading to the symbol are counted and displayed in descending order. You can select the displayed call stack using the \emph{entry call stack} controls, which also display time spent in the chosen call stack. Alternatively, sample counts may be shown by disabling the \emph{\faStopwatch{}~Show time} option, which is described in more detail in chapter~\ref{statisticssampling}. +This window displays statistical information about the selected symbol. All sampled call stacks (chapter~\ref{sampling}) leading to the symbol are counted and displayed in descending order. You can choose the displayed call stack using the \emph{entry call stack} controls, which also display time spent in the selected call stack. Alternatively, sample counts may be shown by disabling the \emph{\faStopwatch{}~Show time} option, which is described in more detail in chapter~\ref{statisticssampling}. The layout of frame list and the \emph{\faAt{}~Frame location} option selection is similar to the call stack window, described in chapter~\ref{callstackwindow}. @@ -3374,30 +3371,30 @@ This window can operate in one of the two modes. The first one is quite simple, \subsubsection{Source file view} -In source view mode you can view the source code of the profiled application, to take a quick glance at the context of the function behavior you are analyzing. The selected line (for example, a location of a profiling zone) will be highlighted both in the source code listing and on the scroll bar. +In source view mode, you can view the source code of the profiled application to take a quick glance at the context of the function behavior you are analyzing. The profiler will highlight the selected line (for example, a location of a profiling zone) both in the source code listing and on the scroll bar. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -In order to be able to display source files, Tracy has to somehow gain access to them. Since having the source code is not needed for the profiled application to run, this can be a bit problematic in some cases. The source files search order is as follows: +To display source files, Tracy has to gain access to them somehow. Since having the source code is not needed for the profiled application to run, this can be problematic in some cases. The source files search order is as follows: \begin{enumerate} -\item Discovery is performed on server side. Found files are cached in the trace. \emph{This is appropriate when the client and the server run on the same machine, or if you're deploying your application to the target device and then run the profiler on the same workstation.} -\item If not found, discovery is performed on client side. Found files are cached in the trace. \emph{This is appropriate when you are developing your code on another machine, for example you may be working on a dev-board through a SSH connection.} -\item If not found, Tracy will try to open source files which you might have on your disk later on. These files won't be stored in the trace. You may provide custom file path substitution rules to redirect this search to the right place (see section~\ref{traceinfo}). +\item Discovery is performed on the server side. Found files are cached in the trace. \emph{This is appropriate when the client and the server run on the same machine or if you're deploying your application to the target device and then run the profiler on the same workstation.} +\item If not found, discovery is performed on the client-side. Found files are cached in the trace. \emph{This is appropriate when you are developing your code on another machine, for example, you may be working on a dev-board through an SSH connection.} +\item If not found, Tracy will try to open source files that you might have on your disk later on. The profiler won't store these files in the trace. You may provide custom file path substitution rules to redirect this search to the right place (see section~\ref{traceinfo}). \end{enumerate} -Note that the discovery process not only looks for a file on the disk, but it also checks its time stamp and validates it against the executable image time stamp, or, if it's not available, the time of the performed capture. This will prevent use of source files that are newer (i.e. were changed) than the program you're profiling. +Note that the discovery process not only looks for a file on the disk but it also checks its time stamp and validates it against the executable image timestamp or, if it's not available, the time of the performed capture. This will prevent the use of newer source files (i.e., were changed) than the program you're profiling. -Nevertheless, \textbf{the displayed source files might still not reflect the code that was profiled!} It is up to you to verify that you don't have a modified version of the code, with regards to the trace. +Nevertheless, \textbf{the displayed source files might still not reflect the code that you profiled!} It is up to you to verify that you don't have a modified version of the code with regards to the trace. \end{bclogo} \subsubsection{Symbol view} \label{symbolview} -If the inspected source location has an associated symbol context (i.e. if it comes from a call stack capture, from call stack sampling, etc.), a much more capable symbol view mode is available. A symbol is an unit of machine code, basically a callable function. It may be generated using multiple source files and may consist of multiple inlined functions. A list of all captured symbols is available in the statistics window, as described in chapter~\ref{statisticssampling}. +A much more capable symbol view mode is available if the inspected source location has an associated symbol context (i.e., if it comes from a call stack capture, from call stack sampling, etc.). A symbol is a unit of machine code, basically a callable function. It may be generated using multiple source files and may consist of numerous inlined functions. A list of all captured symbols is available in the statistics window, as described in chapter~\ref{statisticssampling}. The header of symbol view window contains a name of the selected \emph{\faPuzzlePiece{}~symbol}, a list of \emph{\faSitemap{}~functions} that contribute to the symbol, and information such as \emph{\faWeightHanging{}~Code size} in the program, or count of probed \emph{\faEyeDropper{}~Samples}. @@ -3409,48 +3406,48 @@ Additionally, you may use the \emph{Mode} selector to decide what content should \item \emph{Combined} -- both source code and disassembly will be listed next to each other. \end{itemize} -In some circumstances (missing or outdated source files, lack of machine code) some modes may be unavailable. +Some modes may be unavailable in some circumstances (missing or outdated source files, lack of machine code). \paragraph{Source mode} -This is pretty much the original source file view window, but with the ability to select one of the source files that were used to build the symbol. Additionally, each source file line which produced machine code in the symbol will show count of associated assembly instructions, displayed with an '\texttt{@}' prefix, and will be marked with a grey color on the scroll bar. Due to the way optimizing compilers work, some lines may seemingly not produce any machine code, for example because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have disproportionate amount of associated instructions, e.g. when a loop unrolling optimization is applied. This varies from case to case and from compiler to compiler. +This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an '\texttt{@}' prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler. \paragraph{Assembly mode} -This mode shows the disassembly of the symbol machine code. If only one inline function is selected through the \emph{\faSitemap{}~Function} selector, assembly instructions outside of this function will be dimmed out. Each assembly instruction is displayed listed with its location in the program memory during execution. If the \emph{\faSearchLocation{}~Relative locations} option is selected, an offset from the symbol beginning will be printed instead. Clicking the \LMB{}~left mouse button on the address/offset will switch to counting line numbers, using the selected one as origin (i.e. zero value). Line numbers are displayed inside \texttt{[]} brackets. This display mode can be useful to correlate lines with output of external tools, such as \texttt{llvm-mca}. To disable line numbering click the \RMB{}~right mouse button on a line number. +This mode shows the disassembly of the symbol machine code. If only one inline function is selected through the \emph{\faSitemap{}~Function} selector, assembly instructions outside of this function will be dimmed out. Each assembly instruction is displayed listed with its location in the program memory during execution. If the \emph{\faSearchLocation{}~Relative locations} option is selected, the profiler will print an offset from the symbol beginning instead. Clicking the \LMB{}~left mouse button on the address/offset will switch to counting line numbers, using the selected one as the origin (i.e., zero value). Line numbers are displayed inside \texttt{[]} brackets. This display mode can be useful to correlate lines with the output of external tools, such as \texttt{llvm-mca}. To disable line numbering click the \RMB{}~right mouse button on a line number. -If the \emph{\faFileImport{}~Source locations} option is selected, each line of the assembly code will also contain information about the originating source file name and line number. For easier differentiation between different source files, each file is assigned its own color. Clicking the \LMB{}~left mouse button on a displayed source location will switch the source file, if necessary, and focus the source view on selected line. Additionally, hovering the \faMousePointer{}~mouse cursor over the presented location will show a tooltip containing the name of a function the instruction originates from, along with an appropriate source code fragment. +If the \emph{\faFileImport{}~Source locations} option is selected, each line of the assembly code will also contain information about the originating source file name and line number. Each file is assigned its own color for easier differentiation between different source files. Clicking the \LMB{}~left mouse button on a displayed source location will switch the source file, if necessary, and focus the source view on the selected line. Additionally, hovering the \faMousePointer{}~mouse cursor over the presented location will show a tooltip containing the name of a function the instruction originates from, along with an appropriate source code fragment. -Selecting the \emph{\faCogs{}~Machine code} option will enable display of raw machine code bytes for each line. +Selecting the \emph{\faCogs{}~Machine code} option will enable the display of raw machine code bytes for each line. -If any instruction would jump to a predefined address, symbolic name of the jump target will be additionally displayed. If the destination location is within the currently displayed symbol an \texttt{->}~arrow will be prepended to the name. Hovering the \faMousePointer{}~mouse pointer over such symbol name will highlight the target location. Clicking on it with the \LMB{}~left mouse button will focus the view on the destination instruction, or switch view to the destination symbol. +If any instruction would jump to a predefined address, the symbolic name of the jump target will be additionally displayed. If the destination location is within the currently displayed symbol, an \texttt{->}~arrow will be prepended to the name. Hovering the \faMousePointer{}~mouse pointer over such symbol name will highlight the target location. Clicking on it with the \LMB{}~left mouse button will focus the view on the destination instruction or switch view to the destination symbol. -Enabling the \emph{\faShare{}~Jumps} option will show jumps within the symbol code as a series of arrows from the jump source to the jump target. Hovering the \faMousePointer{}~mouse pointer over a jump arrow will display jump information tooltip and will also draw the jump range on the scroll bar, as a green line. Jump target location will be marked by a horizontal green line. Clicking on a jump arrow with the \LMB{}~left mouse button will focus the view on the target location. Jumps going out of the symbol\footnote{This includes jumps, procedure calls and returns. For example, in x86 assembly the respective operand names can be: \texttt{jmp}, \texttt{call}, \texttt{ret}.} will be indicated by a smaller arrow pointing away from the code. +Enabling the \emph{\faShare{}~Jumps} option will show jumps within the symbol code as a series of arrows from the jump source to the jump target and hovering the \faMousePointer{}~mouse pointer over a jump arrow will display a jump information tooltip. It will also draw the jump range on the scroll bar as a green line. A horizontal green line will mark the jump target location. Clicking on a jump arrow with the \LMB{}~left mouse button will focus the view on the target location. Jumps going out of the symbol\footnote{This includes jumps, procedure calls, and returns. For example, in x86 assembly the respective operand names can be: \texttt{jmp}, \texttt{call}, \texttt{ret}.} will be indicated by a smaller arrow pointing away from the code. The \emph{AT\&T} switch can be used to select between \emph{Intel} and \emph{AT\&T} assembly syntax. Beware that microarchitecture data is only available if Intel syntax is selected. -Portions of the executable used to show the symbol view are stored within the captured profile and don't rely on the local disk files being available. +Portions of the executable used to show the symbol view are stored within the captured profile and don't rely on the available local disk files. \subparagraph{Exploring microarchitecture} -If the listed assembly code targets x86 or x64 instruction set architectures, hovering \faMousePointer{}~mouse pointer over an instruction will display a tooltip with microarchitectural data, based on measurements made in \cite{Abel19a}. \emph{This information is retrieved from instruction cycle tables, and does not represent true behavior of the profiled code.} Reading the cited article will give you a detailed definition of the presented data, but here's a quick (and inaccurate) explanation: +If the listed assembly code targets x86 or x64 instruction set architectures, hovering \faMousePointer{}~mouse pointer over an instruction will display a tooltip with microarchitectural data, based on measurements made in \cite{Abel19a}. \emph{This information is retrieved from instruction cycle tables and does not represent the true behavior of the profiled code.} Reading the cited article will give you a detailed definition of the presented data, but here's a quick (and inaccurate) explanation: \begin{itemize} -\item \emph{Throughput} -- How many cycles are required to execute an instruction in a stream of independent same instructions. For example, if two independent \texttt{add} instructions may be executed simultaneously on different execution units, then the throughput (cycle cost per instruction) is 0.5. +\item \emph{Throughput} -- How many cycles are required to execute an instruction in a stream of the same independent instructions. For example, if the CPU may execute two independent \texttt{add} instructions simultaneously on different execution units, then the throughput (cycle cost per instruction) is 0.5. \item \emph{Latency} -- How many cycles it takes for an instruction to finish executing. This is reported as a min-max range, as some output values may be available earlier than the rest. \item \emph{\textmu{}ops} -- How many microcode operations have to be dispatched for an instruction to retire. For example, adding a value from memory to a register may consist of two microinstructions: first load the value from memory, then add it to the register. -\item \emph{Ports} -- Which ports (execution units) are required for dispatch of microinstructions. For example, \texttt{2*p0+1*p015} would mean that out of the three microinstructions implementing the assembly instruction, two can only be executed on port 0, and one microinstruction can be executed on ports 0, 1, or 5. Number of available ports and their capabilities vary between different processors architectures. Refer to \url{https://wikichip.org/} for more information. +\item \emph{Ports} -- Which ports (execution units) are required for dispatch of microinstructions. For example, \texttt{2*p0+1*p015} would mean that out of the three microinstructions implementing the assembly instruction, two can only be executed on port 0, and one microinstruction can be executed on ports 0, 1, or 5. The number of available ports and their capabilities varies between different processors architectures. Refer to \url{https://wikichip.org/} for more information. \end{itemize} -Selection of the CPU microarchitecture can be performed using the \emph{\faMicrochip{}~\textmu{}arch} drop-down. Each architecture is accompanied with a name of an example CPU implementing it. If the current selection matches the microarchitecture on which the profiled application was running on, the \faMicrochip{}~icon will be green\footnote{Comparing sampled instruction counts with microarchitectural details only makes sense when this selection is properly matched.}. Otherwise it will be red\footnote{This can be used to gain insight into how the code \emph{may} behave on other processors.}. +Selection of the CPU microarchitecture can be performed using the \emph{\faMicrochip{}~\textmu{}arch} drop-down. Each architecture is accompanied by the name of an example CPU implementing it. If the current selection matches the microarchitecture on which the profiled application was running on, the \faMicrochip{}~icon will be green\footnote{Comparing sampled instruction counts with microarchitectural details only makes sense when this selection is properly matched.}. Otherwise, it will be red\footnote{You can use this to gain insight into how the code \emph{may} behave on other processors.}. -Enabling the \emph{\faTruckLoading{}~Latency} option will display graphical representation of instruction latencies on the listing. Minimum latency of an instruction is represented with a red bar, while the maximum latency is represented by a yellow bar. +Enabling the \emph{\faTruckLoading{}~Latency} option will display a graphical representation of instruction latencies on the listing. The minimum latency of instruction is represented by a red bar, while the maximum latency is represented by a yellow bar. -Clicking on the \emph{\faFileImport{}~Save} button lets you write the disassembly listing to a file. You can then manually extract some critical loop kernel and pass it to a CPU simulator, such as \emph{LLVM Machine Code Analyzer} (\texttt{llvm-mca})\footnote{\url{https://llvm.org/docs/CommandGuide/llvm-mca.html}}, in order to see how the code is executed and if there are any pipeline bubbles. Consult the \texttt{llvm-mca} documentation for more details. Alternatively, you might click the \RMB{}~right mouse button on a jump arrow and save only the instructions within the jump range, using the \emph{\faFileImport{}~Save jump range} button. +Clicking on the \emph{\faFileImport{}~Save} button lets you write the disassembly listing to a file. You can then manually extract some critical loop kernel and pass it to a CPU simulator, such as \emph{LLVM Machine Code Analyzer} (\texttt{llvm-mca})\footnote{\url{https://llvm.org/docs/CommandGuide/llvm-mca.html}}, to see how the code is executed and if there are any pipeline bubbles. Consult the \texttt{llvm-mca} documentation for more details. Alternatively, you might click the \RMB{}~right mouse button on a jump arrow and save only the instructions within the jump range, using the \emph{\faFileImport{}~Save jump range} button. \subparagraph{Instruction dependencies} -Assembly instructions may read values stored in registers and may also write values to registers. A dependency between two instructions is created when one produces some result, which is then consumed by the other one. Combining this dependency graph with information about instruction latencies may give deep understanding of the bottlenecks in code performance. +Assembly instructions may read values stored in registers and may also write values to registers. As a result, a dependency between two instructions is created when one produces some result, which the other then consumes. Combining this dependency graph with information about instruction latencies may give a deep understanding of the bottlenecks in code performance. Clicking the \LMB{}~left mouse button on any assembly instruction will mark it as a target for resolving register dependencies between instructions. To cancel this selection, click on any assembly instruction with \RMB{}~right mouse button. @@ -3460,71 +3457,71 @@ The selected instruction will be highlighted in red, while its dependencies will \item \emph{Green} -- Register value is read (is a dependency \emph{after} target instruction). \item \emph{Red} -- A value is written to a register (is a dependency \emph{before} target instruction). \item \emph{Yellow} -- Register is read and then modified. -\item \emph{Grey} -- Value in a register is either discarded (overwritten), or was already consumed by an earlier instruction (i.e. it is readily available\footnote{This is actually a bit of simplification. Run a pipeline simulator, e.g. \texttt{llvm-mca} for a better analysis.}). Dependency will be not followed further. +\item \emph{Grey} -- Value in a register is either discarded (overwritten) or was already consumed by an earlier instruction (i.e., it is readily available\footnote{This is actually a bit of simplification. Run a pipeline simulator, e.g., \texttt{llvm-mca} for a better analysis.}). The profiler will not follow the dependency chain further. \end{itemize} -Search for dependencies follows program control flow, so there may be multiple producers and consumers for any single register. While the \emph{after} and \emph{before} guidelines mentioned above hold in general case, things may be more complicated when there's a large amount of conditional jumps in the code. Note that dependencies further away than 64 instructions are not displayed. +Search for dependencies follows program control flow, so there may be multiple producers and consumers for any single register. While the \emph{after} and \emph{before} guidelines mentioned above hold in the general case, things may be more complicated when there's a large number of conditional jumps in the code. Note that dependencies further away than 64 instructions are not displayed. -For easier navigation, dependencies are also marked on the left side of the scroll bar, following the green, red and yellow convention. The selected instruction is marked in blue. +For more straightforward navigation, dependencies are also marked on the left side of the scroll bar, following the green, red and yellow conventions. The selected instruction is marked in blue. \paragraph{Combined mode} -In this mode both the source and assembly panes will be displayed together, providing the best way to gain insight into the code. Hovering the \faMousePointer{}~mouse pointer over the source file line, or the location of the assembly line will highlight the corresponding lines in the second pane (both in the listing and on the scroll bar). Clicking the \LMB{}~left mouse button on a line will select it in both panes. Do note that while an assembly line always has only one corresponding source line, a single source line may have many associated assembly lines, not necessarily next to each other. Clicking the \RMB{}~right mouse button will perform the same action as left mouse button, but it will also focus the secondary view on the selected line. Clicking on the same \emph{source} line more than once will focus the \emph{assembly} view on the next associated instructions block. +In this mode, both the source and assembly panes will be displayed together, providing the best way to gain insight into the code. Hovering the \faMousePointer{}~mouse pointer over the source file line or the location of the assembly line will highlight the corresponding lines in the second pane (both in the listing and on the scroll bar). Clicking the \LMB{}~left mouse button on a line will select it in both panes. Note that while an assembly line always has only one corresponding source line, a single source line may have many associated assembly lines, not necessarily next to each other. Clicking the \RMB{}~right mouse button will perform the same action as the left mouse button, but it will also focus the secondary view on the selected line. Clicking on the same \emph{source} line more than once will focus the \emph{assembly} view on the next associated instructions block. \paragraph{Instruction pointer cost statistics} -If automated call stack sampling (see chapter~\ref{sampling}) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. This information can be used to determine which line of the function takes the most time. The displayed percentage values are heat map color coded, with the lowest values mapped to dark red, and the highest values mapped to bright yellow. The color code will appear next to the percentage value, and on the scroll bar, so that 'hot' places in code can be identified at a glance. +If automated call stack sampling (see chapter~\ref{sampling}) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify 'hot' places in the code at a glance. -By default samples are displayed only from within the selected symbol, in isolation. In some cases you may however want to include samples from functions that were called. To do so, enable the \emph{\faSignOut*{}~Child calls} option, which may also be temporarily toggled by holding the \keys{Z} key. You can also click the~\faCaretDown{}~drop down control to display a child call distribution list, which shows each known function\footnote{You should remember that these are results of random sampling. Some function calls may be missing here.} that the symbol called. Make sure to familiarize yourself with section~\ref{readingcallstacks} to be able to properly read the results. +By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the \emph{\faSignOut*{}~Child calls} option, which you may also temporarily toggle by holding the \keys{Z} key. You can also click the~\faCaretDown{}~drop down control to display a child call distribution list, which shows each known function\footnote{You should remember that these are results of random sampling. Some function calls may be missing here.} that the symbol called. Make sure to familiarize yourself with section~\ref{readingcallstacks} to be able to read the results correctly. -Instruction timings can be viewed as a group. To begin constructing such group, click the \LMB{}~left mouse button on the percentage value. Additional instructions can be added using the \keys{\ctrl}~key, while holding the \keys{\shift}~key will allow selection of a range. To cancel the selection, click the \RMB{}~right mouse button on a percentage value. Group statistics can be seen at the bottom of the pane. +Instruction timings can be viewed as a group. To begin constructing such a group, click the \LMB{}~left mouse button on the percentage value. Additional instructions can be added using the \keys{\ctrl}~key while holding the \keys{\shift}~key will allow selection of a range. To cancel the selection, click the \RMB{}~right mouse button on a percentage value. Group statistics can be seen at the bottom of the pane. -Clicking the \MMB{}~middle mouse button on the percentage value of an assembly instruction will display entry call stacks of the selected sample (see chapter~\ref{sampleparents}). This functionality is only available for instructions that have collected sampling data, and only in the assembly view, as the source code may be inlined multiple times, which would result in ambiguous location data. Note that number of entry call stacks is displayed in a tooltip, for a quick reference. +Clicking the \MMB{}~middle mouse button on the percentage value of an assembly instruction will display entry call stacks of the selected sample (see chapter~\ref{sampleparents}). This functionality is only available for instructions that have collected sampling data and only in the assembly view, as the source code may be inlined multiple times, which would result in ambiguous location data. Note that number of entry call stacks is displayed in a tooltip for a quick reference. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bclampe ]{How did I get here?} -In some cases it may be difficult to understand what is being displayed in the disassembly. For example, calling the \texttt{std::lower\_bound} function may generate multiple level of inlined functions: first we enter the search algorithm, then the comparison functions, which in turn may be lambdas that call even more external code, and so on. In such event you will most likely see that some external code is taking a long time to execute and you will be none the wiser on how to improve things. +In some cases, it may be challenging to understand what is being displayed in the disassembly. For example, calling the \texttt{std::lower\_bound} function may generate multiple levels of inlined functions: first, we enter the search algorithm, then the comparison functions, which in turn may be lambdas that call even more external code, and so on. In such an event, you will most likely see that some external code is taking a long time to execute, and you will be none the wiser on improving things. -Using the entry call stacks view can be very helpful in such cases, as you will be able to see the call stack of inline functions, originating from a call site in the code you are familiar with. With this critical piece of information you will be able to make a connection between functions you call and the instructions that are executed. +Using the entry call stacks view can be very helpful in such cases, as you will be able to see the call stack of inline functions originating from a call site in the code you are familiar with. With this critical piece of information, you will be able to connect the functions you call and the executed instructions. \end{bclogo} -Sample data source is controlled by the \emph{\faSitemap{}~Function} control, in the window header. If this option should be disabled, sample data will represent the whole symbol. If it is enabled, then the sample data will only include the selected function. The currently selected function can be changed by opening the drop-down box, which includes time statistics. The time percentage values of each contributing function are calculated relative to total number of samples collected within the symbol. +The sample data source is controlled by the \emph{\faSitemap{}~Function} control in the window header. If this option should be disabled, sample data will represent the whole symbol. If it is enabled, then the sample data will only include the selected function. You can change the currently selected function by opening the drop-down box, which includes time statistics. The time percentage values of each contributing function are calculated relative to the total number of samples collected within the symbol. -Selecting the \emph{Limit range} option will restrict counted samples to the time extent shared with the statistics view (displayed as a red striped region on the timeline). See section~\ref{timeranges} for more detail. +Selecting the \emph{Limit range} option will restrict counted samples to the time extent shared with the statistics view (displayed as a red-striped region on the timeline). See section~\ref{timeranges} for more detail. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcbombe ]{Important} -Be aware that the data is not fully accurate, as it is the result of random sampling of program execution. Furthermore, undocumented implementation details of an out-of-order CPU architecture will highly impact the measurement. Read chapter~\ref{checkenvironmentcpu} to see the tip of an iceberg. +Be aware that the data is not entirely accurate, as it results from a random sampling of program execution. Furthermore, undocumented implementation details of an out-of-order CPU architecture will highly impact the measurement. Read chapter~\ref{checkenvironmentcpu} to see the tip of an iceberg. \end{bclogo} \paragraph{Inspecting hardware samples} -As described in chapter~\ref{hardwaresampling}, on some platforms Tracy is able to capture the internal statistics counted by the CPU hardware. If this data has been collected, the \emph{\faHighlighter{}~Cost} selection list will be available. It allows changing what is taken into consideration for display by the cost statistics. The following options can be selected: +As described in chapter~\ref{hardwaresampling}, on some platforms, Tracy can capture the internal statistics counted by the CPU hardware. If this data has been collected, the \emph{\faHighlighter{}~Cost} selection list will be available. It allows changing what is taken into consideration for display by the cost statistics. You can select the following options: \begin{itemize} \item \emph{Sample count} -- this selects the instruction pointer statistics, collected by call stack sampling performed by the operating system. This is the default data shown when hardware samples have not been captured. \item \emph{Cycles} -- an option very similar to the \emph{sample count}, but the data is collected directly by the CPU hardware counters. This may make the results more reliable. -\item \emph{Branch impact} -- indicates places where many branch instructions are issued, and at the same time, incorrectly predicted. Calculated as $\sqrt{\text{\#branch instructions}*\text{\#branch misses}}$. This is more useful than the raw branch miss rate, as it takes into account the number of events taking place. -\item \emph{Cache impact} -- similar to \emph{branch impact}, but it shows cache miss data instead. These values are calculated as $\sqrt{\text{\#cache references}*\text{\#cache misses}}$, and will highlight places with lots of cache accesses that also miss. -\item The rest of available selections just show raw values gathered from the hardware counters. These are: \emph{Retirements}, \emph{Branches taken}, \emph{Branch miss}, \emph{Cache access} and \emph{Cache miss}. +\item \emph{Branch impact} -- indicates places where many branch instructions are issued, and at the same time, incorrectly predicted. Calculated as $\sqrt{\text{\#branch instructions}*\text{\#branch misses}}$. This is more useful than the raw branch miss rate, as it considers the number of events taking place. +\item \emph{Cache impact} -- similar to \emph{branch impact}, but it shows cache miss data instead. These values are calculated as $\sqrt{\text{\#cache references}*\text{\#cache misses}}$ and will highlight places with lots of cache accesses that also miss. +\item The rest of the available selections just show raw values gathered from the hardware counters. These are: \emph{Retirements}, \emph{Branches taken}, \emph{Branch miss}, \emph{Cache access} and \emph{Cache miss}. \end{itemize} -If the \emph{\faHammer{}~Hardware samples} switch is enabled, the cost percentages column will be supplemented with three additional columns. The first added column always displays the instructions per cycle (IPC) value. The two remaining columns show branch and cache data, as described below. The displayed values are color coded, with green color indicating good execution performance, and red color indicating that the CPU pipeline was stalled due to one reason or another. +If the \emph{\faHammer{}~Hardware samples} switch is enabled, the profiler will supplement the cost percentages column with three additional columns. The first added column displays the instructions per cycle (IPC) value. The two remaining columns show branch and cache data, as described below. The displayed values are color-coded, with green indicating good execution performance and red indicating that the code stalled the CPU pipeline for one reason or another. -If the \emph{\faCarCrash{}~Impact} switch is enabled, the branch and cache columns will show how much impact the branch mispredictions and cache misses have. The way these statistics are calculated is described in the list above. In the other case, the columns will show the raw branch and cache miss rate ratios, isolated to their respective source and assembly lines, and not relative to the whole symbol. +If the \emph{\faCarCrash{}~Impact} switch is enabled, the branch and cache columns will show how much impact the branch mispredictions and cache misses have. The way these statistics are calculated is described in the list above. In the other case, the columns will show the raw branch and cache miss rate ratios, isolated to their respective source and assembly lines and not relative to the whole symbol. \begin{bclogo}[ noborder=true, couleur=black!5, logo=\bcattention ]{Isolated values} -The percentage values when \emph{\faCarCrash{}~Impact} option is not selected will not take into account the relative count of events. For example, you may see 100\% cache miss rate when some instruction missed 10 out of 10 cache accesses. While not ideal, this is not as important as a seemingly better 50\% cache miss rate instruction, which actually has missed 1000 out of 2000 accesses. You should always cross-check the presented information with the respective event counts. To help a bit with this, Tracy will dim values that are statistically unimportant. +The percentage values when \emph{\faCarCrash{}~Impact} option is not selected will not take into account the relative count of events. For example, you may see a 100\% cache miss rate when some instruction missed 10 out of 10 cache accesses. While not ideal, this is not as important as a seemingly better 50\% cache miss rate instruction, which actually has missed 1000 out of 2000 accesses. Therefore, you should always cross-check the presented information with the respective event counts. To help with this, Tracy will dim statistically unimportant values. \end{bclogo} \subsection{Wait stacks window} @@ -3533,42 +3530,42 @@ The percentage values when \emph{\faCarCrash{}~Impact} option is not selected wi If wait stack information has been captured (chapter~\ref{waitstacks}), here you will be able to inspect the collected data. There are three different views available: \begin{itemize} -\item \emph{\faTable{}~List} -- shows all unique wait stacks, sorted by number of times they were observed. -\item \emph{\faTree{}~Bottom-up tree} -- displays wait stacks in form of a collapsible tree, which starts at the bottom of the call stack. -\item \emph{\faTree{}~Top-down tree} -- displays wait stacks in form of a collapsible tree, which starts at the top of the call stack. +\item \emph{\faTable{}~List} -- shows all unique wait stacks, sorted by the number of times they were observed. +\item \emph{\faTree{}~Bottom-up tree} -- displays wait stacks in the form of a collapsible tree, which starts at the bottom of the call stack. +\item \emph{\faTree{}~Top-down tree} -- displays wait stacks in the form of a collapsible tree, which starts at the top of the call stack. \end{itemize} -Displayed data may be narrowed down to a specific time range and/or to include only selected threads. +Displayed data may be narrowed down to a specific time range or to include only selected threads. \subsection{Lock information window} \label{lockwindow} -This window presents information and statistics about a lock. The lock events count represents the total number collected of wait, obtain and release events. The announce, termination and lock lifetime measure the time from the lockable construction until destruction. +This window presents information and statistics about a lock. The lock events count represents the total number collected of wait, obtain and release events. The announce, termination, and lock lifetime measure the time from the lockable construction until destruction. \subsection{Frame image playback window} \label{playback} -You may view a live replay of the profiled application screen captures (see section~\ref{frameimages}) using this window. Playback is controlled by the \emph{\faPlay~Play} and \emph{\faPause~Pause} buttons and the \emph{Frame image} slider can be used to scrub to the desired time stamp. Alternatively you may use the \emph{\faCaretLeft} and \emph{\faCaretRight} buttons to change single frame back or forward. +You may view a live replay of the profiled application screen captures (see section~\ref{frameimages}) using this window. Playback is controlled by the \emph{\faPlay~Play} and \emph{\faPause~Pause} buttons and the \emph{Frame image} slider can be used to scrub to the desired timestamp. Alternatively you may use the \emph{\faCaretLeft} and \emph{\faCaretRight} buttons to change single frame back or forward. -If the \emph{Sync timeline} option is selected, the timeline view will be focused on the frame corresponding to the currently displayed screen shot. The \emph{Zoom 2$\times$} option enlarges the image, for easier viewing. +If the \emph{Sync timeline} option is selected, the profiler will focus the timeline view on the frame corresponding to the currently displayed screenshot. The \emph{Zoom 2$\times$} option enlarges the image for easier viewing. -Each displayed frame image is also accompanied by the following parameters: \emph{timestamp}, showing at which time the image was captured, \emph{frame}, displaying the numerical value of corresponding frame, and \emph{ratio}, telling how well the in-memory loss-less compression was able to reduce the image data size. +The following parameters also accompany each displayed frame image: \emph{timestamp}, showing at which time the image was captured, \emph{frame}, displaying the numerical value of the corresponding frame, and \emph{ratio}, telling how well the in-memory loss-less compression was able to reduce the image data size. \subsection{CPU data window} \label{cpudata} -Statistical data about all processes running on the system during the capture is available in this window, if context switch capture (section~\ref{contextswitches}) was performed. +Statistical data about all processes running on the system during the capture is available in this window if the profiler performed context switch capture (section~\ref{contextswitches}). -Each running program has an assigned process identifier (PID), which is displayed in the first column. If a program entry is expanded, a list of thread identifiers (TIDs) will also be displayed. +Each running program has an assigned process identifier (PID), which is displayed in the first column. The profiler will also display a list of thread identifiers (TIDs) if a program entry is expanded. -The \emph{running time} column shows how much processor time was used by a process or thread. The percentage may be over 100\%, as it is scaled to trace length and multiple threads belonging to a single program may be executing simultaneously. The \emph{running regions} column displays how many times a given entry was in the \emph{running} state and the \emph{CPU migrations} shows how many times an entry was moved from one CPU core to another, when an entry was suspended by the system scheduler. +The \emph{running time} column shows how much processor time was used by a process or thread. The percentage may be over 100\%, as it is scaled to trace length, and multiple threads belonging to a single program may be executing simultaneously. The \emph{running regions} column displays how many times a given entry was in the \emph{running} state, and the \emph{CPU migrations} shows how many times an entry was moved from one CPU core to another when the system scheduler suspended an entry. -The profiled program is highlighted using green color. Furthermore, yellow highlight indicates threads which are known to the profiler (that is, which sent events due to instrumentation). +The profiled program is highlighted using green color. Furthermore, the yellow highlight indicates threads known to the profiler (that is, which sent events due to instrumentation). \subsection{Annotation settings window} \label{annotationsettings} -In this window you may modify how a timeline annotation (section~\ref{annotatingtrace}) is presented by setting its text description, or selecting region highlight color. If the note is no longer needed, it may also be removed here. +In this window, you may modify how a timeline annotation (section~\ref{annotatingtrace}) is presented by setting its text description or selecting region highlight color. If the note is no longer needed, you may also remove it here. \subsection{Annotation list window} \label{annotationlist} @@ -3610,13 +3607,13 @@ This window displays information about time range limits (section~\ref{timerange \item \emph{\faMemory{}~Copy from memory} -- Copies the memory time range limit. \end{itemize} -Note that ranges displayed in the window have color hints that match color of the striped regions on the timeline. +Note that ranges displayed in the window have color hints that match the color of the striped regions on the timeline. \section{Exporting zone statistics to CSV} \label{csvexport} -You can use a command-line utility in the \texttt{csvexport} directory to export basic zone statistics from a saved trace into a CSV format. -The tool requires a single .tracy file as an argument and prints the result into the standard output (stdout) from where you can redirect it into a file or use it as an input into another tool. +You can use a command-line utility in the \texttt{csvexport} directory to export primary zone statistics from a saved trace into a CSV format. +The tool requires a single .tracy file as an argument and prints the result into the standard output (stdout), from where you can redirect it into a file or use it as an input into another tool. By default, the utility will list all zones with the following columns: \begin{itemize} @@ -3646,7 +3643,7 @@ You can customize the output with the following command line options: \section{Importing external profiling data} \label{importingdata} -Tracy can import data generated by other profilers. This external data cannot be directly loaded, but must be converted first. Currently there's only support for converting chrome:tracing data, through the \texttt{import-chrome} utility. +Tracy can import data generated by other profilers. This external data cannot be directly loaded but must be converted first. Currently, there's only support for converting chrome:tracing data through the \texttt{import-chrome} utility. \begin{bclogo}[ noborder=true, @@ -3661,7 +3658,7 @@ noborder=true, couleur=black!5, logo=\bclampe ]{Source locations} -Chrome tracing format doesn't document a way to provide source location data. The \texttt{import-chrome} utility will however recognize a custom \texttt{loc} tag in the root of zone begin events. You should be formatting this data in the usual \texttt{filename:line} style, for example: \texttt{hello.c:42}. Providing the line number (including a colon) is optional, but highly recommended. +Chrome tracing format doesn't document a way to provide source location data. The \texttt{import-chrome} utility will however recognize a custom \texttt{loc} tag in the root of zone begin events. You should be formatting this data in the usual \texttt{filename:line} style, for example: \texttt{hello.c:42}. Providing the line number (including a colon) is optional but highly recommended. \end{bclogo} \begin{bclogo}[ @@ -3670,15 +3667,15 @@ couleur=black!5, logo=\bcattention ]{Limitations} \begin{itemize} -\item Tracy is a single-process profiler. Should the imported trace contain PID entries, each PID+TID pair will create a new \emph{pseudo-TID} number, which will be then decoded into a PID+TID pair in thread labels. If you want to preserve the original TID numbers, your traces should omit PID entries. -\item The imported data may be severely limited, either by not mapping directly to the data structures used by Tracy, or by following undocumented practices. +\item Tracy is a single-process profiler. Should the imported trace contain PID entries, each PID+TID pair will create a new \emph{pseudo-TID} number, which the profiler will then decode into a PID+TID pair in thread labels. If you want to preserve the original TID numbers, your traces should omit PID entries. +\item The imported data may be severely limited, either by not mapping directly to the data structures used by Tracy or by following undocumented practices. \end{itemize} \end{bclogo} \section{Configuration files} \label{configurationfiles} -While the client part doesn't read or write anything to the disk (with the exception of accessing the \texttt{/proc} filesystem on Linux), the server part has to keep some persistent state. The naming conventions or internal data format of the files are not meant to be known by profiler users, but you may want to do a backup of the configuration, or move it to another machine. +While the client part doesn't read or write anything to the disk (except for accessing the \texttt{/proc} filesystem on Linux), the server part has to keep some persistent state. The naming conventions or internal data format of the files are not meant to be known by profiler users, but you may want to do a backup of the configuration or move it to another machine. On Windows settings are stored in the \texttt{\%APPDATA\%/tracy} directory. All other platforms use the \texttt{\$XDG\_CONFIG\_HOME/tracy} directory, or \texttt{\$HOME/.config/tracy} if the \texttt{XDG\_CONFIG\_HOME} environment variable is not set. @@ -3689,11 +3686,11 @@ Various files at the root configuration directory store common profiler state su \subsection{Trace specific settings} \label{tracespecific} -Trace files saved on disk are immutable and can't be changed, but it may be desirable to store additional per-trace information to be used by the profiler, for example a custom description of the trace, or the timeline view position used in the previous profiling session. +Trace files saved on disk are immutable and can't be changed. Still, it may be desirable to store additional per-trace information to be used by the profiler, for example, a custom description of the trace or the timeline view position used in the previous profiling session. -This external data is stored in the \texttt{user/[letter]/[program]/[week]/[epoch]} directory, relative to the configuration's root directory. The \texttt{program} part is the name of the profiled application (for example \texttt{program.exe}). The \texttt{letter} part is a first letter of the profiled application's name. The \texttt{week} part is a number of weeks since the unix epoch, and the \texttt{epoch} part is a number of seconds since unix epoch. This rather unusual convention prevents creation of directories with hundreds of entries. +This external data is stored in the \texttt{user/[letter]/[program]/[week]/[epoch]} directory, relative to the configuration's root directory. The \texttt{program} part is the name of the profiled application (for example \texttt{program.exe}). The \texttt{letter} part is the first letter of the profiled application's name. The \texttt{week} part is a count of weeks since the Unix epoch, and the \texttt{epoch} part is a count of seconds since the Unix epoch. This rather unusual convention prevents the creation of directories with hundreds of entries. -User settings are never pruned by the profiler. +The profiler never prunes user settings. \newpage \appendix