tracy/manual/tracy.tex
2018-08-21 17:39:41 +02:00

540 lines
37 KiB
TeX

% !TeX spellcheck = en_US
\documentclass[hidelinks,titlepage,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{newpxtext,newpxmath}
\linespread{1.05} % Line spacing - Palatino needs more space between lines
\usepackage{microtype}
\usepackage[group-separator={,}]{siunitx}
\usepackage[tikz]{bclogo}
\usepackage{appendix}
\usepackage{verbatim}
\usepackage[hyphens]{url}
\usepackage{hyperref} % For hyperlinks in the PDF
\usepackage{fontawesome}
\usepackage[hmarginratio=1:1,top=32mm,columnsep=20pt]{geometry} % Document margins
\geometry{a4paper,textwidth=6.5in,hmarginratio=1:1,
textheight=9in,vmarginratio=1:1,heightrounded}
\usepackage{fancyhdr} % Headers and footers
\pagestyle{fancy} % All pages have headers and footers
\fancyhead{} % Blank out the default header
\fancyfoot{} % Blank out the default footer
\fancyhead[L]{Tracy Profiler}
\fancyhead[R]{The user manual}
\fancyfoot[RO]{\thepage} % Custom footer text
\usepackage{listings}
\usepackage{xcolor}
\usepackage{float}
\lstset{language=C++}
\lstset{
basicstyle=\footnotesize\ttfamily,
tabsize=4,
extendedchars=true,
breaklines=true,
stringstyle=\ttfamily,
showspaces=false,
xleftmargin=17pt,
framexleftmargin=17pt,
framexrightmargin=5pt,
framexbottommargin=4pt,
showstringspaces=false
}
\usepackage[hang, small,labelfont=bf,up,textfont=it,up]{caption} % Custom captions under/above floats in tables or figures
\begin{document}
\begin{titlepage}
\centering
{\fontsize{120}{140}\selectfont Tracy Profiler}
\vspace{50pt} {\Huge\fontfamily{lmtt}\selectfont The user manual}
\vfill
\large\textbf{Bartosz Taudul} \href{mailto:wolf.pld@gmail.com}{<wolf.pld@gmail.com>}
\vspace{10pt}
\today
\vfill
\url{https://bitbucket.org/wolfpld/tracy}
\end{titlepage}
\tableofcontents
\newpage
\section{A quick look at Tracy}
Tracy is a real-time, nanosecond resolution \emph{frame profiler} that can be used for remote or embedded telemetry of applications. It can profile CPU (C++, Lua), GPU (OpenGL, Vulkan) and memory. It also can monitor locks held by threads and show where contention does happen.
In contrast with \emph{statistical profilers} (such as VTune, perf or Very Sleepy), Tracy does require manual markup of the source code. In return, it allows frame-by-frame inspection of the program execution. You will be able to see exactly which functions are called, how much time is spent in them, and how do they interact with each other in a multi-threaded environment. This feat is by-design impossible to achieve in statistical profilers, which work by periodically sampling the \emph{program counter} register to see which part of the code is executing.
Even though Tracy is a \emph{frame} profiler, with the emphasis on analysis of \emph{frame time} in real-time applications, it does work with utilities that do not employ the concept of a frame. There's nothing that would prohibit profiling of, for example, a compression tool, or an event-driven UI application.
The close analogues of Tracy are: RAD Telemetry, Brofiler, microprofile.
Now let's take a close look at the marketing blurb.
\subsection{Real-time}
This claim can be described in the following ways:
\begin{enumerate}
\item The profiled application is not slowed down by profiling\footnote{See section~\ref{perfimpact} for a benchmark.}. The act of recording a profiling event has virtually zero cost -- it only takes \textasciitilde 8~\si{\nano\second}. Even on low-power mobile devices there's no perceptible impact on execution speed.
\item The profiler itself works in real-time, without the need to process collected data in a complex way. Actually, it is quite inefficient in the way it works, as the data it presents is calculated anew each frame. And yet it can run at 60 frames per second.
\item The profiler has full functionality when the profiled application is running and the data is captured. You may interact with your application and then immediately switch to the profiler, when a performance drop occurs.
\end{enumerate}
\subsection{Nanosecond resolution}
It is hard to imagine how long a nanosecond is. One good analogy is to compare it with a measure of length. Let's say that one second is one meter (the average doorknob is on the height of one meter).
One millisecond ($\frac{1}{1000}$ of a second) would be then the length of a millimeter. The average size of a red ant or the width of a pencil is 5 or 6~\si{\milli\metre}. A modern game running at 60 frames per second has only 16~\si{\milli\second} to update the game world and render the entire scene.
One microsecond ($\frac{1}{1000}$ of a millisecond) in our comparison equals to one micron. The diameter of a typical bacterium ranges from 1 to 10 microns. The diameter of a red blood cell, or width of strand of spider web silk is about 7~\si{\micro\metre}.
And finally, one nanosecond ($\frac{1}{1000}$ of a microsecond) would be one nanometer. The modern microprocessor transistor gate, the width of DNA helix, or the thickness of a cell membrane are in the range of 5~\si{\nano\metre}. In one~\si{\nano\second} the light can travel only 30~\si{\centi\meter}.
Tracy can achieve single-digit nanosecond measurement resolution, due to usage of hardware timing mechanisms on the x86 and ARM architectures\footnote{In both 32 and 64~bit variants. On x86 Tracy requires a modern version of the \texttt{rdtscp} instruction (Sandy Bridge and later). On ARM-based systems Tracy will try to use the timer register (\textasciitilde 40 \si{\nano\second} resolution). If it fails (due to kernel configuration), Tracy falls back to system provided timer, which can range in resolution from 250 \si{\nano\second} to 1 \si{\micro\second}.}. Other profilers may rely on the timers provided by operating system, which do have significantly reduced resolution (about 300~\si{\nano\second} -- 1~\si{\micro\second}). This is enough to hide the subtle impact of cache access optimization, etc.
\subsection{Frame profiler}
Tracy is aimed at understanding the inner workings of a tight game (or interactive application) loop. That's why it slices the execution time of a program using the \emph{frame}\footnote{A frame is used to describe a single image displayed on the screen by the game (or any other program), preferably 60 times per second to achieve smooth animation.} as a basic work-unit\footnote{Frame usage is not required. See section~\ref{markingframes} for more information.}. The most interesting frames are the ones that took longer than the allocated time, producing visible hitches in the on-screen animation. Tracy allows inspection of such misbehavior.
\subsection{Remote or embedded telemetry}
Tracy uses the client-server model to enable a wide range of use-cases. For example, a game on a mobile phone may be profiled over the wireless connection, with the profiler running on a desktop computer. It is also possible to embed the visualization front-end in the profiled application, making the profiling self-contained\footnote{See section~\ref{embeddingserver} for guidelines.}.
In the Tracy terminology, the profiled application is the \emph{client} and the profiler itself is the \emph{server}. It was named this way because the client is a thin layer that just collects events and sends them for processing and long-term storage on the server. The fact that the server needs to connect to the client to begin the profiling session may be a bit confusing at first.
\subsection{Performance impact}
\label{perfimpact}
To check how much slowdown is introduced by using Tracy, let's profile an example application. For this purpose we will use etcpak\footnote{\url{https://bitbucket.org/wolfpld/etcpak}}. Let's use an $8192 \times 8192$ pixels test image as input data and instrument everything down to the $4 \times 4$ pixel block compression function (that's 4 million blocks to compress).
The resulting timing information can be seen in table~\ref{PerformanceImpact}. As can be seen, the cost of a single-zone capture (consisting of the zone begin and zone end events) is \textasciitilde 15 \si{\nano\second}.
\begin{table}[h]
\centering
\begin{tabular}[h]{c|c|c|c|c}
\textbf{Output} & \textbf{Zones} & \textbf{Clean run} & \textbf{Profiling run} & \textbf{Difference} \\ \hline
ETC1 & \num{4194568} & 0.94 \si{\second} & 1.003 \si{\second} & +0.063 \si{\second} \\
ETC2 + mip-maps & \num{5592822} & 1.034 \si{\second} & 1.119 \si{\second} & +0.085 \si{\second}
\end{tabular}
\caption{Zone capture time cost.}
\label{PerformanceImpact}
\end{table}
It should be noted that Tracy has a constant initialization cost, needed to perform timer calibration. This cost was subtracted from the profiling run times, as it is irrelevant to the single-zone capture time.
\section{First steps}
Tracy requires compiler support for C++11, Thread Local Storage and a way to workaround static initialization order fiasco. There are no other requirements. The following platforms are confirmed to be working (this is not a complete list):
\begin{itemize}
\item Windows (x86, x64)
\item Linux (x86, x64, ARM, ARM64)
\item Android (ARM, x86)
\item FreeBSD (x64)
\item Cygwin (x64)
\item WSL (x64)
\item OSX (x64)\footnote{In the Apple world everything has to be \emph{think different}. Support for Thread Local Storage is only available since Xcode 8 and not before iOS 9. There's no way to handle static initialization order fiasco, so you will have to make your own workarounds. Zone filtering described in section~\ref{filteringzones} may be of help.}
\end{itemize}
The following compilers are supported:
\begin{itemize}
\item MSVC
\item gcc
\item clang
\end{itemize}
\subsection{Initial client setup}
The recommended way to integrate Tracy into an application is to create a git submodule in the repository (assuming that git is used for version control). This way it is very easy to update Tracy to newly released versions.
If that's not an option, copy all files from the \texttt{tracy/client} and \texttt{tracy/common} directories, along with the source files in Tracy's root directory to your project. Next, add the \texttt{tracy/TracyClient.cpp} source file to the IDE project and/or makefile. That's all. Tracy is now integrated into the application.
In the default configuration Tracy is disabled. This way you don't have to worry that the production builds will perform collection of the profiling data. You will probably want to create a separate build configuration, with the \texttt{TRACY\_ENABLE} define, which enables profiling.
Finally, on Unix make sure that the application is linked with libraries \texttt{libpthread} and \texttt{libdl}.
\subsubsection{Short-lived applications}
In case you want to profile a short-lived program (for example, a compression utility that finishes its work in one second), add the \texttt{TRACY\_NO\_EXIT} define to the build configuration. With this option enabled, Tracy will not exit until an incoming connection is made, even if the application has already finished executing. This mode of operation can also be turned on by setting the \texttt{TRACY\_NO\_EXIT} environment variable to $1$.
\subsubsection{On-demand profiling}
\label{ondemand}
By default Tracy will begin profiling even before the program enters the \texttt{main} function. If you don't want to perform a full capture of application life-time, you may define the \texttt{TRACY\_ON\_DEMAND} macro, which will enable profiling only when there's an established connection with the server.
It should be noted, that if on-demand profiling is \emph{disabled} (which is the default), then the recorded events will be stored in the system memory until a server connection is made and the data can be uploaded\footnote{This memory is never released, but it is reused for collection of further events.}. Depending on the amount of the things profiled, the requirements for event storage can easily grow up to a couple of gigabytes.
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bcattention
]{Caveats}
The client with on-demand profiling enabled needs to perform additional bookkeeping, in order to present a coherent application state to the profiler. This incurs additional time cost for each profiling event.
\end{bclogo}
\subsubsection{Setup for multi-DLL projects}
In projects that consist of multiple DLLs/shared objects things are a bit different. Compiling \texttt{TracyClient.cpp} into every DLL is not an option because this would result in several instances of Tracy objects lying around in the process. We rather need to pass the instances of them to the different DLLs to be reused there.
For that you need a \emph{main DLL} to which your executable and the other DLLs link. If that doesn't exist you have to create one explicitly for Tracy. Link the executable and all DLLs which you want to profile to this DLL.
You should compile the main library with the \texttt{tracy/TracyClient.cpp} source file and then add the \texttt{tracy/TracyClientDLL.cpp} file to the source files lists of the executable and the other DLLs.
\subsection{Running the server}
The easiest way to get going is to build the data analyzer, available in the \texttt{profiler} directory. With it you can connect to localhost or remote clients and view the collected data right away.
If you prefer to inspect the data only after a trace has been performed, you may use the command line utility in the \texttt{capture} directory. It will save a data dump that may be later opened in the graphical viewer application.
See section~\ref{capturing} for more information.
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bcbombe
]{Important}
You must use the same version of the Tracy profiler on both client and server! Network protocol mismatch will most likely lead to crashes. Tracy \emph{will not warn} about this!
\end{bclogo}
\subsubsection{Embedding the server in profiled application}
\label{embeddingserver}
While not officially supported, it is possible to embed the server in your application, the same which is running the client part of Tracy. This part is up to you to figure out.
The following defines may be of interest:
\begin{itemize}
\item \texttt{TRACY\_FILESELECTOR} -- controls whether a system load/save dialog is compiled in. If it's left out, the saved traces will be named \texttt{trace.tracy}.
\item \texttt{TRACY\_NO\_STATISTICS} -- Tracy will perform statistical data collection on the fly, if this macro is \emph{not} defined. This allows extended analysis of the trace (for example, you can perform a live search for matching zones) at a small CPU processing cost and a considerable memory usage increase (at least 10 bytes per zone).
\item \texttt{TRACY\_EXTENDED\_FONT} -- use this define, if you have loaded extra symbol ranges in your font and added icons\footnote{See the \texttt{profiler} utility source code for reference.}. Otherwise, some characters will be replaced with an ASCII compatible version. For example, the micro (\si\micro) symbol will be replaced with \texttt{u}, and \faWarning{} icon will be replaced with \texttt{/!\textbackslash}.
\item \texttt{TRACY\_ROOT\_WINDOW} -- the main profiler view will occupy whole window if this macro is defined. Additional setup is required for this to work. If you are embedding the server into your application you probably do \emph{not} want this.
\end{itemize}
\subsection{Naming threads}
Remember to set thread names for proper identification of threads. You should use the functions exposed in the \texttt{tracy/common/TracySystem.hpp} header to do so.
Be aware that even if you already have thread naming functionality implemented, some platforms\footnote{Basically everything, but the recent Windows releases.} do not have adequate system-level capabilities (or none at all), in which case Tracy uses its own internal thread name storage.
\section{Client markup}
\label{client}
With the aforementioned steps you will be able to connect to the profiled program, but there won't be any data collection performed. In order to begin profiling, Tracy requires that you manually instrument the application\footnote{Automatic tracing of every entered function is not feasible due to the amount of data that would generate.}. All the user-facing interface is contained in the \texttt{tracy/Tracy.hpp} header file.
The best way to start is to add markup to the main loop of the application, along with a few function that are called there. This will give you a rough outline of the function's time cost, which you may then further refine by instrumenting functions deeper in the call stack.
\subsection{Handling text strings}
When dealing with Tracy macros, you will encounter two ways of providing string data to the profiler. In both cases you should pass \texttt{const char*} pointers, but there are differences in expected life-time of the pointed data.
\begin{enumerate}
\item When a macro only accepts a pointer (for example: \texttt{TracyMessageL(text)}), the provided string data must be accessible at any time in program execution (\emph{this also includes the time after exiting the \texttt{main} function}). The string also cannot be changed. This basically means that the only option is to use a string literal (e.g.: \texttt{TracyMessageL("Hello")}).
\item If there's a string pointer with a size parameter (for example: \texttt{TracyMessage(text, size)}), the profiler will allocate an internal temporary buffer to store the data. The pointed-to data is not used afterwards. You should be aware that allocating and copying memory involved in this operation has a small time cost.
\end{enumerate}
\subsection{Marking frames}
\label{markingframes}
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bclampe
]{Do I need this?}
This step is optional, as some applications do not use the concept of a frame.
\end{bclogo}
To slice the program's execution recording into frame-sized chunks\footnote{Each frame starts immediately after the previous has ended.}, put the \texttt{FrameMark} macro after you have completed rendering the frame. Ideally that would be right after the swap buffers command.
\subsubsection{Secondary frame sets}
In some cases you may want to track more than one set of frames in your program. To do so, you may use the \texttt{FrameMarkNamed(name)} macro, which will create a new set of frames for each unique name you provide.
\subsubsection{Discontinuous frames}
Some types of frames are discontinuous by nature. For example, a physics processing step in a game loop, or an audio callback running on a separate thread. These kinds of workloads are executed periodically, with a pause between each run. Tracy can also track these kind of frames.
To mark the beginning of a discontinuous frame use the \texttt{FrameMarkStart(name)} macro. After the work is finished, use the \texttt{FrameMarkEnd(name)} macro.
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bcbombe
]{Important}
\begin{itemize}
\item Frame types \emph{must not} be mixed. For each frame set, identified by an unique name, use either continuous or discontinuous frames only!
\item You \emph{must} issue the \texttt{FrameMarkStart} and \texttt{FrameMarkEnd} macros in proper order. Be extra careful, especially if multi-threading is involved. Note that the profiler event data is unordered between threads, so you can't start a frame in one thread and end it in another one.
\end{itemize}
\end{bclogo}
\subsection{Marking zones}
To record a zone's\footnote{A \texttt{zone} represents the life-time of a special on-stack profiler variable. Typically it would exist for the duration of a whole scope of the profiled function, but you also can measure time spent in scopes of a for-loop, or an if-branch.} execution time add the \texttt{ZoneScoped} macro at the beginning of the scope you want to measure. This will automatically record function name, source file name and location. Optionally you may use the \texttt{ZoneScopedC(0xRRGGBB)} macro to set a custom color for the zone. Note that the color value will be constant in the recording (don't try to parametrize it). You may also set a custom name for the zone, using the \texttt{ZoneScopedN(name)} macro. Color and name may be combined by using the \texttt{ZoneScopedNC(name, color)} macro.
Use the \texttt{ZoneText(text, size)} macro to add a custom text string that will be displayed along the zone information (for example, name of the file you are opening).
If you want to set zone name on a per-call basis, you may do so using the \texttt{ZoneName(text, size)} macro. This name won't be used in the process of grouping the zones for statistical purposes.
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bclampe
]{Color palette}
You may use named colors predefined in \texttt{common/TracyColor.hpp} (included by \texttt{Tracy.hpp}). Visual reference: \url{https://en.wikipedia.org/wiki/X11_color_names}.
\end{bclogo}
\subsubsection{Multiple zones in one scope}
\label{multizone}
Using the \texttt{ZoneScoped} family of macros creates a stack variable named \texttt{\_\_\_tracy\_scoped\_zone}. If you want to measure more than one zone in the same scope, you will need to use the \texttt{ZoneNamed} macros, which require that you provide a name for the created variable. For example, instead of \texttt{ZoneScopedN("Zone name")}, you would use \texttt{ZoneNamedN(variableName, "Zone name", true)}\footnote{The last parameter is explained in section~\ref{filteringzones}.}.
The \texttt{ZoneText} and \texttt{ZoneName} macros work only for the zones created using the \texttt{ZoneScoped} macros. For the \texttt{ZoneNamed} macros, you will need to invoke the methods \texttt{Text} or \texttt{Name} of the variable you have created.
\subsubsection{Filtering zones}
\label{filteringzones}
Zone logging can be disabled on a per zone basis, by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument ('\texttt{true}' in the example above), which will determine whether the zone should be logged.
\subsection{Marking locks}
Modern programs must use multi-threading to achieve full performance capability of the CPU. Correct execution requires claiming exclusive access to data shared between threads. When many threads want to enter the critical section at once, the application's multi-threaded performance advantage is nullified. To answer this problem, Tracy can collect and display lock interactions in threads.
To mark a lock (mutex) for event reporting, use the \texttt{TracyLockable(type, varname)} macro. Note that the lock must implement the Mutex requirement\footnote{\url{https://en.cppreference.com/w/cpp/named_req/Mutex}} (i.e.\ there's no support for timed mutices). For a concrete example, you would replace the line
\begin{lstlisting}
std::mutex m_lock;
\end{lstlisting}
with
\begin{lstlisting}
TracyLockable(std::mutex, m_lock);
\end{lstlisting}
Alternatively, you may use \texttt{TracyLockableN(type, varname, description)} to provide a custom lock name.
The standard \texttt{std::lock\_guard} and \texttt{std::unique\_lock} wrappers should use the \texttt{LockableBase(type)} macro for their template parameter (unless you're using C++17, with improved template argument deduction). For example:
\begin{lstlisting}
std::lock_guard<LockableBase(std::mutex)> lock(m_lock);
\end{lstlisting}
To mark the location of lock being held, use the \texttt{LockMark(varname)} macro, after you have obtained the lock. Note that the \texttt{varname} must be a lock variable (a reference is also valid). This step is optional.
Similarly, you can use \texttt{TracySharedLockable}, \texttt{TracySharedLockableN} and \texttt{SharedLockableBase} to mark locks implementing the SharedMutex requirement\footnote{\url{https://en.cppreference.com/w/cpp/named_req/SharedMutex}}. Note that while there's no support for timed mutices in Tracy, both \texttt{std::shared\_mutex} and \texttt{std::shared\_timed\_mutex} may be used\footnote{Since \texttt{std::shared\_mutex} was added in C++17, using \texttt{std::shared\_timed\_mutex} is the only way to have shared mutex functionality in C++14.}.
\subsection{Plotting data}
Tracy is able to capture and draw numeric value changes over time. You may use it to analyze draw call counts, number of performed queries, etc. To report data, use the \texttt{TracyPlot(name, value)} macro.
\subsection{Message log}
Fast navigation in large data sets and correlating zones with what was happening in application may be difficult. To ease these issues Tracy provides a message log functionality. You can send messages (for example, your typical debug output) using the \texttt{TracyMessage(text, size)} macro. Alternatively, use \texttt{TracyMessageL(text)} for string literal messages.
\subsection{Memory profiling}
Tracy can monitor memory usage of your application. Knowledge about each performed memory allocation enables the following:
\begin{itemize}
\item Memory usage graph (like in massif, but fully interactive).
\item List of active allocations at program exit (leak list).
\item Visualization of memory map.
\item Ability to rewind view of active allocations and memory map to any point of program execution.
\item Information about memory statistics of each zone.
\item Memory allocation hot-spot tree.
\end{itemize}
To mark memory events, use the \texttt{TracyAlloc(ptr, size)} and \texttt{TracyFree(ptr)} macros. Typically you would do that in overloads of \texttt{operator new} and \texttt{operator delete}.
\subsection{Lua support}
To profile Lua code using Tracy, include the \texttt{tracy/TracyLua.hpp} header file in your Lua wrapper and execute \texttt{tracy::LuaRegister(lua\_State*)} function to add instrumentation support.
In the Lua code, add \texttt{tracy.ZoneBegin()} and \texttt{tracy.ZoneEnd()} calls to mark execution zones. You need to call the \texttt{ZoneEnd} method, because there is no automatic destruction of variables in Lua and we don't know when the garbage collection will be performed. \emph{Double check if you have included all return paths!}
Use \texttt{tracy.ZoneBeginN(name)} if you want to set a custom zone name\footnote{While technically this name doesn't need to be constant, like in the \texttt{ZoneScopedN} macro, it should be, as it is used to group the zones together. This grouping is then used to display various statistics in the profiler. You may still set the per-call name using the \texttt{tracy.ZoneName} method.}.
Use \texttt{tracy.ZoneText(text)} to set zone text.
Use \texttt{tracy.Message(text)} to send messages.
Use \texttt{tracy.ZoneName(text)} to set zone name on a per-call basis.
Lua instrumentation needs to perform additional work (including memory allocation) to store source location. This approximately doubles the data collection cost.
Even if Tracy is disabled, you still have to pay the no-op function call cost. To prevent that you may want to use the \texttt{tracy::LuaRemove(char* script)} function, which will replace instrumentation calls with white-space. This function does nothing if profiler is enabled.
\subsection{GPU profiling}
Tracy provides bindings for profiling OpenGL and Vulkan execution time on GPU.
Note that the CPU and GPU timers may be not synchronized. You can correct the resulting desynchronization in the profiler's options.
\subsubsection{OpenGL}
You will need to include the \texttt{tracy/TracyOpenGL.hpp} header file and declare each of your rendering contexts using the \texttt{TracyGpuContext} macro (typically you will only have one context). Tracy expects no more than one context per thread and no context migration.
To mark a GPU zone use the \texttt{TracyGpuZone(name)} macro, where \texttt{name} is a string literal name of the zone. Alternatively you may use \texttt{TracyGpuZoneC(name, color)} to specify zone color.
You also need to periodically collect the GPU events using the \texttt{TracyGpuCollect} macro. A good place to do it is after the swap buffers function call.
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bcattention
]{Caveats}
\begin{itemize}
\item GPU profiling is not supported on OSX, iOS\footnote{Because Apple is unable to implement standards properly.}.
\item Android devices do work, if GPU drivers are not broken. Disjoint events are not currently handled, so some readings may be a bit spotty.
\item Nvidia drivers are unable to provide consistent timing results when two OpenGL contexts are used simultaneously.
\item Calling the \texttt{TracyGpuCollect} macro is a fairly slow operation (couple \si{\micro\second}).
\end{itemize}
\end{bclogo}
\subsubsection{Vulkan}
Similarly, for Vulkan support you should include the \texttt{tracy/TracyVulkan.hpp} header file and initialize the Vulkan instance using the \texttt{TracyVkContext(physdev, device, queue, cmdbuf)} macro. Cleanup is performed using the \texttt{TracyVkDestroy()} macro. Currently you can't track more than one instance.
The physical device, logical device, queue and command buffer must relate with each other. The queue must support graphics or compute operations. The command buffer must be in the initial state and be able to be reset. It will be rerecorded and submitted to the queue multiple times and it will be in the executable state on exit from the initialization function.
To mark a GPU zone use the \texttt{TracyVkZone(cmdbuf, name)} macro, where \texttt{name} is a string literal name of the zone. Alternatively you may use \texttt{TracyVkZoneC(cmdbuf, name, color)} to specify zone color. The provided command buffer must be in the recording state.
You also need to periodically collect the GPU events using the \texttt{TracyVkCollect(cmdbuf)} macro\footnote{It is considerably faster than the OpenGL's \texttt{TracyGpuCollect}.}. The provided command buffer must be in the recording state and outside of a render pass instance.
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bcattention
]{Caveats}
Vulkan support is very bare at the moment. Multi-threaded submitting commands to command buffers is not supported right now.
\end{bclogo}
\subsubsection{Multiple zones in one scope}
Putting more than one GPU zone macro in a single scope features the same issue as with the \texttt{ZoneScoped} macros, described in section~\ref{multizone} (but this time the variable name is \texttt{\_\_\_tracy\_gpu\_zone}).
To solve this problem, in case of OpenGL use the \texttt{TracyGpuNamedZone} macro in place of \texttt{TracyGpuZone} (or the color variant). The same applies to Vulkan -- replace \texttt{TracyVkZone} with \texttt{TracyVkNamedZone}.
Remember that you need to provide your own name for the created stack variable as the first parameter to the macros.
\subsection{Collecting call stacks}
Tracy can capture true calls stacks on selected platforms (Windows, Linux, Android). It can be performed by using macros with the \texttt{S} postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: \texttt{ZoneScopedS}, \texttt{ZoneScopedNS}, \texttt{ZoneScopedCS}, \texttt{ZoneScopedNCS}, \texttt{TracyAllocS}, \texttt{TracyFreeS}, \texttt{TracyGpuZoneS}, \texttt{TracyGpuZoneCS}, \texttt{TracyVkZoneS}, \texttt{TracyVkZoneCS}, and the named variants.
Be aware that call stack collection is a relatively slow operation. Table~\ref{CallstackTimes} shows how long it took to perform a single capture of varying depth on multiple CPU architectures.
\begin{table}[h]
\centering
\begin{tabular}[h]{c|c|c}
\textbf{Depth} & \textbf{x86} & \textbf{x64} \\ \hline
1 & 37 \si{\nano\second} & 97 \si{\nano\second} \\
5 & 51 \si{\nano\second} & 312 \si{\nano\second} \\
10 & 71 \si{\nano\second} & 468 \si{\nano\second} \\
20 & 84 \si{\nano\second} & 517 \si{\nano\second}
\end{tabular}
\caption{Call stack capture times.}
\label{CallstackTimes}
\end{table}
\begin{bclogo}[
noborder=true,
couleur=black!5,
logo=\bcattention
]{Debugging symbols}
To have proper call stack information, the profiled application must be compiled with debugging symbols enabled. You can achieve that in the following way:
\begin{itemize}
\item On MSVC open the project properties and go to \emph{Linker\textrightarrow Debugging\textrightarrow Generate Debug Info}, where the \emph{Generate Debug Information} option should be selected.
\item On gcc or clang remember to specify the debugging information \texttt{-g} parameter during compilation and omit the strip symbols \texttt{-s} parameter. Link the executable with an additional option \texttt{-rdynamic} (or \texttt{-{}-export-dynamic}, if you are passing parameters directly to the linker).
\end{itemize}
\end{bclogo}
\section{Capturing the data}
\label{capturing}
After the client application has been instrumented, you will want to connect to it using a server.
\subsection{Command line}
You can capture a trace using a command line utility contained in the \texttt{capture} directory. To use it you will need to provide two parameters:
\begin{itemize}
\item \texttt{-a address} -- specifies the IP address (or a domain name) of the client application.
\item \texttt{-o output.tracy} -- the file name of the resulting trace.
\end{itemize}
If there is no client running at the given address, the server will wait until a connection can be made. During the capture the following information will be displayed:
\begin{verbatim}
% ./capture -a 127.0.0.1 -o trace
Connecting to 127.0.0.1...
Queue delay: 9 ns
Timer resolution: 6 ns
1.90 Mbps | Ratio: 40.8% | Real: 4.67 Mbps | Mem: 77.57 MB
\end{verbatim}
The \emph{queue delay} and \emph{timer resolution} parameters are calibration results of timers used by the client. The next line is a status bar, which presents: network connection speed, connection compression ratio, the resulting uncompressed data rate and total memory usage of the utility.
\subsection{Interactive profiling}
If you want to look at the profile data in real-time (or load a saved trace file), you can use the data analysis utility contained in the \texttt{profiler} directory.
After starting the application, you will be greeted with a welcome dialog, presenting a bunch of useful links (\faBook{}~\emph{User Manual}, \faGlobe{}~\emph{Homepage} and \faVideoCamera{}~\emph{Tutorial}), along with the client address entry field and the \faWifi{}~\emph{Connect} button. A saved trace can be loaded from disk using the \faFolderOpen{}~\emph{Open saved trace} button.
Both connecting to a client and opening a saved trace will present you with the main profiler view, which you can use to analyze the data.
If this is a real-time capture, you will also see the connection window, with the capture status similar to the one displayed by the command line utility. You can use the \faSave{}~\emph{Save trace} button to save the current profile data to a file.
\subsection{Connection speed}
Tracy will happily saturate a 1~Gbps network connection, as it can process up to 6~Gbps of uncompressed data. Note that at such data rates, the resulting capture will need to allocate about 1~GB of RAM per second.
\subsection{Memory usage}
The captured data is stored in RAM and only written to the disk, when the capture finishes. This can result in memory exhaustion when you are capturing massive amounts of profile data, or even in normal usage situations, when the capture is performed over a long stretch of time. The recommended usage pattern is to perform moderate instrumentation of the client code and limit capture time to the strict necessity.
In some cases it may be useful to perform an \emph{on-demand} capture, as described in section~\ref{ondemand}. In such case you will be able to profile only the interesting case (e.g.\ behavior during loading of a level in a game), ignoring all the unneeded data.
If you truly need to capture large traces, you have two options. Either buy more RAM, or use a large swap file on a fast disk drive\footnote{The operating system is able to manage memory paging much better than Tracy would be ever able to.}.
\subsection{Trace versioning}
Each new release of Tracy changes the internal format of trace files. While there is a backwards compatibility layer, allowing loading of traces created by previous versions of Tracy in new releases, it won't be there forever. You are thus advised to upgrade your traces using the utility contained in the \texttt{update} directory.
To use it, you will need to provide the input file and the output file. The program will print a short summary when it finishes:
\begin{verbatim}
% ./update old.tracy new.tracy
old.tracy (0.3.0) -> new.tracy (0.4.0)
\end{verbatim}
The new file contains the same data as the old one, but in the updated internal representation. Note that to perform an upgrade, whole trace needs to be loaded to memory.
\section{Practical considerations}
While the data collection is very lightweight, it is not completely free. Each recorded zone event has a cost, which Tracy tries to calculate and display on the time-line view, as a red zone. Note that this is an approximation of the real cost, which ignores many important factors. For example, you can't determine the impact of cache effects. The CPU frequency may be reduced in some situations, which will increase the recorded time, but the displayed profiler cost will not compensate for that.
\newpage
\appendix
\appendixpage
\section{License}
\verbatiminput{../LICENSE.}
\section{List of contributors}
\verbatiminput{../AUTHORS.}
\end{document}