Initial draft of the user manual.

This commit is contained in:
Bartosz Taudul 2018-08-01 23:45:40 +02:00
parent 1b42946e25
commit a211e57e3c
2 changed files with 230 additions and 0 deletions

6
.gitignore vendored
View File

@ -12,3 +12,9 @@ imgui.ini
test/tracy_test test/tracy_test
test/tracy_test.exe test/tracy_test.exe
*/build/unix/*-* */build/unix/*-*
manual/tracy.aux
manual/tracy.log
manual/tracy.out
manual/tracy.pdf
manual/tracy.synctex.gz
manual/tracy.toc

224
manual/tracy.tex Normal file
View File

@ -0,0 +1,224 @@
% !TeX spellcheck = en_US
\documentclass[hidelinks,titlepage,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{newpxtext,newpxmath}
\linespread{1.05} % Line spacing - Palatino needs more space between lines
\usepackage{microtype}
\usepackage{siunitx}
\usepackage[hyphens]{url}
\usepackage{hyperref} % For hyperlinks in the PDF
\usepackage[hmarginratio=1:1,top=32mm,columnsep=20pt]{geometry} % Document margins
\geometry{a4paper,textwidth=6.5in,hmarginratio=1:1,
textheight=9in,vmarginratio=1:1,heightrounded}
\usepackage{fancyhdr} % Headers and footers
\pagestyle{fancy} % All pages have headers and footers
\fancyhead{} % Blank out the default header
\fancyfoot{} % Blank out the default footer
\fancyhead[L]{Tracy Profiler}
\fancyhead[R]{The user manual}
\fancyfoot[RO]{\thepage} % Custom footer text
\begin{document}
\begin{titlepage}
\centering
{\fontsize{120}{140}\selectfont Tracy Profiler}
\vspace{50pt} {\Huge\fontfamily{lmtt}\selectfont The user manual}
\vfill
\large\textbf{Bartosz Taudul}
\vfill
\url{https://bitbucket.org/wolfpld/tracy}
\end{titlepage}
\tableofcontents
\newpage
\section{A quick look at Tracy}
Tracy is a real-time, nanosecond resolution \emph{frame profiler} that can be used for remote or embedded telemetry of applications. It can profile CPU (C++, Lua), GPU (OpenGL, Vulkan) and memory. It also can monitor locks held by threads and show where contention does happen.
In contrast with \emph{statistical profilers} (such as VTune, perf or Very Sleepy), Tracy does require manual markup of the source code. In return, it allows frame-by-frame inspection of the program execution. You will be able to see exactly which functions are called, how much time is spent in them, and how do they interact with each other in a multi-threaded environment. This feat is by-design impossible to achieve in statistical profilers, which work by periodically sampling the \emph{program counter} register to see which part of the code is executing.
Even though Tracy is a \emph{frame} profiler, with the emphasis on analysis of \emph{frame time} in real-time applications, it does work with utilities that do not employ the concept of a frame. There's nothing that would prohibit profiling of, for example, a compression tool, or an event-driven UI application.
The close analogues of Tracy are: RAD Telemetry, Brofiler, microprofile.
Now let's take a close look at the marketing blurb.
\subsection{Real-time}
This claim can be described in the following two ways:
\begin{enumerate}
\item The profiled application is not slowed down by profiling. The act of recording a profiling event has virtually zero cost -- it only takes \textasciitilde 8~\si{\nano\second}. Even on low-power mobile devices there's no perceptible impact on execution speed.
\item The profiler itself works in real-time, without the need to process collected data in a complex way. Actually, it is quite inefficient in the way it works, as the data it presents is calculated anew each frame. And yet it can run at 60 frames per second.
\end{enumerate}
\subsection{Nanosecond resolution}
It is hard to imagine how long a nanosecond is. One good analogy is to compare it with a measure of length. Let's say that one second is one meter (the average doorknob is on the height of one meter).
One millisecond ($\frac{1}{1000}$ of a second) would be then the length of a millimeter. The average size of a red ant or the width of a pencil is 5 or 6~\si{\milli\metre}. A modern game running at 60 frames per second has only 16~\si{\milli\second} to update the game world and render the entire scene.
One microsecond ($\frac{1}{1000}$ of a millisecond) in our comparison equals to one micron. The diameter of a typical bacterium ranges from 1 to 10 microns. The diameter of a red blood cell, or width of strand of spider web silk is about 7~\si{\micro\metre}.
And finally, one nanosecond ($\frac{1}{1000}$ of a microsecond) would be one nanometer. The modern microprocessor transistor gate, the width of DNA helix, or the thickness of a cell membrane are in the range of 5~\si{\nano\metre}. In one~\si{\nano\second} the light can travel only 30~\si{\centi\meter}.
Tracy can achieve single-digit nanosecond measurement resolution, due to usage of hardware timing mechanisms on the x86 and ARM architectures\footnote{In both 32 and 64~bit variants.}. Other profilers may rely on the timers provided by operating system, which do have significantly reduced resolution (about 300~\si{\nano\second} -- 1~\si{\micro\second}). This is enough to hide the subtle impact of cache access optimization, etc.
\subsection{Frame profiler}
Tracy is aimed at understanding the inner workings of a tight game (or interactive application) loop. That's why it slices the execution time of a program using the \emph{frame}\footnote{A frame is used to describe a single image displayed on the screen by the game, preferably 60 times per second to achieve smooth animation.} as a basic work-unit. The most interesting frames are the ones that took longer than the allocated time, producing visible hitches in the on-screen animation. Tracy allows inspection of such misbehavior.
\subsection{Remote or embedded telemetry}
Tracy uses the client-server model to enable a wide range of use-cases. For example, a game on a mobile phone may be profiled over the wireless connection, with the profiler running on a desktop computer. It is also possible to embed the visualization front-end in the profiled application, making the profiling self-contained.
In the Tracy terminology, the profiled application is the \emph{client} and the profiler itself is the \emph{server}. It was named this way because the client is a thin layer that just collects events and sends them for processing and long-term storage on the server. The fact that the server needs to connect to the client to begin the profiling session may be a bit confusing at first.
\section{First steps}
\subsection{Initial client setup}
The recommended way to integrate Tracy into an application is to create a git submodule in the repository (assuming that git is used for version control). This way it is very easy to update Tracy to newly released versions.
If that's not an option, copy files from the \texttt{tracy/client} and \texttt{tracy/common} directories, along with the source files in Tracy's root directory to your project. Next, add the \texttt{tracy/TracyClient.cpp} source file to the IDE project and/or makefile. That's all. Tracy is now integrated into the application.
In the default configuration Tracy is disabled. This way you don't have to worry that the production builds will perform profiling data collection. You will probably want to create a separate build configuration, with the \texttt{TRACY\_ENABLE} define, which enables profiling.
In case you want to profile a short-lived program (for example, a compression utility that finishes its work in one second), add the \texttt{TRACY\_NO\_EXIT} define to the build configuration. With this option Tracy will not exit until an incoming connection is made, even if the application has already finished executing. This mode of operation can also be achieved by setting the \texttt{TRACY\_NO\_EXIT} environment variable to $1$.
By default Tracy will begin profiling even before the program enters the \texttt{main} function. If you don't want to perform a complete application life-time capture, you may define the \texttt{TRACY\_ON\_DEMAND} macro, which will enable profiling only when there's an established connection with the server.
Finally, on Unix make sure that the application is linked with libraries \texttt{libpthread} and \texttt{libdl}.
\subsubsection{Setup for multi-DLL projects}
In projects that consist of multiple DLLs / shared objects things are a bit different. Compiling \texttt{TracyClient.cpp} into every DLL is not an option because this would result in several instances of Tracy objects lying around in the process. We rather need to pass the instances of them to the different DLLs to be reused there.
For that you need a main DLL to which your executable and the other DLLs link. If that doesn't exist you have to create one explicitly for Tracy. Link the executable and all DLLs which you want to profile to this DLL.
Add the \texttt{tracy/TracyClient.cpp} file to the source files list of the main DLL and the \texttt{tracy/TracyClientDLL.cpp} to the source files lists of the executable and the other DLLs.
\subsection{Running the server}
The easiest way to get going is to build the standalone server, available in the \texttt{standalone} directory. You can connect to localhost or remote clients and view the collected data right away.
If you prefer to inspect the data only after a trace has been performed, you may use the command line utility in the \texttt{capture} directory. It will save a data dump that may be later opened in the graphical viewer application.
Alternatively, you may want to embed the server in your application, the same which is running the client part of Tracy.
\section{Client markup}
With the aforementioned steps you will be able to connect to the profiled program, but there won't be any data collection performed. In order to begin profiling, Tracy requires that you manually instrument the application\footnote{Automatic tracing of every entered function is not feasible due to the amount of data that would generate.}. All the user-facing interface is contained in the \texttt{tracy/Tracy.hpp} header file.
\subsection{Marking frames}
To slice the program's execution recording into frame-sized chunks, put the \texttt{FrameMark} macro after you have completed rendering the frame. Ideally that would be right after the swap buffers command.
Note that this step is optional, as some applications do not use the concept of a frame.
\subsection{Marking zones}
To record a zone's execution time add the \texttt{ZoneScoped} macro at the beginning of the scope you want to measure. This will automatically record function name, source file name and location. Optionally you may use the \texttt{ZoneScopedC(0xRRGGBB)} macro to set a custom color for the zone. Note that the color value will be constant in the recording (don't try to parametrize it). You may also set a custom name for the zone, using the \texttt{ZoneScopedN(name)} macro, where name is a string literal. Color and name may be combined by using the \texttt{ZoneScopedNC(name, color)} macro.
Use the \texttt{ZoneText(const char* text, size\_t size)} macro to add a custom text string that will be displayed along the zone information (for example, name of the file you are opening). Note that every time \texttt{ZoneText} is invoked, a memory allocation is performed to store an internal copy of the data. The provided string is not used by Tracy after \texttt{ZoneText} returns.
If you want to set zone name on a per-call basis, you may do so using the \texttt{ZoneName(text, size)} macro.
\subsection{Marking locks}
Tracy can collect and display lock interactions in threads. To mark a lock (mutex) for event reporting, use the \texttt{TracyLockable(type, varname)} macro. Note that the lock must implement the Lockable concept (i.e. there's no support for timed mutices). For a concrete example, you would replace the line \texttt{std::mutex m\_lock} with \texttt{TracyLockable(std::mutex, m\_lock)}. You may use \texttt{TracyLockableN(type, varname, description)} to provide a custom lock name.
The standard \texttt{std::lock\_guard} and \texttt{std::unique\_lock} wrappers should use the \texttt{LockableBase(type)} macro for their template parameter (unless you're using C++17, with improved template argument deduction). For example, \texttt{std::lock\_guard<LockableBase(std::mutex)> lock(m\_lock)}.
To mark the location of lock being held, use the \texttt{LockMark(varname)} macro, after you have obtained the lock. Note that the \texttt{varname} must be a lock variable (a reference is also valid). This step is optional.
Similarly, you can use \texttt{TracySharedLockable}, \texttt{TracySharedLockableN} and \texttt{SharedLockableBase} to mark locks implementing the SharedMutex concept. Note that while there's no support for timed mutices in Tracy, both \texttt{std::shared\_mutex} and \texttt{std::shared\_timed\_mutex} may be used.
\subsection{Plotting data}
Tracy is able to capture and draw value changes over time. You may use it to analyze draw call count, number of performed queries, etc. To report data, use the \texttt{TracyPlot(name, value)} macro.
\subsection{Message log}
Fast navigation in large data set and correlation of zones with what was happening in application may be difficult. To ease these issues Tracy provides a message log functionality. You can send messages (for example, your typical debug output) using the \texttt{TracyMessage(text, size)} macro (Tracy will allocate memory for message storage). Alternatively, use \texttt{TracyMessageL(text)} for string literal messages. Messages are displayed on a chronological list and in the zone view.
\subsection{Memory profiling}
Tracy can monitor memory usage of your application. Knowledge about each performed memory allocation enables the following:
\begin{itemize}
\item Memory usage graph (like in massif, but fully interactive).
\item List of active allocations at program exit (leak list).
\item Visualization of memory map.
\item Ability to rewind view of active allocations and memory map to any point of program execution.
\item Information about memory statistics of each zone.
\end{itemize}
To mark memory events, use the \texttt{TracyAlloc(ptr, size)} and \texttt{TracyFree(ptr)} macros. Typically you would do that in overloads of operator new and operator delete.
\subsection{Lua support}
To profile Lua code using Tracy, include the \texttt{tracy/TracyLua.hpp} header file in your Lua wrapper and execute \texttt{tracy::LuaRegister(lua\_State*)} function to add instrumentation support. In your Lua code, add \texttt{tracy.ZoneBegin()} and \texttt{tracy.ZoneEnd()} calls to mark execution zones. \emph{Double check if you have included all return paths!} Use \texttt{tracy.ZoneBeginN(name)} to set zone name. Use \texttt{tracy.ZoneText(text)} to set zone text. Use \texttt{tracy.Message(text)} to send messages. Use \texttt{tracy.ZoneName(text)} to set zone name on a per-call basis.
Even if Tracy is disabled, you still have to pay the no-op function call cost. To prevent that you may want to use the \texttt{tracy::LuaRemove(char* script)} function, which will replace instrumentation calls with white-space. This function does nothing if profiler is enabled.
\subsection{GPU profiling}
Tracy provides bindings for profiling OpenGL and Vulkan execution time on GPU.
\subsubsection{OpenGL}
You will need to include the \texttt{tracy/TracyOpenGL.hpp} header file and declare each of your rendering contexts using the \texttt{TracyGpuContext} macro (typically you will only have one context). Tracy expects no more than one context per thread and no context migration.
To mark a GPU zone use the \texttt{TracyGpuZone(name)} macro, where name is a string literal name of the zone. Alternatively you may use \texttt{TracyGpuZoneC(name, color)} to specify zone color.
You also need to periodically collect the GPU events using the \texttt{TracyGpuCollect} macro. A good place to do it is after swap buffers function call.
GPU profiling is not supported on OSX, iOS\footnote{Because Apple is unable to implement standards properly.}. Android devices do work, if GPU drivers are not broken. Disjoint events are not currently handled, so some readings may be a bit spotty. Nvidia drivers are unable to provide consistent timing results when two OpenGL contexts are used simultaneously.
\subsubsection{Vulkan}
Include the \texttt{tracy/TracyVulkan.hpp} header file and initialize the Vulkan instance using the \texttt{TracyVkContext(physdev, device, queue, cmdbuf)} macro. Cleanup is performed using the \texttt{TracyVkDestroy()} macro. Currently you can't track more than one instance.
The physical device, logical device, queue and command buffer must relate with each other. The queue must support graphics or compute operations. The command buffer must be in the initial state and be able to be reset. It will be rerecorded and submitted to the queue multiple times and it will be in the executable state on exit from the initialization function.
To mark a GPU zone use the \texttt{TracyVkZone(cmdbuf, name)} macro, where name is a string literal name of the zone. Alternatively you may use \texttt{TracyVkZoneC(cmdbuf, name, color)} to specify zone color. The provided command buffer must be in the recording state.
You also need to periodically collect the GPU events using the \texttt{TracyVkCollect(cmdbuf)} macro. The provided command buffer must be in the recording state and outside of a render pass instance.
\subsection{Collecting call stacks}
Tracy can capture true calls stacks on selected platforms (Windows, Linux, Android). It can be performed by using macros with the \texttt{S} postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to do capture. Currently you can use the following macros: \texttt{ZoneScopedS}, \texttt{ZoneScopedNS}, \texttt{ZoneScopedCS}, \texttt{ZoneScopedNCS}, \texttt{TracyAllocS}, \texttt{TracyFreeS}, \texttt{TracyGpuZoneS}, \texttt{TracyGpuZoneCS}, \texttt{TracyVkZoneS}, \texttt{TracyVkZoneCS}.
\section{Good practices}
Remember to set thread names for proper identification of threads. You may use the functions exposed in the \texttt{tracy/common/TracySystem.hpp} header to do so.
\section{Practical considerations}
Tracy's time measurement precision is not infinite. It's only as good as the system-provided timers are.
\begin{itemize}
\item On x86 the time resolution depends on the hardware implementation of the \texttt{rdtscp} instruction and typically is a couple of nanoseconds. This may vary from one micro-architecture to another and requires a fairly modern (Sandy Bridge) processor for reliable results.
\item On ARM-based systems Tracy will try to use timer register (~40 \si{\nano\second} resolution). If it fails, Tracy falls back to system provided timer, which can range in resolution from 250 \si{\nano\second} to 1 \si{\micro\second}.
\end{itemize}
While the data collection is very lightweight, it is not completely free. Each recorded zone event has a cost, which Tracy tries to calculate and display on the time-line view, as a red zone. Note that this is an approximation of the real cost, which ignores many important factors. For example, you can't determine the impact of cache effects. The CPU frequency may be reduced in some situations, which will increase the recorded time, but the displayed profiler cost will not compensate for that.
Lua instrumentation needs to perform additional work (including memory allocation) to store source location. This approximately doubles the data collection cost.
You may use named colors predefined in \texttt{common/TracyColor.hpp} (included by \texttt{Tracy.hpp}). Visual reference: \url{https://en.wikipedia.org/wiki/X11_color_names}.
Tracy server will perform statistical data collection on the fly, if the macro \texttt{TRACY\_NO\_STATISTICS} is not defined. This allows extended analysis of the trace (for example, you can perform a live search for matching zones) at a small CPU processing cost and a considerable memory usage increase (at least 10 bytes per zone).
\end{document}