Update manual.

2024-11-22 14:44:34 +00:00 · 2019-06-08 00:06:37 +02:00 · 2019-06-08 00:06:37 +02:00 · 76379a761a
commit 76379a761a
parent 1954caa806
1 changed files with 97 additions and 1 deletions
--- a/manual/tracy.tex
+++ b/manual/tracy.tex
@ -421,6 +421,101 @@ logo=\bcbombe
 \end{itemize}
 \end{bclogo}

+\subsubsection{Frame images}
+\label{frameimages}
+
+It is possible to attach a screen capture of your application to any frame in the main frame set. This can help you see the context of what's happening in various places in the trace. You need to implement retrieval of the image data from GPU by yourself.
+
+Images are sent using the \texttt{FrameImage(image, width, height, offset)} macro, where \texttt{image} is a pointer to BGRA\footnote{Alpha value is ignored, but leaving it out wouldn't map well to the way graphics hardware works.} pixel data, \texttt{width} and \texttt{height} are the image dimensions, which \emph{must be divisible by 4}, and \texttt{offset} specifies how much frame lag was there for the current image (see chapter~\ref{screenshotcode}).
+
+Handling image data requires a lot of memory and bandwidth\footnote{One uncompressed 1080p image takes 8 MB.}, To achieve sane memory usage you should scale down taken screen shots to a sensible size, e.g. $320\times180$.
+
+To further reduce image data size, frame images are internally compressed using the Ericsson Texture Compression (ETC1) technique, which significantly reduces data size\footnote{One pixel is stored in a nibble (4 bits) instead of 32 bits.}, at a small quality decrease. The compression algorithm is very fast and can be made even faster by enabling SIMD processing, as indicated in table~\ref{EtcSimd}.
+
+\begin{table}[h]
+\centering
+\begin{tabular}[h]{c|c|c|c}
+\textbf{Implementation} & \textbf{Required define} & \textbf{V-sync on (cold cache)} & \textbf{V-sync off (hot cache)} \\ \hline
+Reference & --- & 1.45 \si{\milli\second} & 778 \si{\micro\second} \\
+x86 SSE4.1 & \texttt{\_\_SSE4\_1\_\_} & 423 \si{\micro\second} & 245 \si{\micro\second}
+\end{tabular}
+\caption{Compression time of $320\times180$ image on i7 8700K}
+\label{EtcSimd}
+\end{table}
+
+\paragraph{OpenGL screen capture code example}
+\label{screenshotcode}
+
+There are many pitfalls associated with retrieving screen contents in an efficient way. For example, using \texttt{glReadPixels} and then resizing the image using some library (e.g. \emph{stb\_image\_resize}) is terrible for performance, as it forces synchronization of the GPU to CPU and performs the downscaling in software. To do things properly we need to scale the image using the graphics hardware and transfer data asynchronously, which allows the GPU to run independently of CPU.
+
+The following example shows how this can be achieved using OpenGL 3.2. More recent OpenGL versions allow to do things even better (for example by using persistent buffer mapping), but it won't be covered here.
+
+Let's begin by defining the required objects. We need a \emph{texture} to store the resized image, a \emph{framebuffer object} to be able to write to the texture, a \emph{pixel buffer object} to store the image data for access by the CPU and a \emph{fence} to know when the data are ready for retrieval. We need everything in \emph{at least} four copies, because the rendering, as we see it in program, may be ahead of the GPU by a couple frames. We need an index to access the appropriate data set in a ring-buffer manner. And finally, we need to store in a queue indices to data sets that we are still waiting for.
+
+\begin{lstlisting}
+GLuint m_fiTexture[4];
+GLuint m_fiFramebuffer[4];
+GLuint m_fiPbo[4];
+GLsync m_fiFence[4];
+int m_fiIdx = 0;
+std::vector<int> m_fiQueue;
+\end{lstlisting}
+
+Everything needs to be properly initialized (the cleanup is left for the reader to figure out). Notice that we are making a BGRA texture, not RGBA.
+
+\begin{lstlisting}
+glGenTextures(4, m_fiTexture);
+glGenFramebuffers(4, m_fiFramebuffer);
+glGenBuffers(4, m_fiPbo);
+for(int i=0; i<4; i++)
+{
+    glBindTexture(GL_TEXTURE_2D, m_fiTexture[i]);
+    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
+    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
+    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 320, 180, 0, GL_BGRA, GL_UNSIGNED_BYTE, nullptr);
+
+    glBindFramebuffer(GL_FRAMEBUFFER, m_fiFramebuffer[i]);
+    glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D,
+		m_fiTexture[i], 0);
+
+    glBindBuffer(GL_PIXEL_PACK_BUFFER, m_fiPbo[i]);
+    glBufferData(GL_PIXEL_PACK_BUFFER, 320*180*4, nullptr, GL_STREAM_READ);
+}
+\end{lstlisting}
+
+We will now setup a screen capture, which will downscale the screen contents to $320\times180$ pixels and copy the resulting image to a buffer which will be accessible by the CPU when the operation is done. This should be done right before \emph{swap buffers} or \emph{present} call.
+
+\begin{lstlisting}
+glBindFramebuffer(GL_DRAW_FRAMEBUFFER, m_fiFramebuffer[m_fiIdx]);
+glBlitFramebuffer(0, 0, res.x, res.y, 0, 0, 320, 180, GL_COLOR_BUFFER_BIT, GL_LINEAR);
+glBindFramebuffer(GL_DRAW_FRAMEBUFFER, 0);
+glBindFramebuffer(GL_READ_FRAMEBUFFER, m_fiFramebuffer[m_fiIdx]);
+glBindBuffer(GL_PIXEL_PACK_BUFFER, m_fiPbo[m_fiIdx]);
+glReadPixels(0, 0, 320, 180, GL_BGRA, GL_UNSIGNED_BYTE, nullptr);
+glBindFramebuffer(GL_READ_FRAMEBUFFER, 0);
+m_fiFence[m_fiIdx] = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
+m_fiQueue.emplace_back(m_fiIdx);
+m_fiIdx = (m_fiIdx + 1) % 4;
+\end{lstlisting}
+
+And lastly, just before the capture setup code\footnote{Yes, before. We are handling past screen captures here.} we need to have the image retrieval code. We are checking if the capture operation has finished and if it has, we map the \emph{pixel buffer object} to memory, inform the profiler that there's image data to be handled, unmap the buffer and go to check the next queue item. If a capture is still pending, we break out of the loop and wait until the next frame to check if the GPU has finished the capture.
+
+\begin{lstlisting}
+while(!m_fiQueue.empty())
+{
+    const auto fiIdx = m_fiQueue.front();
+    assert(fiIdx != m_fiIdx);
+    if(glClientWaitSync(m_fiFence[fiIdx], 0, 0) == GL_TIMEOUT_EXPIRED) break;
+    glBindBuffer(GL_PIXEL_PACK_BUFFER, m_fiPbo[fiIdx]);
+    auto ptr = glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, 320*180*4, GL_MAP_READ_BIT);
+    FrameImage(ptr, 320, 180, m_fiQueue.size());
+    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
+    m_fiQueue.erase(m_fiQueue.begin());
+}
+\end{lstlisting}
+
+Notice that in the call to \texttt{FrameImage} we are passing the remaining queue size as the \texttt{offset} parameter. Queue size represents how many frames ahead our program is relative to the GPU. Since we are sending past frames images we need to specify how many frames back we currently are. Of course if this would be a synchronous capture (without use of fences), we would set \texttt{offset} to zero, as there would be no frame lag.
+
 \subsection{Marking zones}
 \label{markingzones}

@ -1087,7 +1182,7 @@ Each bar displayed on the graph represents an unique frame in the current frame

 The frames visible on the timeline are marked with a violet box drawn over them (presented as a dotted box on figure~\ref{frametime}).

-Moving the \faMousePointer{} mouse cursor over the frames displayed on the graph will display tooltip with information about frame number, frame time, etc. Such tooltips are common for many UI elements in the profiler and won't be mentioned later in the manual.
+Moving the \faMousePointer{} mouse cursor over the frames displayed on the graph will display tooltip with information about frame number, frame time, frame image (if available, see chapter~\ref{frameimages}), etc. Such tooltips are common for many UI elements in the profiler and won't be mentioned later in the manual.

 The timeline view may be focused on the frames, by clicking or dragging the \LMB{} left mouse button on the graph. The graph may be scrolled left and right by dragging the \RMB{} right mouse button over the graph. The view may be zoomed in and out by using the \Scroll{} mouse scroll. If the view is zoomed out, so that multiple frames are merged into one column, the highest frame time will be used to represent the given column.

@ -1667,6 +1762,7 @@ The following libraries are included with and used by the Tracy Profiler:
 \begin{itemize}
 \item getopt\_port -- \url{https://github.com/kimgr/getopt\_port}
 \item libbacktrace -- \url{https://github.com/ianlancetaylor/libbacktrace}
+\item etcpak -- \url{https://bitbucket.org/wolfpld/etcpak}
 \end{itemize}

 \item 2-clause BSD license