History

Leandro Lacerda 75bf739208

[libc][gpu] Disable loop unrolling in the throughput benchmark loop (#153971 )

This patch makes GPU throughput benchmark results more comparable across
targets by disabling loop unrolling in the benchmark loop.

Motivation:
* PTX (post-LTO) evidence on NVPTX: for libc `sin`, the generated PTX
shows the `throughput` loop unrolled 8x at `N=128` (one iteration
advances the input pointer by 64 bytes = 8 doubles), interleaving eight
independent chains before the back-edge. This hides latency and
significantly reduces cycles/call as the batch size `N` grows.
* Observed scaling (NVPTX measurements): with unrolling enabled, `sin`
dropped from ~3,100 cycles/call at `N=1` to ~360 at `N=128`. After
enforcing `#pragma clang loop unroll(disable)`, results stabilized
(e.g., from ~3100 cycles/call at `N=1` to ~2700 at `N=128`).
* libdevice contrast: the libdevice `sin` path did not exhibit a similar
drop in our measurements, and the PTX appears as compact internal calls
rather than a long FMA chain, leaving less ILP for the outer loop to
extract.

What this change does:
* Applies `#pragma clang loop unroll(disable)` to the GPU `throughput()`
loop in both NVPTX and AMDGPU backends.

Leaving unrolling entirely to the optimizer makes apples-to-apples
comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields
fairer, more consistent numbers.

2025-08-16 20:14:26 +00:00

distributions

…

gpu

[libc][gpu] Disable loop unrolling in the throughput benchmark loop (#153971 )

2025-08-16 20:14:26 +00:00

CMakeLists.txt

…

JSON.cpp

…

JSON.h

…

JSONTest.cpp

…

libc-benchmark-analysis.py3

…

LibcBenchmark.cpp

…

LibcBenchmark.h

…

LibcBenchmarkTest.cpp

…

LibcDefaultImplementations.cpp

…

LibcFunctionPrototypes.h

…

LibcMemoryBenchmark.cpp

…

LibcMemoryBenchmark.h

…

LibcMemoryBenchmarkMain.cpp

…

LibcMemoryBenchmarkTest.cpp

…

LibcMemoryGoogleBenchmarkMain.cpp

…

MemorySizeDistributions.cpp

…

MemorySizeDistributions.h

…

RATIONALE.md

…

README.md

…

README.md

Libc mem* benchmarks

This framework has been designed to evaluate and compare relative performance of memory function implementations on a particular machine.

It relies on:

libc.src.string.<mem_function>_benchmark to run the benchmarks for the particular <mem_function>.
libc-benchmark-analysis.py3 a tool to process the measurements into reports.

Benchmarking tool

Setup

cd llvm-project
cmake -B/tmp/build -Sllvm -DLLVM_ENABLE_PROJECTS='clang;clang-tools-extra;libc' -DCMAKE_BUILD_TYPE=Release -DLIBC_INCLUDE_BENCHMARKS=Yes -G Ninja
ninja -C /tmp/build libc.src.string.<mem_function>_benchmark

Note: The machine should run in performance mode. This is achieved by running:

cpupower frequency-set --governor performance

Usage

The benchmark can run in two modes:

stochastic mode returns the average time per call for a particular size distribution, this is the default,
sweep mode returns the average time per size over a range of sizes.

Each benchmark requires the --study-name to be set, this is a name to identify a run and provide label during analysis. If stochastic mode is being used, you must also provide --size-distribution-name to pick one of the available MemorySizeDistribution's.

It also provides optional flags:

--num-trials: repeats the benchmark more times, the analysis tool can take this into account and give confidence intervals.
--output: specifies a file to write the report - or standard output if not set.

Stochastic mode

This is the preferred mode to use. The function parameters are randomized and the branch predictor is less likely to kick in.

/tmp/build/bin/libc.src.string.memcpy_benchmark \
    --study-name="new memcpy" \
    --size-distribution-name="memcpy Google A" \
    --num-trials=30 \
    --output=/tmp/benchmark_result.json

The --size-distribution-name flag is mandatory and points to one of the predefined distribution.

Note: These distributions are gathered from several important binaries at Google (servers, databases, realtime and batch jobs) and reflect the importance of focusing on small sizes.

Using a profiler to observe size distributions for calls into libc functions, it was found most operations act on a small number of bytes.

Function	% of calls with size ≤ 128	% of calls with size ≤ 1024
memcpy	96%	99%
memset	91%	99.9%
memcmp¹	99.5%	~100%

¹ - The size refers to the size of the buffers to compare and not the number of bytes until the first difference.

Sweep mode

This mode is used to measure call latency per size for a certain range of sizes. Because it exercises the same size over and over again the branch predictor can kick in. It can still be useful to compare strength and weaknesses of particular implementations.

/tmp/build/bin/libc.src.string.memcpy_benchmark \
    --study-name="new memcpy" \
    --sweep-mode \
    --sweep-max-size=128 \
    --output=/tmp/benchmark_result.json

Analysis tool

Setup

Make sure to have matplotlib, pandas and seaborn setup correctly:

apt-get install python3-pip
pip3 install matplotlib pandas seaborn

You may need python3-gtk or similar package to display the graphs.

Usage

python3 libc/benchmarks/libc-benchmark-analysis.py3 /tmp/benchmark_result.json ...

When used with multiple trials Sweep Mode data the tool displays the 95% confidence interval.

When providing with multiple reports at the same time, all the graphs from the same machine are displayed side by side to allow for comparison.

The Y-axis unit can be changed via the --mode flag:

time displays the measured time (this is the default),
cycles displays the number of cycles computed from the cpu frequency,
bytespercycle displays the number of bytes per cycle (for Sweep Mode reports only).

Under the hood

To learn more about the design decisions behind the benchmarking framework, have a look at the RATIONALE.md file.