This patch makes GPU throughput benchmark results more comparable across
targets by disabling loop unrolling in the benchmark loop.
Motivation:
* PTX (post-LTO) evidence on NVPTX: for libc `sin`, the generated PTX
shows the `throughput` loop unrolled 8x at `N=128` (one iteration
advances the input pointer by 64 bytes = 8 doubles), interleaving eight
independent chains before the back-edge. This hides latency and
significantly reduces cycles/call as the batch size `N` grows.
* Observed scaling (NVPTX measurements): with unrolling enabled, `sin`
dropped from ~3,100 cycles/call at `N=1` to ~360 at `N=128`. After
enforcing `#pragma clang loop unroll(disable)`, results stabilized
(e.g., from ~3100 cycles/call at `N=1` to ~2700 at `N=128`).
* libdevice contrast: the libdevice `sin` path did not exhibit a similar
drop in our measurements, and the PTX appears as compact internal calls
rather than a long FMA chain, leaving less ILP for the outer loop to
extract.
What this change does:
* Applies `#pragma clang loop unroll(disable)` to the GPU `throughput()`
loop in both NVPTX and AMDGPU backends.
Leaving unrolling entirely to the optimizer makes apples-to-apples
comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields
fairer, more consistent numbers.
This patch provides cleanups and improvements for the GPU benchmarking
infrastructure. The key changes are:
- Fix benchmark convergence bug: Round up the scaled iteration count
(ceil) to ensure it grows properly. The previous truncation logic causes
the iteration count to get stuck.
- Resolve remaining compiler warning.
- Remove unused `BenchmarkLogger` files: This is dead code that added
maintenance and cognitive overhead without providing functionality.
- Improve build hygiene: Clean up headers and CMake dependencies to
strictly follow the 'include what you use' (IWYU) principle.
This patch improves the GPU benchmarking in this way:
* Replace `rand`/`srand` with a deterministic per-thread RNG seeded by
`call_index`: reproducible, apples-to-apples libc vs vendor comparisons.
* Fix input generation: sample the unbiased exponent uniformly in
`[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and
`+0.0`.
* Fix standard deviation: use an explicit estimator from sums and
sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples.
* Fix throughput overhead: subtract a loop-only baseline inside
NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call
already corrected (no `overhead()` call).
* Adapt existing math benchmarks to the new RNG/timing plumbing (plumb
`call_index`, drop `rand/srand`, clean includes).
* Correct inter-thread aggregation: use iteration-weighted pooling to
compute the global mean/variance, ensuring statistically sound `Cycles
(Mean)` and `Stddev`.
* Remove `Time / Iteration` column from the results table: it reported
per-thread convergence time (not per-call latency) and was
redundant/misleading next to `Cycles (Mean)`.
* Remove unused `BenchmarkLogger` files: dead code that added
maintenance and cognitive overhead without providing functionality.
---
## TODO (before merge)
* [ ] Investigate compiler warnings and address their root causes.
* [x] Review how per-thread results are aggregated into the overall
result.
## Follow-ups (future PRs)
* Add support to run throughput benchmarks with uniform (linear) input
distributions, alongside the current log2-uniform scheme.
* Review/adjust the configuration and coverage of existing math
benchmarks.
* Add more math benchmarks (e.g., `exp`/`expf`, others).
This PR adds benchmarking for `atan2()`, `__nv_atan2()`, and
`__ocml_atan2_f64()` using the same setup as `sin()`. This PR also adds
support for throughout bencmarking for functions with 2 inputs.
Previously, `AmdgpuSinTwoPow_128` and others were too large for their
table cells. This PR shortens the name to `AmdSin...`
There were also some `-` missing in the separator. This PR instead
creates the separator string using the length of the headers.
This PR implements
2a158426d4
to provide better throughput benchmarking for libc `sin()` and
`__nv_sin()`.
These changes have not been tested on AMDGPU yet, only compiled.
Previously, the time field was the total time take to run all iterations
of the benchmark. This PR changes the value displayed to be the average
time take by each iteration.
This PR adds support for benchmarking `__ocml_sin_f64()` against
`sin()`. This PR is currently a draft because I do not have access to an
AMD GPU and was not able to test the PR, but the code compiled when I
ran `ninja gpu-benchmark` from `runtimes-amdgcn-amd-amdhsa-bins`
Co-authored-by: Joseph Huber <huberjn@outlook.com>
This PR adds minimums (50 iterations, 500 us, and epsilon of 0.0001) to
ensure that all benchmarks run at least a set number of times before
outputting a final measurement.
Summary:
This value is a uint32_t but is printed as a uint64_t, leading to
invalid offsets when done on AMDGPU due to its packed format extending
past the buffer.
This PR adds a `BENCHMARK_N_THREADS()` helper to register benchmarks
with a specific number of threads. This PR replaces the flags used
originally to allow any amount of threads.
`libc/benchmarks/gpu/timing/CMakeLists.txt` did not correctly build
`amdgpu` utils. This PR fixes that issue by adding `amdgpu` to the loop
that adds the correct sub directories.
This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by
default, adding the option for single threaded benchmarks. We can
specify that a benchmark should be run on a single thread using the
`SINGLE_THREADED_BENCHMARK()` macro.
I chose to use a flag here so that other options could be added in the
future.
This PR replaces our old method of reducing the benchmark results by
using an array to using atomics instead. This should help us implement
single threaded benchmarks.
There was previously an issue where registering multiple benchmarks in
the same file would only give the results for the last benchmark to run.
This PR fixes the issue.
@jhuber6
PR for adding AMDGPU timing utils for benchmarking.
I was not able to test this code since I do not have an AMD GPU, but I
was able to successfully compile this code using
-DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a
-DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo
-DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to
force the code to compile without having an AMD gpu on my machine.
@jhuber6
PR for adding microbenchmarking infrastructure for NVPTX. `nvlink`
cannot perform LTO, so we cannot inline `libc` functions and this
function call overhead is not adjusted for during microbenchmarking.