This PR adds benchmarking for `atan2()`, `__nv_atan2()`, and
`__ocml_atan2_f64()` using the same setup as `sin()`. This PR also adds
support for throughout bencmarking for functions with 2 inputs.
This PR implements
2a158426d4
to provide better throughput benchmarking for libc `sin()` and
`__nv_sin()`.
These changes have not been tested on AMDGPU yet, only compiled.
This PR adds minimums (50 iterations, 500 us, and epsilon of 0.0001) to
ensure that all benchmarks run at least a set number of times before
outputting a final measurement.
This PR adds a `BENCHMARK_N_THREADS()` helper to register benchmarks
with a specific number of threads. This PR replaces the flags used
originally to allow any amount of threads.
This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by
default, adding the option for single threaded benchmarks. We can
specify that a benchmark should be run on a single thread using the
`SINGLE_THREADED_BENCHMARK()` macro.
I chose to use a flag here so that other options could be added in the
future.
This PR replaces our old method of reducing the benchmark results by
using an array to using atomics instead. This should help us implement
single threaded benchmarks.
PR for adding microbenchmarking infrastructure for NVPTX. `nvlink`
cannot perform LTO, so we cannot inline `libc` functions and this
function call overhead is not adjusted for during microbenchmarking.