11 Commits

Author SHA1 Message Date
Leandro Lacerda
b1f5da4328
[libc][gpu] Add exp/log benchmarks and flexible input generation (#155727)
This patch adds GPU benchmarks for the exp (`exp`, `expf`, `expf16`) and
log (`log`, `logf`, `logf16`) families of math functions.

Adding these benchmarks revealed a key limitation in the existing
framework: the input generation mechanism was hardcoded to a single
strategy that sampled numbers with a uniform distribution of their
unbiased exponents.

While this strategy is effective for values spanning multiple orders of
magnitude, it is not suitable for linear ranges. The previous framework
lacked the flexibility to support this.

### Summary of Changes

**1. Framework Refactoring for Flexible Input Sampling:**
The GPU benchmark framework was refactored to support multiple,
pluggable input sampling strategies.

* **`Random.h`:** A new header was created to house the
`RandomGenerator` and the new distribution classes.
* **Distribution Classes:** Two sampling strategies were implemented:
* `UniformExponent`: Formalizes the previous logic of sampling numbers
with a uniform distribution of their unbiased exponents. It can now also
be configured to produce only positive values, which is essential for
functions like `log`.
* `UniformLinear`: A new strategy that samples numbers from a uniform
distribution over a linear interval `[min, max)`.
* **`MathPerf` Update:** The `MathPerf` class was updated with a generic
`run_throughput` method that is templated on a distribution object. This
makes the framework extensible to future sampling strategies.

**2. New Benchmarks for `exp` and `log`:**
Using the newly refactored framework, benchmarks were added for `exp`,
`expf`, `expf16`, `log`, `logf`, and `logf16`. The test intervals were
carefully chosen to measure the performance of distinct behavioral
regions of each function.
2025-08-27 21:05:10 -05:00
Leandro Lacerda
08ff017fb0
[libc] Improve GPU benchmarking (#153512)
This patch improves the GPU benchmarking in this way:

* Replace `rand`/`srand` with a deterministic per-thread RNG seeded by
`call_index`: reproducible, apples-to-apples libc vs vendor comparisons.
* Fix input generation: sample the unbiased exponent uniformly in
`[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and
`+0.0`.
* Fix standard deviation: use an explicit estimator from sums and
sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples.
* Fix throughput overhead: subtract a loop-only baseline inside
NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call
already corrected (no `overhead()` call).
* Adapt existing math benchmarks to the new RNG/timing plumbing (plumb
`call_index`, drop `rand/srand`, clean includes).
* Correct inter-thread aggregation: use iteration-weighted pooling to
compute the global mean/variance, ensuring statistically sound `Cycles
(Mean)` and `Stddev`.
* Remove `Time / Iteration` column from the results table: it reported
per-thread convergence time (not per-call latency) and was
redundant/misleading next to `Cycles (Mean)`.
* Remove unused `BenchmarkLogger` files: dead code that added
maintenance and cognitive overhead without providing functionality.

---

## TODO (before merge)

* [ ] Investigate compiler warnings and address their root causes.
* [x] Review how per-thread results are aggregated into the overall
result.

## Follow-ups (future PRs)

* Add support to run throughput benchmarks with uniform (linear) input
distributions, alongside the current log2-uniform scheme.
* Review/adjust the configuration and coverage of existing math
benchmarks.
* Add more math benchmarks (e.g., `exp`/`expf`, others).
2025-08-15 11:00:17 -05:00
lntue
66603dd1f1
[libc][NFC] Add stdint.h proxy header to fix dependency issue with <stdint.h> includes. (#150303)
https://github.com/llvm/llvm-project/issues/149993
2025-07-23 20:19:52 -04:00
Joseph Huber
de59e7b86c [libc] Fix GPU benchmarking 2025-07-18 14:36:23 -05:00
jameshu15869
deb6b45c32
[libc][gpu] Add Atan2 Benchmarks (#104708)
This PR adds benchmarking for `atan2()`, `__nv_atan2()`, and
`__ocml_atan2_f64()` using the same setup as `sin()`. This PR also adds
support for throughout bencmarking for functions with 2 inputs.
2024-08-18 12:50:30 -05:00
jameshu15869
2b592b16c1
[libc][gpu] Add Sinf Benchmarks (#102532)
This PR adds benchmarking for `sinf()` using the same set up as `sin()`
but with a smaller range for floats.
2024-08-08 16:26:26 -05:00
jameshu15869
1248698e9b
[libc] [gpu] Fix Minor Benchmark UI Issues (#102529)
Previously, `AmdgpuSinTwoPow_128` and others were too large for their
table cells. This PR shortens the name to `AmdSin...`

There were also some `-` missing in the separator. This PR instead
creates the separator string using the length of the headers.
2024-08-08 15:32:20 -05:00
jameshu15869
9a070d6d0f
[libc] [gpu] Add Generic, NvSin, and OcmlSinf64 Throughput Benchmark (#101917)
This PR implements
2a158426d4
to provide better throughput benchmarking for libc `sin()` and
`__nv_sin()`.

These changes have not been tested on AMDGPU yet, only compiled.
2024-08-08 15:05:34 -05:00
Joseph Huber
ebdcb76d1a [libc] Only link in the appropriate architecture's device libs 2024-07-30 18:36:41 -05:00
jameshu15869
8f7910a4fc
[libc] Add AMDGPU Sin Benchmark (#101120)
This PR adds support for benchmarking `__ocml_sin_f64()` against
`sin()`. This PR is currently a draft because I do not have access to an
AMD GPU and was not able to test the PR, but the code compiled when I
ran `ninja gpu-benchmark` from `runtimes-amdgcn-amd-amdhsa-bins`

Co-authored-by: Joseph Huber <huberjn@outlook.com>
2024-07-30 10:19:48 -05:00
jameshu15869
677796cab3
[libc] Add Generic and NVPTX Sin Benchmark (#99795)
This PR adds sin benchmarking for a range of values and on a
pregenerated random distribution.
2024-07-29 22:09:11 -05:00