llvm-project

Author	SHA1	Message	Date
Leandro Lacerda	75bf739208	[libc][gpu] Disable loop unrolling in the throughput benchmark loop (#153971 ) This patch makes GPU throughput benchmark results more comparable across targets by disabling loop unrolling in the benchmark loop. Motivation: * PTX (post-LTO) evidence on NVPTX: for libc `sin`, the generated PTX shows the `throughput` loop unrolled 8x at `N=128` (one iteration advances the input pointer by 64 bytes = 8 doubles), interleaving eight independent chains before the back-edge. This hides latency and significantly reduces cycles/call as the batch size `N` grows. * Observed scaling (NVPTX measurements): with unrolling enabled, `sin` dropped from ~3,100 cycles/call at `N=1` to ~360 at `N=128`. After enforcing `#pragma clang loop unroll(disable)`, results stabilized (e.g., from ~3100 cycles/call at `N=1` to ~2700 at `N=128`). * libdevice contrast: the libdevice `sin` path did not exhibit a similar drop in our measurements, and the PTX appears as compact internal calls rather than a long FMA chain, leaving less ILP for the outer loop to extract. What this change does: * Applies `#pragma clang loop unroll(disable)` to the GPU `throughput()` loop in both NVPTX and AMDGPU backends. Leaving unrolling entirely to the optimizer makes apples-to-apples comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields fairer, more consistent numbers.	2025-08-16 20:14:26 +00:00
Leandro Lacerda	cf5f311b26	[libc] Polish GPU benchmarking (#153900 ) This patch provides cleanups and improvements for the GPU benchmarking infrastructure. The key changes are: - Fix benchmark convergence bug: Round up the scaled iteration count (ceil) to ensure it grows properly. The previous truncation logic causes the iteration count to get stuck. - Resolve remaining compiler warning. - Remove unused `BenchmarkLogger` files: This is dead code that added maintenance and cognitive overhead without providing functionality. - Improve build hygiene: Clean up headers and CMake dependencies to strictly follow the 'include what you use' (IWYU) principle.	2025-08-15 19:51:52 -05:00
Leandro Lacerda	08ff017fb0	[libc] Improve GPU benchmarking (#153512 ) This patch improves the GPU benchmarking in this way: * Replace `rand`/`srand` with a deterministic per-thread RNG seeded by `call_index`: reproducible, apples-to-apples libc vs vendor comparisons. * Fix input generation: sample the unbiased exponent uniformly in `[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and `+0.0`. * Fix standard deviation: use an explicit estimator from sums and sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples. * Fix throughput overhead: subtract a loop-only baseline inside NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call already corrected (no `overhead()` call). * Adapt existing math benchmarks to the new RNG/timing plumbing (plumb `call_index`, drop `rand/srand`, clean includes). * Correct inter-thread aggregation: use iteration-weighted pooling to compute the global mean/variance, ensuring statistically sound `Cycles (Mean)` and `Stddev`. * Remove `Time / Iteration` column from the results table: it reported per-thread convergence time (not per-call latency) and was redundant/misleading next to `Cycles (Mean)`. * Remove unused `BenchmarkLogger` files: dead code that added maintenance and cognitive overhead without providing functionality. --- ## TODO (before merge) * [ ] Investigate compiler warnings and address their root causes. * [x] Review how per-thread results are aggregated into the overall result. ## Follow-ups (future PRs) * Add support to run throughput benchmarks with uniform (linear) input distributions, alongside the current log2-uniform scheme. * Review/adjust the configuration and coverage of existing math benchmarks. * Add more math benchmarks (e.g., `exp`/`expf`, others).	2025-08-15 11:00:17 -05:00
lntue	66603dd1f1	[libc][NFC] Add stdint.h proxy header to fix dependency issue with <stdint.h> includes. (#150303 ) https://github.com/llvm/llvm-project/issues/149993	2025-07-23 20:19:52 -04:00
Joseph Huber	de59e7b86c	[libc] Fix GPU benchmarking	2025-07-18 14:36:23 -05:00
Joseph Huber	db6b7a84e6	[libc][NFC] Strip all training whitespace and missing newlines (#124163 )	2025-01-23 12:02:54 -06:00
jameshu15869	deb6b45c32	[libc][gpu] Add Atan2 Benchmarks (#104708 ) This PR adds benchmarking for `atan2()`, `__nv_atan2()`, and `__ocml_atan2_f64()` using the same setup as `sin()`. This PR also adds support for throughout bencmarking for functions with 2 inputs.	2024-08-18 12:50:30 -05:00
Schrodinger ZHU Yifan	b7c7dbd473	Revert "libc: Remove `extern "C"` from main declarations" (#102827 ) Reverts llvm/llvm-project#102825	2024-08-11 13:40:50 -07:00
David Blaikie	1b71c471c7	libc: Remove `extern "C"` from main declarations (#102825 ) This is invalid in C++, and clang recently started warning on it as of #101853	2024-08-11 13:17:27 -07:00
jameshu15869	2b592b16c1	[libc][gpu] Add Sinf Benchmarks (#102532 ) This PR adds benchmarking for `sinf()` using the same set up as `sin()` but with a smaller range for floats.	2024-08-08 16:26:26 -05:00
jameshu15869	1248698e9b	[libc] [gpu] Fix Minor Benchmark UI Issues (#102529 ) Previously, `AmdgpuSinTwoPow_128` and others were too large for their table cells. This PR shortens the name to `AmdSin...` There were also some `-` missing in the separator. This PR instead creates the separator string using the length of the headers.	2024-08-08 15:32:20 -05:00
jameshu15869	9a070d6d0f	[libc] [gpu] Add Generic, NvSin, and OcmlSinf64 Throughput Benchmark (#101917 ) This PR implements `2a158426d4` to provide better throughput benchmarking for libc `sin()` and `__nv_sin()`. These changes have not been tested on AMDGPU yet, only compiled.	2024-08-08 15:05:34 -05:00
jameshu15869	39826b1030	[libc] [gpu] Change Time To Be Per Iteration (#101919 ) Previously, the time field was the total time take to run all iterations of the benchmark. This PR changes the value displayed to be the average time take by each iteration.	2024-08-05 08:27:31 -05:00
Joseph Huber	ebdcb76d1a	[libc] Only link in the appropriate architecture's device libs	2024-07-30 18:36:41 -05:00
jameshu15869	8f7910a4fc	[libc] Add AMDGPU Sin Benchmark (#101120 ) This PR adds support for benchmarking `__ocml_sin_f64()` against `sin()`. This PR is currently a draft because I do not have access to an AMD GPU and was not able to test the PR, but the code compiled when I ran `ninja gpu-benchmark` from `runtimes-amdgcn-amd-amdhsa-bins` Co-authored-by: Joseph Huber <huberjn@outlook.com>	2024-07-30 10:19:48 -05:00
jameshu15869	677796cab3	[libc] Add Generic and NVPTX Sin Benchmark (#99795 ) This PR adds sin benchmarking for a range of values and on a pregenerated random distribution.	2024-07-29 22:09:11 -05:00
Joseph Huber	79afb94da1	[libc] Make NVPTX benchmarks use LTO for linking Summary: Now that we can do LTO, we can make the benchmarks more accurate by allowing optimization + inlining of the implementation.	2024-07-27 06:53:12 -05:00
jameshu15869	a09c0f676d	[libc] Add Minimum Time and Iterations, Reduce Epsilon (#100838 ) This PR adds minimums (50 iterations, 500 us, and epsilon of 0.0001) to ensure that all benchmarks run at least a set number of times before outputting a final measurement.	2024-07-26 20:30:19 -05:00
Joseph Huber	6911f823ad	[libc] Fix invalid format specifier in benchmark Summary: This value is a uint32_t but is printed as a uint64_t, leading to invalid offsets when done on AMDGPU due to its packed format extending past the buffer.	2024-07-22 11:21:22 -05:00
jameshu15869	197b142232	[libc] Add N Threads Benchmark Helper (#99834 ) This PR adds a `BENCHMARK_N_THREADS()` helper to register benchmarks with a specific number of threads. This PR replaces the flags used originally to allow any amount of threads.	2024-07-21 21:56:40 -05:00
jameshu15869	a964f2e8a1	[libc] Improve Benchmark UI (#99796 ) This PR changes the output to resemble Google Benchmark. e.g. ``` Running Suite: LlvmLibcIsAlNumGpuBenchmark Benchmark \| Cycles \| Min \| Max \| Iterations \| Time (ns) \| Stddev \| Threads \| ----------------------------------------------------------------------------------------------------- IsAlnum \| 92 \| 76 \| 482 \| 23 \| 86500 \| 76 \| 64 \| IsAlnumSingleThread \| 87 \| 76 \| 302 \| 20 \| 72000 \| 49 \| 1 \| IsAlnumSingleWave \| 87 \| 76 \| 302 \| 20 \| 72000 \| 49 \| 32 \| IsAlnumCapital \| 89 \| 76 \| 299 \| 17 \| 78500 \| 52 \| 64 \| IsAlnumNotAlnum \| 87 \| 76 \| 303 \| 20 \| 76000 \| 49 \| 64 \| ```	2024-07-21 16:40:01 -05:00
jameshu15869	ef47bbb471	[libc] Add AMDGPU Timing to CMake (#99603 ) `libc/benchmarks/gpu/timing/CMakeLists.txt` did not correctly build `amdgpu` utils. This PR fixes that issue by adding `amdgpu` to the loop that adds the correct sub directories.	2024-07-19 06:57:55 -05:00
jameshu15869	8badfccefe	[libc] Add Multithreaded GPU Benchmarks (#98964 ) This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the `SINGLE_THREADED_BENCHMARK()` macro. I chose to use a flag here so that other options could be added in the future.	2024-07-18 07:18:23 -05:00
jameshu15869	1ecffdaf27	[libc] Add Kernel Resource Usage to nvptx-loader (#97503 ) This PR allows `nvptx-loader` to read the resource usage of `_start`, `_begin`, and `_end` when executing CUDA binaries. Example output: ``` $ nvptx-loader --print-resource-usage libc/benchmarks/gpu/src/ctype/libc.benchmarks.gpu.src.ctype.isalnum_benchmark.__build__ [ RUN ] LlvmLibcIsAlNumGpuBenchmark.IsAlnumWrapper [ OK ] LlvmLibcIsAlNumGpuBenchmark.IsAlnumWrapper: 93 cycles, 76 min, 470 max, 23 iterations, 78000 ns, 80 stddev _begin registers: 25 _start registers: 80 _end registers: 62 ``` --------- Co-authored-by: Joseph Huber <huberjn@outlook.com>	2024-07-17 16:07:12 -05:00
jameshu15869	b42c332d73	[libc] Use Atomics in GPU Benchmarks (#98842 ) This PR replaces our old method of reducing the benchmark results by using an array to using atomics instead. This should help us implement single threaded benchmarks.	2024-07-15 07:08:23 -05:00
Petr Hosek	5ff3ff33ff	[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98597 ) This is a part of #97655.	2024-07-12 09:28:41 -07:00
Mehdi Amini	ce9035f5bd	Revert "[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration" (#98593 ) Reverts llvm/llvm-project#98075 bots are broken	2024-07-12 09:12:13 +02:00
Petr Hosek	3f30effe1b	[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98075 ) This is a part of #97655.	2024-07-11 12:35:22 -07:00
jameshu15869	eeed5896de	[libc] Correctly Run Multiple Benchmarks in the Same File (#98467 ) There was previously an issue where registering multiple benchmarks in the same file would only give the results for the last benchmark to run. This PR fixes the issue. @jhuber6	2024-07-11 06:58:10 -05:00
jameshu15869	eb66e31bc2	[libc] Add Timing Utils for AMDGPU (#96828 ) PR for adding AMDGPU timing utils for benchmarking. I was not able to test this code since I do not have an AMD GPU, but I was able to successfully compile this code using -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_TEST_ARCHITECTURE=gfx90a -DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_LOADER_EXECUTABLE=echo -DRUNTIMES_amdgcn_amd-amdhsa_LIBC_GPU_TARGET_ARCHITECTURE=gfx90a to force the code to compile without having an AMD gpu on my machine. @jhuber6	2024-07-10 16:04:56 -05:00
jameshu15869	f4e6ddbc2e	[libc] Fix Cppcheck Issues (#96999 ) This PR fixes linting issues discovered by `cppcheck`. Fixes: https://github.com/llvm/llvm-project/issues/96863	2024-07-06 17:53:36 -05:00
jameshu15869	02b57dedb7	[libc] NVPTX Profiling (#92009 ) PR for adding microbenchmarking infrastructure for NVPTX. `nvlink` cannot perform LTO, so we cannot inline `libc` functions and this function call overhead is not adjusted for during microbenchmarking.	2024-06-26 16:38:39 -05:00

32 Commits