This patch improves the GPU benchmarking in this way:
* Replace `rand`/`srand` with a deterministic per-thread RNG seeded by
`call_index`: reproducible, apples-to-apples libc vs vendor comparisons.
* Fix input generation: sample the unbiased exponent uniformly in
`[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and
`+0.0`.
* Fix standard deviation: use an explicit estimator from sums and
sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples.
* Fix throughput overhead: subtract a loop-only baseline inside
NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call
already corrected (no `overhead()` call).
* Adapt existing math benchmarks to the new RNG/timing plumbing (plumb
`call_index`, drop `rand/srand`, clean includes).
* Correct inter-thread aggregation: use iteration-weighted pooling to
compute the global mean/variance, ensuring statistically sound `Cycles
(Mean)` and `Stddev`.
* Remove `Time / Iteration` column from the results table: it reported
per-thread convergence time (not per-call latency) and was
redundant/misleading next to `Cycles (Mean)`.
* Remove unused `BenchmarkLogger` files: dead code that added
maintenance and cognitive overhead without providing functionality.
---
## TODO (before merge)
* [ ] Investigate compiler warnings and address their root causes.
* [x] Review how per-thread results are aggregated into the overall
result.
## Follow-ups (future PRs)
* Add support to run throughput benchmarks with uniform (linear) input
distributions, alongside the current log2-uniform scheme.
* Review/adjust the configuration and coverage of existing math
benchmarks.
* Add more math benchmarks (e.g., `exp`/`expf`, others).
Previously, we were trying to memset a pointer that wasn't being
initialized, and the test would randomly fail.
This PR replaces the pointers with actual objects.
These variants require a different exception table that requires a bit
of initialisation.
This allows us to enable testing for these variants downstream.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- bf16fma
- bf16fmaf
- bf16fmal
- bf16fmaf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
The PR is going to improve the readability for the files under
`llvm-project/libc/src/wchar` directory.
---------
Co-authored-by: Jin Huang <jingold@google.com>
Use LIBC_ERRNO_MODE_SYSTEM_INLINE instead as the default for the "public
packaging" (i.e. release mode) of an overlay build. The Bazel build has
already switched to use it by default in
5ccc734fa0355f971f8f515457a0bece33ab6642. This should be a safe change,
as LIBC_ERRNO_MODE_SYSTEM_INLINE works a drop-in (but simpler)
LIBC_ERRNO_MODE_SYSTEM replacement. Remove the associated code paths and
config settings.
Fixes issue #143454.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- bf16div
- bf16divf
- bf16divl
- bf16divf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- fmaximumbf16
- fmaximum_magbf16
- fmaximum_mag_numbf16
- fmaximum_numbf16
- fminimumbf16
- fminimum_magbf16
- fminimum_mag_numbf16
- fminimum_numbf16
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
Summary:
This moves the waiting to be done inside of the `try_lock` routine
instead. This makes the logic much simpler since it's just a single loop
on a load. We should have the same effect here, and since we don't care
about this being a generic interface it shouldn't matter that it waits
abit. Still wait free since it's guaranteed to make progress
*eventually*.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- bf16mul
- bf16mulf
- bf16mull
- bf16mulf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
Summary:
This patch introduces a lock-free stack used to store a fixed number of
slabs. Instead of going directly through RPC memory, we instead can
consult the cache and use that. Currently, this means that ~64 MiB of
memory will remain in-use if the user completely fills the cache.
However, because we always fully destroy the object, the chunk size can
be reset so they can be fully reused.
This greatly improves performance in cases where the user has previously
accessed malloc, lowering the difference between an implementation that
does not free slabs at all and one that does.
We can also skip the expensive zeroing step if the old chunk size was
smaller than the previous one. Smaller chunk sizes need a larger
bitfield, and because we know for a fact that the number of users
remaining in this slab is zero thanks to the reference counting we can
guarantee that the bitfield is all zero like when it was initialized.
This PR adds implements following basic math functions for BFloat16 type
along with the tests:
- bf16add
- bf16addf
- bf16addl
- bf16addf128
- bf16sub
- bf16subf
- bf16subl
- bf16subf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
This would ensure that errno value is cleared out before test execution
and tests pass even when LIBC_ERRNO_MODE_SYSTEM_INLINE is specified (and
errno may be clobbered before test execution).
A lot of the tests would fail, however, since errno would end up getting
set to EDOM or ERANGE during test execution and never validated before
the end of the test. This should be fixed - and errno should be
explicitly checked or ignored in all of those cases, but for now add a
TODO to address it later (see open issue #135320) and clear out errno in
test fixture to avoid test failures.
This PR implements the following basic math functions for BFloat16 type
along with the tests:
- ceilbf16
- floorbf16
- roundbf16
- roundevenbf16
- truncbf16
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
This removes an extraneous ',' in the generated dlfcn header.
This also adds `__restrict` to `dladdr`'s declaration per POSIX. Another
fix is made: the C standard `restrict` keyword is removed from
dlinfo.cpp/dlinfo.h (but note that dlfcn.yaml still annotates
`__restrict` for dlinfo's decl).
Addressed TODO to template the StringConverter pop functions to have a
single implementation (combine popUTF8 and popUTF32 into a single
templated pop function)
The iswalpha function is only working for characters in the ascii range
right now, which isn't ideal. Also it was returning a bool when the
function is supposed to return an int.
When single-threaded mode is selected, all instances of the keyword
`LIBC_THREAD_LOCAL` are stubbed out, similar to how it currently works
on the GPU. This allows baremetal builds to avoid using thread_local.
However, libcxx uses shared headers, so we need to be careful there.
Thankfully, there is already an option to disable multithreading in
libcxx, so a flag is added such that single-threaded mode is propagated
down to libc.
Summary:
We need `malloc` to return a larger size now that it's aligned properly
and we use a bunch of threads. Also the `match_any` test was wrong
because it assumed a 32-bit lanemask.
A initial commit for #97929, this adds a stub implementation for
`dladdr` and includes the definition for the `DL_info` type used as one
of its arguments.
While the `dladdr` implementation relies on dynamic linker support, this
patch will add its prototype in the generated `dlfcn.h` header so that
it can be used by downstream platforms that have their own `dladdr`
implementation.
An initial commit for #149911, this adds a stub implementation for
dlinfo and the enums list of `RTLD_DI_*` values.
While the dlinfo implementation relies on dynamic linker support, this
patch will add its prototype in the generated dlfcn.h header so that it
can be used by downstream platforms that have their own dlinfo
implementation.