This is required in hermetic testing downstream. It is not complete, and
will not work on hardware, however it runs on QEMU, and can report a
pass/fail on our tests.
These functions turned out to have the same bug that was in wcstok()
(fixed by 4fc9801), so add the missing tests and fix the code in a way
that matches wcstok().
Also fix incorrect test expectations in existing tests.
Also update the BUILD.bazel files to actually build the strsep() test.
This PR modifies the static_asserts checking the expected sizes in
__barrier_type.h, so that we can guarantee that our internal
implementation fits the public header.
Summary:
This patch changes the linux build to use the wide reads on the memory
operations by default. These memory functions will now potentially read
outside of the bounds explicitly allowed by the current function. While
technically undefined behavior in the standard, plenty of C library
implementations do this. it will not cause a segmentation fault on linux
as long as you do not cross a page boundary, and because we are only
*reading* memory it should not have atomic effects.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- fromfpbf16
- fromfpxbf16
- ufromfpbf16
- ufromfpxbf16
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
This patch adds some hdrgen yaml for ioctl(). Otherwise the function
never actually ends up being available in a full build. This is the last
thing that is needed to enable turning on LIBCXX_ENABLE_RANDOM_DEVICE.
This patch makes GPU throughput benchmark results more comparable across
targets by disabling loop unrolling in the benchmark loop.
Motivation:
* PTX (post-LTO) evidence on NVPTX: for libc `sin`, the generated PTX
shows the `throughput` loop unrolled 8x at `N=128` (one iteration
advances the input pointer by 64 bytes = 8 doubles), interleaving eight
independent chains before the back-edge. This hides latency and
significantly reduces cycles/call as the batch size `N` grows.
* Observed scaling (NVPTX measurements): with unrolling enabled, `sin`
dropped from ~3,100 cycles/call at `N=1` to ~360 at `N=128`. After
enforcing `#pragma clang loop unroll(disable)`, results stabilized
(e.g., from ~3100 cycles/call at `N=1` to ~2700 at `N=128`).
* libdevice contrast: the libdevice `sin` path did not exhibit a similar
drop in our measurements, and the PTX appears as compact internal calls
rather than a long FMA chain, leaving less ILP for the outer loop to
extract.
What this change does:
* Applies `#pragma clang loop unroll(disable)` to the GPU `throughput()`
loop in both NVPTX and AMDGPU backends.
Leaving unrolling entirely to the optimizer makes apples-to-apples
comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields
fairer, more consistent numbers.
This patch provides cleanups and improvements for the GPU benchmarking
infrastructure. The key changes are:
- Fix benchmark convergence bug: Round up the scaled iteration count
(ceil) to ensure it grows properly. The previous truncation logic causes
the iteration count to get stuck.
- Resolve remaining compiler warning.
- Remove unused `BenchmarkLogger` files: This is dead code that added
maintenance and cognitive overhead without providing functionality.
- Improve build hygiene: Clean up headers and CMake dependencies to
strictly follow the 'include what you use' (IWYU) principle.
This patch improves the GPU benchmarking in this way:
* Replace `rand`/`srand` with a deterministic per-thread RNG seeded by
`call_index`: reproducible, apples-to-apples libc vs vendor comparisons.
* Fix input generation: sample the unbiased exponent uniformly in
`[min_exp, max_exp]`, clamp bounds, and skip `Inf`, `NaN`, `-0.0`, and
`+0.0`.
* Fix standard deviation: use an explicit estimator from sums and
sums-of-squares (`sqrt(E[x^2] − E[x]^2)`) across samples.
* Fix throughput overhead: subtract a loop-only baseline inside
NVPTX/AMDGPU timing backends so `benchmark()` gets cycles-per-call
already corrected (no `overhead()` call).
* Adapt existing math benchmarks to the new RNG/timing plumbing (plumb
`call_index`, drop `rand/srand`, clean includes).
* Correct inter-thread aggregation: use iteration-weighted pooling to
compute the global mean/variance, ensuring statistically sound `Cycles
(Mean)` and `Stddev`.
* Remove `Time / Iteration` column from the results table: it reported
per-thread convergence time (not per-call latency) and was
redundant/misleading next to `Cycles (Mean)`.
* Remove unused `BenchmarkLogger` files: dead code that added
maintenance and cognitive overhead without providing functionality.
---
## TODO (before merge)
* [ ] Investigate compiler warnings and address their root causes.
* [x] Review how per-thread results are aggregated into the overall
result.
## Follow-ups (future PRs)
* Add support to run throughput benchmarks with uniform (linear) input
distributions, alongside the current log2-uniform scheme.
* Review/adjust the configuration and coverage of existing math
benchmarks.
* Add more math benchmarks (e.g., `exp`/`expf`, others).
Previously, we were trying to memset a pointer that wasn't being
initialized, and the test would randomly fail.
This PR replaces the pointers with actual objects.
These variants require a different exception table that requires a bit
of initialisation.
This allows us to enable testing for these variants downstream.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- bf16fma
- bf16fmaf
- bf16fmal
- bf16fmaf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
The PR is going to improve the readability for the files under
`llvm-project/libc/src/wchar` directory.
---------
Co-authored-by: Jin Huang <jingold@google.com>
Use LIBC_ERRNO_MODE_SYSTEM_INLINE instead as the default for the "public
packaging" (i.e. release mode) of an overlay build. The Bazel build has
already switched to use it by default in
5ccc734fa0355f971f8f515457a0bece33ab6642. This should be a safe change,
as LIBC_ERRNO_MODE_SYSTEM_INLINE works a drop-in (but simpler)
LIBC_ERRNO_MODE_SYSTEM replacement. Remove the associated code paths and
config settings.
Fixes issue #143454.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- bf16div
- bf16divf
- bf16divl
- bf16divf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- fmaximumbf16
- fmaximum_magbf16
- fmaximum_mag_numbf16
- fmaximum_numbf16
- fminimumbf16
- fminimum_magbf16
- fminimum_mag_numbf16
- fminimum_numbf16
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
Summary:
This moves the waiting to be done inside of the `try_lock` routine
instead. This makes the logic much simpler since it's just a single loop
on a load. We should have the same effect here, and since we don't care
about this being a generic interface it shouldn't matter that it waits
abit. Still wait free since it's guaranteed to make progress
*eventually*.
This PR adds the following basic math functions for BFloat16 type along
with the tests:
- bf16mul
- bf16mulf
- bf16mull
- bf16mulf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>
Summary:
This patch introduces a lock-free stack used to store a fixed number of
slabs. Instead of going directly through RPC memory, we instead can
consult the cache and use that. Currently, this means that ~64 MiB of
memory will remain in-use if the user completely fills the cache.
However, because we always fully destroy the object, the chunk size can
be reset so they can be fully reused.
This greatly improves performance in cases where the user has previously
accessed malloc, lowering the difference between an implementation that
does not free slabs at all and one that does.
We can also skip the expensive zeroing step if the old chunk size was
smaller than the previous one. Smaller chunk sizes need a larger
bitfield, and because we know for a fact that the number of users
remaining in this slab is zero thanks to the reference counting we can
guarantee that the bitfield is all zero like when it was initialized.
This PR adds implements following basic math functions for BFloat16 type
along with the tests:
- bf16add
- bf16addf
- bf16addl
- bf16addf128
- bf16sub
- bf16subf
- bf16subl
- bf16subf128
---------
Signed-off-by: Krishna Pandey <kpandey81930@gmail.com>