Fixes#108624
This allows `flat_map::insert(Iter, Iter)` to directly forward to
underlying containers' `insert(Iter, Iter)`, instead of inserting one
element at a time, when input models "product iterator". atm,
`flat_map::iterator` and `zip_view::iterator` are "product iterator"s.
This gives about almost 10x speed up in my benchmark with -03 (for both
before and after)
```cpp
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
flat_map::insert_product_iterator_flat_map/32 -0.5028 -0.5320 149 74 149 70
flat_map::insert_product_iterator_flat_map/1024 -0.8617 -0.8618 3113 430 3112 430
flat_map::insert_product_iterator_flat_map/8192 -0.8877 -0.8877 26682 2995 26679 2995
flat_map::insert_product_iterator_flat_map/65536 -0.8769 -0.8769 226235 27844 226221 27841
flat_map::insert_product_iterator_zip/32 -0.5844 -0.5844 162 67 162 67
flat_map::insert_product_iterator_zip/1024 -0.8754 -0.8754 3427 427 3427 427
flat_map::insert_product_iterator_zip/8192 -0.8934 -0.8934 28134 3000 28132 3000
flat_map::insert_product_iterator_zip/65536 -0.8783 -0.8783 229783 27960 229767 27958
OVERALL_GEOMEAN -0.8319 -0.8332 0 0 0 0
```
---------
Co-authored-by: Louis Dionne <ldionne.2@gmail.com>
Previously, the segmented iterator optimization was limited to `std::{for_each, for_each_n}`. This patch
extends the optimization to `std::ranges::for_each` and `std::ranges::for_each_n`, ensuring consistent
optimizations across these algorithms. This patch first generalizes the `std` algorithms by introducing
a `Projection` parameter, which is set to `__identity` for the `std` algorithms. Then we let the `ranges`
algorithms to directly call their `std` counterparts with a general `__proj` argument. Benchmarks
demonstrate performance improvements of up to 21.4x for ``std::deque::iterator`` and 22.3x for
``join_view`` of ``vector<vector<char>>``.
Addresses a subtask of #102817.
This PR simplifies `__bitset::__init` into a more compact and readable
form, which avoids redundant computations of a `size_t` value and
eliminates the overhead of a local array.
This patch optimizes `bitset::to_string` by replacing the existing bit-by-bit processing with a more efficient
bit traversal strategy. Instead of checking each bit sequentially, we leverage `std::__countr_zero` to efficiently
locate the next set bit, skipping over consecutive zero bits. This greatly accelerates the conversion process,
especially for sparse `bitset`s where zero bits dominate. To ensure similar improvements for dense `bitset`s, we
exploit symmetry by inverting the bit pattern, allowing us to apply the same optimized traversal technique. Even
for uniformly distributed `bitset`s, the proposed approach offers measurable performance gains over the existing
implementation.
Benchmarks demonstrate substantial improvements, achieving up to 13.5x speedup for sparse `bitset`s with
`Pr(true bit) = 0.1`, 16.1x for dense `bitset`s with `Pr(true bit) = 0.9`, and 8.3x for uniformly distributed
`bitset`s with `Pr(true bit) = 0.5)`.
This patch enhances the performance of `std::for_each_n` when used with
segmented iterators, leading to significant performance improvements,
summarized in the tables below. This addresses a subtask of
https://github.com/llvm/llvm-project/issues/102817.
This patch adds missing benchmark coverage for partial_sort,
partial_sort_copy, is_sorted and is_sorted_until.
It also refactors the existing benchmarks for sort and stable_sort to
follow the consistent style of the new algorithm benchmarks. However,
these benchmarks were notoriously slow to run since they tested multiple
data patterns on multiple data types. To try to alleviate this, I
reduced the benchmarks to only run on integral types and on a single
non-integral type, which should faithfully show how the algorithm
behaves for anything non-integral. However, this is technically a
reduction in coverage.
This reverts commit d1156fcb56891fb1a426c3e8331a51d47f98a1b8.
This patch fixes the reported incorrect formatting changes and adds
tests for them. The performance should be unaffected, since there are no
significant changes required to fix the bugs.
Specifically, a `>` was changed to a `>=` to also add a `+` in the zero
case, and we're checking for zero now before printing the octal and
hexadecimal prefixes.
Closes#131710
This PR optimizes the performance of `std::ranges::rotate` for
`vector<bool>::iterator`. The optimization yields a performance
improvement of up to 2096x.
Closes#64038.
This PR optimizes the performance of `std::ranges::swap_ranges` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations yield performance improvements of up to **611x** for
aligned range swap and **78x** for unaligned range swap comparison.
Additionally, comprehensive tests covering up to 4 storage words (256
bytes) with odd and even bit sizes are provided, which validate the
proposed optimizations in this patch.
This PR optimizes the performance of `std::ranges::equal` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations yield performance improvements of up to 188x for
aligned equality comparison and 82x for unaligned equality
comparison. Moreover, comprehensive tests covering up to 4 storage words
(256 bytes) with odd and even bit sizes are provided, which validate the
proposed optimizations in this patch.
As a follow-up to #121013 (which optimized `ranges::copy`) and #121026
(which optimized `ranges::copy_backward`), this PR enhances the
performance of `std::ranges::{move, move_backward}` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations bring performance improvements analogous to those
achieved for the `{copy, copy_backward}` algorithms: up to 2000x for
aligned moves and 60x for unaligned moves. Moreover, comprehensive
tests covering up to 4 storage words (256 bytes) with odd and even bit
sizes are provided, which validate the proposed optimizations in this
patch.
We have some benchmarks that were benchmarking very specific
functionality, namely the optimizations in vector<bool>::iterator. Call
this out in the benchmarks by renaming them appropriately. In the future
we will also increase the coverage of these benchmarks to test other
containers.
This patch does not significantly change how the sequence container
benchmarks are done, but it adopts the same style as the associative
container benchmarks.
This commit does adjust how we were benchmarking push_back, where we
never really measured the overhead of the slow path of push_back (when
we need to reallocate).
This patch implements generic associative container benchmarks for
containers with unique keys. In doing so, it replaces the existing
std::map benchmarks which were based on the cartesian product
infrastructure and were too slow to execute.
These new benchmarks aim to strike a balance between exhaustive coverage
of all operations in the most interesting case, while executing fairly
rapidly (~40s on my machine).
This bumps the requirement for the map benchmarks from C++17 to C++20
because the common header that provides associative container benchmarks
requires support for C++20 concepts.
Rewrite the sequence container benchmarks to only rely on the actual
operations specified in SequenceContainer requirements and add
benchmarks for std::list, which is also considered a sequence container.
One of the major goals of this refactoring is also to make these
container benchmarks run faster so that they can be run more frequently.
The existing benchmarks have the significant problem that they take so
long to run that they must basically be run overnight. This patch
reduces the size of inputs such that the rewritten benchmarks each take
at most a minute to run.
This patch doesn't touch the string benchmarks, which were not using the
generic container benchmark functions previously.
As a follow-up to #121013 (which focused on `std::ranges::copy`), this
PR optimizes the performance of `std::ranges::copy_backward` for
`vector<bool>::iterator`, addressing a subtask outlined in issue #64038.
The optimizations yield performance improvements of up to 2000x for
aligned copies and 60x for unaligned copies.
While implementing this feature and its associated LWG issues it turns
out
- LWG3316 Correctly define epoch for utc_clock / utc_timepoint only
added non-normative wording to the standard.
Implements parts of:
- P0355 Extending <chrono> to Calendars and Time Zones
- P1361 Integration of chrono with text formatting
- LWG3359 <chrono> leap second support should allow for negative leap
seconds
As a follow-up to #113852, this PR optimizes the performance of the
`insert(const_iterator pos, InputIt first, InputIt last)` function for
`input_iterator`-pair inputs in `std::vector` for cases where
reallocation occurs during insertion. Additionally, this optimization
enhances exception safety by replacing the traditional `try-catch`
mechanism with a modern exception guard for the `insert` function.
The optimization targets cases where insertion trigger reallocation. In
scenarios without reallocation, the implementation remains unchanged.
Previous implementation
-----------------------
The previous implementation of `insert` is inefficient in reallocation
scenarios because it performs the following steps separately:
- `reserve()`: This leads to the first round of relocating old
elements to new memory;
- `rotate()`: This leads to the second round of reorganizing the
existing elements;
- Move-forward: Moves the elements after the insertion position to
their final positions.
- Insert: performs the actual insertion.
This approach results in a lot of redundant operations, requiring the
elements to undergo three rounds of relocations/reorganizations to be
placed in their final positions.
Proposed implementation
-----------------------
The proposed implementation jointly optimize the above 4 steps in the
previous implementation such that each element is placed in its final
position in just one round of relocation. Specifically, this
optimization reduces the total cost from 2 relocations + 1 std::rotate
call to just 1 relocation, without needing to call `std::rotate`,
thereby significantly improving overall performance.
Since that benchmark is testing n*n inputs, the batch size reported to
GoogleBenchmark should be that amount. Otherwise, GoogleBenchmark
reports the timing for calling std::gcd on the whole sequence, which is
misleading.
The benchmark currently uses makeCartesianProductBenchmark, which
doesn't make a ton of sense, since minmax_element always goes through
every element one by one. The runtime doesn't depend on the values
of the elements.
Fixes#120758
That benchmark isn't really useful, since it doesn't benchmark anything
from libc++ (besides `operator new`). The implementation of the
benchmark also has serious problems like the fact that it allocates an
unknown amount of memory without deallocating it.
This patch refactors the tests around aligned allocation and sized
deallocation to avoid relying on passing the -fsized-deallocation or
-faligned-allocation flags by default. Since both of these features are
enabled by default in >= C++14 mode, it now makes sense to make that
assumption in the test suite.
A notable exception is MinGW and some older compilers, where sized
deallocation is still not enabled by default. We treat that as a "bug"
in the test suite and we work around it by explicitly adding
-fsized-deallocation, but only under those configurations.
This PR optimizes the input iterator overload of `assign(_InputIterator,
_InputIterator)` in `std::vector<_Tp, _Allocator>` by directly assigning
to already initialized memory, rather than first destroying existing
elements and then constructing new ones. By eliminating unnecessary
destruction and construction, the proposed algorithm enhances the
performance by up to 2x for trivial element types (e.g.,
`std::vector<int>`), up to 2.6x for non-trivial element types like
`std::vector<std::string>`, and up to 3.4x for more complex non-trivial
types (e.g., `std::vector<std::vector<int>>`).
### Google Benchmarks
Benchmark tests (`libcxx/test/benchmarks/vector_operations.bench.cpp`)
were conducted for the `assign()` implementations before and after this
patch. The tests focused on trivial element types like
`std::vector<int>`, and non-trivial element types such as
`std::vector<std::string>` and `std::vector<std::vector<int>>`.
#### Before
```
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------
BM_AssignInputIterIter/vector_int/1024/1024 1157 ns 1169 ns 608188
BM_AssignInputIterIter<32>/vector_string/1024/1024 14559 ns 14710 ns 47277
BM_AssignInputIterIter<32>/vector_vector_int/1024/1024 26846 ns 27129 ns 25925
```
#### After
```
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------
BM_AssignInputIterIter/vector_int/1024/1024 561 ns 566 ns 1242251
BM_AssignInputIterIter<32>/vector_string/1024/1024 5604 ns 5664 ns 128365
BM_AssignInputIterIter<32>/vector_vector_int/1024/1024 7927 ns 8012 ns 88579
```
This addresses the issue uncovered by #115361. Previously, we weren't
building benchmarks in many cases due to the following block:
e58949632e/libcxx/CMakeLists.txt (L162-L172)
We need to passthrough the necessary variables into the benchmarks
subbuild and use correct syntax.
In particular, test everything with both a normal and a min_allocator,
add tests for a few corner cases and add tests with types that are
trivially relocatable. Also add tests that count the number of
assignments performed by vector::erase, since that is mandated by the
Standard.
This patch is a preparation for optimizing vector::erase.
The input generating functions for benchmark tests in the GenerateInput.h
file can be slightly improved by invoking vector::reserve before calling
vector::push_back. This slight performance improvement could potentially
speed-up all benchmark tests for containers and algorithms that use these
functions as inputs.
This PR refines the code in `GenerateInput.h` used for benchmark testing
by implementing the following changes:
- Replaced all unqualified usages of `size_t` with `std::size_t`.
- Removed unnecessary curly braces `{}` from for loops that contain
simple single-statement bodies, in accordance with LLVM coding
standards.
Instead of building the benchmarks separately via CMake and running them
separately from the test suite, this patch merges the benchmarks into
the test suite and handles both uniformly.
As a result:
- It is now possible to run individual benchmarks like we run tests
(e.g. using libcxx-lit), which is a huge quality-of-life improvement.
- The benchmarks will be run under exactly the same configuration as
the rest of the tests, which is a nice simplification. This does
mean that one has to be careful to enable the desired optimization
flags when running benchmarks, but that is easy with e.g.
`libcxx-lit <...> --param optimization=speed`.
- Benchmarks can use the same annotations as the rest of the test
suite, such as `// UNSUPPORTED` & friends.
When running the tests via `check-cxx`, we only compile the benchmarks
because running them would be too time consuming. This introduces a bit
of complexity in the testing setup, and instead it would be better to
allow passing a --dry-run flag to GoogleBenchmark executables, which is
the topic of https://github.com/google/benchmark/issues/1827.
I am not really satisfied with the layering violation of adding the
%{benchmark_flags} substitution to cmake-bridge, however I believe
this can be improved in the future.
This patch fixes warnings and errors that come up when running the
benchmarks as part of the test suite. It also adds the necessary Lit
annotations to make it pass in various configurations and increases the
portability of the benchmarks.