31 Commits

Author SHA1 Message Date
Joseph Huber
d85576d368
[libc] Replace RPC 'close()' mechanism with RAII handler (#181690)
Summary:
Closing ports was previously done manually, This makes the protocol more
error prone as unclosed ports will leak and eventually the locks will
run out. I believe the original fear was that the RAII portion would
negatively impact code generation but I have not noticed anything
significant.
2026-02-16 15:14:30 -06:00
Joseph Huber
739e997c3e
[libc] Remove ballot on slab find (#176606)
Summary:
This negatively impacts performance, while the other changes in the
initial PR slightly improved it. This was originally done to make Volta
independent thread scheduling work, but that doesn't seem to work
correctly all the time either so we should make this faster.
2026-01-17 17:40:06 -06:00
Joseph Huber
185f078a6f [libc] Improve SIMT control flow in the GPU allocator
Summary:
The Volta independent thread scheduling is very difficult to work with.
This is a first attempt to make the logic more sound when lanes execute
independently. This isn't all that's required, but it ends up improving
control flow for AMDGPU as well.
2026-01-12 08:24:17 -06:00
Joseph Huber
c18de24d9d
[libc] Add a config option to disable slab reclaiming (#151599)
Summary:
Without slab reclaiming this interface is much simpler and it can speed
up cases with a lot of churn. Basically, wastes memory for performance.
2025-10-10 21:05:51 -05:00
Joseph Huber
005895290d
[libc] Simplifiy slab waiting in GPU memory allocator (#152872)
Summary:
This moves the waiting to be done inside of the `try_lock` routine
instead. This makes the logic much simpler since it's just a single loop
on a load. We should have the same effect here, and since we don't care
about this being a generic interface it shouldn't matter that it waits
abit. Still wait free since it's guaranteed to make progress
*eventually*.
2025-08-11 13:11:39 -05:00
Joseph Huber
ca006898b3
[libc] Cache old slabs when allocating GPU memory (#151866)
Summary:
This patch introduces a lock-free stack used to store a fixed number of
slabs. Instead of going directly through RPC memory, we instead can
consult the cache and use that. Currently, this means that ~64 MiB of
memory will remain in-use if the user completely fills the cache.
However, because we always fully destroy the object, the chunk size can
be reset so they can be fully reused.

This greatly improves performance in cases where the user has previously
accessed malloc, lowering the difference between an implementation that
does not free slabs at all and one that does.

We can also skip the expensive zeroing step if the old chunk size was
smaller than the previous one. Smaller chunk sizes need a larger
bitfield, and because we know for a fact that the number of users
remaining in this slab is zero thanks to the reference counting we can
guarantee that the bitfield is all zero like when it was initialized.
2025-08-08 14:28:41 -05:00
Joseph Huber
b5cd8c34a7 [libc] Fix leader calculation when done in wave64 mode
Summary:
Wave 64 mode touches the upper limit of this, which had an off-by-one
error. This caused it to return the same leader which gave an invalid
view of memory.
2025-07-31 21:26:31 -05:00
Joseph Huber
1a0121cbed [libc] Start slab search at number of allocated bits
Summary:
This patch changes the slab search to start at the number of allocated
bits. Previously we would randomly search, but this gives very good
performance when doing nothing but allocating, which is a common
configuration. This will degrade performance when mixing malloc and free
close to eachother as this is more likely to fail when the counter
starts decreasing.
2025-07-30 16:05:55 -05:00
Joseph Huber
78c460bbe8 [libc] Fix incorrect count when initializing slab
Summary:
The initialization code should share the result with all of its
neighbors. Right now it sets them to the sentinel value and doesn't
shuffle them correctly. Shuffle them after initialization so we
correctly report that we succeeded in the allocation.
2025-07-29 18:50:02 -05:00
Joseph Huber
b2322772f2
[libc] Reduce reference counter to a 32-bit integer (#150961)
Summary:
This reference counter tracks how many threads are using a given slab.
Currently it's a 64-bit integer, this patch reduces it to a 32-bit
integer. The benefit of this is that we save a few registers now that we
no longer need to use two for these operations. This increases the risk
of overflow, but given that the largest value we accept for a single
slab is ~131,000 it is a long way off of the maximum of four billion or
so. Obviously we can oversubscribe the reference count by having threads
attempt to claim the lock and then try to free it, but I assert that it
is exceedingly unlikely that we will somehow have over four billion GPU
threads stalled in the same place.

A later optimization could be done to split the reference counter and
pointers into a struct of arrays, that will save 128 KiB of static
memory (as we currently use 512 KiB for the slab array).
2025-07-28 11:05:36 -05:00
Joseph Huber
a1a610a128 [libc] Increase the number of times we wait on a slab
Summary:
This wait restricts how long we wait on a slab. The only reason this
isn't an infinite loop is to prevent complete deadlocks. However, this
limit was *just* on the cusp of waiting long enough for the allocation
to be done. Just increase this to a sufficiently large value, because
this limit only exists to keep the interface wait-free in the absolute
worst case scheduling scenario. This *MASSIVELY* improved performance
for mixed allocations as we no longer shuffled around creating more than
necessary.
2025-07-28 09:23:29 -05:00
Joseph Huber
a7649007ef [libc] Rework match any use in hot allocate bitfield loop
Summary:
We previously used `match_all` as the shortcut to figure out which
threads were destined for which slots. This lowers to a for-loop, which
even if it often only executes once still causes some slowdown
especially when divergent. Instead we use a single ballot call and then
calculate it.

Here the ballot tells us which lanes are the first in a block, either
the starting index or the barrier for a new 32-bit int. We then use some
bit magic to figure out for each lane ID its closest leader. For the
length we simply use the length calculated by the leader of the
remaining bits to be written. This removes the match any and the
shuffle, which improves the minimum number of cycles this takes by about
5%.
2025-07-28 09:23:29 -05:00
Joseph Huber
9975dfdf80 [libc] Small performance improvements to GPU allocator
Summary:
This slightly increases performance in a few places. First, we
optimistically assume the cached slab has ample space which lets us
avoid the atomic load on the highly contended counter in the case that
it is likely to succeed. Second, we no longer call `match_any` twice as
we can calculate the uniform slabs at the moment we return them.
Thirdly, we always choose a random index on a 32-bit boundary. This
means that in the fast case we fulfil the allocation with a single
`fetch_or`, and in the other case we quickly move to the free bit.

This nets around a 7.75% improvement for the fast path case.
2025-07-28 09:23:29 -05:00
Joseph Huber
5dc9937ea9
[libc] Improve starting indices for GPU allocation (#150432)
Summary:
The slots in this allocation scheme are statically allocated. All sizes
share the same array of slots, but are given different starting
locations to space them apart. The previous implementation used a
trivial linear slice. This is inefficient because it provides the more
likely allocations (1-1024 bytes) with just as much space as a highly
unlikely one (1 MiB).

This patch uses a cubic easing function to gradually shrink the gaps.
For example, we used to get around 700 free slots for a 16 byte
allocation, now we get around 2100 before it starts encroaching on the
32 byte allocation space. This could be improved further, but I think
this is sufficient.
2025-07-28 07:54:48 -05:00
Joseph Huber
ce52f9cdc4
[libc] Search empty bits after failed allocation (#149910)
Summary:
The scheme we use to find a free bit is to just do a random walk. This
works very well up until you start to completely saturate the bitfield.
Because the result of the fetch_or yields the previous value, we can
search this to go to any known empty bits as our next guess. This
effectively increases our liklihood of finding a match after two tries
by 32x since the distribution is random.

This *massively* improves performance when a lot of memory is allocated
without freeing, as it now doesn't takea one in a million shot to fill
that last bit. A further change could improve this further by only
*mostly* filling the slab, allowing 1% to be free at all times.
2025-07-23 13:19:43 -05:00
Joseph Huber
df1dd803b6
[libc] Cache the most recently used slot for a chunk size (#149751)
Summary:
This patch changes the `find_slab` logic to simply cache the most
successful slot. This means the happy fast path is now a single atomic
load on this index. I removed the SIMT shuffling logic that did slab
lookups wave-parallel. Here I am considering the actual traversal to be
comparatively unlikely, so it's not overly bad that it takes longer.
ideally one thread finds a slot and shared it with the rest so we only
pay that cost once.

---------

Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-23 13:19:36 -05:00
Joseph Huber
50f40a5327
[libc] Fix internal alignment in allcoator (#146738)
Summary:
The allocator interface is supposed to have 16 byte alignment (to keep
it consistent with the CPU allocator. We could probably drop this to 8
if desires.) But this was not enforced because the number of bytes used
for the bitfield sometimes resulted in alignment of 8 instead of 16.
Explicitly align the number of bytes to be a multiple of 16 even if
unused.
2025-07-02 12:29:01 -05:00
Joseph Huber
24828c8c45
[libc] Efficiently implement aligned_alloc for AMDGPU (#146585)
Summary:
This patch uses the actual allocator interface to implement
`aligned_alloc`. We do this by simply rounding up the amount allocated.
Because of how index calculation works, any offset within an allocated
pointer will still map to the same chunk, so we can just adjust
internally and it will free all the same.
2025-07-02 09:25:57 -05:00
Joseph Huber
dea4f3213d
[libc] Use is aligned builtin instead of ptrtoint (#146402)
Summary:
This avoids a ptrtoint by just using the clang builtin. This is clang
specific but only clang can compile GPU code anyway so I do not bother
with a fallback.
2025-07-02 07:03:11 -05:00
Joseph Huber
10445acfa6
[libc] Efficiently implement 'realloc' for AMDGPU devices (#145960)
Summary:
Now that we have `malloc` we can implement `realloc` efficiently. This
uses the known chunk sizes to avoid unnecessary allocations. We just
return nullptr for NVPTX. I'd remove the list for the entrypoint but
then the libc++ code would stop working. When someone writes the NVPTX
support this will be trivial.
2025-06-30 08:39:40 -05:00
Joseph Huber
d34214a85e
[libc] Add and use 'cpp::launder' to guard placement new (#146123)
Summary:
In the GPU allocator we reinterpret cast from a void pointer. We know
that an actual object was constructed there according to the C++ object
model, but to make it fully standards compliant we need to 'launder' it
to forward that information to the compiler. Add this function and call
it as appropriate.
2025-06-27 14:34:33 -05:00
Joseph Huber
dc4335a2bf
[libc] Perform bitfield zero initialization wave-parallel (#143607)
Summary:
We need to set the bitfield memory to zero because the system does not
guarantee zeroed out memory. Even if fresh pages are zero, the system
allows re-use so we would need a `kfd` level API to skip this step.

Because we can't this patch updates the logic to perform the zero
initialization wave-parallel. This reduces the amount of time it takes
to allocate a fresh by up to a tenth.

This has the unfortunate side effect that the control flow is more
convoluted and we waste some extra registers, but it's worth it to
reduce the slab allocation latency.
2025-06-11 18:22:05 -05:00
Joseph Huber
f1575de4c5 [libc][NFC] Remove template from GPU allocator reference counter
Summary:
We don't need this to be generic, precommit for
https://github.com/llvm/llvm-project/pull/143607
2025-06-11 11:37:51 -05:00
Joseph Huber
59725c7486
[libc] Coalesce bitfield access in GPU malloc (#142692)
Summary:
This improves performance by reducing the amount of RMW operations we
need to do to a single slot. This improves repeated allocations without
much contention about ten percent.
2025-06-04 20:32:07 -05:00
Joseph Huber
b4bc8c6f83
[libc] Implement efficient 'malloc' on the GPU (#140156)
Summary:
This is the big patch that implements an efficient device-side `malloc`
on the GPU. This is the first pass and many improvements will be made
later.

The scheme revolves around using a global reference counted pointer to
hand out access to a dynamically created and destroyed slab interface.
The slab is simply a large bitfield with one bit for each slab. All
allocations are the same size in a slab, so different sized allocations
are done through different slabs.

Allocation is thus searching for or creating a slab for the desired
slab, reserving space, and then searching for a free bit. Freeing is
clearing the bit and then releasing the space.

This interface allows memory to dynamically grow and shrink. Future
patches will have different modes to allow fast first-time-use as well
as a non-RPC version.
2025-05-28 08:21:43 -05:00
Joseph Huber
a6ef0debb1 [libc][NFC] Rename RPC opcodes to better reflect their usage
Summary:
RPC_ is a generic prefix here, use LIBC_ to indicate that these are
opcodes used to implement the C library
2024-12-02 15:35:08 -06:00
Joseph Huber
be0c67c90e
[libc] Remove dependency on cpp::function in rpc.h (#112422)
Summary:
I'm going to attempt to move the `rpc.h` header to a separate folder
that we can install and include outside of `libc`. Before doing this I'm
going to try to trim up the file so there's not as many things I need to
copy to make it work. This dependency on `cpp::functional` is a low
hanging fruit. I only did it so that I could overload the argument of
the work function so that passing the id was optional in the lambda,
that's not a *huge* deal and it makes it more explicit I suppose.
2024-10-15 12:31:06 -07:00
Petr Hosek
5ff3ff33ff
[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98597)
This is a part of #97655.
2024-07-12 09:28:41 -07:00
Mehdi Amini
ce9035f5bd
Revert "[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration" (#98593)
Reverts llvm/llvm-project#98075

bots are broken
2024-07-12 09:12:13 +02:00
Petr Hosek
3f30effe1b
[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98075)
This is a part of #97655.
2024-07-11 12:35:22 -07:00
Joseph Huber
ea697dcc2a
[libc][NFC] Move GPU allocator implementation to common header (#84690)
Summary:
This is a NFC move preceding more radical functional changes to the
allocator implementation. We just move it to a common utility so it will
be easier to write these in tandem.
2024-03-10 15:49:44 -05:00