llvm-project

Author	SHA1	Message	Date
Joseph Huber	d85576d368	[libc] Replace RPC 'close()' mechanism with RAII handler (#181690 ) Summary: Closing ports was previously done manually, This makes the protocol more error prone as unclosed ports will leak and eventually the locks will run out. I believe the original fear was that the RAII portion would negatively impact code generation but I have not noticed anything significant.	2026-02-16 15:14:30 -06:00
Joseph Huber	739e997c3e	[libc] Remove ballot on slab find (#176606 ) Summary: This negatively impacts performance, while the other changes in the initial PR slightly improved it. This was originally done to make Volta independent thread scheduling work, but that doesn't seem to work correctly all the time either so we should make this faster.	2026-01-17 17:40:06 -06:00
Joseph Huber	185f078a6f	[libc] Improve SIMT control flow in the GPU allocator Summary: The Volta independent thread scheduling is very difficult to work with. This is a first attempt to make the logic more sound when lanes execute independently. This isn't all that's required, but it ends up improving control flow for AMDGPU as well.	2026-01-12 08:24:17 -06:00
Joseph Huber	c18de24d9d	[libc] Add a config option to disable slab reclaiming (#151599 ) Summary: Without slab reclaiming this interface is much simpler and it can speed up cases with a lot of churn. Basically, wastes memory for performance.	2025-10-10 21:05:51 -05:00
Joseph Huber	005895290d	[libc] Simplifiy slab waiting in GPU memory allocator (#152872 ) Summary: This moves the waiting to be done inside of the `try_lock` routine instead. This makes the logic much simpler since it's just a single loop on a load. We should have the same effect here, and since we don't care about this being a generic interface it shouldn't matter that it waits abit. Still wait free since it's guaranteed to make progress eventually.	2025-08-11 13:11:39 -05:00
Joseph Huber	ca006898b3	[libc] Cache old slabs when allocating GPU memory (#151866 ) Summary: This patch introduces a lock-free stack used to store a fixed number of slabs. Instead of going directly through RPC memory, we instead can consult the cache and use that. Currently, this means that ~64 MiB of memory will remain in-use if the user completely fills the cache. However, because we always fully destroy the object, the chunk size can be reset so they can be fully reused. This greatly improves performance in cases where the user has previously accessed malloc, lowering the difference between an implementation that does not free slabs at all and one that does. We can also skip the expensive zeroing step if the old chunk size was smaller than the previous one. Smaller chunk sizes need a larger bitfield, and because we know for a fact that the number of users remaining in this slab is zero thanks to the reference counting we can guarantee that the bitfield is all zero like when it was initialized.	2025-08-08 14:28:41 -05:00
Joseph Huber	b5cd8c34a7	[libc] Fix leader calculation when done in wave64 mode Summary: Wave 64 mode touches the upper limit of this, which had an off-by-one error. This caused it to return the same leader which gave an invalid view of memory.	2025-07-31 21:26:31 -05:00
Joseph Huber	1a0121cbed	[libc] Start slab search at number of allocated bits Summary: This patch changes the slab search to start at the number of allocated bits. Previously we would randomly search, but this gives very good performance when doing nothing but allocating, which is a common configuration. This will degrade performance when mixing malloc and free close to eachother as this is more likely to fail when the counter starts decreasing.	2025-07-30 16:05:55 -05:00
Joseph Huber	78c460bbe8	[libc] Fix incorrect count when initializing slab Summary: The initialization code should share the result with all of its neighbors. Right now it sets them to the sentinel value and doesn't shuffle them correctly. Shuffle them after initialization so we correctly report that we succeeded in the allocation.	2025-07-29 18:50:02 -05:00
Joseph Huber	b2322772f2	[libc] Reduce reference counter to a 32-bit integer (#150961 ) Summary: This reference counter tracks how many threads are using a given slab. Currently it's a 64-bit integer, this patch reduces it to a 32-bit integer. The benefit of this is that we save a few registers now that we no longer need to use two for these operations. This increases the risk of overflow, but given that the largest value we accept for a single slab is ~131,000 it is a long way off of the maximum of four billion or so. Obviously we can oversubscribe the reference count by having threads attempt to claim the lock and then try to free it, but I assert that it is exceedingly unlikely that we will somehow have over four billion GPU threads stalled in the same place. A later optimization could be done to split the reference counter and pointers into a struct of arrays, that will save 128 KiB of static memory (as we currently use 512 KiB for the slab array).	2025-07-28 11:05:36 -05:00
Joseph Huber	a1a610a128	[libc] Increase the number of times we wait on a slab Summary: This wait restricts how long we wait on a slab. The only reason this isn't an infinite loop is to prevent complete deadlocks. However, this limit was just on the cusp of waiting long enough for the allocation to be done. Just increase this to a sufficiently large value, because this limit only exists to keep the interface wait-free in the absolute worst case scheduling scenario. This MASSIVELY improved performance for mixed allocations as we no longer shuffled around creating more than necessary.	2025-07-28 09:23:29 -05:00
Joseph Huber	a7649007ef	[libc] Rework match any use in hot allocate bitfield loop Summary: We previously used `match_all` as the shortcut to figure out which threads were destined for which slots. This lowers to a for-loop, which even if it often only executes once still causes some slowdown especially when divergent. Instead we use a single ballot call and then calculate it. Here the ballot tells us which lanes are the first in a block, either the starting index or the barrier for a new 32-bit int. We then use some bit magic to figure out for each lane ID its closest leader. For the length we simply use the length calculated by the leader of the remaining bits to be written. This removes the match any and the shuffle, which improves the minimum number of cycles this takes by about 5%.	2025-07-28 09:23:29 -05:00
Joseph Huber	9975dfdf80	[libc] Small performance improvements to GPU allocator Summary: This slightly increases performance in a few places. First, we optimistically assume the cached slab has ample space which lets us avoid the atomic load on the highly contended counter in the case that it is likely to succeed. Second, we no longer call `match_any` twice as we can calculate the uniform slabs at the moment we return them. Thirdly, we always choose a random index on a 32-bit boundary. This means that in the fast case we fulfil the allocation with a single `fetch_or`, and in the other case we quickly move to the free bit. This nets around a 7.75% improvement for the fast path case.	2025-07-28 09:23:29 -05:00
Joseph Huber	5dc9937ea9	[libc] Improve starting indices for GPU allocation (#150432 ) Summary: The slots in this allocation scheme are statically allocated. All sizes share the same array of slots, but are given different starting locations to space them apart. The previous implementation used a trivial linear slice. This is inefficient because it provides the more likely allocations (1-1024 bytes) with just as much space as a highly unlikely one (1 MiB). This patch uses a cubic easing function to gradually shrink the gaps. For example, we used to get around 700 free slots for a 16 byte allocation, now we get around 2100 before it starts encroaching on the 32 byte allocation space. This could be improved further, but I think this is sufficient.	2025-07-28 07:54:48 -05:00
Joseph Huber	ce52f9cdc4	[libc] Search empty bits after failed allocation (#149910 ) Summary: The scheme we use to find a free bit is to just do a random walk. This works very well up until you start to completely saturate the bitfield. Because the result of the fetch_or yields the previous value, we can search this to go to any known empty bits as our next guess. This effectively increases our liklihood of finding a match after two tries by 32x since the distribution is random. This massively improves performance when a lot of memory is allocated without freeing, as it now doesn't takea one in a million shot to fill that last bit. A further change could improve this further by only mostly filling the slab, allowing 1% to be free at all times.	2025-07-23 13:19:43 -05:00
Joseph Huber	df1dd803b6	[libc] Cache the most recently used slot for a chunk size (#149751 ) Summary: This patch changes the `find_slab` logic to simply cache the most successful slot. This means the happy fast path is now a single atomic load on this index. I removed the SIMT shuffling logic that did slab lookups wave-parallel. Here I am considering the actual traversal to be comparatively unlikely, so it's not overly bad that it takes longer. ideally one thread finds a slot and shared it with the rest so we only pay that cost once. --------- Co-authored-by: Shilei Tian <i@tianshilei.me>	2025-07-23 13:19:36 -05:00
Joseph Huber	50f40a5327	[libc] Fix internal alignment in allcoator (#146738 ) Summary: The allocator interface is supposed to have 16 byte alignment (to keep it consistent with the CPU allocator. We could probably drop this to 8 if desires.) But this was not enforced because the number of bytes used for the bitfield sometimes resulted in alignment of 8 instead of 16. Explicitly align the number of bytes to be a multiple of 16 even if unused.	2025-07-02 12:29:01 -05:00
Joseph Huber	24828c8c45	[libc] Efficiently implement `aligned_alloc` for AMDGPU (#146585 ) Summary: This patch uses the actual allocator interface to implement `aligned_alloc`. We do this by simply rounding up the amount allocated. Because of how index calculation works, any offset within an allocated pointer will still map to the same chunk, so we can just adjust internally and it will free all the same.	2025-07-02 09:25:57 -05:00
Joseph Huber	dea4f3213d	[libc] Use is aligned builtin instead of ptrtoint (#146402 ) Summary: This avoids a ptrtoint by just using the clang builtin. This is clang specific but only clang can compile GPU code anyway so I do not bother with a fallback.	2025-07-02 07:03:11 -05:00
Joseph Huber	10445acfa6	[libc] Efficiently implement 'realloc' for AMDGPU devices (#145960 ) Summary: Now that we have `malloc` we can implement `realloc` efficiently. This uses the known chunk sizes to avoid unnecessary allocations. We just return nullptr for NVPTX. I'd remove the list for the entrypoint but then the libc++ code would stop working. When someone writes the NVPTX support this will be trivial.	2025-06-30 08:39:40 -05:00
Joseph Huber	d34214a85e	[libc] Add and use 'cpp::launder' to guard placement new (#146123 ) Summary: In the GPU allocator we reinterpret cast from a void pointer. We know that an actual object was constructed there according to the C++ object model, but to make it fully standards compliant we need to 'launder' it to forward that information to the compiler. Add this function and call it as appropriate.	2025-06-27 14:34:33 -05:00
Joseph Huber	dc4335a2bf	[libc] Perform bitfield zero initialization wave-parallel (#143607 ) Summary: We need to set the bitfield memory to zero because the system does not guarantee zeroed out memory. Even if fresh pages are zero, the system allows re-use so we would need a `kfd` level API to skip this step. Because we can't this patch updates the logic to perform the zero initialization wave-parallel. This reduces the amount of time it takes to allocate a fresh by up to a tenth. This has the unfortunate side effect that the control flow is more convoluted and we waste some extra registers, but it's worth it to reduce the slab allocation latency.	2025-06-11 18:22:05 -05:00
Joseph Huber	f1575de4c5	[libc][NFC] Remove template from GPU allocator reference counter Summary: We don't need this to be generic, precommit for https://github.com/llvm/llvm-project/pull/143607	2025-06-11 11:37:51 -05:00
Joseph Huber	59725c7486	[libc] Coalesce bitfield access in GPU malloc (#142692 ) Summary: This improves performance by reducing the amount of RMW operations we need to do to a single slot. This improves repeated allocations without much contention about ten percent.	2025-06-04 20:32:07 -05:00
Joseph Huber	b4bc8c6f83	[libc] Implement efficient 'malloc' on the GPU (#140156 ) Summary: This is the big patch that implements an efficient device-side `malloc` on the GPU. This is the first pass and many improvements will be made later. The scheme revolves around using a global reference counted pointer to hand out access to a dynamically created and destroyed slab interface. The slab is simply a large bitfield with one bit for each slab. All allocations are the same size in a slab, so different sized allocations are done through different slabs. Allocation is thus searching for or creating a slab for the desired slab, reserving space, and then searching for a free bit. Freeing is clearing the bit and then releasing the space. This interface allows memory to dynamically grow and shrink. Future patches will have different modes to allow fast first-time-use as well as a non-RPC version.	2025-05-28 08:21:43 -05:00
Joseph Huber	a6ef0debb1	[libc][NFC] Rename RPC opcodes to better reflect their usage Summary: RPC_ is a generic prefix here, use LIBC_ to indicate that these are opcodes used to implement the C library	2024-12-02 15:35:08 -06:00
Joseph Huber	be0c67c90e	[libc] Remove dependency on `cpp::function` in `rpc.h` (#112422 ) Summary: I'm going to attempt to move the `rpc.h` header to a separate folder that we can install and include outside of `libc`. Before doing this I'm going to try to trim up the file so there's not as many things I need to copy to make it work. This dependency on `cpp::functional` is a low hanging fruit. I only did it so that I could overload the argument of the work function so that passing the id was optional in the lambda, that's not a huge deal and it makes it more explicit I suppose.	2024-10-15 12:31:06 -07:00
Petr Hosek	5ff3ff33ff	[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98597 ) This is a part of #97655.	2024-07-12 09:28:41 -07:00
Mehdi Amini	ce9035f5bd	Revert "[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration" (#98593 ) Reverts llvm/llvm-project#98075 bots are broken	2024-07-12 09:12:13 +02:00
Petr Hosek	3f30effe1b	[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98075 ) This is a part of #97655.	2024-07-11 12:35:22 -07:00
Joseph Huber	ea697dcc2a	[libc][NFC] Move GPU allocator implementation to common header (#84690 ) Summary: This is a NFC move preceding more radical functional changes to the allocator implementation. We just move it to a common utility so it will be easier to write these in tandem.	2024-03-10 15:49:44 -05:00

31 Commits