Summary:
Closing ports was previously done manually, This makes the protocol more
error prone as unclosed ports will leak and eventually the locks will
run out. I believe the original fear was that the RAII portion would
negatively impact code generation but I have not noticed anything
significant.
Summary:
This negatively impacts performance, while the other changes in the
initial PR slightly improved it. This was originally done to make Volta
independent thread scheduling work, but that doesn't seem to work
correctly all the time either so we should make this faster.
Summary:
The Volta independent thread scheduling is very difficult to work with.
This is a first attempt to make the logic more sound when lanes execute
independently. This isn't all that's required, but it ends up improving
control flow for AMDGPU as well.
Summary:
Without slab reclaiming this interface is much simpler and it can speed
up cases with a lot of churn. Basically, wastes memory for performance.
Summary:
This moves the waiting to be done inside of the `try_lock` routine
instead. This makes the logic much simpler since it's just a single loop
on a load. We should have the same effect here, and since we don't care
about this being a generic interface it shouldn't matter that it waits
abit. Still wait free since it's guaranteed to make progress
*eventually*.
Summary:
This patch introduces a lock-free stack used to store a fixed number of
slabs. Instead of going directly through RPC memory, we instead can
consult the cache and use that. Currently, this means that ~64 MiB of
memory will remain in-use if the user completely fills the cache.
However, because we always fully destroy the object, the chunk size can
be reset so they can be fully reused.
This greatly improves performance in cases where the user has previously
accessed malloc, lowering the difference between an implementation that
does not free slabs at all and one that does.
We can also skip the expensive zeroing step if the old chunk size was
smaller than the previous one. Smaller chunk sizes need a larger
bitfield, and because we know for a fact that the number of users
remaining in this slab is zero thanks to the reference counting we can
guarantee that the bitfield is all zero like when it was initialized.
Summary:
Wave 64 mode touches the upper limit of this, which had an off-by-one
error. This caused it to return the same leader which gave an invalid
view of memory.
Summary:
This patch changes the slab search to start at the number of allocated
bits. Previously we would randomly search, but this gives very good
performance when doing nothing but allocating, which is a common
configuration. This will degrade performance when mixing malloc and free
close to eachother as this is more likely to fail when the counter
starts decreasing.
Summary:
The initialization code should share the result with all of its
neighbors. Right now it sets them to the sentinel value and doesn't
shuffle them correctly. Shuffle them after initialization so we
correctly report that we succeeded in the allocation.
Summary:
This reference counter tracks how many threads are using a given slab.
Currently it's a 64-bit integer, this patch reduces it to a 32-bit
integer. The benefit of this is that we save a few registers now that we
no longer need to use two for these operations. This increases the risk
of overflow, but given that the largest value we accept for a single
slab is ~131,000 it is a long way off of the maximum of four billion or
so. Obviously we can oversubscribe the reference count by having threads
attempt to claim the lock and then try to free it, but I assert that it
is exceedingly unlikely that we will somehow have over four billion GPU
threads stalled in the same place.
A later optimization could be done to split the reference counter and
pointers into a struct of arrays, that will save 128 KiB of static
memory (as we currently use 512 KiB for the slab array).
Summary:
This wait restricts how long we wait on a slab. The only reason this
isn't an infinite loop is to prevent complete deadlocks. However, this
limit was *just* on the cusp of waiting long enough for the allocation
to be done. Just increase this to a sufficiently large value, because
this limit only exists to keep the interface wait-free in the absolute
worst case scheduling scenario. This *MASSIVELY* improved performance
for mixed allocations as we no longer shuffled around creating more than
necessary.
Summary:
We previously used `match_all` as the shortcut to figure out which
threads were destined for which slots. This lowers to a for-loop, which
even if it often only executes once still causes some slowdown
especially when divergent. Instead we use a single ballot call and then
calculate it.
Here the ballot tells us which lanes are the first in a block, either
the starting index or the barrier for a new 32-bit int. We then use some
bit magic to figure out for each lane ID its closest leader. For the
length we simply use the length calculated by the leader of the
remaining bits to be written. This removes the match any and the
shuffle, which improves the minimum number of cycles this takes by about
5%.
Summary:
This slightly increases performance in a few places. First, we
optimistically assume the cached slab has ample space which lets us
avoid the atomic load on the highly contended counter in the case that
it is likely to succeed. Second, we no longer call `match_any` twice as
we can calculate the uniform slabs at the moment we return them.
Thirdly, we always choose a random index on a 32-bit boundary. This
means that in the fast case we fulfil the allocation with a single
`fetch_or`, and in the other case we quickly move to the free bit.
This nets around a 7.75% improvement for the fast path case.
Summary:
The slots in this allocation scheme are statically allocated. All sizes
share the same array of slots, but are given different starting
locations to space them apart. The previous implementation used a
trivial linear slice. This is inefficient because it provides the more
likely allocations (1-1024 bytes) with just as much space as a highly
unlikely one (1 MiB).
This patch uses a cubic easing function to gradually shrink the gaps.
For example, we used to get around 700 free slots for a 16 byte
allocation, now we get around 2100 before it starts encroaching on the
32 byte allocation space. This could be improved further, but I think
this is sufficient.
Summary:
The scheme we use to find a free bit is to just do a random walk. This
works very well up until you start to completely saturate the bitfield.
Because the result of the fetch_or yields the previous value, we can
search this to go to any known empty bits as our next guess. This
effectively increases our liklihood of finding a match after two tries
by 32x since the distribution is random.
This *massively* improves performance when a lot of memory is allocated
without freeing, as it now doesn't takea one in a million shot to fill
that last bit. A further change could improve this further by only
*mostly* filling the slab, allowing 1% to be free at all times.
Summary:
This patch changes the `find_slab` logic to simply cache the most
successful slot. This means the happy fast path is now a single atomic
load on this index. I removed the SIMT shuffling logic that did slab
lookups wave-parallel. Here I am considering the actual traversal to be
comparatively unlikely, so it's not overly bad that it takes longer.
ideally one thread finds a slot and shared it with the rest so we only
pay that cost once.
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
Summary:
The allocator interface is supposed to have 16 byte alignment (to keep
it consistent with the CPU allocator. We could probably drop this to 8
if desires.) But this was not enforced because the number of bytes used
for the bitfield sometimes resulted in alignment of 8 instead of 16.
Explicitly align the number of bytes to be a multiple of 16 even if
unused.
Summary:
This patch uses the actual allocator interface to implement
`aligned_alloc`. We do this by simply rounding up the amount allocated.
Because of how index calculation works, any offset within an allocated
pointer will still map to the same chunk, so we can just adjust
internally and it will free all the same.
Summary:
This avoids a ptrtoint by just using the clang builtin. This is clang
specific but only clang can compile GPU code anyway so I do not bother
with a fallback.
Summary:
Now that we have `malloc` we can implement `realloc` efficiently. This
uses the known chunk sizes to avoid unnecessary allocations. We just
return nullptr for NVPTX. I'd remove the list for the entrypoint but
then the libc++ code would stop working. When someone writes the NVPTX
support this will be trivial.
Summary:
In the GPU allocator we reinterpret cast from a void pointer. We know
that an actual object was constructed there according to the C++ object
model, but to make it fully standards compliant we need to 'launder' it
to forward that information to the compiler. Add this function and call
it as appropriate.
Summary:
We need to set the bitfield memory to zero because the system does not
guarantee zeroed out memory. Even if fresh pages are zero, the system
allows re-use so we would need a `kfd` level API to skip this step.
Because we can't this patch updates the logic to perform the zero
initialization wave-parallel. This reduces the amount of time it takes
to allocate a fresh by up to a tenth.
This has the unfortunate side effect that the control flow is more
convoluted and we waste some extra registers, but it's worth it to
reduce the slab allocation latency.
Summary:
This improves performance by reducing the amount of RMW operations we
need to do to a single slot. This improves repeated allocations without
much contention about ten percent.
Summary:
This is the big patch that implements an efficient device-side `malloc`
on the GPU. This is the first pass and many improvements will be made
later.
The scheme revolves around using a global reference counted pointer to
hand out access to a dynamically created and destroyed slab interface.
The slab is simply a large bitfield with one bit for each slab. All
allocations are the same size in a slab, so different sized allocations
are done through different slabs.
Allocation is thus searching for or creating a slab for the desired
slab, reserving space, and then searching for a free bit. Freeing is
clearing the bit and then releasing the space.
This interface allows memory to dynamically grow and shrink. Future
patches will have different modes to allow fast first-time-use as well
as a non-RPC version.
Summary:
I'm going to attempt to move the `rpc.h` header to a separate folder
that we can install and include outside of `libc`. Before doing this I'm
going to try to trim up the file so there's not as many things I need to
copy to make it work. This dependency on `cpp::functional` is a low
hanging fruit. I only did it so that I could overload the argument of
the work function so that passing the id was optional in the lambda,
that's not a *huge* deal and it makes it more explicit I suppose.
Summary:
This is a NFC move preceding more radical functional changes to the
allocator implementation. We just move it to a common utility so it will
be easier to write these in tandem.