Summary:
`Status` is unfortunately heavily overloaded in practice. Things like
X11 define it as a macro. Best to just remove that possibility entirely.
Summary:
We store the locks in local device memory for performance and
simplicity. The number here needs to correspond to the maximum occupancy
so that we never have a situation where a GPU thread is blocking another
GPU thread.
The number now is sufficient for most hardware, but modern compute chips
like the MI300x are already pushing ~12000 resident waves. This has ABI
impliciations so I'd like to bump it up sooner rather than later. The
ABI change is within what OpenMP expects, LLVM major versions, and it
will be caught statically so there's no risk of silent corruption (size
doesn't match).
Summary:
One of the main disadvantages to using the RPC interface is that it
requires a server thread to spin on the mailboxes checking for work.
The vast majority of the time, there will be no work and work will come
in large bursts.
The HSA / KFD interface supports device-side interrupts and already has
handling for binding these events to an HSA signal. This means that we
can send interrupts from the GPU to wake a sleeping thread on the CPU.
The sleeping thread will be descheduled with a blocking HSA wait call
and woken up when its event ID is raised through the kernel driver's
interrupt.
This is very target-specific handling, but I believe it is valuable
enough to warrant it being in the protocol. It is completely optional,
as it is ignored if uninitialized. This should bring this support at
parity with the interface HIP expects.
Summary:
The NVIDIA ITS protocol allows lanes to diverge inside of a warp. We
previously had contingencies around this, but there were cases where
issues would still show up under highly stressed usage.
The rules state that as long as the PC is the same, threads can
reconverge. This means that we can see a 'convergent' warp even when
they took completely divergent paths to get there. This resulted in the
'index' value in the RPC port lookup loop thinking we were in a
convergent group while all the indices were different. Fix this with a
broadcast to force the expected behavior
Additionally, we did not force that the threads were actually done with
their 'work_fn'. If the work included something that caused divergence
the other threads could continue and toggle the mailbox, resulting in
the server seeing unfinished work. Fix this with an explicit sync and
have one thread do it.
Add a test to make sure this actually works.
Summary:
Recently I changed the interface to use RAII to close the ports. This
exposed a problem where the default move constructor was invoked in the
optional wrapping, this caused the destructor to fire twice on the
server, obviously causing havok. This PR changes the move destructor to
be deleted so this never happens again. Now everything is constructed
once and only references are allowed. The optional class had to be
fixed to properly set in_use so we run the destructor properly as well.
Summary:
Closing ports was previously done manually, This makes the protocol more
error prone as unclosed ports will leak and eventually the locks will
run out. I believe the original fear was that the RAII portion would
negatively impact code generation but I have not noticed anything
significant.
Summary:
Mostly NFC, replaced some inconsistent comments and replaces `class`
with `typename` to be consistent. Also fix incomplete type detection I
forgot to merge in the RPC helper PR.
Summary:
The RPC interface is useful for forwarding functions. This PR adds
helper functions for doing a completely bare forwarding of a function
from the client to the server. This is intended to facilitate
heterogenous libraries that implement host functions on the GPU (like
MPI or Fortran).
Summary:
In order for this to work with CUDA we need to declare functions as
__host__ and __device__ while also making sure we only call the GPU
functions during the CUDA / HIP compile stage.
Summary:
This patch removes much of the `llvmlibc_rpc_server` interface. This
pretty much deletes all of this code and just replaces it with including
`rpc.h` directly. We still maintain the file to let `libc` handle the
opcodes, since those depend on the `printf` impelmentation.
This will need to be cleaned up more, but I don't want to put too much
into a single patch.
Summary:
Previous patches have made the `rpc.h` header independent of the `libc`
internals. This allows us to include it directly rather than providing
an indirect C API. This patch only does the work to move the header. A
future patch will pull out the `rpc_server` interface and simply replace
it with a single function that handles the opcodes.