This patch adds the necessary support to provide timing information in
`libc` tests. This is useful for determining which tests look what
amount of time. We also can use this as a test basis for providing more
fine-grained timing when implementing things on the GPU.
The main difficulty with this is the fact that the AMDGPU fixed
frequency clock operates at an unknown frequency. We need to read this
on a per-card basis from the driver and then copy it in. NVPTX on the
other hand has a fixed clock at a resolution of 1ns. I have also
increased the resolution of the print-outs as the majority of these are
below a millisecond for me.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D154446
This patch prepares the RPC interface to be installed. We place this in
the existing `llvm-gpu-none` directory as it will also give us access to
the generated `libc` headers for the opcodes.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D153040
Switching to this interface we neglected to actually write the output
from the malloc call to the RPC buffer. Fix this so the tests pass
again.
Differential Revision: https://reviews.llvm.org/D153069
The GPU port of the LLVM C library needs to export a few extensions to
the interface such that users can interface with it. This patch adds the
necessary logic to define a GPU extension. Currently, this only exports
a `rpc_reset_client` function. This allows us to use the server in
D147054 to set up the RPC interface outside of `libc`.
Depends on https://reviews.llvm.org/D147054
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D152283
This patch begins providing a generic static library that wraps around
the raw `rpc.h` interface. As discussed in the corresponding RFC,
https://discourse.llvm.org/t/rfc-libc-exporting-the-rpc-interface-for-the-gpu-libc/71030,
we want to begin exporting RPC services to external users. In order to
do this we decided to not expose the `rpc.h` header by wrapping around
its functionality. This is done with a C-interface as we make heavy use
of callbacks and allows us to provide a predictable interface.
Reviewed By: JonChesterfield, sivachandra
Differential Revision: https://reviews.llvm.org/D147054
A few of these tests were disabled due to failing on NVPTX. After
looking into it the vast majority of these cases were due to
insufficient stack memory. This can be worked around by increasing the
stack size in the loader or by reducing the memory usage in the case of
large string constants.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D152583
A previous patch added general support for printing via the RPC
interface. we should consolidate this functionality and get rid of the
old opcode that was used for simple testing.
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D152211
If CUDA is not found this string will expand into nothing. We need to
surround it with a string otherwise it will cause build failures.
Differential Revision: https://reviews.llvm.org/D152209
This patch adds the initial support required to support basic priting in
`stdio.h` via `puts` and `fputs`. This is done using the existing LLVM C
library `File` API. In this sense we can think of the RPC interface as
our system call to dump the character string to the file. We carry a
`uintptr_t` reference as our native "file descriptor" as it will be used
as an opaque reference to the host's version once functions like
`fopen` are supported.
For some unknown reason the declaration of the `StdIn` variable causes
both the AMDGPU and NVPTX backends to crash if I use the `READ` flag.
This is not used currently as we only support output now, but it needs
to be fixed
Reviewed By: sivachandra, lntue
Differential Revision: https://reviews.llvm.org/D151282
This patch adds support for the `malloc` and `free` functions. These
currently aren't implemented in-tree so we first add the interface
filies.
This patch provides the most basic support for a true `malloc` and
`free` by using the RPC interface. This is functional, but in the future
we will want to implement a more intelligent system and primarily use
the RPC interface more as a `brk()` or `sbrk()` interface only called
when absolutely necessary. We will need to design an intelligent
allocator in the future.
The semantics of these memory allocations will need to be checked. I am
somewhat iffy on the details. I've heard that HSA can allocate
asynchronously which seems to work with my tests at least. CUDA uses an
implicit synchronization scheme so we need to use an explicitly separate
stream from the one launching the kernel or the default stream. I will
need to test the NVPTX case.
I would appreciate if anyone more experienced with the implementation details
here could chime in for the HSA and CUDA cases.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D151735
Currently we have the `send_n` and `recv_n` routines to stream data,
such as a string to print, to the other side. The first operation is to
send the size so the other side knows the number of bytes to recieve.
However, this wasted 56 bytes that could've been sent. This meant that
small values, like the arguments to a function to call on the host for
example, needed to perform an extra send. This patch sends the first 56
bytes in the first packet and continues if necessary.
Depends on D150992
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D151041
We support asynchronous sends, that means that the kernel can issue a
send, then exit the kernel as we do with the `EXIT` syscall. Because of
the condition it's therefore possible for the kernel to exit and break
from the loop before we check the server again. This can potentially
cause us to ignore an `EXIT` call from the GPU.
Reviewed By: JonChesterfield, lntue
Differential Revision: https://reviews.llvm.org/D150456
Currently we provide the `send_n` and `recv_n` functions. These were
somewhat divergent and not tested on the GPU. This patch changes the
support to be more common. We do this my making the CPU provide an array
equal the to at least the lane size while the GPU can rely on the
private memory address of its stack variables. This allows us to send
data back and forth generically.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D150379
Small cleanup of the server code and fixes a constant name not following
the naming convention.
Differential Revision: https://reviews.llvm.org/D150361
Allows moving the pointer swap between server and client into reset.
Single allocation simplifies whatever allocates the client/server, currently
the libc loaders.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D150337
The interface exported by the RPC library allows users to simply send
and recieve fixed sized packets without worrying about the data motion
underneath. However, this was broken in the current implementation. We
can think of the send and recieve implementations in terms of waiting
for ownership of the buffer, using the buffer, and posting ownership to
the other side. Our implementation of `recv` was incorrect in the
following scenarios.
recv -> send // we still own the buffer and should give away ownership
recv -> close // The other side is not waiting for data, this will
result in multiple openings of the same port
This patch attempts to fix this with an admittedly hacky fix where we
track if the previous implementation was a recv and post conditionally.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D150327
Replaces the globals currently used. Worth changing to a bitmap
before allowing runtime number of ports >> 64. One bit per port is likely
to be cheap enough that sizing for the worst case is always fine, otherwise
in the future we can change to dynamically allocating it.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D150309
Previously we used a single port to implement the RPC. This was
sufficient for single threaded tests but can potentially cause deadlocks
when using multiple threads. The reason for this is that GPUs make no
forward progress guarantees. Therefore one group of threads waiting on
another group of threads can spin forever because there is no guarantee
that the other threads will continue executing. The typical workaround
for this is to allocate enough memory that a sufficiently large number
of work groups can make progress. As long as this number is somewhat
close to the amount of total concurrency we can obtain reliable
execution around a shared resource.
This patch enables using multiple ports by widening the arrays to a
predetermined size and indexes into them. Empty ports are currently
obtained via a trivial linker scan. This should be imporoved in the
future for performance reasons. Portions of D148191 were applied to
achieve parallel support.
Depends on D149581
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D149598
The GPU has a different execution model to standard `_start`
implementations. On the GPU, all threads are active at the start of a
kernel. In order to correctly intitialize and call the constructors we
want single threaded semantics. Previously, this was done using a
makeshift global barrier with atomics. However, it should be easier to
simply put the portions of the code that must be single threaded in
separate kernels and then call those with only one thread. Generally,
mixing global state between kernel launches makes optimizations more
difficult, similarly to calling a function outside of the TU, but for
testing it is better to be correct.
Depends on D149527 D148943
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D149581
The execution model of the GPU expects that groups of threads will
execute in lock-step in SIMD fashion. It's both important for
performance and correctness that we treat this as the smallest possible
granularity for an RPC operation. Thus, we map multiple threads to a
single larger buffer and ship that across the wire.
This patch makes the necessary changes to support executing the RPC on
the GPU with multiple threads. This requires some workarounds to mimic
the model when handling the protocol from the CPU. I'm not completely
happy with some of the workarounds required, but I think it should work.
Uses some of the implementation details from D148191.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D148943
This patch adds the necessary hacks to support global constructors and
destructors. This is an incredibly hacky process caused by the primary
fact that Nvidia does not provide any binary tools and very little
linker support. We first had to emit references to these functions and
their priority in D149451. Then we dig them out of the module once it's
loaded to manually create the list that the linker should have made for
us. This patch also contains a few Nvidia specific hacks, but it passes
the test, albeit with a stack size warning from `ptxas` for the
callback. But this should be fine given the resource usage of a common
test.
This also adds a dependency on LLVM to the NVPTX loader, which hopefully doesn't
cause problems with our CUDA buildbot.
Depends on D149451
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D149527
The implementation of the test printing currently expects a null
terminated C-string. However, the `write_to_stderr` interface uses a
string view, which doesn't need to be null terminated. This patch
changes the printing interface to directly use `fwrite` instead rather
than relying on a null terminator.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D149493
Currently, the RPC interface with the loader is only tested if the other
tests fail. This test adds a direct test that runs a simple integer
increment over the RPC handshake 10000 times.
Depends on https://reviews.llvm.org/D148288
Reviewed By: lntue
Differential Revision: https://reviews.llvm.org/D148342
This patch reworks the RPC interface to allow more generic memory
operations using the shared better. This patch decomposes the entire RPC
interface into opening a port and calling `send` or `recv` on it.
The `send` function sends a single packet of the length of the buffer.
The `recv` function is paired with the `send` call to then use the data.
So, any aribtrary combination of sending packets is possible. The only
restriction is that the client initiates the exchange with a `send`
while the server consumes it with a `recv`.
The operation of this is driven by two independent state machines that
tracks the buffer ownership during loads / stores. We keep track of two
so that we can transition between a send state and a recv state without
an extra wait. State transitions are observed via bit toggling, e.g.
This interface supports an efficient `send -> ack -> send -> ack -> send`
interface and allows for the last send to be ignored without checking
the ack.
A following patch will add some more comprehensive testing to this interface. I
I informally made an RPC call that simply incremented an integer and it took
roughly 10 microsends to complete an RPC call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D148288
We will want to test the GPU `libc` with multiple threads in the future.
This patch adds the `--threads` and `--blocks` option to set the `x`
dimension of the kernel. Using CUDA terminology instead of OpenCL for
familiarity.
Depends on D148288 D148342
Reviewed By: jdoerfert, sivachandra, tra
Differential Revision: https://reviews.llvm.org/D148485
Summary:
This part was ignored and we just hoped that shutting down the runtime
freed these correctly. But it's best to be specific and free the memory
we've allocated.
Summary:
This implementation was buggy and inefficient. Fine-grained memory can
only be allocated on a page-level granularity. Which means that each
allocated string used about 4096 bytes. This is wasteful in general, and
also allowed for buggy behaviour. The previous copying of the
environment vector only worked because the large buffer size meant that
we would typically have a null byte after the allocated memory. However
this would break if the vector was larger than a page. This patch
allocates everything into a single buffer. It makes it easier to free,
use, and it more correct.
This patch adds the necessary build infrastructure to build and run the
integration tests on NVIDIA GPUs. The NVIDIA `nvlink` linker utility is
what is ultimately used to combine these files into a single executable
image. Unfortunately, their tool does not support static libraries. So
we need to link with every object directly instead. This could be solved
by impelementing a "wrapper" utility around `nvlink` like we used to use
for OpenMP. But for now this should be sufficient.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D146861
This patch adds the necessary code to impelement the existing RPC client
/ server interface when targeting NVPTX GPUs. This follows closely to
the implementation in the AMDGPU version. This does not yet enable unit
testing as the `nvlink` linker does not support static libraries. So
that will need to be worked around.
I am ignoring the RPC duplication between the AMDGPU and NVPTX loaders. This
will be changed completely later so there's no point unifying the code at this
stage. The implementation was tested manually with the following file and
compilation flags.
```
namespace __llvm_libc {
void write_to_stderr(const char *msg);
void quick_exit(int);
} // namespace __llvm_libc
using namespace __llvm_libc;
int main(int argc, char **argv, char **envp) {
for (int i = 0; i < argc; ++i) {
write_to_stderr(argv[i]);
write_to_stderr("\n");
}
quick_exit(255);
}
```
```
$ clang++ crt1.o rpc_client.o quick_exit.o io.o main.cpp --target=nvptx64-nvidia-cuda -march=sm_70 -o image
$ ./nvptx_loader image 1 2 3
image
1
2
3
$ echo $?
255
```
Depends on D146681
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D146846
This patch adds a loader utility targeting the CUDA driver API to launch
NVPTX images called `nvptx_loader`. This takes a GPU image on the
command line and launches the `_start` kernel with the appropriate
arguments. The `_start` kernel is provided by the already implemented
`nvptx/start.cpp`. So, an application with a `main` function can be
compiled and run as follows.
```
clang++ --target=nvptx64-nvidia-cuda main.cpp crt1.o -march=sm_70 -o image
./nvptx_loader image args to kernel
```
This implementation is not tested and does not yet support RPC. This
requires further development to work around NVIDIA specific limitations
in atomics and linking.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D146681
This patch performs the same operation to copy over the `argv` array to
the `envp` array. This allows the GPU tests to use environment
variables.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D146322
This patch enables integration tests running on the GPU. This uses the
RPC interface implemented in D145913 to compile the necessary
dependencies for the integration test object. We can then use this to
compile the objects for the GPU directly and execute them using the AMD
HSA loader combined with its RPC server. For example, the compiler is
performing the following actions to execute the integration tests.
```
$ clang++ --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -nostdlib -flto -ffreestanding \
crt1.o io.o quick_exit.o test.o rpc_client.o args_test.o -o image
$ ./amdhsa_loader image 1 2 5
args_test.cpp:24: Expected 'my_streq(argv[3], "3")' to be true, but is false
```
This currently only works with a single threaded client implementation
running on AMDGPU. Further work will implement multiple clients for AMD
and the ability to run on NVPTX as well.
Depends on D145913
Reviewed By: sivachandra, JonChesterfield
Differential Revision: https://reviews.llvm.org/D146256
This patch adds initial support for an RPC client / server architecture.
The GPU is unable to perform several system utilities on its own, so in
order to implement features like printing or memory allocation we need
to be able to communicate with the executing process. This is done via a
buffer of "sharable" memory. That is, a buffer with a unified pointer
that both the client and server can use to communicate.
The implementation here is based off of Jon Chesterfields minimal RPC
example in his work. We use an `inbox` and `outbox` to communicate
between if there is an RPC request and to signify when work is done.
We use a fixed-size buffer for the communication channel. This is fixed
size so that we can ensure that there is enough space for all
compute-units on the GPU to issue work to any of the ports. Right now
the implementation is single threaded so there is only a single buffer
that is not shared.
This implementation still has several features missing to be complete.
Such as multi-threaded support and asynchrnonous calls.
Depends on D145912
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D145913
This is the first attempt to get some testing support for GPUs in LLVM's
libc. We want to be able to compile for and call generic code while on
the device. This is difficult as most GPU applications also require the
support of large runtimes that may contain their own bugs (e.g. CUDA /
HIP / OpenMP / OpenCL / SYCL). The proposed solution is to provide a
"loader" utility that allows us to execute a "main" function on the GPU.
This patch implements a simple loader utility targeting the AMDHSA
runtime called `amdhsa_loader` that takes a GPU program as its first
argument. It will then attempt to load a predetermined `_start` kernel
inside that image and launch execution. The `_start` symbol is provided
by a `start` utility function that will be linked alongside the
application. Thus, this should allow us to run arbitrary code on the
user's GPU with the following steps for testing.
```
clang++ Start.cpp --target=amdgcn-amd-amdhsa -mcpu=<arch> -ffreestanding -nogpulib -nostdinc -nostdlib -c
clang++ Main.cpp --target=amdgcn-amd-amdhsa -mcpu=<arch> -nogpulib -nostdinc -nostdlib -c
clang++ Start.o Main.o --target=amdgcn-amd-amdhsa -o image
amdhsa_loader image <args, ...>
```
We determine the `-mcpu` value using the `amdgpu-arch` utility provided
either by `clang` or `rocm`. If `amdgpu-arch` isn't found or returns an
error we shouldn't run the tests as the machine does not have a valid
HSA compatible GPU. Alternatively we could make this utility in-source
to avoid the external dependency.
This patch provides a single test for this untility that simply checks
to see if we can compile an application containing a simple `main`
function and execute it.
The proposed solution in the future is to create an alternate
implementation of the LibcTest.cpp source that can be compiled and
launched using this utility. This approach should allow us to use the
same test sources as the other applications.
This is primarily a prototype, suggestions for how to better integrate
this with the existing LibC infastructure would be greatly appreciated.
The loader code should also be cleaned up somewhat. An implementation
for NVPTX will need to be written as well.
Reviewed By: sivachandra, JonChesterfield
Differential Revision: https://reviews.llvm.org/D139839