llvm-project

Author	SHA1	Message	Date
Joseph Huber	abd85cd473	[libc] Remove the optional arguments for NVPTX constructors (#69536 ) Summary: We call the global constructors by function pointer. For whatever reason the NVPTX architecture relies very specifically on the arguments to the function pointer invocation matching what the function is implemented as. This is problematic as most of these constructors are generated with no arguments. This patch removes the extended arguments that GNU and LLVM use for the constructors optionally so that it can support the common case.	2023-11-20 17:10:15 -06:00
Joseph Huber	dc30fa6aca	[libc][fix] Call GPU destructors in the correct order Summary: I was mistakenly iterating the list backwards. Regular semantics puts both arrays in priority order but the destructors are called backwards.	2023-11-09 09:22:41 -06:00
Guillaume Chatelet	b6bc9d72f6	[libc] Mass replace enclosing namespace (#67032 ) This is step 4 of https://discourse.llvm.org/t/rfc-customizable-namespace-to-allow-testing-the-libc-when-the-system-libc-is-also-llvms-libc/73079	2023-09-26 11:45:04 +02:00
Joseph Huber	59896c168a	[libc] Remove the 'rpc_reset' routine from the RPC implementation (#66700 ) Summary: This patch removes the `rpc_reset` function. This was previously used to initialize the RPC client on the device by setting up the pointers to communicate with the server. The purpose of this was to make it easier to initialize the device for testing. However, this prevented us from enforcing an invariant that the buffers are all read-only from the client side. The expected way to initialize the server is now to copy it from the host runtime. This will allow us to maintain that the RPC client is in the constant address space on the GPU, potentially through inference, and improving caching behaviour.	2023-09-21 11:07:09 -05:00
Joseph Huber	d3aabeb7b5	[libc] Treat the locks array as a bitfield Currently we keep an internal buffer of device memory that is used to indicate ownership of a port. Since we only use this as a single bit we can simply turn this into a bitfield. I did this manually rather than having a separate type as we need very special handling of the masks used to interact with the locks. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D155511	2023-07-21 10:49:11 -05:00
Joseph Huber	979fb95021	Revert "[libc] Treat the locks array as a bitfield" Summary: This caused test failures on the gfx90a buildbot. This works on my gfx1030 and the Nvidia buildbots, so we'll need to investigate what is going wrong here. For now revert it to get the bots green. This reverts commit 05abcc579244b68162b847a6780d27b22bd58f74.	2023-07-19 09:27:08 -05:00
Joseph Huber	05abcc5792	[libc] Treat the locks array as a bitfield Currently we keep an internal buffer of device memory that is used to indicate ownership of a port. Since we only use this as a single bit we can simply turn this into a bitfield. I did this manually rather than having a separate type as we need very special handling of the masks used to interact with the locks. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D155511	2023-07-18 11:34:21 -05:00
Joseph Huber	964a535bfa	[libc] Remove flexible array and replace with a template Currently the implementation of the RPC interface requires a flexible struct. This caused problems when compilling the RPC server with GCC as would be required if trying to export the RPC server interface. This required that we either move to the `x[1]` workaround or make it a template parameter. While just using `x[1]` would be much less noisy, this is technically undefined behavior. For this reason I elected to use templates. The downside to using templates is that the server code must now be able to handle multiple different types at runtime. I was unable to find a good solution that didn't rely on type erasure so I simply branch off of the given value. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D153304	2023-06-20 15:22:37 -05:00
Joseph Huber	30093d6be2	[libc][obvious] Fix undefined variable after name change I forgot that we still used these variables in the loaders. Differential Revision: https://reviews.llvm.org/D150362	2023-05-11 09:00:08 -05:00
Jon Chesterfield	bbeae142bf	[libc][rpc] Allocate a single block of shared memory instead of three Allows moving the pointer swap between server and client into reset. Single allocation simplifies whatever allocates the client/server, currently the libc loaders. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D150337	2023-05-11 03:04:56 +01:00
Jon Chesterfield	f497611f43	[libc][rpc] Allocate locks array within process Replaces the globals currently used. Worth changing to a bitmap before allowing runtime number of ports >> 64. One bit per port is likely to be cheap enough that sizing for the worst case is always fine, otherwise in the future we can change to dynamically allocating it. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D150309	2023-05-11 00:41:51 +01:00
Joseph Huber	aea866c12c	[libc] Support concurrent RPC port access on the GPU Previously we used a single port to implement the RPC. This was sufficient for single threaded tests but can potentially cause deadlocks when using multiple threads. The reason for this is that GPUs make no forward progress guarantees. Therefore one group of threads waiting on another group of threads can spin forever because there is no guarantee that the other threads will continue executing. The typical workaround for this is to allocate enough memory that a sufficiently large number of work groups can make progress. As long as this number is somewhat close to the amount of total concurrency we can obtain reliable execution around a shared resource. This patch enables using multiple ports by widening the arrays to a predetermined size and indexes into them. Empty ports are currently obtained via a trivial linker scan. This should be imporoved in the future for performance reasons. Portions of D148191 were applied to achieve parallel support. Depends on D149581 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149598	2023-05-05 10:12:19 -05:00
Joseph Huber	901266dad3	[libc] Change GPU startup and loader to use multiple kernels The GPU has a different execution model to standard `_start` implementations. On the GPU, all threads are active at the start of a kernel. In order to correctly intitialize and call the constructors we want single threaded semantics. Previously, this was done using a makeshift global barrier with atomics. However, it should be easier to simply put the portions of the code that must be single threaded in separate kernels and then call those with only one thread. Generally, mixing global state between kernel launches makes optimizations more difficult, similarly to calling a function outside of the TU, but for testing it is better to be correct. Depends on D149527 D148943 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149581	2023-05-04 19:31:41 -05:00
Joseph Huber	507edb52f9	[libc] Enable multiple threads to use RPC on the GPU The execution model of the GPU expects that groups of threads will execute in lock-step in SIMD fashion. It's both important for performance and correctness that we treat this as the smallest possible granularity for an RPC operation. Thus, we map multiple threads to a single larger buffer and ship that across the wire. This patch makes the necessary changes to support executing the RPC on the GPU with multiple threads. This requires some workarounds to mimic the model when handling the protocol from the CPU. I'm not completely happy with some of the workarounds required, but I think it should work. Uses some of the implementation details from D148191. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D148943	2023-05-04 19:31:41 -05:00
Joseph Huber	2e1c0ec629	[libc] Support global constructors and destructors on NVPTX This patch adds the necessary hacks to support global constructors and destructors. This is an incredibly hacky process caused by the primary fact that Nvidia does not provide any binary tools and very little linker support. We first had to emit references to these functions and their priority in D149451. Then we dig them out of the module once it's loaded to manually create the list that the linker should have made for us. This patch also contains a few Nvidia specific hacks, but it passes the test, albeit with a stack size warning from `ptxas` for the callback. But this should be fine given the resource usage of a common test. This also adds a dependency on LLVM to the NVPTX loader, which hopefully doesn't cause problems with our CUDA buildbot. Depends on D149451 Reviewed By: tra Differential Revision: https://reviews.llvm.org/D149527	2023-05-04 07:13:00 -05:00
Joseph Huber	50445dff43	[libc] Add more utility functions for the GPU This patch adds extra intrinsics for the GPU. Some of these are unused for now but will be used later. We use these currently to update the `RPC` handling. Currently, every thread can update the RPC client, which isn't correct. This patch adds code neccesary to allow a single thread to perfrom the write while the others wait. Feedback is welcome for the naming of these functions. I'm copying the OpenMP nomenclature where we call an AMD `wavefront` or NVIDIA `warp` a `lane`. Reviewed By: tra Differential Revision: https://reviews.llvm.org/D148810	2023-04-24 15:47:53 -05:00
Joseph Huber	d0ff5e4030	[libc] Update RPC interface for system utilities on the GPU This patch reworks the RPC interface to allow more generic memory operations using the shared better. This patch decomposes the entire RPC interface into opening a port and calling `send` or `recv` on it. The `send` function sends a single packet of the length of the buffer. The `recv` function is paired with the `send` call to then use the data. So, any aribtrary combination of sending packets is possible. The only restriction is that the client initiates the exchange with a `send` while the server consumes it with a `recv`. The operation of this is driven by two independent state machines that tracks the buffer ownership during loads / stores. We keep track of two so that we can transition between a send state and a recv state without an extra wait. State transitions are observed via bit toggling, e.g. This interface supports an efficient `send -> ack -> send -> ack -> send` interface and allows for the last send to be ignored without checking the ack. A following patch will add some more comprehensive testing to this interface. I I informally made an RPC call that simply incremented an integer and it took roughly 10 microsends to complete an RPC call. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D148288	2023-04-19 20:02:31 -05:00
Joseph Huber	58f5e5e6b0	[libc] Implement the RPC client / server for NVPTX This patch adds the necessary code to impelement the existing RPC client / server interface when targeting NVPTX GPUs. This follows closely to the implementation in the AMDGPU version. This does not yet enable unit testing as the `nvlink` linker does not support static libraries. So that will need to be worked around. I am ignoring the RPC duplication between the AMDGPU and NVPTX loaders. This will be changed completely later so there's no point unifying the code at this stage. The implementation was tested manually with the following file and compilation flags. ``` namespace __llvm_libc { void write_to_stderr(const char msg); void quick_exit(int); } // namespace __llvm_libc using namespace __llvm_libc; int main(int argc, char argv, char *envp) { for (int i = 0; i < argc; ++i) { write_to_stderr(argv[i]); write_to_stderr("\n"); } quick_exit(255); } ``` ``` $ clang++ crt1.o rpc_client.o quick_exit.o io.o main.cpp --target=nvptx64-nvidia-cuda -march=sm_70 -o image $ ./nvptx_loader image 1 2 3 image 1 2 3 $ echo $? 255 ``` Depends on D146681 Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D146846	2023-03-24 20:04:43 -05:00
Joseph Huber	1fce1d341b	[libc] Use `nvptx_kernel` attribute in NVPTX startup code Summary: A recent patch allowed us to emit a callable kernel from freestanding NVPTX code. This allows us to move away from using the CUDA language. This has several advantages in that it works around an entire assortment of errors I was seeing while implementing RPC for Nvidia.	2023-03-24 14:46:26 -05:00
Joseph Huber	ae63b1a576	[libc] Adjust NVPTX startup code Summary: The startup code needs to include the environment pointer so we add this to the arguments. Also we need to ensure that the `crt1.o` file is made with `-fgpu-rdc` set so we can actually use it without undefined reference errors.	2023-03-22 20:08:08 -05:00
Joseph Huber	fa34b9e032	[libc] Add startup code implementation for GPU targets This patch introduces startup code for executing `main` on a device compiled for the GPU. We will primarily use this to run standalone integration tests on the GPU. The actual execution of this routine will need to be provided by a `loader` utility to bootstrap execution on the GPU. Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D143212	2023-02-07 11:36:16 -06:00

21 Commits