39 Commits

Author SHA1 Message Date
Joseph Huber
666a3f4ed4
[libc] Stub TLS functions on the GPU temporarily (#108267)
Summary:
There's an extern weak symbol for this, we should just factor these into
a more common interface. Stub them temporarily to make the bots happy.
PTXAS does not handle extern weak.
2024-09-11 11:36:07 -07:00
Joseph Huber
5c13f9aea2
[libc] Add single threaded kernel attributes to AMDGPU startup utility (#104651)
Summary:
I fixed the errors here recently so I can actually use these. This
shouldn't impact much, just should hopefully make the code generated
slightly better.
2024-08-18 12:50:15 -05:00
Schrodinger ZHU Yifan
b7c7dbd473
Revert "libc: Remove extern "C" from main declarations" (#102827)
Reverts llvm/llvm-project#102825
2024-08-11 13:40:50 -07:00
David Blaikie
1b71c471c7
libc: Remove extern "C" from main declarations (#102825)
This is invalid in C++, and clang recently started warning on it as of
#101853
2024-08-11 13:17:27 -07:00
Joseph Huber
1a92cc5a0a
[libc] Implement 'getenv' on the GPU target (#102376)
Summary:
This patch implements 'getenv'. I was torn on how to implement this,
since realistically we only have access to this environment pointer in
the "loader" interface. An alternative would be to use an RPC call every
time, but I think that's overkill for what this will be used for. A
better solution is just to emit a common `DataEnvironment` that contains
all of the host visible resources to initialize. Right now this is the
`env_ptr`, `clock_freq`, and `rpc_client`.

I did this by making the `app.h` interface that Linux uses more general,
could possibly move that into a separate patch, but I figured it's
easier to see with the usage.
2024-08-08 06:45:42 -05:00
Petr Hosek
5ff3ff33ff
[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98597)
This is a part of #97655.
2024-07-12 09:28:41 -07:00
Mehdi Amini
ce9035f5bd
Revert "[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration" (#98593)
Reverts llvm/llvm-project#98075

bots are broken
2024-07-12 09:12:13 +02:00
Petr Hosek
3f30effe1b
[libc] Migrate to using LIBC_NAMESPACE_DECL for namespace declaration (#98075)
This is a part of #97655.
2024-07-11 12:35:22 -07:00
Joseph Huber
0352d5eee0
[libc][NFC] Remove redundant external clock symbol for AMDGPU (#82794)
Summary:
The AMDGPU target needs an external clock symbol so the driver can set
the frequency with the correct value. This was left over from the
previous implementation and I forgot to remove it when actually
implementing the timing utilities.
2024-02-23 11:59:46 -06:00
Joseph Huber
47b7c91abe
[libc] Rework the GPU build to be a regular target (#81921)
Summary:
This is a massive patch because it reworks the entire build and
everything that depends on it. This is not split up because various bots
would fail otherwise. I will attempt to describe the necessary changes
here.

This patch completely reworks how the GPU build is built and targeted.
Previously, we used a standard runtimes build and handled both NVPTX and
AMDGPU in a single build via multi-targeting. This added a lot of
divergence in the build system and prevented us from doing various
things like building for the CPU / GPU at the same time, or exporting
the startup libraries or running tests without a full rebuild.

The new appraoch is to handle the GPU builds as strict cross-compiling
runtimes. The first step required
https://github.com/llvm/llvm-project/pull/81557 to allow the `LIBC`
target to build for the GPU without touching the other targets. This
means that the GPU uses all the same handling as the other builds in
`libc`.

The new expected way to build the GPU libc is with
`LLVM_LIBC_RUNTIME_TARGETS=amdgcn-amd-amdhsa;nvptx64-nvidia-cuda`.

The second step was reworking how we generated the embedded GPU library
by moving it into the library install step. Where we previously had one
`libcgpu.a` we now have `libcgpu-amdgpu.a` and `libcgpu-nvptx.a`. This
patch includes the necessary clang / OpenMP changes to make that not
break the bots when this lands.

We unfortunately still require that the NVPTX target has an `internal`
target for tests. This is because the NVPTX target needs to do LTO for
the provided version (The offloading toolchain can handle it) but cannot
use it for the native toolchain which is used for making tests.

This approach is vastly superior in every way, allowing us to treat the
GPU as a standard cross-compiling target. We can now install the GPU
utilities to do things like use the offload tests and other fun things.

Some certain utilities need to be built with 
`--target=${LLVM_HOST_TRIPLE}` as well. I think this is a fine
workaround as we
will always assume that the GPU `libc` is a cross-build with a
functioning host.

Depends on https://github.com/llvm/llvm-project/pull/81557
2024-02-22 15:29:29 -06:00
lntue
0881d0f009
[libc] Refactor _build_gpu_objects cmake function. (#80631) 2024-02-05 10:44:19 -05:00
Joseph Huber
dc30fa6aca [libc][fix] Call GPU destructors in the correct order
Summary:
I was mistakenly iterating the list backwards. Regular semantics puts
both arrays in priority order but the destructors are called backwards.
2023-11-09 09:22:41 -06:00
alfredfo
f350532099
[libc] Fix accidental LIBC_NAMESPACE_clock_freq (#69620)
See-also: https://github.com/llvm/llvm-project/pull/69548
2023-10-19 19:39:02 +02:00
Guillaume Chatelet
b6bc9d72f6
[libc] Mass replace enclosing namespace (#67032)
This is step 4 of
https://discourse.llvm.org/t/rfc-customizable-namespace-to-allow-testing-the-libc-when-the-system-libc-is-also-llvms-libc/73079
2023-09-26 11:45:04 +02:00
Joseph Huber
59896c168a
[libc] Remove the 'rpc_reset' routine from the RPC implementation (#66700)
Summary:
This patch removes the `rpc_reset` function. This was previously used to
initialize the RPC client on the device by setting up the pointers to
communicate with the server. The purpose of this was to make it easier
to initialize the device for testing. However, this prevented us from
enforcing an invariant that the buffers are all read-only from the
client side.

The expected way to initialize the server is now to copy it from the
host runtime. This will allow us to maintain that the RPC client is in
the constant address space on the GPU, potentially through inference,
and improving caching behaviour.
2023-09-21 11:07:09 -05:00
Joseph Huber
76af6e77c0
[libc] Manually set the AMDGPU code object version (#65986)
Summary:
There is currently effort to change over the default AMDGPU code object
version https://github.com/llvm/llvm-project/pull/65410. However, this
unfortunately causes problems in the LLVM LibC test suite that leads to
a hang while executing. This is most likely a bug to do with indirect
call optimization, as it can be avoided without optimizations or with
manually preventing inlining in the AMDGPU startup code.

This patch sets the AMDGPU code object version to be four explicitly on
the LibC test suite. This should unblock the efforts to move the default
to 5 without breaking the test suite. This isn't a great solution, but
there is currently some time pressure to get COV5 landed and this seems
to be the easiest solution.
2023-09-11 13:07:56 -05:00
Joseph Huber
d3aabeb7b5 [libc] Treat the locks array as a bitfield
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D155511
2023-07-21 10:49:11 -05:00
Joseph Huber
979fb95021 Revert "[libc] Treat the locks array as a bitfield"
Summary:
This caused test failures on the gfx90a buildbot. This works on my
gfx1030 and the Nvidia buildbots, so we'll need to investigate what is
going wrong here. For now revert it to get the bots green.

This reverts commit 05abcc579244b68162b847a6780d27b22bd58f74.
2023-07-19 09:27:08 -05:00
Joseph Huber
05abcc5792 [libc] Treat the locks array as a bitfield
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D155511
2023-07-18 11:34:21 -05:00
Joseph Huber
5db39796bf [libc] Support timing information in libc tests
This patch adds the necessary support to provide timing information in
`libc` tests. This is useful for determining which tests look what
amount of time. We also can use this as a test basis for providing more
fine-grained timing when implementing things on the GPU.

The main difficulty with this is the fact that the AMDGPU fixed
frequency clock operates at an unknown frequency. We need to read this
on a per-card basis from the driver and then copy it in. NVPTX on the
other hand has a fixed clock at a resolution of 1ns. I have also
increased the resolution of the print-outs as the majority of these are
below a millisecond for me.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154446
2023-07-05 14:27:08 -05:00
Joseph Huber
ee6ace27e0 [libc] Remove disabled pass after performance improvement
This pass used to cause huge compile time regressions, That has been
address and can now be re-added.

Differential Revision: https://reviews.llvm.org/D153374
2023-06-20 15:48:02 -05:00
Joseph Huber
964a535bfa [libc] Remove flexible array and replace with a template
Currently the implementation of the RPC interface requires a flexible
struct. This caused problems when compilling the RPC server with GCC as
would be required if trying to export the RPC server interface. This
required that we either move to the `x[1]` workaround or make it a
template parameter. While just using `x[1]` would be much less noisy,
this is technically undefined behavior. For this reason I elected to use
templates.

The downside to using templates is that the server code must now be able
to handle multiple different types at runtime. I was unable to find a
good solution that didn't rely on type erasure so I simply branch off of
the given value.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D153304
2023-06-20 15:22:37 -05:00
Joseph Huber
5a8fc41937 [libc] Disable atomic optimizations for libc AMDGPU builds
Recently the AMDGPU backend automatically enables a pass to optimize
atomics. This results in the LTO build taking about 10x longer in all
cases. For now we disable this by default as was the case before the
patch in D152649.

Reviewed By: lntue

Differential Revision: https://reviews.llvm.org/D153232
2023-06-19 03:25:51 -05:00
Joseph Huber
ad00a3db4d [libc][AMDGPU] Disable the AMDGPU backend's ctor/dtor lowering for libc
The AMDGPU backend has a built-in pass to lower constructors. We do this
manually in the `start.cpp` implementation so we can disable this to
keep the binaries smaller.

Differential Revision: https://reviews.llvm.org/D151213
2023-05-23 09:20:41 -05:00
Joseph Huber
30093d6be2 [libc][obvious] Fix undefined variable after name change
I forgot that we still used these variables in the loaders.

Differential Revision: https://reviews.llvm.org/D150362
2023-05-11 09:00:08 -05:00
Jon Chesterfield
bbeae142bf [libc][rpc] Allocate a single block of shared memory instead of three
Allows moving the pointer swap between server and client into reset.
Single allocation simplifies whatever allocates the client/server, currently
the libc loaders.

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D150337
2023-05-11 03:04:56 +01:00
Jon Chesterfield
f497611f43 [libc][rpc] Allocate locks array within process
Replaces the globals currently used. Worth changing to a bitmap
before allowing runtime number of ports >> 64. One bit per port is likely
to be cheap enough that sizing for the worst case is always fine, otherwise
in the future we can change to dynamically allocating it.

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D150309
2023-05-11 00:41:51 +01:00
Joseph Huber
aea866c12c [libc] Support concurrent RPC port access on the GPU
Previously we used a single port to implement the RPC. This was
sufficient for single threaded tests but can potentially cause deadlocks
when using multiple threads. The reason for this is that GPUs make no
forward progress guarantees. Therefore one group of threads waiting on
another group of threads can spin forever because there is no guarantee
that the other threads will continue executing. The typical workaround
for this is to allocate enough memory that a sufficiently large number
of work groups can make progress. As long as this number is somewhat
close to the amount of total concurrency we can obtain reliable
execution around a shared resource.

This patch enables using multiple ports by widening the arrays to a
predetermined size and indexes into them. Empty ports are currently
obtained via a trivial linker scan. This should be imporoved in the
future for performance reasons. Portions of D148191 were applied to
achieve parallel support.

Depends on D149581

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D149598
2023-05-05 10:12:19 -05:00
Joseph Huber
901266dad3 [libc] Change GPU startup and loader to use multiple kernels
The GPU has a different execution model to standard `_start`
implementations. On the GPU, all threads are active at the start of a
kernel. In order to correctly intitialize and call the constructors we
want single threaded semantics. Previously, this was done using a
makeshift global barrier with atomics. However, it should be easier to
simply put the portions of the code that must be single threaded in
separate kernels and then call those with only one thread. Generally,
mixing global state between kernel launches makes optimizations more
difficult, similarly to calling a function outside of the TU, but for
testing it is better to be correct.

Depends on D149527 D148943

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D149581
2023-05-04 19:31:41 -05:00
Joseph Huber
507edb52f9 [libc] Enable multiple threads to use RPC on the GPU
The execution model of the GPU expects that groups of threads will
execute in lock-step in SIMD fashion. It's both important for
performance and correctness that we treat this as the smallest possible
granularity for an RPC operation. Thus, we map multiple threads to a
single larger buffer and ship that across the wire.

This patch makes the necessary changes to support executing the RPC on
the GPU with multiple threads. This requires some workarounds to mimic
the model when handling the protocol from the CPU. I'm not completely
happy with some of the workarounds required, but I think it should work.

Uses some of the implementation details from D148191.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D148943
2023-05-04 19:31:41 -05:00
Joseph Huber
1b823abea7 [libc] Add support for global ctors / dtors for AMDGPU
This patch makes the necessary changes to support calling global
constructors and destructors on the GPU. The patch in D149340 allows the
`lld` linker to create the symbols pointing us to these globals. These
should be executed by a single thread, which is more difficult on the
GPU because all threads are active. I chose to use an atomic counter to
sync every thread on the GPU. This is very slow if you use more than a
few thousand threads, but for testing purposes it should be sufficient.

Depends on D149340 D149363

Reviewed By: sivachandra

Differential Revision: https://reviews.llvm.org/D149398
2023-04-29 08:40:20 -05:00
Joseph Huber
50445dff43 [libc] Add more utility functions for the GPU
This patch adds extra intrinsics for the GPU. Some of these are unused
for now but will be used later. We use these currently to update the
`RPC` handling. Currently, every thread can update the RPC client, which
isn't correct. This patch adds code neccesary to allow a single thread
to perfrom the write while the others wait.

Feedback is welcome for the naming of these functions. I'm copying the
OpenMP nomenclature where we call an AMD `wavefront` or NVIDIA `warp` a
`lane`.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D148810
2023-04-24 15:47:53 -05:00
Joseph Huber
d0ff5e4030 [libc] Update RPC interface for system utilities on the GPU
This patch reworks the RPC interface to allow more generic memory
operations using the shared better. This patch decomposes the entire RPC
interface into opening a port and calling `send` or `recv` on it.

The `send` function sends a single packet of the length of the buffer.
The `recv` function is paired with the `send` call to then use the data.
So, any aribtrary combination of sending packets is possible. The only
restriction is that the client initiates the exchange with a `send`
while the server consumes it with a `recv`.

The operation of this is driven by two independent state machines that
tracks the buffer ownership during loads / stores. We keep track of two
so that we can transition between a send state and a recv state without
an extra wait. State transitions are observed via bit toggling, e.g.

This interface supports an efficient `send -> ack -> send -> ack -> send`
interface and allows for the last send to be ignored without checking
the ack.

A following patch will add some more comprehensive testing to this interface. I
I informally made an RPC call that simply incremented an integer and it took
roughly 10 microsends to complete an RPC call.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D148288
2023-04-19 20:02:31 -05:00
Joseph Huber
6bd4d717d5 [libc] Add environment variables to GPU libc test for AMDGPU
This patch performs the same operation to copy over the `argv` array to
the `envp` array. This allows the GPU tests to use environment
variables.

Reviewed By: sivachandra

Differential Revision: https://reviews.llvm.org/D146322
2023-03-20 13:16:58 -05:00
Joseph Huber
39e91098b5 [libc] Enable integration tests targeting the GPU
This patch enables integration tests running on the GPU. This uses the
RPC interface implemented in D145913 to compile the necessary
dependencies for the integration test object. We can then use this to
compile the objects for the GPU directly and execute them using the AMD
HSA loader combined with its RPC server. For example, the compiler is
performing the following actions to execute the integration tests.

```
$ clang++ --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -nostdlib -flto -ffreestanding \
    crt1.o io.o quick_exit.o test.o rpc_client.o args_test.o -o image
$ ./amdhsa_loader image 1 2 5
args_test.cpp:24: Expected 'my_streq(argv[3], "3")' to be true, but is false
```

This currently only works with a single threaded client implementation
running on AMDGPU. Further work will implement multiple clients for AMD
and the ability to run on NVPTX as well.

Depends on D145913

Reviewed By: sivachandra, JonChesterfield

Differential Revision: https://reviews.llvm.org/D146256
2023-03-17 12:55:32 -05:00
Joseph Huber
8e4f9b1fcb [libc] Add initial support for an RPC mechanism for the GPU
This patch adds initial support for an RPC client / server architecture.
The GPU is unable to perform several system utilities on its own, so in
order to implement features like printing or memory allocation we need
to be able to communicate with the executing process. This is done via a
buffer of "sharable" memory. That is, a buffer with a unified pointer
that both the client and server can use to communicate.

The implementation here is based off of Jon Chesterfields minimal RPC
example in his work. We use an `inbox` and `outbox` to communicate
between if there is an RPC request and to signify when work is done.
We use a fixed-size buffer for the communication channel. This is fixed
size so that we can ensure that there is enough space for all
compute-units on the GPU to issue work to any of the ports. Right now
the implementation is single threaded so there is only a single buffer
that is not shared.

This implementation still has several features missing to be complete.
Such as multi-threaded support and asynchrnonous calls.

Depends on D145912

Reviewed By: sivachandra

Differential Revision: https://reviews.llvm.org/D145913
2023-03-17 12:55:31 -05:00
Joseph Huber
6641f8da73 [libc] Fix amdgpu startup code flags
Summary:
Currently AMDGPU only barely supports cross-TU ELF linking. Full linking
is usually done via LTO. This requires passing the architecture to the
link job. This is done automatically via `-flto` since D144505. Add this
to the link options.
2023-02-22 11:38:26 -06:00
Joseph Huber
d51d2b5909 [libc] Support add_object_library for the GPU build
This patch unifies the handling of generating the GPU build targets
between the `add_entrypoint_library` and the `add_object_library`
functions. The `_build_gpu_objects` function will create two targets.
One contains a single object file with several GPU binaries embedded in
it, a so-called fatbinary. The other is a direct compile of the
supported target to be used internally only. This patch pulls out some
of the properties logic so that we can handle both more easily. This
patch also required adding an ovverride  `NO_GPU_BUILD` for cases when
we only want to build the source file as normal.

Reviewed By: sivachandra

Differential Revision: https://reviews.llvm.org/D144214
2023-02-20 09:37:18 -06:00
Joseph Huber
fa34b9e032 [libc] Add startup code implementation for GPU targets
This patch introduces startup code for executing `main` on a device
compiled for the GPU. We will primarily use this to run standalone
integration tests on the GPU. The actual execution of this routine will
need to be provided by a `loader` utility to bootstrap execution on the
GPU.

Reviewed By: sivachandra

Differential Revision: https://reviews.llvm.org/D143212
2023-02-07 11:36:16 -06:00