Summary:
We always want to use the detected path. The clang driver's detection is
far more sophisticated so we should use that whenever possible. Also
update the usage so we properly fall back to path instead of incorrectly
using the `/bin` if not provided.
CUDA-12.9 has removed fatbinary tool's `--image` argument we've been
using till now.
--image3 has been supported since cuda-9, so we do not need CUDA SDK
version checks.
Summary:
This patch reworks how we create offloading toolchains. Previously we
would handle this separately for all the different kinds. This patch
instead changes this to use the target triple and the offloading kind to
determine the proper toolchain. In the old case where the user only
passes `--offload-arch` we instead infer the triple from the passed
arguments. This is a pretty major overhaul but currently passes all the
clang tests with only minor changes to error messages.
Summary:
These commands both do the same thing and behave like the same tool.
Now, the `nvptx-arch` and `amdgpu-arch` tools cause it to only emit
architectures for that name.
Currently there is only option -nogpuinc for disabling
the default CUDA/HIP wrapper headers. However, there
are situations where -nogpuinc needs to be overriden
for enabling CUDA/HIP wrapper headers. This patch
adds --[no-]offload-inc for that purpose. When both
exist, the last wins. -nogpuinc and -nocudainc are
now alias to --no-offload-inc.
This patch moves the CommonArgs utilities into a location visible by the
Frontend Drivers, so that the Frontend Drivers may share option parsing
code with the Compiler Driver. This is useful when the Frontend Drivers
would like to verify that their incoming options are well-formed and
also not reinvent the option parsing wheel.
We already see code in the Clang/Flang Drivers that is parsing and
verifying its incoming options. E.g. OPT_ffp_contract. This option is
parsed in the Compiler Driver, Clang Driver, and Flang Driver, all with
slightly different parsing code. It would be nice if the Frontend
Drivers were not required to duplicate this Compiler Driver code. That
way there is no/low maintenance burden on keeping all these parsing
functions in sync.
Along those lines, the Frontend Drivers will now have a useful mechanism
to verify their incoming options are well-formed. Currently, the
Frontend Drivers trust that the Compiler Driver is not passing back junk
in some cases. The Language Drivers may even accept junk with no error
at all. E.g.:
`clang -cc1 -mprefer-vector-width=junk test.c'
With this patch, we'll now be able to tighten up incomming options to
the Frontend drivers in a lightweight way.
---------
Co-authored-by: Cameron McInally <cmcinally@nvidia.com>
Co-authored-by: Shafik Yaghmour <shafik.yaghmour@intel.com>
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
Summary:
We support `nogpulib` to disable implicit libraries. In the future we
will want to change the default linking of these libraries based on the
user language. This patch just introduces a positive variant so now we
can do `-nogpulib -gpulib` to disable it.
Later patch will make the default a variable in the ROCmToolChain
depending on the target languages.
Summary:
Currently we conditionally enable NVPTX lowering depending on the
language (C/C++/OpenMP). Unfortunately this causes problems because this
option is only present if the backend was enabled, which causes this to
error if you try to make LLVM-IR.
This patch instead makes it the only accepted lowering. The reason we
had it as opt-in before is because it is not handled by CUDA. So, this
pach also introduces diagnostics to prevent *all* creation of
device-side global constructors and destructors. We already did this for
variables, now we do it for attributes as well.
This inverts the responsibility of blocking this from the backend to the
langauage like it should be given that support for this is language
dependent.
Summary:
This patch cleans up how we query the offloading toolchain. We create a
single that is more similar to the existing `getToolChain` driver
function and make all the offloading handlers use it.
Summary:
OpenMP was weirdly split between using the bound architecture from
`--offload-arch=` and the old `-march=` option which only worked for
single jobs. This patch removes that special handling. The main benefit
here is that we can now use `getToolchainArgs` without it throwing an
error.
I'm assuming SYCL doesn't care about this because they don't use an
architecture.
Summary:
We pass the `-nvptx-lower-global-ctor-dtor` option to support the `libc`
like use-case which needs global constructors sometimes. This only
affects the backend. If the NVPTX target is not enabled this option will
be unknown which prevents you from compiling generic IR for this.
Summary:
We previously built this for every single architecture to deal with
incompatibility. This patch updates it to use the 'generic' IR that
`libc` and other projects use. Who knows if this will have any
side-effects, probably worth testing more but it passes the tests I
expect to pass on my side.
We do not have support for the threadsafe statics on the GPU side.
However, we do sometimes end up with empty local static initializers,
and those happen to trigger calls to `__cxa_guard*`, which breaks
compilation.
Partially addresses https://github.com/llvm/llvm-project/issues/117023
Summary:
The C library for GPUs provides the ability to target regular C/C++
programs by providing the C library and a file containing kernels that
call the `main` function. This is mostly used for unit tests, this patch
provides a quick way to add them without needing to know the paths. I
currently do this explicitly, but according to the libc++ contributors
we don't want to need to specify these paths manually. See the
discussion in https://github.com/llvm/llvm-project/pull/104515.
I just default to `lib/` if the target-specific one isn't found because
the linker will handle giving a reasonable error message if it's not
found. Basically the use-case looks like this.
```console
$ clang test.c --target=amdgcn-amd-amdhsa -mcpu=native -startfiles -stdlib
$ amdhsa-loader a.out
PASS!
```
When passed -nocudalib/-nogpulib, Cuda's argument handling would bail
out before handling -fcuda-short-ptr, meaning the frontend and backend
data layouts would mismatch.
Summary:
I initially thought that it would be convenient to automatically link
these libraries like they are for standard C/C++ targets. However, this
created issues when trying to use C++ as a GPU target. This patch moves
the logic to now implicitly pass it as part of the offloading toolchain
instead, if found. This means that the user needs to set the target
toolchain for the link job for automatic detection, but can still be
done manually via `-Xoffload-linker -lc`.
(this is clang related part)
Without these explicit includes, removing other headers, who implicitly
include llvm-config.h, may have non-trivial side effects. For example,
`clagd` may report even `llvm-config.h` as "no used" in case it defines
a macro, that is explicitly used with #ifdef. It is actually amplified
with different build configs which use different set of macros.
Through the new `-foffload-via-llvm` flag, CUDA kernels can now be
lowered to the LLVM/Offload API. On the Clang side, this is simply done
by using the OpenMP offload toolchain and emitting calls to `llvm*`
functions to orchestrate the kernel launch rather than `cuda*`
functions. These `llvm*` functions are implemented on top of the
existing LLVM/Offload API.
As we are about to redefine the Offload API, this wil help us in the
design process as a second offload language.
We do not support any CUDA APIs yet, however, we could:
https://www.osti.gov/servlets/purl/1892137
For proper host execution we need to resurrect/rebase
https://tianshilei.me/wp-content/uploads/2021/12/llpp-2021.pdf
(which was designed for debugging).
```
❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}
__global__ void square(int *A) { *A = 42; }
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
*Ptr = 7;
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
square<<<1, 1>>>(Ptr);
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
llvm_omp_target_free_shared(Ptr, DevNo);
}
❯❯❯ clang++ test.cu -O3 -o test123 -foffload-via-llvm --offload-arch=native
❯❯❯ llvm-objdump --offloading test123
test123: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
❯❯❯ LIBOMPTARGET_INFO=16 ./test123
Ptr 0x155448ac8000, *Ptr 7
Ptr 0x155448ac8000, *Ptr 42
```
When working on very busy systems, check-offload frequently fails many
tests with this diagnostic:
```
clang: error: cannot determine amdgcn architecture: /tmp/llvm/build/bin/amdgpu-arch: Child timed out: ; consider passing it via '-march'
```
This patch accepts the environment variable
`CLANG_TOOLCHAIN_PROGRAM_TIMEOUT` to set the timeout. It also increases
the timeout from 10 to 60 seconds.
Summary:
The `nvlink-wrapper` can do LTO now, which means we can still create
some LLVM-IR without needing an architecture. In the case that we try to
invoke `nvlink` internally, that will still fail. This patch simply
defers the error until later so we can use `--lto-emit-llvm` to get the
IR without specifying an architecture.
Summary:
This is necessary for LTO when the user specifies it or has a CUDA
version that supports a sufficiently high version. Previously it would
default.
Fixes https://github.com/llvm/llvm-project/issues/100606
Summary:
The previous patches (The other commits in this chain) allow the
offloading toolchain to directly invoke the device linker. Because of
this, we can now just have the toolchain implicitly include `-lc` and
`-lm` like a standard target does. This removes the old handling that
went through the fat binary `-lcgpu`.
Summary:
The `clang-nvlink-wrapper` is a utility that I removed awhile back
during the transition to the new driver. This patch adds back in a new,
upgraded version that does LTO + archive linking. It's not an easy
choice to reintroduce something I happily deleted, but this is the only
way to move forward with improving GPU support in LLVM.
While NVIDIA provides a linker called 'nvlink', its main interface is
very difficult to work with. It does not provide LTO, or static linking,
requires all files to be named a non-standard `.cubin`, and rejects link
jobs that other linkers would be fine with (i.e empty). I have spent a
great deal of time hacking around this in the GPU `libc` implementation,
where I deliberately avoid LTO and static linking and have about 100
lines of hacky CMake dedicated to storing these files in a format that
the clang-linker-wrapper accepts to avoid this limitation.
The main reason I want to re-intorudce this tool is because I am
planning on creating a more standard C/C++ toolchain for GPUs to use.
This will install files like the following.
```
<install>/lib/nvptx64-nvidia-cuda/libc.a
<install>/lib/nvptx64-nvidia-cuda/libc++.a
<install>/lib/nvptx64-nvidia-cuda/libomp.a
<install>/lib/clang/19/lib/nvptx64-nvidia-cuda/libclang_rt.builtins.a
```
Linking in these libraries will then simply require passing `-lc` like
is already done for non-GPU toolchains. However, this doesn't work with
the currently deficient `nvlink` linker, so I consider this a blocking
issue to massively improving the state of building GPU libraries.
In the future we may be able to convince NVIDIA to port their linker to
`ld.lld`, but for now this is the only workable solution that allows us
to hack around the weird behavior of their closed-source software.
This also copies some amount of logic from the clang-linker-wrapper,
but not enough for it to be worthwhile to merge them I feel. In the
future it may be possible to delete that handling from there entirely.
Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.
This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.
It has been noticed that the arguments are being passed twice to ptxas.
This also has been fixed by filtering out the arguments before appending
them to the new DAL created by CudaToolChain::TranslateArgs.
github:https://github.com/llvm/llvm-project/pull/86807
This PR adds `-march=generic` support for the NVPTX backend. This
fulfills a TODO introduced in #79873.
With this PR, users can explicitly request the "default" CUDA
architecture, which makes sure that no specific architecture is
specified.
This PR does not address any compatibility issues between different CUDA
versions.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Summary:
The old driver embed PTX in rdc-mode and so does the `nvcc` compiler.
The new drivers currently does not do this, so we should keep it
consistent in this case. This simply requires adding the assembler
output as an input to the offloading action that gets fed to fatbin.
Summary:
One recurring problem we have with the OpenMP libraries is that they are
potentially conflicting with ones found on the system, this occurs when
there are two copies and one is used for linking that it not attached to
the correspoding clang compiler. LLVM already uses target specific
directories for this, like with libc++, which are always searched first.
This patch changes the install directory to be
`lib/x86_64-unknown-linux-gnu` for example.
Notable changes would be that users will need to change their
LD_LIBRARY_PATH settings optionally, or use default rt-rpath options.
This should fix problems were users are linking the wrong versions of
static libraries
Summary:
Recent changes to the `libc` project caused the headers to be installed
to `include/<triple>` for the GPU and the libraries to be in
`lib/<triple>`. This means we should automatically append these search
paths so they can be found by default. This allows the following to work
targeting AMDGPU.
```shell
$ clang foo.c -flto -mcpu=native --target=amdgcn-amd-amdhsa -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
$ amdhsa-loader a.out
```
Summary:
We support standalone compilation for the NVPTX architecture using
'nvlink' as our linker. Because of the special handling required to
transform input files to cubins, as nvlink expects for some reason, we
didn't use the standard AddLinkerInput method. However, this also meant
that we weren't forwarding options passed with -Wl to the linker. Add
this support in for the standalone toolchain path.
Revived from https://reviews.llvm.org/D149978
Summary:
The NVPTX tools require an architecture to be used, however if we are
creating generic LLVM-IR we should be able to leave it unspecified. This
will result in the `target-cpu` attributes not being set on the
functions so it can be changed when linked into code. This allows the
standalone `--target=nvptx64-nvidia-cuda` toolchain to create LLVM-IR
simmilar to how CUDA's deviceRTL looks from C/C++
Summary:
We support `--target=nvptx64-nvidia-cuda` as a way to target the NVPTX
architecture from standard CPU. This patch simply uses the existing
support for handling `--offload-arch=native` to also apply to the
standalone toolchain.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.