Summary;
Now that we use the linker to do LTO / device linking, we need to inform
the `clang` invocation to use `-flto` so it forwards arguments like
`-On` correctly.
Make it possible to do things like the following, regardless of whether
the offload target is nvptx or amdgpu:
```
$ clang -O1 -g -fopenmp --offload-arch=native test.c \
-Xoffload-linker -mllvm=-pass-remarks=inline \
-Xoffload-linker -mllvm=-force-remove-attribute=g.internalized:noinline\
-Xoffload-linker --lto-newpm-passes='forceattrs,default<O1>' \
-Xoffload-linker --lto-debug-pass-manager \
-foffload-lto
```
To accomplish that:
- In clang-linker-wrapper, do not forward options via `-Wl` if they
might have literal commas. Use `-Xlinker` instead.
- In clang-nvlink-wrapper, accept `--lto-debug-pass-manager` and
`--lto-newpm-passes`.
- In clang-nvlink-wrapper, drop `-passes` because it's inconsistent with
the interface of `lld`, which is used instead of clang-nvlink-wrapper
when the target is amdgpu. Without this patch, `-passes` is passed to
`nvlink`, producing an error anyway.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Fix the builds with LLVM_TOOL_LLVM_DRIVER_BUILD enabled.
LLVM_ENABLE_EXPORTED_SYMBOLS_IN_EXECUTABLES is not completely
compatible with export_executable_symbols as the later will be ignored
if the previous is set to NO.
Fix the issue by passing if symbols need to be exported to
llvm_add_exectuable so the link flag can be determined directly
without calling export_executable_symbols_* later.
`LLVM_ENABLE_EXPORTED_SYMBOLS_IN_EXECUTABLES` is not completely
compatible with `export_executable_symbols` as the later will be ignored
if the previous is set to NO.
Fix the issue by passing if symbols need to be exported to
`llvm_add_exectuable` so the link flag can be determined directly
without calling `export_executable_symbols_*` later.
Summary:
`-Xlinker` is supposed to pass options to the linker, while
`-Xoffload-linker` instead passes it to the `clang` job. This is
unintuitive and results in unnecessarily complex command lines. Because
passing it to the clang job is still useful, I've added
`-device-compiler`. This changes the behavior of the flag, but I think
it should be fine in this case since it's likely rarely used and has a
direct replacement.
---------
Co-authored-by: Joel E. Denny <jdenny.ornl@gmail.com>
Summary:
Previously we could parse these internally as they would be used by the
embedded LTO job. Now, this LTO is passed to the linker utilities which
means these need to be forwarded. So this can now either be done with
`--offload-opt` which works in the clang job, or with `-Xoffload-linker`
manually.
Fixes https://github.com/llvm/llvm-project/issues/100212
Summary:
We have the `-Xoffload-linker=triple=arg` syntax that split the argument
meant only for a single toolchain. However this borke if it was an `a=b`
type argument. Make it only treat it like a triple if it's a valid
triple.
Summary:
The linker wrapper's job is to extract embedded device code from fat
binaries and create linked images that can then be embedded and
executed. In order to support LTO, we originally reinvented all of the
LTO handling that `ld.lld` normally does. Primarily, this was because
`nvlink` didn't support this at all, and we have special hacks required
for offloading languages interacting with archive libraries.
Now since I wrote https://github.com/llvm/llvm-project/pull/96561 we
should be able to pass all the inputs to the device linker
transparently. This has the advantage of allowing the `clang` Driver to
do its own handling. Primarily, this will be used to implicitly pass
libraries to the device link job to make it more consistent with other
toolchains.
The JIT support is a notable departure, however there is an option
called `--lto-emit-llvm` that performs the exact function where we want
the final link job to output LLVM-IR that we can then embed instead.
This patch does not fully delete the LTO handling, primarily because I
think the SPIR-V people might want it. To see only the relevant patches,
ignore the first commit of the nvlink-wrapper.
Depends on https://github.com/llvm/llvm-project/pull/96561.
When save-temps is enabled and the given offload-archs differ
only in target features with the same arch, the intermediate
postlink.bc and postopt.bc files were getting overwritten. This
fix, suffixes the intermediate file names with the complete
TargetID.
E.g. `helloworld.amdgcn-amd-amdhsa.gfx90a:xnack+.postlink.bc`
and `helloworld.amdgcn-amd-amdhsa.gfx90a:xnack+.postopt.bc`
The goal of this patch is to enable utilizing LLVM plugin passes and
remarks for GPU offload code at link time. Specifically, this patch
extends clang-linker-wrapper's `--offload-opt` (and consequently
`-mllvm`) to accept the various LLVM pass options that tools like opt
usually accept. Those options include `--passes`, `--load-pass-plugin`,
and various remarks options.
Unlike many other LLVM options that are inherited from linked code by
clang-linker-wrapper (e.g., `-pass-remarks` is already implemented in
`llvm/lib/IR/DiagnosticHandler.cpp`), these options are implemented
separately as needed by each tool (e.g., opt, llc). Fortunately, this
patch is able to handle most of the implementation by passing the option
values to `lto::Config`.
For testing plugin support, this patch uses the simple `Bye` plugin from
LLVM core, but that requires several small Clang test suite config
extensions.
SmallPtrSet.h and TimeProfiler.h are unused. CommandLine.h is only
needed for the UseNewDbgInfoFormat declare, which can be moved to the
places that need it.
DenseMap iteration order is not guaranteed to be deterministic.
Without the change, clang/test/Driver/linker-wrapper{,-libs}.c would
fail when `combineHashValue` changes (#95970).
Summary:
One of the downsides of the linker wrapper is that it made debugging
more difficult. It is very powerful in that it can resolve a lot of
input matching and library handling that could not be done before.
However, the old method allowed users to simply copy-paste the script
files to modify the output and test it.
This patch attempts to make it easier to debug changes by letting the
user override all the linker inputs. That is, we provide a user-created
binary that is treated like the final output of the device link step.
The intended use-case is for using `-save-temps` to get some IR, then
modifying the IR and sticking it back in to see if it exhibits the old
failures.
The distinction between the hip and hipv4 offload kinds is historically
based. Originally, these designations might have indicated different
versions of the code object ABI (Application Binary Interface). However,
as the system has evolved, the ABI version is now embedded directly
within the code object itself, making these historical distinctions
irrelevant during the unbundling process. Consequently, hip and hipv4
are treated as compatible in current implementations, facilitating
interchangeable handling of code objects without differentiation based
on offload kind. This change streamlines code management within the
ecosystem.
Summary:
The linker wrapper attempts to maintain consistent semantics with
existing host invocations. Static libraries by default only extract if
there are non-weak symbols that remain undefined. However, we have
situations between linkers that put different meanings on ordering. The
ld.bfd linker requires static libraries to be defined after the symbols,
while `ld.lld` relaxes this rule. The linker wrapper went with the
former as it's the easier solution, however this has caused a lot of
issues as I've had to explain this rule to several people, it also make
it difficult to include things like `libc` in the OpenMP runtime because
it would sometimes be linked before or after.
This patch reworks the logic to more or less perform the following logic
for static libraries.
1. Split library / object inputs.
2. Include every object input and record its undefined symbols
3. Repeatedly try to extract static libraries to resolve these
symbols. If a file is extracted we need to check every library
again to resolve any new undefined symbols.
This allows the following to work and will cause fewer issues when
replacing HIP, which does `--whole-archive` so it's very likely the old
logic will regress.
```console
$ clang -lfoo main.c -fopenmp --offload-arch=native
```
Summary:
The device linking phase only wants to create the necessary commands to
emit the device binary. There were issues where the user's default
config file was being used and passing incompatible arguments to the
device compilation step. Simply disable this since we do not want any
additional arguments to these clang invocations.
The plugin was not getting built as the build_generic_elf64 macro
assumes the LLVM triple processor name matches the CMake processor name,
which is unfortunately not the case for SystemZ.
Fix this by providing two separate arguments instead.
Actually building the plugin exposed a number of other issues causing
various test failures. Specifically, I've had to add the SystemZ target
to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt
Finally, I've had to set a number of test cases to UNSUPPORTED on
s390x-ibm-linux-gnu; all these tests were already marked as UNSUPPORTED
for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu and are failing on
s390x for what seem to be the same reason.
In addition, this also requires support for BE ELF files in
plugins-nextgen: https://github.com/llvm/llvm-project/pull/85246
Added --offload-compression-level= option to clang and
-compression-level=
option to clang-offload-bundler for controlling compression level.
Added support of long distance matching (LDM) for llvm::zstd which is
off
by default. Enable it for clang-offload-bundler by default since it
improves compression rate in general.
Change default compression level to 3 for zstd for clang-offload-bundler
since it works well for bundle entry size from 1KB to 32MB, which should
cover most of the clang-offload-bundler usage. Users can still specify
compression level by -compression-level= option if necessary.
Summary:
The clang-offload-bundler uses an empty file to control the bundles made
for embedding. Previously this still used `/dev/null` by mistake even on
Windows.
Summary:
The very first version of the `clang-linker-wrapper` used `--` as a
separator for the host and device arguments. I moved away from this
towards a commandline parsing implementation years ago but never got
around to officially removing this.
The plugin was not getting built as the build_generic_elf64 macro
assumes the LLVM triple processor name matches the CMake processor name,
which is unfortunately not the case for SystemZ.
Fix this by providing two separate arguments instead.
Actually building the plugin exposed a number of other issues causing
various test failures. Specifically, I've had to add the SystemZ target
to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt
Finally, I've had to set a number of test cases to UNSUPPORTED on
s390x-ibm-linux-gnu; all these tests were already marked as UNSUPPORTED
for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu and are failing on
s390x for what seem to be the same reason.
In addition, this also requires support for BE ELF files in
plugins-nextgen: https://github.com/llvm/llvm-project/pull/83976
Summary:
We support standalone compilation for the NVPTX architecture using
'nvlink' as our linker. Because of the special handling required to
transform input files to cubins, as nvlink expects for some reason, we
didn't use the standard AddLinkerInput method. However, this also meant
that we weren't forwarding options passed with -Wl to the linker. Add
this support in for the standalone toolchain path.
Revived from https://reviews.llvm.org/D149978
Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a special section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.
One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.
This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entries that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.
Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```
Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.
```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out
```
One question is whether or not this should be the default behavior of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guaranteed to be correct.
Summary:
A relocatable link through `clang -r` can go through the
clang-linker-wrapper if offloading is enabled. This will have the effect
of linking the device code and creating the wrapper module. It will then
be merged into the final file. This is useful behavior on its own, but
is likely not what is expected for a `-r` job.
This patch makes the linker wrapper ignore the device code when doing a
reloctable link. This has the effect of the linker merging the
`.llvm.offloading` sections in the output object. These will then be
parsed as normal when the executable is finally created.
Even though this doesn't actually perform a reloctable link on the
device code itself, it has a similar effect of combining multiple files
into a single one.
Summary:
The linker wrapper's job is to sort various embedded inputs into a list
of files that participate in a single link job. So far, this has been
completely 1-to-1, that is, each input file participates in exactly one
link job. However, support for AMD's target-id requires that one input
file may participate in multiple link jobs. For example, if given a
`gfx90a` static library and a `gfx90a:xnack+` object file input, we
should link the gfx90a` target into the `gfx90a:xnack+` job. These are
considered separate CPUs that can be mutually linked more or less.
This patch adds the necessary logic to make this happen. It primarily
reworks the logic to copy relevant input files into a separate list. So,
it moves construction of the final list of link jobs into the extraction
phase. We also need to copy the files in the case that it is needed more
than once, as the entire workflow expects ownership of said file.
This patch moves `clang/tools/clang-linker-wrapper/OffloadWrapper.*` to
`llvm/Frontend/Offloading` allowing them to be re-utilized by other
projects.
Additionally, it makes minor modifications to the API to make it more
flexible.
Concretely:
- The `wrap*` methods now have additional arguments `EntryArray`,
`Suffix` and `EmitSurfacesAndTextures` to specify some additional options.
- The `EntryArray` is now constructed by the caller. This change is needed to
enable JIT compilation, as ORC doesn't fully support `__start_` and `__stop_`
symbols. Thus, to JIT the code, the `EntryArray` has to be constructed explicitly in the IR.
- The `Suffix` field is used when emitting the descriptor, registration
methods, etc, to make them more readable. It is empty by default.
- The `EmitSurfacesAndTextures` field controls whether to emit surface
and texture registration code, as those functions were removed from `CUDART`
in CUDA 12. It is true by default.
- The function `getOffloadingEntryInitializer` was added to help create
the `EntryArray`, as it returns the constant initializer and not a global
variable.
Summary:
We use the OffloadBinary to contain bundled offloading objects used to
support many images / targets at the same time. The `__tgt_device_info`
struct used to contain a pointer to this underlying binary format, which
contains information about the triple and architecture. We used to parse
this in the runtime to do image verification.
Recent changes removed the need for this to be used internally, as we
just parse it out of the ELF directly. This patch sets the pointers up
so they point to the ELF without requiring any further parsing.
Summary:
The CPU target currently inherits all the libraries from the normal link
job to ensure that it has access to the same envrionment that the host
does. However, this previously was not respecting argument libraries
that are passed by name rather than `-l` as well as the whole archive
flags. This patch fixes this to allow the CPU linker to correctly pick
up the libraries associated with things like address sanitizers.
Fixes: https://github.com/llvm/llvm-project/issues/75651
This reverts commit 47d9fbc04b91fb03b6da294e82c2fb4bca6b6343.
It creates a segmentation falt on SPEChpc soma on Frontier on a GPU for
the kernel generate_new_beads
Co-authored-by: fel-cab <fel-cab@github.com>
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
Summary:
We provide `-Xoffload-linker` to pass arguments directly to the link
step. Currently this uses `-Wl,` implicitly which prevents us from using
clang options that we otherwise could make use of. This patch removes
that implicit behavior as users can just as easiliy pass
`-Xoffload-linker -Wl,-foo` if needed.
Summary:
This patch adds support for registering texture / surface variables from
CUDA / HIP. Additionally, we now properly track the `extern` and `const`
flags that are also used in these runtime functions.
This does not implement the `managed` variables yet as those seem to
require some extra handling I'm not familiar with. The issue is that the
current offload entry isn't large enough to carry size and alignment
information along with an extra global.
Multiple calls to `PointerType::getUnqual(C)`, and calls to
`Type::getPointerTo(AddrSpace=0)` on them all result in the same type.
Clean them up to re-use the same `PtrTy` variable within function
`createRegisterGlobalsFunction()`.
Summary:
The linker wrapper is a utility used to create offloading programs from
single-source offloading languages such as OpenMP or CUDA. This is done
by embedding device code into the host object, then feeding it into the
linker wrapper which extracts the accelerator object files, links them,
then wraps them in registration code for the target runtime. This
previously has only worked in Linux / ELF platforms.
This patch attempts to hand Windows / COFF inputs by also accepting COFF
forms of certain linker arguments we use internally. The important
arguments are library search paths, so we can identify libraries which
may contain device code, libraries themselves, and the output name used
for intermediate output.
I am not intimately familiar with the semantics here for the semantics
in how a `lib` file is earched. I am simply treating `foo.lib` as the
GNU equivalent `-l:foo.lib` in the search logic. Similarly, I am
assuming that static libraries will be llvm-ar style libraries. I will
need to investigate the actual deficiencies later, but this should be a
good starting point along with
https://github.com/llvm/llvm-project/pull/72697
Summary:
This patch is a simple refactoring of code out of the linker wrapper
into a common location. The main motivation behind this change is to
make it easier to change the handling in the future to accept a triple
to be used to emit entries that function on that target.
Summary:
This accidentally was a dangling reference and caused issues when
actually used. Make sure that the memory is saved before the job is
created.
Summary:
Weak symbols are supposed to have the semantics that they can be
overriden by a strong (i.e. global) definition. This wasn't being
respected by the LTO pass because we simply used the first definition
that was available. This patch fixes that logic by doing a first pass
over the symbols to check for strong resolutions that could override a
weak one.
A lot of fake linker logic is ending up in the linker wrapper. If there
were an option to handle this in `lld` it would be a lot cleaner, but
unfortunately supporting NVPTX is a big restriction as their binaries
require the `nvlink` tool.