Summary:
We have the ability to schedule callbacks after certain events complete.
Currently we can register an arbitrary callback in CUDA, but can't in
AMDGPU. I am planning on using this support to move the RPC handling to
a separate thread, then using these callbacks to suspend / resume it
when no kernels are running. This is a preliminary patch to keep this
noise out of that one.
We had three `utils::` namespaces, all with different "meaning" (host,
device, hsa_utils). We should, when we can, keep "include/Shared"
accessible from host and device, thus RefCountTy has been moved to a
separate header. `hsa_utils` was introduced to make `utils::` less
overloaded. And common functionality was de-duplicated, e.g.,
`utils::advance` and `utils::advanceVoidPtr` -> `utils:advancePtr`. Type
punning now checks for the size of the result to make sure it matches
the source type.
No functional change was intended.
It appears that the RUNTIMES build prefers the x86-64-unknown-linux-gnu
triple notation for the host. This fixes runtime / test breakages when
compiler-rt is used as the CLANG_DEFAULT_RTLIB.
This pull request is a revised version of #76587. This pull request
fixes some build issues that were present in the previous version of
this change.
> This pull request is the first part of an ongoing effort to extends
PGO instrumentation to GPU device code. This PR makes the following
changes:
>
> - Adds blank registration functions to device RTL
> - Gives PGO globals protected visibility when targeting a supported
GPU
> - Handles any addrspace casts for PGO calls
> - Implements PGO global extraction in GPU plugins (currently only
dumps info)
>
> These changes can be tested by supplying `-fprofile-instrument=clang`
while targeting a GPU.
Since we can already track allocations, we can diagnose memory faults to
some degree. If the fault happens in a prior allocation (use after free)
or "close but outside" one, we can provide that information to the user.
Note that the fault address might be page aligned, and not all accesses
trigger a fault, especially for allocations that are backed by a
MemoryManager. Still, if people disable the MemoryManager or the
allocation is big enough, we can sometimes provide valueable feedback.
This patch moves utilities from
`offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h` to
`llvm/Frontend/Offloading/Utility.h` to be reused by
other projects.
Concretely the following changes were made:
- Rename `KernelMetaDataTy` to `AMDGPUKernelMetaData`.
- Remove unused fields `KernelObject`, `KernelSegmentSize`,
`ExplicitArgumentCount` and `ImplicitArgumentCount` from
`AMDGPUKernelMetaData`.
- Return the produced error if `ELFObj.sections()` failed instead of
using `cantFail`.
- Added `AGPRCount` field to `AMDGPUKernelMetaData`.
- Added a default invalid value to all the fields in
`AMDGPUKernelMetaData`.
Error: CommandLine Error: Option 'attributor-manifest-internal'
registered more than once
During the standalone debug build of offload the above error is seen at
app runtime when using a prebuilt llvm with LLVM_LINK_LLVM_DYLIB=ON.
This is caused by linking both libLLVM.so and various archives that are
found via llvm_map_components_to_libnames for jit support.
Through the new `-foffload-via-llvm` flag, CUDA kernels can now be
lowered to the LLVM/Offload API. On the Clang side, this is simply done
by using the OpenMP offload toolchain and emitting calls to `llvm*`
functions to orchestrate the kernel launch rather than `cuda*`
functions. These `llvm*` functions are implemented on top of the
existing LLVM/Offload API.
As we are about to redefine the Offload API, this wil help us in the
design process as a second offload language.
We do not support any CUDA APIs yet, however, we could:
https://www.osti.gov/servlets/purl/1892137
For proper host execution we need to resurrect/rebase
https://tianshilei.me/wp-content/uploads/2021/12/llpp-2021.pdf
(which was designed for debugging).
```
❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}
__global__ void square(int *A) { *A = 42; }
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
*Ptr = 7;
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
square<<<1, 1>>>(Ptr);
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
llvm_omp_target_free_shared(Ptr, DevNo);
}
❯❯❯ clang++ test.cu -O3 -o test123 -foffload-via-llvm --offload-arch=native
❯❯❯ llvm-objdump --offloading test123
test123: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
❯❯❯ LIBOMPTARGET_INFO=16 ./test123
Ptr 0x155448ac8000, *Ptr 7
Ptr 0x155448ac8000, *Ptr 42
```
The kernel names for OpenMP are manually mangled and not ideal when we
report something to the user. We demangle them now, providing the
function and line number of the target region, together with the actual
kernel name.
Similar to (de)allocation traces, we can record kernel launch stack
traces and display them in case of an error. However, the AMD GPU plugin
signal handler, which is invoked on memroy faults, cannot pinpoint the
offending kernel. Insteade print `<NUM>`, set via
`OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=<NUM>`, many traces. The
recoding/record uses a ring buffer of fixed size (for now 8).
For `trap` errors, we print the actual kernel name, and trace if
recorded.
As a first step towards a GPU sanitizer we now can track allocations and
deallocations in order to report double frees, and other problems during
deallocation.
This pull request is the first part of an ongoing effort to extends PGO
instrumentation to GPU device code. This PR makes the following changes:
- Adds blank registration functions to device RTL
- Gives PGO globals protected visibility when targeting a supported GPU
- Handles any addrspace casts for PGO calls
- Implements PGO global extraction in GPU plugins (currently only dumps
info)
These changes can be tested by supplying `-fprofile-instrument=clang`
while targeting a GPU.
Summary:
The HSA headers existed previously in `include/hsa.h` and were moved to
`include/hsa/hsa.h` in a later ROCm version. The include headers here
were originally designed to favor a newer one. However, this
unintentionally prevented the dyanmic HSA's `hsa.h` from being used if
both were present. This patch changes the order so it will be found
first.
Related to https://github.com/llvm/llvm-project/pull/95484.
Sometimes it might be beneficial to spawn more thread blocks instead of
reusing existing for multiple loop iterations.
**Alternatives considered:**
Make `DefaultNumBlocks` settable via an environment variable.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
We already used a flat array of kernel launch parameters for the AMD GPU
launch but now we also use this scheme for the NVIDIA GPU launch. The
only remaining/required use of the indirection is the host plugin (due
ot ffi). This allows to us simplify the use for non-OpenMP kernel
launch.
COV3 is not supported anymore, thus we can just use ArgsSize we read
from the kernel to determine how many argument bytes we need and if
implicit kernel arguments are used.
Summary:
The old COV3 implementation of HSA used to omit the implicit arguments
from the kernel argument size. For COV4 and COV5 this is no longer the
case so we can simply use the size reported from the symbol information.
See
https://github.com/ROCm/ROCR-Runtime/issues/117#issuecomment-812758161
Summary:
Currently, we register images into a linear table according to the
logical OpenMP device identifier. We then initialize all of these images
as one block. This logic requires that images are compatible with *all*
devices instead of just the one that it can run on. This prevents us
from running on systems with heterogeneous devices (i.e. image 1 runs on
device 0 image 0 runs on device 1).
This patch reworks the logic by instead making the compatibility check a
per-device query. We then scan every device to see if it's compatible
and do it as they come.
Summary:
The logic since the next-gen plugins was added was that every single
agent would get access to a memory pool we allocated. This is necessary
for things like fine-grained memory and to faciliate d2d copied.
However, there are cases where an agent cannot legally access a memory
pool. We have a debug check for this, but it would always be triggered
in these situations because both uses of the function simply passed
every agent. This patch changes the behavior by only enabling memory
pool access for agents that can access the memory pool.
Summary:
Initializing the plugins requires initializing the runtime like CUDA or
HSA. This has a considerable overhead on most platforms, so we should
only actually initialize a plugin if it is needed by any image that is
loaded.
Summary:
Certain plugins can only be built on specific platforms. Previously this
didn't cause issues becaues each one was handled independently. However,
now that we link these all directly they need to be in a CMake list.
Furthermore we use this list to generate a config file. For this reason
these checks are moved to where we normalize the support.
Fixes: https://github.com/llvm/llvm-project/issues/93183
Summary:
We previously had multiple options for this, this patch replaces them
with `LIBOMPTARGET_DLOPEN_PLUGINS=` to be a list of plugins to
dynamically use. It defaults to everything right now. This ignores the
`host` plugin because the `libffi` dependency is going to be removed
soon hopefully in https://github.com/llvm/llvm-project/pull/91264.
Summary:
CUDA does its versioning by putting a redirection in the header so the
API functions remain the same while the symbol changes. These weren't
being used for some functions that required it in the dynamic cuda
version.
These functions have newer verisons that should be used. These are
fairly old as far as I'm aware so we should be able to sweep backward
compatibility under the rug.
Since #87009, libomptarget directly links all the plugins statically.
All the dependencies of plugins got exposed to libomptarget. The CUDA
plugin depends on libcuda and the amdgpu plugin depends on libhsa if not
forced using dlopen. On a cluster with different compute node
architectures, libomptarget can be built and run on different nodes. In
the build stage, if cmake founds libcuda and
`LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA=OFF`, libomptarget links libcuda.so
directly and the result libomptarget may not run a node without a NVIDIA
driver for example a CPU or AMD GPU only machine with a complaint that
libcuda.so not found.
The solution is setting `LIBOMPTARGET_FORCE_DLOPEN_LIBCUDA` and
`LIBOMPTARGET_FORCE_DLOPEN_LIBHSA` `ON`. Preferably this should be
default to maximize the usability of libomptarget. If cmake detects
NVIDIA or AMD software on an OS imaging building node, the resulted
libomptarget may not be able to function on the user side due to the
requirement the existence of vendor runtime libraries.
Summary:
This isn't `libomptarget` anymore, and these messages were always
unnecessary because no other project uses these prefixed messages. The
effect of this is that no longer will the logs have `LIBOMPTARGET --` in
front of everything. We have a message stating when we start building
the offload project so it'll still be trivial to find.
Summary:
No other project has these in the CMake itself, and they're wildly
inconsistent even within the project. These don't really add anything so
I think they should be removed.
Summary:
Previously, the R&R support was global state initialized by a global
constructor. This is bad because it prevents us from adequately
constraining the lifetime of the library. Additionally, we want to
minimize the amount of global state floating around.
This patch moves the R&R support into a plugin member like everything
else. This means there will be multiple copies of the R&R implementation
floating around, but this was already the case given the fact that we
currently handle everything with dynamic libraries.
Summary:
Currently this is only used for the zero-copy handling. However, this
can easily be moved into `libomptarget` so that we do not need to bother
setting the requires flags in the plugin. The advantage here is that we
no longer need to do this for every device redundently. Additionally,
these requires flags are specifically OpenMP related, so they should
live in `libomptarget`.
Summary:
The offload library supports basic JIT functionality, however we
currently link against every single target even though only AMDGPU and
NVPTX are supported. This somewhat bloats the dynamic library list, so
we should constrain it to what's actually used.
Summary:
Since the move to the statically linked plugins, we added a new way to
directly control which plugins will be added. Delete these old ones as
they will cause the build to fail and suggest the new format.
This patch overhauls the `libomptarget` and plugin interface. Currently,
we define a C API and compile each plugin as a separate shared library.
Then, `libomptarget` loads these API functions and forwards its internal
calls to them. This was originally designed to allow multiple
implementations of a library to be live. However, since then no one has
used this functionality and it prevents us from using much nicer
interfaces. If the old behavior is desired it should instead be
implemented as a separate plugin.
This patch replaces the `PluginAdaptorTy` interface with the
`GenericPluginTy` that is used by the plugins. Each plugin exports a
`createPlugin_<name>` function that is used to get the specific
implementation. This code is now shared with `libomptarget`.
There are some notable improvements to this.
1. Massively improved lifetimes of life runtime objects
2. The plugins can use a C++ interface
3. Global state does not need to be duplicated for each plugin +
libomptarget
4. Easier to use and add features and improve error handling
5. Less function call overhead / Improved LTO performance.
Additional changes in this plugin are related to contending with the
fact that state is now shared. Initialization and deinitialization is
now handled correctly and in phase with the underlying runtime, allowing
us to actually know when something is getting deallocated.
Depends on https://github.com/llvm/llvm-project/pull/86971https://github.com/llvm/llvm-project/pull/86875https://github.com/llvm/llvm-project/pull/86868
This patch overhauls the `libomptarget` and plugin interface. Currently,
we define a C API and compile each plugin as a separate shared library.
Then, `libomptarget` loads these API functions and forwards its internal
calls to them. This was originally designed to allow multiple
implementations of a library to be live. However, since then no one has
used this functionality and it prevents us from using much nicer
interfaces. If the old behavior is desired it should instead be
implemented as a separate plugin.
This patch replaces the `PluginAdaptorTy` interface with the
`GenericPluginTy` that is used by the plugins. Each plugin exports a
`createPlugin_<name>` function that is used to get the specific
implementation. This code is now shared with `libomptarget`.
There are some notable improvements to this.
1. Massively improved lifetimes of life runtime objects
2. The plugins can use a C++ interface
3. Global state does not need to be duplicated for each plugin +
libomptarget
4. Easier to use and add features and improve error handling
5. Less function call overhead / Improved LTO performance.
Additional changes in this plugin are related to contending with the
fact that state is now shared. Initialization and deinitialization is
now handled correctly and in phase with the underlying runtime, allowing
us to actually know when something is getting deallocated.
Depends on https://github.com/llvm/llvm-project/pull/86971https://github.com/llvm/llvm-project/pull/86875https://github.com/llvm/llvm-project/pull/86868
Summary:
This gets the target's corresponding ELF value from the preprocessor.
We use this to detect if a given ELF is compatible with the CPU
offloading impolementation for OpenMP. Previously we used defitions from
CMake, but this is easier for people to understand as there may be new
users of this in the future.
Summary:
The `GenericDeviceTy::dataDelete` method doesn't verify the
`TargetAllocTy` of the of the device pointer. Because of this, it can
use the `MemoryManager` to free the ptr. However, the
`TARGET_ALLOC_HOST` and `TARGET_ALLOC_SHARED` types are not allocated
using the `MemoryManager` in the `GenericDeviceTy::dataAlloc` method.
Since the `MemoryManager` uses the `DeviceAllocatorTy::free` operation
without specifying the type of the ptr, some plugins may use incorrect
operations to free ptrs of certain types. In particular, this bug causes
the CUDA plugin to use the `cuMemFree` operation on ptrs of type
`TARGET_ALLOC_HOST`, resulting in an unchecked error, as shown in the
output snippet of the test
`offload/test/api/omp_host_pinned_memory_alloc.c`:
```
omptarget --> Notifying about an unmapping: HstPtr=0x00007c6114200000
omptarget --> Call to llvm_omp_target_free_host for device 0 and address 0x00007c6114200000
omptarget --> Call to omp_get_num_devices returning 1
omptarget --> Call to omp_get_initial_device returning 1
PluginInterface --> MemoryManagerTy:🆓 target memory 0x00007c6114200000.
PluginInterface --> Cannot find its node. Delete it on device directly.
TARGET CUDA RTL --> Failure to free memory: Error in cuMemFree[Host]: invalid argument
omptarget --> omp_target_free deallocated device ptr
```
This patch fixes this by adding the check of the device pointer type
before calling the appropriate operation for each type.
Summary:
This patch removes the special-case handling for the target triple
inside of the CMake. I moved it into the implementation so it's easier
to see and modify.
Summary:
Previously we would build all of the plugins by default and then only
load some using the `LIBOMPTARGET_PLUGINS_TO_LOAD` variable. This patch
renamed this to `LIBOMPTARGET_PLUGINS_TO_BUILD` and changes whether or
not it will include the plugin in CMake.
Additionally this patch creates a new `Targets.def` file that allows us
to enumerate all of the enabled plugins. This is somewhat different from
the old method, and it's done this way for future use that will need to
be shared. This follows the same method that LLVM uses for its targets,
however it does require adding an extra include path.
Depends on https://github.com/llvm/llvm-project/pull/86868
Summary:
All of these are functionally the same code, just compiled for separate
architectures. We currently do not expose a way to execute these on
separate architectures as the host plugin works using `dlopen` into the
same process, and therefore cannot possibly be an incompatible
architecture. (This could work with a remote plugin, but this is not
supported yet).
This patch simply renames all of these to the same thing so we no longer
need to check around for its varying definitions.
In a nutshell, this moves our libomptarget code to populate the offload
subproject.
With this commit, users need to enable the new LLVM/Offload subproject
as a runtime in their cmake configuration.
No further changes are expected for downstream code.
Tests and other components still depend on OpenMP and have also not been
renamed. The results below are for a build in which OpenMP and Offload
are enabled runtimes. In addition to the pure `git mv`, we needed to
adjust some CMake files. Nothing is intended to change semantics.
```
ninja check-offload
```
Works with the X86 and AMDGPU offload tests
```
ninja check-openmp
```
Still works but doesn't build offload tests anymore.
```
ls install/lib
```
Shows all expected libraries, incl.
- `libomptarget.devicertl.a`
- `libomptarget-nvptx-sm_90.bc`
- `libomptarget.rtl.amdgpu.so` -> `libomptarget.rtl.amdgpu.so.18git`
- `libomptarget.so` -> `libomptarget.so.18git`
Fixes: https://github.com/llvm/llvm-project/issues/75124
---------
Co-authored-by: Saiyedul Islam <Saiyedul.Islam@amd.com>