Summary:
Tihs patch is mostly NFC to fix some warning currently present in OpenMP
offloading plugins. Specifically this mostly removes the use of Twine
variables in favor of LLVM's small string. Twine variables are prone to
use-after-free and this is a cleaner way to concatenate a string.
The next-gen plugins did not properly set the values from
`OMP_NUM_TEAMS` and `OMP_TEAMS_THREAD_LIMIT`. This is because these
maximum values are set by each plugin to its hardware maximum. This
happens *after* the previous initialization. Move it to the correct
place and then add a test.
Fixes https://github.com/llvm/llvm-project/issues/61082
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D145105
Makes the info that is printed for kernel launches configurable for
different plugins. Adds all machinery to print the detailed launch
info that the current AMD plugin provides and includes e.g. register
spill counts.
The files msgpack.cpp, msgpack.def, and msgpack.h are copied from the old plugin
and are untouched. The contents of UtilitiesHSA.cpp and .h are copied together from
various files from the old plugin. The code was originally written by
Jon Chesterfield. I updated the function and type names visible to the outside, i.e.
in headers, to respect the LLVM conventions.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D144521
This interface function does not actually need the device image type.
It's unused in the function, so it should be able to be safely removed.
The motivation for this is to facilitate downsteam porting of the
amd-stg-open RPC module into the nextgen plugin so we can delete the old
plugin entirely. For that to work we need to be able to call this
function at kernel-launch time, which doesn't have the image. Also it's
cleaner.
Reviewed By: jplehr
Differential Revision: https://reviews.llvm.org/D144436
With the NPM, we're now defaulting to preserving LCSSA, so a couple
of tests have changed slightly.
Differential Revision: https://reviews.llvm.org/D140982
The NextGen plugins use the information regarding new mapping/unmappings to
lock/unlock the corresponding host buffer and speed up the host-device memory
transfers involving those buffers. The locking/unlocking is disabled by default
and can be enabled by the LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS envar. The
envar accepts boolean values (on/off) and a special option:
- off: Do not lock mapped host buffers (default).
- on: Lock mapped host buffers automatically, but do not report lock
failures if the plugin fails to lock them.
- mandatory: Lock mapped host buffers automatically and treat locking failures
in the plugins as fatal errors. This option may be useful for
debugging purposes.
Differential Revision: https://reviews.llvm.org/D142514
This patch implements the memory lock/unlock API, introduced in patch https://reviews.llvm.org/D139208,
in the NextGen plugins. Locked buffers feature reference counting and we allow certain overlapping. Given
an already locked buffer A, other buffers that are fully contained inside A can be locked again, even if
they are smaller than A. In this case, the reference count of locked buffer A will be incremented. However,
extending an existing locked buffer is not allowed. The original buffer is actually unlocked once all its
users have released the locked buffer and sub-buffers (i.e., the reference counter becomes zero).
Differential Revision: https://reviews.llvm.org/D141227
Add free functions llvm::CodeGenOpt::{getLevel,getID,parseLevel} to
provide common implementations for functionality that has been
duplicated in many places across the codebase.
Differential Revision: https://reviews.llvm.org/D141968
Dynamic memory allows users to allocate fast shared memory when a kernel
is launched. We support a single size for all kernels via the
`LIBOMPTARGET_SHARED_MEMORY_SIZE` environment variable but now we can
control it per kernel invocation, hence allow computed values.
Note: Only the nextgen plugins will allocate memory based on the clause,
the old plugins will silently miscompile.
Differential Revision: https://reviews.llvm.org/D141233
We already created a versioned `__tgt_kernel_arguments` struct but it
was only briefly used and its content was passed in isolation anyway.
This makes it hard to add more information in the future. With this
patch we fully embrace the struct as means to pass information from the
compiler to the plugin as part of a kernel launch.
The patch also extends and renames the struct, bumping the version
number to 2. Version 1 entries are auto-upgraded. This is in preparation
for "bare" kernel launches, per kernel dynamic shared memory, CUDA/HIP
lowering, etc.
The `__tgt_target_kernel_nowait` interface was deprecated as it was
unused. Once we actually implement support for something like that, we
can add an appropriate API.
Note: Only plugins with the `launch_kernel` interface are now supported.
That means that a new clang won't be able to use an old runtime.
An old clang can still use the new runtime since the libomptarget
interface did not change.
Differential Revision: https://reviews.llvm.org/D141232
This patch enables to store bitcode images when JIT is enabled for the record-and-replay functionality (see https://reviews.llvm.org/D138931). Credits to @jdoerfert for refactoring the code.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D141986
This patch adds functionality for recording and replaying the execution of OpenMP offload kernels, based on an original implementation by Steve Rangel. The patch extends libomptarget to extract a json description of the kernel, the device image binary, and a device memory snapshot before and after the execution of a recorded kernel. Kernel recording/replaying in libomptarget is controlled through env vars (LIBOMPTARGET_RECORD, LIBOMPTARGET_REPLAY). It provides a tool, llvm-omp-kernel-replay, for replaying a kernel using the extracted information with the ability to verify replayed execution using the post-execution device memory snapshot, also supporting changing the number of teams/threads for replaying.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D138931
This variable is used by the runtime. Before kernel launch we set it to
indicate several configuration options from the host. This patch renames
it to be more in-line with the rest of the named exported from the
runtime. This is better because this is the only symbol visible to the
host from the runtime, so it should have a reserved name.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D141960
The JIT is a great debugging tool since we can modify the IR manually
before launching it in an existing test case. The new flasks allow to
skip optimizations, to use the exact given IR, as well as to provide a
finished object file. The latter is useful to try out different backend
options and to have complete freedom with pass pipelines.
Documentation is included. Minimal refactoring was performed to make the
second object fit in nicely.
The JIT interface was somewhat irregular as it used multiple global
functions. It also did not cache the results of the JIT, hence multiple
GPU systems would perform the work multiple times. Finally, there might
have been races on the state if we have multi-threaded initialization of
different embedded images, or one image initialized on multiple devices.
This patch tries to rectify all of the above. The JITEngine is now a
part of the GenericPluginTy and tied to one target triple. To support
multiple "ComputeUnitKind"s (previously confusingly called Arch or
[M]CPU) and to avoid re-jitting for the same ComputeUnitKind, we keep a
map of JIT results per ComputeUnitKind. All interaction with the JIT
happens through the JITEngine directly, two functions are exposed. Both
use (shared) locks to avoid races and cache the result. All JIT-related
environment variables are now defined together.
Differential Revision: https://reviews.llvm.org/D141081
We can now dump the IR before and after JIT optimizations into the
files passed via `LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE` and
`LIBOMPTARGET_JIT_POST_OPT_IR_MODULE`, respectively.
Similarly, users can set `LIBOMPTARGET_JIT_REPLACEMENT_MODULE` to
replace the IR in the image with a custom IR module in a file.
All options take file paths, documentation was added.
Reviewed by: tianshilei1992
Differential revision: https://reviews.llvm.org/D140945
Defaulting to Generic mode doesn't make much sense as the kernel needs
to be prepared for it. SPMD mode is the "native" execution, e.g., for
"bare" kernels. It also is the execution method for constructors and
destructors (as we might otherwise throw an extra warp onto them).
Differential Revision: https://reviews.llvm.org/D140718
This patch fixes a couple of issues:
1. Instead of using `llvm_unreachable` for those base virtual functions, unknown
value will be returned. The previous method could cause runtime error for those
targets where the image is not compatible but JIT is not implemented.
2. Fixed the type in CMake that causes the `Target` CMake variable is undefined.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D140732
This patch adds the basic JIT support for OpenMP. Currently it only works on Nvidia GPUs.
The support for AMDGPU can be extended easily by just implementing three interface functions. However, the infrastructure requires a small extra extension (add a pre process hook) to support portability for AMDGPU because the AMDGPU backend reads target features of functions. 02bc7effcc (diff-321c2038035972ad4994ff9d85b29950ba72c08a79891db5048b8f5d46915314R432) shows how it roughly works.
As for the test, even though I added the corresponding code in CMake files, the test still cannot be triggered because some code is missing in the new plugin CMake file, which has nothing to do with this patch. It will be fixed later.
In order to enable JIT mode, when compiling, `-foffload-lto` is needed, and when linking, `-foffload-lto -Wl,--embed-bitcode` is needed. That implies that, LTO is required to enable JIT mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D139287
This patch adds the basic JIT support for OpenMP. Currently it only works on Nvidia GPUs.
The support for AMDGPU can be extended easily by just implementing three interface functions. However, the infrastructure requires a small extra extension (add a pre process hook) to support portability for AMDGPU because the AMDGPU backend reads target features of functions. 02bc7effcc (diff-321c2038035972ad4994ff9d85b29950ba72c08a79891db5048b8f5d46915314R432) shows how it roughly works.
As for the test, even though I added the corresponding code in CMake files, the test still cannot be triggered because some code is missing in the new plugin CMake file, which has nothing to do with this patch. It will be fixed later.
In order to enable JIT mode, when compiling, `-foffload-lto` is needed, and when linking, `-foffload-lto -Wl,--embed-bitcode` is needed. That implies that, LTO is required to enable JIT mode.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D139287
This patch moves the management/tracking of host pinned buffers to the common PluginInterface
in NextGen plugins. For the moment, the management consists of tracking the host pinned
allocations into a map in each device.
Differential Revision: https://reviews.llvm.org/D140502
With this patch we:
- pick more sensible defaults for the number of teams, inspired by the
old plugin, and configured via LIBOMPTARGET_AMDGPU_TEAMS_PER_CU.
- check the input signal of a kernel launch late, after the queue lock
was taken, to avoid a barrier packet more often.
- copy the kernel arguments in one swoop into the appropriate memory.
- manually specialize the callbacks to avoid potential indirect calls.
This patch better integrates the target nowait functions with the tasking runtime. It splits the nowait execution into two stages: a dispatch stage, which triggers all the necessary asynchronous device operations and stores a set of post-processing procedures that must be executed after said ops; and a synchronization stage, responsible for synchronizing the previous operations in a non-blocking manner and running the appropriate post-processing functions. Suppose during the synchronization stage the operations are not completed. In that case, the attached hidden helper task is re-enqueued to any hidden helper thread to be later synchronized, allowing other target nowait regions to be concurrently dispatched.
Reviewed By: jdoerfert, tianshilei1992
Differential Revision: https://reviews.llvm.org/D132005
This patch uses refactors CMake files related to `PluginInterface` in `plugins-nextgen` to handle LLVM dependences in a better way.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D139371
This is still not working for me:
```
-- Configuring done
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.amdgpu" which requires target "elf_common" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.cuda" which requires target "elf_common" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.x86_64" which requires target "elf_common" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.cuda.nextgen" which requires target "elf_common" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.cuda.nextgen" which requires target "PluginInterface" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.x86_64.nextgen" which requires target "elf_common" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.x86_64.nextgen" which requires target "PluginInterface" that is not in any export set.
-- Generating done
```
This reverts commit e682a76c3bf61c52628d79d6ec4db221430768c0.
This patch uses `add_llvm_library` to build the target `PluginInterface` since it can handle LLVM dependences much better. One temporary drawback of using this is that currently LLVM CMake macro doesn't support object libraries very well (there was a try a couple years ago but it was reverted later 29e5722949). After switching to that, `CXX_VISIBILITY_PRESET` can not be set correctly, which can cause runtime error that a function call from one plugin could go to another. As a consequence, `PluginInterface` is built as a static library for now. I have asked the question in CMake community (https://discourse.cmake.org/t/set-target-properties-doesnt-work-properly/7016). Once that issue is solved, I'll switch it back to object library. It is not necessarily too bad to use static library, especially `BUILDTREE_ONLY` is already set such that `PluginInterface.a` will not be installed.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D139371
Breaks cmake regeneration for me:
```
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.cuda.nextgen" which requires target "PluginInterface" that is not in any export set.
CMake Error: install(EXPORT "LLVMExports" ...) includes target "omptarget.rtl.x86_64.nextgen" which requires target "PluginInterface" that is not in any export set.
```
This reverts commit 08c4081bd3605e1b01a7ccd6accc9052c8966250.
This patch uses `add_llvm_library` to build the target `PluginInterface` since it can handle LLVM dependences much better. One temporary drawback of using this is that currently LLVM CMake macro doesn't support object libraries very well (there was a try a couple years ago but it was reverted later 29e5722949). After switching to that, `CXX_VISIBILITY_PRESET` can not be set correctly, which can cause runtime error that a function call from one plugin could go to another. As a consequence, `PluginInterface` is built as a static library for now. I have asked the question in CMake community (https://discourse.cmake.org/t/set-target-properties-doesnt-work-properly/7016). Once that issue is solved, I'll switch it back to object library. It is not necessarily too bad to use static library, especially `BUILDTREE_ONLY` is already set such that `PluginInterface.a` will not be installed.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D139371
This patch removes the classes GenericStreamManagerTy and GenericEventManagerTy
from the PluginInterface header.
Differential Revision: https://reviews.llvm.org/D138769
This patch modifies the PluginInterface to define functions for initializing
and deinitializing GenericPluginTy instances instead of using the constructor
and destructor. This way, we can return errors from these functions. Also, it
defines some functions that each plugin should implement for creating
plugin-specific objects.
This patch prepares the PluginInterface for the new AMDGPU NextGen plugin.
Differential Revision: https://reviews.llvm.org/D138625
List of fixes:
- omptarget_device_environment symbol is not mandatory in device images
- Do not synchronize in ~AsyncInfoWrapperTy() if the async info's queue is null
- GenericDeviceResourceRef's create() and destroy() require the device as parameter
Differential Revision: https://reviews.llvm.org/D138619
The OpenMP target's NextGen plugins retrieve symbol information in the ELF image
(i.e., address and size) through the ELF section and ELF symbol objects. However,
the images of CUDA programs compute the address differently from the images of
AMDGPU programs:
- Address for CUDA symbols: image begin + section's offset + symbol's st_value
- Address for AMDGPU symbols: image + begin + symbol's st_value
Differential Revision: https://reviews.llvm.org/D138604