This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as
requested in
[#185728](https://github.com/llvm/llvm-project/issues/185728), and adds
the corresponding support in `plugins-nextgen`.
A main motivation for this change is to make it possible to measure the
elapsed time of work submitted to a queue, especially kernel launches.
This is relevant to the intended use of the new Offload API for
microbenchmarking GPU libc math functions.
### Summary
The new API returns the elapsed time, in milliseconds, between two
events on the same device.
To support the common pattern `create start event → enqueue kernel →
create end event → sync end event → get elapsed time`, `olCreateEvent`
now always creates and records a backend event through the device
interface. For backends that materialize real event state, this gives
the event concrete backend state that can be used for elapsed-time
measurement. For backends that do not materialize backend event state,
`EventInfo` may still remain null and existing event operations continue
to treat such events as trivially complete.
Previously, an event created on an empty queue could be represented only
as a logical event. That representation was sufficient for sync and
completion queries, but it was not suitable for elapsed-time measurement
because there was no backend event state to timestamp. The new behavior
preserves the meaning of completion of prior work while also allowing
backends with timing support to attach real event state.
### Changes in `plugins-nextgen`
#### Common interface
Add elapsed-time support to the common device and plugin interfaces:
* `GenericPluginTy::get_event_elapsed_time`
* `GenericDeviceTy::getEventElapsedTime`
* `GenericDeviceTy::getEventElapsedTimeImpl`
#### AMDGPU
* Add the required ROCr declarations and wrappers.
* Enable queue profiling at queue creation time.
* Record events by enqueuing a real barrier marker packet on the stream.
* Retain the timing signal needed to query the recorded marker later.
* Implement `getEventElapsedTimeImpl` using
`hsa_amd_profiling_get_dispatch_time`, converting the result to
milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`.
This follows the ROCm/HIP approach of enabling queue profiling at HSA
queue creation time, while keeping the AMDGPU queue path simpler than
the lazy-enable alternative discussed during review.
#### CUDA
* Add the required CUDA driver declarations and wrappers.
* Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`.
#### Host
* Add `getEventElapsedTimeImpl` that stores `0.0f` in the output
pointer, when present, and returns success.
Reason: the host plugin does not materialize backend event state and
already treats event operations as trivially successful. Returning
`0.0f` preserves that model without introducing a new failure mode.
#### Level Zero
* Add `getEventElapsedTimeImpl`, but leave it unimplemented.
Reason: the Level Zero plugin currently does not provide standalone
backend event support for this event model. For example, `waitEventImpl`
/ `syncEventImpl` are still unimplemented there.
---------
Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br>
Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>
Summary:
One of the main disadvantages to using the RPC interface is that it
requires a server thread to spin on the mailboxes checking for work.
The vast majority of the time, there will be no work and work will come
in large bursts.
The HSA / KFD interface supports device-side interrupts and already has
handling for binding these events to an HSA signal. This means that we
can send interrupts from the GPU to wake a sleeping thread on the CPU.
The sleeping thread will be descheduled with a blocking HSA wait call
and woken up when its event ID is raised through the kernel driver's
interrupt.
This is very target-specific handling, but I believe it is valuable
enough to warrant it being in the protocol. It is completely optional,
as it is ignored if uninitialized. This should bring this support at
parity with the interface HIP expects.
This commit removes the `LIBOMPTARGET_SHARED_MEMORY_SIZE` envar and
outputs a runtime warning if it is defined. Access to dynamic shared memory
should be obtained through the `dyn_groupprivate` clause (OpenMP 6.1) or
the launch arguments in liboffload kernel launch.
Summary:
We use this `dyn_ptr` argument in Clang/OpenMP to handle the
`KernelLaunchEnvironment`. This is a per-kernel argument used to share
some information. Currenetly, it's prepended to the argument list and we
generate storage for it in the runtime.
This is bad for a few reasons:
1. It changes the ABI by shifting user arguments
2. It cannot be trivially be left uninitialized if unused
3. The runtime must allocate its own memory for it
This PR changes it to be appended instead. Additionally, space for this
is always emitted. This means the OMPIRBuilder itself will provide the
storage, we simply need to populate it in the runtime if it is used.
This means that if it's unused we don't always pay the cost and it's
easier for non-OpenMP users to ignore it.
Backward compatibility is maintained by auto-upgrading the kernel
arguments. In `libomptarget` we completely allocate a new buffer to
store this in the new format. The plugins still need to respect the old
ABI of the called device object, so we simply rotate it if it's the old
version.
As discussed in #185404 we might want to provide a way for plugins to
validate images not recognized by the common layer.
This PR adds such extension and uses it to validate pure SPIRV images by
the Level Zero plugin.
Summary:
This was a regression from the original LLVM-gpu-loader. We used to
handle `-mwavefrontsize64` correctly in the loader by over-allocating
memory and just leaving the upper 32-bits masked off. In order to handle
this in offload we need to scan loaded kernels to see how much memory we
need to allocate. This should be safe, the protocol is designed to
handle an arbitrary size and worst-case this just wastes space.
This PR adds extends liboffload olMemRegister API to handle a case when
a memory block may have been mapped before calling olMemRegister to
support some use cases in libomptarget
Add liboffload asynchronous queue query API for libomptarget migration
This PR adds liboffload asynchronous queue query API that needed to make
libomptarget to use liboffload
Add a new nextgen plugin that supports GPU devices through the Intel oneAPI Level Zero library. The plugin is not enabled by default and needs to be added to LIBOMPTARGET_PLUGINS_TO_BUILD explicitely.
---------
Co-authored-by: Alexey Sachkov <alexey.sachkov@intel.com>
Co-authored-by: Nick Sarnie <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Update debug messages based on the new method from #170425. Updated the
following files.
- plugins-nextgen/common/include/MemoryManager.h
- plugins-nextgen/common/include/PluginInterface.h
- plugins-nextgen/common/src/GlobalHandler.cpp
- plugins-nextgen/common/src/PluginInterface.cpp
- plugins-nextgen/host/dynamic_ffi/ffi.cpp
Summary:
This was a lot of code that was only used for upstream LLVM builds of
AMDGPU offloading. We have a generic and fast `malloc` in `libc` now so
just use that. Simplifies code, can be added back if we start providing
alternate forms but I don't think there's a single use-case that would
justify it yet.
Introduced in OpenMP 6.0, the device UID shall be a unique identifier of
a device on a given system. (Not necessarily a UUID.) Since it is not
guaranteed that the (U)UIDs defined by the device vendor libraries, such
as HSA, do not overlap with those of other vendors, the device UIDs in
offload are always combined with the offload plugin name. In case the
vendor library does not specify any device UID for a given device, we
fall back to the offload-internal device ID.
The device UID can be retrieved using the `llvm-offload-device-info`
tool.
Adds omp_target_is_accessible routine.
Refactors common code from omp_target_is_present to work for both
routines.
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
Summary:
This exposes the 'isDeviceCompatible' routine for checking if a binary
*can* be loaded. This is useful if people don't want to consume errors
everywhere when figuring out which image to put to what device.
I don't know if this is a good name, I was thining like `olIsCompatible`
or whatever. Let me know what you think.
Long term I'd like to be able to do something similar to what OpenMP
does where we can conditionally only initialize devices if we need them.
That's going to be support needed if we want this to be more
generic.
Summary:
Currently we have this `__tgt_device_image` indirection which just takes
a reference to some pointers. This was all find and good when the only
usage of this was from a section of GPU code that came from an ELF
constant section. However, we have expanded beyond that and now need to
worry about managing lifetimes. We have code that references the image
even after it was loaded internally. This patch changes the
implementation to instaed copy the memory buffer and manage it locally.
This PR reworks the JIT and other image handling to directly manage its
own memory. We now don't need to duplicate this behavior externally at
the Offload API level. Also we actually free these if the user unloads
them.
Upside, less likely to crash and burn. Downside, more latency when
loading an image.
Summary:
This operation is done every time we load a binary, this behavior should
be moved into OpenMP since it concerns an OpenMP specific data struct.
This is a little messy, because ideally we should only be using public
APIs, but more can be extracted later.
This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is
currently
only implemented on Cuda; AMDGPU and Host return unsupported.
---------
Co-authored-by: Callum Fare <callum@codeplay.com>
The purpose of this fence is to ensure that any `dataSubmit`s inserted
into a queue before a `dataFence` finish before finish before any
`dataSubmit`s
inserted after it begin.
This is a no-op for most queues, since they are in-order, and by design
any operations inserted into them occur in order.
But the interface is supposed to be functional for out-of-order queues.
The addition of the interface means that any operations that rely on
such ordering (like ATTACH map-type support in #149036) can invoke it,
without worrying about whether the underlying queue is in-order or
out-of-order.
Once a plugin supports out-of-order queues, the plugin can implement
this function, without requiring any change at the libomptarget level.
---------
Co-authored-by: Alex Duran <alejandro.duran@intel.com>
This sprinkles a few mutexes around the plugin interface so that the
olLaunchKernel CTS test now passes when ran on multiple threads.
Part of this also involved changing the interface for device synchronise
so that it can optionally not free the underlying queue (which
introduced a race condition in liboffload).
Add a device function to check if a device queue is empty. If liboffload
tries to create an event for an empty queue, we create an "empty" event
that is already complete.
This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run
quickly for empty queues.
Enables AMD data center class GPUs to use memory manager memory pooling
up to 3GB allocation by default, up from the "1 << 13" threshold that
all plugin-nextgen devices use.
The following patch introduces a new interop interface implementation
with the following characteristics:
* It supports the new 6.0 prefer_type specification
* It supports both explicit objects (from interop constructs) and
implicit objects (from variant calls).
* Implements a per-thread reuse mechanism for implicit objects to reduce
overheads.
* It provides a plugin interface that allows selecting the supported
interop types, and managing all the backend related interop operations
(init, sync, ...).
* It enables cooperation with the OpenMP runtime to allow progress on
OpenMP synchronizations.
* It cleanups some vendor/fr_id mismatchs from the current query
routines.
* It supports extension to define interop callbacks for library cleanup.
The `unloadBinaryImpl` method on the host plugin is now implemented
properly (rather than just being a stub). When an image is unloaded,
it is deallocated and the library associated with it is closed.
GenericKernelTy has a pointer to the name that was used to create it.
However, the name passed in as an argument may not outlive the kernel.
Instead, GenericKernelTy now contains a std::string, and copies the
name into there.
This allows removal of a specific Image from a Device, rather than
requiring all image data to outlive the device they were created for.
This is required for `ol_program_handle_t`s, which now specify the
lifetime of the buffer used to create the program.
Rather than being "stringly typed", store values as a std::variant that
can hold various types. This means that liboffload doesn't have to do
any string parsing for integer/bool device info keys.
Previously, device info was returned as a queue with each element having
a "Level" field indicating its nesting level. This replaces this queue
with a more traditional tree-like structure.
This should not result in a change to the output of
`llvm-offload-device-info`.
Previously we decided to check in files that we generate with tablegen.
The justification at the time was that it helped reviewers unfamiliar
with `offload-tblgen` see the actual changes to the headers in PRs.
After trying it for a while, it's ended up causing some headaches and is
also not how tablegen is used elsewhere in LLVM.
This changes our use of tablegen to be more conventional. Where
possible, files are still clang-formatted, but this is no longer a hard
requirement. Because `OffloadErrcodes.inc` is shared with libomptarget
it now gets generated in a more appropriate place.
[Offload] Use new error code handling mechanism
This removes the old ErrorCode-less error method and requires
every user to provide a concrete error code. All calls have been
updated.
In addition, for consistency with error messages elsewhere in LLVM, all
messages have been made to start lower case.
A new ErrorCode enumeration is present in PluginInterface which can
be used when returning an llvm::Error from offload and PluginInterface
functions.
This enum must be kept up to sync with liboffload's ol_errc_t enum, so
both are automatically generated from liboffload's enum definition.
Some error codes have also been shuffled around to allow for future
work. Note that this patch only adds the machinery; actual error codes
will be added in a future patch.
~~Depends on #137339 , please ignore first commit of this MR.~~ This has
been merged.
Summary:
We treated the missing kernel environment as a unique mode, but it was
kind of this random bool that was doing the same thing and it explicitly
expects the kernel environment to be zero. It broke after the previous
change since it used to default to SPMD and didn't handle zero in any of
the other cases despite being used. This fixes that and queries for it
without needing to consume an error.
Reland #118503. Added a fix for builds with `-DBUILD_SHARED_LIBS=ON`
(see last commit). Otherwise the changes are identical.
---
### New API
Previous discussions at the LLVM/Offload meeting have brought up the
need for a new API for exposing the functionality of the plugins. This
change introduces a very small subset of a new API, which is primarily
for testing the offload tooling and demonstrating how a new API can fit
into the existing code base without being too disruptive. Exact designs
for these entry points and future additions can be worked out over time.
The new API does however introduce the bare minimum functionality to
implement device discovery for Unified Runtime and SYCL. This means that
the `urinfo` and `sycl-ls` tools can be used on top of Offload. A
(rough) implementation of a Unified Runtime adapter (aka plugin) for
Offload is available
[here](https://github.com/callumfare/unified-runtime/tree/offload_adapter).
Our intention is to maintain this and use it to implement and test
Offload API changes with SYCL.
### Demoing the new API
```sh
# From the runtime build directory
$ ninja LibomptUnitTests
$ OFFLOAD_TRACE=1 ./offload/unittests/OffloadAPI/offload.unittests
```
### Open questions and future work
* Only some of the available device info is exposed, and not all the
possible device queries needed for SYCL are implemented by the plugins.
A sensible next step would be to refactor and extend the existing device
info queries in the plugins. The existing info queries are all strings,
but the new API introduces the ability to return any arbitrary type.
* It may be sensible at some point for the plugins to implement the new
API directly, and the higher level code on top of it could be made
generic, but this is more of a long-term possibility.