This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as
requested in
[#185728](https://github.com/llvm/llvm-project/issues/185728), and adds
the corresponding support in `plugins-nextgen`.
A main motivation for this change is to make it possible to measure the
elapsed time of work submitted to a queue, especially kernel launches.
This is relevant to the intended use of the new Offload API for
microbenchmarking GPU libc math functions.
### Summary
The new API returns the elapsed time, in milliseconds, between two
events on the same device.
To support the common pattern `create start event → enqueue kernel →
create end event → sync end event → get elapsed time`, `olCreateEvent`
now always creates and records a backend event through the device
interface. For backends that materialize real event state, this gives
the event concrete backend state that can be used for elapsed-time
measurement. For backends that do not materialize backend event state,
`EventInfo` may still remain null and existing event operations continue
to treat such events as trivially complete.
Previously, an event created on an empty queue could be represented only
as a logical event. That representation was sufficient for sync and
completion queries, but it was not suitable for elapsed-time measurement
because there was no backend event state to timestamp. The new behavior
preserves the meaning of completion of prior work while also allowing
backends with timing support to attach real event state.
### Changes in `plugins-nextgen`
#### Common interface
Add elapsed-time support to the common device and plugin interfaces:
* `GenericPluginTy::get_event_elapsed_time`
* `GenericDeviceTy::getEventElapsedTime`
* `GenericDeviceTy::getEventElapsedTimeImpl`
#### AMDGPU
* Add the required ROCr declarations and wrappers.
* Enable queue profiling at queue creation time.
* Record events by enqueuing a real barrier marker packet on the stream.
* Retain the timing signal needed to query the recorded marker later.
* Implement `getEventElapsedTimeImpl` using
`hsa_amd_profiling_get_dispatch_time`, converting the result to
milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`.
This follows the ROCm/HIP approach of enabling queue profiling at HSA
queue creation time, while keeping the AMDGPU queue path simpler than
the lazy-enable alternative discussed during review.
#### CUDA
* Add the required CUDA driver declarations and wrappers.
* Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`.
#### Host
* Add `getEventElapsedTimeImpl` that stores `0.0f` in the output
pointer, when present, and returns success.
Reason: the host plugin does not materialize backend event state and
already treats event operations as trivially successful. Returning
`0.0f` preserves that model without introducing a new failure mode.
#### Level Zero
* Add `getEventElapsedTimeImpl`, but leave it unimplemented.
Reason: the Level Zero plugin currently does not provide standalone
backend event support for this event model. For example, `waitEventImpl`
/ `syncEventImpl` are still unimplemented there.
---------
Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br>
Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>
Summary:
We create the RPC doorbell signal lazily and destroy it at the plugin
level. This means that we can't rely on the normal 'per-device' handling
so this needs to be called unconditionally. We only create the signal if
a device is registered, but deinit is called unconditionally. Just check
the handle.
Summary:
One of the main disadvantages to using the RPC interface is that it
requires a server thread to spin on the mailboxes checking for work.
The vast majority of the time, there will be no work and work will come
in large bursts.
The HSA / KFD interface supports device-side interrupts and already has
handling for binding these events to an HSA signal. This means that we
can send interrupts from the GPU to wake a sleeping thread on the CPU.
The sleeping thread will be descheduled with a blocking HSA wait call
and woken up when its event ID is raised through the kernel driver's
interrupt.
This is very target-specific handling, but I believe it is valuable
enough to warrant it being in the protocol. It is completely optional,
as it is ignored if uninitialized. This should bring this support at
parity with the interface HIP expects.
This commit removes the `LIBOMPTARGET_SHARED_MEMORY_SIZE` envar and
outputs a runtime warning if it is defined. Access to dynamic shared memory
should be obtained through the `dyn_groupprivate` clause (OpenMP 6.1) or
the launch arguments in liboffload kernel launch.
Summary:
This was a regression from the original LLVM-gpu-loader. We used to
handle `-mwavefrontsize64` correctly in the loader by over-allocating
memory and just leaving the upper 32-bits masked off. In order to handle
this in offload we need to scan loaded kernels to see how much memory we
need to allocate. This should be safe, the protocol is designed to
handle an arbitrary size and worst-case this just wastes space.
Add liboffload asynchronous queue query API for libomptarget migration
This PR adds liboffload asynchronous queue query API that needed to make
libomptarget to use liboffload
Summary:
This was a lot of code that was only used for upstream LLVM builds of
AMDGPU offloading. We have a generic and fast `malloc` in `libc` now so
just use that. Simplifies code, can be added back if we start providing
alternate forms but I don't think there's a single use-case that would
justify it yet.
Introduced in OpenMP 6.0, the device UID shall be a unique identifier of
a device on a given system. (Not necessarily a UUID.) Since it is not
guaranteed that the (U)UIDs defined by the device vendor libraries, such
as HSA, do not overlap with those of other vendors, the device UIDs in
offload are always combined with the offload plugin name. In case the
vendor library does not specify any device UID for a given device, we
fall back to the offload-internal device ID.
The device UID can be retrieved using the `llvm-offload-device-info`
tool.
Adds omp_target_is_accessible routine.
Refactors common code from omp_target_is_present to work for both
routines.
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
Summary:
This check is unnecessarily restrictive and currently incorrectly fires
for any size less than eight bytes. Just remove it, we do sanity checks
elsewhere and at some point need to trust the ABI.
This implements two pieces to restore the interop functionality (that I
broke) when the 6.0 interfaces were added:
* A set of wrappers that support the old interfaces on top of the new
ones
* The same level of interop support for the CUDA amd AMD plugins
Summary:
This was originally added in as a hack to work around CUDA's limitation
on allocation. The `libc` implementation now isn't even used for CUDA so
this code is never hit. Even if this case, this code never truly worked.
A true solution would be to use CUDA's virtual memory API instead to
allocate 2MiB slabs independenctly from the normal memory management
done in the stream.
Summary:
I made the GPU flags accept more of the default LLVM warnings, which
triggered some new cases. Clean those up and fix some other ones while
I'm at it.
Summary:
Currently we have this `__tgt_device_image` indirection which just takes
a reference to some pointers. This was all find and good when the only
usage of this was from a section of GPU code that came from an ELF
constant section. However, we have expanded beyond that and now need to
worry about managing lifetimes. We have code that references the image
even after it was loaded internally. This patch changes the
implementation to instaed copy the memory buffer and manage it locally.
This PR reworks the JIT and other image handling to directly manage its
own memory. We now don't need to duplicate this behavior externally at
the Offload API level. Also we actually free these if the user unloads
them.
Upside, less likely to crash and burn. Downside, more latency when
loading an image.
This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is
currently
only implemented on Cuda; AMDGPU and Host return unsupported.
---------
Co-authored-by: Callum Fare <callum@codeplay.com>
Add the following properties in Offload device info:
* VENDOR_ID
* NUM_COMPUTE_UNITS
* [SINGLE|DOUBLE|HALF]_FP_CONFIG
* NATIVE_VECTOR_WIDTH_[CHAR|SHORT|INT|LONG|FLOAT|DOUBLE|HALF]
* MAX_CLOCK_FREQUENCY
* MEMORY_CLOCK_RATE
* ADDRESS_BITS
* MAX_MEM_ALLOC_SIZE
* GLOBAL_MEM_SIZE
Add a bitfield option to enumerators, allowing the values to be
bit-shifted instead of incremented. Generate the per-type enums using
`foreach` to reduce code duplication.
Use macros in unit test definitions to reduce code duplication.
The purpose of this fence is to ensure that any `dataSubmit`s inserted
into a queue before a `dataFence` finish before finish before any
`dataSubmit`s
inserted after it begin.
This is a no-op for most queues, since they are in-order, and by design
any operations inserted into them occur in order.
But the interface is supposed to be functional for out-of-order queues.
The addition of the interface means that any operations that rely on
such ordering (like ATTACH map-type support in #149036) can invoke it,
without worrying about whether the underlying queue is in-order or
out-of-order.
Once a plugin supports out-of-order queues, the plugin can implement
this function, without requiring any change at the libomptarget level.
---------
Co-authored-by: Alex Duran <alejandro.duran@intel.com>
This sprinkles a few mutexes around the plugin interface so that the
olLaunchKernel CTS test now passes when ran on multiple threads.
Part of this also involved changing the interface for device synchronise
so that it can optionally not free the underlying queue (which
introduced a race condition in liboffload).
Add a device function to check if a device queue is empty. If liboffload
tries to create an event for an empty queue, we create an "empty" event
that is already complete.
This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run
quickly for empty queues.
Enables AMD data center class GPUs to use memory manager memory pooling
up to 3GB allocation by default, up from the "1 << 13" threshold that
all plugin-nextgen devices use.
`MAX_WORK_GROUP_SIZE` now represents the maximum total number of work
groups the device can allocate, rather than the maximum per dimension.
`MAX_WORK_GROUP_SIZE_PER_DIMENSION` has been added, which has the old
behaviour.
The `GlobalTy` helper has been extended to make both the Size and Ptr be
optional. Now `getGlobalMetadataFromDevice`/`Image` is able to write the
size of the global to the struct, instead of just verifying it.
Summary:
There's a new one called the AIE (AI Engine). We could handle this, but
since we don't use it currently I'm just making it future-proof. Adding
the AIE check would require checking the HSA version which isn't
worthwhile just yet.
This allows removal of a specific Image from a Device, rather than
requiring all image data to outlive the device they were created for.
This is required for `ol_program_handle_t`s, which now specify the
lifetime of the buffer used to create the program.
Previously, device info was returned as a queue with each element having
a "Level" field indicating its nesting level. This replaces this queue
with a more traditional tree-like structure.
This should not result in a change to the output of
`llvm-offload-device-info`.
Summary:
The size of the implicit argument struct can vary depending on
optimizations, it is not always the size as listed by the full struct.
Additionally, the implicit arguments are always aligned on a pointer
boundary. This patch updates the handling to use the correctly aligned
offset and only initialize the members if they are contained in the
reported size.
Additionally, we modify the `alloc` and `free` routines to allow
`alloc(0)` and `free(nullptr)` as these are mandated by the C standard
and allow us to easily handle cases where the user calls a kernel with
no arguments.
Summary:
`malloc(0)` and `free(nullptr)` are both defined by the standard but we
current trigger erros and assertions on them. Fix that so this works
with empty arguments.
[Offload] Use new error code handling mechanism
This removes the old ErrorCode-less error method and requires
every user to provide a concrete error code. All calls have been
updated.
In addition, for consistency with error messages elsewhere in LLVM, all
messages have been made to start lower case.
Summary:
We conditionally allocate the implicit arguments, so they possibly are
null. The flang compiler seems to hit this case, even though it
shouldn't when it's supposed to conform to the HSA code object. For now
guard this to fix the regression and cover a case in the future where
someone rolls a fully custom implementatation.
Fixes: https://github.com/llvm/llvm-project/issues/132982