Summary:
`Status` is unfortunately heavily overloaded in practice. Things like
X11 define it as a macro. Best to just remove that possibility entirely.
Introduce common infrastructure for runtimes that determines compiler
resource path locations. These variables introduced are:
* RUNTIMES_OUTPUT_RESOURCE_DIR
* RUNTIMES_INSTALL_RESOURCE_PATH
That contain the location for the compiler resource path (typically
`lib/clang/<version>`) in the build tree and the install tree (the
latter relative to CMAKE_INSTALL_PREFIX).
Additionally, define
* RUNTIMES_OUTPUT_RESOURCE_LIB_DIR
* RUNTIMES_INSTALL_RESOURCE_LIB_PATH
as for the location of clang/flang version-locked libraries (typically
`lib${LLVM_LIBDIR_SUFFIX}/<targer-triple>`, but also depends on `APPLE`
and `LLVM_ENABLE_PER_TARGET_RUNTIME_DIR`). This code is moved from
flang-rt and initially becomes its only user.
Refactored out of #171610 as requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2687382481).
Extracted `get_runtimes_target_libdir_common` from compiler-rt as
requested
[here](https://github.com/llvm/llvm-project/pull/171610#discussion_r2689565634).
Added TODO comments to all runtimes as requested
[here](https://github.com/llvm/llvm-project/pull/171610#issuecomment-3789598635).
This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as
requested in
[#185728](https://github.com/llvm/llvm-project/issues/185728), and adds
the corresponding support in `plugins-nextgen`.
A main motivation for this change is to make it possible to measure the
elapsed time of work submitted to a queue, especially kernel launches.
This is relevant to the intended use of the new Offload API for
microbenchmarking GPU libc math functions.
### Summary
The new API returns the elapsed time, in milliseconds, between two
events on the same device.
To support the common pattern `create start event → enqueue kernel →
create end event → sync end event → get elapsed time`, `olCreateEvent`
now always creates and records a backend event through the device
interface. For backends that materialize real event state, this gives
the event concrete backend state that can be used for elapsed-time
measurement. For backends that do not materialize backend event state,
`EventInfo` may still remain null and existing event operations continue
to treat such events as trivially complete.
Previously, an event created on an empty queue could be represented only
as a logical event. That representation was sufficient for sync and
completion queries, but it was not suitable for elapsed-time measurement
because there was no backend event state to timestamp. The new behavior
preserves the meaning of completion of prior work while also allowing
backends with timing support to attach real event state.
### Changes in `plugins-nextgen`
#### Common interface
Add elapsed-time support to the common device and plugin interfaces:
* `GenericPluginTy::get_event_elapsed_time`
* `GenericDeviceTy::getEventElapsedTime`
* `GenericDeviceTy::getEventElapsedTimeImpl`
#### AMDGPU
* Add the required ROCr declarations and wrappers.
* Enable queue profiling at queue creation time.
* Record events by enqueuing a real barrier marker packet on the stream.
* Retain the timing signal needed to query the recorded marker later.
* Implement `getEventElapsedTimeImpl` using
`hsa_amd_profiling_get_dispatch_time`, converting the result to
milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`.
This follows the ROCm/HIP approach of enabling queue profiling at HSA
queue creation time, while keeping the AMDGPU queue path simpler than
the lazy-enable alternative discussed during review.
#### CUDA
* Add the required CUDA driver declarations and wrappers.
* Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`.
#### Host
* Add `getEventElapsedTimeImpl` that stores `0.0f` in the output
pointer, when present, and returns success.
Reason: the host plugin does not materialize backend event state and
already treats event operations as trivially successful. Returning
`0.0f` preserves that model without introducing a new failure mode.
#### Level Zero
* Add `getEventElapsedTimeImpl`, but leave it unimplemented.
Reason: the Level Zero plugin currently does not provide standalone
backend event support for this event model. For example, `waitEventImpl`
/ `syncEventImpl` are still unimplemented there.
---------
Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br>
Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>
Summary:
These are currently only run with check-offload-unit. Make them a part
of the other tests by putting a dependency on it. We did something like
this previously but it was reverted because the tests failed if there
were no GPUs (like in systems that only checked the CPU case) but I
think that has been fixed.
Already disabled on other GPU platforms and sporadically failing on our
builder, so this test seems not be doing too hot.
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
Fix some tests causing hangs, one fail, and a few XPASSing. We are
seeing new passes/fails because of the named barrier changes being
merged.
Signed-off-by: Nick Sarnie <nick.sarnie@intel.com>
This change adds implementation for named barriers for SPIRV backend.
Since there is no built in API/intrinsics for named barrier in SPIRV,
the implementation loosely follows implementation for AMD
Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.
This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.
Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.
NVPTX is more difficult as it is an ELF platform without this support. I
have hooked up the 'Other' handling to work around this, but even then
it's a bit of a stretch. I could remove this support here, but I wanted
to demonstrate that we can share the ABI. However, NVPTX will only work
if we force LTO and change the backend to emit variables in the same
TL;DR, we now do this:
```c
struct { start1, stop1, start2, stop2, start3, stop3, version; } device;
struct host = DtoH(lookup("device"));
counters = DtoH(host.stop - host.start)
version = DtoH(host.version);
```
The existing implementation has three issues which this patch addresses.
1. The last dimension which represents the bytes in the type, has the
wrong stride and count. For example, for a 4 byte int, count=1 and
stride=4. The correct representation here is count=4 and stride=1
because there are 4 bytes (count=4) that we need to copy and we do not
skip any bytes (stride=1).
2. The size of the data copy was computed using the last dimension.
However, this is incorrect in cases where some of the final dimensions
get merged into one. In this case we need to take the combined size of
the merged dimensions, which is (Count * Stride) of the first merged
dimension.
3. The Offset into a dimension was computed as a multiple of its Stride.
However, this Stride which is in bytes, already includes the stride
multiplier given by the user. This means that when the user specified
1:3:2, i.e. elements 1, 3, 5, the runtime incorrectly copied elements 2,
4, 6. Fix this by precomputing at compile time the Offset to be in bytes
by correctly multiplying the offset by the stride of the dimension
without the user-specified multiplier.
Summary:
Right now the CMake does not follow the pattern other runtime projects
use. All this does is use the standard subdir to place libraries in a
unique location. This allows, for example, users to configure a debug
version of openmp / offload within the same CMake invocation.
---------
Co-authored-by: Michael Kruse <github@meinersbur.de>
This PR adds support in the Level Zero plugin to execute
constructors/destructors on the device code. As spirv-link has some
limitations, it mimics the CUDA plugin behavior where the RTL constructs
the device side tables before invoking the kernel that will execute
them.
The kernel and other necessary symbols to create the device tables are
created by the SPIRVCtorDtorLowering pass to be added in #187509
Additional cleanup improvements on error conditions (in addition to
those in #187597):
* Fixed incomplete cleanup in L0Context::init()
* Fixed build log leak in addModule()
* Fixed context inconsistent state in findDevices()
Disclaimer: The base of this PR was generated by Claude and adjusted by
me afterwards.
Summary:
We create the RPC doorbell signal lazily and destroy it at the plugin
level. This means that we can't rely on the normal 'per-device' handling
so this needs to be called unconditionally. We only create the signal if
a device is registered, but deinit is called unconditionally. Just check
the handle.
Summary:
One of the main disadvantages to using the RPC interface is that it
requires a server thread to spin on the mailboxes checking for work.
The vast majority of the time, there will be no work and work will come
in large bursts.
The HSA / KFD interface supports device-side interrupts and already has
handling for binding these events to an HSA signal. This means that we
can send interrupts from the GPU to wake a sleeping thread on the CPU.
The sleeping thread will be descheduled with a blocking HSA wait call
and woken up when its event ID is raised through the kernel driver's
interrupt.
This is very target-specific handling, but I believe it is valuable
enough to warrant it being in the protocol. It is completely optional,
as it is ignored if uninitialized. This should bring this support at
parity with the interface HIP expects.
Summary:
This PR changes the handling of the emitted kernels when targeting a CPU
to be a pointer struct.
The old handling emitted a standard function prototype, this
necessitated a target specific ABI to call it because the signature
differed with the number of arguments. Instead, this PR emits a void
pointer to a naturally aligned struct, this is what APIs like `pthreads`
assert.
This allows us to remove all the complexity around launching host
kernels and just pass the argument list.
Summary:
The changes in https://www.github.com/llvm/llvm-project/pull/185552
allowed us to
start building the standard `libclang_rt.profile.a` for GPU targets.
This PR expands this by adding an optimized GPU routine for counter
increment and removing the special-case handling of these functions in
the OpenMP runtime.
Vast majority of these functions are boilerplate, but we should be able
to do more interesting things with this in the future, like value or
memory profiling.
Summary:
Follow up on removal of OPENMP_STANDALONE_BUILD in openmp (#149878).
This
build method is redundant and can be accomplished via runtimes.
Removes support for:
`cmake -S <llvm-project>/offload ...`
Switches over to:
`make -S <llvm-project>/runtimes -DLLVM_ENABLE_RUNTIMES=openmp;offload
...`
Libomptarget has a dependency on libomp.so and requires the omp cmake
target to exist at build time, which is why both runtimes are listed.
Updates cmake compiler logic in offload/CMakeLists.txt to mirror openmp
changes:
[openmp] Allow testing OpenMP without a full clang build tree (#182470)
User will still need to have a separate invocation to build openmp
DeviceRTL via:
`-DLLVM_ENABLE_RUNTIMES=openmp`
`-DLLVM_DEFAULT_TARGET_TRIPLE=<amdgcn-amd-amdhsa|nvptx64-nvidia-cuda>`
In the pre-merge CI we need a top-level visible target that can be used
to build offload, i.e., libomptarget and LLVMOffload.
The related PR to include offload into pre-merge CI is here:
https://github.com/llvm/llvm-project/pull/174955
This change improves handling of errors during synchronization in Level
Zero plugin by ensuring cleanup of queues and events in case of an
synchronization error. As a result multiple tests stopped hanging.
---------
Co-authored-by: Duran, Alex <alejandro.duran@intel.com>
Queue's WaitEvent collection wasn't being cleared after synchronization
and resetting of the events. This led to hangs on subsequent host
synchronizations if not preceeded by any other operation.
This commit removes the `LIBOMPTARGET_SHARED_MEMORY_SIZE` envar and
outputs a runtime warning if it is defined. Access to dynamic shared memory
should be obtained through the `dyn_groupprivate` clause (OpenMP 6.1) or
the launch arguments in liboffload kernel launch.
Summary:
We use this `dyn_ptr` argument in Clang/OpenMP to handle the
`KernelLaunchEnvironment`. This is a per-kernel argument used to share
some information. Currenetly, it's prepended to the argument list and we
generate storage for it in the runtime.
This is bad for a few reasons:
1. It changes the ABI by shifting user arguments
2. It cannot be trivially be left uninitialized if unused
3. The runtime must allocate its own memory for it
This PR changes it to be appended instead. Additionally, space for this
is always emitted. This means the OMPIRBuilder itself will provide the
storage, we simply need to populate it in the runtime if it is used.
This means that if it's unused we don't always pay the cost and it's
easier for non-OpenMP users to ignore it.
Backward compatibility is maintained by auto-upgrading the kernel
arguments. In `libomptarget` we completely allocate a new buffer to
store this in the new format. The plugins still need to respect the old
ABI of the called device object, so we simply rotate it if it's the old
version.
Converting the current bot config to use the CMake cache file that we
use in other bots (offload/cmake/caches/AMDGPUBot.cmake). This PR
removes all CMake settings that the cache file already sets and only
leaves those that were either not set explicitly or which differ.
Thus, first load the cache file and then adjust the settings to override
existing values.
### Issue: Dimension override missing
When variable count expressions were used with stride, the constant
subsection path computed size first. This marked `ArgSizes` with byte
size semantics. Variable expression logic later triggered, but reused
`ArgSizes` assuming "bytes" semantics
`OMPIRBuilder.cpp` didn't handle dimension count for
`OMP_MAP_NON_CONTIG` flag
**Result**: `ArgSizes` wasn't overwritten with dimension count, breaking
non-contiguous mapping.
**Fixes**:
`llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp` - Expression semantics for
non-contiguous.
stride/count.
Generate 3D descriptor structures with runtime dimensions.
Fix dimension override to use dimension count instead of byte size.
Added testcases to cover stack arrays, heap pointers, struct members,
etc.
- Accept OffloadBinaries as valid images by plugins that support them in
the PluginInterface.
- Add support in L0 plugin to extract SPIRV images and their associated
metadata from an OffloadBinary image.
Depends on:
- #185663
Follow-up PRs:
- #185413 (Changes SPIRV wrapper generation to use OffloadBinary)
- #185425 (Adjusts llvm-objdump)
- #184774 (Adjusts llvm-offload-binary)
Summary:
As suggested in https://github.com/llvm/llvm-project/pull/177665, we
should build a GPU version of the compiler-rt profile library instead of
writing it in-line in the lowering. This PR does not define anything GPU
specific, it simply re-uses the baremetal handling. Later PRs will
prevent the GPU specific handling we would want to do to optimize
counter handling on the GPU.
Note that this will require using the cache file, or setting these
options
manually for existing users. Hopefully if people are using the cache
file
as they should it won't break anything.
As discussed in #185404 we might want to provide a way for plugins to
validate images not recognized by the common layer.
This PR adds such extension and uses it to validate pure SPIRV images by
the Level Zero plugin.