233 Commits

Author SHA1 Message Date
Kevin Sala Penades
00d5f660f4
[offload][CUDA] Fix DLWRAP for memory routines (#190500) 2026-04-04 19:29:25 -07:00
Matt Arsenault
6f68e58519
offload: Parse triple using to identify amdgcn-amd-amdhsa (#190319)
Avoid hardcoding the exact triple.
2026-04-03 23:22:48 +02:00
Joseph Huber
b2d3a6574c
[libc] Rename rpc::Status to rpc::RPCStatus to reduce conflicts (#190239)
Summary:
`Status` is unfortunately heavily overloaded in practice. Things like
X11 define it as a macro. Best to just remove that possibility entirely.
2026-04-02 14:55:57 -05:00
Leandro Lacerda
34028294e4
[Offload] Add support for measuring elapsed time between events (#186856)
This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as
requested in
[#185728](https://github.com/llvm/llvm-project/issues/185728), and adds
the corresponding support in `plugins-nextgen`.

A main motivation for this change is to make it possible to measure the
elapsed time of work submitted to a queue, especially kernel launches.
This is relevant to the intended use of the new Offload API for
microbenchmarking GPU libc math functions.

### Summary

The new API returns the elapsed time, in milliseconds, between two
events on the same device.

To support the common pattern `create start event → enqueue kernel →
create end event → sync end event → get elapsed time`, `olCreateEvent`
now always creates and records a backend event through the device
interface. For backends that materialize real event state, this gives
the event concrete backend state that can be used for elapsed-time
measurement. For backends that do not materialize backend event state,
`EventInfo` may still remain null and existing event operations continue
to treat such events as trivially complete.

Previously, an event created on an empty queue could be represented only
as a logical event. That representation was sufficient for sync and
completion queries, but it was not suitable for elapsed-time measurement
because there was no backend event state to timestamp. The new behavior
preserves the meaning of completion of prior work while also allowing
backends with timing support to attach real event state.

### Changes in `plugins-nextgen`

#### Common interface

Add elapsed-time support to the common device and plugin interfaces:

* `GenericPluginTy::get_event_elapsed_time`
* `GenericDeviceTy::getEventElapsedTime`
* `GenericDeviceTy::getEventElapsedTimeImpl`

#### AMDGPU

* Add the required ROCr declarations and wrappers.
* Enable queue profiling at queue creation time.
* Record events by enqueuing a real barrier marker packet on the stream.
* Retain the timing signal needed to query the recorded marker later.
* Implement `getEventElapsedTimeImpl` using
`hsa_amd_profiling_get_dispatch_time`, converting the result to
milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`.

This follows the ROCm/HIP approach of enabling queue profiling at HSA
queue creation time, while keeping the AMDGPU queue path simpler than
the lazy-enable alternative discussed during review.

#### CUDA

* Add the required CUDA driver declarations and wrappers.
* Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`.

#### Host

* Add `getEventElapsedTimeImpl` that stores `0.0f` in the output
pointer, when present, and returns success.

Reason: the host plugin does not materialize backend event state and
already treats event operations as trivially successful. Returning
`0.0f` preserves that model without introducing a new failure mode.

#### Level Zero

* Add `getEventElapsedTimeImpl`, but leave it unimplemented.

Reason: the Level Zero plugin currently does not provide standalone
backend event support for this event model. For example, `waitEventImpl`
/ `syncEventImpl` are still unimplemented there.

---------

Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br>
Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>
2026-04-01 14:13:44 -05:00
Joseph Huber
15bfc06b6b
[Offload][NFC] Various minor changes to Offload CMake (#189029)
Summary:
Most of these just remove some redundancy or rename `openmp` ->
`offload` where the variable is purely internal.
2026-03-27 12:06:37 -05:00
Joseph Huber
ffd6a13b5f
[compiler-rt] Rework profile data handling for GPU targets (#187136)
Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.

This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.

Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.

NVPTX is more difficult as it is an ELF platform without this support. I
have hooked up the 'Other' handling to work around this, but even then
it's a bit of a stretch. I could remove this support here, but I wanted
to demonstrate that we can share the ABI. However, NVPTX will only work
if we force LTO and change the backend to emit variables in the same

TL;DR, we now do this:
```c
struct { start1, stop1, start2, stop2, start3, stop3, version; } device;
struct host = DtoH(lookup("device"));
counters = DtoH(host.stop - host.start)
version = DtoH(host.version);
```
2026-03-26 10:17:43 -05:00
Alex Duran
64e7c77e04
[OFFLOAD][L0] More error handling (#188496)
This PR improves cleanup/handling of errors in some memory operations,
allocating event pools, ...
2026-03-26 05:50:26 +01:00
fineg74
1dbf7c7e1b
[OFFLOAD] Improve resource management of the plugin (#187597)
This PR improves event management of the plugin by fixing potential
resource leaks and preventing a potential deadlock
2026-03-25 09:50:38 +01:00
Alex Duran
e40062c0bd
[OFFLOAD][L0] Add support to run ctor/dtor code (#187510)
This PR adds support in the Level Zero plugin to execute
constructors/destructors on the device code. As spirv-link has some
limitations, it mimics the CUDA plugin behavior where the RTL constructs
the device side tables before invoking the kernel that will execute
them.

The kernel and other necessary symbols to create the device tables are
created by the SPIRVCtorDtorLowering pass to be added in #187509
2026-03-25 08:43:44 +01:00
Alex Duran
227bab0a62
[OFFLOAD][L0] Improve cleanup on errors (#188251)
Additional cleanup improvements on error conditions (in addition to
those in #187597):

  * Fixed incomplete cleanup in L0Context::init()
  * Fixed build log leak in addModule()
  * Fixed context inconsistent state in findDevices()

Disclaimer: The base of this PR was generated by Claude and adjusted by
me afterwards.
2026-03-24 15:36:01 +01:00
Joseph Huber
376874a345 [Offload] Fix destroying signal that was never initialized
Summary:
We create the RPC doorbell signal lazily and destroy it at the plugin
level. This means that we can't rely on the normal 'per-device' handling
so this needs to be called unconditionally. We only create the signal if
a device is registered, but deinit is called unconditionally. Just check
the handle.
2026-03-24 09:29:27 -05:00
Joseph Huber
4961700c10
[libc] Support AMDGPU device interrupts for the RPC interface (#188067)
Summary:
One of the main disadvantages to using the RPC interface is that it
requires a server thread to spin on the mailboxes checking for work.
The vast majority of the time, there will be no work and work will come
in large bursts.

The HSA / KFD interface supports device-side interrupts and already has
handling for binding these events to an HSA signal. This means that we
can send interrupts from the GPU to wake a sleeping thread on the CPU.
The sleeping thread will be descheduled with a blocking HSA wait call
and woken up when its event ID is raised through the kernel driver's
interrupt.

This is very target-specific handling, but I believe it is valuable
enough to warrant it being in the protocol. It is completely optional,
as it is ignored if uninitialized. This should bring this support at
parity with the interface HIP expects.
2026-03-24 08:48:52 -05:00
Joseph Huber
07896d44a3
[OpenMP] Emit aggregate kernel prototypes and remove libffi dependency (#186261)
Summary:
This PR changes the handling of the emitted kernels when targeting a CPU
to be a pointer struct.

The old handling emitted a standard function prototype, this
necessitated a target specific ABI to call it because the signature
differed with the number of arguments. Instead, this PR emits a void
pointer to a naturally aligned struct, this is what APIs like `pthreads`
assert.

This allows us to remove all the complexity around launching host
kernels and just pass the argument list.
2026-03-20 13:08:23 -05:00
Bruce Changlong Xu
cbab7e65a7
[AMDGPU] Minor cleanups in offload plugin and AMDGPUEmitPrintf. NFC. (#187587)
Use empty() in assert, brace-init instead of std::make_pair in the
AMDGPU offload plugin, and fix a comment typo in AMDGPUEmitPrintf.
2026-03-19 18:16:47 -04:00
fineg74
2890f9883c
[OFFLOAD] Improve handling of synchronization errors in L0 plugin and reenable tests (#186927)
This change improves handling of errors during synchronization in Level
Zero plugin by ensuring cleanup of queues and events in case of an
synchronization error. As a result multiple tests stopped hanging.

---------

Co-authored-by: Duran, Alex <alejandro.duran@intel.com>
2026-03-18 05:50:06 +01:00
Joseph Huber
154a128c65 Reapply "[OpenMP] Move OpenMP implicit argument to the end and reformat" (#186309)
Should be working downstream now
This reverts commit 9b61ff210fdff752d5db55b128474e9990258488.
2026-03-13 15:48:37 -05:00
Piotr Balcer
1b9a4a0f72
[Offload][L0] clear completed events from a wait list (#186379)
Queue's WaitEvent collection wasn't being cleared after synchronization
and resetting of the events. This led to hangs on subsequent host
synchronizations if not preceeded by any other operation.
2026-03-13 13:56:27 +00:00
theRonShark
9b61ff210f
Revert "[OpenMP] Move OpenMP implicit argument to the end and reformat" (#186309)
Reverts llvm/llvm-project#185989
2026-03-13 05:20:40 +00:00
Kevin Sala Penades
ac71b185c2
[offload] Remove LIBOMPTARGET_SHARED_MEMORY_SIZE envar (#186231)
This commit removes the `LIBOMPTARGET_SHARED_MEMORY_SIZE` envar and
outputs a runtime warning if it is defined. Access to dynamic shared memory
should be obtained through the `dyn_groupprivate` clause (OpenMP 6.1) or
the launch arguments in liboffload kernel launch.
2026-03-12 21:21:29 -07:00
Joseph Huber
4376fbd793
[OpenMP] Move OpenMP implicit argument to the end and reformat (#185989)
Summary:
We use this `dyn_ptr` argument in Clang/OpenMP to handle the
`KernelLaunchEnvironment`. This is a per-kernel argument used to share
some information. Currenetly, it's prepended to the argument list and we
generate storage for it in the runtime.

This is bad for a few reasons:
1. It changes the ABI by shifting user arguments
2. It cannot be trivially be left uninitialized if unused
3. The runtime must allocate its own memory for it

This PR changes it to be appended instead. Additionally, space for this
is always emitted. This means the OMPIRBuilder itself will provide the
storage, we simply need to populate it in the runtime if it is used.
This means that if it's unused we don't always pay the cost and it's
easier for non-OpenMP users to ignore it.

Backward compatibility is maintained by auto-upgrading the kernel
arguments. In `libomptarget` we completely allocate a new buffer to
store this in the new format. The plugins still need to respect the old
ABI of the called device object, so we simply rotate it if it's the old
version.
2026-03-12 18:08:22 -05:00
Kevin Sala Penades
1f583c6dee
[OpenMP][Offload] Add offload runtime support for dyn_groupprivate clause (#152831)
Part 3 adding offload runtime support. See
https://github.com/llvm/llvm-project/pull/152651.

---------

Co-authored-by: Krzysztof Parzyszek <Krzysztof.Parzyszek@amd.com>
2026-03-12 01:13:06 -07:00
Alex Duran
789fea83bb
[offload][l0][nfc] remove duplicated entry (#185855)
Remove left over function by mistake from #185404
2026-03-11 11:55:30 +01:00
Alex Duran
3ff332ad0f
[Offload][L0] Add support for OffloadBinary format in L0 plugin (#185404)
- Accept OffloadBinaries as valid images by plugins that support them in
the PluginInterface.
- Add support in L0 plugin to extract SPIRV images and their associated
metadata from an OffloadBinary image.

Depends on:
- #185663

Follow-up PRs:
- #185413 (Changes SPIRV wrapper generation to use OffloadBinary)
- #185425 (Adjusts llvm-objdump)
- #184774 (Adjusts llvm-offload-binary)
2026-03-11 11:42:36 +01:00
Alex Duran
be021b8433
[OFFLOAD] Add interface to extend image validation (#185663)
As discussed in #185404 we might want to provide a way for plugins to
validate images not recognized by the common layer.

This PR adds such extension and uses it to validate pure SPIRV images by
the Level Zero plugin.
2026-03-10 18:41:23 +01:00
Joseph Huber
a9e457a82f
[Offload][AMDGPU] Fix RPC server on mixed w32 w64 workloads (#185496)
Summary:
This was a regression from the original LLVM-gpu-loader. We used to
handle `-mwavefrontsize64` correctly in the loader by over-allocating
memory and just leaving the upper 32-bits masked off. In order to handle
this in offload we need to scan loaded kernels to see how much memory we
need to allocate. This should be safe, the protocol is designed to
handle an arbitrary size and worst-case this just wastes space.
2026-03-09 17:13:59 -05:00
Łukasz Plewa
57614e8810
[OFFLOAD] Replace C-style casts with C++ style casts in obtainInfoImpl (#185023)
Replace C-style bool casts (bool)TmpInt with C++ functional casts
bool(TmpInt)
2026-03-06 10:28:38 -06:00
Hansang Bae
8f268e63e4
[Offload] Remove unused data type (#183840) 2026-02-27 15:46:59 -06:00
Hansang Bae
a347e1298c
[Offload] Enable memory usage printing with alloc debug type (#182938) 2026-02-23 17:19:41 -06:00
Jan Patrick Lehr
92447ed273
[Offload] Fix copy-elision warning (#182848)
This fixes a warning about a prohibited copy-elision due to the move of
a temporary object.
2026-02-23 13:58:07 +00:00
Alex Duran
7ed0aa2652
[OFFLOAD][L0] Remove leftover global constructor (#182611) (#182665)
fixes #182611
2026-02-21 18:09:46 +01:00
Joseph Huber
21b3461440
[flang-rt] Implement basic support for I/O from OpenMP GPU Offloading (#181039)
Summary:
This PR provides the minimal support for Fortran I/O coming from a GPU
in OpenMP offloading. We use the same support the `libc` uses for its
printing through the RPC server. The helper functions `rpc::dispatch`
and `rpc::invoke` help make this mostly automatic.

Becaus Fortran I/O is not reentrant, the vast majority of complexity
comes from needing to stitch together calls from the GPU until they can
be executed all at once. This is needed not only because of the
limitations of recursive I/O, but without this the output would all be
interleaved because of the GPU's lock-step execution.

As such, the return values from the intermediate functions are
meaningless, all returning true. The final value is correct however. For
cookies we create a context pointer on the server to chain these
together.

Works on both my AMD and NVIDIA GPUs.
```fortran
program hello_gpu
  implicit none

  !$omp target teams num_teams(1)
  !$omp parallel num_threads(2)
    ! Print strings
    print *, "Hello from GPU"
  !$omp end parallel
  !$omp end target teams

end program hello_gpu
```
```console
> flang hello.f90 -O2 -fopenmp --offload-arch=gfx1030 
> ./a.out 
 Hello from GPU
 Hello from GPU
> flang hello.f90 -O2 -fopenmp --offload-arch=sm_89  
> ./a.out 
 Hello from GPU
 Hello from GPU
```
2026-02-20 07:56:59 -06:00
Jan Patrick Lehr
e1e0e86e60
[Offload] Always check/consume Error (#182008)
This fixes an issue introduced in
https://github.com/llvm/llvm-project/pull/172226 where an llvm::Error is
not checked in the "good" code path.
2026-02-18 13:46:21 +01:00
fineg74
1c6d774baa
[OFFLOAD] Extend olMemRegister API to handle cases when a memory block may have been mapped outside of liboffload. (#172226)
This PR adds extends liboffload olMemRegister API to handle a case when
a memory block may have been mapped before calling olMemRegister to
support some use cases in libomptarget
2026-02-17 20:53:00 +00:00
Joseph Huber
d85576d368
[libc] Replace RPC 'close()' mechanism with RAII handler (#181690)
Summary:
Closing ports was previously done manually, This makes the protocol more
error prone as unclosed ports will leak and eventually the locks will
run out. I believe the original fear was that the RAII portion would
negatively impact code generation but I have not noticed anything
significant.
2026-02-16 15:14:30 -06:00
fineg74
b58a31d3ce
[OFFLOAD] Add support for host offloading device (#177307)
The purpose of this PR is to add support of host as an offloading device
to liboffload. Both OpenMP and sycl support offloading to a host as
their normal workflow and therefore would require such capability from
liboffload library.
2026-02-13 10:27:52 +01:00
Hansang Bae
0deb1b6e05
[Offload] Try to load Level Zero loader with version suffix (#180042)
The default Level Zero loader `libze_loader.so` may not be available on
systems that don't have Level Zero development package. Level Zero
loaders with major version suffix are searched in that case.
2026-02-11 15:13:26 -06:00
Alex Duran
8b9fd4803c
[OFFLOAD] Support host plugin on Windows (#180401)
Changes to make host plugin compile on Windows:
* Change IO code to be portable
* Adjust Makefiles

Allow plugin to work partially when libffi support is not found
dynamically (compilation works fine even on Windows because of the
wrapper support).
2026-02-11 08:54:47 +01:00
Joseph Huber
2f00977fea
[Offload] Make the RPC callbacks private to each running server (#178901)
Summary:
The static object mixes callbacks from different plugins because ever
since we moved to the object library target these are actually shared.
Just make it a member of the base class and make it a pointer set just
to do some basic deduplication.
2026-02-06 08:28:57 -06:00
Alex Duran
4096cb6017
[OFFLOAD] Fix TARGET_NAME in plugins common code (#180151)
Unlike other names is set between quotes which prevents our debug macros
to properly match it.
2026-02-06 14:12:04 +01:00
Joseph Huber
1a86c146ae
[Offload] Add a function to register an RPC Server callback (#178774)
Summary:
We provide an RPC server to manage calls initiated by the device to run
on the host. This is very useful for the built-in handling we have,
however there are cases where we would want to extend this
functionality.

Cases like Fortran or MPI would be useful, but we cannot put references
to these in the core offloading runtime. This way, we can provide this
as a library interface that registers custom handlers for whatever code
people want.
2026-01-30 08:03:13 -06:00
Hansang Bae
85d64d1201
[Offload] Cast to void * in the debug message (#177019)
There are a few places where data types based on character array or
string are printed in the debug message while they do not represent
strings. Such expressions should be casted to `void *` unless they
represent actual strings. Change also includes casting from integral
type to pointer type when appropriate.
2026-01-20 15:44:08 -06:00
fineg74
848d736e64
[OFFLOAD] Add asynchronous queue query API for libomptarget migration (#172231)
Add liboffload asynchronous queue query API for libomptarget migration

This PR adds liboffload asynchronous queue query API that needed to make
libomptarget to use liboffload
2026-01-20 10:53:32 -08:00
Hansang Bae
edd857aad8
[Offload] Remove unnecessary maybe_unused attribute (#175855)
The attribute is not necessary in the new debug messaging.
2026-01-15 14:31:58 -06:00
Hansang Bae
90b6d33755
[Offload] Small debug message fix in Level Zero plugin (#175958)
Do not include trailing zeros in the device name.
2026-01-14 09:42:19 -06:00
Alex Duran
efad3563ea
[OFFLOAD] Update CUDA and AMD plugins to new debug format (#175787) 2026-01-13 17:53:59 +01:00
Alex Duran
86e114a9b2
Revert "[OFFLOAD] Update CUDA and AMD plugins to new debug format" (#175786)
Reverts llvm/llvm-project#175757
2026-01-13 17:13:46 +01:00
Alex Duran
7c2f49373b
[OFFLOAD] Update CUDA and AMD plugins to new debug format (#175757)
This should be the last step before completely removing the DP macro.
2026-01-13 17:06:35 +01:00
Hansang Bae
13cd7003ad
[NFC][Offload] Rename a function (#175673)
Renamed a function as suggested in #175664.
2026-01-12 19:40:17 -06:00
Hansang Bae
496729fe7e
[Offload] Fix level_zero plugin build (#175664)
Build has been broken when OMPTARGET_DEBUG is undefined.
2026-01-12 16:53:23 -06:00
Hansang Bae
dae3b49cba
[Offload] Update debug message printig in the plugins (#175205)
* Prepare a set of debug types in llvm::offload::debug to be used in
plugin code
* Update debug messages in the plugins
2026-01-12 14:26:43 -06:00