llvm-project

Author	SHA1	Message	Date
Leandro Lacerda	34028294e4	[Offload] Add support for measuring elapsed time between events (#186856 ) This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as requested in [#185728](https://github.com/llvm/llvm-project/issues/185728), and adds the corresponding support in `plugins-nextgen`. A main motivation for this change is to make it possible to measure the elapsed time of work submitted to a queue, especially kernel launches. This is relevant to the intended use of the new Offload API for microbenchmarking GPU libc math functions. ### Summary The new API returns the elapsed time, in milliseconds, between two events on the same device. To support the common pattern `create start event → enqueue kernel → create end event → sync end event → get elapsed time`, `olCreateEvent` now always creates and records a backend event through the device interface. For backends that materialize real event state, this gives the event concrete backend state that can be used for elapsed-time measurement. For backends that do not materialize backend event state, `EventInfo` may still remain null and existing event operations continue to treat such events as trivially complete. Previously, an event created on an empty queue could be represented only as a logical event. That representation was sufficient for sync and completion queries, but it was not suitable for elapsed-time measurement because there was no backend event state to timestamp. The new behavior preserves the meaning of completion of prior work while also allowing backends with timing support to attach real event state. ### Changes in `plugins-nextgen` #### Common interface Add elapsed-time support to the common device and plugin interfaces: * `GenericPluginTy::get_event_elapsed_time` * `GenericDeviceTy::getEventElapsedTime` * `GenericDeviceTy::getEventElapsedTimeImpl` #### AMDGPU * Add the required ROCr declarations and wrappers. * Enable queue profiling at queue creation time. * Record events by enqueuing a real barrier marker packet on the stream. * Retain the timing signal needed to query the recorded marker later. * Implement `getEventElapsedTimeImpl` using `hsa_amd_profiling_get_dispatch_time`, converting the result to milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`. This follows the ROCm/HIP approach of enabling queue profiling at HSA queue creation time, while keeping the AMDGPU queue path simpler than the lazy-enable alternative discussed during review. #### CUDA * Add the required CUDA driver declarations and wrappers. * Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`. #### Host * Add `getEventElapsedTimeImpl` that stores `0.0f` in the output pointer, when present, and returns success. Reason: the host plugin does not materialize backend event state and already treats event operations as trivially successful. Returning `0.0f` preserves that model without introducing a new failure mode. #### Level Zero * Add `getEventElapsedTimeImpl`, but leave it unimplemented. Reason: the Level Zero plugin currently does not provide standalone backend event support for this event model. For example, `waitEventImpl` / `syncEventImpl` are still unimplemented there. --------- Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br> Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>	2026-04-01 14:13:44 -05:00
Johannes Doerfert	9f4636210d	[Offload] Fix type mismatch by using `uint64_t` instead of `size_t` (#183375 ) The variant uses uint64_t, so should the get.	2026-02-25 13:31:03 -08:00
Joseph Huber	6282a7b993	[Offload] Fix missing end to string in .td file	2026-02-17 15:32:17 -06:00
fineg74	1c6d774baa	[OFFLOAD] Extend olMemRegister API to handle cases when a memory block may have been mapped outside of liboffload. (#172226 ) This PR adds extends liboffload olMemRegister API to handle a case when a memory block may have been mapped before calling olMemRegister to support some use cases in libomptarget	2026-02-17 20:53:00 +00:00
Joseph Huber	d62cd1b89d	[Offload] Add argument to 'olInit' for global configuration options (#181872 ) Summary: This PR adds a pointer argument to the initialization routine to be used for global options. Right now this is used to allow the user to constrain which backends they wish to use. If a null argument is passed, the same behavior as before is observed. This is epxected to be extensible by forcing the user to encode the size of the struct. So, old executables will encode which fields they have access to. We use a macro helper to get this struct rather than a runtime call so that the current state of the size is baked into the executable rather than something looked up by the runtime. Otherwise it would just return the size that the (potentially newer) runtime would see	2026-02-17 14:04:00 -06:00
fineg74	b58a31d3ce	[OFFLOAD] Add support for host offloading device (#177307 ) The purpose of this PR is to add support of host as an offloading device to liboffload. Both OpenMP and sycl support offloading to a host as their normal workflow and therefore would require such capability from liboffload library.	2026-02-13 10:27:52 +01:00
Joseph Huber	1a86c146ae	[Offload] Add a function to register an RPC Server callback (#178774 ) Summary: We provide an RPC server to manage calls initiated by the device to run on the host. This is very useful for the built-in handling we have, however there are cases where we would want to extend this functionality. Cases like Fortran or MPI would be useful, but we cannot put references to these in the core offloading runtime. This way, we can provide this as a library interface that registers custom handlers for whatever code people want.	2026-01-30 08:03:13 -06:00
puneeth_aditya_5656	ed4aab07ec	[Offload][AMDGPU] Fix olQueryQueue uninitialized output parameter (#178464 ) ## Summary - Fix uninitialized output parameter in `olQueryQueue_impl` when `Queue->AsyncInfo->Queue` is null - Set `IsQueueWorkCompleted` to `true` when no underlying queue exists (no pending work) - Resolves test failure on AMDGPU for `olQueryQueueTest.SuccessEmptyAsyncQueueCheckResult` Fixes #178462. ## Test plan - [x] Fixed `OffloadAPI/queue.unittests/olQueryQueueTest/SuccessEmptyAsyncQueueCheckResult/AMDGPU_AMD_Radeon_RX_7700_XT_0` test - [ ] CI tests pass --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Joseph Huber <huberjn@outlook.com>	2026-01-28 20:49:19 -06:00
fineg74	51e2c82023	[OFFLOAD] Add a check before calling dataExchange (#176853 ) Per documentation the call to dataExchange API (move memory block between different devices) is permitted only if isDataExchangable() call returned true. While almost all platforms support memory transfer between different devices, in the case when the transfer is attempted between devices belonging to different platforms if they are present on the same machine which can lead to unexpected results. This PR adds a check if dataExchange can be called and if not uses a workaround by initiating memory transfer through host.	2026-01-20 11:37:35 -08:00
fineg74	848d736e64	[OFFLOAD] Add asynchronous queue query API for libomptarget migration (#172231 ) Add liboffload asynchronous queue query API for libomptarget migration This PR adds liboffload asynchronous queue query API that needed to make libomptarget to use liboffload	2026-01-20 10:53:32 -08:00
fineg74	1232599032	[OFFLOAD] Add memory data locking API for libomptarget migration (#173138 ) Add liboffload memory data locking API for libomptarget migration This PR adds liboffload memory data locking API that needed to make libomptarget to use liboffload	2026-01-12 13:07:57 -06:00
Alex Duran	ae739a240c	[OFFLOAD] Recognize level_zero backend in liboffload (#172818 ) The code to recognize the level_zero plugin as a liboffload backend was split from #158900. This PR adds the support back. --------- Co-authored-by: Alexey Sachkov <alexey.sachkov@intel.com> Co-authored-by: Nick Sarnie <nick.sarnie@intel.com> Co-authored-by: Joseph Huber <huberjn@outlook.com>	2025-12-18 15:31:36 +00:00
Kevin Sala Penades	1a86f0aae7	[Offload] Add device info for shared memory (#167817 )	2025-11-13 11:00:12 -08:00
Robert Imschweiler	dc94f2cbad	[Offload] Add device UID (#164391 ) Introduced in OpenMP 6.0, the device UID shall be a unique identifier of a device on a given system. (Not necessarily a UUID.) Since it is not guaranteed that the (U)UIDs defined by the device vendor libraries, such as HSA, do not overlap with those of other vendors, the device UIDs in offload are always combined with the offload plugin name. In case the vendor library does not specify any device UID for a given device, we fall back to the offload-internal device ID. The device UID can be retrieved using the `llvm-offload-device-info` tool.	2025-11-04 20:15:47 +01:00
Joseph Huber	227bc5786f	Revert "[Offload] Lazily initialize platforms in the Offloading API" (#163272 ) Summary: This causes issues with CUDA's teardown order when the init is separated from the total init scope.	2025-10-14 12:46:55 -05:00
Joseph Huber	4a35c4d38a	[Offload] Lazily initialize platforms in the Offloading API (#163272 ) Summary: The Offloading library wraps around the underlying plugins. The problem is that we currently initialize all plugins we find, even if they are not needed for the program. This is very expensive for trivial uses, as fully heterogenous usage is quite rare. In practice this means that you will always pay a 200 ms penalty for having CUDA installed. This patch changes the behavior to provide accessors into the plugins and devices that allows them to be initialized lazily. We use a once_flag, this should properly take a fast-path check while still blocking on concurrent use. Making full use of this will require a way to filter platforms more specifically. I'm thinking of what this would look like as an API. I'm thinking that we either have an extra iterate function that takes a callback on the platform, or we just provide a helper to find all the devices that can run a given image. Maybe both? Fixes: https://github.com/llvm/llvm-project/issues/159636	2025-10-14 09:35:53 -05:00
Joseph Huber	095877c12e	[Offload] Fix isValidBinary segfault on host platform Summary: Need to verify this actually has a device. We really need to rework this to point to a real impolementation, or streamline it to handle this automatically.	2025-10-06 14:46:50 -05:00
Piotr Balcer	23d08af3d4	[Offload][NFC] use unique ptrs for platforms (#160888 ) Currently, devices store a raw pointer to back to their owning Platform. Platforms are stored directly inside of a vector. Modifying this vector risks invalidating all the platform pointers stored in devices. This patch allocates platforms individually, and changes devices to store a reference to its platform instead of a pointer. This is safe, because platforms are guaranteed to outlive the devices they contain.	2025-09-29 07:10:26 -05:00
Ross Brunton	ea0e5185e2	[Offload] Add olGetMemInfo with platform-less API (#159581 )	2025-09-24 12:17:57 +01:00
Joseph Huber	204580aa8e	[Offload] Don't add the unsupported host plugin to the list (#159642 ) Summary: The host plugin is basically OpenMP specific and doesn't work very well. Previously we were skipping over it in the list instead of just not adding it at all.	2025-09-23 08:31:35 -05:00
Ross Brunton	fcebe6bdbb	[Offload] Re-allocate overlapping memory (#159567 ) If olMemAlloc happens to allocate memory that was already allocated elsewhere (possibly by another device on another platform), it is now thrown away and a new allocation generated. A new `AllocBases` vector is now available, which is an ordered list of allocation start addresses.	2025-09-23 13:59:52 +01:00
Joseph Huber	51e3c3d51b	[Offload] Implement 'olIsValidBinary' in offload and clean up (#159658 ) Summary: This exposes the 'isDeviceCompatible' routine for checking if a binary can be loaded. This is useful if people don't want to consume errors everywhere when figuring out which image to put to what device. I don't know if this is a good name, I was thining like `olIsCompatible` or whatever. Let me know what you think. Long term I'd like to be able to do something similar to what OpenMP does where we can conditionally only initialize devices if we need them. That's going to be support needed if we want this to be more generic.	2025-09-19 12:15:57 -05:00
Joseph Huber	e7101dac9c	[Offload] Copy loaded images into managed storage (#158748 ) Summary: Currently we have this `__tgt_device_image` indirection which just takes a reference to some pointers. This was all find and good when the only usage of this was from a section of GPU code that came from an ELF constant section. However, we have expanded beyond that and now need to worry about managing lifetimes. We have code that references the image even after it was loaded internally. This patch changes the implementation to instaed copy the memory buffer and manage it locally. This PR reworks the JIT and other image handling to directly manage its own memory. We now don't need to duplicate this behavior externally at the Offload API level. Also we actually free these if the user unloads them. Upside, less likely to crash and burn. Downside, more latency when loading an image.	2025-09-16 08:57:28 -05:00
Ross Brunton	ffb756dff2	[Offload] Add `OL_DEVICE_INFO_MAX_WORK_SIZE[_PER_DIMENSION]` (#155823 ) This is the total number of work items that the device supports (the equivalent work group properties are for only a single work group).	2025-08-29 09:39:18 +01:00
Ross Brunton	9e5d8bd3d1	[Offload] Improve `olDestroyQueue` logic (#153041 ) Previously, `olDestroyQueue` would not actually destroy the queue, instead leaving it for the device to clean up when it was destroyed. Now, the queue is either released immediately if it is complete or put into a list of "pending" queues if it is not. Whenever we create a new queue, we check this list to see if any are now completed. If there are any we release their resources and use them instead of pulling from the pool. This prevents long running programs that create and drop many queues without syncing them from leaking memory all over the place.	2025-08-29 09:39:00 +01:00
Ross Brunton	41fed2d048	[Offload] Add PRODUCT_NAME device info (#155632 ) On my system, this will be "Radeon RX 7900 GRE" rather than "gfx1100". For Nvidia, the product name and device name are identical.	2025-08-28 15:16:17 +01:00
Callum Fare	77c5a6506f	[Offload] Fix definition of olMemFill (#154947 ) Fix regression introduced by #154102 - the way offload-tblgen handles names has changed	2025-08-22 14:48:00 +01:00
Callum Fare	0b18d2da70	[Offload] Implement olMemFill (#154102 ) Implement olMemFill to support filling device memory with arbitrary length patterns. AMDGPU support will be added in a follow-up PR.	2025-08-22 14:31:16 +01:00
Ross Brunton	4c0c295775	[Offload] `OL_EVENT_INFO_IS_COMPLETE` (#153194 ) A simple info query for events that returns whether the event is complete or not.	2025-08-22 13:40:31 +01:00
Ross Brunton	17dbb92612	[Offload][NFC] Use tablegen names rather than `name` parameter for API (#154736 )	2025-08-22 11:13:57 +01:00
Ross Brunton	2e74cc6c04	[Offload][NFC] Use a sensible order for APIGen (#154518 ) The order entries in the tablegen API files are iterated is not the order they appear in the file. To avoid any issues with the order changing in future, we now generate all definitions of a certain class before class that can use them. This is a NFC; the definitions don't actually change, just the order they exist in in the OffloadAPI.h header.	2025-08-21 09:38:21 +01:00
Ross Brunton	273ca1f77b	[Offload] Fix `OL_DEVICE_INFO_MAX_MEM_ALLOC_SIZE` on AMD (#154521 ) This wasn't handled with the normal info API, so needs special handling.	2025-08-21 09:37:58 +01:00
Ross Brunton	c8986d1ecb	[Offload] Guard olMemAlloc/Free with a mutex (#153786 ) Both these functions update an `AllocInfoMap` structure in the context, however they did not use any locks, causing random failures in threaded code. Now they use a mutex.	2025-08-20 13:23:57 +01:00
Ross Brunton	2c11a83691	[Offload] Add olCalculateOptimalOccupancy (#142950 ) This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is currently only implemented on Cuda; AMDGPU and Host return unsupported. --------- Co-authored-by: Callum Fare <callum@codeplay.com>	2025-08-19 15:16:47 +01:00
Rafal Bielski	9c9d9e4cb6	[Offload] Define additional device info properties (#152533 ) Add the following properties in Offload device info: * VENDOR_ID * NUM_COMPUTE_UNITS * [SINGLE\|DOUBLE\|HALF]_FP_CONFIG * NATIVE_VECTOR_WIDTH_[CHAR\|SHORT\|INT\|LONG\|FLOAT\|DOUBLE\|HALF] * MAX_CLOCK_FREQUENCY * MEMORY_CLOCK_RATE * ADDRESS_BITS * MAX_MEM_ALLOC_SIZE * GLOBAL_MEM_SIZE Add a bitfield option to enumerators, allowing the values to be bit-shifted instead of incremented. Generate the per-type enums using `foreach` to reduce code duplication. Use macros in unit test definitions to reduce code duplication.	2025-08-19 13:02:01 +01:00
Ross Brunton	30c7951136	[Offload] `olLaunchHostFunction` (#152482 ) Add an `olLaunchHostFunction` method that allows enqueueing host work to the stream.	2025-08-15 09:39:48 +01:00
Ross Brunton	3e9f29cfee	[Offload] Store globals in the program's global list rather than the kernel list (#153441 )	2025-08-13 17:18:25 +01:00
Ross Brunton	910d7e90bf	[Offload] Make olLaunchKernel test thread safe (#149497 ) This sprinkles a few mutexes around the plugin interface so that the olLaunchKernel CTS test now passes when ran on multiple threads. Part of this also involved changing the interface for device synchronise so that it can optionally not free the underlying queue (which introduced a race condition in liboffload).	2025-08-08 10:57:04 +01:00
Ross Brunton	197d1c1570	[Offload] OL_QUEUE_INFO_EMPTY (#152473 ) Add a queue query that (if possible) reports whether the queue is empty	2025-08-08 10:20:45 +01:00
Ross Brunton	a44532544b	[Offload] Don't create events for empty queues (#152304 ) Add a device function to check if a device queue is empty. If liboffload tries to create an event for an empty queue, we create an "empty" event that is already complete. This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run quickly for empty queues.	2025-08-07 10:16:33 +01:00
Ross Brunton	ca13c44bbc	[NFC][Offload] Clarify `olDestroyQueue` (#152132 ) This has no code changes.	2025-08-06 15:34:31 +01:00
Ross Brunton	d03692a00e	[Offload] Rework `MAX_WORK_GROUP_SIZE` (#151926 ) `MAX_WORK_GROUP_SIZE` now represents the maximum total number of work groups the device can allocate, rather than the maximum per dimension. `MAX_WORK_GROUP_SIZE_PER_DIMENSION` has been added, which has the old behaviour.	2025-08-04 15:21:24 +01:00
Ross Brunton	adb2421202	[Offload] Refactor device information queries to use new tagging (#147318 ) Instead using strings to look up device information (which is brittle and slow), use the new tags that the plugins specify when building the nodes.	2025-07-25 14:51:51 +01:00
Ross Brunton	690c3ee5be	[Offload] Replace "EventOut" parameters with `olCreateEvent` (#150217 ) Rather than having every "enqueue"-type function have an output pointer specifically for an output event, just provide an `olCreateEvent` entrypoint which pushes an event to the queue. For example, replace: ```cpp olMemcpy(Queue, ..., EventOut); ``` with ```cpp olMemcpy(Queue, ...); olCreateEvent(Queue, EventOut); ```	2025-07-24 14:31:06 +01:00
Ross Brunton	081b74caf5	[Offload] Add olWaitEvents (#150036 ) This function causes a queue to wait until all the provided events have completed before running any future scheduled work.	2025-07-23 14:12:16 +01:00
Ross Brunton	2726b7fb1c	[Offload] Rename olWaitEvent/Queue to olSyncEvent/Queue (#150023 ) This more closely matches the nomenclature used by CUDA, AMDGPU and the plugin interface.	2025-07-23 10:52:13 +01:00
Ross Brunton	55b417a75f	[Offload] Cache symbols in program (#148209 ) When creating a new symbol, check that it already exists. If it does, return that pointer rather than building a new symbol structure.	2025-07-16 18:32:47 +01:00
Callum Fare	47c9609a86	[Offload] Check plugins aren't already deinitialized when tearing down (#148642 ) This is a hotfix for #148615 - it fixes the issue for me locally. I think a broader issue is that in the test environment we're calling olShutDown from a global destructor in the test binaries. We should do something more controlled, either calling olInit/olShutDown in every test, or move those to a GTest global environment. I didn't do that originally because it looked like it needed changes to LLVM's GTest wrapper.	2025-07-14 16:17:10 +01:00
Ross Brunton	2fdeeefacf	[Offload] Add global variable address/size queries (#147972 ) Add two new symbol info types for getting the bounds of a global variable. As well as a number of tests for reading/writing to it.	2025-07-11 16:12:48 +01:00
Ross Brunton	84e15d08c2	[Offload] Add `olGetSymbolInfo[Size]` (#147962 ) This mirrors the similar functions for other handles. The only implemented info at the moment is the symbol's kind.	2025-07-11 15:29:53 +01:00

1 2

94 Commits