llvm-project

Author	SHA1	Message	Date
Kevin Sala Penades	00d5f660f4	[offload][CUDA] Fix DLWRAP for memory routines (#190500 )	2026-04-04 19:29:25 -07:00
Leandro Lacerda	34028294e4	[Offload] Add support for measuring elapsed time between events (#186856 ) This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as requested in [#185728](https://github.com/llvm/llvm-project/issues/185728), and adds the corresponding support in `plugins-nextgen`. A main motivation for this change is to make it possible to measure the elapsed time of work submitted to a queue, especially kernel launches. This is relevant to the intended use of the new Offload API for microbenchmarking GPU libc math functions. ### Summary The new API returns the elapsed time, in milliseconds, between two events on the same device. To support the common pattern `create start event → enqueue kernel → create end event → sync end event → get elapsed time`, `olCreateEvent` now always creates and records a backend event through the device interface. For backends that materialize real event state, this gives the event concrete backend state that can be used for elapsed-time measurement. For backends that do not materialize backend event state, `EventInfo` may still remain null and existing event operations continue to treat such events as trivially complete. Previously, an event created on an empty queue could be represented only as a logical event. That representation was sufficient for sync and completion queries, but it was not suitable for elapsed-time measurement because there was no backend event state to timestamp. The new behavior preserves the meaning of completion of prior work while also allowing backends with timing support to attach real event state. ### Changes in `plugins-nextgen` #### Common interface Add elapsed-time support to the common device and plugin interfaces: * `GenericPluginTy::get_event_elapsed_time` * `GenericDeviceTy::getEventElapsedTime` * `GenericDeviceTy::getEventElapsedTimeImpl` #### AMDGPU * Add the required ROCr declarations and wrappers. * Enable queue profiling at queue creation time. * Record events by enqueuing a real barrier marker packet on the stream. * Retain the timing signal needed to query the recorded marker later. * Implement `getEventElapsedTimeImpl` using `hsa_amd_profiling_get_dispatch_time`, converting the result to milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`. This follows the ROCm/HIP approach of enabling queue profiling at HSA queue creation time, while keeping the AMDGPU queue path simpler than the lazy-enable alternative discussed during review. #### CUDA * Add the required CUDA driver declarations and wrappers. * Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`. #### Host * Add `getEventElapsedTimeImpl` that stores `0.0f` in the output pointer, when present, and returns success. Reason: the host plugin does not materialize backend event state and already treats event operations as trivially successful. Returning `0.0f` preserves that model without introducing a new failure mode. #### Level Zero * Add `getEventElapsedTimeImpl`, but leave it unimplemented. Reason: the Level Zero plugin currently does not provide standalone backend event support for this event model. For example, `waitEventImpl` / `syncEventImpl` are still unimplemented there. --------- Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br> Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>	2026-04-01 14:13:44 -05:00
Kevin Sala Penades	ac71b185c2	[offload] Remove LIBOMPTARGET_SHARED_MEMORY_SIZE envar (#186231 ) This commit removes the `LIBOMPTARGET_SHARED_MEMORY_SIZE` envar and outputs a runtime warning if it is defined. Access to dynamic shared memory should be obtained through the `dyn_groupprivate` clause (OpenMP 6.1) or the launch arguments in liboffload kernel launch.	2026-03-12 21:21:29 -07:00
Kevin Sala Penades	1f583c6dee	[OpenMP][Offload] Add offload runtime support for dyn_groupprivate clause (#152831 ) Part 3 adding offload runtime support. See https://github.com/llvm/llvm-project/pull/152651. --------- Co-authored-by: Krzysztof Parzyszek <Krzysztof.Parzyszek@amd.com>	2026-03-12 01:13:06 -07:00
Łukasz Plewa	57614e8810	[OFFLOAD] Replace C-style casts with C++ style casts in obtainInfoImpl (#185023 ) Replace C-style bool casts (bool)TmpInt with C++ functional casts bool(TmpInt)	2026-03-06 10:28:38 -06:00
fineg74	848d736e64	[OFFLOAD] Add asynchronous queue query API for libomptarget migration (#172231 ) Add liboffload asynchronous queue query API for libomptarget migration This PR adds liboffload asynchronous queue query API that needed to make libomptarget to use liboffload	2026-01-20 10:53:32 -08:00
Alex Duran	efad3563ea	[OFFLOAD] Update CUDA and AMD plugins to new debug format (#175787 )	2026-01-13 17:53:59 +01:00
Alex Duran	86e114a9b2	Revert "[OFFLOAD] Update CUDA and AMD plugins to new debug format" (#175786 ) Reverts llvm/llvm-project#175757	2026-01-13 17:13:46 +01:00
Alex Duran	7c2f49373b	[OFFLOAD] Update CUDA and AMD plugins to new debug format (#175757 ) This should be the last step before completely removing the DP macro.	2026-01-13 17:06:35 +01:00
Alex Duran	dbd52bd558	[OFFLOAD][OpenMP] Remove old style REPORT support (#175607 ) Fix the few remaining usages and remove the support for the old REPORT macro.	2026-01-12 19:48:40 +01:00
Kevin Sala Penades	35315a84b4	[offload] Fix CUDA args size by subtracting tail padding (#172249 ) This commit makes the cuLaunchKernel call to pass the total arguments size without tail padding.	2025-12-14 21:57:25 -08:00
Kevin Sala Penades	1a86f0aae7	[Offload] Add device info for shared memory (#167817 )	2025-11-13 11:00:12 -08:00
Joseph Huber	aaddd8d38a	[OpenMP] Fix tests relying on the heap size variable Summary: I made that an unimplemented error, but forgot that it was used for this environment variable.	2025-11-06 13:00:26 -06:00
Joseph Huber	670c453aeb	[Offload] Remove handling for device memory pool (#163629 ) Summary: This was a lot of code that was only used for upstream LLVM builds of AMDGPU offloading. We have a generic and fast `malloc` in `libc` now so just use that. Simplifies code, can be added back if we start providing alternate forms but I don't think there's a single use-case that would justify it yet.	2025-11-06 10:15:18 -06:00
Robert Imschweiler	dc94f2cbad	[Offload] Add device UID (#164391 ) Introduced in OpenMP 6.0, the device UID shall be a unique identifier of a device on a given system. (Not necessarily a UUID.) Since it is not guaranteed that the (U)UIDs defined by the device vendor libraries, such as HSA, do not overlap with those of other vendors, the device UIDs in offload are always combined with the offload plugin name. In case the vendor library does not specify any device UID for a given device, we fall back to the offload-internal device ID. The device UID can be retrieved using the `llvm-offload-device-info` tool.	2025-11-04 20:15:47 +01:00
Alex Duran	45757b9284	[OFFLOAD] Remove unused init_device_info plugin interface (#162650 ) This was used for the old interop code. It's dead code after #143491	2025-10-09 08:38:24 -05:00
Alex Duran	902fe02e87	[OFFLOAD] Restore interop functionality (#161429 ) This implements two pieces to restore the interop functionality (that I broke) when the 6.0 interfaces were added: * A set of wrappers that support the old interfaces on top of the new ones * The same level of interop support for the CUDA amd AMD plugins	2025-10-02 21:48:31 +02:00
Kevin Sala Penades	01d761a776	[Offload] Use Error for allocating/deallocating in plugins (#160811 ) Co-authored-by: Joseph Huber <huberjn@outlook.com>	2025-09-26 13:50:00 -05:00
Joseph Huber	23efc67e19	[Offload] Remove non-blocking allocation type (#159851 ) Summary: This was originally added in as a hack to work around CUDA's limitation on allocation. The `libc` implementation now isn't even used for CUDA so this code is never hit. Even if this case, this code never truly worked. A true solution would be to use CUDA's virtual memory API instead to allocate 2MiB slabs independenctly from the normal memory management done in the stream.	2025-09-20 09:07:14 -05:00
Joseph Huber	580860e8b7	[OpenMP][NFC] Clean up a bunch of warnings and clang-tidy messages (#159831 ) Summary: I made the GPU flags accept more of the default LLVM warnings, which triggered some new cases. Clean those up and fix some other ones while I'm at it.	2025-09-19 14:09:33 -05:00
Joseph Huber	dffd7f3d9a	[LLVM] Fix offload and update CUDA ABI for all SM values (#159354 ) Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: https://github.com/llvm/llvm-project/issues/159088	2025-09-17 14:39:39 -05:00
Joseph Huber	e7101dac9c	[Offload] Copy loaded images into managed storage (#158748 ) Summary: Currently we have this `__tgt_device_image` indirection which just takes a reference to some pointers. This was all find and good when the only usage of this was from a section of GPU code that came from an ELF constant section. However, we have expanded beyond that and now need to worry about managing lifetimes. We have code that references the image even after it was loaded internally. This patch changes the implementation to instaed copy the memory buffer and manage it locally. This PR reworks the JIT and other image handling to directly manage its own memory. We now don't need to duplicate this behavior externally at the Offload API level. Also we actually free these if the user unloads them. Upside, less likely to crash and burn. Downside, more latency when loading an image.	2025-09-16 08:57:28 -05:00
Ross Brunton	ffb756dff2	[Offload] Add `OL_DEVICE_INFO_MAX_WORK_SIZE[_PER_DIMENSION]` (#155823 ) This is the total number of work items that the device supports (the equivalent work group properties are for only a single work group).	2025-08-29 09:39:18 +01:00
Ross Brunton	41fed2d048	[Offload] Add PRODUCT_NAME device info (#155632 ) On my system, this will be "Radeon RX 7900 GRE" rather than "gfx1100". For Nvidia, the product name and device name are identical.	2025-08-28 15:16:17 +01:00
Kevin Sala Penades	0ad35d7586	[NFC][offload] Fix error message for cuFuncSetAttribute (#155655 )	2025-08-27 11:35:37 -07:00
Callum Fare	0b18d2da70	[Offload] Implement olMemFill (#154102 ) Implement olMemFill to support filling device memory with arbitrary length patterns. AMDGPU support will be added in a follow-up PR.	2025-08-22 14:31:16 +01:00
Ross Brunton	4c0c295775	[Offload] `OL_EVENT_INFO_IS_COMPLETE` (#153194 ) A simple info query for events that returns whether the event is complete or not.	2025-08-22 13:40:31 +01:00
Ross Brunton	2c11a83691	[Offload] Add olCalculateOptimalOccupancy (#142950 ) This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is currently only implemented on Cuda; AMDGPU and Host return unsupported. --------- Co-authored-by: Callum Fare <callum@codeplay.com>	2025-08-19 15:16:47 +01:00
Rafal Bielski	9c9d9e4cb6	[Offload] Define additional device info properties (#152533 ) Add the following properties in Offload device info: * VENDOR_ID * NUM_COMPUTE_UNITS * [SINGLE\|DOUBLE\|HALF]_FP_CONFIG * NATIVE_VECTOR_WIDTH_[CHAR\|SHORT\|INT\|LONG\|FLOAT\|DOUBLE\|HALF] * MAX_CLOCK_FREQUENCY * MEMORY_CLOCK_RATE * ADDRESS_BITS * MAX_MEM_ALLOC_SIZE * GLOBAL_MEM_SIZE Add a bitfield option to enumerators, allowing the values to be bit-shifted instead of incremented. Generate the per-type enums using `foreach` to reduce code duplication. Use macros in unit test definitions to reduce code duplication.	2025-08-19 13:02:01 +01:00
Abhinav Gaba	79cf877627	[Offload] Introduce dataFence plugin interface. (#153793 ) The purpose of this fence is to ensure that any `dataSubmit`s inserted into a queue before a `dataFence` finish before finish before any `dataSubmit`s inserted after it begin. This is a no-op for most queues, since they are in-order, and by design any operations inserted into them occur in order. But the interface is supposed to be functional for out-of-order queues. The addition of the interface means that any operations that rely on such ordering (like ATTACH map-type support in #149036) can invoke it, without worrying about whether the underlying queue is in-order or out-of-order. Once a plugin supports out-of-order queues, the plugin can implement this function, without requiring any change at the libomptarget level. --------- Co-authored-by: Alex Duran <alejandro.duran@intel.com>	2025-08-15 11:49:35 -07:00
Ross Brunton	30c7951136	[Offload] `olLaunchHostFunction` (#152482 ) Add an `olLaunchHostFunction` method that allows enqueueing host work to the stream.	2025-08-15 09:39:48 +01:00
Callum Fare	aa6f591b63	[Offload] Implement hasPendingWork on CUDA (#152728 ) Following on from #152304, implement the new query in the CUDA plugin	2025-08-13 16:35:23 +01:00
Kevin Sala Penades	7de50beb52	[Offload] Fix return error with a condition (#152876 ) Adds a conditional to the error return so that it only returns if there was an error.	2025-08-10 12:03:09 -07:00
Ross Brunton	910d7e90bf	[Offload] Make olLaunchKernel test thread safe (#149497 ) This sprinkles a few mutexes around the plugin interface so that the olLaunchKernel CTS test now passes when ran on multiple threads. Part of this also involved changing the interface for device synchronise so that it can optionally not free the underlying queue (which introduced a race condition in liboffload).	2025-08-08 10:57:04 +01:00
Ross Brunton	a44532544b	[Offload] Don't create events for empty queues (#152304 ) Add a device function to check if a device queue is empty. If liboffload tries to create an event for an empty queue, we create an "empty" event that is already complete. This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run quickly for empty queues.	2025-08-07 10:16:33 +01:00
Ross Brunton	d03692a00e	[Offload] Rework `MAX_WORK_GROUP_SIZE` (#151926 ) `MAX_WORK_GROUP_SIZE` now represents the maximum total number of work groups the device can allocate, rather than the maximum per dimension. `MAX_WORK_GROUP_SIZE_PER_DIMENSION` has been added, which has the old behaviour.	2025-08-04 15:21:24 +01:00
Joseph Huber	b53be5f4b2	[LLVM] Update CUDA ELF flags for their new ABI (#149534 ) Summary: We rely on these flags to do things in the runtime and print the contents of binaries correctly. CUDA updated their ABI encoding recently and we didn't handle that. it's a new ABI entirely so we just select on it when it shows up. Fixes: https://github.com/llvm/llvm-project/issues/148703	2025-07-21 14:38:03 -05:00
Ross Brunton	311847be4c	[Offload] Allow "tagging" device info entries with offload keys (#147317 ) When generating the device info tree, nodes can be marked with an offload Device Info value. The nodes can also look up children based on this value.	2025-07-18 14:27:34 +01:00
Ross Brunton	a71187e976	[Offload] Return error rather than dropping it (#148609 )	2025-07-14 14:05:58 +01:00
Ross Brunton	abb878438a	[Offload] Allow querying the size of globals (#147698 ) The `GlobalTy` helper has been extended to make both the Size and Ptr be optional. Now `getGlobalMetadataFromDevice`/`Image` is able to write the size of the global to the struct, instead of just verifying it.	2025-07-10 12:05:31 +01:00
Callum Fare	fdf6ab2a53	[Offload] Implement 'Vendor Name' device info for CUDA (#147334 ) After #146345 the device info implementation requires a value for every query, rather than silently returning an empty string. This broke the test for `OL_DEVICE_INFO_VENDOR` on CUDA. Add a value to the CUDA plugin. We can quite safely hard code this one.	2025-07-08 10:04:48 +01:00
Giorgi Gvalia	5110ac4113	[Offload] Allow CUDA Kernels to use arbitrarily large shared memory (#145963 ) Previously, the user was not able to use more than 48 KB of shared memory on NVIDIA GPUs. In order to do so, setting the function attribute `CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` is required, which was not present in the code base. With this commit, we add the ability toset this attribute, allowing the user to utilize the full power of their GPU. In order to not have to reset the function attribute for each launch of the same kernel, we keep track of the maximum memory limit (as the variable `MaxDynCGroupMemLimit`) and only set the attribute if our desired amount exceeds the limit. By default, this limit is set to 48 KB. Feedback is greatly appreciated, especially around setting the new variable as mutable. I did this becuase the `launchImpl` method is const and I am not able to modify my variable otherwise. --------- Co-authored-by: Giorgi Gvalia <ggvalia@login33.chn.perlmutter.nersc.gov> Co-authored-by: Giorgi Gvalia <ggvalia@login07.chn.perlmutter.nersc.gov>	2025-07-07 15:26:16 -04:00
Ross Brunton	102cf1b999	[Offload] Make CUDA Driver Version a string (#146049 ) AMD treats this value as a string, so for consistency require this in NVIDIA as well. This shouldn't change the output of the `llvm-offload-device-info` tool, but does fix an issue in liboffload when it tries to query the version.	2025-06-27 15:07:04 +01:00
Ross Brunton	0870c8838b	[Offload] Add an `unloadBinary` interface to PluginInterface (#143873 ) This allows removal of a specific Image from a Device, rather than requiring all image data to outlive the device they were created for. This is required for `ol_program_handle_t`s, which now specify the lifetime of the buffer used to create the program.	2025-06-25 14:53:18 +01:00
Ross Brunton	e6a3579653	[Offload] Replace device info queue with a tree (#144050 ) Previously, device info was returned as a queue with each element having a "Level" field indicating its nesting level. This replaces this queue with a more traditional tree-like structure. This should not result in a change to the output of `llvm-offload-device-info`.	2025-06-13 09:22:47 -05:00
Ross Brunton	050892d2f8	[Offload] Use new error code handling mechanism and lower-case messages (#139275 ) [Offload] Use new error code handling mechanism This removes the old ErrorCode-less error method and requires every user to provide a concrete error code. All calls have been updated. In addition, for consistency with error messages elsewhere in LLVM, all messages have been made to start lower case.	2025-05-20 08:50:20 -05:00
Christian Clauss	1f56bb3137	[Offload][NFC] Fix typos discovered by codespell (#125119 ) https://github.com/codespell-project/codespell % `codespell --ignore-words-list=archtype,hsa,identty,inout,iself,nd,te,ths,vertexes --write-changes`	2025-01-31 09:35:29 -06:00
Joseph Huber	bd8a818128	[Offload] Add cuLaunchHostFunc to dynamic cuda Summary: This was missing, causing non-directly linked builds to fail.	2025-01-24 11:41:20 -06:00
Joseph Huber	134401deea	[Offload] Move RPC server handling to a dedicated thread (#112988 ) Summary: Handling the RPC server requires running through list of jobs that the device has requested to be done. Currently this is handled by the thread that does the waiting for the kernel to finish. However, this is not sound on NVIDIA architectures and only works for async launches in the OpenMP model that uses helper threads. However, we also don't want to have this thread doing work unnnecessarily. For this reason we track the execution of kernels and cause the thread to sleep via a condition variable (usually backed by some kind of futex or other intelligent sleeping mechanism) so that the thread will be idle while no kernels are running.	2025-01-24 11:36:45 -06:00
Shilei Tian	92376c3ff5	[Offload][OMPX] Add the runtime support for multi-dim grid and block (#118042 )	2024-12-06 09:07:50 -05:00

1 2

66 Commits