Summary:
I made the GPU flags accept more of the default LLVM warnings, which
triggered some new cases. Clean those up and fix some other ones while
I'm at it.
Summary:
This exposes the 'isDeviceCompatible' routine for checking if a binary
*can* be loaded. This is useful if people don't want to consume errors
everywhere when figuring out which image to put to what device.
I don't know if this is a good name, I was thining like `olIsCompatible`
or whatever. Let me know what you think.
Long term I'd like to be able to do something similar to what OpenMP
does where we can conditionally only initialize devices if we need them.
That's going to be support needed if we want this to be more
generic.
Summary:
Turns out the new CUDA ABI now applies retroactively to all the other
SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping
all the SM flags consistent but using an offset.
Fixes: https://github.com/llvm/llvm-project/issues/159088
Currently get this error
```
offload/plugins-nextgen/common/src/PluginInterface.cpp:859:63: error: member reference type 'StringRef' is not a pointer; did you mean to use '.'?
```
We pass the full image binary now so we can't really print anything
useful here.
Seems introduced in https://github.com/llvm/llvm-project/pull/158748.
---------
Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Summary:
Currently we have this `__tgt_device_image` indirection which just takes
a reference to some pointers. This was all find and good when the only
usage of this was from a section of GPU code that came from an ELF
constant section. However, we have expanded beyond that and now need to
worry about managing lifetimes. We have code that references the image
even after it was loaded internally. This patch changes the
implementation to instaed copy the memory buffer and manage it locally.
This PR reworks the JIT and other image handling to directly manage its
own memory. We now don't need to duplicate this behavior externally at
the Offload API level. Also we actually free these if the user unloads
them.
Upside, less likely to crash and burn. Downside, more latency when
loading an image.
Summary:
This operation is done every time we load a binary, this behavior should
be moved into OpenMP since it concerns an OpenMP specific data struct.
This is a little messy, because ideally we should only be using public
APIs, but more can be extracted later.
Previously, `olDestroyQueue` would not actually destroy the queue,
instead leaving it for the device to clean up when it was destroyed.
Now, the queue is either released immediately if it is complete or put
into a list of "pending" queues if it is not. Whenever we create a new
queue, we check this list to see if any are now completed. If there are
any we release their resources and use them instead of pulling from
the pool.
This prevents long running programs that create and drop many queues
without syncing them from leaking memory all over the place.
This is equivalent to `cuOccupancyMaxPotentialBlockSize`. It is
currently
only implemented on Cuda; AMDGPU and Host return unsupported.
---------
Co-authored-by: Callum Fare <callum@codeplay.com>
Add the following properties in Offload device info:
* VENDOR_ID
* NUM_COMPUTE_UNITS
* [SINGLE|DOUBLE|HALF]_FP_CONFIG
* NATIVE_VECTOR_WIDTH_[CHAR|SHORT|INT|LONG|FLOAT|DOUBLE|HALF]
* MAX_CLOCK_FREQUENCY
* MEMORY_CLOCK_RATE
* ADDRESS_BITS
* MAX_MEM_ALLOC_SIZE
* GLOBAL_MEM_SIZE
Add a bitfield option to enumerators, allowing the values to be
bit-shifted instead of incremented. Generate the per-type enums using
`foreach` to reduce code duplication.
Use macros in unit test definitions to reduce code duplication.
The purpose of this fence is to ensure that any `dataSubmit`s inserted
into a queue before a `dataFence` finish before finish before any
`dataSubmit`s
inserted after it begin.
This is a no-op for most queues, since they are in-order, and by design
any operations inserted into them occur in order.
But the interface is supposed to be functional for out-of-order queues.
The addition of the interface means that any operations that rely on
such ordering (like ATTACH map-type support in #149036) can invoke it,
without worrying about whether the underlying queue is in-order or
out-of-order.
Once a plugin supports out-of-order queues, the plugin can implement
this function, without requiring any change at the libomptarget level.
---------
Co-authored-by: Alex Duran <alejandro.duran@intel.com>
This sprinkles a few mutexes around the plugin interface so that the
olLaunchKernel CTS test now passes when ran on multiple threads.
Part of this also involved changing the interface for device synchronise
so that it can optionally not free the underlying queue (which
introduced a race condition in liboffload).
Add a device function to check if a device queue is empty. If liboffload
tries to create an event for an empty queue, we create an "empty" event
that is already complete.
This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run
quickly for empty queues.
Enables AMD data center class GPUs to use memory manager memory pooling
up to 3GB allocation by default, up from the "1 << 13" threshold that
all plugin-nextgen devices use.
The following patch introduces a new interop interface implementation
with the following characteristics:
* It supports the new 6.0 prefer_type specification
* It supports both explicit objects (from interop constructs) and
implicit objects (from variant calls).
* Implements a per-thread reuse mechanism for implicit objects to reduce
overheads.
* It provides a plugin interface that allows selecting the supported
interop types, and managing all the backend related interop operations
(init, sync, ...).
* It enables cooperation with the OpenMP runtime to allow progress on
OpenMP synchronizations.
* It cleanups some vendor/fr_id mismatchs from the current query
routines.
* It supports extension to define interop callbacks for library cleanup.
`MAX_WORK_GROUP_SIZE` now represents the maximum total number of work
groups the device can allocate, rather than the maximum per dimension.
`MAX_WORK_GROUP_SIZE_PER_DIMENSION` has been added, which has the old
behaviour.
When `unloadBinary` is called, any entries in the JITEngine's cache
for that binary will be cleared. This fixes a nasty issue with
liboffload program handles. If two handles happen to have had the same
address (after one was free'd, for example), the cache would be hit and
return the wrong program.
Summary:
We rely on these flags to do things in the runtime and print the
contents of binaries correctly. CUDA updated their ABI encoding recently
and we didn't handle that. it's a new ABI entirely so we just select on
it when it shows up.
Fixes: https://github.com/llvm/llvm-project/issues/148703
The `GlobalTy` helper has been extended to make both the Size and Ptr be
optional. Now `getGlobalMetadataFromDevice`/`Image` is able to write the
size of the global to the struct, instead of just verifying it.
This is a generated file which contains a macro for all Device Info
keys. This is visible to the plugin interface so that it can use the
definitions in a future patch.
The `unloadBinaryImpl` method on the host plugin is now implemented
properly (rather than just being a stub). When an image is unloaded,
it is deallocated and the library associated with it is closed.
After #146345 the device info implementation requires a value for every
query, rather than silently returning an empty string. This broke the
test for `OL_DEVICE_INFO_VENDOR` on CUDA.
Add a value to the CUDA plugin. We can quite safely hard code this one.
Previously, the user was not able to use more than 48 KB of shared
memory on NVIDIA GPUs. In order to do so, setting the function attribute
`CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` is required, which was not
present in the code base. With this commit, we add the ability toset
this attribute, allowing the user to utilize the full power of their
GPU.
In order to not have to reset the function attribute for each launch of
the same kernel, we keep track of the maximum memory limit (as the
variable `MaxDynCGroupMemLimit`) and only set the attribute if our
desired amount exceeds the limit. By default, this limit is set to 48
KB.
Feedback is greatly appreciated, especially around setting the new
variable as mutable. I did this becuase the `launchImpl` method is const
and I am not able to modify my variable otherwise.
---------
Co-authored-by: Giorgi Gvalia <ggvalia@login33.chn.perlmutter.nersc.gov>
Co-authored-by: Giorgi Gvalia <ggvalia@login07.chn.perlmutter.nersc.gov>
GenericKernelTy has a pointer to the name that was used to create it.
However, the name passed in as an argument may not outlive the kernel.
Instead, GenericKernelTy now contains a std::string, and copies the
name into there.
AMD treats this value as a string, so for consistency require this in
NVIDIA as well. This shouldn't change the output of the
`llvm-offload-device-info` tool, but does fix an issue in liboffload
when it tries to query the version.
Summary:
There's a new one called the AIE (AI Engine). We could handle this, but
since we don't use it currently I'm just making it future-proof. Adding
the AIE check would require checking the HSA version which isn't
worthwhile just yet.
This allows removal of a specific Image from a Device, rather than
requiring all image data to outlive the device they were created for.
This is required for `ol_program_handle_t`s, which now specify the
lifetime of the buffer used to create the program.
Previously, if a binary failed to load due to failures when jit
compiling, the function would return success with nullptr. Now it
returns a new plugin error, `COMPILE_FAILURE`.
Rather than being "stringly typed", store values as a std::variant that
can hold various types. This means that liboffload doesn't have to do
any string parsing for integer/bool device info keys.
Previously, device info was returned as a queue with each element having
a "Level" field indicating its nesting level. This replaces this queue
with a more traditional tree-like structure.
This should not result in a change to the output of
`llvm-offload-device-info`.
This pull request fixes coverage mapping on GPU targets.
- It adds an address space cast to the coverage mapping generation pass.
- It reads the profiled function names from the ELF directly. Reading it
from public globals was causing issues in cases where multiple
device-code object files are linked together.
Previously we decided to check in files that we generate with tablegen.
The justification at the time was that it helped reviewers unfamiliar
with `offload-tblgen` see the actual changes to the headers in PRs.
After trying it for a while, it's ended up causing some headaches and is
also not how tablegen is used elsewhere in LLVM.
This changes our use of tablegen to be more conventional. Where
possible, files are still clang-formatted, but this is no longer a hard
requirement. Because `OffloadErrcodes.inc` is shared with libomptarget
it now gets generated in a more appropriate place.