Summary:
This PR changes the handling of the emitted kernels when targeting a CPU
to be a pointer struct.
The old handling emitted a standard function prototype, this
necessitated a target specific ABI to call it because the signature
differed with the number of arguments. Instead, this PR emits a void
pointer to a naturally aligned struct, this is what APIs like `pthreads`
assert.
This allows us to remove all the complexity around launching host
kernels and just pass the argument list.
This change improves handling of errors during synchronization in Level
Zero plugin by ensuring cleanup of queues and events in case of an
synchronization error. As a result multiple tests stopped hanging.
---------
Co-authored-by: Duran, Alex <alejandro.duran@intel.com>
Queue's WaitEvent collection wasn't being cleared after synchronization
and resetting of the events. This led to hangs on subsequent host
synchronizations if not preceeded by any other operation.
This commit removes the `LIBOMPTARGET_SHARED_MEMORY_SIZE` envar and
outputs a runtime warning if it is defined. Access to dynamic shared memory
should be obtained through the `dyn_groupprivate` clause (OpenMP 6.1) or
the launch arguments in liboffload kernel launch.
Summary:
We use this `dyn_ptr` argument in Clang/OpenMP to handle the
`KernelLaunchEnvironment`. This is a per-kernel argument used to share
some information. Currenetly, it's prepended to the argument list and we
generate storage for it in the runtime.
This is bad for a few reasons:
1. It changes the ABI by shifting user arguments
2. It cannot be trivially be left uninitialized if unused
3. The runtime must allocate its own memory for it
This PR changes it to be appended instead. Additionally, space for this
is always emitted. This means the OMPIRBuilder itself will provide the
storage, we simply need to populate it in the runtime if it is used.
This means that if it's unused we don't always pay the cost and it's
easier for non-OpenMP users to ignore it.
Backward compatibility is maintained by auto-upgrading the kernel
arguments. In `libomptarget` we completely allocate a new buffer to
store this in the new format. The plugins still need to respect the old
ABI of the called device object, so we simply rotate it if it's the old
version.
- Accept OffloadBinaries as valid images by plugins that support them in
the PluginInterface.
- Add support in L0 plugin to extract SPIRV images and their associated
metadata from an OffloadBinary image.
Depends on:
- #185663
Follow-up PRs:
- #185413 (Changes SPIRV wrapper generation to use OffloadBinary)
- #185425 (Adjusts llvm-objdump)
- #184774 (Adjusts llvm-offload-binary)
As discussed in #185404 we might want to provide a way for plugins to
validate images not recognized by the common layer.
This PR adds such extension and uses it to validate pure SPIRV images by
the Level Zero plugin.
Summary:
This was a regression from the original LLVM-gpu-loader. We used to
handle `-mwavefrontsize64` correctly in the loader by over-allocating
memory and just leaving the upper 32-bits masked off. In order to handle
this in offload we need to scan loaded kernels to see how much memory we
need to allocate. This should be safe, the protocol is designed to
handle an arbitrary size and worst-case this just wastes space.
Summary:
This PR provides the minimal support for Fortran I/O coming from a GPU
in OpenMP offloading. We use the same support the `libc` uses for its
printing through the RPC server. The helper functions `rpc::dispatch`
and `rpc::invoke` help make this mostly automatic.
Becaus Fortran I/O is not reentrant, the vast majority of complexity
comes from needing to stitch together calls from the GPU until they can
be executed all at once. This is needed not only because of the
limitations of recursive I/O, but without this the output would all be
interleaved because of the GPU's lock-step execution.
As such, the return values from the intermediate functions are
meaningless, all returning true. The final value is correct however. For
cookies we create a context pointer on the server to chain these
together.
Works on both my AMD and NVIDIA GPUs.
```fortran
program hello_gpu
implicit none
!$omp target teams num_teams(1)
!$omp parallel num_threads(2)
! Print strings
print *, "Hello from GPU"
!$omp end parallel
!$omp end target teams
end program hello_gpu
```
```console
> flang hello.f90 -O2 -fopenmp --offload-arch=gfx1030
> ./a.out
Hello from GPU
Hello from GPU
> flang hello.f90 -O2 -fopenmp --offload-arch=sm_89
> ./a.out
Hello from GPU
Hello from GPU
```
This PR adds extends liboffload olMemRegister API to handle a case when
a memory block may have been mapped before calling olMemRegister to
support some use cases in libomptarget
Summary:
Closing ports was previously done manually, This makes the protocol more
error prone as unclosed ports will leak and eventually the locks will
run out. I believe the original fear was that the RAII portion would
negatively impact code generation but I have not noticed anything
significant.
The purpose of this PR is to add support of host as an offloading device
to liboffload. Both OpenMP and sycl support offloading to a host as
their normal workflow and therefore would require such capability from
liboffload library.
The default Level Zero loader `libze_loader.so` may not be available on
systems that don't have Level Zero development package. Level Zero
loaders with major version suffix are searched in that case.
Changes to make host plugin compile on Windows:
* Change IO code to be portable
* Adjust Makefiles
Allow plugin to work partially when libffi support is not found
dynamically (compilation works fine even on Windows because of the
wrapper support).
Summary:
The static object mixes callbacks from different plugins because ever
since we moved to the object library target these are actually shared.
Just make it a member of the base class and make it a pointer set just
to do some basic deduplication.
Summary:
We provide an RPC server to manage calls initiated by the device to run
on the host. This is very useful for the built-in handling we have,
however there are cases where we would want to extend this
functionality.
Cases like Fortran or MPI would be useful, but we cannot put references
to these in the core offloading runtime. This way, we can provide this
as a library interface that registers custom handlers for whatever code
people want.
There are a few places where data types based on character array or
string are printed in the debug message while they do not represent
strings. Such expressions should be casted to `void *` unless they
represent actual strings. Change also includes casting from integral
type to pointer type when appropriate.
Add liboffload asynchronous queue query API for libomptarget migration
This PR adds liboffload asynchronous queue query API that needed to make
libomptarget to use liboffload
Add liboffload memory data locking API for libomptarget migration
This PR adds liboffload memory data locking API that needed to make
libomptarget to use liboffload
Summary:
This is only really meaningful for the NVPTX target. Not all build
environments support host LTO and these are redundant tests, just clean
this up and make it run faster.
This PR refactors how the device image is built so we can expose the
native ELF of the device to DeviceImageTy which solves several issues
regarding symbol look up (as DeviceImageTy expects an ELF). It also
simplifies the module linking code taking into account the latest
changes in the driver (which adds "-library-compilation when necessary).
---------
Co-authored-by: Alexey Sachkov <alexey.sachkov@intel.com>
Co-authored-by: Nick Sarnie <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
When looking for the device address of a symbol, we need to also look if
it's a function symbol if not found as global symbol in the device.
---------
Co-authored-by: Alexey Sachkov <alexey.sachkov@intel.com>
Co-authored-by: Nick Sarnie <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Support for getDebugLevel was removed as part of the new debug macros
(#165416). This PR updates such usages to use the new ODBG_* macros.
---------
Co-authored-by: Alexey Sachkov <alexey.sachkov@intel.com>
Co-authored-by: Nick Sarnie <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Add a new nextgen plugin that supports GPU devices through the Intel oneAPI Level Zero library. The plugin is not enabled by default and needs to be added to LIBOMPTARGET_PLUGINS_TO_BUILD explicitely.
---------
Co-authored-by: Alexey Sachkov <alexey.sachkov@intel.com>
Co-authored-by: Nick Sarnie <nick.sarnie@intel.com>
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Update debug messages based on the new method from #170425. Updated the
following files.
- plugins-nextgen/common/include/MemoryManager.h
- plugins-nextgen/common/include/PluginInterface.h
- plugins-nextgen/common/src/GlobalHandler.cpp
- plugins-nextgen/common/src/PluginInterface.cpp
- plugins-nextgen/host/dynamic_ffi/ffi.cpp
This PR introduces new debug macros that allow a more fined control of
which debug message to output and introduce C++ stream style for debug
messages.
Changing existing messages (except a few that I changed for testing)
will come in subsequent PRs.
I also think that we should make debug enabling OpenMP agnostic but, for
now, I prioritized maintaing the current libomptarget behavior for now,
and we might need more changes further down the line as we we decouple
libomptarget.
Summary:
We start this thread if the RPC client symbol is detected in the loaded
binary. We should make this sleep if there's no work to avoid the thread
running at high priority when the (scarecely used) RPC call is actually
required. So, right now after 25 microseconds we will assume the server
is inactive and begin sleeping. This resets once we do find work.
AMD supports a more intelligent way to do this. HSA signals can wake a
sleeping thread from the kernel, and signals can be sent from the GPU
side. This would be nice to have and I'm planning on working with it in
the future to make this infrastructure more usable with existing AMD
workloads.