This is a hotfix for #148615 - it fixes the issue for me locally.
I think a broader issue is that in the test environment we're calling
olShutDown from a global destructor in the test binaries. We should do
something more controlled, either calling olInit/olShutDown in every
test, or move those to a GTest global environment. I didn't do that
originally because it looked like it needed changes to LLVM's GTest
wrapper.
Add `OffloadDeviceTest::getPlatformBackend()` and use it to skip event
tests which currently fail on AMDGPU due to:
```
OL_ERRC_UNIMPLEMENTED: synchronize event not implemented
```
`olGetKernel` has been replaced by `olGetSymbol` which accepts a
`Kind` parameter. As well as loading information about kernels, it
can now also load information about global variables.
In the future, we want `ol_symbol_handle_t` to represent both kernels
and global variables The first step in this process is a rename and
promotion to a "typed handle".
The `GlobalTy` helper has been extended to make both the Size and Ptr be
optional. Now `getGlobalMetadataFromDevice`/`Image` is able to write the
size of the global to the struct, instead of just verifying it.
* Add spec generation to offload-tblgen tool
* This patch adds generation of Sphinx compatible reStructuedText
utilizing the C domain to document the Offload API directly from the
spec definition `.td` files.
* Add Sphinx HTML documentation target
* Introduces the `docs-offload-html` target when CMake is configured
with `LLVM_ENABLE_SPHINX=ON` and `SPHINX_OUTPUT_HTML=ON`. Utilized
`offload-tblgen -gen-spen` to generate Offload API specification docs.
Add info queries for queues and events.
`olGetQueueInfo` only supports getting the associated device. We were
already tracking this so we can implement this for free. We will likely
add other queries to it in the future (whether the queue is empty, what
flags it was created with, etc)
`olGetEventInfo` only supports getting the associated queue. This is
another thing we were already storing in the handle. We'll be able to
add other queries in future (the event type, status, etc)
Adds two "launch kernel" tests for lib offload, one testing that
global memory works and persists between different kernels, and one
verifying that `[[gnu::constructor]]` works correctly.
Since we now have tests that contain multiple kernels in the same
binary, the test framework has been updated a bit.
The Offload and Flang-RT had the ability to compile GTest themselves.
But in bootstrapping builds, LLVM_LIBRARY_OUTPUT_INTDIR points to the
same location as the stage1 build. If both are building GTest, they
everwrite each others `libllvm_gtest.a` and `libllvm_test_main.a` which
causes #143134.
This PR removes the ability for the Offload/Flang-RT runtimes to build
their own GTest and instead relies on the stage1 build of GTest. This
was already the case with LLVM_INSTALL_GTEST=ON configurations. For
LLVM_INSTALL_GTEST=OFF configurations, we now also export gtest into the
buildtree configuration. Ultimately, this reduces combinatorial
explosion of configurations in which unittests could be built
(LLVM_INSTALL_GTEST=ON, GTest built by Offload, GTest built by Flang-RT,
GTest built by Offload and also used by Flang-RT).
GTest and therefore Offload/Runtime unittests will not be available if
the runtimes are configured against an LLVM install tree. Since llvm-lit
isn't available in the install tree either, it doesn't matter.
Note that compiler-rt and libc also use GTest in non-default
configrations. libc also depends on LLVM's GTest build (and would
error-out if unavailable), but compiler-rt builds it completely
different.
Fixes#143134
This is a generated file which contains a macro for all Device Info
keys. This is visible to the plugin interface so that it can use the
definitions in a future patch.
The `unloadBinaryImpl` method on the host plugin is now implemented
properly (rather than just being a stub). When an image is unloaded,
it is deallocated and the library associated with it is closed.
The output of the compile-and-run tests is incorrect. These will be used
for reference in future commits that resolve the issues.
Also updated the existing clang LIT test,
target_map_both_pointer_pointee_codegen.cpp, with more constructs and
fewer CHECKs (through more update_cc_test_checks filters).
After #146345 the device info implementation requires a value for every
query, rather than silently returning an empty string. This broke the
test for `OL_DEVICE_INFO_VENDOR` on CUDA.
Add a value to the CUDA plugin. We can quite safely hard code this one.
Previously, the user was not able to use more than 48 KB of shared
memory on NVIDIA GPUs. In order to do so, setting the function attribute
`CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` is required, which was not
present in the code base. With this commit, we add the ability toset
this attribute, allowing the user to utilize the full power of their
GPU.
In order to not have to reset the function attribute for each launch of
the same kernel, we keep track of the maximum memory limit (as the
variable `MaxDynCGroupMemLimit`) and only set the attribute if our
desired amount exceeds the limit. By default, this limit is set to 48
KB.
Feedback is greatly appreciated, especially around setting the new
variable as mutable. I did this becuase the `launchImpl` method is const
and I am not able to modify my variable otherwise.
---------
Co-authored-by: Giorgi Gvalia <ggvalia@login33.chn.perlmutter.nersc.gov>
Co-authored-by: Giorgi Gvalia <ggvalia@login07.chn.perlmutter.nersc.gov>
GenericKernelTy has a pointer to the name that was used to create it.
However, the name passed in as an argument may not outlive the kernel.
Instead, GenericKernelTy now contains a std::string, and copies the
name into there.
- Update the main README to reflect the current project status
- Rework the main API generation documentation. General fixes/tidying,
but also spell out explicitly how to make API changes at the top of the
document since this is what most people will care about.
---------
Co-authored-by: Martin Grant <martingrant@outlook.com>
Summary:
This adds a basic outline for adding 'conformance' tests. These are
tests that are intended to check device code against a standard. In this
case, we will expect this to be filled with math conformance tests to
make sure their results are within the ULP requirements we demand.
Right now this just *assumes* the GPU libc is there, meaning you'll
likely need to do a manual `ninja` before doing `ninja -C
runtimes/runtimes-bins offload.conformance`.
Fix a couple of unhandled edge cases in offload-tblgen that were found
by static analysis
* `LineStart` may wrap around to 0 when processing multi-line strings.
The value is not actually being used in that case, but still better to
explicitly handle it
* Possible unchecked nullptr when processing parameter flags
This makes several small changes to how the platform and device info
queries are handled:
* ReturnHelper has been replaced with InfoWriter which is more explicit
in how it is invoked.
* InfoWriter consumes `llvm::Expected` rather than values directly, and
will early exit if it returns an error.
* As a result of the above, `GetInfoString` now correctly returns errors
rather than empty strings.
* The host device now has its own dedicated "getInfo" function rather
than being checked in multiple places.
`olShutDown` was not properly calling deinit on the platforms, resulting
in random segfaults on AMD devices.
As part of this, `olInit` and `olShutDown` now alloc and free the
offload context rather than it being static. This
allows `olShutDown` to be called within a destructor of a static object
(like the tests do) without having to worry about destructor ordering.
OpenMP allows duplicate mappings, i.e. in OpenMP 6.0, 7.9.6 "map
Clause":
Two list items of the map clauses on the same construct must not share
original storage unless one of the following is true: they are the same
list item [or other omitted reasons]"
Duplicate mappings can arise as a result of user-defined mapper
processing (which I think is a separate bug, and is not addressed here),
but also in straightforward cases such as:
#pragma omp target map(tofrom: s.mem[0:10]) map(tofrom: s.mem[0:10])
Both these cases cause crashes at runtime at present, due to an
unfortunate interaction between reference counting behaviour and shadow
pointer handling for blocks. This is what happens:
1. The member "s.mem" is copied to the target
2. A shadow pointer is created, modifying the pointer on the target
3. The member "s.mem" is copied to the target again
4. The previous shadow pointer metadata is still present, so the runtime doesn't modify the target pointer a second time.
The fix is to disable step 3 if we've already done step 2 for a given
block that has the "is new" flag set.
Rather than creating a new device info tree for each call to
`olGetDeviceInfo`, we instead do it on device initialisation. As well
as improving performance, this fixes a few lifetime issues with returned
strings.
This does unfortunately mean that device information is immutable,
but hopefully that shouldn't be a problem for any queries we want to
implement.
This also meant allowing offload initialization to fail, which it can
now do.
AMD treats this value as a string, so for consistency require this in
NVIDIA as well. This shouldn't change the output of the
`llvm-offload-device-info` tool, but does fix an issue in liboffload
when it tries to query the version.
Summary:
There's a new one called the AIE (AI Engine). We could handle this, but
since we don't use it currently I'm just making it future-proof. Adding
the AIE check would require checking the HSA version which isn't
worthwhile just yet.
This allows removal of a specific Image from a Device, rather than
requiring all image data to outlive the device they were created for.
This is required for `ol_program_handle_t`s, which now specify the
lifetime of the buffer used to create the program.
Previously, if a binary failed to load due to failures when jit
compiling, the function would return success with nullptr. Now it
returns a new plugin error, `COMPILE_FAILURE`.
Summary:
I'll probably want to use this as a more generic utility in the future.
This patch reworks it to make it a top level function. I also tried to
decouple this from the OpenMP utilities to make that easier in the
future. Instead, I just use `-march=native` functionality which is the
same thing. Needed a small hack to skip the linker stage for checking if
that works.
This should still create the same output as far as I'm aware.
Rather than being "stringly typed", store values as a std::variant that
can hold various types. This means that liboffload doesn't have to do
any string parsing for integer/bool device info keys.
Rather than having a number of static local variables, we now use
a single `OffloadContext` struct to store global state. This is
initialised by `olInit`, but is never deleted (de-initialization of
Offload isn't yet implemented).
The error reporting mechanism has not been moved to the struct, since
that's going to cause issues with teardown (error messages must outlive
liboffload).
Previously, device info was returned as a queue with each element having
a "Level" field indicating its nesting level. This replaces this queue
with a more traditional tree-like structure.
This should not result in a change to the output of
`llvm-offload-device-info`.
`pgo_atomic_teams.c` and `pgo_atomic_threads.c` currently are set to run
on NVPTX despite the changes for that target not being upstreamed yet.
This patch also replaces instances of `llvm-profdata` with `%profdata`
in those tests.
This is a three element x, y, z size_t vector that can be used any place
where a 3D vector is required. This ensures that all vectors across
liboffload are the same and don't require any resizing/reordering
dances.