This pull request is the second part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on #76587. This PR makes
the following changes:
- Introduces `__llvm_write_custom_profile` to PGO compiler-rt library.
This is an external function that can be used to write profiles with
custom data to target-specific files.
- Adds `__llvm_write_custom_profile` as weak symbol to libomptarget so
that it can write the collected data to a profraw file.
- Adds `PGODump` debug flag and only displays dump when the
aforementioned flag is set
Summary:
The 'omp_alloc' function should be callable from a target region. This
patch implemets it by simply calling `malloc` for every non-default
trait value allocator. All the special access modifiers are
unimplemented and return null. The null allocator returns null as the
spec states it should not be usable from the target.
Similar to (de)allocation traces, we can record kernel launch stack
traces and display them in case of an error. However, the AMD GPU plugin
signal handler, which is invoked on memroy faults, cannot pinpoint the
offending kernel. Insteade print `<NUM>`, set via
`OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=<NUM>`, many traces. The
recoding/record uses a ring buffer of fixed size (for now 8).
For `trap` errors, we print the actual kernel name, and trace if
recorded.
As a first step towards a GPU sanitizer we now can track allocations and
deallocations in order to report double frees, and other problems during
deallocation.
Sometimes it might be beneficial to spawn more thread blocks instead of
reusing existing for multiple loop iterations.
**Alternatives considered:**
Make `DefaultNumBlocks` settable via an environment variable.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Users can put a : in front of KMP_HW_SUBSET to indicate that the
specified subset is an "absolute" subset. Currently, when a user puts
KMP_HW_SUBSET=1t. This gets translated to KMP_HW_SUBSET="*s,*c,1t",
where * means "use all of". If a user wants only one thread as the
entire topology they can now do KMP_HW_SUBSET=:1t.
Along with the absolute syntax is a fix for newer machines and making
them easier to use with only the 3-level topology syntax. When a user
puts KMP_HW_SUBSET=1s,4c,2t on a machine which actually has 4 layers,
(say 1s,2m,3c,2t as the entire machine) the user gets an unexpected "too
many resources asked" message because KMP_HW_SUBSET currently translates
the "4c" value to mean 4 cores per module. To help users out, the
runtime can assume that these newer layers, module in this case, should
be ignored if they are not specified, but the topology should always
take into account the sockets, cores, and threads layers.
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.
This is the next step towards fixing
https://github.com/llvm/llvm-project/issues/66708
This change adds the option of using different units for blocktimes specified via the KMP_BLOCKTIME environment variable. The parsing of the environment now recognizes units suffixes: ms and us. If a units suffix is not specified, the default unit is ms. Thus default behavior is still the same, and any previous usage still works the same. Internally, blocktime is now converted to microseconds everywhere, so settings that exceed INT_MAX in microseconds are considered "infinite".
kmp_set/get_blocktime are updated to use the units the user specified with KMP_BLOCKTIME, and if not specified, ms are used.
Added better range checking and inform messages for the two time units. Large values of blocktime for default (ms) case (beyond INT_MAX/1000) are no longer allowed, but will autocorrect with an INFORM message.
The delay for determining ticks per usec was lowered. It is now 1 million ticks which was calculated as ~450us based on 2.2GHz clock which is pretty typical base clock frequency on X86:
(1e6 Ticks) / (2.2e9 Ticks/sec) * (1e6 usec/sec) = 454 usec
Really short benchmarks can be affected by longer delay.
Update KMP_BLOCKTIME docs.
Portions of this commit were authored by Johnny Peyton.
Differential Revision: https://reviews.llvm.org/D157646
If the Envar is set to true (default), busy HSA queues will be
actively avoided when assigning a queue to a Stream.
Otherwise, we will initialize a new HSA queue for each requested
Stream, then default to round robin once the set maximum has been
reached.
Reviewed By: jdoerfert, kevinsala
Differential Revision: https://reviews.llvm.org/D156996
This feature was supposed to allow you to trace execution inside of
Libomptarget. However, this never really worked properly. The printing
was always reoganized, only worked for single threads, and pretty much
only told you a handful of things about a runtime library that's an
implementation detail to all users. Despite this, it contributed about
40% of the total filesize of the deviceRTL. This patch simply removes
this functionalit which I think was past due.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D157001
This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.
To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.
When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.
With contributions from Johannes Doerfert <johannes@jdoerfert.de>
Depends on D156245
Reviewed By: kevinsala
Differential Revision: https://reviews.llvm.org/D154523
When I was trying to improve the OpenMP documentation, I found that the information in `OpenMP/docs/README.md` did not contain up-to-date information about how to build the OpenMP documentation with Sphinx. When I ran `make
docs-openmp-html`, the command failed because there were a few syntax errors in `openmp/docs/design/Runtimes.rst`. This commit fixes the syntax errors and updates the documentation on building the OpenMP documentation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D156470
This points users to the `libc` documentation and explains the basics of
how it's used inside the runtime.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D155318
It's time to remove the old plugins as the next-gen has already been set
to default in LLVM 16.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D142820
If a combined loop has insufficient parallelism (= low trip count), we
might end up with too few teams/blocks. To counter that we can reduce
the number of threads per team we use. This patch implements a heuristic
and exposes a new environment variable to control the minimum of threads
to be employed in this case.
Issue reported by:
Felipe Cabarcas Jaramillo <cabarcas@udel.edu> (@fel-cab).
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D152014
Adds HSA timeout hint of 2 seconds to the AMDGPU nextgen-plugin to improve
performance of small kernels.
The HSA runtime may stay in HSA_WAIT_STATE_ACTIVE for up to the timeout
value before switching to HSA_WAIT_STATE_BLOCKED. This can improve
latency from which small kernels can benefit.
The value was determined via experimentation w/ different benchmarks.
The timeout value can be overriden using the environment variable
LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT with a value in microseconds.
Original author: Greg Rodgers <Gregory.Rodgers@amd.com>
Contributions from: JP Lehr <JanPatrick.Lehr@amd.com>
Differential Revision: https://reviews.llvm.org/D148808
Summary:
At some point we stopped copying this file to the server, but
realistically this is just a static `.pdf` hosted in the LLVM repository
so we can link it directly.
Dynamic memory allows users to allocate fast shared memory when a kernel
is launched. We support a single size for all kernels via the
`LIBOMPTARGET_SHARED_MEMORY_SIZE` environment variable but now we can
control it per kernel invocation, hence allow computed values.
Note: Only the nextgen plugins will allocate memory based on the clause,
the old plugins will silently miscompile.
Differential Revision: https://reviews.llvm.org/D141233
The JIT is a great debugging tool since we can modify the IR manually
before launching it in an existing test case. The new flasks allow to
skip optimizations, to use the exact given IR, as well as to provide a
finished object file. The latter is useful to try out different backend
options and to have complete freedom with pass pipelines.
Documentation is included. Minimal refactoring was performed to make the
second object fit in nicely.
We can now dump the IR before and after JIT optimizations into the
files passed via `LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE` and
`LIBOMPTARGET_JIT_POST_OPT_IR_MODULE`, respectively.
Similarly, users can set `LIBOMPTARGET_JIT_REPLACEMENT_MODULE` to
replace the IR in the image with a custom IR module in a file.
All options take file paths, documentation was added.
Reviewed by: tianshilei1992
Differential revision: https://reviews.llvm.org/D140945
Add new hidden helper affinity via the environment variable,
KMP_HIDDEN_HELPER_AFFINITY, which allows users to assign thread
affinity to hidden helper threads using the same syntax as
KMP_AFFINITY. OMP_PLACES/OMP_PROC_BIND have no interaction with
KMP_HIDDEN_HELPER_AFFINITY.
Differential Revision: https://reviews.llvm.org/D135113
Previously time tracing features were hidden behind an optional CMake
option. This was because `libomptarget` was not based on the LLVM
libraries at that time. Now that `libomptarget` is an LLVM library we
should be able to freely use the `LLVMSupport` library whenever we want
and do not need to guard it in this way.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D132852
Added control to reset affinity of primary thread after outermost parallel
region to initial affinity encountered before OpenMP runtime was initialized.
KMP_AFFINITY environment variable reset/noreset modifier introduced.
Default behavior is unchanged.
Differential Revision: https://reviews.llvm.org/D125993
This patch adds the `llvm_omp_target_dynamic_shared_alloc` function to
the `omp.h` header file so users can access it by default. Also changed
the name to keep it consistent with the other target allocators. Added
some documentation so users know how to use it. Didn't add the interface
for Fortran since there's no way to test it right now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123246
In SupportAndFAQ.rst, add blank lines before and after a bullet list and
sublist. This avoids an "Unepxected indentation" warning.
In Runtimes.rst, adjust the suggestion for setting LIBOMPTARGET_INFO.
The right shifts are not necessary as the bit mask values are already
correct.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D119595
This patch changes the error message to instead mention the
documentation page for the debugging options provided by libomptarget
and the bitcode runtimes. Add some extra information to the documentation to
help users more quickly identify debugging resources.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D118626
Atomic handling of map clauses was introduced to comply with the OpenMP
standard (see D104418). However, many apps won't need this feature which
can be costly in certain situations. To allow for applications to
opt-out we now introduce the `LIBOMPTARGET_MAP_FORCE_ATOMIC` environment
flag that voids the atomicity guarantee of the standard for map clauses
again, shifting the burden to the user.
This patch also de-duplicates the code that introduces the events used
to enforce atomicity as a cleanup.
Differential Revision: https://reviews.llvm.org/D117627
This patch allows the user to request all resources of a particular
layer (or core-attribute). The syntax of KMP_HW_SUBSET is modified
so the number of units requested is optional or can be replaced with an
'*' character.
e.g., KMP_HW_SUBSET=c:intel_atom@3 will use all the cores after offset 3
e.g., KMP_HW_SUBSET=*c:intel_core will use all the big cores
e.g., KMP_HW_SUBSET=*s,*c,1t will use all the sockets, all cores per
each socket and 1 thread per core.
Differential Revision: https://reviews.llvm.org/D115826
Allow filtering of resources based on core attributes. There are two new
attributes added:
1) Core Type (intel_atom, intel_core)
2) Core Efficiency (integer) where the higher the efficiency, the more
performant the core
On hybrid architectures , e.g., Alder Lake, users can specify
KMP_HW_SUBSET=4c:intel_atom,4c:intel_core to select the first four Atom
and first four Big cores. The can also use the efficiency syntax. e.g.,
KMP_HW_SUBSET=2c:eff0,2c:eff1
Differential Revision: https://reviews.llvm.org/D114901
Add documentation for the debugging features in the OpenMP device
runtime library.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112010
This patch adds support for using dynamic shared memory in the new
device runtime. The new function `__kmpc_get_dynamic_shared` will return a
pointer to the buffer of dynamic shared memory. Currently the amount of memory
allocated is set by an environment variable.
In the future this amount will be added to the amount used for the smart stack
which will be configured in a similar way.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110006
This patch adds support for two environment variables to configure the device.
``LIBOMPTARGET_STACK_SIZE`` sets the amount of memory in bytes that each thread
has for its stack. ``LIBOMPTARGET_HEAP_SIZE`` sets the amount of heap memory
that can be allocated using malloc / free on the device.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106627
For example, without this patch:
```
$ cat test.c
int main() {
int x;
#pragma omp target enter data map(alloc: x)
#pragma omp target exit data map(release: x)
;
return 0;
}
$ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test.c
$ LIBOMPTARGET_DEBUG=1 ./a.out |& grep 'Creating\|Mapping exists'
Libomptarget --> Creating new map entry with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, Name=unknown
Libomptarget --> Mapping exists with HstPtrBegin=0x00007ffcace8e448, TgtPtrBegin=0x00007f12ef600000, Size=4, updated RefCount=1
```
There are two problems in this example:
* `RefCount` is not reported when a mapping is created, but it might
be 1 or infinite. In this case, because it's created by `omp target
enter data`, it's 1. Seeing that would make later `RefCount`
messages easier to understand.
* `RefCount` is still 1 at the `omp target exit data`, but it's
reported as `updated`. The reason it's still 1 is that, upon
deletions, the reference count is generally not updated in
`DeviceTy::getTgtPtrBegin`, where the report is produced. Instead,
it's zeroed later in `DeviceTy::deallocTgtPtr`, where it's actually
removed from the mapping table.
This patch makes the following changes:
* Report the reference count when creating a mapping.
* Where an existing mapping is reported, always report a reference
count action:
* `update suppressed` when `UpdateRefCount=false`
* `incremented`
* `decremented`
* `deferred final decrement`, which replaces the misleading
`updated` in the above example
* Add comments to `DeviceTy::getTgtPtrBegin` to explain why it does
not zero the reference count. (Please advise if these comments miss
the point.)
* For unified shared memory, don't report confusing messages like
`RefCount=` or `RefCount= updated` given that reference counts are
irrelevant in this case. Instead, just report `for unified shared
memory`.
* Use `INFO` not `DP` consistently for `Mapping exists` messages.
* Fix device table dumps to print `INF` instead of `-1` for an
infinite reference count.
Reviewed By: jhuber6, grokos
Differential Revision: https://reviews.llvm.org/D104559
This patch adds an information flag that indicated when data is being copied to
and from the device. This will be helpful for finding redundant or unnecessary
data transfers in applications.
Reviewed By: jdoerfert, grokos
Differential Revision: https://reviews.llvm.org/D103927
Summary:
This patch adds a new runtime function __tgt_set_info_flag that allows the
user to set the information level at runtime without using the environment
variable. Using this will require an extern function, but will eventually be
added into an auxilliary library for OpenMP support functions.
This patch required moving the current InfoLevel to a global variable which must
be instantiated by each plugin.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100774
Summary:
This patch adds a feature to print information whenever the host-device pointer
mapping table is changed by inserting or removing an entry. This introduces a
new bit field for LIBOMPTARGET_INFO at position 0x8.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D100600
Summary:
Remove some of the error messages printed when the CUDA plugin fails. The current error messages can be confusing because they are the first error messages printed after the async stream finds an error. This means that the printed values aren't related to what caused the issue, but are simply the last asyncronous operation that succeeded on the device. Remove these as they can be misleading.
Reviewers: jdoerfert
Differential Revision: https://reviews.llvm.org/D99510
Link error occurred when time profiling in libomp is enabled by default
because `libomp` is assumed to be a C library but the dependence on
`libLLVMSupport` for profiling is a C++ library. Currently the issue blocks all
OpenMP tests in Phabricator.
This patch set a new CMake option `OPENMP_ENABLE_LIBOMP_PROFILING` to
enable/disable the feature. By default it is disabled. Note that once time
profiling is enabled for `libomp`, it becomes a C++ library.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D95585
This introduces a remote offloading plugin for libomptarget. This
implementation relies on gRPC and protobuf, so this library will only
build if both libraries are available on the system. The corresponding
server is compiled to `openmp-offloading-server`.
This is a large change, but the only way to split this up is into RTL/server
but I fear that could introduce an inconsistency amongst them.
Ideally, tests for this should be added to the current ones that but that is
problematic for at least one reason. Given that libomptarget registers plugin
on a first-come-first-serve basis, if we wanted to offload onto a local x86
through a different process, then we'd have to either re-order the plugin list
in `rtl.cpp` (which is what I did locally for testing) or find a better
solution for runtime plugin registration in libomptarget.
Differential Revision: https://reviews.llvm.org/D95314
Add extra information to the runtime page describing the error messages and add information to the release notes for clang 12.0
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94562
Add an example to the OpenMP Documentation on the LIBOMPTARGET_INFO environment variable
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D94246