Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.
Fixes: https://github.com/llvm/llvm-project/issues/70249
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
This reverts commit 35cfadfbe2decd9633560b3046fa6c17523b2fa9.
It makes a couple of buildbots unhappy because of the following test failures:
- `Transforms/OpenMP/add_attributes.ll'`
- `mapping/declare_mapper_target_data.cpp` on AMDGPU
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
Code generation support for 'parallel masked' directive.
The `EmitOMPParallelMaskedDirective` was implemented.
In addition, the appropiate device functions were added.
Fix#59939.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D143527
This patch fixes the support for tasking on the device.
Note: AMDGPU doesn't support it yet because of no support for `malloc` and `free`.
Fix#59946.
```
➜ ./test_parallel_master_device
[OMPVV_RESULT: test_parallel_master_device.c] Test passed on the device.
```
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D141562
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert, jhuber6, ABataev
Differential Revision: https://reviews.llvm.org/D102107
This patch implements omp_get_device_num() in the host and the device.
It uses the already existing getDeviceNum in the device config for the device.
And in the host it uses the omp_get_num_devices().
Two simple tests added
Differential Revision: https://reviews.llvm.org/D128347
This patch adds the `llvm_omp_target_dynamic_shared_alloc` function to
the `omp.h` header file so users can access it by default. Also changed
the name to keep it consistent with the other target allocators. Added
some documentation so users know how to use it. Didn't add the interface
for Fortran since there's no way to test it right now.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D123246
The problem with the old scheme is that we would need to keep track of
the "next region" and reset the num_threads value after it. The new RT
doesn't do it and an assertion is triggered. The old RT doesn't do it
either, I haven't tested it but I assume a num_threads clause might
impact multiple parallel regions "accidentally". Further, in SPMD mode
num_threads was simply ignored, for some reason beyond me.
In any case, parallel_51 is designed to take the clause value directly,
so let's do that instead.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D113623
The existing CGOpenMPRuntimeAMDGCN and CGOpenMPRuntimeNVPTX classes are
just code bloat. By removing them, the codebase gets a bit cleaner.
Reviewed By: jdoerfert, JonChesterfield, tianshilei1992
Differential Revision: https://reviews.llvm.org/D113421
Before we had aligned barriers the `__kmpc_barrier_simple_spmd` was
OK to be used in the custom state machine. Now that SPMD barriers are
assumed to be aligned we need to use a "generic" barrier in places
that are not aligned.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D112893
We do not generate _serialized_parallel calls in device mode, no
need for an external API.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D112145
This patch adds support for the
`__kmpc_get_hardware_num_threads_in_block` function that returns the
number of threads. This was missing in the new runtime and was used by
the AMDGPU plugin which prevented it from using the new runtime. This
patchs also unified the interface for getting the thread numbers in the
frontend.
Originally authored by jdoerfert.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D111475
This patch adds an external interface to access the dynamic shared
memory buffer in the device runtime. The function introduced is
``llvm_omp_get_dynamic_shared``. This includes a host-side
definition that only returns a null pointer so that it can be used when
host-fallback is enabled without crashing. Support for dynamic shared
memory was also ported to the old device runtime.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D110957
This is a follow-up of D110029, which uses bitset to indicate execution mode. This patches makes the changes in the function call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110279
The new device runtime library currently lacks the
`kmpc_get_hardware_thread_id_in_block` function which is currently used
when doing the SPMDzation optimization. This call would be introduced
through the optimization and then cause a linking error because it was
not present. This patch adds support for this runtime call.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110195
This patch adds support for using dynamic shared memory in the new
device runtime. The new function `__kmpc_get_dynamic_shared` will return a
pointer to the buffer of dynamic shared memory. Currently the amount of memory
allocated is set by an environment variable.
In the future this amount will be added to the amount used for the smart stack
which will be configured in a similar way.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110006
Use uint64_t for lanemask on all GPU architectures at the interface
with clang. Updates tests. The deviceRTL is always linked as IR so the zext
and trunc introduced for wave32 architectures will fold after inlining.
Simplification partly motivated by amdgpu gfx10 which will be wave32 and
is awkward to express in the current arch-dependant typedef interface.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D108317
The new method of sharing variables introduces a `__kmpc_alloc_shared` call
that cannot be removed in the middle end because of its non-constant argument
and unconnected free. This patch reverts this to the old method that used a
static amount of shared memory for sharing variables.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106905
The "old" OpenMP GPU device runtime (D14254) has served us well for many
years but modernizing it has caused some pain recently. This patch
introduces an alternative which is mostly written from scratch embracing
OpenMP 5.X, C++, LLVM coding style (where applicable), and conceptual
interfaces. This new runtime is opt-in through a clang flag (D106793).
The new runtime is currently only build for nvptx and has "-new" in its
name.
The design is tailored towards middle-end optimizations rather than
front-end code generation choices, a trend we already started in the old
runtime a while back. In contrast to the old one, state is organized in
a simple manner rather than a "smart" one. While this can induce costs
it helps optimizations. Our expectation is that the majority of codes
can be optimized and a "simple" design is therefore preferable. The new
runtime does also avoid users to pay for things they do not use,
especially wrt. memory. The unlikely case of nested parallelism is
supported but costly to make the more likely case use less resources.
The worksharing and reduction implementation have been taken from the
old runtime and will be rewritten in the future if necessary.
Documentation and debug features are still mostly missing and will be
added over time.
All external symbols start with `__kmpc` for legacy reasons but should
be renamed once we switch over to a single runtime. All internal symbols
are placed in appropriate namespaces (anonymous or `_OMP`) to avoid name
clashes with user symbols.
Differential Revision: https://reviews.llvm.org/D106803