The debug assert is meant to check that the index is a valid which means
the runtime needs to check against the size of the array instead of the
number of threads. A free()-ed thread put back in the thread pool may
index into anywhere inside the task team's available array from 0 to
tt_max_threads potentially.
Fixes: #94260
* Serial teams now use a stack (similar to dispatch buffers)
* Serial teams always use `t_task_team[0]` as the task team and the
second pointer is a next pointer for the stack
`t_task_team[1]` is interpreted as a stack of task teams where each
level is a nested level
```
inner serial team outer serial team
[ t_task_team[0] ] -> (task_team) [ t_task_team[0] ] -> (task_team)
[ next ] ----------------> [ next ] -> ...
```
* Remove the task state memo stack from thread structure.
* Instead of a thread-private stack, use team structure to store
th_task_state of the primary thread. When coming out of a parallel,
restore the primary thread's task state. The new field in the team
structure doesn't cause sizeof(team) to change and is in the cache line
which is only read/written by the primary thread.
Fixes: #50602Fixes: #69368Fixes: #69733Fixes: #79416
When a nested parallel region ends, the runtime calls __kmp_join_call().
During this call, the primary thread of the nested parallel region will
reset its tid (retval of omp_get_thread_num()) to what it was in the
outer parallel region. A data race occurs with the current code when
another worker thread from the nested inner parallel region tries to
steal tasks from the primary thread's task deque. The worker thread
reads the tid value directly from the primary thread's data structure
and may read the wrong value.
This change just uses the calculated victim_tid from execute_tasks()
directly in the steal_task() routine rather than reading tid from the
data structure.
Fixes: #87307
The hidden helper team pre-allocates the gtid space [1,
num_hidden_helpers] (inclusive). If regular host threads are allocated,
then put back in the thread pool, then the hidden helper team is
initialized, the hidden helper team tries to allocate the threads from
the thread pool with gtids higher than [1, num_hidden_helpers]. Instead,
have the hidden helper team fork OS threads so the correct gtid range
used for hidden helper threads.
Fixes: #87117
MSVC does not define __BYTE_ORDER__ making the check for BigEndian
erroneously evaluate to true and breaking the struct definitions in MSVC
compiled builds correspondingly. The fix adds an additional check for
whether __BYTE_ORDER__ is defined by the compiler to fix these.
---------
Co-authored-by: Vadim Paretsky <b-vadipa@microsoft.com>
From "3.1 Reducing the number of edges" of this [[ https://hal.science/hal-04136674v1/ | paper ]] - Optimization (b)
Task (dependency) nodes have a `successors` list built upon passed dependency.
Given the following code, B will be added to A's successors list building the graph `A` -> `B`
```
// A
# pragma omp task depend(out: x)
{}
// B
# pragma omp task depend(in: x)
{}
```
In the following code, B is currently added twice to A's successor list
```
// A
# pragma omp task depend(out: x, y)
{}
// B
# pragma omp task depend(in: x, y)
{}
```
This patch removes such dupplicates by checking lastly inserted task in `A` successor list.
Authored by: Romain Pereira (rpereira-dev)
Differential Revision: https://reviews.llvm.org/D158544
struct DEP defined in multiple testcases must correspond to runtime's
struct kmp_depend_info. The former defines flags as int, and the latter
as kmp_uint8_t. This discrepancy goes unnoticed on little-endian
systems, but breaks big-endian ones.
Make flags in struct DEP unsigned char.
This patch implements the "__kmp_print_tdg_dot" function, that prints a task dependency graph into a dot file containing the tasks and their dependencies.
It is activated through a new environment variable "KMP_TDG_DOT"
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D150962
This patch properly marks the support level for libomp test when testing with
GCC.
Some new OpenMP features were only introduced with GCC 11.
Tests using the target construct are incompatibe with GCC.
Tests pass now with GCC 10, 11, 12
This patch implements the "task record and replay" mechanism. The idea is to be able to store tasks and their dependencies in the runtime so that we do not pay the cost of task creation and dependency resolution for future executions. The objective is to improve fine-grained task performance, both for those from "omp task" and "taskloop".
The entry point of the recording phase is __kmpc_start_record_task, and the end of record is triggered by __kmpc_end_record_task.
Tasks encapsulated between a record start and a record end are saved, meaning that the runtime stores their dependencies and structures, referred to as TDG, in order to replay them in subsequent executions. In these TDG replays, we start the execution by scheduling all root tasks (tasks that do not have input dependencies), and there will be no involvement of a hash table to track the dependencies, yet tasks do not need to be created again.
At the beginning of __kmpc_start_record_task, we must check if a TDG has already been recorded. If yes, the function returns 0 and starts to replay the TDG by calling __kmp_exec_tdg; if not, we start to record, and the function returns 1.
An integer uniquely identifies TDGs. Currently, this identifier needs to be incremented manually in the source code. Still, depending on how this feature would eventually be used in the library, the caller function must do it; also, the caller function needs to implement a mechanism to skip the associated region, according to the return value of __kmpc_start_record_task.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D146642
The working of depend clause with iterator modifier
can be correctly tested by means of execution tests
and not at the LLVM IR level. These tests
are imported/inspired from the SOLLVE tests.
SOLLVE repo: https://github.com/SOLLVE/sollve_vv
Differential Revision: https://reviews.llvm.org/D146706
Use the correct data type for pointer sized integers on Windows;
"long" is always 32 bit, even on 64 bit Windows - don't use it
for the kmp_intptr_t type.
Provide the exact correct definition of the kmp_depend_info
struct - avoid the risk of mismatches (if a platform would pack
things slightly differently when things are declared differently).
Zero initialize the whole dep_info struct before filling it in;
if only setting the in/out bits, the rest of the unallocated bits
in the bitfield can have undefined values. Libomp reads the flags
in combined form as an kmp_uint8 by reading the flag field - thus,
the unused bits do need to be zeroed. (Alternatively, the flag field
could be set to zero before setting the individual bits in the
bitfield).
Use kmp_intptr_t instead of long for casting pointers to integers.
Differential Revision: https://reviews.llvm.org/D137748
The hidden helper task is only enabled on Linux (kmp_runtime.cpp
initializes __kmp_enable_hidden_helper to TRUE for linux but to
FALSE for any other OS), and the __kmp_stg_parse_use_hidden_helper
function always makes it disabled on non-Linux OSes too.
Add a lit test feature for hidden helper tasks (only made available
on Linux) and mark two tests as requiring this feature.
Disable hidden helper tasks in the test that doesn't really involve
them, for consistent behaviour across platforms.
Differential Revision: https://reviews.llvm.org/D137749
OpenMP tests that use pthread functions include this header instead.
On Unix systems, this header includes pthread.h, while it provides
minimal implementations of the used pthread functions for Windows.
Differential Revision: https://reviews.llvm.org/D137746
Add new hidden helper affinity via the environment variable,
KMP_HIDDEN_HELPER_AFFINITY, which allows users to assign thread
affinity to hidden helper threads using the same syntax as
KMP_AFFINITY. OMP_PLACES/OMP_PROC_BIND have no interaction with
KMP_HIDDEN_HELPER_AFFINITY.
Differential Revision: https://reviews.llvm.org/D135113
Before this patch task priorities were ignored, that was a valid implementation
as the task priority is a hint according to OpenMP specification.
Implemented shared list of sorted (high -> low) task deques one per task
priority value. Tasks execution changed to first check if priority tasks ready
for execution exist, and these tasks executed before others;
otherwise usual tasks execution mechanics work.
Differential Revision: https://reviews.llvm.org/D119676
The __kmp_hidden_helper_threads_num set to N+1 if user requested N threads.
Thus number of worker hidden helper threads corresponds to user request,
main thread of helper team excluded as it does not participate in actual work.
This also fixes divide-by-0 issue in the code.
Fixes#48656
Differential Revision: https://reviews.llvm.org/D119586
Eachempati.
This patch adds clang (parsing, sema, serialization, codegen) support for the 'depend' clause on the 'taskwait' directive.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D113540
The CHECK: line in the test had no effect, because the test does not
pipe to FileCheck. Since the test only checks for a single value,
encode the result in the return value of the test.
This patch allows to simplify compiler implementation on "taskwait nowait"
construct. The "taskwait nowait" is semantically equivalent to the empty task.
Instead of creating an empty routine as a task entry, compiler can just send
NULL pointer to the runtime. Then the runtime will make all the work with
dependences and return because of the absent task routine.
Differential Revision: https://reviews.llvm.org/D112015
GOMP depobjs are represented as a two intptr_t array. The first
element is the base address of the dependency and the second element
is the flag indicating the type the depobj represents.
Differential Revision: https://reviews.llvm.org/D108790
New omp_all_memory task dependence type is implemented.
Library recognizes the new type via either
(dependence_address == NULL && dependence_flag == 0x80)
or
(dependence_address == SIZE_MAX).
A task with new dependence type depends on each preceding task
with any dependence type (kind of a dependence barrier).
Differential Revision: https://reviews.llvm.org/D108574
As discussed in D107121, task wait doesn't work when a regular task T depends on
a detached task or a hidden helper task T' in a serialized team. The root cause is,
since the team is serialized, the last task will not be tracked by
`td_incomplete_child_tasks`. When T' is finished, it first releases its
dependences, and then decrements its parent counter. So far so good. For the thread
that is running task wait, if at the moment it is still spinning and trying to
execute tasks, it is fine because it can detect the new task and execute it.
However, if it happends to finish the function `flag.execute_tasks(...)`, it will
be broken because `td_incomplete_child_tasks` is 0 now.
In this patch, we update the rule to track children tasks a little bit. If the
task team encounters a proxy task or a hidden helper task, all following tasks
will be tracked.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D107496
Fix for https://bugs.llvm.org/show_bug.cgi?id=49723.
Eliminated references from task dependency hash to node allocated on stack,
thus eliminated accesses to stale memory. So the node now never freed.
Uncommented assertion which triggered when stale memory accessed.
Removed unneeded ref count increment for stack allocated node.
Differential Revision: https://reviews.llvm.org/D106705
This patch fixes https://bugs.llvm.org/show_bug.cgi?id=49066.
For detachable tasks, the assumption breaks that the proxy task cannot have
remaining child tasks when the proxy completes.
In stead of increment/decrement the incomplete task count, a high-order bit
is flipped to mark and wait for the incomplete proxy task.
Differential Revision: https://reviews.llvm.org/D101082
gcc 11 introduced support for depend clause, but the gomp interface of libomp
does not yet handle the information.
Also remove -fopenmp-version=50, which is no longer needed for clang, but not
supported by gcc.
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
This reverts commit a1f550e052543f75acac9089b760cbc61729131f.
Revert in order to fix backwards compatibility breakage
caused by type size change for task dependence flag.
Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
Size of type of the dependence flag changed from 1 to 4 bytes in clang.
All dependence flags library gets so far and corresponding dependence types:
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Differential Revision: https://reviews.llvm.org/D97085
Bug 49356 (https://bugs.llvm.org/show_bug.cgi?id=49356) reports crash in
the test case `tasking/bug_taskwait_detach.cpp`, which is caused by the wrong
function declaration. `gtid` in `__kmpc_omp_task` should be `kmp_int32`.
Reviewed By: AndreyChurbanov
Differential Revision: https://reviews.llvm.org/D102584
Implement the remaining GOMP_* functions to support task reductions
in taskgroup, parallel, loop, and taskloop constructs. The unused mem
argument to many of the work-sharing constructs has to do with the
scan() directive/ inscan() modifier. If mem is set, each function
will call KMP_FATAL() and tell the user scan/inscan is unsupported. The
GOMP reduction implementation is kept separate from our implementation
because of how GOMP presents reduction data and computes the reductions.
GOMP expects the privatized copies to be present even after a #pragma
omp parallel reduction(task:...) region has ended so the data is stored
inside GOMP's uintptr_t* data pseudo-structure. This style is tightly
coupled with GCC compiler codegen. There also isn't any init(),
combiner(), fini() functions in GOMP's codegen so the two
implementations were to disparate to try to wrap GOMP's around our own.
Differential Revision: https://reviews.llvm.org/D98806
It is reported that after enabling hidden helper thread, the program
can hit the assertion `new_gtid < __kmp_threads_capacity` sometimes. The root
cause is explained as follows. Let's say the default `__kmp_threads_capacity` is
`N`. If hidden helper thread is enabled, `__kmp_threads_capacity` will be offset
to `N+8` by default. If the number of threads we need exceeds `N+8`, e.g. via
`num_threads` clause, we need to expand `__kmp_threads`. In
`__kmp_expand_threads`, the expansion starts from `__kmp_threads_capacity`, and
repeatedly doubling it until the new capacity meets the requirement. Let's
assume the new requirement is `Y`. If `Y` happens to meet the constraint
`(N+8)*2^X=Y` where `X` is the number of iterations, the new capacity is not
enough because we have 8 slots for hidden helper threads.
Here is an example.
```
#include <vector>
int main(int argc, char *argv[]) {
constexpr const size_t N = 1344;
std::vector<int> data(N);
#pragma omp parallel for
for (unsigned i = 0; i < N; ++i) {
data[i] = i;
}
#pragma omp parallel for num_threads(N)
for (unsigned i = 0; i < N; ++i) {
data[i] += i;
}
return 0;
}
```
My CPU is 20C40T, then `__kmp_threads_capacity` is 160. After offset,
`__kmp_threads_capacity` becomes 168. `1344 = (160+8)*2^3`, then the assertions
hit.
Reviewed By: protze.joachim
Differential Revision: https://reviews.llvm.org/D98838
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
The basic design is to create an outer-most parallel team. It is not a regular team because it is only created when the first hidden helper task is encountered, and is only responsible for the execution of hidden helper tasks. We first use `pthread_create` to create a new thread, let's call it the initial and also the main thread of the hidden helper team. This initial thread then initializes a new root, just like what RTL does in initialization. After that, it directly calls `__kmpc_fork_call`. It is like the initial thread encounters a parallel region. The wrapped function for this team is, for main thread, which is the initial thread that we create via `pthread_create` on Linux, waits on a condition variable. The condition variable can only be signaled when RTL is being destroyed. For other work threads, they just do nothing. The reason that main thread needs to wait there is, in current implementation, once the main thread finishes the wrapped function of this team, it starts to free the team which is not what we want.
Two environment variables, `LIBOMP_NUM_HIDDEN_HELPER_THREADS` and `LIBOMP_USE_HIDDEN_HELPER_TASK`, are also set to configure the number of threads and enable/disable this feature. By default, the number of hidden helper threads is 8.
Here are some open issues to be discussed:
1. The main thread goes to sleeping when the initialization is finished. As Andrey mentioned, we might need it to be awaken from time to time to do some stuffs. What kind of update/check should be put here?
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D77609
This patch adds new API __kmpc_taskloop_5 to accomadate strict
modifier (introduced in OpenMP 5.1) in num_tasks and grainsize
clause.
Differential Revision: https://reviews.llvm.org/D92352
This change introduces the GOMP_taskwait_depend() function. It implements
the OpenMP 5.0 feature of #pragma omp taskwait with depend() clause by
wrapping around __kmpc_omp_wait_deps().
Differential Revision: https://reviews.llvm.org/D87269
Encapsulate GOMP task dependencies in separate class and introduce the
new mutexinoutset dependency type. This separate class allows
future GOMP task APIs easier access to the task dependency functionality
and better ability to propagate new dependency types to all existing GOMP
task APIs which use task dependencies.
Differential Revision: https://reviews.llvm.org/D87267