On powerpc, physical_package_id may not be available. Currently, this
causes openmp to fall back to flat topology and various affinity tests
fail.
Fix this by parsing core_siblings_list to deterimine which cpus belong
to the same socket. This matches what the testing code does. The code to
parse the CPU list format thankfully already exists.
Fixes https://github.com/llvm/llvm-project/issues/111809.
This patch adds support for pause resource with a new enumerator
omp_pause_stop_tool. The expected behavior of this enumerator is
* omp_pause_resource: not allowed
* omp_pause_resource_all: equivalent to omp_pause_hard
The debug assert is meant to check that the index is a valid which means
the runtime needs to check against the size of the array instead of the
number of threads. A free()-ed thread put back in the thread pool may
index into anywhere inside the task team's available array from 0 to
tt_max_threads potentially.
Fixes: #94260
Add the reverse directive which will be introduced in the upcoming
OpenMP 6.0 specification. A preview has been published in [Technical
Report 12](https://www.openmp.org/wp-content/uploads/openmp-TR12.pdf).
---------
Co-authored-by: Alexey Bataev <a.bataev@outlook.com>
This change makes the runtime use new OMPT state and sync kinds
introduced in OpenMP 5.1 in place of the deprecated implicit state and
sync kinds. Events from implicit barriers use different enumerators for
workshare, parallel, and teams.
Use more specific values from `ompt_work_t` to allow the tool identify
the schedule of a worksharing-loop. With this patch, the runtime will
report the schedule chosen by the runtime rather than necessarily the
schedule literally requested by the clause.
E.g., for guided + just one iteration per thread, the runtime would
choose and report static.
Fixes issue #63904
Add support to the runtime for 6.0 spec features that allow num_threads
clause to take a list, and also make use of the strict modifier.
Provides new compiler interface functions for these features.
OpenMP loop transformation did not work on a for-loop using an iterator
or range-based for-loops. The first reason is that it combined the
iterator's type for generated loops with the type of `NumIterations` as
generated for any `OMPLoopBasedDirective` which is an integer. Fixed by
basing all generated loop variables on `NumIterations`.
Second, C++11 range-based for-loops include syntactic sugar that needs
to be executed before the loop. This additional code is now added to the
construct's Pre-Init lists.
Third, C++20 added an initializer statement to range-based for-loops
which is also added to the pre-init statement. PreInits used to be a
`DeclStmt` which made it difficult to add arbitrary statements from
`CXXRangeForStmt`'s syntactic sugar, especially the for-loops init
statement which does not need to be a declaration. Change it to be a
general `Stmt` that can be a `CompoundStmt` to hold arbitrary Stmts,
including DeclStmts. This also avoids the `PointerUnion` workaround used
by `checkTransformableLoopNest`.
End-to-end tests are added to verify the expected number and order of
loop execution and evaluations of expressions (such as iterator
dereference). The order and number of evaluations of expressions in
canonical loops is explicitly undefined by OpenMP but checked here for
clarification and for changes to be noticed.
Allow non-constants in the `sizes` clause such as
```
#pragma omp tile sizes(a)
for (int i = 0; i < n; ++i)
```
This is permitted since tile was introduced in [OpenMP
5.1](https://www.openmp.org/spec-html/5.1/openmpsu53.html#x78-860002.11.9).
It is possible to sneak-in negative numbers at runtime as in
```
int a = -1;
#pragma omp tile sizes(a)
```
Even though it is not well-formed, it should still result in every loop
iteration to be executed exactly once, an invariant of the tile
construct that we should ensure. `ParseOpenMPExprListClause` is
extracted-out to be reused by the `permutation` clause of the
`interchange` construct. Some care was put into ensuring correct behavior
in template contexts.
Root cause: Segmentation fault is caused by null pointer dereference
inside the __kmpc_fork_call_if function at
https://github.com/llvm/llvm-project/blob/main/openmp/runtime/src/z_Linux_asm.S#L1186
. __kmpc_fork_call_if is missing case to handle argc=0 .
Fix: Added a check inside the __kmp_invoke_microtask function to handle
the case when argc is 0.
---------
Co-authored-by: Singh <chasingh@amd.com>
When a child process is forked with OpenMP already initialized, the
child process resets its affinity mask and sets proc-bind-var to false
so that the entire original affinity mask is used. This patch corrects
an issue with the affinity initialization code setting affinity to
compact instead of none for this special case of forked children.
The test trying to catch this only testing explicit setting of
KMP_AFFINITY=none. Add test run for no KMP_AFFINITY setting.
Fixes: #91098
* Serial teams now use a stack (similar to dispatch buffers)
* Serial teams always use `t_task_team[0]` as the task team and the
second pointer is a next pointer for the stack
`t_task_team[1]` is interpreted as a stack of task teams where each
level is a nested level
```
inner serial team outer serial team
[ t_task_team[0] ] -> (task_team) [ t_task_team[0] ] -> (task_team)
[ next ] ----------------> [ next ] -> ...
```
* Remove the task state memo stack from thread structure.
* Instead of a thread-private stack, use team structure to store
th_task_state of the primary thread. When coming out of a parallel,
restore the primary thread's task state. The new field in the team
structure doesn't cause sizeof(team) to change and is in the cache line
which is only read/written by the primary thread.
Fixes: #50602Fixes: #69368Fixes: #69733Fixes: #79416
The new collapse test cases define `MAX_THREADS` to be 256 and use all
available threads/logical processors on the system. This triples the
testing time on an AIX machine that has 128 logical processors. This
patch changes to use half of available logical processors to avoid over
subscribing because there are other libomp tests running at the same
time, including 2 other such collapse tests.
This patch fixes the test config so that it works for
`tasking/omp50_taskdep_depobj.c` which uses different flags to test with
compiler's `omp.h`.
* set test environment variable `OBJECT_MODE` to `64` if it is set
explicitly to `64` in the AIX environment. `OBJECT_MODE` is default to
`32` and is recognized by AIX compilers and toolchain. In this way, we
don't need to set `-m64` for all compiler flags for 64-bit mode
* add option `-Wl,-bmaxdata` to 32-bit `test_openmp_flags` used by
`tasking/omp50_taskdep_depobj.c`
Users can put a : in front of KMP_HW_SUBSET to indicate that the
specified subset is an "absolute" subset. Currently, when a user puts
KMP_HW_SUBSET=1t. This gets translated to KMP_HW_SUBSET="*s,*c,1t",
where * means "use all of". If a user wants only one thread as the
entire topology they can now do KMP_HW_SUBSET=:1t.
Along with the absolute syntax is a fix for newer machines and making
them easier to use with only the 3-level topology syntax. When a user
puts KMP_HW_SUBSET=1s,4c,2t on a machine which actually has 4 layers,
(say 1s,2m,3c,2t as the entire machine) the user gets an unexpected "too
many resources asked" message because KMP_HW_SUBSET currently translates
the "4c" value to mean 4 cores per module. To help users out, the
runtime can assume that these newer layers, module in this case, should
be ignored if they are not specified, but the topology should always
take into account the sockets, cores, and threads layers.
When a nested parallel region ends, the runtime calls __kmp_join_call().
During this call, the primary thread of the nested parallel region will
reset its tid (retval of omp_get_thread_num()) to what it was in the
outer parallel region. A data race occurs with the current code when
another worker thread from the nested inner parallel region tries to
steal tasks from the primary thread's task deque. The worker thread
reads the tid value directly from the primary thread's data structure
and may read the wrong value.
This change just uses the calculated victim_tid from execute_tasks()
directly in the steal_task() routine rather than reading tid from the
data structure.
Fixes: #87307
detect `aarch64_32` with compiler defined macro `__ARM64_ARCH_8_32__`
reuse ARM `__kmp_unnamed_critical_addr` and add `KMP_PREFIX_UNDERSCORE`
macro like AARCH64
reuse AARCH64 `__kmp_invoke_microtask`
build log for watchos armv7k + arm64_32 and watchos simulator x86_64 +
arm64
https://github.com/nihui/action-protobuf/actions/runs/8520684611/job/23337305030
The hidden helper team pre-allocates the gtid space [1,
num_hidden_helpers] (inclusive). If regular host threads are allocated,
then put back in the thread pool, then the hidden helper team is
initialized, the hidden helper team tries to allocate the threads from
the thread pool with gtids higher than [1, num_hidden_helpers]. Instead,
have the hidden helper team fork OS threads so the correct gtid range
used for hidden helper threads.
Fixes: #87117
This patch implements `affinity` for AIX, which is quite different from
platforms such as Linux.
- Setting CPU affinity through masks and related functions are not
supported. System call `bindprocessor()` is used to bind a thread to one
CPU per call.
- There are no system routines to get the affinity info of a thread. The
implementation of `get_system_affinity()` for AIX gets the mask of all
available CPUs, to be used as the full mask only.
- Topology is not available from the file system. It is obtained through
system SRAD (Scheduler Resource Allocation Domain).
This patch has run through the libomp LIT tests successfully with
`affinity` enabled.
MSVC does not define __BYTE_ORDER__ making the check for BigEndian
erroneously evaluate to true and breaking the struct definitions in MSVC
compiled builds correspondingly. The fix adds an additional check for
whether __BYTE_ORDER__ is defined by the compiler to fix these.
---------
Co-authored-by: Vadim Paretsky <b-vadipa@microsoft.com>
This PR adds OMP runtime support for more efficient partitioning of
certain types of collapsed loops that can be used by compilers that
support loop collapsing (i.e. MSVC) to achieve more optimal thread load
balancing.
In particular, this PR addresses double nested upper and lower isosceles
triangular loops of the following types
1. lower triangular 'less_than'
for (int i=0; i<N; i++)
for (int j=0; j<i; j++)
2. lower triangular 'less_than_equal'
for (int i=0; i<N; j++)
for (int j=0; j<=i; j++)
3. upper triangular
for (int i=0; i<N; i++)
for (int j=i; j<N; j++)
Includes tests for the three supported loop types.
---------
Co-authored-by: Vadim Paretsky <b-vadipa@microsoft.com>
The resume thread logic inside __kmp_free_team() is faulty. Only
checking b_go for sleep status doesn't wake up distributed barrier.
Change to generic check for th_sleep_loc and calling
__kmp_null_resume_wrapper().
Fixes: #80664
Within the MSVC ABI, long doubles are the same as regular 64 bit
doubles. This test case, which is compiled with -mlong-double-80, cannot
work when libomp has been compiled without that flag, as
-mlong-double-80 changes the calling convention for the tested
functions.
This PR contains initial changes for building and testing libomp on AIX.
More changes will follow.
- `KMP_OS_AIX` is defined for the AIX platform
- `KMP_ARCH_PPC` is defined for 32-bit PPC
- `KMP_ARCH_PPC_XCOFF` and `KMP_ARCH_PPC64_XCOFF` are for 32- and 64-bit
XCOFF object formats respectively
- Assembly file `z_AIX_asm.S` is used for AIX specific assembly code and
will be added in a separate PR
- The target library is disabled because AIX does not have the device
support
- OMPT is temporarily disabled
ompt/synchronization/[masked.c | master.c] tests fail due to a wrong
offset being calculated for the possible return addreses. PR #65936
fixes this for Darwin and the same has to be done for Linux.
Updates #69627
- `nothing` directive was effecting the `if` block structure which it
should not. So return an empty statement instead of an error statement
while parsing to avoid this.
From "3.1 Reducing the number of edges" of this [[ https://hal.science/hal-04136674v1/ | paper ]] - Optimization (b)
Task (dependency) nodes have a `successors` list built upon passed dependency.
Given the following code, B will be added to A's successors list building the graph `A` -> `B`
```
// A
# pragma omp task depend(out: x)
{}
// B
# pragma omp task depend(in: x)
{}
```
In the following code, B is currently added twice to A's successor list
```
// A
# pragma omp task depend(out: x, y)
{}
// B
# pragma omp task depend(in: x, y)
{}
```
This patch removes such dupplicates by checking lastly inserted task in `A` successor list.
Authored by: Romain Pereira (rpereira-dev)
Differential Revision: https://reviews.llvm.org/D158544
This commit adds skewed distribution of iterations in
nonmonotonic:dynamic schedule (static steal) for hybrid systems when
thread affinity is assigned. Currently, it distributes the iterations at
60:40 ratio. Consider this loop with dynamic schedule type,
for (int i = 0; i < 100; ++i). In a hybrid system with 20 hardware
threads (16 CORE and 4 ATOM core), 88 iterations will be assigned to
performance cores and 12 iterations will be assigned to efficient cores.
Each thread with CORE core will process 5 iterations + extras and with
ATOM core will process 3 iterations.
Differential Revision: https://reviews.llvm.org/D152955
* openmp/README.rst
- Add s390x to those platforms supported
* openmp/libomptarget/plugins-nextgen/CMakeLists.txt
- Add s390x subdirectory
* openmp/libomptarget/plugins-nextgen/s390x/CMakeLists.txt
- Add s390x definitions
* openmp/runtime/CMakeLists.txt
- Add s390x to those platforms supported
* openmp/runtime/cmake/LibompGetArchitecture.cmake
- Define s390x ARCHITECTURE
* openmp/runtime/cmake/LibompMicroTests.cmake
- Add dependencies for System z (aka s390x)
* openmp/runtime/cmake/LibompUtils.cmake
- Add S390X to the mix
* openmp/runtime/cmake/config-ix.cmake
- Add s390x as a supported LIPOMP_ARCH
* openmp/runtime/src/kmp_affinity.h
- Define __NR_sched_[get|set]addinity for s390x
* openmp/runtime/src/kmp_config.h.cmake
- Define CACHE_LINE for s390x
* openmp/runtime/src/kmp_os.h
- Add KMP_ARCH_S390X to support checks
* openmp/runtime/src/kmp_platform.h
- Define KMP_ARCH_S390X
* openmp/runtime/src/kmp_runtime.cpp
- Generate code when KMP_ARCH_S390X is defined
* openmp/runtime/src/kmp_tasking.cpp
- Generate code when KMP_ARCH_S390X is defined
* openmp/runtime/src/thirdparty/ittnotify/ittnotify_config.h
- Define ITT_ARCH_S390X
* openmp/runtime/src/z_Linux_asm.S
- Instantiate __kmp_invoke_microtask for s390x
* openmp/runtime/src/z_Linux_util.cpp
- Generate code when KMP_ARCH_S390X is defined
* openmp/runtime/test/ompt/callback.h
- Define print_possible_return_addresses for s390x
* openmp/runtime/tools/lib/Platform.pm
- Return s390x as platform and host architecture
* openmp/runtime/tools/lib/Uname.pm
- Set hardware platform value for s390x
struct DEP defined in multiple testcases must correspond to runtime's
struct kmp_depend_info. The former defines flags as int, and the latter
as kmp_uint8_t. This discrepancy goes unnoticed on little-endian
systems, but breaks big-endian ones.
Make flags in struct DEP unsigned char.
Change to use VE_LD_LIBRARY_PATH for VE instead of LD_LIBRARY_PATH. The
VE is connected to the host, and compiled test programs for VE is
invoked on the host and transferred to the VE. If programs are compiled
for the host, we use LD_LIBRARY_PATH. Otherwise, we use
VE_LD_LIBRARY_PATH.
Support OpenMP runtime library on VE. This patch makes OpenMP compilable
for VE architecture. Almost all tests run correctly on VE.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D159401
Previously, the test ran a section with
#pragma omp target thread_limit(4)
and expected it to execute exactly 4 times, even though it would
in practice execute min(cores, 4) times.
Increment a counter and check that it executed 1-4 times.
Differential Revision: https://reviews.llvm.org/D159311
offloading
- This patch adds support for thread_limit clause on target directive according to OpenMP 51 [2.14.5]
- The idea is to create an outer task for target region, when there is a thread_limit clause, and manipulate the thread_limit of task instead. This way, thread_limit will be applied to all the relevant constructs enclosed by the target region.
Differential Revision: https://reviews.llvm.org/D152054
* Add KMP_CPU_EQUAL and KMP_CPU_ISEMPTY to affinity mask API
* Add printout of leader to hardware thread dump
* Allow OMP_PLACES to restrict fullMask
This change fixes an issue with the OMP_PLACES=resource(#) syntax.
Before this change, specifying the number of resources did NOT change
the default number of threads created by the runtime. e.g.,
OMP_PLACES=cores(2) would still create __kmp_avail_proc number of
threads. After this change, the fullMask and __kmp_avail_proc are
modified if necessary so that the final place list dictates which
resources are available and how thus, how many threads are created by
default.
* Introduce hybrid core attributes to OMP_PLACES and KMP_AFFINITY
For OMP_PLACES, two new features are added:
1) OMP_PLACES=cores:<attribute> where <attribute> is either
intel_atom, intel_core, or eff# where # is 0 - number of core
efficiencies-1. This syntax also supports the optional (#)
number selection of resources.
2) OMP_PLACES=core_types|core_effs where this setting will create
the number of core_types (or core_effs|core_efficiencies).
For KMP_AFFINITY, the granularity setting is expanded to include two new
keywords: core_type, and core_eff (or core_efficiency). This will set
the granularity to include all cores with a particular core type (or
efficiency). e.g., KMP_AFFINITY=granularity=core_type,compact will
create threads which can float across a single core type.
Differential Revision: https://reviews.llvm.org/D154547
Add CHECK_OPENMP_ENV environment variable which will be passed to environment
variables for test (make check-* target). This provides a handy way to
exercise various openmp code with different settings during development.
For example, to change default barrier pattern:
```
$ env CHECK_OPENMP_ENV="KMP_FORKJOIN_BARRIER_PATTERN=hier,hier \
KMP_PLAIN_BARRIER_PATTERN=hier,hier \
KMP_REDUCTION_BARRIER_PATTERN=hier,hier" \
ninja check-openmp
```
Even with this, each test can set appropriate environment variables if needed
as before.
Also, this commit adds missing documention about how to run tests in README.
Patch provided by t-msn
Differential Revision: https://reviews.llvm.org/D122645
The OpenMP specification mentions that omp_test_lock and
omp_test_nest_lock dispatch OMPT callbacks with ompt_mutex_test_lock
and ompt_mutex_test_nest_lock for their kind respectively. Previously,
the values ompt_mutex_lock and ompt_mutex_nest_lock were used. This
could cause issues in application relying on the kind to correctly
determine lock states. This commit changes the kind to the expected
ones.
Also update callback.h and OMPT tests to reflect this change.
Patch prepared by Thyre
Differential Review: https://reviews.llvm.org/D153028
Differential Review: https://reviews.llvm.org/D153031
Differential Review: https://reviews.llvm.org/D153032
omp_all_memory currently has no representation in OMPT.
Adding new dependency flags as suggested by omp-lang issue #3007.
Differential Revision: https://reviews.llvm.org/D111788