Based on https://en.cppreference.com/w/c/memory/aligned_alloc, the
`size` is supposed
to be a multiple of `alignment`, and it is implementation defined
behavior if not.
We have a non-conformant use in `kmp_barrier.h` when allocating
distribute barrier.
The size of the barrier is 576 and the alignment is `4*CACHE_LINE`,
which is 256
on most systems. Apparently it works perfectly fine for Linux and
Intel-based Mac,
but not for Apple Silicon based Mac.
Fix#63194.
Summary:
The `omp_get_num_procs()` function should return the amount of
parallelism availible. On the GPU, this was not defined. We have elected
to define this function as the maximum amount of wavefronts / warps that
can be simultaneously resident on the device. For AMDGPU this is the
number of CUs multiplied byth CU's per wave. For NVPTX this is the
maximum threads per SM divided by the warp size and multiplied by the
number of SMs.
The function assumes that `__kmp_gtid_get_specific` always returns a valid gtid.
That is not always true, because when creating the key for thread-specific data,
a destructor is assigned. The dtor will be called at thread exit. However, before
the dtor is called, the thread-specific data will be reset to NULL first
(https://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html):
> At thread exit, if a key value has a non-NULL destructor pointer, and the thread
> has a non-NULL value associated with that key, the value of the key is set to NULL.
This will lead to that `__kmp_gtid_get_specific` returns `KMP_GTID_DNE`.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D159369
The outlined function is typically invoked by using
`__kmp_invoke_microtask`,
which is written in asm. D138495 introduces a new interface function for
parallel
region for OpenMPIRBuilder, where the outlined function is called via
the function
pointer. For some reason, it works perfectly well on x86 and x86-64
system, but
doesn't work on Apple Silicon. The 3rd argument in the callee is always
`nullptr`, even
if it is not in caller. It appears `x2` always contains `0x0`. This
patch adopts
the typical method to invoke the function pointer. It works on my M2
Ultra Mac.
Fix#63194.
This patch ensures that the locally built version of flang when building in-tree. `find_program` sometimes used the wrong executable if a different copy of flang was installed.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D159161
Previously, the test ran a section with
#pragma omp target thread_limit(4)
and expected it to execute exactly 4 times, even though it would
in practice execute min(cores, 4) times.
Increment a counter and check that it executed 1-4 times.
Differential Revision: https://reviews.llvm.org/D159311
This patch adds a test that uses a target region to set a scalar value. It also
adds rules in lit.cfg to handle fortran testing.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D159216
In `__kmp_get_global_thread_id`, if the gtid mode is 1, after getting the gtid
from TLS, it will store the gtid value to the thread stack maintained in the thread
descriptor. However, `__kmp_get_global_thread_id` can be called when the library
is destructed, after the corresponding thread info has been release. This will
cause a segment fault. This can happen on an Intel-based Mac.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D159324
This patch fixes the flang detection in the openmp fortran offloading test.
Reviewed By: jsjodin
Differential Revision: https://reviews.llvm.org/D158546
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.
CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.
Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.
AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.
Differential Revision: https://reviews.llvm.org/D139730
Reviewed By: jhuber6, yaxunl
As discussed on the weekly OpenMP meeting on the second of August 2023, the default version
in the OpenMP documentation shoud be changed from OpenMP 5.0 to 5.1.
Differential Revision: https://reviews.llvm.org/D156901
At the moment Archer segfaults due to a null-pointer access, if an application
uses taskwait with depend clause as used in the two new tests.
This patch cleans up the task_schedule function, moves semantic blocks into
functions and replaces the if blocks by a single switch statement. The switch
statement will warn, when new enum values are added in OMPT and makes clear
what code is executed for the different cases.
With free-agent tasks coming up in OpenMP 6.0, we should expect more
null-pointer task_data, so additional null-pointer checks were added.
We also cannot rely on having an implicit task on the stack, so the
BarrierIndex is stored during task creation.
Differential Revision: https://reviews.llvm.org/D158072
Since td_allow_completion_event is a member of the taskdata struct, not all
firstprivate/shared variables are stored at the end of the task memory
allocation. Simply report the whole allocation instead.
Furthermore, the function should always return 0 since in no case there is
another block to report.
Differential Review: https://reviews.llvm.org/D158080
offloading
- This patch adds support for thread_limit clause on target directive according to OpenMP 51 [2.14.5]
- The idea is to create an outer task for target region, when there is a thread_limit clause, and manipulate the thread_limit of task instead. This way, thread_limit will be applied to all the relevant constructs enclosed by the target region.
Differential Revision: https://reviews.llvm.org/D152054
D152014 introduced an optimization that favors more smaller blocks over
fewer larger blocks, even if user sets `thread_limit` explicitly. This patch changes
the behavior to honor user value.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D158802
The changes in D157738 allowed for us to emit stub globals on the device
in the offloading entry section. These globals contain the addresses of
device functions and allow us to map host functions to their
corresponding device equivalent. This patch provides the initial support
required to build a table on the device to lookup the associated value.
This is done by finding these entries and creating a global table on the
device that can be searched with a simple binary search.
This requires an allocation, which supposedly should be automatically
freed at plugin shutdown. This includes a basic test which looks up device
pointers via a host pointer using the added function. This will need to be built
upon to provide full support for these calls in the runtime.
To support reverse offloading it would also be useful to provide a reverse table
that allows us to get host functions from device stubs.
Depends on D157738
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D157918
Before, we checked the parallel region only once, and ignored updates in
the KernelInfo for the parallel region that happened later. This caused
us to think nested parallel sections are not present even if they are,
among other things.
Make the generic-plugin report a corresponding CU kind -- instead of 'unknown'.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D158542
The old code did not account for new queries during an update, which
caused us to leave stack RQIs in the map. We are now explicit about
temporary vs non-temporary RQIs.
Fixes: https://github.com/llvm/llvm-project/issues/64959
Using LIBOMPTARGET_PROFILER, duplicates are created from timing both Kernel functions and Data update functions.
I commented out the duplicate timescope and left them in the targetkernel and the targetdataupdate functions. This
way the timescope calls will be closer to the launching of the kernel and the data moving.
Reviewed By: jdoerfert, tianshilei1992
Differential Revision: https://reviews.llvm.org/D157725
When we used to treat the kernel end as as aligned barrier, assertions
at the end made sense. Now, they actually cause problems as the "writes"
are not ordered with regards to reads within the kernel. We can simply
get rid of them.
We are having a hard time optimizing some vectorized loads/stores later
on which causes this optimization to degrade performance.
Differential Revision: https://reviews.llvm.org/D158656
We used to have two separate implementations to derive the number of
threads used in a target region. This lead us to sometimes miss out on
user provided thread bounds (num_threads, or thread_limit) when we
looked for "constant default values". If we might miss out on the
presence of those bounds, we cannot set the thread_limit statically
since the runtime will try to honor user input rather than cap it at the
"preferred default". This patch replaces the secondary implementation
with the primary in a mode that will not emit code but just look for the
presence, and potentially upper bounds, of thread limiting clauses.
The runtime test would not pass without this rewrite as we missed some
clauses, set the static limit on the device to the preferred value, but
then violated that value at runtime.
Fixes: https://github.com/llvm/llvm-project/issues/64845
Differential Revision: https://reviews.llvm.org/D158381
A few places in the loop collapse support code make small dynamic allocations
that introduce a noticeable performance overhead when made on the heap.
This change moves allocations up to 32 bytes to the stack instead of the heap.
Differential Revision: https://reviews.llvm.org/D158220
This change has the primary thread create each thread's initial mask
and topology information so it is available immediately after
forking. The setting of mask/topology information is decoupled from the
actual binding. Also add this setting of topology information inside the
__kmp_partition_places mechanism for OMP_PLACES+OMP_PROC_BIND.
Without this, there could be a timing window after the primary
thread signals the workers to fork where worker threads have not yet
established their affinity mask or topology information.
Each worker thread will then bind to the location the primary thread
sets.
Differential Revision: https://reviews.llvm.org/D156727
This patch fixes: https://github.com/llvm/llvm-project/issues/64738
We observed multiple issues, primarily that the `DeviceId` was reported as -1
in certain scenarios. The reason for this is simply that the device is not
initialized at that point. Hence, we need to move the RAII object creation just
after the `checkDeviceAndCtors`, closer to the actual call we want to observe.
This also solves an odering issue where one `target enter data` callback would
be executed before the `Init` callback.
Additionally, this change will also fix that the callbacks corresponding to
`enter / exit data` and `update` in conjunction with `nowait` would not result
in the emission of an OMPT callback.
Added a testcase to cover initialized device number and `omp target` constructs.
Reviewed By: dhruvachak
Differential Revision: https://reviews.llvm.org/D157605
AAIndirectCallInfo will collect information and specialize indirect call
sites. It is similar to our IndirectCallPromotion but runs as part of
the Attributor (so with assumed callee information). It also expands
more calls and let's the rest of the pipeline figure out what is UB, for
now. We use existing call promotion logic to improve the result,
otherwise we rely on the (implicit) function pointer cast.
This effectively "fixes" #60327 as it will undo the type punning early
enough for the inliner to work with the (now specialized, thus direct)
call.
Fixes: https://github.com/llvm/llvm-project/issues/60327
This change adds the option of using different units for blocktimes specified via the KMP_BLOCKTIME environment variable. The parsing of the environment now recognizes units suffixes: ms and us. If a units suffix is not specified, the default unit is ms. Thus default behavior is still the same, and any previous usage still works the same. Internally, blocktime is now converted to microseconds everywhere, so settings that exceed INT_MAX in microseconds are considered "infinite".
kmp_set/get_blocktime are updated to use the units the user specified with KMP_BLOCKTIME, and if not specified, ms are used.
Added better range checking and inform messages for the two time units. Large values of blocktime for default (ms) case (beyond INT_MAX/1000) are no longer allowed, but will autocorrect with an INFORM message.
The delay for determining ticks per usec was lowered. It is now 1 million ticks which was calculated as ~450us based on 2.2GHz clock which is pretty typical base clock frequency on X86:
(1e6 Ticks) / (2.2e9 Ticks/sec) * (1e6 usec/sec) = 454 usec
Really short benchmarks can be affected by longer delay.
Update KMP_BLOCKTIME docs.
Portions of this commit were authored by Johnny Peyton.
Differential Revision: https://reviews.llvm.org/D157646
This patch fixes: https://github.com/llvm/llvm-project/issues/64671
DataOp EMI callbacks would not report the correct target pointer.
This is now alleviated by passing a `void**` into the function which
emits the actual callback, then evaluating that pointer.
Note: Since this is only done after the pointer has been properly
updated, only `endpoint=2` callbacks will show a non-null value.
Reviewed By: dhruvachak, jdoerfert
Differential Revision: https://reviews.llvm.org/D157996