Summary:
We use this `dyn_ptr` argument in Clang/OpenMP to handle the
`KernelLaunchEnvironment`. This is a per-kernel argument used to share
some information. Currenetly, it's prepended to the argument list and we
generate storage for it in the runtime.
This is bad for a few reasons:
1. It changes the ABI by shifting user arguments
2. It cannot be trivially be left uninitialized if unused
3. The runtime must allocate its own memory for it
This PR changes it to be appended instead. Additionally, space for this
is always emitted. This means the OMPIRBuilder itself will provide the
storage, we simply need to populate it in the runtime if it is used.
This means that if it's unused we don't always pay the cost and it's
easier for non-OpenMP users to ignore it.
Backward compatibility is maintained by auto-upgrading the kernel
arguments. In `libomptarget` we completely allocate a new buffer to
store this in the new format. The plugins still need to respect the old
ABI of the called device object, so we simply rotate it if it's the old
version.
Generate nuw GEPs for struct member accesses, as inbounds + non-negative
implies nuw.
Regression tests are updated using update scripts where possible, and by
find + replace where not.
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249
offloading
- This patch adds support for thread_limit clause on target directive according to OpenMP 51 [2.14.5]
- The idea is to create an outer task for target region, when there is a thread_limit clause, and manipulate the thread_limit of task instead. This way, thread_limit will be applied to all the relevant constructs enclosed by the target region.
Differential Revision: https://reviews.llvm.org/D152054