The deduplication of the calls to `omp_get_thread_limit` used to be
legal when originally added in
<e28936f613 (diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123)>,
as the result (thread_limit) was immutable.
However, now that we have `thread_limit` clause, we no longer have
immutability; therefore `omp_get_thread_limit()` is not a deduplicable
runtime call.
Thus, removing `omp_get_thread_limit` from the
`DeduplicableRuntimeCallIDs` array.
Here's a simple example:
```
#include <omp.h>
#include <stdio.h>
int main()
{
#pragma omp target thread_limit(4)
{
printf("\n1:target thread_limit: %d\n", omp_get_thread_limit());
}
#pragma omp target thread_limit(3)
{
printf("\n2:target thread_limit: %d\n", omp_get_thread_limit());
}
return 0;
}
```
GCC-compiled binary execution: https://gcc.godbolt.org/z/Pjv3TWoTq
```
1:target thread_limit: 4
2:target thread_limit: 3
```
Clang/LLVM-compiled binary execution:
https://clang.godbolt.org/z/zdPbrdMPn
```
1:target thread_limit: 4
2:target thread_limit: 4
```
By my reading of the OpenMP spec GCC does the right thing here; cf.
<https://www.openmp.org/spec-html/5.2/openmpse12.html#x34-330002.4>:
> If a target construct with a thread_limit clause is encountered, the
thread-limit-var ICV from the data environment of the generated initial
task is instead set to an implementation defined value between one and
the value specified in the clause.
The common subexpression elimination (CSE) of the second call to
`omp_get_thread_limit` by LLVM does not seem to be correct, as it's not
an available expression at any program point(s) (in the scope of the
clause in question) after the second target construct with a
`thread_limit` clause is encountered.
Compiling with `-Rpass=openmp-opt -Rpass-analysis=openmp-opt
-Rpass-missed=openmp-opt` we have:
https://clang.godbolt.org/z/G7dfhP7jh
```
<source>:8:42: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170] [-Rpass=openmp-opt]
8 | printf("\n1:target thread_limit: %d\n",omp_get_thread_limit());
| ^
```
OMP170 has the following explanation:
https://openmp.llvm.org/remarks/OMP170.html
> This optimization remark indicates that a call to an OpenMP runtime
call was replaced with the result of an existing one. This occurs when
the compiler knows that the result of a runtime call is immutable.
Removing duplicate calls is done by replacing all calls to that function
with the result of the first call. This cannot be done automatically by
the compiler because the implementations of the OpenMP runtime calls
live in a separate library the compiler cannot see.
This optimization will trigger for known OpenMP runtime calls whose
return value will not change.
At the same time I do not believe we have an analysis checking whether
this precondition holds here: "This occurs when the compiler knows that
the result of a runtime call is immutable."
AFAICT, such analysis doesn't appear to exist in the original patch
introducing deduplication, either:
-
9548b74a83
- https://reviews.llvm.org/D69930
The fix is to remove it from `DeduplicableRuntimeCallIDs`, effectively
reverting the addition in this commit (noting that `omp_get_max_threads`
is not present in `DeduplicableRuntimeCallIDs`, so it's possible this
addition was incorrect in the first place):
- [OpenMP][Opt] Annotate known runtime functions and deduplicate more,
-
e28936f613 (diff-de101c82aff66b2bda2d1f53fde3dde7b0d370f14f1ff37b7919ce38531230dfR123)
As a result, we're no longer unsoundly deduplicating the OpenMP runtime
call `omp_get_thread_limit` as illustrated by the test case: Note the
(correctly) repeated `call i32 @omp_get_thread_limit()`.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.
We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
* Remove a call to CreatePointerBitCastOrAddrSpaceCast which merely adds
a no-op ptr-to-ptr bitcast.
* Most of the diff is from removing checks for no-op ptr-to-ptr bitcasts
in relevant LIT tests
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249
Summary:
Part of the work done in the `libc` project is to provide host services
for things like `printf` or `malloc`, or generally any syscall-like
behaviour. This scheme works by emitting an externally visible global
called `__llvm_libc_rpc_client` that the host runtime can pick up to get
a handle to the global memory associated with the client. We use the
presence of this symbol to indicate whether or not we need to run an RPC
server. Normally, this symbol is only present if something requiring an
RPC server was linked in, such as `printf`. However, if this call to
`printf` was subsequently optimizated out, the symbol would remain and
cannot be removed (rightfully so) because of its linkage. This patch
adds a special-case optimization to remove this symbol so we can
indicate that an RPC server is no longer needed.
This patch puts this logic in `OpenMPOpt` as the most readily available
place for it. In the future, we should think how to move this somewhere
more generic. Furthermore, we use a hard-coded runtime name (which isn't
uncommon given all the other magic symbol names). But it might be nice
to abstract that part away.
This reverts commit ddbaa11e9f43a38d50d62a9b9b07c3653b6bf8ab.
Reapply the original commit, the broken test was repaired in 5e51363f38d083ab326736c0d4d1b5f9fe0de080 in the meantime.
The runtime needs to know about the acceptable launch bounds, especially
if the compiler (middle- or backend) assumed those bounds. While this
patch does not yet inform the runtime, it stores the bounds in a place
that can/will be accessed and is associated with the kernel.
There are many tests that specify a target triple/CPU flags but no
DataLayout which can lead to IR being generated that has unusual
behaviour. This commit attempts to use the default DataLayout based
on the relevant flags if there is no explicit override on the command
line or in the IR file.
One thing that is not currently possible to differentiate from a missing
datalayout `target datalayout = ""` in the IR file since the current
APIs don't allow detecting this case. If it is considered useful to
support this case (instead of passing "-data-layout=" on the command
line), I can change IR parsers to track whether they have seen such a
directive and change the callback type.
Differential Revision: https://reviews.llvm.org/D141060
If we update the state, or indicate a pessimistic fixpoint, we need to
consider NestedParallelism too.
Fixes part of https://github.com/llvm/llvm-project/issues/66708
That said, the reproducer still needs malloc which we don't support on
AMD GPU. Will be added later.
Barrier removal in OpenMPOpt normally removes barriers by proving that
they are redundant with barriers preceding them. However, it can't do
this with the "pseudo-barrier" at the end of kernels because that can't
be removed. Instead, it removes the barriers preceding the end of the
kernel which that end-of-kernel barrier is redundant with. However,
these barriers aren't always redundant with the end-of-kernel barrier
when loops are involved, and removing them can lead to incorrect results
in compiled code.
This change fixes this by requiring that these pre-end-of-kernel
barriers also have the kernel end as a unique successor before removing
them. It also changes the initialization of `ExitED` for kernels since
the kernel end is not an aligned barrier.
Current OpenMPOpt assumes all kernels are OpenMP kernels (aka. with
"kernel"
attribute). This doesn't hold if we mix OpenMP code and CUDA code by
lingking
them together because CUDA kernels are not annotated with the attribute.
This
patch removes the assumption and added a new counter for those
non-OpenMP
kernels.
Fix#66687.
The associated function can be a nullptr if it is an indirect call.
This causes a crash in `CheckCallee` which always assumes the callee
is a valid pointer.
Fix#66904.
The Attributor has gained support for indirect calls but it is opt-in.
This patch makes AAKernelInfoCallSite able to handle multiple potential
callees.
When we remove barriers, we might need to remove llvm.assume
assumptions as well. However, doing this early, thus in the module pass,
will cause us to miss out on information we might need. There are few
situations we can eliminate barriers across functions, for now we simply
disable elimination of barriers that require assumptions to be removed
during the early module pass.
The visited set was used to not visit the same function twice, however,
the (new) algorithm requires we do since we start the queries at
different call sites.
We know that __kmpc_alloc_shared is by construction matched with a
unique __kmpc_free_shared. Making the compiler aware of these facts
helps to avoid mallocs/allocas.
Fixes: https://github.com/llvm/llvm-project/issues/64551
When other AAs query the current value of KernelEnvC via the callback
KernelConfigurationSimplifyCB we need to ensure they are now dependent
on the AAKernelInfo that is in charge of the KernelEnvC.
It was `inaccessiblemem: readwrite` before, no need for the read.
No real benefit is expected but it can help debugging and other efforts.
Differential Revision: https://reviews.llvm.org/D156478
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
This showed up when we started to deduce readnone for the argument of
__kmpc_global_thread_num. The known attributes for "getters" did not
allow to read arguments, but that is sometimes the case.
AANonNull is now the first AA that is always queried via the new APIs
and not created manually. Others will follow shortly to avoid trivial
AAs whenever possible.
This commit introduced some helper logic that will make it simpler to
port the next one. It also untangles AADereferenceable and AANonNull
such that the former does not keep a handle on the latter. Finally,
we stop deducing `nonnull` for `undef`, which was incorrect.
A kernel can be exited in a non-aligned fashion, so we cannot pretend it
always ends in an aligned barrier. Instead, we require an explicit
aligned barrier as we lack a divergence analysis at this point.
We had some custom manifest for assumption attributes but we use the
generic manifest logic. If we later decide to curb duplication (of
attributes on the call site and callee), we can do that at a single
location and for all attributes.
The test changes basically add known `llvm.assume` callee information to
the call sites.
We had some custom handling for existing MemoryEffects but we now move
it to the place we check other existing attributes before we manifest
new ones. If we later decide to curb duplication (of attributes on the
call site and callee), we can do that at a single location and for all
attributes.
The test changes basically add known `memory` callee information to the
call sites.
This is a partial cleanup to centralize the initialization and update
decisions for AAs. Lifting the burdon and boilerplate on users and
making it harder to accidentally perform unsound deductions.
The two static helpers show how we can lift the decisions to generate an
AA into the Attributor, avoiding trivial AAs that just cost us compile
time and maintenance code (to check for pre-conditions).