* Remove a call to CreatePointerBitCastOrAddrSpaceCast which merely adds
a no-op ptr-to-ptr bitcast.
* Most of the diff is from removing checks for no-op ptr-to-ptr bitcasts
in relevant LIT tests
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.
Fixes: https://github.com/llvm/llvm-project/issues/70249
This reverts commit ddbaa11e9f43a38d50d62a9b9b07c3653b6bf8ab.
Reapply the original commit, the broken test was repaired in 5e51363f38d083ab326736c0d4d1b5f9fe0de080 in the meantime.
The runtime needs to know about the acceptable launch bounds, especially
if the compiler (middle- or backend) assumed those bounds. While this
patch does not yet inform the runtime, it stores the bounds in a place
that can/will be accessed and is associated with the kernel.
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Depend on D155886.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
This is a partial cleanup to centralize the initialization and update
decisions for AAs. Lifting the burdon and boilerplate on users and
making it harder to accidentally perform unsound deductions.
The two static helpers show how we can lift the decisions to generate an
AA into the Attributor, avoiding trivial AAs that just cost us compile
time and maintenance code (to check for pre-conditions).
This reverts commit 35cfadfbe2decd9633560b3046fa6c17523b2fa9.
It makes a couple of buildbots unhappy because of the following test failures:
- `Transforms/OpenMP/add_attributes.ll'`
- `mapping/declare_mapper_target_data.cpp` on AMDGPU
This patch introduces per kernel environment. Previously, flags such as execution mode are set through global variables with name like `__kernel_name_exec_mode`. They are accessible on the host by reading the corresponding global variable, but not from the device. Besides, some assumptions, such as no nested parallelism, are not per kernel basis, preventing us applying per kernel optimization in the device runtime.
This is a combination and refinement of patch series D116908, D116909, and D116910.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142569
There's a desire to move away from `undef` in LLVM. Currently we want to
have the `addressspace(3)` variables use `poison` instead.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D147719
This patch adds the AANonConvergent abstract attribute. It removes the
convergent attribute from functions that only call non-convergent
functions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D143228
The `OpenMPOpt` pass contains optimizations that generate new calls into
the OpenMP runtime. This causes problems if we are in a state where the
runtime has already been linked statically. Generating these new calls
will result in them never being resolved. We should indicate if we are
in a "post-link" LTO phase and prevent OpenMPOpt from generating new
runtime calls.
Generally, it's not desireable for passes to maintain state about the
context in which they're called. But this is the only reasonable
solution to static linking when we have a pass that generates new
runtime calls.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142646
With this patch we track aligned barriers in AAExecutionDomain and also
delete unnecessary barriers there. This allows us to eliminate barriers
across blocks, across functions, and in the presence of complex accesses
that do not force a barrier. Further, we can use the collected
information to enable store-load forwarding in a threaded environment
(follow up patch).
Differential Revision: https://reviews.llvm.org/D140463
This switches everything to use the memory attribute proposed in
https://discourse.llvm.org/t/rfc-unify-memory-effect-attributes/65579.
The old argmemonly, inaccessiblememonly and inaccessiblemem_or_argmemonly
attributes are dropped. The readnone, readonly and writeonly attributes
are restricted to parameters only.
The old attributes are auto-upgraded both in bitcode and IR.
The bitcode upgrade is a policy requirement that has to be retained
indefinitely. The IR upgrade is mainly there so it's not necessary
to update all tests using memory attributes in this patch, which
is already large enough. We could drop that part after migrating
tests, or retain it longer term, to make it easier to import IR
from older LLVM versions.
High-level Function/CallBase APIs like doesNotAccessMemory() or
setDoesNotAccessMemory() are mapped transparently to the memory
attribute. Code that directly manipulates attributes (e.g. via
AttributeList) on the other hand needs to switch to working with
the memory attribute instead.
Differential Revision: https://reviews.llvm.org/D135780
If we run LTO optimization we migth end up introducing a custom state machine
and later transforming the region into SPMD. This is a problem. While a follow
up will introduce a check for the SPMD conversion, this already prevents the
eager custom state machine generation. Only if the kernel init function is
defined, rather then declared, we will emit a custom state machine. SPMD-zation
can happen eagerly though. Tests are adjusted via a weak definition. The LTO
test was added to verify this works as expected.
Differential Revision: https://reviews.llvm.org/D136740
Using Max for both "PIC Level" and "PIE Level" is inconsistent. PIC imposes less
restriction while PIE imposes more restriction. The result generally
picks the more restrictive behavior: Min for PIC.
This choice matches `ld -r`: a non-pic object and a pic object merge into a
result which should be treated as non-pic.
To allow linking "PIC Level" using Error/Max from old bitcode files, upgrade
Error/Max to Min.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D130531
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
only as an afterthought. `genericValueTraversal` did offer an option
but `AAValueSimplify` did not. Thus, we might end up with "too much"
simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
problems like the infinite recursion bug (#54981) as well as code
duplication.
This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.
`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.
We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good even if some tests look like they regress.
Fixes: https://github.com/llvm/llvm-project/issues/54981
Note: A previous version was flawed and consequently reverted in
6555558a80589d1c5a1154b92cc3af9495f8f86c.
This reverts commit f17639ea0cd30f52ac853ba2eb25518426cc3bb8 as three
AMDGPU tests haven't been updated. Will need to verify the changes are
not regressions we should avoid.
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
only as an afterthought. `genericValueTraversal` did offer an option
but `AAValueSimplify` did not. Thus, we might end up with "too much"
simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
problems like the infinite recursion bug (#54981) as well as code
duplication.
This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.
`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.
We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good even if some tests look like they regress.
Fixes: https://github.com/llvm/llvm-project/issues/54981
Note: A previous version was flawed and consequently reverted in
6555558a80589d1c5a1154b92cc3af9495f8f86c.
If we are certainly not in a loop we can directly emit the heap2stack
allocas in the function entry block. This will help to get rid of them
(SROA) and avoid stacksave/restore intrinsics when the function is
inlined.
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
only as an afterthought. `genericValueTraversal` did offer an option
but `AAValueSimplify` did not. Thus, we might end up with "too much"
simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
problems like the infinite recursion bug (#54981) as well as code
duplication.
This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.
`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.
We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good.
Fixes: https://github.com/llvm/llvm-project/issues/54981
This reverts commit 7aea3ea8c3b33c9bb338d5d6c0e4832be1d09ac3 as it
breaks the buildbots.
I didn't see these failures in the pre-merge checks, looking into it.
Most intrinsics, especially "default" ones, will not call back into the
IR module. `nocallback` encodes this nicely. As it was not used before,
this patch also makes use of `nocallback` in the Attributor which
results in many more `norecurse` deductions.
Tablegen part is mechanical, test updates by script.
Differential Revision: https://reviews.llvm.org/D118680
With debug information enabled (-g) Clang will wrap the actual target
region into a new function which is called from the "kernel". The problem
is that the "kernel" is now basically a wrapper without all the things
we expect. More importantly, if we end up asking for an AAKernelInfo
for the "target region function" we might try to turn it into SPMD mode.
That used to cause an assertion as that function doesn't have an
appropriately named `_exec_mode` global. While the global is going away
soon we still need to make sure to properly handle this case, e.g.,
perform optimizations reliably.
Differential Revision: https://reviews.llvm.org/D122043
Outside users of the Attributor, e.g., OpenMP-opt, want to seed AAs
themselves. We should not seed all default AAs one an internal function
becomes live. That said, there should be a callback such that they can
do lazy seeding as well.
Differential Revision: https://reviews.llvm.org/D121489
Both `undef` and `nullptr` are maximally aligned. This is especially
important as we often see `undef` until a proper value has been
identified during simplification.
This is the second step in obviating two columns about allocation
functions in MemoryBuiltins.cpp.
Differential Revision: https://reviews.llvm.org/D119583
The custom state machine had a check for surplus threads that filtered
the main thread if the kernel was executed by a single warp only. We
now first check for the main thread, then for surplus threads, avoiding
to filter the former out.
Fixes#54214.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D121011
When we move an allocation from the heap to the stack we need to
allocate it in the alloca AS and then cast the result. This also
prevents us from inserting the alloca after the allocation call but
rather right before.
Fixes https://github.com/llvm/llvm-project/issues/53858
We missed out on AANoRecurse in the module pass because we had no call
graph. With AAFunctionReachability we can simply ask if the function may
reach itself.
Differential Revision: https://reviews.llvm.org/D110099
This fixes a conceptual problem with our AAIsDead usage which conflated
call site liveness with call site return value liveness. Without the
fix tests would obviously miscompile as we make genericValueTraversal
more powerful (in a follow up). The effects on the tests are mixed but
mostly marginal. The most prominent one is the lack of `noreturn` for
functions. The reason is that we make entire blocks live at the same
time (for time reasons). Now that we actually look at the block
liveness, which we need to do, the return instructions are live and
will survive. As an example, `noreturn_async.ll` has been modified
to retain the `noreturn` even with block granularity. We could address
this easily but there is little need in practice.
Currenly we push some variables to a global constant containing shared
memory as an optimization. This generated constant had internal linkage
and should not have collided with any known identifiers in the
translation unit. However, there have been observed cases of this
optimiztaion unintentionally colliding with undocumented PTX
identifiers. This patch adds a suffix to the created globals to
hopefully bypass this.
Depends on D118059
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D118068
Previously in OpenMPOpt we did not correctly inherit the calling
convention of the callee when creating new OpenMP runtime calls. This
created issues when the calling convention was changed during
`GlobalOpt` but a new call was creating without the correct calling
convention. This lead to the call being replaced with a poison value in
`InstCombine` due to undefined behaviour and causing large portions of
the program to be incorrectly eliminated. This patch correctly inherits
the existing calling convention from the callee.
Reviewed By: tianshilei1992, jdoerfert
Differential Revision: https://reviews.llvm.org/D118059
This reverts commit 73ece231ee0cf048d56841f47915beb1db6afc26 and
reapplies 7bfcdbcbf368cea14a5236080af975d5878a46eb with mlir changes.
Also reverts commit 423ba12971bac8397c87fcf975ba6a4b7530ed28 and
includes the unit test changes of
16da2140045808b2aea1d28366ca7d326eb3c809.
Similar to loads, we want to be aggressive when it comes to store
simplification. Not everything in LLVM handles dead stores well when
address space casts are involved, we can simply ask the Attributor to do
it for us though.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D109998
One of the unused ident_t fields now holds the size of the string
(=const char *) field so we have an easier time dealing with those
in the future.
Differential Revision: https://reviews.llvm.org/D113126