The old code used isInstructionTriviallyDead() and removed instructions
when walking the path from a resume call to function return to check if
the call is in tail position.
However, since the code was walking forwards it was not able to get past
instructions such as:
%gep = getelementptr inbounds i64, ptr %alloc.var, i32 0
%foo = ptrtoint ptr %gep to i64
This patch instead ignores such instructions as long as their values are
not needed. This enables the code to emit tail calls in more situations.
Implement `llvm.coro.await.suspend` intrinsics, to deal with performance
regression after prohibiting `.await_suspend` inlining, as suggested in
#64945.
Actually, there are three new intrinsics, which directly correspond to
each of three forms of `await_suspend`:
```
void llvm.coro.await.suspend.void(ptr %awaiter, ptr %frame, ptr @wrapperFunction)
i1 llvm.coro.await.suspend.bool(ptr %awaiter, ptr %frame, ptr @wrapperFunction)
ptr llvm.coro.await.suspend.handle(ptr %awaiter, ptr %frame, ptr @wrapperFunction)
```
There are three different versions instead of one, because in `bool`
case it's result is used for resuming via a branch, and in
`coroutine_handle` case exceptions from `await_suspend` are handled in
the coroutine, and exceptions from the subsequent `.resume()` are
propagated to the caller.
Await-suspend block is simplified down to intrinsic calls only, for
example for symmetric transfer:
```
%id = call token @llvm.coro.save(ptr null)
%handle = call ptr @llvm.coro.await.suspend.handle(ptr %awaiter, ptr %frame, ptr @wrapperFunction)
call void @llvm.coro.resume(%handle)
%result = call i8 @llvm.coro.suspend(token %id, i1 false)
switch i8 %result, ...
```
All await-suspend logic is moved out into a wrapper function, generated
for each suspension point.
The signature of the function is `<type> wrapperFunction(ptr %awaiter,
ptr %frame)` where `<type>` is one of `void` `i1` or `ptr`, depending on
the return type of `await_suspend`.
Intrinsic calls are lowered during `CoroSplit` pass, right after the
split.
Because I'm new to LLVM, I'm not sure if the helper function generation,
calls to them and lowering are implemented in the right way, especially
with regard to various metadata and attributes, i. e. for TBAA. All
things that seemed questionable are marked with `FIXME` comments.
There is another detail: in case of symmetric transfer raw pointer to
the frame of coroutine, that should be resumed, is returned from the
helper function and a direct call to `@llvm.coro.resume` is generated.
C++ standard demands, that `.resume()` method is evaluated. Not sure how
important is this, because code has been generated in the same way
before, sans helper function.
Fixes: aadd7650447b
The above commit landed with an incorrect test expect, missing a `metadata`
prefix. This patch adds the expected prefix to the test.
Fixes the error described here:
a93a4ec7dd (commitcomment-138965199)
The function `findDbgIntrinsics` is used to return a list of debug
intrinsics and DPValues that use a given value, with the intent that no
duplicates are returned in either list. For DPValues, we've guarded
against DPValues that use a value multiple times as part of a DIArgList,
but we have not guarded against DPValues that use a value multiple times
as separate operands (currently only possible for `dbg_assign`s,
something I missed in my implementation of that type!). This patch adds
a guard, and also updates a test to cover this case.
This patch fixes a verifier error when async lowering is used for
WebAssembly target without tail-call feature. This missing check was
revealed by b1ac052ab07ea091c90c2b7c89445b2bfcfa42ab, which removed
inlining of the musttail'ed call and it started leaving the invalid call
at the verification stage. Additionally, `TTI::supportsTailCallFor` did
not respect the concrete TTI's `supportsTailCalls` implementation, so it
always returned true even though `supportsTailCalls` returned false, so
this patch also fixes the wrong CRTP base class implementation.
The call to the inlining utility does not update the call graph. Leading
to assertion failures when calling the call graph utility to update the
call graph.
Instead rely on an inline pass to run after coro splitting and use
alwaysinline annotations.
github.com/apple/swift/issues/68708
This patch canonicalizes getelementptr instructions with constant
indices to use the `i8` source element type. This makes it easier for
optimizations to recognize that two GEPs are identical, because they
don't need to see past many different ways to express the same offset.
This is a first step towards
https://discourse.llvm.org/t/rfc-replacing-getelementptr-with-ptradd/68699.
This is limited to constant GEPs only for now, as they have a clear
canonical form, while we're not yet sure how exactly to deal with
variable indices.
The test llvm/test/Transforms/PhaseOrdering/switch_with_geps.ll gives
two representative examples of the kind of optimization improvement we
expect from this change. In the first test SimplifyCFG can now realize
that all switch branches are actually the same. In the second test it
can convert it into simple arithmetic. These are representative of
common optimization failures we see in Rust.
Fixes https://github.com/llvm/llvm-project/issues/69841.
Make the hoisted dbg.declare inherent the DILocation scope from the new
storage.
After hoisting, the dbg.declare is moved into the block that defines the
new storage. This could create an inconsistency in the debug location
scope hierarchy where the scope of hoisted dbg.declare (i.e.
DILexicalBlock) is enclosed with the scope of the block (i.e.
DISubprogram). This confuses LiveDebugValues pass to think that the
hoisted dbg.declare is killed in that block and does not generate
DBG_VALUE in other blocks. Debugger won't be able to track its value
anymore.
We do this for unoptimized binary only.
As part of the RemoveDIs project, transitioning to non-instruction debug
info, all debug intrinsic handling code needs to be duplicated to handle
DPValues.
--try-experimental-debuginfo-iterators enables the new debug mode in
tests if the CMake option has been enabled.
`getInsertPtAfterFramePtr` now returns an iterator so we don't lose
debug-info-communicating bits.
---
Depends on #73500, #74090, #74091.
Make the hoisted dbg.declare inherent the DILocation scope from the new
storage.
After hoisting, the dbg.declare is moved into the block that defines the
new storage. This could create an inconsistency in the debug location
scope hierarchy where the scope of hoisted dbg.declare (i.e.
DILexicalBlock) is enclosed with the scope of the block (i.e.
DISubprogram). This confuses LiveDebugValues pass to think that the
hoisted dbg.declare is killed in that block and does not generate
DBG_VALUE in other blocks. Debugger won't be able to track its value
anymore.
If a suspend happens in the resume part (this can happen in the case of chained coroutines), and that's part of a loop, the pre-split CFG has the suspend block as an exit of that loop. PGO Counter Promotion will then try to commit the temporary counter to the global in that "exit" block (it also does that in the other loop exit BBs, which also includes
the "destroy" case). This interferes with symmetric transfer.
We don't need to commit the counter in the suspend case - it's not a loop exit from the perspective of the behavior of the program. The regular loop exit, together with the "destroy" case, completely cover any updates that may need to happen to the global counter.
If we did, we couldn't lower symmetric transfer resumes to tail calls.
We can instrument the other 2 edges instead, as long as they also don't
point to the same basic block.
Close https://github.com/llvm/llvm-project/issues/56980.
This patch tries to introduce a light-weight optimization attribute for
coroutines which are guaranteed to only be destroyed after it reached
the final suspend.
The rationale behind the patch is simple. See the example:
```C++
A foo() {
dtor d;
co_await something();
dtor d1;
co_await something();
dtor d2;
co_return 43;
}
```
Generally the generated .destroy function may be:
```C++
void foo.destroy(foo.Frame *frame) {
switch(frame->suspend_index()) {
case 1:
frame->d.~dtor();
break;
case 2:
frame->d.~dtor();
frame->d1.~dtor();
break;
case 3:
frame->d.~dtor();
frame->d1.~dtor();
frame->d2.~dtor();
break;
default: // coroutine completed or haven't started
break;
}
frame->promise.~promise_type();
delete frame;
}
```
Since the compiler need to be ready for all the cases that the coroutine
may be destroyed in a valid state.
However, from the user's perspective, we can understand that certain
coroutine types may only be destroyed after it reached to the final
suspend point. And we need a method to teach the compiler about this.
Then this is the patch. After the compiler recognized that the
coroutines can only be destroyed after complete, it can optimize the
above example to:
```C++
void foo.destroy(foo.Frame *frame) {
frame->promise.~promise_type();
delete frame;
}
```
I spent a lot of time experimenting and experiencing this in the
downstream. The numbers are really good. In a real-world coroutine-heavy
workload, the size of the build dir (including .o files) reduces 14%.
And the size of final libraries (excluding the .o files) reduces 8% in
Debug mode and 1% in Release mode.
The stack might be in a different address space, in which case, bitcast
does not work. We should use addrspacecast. As we do not support typed
pointer anymore, so we do not need a bitcast here anymore.
When dealing with short-circuiting coroutines (e.g. expected), the
deferred calls that resolve the get_return_object are currently being
emitted after we delete the coroutine frame.
This was caught by ASAN when using optimizations -O1 and above:
optimizations after inlining would place the __coro_gro in the heap, and
subsequent delete of the coroframe followed by the conversion -> BOOM.
This patch forbids the GRO to be placed in the coroutine frame, by
adding a new metadata node that can be attached to `alloca`
instructions.
Fix#49843
One of the main user of these kind of coroutines is swift. There yield-once (`retcon.once`) coroutines are used to temporary "expose" pointers to internal fields of various objects creating borrow scopes.
However, in some cases it might be useful also to allow these coroutines to produce a normal result, but there is no convenient way to represent this (as compared to switched-resume kind of coroutines where C++ `co_return`
is transformed to a member / callback call on promise object).
The extension is simple: we allow continuation function to have a non-void result and accept optional extra arguments via a special `llvm.coro.end.result` intrinsic that would essentially forward them as normal results.
Only X86_64 and ARM64 have a reserved register for async arguments, and so the
debugger is only able to handle those targets. For other architectures, we use a
non-entry-value expression and let the debugger do its best with that.
Differential Revision: https://reviews.llvm.org/D158638
The entry point function is called as a regular function. Among other things, it
can be inlined, which would violate the semantics of entry_value in the IR.
Differential Revision: https://reviews.llvm.org/D158108
This patch adds the linkage name update to DISubprogram's declaration after 6ce76ff7eb7640e53b65f0473848ce7d08165c98.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D157184
Previously, we would try to transform cmpinst in
simplifyTerminatorLeadingToRet if we found it was a constant. However,
this is incorrect.
Since the resolved constants in simplifyTerminatorLeadingToRet are not
truely constants. They are basically constants along cerntain code
paths.
In this way, it is clearly incorrect to transform the compare
instruction to a constant.
It will cause confusing miscompilations. This patch tries to fix this.
Try to address part of
https://github.com/llvm/llvm-project/issues/61900.
It is not completely addressed since the original reproducer is not
fixed due to the final suspend point is optimized out in its special
case. But that is a relatively independent issue.
When the coroutine splitter splits swift coroutines, variables in the new
funclets are now described in terms of the frame pointer, which is always placed
at a ABI-specified register whose contents are valid upon function entry. As
such, debug intrinsics must be prepended by the `entry_value` operation.
Depends on D149778
Differential Revision: https://reviews.llvm.org/D149779
D97673 implemented salvaging o dbg.value inside coroutine funclets, but
left the original function untouched. Before, only dbg.addr and dbg.decl
would get salvaged.
D121324 implemented salvaging of dbg.addr and dbg.decl in the original
function as well, but not of dbg.values.
This patch unifies salvaging in the original function and related
funclets, so that all intrinsics are salvaged in all functions. This is
particularly useful for ABIs where the original function is also
rewritten to receive the frame pointer as an argument.
Differential Revision: https://reviews.llvm.org/D148745
The insertSpills() code will currently skip lifetime intrinsic users
when replacing the alloca with a frame reference. Rather than
leaving behind the dead lifetime intrinsics working on the old
alloca, directly remove them. This makes sure the alloca can be
dropped as well.
I noticed this as a regression when converting tests to opaque
pointers. Without opaque pointers, this code didn't really do
anything, because there would usually be a bitcast in between.
The lifetimes would get rewritten to the frame pointer. With
opaque pointers, this code now triggers and leaves behind users
of the old allocas.
Differential Revision: https://reviews.llvm.org/D148240
A temp spill may not have direct dbg.declare attached. This can cause problem for debugger when it wants to print the value in resume/destroy/cleanup functions.
In particular, we found this happening to "this" pointer that a temp is used to store its value in entry block and spilled later.
Reviewed By: ChuanqiXu
Differential Revision: https://reviews.llvm.org/D146543
Currently we will run two optimization (rematerialization and sink
lifetime markers) unconditionally even if the coroutine is marked as
optnone (O0). This looks not good. This patch disables these 2
optimizations for optnone functions. An internal change shows the change
improve the compilation time for 3% in the debug build.
As originally implemented, the rematerialization of valid instructions across
the suspend point would iterate 4 times, meaning that up to 4 instructions could
be rematerialized.
This implementation changes that approach to instead build a graph of
rematerializable instructions, then move all of them. This is faster than the
original approach and is not limited to an arbitrary limit.
Differential Revision: https://reviews.llvm.org/D142620
Added more tests that check for >4 instructions.
Also added a retcon-remat test that checks rematerialization into a suspend
block predecessor (such as when remat for a retcon suspend happens).
Differential Revision: https://reviews.llvm.org/D142619
IR is now always parsed in opaque pointer mode, unless
-opaque-pointers=0 is explicitly given. There is no automatic
detection of typed pointers anymore.
The -opaque-pointers=0 option is added to any remaining IR tests
that haven't been migrated yet.
Differential Revision: https://reviews.llvm.org/D141912