This PR fixes another race condition in
https://github.com/llvm/llvm-project/pull/90930. The failure was found
by @labath with this log: https://paste.debian.net/hidden/30235a5c/:
```
dotest_wrapper. < 15> send packet: $z0,224505,1#65
...
b-remote.async> < 22> send packet: $vCont;s:p1dcf.1dcf#4c
intern-state GDBRemoteClientBase::Lock::Lock sent packet: \x03
b-remote.async> < 818> read packet: $T13thread:p1dcf.1dcf;name:a.out;threads:1dcf,1dd2;jstopinfo:5b7b226e616d65223a22612e6f7574222c22726561736f6e223a227369676e616c222c227369676e616c223a31392c22746964223a373633317d2c7b226e616d65223a22612e6f7574222c22746964223a373633347d5d;thread-pcs:0000000000224505,00007f4e4302119a;00:0000000000000000;01:0000000000000000;02:0100000000000000;03:0000000000000000;04:9084997dfc7f0000;05:a8742a0000000000;06:b084997dfc7f0000;07:6084997dfc7f0000;08:0000000000000000;09:00d7e5424e7f0000;0a:d0d9e5424e7f0000;0b:0202000000000000;0c:80cc290000000000;0d:d8cc1c434e7f0000;0e:2886997dfc7f0000;0f:0100000000000000;10:0545220000000000;11:0602000000000000;12:3300000000000000;13:0000000000000000;14:0000000000000000;15:2b00000000000000;16:80fbe5424e7f0000;17:0000000000000000;18:0000000000000000;19:0000000000000000;reason:signal;#b9
```
It shows an async interrupt "\x03" was sent immediately after `vCont;s`
single step over breakpoint at address `0x224505` (which was disabled
before vCont). And the later stop was still at the original PC
(0x224505) not moving forward.
The investigation shows the failure happens when timeout is short and
async interrupt is sent to lldb-server immediately after vCont so
ptrace() resumes and then async interrupts debuggee immediately so
debuggee does not get a chance to execute and move PC. So it enters stop
mode immediately at original PC. `ThreadPlanStepOverBreakpoint` does not
expect PC not moving and reports stop at the original place.
To fix this, the PR prevents `ThreadPlanSingleThreadTimeout` from being
created during `ThreadPlanStepOverBreakpoint` by introduces a new
`SupportsResumeOthers()` method and `ThreadPlanStepOverBreakpoint`
returns false for it. This makes sense because we should never resume
threads during step over breakpoint anyway otherwise it might cause
other threads to miss breakpoint.
---------
Co-authored-by: jeffreytan81 <jeffreytan@fb.com>
This PR fixes a potential race condition in
https://github.com/llvm/llvm-project/pull/90930.
This race can happen because the original code set `m_info->m_isAlive =
true` **after** the timer thread is created. So if there is a context
switch happens and timer thread checks `m_info->m_isAlive` before main
thread got a chance to run `m_info->m_isAlive = true`, the timer thread
may treat `ThreadPlanSingleThreadTimeout` as not alive and simply exit
resulting in async interrupt never being sent to resume all threads
(deadlock).
The PR fixes the race by initializing all states **before** worker timer
thread creates.
Co-authored-by: jeffreytan81 <jeffreytan@fb.com>
This fixes:
```
[6831/7617] Building CXX object
tools\lldb\source\Target\CMakeFiles\lldbTarget.dir\ThreadPlanSingleThreadTimeout.cpp.obj
C:\src\git\llvm-project\lldb\source\Target\ThreadPlanSingleThreadTimeout.cpp(66)
: warning C4715:
'lldb_private::ThreadPlanSingleThreadTimeout::StateToString': not all
control paths return a value
```
This PR fixes the ASAN failure in
https://github.com/llvm/llvm-project/pull/90930.
The original PR made the assumption that parent
`ThreadPlanStepOverRange`'s lifetime will always be longer than
`ThreadPlanSingleThreadTimeout` leaf plan so it passes the
`m_timeout_info` as reference to it.
From the ASAN failure, it seems that this assumption may not be true
(likely the thread stack is holding a strong reference to the leaf
plan).
This PR fixes this lifetime issue by using shared pointer instead of
passing by reference.
---------
Co-authored-by: jeffreytan81 <jeffreytan@fb.com>
This PR introduces a new `ThreadPlanSingleThreadTimeout` that will be
used to address potential deadlock during single-thread stepping.
While debugging a target with a non-trivial number of threads (around
5000 threads in one example target), we noticed that a simple step over
can take as long as 10 seconds. Enabling single-thread stepping mode
significantly reduces the stepping time to around 3 seconds. However,
this can introduce deadlock if we try to step over a method that depends
on other threads to release a lock.
To address this issue, we introduce a new
`ThreadPlanSingleThreadTimeout` that can be controlled by the
`target.process.thread.single-thread-plan-timeout` setting during
single-thread stepping mode. The concept involves counting the elapsed
time since the last internal stop to detect overall stepping progress.
Once a timeout occurs, we assume the target is not making progress due
to a potential deadlock, as mentioned above. We then send a new async
interrupt, resume all threads, and `ThreadPlanSingleThreadTimeout`
completes its task.
To support this design, the major changes made in this PR are:
1. `ThreadPlanSingleThreadTimeout` is popped during every internal stop
and reset (re-pushed) to the top of the stack (as a leaf node) during
resume. This is achieved by always returning `true` from
`ThreadPlanSingleThreadTimeout::DoPlanExplainsStop()` and
`ThreadPlanSingleThreadTimeout::MischiefManaged()`.
2. A new thread-specific async interrupt stop is introduced, which can
be detected/consumed by `ThreadPlanSingleThreadTimeout`.
3. The clearing of branch breakpoints in the range thread plan has been
moved from `DoPlanExplainsStop()` to `ShouldStop()`, as it is not
guaranteed that it will be called.
The detailed design is discussed in the RFC below:
[https://discourse.llvm.org/t/improve-single-thread-stepping/74599](https://discourse.llvm.org/t/improve-single-thread-stepping/74599)
---------
Co-authored-by: jeffreytan81 <jeffreytan@fb.com>