I saw a hang triggered by test_failpoints in JoinBuilder::HandofftoProbesAndWait(), where the thread was blocked but build_side_state->is_cancelled_ is true.
The sequence of events leading to the bug is as follows:
- Thread A is in HandoffToProbesAndWait(), reads is_cancelled_ and sees false.
- Thread B in RuntimeState::Cancel() sets is_cancelled_ = true, acquires cancellation_cvs_lock_, then calls NotifyAll() on the condition variable
- Thread A calls Wait() on the cv, blocks forever.
I think this is most likely if thread A is de-scheduled at the wrong time.