There is a race condition in the query coordination code that could cause queries to hang indefinitely in an un-cancellable state if an impalad crashes after the query has transitioned to the FINISHED state, but before all backends have completed.
The issue occurs if:
- A query produces all results
- A client issues a fetch request to read all of those results
- The client fetch request fetches all available rows (e.g. eos is hit)
- Coordinator::GetNext then calls SetNonErrorTerminalState(ExecState::RETURNED_RESULTS) which eventually calls WaitForBackends()
- WaitForBackends() will block until all backends have completed
- One of the impalads running the query crashes, and thus never reports success for the query fragment it was running
- The WaitForBackends() call will then block indefinitely
- Any attempt to cancel the query fails because the original fetch request that drove the WaitForBackends() call has acquired the ClientRequestState lock, which thus prevents any cancellation from occurring.
Implementing IMPALA-6984 should theoretically fix this because as soon as eos is hit, the coordinator will call CancelBackends() rather than WaitForBackends(). Another solution would be to add a timeout to the WaitForBackends() so that it returns after the timeout is hit, this would force the fetch request to return 0 rows with hasMoreRows=true, and unblock any cancellation threads.