Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.1
-
None
Description
If a node is truly hung, the statestore may apparently wait forever to receive the heartbeat response. We need to check the TCP timeouts on the connections from the statestore to the subscriber.
Since the operating system can also interfere, we should periodically visit all heartbeat threads and see how long they've been in the heartbeat RPC for. I think we can forcibly close the socket in a GC thread if it's taken too long. The next time round should hit the TCP cnxn timeout (or be refused), and the subscriber should be marked as dead.