Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
-
None
-
ghx-label-2
Description
I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with the attached patch that adds a sleep. The patch is a hack and only works on my system (it has a name hardcoded). The trick is to kill the third impala manually while the cluster is starting up.
Then the system gets stuck in a state where all impalads thing 22002 is alive but the process was actually killed. Running queries fails because they keep getting scheduled on the dead impalad.
Known backend(s): 3 Address Coordinator Executor tarmstrong-box:22002 true true tarmstrong-box:22001 true true tarmstrong-box:22000 true true
The race seems quite exotic but may be possible if there are intermittent transport errors (causing heartbeats to fail) or if there are delays processing topics, e.g. contending for locks.
IMPALA-4953 fixes the problem by deleting newly-added transient entries if the subscriber got unregistered while the statestore was processing an update.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-7306 Add regression test for IMPALA-7305
- Resolved
- relates to
-
IMPALA-4953 Prevent large statestore updates from head-of-line blocking subsequent updates to different topics
- Resolved