[IMPALA-7305] membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
Fix Version/s: Impala 2.12.0, Impala 3.1.0
Component/s: Distributed Exec
Labels:
None

Epic Color:
ghx-label-2

Description

I was able to reproduce this bug on a version of Impala pre-~~IMPALA-4953~~ with the attached patch that adds a sleep. The patch is a hack and only works on my system (it has a name hardcoded). The trick is to kill the third impala manually while the cluster is starting up.

Then the system gets stuck in a state where all impalads thing 22002 is alive but the process was actually killed. Running queries fails because they keep getting scheduled on the dead impalad.

Known backend(s): 3
Address	Coordinator	Executor
tarmstrong-box:22002 	true 	true
tarmstrong-box:22001 	true 	true
tarmstrong-box:22000 	true 	true

The race seems quite exotic but may be possible if there are intermittent transport errors (causing heartbeats to fail) or if there are delays processing topics, e.g. contending for locks.

~~IMPALA-4953~~ fixes the problem by deleting newly-added transient entries if the subscriber got unregistered while the statestore was processing an update.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-Repro-CDH-70703.patch
16/Jul/18 21:26
3 kB
Tim Armstrong

Issue Links

is related to

IMPALA-7306 Add regression test for IMPALA-7305

Resolved

relates to

IMPALA-4953 Prevent large statestore updates from head-of-line blocking subsequent updates to different topics

Resolved

Activity

People

Assignee:: Tim Armstrong

Reporter:: Tim Armstrong

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Jul/18 21:30

Updated:: 25/Sep/18 04:54

Resolved:: 16/Jul/18 21:32