[IMPALA-414] Impala server cannot detect crash-restart failures reliably - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: Impala 1.0.1
Fix Version/s: None
Component/s: Distributed Exec
Labels:
- statestore

Target Version:

Product Backlog

Description

The membership mechanism used to tell Impala servers about failures does not always detect fast crash-restarts. If a server restarts and re-registers before the state-store recognises that it has failed, the failure won't get reported to any other subscriber.

The right way to fix this, I think, is to track a version number in every subscriber. When a subscriber reconnects, it gets a new version number. For every query, we track the highest version number of the subscriber known at that time. Then if any backend executing a query has a higher version number, it's likely to have restarted since the query started. There might be a couple of false positives, since a node could conceivably restart between a scheduling assignment and actually receiving a query, but that's unlikely and better than false negatives.

Attachments

Issue Links

duplicates

IMPALA-2990 Coordinator should timeout and cancel queries with unresponsive / stuck executors

Resolved

relates to

IMPALA-412 Impala might hang when an impalad die during query execution

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Henry Robinson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Jun/13 23:11

Updated:: 26/Oct/18 01:24

Resolved:: 26/Oct/18 01:24