[IMPALA-9425] Statestore may fail to report when an impalad has failed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 3.4.0
Fix Version/s: Impala 3.4.0
Component/s: Distributed Exec
Labels:
None

Epic Color:
ghx-label-10

Description

If an impalad fails and another is restarted at the same host:port combination quickly, the statestore may fail to report to the coordinators that the impalad went down.

The reason for this is that in the cluster membership topic, impalads are keyed by their statestore subscriber id, which is "impalad@host:port". If the new impalad registers itself before a topic update has been generated for a particular coordinator, the statestore has no way of knowing that the particular key was deleted and then re-added since the last update.

The result is that queries that were running on the impalad that failed may not be cancelled by the coordinator until they pass the unresponsive backend timeout, which by default is ~12 minutes.

I propose as a solution that we add a concept of uuids for impalads, where each impalad will generate its own uuid on startup. This allows us to differentiate between different impalads running at the same host:port combination.

It can also be used to simplify some logic in the scheduler and ExecutorGroup/ExecutorBlacklist etc. where we currently have data structures containing info about impalads that are keyed off host/port combinations.

Attachments

Activity

People

Assignee:: Thomas Tauber-Marshall

Reporter:: Thomas Tauber-Marshall

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Feb/20 00:35

Updated:: 12/Mar/20 21:45

Resolved:: 12/Mar/20 21:45