[IMPALA-8339] Coordinator should be more resilient to fragment instances startup failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 3.3.0
Component/s: Distributed Exec
Labels:
- Availability
- resilience

Epic Color:
ghx-label-7

Description

Impala currently relies on statestore for cluster membership. When an Impala executor goes offline, it may take a while for statestore to declare that node as unavailable and for that information to be propagated to all coordinator nodes. Within this window, some coordinator nodes may still attempt to issue RPCs to the faulty node, resulting in RPC failures which resulted in query failures. In other words, many queries may fail to start within this window until all coordinator nodes get the latest information on cluster membership.

Going forward, coordinator may need to fall back to using backup executors for each fragments in case some of the executors are not available. Moreover, coordinator should treat the cluster membership information from statestore (or any external source of truth e.g. etcd) as hints instead of ground truth and adjust the scheduling of fragment instances based on the availability of the executors from the coordinator's perspective.

Attachments

Issue Links

blocks

IMPALA-2638 Retry queries that fail during scheduling

Resolved

relates to

IMPALA-9224 Blacklist nodes with faulty disks

Resolved

IMPALA-9137 Blacklist node if a DataStreamService RPC to the node fails

Resolved

IMPALA-9243 Coordinator Web UI should list which executors have been blacklisted

Resolved

IMPALA-9124 Transparently retry queries that fail due to cluster membership changes

In Progress

Activity

People

Assignee:: Thomas Tauber-Marshall

Reporter:: Michael Ho

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Mar/19 22:23

Updated:: 16/Jan/20 21:04

Resolved:: 30/Jul/19 15:22