[IMPALA-10476] Remove executor node with faulty disks from executor group - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Distributed Exec
Labels:
None

Epic Color:
ghx-label-3

Description

If an executor node frequently gets disk IO failures when reading/writing local disk, it should report its unhealthy state to statestore so that the node could be marked as down and be removed from executor group to avoid repeated query failures in the cluster. This provides a mechanism for executor node to remove itself from scheduling.

The two major components of Impala that read/write from local disk are the spill-to-disk and data caching features. We need to add stats for counting such local disk failures over a period of time like last x seconds, then use these stats to measure if a node is in good health for executing query fragment instances.

The healthy state of an executor node should be shown on the debug WebUI. We should also allow users to overwrite the node's healthy state. The node will restart to register itself in the statestore once its healthy state is overwritten.

Attachments

Activity

People

Assignee:: Wenzhe Zhou

Reporter:: Wenzhe Zhou

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Feb/21 06:00

Updated:: 05/Feb/21 17:32