[IMPALA-9224] Blacklist nodes with faulty disks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.0.0
Component/s: Backend
Labels:
None

Target Version:

Impala 4.0.0
Epic Color:
ghx-label-12

Description

Similar to ~~IMPALA-8339~~ and ~~IMPALA-9137~~, Impala should blacklist nodes with faulty disks. Specifically, if a query fails because of a disk error, the node with that disk should be blacklisted and the query should be retried.

We shouldn't need to blacklist nodes that fail to read from HDFS / S3, since they contain their own internal mechanisms for recovering from faulty disks. We should only blacklist nodes when failing to read / write from local disks.

The two main components of Impala that read / write from local disk are the spill-to-disk and data caching features. Whenever a query fails because of a disk failure during spill-to-disk, the node should be blacklisted.

Reads / writes from / to the data cache are a bit different. If a cache read fails due to a disk error, the error will be printed out and the Lookup() call to the cache will return 0 bytes read, which means it couldn't find the data in the cache. This should cause the scan to fall back to a normal, un-cached read. While this doesn't affect query correctness or the ability for a query to complete, it can affect performance. Since cache failures don't result in query failures, we might consider having a threshold of data cache read / writes errors before blacklisting a node.

We need to be careful to only capture specific disk failures - e.g. disk quota, permission denied, etc. errors shouldn't result in blacklisting as they typically are a result of system misconfiguration.

Attachments

Issue Links

is related to

IMPALA-8339 Coordinator should be more resilient to fragment instances startup failure

Resolved

IMPALA-9137 Blacklist node if a DataStreamService RPC to the node fails

Resolved

relates to

IMPALA-4683 Intelligently blacklist scratch disks with physical I/O errors

Open

Activity

People

Assignee:: Wenzhe Zhou

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Dec/19 00:50

Updated:: 07/Feb/21 04:49

Resolved:: 07/Feb/21 04:49