[IMPALA-9137] Blacklist node if a DataStreamService RPC to the node fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 3.4.0
Component/s: Backend
Labels:
None

Epic Color:
ghx-label-7

Description

If a query fails because a RPC to a specific node failed, the query error message will similar to one of the following:

ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv got EOF from 10.65.30.141:27000 (error 108)
ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)
ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client connection negotiation failed: client connection to 10.65.26.254:27000: connect: Connection refused (error 111)
ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv error from 0.0.0.0:0: Transport endpoint is not connected (error 107)

RPCs are already retried, so it is likely that something is wrong with the target node. Perhaps it crashed or is so overloaded that it can't process RPC requests. In any case, the Impala Coordinator should blacklist the target of the failed RPC so that future queries don't fail with the same error.

If the node crashed, the statestore will eventually remove the failed node from the cluster as well. However, the statestore can take a while to detect a failed node because it has a long timeout. The issue is that queries can still fail in within the timeout window.

This is necessary for transparent query retries because if a node does crash, it will take too long for the statestore to remove the crashed node from the cluster. So any attempt at retrying a query will just fail.

Attachments

Issue Links

causes

IMPALA-9262 Blacklist test test_kill_impalad_with_running_queries fails in exhaustive mode

Resolved

depends upon

IMPALA-8138 Re-introduce rpc debugging options

Resolved

is related to

IMPALA-8339 Coordinator should be more resilient to fragment instances startup failure

Resolved

IMPALA-9296 Move FragmentInstanceExecStatus' AuxErrorInfo to StatefulStatus

Resolved

relates to

IMPALA-9227 Test coverage for query retries when there is a network partition

Open

IMPALA-9253 Blacklist additional posix error codes for failed DataStreamService RPCs

Open

IMPALA-9224 Blacklist nodes with faulty disks

Resolved

IMPALA-9295 RPC failures don't always trigger a blacklist

Resolved

(3 relates to)

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Nov/19 17:05

Updated:: 16/Jan/20 20:56

Resolved:: 20/Dec/19 18:26