[IMPALA-9124] Transparently retry queries that fail due to cluster membership changes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Backend, Clients
Labels:
None

Epic Color:
ghx-label-5

Description

Currently, if the Impala Coordinator or any Executors run into errors during query execution, Impala will fail the entire query. It would improve user experience to transparently retry the query for some transient, recoverable errors.

This JIRA focuses on retrying queries that would otherwise fail due to cluster membership changes. Specifically, node failures that cause changes in the cluster membership (currently the Coordinator cancels all queries running on a node if it detects that the node is no longer part of the cluster) and node blacklisting (the Coordinator blacklists a node because it detects a problem with that node - can’t execute RPCs against the node). It is not focused on retrying general errors (e.g. any frontend errors, MemLimitExceeded exceptions, etc.).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Impala Transparent Query Retries.pdf
05/Nov/19 17:22
138 kB
Sahil Takiar

Issue Links

is related to

IMPALA-6984 Coordinator should cancel backends when returning EOS

Reopened

IMPALA-8339 Coordinator should be more resilient to fragment instances startup failure

Resolved

IMPALA-9834 test_query_retries.TestQueryRetries is flaky on erasure coding configurations

Resolved

IMPALA-9113 Queries can hang if an impalad is killed after a query has FINISHED

Resolved

IMPALA-10585 retry_failed_queries=true should not apply to DMLs

Resolved

IMPALA-6194 Ensure all fragment instances notice cancellation

Open

IMPALA-8138 Re-introduce rpc debugging options

Resolved

IMPALA-2638 Retry queries that fail during scheduling

Resolved

relates to

IMPALA-9299 Node Blacklisting: Coordinators should blacklist unhealthy nodes

Open

requires

IMPALA-6894 Use an internal representation of query states in ClientRequestState

Resolved

(3 is related to, 1 relates to, 1 requires)

Sub-Tasks

1.	Fix error reporting when AuxErrorInfoPB is present without an error	Open	Wenzhe Zhou
2.	Test coverage for query retries when there is a network partition	Open	Wenzhe Zhou
3.	Retried runtime profile should include some information about previous query attempts	Open	Unassigned
4.	Add impalad level metrics for query retries	Open	Unassigned
5.	Queries should only be retried if all fragments fail with retryable errors	Open	Unassigned
6.	Re-factor ImpalaServer, ClientRequestState, Coordinator protocol	Open	Unassigned
7.	Test that queries are not retried if they cause an impalad to crash	Open	Unassigned
8.	Web UI improvements for retried queries	Open	Unassigned
9.	Add support for multi query retries on cluster membership changes	Open	Unassigned
10.	Profile log does not include profiles of failed queries	Open	Unassigned
11.	Impala Doc: Add docs for transparent query retries	Open	shajini thayasingh
12.	Consider using num_rows_fetched instead of fetched_rows in checking whether client has fetched any results in TryQueryRetry	Open	Unassigned

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/Nov/19 17:09

Updated:: 02/May/23 16:13