[IMPALA-9124] Transparently retry queries that fail due to cluster membership changes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Backend, Clients
Labels:
None

Epic Color:
ghx-label-5

Description

Currently, if the Impala Coordinator or any Executors run into errors during query execution, Impala will fail the entire query. It would improve user experience to transparently retry the query for some transient, recoverable errors.

This JIRA focuses on retrying queries that would otherwise fail due to cluster membership changes. Specifically, node failures that cause changes in the cluster membership (currently the Coordinator cancels all queries running on a node if it detects that the node is no longer part of the cluster) and node blacklisting (the Coordinator blacklists a node because it detects a problem with that node - can’t execute RPCs against the node). It is not focused on retrying general errors (e.g. any frontend errors, MemLimitExceeded exceptions, etc.).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Impala Transparent Query Retries.pdf
05/Nov/19 17:22
138 kB
Sahil Takiar

Issue Links

is related to

IMPALA-6984 Coordinator should cancel backends when returning EOS

Reopened

IMPALA-8339 Coordinator should be more resilient to fragment instances startup failure

Resolved

IMPALA-9834 test_query_retries.TestQueryRetries is flaky on erasure coding configurations

Resolved

IMPALA-9113 Queries can hang if an impalad is killed after a query has FINISHED

Resolved

IMPALA-10585 retry_failed_queries=true should not apply to DMLs

Resolved

IMPALA-6194 Ensure all fragment instances notice cancellation

Open

IMPALA-8138 Re-introduce rpc debugging options

Resolved

IMPALA-2638 Retry queries that fail during scheduling

Resolved

relates to

IMPALA-9299 Node Blacklisting: Coordinators should blacklist unhealthy nodes

Open

requires

IMPALA-6894 Use an internal representation of query states in ClientRequestState

Resolved

(3 is related to, 1 relates to, 1 requires)

Sub-Tasks

1.	Classify certain errors as retryable	Resolved	Sahil Takiar
2.	Add support for single query retries on cluster membership changes	Resolved	Sahil Takiar
3.	Avoid copying TExecRequest when retrying queries	Resolved	Sahil Takiar
4.	Client logs should indicate if a query has been retried	Resolved	Quanlong Huang
5.	Query progress bar freezes when a query is retried	Resolved	Quanlong Huang
6.	ACID-query retry integration	Resolved	Sahil Takiar
7.	Link failed and retried runtime profiles	Resolved	Sahil Takiar
8.	Retried queries that blacklist nodes should ensure they don't run on the blacklisted node	Resolved	Wenzhe Zhou
9.	Retryable queries should spool all results before returning any to the client	Resolved	Quanlong Huang
10.	TSAN data race in QueryDriver::CreateRetriedClientRequestState	Resolved	Sahil Takiar
11.	TSAN lock-order-inversion warning in QueryDriver::RetryQueryFromThread	Resolved	Sahil Takiar
12.	Hit DCHECK when retrying a query in FINISHED state	Resolved	Quanlong Huang
13.	summary and profile command in impala-shell should show both original and retried info	Resolved	Quanlong Huang
14.	Fix error reporting when AuxErrorInfoPB is present without an error	Open	Wenzhe Zhou
15.	Test coverage for query retries when there is a network partition	Open	Wenzhe Zhou
16.	Retried runtime profile should include some information about previous query attempts	Open	Unassigned
17.	Add impalad level metrics for query retries	Open	Unassigned
18.	Queries should only be retried if all fragments fail with retryable errors	Open	Unassigned
19.	Re-factor ImpalaServer, ClientRequestState, Coordinator protocol	Open	Unassigned
20.	Test that queries are not retried if they cause an impalad to crash	Open	Unassigned
21.	Web UI improvements for retried queries	Open	Unassigned
22.	Add support for multi query retries on cluster membership changes	Open	Unassigned
23.	Profile log does not include profiles of failed queries	Open	Unassigned
24.	Impala Doc: Add docs for transparent query retries	Open	shajini thayasingh
25.	Consider using num_rows_fetched instead of fetched_rows in checking whether client has fetched any results in TryQueryRetry	Open	Unassigned

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 05/Nov/19 17:09

Updated:: 02/May/23 16:13