Details
Description
Currently, if the Impala Coordinator or any Executors run into errors during query execution, Impala will fail the entire query. It would improve user experience to transparently retry the query for some transient, recoverable errors.
This JIRA focuses on retrying queries that would otherwise fail due to cluster membership changes. Specifically, node failures that cause changes in the cluster membership (currently the Coordinator cancels all queries running on a node if it detects that the node is no longer part of the cluster) and node blacklisting (the Coordinator blacklists a node because it detects a problem with that node - can’t execute RPCs against the node). It is not focused on retrying general errors (e.g. any frontend errors, MemLimitExceeded exceptions, etc.).
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-6984 Coordinator should cancel backends when returning EOS
- Reopened
-
IMPALA-8339 Coordinator should be more resilient to fragment instances startup failure
- Resolved
-
IMPALA-9834 test_query_retries.TestQueryRetries is flaky on erasure coding configurations
- Resolved
-
IMPALA-9113 Queries can hang if an impalad is killed after a query has FINISHED
- Resolved
-
IMPALA-10585 retry_failed_queries=true should not apply to DMLs
- Resolved
-
IMPALA-6194 Ensure all fragment instances notice cancellation
- Open
-
IMPALA-8138 Re-introduce rpc debugging options
- Resolved
-
IMPALA-2638 Retry queries that fail during scheduling
- Resolved
- relates to
-
IMPALA-9299 Node Blacklisting: Coordinators should blacklist unhealthy nodes
- Open
- requires
-
IMPALA-6894 Use an internal representation of query states in ClientRequestState
- Resolved