Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
Impala 2.3.0
-
None
Description
An important building block for node-decommissioning is the ability to retry queries if they fail during scheduling for some recoverable reason (e.g. RPC failed due to unreachable host, fragment could not be started due to memory pressure).
To do this we can detect failures during Coordinator::Exec(), cancel the running query and then re-start from somewhere in QueryExecState::ExecQueryOrDmlRequest() - updating a local blacklist of nodes so that we know to avoid those that have caused failures.
There are some subtleties though:
- Queries shouldn't be retried more than a small number of times, in case they cause the outage (there might be a good way to figure that out at the time)
- If the query is restarted from the scheduling step (rather than completely restarting), some care will have to be taken to ensure that none of the old query's fragments that are being cancelled can affect the new query's operation in any way (there are several ways to do this).
Eventually the failures will propagate to the rest of the cluster via the statestore - this mechanism allows queries to recover and continue while the statestore detects the failure.
This JIRA doesn't address restarting queries that have suffered failures part-way through execution, because that's strictly harder and not (as) needed for decommissioning.
Attachments
Issue Links
- blocks
-
IMPALA-1760 Add Impala SQL command to gracefully shut down an Impala daemon
- Resolved
- is blocked by
-
IMPALA-8339 Coordinator should be more resilient to fragment instances startup failure
- Resolved
- relates to
-
IMPALA-9124 Transparently retry queries that fail due to cluster membership changes
- In Progress
-
IMPALA-3380 Add TCP timeouts to all RPCs that don't block
- Resolved