An important building block for node-decommissioning is the ability to retry queries if they fail during scheduling for some recoverable reason (e.g. RPC failed due to unreachable host, fragment could not be started due to memory pressure).
To do this we can detect failures during Coordinator::Exec(), cancel the running query and then re-start from somewhere in QueryExecState::ExecQueryOrDmlRequest() - updating a local blacklist of nodes so that we know to avoid those that have caused failures.
There are some subtleties though:
- Queries shouldn't be retried more than a small number of times, in case they cause the outage (there might be a good way to figure that out at the time)
- If the query is restarted from the scheduling step (rather than completely restarting), some care will have to be taken to ensure that none of the old query's fragments that are being cancelled can affect the new query's operation in any way (there are several ways to do this).
Eventually the failures will propagate to the rest of the cluster via the statestore - this mechanism allows queries to recover and continue while the statestore detects the failure.
This JIRA doesn't address restarting queries that have suffered failures part-way through execution, because that's strictly harder and not (as) needed for decommissioning.