[IMPALA-4555] Don't cancel query for failed ReportExecStatus (done=false) RPC - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.7.0
Fix Version/s: Impala 3.2.0
Component/s: Distributed Exec
Labels:
None

Target Version:

Impala 3.2.0

Description

We currently try to send the ReportExecStatus RPC up to 3 times if the first 2 times are unsuccessful - due to high network load or a network partition. If all 3 attempts fail, we cancel the fragment instance and hence the query.

However, we do not need to cancel the fragment instance if sending the report with done=false failed. We can just skip this turn and try again the next time.

We could probably skip sending the report up to 2 times (if we're unable to send due to high network load and if done=false) before succumbing to the current behavior, which is to cancel the fragment instance. The point is to try at a later time when the network load may be lower rather than try quickly again. The chance that the network load would reduce in 100 ms is less than in 5s.

Also, we probably do not need to have the retry logic unless we've already skipped twice or if done=true.

This could help reduce the network load on the coordinator for highly concurrent workloads.

The only drawback I see now is that the QueryExecSummary might be stale for a while (which it would have anyway because the RPCs would have failed to send)

P.S: This above proposed solution may need to change if we go ahead with ~~IMPALA-2990~~.

Attachments

Issue Links

is part of

IMPALA-2990 Coordinator should timeout and cancel queries with unresponsive / stuck executors

Resolved

relates to

IMPALA-4063 Make fragment instance reports per-query (or per-host) instead of per-fragment instance.

Resolved

Activity

People

Assignee:: Thomas Tauber-Marshall

Reporter:: Sailesh Mukil

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Nov/16 22:25

Updated:: 13/Feb/19 03:39

Resolved:: 08/Feb/19 18:29