[IMPALA-5558] Query hang after coordinator crash because DoRpc(ReportExecStatus) fails and is not retried - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.9.0
Fix Version/s: Impala 2.10.0
Component/s: Distributed Exec
Labels:
None

Target Version:

Impala 2.10.0
Epic Color:
ghx-label-3

Description

The following loop aims to retry the RPC for up to 3 times when reporting exec status of a fragment instance to the coordinator. However, it's not very effective because we didn't check out a new client between retry. In case the connection is bad, the retry will fail again. In addition, since we are reporting the query profile, it should be fine to retry all the time even if the payload was partially sent to the remote client.

cc'ing henryr and sailesh

  // Try to send the RPC 3 times before failing.
  for (int i = 0; i < 3; ++i) {
    rpc_status = coord.DoRpc(
        &ImpalaBackendClient::ReportExecStatus, params, &res, &retry_is_safe);
    if (rpc_status.ok()) break;
    if (!retry_is_safe) break;
    if (i < 2) SleepForMs(RETRY_SLEEP_MS);
  }

Attachments

Issue Links

breaks

IMPALA-5588 test_rpc_secure_recv_timed_out: TypeError

Resolved

IMPALA-6792 Appears to be a memory leak in orphaned fragments

Resolved

is broken by

IMPALA-5388 wrong results under stress with secure cluster

Resolved

is related to

IMPALA-5576 Wrong Cancel() in QueryState::ReportExecStatusAux() can lead to coordinator hang

Resolved

Activity

People

Assignee:: Michael Ho

Reporter:: Michael Ho

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Jun/17 07:03

Updated:: 18/Apr/18 23:22

Resolved:: 26/Jun/17 16:56