[IMPALA-5576] Wrong Cancel() in QueryState::ReportExecStatusAux() can lead to coordinator hang - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.9.0
Fix Version/s: Impala 2.10.0
Component/s: Distributed Exec
Labels:
- hang

Target Version:

Impala 2.9.0
Epic Color:
ghx-label-4

Description

Code introduced as part of ~~IMPALA-2550~~ makes a hang possible if the report exec function fails to get a backend client. The new code cancels the local fragments but the status will never be reported to the coordinator, so it will wait indefinitely for their reports.

void QueryState::ReportExecStatusAux(bool done, const Status& status,
    FragmentInstanceState* fis, bool instances_started) {
  // if we're reporting an error, we're done
  DCHECK(status.ok() || done);
  // if this is not for a specific fragment instance, we're reporting an error
  DCHECK(fis != nullptr || !status.ok());
  DCHECK(fis == nullptr || fis->IsPrepared());

  // This will send a report even if we are cancelled.  If the query completed correctly
  // but fragments still need to be cancelled (e.g. limit reached), the coordinator will
  // be waiting for a final report and profile.

  Status coord_status;
  ImpalaBackendConnection coord(ExecEnv::GetInstance()->impalad_client_cache(),
      query_ctx().coord_address, &coord_status);
  if (!coord_status.ok()) {
    // TODO: this might flood the log
    LOG(WARNING) << "Couldn't get a client for " << query_ctx().coord_address
        <<"\tReason: " << coord_status.GetDetail();
    if (instances_started) Cancel();
    return;
  }

Attachments

Issue Links

breaks

IMPALA-6792 Appears to be a memory leak in orphaned fragments

Resolved

is broken by

IMPALA-2550 Switch to per-query exec rpc

Resolved

relates to

IMPALA-5537 Impala does not retry RPCs that fail in SSL_read()

Resolved

IMPALA-5558 Query hang after coordinator crash because DoRpc(ReportExecStatus) fails and is not retried

Resolved

IMPALA-2990 Coordinator should timeout and cancel queries with unresponsive / stuck executors

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Matthew Jacobs

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Jun/17 18:27

Updated:: 18/Apr/18 23:22

Resolved:: 26/Jun/17 17:07