Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.9.0
-
ghx-label-4
Description
Code introduced as part of IMPALA-2550 makes a hang possible if the report exec function fails to get a backend client. The new code cancels the local fragments but the status will never be reported to the coordinator, so it will wait indefinitely for their reports.
void QueryState::ReportExecStatusAux(bool done, const Status& status, FragmentInstanceState* fis, bool instances_started) { // if we're reporting an error, we're done DCHECK(status.ok() || done); // if this is not for a specific fragment instance, we're reporting an error DCHECK(fis != nullptr || !status.ok()); DCHECK(fis == nullptr || fis->IsPrepared()); // This will send a report even if we are cancelled. If the query completed correctly // but fragments still need to be cancelled (e.g. limit reached), the coordinator will // be waiting for a final report and profile. Status coord_status; ImpalaBackendConnection coord(ExecEnv::GetInstance()->impalad_client_cache(), query_ctx().coord_address, &coord_status); if (!coord_status.ok()) { // TODO: this might flood the log LOG(WARNING) << "Couldn't get a client for " << query_ctx().coord_address <<"\tReason: " << coord_status.GetDetail(); if (instances_started) Cancel(); return; }
Attachments
Issue Links
- breaks
-
IMPALA-6792 Appears to be a memory leak in orphaned fragments
- Resolved
- is broken by
-
IMPALA-2550 Switch to per-query exec rpc
- Resolved
- relates to
-
IMPALA-5537 Impala does not retry RPCs that fail in SSL_read()
- Resolved
-
IMPALA-5558 Query hang after coordinator crash because DoRpc(ReportExecStatus) fails and is not retried
- Resolved
-
IMPALA-2990 Coordinator should timeout and cancel queries with unresponsive / stuck executors
- Resolved