[IMPALA-2990] Coordinator should timeout and cancel queries with unresponsive / stuck executors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.3.0
Fix Version/s: Impala 3.3.0
Component/s: Distributed Exec
Labels:

Epic Link:
Impala Scalability Improvement
Target Version:

Impala 3.3.0

Description

The coordinator currently waits indefinitely if it does not hear back from a backend. This could cause a query to hang indefinitely in case of a network error, etc.

We should add logic for determining when a backend is unresponsive and kill the query. The logic should mostly revolve around Coordinator::Wait() and Coordinator::UpdateFragmentExecStatus() based on whether it receives periodic updates from a backed (via FragmentExecState::ReportStatusCb()).

Attachments

Issue Links

blocks

IMPALA-6787 On large secure clusters the connection setup thread becomes bottleneck at warmup and cause occasional timeout failures

Resolved

IMPALA-6338 Tests fail due to runtime profile for query with limit missing pieces

Resolved

incorporates

IMPALA-4555 Don't cancel query for failed ReportExecStatus (done=false) RPC

Resolved

is depended upon by

IMPALA-5119 Don't make RPCs from Coordinator::UpdateBackendExecStatus()

Open

is duplicated by

IMPALA-414 Impala server cannot detect crash-restart failures reliably

Resolved

is related to

IMPALA-4063 Make fragment instance reports per-query (or per-host) instead of per-fragment instance.

Resolved

IMPALA-6596 Query failed with OOM on coordinator while remote fragments on other nodes continue to run

Open

IMPALA-9919 Bad Impala Performance after a period of time

Open

IMPALA-2567 KRPC milestone 1

Resolved

IMPALA-5576 Wrong Cancel() in QueryState::ReportExecStatusAux() can lead to coordinator hang

Resolved

IMPALA-5746 Remote fragments continue to hold onto memory after stopping the coordinator daemon

Resolved

IMPALA-8327 TestRPCTimeout::test_reportexecstatus_retry() times out on exhaustive build

Resolved

IMPALA-3160 Queries may not get cancelled if cancellation pool hits MAX_CANCELLATION_QUEUE_SIZE

Resolved

IMPALA-539 Impala should gather final runtime profile from fragments for aborted/cancelled query

Resolved

Parent Feature

IMPALA-3380 Add TCP timeouts to all RPCs that don't block

Resolved

relates to

IMPALA-6984 Coordinator should cancel backends when returning EOS

Reopened

requires

IMPALA-7163 Implement a state machine for the QueryState class

Resolved

(9 is related to, 1 Parent Feature, 1 relates to, 1 requires)

Sub-Tasks

1.	Port ReportExecStatus() RPCs to KRPC	Resolved	Michael Ho
2.	Deprecate --use_krpc flag	Resolved	Michael Ho
3.	Implement a state machine for the QueryState class	Resolved	Sailesh Mukil
4.	Make fragment instance reports per-query (or per-host) instead of per-fragment instance.	Resolved	Michael Ho
5.	Use sidecars for Thrift-wrapped RPC payloads	Resolved	Unassigned
6.	Don't cancel query for failed ReportExecStatus (done=false) RPC	Resolved	Thomas Tauber-Marshall

Activity

People

Assignee:: Thomas Tauber-Marshall

Reporter:: Sailesh Mukil

Votes:: 2 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 13/Feb/16 00:38

Updated:: 06/Jul/20 15:44

Resolved:: 30/Apr/19 17:00