[IMPALA-6025] Improve hang diagnostics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.9.0
Fix Version/s: None
Component/s: Distributed Exec
Labels:
- observability
- supportability

Epic Name:
Improve Hang Diagnostics
Target Version:

Product Backlog
Epic Color:
ghx-label-8

Description

In the past, users of Impalad had a hard time getting diagnostics information when a query is hung. Usually, that involves a rather manual process of determining the fragment instances which aren't making progress and generating stack trace or core from that Impalad and looking into it under a debugger. Given the thousand of threads running when multiple queries are active, it's quite time consuming for diagnostics.

This JIRA aims to track the improvement ideas which we can implement to alleviate the stress with debugging this kind of issue. Some ideas include:

implement a diagnostic button (analogous to the cancellation button in the UI) to dump diagnostics information (e.g. threads' backtraces, executor nodes' internals, states of data stream sender and receivers, lock information (e.g. holder's pid) ) for fragment instances on some or all hosts of a query.

have a watch dog to dump backtraces on threads which aren't making progress for a while. This probably doesn't apply to all threads (e.g. idle threads shouldn't trigger any alert).

A fragment instance can appear to be not making progress because its parent operator / fragment may be hung (e.g.the probe side of a join will not be able to make much progress until the build side is done and the build side itself could be another chain of joins). It'd be much easier to resolve this dependency chain programmatically to find the root of the cascade of delay.

Please feel free to add more ideas to this JIRA.

Attachments

Issue Links

is a child of

IMPALA-6698 Supportability roadmap

Open

is related to

IMPALA-5865 Improve Impala execution scalability

Open

IMPALA-2567 KRPC milestone 1

Resolved

Activity

People

Assignee:: Lars Volker

Reporter:: Michael Ho

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 06/Oct/17 22:25

Updated:: 21/Dec/20 19:14