In the past, users of Impalad had a hard time getting diagnostics information when a query is hung. Usually, that involves a rather manual process of determining the fragment instances which aren't making progress and generating stack trace or core from that Impalad and looking into it under a debugger. Given the thousand of threads running when multiple queries are active, it's quite time consuming for diagnostics.
This JIRA aims to track the improvement ideas which we can implement to alleviate the stress with debugging this kind of issue. Some ideas include:
- implement a diagnostic button (analogous to the cancellation button in the UI) to dump diagnostics information (e.g. threads' backtraces, executor nodes' internals, states of data stream sender and receivers, lock information (e.g. holder's pid) ) for fragment instances on some or all hosts of a query.
- have a watch dog to dump backtraces on threads which aren't making progress for a while. This probably doesn't apply to all threads (e.g. idle threads shouldn't trigger any alert).
- A fragment instance can appear to be not making progress because its parent operator / fragment may be hung (e.g.the probe side of a join will not be able to make much progress until the build side is done and the build side itself could be another chain of joins). It'd be much easier to resolve this dependency chain programmatically to find the root of the cascade of delay.
Please feel free to add more ideas to this JIRA.