[IMPALA-5473] Make diagnosing network issues easier - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.10.0
Fix Version/s: None
Component/s: Distributed Exec
Labels:
- observability
- supportability

Epic Link:
Network Debugging / Supportability Improvements
Epic Color:
ghx-label-7

Description

With our current metrics in the profile, it's hard to debug queries that get slow throughput from their exchanges.

The following cases have different causes, but similar symptoms (e.g. a high InactiveTimer in the xchg profile):

1. Downstream sender does not produce rows quickly (perhaps because its child instances do not produce rows quickly).

2. Downstream sender can not send rows quickly, perhaps because of network congestion.

3. Downstream sender does not start producing rows until some time after the upstream has started (captured by FirstBatchArrivalWaitTime).

4. Downstream sender does not close stream until some time after all rows are sent.

We should try to improve these metrics so that all the information about who is slow, and why, is available clearly in the runtime profile. Distinguishing cases 1 and 2 is particularly important.

Attachments

Issue Links

depends upon

IMPALA-2567 KRPC milestone 1

Resolved

relates to

IMPALA-6692 When partition exchange is followed by sort each sort node becomes a synchronization point across the cluster

Reopened

IMPALA-6685 Improve profile in KrpcDataStreamRecvr and KrpcDataStreamSender

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Henry Robinson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Jun/17 22:11

Updated:: 21/Dec/20 19:11