[IMPALA-9128] Improve debugging for slow sends in KrpcDataStreamSender - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 3.4.0
Component/s: Distributed Exec
Labels:
- observability

Epic Color:
ghx-label-13

Description

I'm trying to debug a problem that appears to be caused by a slow RPC:

    Fragment F00
      Instance 754fc21ba4744310:d58fd04200000020 (host=xxxxxxxxx)
        Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/120.48 MB 
        - AverageThreadTokens: 1.00 (1.0)
        - BloomFilterBytes: 0 B (0)
        - InactiveTotalTime: 0ns (0)
        - PeakMemoryUsage: 3.2 MiB (3337546)
        - PeakReservation: 2.0 MiB (2097152)
        - PeakUsedReservation: 0 B (0)
        - PerHostPeakMemUsage: 6.7 MiB (6987376)
        - RowsProduced: 7 (7)
        - TotalNetworkReceiveTime: 0ns (0)
        - TotalNetworkSendTime: 3.6m (215354065071)
        - TotalStorageWaitTime: 4ms (4552708)
        - TotalThreadsInvoluntaryContextSwitches: 2 (2)
        - TotalThreadsTotalWallClockTime: 3.6m (215924079474)
          - TotalThreadsSysTime: 24ms (24386000)
          - TotalThreadsUserTime: 505ms (505714000)
        - TotalThreadsVoluntaryContextSwitches: 3,623 (3623)
        - TotalTime: 3.6m (215801961705)
        Fragment Instance Lifecycle Event Timeline
          Prepare Finished: 1ms (1812344)
          Open Finished: 322ms (322905753)
          First Batch Produced: 447ms (447050377)
          First Batch Sent: 447ms (447054546)
          ExecInternal Finished: 3.6m (215802284852)
        Buffer pool
          - AllocTime: 0ns (0)
          - CumulativeAllocationBytes: 0 B (0)
          - CumulativeAllocations: 0 (0)
          - InactiveTotalTime: 0ns (0)
          - PeakReservation: 0 B (0)
          - PeakUnpinnedBytes: 0 B (0)
          - PeakUsedReservation: 0 B (0)
          - ReadIoBytes: 0 B (0)
          - ReadIoOps: 0 (0)
          - ReadIoWaitTime: 0ns (0)
          - ReservationLimit: 0 B (0)
          - TotalTime: 0ns (0)
          - WriteIoBytes: 0 B (0)
          - WriteIoOps: 0 (0)
          - WriteIoWaitTime: 0ns (0)
        Fragment Instance Lifecycle Timings
          - ExecTime: 3.6m (215479380267)
            - ExecTreeExecTime: 124ms (124299400)
          - InactiveTotalTime: 0ns (0)
          - OpenTime: 321ms (321088906)
            - ExecTreeOpenTime: 572.04us (572045)
          - PrepareTime: 1ms (1426412)
            - ExecTreePrepareTime: 233.32us (233318)
          - TotalTime: 0ns (0)
        KrpcDataStreamSender (dst_id=3)
          - EosSent: 58 (58)
          - InactiveTotalTime: 3.6m (215354085858)
          - PeakMemoryUsage: 464.4 KiB (475504)
          - RowsSent: 7 (7)
          - RpcFailure: 0 (0)
          - RpcRetry: 0 (0)
          - SerializeBatchTime: 99.87us (99867)
          - TotalBytesSent: 207 B (207)
          - TotalTime: 3.6m (215355336381)
          - UncompressedRowBatchSize: 267 B (267)

We should add some diagnostics that will allow us to figure out which RPCs are slow and whether there's a pattern about which host is the problem. E.g. maybe we should log if the RPC time exceeds a configured threshold.

It may also be useful to include some stats about the wait time, e.g. a histogram of the wait times, so that we can see if it's an outlier or general slowness.

Attachments

Issue Links

is related to

IMPALA-10137 Network Debugging / Supportability Improvements

Open

KUDU-2996 Log trace when --rpc_duration_too_long_ms is exceeded

Resolved

relates to

IMPALA-10241 Impala Doc: RPC troubleshooting guide

Open

IMPALA-10139 Slow RPC logs can be misleading

Open

Activity

People

Assignee:: Tim Armstrong

Reporter:: Tim Armstrong

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Nov/19 20:09

Updated:: 02/Mar/22 02:01

Resolved:: 14/Nov/19 21:14