Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-13
Description
I'm trying to debug a problem that appears to be caused by a slow RPC:
Fragment F00 Instance 754fc21ba4744310:d58fd04200000020 (host=xxxxxxxxx) Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/120.48 MB - AverageThreadTokens: 1.00 (1.0) - BloomFilterBytes: 0 B (0) - InactiveTotalTime: 0ns (0) - PeakMemoryUsage: 3.2 MiB (3337546) - PeakReservation: 2.0 MiB (2097152) - PeakUsedReservation: 0 B (0) - PerHostPeakMemUsage: 6.7 MiB (6987376) - RowsProduced: 7 (7) - TotalNetworkReceiveTime: 0ns (0) - TotalNetworkSendTime: 3.6m (215354065071) - TotalStorageWaitTime: 4ms (4552708) - TotalThreadsInvoluntaryContextSwitches: 2 (2) - TotalThreadsTotalWallClockTime: 3.6m (215924079474) - TotalThreadsSysTime: 24ms (24386000) - TotalThreadsUserTime: 505ms (505714000) - TotalThreadsVoluntaryContextSwitches: 3,623 (3623) - TotalTime: 3.6m (215801961705) Fragment Instance Lifecycle Event Timeline Prepare Finished: 1ms (1812344) Open Finished: 322ms (322905753) First Batch Produced: 447ms (447050377) First Batch Sent: 447ms (447054546) ExecInternal Finished: 3.6m (215802284852) Buffer pool - AllocTime: 0ns (0) - CumulativeAllocationBytes: 0 B (0) - CumulativeAllocations: 0 (0) - InactiveTotalTime: 0ns (0) - PeakReservation: 0 B (0) - PeakUnpinnedBytes: 0 B (0) - PeakUsedReservation: 0 B (0) - ReadIoBytes: 0 B (0) - ReadIoOps: 0 (0) - ReadIoWaitTime: 0ns (0) - ReservationLimit: 0 B (0) - TotalTime: 0ns (0) - WriteIoBytes: 0 B (0) - WriteIoOps: 0 (0) - WriteIoWaitTime: 0ns (0) Fragment Instance Lifecycle Timings - ExecTime: 3.6m (215479380267) - ExecTreeExecTime: 124ms (124299400) - InactiveTotalTime: 0ns (0) - OpenTime: 321ms (321088906) - ExecTreeOpenTime: 572.04us (572045) - PrepareTime: 1ms (1426412) - ExecTreePrepareTime: 233.32us (233318) - TotalTime: 0ns (0) KrpcDataStreamSender (dst_id=3) - EosSent: 58 (58) - InactiveTotalTime: 3.6m (215354085858) - PeakMemoryUsage: 464.4 KiB (475504) - RowsSent: 7 (7) - RpcFailure: 0 (0) - RpcRetry: 0 (0) - SerializeBatchTime: 99.87us (99867) - TotalBytesSent: 207 B (207) - TotalTime: 3.6m (215355336381) - UncompressedRowBatchSize: 267 B (267)
We should add some diagnostics that will allow us to figure out which RPCs are slow and whether there's a pattern about which host is the problem. E.g. maybe we should log if the RPC time exceeds a configured threshold.
It may also be useful to include some stats about the wait time, e.g. a histogram of the wait times, so that we can see if it's an outlier or general slowness.
Attachments
Issue Links
- is related to
-
IMPALA-10137 Network Debugging / Supportability Improvements
- Open
-
KUDU-2996 Log trace when --rpc_duration_too_long_ms is exceeded
- Resolved
- relates to
-
IMPALA-10241 Impala Doc: RPC troubleshooting guide
- Open
-
IMPALA-10139 Slow RPC logs can be misleading
- Open