Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6294

Concurrent hung with lots of spilling make slow progress due to blocking in DataStreamRecvr and DataStreamSender

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • Impala 2.11.0
    • None
    • Backend
    • None
    • ghx-label-4

    Description

      While running a highly concurrent spilling workload on a large cluster queries start running slower, even light weight queries that are not running are affected by this slow down.

                EXCHANGE_NODE (id=9):(Total: 3m1s, non-child: 3m1s, % non-child: 100.00%)
                   - ConvertRowBatchTime: 999.990us
                   - PeakMemoryUsage: 0
                   - RowsReturned: 108.00K (108001)
                   - RowsReturnedRate: 593.00 /sec
                  DataStreamReceiver:
                    BytesReceived(4s000ms): 254.47 KB, 338.82 KB, 338.82 KB, 852.43 KB, 1.32 MB, 1.33 MB, 1.50 MB, 2.53 MB, 2.99 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.00 MB, 3.16 MB, 3.49 MB, 3.80 MB, 4.15 MB, 4.55 MB, 4.84 MB, 4.99 MB, 5.07 MB, 5.41 MB, 5.75 MB, 5.92 MB, 6.00 MB, 6.00 MB, 6.00 MB, 6.07 MB, 6.28 MB, 6.33 MB, 6.43 MB, 6.67 MB, 6.91 MB, 7.29 MB, 8.03 MB, 9.12 MB, 9.68 MB, 9.90 MB, 9.97 MB, 10.44 MB, 11.25 MB
                     - BytesReceived: 11.73 MB (12301692)
                     - DeserializeRowBatchTimer: 957.990ms
                     - FirstBatchArrivalWaitTime: 0.000ns
                     - PeakMemoryUsage: 644.44 KB (659904)
                     - SendersBlockedTimer: 0.000ns
                     - SendersBlockedTotalTimer(*): 0.000ns
      
              DataStreamSender (dst_id=9):(Total: 1s819ms, non-child: 1s819ms, % non-child: 100.00%)
                 - BytesSent: 234.64 MB (246033840)
                 - NetworkThroughput(*): 139.58 MB/sec
                 - OverallThroughput: 128.92 MB/sec
                 - PeakMemoryUsage: 33.12 KB (33920)
                 - RowsReturned: 108.00K (108001)
                 - SerializeBatchTime: 133.998ms
                 - TransmitDataRPCTime: 1s680ms
                 - UncompressedRowBatchSize: 446.42 MB (468102200)
      

      Timeouts seen in IMPALA-6285 are caused by this issue

      I1206 12:44:14.925405 25274 status.cc:58] RPC recv timed out: Client foo-17.domain.com:22000 timed-out during recv call.
          @           0x957a6a  impala::Status::Status()
          @          0x11dd5fe  impala::DataStreamSender::Channel::DoTransmitDataRpc()
          @          0x11ddcd4  impala::DataStreamSender::Channel::TransmitDataHelper()
          @          0x11de080  impala::DataStreamSender::Channel::TransmitData()
          @          0x11e1004  impala::ThreadPool<>::WorkerThread()
          @           0xd10063  impala::Thread::SuperviseThread()
          @           0xd107a4  boost::detail::thread_data<>::run()
          @          0x128997a  (unknown)
          @     0x7f68c5bc7e25  start_thread
          @     0x7f68c58f534d  __clone
      

      A similar behavior was also observed with KRPC enabled IMPALA-6048

      Attachments

        1. IMPALA-6285 TPCDS Q3 slow broadcast
          412 kB
          Mostafa Mokhtar
        2. slow_broadcast_q3_reciever.txt
          19.99 MB
          Mostafa Mokhtar
        3. slow_broadcast_q3_sender.txt
          22.60 MB
          Mostafa Mokhtar

        Issue Links

          Activity

            People

              kwho Michael Ho
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: