Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-17908

LLAP External client not correctly handling killTask for pending requests

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0
    • llap
    • None

    Description

      Hitting "Timed out waiting for heartbeat for task ID" errors with the LLAP external client.
      HIVE-17393 fixed some of these errors, however it is also occurring because the client is not correctly handling the killTask notification when the request is accepted but still waiting for the first task heartbeat. In this situation the client should retry the request, similar to what the LLAP AM does. Current logic is ignoring the killTask in this situation, which results in a heartbeat timeout - no heartbeats are sent by LLAP because of the killTask notification.

      17/08/09 05:36:02 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 14, cn114-10.l42scl.hortonworks.com, executor 5): java.io.IOException: Received reader event error: Timed out waiting for heartbeat for task ID attempt_7739111832518812959_0005_0_00_000010_0
              at org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:178)
              at org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:50)
              at org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
              at org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
              at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
              at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
              at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
              at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
              at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
              at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
              at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
              at org.apache.spark.scheduler.Task.run(Task.scala:99)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: LlapTaskUmbilicalExternalClient(attempt_7739111832518812959_0005_0_00_000010_0): Error while attempting to read chunk length
              at org.apache.hadoop.hive.llap.io.ChunkedInputStream.read(ChunkedInputStream.java:82)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
              at java.io.FilterInputStream.read(FilterInputStream.java:83)
              at org.apache.hadoop.hive.llap.LlapBaseRecordReader.hasInput(LlapBaseRecordReader.java:267)
              at org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:142)
              ... 22 more
      Caused by: java.net.SocketException: Socket closed
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      

      Attachments

        1. HIVE-17908.1.patch
          7 kB
          Jason Dere
        2. HIVE-17908.2.patch
          7 kB
          Jason Dere
        3. HIVE-17908.3.patch
          7 kB
          Jason Dere
        4. HIVE-17908.4.patch
          7 kB
          Jason Dere
        5. HIVE-17908.5.patch
          7 kB
          Jason Dere
        6. HIVE-17908.6.patch
          6 kB
          Jason Dere

        Activity

          People

            jdere Jason Dere
            jdere Jason Dere
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: