Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15219

DFS Client will stuck when ResponseProcessor.run throw Error

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.3
    • Fix Version/s: 3.3.0, 3.1.4, 3.2.2
    • Component/s: hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In my case, a Tez application stucked more than 2 hours util we kill this applicaiton. The Reason is a task attempt stucked, becuase speculative execution is disable. 

      Then Exception like this:

      2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records read - 100000
      2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: records written - 1000000
      2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records read - 1000000
      2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for block BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] threw an Error. Shutting down now...
      java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat
       at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253)
       at java.lang.String.valueOf(String.java:2847)
       at java.lang.StringBuilder.append(StringBuilder.java:128)
       at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737)
      Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat
       at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
       at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
       at java.security.AccessController.doPrivileged(Native Method)
       at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
       at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
       at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
       ... 4 more
      Caused by: java.util.zip.ZipException: error reading zip file
       at java.util.zip.ZipFile.read(Native Method)
       at java.util.zip.ZipFile.access$1400(ZipFile.java:56)
       at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679)
       at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415)
       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
       at sun.misc.Resource.getBytes(Resource.java:124)
       at java.net.URLClassLoader.defineClass(URLClassLoader.java:444)
       at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
       at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
       ... 10 more
      2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] |util.ExitUtil|: Exiting with status -1
      2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
      2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
      2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an invocation of shutdownRequested
      
      

      Reason is UncaughtException. When time is 01:29, a disk was error, so throw NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then DataStream didn't know ReponseProcessor was dead, and can't trigger closeResponder, so stucked in DataStream.run.

       I tested in unit-test TestDataStream.testDfsClient. When I throw NoClassDefFoundError in ResponseProcessor.run, the TestDataStream.testDfsClient will failed bacause of timeout.

      I think we should catch Throwable but not Exception in ReponseProcessor.run.

       

        Attachments

        1. HDFS-15219.001.patch
          0.7 kB
          zhengchenyu

          Issue Links

            Activity

              People

              • Assignee:
                zhengchenyu zhengchenyu
                Reporter:
                zhengchenyu zhengchenyu
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 672h
                  672h
                  Remaining:
                  Remaining Estimate - 672h
                  672h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified