Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.7.3
-
None
-
Reviewed
Description
In my case, a Tez application stucked more than 2 hours util we kill this applicaiton. The Reason is a task attempt stucked, becuase speculative execution is disable.
Then Exception like this:
2020-03-11 01:23:59,141 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records read - 100000 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.FileSinkOperator|: FS[3]: records written - 1000000 2020-03-11 01:24:50,294 [INFO] [TezChild] |exec.MapOperator|: MAP[4]: records read - 1000000 2020-03-11 01:29:02,967 [FATAL] [ResponseProcessor for block BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] |yarn.YarnUncaughtExceptionHandler|: Thread Thread[ResponseProcessor for block BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073,5,main] threw an Error. Shutting down now... java.lang.NoClassDefFoundError: com/google/protobuf/TextFormat at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.toString(PipelineAck.java:253) at java.lang.String.valueOf(String.java:2847) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:737) Caused by: java.lang.ClassNotFoundException: com.google.protobuf.TextFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 4 more Caused by: java.util.zip.ZipException: error reading zip file at java.util.zip.ZipFile.read(Native Method) at java.util.zip.ZipFile.access$1400(ZipFile.java:56) at java.util.zip.ZipFile$ZipFileInputStream.read(ZipFile.java:679) at java.util.zip.ZipFile$ZipFileInflaterInputStream.fill(ZipFile.java:415) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) at sun.misc.Resource.getBytes(Resource.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:444) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) ... 10 more 2020-03-11 01:29:02,970 [INFO] [ResponseProcessor for block BP-1856561198-172.16.6.67-1421842461517:blk_15177828027_14109212073] |util.ExitUtil|: Exiting with status -1 2020-03-11 03:27:26,833 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM 2020-03-11 03:27:26,834 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat 2020-03-11 03:27:26,839 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1583335296048_917815_3_01_000704_0 due to an invocation of shutdownRequested
Reason is UncaughtException. When time is 01:29, a disk was error, so throw NoClassDefFoundError. ResponseProcessor.run only catch Exception, can't catch NoClassDefFoundError. So the ReponseProcessor didn't set errorState. Then DataStream didn't know ReponseProcessor was dead, and can't trigger closeResponder, so stucked in DataStream.run.
I tested in unit-test TestDataStream.testDfsClient. When I throw NoClassDefFoundError in ResponseProcessor.run, the TestDataStream.testDfsClient will failed bacause of timeout.
I think we should catch Throwable but not Exception in ReponseProcessor.run.
Attachments
Attachments
Issue Links
- links to