Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2998

Failing to contact Am/History for jobs: java.io.EOFException in DataInputStream

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.0, 2.0.0-alpha
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
      None

      Description

      I am getting an exception frequently when running my jobs on a single-node cluster. It happens with basically any job I run: sometimes the job will work, but most of the time I get this exception (in this case, I was running a simple wordcount from the examples jar - where I got the exception 4 times in a row, and then the job worked the fifth time I submitted it).
      Sometimes restarting the namenode, resourcemanager, and historyserver helps - but not always. Several other developers have seen this problem.

      11/09/12 17:17:50 INFO mapred.YARNRunner: AppMaster capability = memory: 2048,
      11/09/12 17:17:51 INFO mapred.YARNRunner: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dhadoop.root.logger=DEBUG,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1315847180566 6 <FAILCOUNT> 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
      11/09/12 17:17:51 INFO mapred.ResourceMgrDelegate: Submitted application application_1315847180566_6 to ResourceManager
      11/09/12 17:17:51 INFO mapred.ClientCache: Connecting to HistoryServer at: 0.0.0.0:10020
      11/09/12 17:17:51 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
      11/09/12 17:17:51 INFO mapred.ClientCache: Connected to HistoryServer at: 0.0.0.0:10020
      11/09/12 17:17:51 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol
      11/09/12 17:17:51 INFO mapreduce.Job: Running job: job_1315847180566_0006
      11/09/12 17:17:52 INFO mapreduce.Job: map 0% reduce 0%
      11/09/12 17:18:00 INFO mapred.ClientServiceDelegate: Tracking Url of JOB is <IP-ADDRESS>:55361
      11/09/12 17:18:00 INFO mapred.ClientServiceDelegate: Connecting to <IP-ADDRESS>:43465
      11/09/12 17:18:00 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
      11/09/12 17:18:00 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol
      11/09/12 17:18:01 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1315847180566_0006 Will retry..
      java.lang.reflect.UndeclaredThrowableException
      at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:179)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:237)
      at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:276)
      at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:547)
      at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540)
      at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1144)
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1092)
      at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
      at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
      Caused by: com.google.protobuf.ServiceException: java.io.IOException: Call to /<IP-ADDRESS>:43465 failed on local exception: java.io.EOFException
      at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
      at $Proxy8.getTaskAttemptCompletionEvents(Unknown Source)
      at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
      ... 23 more
      Caused by: java.io.IOException: Call to /<IP-ADDRESS>:43465 failed on local exception: java.io.EOFException
      at org.apache.hadoop.ipc.Client.wrapException(Client.java:1119)
      at org.apache.hadoop.ipc.Client.call(Client.java:1087)
      at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
      ... 25 more
      Caused by: java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:375)
      at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:816)
      at org.apache.hadoop.ipc.Client$Connection.run(Client.java:754)
      11/09/12 17:18:01 INFO mapreduce.Job: Job job_1315847180566_0006 failed with state FAILED
      11/09/12 17:18:01 INFO mapreduce.Job: Counters: 0

      1. MAPREDUCE-2998.txt
        1 kB
        Vinod Kumar Vavilapalli
      2. amlog
        180 kB
        Jeffrey Naisbitt

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #807 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/807/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #807 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/807/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #837 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/837/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #837 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/837/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/16/)
        Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/16/ ) Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998 . acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Build #17 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/17/)
        Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #17 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/17/ ) Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998 . acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #941 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/941/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #941 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/941/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1002 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1002/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1002 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1002/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #924 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/924/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #924 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/924/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Arun C Murthy added a comment -

        I committed this trivial, but very important fix.

        Show
        Arun C Murthy added a comment - I committed this trivial, but very important fix.
        Hide
        Arun C Murthy added a comment -

        Lets commit this to unblock QE folks while I work to stabilize MAPREDUCE-2880.

        Show
        Arun C Murthy added a comment - Lets commit this to unblock QE folks while I work to stabilize MAPREDUCE-2880 .
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Vinod - what is the bug you see?

        Please see my comment above(dated 19/Sep/11 12:44) and the attached patch.

        Show
        Vinod Kumar Vavilapalli added a comment - Vinod - what is the bug you see? Please see my comment above(dated 19/Sep/11 12:44) and the attached patch.
        Hide
        Arun C Murthy added a comment -

        Vinod - what is the bug you see?

        Show
        Arun C Murthy added a comment - Vinod - what is the bug you see?
        Hide
        Vinod Kumar Vavilapalli added a comment -

        MAPREDUCE-2880 will avoid "bin/mapred" launches altogether, but we still need this fix.

        Show
        Vinod Kumar Vavilapalli added a comment - MAPREDUCE-2880 will avoid "bin/mapred" launches altogether, but we still need this fix.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Okay, I(actually Karam Singh who is helping me) ran into this one multiple times and I figured out the root cause.

        AM launches "bin/mapred classpath" in the beginning to figure out the classpath for tasks and this is supposed to happen only once per AM. Because of a subtle bug, it happens more than once during the beginning of AM and each sub-process takes an equal amount of Vmem. The more the number of tasks, the more chance of this happening.

        Show
        Vinod Kumar Vavilapalli added a comment - Okay, I(actually Karam Singh who is helping me) ran into this one multiple times and I figured out the root cause. AM launches "bin/mapred classpath" in the beginning to figure out the classpath for tasks and this is supposed to happen only once per AM. Because of a subtle bug, it happens more than once during the beginning of AM and each sub-process takes an equal amount of Vmem. The more the number of tasks, the more chance of this happening.
        Hide
        Jeffrey Naisbitt added a comment -

        Looking at the NodeManager logs, I am seeing this:
        11/09/15 12:46:51 WARN monitor.ContainersMonitorImpl: Container [pid=22125,containerID=container_1316090748961_0001_01_000001] is running beyond memory-limits. Current usage : 2150408192bytes. Limit : 2147483648bytes. Killing container.
        Dump of the process-tree for container_1316090748961_0001_01_000001 :

        • 22125 21539 22125 22125 (bash) 1 1 65400832 291 /bin/bash -c /home/<USERNAME>/hadoop-build/jdk1.6.0_12/bin/java -Dhadoop.root.logger=DEBUG,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1316090748961 1 1 1>/home/hadoop/mapred/nm/logs/application_1316090748961_0001/container_1316090748961_0001_01_000001/stdout 2>/home/

        So, for some reason the AM is using more memory now. We should figure out why it's using so much memory lately. In the meantime though, as a temporary workaround, you can use this option when running your jobs: -Dyarn.app.mapreduce.am.command-opts=-Xmx1024m
        ...or add this to your yarn-site.xml file:
        <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx1024m</value>
        </property>

        I am removing the Blocker priority since we have a workaround now.

        Show
        Jeffrey Naisbitt added a comment - Looking at the NodeManager logs, I am seeing this: 11/09/15 12:46:51 WARN monitor.ContainersMonitorImpl: Container [pid=22125,containerID=container_1316090748961_0001_01_000001] is running beyond memory-limits. Current usage : 2150408192bytes. Limit : 2147483648bytes. Killing container. Dump of the process-tree for container_1316090748961_0001_01_000001 : 22125 21539 22125 22125 (bash) 1 1 65400832 291 /bin/bash -c /home/<USERNAME>/hadoop-build/jdk1.6.0_12/bin/java -Dhadoop.root.logger=DEBUG,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1316090748961 1 1 1>/home/hadoop/mapred/nm/logs/application_1316090748961_0001/container_1316090748961_0001_01_000001/stdout 2>/home/ So, for some reason the AM is using more memory now. We should figure out why it's using so much memory lately. In the meantime though, as a temporary workaround, you can use this option when running your jobs: -Dyarn.app.mapreduce.am.command-opts=-Xmx1024m ...or add this to your yarn-site.xml file: <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx1024m</value> </property> I am removing the Blocker priority since we have a workaround now.
        Hide
        Jeffrey Naisbitt added a comment -

        Here is the stderr for the AM container log

        Show
        Jeffrey Naisbitt added a comment - Here is the stderr for the AM container log
        Hide
        Vinod Kumar Vavilapalli added a comment -

        The corrupt blocks should be because of MAPREDUCE-3006 which should be fixed soon.

        If you attach logs, we can see if there are any more bugs .

        Show
        Vinod Kumar Vavilapalli added a comment - The corrupt blocks should be because of MAPREDUCE-3006 which should be fixed soon. If you attach logs, we can see if there are any more bugs .
        Hide
        Jeffrey Naisbitt added a comment -

        Thanks for the comment, Vinod. I will post logs. This is currently trivial to reproduce since it happens 4 out of 5 times for me.

        This is now causing delays and failures in our QE cycles.
        Also, in our QE clusters, this failure is also corrupting HDFS blocks:

        $ $hadoophdfs fsck / | grep CORRUP
        11/09/14 23:51:50 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use
        mapreduce.client.genericoptionsparser.used
        Connecting to namenode via https://<HOSTNAME>:50470
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0009.summary: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_7057908993439990810
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0010.summary: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_3616335227169344631
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012.summary: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_-7171149404998570529
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012_conf.xml: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_5875537261129974720
        ..Status: CORRUPT
        CORRUPT FILES: 4
        CORRUPT BLOCKS: 4
        The filesystem under path '/' is CORRUPT

        This is causing cluster to be in Safemode unless we leave manually

        Show
        Jeffrey Naisbitt added a comment - Thanks for the comment, Vinod. I will post logs. This is currently trivial to reproduce since it happens 4 out of 5 times for me. This is now causing delays and failures in our QE cycles. Also, in our QE clusters, this failure is also corrupting HDFS blocks: $ $hadoophdfs fsck / | grep CORRUP 11/09/14 23:51:50 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used Connecting to namenode via https://<HOSTNAME>:50470 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0009.summary: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_7057908993439990810 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0010.summary: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_3616335227169344631 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012.summary: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_-7171149404998570529 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012_conf.xml: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_5875537261129974720 ..Status: CORRUPT CORRUPT FILES: 4 CORRUPT BLOCKS: 4 The filesystem under path '/' is CORRUPT This is causing cluster to be in Safemode unless we leave manually
        Hide
        Vinod Kumar Vavilapalli added a comment -

        From both the logs you pasted, the job clearly failed, you need to debug why by looking at AM logs.

        The pasted exceptions are happening because MR AM exited when some call from client was in progress. This is expected in some cases. The retry should eventually figure out the final state of the job from RM and subsequently form JobHistoryServer. BTW, do you have JobHistoryServer up?

        Side note: Reading the logs pasted on JIRA is so hard, can you please attach the full JobClient as a separate file the next time you run into this? Thanks!

        Show
        Vinod Kumar Vavilapalli added a comment - From both the logs you pasted, the job clearly failed, you need to debug why by looking at AM logs. The pasted exceptions are happening because MR AM exited when some call from client was in progress. This is expected in some cases. The retry should eventually figure out the final state of the job from RM and subsequently form JobHistoryServer. BTW, do you have JobHistoryServer up? Side note: Reading the logs pasted on JIRA is so hard, can you please attach the full JobClient as a separate file the next time you run into this? Thanks!
        Hide
        Jeffrey Naisbitt added a comment -

        Sometimes, after the original EOException, I get the following exception when it retries the connection:

        java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:179)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:237)
        at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:276)
        at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:546)
        at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540)
        at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1144)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1092)
        at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
        Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
        at $Proxy8.getTaskAttemptCompletionEvents(Unknown Source)
        at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
        ... 23 more
        Caused by: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1111)
        at org.apache.hadoop.ipc.Client.call(Client.java:1087)
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
        ... 25 more
        Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:376)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:459)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:556)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:207)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1220)
        at org.apache.hadoop.ipc.Client.call(Client.java:1064)
        ... 26 more
        11/09/13 20:13:01 INFO mapreduce.Job: Job job_1315941172309_0006 failed with state FAILED
        11/09/13 20:13:01 INFO mapreduce.Job: Counters: 0

        Show
        Jeffrey Naisbitt added a comment - Sometimes, after the original EOException, I get the following exception when it retries the connection: java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:179) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:237) at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:276) at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:546) at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1144) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1092) at org.apache.hadoop.examples.WordCount.main(WordCount.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:189) Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139) at $Proxy8.getTaskAttemptCompletionEvents(Unknown Source) at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172) ... 23 more Caused by: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1111) at org.apache.hadoop.ipc.Client.call(Client.java:1087) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136) ... 25 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:376) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:459) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:556) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:207) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1220) at org.apache.hadoop.ipc.Client.call(Client.java:1064) ... 26 more 11/09/13 20:13:01 INFO mapreduce.Job: Job job_1315941172309_0006 failed with state FAILED 11/09/13 20:13:01 INFO mapreduce.Job: Counters: 0

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Jeffrey Naisbitt
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development