Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2998

Failing to contact Am/History for jobs: java.io.EOFException in DataInputStream

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.0, 2.0.0-alpha
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
      None

      Description

      I am getting an exception frequently when running my jobs on a single-node cluster. It happens with basically any job I run: sometimes the job will work, but most of the time I get this exception (in this case, I was running a simple wordcount from the examples jar - where I got the exception 4 times in a row, and then the job worked the fifth time I submitted it).
      Sometimes restarting the namenode, resourcemanager, and historyserver helps - but not always. Several other developers have seen this problem.

      11/09/12 17:17:50 INFO mapred.YARNRunner: AppMaster capability = memory: 2048,
      11/09/12 17:17:51 INFO mapred.YARNRunner: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dhadoop.root.logger=DEBUG,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1315847180566 6 <FAILCOUNT> 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
      11/09/12 17:17:51 INFO mapred.ResourceMgrDelegate: Submitted application application_1315847180566_6 to ResourceManager
      11/09/12 17:17:51 INFO mapred.ClientCache: Connecting to HistoryServer at: 0.0.0.0:10020
      11/09/12 17:17:51 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
      11/09/12 17:17:51 INFO mapred.ClientCache: Connected to HistoryServer at: 0.0.0.0:10020
      11/09/12 17:17:51 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol
      11/09/12 17:17:51 INFO mapreduce.Job: Running job: job_1315847180566_0006
      11/09/12 17:17:52 INFO mapreduce.Job: map 0% reduce 0%
      11/09/12 17:18:00 INFO mapred.ClientServiceDelegate: Tracking Url of JOB is <IP-ADDRESS>:55361
      11/09/12 17:18:00 INFO mapred.ClientServiceDelegate: Connecting to <IP-ADDRESS>:43465
      11/09/12 17:18:00 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
      11/09/12 17:18:00 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.mapreduce.v2.api.MRClientProtocol
      11/09/12 17:18:01 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1315847180566_0006 Will retry..
      java.lang.reflect.UndeclaredThrowableException
      at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:179)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:237)
      at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:276)
      at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:547)
      at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540)
      at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1144)
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1092)
      at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
      at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
      Caused by: com.google.protobuf.ServiceException: java.io.IOException: Call to /<IP-ADDRESS>:43465 failed on local exception: java.io.EOFException
      at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
      at $Proxy8.getTaskAttemptCompletionEvents(Unknown Source)
      at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
      ... 23 more
      Caused by: java.io.IOException: Call to /<IP-ADDRESS>:43465 failed on local exception: java.io.EOFException
      at org.apache.hadoop.ipc.Client.wrapException(Client.java:1119)
      at org.apache.hadoop.ipc.Client.call(Client.java:1087)
      at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
      ... 25 more
      Caused by: java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:375)
      at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:816)
      at org.apache.hadoop.ipc.Client$Connection.run(Client.java:754)
      11/09/12 17:18:01 INFO mapreduce.Job: Job job_1315847180566_0006 failed with state FAILED
      11/09/12 17:18:01 INFO mapreduce.Job: Counters: 0

      1. amlog
        180 kB
        Jeffrey Naisbitt
      2. MAPREDUCE-2998.txt
        1 kB
        Vinod Kumar Vavilapalli

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        7d 9h 27m 1 Arun C Murthy 21/Sep/11 02:25
        Patch Available Patch Available Resolved Resolved
        2m 18s 1 Arun C Murthy 21/Sep/11 02:27
        Resolved Resolved Closed Closed
        54d 23h 21m 1 Arun C Murthy 15/Nov/11 00:48
        Allen Wittenauer made changes -
        Fix Version/s 2.0.0-alpha [ 12320354 ]
        Allen Wittenauer made changes -
        Affects Version/s 2.0.0-alpha [ 12320354 ]
        Affects Version/s 0.24.0 [ 12317654 ]
        Fix Version/s 2.0.0-alpha [ 12320354 ]
        Fix Version/s 0.24.0 [ 12317654 ]
        Arun C Murthy made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #807 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/807/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #807 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/807/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #837 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/837/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #837 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/837/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/16/)
        Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/16/ ) Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998 . acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Build #17 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/17/)
        Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #17 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/17/ ) Merge -r 1173455:1173456 from trunk to branch-0.23 to fix MAPREDUCE-2998 . acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173457 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #941 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/941/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #941 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/941/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Mahadev konar made changes -
        Fix Version/s 0.23.0 [ 12315570 ]
        Fix Version/s 0.24.0 [ 12317654 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1002 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1002/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1002 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1002/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #924 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/924/)
        MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V.

        acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #924 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/924/ ) MAPREDUCE-2998 . Fixed a bug in TaskAttemptImpl which caused it to fork bin/mapred too many times. Contributed by Vinod K V. acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1173456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
        Arun C Murthy made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Arun C Murthy added a comment -

        I committed this trivial, but very important fix.

        Show
        Arun C Murthy added a comment - I committed this trivial, but very important fix.
        Arun C Murthy made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Arun C Murthy added a comment -

        Lets commit this to unblock QE folks while I work to stabilize MAPREDUCE-2880.

        Show
        Arun C Murthy added a comment - Lets commit this to unblock QE folks while I work to stabilize MAPREDUCE-2880 .
        Arun C Murthy made changes -
        Assignee Vinod Kumar Vavilapalli [ vinodkv ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Vinod - what is the bug you see?

        Please see my comment above(dated 19/Sep/11 12:44) and the attached patch.

        Show
        Vinod Kumar Vavilapalli added a comment - Vinod - what is the bug you see? Please see my comment above(dated 19/Sep/11 12:44) and the attached patch.
        Hide
        Arun C Murthy added a comment -

        Vinod - what is the bug you see?

        Show
        Arun C Murthy added a comment - Vinod - what is the bug you see?
        Vinod Kumar Vavilapalli made changes -
        Attachment MAPREDUCE-2998.txt [ 12495059 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        MAPREDUCE-2880 will avoid "bin/mapred" launches altogether, but we still need this fix.

        Show
        Vinod Kumar Vavilapalli added a comment - MAPREDUCE-2880 will avoid "bin/mapred" launches altogether, but we still need this fix.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Okay, I(actually Karam Singh who is helping me) ran into this one multiple times and I figured out the root cause.

        AM launches "bin/mapred classpath" in the beginning to figure out the classpath for tasks and this is supposed to happen only once per AM. Because of a subtle bug, it happens more than once during the beginning of AM and each sub-process takes an equal amount of Vmem. The more the number of tasks, the more chance of this happening.

        Show
        Vinod Kumar Vavilapalli added a comment - Okay, I(actually Karam Singh who is helping me) ran into this one multiple times and I figured out the root cause. AM launches "bin/mapred classpath" in the beginning to figure out the classpath for tasks and this is supposed to happen only once per AM. Because of a subtle bug, it happens more than once during the beginning of AM and each sub-process takes an equal amount of Vmem. The more the number of tasks, the more chance of this happening.
        Jeffrey Naisbitt made changes -
        Priority Blocker [ 1 ] Critical [ 2 ]
        Hide
        Jeffrey Naisbitt added a comment -

        Looking at the NodeManager logs, I am seeing this:
        11/09/15 12:46:51 WARN monitor.ContainersMonitorImpl: Container [pid=22125,containerID=container_1316090748961_0001_01_000001] is running beyond memory-limits. Current usage : 2150408192bytes. Limit : 2147483648bytes. Killing container.
        Dump of the process-tree for container_1316090748961_0001_01_000001 :

        • 22125 21539 22125 22125 (bash) 1 1 65400832 291 /bin/bash -c /home/<USERNAME>/hadoop-build/jdk1.6.0_12/bin/java -Dhadoop.root.logger=DEBUG,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1316090748961 1 1 1>/home/hadoop/mapred/nm/logs/application_1316090748961_0001/container_1316090748961_0001_01_000001/stdout 2>/home/

        So, for some reason the AM is using more memory now. We should figure out why it's using so much memory lately. In the meantime though, as a temporary workaround, you can use this option when running your jobs: -Dyarn.app.mapreduce.am.command-opts=-Xmx1024m
        ...or add this to your yarn-site.xml file:
        <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx1024m</value>
        </property>

        I am removing the Blocker priority since we have a workaround now.

        Show
        Jeffrey Naisbitt added a comment - Looking at the NodeManager logs, I am seeing this: 11/09/15 12:46:51 WARN monitor.ContainersMonitorImpl: Container [pid=22125,containerID=container_1316090748961_0001_01_000001] is running beyond memory-limits. Current usage : 2150408192bytes. Limit : 2147483648bytes. Killing container. Dump of the process-tree for container_1316090748961_0001_01_000001 : 22125 21539 22125 22125 (bash) 1 1 65400832 291 /bin/bash -c /home/<USERNAME>/hadoop-build/jdk1.6.0_12/bin/java -Dhadoop.root.logger=DEBUG,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1316090748961 1 1 1>/home/hadoop/mapred/nm/logs/application_1316090748961_0001/container_1316090748961_0001_01_000001/stdout 2>/home/ So, for some reason the AM is using more memory now. We should figure out why it's using so much memory lately. In the meantime though, as a temporary workaround, you can use this option when running your jobs: -Dyarn.app.mapreduce.am.command-opts=-Xmx1024m ...or add this to your yarn-site.xml file: <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx1024m</value> </property> I am removing the Blocker priority since we have a workaround now.
        Jeffrey Naisbitt made changes -
        Attachment amlog [ 12494610 ]
        Hide
        Jeffrey Naisbitt added a comment -

        Here is the stderr for the AM container log

        Show
        Jeffrey Naisbitt added a comment - Here is the stderr for the AM container log
        Hide
        Vinod Kumar Vavilapalli added a comment -

        The corrupt blocks should be because of MAPREDUCE-3006 which should be fixed soon.

        If you attach logs, we can see if there are any more bugs .

        Show
        Vinod Kumar Vavilapalli added a comment - The corrupt blocks should be because of MAPREDUCE-3006 which should be fixed soon. If you attach logs, we can see if there are any more bugs .
        Jeffrey Naisbitt made changes -
        Environment This is now causing delays and failures in our QE cycles.
        Hide
        Jeffrey Naisbitt added a comment -

        Thanks for the comment, Vinod. I will post logs. This is currently trivial to reproduce since it happens 4 out of 5 times for me.

        This is now causing delays and failures in our QE cycles.
        Also, in our QE clusters, this failure is also corrupting HDFS blocks:

        $ $hadoophdfs fsck / | grep CORRUP
        11/09/14 23:51:50 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use
        mapreduce.client.genericoptionsparser.used
        Connecting to namenode via https://<HOSTNAME>:50470
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0009.summary: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_7057908993439990810
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0010.summary: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_3616335227169344631
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012.summary: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_-7171149404998570529
        /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012_conf.xml: CORRUPT blockpool
        BP-1806827031-98.137.110.254-1316028239408 block blk_5875537261129974720
        ..Status: CORRUPT
        CORRUPT FILES: 4
        CORRUPT BLOCKS: 4
        The filesystem under path '/' is CORRUPT

        This is causing cluster to be in Safemode unless we leave manually

        Show
        Jeffrey Naisbitt added a comment - Thanks for the comment, Vinod. I will post logs. This is currently trivial to reproduce since it happens 4 out of 5 times for me. This is now causing delays and failures in our QE cycles. Also, in our QE clusters, this failure is also corrupting HDFS blocks: $ $hadoophdfs fsck / | grep CORRUP 11/09/14 23:51:50 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used Connecting to namenode via https://<HOSTNAME>:50470 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0009.summary: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_7057908993439990810 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0010.summary: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_3616335227169344631 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012.summary: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_-7171149404998570529 /mapred/history/done_intermediate/hadoopqa/job_1316042948239_0012_conf.xml: CORRUPT blockpool BP-1806827031-98.137.110.254-1316028239408 block blk_5875537261129974720 ..Status: CORRUPT CORRUPT FILES: 4 CORRUPT BLOCKS: 4 The filesystem under path '/' is CORRUPT This is causing cluster to be in Safemode unless we leave manually
        Jeffrey Naisbitt made changes -
        Field Original Value New Value
        Environment This is now causing delays and failures in our QE cycles.
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        From both the logs you pasted, the job clearly failed, you need to debug why by looking at AM logs.

        The pasted exceptions are happening because MR AM exited when some call from client was in progress. This is expected in some cases. The retry should eventually figure out the final state of the job from RM and subsequently form JobHistoryServer. BTW, do you have JobHistoryServer up?

        Side note: Reading the logs pasted on JIRA is so hard, can you please attach the full JobClient as a separate file the next time you run into this? Thanks!

        Show
        Vinod Kumar Vavilapalli added a comment - From both the logs you pasted, the job clearly failed, you need to debug why by looking at AM logs. The pasted exceptions are happening because MR AM exited when some call from client was in progress. This is expected in some cases. The retry should eventually figure out the final state of the job from RM and subsequently form JobHistoryServer. BTW, do you have JobHistoryServer up? Side note: Reading the logs pasted on JIRA is so hard, can you please attach the full JobClient as a separate file the next time you run into this? Thanks!
        Hide
        Jeffrey Naisbitt added a comment -

        Sometimes, after the original EOException, I get the following exception when it retries the connection:

        java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:179)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:237)
        at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:276)
        at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:546)
        at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540)
        at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1144)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1092)
        at org.apache.hadoop.examples.WordCount.main(WordCount.java:84)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
        Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
        at $Proxy8.getTaskAttemptCompletionEvents(Unknown Source)
        at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
        ... 23 more
        Caused by: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1111)
        at org.apache.hadoop.ipc.Client.call(Client.java:1087)
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
        ... 25 more
        Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:376)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:459)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:556)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:207)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1220)
        at org.apache.hadoop.ipc.Client.call(Client.java:1064)
        ... 26 more
        11/09/13 20:13:01 INFO mapreduce.Job: Job job_1315941172309_0006 failed with state FAILED
        11/09/13 20:13:01 INFO mapreduce.Job: Counters: 0

        Show
        Jeffrey Naisbitt added a comment - Sometimes, after the original EOException, I get the following exception when it retries the connection: java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:179) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:237) at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:276) at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:546) at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:540) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1144) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1092) at org.apache.hadoop.examples.WordCount.main(WordCount.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:189) Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139) at $Proxy8.getTaskAttemptCompletionEvents(Unknown Source) at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172) ... 23 more Caused by: java.net.ConnectException: Call to <HOSTNAME>/<IP-ADDRESS>:50862 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1111) at org.apache.hadoop.ipc.Client.call(Client.java:1087) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136) ... 25 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:376) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:459) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:556) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:207) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1220) at org.apache.hadoop.ipc.Client.call(Client.java:1064) ... 26 more 11/09/13 20:13:01 INFO mapreduce.Job: Job job_1315941172309_0006 failed with state FAILED 11/09/13 20:13:01 INFO mapreduce.Job: Counters: 0
        Jeffrey Naisbitt created issue -

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Jeffrey Naisbitt
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development