Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6762

ControlledJob#toString failed with NPE when job status is not successfully updated

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.2
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This issue was found from a cluster where Pig query occasionally failed on NPE. Pig uses JobControl API to track MR job status, but sometimes Job History Server failed to flush job meta files to HDFS which caused the status update failed. Then we get NPE in org.apache.hadoop.mapreduce.Job.getJobName. The result of this situation is quite confusing: Pig query failed, job history is missing, but the job status on Yarn is succeed.

        Issue Links

          Activity

          Hide
          cheersyang Weiwei Yang added a comment -

          Thanks a lot Varun Saxena.

          Show
          cheersyang Weiwei Yang added a comment - Thanks a lot Varun Saxena .
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10316 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10316/)
          MAPREDUCE-6762. ControlledJob#toString failed with NPE when job status (varunsaxena: rev d37b45d613b768950d1cbe342961cd71776816ae)

          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Job.java
          • (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/TestJob.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10316 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10316/ ) MAPREDUCE-6762 . ControlledJob#toString failed with NPE when job status (varunsaxena: rev d37b45d613b768950d1cbe342961cd71776816ae) (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Job.java (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/TestJob.java
          Hide
          varun_saxena Varun Saxena added a comment -

          Committed to trunk, branch-2.
          Thanks Weiwei Yang for your contribution. I am unable to assign JIRA to you. Will ask somebody to assign it to you

          Show
          varun_saxena Varun Saxena added a comment - Committed to trunk, branch-2. Thanks Weiwei Yang for your contribution. I am unable to assign JIRA to you. Will ask somebody to assign it to you
          Hide
          varun_saxena Varun Saxena added a comment -

          Looks good. Will commit it if there are no further comments.

          Job history not being flushed when AM is finishing, which leads to job not being found in JHS has been encountered before (see, MAPREDUCE-6428, MAPREDUCE-5703 and MAPREDUCE-5547). But irrespective, this NPE should be fixed.

          Show
          varun_saxena Varun Saxena added a comment - Looks good. Will commit it if there are no further comments. Job history not being flushed when AM is finishing, which leads to job not being found in JHS has been encountered before (see, MAPREDUCE-6428 , MAPREDUCE-5703 and MAPREDUCE-5547 ). But irrespective, this NPE should be fixed.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 9s trunk passed
          +1 compile 0m 26s trunk passed
          +1 checkstyle 0m 18s trunk passed
          +1 mvnsite 0m 29s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 0m 48s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 21s the patch passed
          +1 javac 0m 21s the patch passed
          +1 checkstyle 0m 14s the patch passed
          +1 mvnsite 0m 26s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 54s the patch passed
          +1 javadoc 0m 19s the patch passed
          +1 unit 2m 4s hadoop-mapreduce-client-core in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          15m 43s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12824496/MAPREDUCE-6762.001.patch
          JIRA Issue MAPREDUCE-6762
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 51cbe2969693 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / dbcaf99
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6680/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6680/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 9s trunk passed +1 compile 0m 26s trunk passed +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 29s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 48s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 22s the patch passed +1 compile 0m 21s the patch passed +1 javac 0m 21s the patch passed +1 checkstyle 0m 14s the patch passed +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 54s the patch passed +1 javadoc 0m 19s the patch passed +1 unit 2m 4s hadoop-mapreduce-client-core in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 15m 43s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12824496/MAPREDUCE-6762.001.patch JIRA Issue MAPREDUCE-6762 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 51cbe2969693 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / dbcaf99 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6680/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6680/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          cheersyang Weiwei Yang added a comment -

          Attached v1 patch, including the test case to simulate this issue, and a simple fix to get rid of this NPE.

          Show
          cheersyang Weiwei Yang added a comment - Attached v1 patch, including the test case to simulate this issue, and a simple fix to get rid of this NPE.
          Hide
          cheersyang Weiwei Yang added a comment -

          The error we saw from Pig console

          2016-07-20 07:28:13,625 [uber-SubtaskRunner] INFO  org.apache.hadoop.mapred.ClientServiceDelegate  - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
          2016-07-20 07:28:16,252 [JobControl] ERROR org.apache.pig.backend.hadoop23.PigJobControl  - Error while trying to run jobs.
          java.lang.NullPointerException
          	at org.apache.hadoop.mapreduce.Job.getJobName(Job.java:426)
          	at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.toString(ControlledJob.java:93)
          	at java.lang.String.valueOf(String.java:2982)
          	at java.lang.StringBuilder.append(StringBuilder.java:131)
          	at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:182)
          	at java.lang.Thread.run(Thread.java:745)
          	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
          

          The error we saw from app-master log (indicating that failure when writing job meta files)

          2016-08-10 07:46:54,862 INFO [Thread-1245] org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state STOPPED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010]
          org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010]
                  at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580)
                  at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374)
                  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
                  at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
                  at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
                  at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
                  at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
                  at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1626)
                  at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
                  at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1126)
                  at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:561)
                  at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:609)
          Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010]
                  at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
                  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
                  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
                  at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
                  at java.io.FilterInputStream.read(FilterInputStream.java:83)
                  at java.io.FilterInputStream.read(FilterInputStream.java:83)
                  at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2278)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1020)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:990)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1131)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
                  at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
          

          So the cause looks like

          • Datanode was too busy to answer JHS's request to flush job meta files
          • Job meta files missing
          • Job client failed to get job status update
          • Job.status resets to null
          • Job.getJobName failed with NPE
          Show
          cheersyang Weiwei Yang added a comment - The error we saw from Pig console 2016-07-20 07:28:13,625 [uber-SubtaskRunner] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2016-07-20 07:28:16,252 [JobControl] ERROR org.apache.pig.backend.hadoop23.PigJobControl - Error while trying to run jobs. java.lang.NullPointerException at org.apache.hadoop.mapreduce.Job.getJobName(Job.java:426) at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.toString(ControlledJob.java:93) at java.lang. String .valueOf( String .java:2982) at java.lang.StringBuilder.append(StringBuilder.java:131) at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:182) at java.lang. Thread .run( Thread .java:745) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276) The error we saw from app-master log (indicating that failure when writing job meta files) 2016-08-10 07:46:54,862 INFO [ Thread -1245] org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state STOPPED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010] org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010] at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1626) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1126) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:561) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:609) Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913 remote=/10.87.225.174:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2278) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1020) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:990) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1131) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402) So the cause looks like Datanode was too busy to answer JHS's request to flush job meta files Job meta files missing Job client failed to get job status update Job.status resets to null Job.getJobName failed with NPE

            People

            • Assignee:
              cheersyang Weiwei Yang
              Reporter:
              cheersyang Weiwei Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development