Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5001

LocalJobRunner has race condition resulting in job failures

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2-alpha
    • Fix Version/s: 0.23.10, 2.1.1-beta
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Hive is hitting a race condition with LocalJobRunner and the Cluster class. The JobClient uses the Cluster class to obtain Job objects. The Cluster class uses the job.xml file to populate the JobConf object (https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java#L184). However, this file is deleted by the LocalJobRunner at the end of it's job (https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java#L484).

      This results in the following exception:

      2013-02-11 14:45:17,755 (main) [FATAL - org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2001)] error parsing conf file:/tmp/hadoop-brock/mapred/staging/brock1916441210/.staging/job_local_0432/job.xml
      java.io.FileNotFoundException: /tmp/hadoop-brock/mapred/staging/brock1916441210/.staging/job_local_0432/job.xml (No such file or directory)
      	at java.io.FileInputStream.open(Native Method)
      	at java.io.FileInputStream.<init>(FileInputStream.java:120)
      	at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1917)
      	at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1870)
      	at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1777)
      	at org.apache.hadoop.conf.Configuration.get(Configuration.java:712)
      	at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1951)
      	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:398)
      	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:388)
      	at org.apache.hadoop.mapred.JobClient$NetworkedJob.<init>(JobClient.java:174)
      	at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:655)
      	at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:668)
      	at org.apache.hadoop.mapreduce.TestMR2LocalMode.test(TestMR2LocalMode.java:40)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
      	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
      	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
      	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
      	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
      	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
      	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
      	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
      	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
      	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
      

      Here is code which exposes this race fairly quickly:

          Configuration conf = new Configuration();
          conf.set("mapreduce.framework.name", "local");
          conf.set("mapreduce.jobtracker.address", "local");
          File inputDir = new File("/tmp", "input-" + System.currentTimeMillis());
          File outputDir = new File("/tmp", "output-" + System.currentTimeMillis());
          while(true) {
            Assert.assertTrue(inputDir.mkdirs());
            File inputFile = new File(inputDir, "file");
            FileUtils.copyFile(new File("/etc/passwd"), inputFile);
            Path input = new Path(inputDir.getAbsolutePath());
            Path output = new Path(outputDir.getAbsolutePath());
            JobConf jobConf = new JobConf(conf, TestMR2LocalMode.class);
            FileInputFormat.addInputPath(jobConf, input);
            FileOutputFormat.setOutputPath(jobConf, output);      
            JobClient jobClient = new JobClient(conf);
            RunningJob runningJob = jobClient.submitJob(jobConf);
            while(!runningJob.isComplete()) {
              runningJob = jobClient.getJob(runningJob.getJobID());
            }      
            FileUtils.deleteQuietly(inputDir);
            FileUtils.deleteQuietly(outputDir);
          }
      

        Issue Links

          Activity

          Hide
          sandyr Sandy Ryza added a comment -

          I'm not sure that getJob is meant to be defined if the job associated with the given job ID is not running. While it should not throw this exception, the correct behavior is probably for it to return null.

          The preferred way to check whether a job has completed would be

          runningJob.getJobState() == JobStatus.RUNNING
          

          More generally, runningJob.getJobStatus() should contain information about the job. Brock Noland, is what a JobStatus provides not sufficient?

          Show
          sandyr Sandy Ryza added a comment - I'm not sure that getJob is meant to be defined if the job associated with the given job ID is not running. While it should not throw this exception, the correct behavior is probably for it to return null. The preferred way to check whether a job has completed would be runningJob.getJobState() == JobStatus.RUNNING More generally, runningJob.getJobStatus() should contain information about the job. Brock Noland , is what a JobStatus provides not sufficient?
          Hide
          brocknoland Brock Noland added a comment -

          Sandy,

          I agree, I think the getJob() method returning null for a job which do not exist is fine. Another issue I should have mentioned is:

          JobClient.get{Map,Reduce}TaskReports
          

          throws an exception for the same reason. HIVE-4009 would be a non-issue if these methods returned null since the repeated use of JobClient.getJob() can likely be removed.

          Show
          brocknoland Brock Noland added a comment - Sandy, I agree, I think the getJob() method returning null for a job which do not exist is fine. Another issue I should have mentioned is: JobClient.get{Map,Reduce}TaskReports throws an exception for the same reason. HIVE-4009 would be a non-issue if these methods returned null since the repeated use of JobClient.getJob() can likely be removed.
          Hide
          jlowe Jason Lowe added a comment -

          Sandy Ryza, do you have an ETA on a patch? Some of our Hive devs would love to see this fixed.

          Show
          jlowe Jason Lowe added a comment - Sandy Ryza , do you have an ETA on a patch? Some of our Hive devs would love to see this fixed.
          Hide
          sandyr Sandy Ryza added a comment -

          Uploading a patch that catches the exception in Cluster#getJob and returns null. An alternative approach I considered would be to modify ClientProtocol and get the job directly from the LocalJobRunner, which could do the appropriate synchronization.

          It's hard to write a test for because reproducing the issue requires the file to be deleted in between when Configuration#loadResource checks whether the file exists and when it tries to read it. I haven't had a chance to test it manually.

          I can't devote a ton of time to it in the near future, so feel free to take over if it's urgent for you.

          Show
          sandyr Sandy Ryza added a comment - Uploading a patch that catches the exception in Cluster#getJob and returns null. An alternative approach I considered would be to modify ClientProtocol and get the job directly from the LocalJobRunner, which could do the appropriate synchronization. It's hard to write a test for because reproducing the issue requires the file to be deleted in between when Configuration#loadResource checks whether the file exists and when it tries to read it. I haven't had a chance to test it manually. I can't devote a ton of time to it in the near future, so feel free to take over if it's urgent for you.
          Hide
          revans2 Robert Joseph Evans added a comment -

          Putting in patch available to kick the automated build/

          Show
          revans2 Robert Joseph Evans added a comment - Putting in patch available to kick the automated build/
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12592735/MAPREDUCE-5001.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3864//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3864//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12592735/MAPREDUCE-5001.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3864//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3864//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          +1, seems like a reasonable approach to try to fix this. Agree that modifying ClientProtocol is probably not something we want to mess with at this point. Committing this.

          Show
          jlowe Jason Lowe added a comment - +1, seems like a reasonable approach to try to fix this. Agree that modifying ClientProtocol is probably not something we want to mess with at this point. Committing this.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #4297 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4297/)
          MAPREDUCE-5001. LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515863)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #4297 (See https://builds.apache.org/job/Hadoop-trunk-Commit/4297/ ) MAPREDUCE-5001 . LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515863 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Sandy! I committed this to trunk, branch-2, branch-2.1-beta, and branch-0.23.

          Show
          jlowe Jason Lowe added a comment - Thanks, Sandy! I committed this to trunk, branch-2, branch-2.1-beta, and branch-0.23.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #308 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/308/)
          MAPREDUCE-5001. LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515863)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #308 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/308/ ) MAPREDUCE-5001 . LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515863 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-0.23-Build #706 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/706/)
          svn merge -c 1515863 FIXES: MAPREDUCE-5001. LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515882)

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-0.23-Build #706 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/706/ ) svn merge -c 1515863 FIXES: MAPREDUCE-5001 . LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515882 ) /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1525 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1525/)
          MAPREDUCE-5001. LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515863)

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1525 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1525/ ) MAPREDUCE-5001 . LocalJobRunner has race condition resulting in job failures. Contributed by Sandy Ryza (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1515863 ) /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Cluster.java

            People

            • Assignee:
              sandyr Sandy Ryza
              Reporter:
              brocknoland Brock Noland
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development