Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7015

Possible race condition in JHS if the job is not loaded

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1.0, 3.0.1, 2.10.0
    • Component/s: jobhistoryserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      There could be a race condition inside JHS. In our build environment, TestMRJobClient.testJobClient() failed with this exception:

      ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
      	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
      	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
      	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
      	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
      	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
      	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
      	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
      	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
      	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
      

      Root cause:
      1. MapReduce job completes
      2. CLI calls cluster.getJob(jobid)
      3. The job is finished and the client side gets redirected to JHS
      4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
      5. First it scans the intermediate directory and finds the job
      6. The call moveToDone() is scheduled for execution on a separate thread inside moveToDoneExecutor and it starts to run immediately
      7. RPC invocation returns with the path pointing to /tmp/hadoop-yarn/staging/history/done_intermediate
      8. The call to moveToDone() completes which moves the contents of done_intermediate to done
      9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer there

      Usually step #6 is slow enough to complete after #7, but sometimes it's faster, causing this race condition.

        Attachments

        1. MAPREDUCE-7015-001.patch
          6 kB
          Peter Bacsko
        2. MAPREDUCE-7015-POC02.patch
          4 kB
          Peter Bacsko
        3. MAPREDUCE-7015-POC01.patch
          6 kB
          Peter Bacsko

          Issue Links

            Activity

              People

              • Assignee:
                pbacsko Peter Bacsko
                Reporter:
                pbacsko Peter Bacsko
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: