Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7015

Possible race condition in JHS if the job is not loaded

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.1.0, 3.0.1, 2.10.0
    • jobhistoryserver
    • None
    • Reviewed

    Description

      There could be a race condition inside JHS. In our build environment, TestMRJobClient.testJobClient() failed with this exception:

      ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
      	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
      	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
      	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
      	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
      	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
      	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
      	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
      	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
      	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
      

      Root cause:
      1. MapReduce job completes
      2. CLI calls cluster.getJob(jobid)
      3. The job is finished and the client side gets redirected to JHS
      4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
      5. First it scans the intermediate directory and finds the job
      6. The call moveToDone() is scheduled for execution on a separate thread inside moveToDoneExecutor and it starts to run immediately
      7. RPC invocation returns with the path pointing to /tmp/hadoop-yarn/staging/history/done_intermediate
      8. The call to moveToDone() completes which moves the contents of done_intermediate to done
      9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer there

      Usually step #6 is slow enough to complete after #7, but sometimes it's faster, causing this race condition.

      Attachments

        1. MAPREDUCE-7015-POC01.patch
          6 kB
          Peter Bacsko
        2. MAPREDUCE-7015-POC02.patch
          4 kB
          Peter Bacsko
        3. MAPREDUCE-7015-001.patch
          6 kB
          Peter Bacsko

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: