Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5528

TeraSort fails with "can't read paritions file" - does not read partition file from distributed cache

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2, 2.5.0, 2.4.1, 2.6.0, 3.0.0-alpha1
    • Fix Version/s: None
    • Component/s: examples
    • Labels:
      None

      Description

      I was trying to run TeraSort against a parallel networked file system, setting things up via the 'file://" scheme. I always got the following error when running terasort:

      13/09/23 11:15:12 INFO mapreduce.Job: Task Id : attempt_1379960046506_0001_m_000080_1, Status : FAILED
      Error: java.lang.IllegalArgumentException: can't read paritions file
              at org.apache.hadoop.examples.terasort.TeraSort$TotalOrderPartitioner.setConf(TeraSort.java:254)
              at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
              at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
              at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:678)
              at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
              at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1499)
              at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
      Caused by: java.io.FileNotFoundException: File _partition.lst does not exist
              at org.apache.hadoop.fs.Stat.parseExecResult(Stat.java:124)
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:486)
              at org.apache.hadoop.util.Shell.run(Shell.java:417)
              at org.apache.hadoop.fs.Stat.getFileStatus(Stat.java:74)
              at org.apache.hadoop.fs.RawLocalFileSystem.getNativeFileLinkStatus(RawLocalFileSystem.java:808)
              at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:740)
              at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:525)
              at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
              at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
              at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
              at org.apache.hadoop.examples.terasort.TeraSort$TotalOrderPartitioner.readPartitions(TeraSort.java:161)
              at org.apache.hadoop.examples.terasort.TeraSort$TotalOrderPartitioner.setConf(TeraSort.java:246)
              ... 10 more
      

      After digging into TeraSort, I noticed that the partitions file was created in the output directory, then added into the distributed cache

      Path outputDir = new Path(args[1]);
      ...
      Path partitionFile = new Path(outputDir, TeraInputFormat.PARTITION_FILENAME);
      ...
      job.addCacheFile(partitionUri);
      

      but the partitions file doesn't seem to be read back from the output directory or distributed cache:

      FileSystem fs = FileSystem.getLocal(conf);
      ...
      Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
      splitPoints = readPartitions(fs, partFile, conf);
      

      It seems the file is being read from whatever the working directory is for the filesystem returned from FileSystem.getLocal(conf).

      Under HDFS this code works, the working directory seems to be the distributed cache (I guess by default??).

      But when I set things up with the networked file system and 'file://' scheme, the working directory was the directory I was running my Hadoop binaries out of.

      The attached patch fixed things for me. It grabs the partition file from the distributed cache all of the time, instead of trusting things underneath to work out. It seems to be the right thing to do???

      Apologies, I was unable to get this to reproduce under the TeraSort example tests, such as TestTeraSort.java, so no test added. Not sure what the subtle difference is in the setup. I tested under both HDFS & 'file' scheme and the patch worked under both.

        Issue Links

          Activity

          Hide
          ehiggs Ewan Higgs added a comment -

          I have reproduced this issue.

          Show
          ehiggs Ewan Higgs added a comment - I have reproduced this issue.
          Hide
          ehiggs Ewan Higgs added a comment -

          I can confirm that this patch fixes the issue.

          @Albert Chu, thanks for submitting the fix!

          Show
          ehiggs Ewan Higgs added a comment - I can confirm that this patch fixes the issue. @ Albert Chu , thanks for submitting the fix!
          Hide
          ehiggs Ewan Higgs added a comment -

          Can we merge this?

          Show
          ehiggs Ewan Higgs added a comment - Can we merge this?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12604700/MAPREDUCE-5528.patch
          against trunk revision 3411732.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The applied patch generated 1158 javac compiler warnings (more than the trunk's current 1155 warnings).

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-examples.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5306//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5306//artifact/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5306//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12604700/MAPREDUCE-5528.patch against trunk revision 3411732. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The applied patch generated 1158 javac compiler warnings (more than the trunk's current 1155 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-examples. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5306//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5306//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5306//console This message is automatically generated.
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Ewan Higgs Albert Chu thank you for taking this issue. Could you update following line to use job.getLocalCacheFiles() instead of DistributedCache.getLocalCacheFiles since the method is deprecated?

          +        Path[] localPaths = DistributedCache.getLocalCacheFiles(conf);
          
          Show
          ozawa Tsuyoshi Ozawa added a comment - Ewan Higgs Albert Chu thank you for taking this issue. Could you update following line to use job.getLocalCacheFiles() instead of DistributedCache.getLocalCacheFiles since the method is deprecated? + Path[] localPaths = DistributedCache.getLocalCacheFiles(conf);
          Hide
          chu11 Albert Chu added a comment -

          Sure, I'll update the patch and retest just to make sure everything is fine and dandy.

          Show
          chu11 Albert Chu added a comment - Sure, I'll update the patch and retest just to make sure everything is fine and dandy.
          Hide
          chu11 Albert Chu added a comment -

          Since I wrote my original patch, DistributedCache has been deprecated in favor of using the job Context. Unfortunately, in TeraSort, there is no mapper or reducer. All of the sorting is handled via the partitioner. As far as I can tell, the Job context can't be accessed in the partitioner. B/c of that, this really can't be handled through the patch I had before, assuming we don't want to use deprecated code. Using the basic idea from Ewan Higgs in MAPREDUCE-5050 wouldn't have worked b/c the JobContext is again needed in newer versions of FileOutputFormat.

          I was trying to think of a clean way to do this but nothing came to mind each of the ways I looked. I might just not see something that others would.

          Open to suggestions.

          Show
          chu11 Albert Chu added a comment - Since I wrote my original patch, DistributedCache has been deprecated in favor of using the job Context. Unfortunately, in TeraSort, there is no mapper or reducer. All of the sorting is handled via the partitioner. As far as I can tell, the Job context can't be accessed in the partitioner. B/c of that, this really can't be handled through the patch I had before, assuming we don't want to use deprecated code. Using the basic idea from Ewan Higgs in MAPREDUCE-5050 wouldn't have worked b/c the JobContext is again needed in newer versions of FileOutputFormat. I was trying to think of a clean way to do this but nothing came to mind each of the ways I looked. I might just not see something that others would. Open to suggestions.

            People

            • Assignee:
              chu11 Albert Chu
              Reporter:
              chu11 Albert Chu
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development