Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5528

TeraSort fails with "can't read paritions file" - does not read partition file from distributed cache

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.20.2, 2.5.0, 2.4.1, 2.6.0, 3.0.0-alpha1
    • None
    • examples
    • None

    Description

      I was trying to run TeraSort against a parallel networked file system, setting things up via the 'file://" scheme. I always got the following error when running terasort:

      13/09/23 11:15:12 INFO mapreduce.Job: Task Id : attempt_1379960046506_0001_m_000080_1, Status : FAILED
      Error: java.lang.IllegalArgumentException: can't read paritions file
              at org.apache.hadoop.examples.terasort.TeraSort$TotalOrderPartitioner.setConf(TeraSort.java:254)
              at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
              at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
              at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:678)
              at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
              at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:171)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1499)
              at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
      Caused by: java.io.FileNotFoundException: File _partition.lst does not exist
              at org.apache.hadoop.fs.Stat.parseExecResult(Stat.java:124)
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:486)
              at org.apache.hadoop.util.Shell.run(Shell.java:417)
              at org.apache.hadoop.fs.Stat.getFileStatus(Stat.java:74)
              at org.apache.hadoop.fs.RawLocalFileSystem.getNativeFileLinkStatus(RawLocalFileSystem.java:808)
              at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:740)
              at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:525)
              at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
              at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
              at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
              at org.apache.hadoop.examples.terasort.TeraSort$TotalOrderPartitioner.readPartitions(TeraSort.java:161)
              at org.apache.hadoop.examples.terasort.TeraSort$TotalOrderPartitioner.setConf(TeraSort.java:246)
              ... 10 more
      

      After digging into TeraSort, I noticed that the partitions file was created in the output directory, then added into the distributed cache

      Path outputDir = new Path(args[1]);
      ...
      Path partitionFile = new Path(outputDir, TeraInputFormat.PARTITION_FILENAME);
      ...
      job.addCacheFile(partitionUri);
      

      but the partitions file doesn't seem to be read back from the output directory or distributed cache:

      FileSystem fs = FileSystem.getLocal(conf);
      ...
      Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
      splitPoints = readPartitions(fs, partFile, conf);
      

      It seems the file is being read from whatever the working directory is for the filesystem returned from FileSystem.getLocal(conf).

      Under HDFS this code works, the working directory seems to be the distributed cache (I guess by default??).

      But when I set things up with the networked file system and 'file://' scheme, the working directory was the directory I was running my Hadoop binaries out of.

      The attached patch fixed things for me. It grabs the partition file from the distributed cache all of the time, instead of trusting things underneath to work out. It seems to be the right thing to do???

      Apologies, I was unable to get this to reproduce under the TeraSort example tests, such as TestTeraSort.java, so no test added. Not sure what the subtle difference is in the setup. I tested under both HDFS & 'file' scheme and the patch worked under both.

      Attachments

        1. MAPREDUCE-5528.patch
          1 kB
          Albert Chu

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            chu11 Albert Chu
            chu11 Albert Chu

            Dates

              Created:
              Updated:

              Slack

                Issue deployment