Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5050

Cannot find partition.lst in Terasort on Hadoop/Local File System

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2
    • Fix Version/s: None
    • Component/s: examples
    • Labels:
      None
    • Environment:

      Cloudera VM CDH3u4, VMWare, Linux, Java SE 1.6.0_31-b04

      Description

      I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system using a single cloudera VM (cdh3u4).

      I can generate the data just fine, but when running the sorting portion of the program, I get an error about not being able to find the _partition.lst file. It exists in the generated data directory.

      Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst, which is created with the parent directory.

        public int run(String[] args) throws Exception {
             LOG.info("starting");
            JobConf job = (JobConf) getConf();
      >>  Path inputDir = new Path(args[0]);
      >>  inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
      >>  Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
            URI partitionUri = new URI(partitionFile.toString() +
                                     "#" + TeraInputFormat.PARTITION_FILENAME);
            TeraInputFormat.setInputPaths(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            job.setJobName("TeraSort");
            job.setJarByClass(TeraSort.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
            job.setInputFormat(TeraInputFormat.class);
            job.setOutputFormat(TeraOutputFormat.class);
            job.setPartitionerClass(TotalOrderPartitioner.class);
            TeraInputFormat.writePartitionFile(job, partitionFile);
            DistributedCache.addCacheFile(partitionUri, job);
            DistributedCache.createSymlink(job);
            job.setInt("dfs.replication", 1);
            TeraOutputFormat.setFinalSync(job, true);
            JobClient.runJob(job);
            LOG.info("done");
            return 0;
        }
      

      But in the configure method, the Path isn't created with the parent directory reference.

          public void configure(JobConf job) {
      
            try {
              FileSystem fs = FileSystem.getLocal(job);
      >>    Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
              splitPoints = readPartitions(fs, partFile, job);
              trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
            } catch (IOException ie) {
              throw new IllegalArgumentException("can't read paritions file", ie);
            }
      
          }
      

      I modified the code as follows, and now sorting portion of the Terasort test works using the
      general file system. I think the above code is a bug.

          public void configure(JobConf job) {
      
            try {
              FileSystem fs = FileSystem.getLocal(job);
      
        >>  Path[] inputPaths = TeraInputFormat.getInputPaths(job);
        >>  Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME);
      
              splitPoints = readPartitions(fs, partFile, job);
              trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
            } catch (IOException ie) {
              throw new IllegalArgumentException("can't read paritions file", ie);
            }
      
          }
      

        Issue Links

          Activity

          Albert Chu made changes -
          Description I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system using a single cloudera VM (cdh3u4).

          I can generate the data just fine, but when running the sorting portion of the program, I get an error about not being able to find the _partition.lst file. It exists in the generated data directory.

          Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst, which is created with the parent directory.

            public int run(String[] args) throws Exception {
                 LOG.info("starting");
                JobConf job = (JobConf) getConf();
          >> Path inputDir = new Path(args[0]);
          >> inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
          >> Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
                URI partitionUri = new URI(partitionFile.toString() +
                                         "#" + TeraInputFormat.PARTITION_FILENAME);
                TeraInputFormat.setInputPaths(job, new Path(args[0]));
                FileOutputFormat.setOutputPath(job, new Path(args[1]));
                job.setJobName("TeraSort");
                job.setJarByClass(TeraSort.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(Text.class);
                job.setInputFormat(TeraInputFormat.class);
                job.setOutputFormat(TeraOutputFormat.class);
                job.setPartitionerClass(TotalOrderPartitioner.class);
                TeraInputFormat.writePartitionFile(job, partitionFile);
                DistributedCache.addCacheFile(partitionUri, job);
                DistributedCache.createSymlink(job);
                job.setInt("dfs.replication", 1);
                TeraOutputFormat.setFinalSync(job, true);
                JobClient.runJob(job);
                LOG.info("done");
                return 0;
            }

          But in the configure method, the Path isn't created with the parent directory reference.

              public void configure(JobConf job) {

                try {
                  FileSystem fs = FileSystem.getLocal(job);
          >> Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
                  splitPoints = readPartitions(fs, partFile, job);
                  trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
                } catch (IOException ie) {
                  throw new IllegalArgumentException("can't read paritions file", ie);
                }

              }

          I modified the code as follows, and now sorting portion of the Terasort test works using the
          general file system. I think the above code is a bug.

              public void configure(JobConf job) {

                try {
                  FileSystem fs = FileSystem.getLocal(job);

            >> Path[] inputPaths = TeraInputFormat.getInputPaths(job);
            >> Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME);

                  splitPoints = readPartitions(fs, partFile, job);
                  trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
                } catch (IOException ie) {
                  throw new IllegalArgumentException("can't read paritions file", ie);
                }

              }
          I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system using a single cloudera VM (cdh3u4).

          I can generate the data just fine, but when running the sorting portion of the program, I get an error about not being able to find the _partition.lst file. It exists in the generated data directory.

          Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst, which is created with the parent directory.

          {noformat}
            public int run(String[] args) throws Exception {
                 LOG.info("starting");
                JobConf job = (JobConf) getConf();
          >> Path inputDir = new Path(args[0]);
          >> inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
          >> Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
                URI partitionUri = new URI(partitionFile.toString() +
                                         "#" + TeraInputFormat.PARTITION_FILENAME);
                TeraInputFormat.setInputPaths(job, new Path(args[0]));
                FileOutputFormat.setOutputPath(job, new Path(args[1]));
                job.setJobName("TeraSort");
                job.setJarByClass(TeraSort.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(Text.class);
                job.setInputFormat(TeraInputFormat.class);
                job.setOutputFormat(TeraOutputFormat.class);
                job.setPartitionerClass(TotalOrderPartitioner.class);
                TeraInputFormat.writePartitionFile(job, partitionFile);
                DistributedCache.addCacheFile(partitionUri, job);
                DistributedCache.createSymlink(job);
                job.setInt("dfs.replication", 1);
                TeraOutputFormat.setFinalSync(job, true);
                JobClient.runJob(job);
                LOG.info("done");
                return 0;
            }
          {noformat}

          But in the configure method, the Path isn't created with the parent directory reference.

          {noformat}
              public void configure(JobConf job) {

                try {
                  FileSystem fs = FileSystem.getLocal(job);
          >> Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
                  splitPoints = readPartitions(fs, partFile, job);
                  trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
                } catch (IOException ie) {
                  throw new IllegalArgumentException("can't read paritions file", ie);
                }

              }
          {noformat}

          I modified the code as follows, and now sorting portion of the Terasort test works using the
          general file system. I think the above code is a bug.

          {noformat}
              public void configure(JobConf job) {

                try {
                  FileSystem fs = FileSystem.getLocal(job);

            >> Path[] inputPaths = TeraInputFormat.getInputPaths(job);
            >> Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME);

                  splitPoints = readPartitions(fs, partFile, job);
                  trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
                } catch (IOException ie) {
                  throw new IllegalArgumentException("can't read paritions file", ie);
                }

              }
          {noformat}
          Albert Chu made changes -
          Field Original Value New Value
          Link This issue duplicates MAPREDUCE-5528 [ MAPREDUCE-5528 ]
          Matt Parker created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Matt Parker
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development