Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-227

Write to sequence file ignores destination path.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.6.0, 0.7.0
    • 0.8.4, 0.12.0
    • IO
    • None
    • Hadoop 1.0.3

    Description

      I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom Writable.

      The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(), but instead to a Crunch working directory.

      This happens when running the job both locally and on my 1-node Hadoop
      test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).

      When using pipeline.done() instead of pipeline.run(), the Crunch working directory gets removed after execution, in that case, the output is not retained at all.

      Code snippet:

      public int run(String[] args) throws IOException {
        CommandLine cl = parseCommandLine(args);
        Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
        int docIdIndex = getColumnIndex(cl, "DocID");
        int ldaIndex = getColumnIndex(cl, "LDA");
      
        Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
        pipeline.setConfiguration(getConf());
        PCollection<String> lines = pipeline.readTextFile((String)
      cl.getValue(INPUT_OPTION));
        PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
          new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
          tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
      
        vectors.write(To.sequenceFile(output));
      
        PipelineResult res = pipeline.run();
        return res.succeeded() ? 0 : 1;
      }
      

      Log output from local run.
      Note how the intended output path "/tmp/foo.seq" is reported in the
      execution plan,
      is not actually used.

      2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
      from SCDynamicStore
      2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
      2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
      to new path: /tmp/foo.seq
      2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
      classes may not be found. See JobConf(Class) or
      JobConf#setJar(String).
      2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1
      2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
      MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
      with rwxr-xr-x
      2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
      /tmp/crunch-1128974463/p1/MAP as
      /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
      2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
      /tmp/crunch-1128974463/p1/MAP as
      /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
      
      2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
      "com.issuu.mahout.utils.DbDumpToSeqFile:
      Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
      
      2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
      available at: http://localhost:8080/
      2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
      is done. And is in the process of commiting
      2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
      2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
      is allowed to commit now
      
      2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
      task 'attempt_local_0001_m_000000_0' to
      /tmp/crunch-1128974463/p1/output
      
      2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
      2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.
      

      This crude patch makes the output end up at the right place,
      but breaks a lot of other tests.

      --- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
      +++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
      @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
         protected void configureForMapReduce(Job job, Class keyClass, Class
      valueClass,
             Class outputFormatClass, Path outputPath, String name) {
           try {
      -      FileOutputFormat.setOutputPath(job, outputPath);
      +      FileOutputFormat.setOutputPath(job, path);
           } catch (Exception e) {
             throw new RuntimeException(e);
           }
      

      Attachments

        1. 0001-CRUNCH-227-Added-test-that-shows-ToolRunner-does-not.patch
          5 kB
          Micah Whitacre
        2. CRUNCH-227_tests.patch
          8 kB
          Micah Whitacre
        3. CRUNCH-227.patch
          3 kB
          Josh Wills

        Activity

          People

            mkwhitacre Micah Whitacre
            florian_laws Florian Laws
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: