[CRUNCH-227] Write to sequence file ignores destination path. - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.6.0, 0.7.0
Fix Version/s: 0.8.4, 0.12.0
Component/s: IO
Labels:
None
Environment:
Hadoop 1.0.3

Description

I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom Writable.

The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(), but instead to a Crunch working directory.

This happens when running the job both locally and on my 1-node Hadoop
test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).

When using pipeline.done() instead of pipeline.run(), the Crunch working directory gets removed after execution, in that case, the output is not retained at all.

Code snippet:
—

public int run(String[] args) throws IOException {
  CommandLine cl = parseCommandLine(args);
  Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
  int docIdIndex = getColumnIndex(cl, "DocID");
  int ldaIndex = getColumnIndex(cl, "LDA");

  Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
  pipeline.setConfiguration(getConf());
  PCollection<String> lines = pipeline.readTextFile((String)
cl.getValue(INPUT_OPTION));
  PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
    new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
    tableOf(strings(), writables(NamedQuantizedVecWritable.class)));

  vectors.write(To.sequenceFile(output));

  PipelineResult res = pipeline.run();
  return res.succeeded() ? 0 : 1;
}

—

Log output from local run.
Note how the intended output path "/tmp/foo.seq" is reported in the
execution plan,
is not actually used.
—

2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
from SCDynamicStore
2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
to new path: /tmp/foo.seq
2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1
2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
with rwxr-xr-x
2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
/tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
/tmp/crunch-1128974463/p1/MAP as
/tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP

2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
"com.issuu.mahout.utils.DbDumpToSeqFile:
Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"

2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
available at: http://localhost:8080/
2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
is allowed to commit now

2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
task 'attempt_local_0001_m_000000_0' to
/tmp/crunch-1128974463/p1/output

2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.

—

This crude patch makes the output end up at the right place,
but breaks a lot of other tests.
—

--- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
+++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
@@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
   protected void configureForMapReduce(Job job, Class keyClass, Class
valueClass,
       Class outputFormatClass, Path outputPath, String name) {
     try {
-      FileOutputFormat.setOutputPath(job, outputPath);
+      FileOutputFormat.setOutputPath(job, path);
     } catch (Exception e) {
       throw new RuntimeException(e);
     }

—

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-CRUNCH-227-Added-test-that-shows-ToolRunner-does-not.patch
21/Nov/14 00:16
5 kB
Micah Whitacre
CRUNCH-227_tests.patch
30/Dec/14 22:05
8 kB
Micah Whitacre
CRUNCH-227.patch
26/Jun/13 06:16
3 kB
Josh Wills

Write to sequence file ignores destination path.

Details

Description

Attachments

Attachments

Activity

People

Dates