Details
Description
I'm trying to write a simple Crunch job that outputs a sequence file consisting of a custom Writable.
The job runs successfully, but the output is not written to the path that I specify in To.sequenceFile(), but instead to a Crunch working directory.
This happens when running the job both locally and on my 1-node Hadoop
test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of today (38a97e5).
When using pipeline.done() instead of pipeline.run(), the Crunch working directory gets removed after execution, in that case, the output is not retained at all.
Code snippet:
—
public int run(String[] args) throws IOException { CommandLine cl = parseCommandLine(args); Path output = new Path((String) cl.getValue(OUTPUT_OPTION)); int docIdIndex = getColumnIndex(cl, "DocID"); int ldaIndex = getColumnIndex(cl, "LDA"); Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class); pipeline.setConfiguration(getConf()); PCollection<String> lines = pipeline.readTextFile((String) cl.getValue(INPUT_OPTION)); PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo( new ConvertToSeqFileDoFn(docIdIndex, ldaIndex), tableOf(strings(), writables(NamedQuantizedVecWritable.class))); vectors.write(To.sequenceFile(output)); PipelineResult res = pipeline.run(); return res.succeeded() ? 0 : 1; }
—
Log output from local run.
Note how the intended output path "/tmp/foo.seq" is reported in the
execution plan,
is not actually used.
—
2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info from SCDynamicStore 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files to new path: /tmp/foo.seq 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 1 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating MAP in /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122 with rwxr-xr-x 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached /tmp/crunch-1128974463/p1/MAP as /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached /tmp/crunch-1128974463/p1/MAP as /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job "com.issuu.mahout.utils.DbDumpToSeqFile: Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)" 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status available at: http://localhost:8080/ 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 2013-06-25 16:19:45 LocalJobRunner:321 [INFO] 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0 is allowed to commit now 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of task 'attempt_local_0001_m_000000_0' to /tmp/crunch-1128974463/p1/output 2013-06-25 16:19:48 LocalJobRunner:321 [INFO] 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.
—
This crude patch makes the output end up at the right place,
but breaks a lot of other tests.
—
--- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java +++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget { protected void configureForMapReduce(Job job, Class keyClass, Class valueClass, Class outputFormatClass, Path outputPath, String name) { try { - FileOutputFormat.setOutputPath(job, outputPath); + FileOutputFormat.setOutputPath(job, path); } catch (Exception e) { throw new RuntimeException(e); }
—