Description
I have a pipeline that consists of several stages to process and filter a dataset. I would like to persist this dataset to HDFS and then perform further computation on it.
Example:
1. ) Load text data A and convert to avro -> A'
2. ) Load text data B and convert to avro -> B'
3. ) Union A' and B' -> C
4. ) Filter C -> D
5. ) Write D to HDFS
6a. ) Use DoFn to extract strings from D -> E
6b. ) Aggregate E ( count strings ) -> F
6c. ) Convert F to HBase puts -> G
6d. ) Write G to HBase
Running this pipeline code generates two mapreduce jobs which run in parallel:
job A) runs steps 1, 2, 3, 4, 5
job B) runs steps 1, 2, 3, 4, 6abcd
If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.
What I would like is to be able to hold on to the PCollection reference to "D", so that steps 6* can be run without going back to the start and re-doing all the work needed to generate it.
–
Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E