[CRUNCH-144] Ability to re-use PCollections after a write without having to recompute them - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.4.0
Fix Version/s: 0.5.0
Component/s: Core
Labels:
None

Description

I have a pipeline that consists of several stages to process and filter a dataset. I would like to persist this dataset to HDFS and then perform further computation on it.

Example:
1. ) Load text data A and convert to avro -> A'
2. ) Load text data B and convert to avro -> B'
3. ) Union A' and B' -> C
4. ) Filter C -> D

5. ) Write D to HDFS

6a. ) Use DoFn to extract strings from D -> E
6b. ) Aggregate E ( count strings ) -> F
6c. ) Convert F to HBase puts -> G
6d. ) Write G to HBase

Running this pipeline code generates two mapreduce jobs which run in parallel:
job A) runs steps 1, 2, 3, 4, 5
job B) runs steps 1, 2, 3, 4, 6abcd

If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.

What I would like is to be able to hold on to the PCollection reference to "D", so that steps 6* can be run without going back to the start and re-doing all the work needed to generate it.

–

Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CRUNCH-144.patch
16/Jan/13 15:44
1 kB
Josh Wills
CRUNCH-144b.patch
17/Jan/13 19:16
4 kB
Josh Wills

Activity

People

Assignee:: Josh Wills

Reporter:: Dave Beech

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Jan/13 11:00

Updated:: 22/Feb/13 06:57

Resolved:: 26/Jan/13 01:24