Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-144

Ability to re-use PCollections after a write without having to recompute them

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4.0
    • 0.5.0
    • Core
    • None

    Description

      I have a pipeline that consists of several stages to process and filter a dataset. I would like to persist this dataset to HDFS and then perform further computation on it.

      Example:
      1. ) Load text data A and convert to avro -> A'
      2. ) Load text data B and convert to avro -> B'
      3. ) Union A' and B' -> C
      4. ) Filter C -> D

      5. ) Write D to HDFS

      6a. ) Use DoFn to extract strings from D -> E
      6b. ) Aggregate E ( count strings ) -> F
      6c. ) Convert F to HBase puts -> G
      6d. ) Write G to HBase

      Running this pipeline code generates two mapreduce jobs which run in parallel:
      job A) runs steps 1, 2, 3, 4, 5
      job B) runs steps 1, 2, 3, 4, 6abcd

      If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially.

      What I would like is to be able to hold on to the PCollection reference to "D", so that steps 6* can be run without going back to the start and re-doing all the work needed to generate it.

      Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E

      Attachments

        1. CRUNCH-144.patch
          1 kB
          Josh Wills
        2. CRUNCH-144b.patch
          4 kB
          Josh Wills

        Activity

          People

            jwills Josh Wills
            dbeech Dave Beech
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: