Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-1250

Remove leaf when materializing PCollection to avoid re-evaluation.

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 0.5.0
    • runner-spark
    • None

    Description

      When materializing a PCollection (implemented as RDD), to create a PCollectionView for example, the runner should remove the materialized RDD from the "leaves" set.
      The runner keeps track of leaves left un-handled in the DAG to force action on them - Write for one is implemented via a sequence of ParDos which are implemented by the runner via mapPartitions so we need to force an action.
      Materializing an RDD is done via the action collect() so no reason to keep in "leaves" set.
      Currently, it remains in the "leaves" set and so it is forced and evaluates the lineage and if not cached it will execute twice the lineage twice (unless caches are applied for some reason).

      Attachments

        Issue Links

          Activity

            People

              amitsela Amit Sela
              amitsela Amit Sela
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack

                  Issue deployment