Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
None
-
None
Description
When materializing a PCollection (implemented as RDD), to create a PCollectionView for example, the runner should remove the materialized RDD from the "leaves" set.
The runner keeps track of leaves left un-handled in the DAG to force action on them - Write for one is implemented via a sequence of ParDos which are implemented by the runner via mapPartitions so we need to force an action.
Materializing an RDD is done via the action collect() so no reason to keep in "leaves" set.
Currently, it remains in the "leaves" set and so it is forced and evaluates the lineage and if not cached it will execute twice the lineage twice (unless caches are applied for some reason).
Attachments
Issue Links
- links to