Details
-
Improvement
-
Status: Resolved
-
P3
-
Resolution: Fixed
-
None
Description
As a result of BEAM-2302, files in excess of WriteFiles maxNumWritersPerBundle are shuffled to be written later. The downside to this is that even if you can guarantee that maxNumWritersPerBundle is high enough to handle all writes you still have to pay the overhead of this write now being a MultiOutput ParDo.
e.g. In the Spark Runner when a ParDo has multiple outputs the returned data is cached and if using the disableCache pipeline option it would cause recalculation and all the temp files would be written again.
I'm sure that the Spark Runner is not the only runner that would benefit from an optional setting for WriteFiles that would skip this spilling and simplify the pipeline.
Attachments
Issue Links
- links to