Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-6735

WriteFiles with runner-determined sharding is forced to handle spilling

Details

    • Improvement
    • Status: Resolved
    • P3
    • Resolution: Fixed
    • None
    • 2.12.0
    • sdk-java-core

    Description

      As a result of BEAM-2302, files in excess of WriteFiles maxNumWritersPerBundle are shuffled to be written later. The downside to this is that even if you can guarantee that maxNumWritersPerBundle is high enough to handle all writes you still have to pay the overhead of this write now being a MultiOutput ParDo.

      e.g. In the Spark Runner when a ParDo has multiple outputs the returned data is cached and if using the disableCache pipeline option it would cause recalculation and all the temp files would be written again.

      I'm sure that the Spark Runner is not the only runner that would benefit from an optional setting for WriteFiles that would skip this spilling and simplify the pipeline.

      Attachments

        Issue Links

          Activity

            People

              winkelman.kyle Kyle Winkelman
              winkelman.kyle Kyle Winkelman
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h