Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-10409

Add combiner packing to graph optimizer phases

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: P2
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: runner-core
    • Labels:

      Description

      Some use cases of Beam (e.g. TensorFlow Transform) create thousands of Combine stages with a common parent. The large number of stages can cause performance issues on some runners. To alleviate, a graph optimization phase could be added to the translations module that packs compatible Combine stages into a single stage.

      The graph optimization for CombinePerKey would work as follows: If CombinePerKey stages have a common input, one input each, and one output each, pack the stages into a single stage that runs all CombinePerKeys and outputs resulting tuples to a new PCollection. A subsequent stage unpacks tuples from this PCollection and sends them to the original output PCollections.

      There is an additional issue with supporting this for CombineGlobally: because of the intermediate KeyWithVoid stage between the CombinePerKey stages and the input stage, the CombinePerKey stages do not have a common input stage, and cannot be packed. To support CombineGlobally, a common sibling elimination graph optimization phase can be used to combine the KeyWithVoid stages. After this, the CombinePerKey stages would have a common input and can be packed.

        Attachments

          Activity

            People

            • Assignee:
              myffical@gmail.com Yifan Mai
              Reporter:
              myffical@gmail.com Yifan Mai

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 8h
                8h

                  Issue deployment