Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2873

Detect number of shards for file sink in Flink Streaming Runner

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: P2
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.10.0
    • Component/s: runner-flink
    • Labels:
      None

      Description

      Reuven Lax mentioned that this is done for the Dataflow Runner and the default behaviour on Flink can be somewhat surprising for users.

      ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:

      This is how the file sink has always worked in Beam. If no sharding is specified, then this means runner-determined sharding, and by default that is one file per bundle. If Flink has small bundles, then I suggest using the withNumShards method to explicitly pick the number of output shards.

      The Flink runner can detect that runner-determined sharding has been chosen, and override it with a specific number of shards. For example, the Dataflow streaming runner (which as you mentioned also has small bundles) detects this case and sets the number of out files shards based on the number of workers in the worker pool Here is the code that does this; it should be quite simple to do something similar for Flink, and then there will be no need for users to explicitly call withNumShards themselves.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dwysakowicz Dawid Wysakowicz
                Reporter:
                aljoscha Aljoscha Krettek
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h
                  6h