Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-14429

SyntheticUnboundedSource(with SDF) produce duplicate records when split with DEFAULT_DESIRED_NUM_SPLITS

Details

    • Bug
    • Status: Resolved
    • P1
    • Resolution: Fixed
    • None
    • 2.40.0
    • io-common
    • None

    Description

      With the default 20 split, the num records produced by Read.from(SyntheticUnboundedSource) is always larger than the numRecords specified. the more splits the more actual number records produced is off. And the Read step tends to take longer time with more splits.

      https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L512

      The issue is manifested with java LoadTests on dataflow runner v2.

      Initial suspicion is that duplicate source readers for the same restriction and checkpoint were created by multiple UnboundedSourceAsSDFWrapperFns.

      Attachments

        Activity

          People

            yichi Yichi Zhang
            yichi Yichi Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3.5h
                3.5h