[BEAM-14429] SyntheticUnboundedSource(with SDF) produce duplicate records when split with DEFAULT_DESIRED_NUM_SPLITS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: P1
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.40.0
Component/s: io-common
Labels:
None

Description

With the default 20 split, the num records produced by Read.from(SyntheticUnboundedSource) is always larger than the numRecords specified. the more splits the more actual number records produced is off. And the Read step tends to take longer time with more splits.

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L512

The issue is manifested with java LoadTests on dataflow runner v2.

Initial suspicion is that duplicate source readers for the same restriction and checkpoint were created by multiple UnboundedSourceAsSDFWrapperFns.

Attachments

Issue Links

links to

GitHub Pull Request #17576

GitHub Pull Request #17600

GitHub Pull Request #17609

Activity

People

Assignee:: Yichi Zhang

Reporter:: Yichi Zhang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/May/22 01:17

Updated:: 12/May/22 17:06

Resolved:: 10/May/22 23:10

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h