[BEAM-8884] Python MongoDBIO TypeError when splitting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Triage Needed
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.18.0
Component/s: sdk-py-core
Labels:
None

Description

I am trying to run a pipeline (defined with the Python SDK) on Dataflow that uses beam.io.ReadFromMongoDB. When dealing with very small datasets (<10mb) it runs fine, when trying to run it with slightly larger datasets (70mb), I always get this error:

TypeError: '<' not supported between instances of 'dict' and 'ObjectId'

Stack trace see below. Running it on a local machine works just fine. I would highly appreciate any pointers what this could be.
I hope this is the right channel do address this.

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 218, in execute
    self._split_task)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 226, in _perform_source_split_considering_api_limits
    desired_bundle_size)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 263, in _perform_source_split
    for split in source.split(desired_bundle_size):
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/mongodbio.py", line 174, in split
    bundle_end = min(stop_position, split_key_id)
TypeError: '<' not supported between instances of 'dict' and 'ObjectId'

Attachments

Issue Links

links to

GitHub Pull Request #10282

GitHub Pull Request #10298

Activity

People

Assignee:: Yichi Zhang

Reporter:: Brian Hulette

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Dec/19 00:16

Updated:: 13/Apr/23 10:59

Resolved:: 13/Dec/19 01:38

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 40m