[BEAM-7866] Python MongoDB IO performance and correctness issues - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Triage Needed
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.15.0
Component/s: sdk-py-core
Labels:
None

Description

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py splits the query result by computing number of results in constructor, and then in each reader re-executing the whole query and getting an index sub-range of those results.

This is broken in several critical ways:

The order of query results returned by find() is not necessarily deterministic, so the idea of index ranges on it is meaningless: each shard may basically get random, possibly overlapping subsets of the total results
Even if you add order by `_id`, the database may be changing concurrently to reading and splitting. E.g. if the database contained documents with ids 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the assumption that these shards would contain respectively 10 20 30, and 40 50), and then suppose shard 10 20 30 is read and then document 25 is inserted - then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and document 25 is lost.
Every shard re-executes the query and skips the first start_offset items, which in total is quadratic complexity
The query is first executed in the constructor in order to count results, which 1) means the constructor can be super slow and 2) it won't work at all if the database is unavailable at the time the pipeline is constructed (e.g. if this is a template).

Unfortunately, none of these issues are caught by SourceTestUtils: this class has extensive coverage with it, and the tests pass. This is because the tests return the same results in the same order. I don't know how to catch this automatically, and I don't know how to catch the performance issue automatically, but these would all be important follow-up items after the actual fix.

CC: chamikara as reviewer.

Attachments

Issue Links

links to

GitHub Pull Request #9233

GitHub Pull Request #9342

Activity

People

Assignee:: Yichi Zhang

Reporter:: Eugene Kirpichov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 31/Jul/19 20:58

Updated:: 13/Apr/23 10:56

Resolved:: 14/Aug/19 21:41

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

12h