[OAK-10452] Indexing job/regex filtering: getting ancestors nodes of filtered path incorrectly does a full col scan on Mongo - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.58.0
Component/s: indexing
Labels:
None

Description

In the PIPELINED strategy of the indexing job, when regex path filtering is enabled, the job does two queries to Mongo:

Download the ancestors of the base path (eg., 0:/, 1:/p1, 2:/p1/p2).
Download all the children of the base path (eg., ???:/p1/p2/*)

The first query returns only a few results so it should use the index on _id. However, to deal with the rare case where the path is a long path and the _id field is actually a hash instead of the path, the query for the ancestors is also searching for matches on the _path field, which will be set if _id is an hash. The issue here is that _path is not indexed, so the first query reverts to a full col scan, which is much slower than an index scan for the handful of ancestors. This negates most or even all of the gains of using regex filtering.

Attachments

Issue Links

links to

GitHub Pull Request #1129

Activity

People

Assignee:: Unassigned

Reporter:: Nuno Santos

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Sep/23 12:41

Updated:: 16/Oct/23 07:53

Resolved:: 28/Sep/23 15:16