[OAK-10671] [Indexing job] - Improve Mongo regex query: remove condition on non-indexed _path field to speedup traversal - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 1.62.0
Component/s: indexing
Labels:
None

Description

Regex path filtering currently is implemented with a condition like:

_id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$] OR ('_id' in [^[0-9]{1,3}:h.*$}] AND _path in [^\Q/foo/bar/\E.*$]

The second condition is necessary to deal with long path documents, whose _id is an hash instead of the path of the document, and that have an additional _path property with the full path of the document. The _id field is part of the index used by the query, but _path is not indexed. So the performance of this query will be very sensitive to how many time the query condition can be resolved without having to lookup the value of _path, which requires retrieving the full document from the column store. If the condition can be evaluated only using the _id value, them if there is no match the document should not be retrieved from the column store.

Unfortunately, Mongo does not seem to properly optimize this query and is retrieving the document from the column storage even when _id does not match the path /foo/bar and the _id is not in the hash format. This leads to very poor performance as both the index and the column store have to be fully read by this query.

We can instead use the following condition:

_id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$ , ^[0-9]{1,3}:h.*$}]

That is, download the document if the _id matches the path or if it is an hash. This has the disadvantage that it will download all long path documents from the repository, many of which might not be needed. However, this query condition only uses the _id field so it is guaranteed to be evaluated fully using only the data on the index. And the number of long paths documents is usually very small, some environments don't even have any long path documents, so downloading them should not take much time. And the indexing job will anyway reapply the filter on paths locally, to eliminate the long path documents which are not required by the indexing job.

Attachments

Issue Links

links to

GitHub Pull Request #1331

Activity

People

Assignee:: Unassigned

Reporter:: Nuno Santos

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Feb/24 14:20

Updated:: 09/Apr/24 04:25

Resolved:: 29/Feb/24 10:43