[OAK-10608] [Indexing job] Improve regex expression used to download from Mongo to make better used of Mongo indexes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 1.62.0
Component/s: indexing
Labels:
None

Description

The current regex expression used to filter from Mongo the included/excluded paths has conditions on both the fields _id and _path. In most cases, the _id field contains the path of the node, but when the path is too long, the _id is replaced by an hash of the path and the full path is added to the document as an additional _path field. For these cases, the regex expression must also check the _path field.

When running an ordered traversal, we use a Mongo index on (_modified, _id). So checks on _id can be done with just the data retrieved from the index. But for the check on _path, Mongo needs to read the full document from the column store, which slows down significantly the traversal.

Currently, if _id does not match, the regex expression will always check _path, forcing a retrieval of the document. But we only need to check _path if the _id is of the form of a long path id, that is, of the pattern 4:h<hash>..., otherwise, if the _id is not a long path, then if it does not match the regex, we can be sure that the document is not needed. The check that _id is an hash can be done without retrieving the full document from the column store, so it will be fast. And in the common case, the document is not a long path, so this simple check will avoid retrieving the document from the column store.

This optimization will have a bit impact when the regex expression matches a small fraction of the repository. In the current implementation, Mongo has to traverse both the index and the column store for all possible regex filters. But with the additional check for long paths, Mongo has still to traverse the full index but it will only retrieve from the column store the documents that match the filter or the long path documents. And since the index is much smaller than the column store and can more easily be cached, this will significantly improve performance.

Attachments

Issue Links

links to

GitHub Pull Request #1273

Activity

People

Assignee:: Unassigned

Reporter:: Nuno Santos

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Jan/24 07:24

Updated:: 09/Apr/24 04:25

Resolved:: 29/Jan/24 10:14