Description
In the PIPELINED strategy of the indexing job, when regex path filtering is enabled, the job does two queries to Mongo:
- Download the ancestors of the base path (eg., 0:/, 1:/p1, 2:/p1/p2).
- Download all the children of the base path (eg., ???:/p1/p2/*)
The first query returns only a few results so it should use the index on _id. However, to deal with the rare case where the path is a long path and the _id field is actually a hash instead of the path, the query for the ancestors is also searching for matches on the _path field, which will be set if _id is an hash. The issue here is that _path is not indexed, so the first query reverts to a full col scan, which is much slower than an index scan for the handful of ancestors. This negates most or even all of the gains of using regex filtering.
Attachments
Issue Links
- links to