Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-10671

[Indexing job] - Improve Mongo regex query: remove condition on non-indexed _path field to speedup traversal



    • Improvement
    • Status: Closed
    • Major
    • Resolution: Done
    • None
    • 1.62.0
    • indexing
    • None


      Regex path filtering currently is implemented with a condition like:

      _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$] OR ('_id' in [^[0-9]{1,3}:h.*$}] AND _path in [^\Q/foo/bar/\E.*$]

      The second condition is necessary to deal with long path documents, whose _id is an hash instead of the path of the document, and that have an additional _path property with the full path of the document. The _id field is part of the index used by the query, but _path is not indexed. So the performance of this query will be very sensitive to how many time the query condition can be resolved without having to lookup the value of _path, which requires retrieving the full document from the column store. If the condition can be evaluated only using the _id value, them if there is no match the document should not be retrieved from the column store.

      Unfortunately, Mongo does not seem to properly optimize this query and is retrieving the document from the column storage even when _id does not match the path /foo/bar and the _id is not in the hash format. This leads to very poor performance as both the index and the column store have to be fully read by this query.

      We can instead use the following condition:

      _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$ , ^[0-9]{1,3}:h.*$}]

      That is, download the document if the _id matches the path or if it is an hash. This has the disadvantage that it will download all long path documents from the repository, many of which might not be needed. However, this query condition only uses the _id field so it is guaranteed to be evaluated fully using only the data on the index. And the number of long paths documents is usually very small, some environments don't even have any long path documents, so downloading them should not take much time. And the indexing job will anyway reapply the filter on paths locally, to eliminate the long path documents which are not required by the indexing job.


        Issue Links



              Unassigned Unassigned
              nuno.santos Nuno Santos
              0 Vote for this issue
              1 Start watching this issue

