Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-10452

Indexing job/regex filtering: getting ancestors nodes of filtered path incorrectly does a full col scan on Mongo

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.58.0
    • indexing
    • None

    Description

      In the PIPELINED strategy of the indexing job, when regex path filtering is enabled, the job does two queries to Mongo:

      • Download the ancestors of the base path (eg., 0:/1:/p12:/p1/p2).
      • Download all the children of the base path (eg., ???:/p1/p2/*)

      The first query returns only a few results so it should use the index on _id. However, to deal with the rare case where the path is a long path and the _id field is actually a hash instead of the path, the query for the ancestors is also searching for matches on the _path field, which will be set if _id is an hash. The issue here is that _path is not indexed, so the first query reverts to a full col scan, which is much slower than an index scan for the handful of ancestors. This negates most or even all of the gains of using regex filtering.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nuno.santos Nuno Santos
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: