Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6868

ParallelLeafReader.getTermVectors can indirectly load TVs multiple times

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ParallelLeafReader has a getTermVectors(docId) implementation that loops over each field it has in a loop and calls getTermVector(docId,fieldName). But the implementation of that will load all term vectors for all fields in that reader, yet ParallelLeafReader only wants one. The effect is an O(n^2) where 'n' is the number of fields, when we could get O(n) if we do it right. PLR should call getTermVectors(docId) (not referring to a specific field) for each of it's readers and then aggregate them.

      This wouldn't be such a problem if our term vector API/Codec was improved to not load all term vectors for all fields from disk at once.

      Found via randomized-testing of IndexWriter auto-picking ParallelAtomicReader along with a test I have that asserts TVs aren't fetched for a doc more than once.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dsmiley David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: