Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7861

Hidden assumption that return value of IndexSearcher.slices is an array of continous sequential slices of the index

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 6.0, 6.5.1
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The IndexSearcher-method

      protected LeafSlice[] slices(List<LeafReaderContext> leaves)

      can be overwritten to customize how the index is searched with multipe threads. However, the IndexSearcher assumes the result is an ordered array of continuous slices of the index. If the result is "interleaved" or unordered, searchAfter may skip results.

      The issue seems to be how searchAfter works vs how TopDocs.merge works:

      searchAfter skips every document with a higher score than the "after" document. In case of equal scores, it uses the document id and skips every document with a <= document id (see PagingFieldCollector).

      TopDocs.merge uses the score to determine which hits should be part of the merged TopDocs. In case of equal scores, it uses the shard index (this corresponds to the slices the IndexSearcher uses) to break ties (see ScoreMergeSortQueue.lessThan)

      So if the shards are noncontinuous/unordered, searchAfter uses a different way of sorting the documents than TopDocs.merge, and therefore hits are skipped.

      On the mailing list, Michael McCandless suggested either improving TopDocs.merge to optionally use the docID for tie breaking (optionally as apparently the docId is not always global for every call of TopDocs.merge) or at least documenting the requirement on the return value of IndexSearcher.slices().

      In my use case (generating a fixed amount of slices of approximately equal size), the requirement of ordered slices will result in a less optimal result - but I am not sure whether this has a real impact on performance.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              christophk Christoph Kaser
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: