Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1483

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

      This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

      When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

      All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).

      • Introduces
        • MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
        • TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields.
        • FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.
        • FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation.
        • FieldComparatorSource - new class to allow for custom Comparators.
      • Alters
        • IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this
      • Deprecates
        • TopFieldDocCollector
        • FieldSortedHitQueue

        Attachments

        1. LUCENE-1483.patch
          12 kB
          Mark Miller
        2. LUCENE-1483.patch
          11 kB
          Mark Miller
        3. LUCENE-1483.patch
          10 kB
          Mark Miller
        4. LUCENE-1483.patch
          11 kB
          Michael McCandless
        5. LUCENE-1483.patch
          11 kB
          Mark Miller
        6. LUCENE-1483.patch
          12 kB
          Mark Miller
        7. LUCENE-1483.patch
          18 kB
          Mark Miller
        8. LUCENE-1483.patch
          18 kB
          Mark Miller
        9. LUCENE-1483.patch
          18 kB
          Mark Miller
        10. LUCENE-1483.patch
          61 kB
          Mark Miller
        11. LUCENE-1483.patch
          78 kB
          Michael McCandless
        12. LUCENE-1483.patch
          108 kB
          Mark Miller
        13. LUCENE-1483.patch
          108 kB
          Mark Miller
        14. LUCENE-1483.patch
          168 kB
          Mark Miller
        15. LUCENE-1483.patch
          160 kB
          Mark Miller
        16. sortBench.py
          6 kB
          Michael McCandless
        17. sortCollate.py
          2 kB
          Michael McCandless
        18. LUCENE-1483.patch
          159 kB
          Mark Miller
        19. LUCENE-1483.patch
          167 kB
          Michael McCandless
        20. LUCENE-1483.patch
          167 kB
          Michael McCandless
        21. LUCENE-1483.patch
          171 kB
          Michael McCandless
        22. LUCENE-1483.patch
          171 kB
          Michael McCandless
        23. LUCENE-1483.patch
          183 kB
          Mark Miller
        24. LUCENE-1483.patch
          183 kB
          Mark Miller
        25. LUCENE-1483.patch
          184 kB
          Mark Miller
        26. LUCENE-1483-partial.patch
          62 kB
          Michael McCandless
        27. LUCENE-1483.patch
          187 kB
          Mark Miller
        28. LUCENE-1483.patch
          202 kB
          Michael McCandless
        29. LUCENE-1483.patch
          207 kB
          Michael McCandless
        30. LUCENE-1483.patch
          218 kB
          Michael McCandless
        31. LUCENE-1483.patch
          207 kB
          Michael McCandless
        32. LUCENE-1483.patch
          206 kB
          Michael McCandless
        33. LUCENE-1483.patch
          192 kB
          Michael McCandless
        34. LUCENE-1483.patch
          199 kB
          Michael McCandless
        35. LUCENE-1483.patch
          199 kB
          Michael McCandless
        36. LUCENE-1483.patch
          199 kB
          Michael McCandless
        37. LUCENE-1483-backcompat.patch
          4 kB
          Michael McCandless
        38. LUCENE-1483.patch
          198 kB
          Michael McCandless
        39. LUCENE-1483.patch
          6 kB
          Michael McCandless

          Issue Links

            Activity

              People

              • Assignee:
                mikemccand Michael McCandless
                Reporter:
                markrmiller@gmail.com Mark Miller
              • Votes:
                1 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: