[LUCENE-1483] Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.9
Fix Version/s: 2.9
Component/s: None
Labels:
None

Lucene Fields:

New

Description

This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).

Introduces
- MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
- TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields.
- FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.
- FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation.
- FieldComparatorSource - new class to allow for custom Comparators.
Alters
- IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this
Deprecates
- TopFieldDocCollector
- FieldSortedHitQueue

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1483.patch
02/Feb/09 12:02
6 kB
Michael McCandless
LUCENE-1483.patch
24/Jan/09 11:26
198 kB
Michael McCandless
LUCENE-1483-backcompat.patch
24/Jan/09 10:59
4 kB
Michael McCandless
LUCENE-1483.patch
24/Jan/09 10:59
199 kB
Michael McCandless
LUCENE-1483.patch
23/Jan/09 19:49
199 kB
Michael McCandless
LUCENE-1483.patch
23/Jan/09 11:59
199 kB
Michael McCandless
LUCENE-1483.patch
21/Jan/09 20:13
192 kB
Michael McCandless
LUCENE-1483.patch
16/Jan/09 22:24
206 kB
Michael McCandless
LUCENE-1483.patch
16/Jan/09 22:17
207 kB
Michael McCandless
LUCENE-1483.patch
16/Jan/09 20:10
218 kB
Michael McCandless
LUCENE-1483.patch
13/Jan/09 16:23
207 kB
Michael McCandless
LUCENE-1483.patch
08/Jan/09 16:33
202 kB
Michael McCandless
LUCENE-1483.patch
07/Jan/09 02:42
187 kB
Mark Miller
LUCENE-1483-partial.patch
06/Jan/09 20:16
62 kB
Michael McCandless
LUCENE-1483.patch
04/Jan/09 15:01
184 kB
Mark Miller
LUCENE-1483.patch
04/Jan/09 01:56
183 kB
Mark Miller
LUCENE-1483.patch
01/Jan/09 22:26
183 kB
Mark Miller
LUCENE-1483.patch
23/Dec/08 13:56
171 kB
Michael McCandless
LUCENE-1483.patch
23/Dec/08 12:03
171 kB
Michael McCandless
LUCENE-1483.patch
22/Dec/08 22:48
167 kB
Michael McCandless
LUCENE-1483.patch
22/Dec/08 20:46
167 kB
Michael McCandless
LUCENE-1483.patch
21/Dec/08 16:52
159 kB
Mark Miller
sortCollate.py
18/Dec/08 19:43
2 kB
Michael McCandless
sortBench.py
18/Dec/08 19:43
6 kB
Michael McCandless
LUCENE-1483.patch
17/Dec/08 23:17
160 kB
Mark Miller
LUCENE-1483.patch
16/Dec/08 15:10
168 kB
Mark Miller
LUCENE-1483.patch
15/Dec/08 12:19
108 kB
Mark Miller
LUCENE-1483.patch
15/Dec/08 00:52
108 kB
Mark Miller
LUCENE-1483.patch
13/Dec/08 21:44
78 kB
Michael McCandless
LUCENE-1483.patch
12/Dec/08 02:56
61 kB
Mark Miller
LUCENE-1483.patch
11/Dec/08 17:49
18 kB
Mark Miller
LUCENE-1483.patch
11/Dec/08 17:01
18 kB
Mark Miller
LUCENE-1483.patch
11/Dec/08 16:30
18 kB
Mark Miller
LUCENE-1483.patch
11/Dec/08 13:47
12 kB
Mark Miller
LUCENE-1483.patch
11/Dec/08 12:13
11 kB
Mark Miller
LUCENE-1483.patch
11/Dec/08 01:56
11 kB
Michael McCandless
LUCENE-1483.patch
11/Dec/08 00:11
10 kB
Mark Miller
LUCENE-1483.patch
10/Dec/08 00:11
11 kB
Mark Miller
LUCENE-1483.patch
09/Dec/08 11:43
12 kB
Mark Miller

Issue Links

relates to

LUCENE-1304 Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene

Closed

LUCENE-2719 Re-add SorterTemplate and use it to provide fast ArraySorting and replace BytesRefHash sorting

Closed

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates