Patch against Lucene 1.9 trunk as of Mar 1 06
Summary: Provide a way to avoid loading the TermInfoIndex into memory if you know all the terms you are ever going to query.
In our search environment, we have a large number of indexes (many thousands), any of which may be queried by any number of hosts. These indexes may be very large (~1M document), but since we have a low term/doc ratio, we have 7-11M terms. With an index interval of 128, that means ~70-90K terms. On loading the index, it instantiates a Term, a TermInfo, a String, and a char. When the document is long lived, this makes some sense because you can quickly search the list of terms using binary search. However, since we throw away the Indexes very often, a lot of garbage is created per query
Here's an example where we load a large index 10 times. This corresponds to 7MB of garbage per query.
percent live alloc'ed stack class
rank self accum bytes objs bytes objs trace name
1 4.48% 4.48% 4678736 128946 23393680 644730 387749 char
3 3.95% 12.61% 4126272 128946 20631360 644730 387751 org.apache.lucene.index.TermInfo
6 2.96% 22.71% 3094704 128946 15473520 644730 387748 java.lang.String
8 1.98% 26.97% 2063136 128946 10315680 644730 387750 org.apache.lucene.index.Term
This adds up after a while. Since we know exactly which Terms we're going to search for before even opening the index, there's no need to allocate this much memory. Upon opening the index, we can go through the TII in sequential order and retrieve the entries into the main term dictionary and reduce the storage requirements dramatically. This reduces the amount of garbage generated by querying by about 60% if you only make 1 query/index with a 77% increase in throughput.
This is accomplished by factoring out the "index loading" aspects of TermInfosReader into a new file, SegmentTermInfosReader. TermInfosReader becomes a base class to allow access to terms. A new class, PrefetchedTermInfosReader will, upon startup, sort the passed in terms and retrieve the IndexEntries for those terms. IndexReader and SegmentReader are modified to take new constructor methods that take a Collection of Terms that correspond to the total set of terms that will ever be searched in the life of the index.
In order to support the "skipping" behavior, some changes need to be made to SegmentTermEnum: specifically, we need to be able to go back an entry in order to retrieve the previous TermInfo and IndexPointer. This is because, unlike the normal case, with the index we want to return the value right before the intended field (so that we can be behind the desired termin the main dictionary). For example, if we're looking for "apple" in the index, and the two adjacent values are "abba" and "argon", we want to return "abba" instead of "argon". That way we won't miss any terms in the real index. This code is confusing; it should probably be moved to an subclass of TermBuffer, but that required more code. Not wanting to modify TermBuffer to keep it small, also lead to the odd NPE catch in SegmentTermEnum.java. Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a different name because it implements a different contract: but it would be useful for anyone trying to skip around in the TII, so I figured it was the right thing to do.