> It almost feels like we should have "hooks" that are invoked at
> certain times, like when we are about to load the term infos index,
> that give the application a chance to change something...
I agree with the need for some kind of hook. This is what TermInfosConfigurer is. It calls a method whenever a SegmentReader reads an index to obtains parameters (termIndexDivisor) that should be used to configure the TermInfosReader.
Why not make the setters/getters on SegmentIndexProperties regular non-static methods, and allow hook methods as well? E.g., setTermIndexDivsior(), getTermIndexDivisor(), getMaxTermsCached(String segmentName, int segmentNumDocs, long segmentNumTerms). Non-static methods make the defaulting straightforward and allow for subclassing to override hook methods.
> It sounds like a detector for this would be very useful. It would, e.g., substantially
> speed updates of such indexes, and not slow searches of them like a divisor does.
> At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding.
Yes, we are pursuing that approach as well, but we have some stringent requirements in our market. E.g., we cannot filter any valid content, because searches must be guaranteed to find all matching results. As of result of this, we cannot impose any maximum length for documents.
Any type of binary content recognizer would either need to be 100% accurate, which is impossible, or require human intervention to validate filtering. For a human intervention approach to be viable the false positive rate must be tiny. To be effective the false negative rate must be tiny. Although invalid content is pretty easy for people tor recognize, I've found so far that high-accuracy recognition rules are surprising subtle.
Do you by chance no of any quality work in this area?
> > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);
> This sounds like a fine approach.
It seems to be working ok, but there is one issue. Heap's Law is based on the total number of tokens in the content, not the total number of documents. I.e., longer documents will generate more distinct terms than shorter documents. For large segments the use of numDocs works ok due to statistical averaging, but for smaller segments there are errors. I may loosen the bound somewhat on smaller segments in order to allow for their larger standard deviation.
If Lucene indexes tracked totalTokens (with duplicates, i.e. not numDistinctTokens) that would be perfect, but they don't. I don't know whether or not there would be other good uses for totalTokens but mention its relevance here in case there are.