Case for Facet Count Caching: Paging through the hitlist (as well as paging through the facet list). In some cases it appears that generating the facet counts takes much longer than generating the hitlist. And that's certainly the case when the hitlist is retrieved from cache.
Case or Facet Paging: The UI design I'm doing back-end for has a list of facets with 5 top values each, and a "More..." link when there are indeed more than 5 facet values. Traversing that link is supposed to show a page with all facet values which fit, and Prev and Next paging buttons to access those which don't. This browser shows counts and can be sorted by count but by default is sorted alphabetically by term. Next to each term is a checkbox; after browsing and checking, a button returns to the hitlist but adds a big OR of the checked terms as an fq. So for example if a user searches and gets 437 hits with rutabaga in the title, having 264 unique author names, they might want to browse the list looking for friends. Then after browsing and checking they can see a hitlist of all articles written by friends with rutabage in the title.
I don't have any idea what the proportion of facet queries would have offset > 0 e.g. where the user has moved to the next page, but I assume it's non-rare.
It occurs to me that facet.limit should NOT do double-duty for paging: In a world where facet counts are cached, facet.limit should continue to play its current role, and limit the number of ranked values that make it into the BoundedTreeSet and thus the cache. Then facet.offset and facet.count could be used to return a subset. facet.limit==0 --> no limit, but can still be paged.
Case for pulling response generation out of getFieldCacheCounts and getFacetTermEnumCounts: I (truly) have a 37 million document index which I need to facet on Author, of which there are millions. The TermEnum algorithm is clearly unsuited, and the FieldCache algorithm requires an inordinate amount of memory; I had to disable it. So rather than tell management "can't be done", I think I need to plug in at least one more algorithm, e.g. using TermFreqVectors, to SimpleFacets. Would love not to have to replicate the response generation code.
Or the sorting code. Just had an idea: It would be even nicer if the counting logic could be passed some object, say an implementation of TermCountRecorder, which has an add(String term, int count) method.
- That object would encapsulate and isolate the generation of CountPair objects, the filtering for mincount, and whatever varieties of sorting are supported.
- Rather than have one object with multiple pathways e.g. for term vs. count vs. no sorting, a static factory method could take the field, sort, and mincount arguments and return an anonymous implementation based on a List or a TreeSet or whatever.
- The factory could also be told whether the counting logic guarantees adding terms in term (index) order, and if not but if term order were requested it could return an implementation which sorts by term text, otherwise a simple List.
- It could be the object that gets cached for that query for that field.
- It could have a generateResponse(offset, count) method which generates the <list name="<facetfield>">
- It could optimize memory when multiple TermCountRecorders corresponding to different queries are cached for a field, by maintaining a single WeakHashMap of term strings for the field, so each TermCountRecorder with the same term has a pointer to the same String object – essentially like String.intern() but the scope is the field and the master value would disappear once all cached TermCountRecorders referencing it disappear.
- It would make life much easier for a faceting approache where rather than iterating field->document it might be more efficient to iterate document->field (e.g. TermFreqVectors?): A TermCountRecorder could be allocated for each faceting field using that algorithm and have add(...) called in a round-robin fashion as documents are iterated. At the end all could be added to the cache and, whether added or retrieved, would have generateResponse called.