Net/net I think we should offer an easy-to-use DV-backed facets impl...
If only DV could handle multi-values. Can they handle a single byte?
Because essentially that's what the facets API needs today - it stores everything in the payload, which is byte.
They can handle byte, so I think we should just offer that.
Having a multi-val DV could benefit us by e.g. not needing to write an iterator on the payload to get the category ordinals ...
Right, though in the special (common?) case where a given facet field
is single-valued, like the Date facets I added to luceneutil /
nightlybench (see the graph here:
– only 3 data points so far!), we could also use DV's int fields and
let it encode the single ord (eg with packed ints) and then aggreggate
up the taxonomy after aggregation of the leaf ords is done. I'm
playing with a prototype patch for this ...
Do I understand correctly that the caching Collector is reusable? Otherwise I don't see how the CachedBytes help.
No no: this is all just a hack (the CachedBytes / static cache). We
should somehow cleanly switch to DV ... it wasn't clear to me how to
do that ...
Hmmm, what if you used the in-mem Codec, for loading just this term's posting list into RAM? Do you think that you would gain the same?
Maybe! Have to test ...
If you want to make this a class that can be reused by other scenarios, then few tips that can enable that:
I do! If ... making it fully generic doesn't hurt perf much. The
decode chain (w/ separate reInit called per doc) seems heavyish ...
Instead of referencing CatListParams.DEFAULT_TERM, you can pull the CLP from FacetSearchParams.getFacetIndexingParams().getCLP(new CP()).getTerm().
Ahh ok. I'll fix that.
Also, you can obtain the right IntDecoder from the CLP for decoding the ordinals. That would remove the hard dependency on VInt+gap, and allow e.g. to use a PackedInts decoder.
OK I'll try that.
Not sure that we should, but this class supports only one CLP. I think it's ok to leave it like that, and get the CLP.term() at ctor, but then we must be able to cache the bytes at the reader level. That way, if an app uses multiple CLPs, it can initialize multi such Collectors.
I think it's ok to rely on the top Query to not call us for deleted docs, and therefore pass liveDocs=null. If a Query wants to iterate on deleted docs, we should count facets for them too.
Maybe you should take the IntArrayAllocator from the outside? That class can be initialized by the app once to e.g. use maxArrays=10 (e.g. if it expects max 10 queries in parallel), and then the int are reused whenever possible. The way the patch is now, if you reuse that Collector, you can only reuse one array.
Ahh I'll do that.
Separately I was wondering if we should sometimes do aggregation
backed by an int hashmap, and have it "upgrade" to a non-sparse
array only once the number collected got too large. Not sure it's
THAT important since it would only serve to keep fast queries fast but
would make slow queries a bit slower...
In setNextReader you sync on the cache only in case someone executes a search w/ an ExecutorService? That's another point where caching at the Codec/AtomicReader level would be better, right?
Also for multiple threads running at once ... but it's all a hack anyway ...
Why is acceptDocsOutOfOrder false? Is it because of how the cache works? Because facet counting is not limited to in-order only.
For the non-caching one that's true, because we can only advance on the fulltree posting. But if the posting is entirely in RAM, we can random access it?
Oh good point – the DV/cache collectors can accept out of order.
I wonder if we can write a good single Collector, and optimize the caching stuff through the Reader, or DV. Collectors in Lucene are usually not reusable? At least, I haven't seen such pattern. The current FacetsCollector isn't reusable (b/c of the bitset and potential scores array). So I'm worried users might be confused and won't benefit the most from that Collector, b/c they won't reuse it ..
On the other hand, saying that we have a FacetsIndexReader (composite) which per configuration initializes the right FacetAtomicReader would be more consumable by apps.
I think we should have two new collectors here? One keeps using
payloads but operates per segment and aggregates on the fly (if, on
making it generic again, we still see gains).
The other stores the byte in DV. But somehow we have to make "send
the byte to DV not payloads at index time" easy ... I'm not sure how
About the results, just to clarify – in both runs the 'QPS base' refers to current facet counting and 'QPS comp' refers to the two new collectors respectively?
Right: base = current trunk, comp = the two new collectors.
I'm surprised that the int didn't perform much better, since you don't need to do the decoding for every document, for every query. But then, perhaps it's because the RAM size is so large, and we pay a lot swapping in/out from CPU cache ...
This also surprised me, but I suspect it's the per-doc pointer
dereferencing that's costing us. I saw the same problem with
DirectPostingsFormat ... This also ties up tons of extra RAM (pointer
= 4 or 8 bytes; int object overhead maybe 8 bytes?). I bet if we
made a single int, and did our own addressing (eg another int that
maps docID to its address) then that would be faster than byte via
Also, note that you wrote a specialized code for decoding the payload, vs. using an API to do that (e.g. PackedInts / IntDecoder). I wonder how would that compare to the base collection, i.e. would we still see the big difference between int and the byte caching.
Yeah good question. I'll separately test the specialized decode to
see how much it's helping....
Mike, if we do the Reader/DV caching approach, that could benefit post-collection performance too, right? Is it possile that you hack the current FacetsCollector to do the aggregation over CachedBytes and then compare the difference?
Right! DV vs payloads is decoupled from during- vs post-collection
I'll open a separate issue to allow byte DV backing for facets....
Because your first results show that during-collection are not that much faster than post-collection, I am just wondering if it'll be the same when we cache the bytes outside the collector entirely.
If so, I think it should push us to do this caching outside, because we've already identified cases where post-collection is needed (e.g. sampling) too.
Overall though, great work Mike !
We must get this code in. It's clear that it can potentially gain a lot for some scenarios ...
Thanks! I want to see that graph jump