I think that's an acceptable risk in exchange for the simpler index-time API.
What if someone already extends IW? Or, we'll like that approach so much that we'll want to apply it elsewhere too (I mean in other modules)? I'm just saying that extending IW is a pretty big thing, compared to FacetFields.
I was thinking maybe we can wrap an IW, and present an interface with only add/updateDoc so on one hand you get to add new FacetField() easily, but on the other you get to use whatever IW that you want. It's like RandomIW, only we may not want to copy the entire set of APIs, only the document-related ones? And in the future we can think about TaxoFacetIW which also owns the commit()?
Another crazy idea is to implement a FacetDocument which extends Document (we'd need to allow that first!) and does the same trick on addField() / indexableFields(). I think if we can pull this off (extending Document is preferred IMO), it's far less intrusive than a new FacetIW.
Wait, the app cannot compute this (accurately) by summing the child counts? It will overcount in general, right?
Ooops, you're right. So in that case, we can open up the extension point (I think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too? Then I think the rest of the search code will just work?
There was a user on the ML (I don't have the link) who just wanted to
get the facet count for a specific label after faceting was done, and
the hoops s/he had to jump through (custom FacetResultHandler I
think??) to achieve that was just crazy.
Maybe it's crazy I don't know. All I'm saying is that now you took that user's request and made it a top-level API. I don't know if that qualifies as simplifying the APIs, or whether similar requests in the future will be easy to implement by extension, or we'd need to add them to our core code. For instance, this API makes Range and Count not share the same API (not saying it's a bad thing, just an example). But I already +1 the idea of having that on the API.
Seriously? What abstractions are we missing?
Well, FacetArrays committing to an int, while we have someone which wants to use a Map, because he has millions of facets, yet queries typically hit very few of them. And he uses both the int and float arrays, which for the most part of them are allocated for no good reason. In order to use a Map he needs to write a FacetResultsHandler which computes top-K from the map, as well extend FacetsAccumulator to override createFRH. Just an example. If we have a FacetValues abstraction w/ get/set/iterator methods, all he'd need to do is override the FacetValues that should be used.
But I already told him that this is something he can just do w/ FacetResultsHandler. And just to point out that this patch won't make things worse - since he already overrides the aggregation method, he can just implement a MapBasedTaxoFacetsSomething. So I only gave it as an example (FacetValues abstraction, I think, will hurt the performance in general, but we could benchmark).
Another abstraction is LUCENE-5316, which we struggle to get it to perform (really frustrating!).
If this is too much change for the facets module, we could, instead,
leave the facets module as is, and break this effort out as a
different module (facets2, simplefacets, something?). We also have
many queryparsers, many highlighters, etc., and I think that's
healthy: all options can be explored.
-1 to that. I think it delivers the wrong message. Why have two faceting modules? A branch is the perfect choice here, since it allows us to move the entire module to the new API. And on the way we benefit by assessing that the new API can really allow implementing associations, complements, sampling.
To give an example - the API first was overwhelming and in the last round of changes I removed and moved things (I thought I simplify it!). So for instance once upon a time FacetRequest let you specify the FacetsAggregator and FacetResultsHandler. I moved them both to FacetsAccumulator, but recently moved FacetsAggregator back to FacetRequest (it allows more easily to define which aggregator a certain FR needs to use). And on LUCENE-5333 I suggested to move FacetResultsHandler back to FR, to implement an AllDimFR.
So wearing the "experienced" hat, I just want to make sure we won't move everything out only to gradually swap it back in. I'm not against changing the APIs, but because this module is rich with features (which I, and apparently our users too, think are important), I'd like to make sure the new API works for all of them. I may compromise on complements and sampling because: (1) complements is not per-segment and I think it might require a full rewrite anywhere (there's an issue for it) and (2) sampling because it's under the *.old package, meaning it was never fully migrated to the new APIs and I think it could use some serious simplification there too.
If you're worried that you'll do this work alone, I promise to help. On the branch.
It's a crazy expert thing to create another faceting impl, so I think
such developers can handle changing their rollup to be lazy if it's
beneficial/necessary for their use case.
Let's use MapReduce as an analogy. The framework says "you worry about the Map and Reduce phase, I'll take care of the distribution stuff". Nice and clean. In the beginning the facets API said "you worry about aggregating a single ordinal, I'll worry about the rest". Then we changed it to "you worry about aggregating all ordinals in a segment, I'll worry about the rest".
Now, some users we have, well think of them as data scientists. All they're interested about is ranking facets to drive some insights out of the data. I feel that we're putting too much burden on them more and more. Cutting the API from the single-ord level to the segment-level was to gain sizable improvements to faceted search. But letting them do the rollup ... anyway, let's proceed with that, we can see as we go.
I personally like code reuse, and it kills me to see code that's duplicated. Often this calls out for bad design of APIs, not necessarily simplification.
I'm not sure it's so easily separated out; the DimConfig is common to
index time and search time, and we're still iterating on that (I just
moved the indexedFieldName into it).
I was thinking to develop it still with FacetIndexingParams in mind - just cutover from FacetFields to FacetIW/FacetDocument. This makes the change gradual and allows us to investigate it separately. It's also, I think, a much lower hanging fruit. But, it looks like the changes in this patch will require rewriting large parts of it, so maybe we should do it here.
We'd have to change 3 places if we did that right now: FacetIndexWriter
(where we encode) and TaxonomyFacetCounts/SumValueSource (where we
Wrong? I mean you still didn't write a version which pulls the ords from OrdinalsCache (and I assume you don't want to get rid of it!). With those two TFCs, it already means we'll have 4 classes? While I think the SumValueSource could be oblivious to how the ords are encoded/fetched out-of-the-box, since it's less common to use (we only optimize counting) and someone can rewrite it to pull the ords from OrdinalsCache directly (instead of going through the CLI abstraction).
So I think its fine to bake in the encoding, but with the two key methods in those 20 classes 'protected' in the appropriate places so that an expert user could subclass them:
I think you point out two problems - one is the fact that we have the *.encoding package at all and the other one is how to override encoding/decoding. The nice thing about IntEncoder/Decoder (let's think of it as PackedInts ok?) is that you can use an implementation without affecting other classes, while with encode/decode protected methods you need to extend every class that you have which reads the data. I don't think that qualifies as good software design. Why do we have PackedInts then?
Now the question is whether we should keep the different IntEncoders or not - I don't know. Why do we have so many PackedInts versions? Do we use all of them or are they there so that someone can make a pick? If they don't hurt anyone, and requires almost no maintenance, why remove them?
The argument is whether we should open up encoding/decoding at all. What Mike is saying "if you want to encode differently, you're an expert and you can rewrite all of your and Lucene's aggregators (count, sum VS) too". What you're saying, add encode/decode protected methods, is more how the API looks today, only instead of a method we have an IntEncoder/Decoder (a'la PackedInts) class. That's just easier to use in conjunction with other classes (especially if we do the CLI abstraction).
Just to clarify, I don't mean we have to have the CLI abstraction and use it everywhere. I think it's a good API for someone to use if he doesn't care about how ordinals are fetched (BDV, CachedOrds, whatever) as well as how they are encoded/decoded. It could very well be a CLITaxoFacetCounts impl which lets you pass the CLI you want to use (BDVCLI, CachedOrdsCLI, MySuperExpertDecodingCLI). That's it. Most users won't know about it, but the few that do, won't need to rewrite so many classes.