Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, Trunk
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'd like to explore simplifications to the facet module's APIs: I
      think the current APIs are complex, and the addition of a new feature
      (sparse faceting, LUCENE-5333) threatens to add even more classes
      (e.g., FacetRequestBuilder). I think we can do better.

      So, I've been prototyping some drastic changes; this is very
      early/exploratory and I'm not sure where it'll wind up but I think the
      new approach shows promise.

      The big changes are:

      • Instead of *FacetRequest/Params/Result, you directly instantiate
        the classes that do facet counting (currently TaxonomyFacetCounts,
        RangeFacetCounts or SortedSetDVFacetCounts), passing in the
        SimpleFacetsCollector, and then you interact with those classes to
        pull labels + values (topN under a path, sparse, specific labels).
      • At index time, no more FacetIndexingParams/CategoryListParams;
        instead, you make a new SimpleFacetFields and pass it the field it
        should store facets + drill downs under. If you want more than
        one CLI you create more than one instance of SimpleFacetFields.
      • I added a simple schema, where you state which dimensions are
        hierarchical or multi-valued. From this we decide how to index
        the ordinals (no more OrdinalPolicy).

      Sparse faceting is just another method (getAllDims), on both taxonomy
      & ssdv facet classes.

      I haven't created a common base class / interface for all of the
      search-time facet classes, but I think this may be possible/clean, and
      perhaps useful for drill sideways.

      All the new classes are under oal.facet.simple.*.

      Lots of things that don't work yet: drill sideways, complements,
      associations, sampling, partitions, etc. This is just a start ...

      1. LUCENE-5339.patch
        2.00 MB
        Michael McCandless
      2. LUCENE-5339.patch
        123 kB
        Michael McCandless
      3. LUCENE-5339.patch
        90 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Starting patch.

          Show
          Michael McCandless added a comment - Starting patch.
          Hide
          Shai Erera added a comment -

          Some comments about the patch:

          • You have a TODO file which seems to have been 'svn added' - are you aware of it? Just making sure, so it's not accidentally committed
          • Maybe we should do this work in a branch and avoid the .simple package? It seems that the majority of the patch is about copy-pasting classes over to .simple, which I assume you did so you can work in isolation and have tests pass?
          • FieldTypes:
            • Can FT.FieldType provide a ctor taking these arguments and then DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false.
            • I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you should name it DimType or something? And the upper class FacetsConfig?
            • Maybe instead of setHierarchical + setMultiValued you do setDimType(DimType)? Then you could avoid the synchronization?
            • Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if the dim is both hierarchical and multi-valued? That's a step back from the default ALL_BUT_DIM that we have today. I think we might want to have a requiresDimValue or something, because I do think the dimension's value (count) is most often unneeded, and it's a waste to encode its ordinal?
          • Constants isn't documented, not sure if 'precommit' will like that, but in general I think the constants should have jdocs. Maybe put a nocommit?
          • TaxonomyFacetCounts
            • .getAllDims() – If a dimension is not hierarchical, I think SimpleFacetResult.count == 0? In that case, sorting by its count is useless?
              • I think it's also relevant for SortedSet?
          • LabelAndValue has a Number as the value rather than double. I see this class is used only in the final step (labeling), but what's wrong w/ the previous 'double' primitive? Is it to avoid casting or something?

          On the approach in general - I personally don't think the current API is overwhelming, but looking at your patch, I do think it could use some simplification. But all in all, it allows an app extend the facets capabilities in ways that do not force it to duplicate a lot of code. Let me explain:

          CategoryListIterator

          That's the piece of code which is responsible for decoding the category list. It's currently used for two purposes: (1) allow you to encode the categories in different formats (using IntEncoder/Decoder abstraction) as well as load the categories from a different source. E.g. that's how we were able to test fast cutting over facets to DV from Payload, that's a nice abstraction for letting you load the values from a cache, and so forth.

          If you want to specialize code, you can do what FastCountingFacetsAggregator does – if you encoded w/ DGapVInt and you want counting, it doesn't go through the CLI API. But if you look at SumScoreFacetsAggregator, there's no reason for it to be limited to just DGapVInt encoding, or that you (the app) won't be able to pull the ordinals from a cache, without rewriting SumScoreFacetsAggregator.

          Hack, even though it's not implemented yet, we could implement a SortedSetCLI (as far as I can tell) which uses SortedSetDVReaderState to pull the ordinals of document. It's really a useful interface.

          So IMO the CLI interface is important. Most apps, I believe, don't extend the facets module (they'll use counting and maybe aggregate by ValueSource pre-built classes). But those that do want to add another aggregation method (e.g. ValueSourceFacetsAggregator), can simply use the CLI API, or copy-paste the DGapVInt decode logic if they want to specialize. Especially that that code is not that trivial, and we've spent a lot of time optimizing it. And the few apps that want to explore other encoding/decoding logic can write their own CLI.

          Without that interface, if e.g. you want to encode the ordinals differently, or load them from a cache (e.g. CachedOrds), you'd have to rewrite every aggregation method that you want to use. And most probably you'll copy-paste the majority of the code. So why let go of the interface?

          FacetsCollector

          It's pretty much your SimpleFC, only it requires to get a FacetsAccumulator. That was done as a convenience so that apps can send in a collector and then call fc.getFacetResults. But its essense is really the fact that it collects MatchingDocs per segment, which the aggregators later use to aggreate the faces in those matching docs.

          Given the changes in this patch, +1 for making it just return the List<MatchingDocs> and let whatever API we decide on to use that list, instead of FC.getFacetResults().

          FacetsAggregator

          This class lets you implement the aggregation function, be it count, sum-score, value-source or by-association. I think it's equivalent to TaxoFacetCounts, e.g. in your approach, if anyone will want to aggregate the ordinals by other than count, that's what they will need to write.

          What I like about the approach in your patch is that (and I had similar thoughts on the last round of API changes) it "reverses" the flow. The app says "I want to count the ords in that field, and sum-score the ords in that field" and then the app can ask for the aggregated values from each such aggregator. Whereas today, app does "new CFR(), new CFR(), new CFR(), new SumScore()" and then receives back the List<FacetResult>. Today's API is convenient in that it allows you to pass a List<FR> around in your code and get back a List<FacetResult>. But I don't see this convenience getting away (you'll just pass a List<FacetsAggregator> and then pull the requested values later on). Also, you sort of pulled the specific requests, such as .getSpecificCount(CP) and .getAllDims() up to FacetsAggregator, where today this can be done by FacetResultsHandler (you'd have to write one, as I did on LUCENE-5333).

          What I like less about it is that it folds in the logic coded by FacetsAccumulator and FacetResultsHandler:

          • FacetsAccumulator
            • Invokes FacetsAggregator to aggregate the facets in MatchingDocs
            • Whether the request specifies a CP for which we need to rollupValues, it asks the aggregator to do so
            • It uses a FacetResultsHandler to compute the top-K facets for each request.
          • FacetResultsHandler
            • Computes top-K values for each parent category
            • Has variants that return a full sub-tree (TopKInEachNodeHandler)
            • By means of extension, allows you to get the value of a specific CategoryPath, as well as get values for all dimensions (as I've shown on LUCENE-5333).

          So I'm torn here. I really like how the API lets you extend things, and only the things that you care about. If you only want to implement a ValueSourceFacetsAggregator, you shouldn't be worried about whether or not to call rollupValues or how to compute top-K. That seems redundant. You should focus on what you want to extend. I also like less the fact that now every TaxoFacetsSomething will need to be aware of the facets configuration...

          But I do like the approach presented in the patch, so I wonder if there's anything we can do to preserve the focused extension points on one hand, yet simplify the app's integration with the API on the other hand. Like, if an app did something like this:

          FacetsAccumulator fa = new TaxoFA(..., matchingDocs, new CountingFacetsAggregator("dvFacets1", [CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA()
          fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
          fa.getValueOf(CategoryPath);
          fa.getAllDimensions(); // does not depend on the aggregation function, only needs the Taxonomy and int[]/float[] array source.
          
          fa = new TaxoFA(..., matchingDocs, new SumScoreFacetsAggregator("dvFacets2", [CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA()
          fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
          fa.getValueOf(CategoryPath);
          fa.getAllDimensions();
          
          fa = new TaxoFA(..., matchingDocs, new SumIntAssociationFacetsAggregator("dvFacets3"), [FacetResultsHandler]); // no CLI or SortedSetFA
          fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
          fa.getValueOf(CategoryPath);
          fa.getAllDimensions();
          
          RangeAccumulator range = new RangeAccumulator(..., matchingDocs, ranges); // RangeAccumulator might not even extend FacetsAccumulator
          range.getTopK(dimension);
          // I think the rest of the getters make no sense?
          

          The idea is that internally, FacetsAccumulator uses FacetsAggregator to do the aggregation and rollup if needed, and you can optionally pass a FacetResultsHandler, while it defaults to a FlatFacetResultsHandler (today's silly name DepthOneFRH). You can also pass your CategoryListIterator to the FacetsAggregator (those that need it). That way, app's code looks similar to the one in the patch, only we allow more code reuse between different aggregation functions.

          I hope this will allow us to support other aggregation functions by SortedSet too, as there's really no reason why not to do it. There are two differences between SortedSet and TaxoIndex: (1) the Taxonomy implementation (where you pull the children of an ordinal from, how you do labeling etc.) and (2) where the ordinals are encoded in the search index (BinaryDV in TaxoIndex case, SortedSetDV in SortedSet case). The latter can easily be solved by a CLI implementation of SortedSet and former by an abstract Taxonomy (read-only) API which both SortedSet and TaxoIndex implement. We should explore these on a separate issue though.

          I think that the problem with your current patch is that the aggregation is done in the constructor, so it sort of eliminates reasonable ways to extend things. But if things were done differently, we could have an abstract FacetsAggregator instead of FacetsAccumulator which let you implement only the .aggregate() method, and take care of rollup + top-K computation itself. But that means you'd need to initialize the class and then call a .aggregate() or .compute() before you call any of the .getTopK() for instance (unless we check in every .getTopK() if it's already computed...).

          FacetArrays

          Unfortunately, there is no efficient way to hold either an int or float in Java, without specifying the type of the array (i.e. long[] or double[] are too expensive), so this class abstracts the way facets are aggregated, so that we can have different FacetAggregators implementations, again, reusing the majority of the irrelevant code (top-K computation, rollup, CategoryListIterator...).

          I don't know if we can get rid of it... especially as there might be aggregators that are interested in both arrays (e.g. avgOfSomething). Maybe if we have a FacetsAccumulator abstract class, we can have an IntFA and FloatFA abstract classes which give you access to an int[]/float[] arrays, and matching FacetResultsHandler?

          FacetIndexingParams

          By all means, I'm +1 for simplifying this class and make it more user-friendly. I like how you can specify a DimType (hierarchical, singleValue...) instead of setting the more arbitrary OrdinalPolicy. I do think this class has some value today, e.g. in CategoryListParams (which you nuked), the app doesn't need to know about the field under which facets are indexed at all, only if it cares to change it or index multiple category lists. Not sure it's a blocker, just pointing that out. On the plus side - it makes app more aware about what's going on in its index...

          I don't have strong feelings about this class, we can rename it to FacetsConfig or whatever, but let's put the configuration parameters under it (e.g. DimType and drillDownDelimiter)?

          Show
          Shai Erera added a comment - Some comments about the patch: You have a TODO file which seems to have been 'svn added' - are you aware of it? Just making sure, so it's not accidentally committed Maybe we should do this work in a branch and avoid the .simple package? It seems that the majority of the patch is about copy-pasting classes over to .simple, which I assume you did so you can work in isolation and have tests pass? FieldTypes: Can FT.FieldType provide a ctor taking these arguments and then DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false. I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you should name it DimType or something? And the upper class FacetsConfig? Maybe instead of setHierarchical + setMultiValued you do setDimType(DimType)? Then you could avoid the synchronization? Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if the dim is both hierarchical and multi-valued? That's a step back from the default ALL_BUT_DIM that we have today. I think we might want to have a requiresDimValue or something, because I do think the dimension's value (count) is most often unneeded, and it's a waste to encode its ordinal? Constants isn't documented, not sure if 'precommit' will like that, but in general I think the constants should have jdocs. Maybe put a nocommit? TaxonomyFacetCounts .getAllDims() – If a dimension is not hierarchical, I think SimpleFacetResult.count == 0? In that case, sorting by its count is useless? I think it's also relevant for SortedSet? LabelAndValue has a Number as the value rather than double. I see this class is used only in the final step (labeling), but what's wrong w/ the previous 'double' primitive? Is it to avoid casting or something? On the approach in general - I personally don't think the current API is overwhelming, but looking at your patch, I do think it could use some simplification. But all in all, it allows an app extend the facets capabilities in ways that do not force it to duplicate a lot of code. Let me explain: CategoryListIterator That's the piece of code which is responsible for decoding the category list. It's currently used for two purposes: (1) allow you to encode the categories in different formats (using IntEncoder/Decoder abstraction) as well as load the categories from a different source. E.g. that's how we were able to test fast cutting over facets to DV from Payload, that's a nice abstraction for letting you load the values from a cache, and so forth. If you want to specialize code, you can do what FastCountingFacetsAggregator does – if you encoded w/ DGapVInt and you want counting, it doesn't go through the CLI API. But if you look at SumScoreFacetsAggregator, there's no reason for it to be limited to just DGapVInt encoding, or that you (the app) won't be able to pull the ordinals from a cache, without rewriting SumScoreFacetsAggregator. Hack, even though it's not implemented yet, we could implement a SortedSetCLI (as far as I can tell) which uses SortedSetDVReaderState to pull the ordinals of document. It's really a useful interface. So IMO the CLI interface is important. Most apps, I believe, don't extend the facets module (they'll use counting and maybe aggregate by ValueSource pre-built classes). But those that do want to add another aggregation method (e.g. ValueSourceFacetsAggregator), can simply use the CLI API, or copy-paste the DGapVInt decode logic if they want to specialize. Especially that that code is not that trivial, and we've spent a lot of time optimizing it. And the few apps that want to explore other encoding/decoding logic can write their own CLI. Without that interface, if e.g. you want to encode the ordinals differently, or load them from a cache (e.g. CachedOrds), you'd have to rewrite every aggregation method that you want to use. And most probably you'll copy-paste the majority of the code. So why let go of the interface? FacetsCollector It's pretty much your SimpleFC, only it requires to get a FacetsAccumulator. That was done as a convenience so that apps can send in a collector and then call fc.getFacetResults. But its essense is really the fact that it collects MatchingDocs per segment, which the aggregators later use to aggreate the faces in those matching docs. Given the changes in this patch, +1 for making it just return the List<MatchingDocs> and let whatever API we decide on to use that list, instead of FC.getFacetResults(). FacetsAggregator This class lets you implement the aggregation function, be it count, sum-score, value-source or by-association. I think it's equivalent to TaxoFacetCounts, e.g. in your approach, if anyone will want to aggregate the ordinals by other than count, that's what they will need to write. What I like about the approach in your patch is that (and I had similar thoughts on the last round of API changes) it "reverses" the flow. The app says "I want to count the ords in that field, and sum-score the ords in that field" and then the app can ask for the aggregated values from each such aggregator. Whereas today, app does "new CFR(), new CFR(), new CFR(), new SumScore()" and then receives back the List<FacetResult>. Today's API is convenient in that it allows you to pass a List<FR> around in your code and get back a List<FacetResult>. But I don't see this convenience getting away (you'll just pass a List<FacetsAggregator> and then pull the requested values later on). Also, you sort of pulled the specific requests, such as .getSpecificCount(CP) and .getAllDims() up to FacetsAggregator, where today this can be done by FacetResultsHandler (you'd have to write one, as I did on LUCENE-5333 ). What I like less about it is that it folds in the logic coded by FacetsAccumulator and FacetResultsHandler: FacetsAccumulator Invokes FacetsAggregator to aggregate the facets in MatchingDocs Whether the request specifies a CP for which we need to rollupValues, it asks the aggregator to do so It uses a FacetResultsHandler to compute the top-K facets for each request. FacetResultsHandler Computes top-K values for each parent category Has variants that return a full sub-tree (TopKInEachNodeHandler) By means of extension, allows you to get the value of a specific CategoryPath, as well as get values for all dimensions (as I've shown on LUCENE-5333 ). So I'm torn here. I really like how the API lets you extend things, and only the things that you care about. If you only want to implement a ValueSourceFacetsAggregator, you shouldn't be worried about whether or not to call rollupValues or how to compute top-K. That seems redundant. You should focus on what you want to extend. I also like less the fact that now every TaxoFacetsSomething will need to be aware of the facets configuration... But I do like the approach presented in the patch, so I wonder if there's anything we can do to preserve the focused extension points on one hand, yet simplify the app's integration with the API on the other hand. Like, if an app did something like this: FacetsAccumulator fa = new TaxoFA(..., matchingDocs, new CountingFacetsAggregator( "dvFacets1" , [CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA() fa.getTopK(dimension); // or fa.getTopK( new CategoryPath(...)) fa.getValueOf(CategoryPath); fa.getAllDimensions(); // does not depend on the aggregation function, only needs the Taxonomy and int []/ float [] array source. fa = new TaxoFA(..., matchingDocs, new SumScoreFacetsAggregator( "dvFacets2" , [CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA() fa.getTopK(dimension); // or fa.getTopK( new CategoryPath(...)) fa.getValueOf(CategoryPath); fa.getAllDimensions(); fa = new TaxoFA(..., matchingDocs, new SumIntAssociationFacetsAggregator( "dvFacets3" ), [FacetResultsHandler]); // no CLI or SortedSetFA fa.getTopK(dimension); // or fa.getTopK( new CategoryPath(...)) fa.getValueOf(CategoryPath); fa.getAllDimensions(); RangeAccumulator range = new RangeAccumulator(..., matchingDocs, ranges); // RangeAccumulator might not even extend FacetsAccumulator range.getTopK(dimension); // I think the rest of the getters make no sense? The idea is that internally, FacetsAccumulator uses FacetsAggregator to do the aggregation and rollup if needed, and you can optionally pass a FacetResultsHandler, while it defaults to a FlatFacetResultsHandler (today's silly name DepthOneFRH). You can also pass your CategoryListIterator to the FacetsAggregator (those that need it). That way, app's code looks similar to the one in the patch, only we allow more code reuse between different aggregation functions. I hope this will allow us to support other aggregation functions by SortedSet too, as there's really no reason why not to do it. There are two differences between SortedSet and TaxoIndex: (1) the Taxonomy implementation (where you pull the children of an ordinal from, how you do labeling etc.) and (2) where the ordinals are encoded in the search index (BinaryDV in TaxoIndex case, SortedSetDV in SortedSet case). The latter can easily be solved by a CLI implementation of SortedSet and former by an abstract Taxonomy (read-only) API which both SortedSet and TaxoIndex implement. We should explore these on a separate issue though. I think that the problem with your current patch is that the aggregation is done in the constructor, so it sort of eliminates reasonable ways to extend things. But if things were done differently, we could have an abstract FacetsAggregator instead of FacetsAccumulator which let you implement only the .aggregate() method, and take care of rollup + top-K computation itself. But that means you'd need to initialize the class and then call a .aggregate() or .compute() before you call any of the .getTopK() for instance (unless we check in every .getTopK() if it's already computed...). FacetArrays Unfortunately, there is no efficient way to hold either an int or float in Java, without specifying the type of the array (i.e. long[] or double[] are too expensive), so this class abstracts the way facets are aggregated, so that we can have different FacetAggregators implementations, again, reusing the majority of the irrelevant code (top-K computation, rollup, CategoryListIterator...). I don't know if we can get rid of it... especially as there might be aggregators that are interested in both arrays (e.g. avgOfSomething). Maybe if we have a FacetsAccumulator abstract class, we can have an IntFA and FloatFA abstract classes which give you access to an int[]/float[] arrays, and matching FacetResultsHandler? FacetIndexingParams By all means, I'm +1 for simplifying this class and make it more user-friendly. I like how you can specify a DimType (hierarchical, singleValue...) instead of setting the more arbitrary OrdinalPolicy. I do think this class has some value today, e.g. in CategoryListParams (which you nuked), the app doesn't need to know about the field under which facets are indexed at all, only if it cares to change it or index multiple category lists. Not sure it's a blocker, just pointing that out. On the plus side - it makes app more aware about what's going on in its index... I don't have strong feelings about this class, we can rename it to FacetsConfig or whatever, but let's put the configuration parameters under it (e.g. DimType and drillDownDelimiter)?
          Hide
          Gilad Barkai added a comment -

          Mike,
          the idea of simplifying the API sounds great, but is it really that complected now?

          Facet's Accumulator is similar to Lucene's Collector, the Aggregator is sort of a Scorer, and a FacetRequest is a sort of Query.
          Actually the model after which the facets were designed was Lucene's.
          The optional IndexingParams came before the IndexWriterConfig but these can be said to be similar as well.

          More low-level objects such as the CategoryListParams are not a must, and the user may never know about them (and btw, they are similar to Codecs).

          I reviewed the patch (mostly the taxonomy related part) and I think that even without associations, counts only is a bit narrow.
          Specially with large counts (say many thousands) the count doesn't say much because of the "long tail" problem.
          When there's a large result set, all the categories will get high hit counts. And just as scoring by counting the number of query terms each document matches doesn't always make much sense (and I think all scoring functions do things a lot smarter), using counts for facets may at times yield irrelevant results.

          We found out that for large result sets, an aggregation of Lucene's score (rather than +1), or even score^2 yields better results for the user. Also arbitrary expressions which are corpus specific (with or without associations) changes the facets' usability dramatically. That's partially why the code was built to allow different "aggregation" techniques, allowing associations, numeric values etc into each value for each category.

          As for the new API, it may be useful if there would be a single "interface" - so all facets implementations could be switched easily, allowing users to experiment with the different implementations without writing a lot of code.

          Bottom line, I'm all for simplifying the API but the current cost seems to great, and I'm not sure the benefits are proportional

          Show
          Gilad Barkai added a comment - Mike, the idea of simplifying the API sounds great, but is it really that complected now? Facet's Accumulator is similar to Lucene's Collector , the Aggregator is sort of a Scorer , and a FacetRequest is a sort of Query . Actually the model after which the facets were designed was Lucene's. The optional IndexingParams came before the IndexWriterConfig but these can be said to be similar as well. More low-level objects such as the CategoryListParams are not a must, and the user may never know about them (and btw, they are similar to Codecs ). I reviewed the patch (mostly the taxonomy related part) and I think that even without associations, counts only is a bit narrow. Specially with large counts (say many thousands) the count doesn't say much because of the "long tail" problem. When there's a large result set, all the categories will get high hit counts. And just as scoring by counting the number of query terms each document matches doesn't always make much sense (and I think all scoring functions do things a lot smarter), using counts for facets may at times yield irrelevant results. We found out that for large result sets, an aggregation of Lucene's score (rather than +1 ), or even score^2 yields better results for the user. Also arbitrary expressions which are corpus specific (with or without associations) changes the facets' usability dramatically. That's partially why the code was built to allow different "aggregation" techniques, allowing associations, numeric values etc into each value for each category. As for the new API, it may be useful if there would be a single "interface" - so all facets implementations could be switched easily, allowing users to experiment with the different implementations without writing a lot of code. Bottom line, I'm all for simplifying the API but the current cost seems to great, and I'm not sure the benefits are proportional
          Hide
          Robert Muir added a comment -

          I took a look at the patch (actually just the tests!) and had these random thoughts:

          Can we rename CategoryPath to FacetLabel or something more intuitive?
          Can there be sugar like addFields(doc, FacetLabel) so a user doesnt have to make lists etc?
          Maybe it could be varargs like FacetLabel..., so just some sugar:

          public void addFields(Document doc, FacetLabel... labels) {
            addFields(doc, Arrays.asList(labels));
          }
          

          new LongRange("less than 10", 0L, true, 10L, false) <-- can we make it so this is less arguments?

          I like that RangeFacetCounts takes varargs though!

          For the Taxo case, I think the "document build" process is still too complicated.
          What if it worked like this:

          IndexWriter iw = new FacetIndexWriter(dir1, dir2);
          Document doc = new Document();
          doc.add(...new TextField(body))
          doc.add(new FacetField("foo", "bar"))
          iw.addDocument(doc);
          

          Then this FacetIW calls super.addDoc, with an iterable filtering out FacetFields, and also
          does whatever it needs to do with the FacetFields on the taxo index.

          Show
          Robert Muir added a comment - I took a look at the patch (actually just the tests!) and had these random thoughts: Can we rename CategoryPath to FacetLabel or something more intuitive? Can there be sugar like addFields(doc, FacetLabel) so a user doesnt have to make lists etc? Maybe it could be varargs like FacetLabel..., so just some sugar: public void addFields(Document doc, FacetLabel... labels) { addFields(doc, Arrays.asList(labels)); } new LongRange("less than 10", 0L, true, 10L, false) <-- can we make it so this is less arguments? I like that RangeFacetCounts takes varargs though! For the Taxo case, I think the "document build" process is still too complicated. What if it worked like this: IndexWriter iw = new FacetIndexWriter(dir1, dir2); Document doc = new Document(); doc.add(... new TextField(body)) doc.add( new FacetField( "foo" , "bar" )) iw.addDocument(doc); Then this FacetIW calls super.addDoc, with an iterable filtering out FacetFields, and also does whatever it needs to do with the FacetFields on the taxo index.
          Hide
          Shai Erera added a comment -

          Can we rename CategoryPath to FacetLabel or something more intuitive?

          CategoryPath is all about the path components of a category. So the name is not entirely out of place. But FacetLabel is a nice too, and definitely aligns w/ the rest of the APi, e.g. FacetResultNode has a 'label' member, not 'path'. But, CP is such a core class, that renaming it will change all of the facet module's classes and every app out there (while the changes in this issue are less drastic), so I just wonder if it's a must or nice-to-have. I don't object to it, just saying it's a rote renaming of a core class. Anyway, if we want to do this, I suggest we do it under a separate issue, just to keep things contained.

          Can there be sugar like addFields(doc, FacetLabel) so a user doesnt have to make lists etc?

          I think the varargs version is good to have, though in practice (except for tests), I never found the need to use one because usually the facets are added by reading them from somewhere, so you anyway need to construct a List or Array. It's only in tests, mock demos, or very specific applications that such a vararg method will be useful. If we have a super FacetFields class, then this method could just be final and delegate to the List variant as you noted in the code example. Otherwise, we just need to make sure we add it to every FF, so we're consistent.

          new LongRange("less than 10", 0L, true, 10L, false) <-- can we make it so this is less arguments?

          You mean omitting the inclusive params, maybe defaulting to true, or maybe minInclusive=true and maxInclusive=false (really depends how you view the max value)? I guess we could do that. But then we should do that in all other Range types (there are only 3).

          For the Taxo case, I think the "document build" process is still too complicated.

          Why the "Taxo" case? TaxoIndex and SortedSet have similar APIs – SimpleFacetFields (maybe we want to rename it to TaxonomyFacetFields) and SortedSetDVFacetFields. Both are needed because they add both drill-down terms and BinaryDV (Taxo) or SortedSetDV (SortedSet) fields.

          A FacetIndexWriter is a nice idea, but we'd need variants for Taxo, SortedSet and Mike explores an enum-method (LUCENE-5326). What I like less about it is that today the app can use two FacetFields, say one for Enum and one for SortedSet (or whatever other combination) and I don't see how it will do that using FacetIW.

          Hmm ... or what we could do is have a single FacetIW, but different field types, e.g. TaxonomyFacetField, SortedSetFacetField and EnumFacetField. FacetIW will capture them and act accordingly. From the app's perspective, the interaction w/ the APIs is simpler, since it just adds another field to the document. Then we don't even need to change FacetFields APIs (varargs or not). I think I like it! Can we do this in a separate issue? I think it's lower hanging fruit than the changes proposed in this issue, and are also focused (nuking FacetFields essentially in exchange for dedicated Field extensions).

          What I also like about FacetIW is that if we want to, we can wrap the app's Codec with a more suitable FacetCodec. E.g. maybe there's a better way to encode the facet information than a plain BinaryDV (maybe it's a special BDV). Asking an app to set a Codec is rather high entry-point. Note that I'm not talking about apps that e.g. want to configure the facet DV to use DirectDVF. That's ok. I'm talking about if we think there's a better way to encode the data for all (or maybe depending on a schema) faceted-search apps - then requiring a special Codec even less approachable API.

          Show
          Shai Erera added a comment - Can we rename CategoryPath to FacetLabel or something more intuitive? CategoryPath is all about the path components of a category. So the name is not entirely out of place. But FacetLabel is a nice too, and definitely aligns w/ the rest of the APi, e.g. FacetResultNode has a 'label' member, not 'path'. But, CP is such a core class, that renaming it will change all of the facet module's classes and every app out there (while the changes in this issue are less drastic), so I just wonder if it's a must or nice-to-have. I don't object to it, just saying it's a rote renaming of a core class. Anyway, if we want to do this, I suggest we do it under a separate issue, just to keep things contained. Can there be sugar like addFields(doc, FacetLabel) so a user doesnt have to make lists etc? I think the varargs version is good to have, though in practice (except for tests), I never found the need to use one because usually the facets are added by reading them from somewhere, so you anyway need to construct a List or Array. It's only in tests, mock demos, or very specific applications that such a vararg method will be useful. If we have a super FacetFields class, then this method could just be final and delegate to the List variant as you noted in the code example. Otherwise, we just need to make sure we add it to every FF, so we're consistent. new LongRange("less than 10", 0L, true, 10L, false) <-- can we make it so this is less arguments? You mean omitting the inclusive params, maybe defaulting to true, or maybe minInclusive=true and maxInclusive=false (really depends how you view the max value)? I guess we could do that. But then we should do that in all other Range types (there are only 3). For the Taxo case, I think the "document build" process is still too complicated. Why the "Taxo" case? TaxoIndex and SortedSet have similar APIs – SimpleFacetFields (maybe we want to rename it to TaxonomyFacetFields) and SortedSetDVFacetFields. Both are needed because they add both drill-down terms and BinaryDV (Taxo) or SortedSetDV (SortedSet) fields. A FacetIndexWriter is a nice idea, but we'd need variants for Taxo, SortedSet and Mike explores an enum-method ( LUCENE-5326 ). What I like less about it is that today the app can use two FacetFields, say one for Enum and one for SortedSet (or whatever other combination) and I don't see how it will do that using FacetIW. Hmm ... or what we could do is have a single FacetIW, but different field types, e.g. TaxonomyFacetField, SortedSetFacetField and EnumFacetField. FacetIW will capture them and act accordingly. From the app's perspective, the interaction w/ the APIs is simpler, since it just adds another field to the document. Then we don't even need to change FacetFields APIs (varargs or not). I think I like it! Can we do this in a separate issue? I think it's lower hanging fruit than the changes proposed in this issue, and are also focused (nuking FacetFields essentially in exchange for dedicated Field extensions). What I also like about FacetIW is that if we want to, we can wrap the app's Codec with a more suitable FacetCodec. E.g. maybe there's a better way to encode the facet information than a plain BinaryDV (maybe it's a special BDV). Asking an app to set a Codec is rather high entry-point. Note that I'm not talking about apps that e.g. want to configure the facet DV to use DirectDVF. That's ok. I'm talking about if we think there's a better way to encode the data for all (or maybe depending on a schema) faceted-search apps - then requiring a special Codec even less approachable API.
          Hide
          Shai Erera added a comment -

          About FacetIndexWriter, it will need to take an optional TaxonomyWriter (i.e. if you intend to use TaxonomyFacetField). But then I wonder if users won't expect that FacetIW.commit() won't commit both the underlying IW and TW. Actually, this could be a very good thing to do, since we could control the order those two objects are committed, and do the two-phase commit properly, rather than telling users what to do. But that means we'd need to make IW.commit() not final. Besides the advantage of doing the commit right, I worry that if we don't do that, users will be confused about having to call TW.commit() themselves, just because now FacetIW already has a handle to their TW. What do you think? We could also just add a commitTaxoAndIndex() method, but that's less elegant.

          Show
          Shai Erera added a comment - About FacetIndexWriter, it will need to take an optional TaxonomyWriter (i.e. if you intend to use TaxonomyFacetField). But then I wonder if users won't expect that FacetIW.commit() won't commit both the underlying IW and TW. Actually, this could be a very good thing to do, since we could control the order those two objects are committed, and do the two-phase commit properly, rather than telling users what to do. But that means we'd need to make IW.commit() not final. Besides the advantage of doing the commit right, I worry that if we don't do that, users will be confused about having to call TW.commit() themselves, just because now FacetIW already has a handle to their TW. What do you think? We could also just add a commitTaxoAndIndex() method, but that's less elegant.
          Hide
          Michael McCandless added a comment -

          Thanks for the feedback everyone ... I'm attaching a new patch.

          Show
          Michael McCandless added a comment - Thanks for the feedback everyone ... I'm attaching a new patch.
          Hide
          Michael McCandless added a comment -

          You have a TODO file which seems to have been 'svn added' - are you aware of it?

          I put a nocommit.

          Maybe we should do this work in a branch and avoid the .simple package?

          I'll start a branch, but on the branch I'd like to keep working on
          .simple for now while we bang out the API changes. If things start to
          crystallize then we can start cutting everything else over?

          Can FT.FieldType provide a ctor taking these arguments and then DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false.

          Maybe instead of setHierarchical + setMultiValued you do setDimType(DimType)? Then you could avoid the synchronization?

          I put a nocommit.

          I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you should name it DimType or something? And the upper class FacetsConfig?

          Good, I renamed both.

          Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if the dim is both hierarchical and multi-valued? That's a step back from the default ALL_BUT_DIM that we have today. I think we might want to have a requiresDimValue or something, because I do think the dimension's value (count) is most often unneeded, and it's a waste to encode its ordinal?

          It does, and I think that's OK? Yes, it's one extra ord it indexes,
          but that will be a minor perf hit since it's a multi-valued and
          hierarchical.

          Constants isn't documented, not sure if 'precommit' will like that, but in general I think the constants should have jdocs. Maybe put a nocommit?

          I put nocommits. Sadly, precommit won't fail (lucene/facet is
          "missing" in document-lint target in build.xml, because we never fixed
          all its javadocs). We should separately fix that.

          TaxonomyFacetCounts.getAllDims() – If a dimension is not hierarchical, I think SimpleFacetResult.count == 0? In that case, sorting by its count is useless? I think it's also relevant for SortedSet?

          I THINK these will work correctly (we sum the dim value as we visit
          all children), but I was missing the "corrected" dim count in the
          hierarchical + MV case so I added that. Also, I put nocommits to test
          this.

          LabelAndValue has a Number as the value rather than double. I see this class is used only in the final step (labeling), but what's wrong w/ the previous 'double' primitive? Is it to avoid casting or something?

          I think it's crazy when I ask for counts that I get a double back
          That's why I switched it to a Number.

          CategoryListIterator

          I appreciate the usefulness of this API, but rather that adding in
          into "simple" from the get-go, I'd like to build out the different
          facet methods and understand if it's actually useful / worth the
          additional abstraction.

          For example, I'm not sure it would work very well with SSDV, since we
          first count in seg-ord space an then convert to global-ord space only
          when combining counts across segments (this gave better performance).
          I mean, yes, it would "work" in that the abstraction would be correct,
          but we'd be paying a performance penalty.

          E.g. that's how we were able to test fast cutting over facets to DV from Payload, that's a nice abstraction for letting you load the values from a cache, and so forth.

          I think doing such future tests with the simple APIs will still be
          easy; I don't think we should open up abstractions for the
          encoding/decoding today.

          FacetsAggregator

          I added another facet method, TaxonomyFacetSumValueSource for value
          source aggregation, in the next patch, to explore this need...

          But I don't see this convenience getting away (you'll just pass a List<FacetsAggregator> and then pull the requested values later on).

          True but ... we need some base class / interface that all these
          *Facets.java implement ... I haven't done that yet (it's a TODO).

          What I like less about it is that it folds in the logic coded by FacetsAccumulator and FacetResultsHandler

          Maybe we can move some of these methods into the base class? I'm not
          sure though... since for the TaxonomyFacetSumValueSource it's float[]
          values and for the *Count it's int[] counts.

          At the end of the day, the code that does the facet counting, the
          rollup, pulling the topK, is in fact a small amount of code; I think
          replicating bits of this code for the different methods is the lesser
          evil than adding so many abstractions that the module is hard to
          approach by new devs/users.

          Has variants that return a full sub-tree (TopKInEachNodeHandler)

          That handler is such an exotic use case ... and the app can just
          recurse itself, calling TaxoFacetCounts.getTopChildren?

          FacetArrays

          We avoid the need to factor this out, by simply enumerating the facet
          impls directly. E.g. we (or a user) can make a TaxoFacetAvgScore that
          allocates its own int[] and float[]...

          I do think this class has some value today, e.g. in CategoryListParams (which you nuked), the app doesn't need to know about the field under which facets are indexed at all, only if it cares to change it or index multiple category lists.

          The facet field is now optional (defaults to $facets =
          Constants.DEFAULT_FACET_FIELD); the TestXXX.testBasic now look quite
          simple.

          let's put the configuration parameters under it (e.g. DimType and drillDownDelimiter)?

          I added nocommits; I think it's a good idea to move delim char inside
          here; we'd need to enforce that all fields sharing the same indexed
          field have the same delim char, but I think that's fine.

          What do you mean by DimType? Taxo vs SSDV? That's interesting to put
          in the DimConfig... it'd mean we (almost?) could pick the right
          accum/agg/results impl at search time.

          Show
          Michael McCandless added a comment - You have a TODO file which seems to have been 'svn added' - are you aware of it? I put a nocommit. Maybe we should do this work in a branch and avoid the .simple package? I'll start a branch, but on the branch I'd like to keep working on .simple for now while we bang out the API changes. If things start to crystallize then we can start cutting everything else over? Can FT.FieldType provide a ctor taking these arguments and then DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false. Maybe instead of setHierarchical + setMultiValued you do setDimType(DimType)? Then you could avoid the synchronization? I put a nocommit. I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you should name it DimType or something? And the upper class FacetsConfig? Good, I renamed both. Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if the dim is both hierarchical and multi-valued? That's a step back from the default ALL_BUT_DIM that we have today. I think we might want to have a requiresDimValue or something, because I do think the dimension's value (count) is most often unneeded, and it's a waste to encode its ordinal? It does, and I think that's OK? Yes, it's one extra ord it indexes, but that will be a minor perf hit since it's a multi-valued and hierarchical. Constants isn't documented, not sure if 'precommit' will like that, but in general I think the constants should have jdocs. Maybe put a nocommit? I put nocommits. Sadly, precommit won't fail (lucene/facet is "missing" in document-lint target in build.xml, because we never fixed all its javadocs). We should separately fix that. TaxonomyFacetCounts.getAllDims() – If a dimension is not hierarchical, I think SimpleFacetResult.count == 0? In that case, sorting by its count is useless? I think it's also relevant for SortedSet? I THINK these will work correctly (we sum the dim value as we visit all children), but I was missing the "corrected" dim count in the hierarchical + MV case so I added that. Also, I put nocommits to test this. LabelAndValue has a Number as the value rather than double. I see this class is used only in the final step (labeling), but what's wrong w/ the previous 'double' primitive? Is it to avoid casting or something? I think it's crazy when I ask for counts that I get a double back That's why I switched it to a Number. CategoryListIterator I appreciate the usefulness of this API, but rather that adding in into "simple" from the get-go, I'd like to build out the different facet methods and understand if it's actually useful / worth the additional abstraction. For example, I'm not sure it would work very well with SSDV, since we first count in seg-ord space an then convert to global-ord space only when combining counts across segments (this gave better performance). I mean, yes, it would "work" in that the abstraction would be correct, but we'd be paying a performance penalty. E.g. that's how we were able to test fast cutting over facets to DV from Payload, that's a nice abstraction for letting you load the values from a cache, and so forth. I think doing such future tests with the simple APIs will still be easy; I don't think we should open up abstractions for the encoding/decoding today. FacetsAggregator I added another facet method, TaxonomyFacetSumValueSource for value source aggregation, in the next patch, to explore this need... But I don't see this convenience getting away (you'll just pass a List<FacetsAggregator> and then pull the requested values later on). True but ... we need some base class / interface that all these *Facets.java implement ... I haven't done that yet (it's a TODO). What I like less about it is that it folds in the logic coded by FacetsAccumulator and FacetResultsHandler Maybe we can move some of these methods into the base class? I'm not sure though... since for the TaxonomyFacetSumValueSource it's float[] values and for the *Count it's int[] counts. At the end of the day, the code that does the facet counting, the rollup, pulling the topK, is in fact a small amount of code; I think replicating bits of this code for the different methods is the lesser evil than adding so many abstractions that the module is hard to approach by new devs/users. Has variants that return a full sub-tree (TopKInEachNodeHandler) That handler is such an exotic use case ... and the app can just recurse itself, calling TaxoFacetCounts.getTopChildren? FacetArrays We avoid the need to factor this out, by simply enumerating the facet impls directly. E.g. we (or a user) can make a TaxoFacetAvgScore that allocates its own int[] and float[]... I do think this class has some value today, e.g. in CategoryListParams (which you nuked), the app doesn't need to know about the field under which facets are indexed at all, only if it cares to change it or index multiple category lists. The facet field is now optional (defaults to $facets = Constants.DEFAULT_FACET_FIELD); the TestXXX.testBasic now look quite simple. let's put the configuration parameters under it (e.g. DimType and drillDownDelimiter)? I added nocommits; I think it's a good idea to move delim char inside here; we'd need to enforce that all fields sharing the same indexed field have the same delim char, but I think that's fine. What do you mean by DimType? Taxo vs SSDV? That's interesting to put in the DimConfig... it'd mean we (almost?) could pick the right accum/agg/results impl at search time.
          Hide
          Michael McCandless added a comment -

          Facet's Accumulator is similar to Lucene's Collector, the Aggregator is sort of a Scorer, and a FacetRequest is a sort of Query.
          Actually the model after which the facets were designed was Lucene's.
          The optional IndexingParams came before the IndexWriterConfig but these can be said to be similar as well.

          I appreciate those analogies but I think the two cases are very
          different: I think faceting is (ought to be) far simpler than
          searching.

          More low-level objects such as the CategoryListParams are not a must, and the user may never know about them (and btw, they are similar to Codecs).

          Likewise, I don't think we need to expose "codec like control" /
          pluggability over facet ords encoding at this point.

          I reviewed the patch (mostly the taxonomy related part) and I think that even without associations, counts only is a bit narrow.

          I added ValueSource aggregation in the next patch, but not
          associations; I think associations can come later (it's just another
          index time and search time impl).

          Specially with large counts (say many thousands) the count doesn't say much because of the "long tail" problem.
          When there's a large result set, all the categories will get high hit counts. And just as scoring by counting the number of query terms each document matches doesn't always make much sense (and I think all scoring functions do things a lot smarter), using counts for facets may at times yield irrelevant results.

          We found out that for large result sets, an aggregation of Lucene's score (rather than +1), or even score^2 yields better results for the user. Also arbitrary expressions which are corpus specific (with or without associations) changes the facets' usability dramatically. That's partially why the code was built to allow different "aggregation" techniques, allowing associations, numeric values etc into each value for each category.

          I agree.

          Do you think ValueSource faceting is sufficient for such apps? Or do
          they "typically" use associations? Aren't associations only really
          required in the multi-valued facet field case?

          As for the new API, it may be useful if there would be a single "interface" - so all facets implementations could be switched easily, allowing users to experiment with the different implementations without writing a lot of code.

          Yeah I think so too ... it's on the TODO list. Especially, if the
          FacetsConfig knows the facet method used by a given field, then we
          could (almost) produce the right impl at search time.

          Show
          Michael McCandless added a comment - Facet's Accumulator is similar to Lucene's Collector, the Aggregator is sort of a Scorer, and a FacetRequest is a sort of Query. Actually the model after which the facets were designed was Lucene's. The optional IndexingParams came before the IndexWriterConfig but these can be said to be similar as well. I appreciate those analogies but I think the two cases are very different: I think faceting is (ought to be) far simpler than searching. More low-level objects such as the CategoryListParams are not a must, and the user may never know about them (and btw, they are similar to Codecs). Likewise, I don't think we need to expose "codec like control" / pluggability over facet ords encoding at this point. I reviewed the patch (mostly the taxonomy related part) and I think that even without associations, counts only is a bit narrow. I added ValueSource aggregation in the next patch, but not associations; I think associations can come later (it's just another index time and search time impl). Specially with large counts (say many thousands) the count doesn't say much because of the "long tail" problem. When there's a large result set, all the categories will get high hit counts. And just as scoring by counting the number of query terms each document matches doesn't always make much sense (and I think all scoring functions do things a lot smarter), using counts for facets may at times yield irrelevant results. We found out that for large result sets, an aggregation of Lucene's score (rather than +1), or even score^2 yields better results for the user. Also arbitrary expressions which are corpus specific (with or without associations) changes the facets' usability dramatically. That's partially why the code was built to allow different "aggregation" techniques, allowing associations, numeric values etc into each value for each category. I agree. Do you think ValueSource faceting is sufficient for such apps? Or do they "typically" use associations? Aren't associations only really required in the multi-valued facet field case? As for the new API, it may be useful if there would be a single "interface" - so all facets implementations could be switched easily, allowing users to experiment with the different implementations without writing a lot of code. Yeah I think so too ... it's on the TODO list. Especially, if the FacetsConfig knows the facet method used by a given field, then we could (almost) produce the right impl at search time.
          Hide
          Michael McCandless added a comment -

          Can we rename CategoryPath to FacetLabel or something more intuitive?

          +1 for FacetLabel; I put a nocommit.

          But, the new patch actually nearly eliminates the need to create
          CategoryPath (it's still needed to create a DrillDownQuery but I
          dropped a nocommit to see if we can fix that).

          new LongRange("less than 10", 0L, true, 10L, false) <-- can we make it so this is less arguments?

          Not sure exactly how

          What if it worked like this:

          This is an awesome idea! I did that in the new patch; now indexing is
          really simple, e.g.:

              doc = new Document();
              doc.add(new FacetField("Author", "Frank"));
              doc.add(new FacetField("Publish Date", "1999", "5", "5"));
          

          and:

              doc = new Document();
              doc.add(new SortedSetDocValuesFacetField("a", "bar"));
          
          Show
          Michael McCandless added a comment - Can we rename CategoryPath to FacetLabel or something more intuitive? +1 for FacetLabel; I put a nocommit. But, the new patch actually nearly eliminates the need to create CategoryPath (it's still needed to create a DrillDownQuery but I dropped a nocommit to see if we can fix that). new LongRange("less than 10", 0L, true, 10L, false) <-- can we make it so this is less arguments? Not sure exactly how What if it worked like this: This is an awesome idea! I did that in the new patch; now indexing is really simple, e.g.: doc = new Document(); doc.add( new FacetField( "Author" , "Frank" )); doc.add( new FacetField( "Publish Date" , "1999" , "5" , "5" )); and: doc = new Document(); doc.add( new SortedSetDocValuesFacetField( "a" , "bar" ));
          Hide
          Shai Erera added a comment -

          About the patch:

          • I think FacetField should an optional ctor taking the indexedFacetField, defaulting to $facets, then the ctor calls super() with the right field, and not "dummy"? And you can remove set/get?
          • SimpleFacetsCollector jdocs are wrong – there's no create?
          • Do we still need SSDVFacetFields?
          • I like FacetIW, but the nocommit, to handle updateDoc, addDocs etc. makes me think if down the road we won't be sorry about doing this change (i.e. if anything changes on IW APIs). The good thing about FacetFields is that it just adds fields to a Document, and doesn't worry about IW API at all...
          • DimType == DimConfig . Sorry if I wasn't clear somewhere in my long response.

          That handler is such an exotic use case ... and the app can just
          recurse itself, calling TaxoFacetCounts.getTopChildren?

          Could be, maybe it will work. It definitely allows asking for different topK for each child (something we currently don't support).

          It does, and I think that's OK? Yes, it's one extra ord it indexes,
          but that will be a minor perf hit since it's a multi-valued and
          hierarchical.

          I don't know. Like if your facet has 2 levels, that's 33% more ords. I think the count of the root ord is most likely never needed? And if it's needed, app can compute it by traversing its children and their values in the facet arrays? Maybe as a default we just never index it, and don't add a vague requiresDimCount/Value/Weight boolean?

          I think replicating bits of this code for the different methods is the lesser
          evil than adding so many abstractions that the module is hard to
          approach by new devs/users.

          Did we ever get such feedback from users? That the module is unapproachable?
          I get the opposite feedback - that we don't have many abstractions!

          At the end of the day, the code that does the facet counting, the
          rollup, pulling the topK, is in fact a small amount of code;

          You have a nocommit "maybe we should do this lazily" in regards for when to rollupValues. That shows me that now every developer who extends this API (let's be clear - users are oblivious to this happening) will facet the same decision (nocommit). If we discover one day that it's better to rollup lazily or not, other developers don't benefit from that decision. That's why I think some abstractions are good.

          I added ValueSource aggregation in the next patch, but not
          associations; I think associations can come later (it's just another
          index time and search time impl).

          I'm not sure we should do that (cut over associations later). The whole point about these features (associations, complements, sampling..) is that they are existing features. If we think they are useless / unneeded - that's one thing. But if we believe they are important, it's useless to make all the API changes without taking them into account, only to figure out later that we need abstraction X and Y in order to implement them.

          And we make heavy use of associations, and some users asked (and use) sampling and I remember a question about complements. So obviously we cannot conclude that these are useless features. Therefore I think it's important that we try to tackle them now, so don't we don't do a full round trip to find ourselves with the same API again.

          Can we do FacetIndexWriter in a separate issue (if we want to do it at all)? It's unrelated to the search API changes you want to do here, and it might be easier to contain within a single issue?

          About CategoryListIterator ... what if we do manage to come up tomorrow with a better encoding strategy for facets. Do you really think that changing all existing WhateverFacets makes sense!? And if developers write their own WhateverFacets, it means they need to change their code too? Really, you're mixing optimizations (inlining dgap+vint) with ease of use. I know (!!) that there are apps that can benefit from a different encoding scheme (e.g. FourOnesIntEncoder). We don't need to wait until someone comes up w/ a better default encoding scheme to introduce abstractions. I mean .. that's just sounds crazy to me.

          Show
          Shai Erera added a comment - About the patch: I think FacetField should an optional ctor taking the indexedFacetField, defaulting to $facets, then the ctor calls super() with the right field, and not "dummy"? And you can remove set/get? SimpleFacetsCollector jdocs are wrong – there's no create? Do we still need SSDVFacetFields? I like FacetIW, but the nocommit, to handle updateDoc, addDocs etc. makes me think if down the road we won't be sorry about doing this change (i.e. if anything changes on IW APIs). The good thing about FacetFields is that it just adds fields to a Document, and doesn't worry about IW API at all... DimType == DimConfig . Sorry if I wasn't clear somewhere in my long response. That handler is such an exotic use case ... and the app can just recurse itself, calling TaxoFacetCounts.getTopChildren? Could be, maybe it will work. It definitely allows asking for different topK for each child (something we currently don't support). It does, and I think that's OK? Yes, it's one extra ord it indexes, but that will be a minor perf hit since it's a multi-valued and hierarchical. I don't know. Like if your facet has 2 levels, that's 33% more ords. I think the count of the root ord is most likely never needed? And if it's needed, app can compute it by traversing its children and their values in the facet arrays? Maybe as a default we just never index it, and don't add a vague requiresDimCount/Value/Weight boolean? I think replicating bits of this code for the different methods is the lesser evil than adding so many abstractions that the module is hard to approach by new devs/users. Did we ever get such feedback from users? That the module is unapproachable? I get the opposite feedback - that we don't have many abstractions! At the end of the day, the code that does the facet counting, the rollup, pulling the topK, is in fact a small amount of code; You have a nocommit "maybe we should do this lazily" in regards for when to rollupValues. That shows me that now every developer who extends this API (let's be clear - users are oblivious to this happening) will facet the same decision (nocommit). If we discover one day that it's better to rollup lazily or not, other developers don't benefit from that decision. That's why I think some abstractions are good. I added ValueSource aggregation in the next patch, but not associations; I think associations can come later (it's just another index time and search time impl). I'm not sure we should do that (cut over associations later). The whole point about these features (associations, complements, sampling..) is that they are existing features. If we think they are useless / unneeded - that's one thing. But if we believe they are important, it's useless to make all the API changes without taking them into account, only to figure out later that we need abstraction X and Y in order to implement them. And we make heavy use of associations, and some users asked (and use) sampling and I remember a question about complements. So obviously we cannot conclude that these are useless features. Therefore I think it's important that we try to tackle them now, so don't we don't do a full round trip to find ourselves with the same API again. Can we do FacetIndexWriter in a separate issue (if we want to do it at all)? It's unrelated to the search API changes you want to do here, and it might be easier to contain within a single issue? About CategoryListIterator ... what if we do manage to come up tomorrow with a better encoding strategy for facets. Do you really think that changing all existing WhateverFacets makes sense!? And if developers write their own WhateverFacets, it means they need to change their code too? Really, you're mixing optimizations (inlining dgap+vint) with ease of use. I know (!!) that there are apps that can benefit from a different encoding scheme (e.g. FourOnesIntEncoder). We don't need to wait until someone comes up w/ a better default encoding scheme to introduce abstractions. I mean .. that's just sounds crazy to me.
          Hide
          Michael McCandless added a comment -

          Thanks for the feedback Shai.

          I think FacetField should an optional ctor taking the indexedFacetField, defaulting to $facets, then the ctor calls super() with the right field, and not "dummy"? And you can remove set/get?

          I moved the indexed field name to the DimConfig.

          SimpleFacetsCollector jdocs are wrong – there's no create?

          I removed it and put nocommit.

          Do we still need SSDVFacetFields?

          Woops, no; I removed it.

          I like FacetIW, but the nocommit, to handle updateDoc, addDocs etc. makes me think if down the road we won't be sorry about doing this change (i.e. if anything changes on IW APIs). The good thing about FacetFields is that it just adds fields to a Document, and doesn't worry about IW API at all...

          I think that's an acceptable risk in exchange for the simpler
          index-time API.

          DimType == DimConfig . Sorry if I wasn't clear somewhere in my long response.

          Ahh, OK. I'm still wondering if we should put the "facetMethod" onto
          the DimConfig...

          Like if your facet has 2 levels, that's 33% more ords. I think the count of the root ord is most likely never needed? And if it's needed, app can compute it by traversing its children and their values in the facet arrays? Maybe as a default we just never index it, and don't add a vague requiresDimCount/Value/Weight boolean?

          Wait, the app cannot compute this (accurately) by summing the child counts? It will overcount in general, right?

          I think replicating bits of this code for the different methods is the lesser evil than adding so many abstractions that the module is hard to approach by new devs/users.

          Did we ever get such feedback from users? That the module is unapproachable?

          It's mostly from my own assessment, looking at the code, and my own
          frustrations over time in trying to improve the facets module;
          LUCENE-5333 was finally the straw (for me)... I find complex APIs
          frustrating and I think it's a serious barrier to new contributors
          getting involved and users consuming them.

          There was a user on the ML (I don't have the link) who just wanted to
          get the facet count for a specific label after faceting was done, and
          the hoops s/he had to jump through (custom FacetResultHandler I
          think??) to achieve that was just crazy.

          I get the opposite feedback - that we don't have many abstractions!

          Seriously? What abstractions are we missing?

          If this is too much change for the facets module, we could, instead,
          leave the facets module as is, and break this effort out as a
          different module (facets2, simplefacets, something?). We also have
          many queryparsers, many highlighters, etc., and I think that's
          healthy: all options can be explored.

          You have a nocommit "maybe we should do this lazily" in regards for when to rollupValues. That shows me that now every developer who extends this API (let's be clear - users are oblivious to this happening) will facet the same decision (nocommit). If we discover one day that it's better to rollup lazily or not, other developers don't benefit from that decision. That's why I think some abstractions are good.

          It's a crazy expert thing to create another faceting impl, so I think
          such developers can handle changing their rollup to be lazy if it's
          beneficial/necessary for their use case.

          I'm not sure we should do that (cut over associations later). The whole point about these features (associations, complements, sampling..) is that they are existing features. If we think they are useless / unneeded - that's one thing. But if we believe they are important, it's useless to make all the API changes without taking them into account, only to figure out later that we need abstraction X and Y in order to implement them.

          Well this could be a good argument for just making a new module?

          The new module would have a simpler API and less thorough
          functionality?

          Can we do FacetIndexWriter in a separate issue (if we want to do it at all)? It's unrelated to the search API changes you want to do here, and it might be easier to contain within a single issue?

          I'm not sure it's so easily separated out; the DimConfig is common to
          index time and search time, and we're still iterating on that (I just
          moved the indexedFieldName into it).

          About CategoryListIterator ... what if we do manage to come up tomorrow with a better encoding strategy for facets. Do you really think that changing all existing WhateverFacets makes sense!? And if developers write their own WhateverFacets, it means they need to change their code too? Really, you're mixing optimizations (inlining dgap+vint) with ease of use. I know (!!) that there are apps that can benefit from a different encoding scheme (e.g. FourOnesIntEncoder). We don't need to wait until someone comes up w/ a better default encoding scheme to introduce abstractions. I mean .. that's just sounds crazy to me.

          We'd have to change 3 places if we did that right now: FacetIndexWriter
          (where we encode) and TaxonomyFacetCounts/SumValueSource (where we
          decode).

          Maybe... we can have a decode method in TFC, and TFSVC subclasses TFC,
          or ... something ... but I'd like to be very careful in just adding
          back abstractions here: this really is a small amount of code at the
          end of the day.

          Show
          Michael McCandless added a comment - Thanks for the feedback Shai. I think FacetField should an optional ctor taking the indexedFacetField, defaulting to $facets, then the ctor calls super() with the right field, and not "dummy"? And you can remove set/get? I moved the indexed field name to the DimConfig. SimpleFacetsCollector jdocs are wrong – there's no create? I removed it and put nocommit. Do we still need SSDVFacetFields? Woops, no; I removed it. I like FacetIW, but the nocommit, to handle updateDoc, addDocs etc. makes me think if down the road we won't be sorry about doing this change (i.e. if anything changes on IW APIs). The good thing about FacetFields is that it just adds fields to a Document, and doesn't worry about IW API at all... I think that's an acceptable risk in exchange for the simpler index-time API. DimType == DimConfig . Sorry if I wasn't clear somewhere in my long response. Ahh, OK. I'm still wondering if we should put the "facetMethod" onto the DimConfig... Like if your facet has 2 levels, that's 33% more ords. I think the count of the root ord is most likely never needed? And if it's needed, app can compute it by traversing its children and their values in the facet arrays? Maybe as a default we just never index it, and don't add a vague requiresDimCount/Value/Weight boolean? Wait, the app cannot compute this (accurately) by summing the child counts? It will overcount in general, right? I think replicating bits of this code for the different methods is the lesser evil than adding so many abstractions that the module is hard to approach by new devs/users. Did we ever get such feedback from users? That the module is unapproachable? It's mostly from my own assessment, looking at the code, and my own frustrations over time in trying to improve the facets module; LUCENE-5333 was finally the straw (for me)... I find complex APIs frustrating and I think it's a serious barrier to new contributors getting involved and users consuming them. There was a user on the ML (I don't have the link) who just wanted to get the facet count for a specific label after faceting was done, and the hoops s/he had to jump through (custom FacetResultHandler I think??) to achieve that was just crazy. I get the opposite feedback - that we don't have many abstractions! Seriously? What abstractions are we missing? If this is too much change for the facets module, we could, instead, leave the facets module as is, and break this effort out as a different module (facets2, simplefacets, something?). We also have many queryparsers, many highlighters, etc., and I think that's healthy: all options can be explored. You have a nocommit "maybe we should do this lazily" in regards for when to rollupValues. That shows me that now every developer who extends this API (let's be clear - users are oblivious to this happening) will facet the same decision (nocommit). If we discover one day that it's better to rollup lazily or not, other developers don't benefit from that decision. That's why I think some abstractions are good. It's a crazy expert thing to create another faceting impl, so I think such developers can handle changing their rollup to be lazy if it's beneficial/necessary for their use case. I'm not sure we should do that (cut over associations later). The whole point about these features (associations, complements, sampling..) is that they are existing features. If we think they are useless / unneeded - that's one thing. But if we believe they are important, it's useless to make all the API changes without taking them into account, only to figure out later that we need abstraction X and Y in order to implement them. Well this could be a good argument for just making a new module? The new module would have a simpler API and less thorough functionality? Can we do FacetIndexWriter in a separate issue (if we want to do it at all)? It's unrelated to the search API changes you want to do here, and it might be easier to contain within a single issue? I'm not sure it's so easily separated out; the DimConfig is common to index time and search time, and we're still iterating on that (I just moved the indexedFieldName into it). About CategoryListIterator ... what if we do manage to come up tomorrow with a better encoding strategy for facets. Do you really think that changing all existing WhateverFacets makes sense!? And if developers write their own WhateverFacets, it means they need to change their code too? Really, you're mixing optimizations (inlining dgap+vint) with ease of use. I know (!!) that there are apps that can benefit from a different encoding scheme (e.g. FourOnesIntEncoder). We don't need to wait until someone comes up w/ a better default encoding scheme to introduce abstractions. I mean .. that's just sounds crazy to me. We'd have to change 3 places if we did that right now: FacetIndexWriter (where we encode) and TaxonomyFacetCounts/SumValueSource (where we decode). Maybe... we can have a decode method in TFC, and TFSVC subclasses TFC, or ... something ... but I'd like to be very careful in just adding back abstractions here: this really is a small amount of code at the end of the day.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542023 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542023 ]

          LUCENE-5339: make branch

          Show
          ASF subversion and git services added a comment - Commit 1542023 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542023 ] LUCENE-5339 : make branch
          Hide
          ASF subversion and git services added a comment -

          Commit 1542025 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542025 ]

          LUCENE-5339: current patch

          Show
          ASF subversion and git services added a comment - Commit 1542025 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542025 ] LUCENE-5339 : current patch
          Hide
          Michael McCandless added a comment -

          OK I created a branch at https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5339 and committed my current patch (same as last patch I think, except I moved the indexedFieldName from FacetField to DimConfig).

          Show
          Michael McCandless added a comment - OK I created a branch at https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5339 and committed my current patch (same as last patch I think, except I moved the indexedFieldName from FacetField to DimConfig).
          Hide
          Robert Muir added a comment -

          Really, you're mixing optimizations (inlining dgap+vint) with ease of use. I know (!!) that there are apps that can benefit from a different encoding scheme (e.g. FourOnesIntEncoder). We don't need to wait until someone comes up w/ a better default encoding scheme to introduce abstractions. I mean .. that's just sounds crazy to me.

          How common is this, I'm curious?

          Just to lend my opinion/support to this issue: imo the pure number of classes to the faceting module can be overwhelming.

          Lets take the encode/decode case: it seems to me you guys iterated a lot and figured out vint was "the best default encoding". I'm not going to argue that some use case could benefit from a custom encoding scheme: instead I'm going to argue if it justifies a whole java package with 20 public classes?

          So I think its fine to bake in the encoding, but with the two key methods in those 20 classes 'protected' in the appropriate places so that an expert user could subclass them:

          decode(BytesRef buf, IntsRef values);
          encode(IntsRef values, BytesRef buf);
          

          I'd make the argument: if someone is expert enough to do this, they dont need pre-provided concrete encoder/decoder classes anyway, they can write their own method?

          Show
          Robert Muir added a comment - Really, you're mixing optimizations (inlining dgap+vint) with ease of use. I know (!!) that there are apps that can benefit from a different encoding scheme (e.g. FourOnesIntEncoder). We don't need to wait until someone comes up w/ a better default encoding scheme to introduce abstractions. I mean .. that's just sounds crazy to me. How common is this, I'm curious? Just to lend my opinion/support to this issue: imo the pure number of classes to the faceting module can be overwhelming. Lets take the encode/decode case: it seems to me you guys iterated a lot and figured out vint was "the best default encoding". I'm not going to argue that some use case could benefit from a custom encoding scheme: instead I'm going to argue if it justifies a whole java package with 20 public classes? So I think its fine to bake in the encoding, but with the two key methods in those 20 classes 'protected' in the appropriate places so that an expert user could subclass them: decode(BytesRef buf, IntsRef values); encode(IntsRef values, BytesRef buf); I'd make the argument: if someone is expert enough to do this, they dont need pre-provided concrete encoder/decoder classes anyway, they can write their own method?
          Hide
          ASF subversion and git services added a comment -

          Commit 1542062 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542062 ]

          LUCENE-5339: add abstract Facets base class; fix separate test failure

          Show
          ASF subversion and git services added a comment - Commit 1542062 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542062 ] LUCENE-5339 : add abstract Facets base class; fix separate test failure
          Hide
          Shai Erera added a comment -

          I think that's an acceptable risk in exchange for the simpler index-time API.

          What if someone already extends IW? Or, we'll like that approach so much that we'll want to apply it elsewhere too (I mean in other modules)? I'm just saying that extending IW is a pretty big thing, compared to FacetFields.

          I was thinking maybe we can wrap an IW, and present an interface with only add/updateDoc so on one hand you get to add new FacetField() easily, but on the other you get to use whatever IW that you want. It's like RandomIW, only we may not want to copy the entire set of APIs, only the document-related ones? And in the future we can think about TaxoFacetIW which also owns the commit()?

          Another crazy idea is to implement a FacetDocument which extends Document (we'd need to allow that first!) and does the same trick on addField() / indexableFields(). I think if we can pull this off (extending Document is preferred IMO), it's far less intrusive than a new FacetIW.

          Wait, the app cannot compute this (accurately) by summing the child counts? It will overcount in general, right?

          Ooops, you're right. So in that case, we can open up the extension point (I think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too? Then I think the rest of the search code will just work?

          There was a user on the ML (I don't have the link) who just wanted to
          get the facet count for a specific label after faceting was done, and
          the hoops s/he had to jump through (custom FacetResultHandler I
          think??) to achieve that was just crazy.

          Maybe it's crazy I don't know. All I'm saying is that now you took that user's request and made it a top-level API. I don't know if that qualifies as simplifying the APIs, or whether similar requests in the future will be easy to implement by extension, or we'd need to add them to our core code. For instance, this API makes Range and Count not share the same API (not saying it's a bad thing, just an example). But I already +1 the idea of having that on the API.

          Seriously? What abstractions are we missing?

          Well, FacetArrays committing to an int[], while we have someone which wants to use a Map, because he has millions of facets, yet queries typically hit very few of them. And he uses both the int[] and float[] arrays, which for the most part of them are allocated for no good reason. In order to use a Map he needs to write a FacetResultsHandler which computes top-K from the map, as well extend FacetsAccumulator to override createFRH. Just an example. If we have a FacetValues abstraction w/ get/set/iterator methods, all he'd need to do is override the FacetValues that should be used.

          But I already told him that this is something he can just do w/ FacetResultsHandler. And just to point out that this patch won't make things worse - since he already overrides the aggregation method, he can just implement a MapBasedTaxoFacetsSomething. So I only gave it as an example (FacetValues abstraction, I think, will hurt the performance in general, but we could benchmark).

          Another abstraction is LUCENE-5316, which we struggle to get it to perform (really frustrating!).

          If this is too much change for the facets module, we could, instead,
          leave the facets module as is, and break this effort out as a
          different module (facets2, simplefacets, something?). We also have
          many queryparsers, many highlighters, etc., and I think that's
          healthy: all options can be explored.

          -1 to that. I think it delivers the wrong message. Why have two faceting modules? A branch is the perfect choice here, since it allows us to move the entire module to the new API. And on the way we benefit by assessing that the new API can really allow implementing associations, complements, sampling.

          To give an example - the API first was overwhelming and in the last round of changes I removed and moved things (I thought I simplify it!). So for instance once upon a time FacetRequest let you specify the FacetsAggregator and FacetResultsHandler. I moved them both to FacetsAccumulator, but recently moved FacetsAggregator back to FacetRequest (it allows more easily to define which aggregator a certain FR needs to use). And on LUCENE-5333 I suggested to move FacetResultsHandler back to FR, to implement an AllDimFR.

          So wearing the "experienced" hat, I just want to make sure we won't move everything out only to gradually swap it back in. I'm not against changing the APIs, but because this module is rich with features (which I, and apparently our users too, think are important), I'd like to make sure the new API works for all of them. I may compromise on complements and sampling because: (1) complements is not per-segment and I think it might require a full rewrite anywhere (there's an issue for it) and (2) sampling because it's under the *.old package, meaning it was never fully migrated to the new APIs and I think it could use some serious simplification there too.

          If you're worried that you'll do this work alone, I promise to help. On the branch.

          It's a crazy expert thing to create another faceting impl, so I think
          such developers can handle changing their rollup to be lazy if it's
          beneficial/necessary for their use case.

          Let's use MapReduce as an analogy. The framework says "you worry about the Map and Reduce phase, I'll take care of the distribution stuff". Nice and clean. In the beginning the facets API said "you worry about aggregating a single ordinal, I'll worry about the rest". Then we changed it to "you worry about aggregating all ordinals in a segment, I'll worry about the rest".

          Now, some users we have, well think of them as data scientists. All they're interested about is ranking facets to drive some insights out of the data. I feel that we're putting too much burden on them more and more. Cutting the API from the single-ord level to the segment-level was to gain sizable improvements to faceted search. But letting them do the rollup ... anyway, let's proceed with that, we can see as we go.

          I personally like code reuse, and it kills me to see code that's duplicated. Often this calls out for bad design of APIs, not necessarily simplification.

          I'm not sure it's so easily separated out; the DimConfig is common to
          index time and search time, and we're still iterating on that (I just
          moved the indexedFieldName into it).

          I was thinking to develop it still with FacetIndexingParams in mind - just cutover from FacetFields to FacetIW/FacetDocument. This makes the change gradual and allows us to investigate it separately. It's also, I think, a much lower hanging fruit. But, it looks like the changes in this patch will require rewriting large parts of it, so maybe we should do it here.

          We'd have to change 3 places if we did that right now: FacetIndexWriter
          (where we encode) and TaxonomyFacetCounts/SumValueSource (where we
          decode).

          Wrong? I mean you still didn't write a version which pulls the ords from OrdinalsCache (and I assume you don't want to get rid of it!). With those two TFCs, it already means we'll have 4 classes? While I think the SumValueSource could be oblivious to how the ords are encoded/fetched out-of-the-box, since it's less common to use (we only optimize counting) and someone can rewrite it to pull the ords from OrdinalsCache directly (instead of going through the CLI abstraction).

          So I think its fine to bake in the encoding, but with the two key methods in those 20 classes 'protected' in the appropriate places so that an expert user could subclass them:

          I think you point out two problems - one is the fact that we have the *.encoding package at all and the other one is how to override encoding/decoding. The nice thing about IntEncoder/Decoder (let's think of it as PackedInts ok?) is that you can use an implementation without affecting other classes, while with encode/decode protected methods you need to extend every class that you have which reads the data. I don't think that qualifies as good software design. Why do we have PackedInts then?

          Now the question is whether we should keep the different IntEncoders or not - I don't know. Why do we have so many PackedInts versions? Do we use all of them or are they there so that someone can make a pick? If they don't hurt anyone, and requires almost no maintenance, why remove them?

          The argument is whether we should open up encoding/decoding at all. What Mike is saying "if you want to encode differently, you're an expert and you can rewrite all of your and Lucene's aggregators (count, sum VS) too". What you're saying, add encode/decode protected methods, is more how the API looks today, only instead of a method we have an IntEncoder/Decoder (a'la PackedInts) class. That's just easier to use in conjunction with other classes (especially if we do the CLI abstraction).

          Just to clarify, I don't mean we have to have the CLI abstraction and use it everywhere. I think it's a good API for someone to use if he doesn't care about how ordinals are fetched (BDV, CachedOrds, whatever) as well as how they are encoded/decoded. It could very well be a CLITaxoFacetCounts impl which lets you pass the CLI you want to use (BDVCLI, CachedOrdsCLI, MySuperExpertDecodingCLI). That's it. Most users won't know about it, but the few that do, won't need to rewrite so many classes.

          Show
          Shai Erera added a comment - I think that's an acceptable risk in exchange for the simpler index-time API. What if someone already extends IW? Or, we'll like that approach so much that we'll want to apply it elsewhere too (I mean in other modules)? I'm just saying that extending IW is a pretty big thing, compared to FacetFields. I was thinking maybe we can wrap an IW, and present an interface with only add/updateDoc so on one hand you get to add new FacetField() easily, but on the other you get to use whatever IW that you want. It's like RandomIW, only we may not want to copy the entire set of APIs, only the document-related ones? And in the future we can think about TaxoFacetIW which also owns the commit()? Another crazy idea is to implement a FacetDocument which extends Document (we'd need to allow that first!) and does the same trick on addField() / indexableFields(). I think if we can pull this off (extending Document is preferred IMO), it's far less intrusive than a new FacetIW. Wait, the app cannot compute this (accurately) by summing the child counts? It will overcount in general, right? Ooops, you're right. So in that case, we can open up the extension point (I think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too? Then I think the rest of the search code will just work? There was a user on the ML (I don't have the link) who just wanted to get the facet count for a specific label after faceting was done, and the hoops s/he had to jump through (custom FacetResultHandler I think??) to achieve that was just crazy. Maybe it's crazy I don't know. All I'm saying is that now you took that user's request and made it a top-level API. I don't know if that qualifies as simplifying the APIs, or whether similar requests in the future will be easy to implement by extension, or we'd need to add them to our core code. For instance, this API makes Range and Count not share the same API (not saying it's a bad thing, just an example). But I already +1 the idea of having that on the API. Seriously? What abstractions are we missing? Well, FacetArrays committing to an int[], while we have someone which wants to use a Map, because he has millions of facets, yet queries typically hit very few of them. And he uses both the int[] and float[] arrays, which for the most part of them are allocated for no good reason. In order to use a Map he needs to write a FacetResultsHandler which computes top-K from the map, as well extend FacetsAccumulator to override createFRH. Just an example. If we have a FacetValues abstraction w/ get/set/iterator methods, all he'd need to do is override the FacetValues that should be used. But I already told him that this is something he can just do w/ FacetResultsHandler. And just to point out that this patch won't make things worse - since he already overrides the aggregation method, he can just implement a MapBasedTaxoFacetsSomething. So I only gave it as an example (FacetValues abstraction, I think, will hurt the performance in general, but we could benchmark). Another abstraction is LUCENE-5316 , which we struggle to get it to perform (really frustrating!). If this is too much change for the facets module, we could, instead, leave the facets module as is, and break this effort out as a different module (facets2, simplefacets, something?). We also have many queryparsers, many highlighters, etc., and I think that's healthy: all options can be explored. -1 to that. I think it delivers the wrong message. Why have two faceting modules? A branch is the perfect choice here, since it allows us to move the entire module to the new API. And on the way we benefit by assessing that the new API can really allow implementing associations, complements, sampling. To give an example - the API first was overwhelming and in the last round of changes I removed and moved things (I thought I simplify it!). So for instance once upon a time FacetRequest let you specify the FacetsAggregator and FacetResultsHandler. I moved them both to FacetsAccumulator, but recently moved FacetsAggregator back to FacetRequest (it allows more easily to define which aggregator a certain FR needs to use). And on LUCENE-5333 I suggested to move FacetResultsHandler back to FR, to implement an AllDimFR. So wearing the "experienced" hat, I just want to make sure we won't move everything out only to gradually swap it back in. I'm not against changing the APIs, but because this module is rich with features (which I, and apparently our users too, think are important), I'd like to make sure the new API works for all of them. I may compromise on complements and sampling because: (1) complements is not per-segment and I think it might require a full rewrite anywhere (there's an issue for it) and (2) sampling because it's under the *.old package, meaning it was never fully migrated to the new APIs and I think it could use some serious simplification there too. If you're worried that you'll do this work alone, I promise to help. On the branch. It's a crazy expert thing to create another faceting impl, so I think such developers can handle changing their rollup to be lazy if it's beneficial/necessary for their use case. Let's use MapReduce as an analogy. The framework says "you worry about the Map and Reduce phase, I'll take care of the distribution stuff". Nice and clean. In the beginning the facets API said "you worry about aggregating a single ordinal, I'll worry about the rest". Then we changed it to "you worry about aggregating all ordinals in a segment, I'll worry about the rest". Now, some users we have, well think of them as data scientists. All they're interested about is ranking facets to drive some insights out of the data. I feel that we're putting too much burden on them more and more. Cutting the API from the single-ord level to the segment-level was to gain sizable improvements to faceted search. But letting them do the rollup ... anyway, let's proceed with that, we can see as we go. I personally like code reuse, and it kills me to see code that's duplicated. Often this calls out for bad design of APIs, not necessarily simplification. I'm not sure it's so easily separated out; the DimConfig is common to index time and search time, and we're still iterating on that (I just moved the indexedFieldName into it). I was thinking to develop it still with FacetIndexingParams in mind - just cutover from FacetFields to FacetIW/FacetDocument. This makes the change gradual and allows us to investigate it separately. It's also, I think, a much lower hanging fruit. But, it looks like the changes in this patch will require rewriting large parts of it, so maybe we should do it here. We'd have to change 3 places if we did that right now: FacetIndexWriter (where we encode) and TaxonomyFacetCounts/SumValueSource (where we decode). Wrong? I mean you still didn't write a version which pulls the ords from OrdinalsCache (and I assume you don't want to get rid of it!). With those two TFCs, it already means we'll have 4 classes? While I think the SumValueSource could be oblivious to how the ords are encoded/fetched out-of-the-box, since it's less common to use (we only optimize counting) and someone can rewrite it to pull the ords from OrdinalsCache directly (instead of going through the CLI abstraction). So I think its fine to bake in the encoding, but with the two key methods in those 20 classes 'protected' in the appropriate places so that an expert user could subclass them: I think you point out two problems - one is the fact that we have the *.encoding package at all and the other one is how to override encoding/decoding. The nice thing about IntEncoder/Decoder (let's think of it as PackedInts ok?) is that you can use an implementation without affecting other classes, while with encode/decode protected methods you need to extend every class that you have which reads the data. I don't think that qualifies as good software design. Why do we have PackedInts then? Now the question is whether we should keep the different IntEncoders or not - I don't know. Why do we have so many PackedInts versions? Do we use all of them or are they there so that someone can make a pick? If they don't hurt anyone, and requires almost no maintenance, why remove them? The argument is whether we should open up encoding/decoding at all. What Mike is saying "if you want to encode differently, you're an expert and you can rewrite all of your and Lucene's aggregators (count, sum VS) too". What you're saying, add encode/decode protected methods, is more how the API looks today, only instead of a method we have an IntEncoder/Decoder (a'la PackedInts) class. That's just easier to use in conjunction with other classes (especially if we do the CLI abstraction). Just to clarify, I don't mean we have to have the CLI abstraction and use it everywhere. I think it's a good API for someone to use if he doesn't care about how ordinals are fetched (BDV, CachedOrds, whatever) as well as how they are encoded/decoded. It could very well be a CLITaxoFacetCounts impl which lets you pass the CLI you want to use (BDVCLI, CachedOrdsCLI, MySuperExpertDecodingCLI). That's it. Most users won't know about it, but the few that do, won't need to rewrite so many classes.
          Hide
          Robert Muir added a comment -

          Now the question is whether we should keep the different IntEncoders or not - I don't know. Why do we have so many PackedInts versions? Do we use all of them or are they there so that someone can make a pick? If they don't hurt anyone, and requires almost no maintenance, why remove them?

          Your analogy is 100% wrong Shai. PackedInts uses the different implementations behind the scenes automatically.

          Here we just have a lot of public classes overwhelming the user for absolutely no good reason.

          I will reopen this JIRA issue if it is marked 'resolved' without these classes being removed!!!!!!

          Show
          Robert Muir added a comment - Now the question is whether we should keep the different IntEncoders or not - I don't know. Why do we have so many PackedInts versions? Do we use all of them or are they there so that someone can make a pick? If they don't hurt anyone, and requires almost no maintenance, why remove them? Your analogy is 100% wrong Shai. PackedInts uses the different implementations behind the scenes automatically. Here we just have a lot of public classes overwhelming the user for absolutely no good reason. I will reopen this JIRA issue if it is marked 'resolved' without these classes being removed!!!!!!
          Hide
          Shai Erera added a comment -

          I don't think it's 100% wrong - look at all the classes under o.a.l.util.packed – I'm talking about the public: PackedInts, Mutable, GrowableWriter, PagedGrowableWriter, AppendingPackedLongBuffer, AppendingDeltaPackedLongBuffer, MontonicAppendingLongBuffer - these classes exist for a reason – you need to be able to pick the best one for you needs. They great.

          I don't care much if you want to remove o.a.l.facet.encoding package - I think it's fair to say that if you want to have a different encoding scheme, you should extend FacetIW/FacetDocument/FacetFields (whatever we come up with) .encode() and then either write a matching CategoryListIterator, or TaxoFacetCounts.

          And I wished you used less !!!!

          Show
          Shai Erera added a comment - I don't think it's 100% wrong - look at all the classes under o.a.l.util.packed – I'm talking about the public: PackedInts, Mutable, GrowableWriter, PagedGrowableWriter, AppendingPackedLongBuffer, AppendingDeltaPackedLongBuffer, MontonicAppendingLongBuffer - these classes exist for a reason – you need to be able to pick the best one for you needs. They great. I don't care much if you want to remove o.a.l.facet.encoding package - I think it's fair to say that if you want to have a different encoding scheme, you should extend FacetIW/FacetDocument/FacetFields (whatever we come up with) .encode() and then either write a matching CategoryListIterator, or TaxoFacetCounts. And I wished you used less !!!!
          Hide
          Shai Erera added a comment -

          About IntEncoder, one thing it lets us do today is detect the encoding used for this category list, and eg throw a meaningful exception in FastCountingAggregator if a FIP is passed which uses a different decoder. I think it's good if we can somehow detect it in TaxoFacetCounts? If not, maybe we just document that this class assumes default encoding is used and that if you changed that, you shouldn't use this class? If you think documentation isn't enough, please drop a nocommit so we remember to handle it?

          I've been thinking to get rid of IntEncoder in a separate issue (do this change in baby steps), so whatever we decide here, I'll apply there.

          Show
          Shai Erera added a comment - About IntEncoder, one thing it lets us do today is detect the encoding used for this category list, and eg throw a meaningful exception in FastCountingAggregator if a FIP is passed which uses a different decoder. I think it's good if we can somehow detect it in TaxoFacetCounts? If not, maybe we just document that this class assumes default encoding is used and that if you changed that, you shouldn't use this class? If you think documentation isn't enough, please drop a nocommit so we remember to handle it? I've been thinking to get rid of IntEncoder in a separate issue (do this change in baby steps), so whatever we decide here, I'll apply there.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542712 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542712 ]

          LUCENE-5339: add nocommits

          Show
          ASF subversion and git services added a comment - Commit 1542712 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542712 ] LUCENE-5339 : add nocommits
          Hide
          Michael McCandless added a comment -

          If not, maybe we just document that this class assumes default encoding is used and that if you changed that, you shouldn't use this class?

          I just committed nocommits that we should jdoc this ...

          Show
          Michael McCandless added a comment - If not, maybe we just document that this class assumes default encoding is used and that if you changed that, you shouldn't use this class? I just committed nocommits that we should jdoc this ...
          Hide
          ASF subversion and git services added a comment -

          Commit 1542713 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542713 ]

          LUCENE-5339: cutover DrillSideways

          Show
          ASF subversion and git services added a comment - Commit 1542713 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542713 ] LUCENE-5339 : cutover DrillSideways
          Hide
          Michael McCandless added a comment -

          I was thinking maybe we can wrap an IW,

          I think that's a good idea? I added to TODO.

          Another crazy idea is to implement a FacetDocument which extends Document

          I'm not sure how this can work, since in order to write the ords we
          need to see all FacetFields? Ie, at what point would we compile all
          the FacetFields into the BDV field?

          we can open up the extension point (I think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too?

          Hmm, we could open that up, but ... I think that's "too late"? You
          can't easily know which dim ords to add back at that point. I added a
          TODO.

          Seriously? What abstractions are we missing?

          Well, FacetArrays committing to an int[], while we have someone which wants to use a Map, because he has millions of facets, yet queries typically hit very few of them.

          With the simplified APIs this user could just make a custom facet
          method?

          Another abstraction is LUCENE-5316, which we struggle to get it to perform (really frustrating!).

          I agree we need better abstraction here... the 3 int we require per
          unique facet label is costly. But, I don't think we need to force
          SSDVFacets to use this abstraction?

          Why have two faceting modules? A branch is the perfect choice here, since it allows us to move the entire module to the new API. And on the way we benefit by assessing that the new API can really allow implementing associations, complements, sampling.

          I would also prefer to have a single facet module after all this, but
          if the requirements (rich functionality vs. simple API) are too
          divergent, then two modules is at least an option. Progress not
          perfection...

          I may compromise on complements and sampling because: (1) complements is not per-segment and I think it might require a full rewrite anywhere (there's an issue for it) and (2) sampling because it's under the *.old package, meaning it was never fully migrated to the new APIs and I think it could use some serious simplification there too.

          I agree it's bad that complements is "top level"; everything else in
          the facet module is NRT friendly and I think we should stick with
          that.

          I'll work on associations...

          I personally like code reuse, and it kills me to see code that's duplicated. Often this calls out for bad design of APIs, not necessarily simplification.

          I think this is a precarious balance. If a little code dup can
          greatly simplify the APIs, then that's the better tradeoff.

          Wrong? I mean you still didn't write a version which pulls the ords from OrdinalsCache (and I assume you don't want to get rid of it!).

          You're right, the ords cache filling will be another place that "bakes
          in" the decoding. So, I agree: if we can find a clean way to abstract
          the encoding/source then let's pursue that.

          Show
          Michael McCandless added a comment - I was thinking maybe we can wrap an IW, I think that's a good idea? I added to TODO. Another crazy idea is to implement a FacetDocument which extends Document I'm not sure how this can work, since in order to write the ords we need to see all FacetFields? Ie, at what point would we compile all the FacetFields into the BDV field? we can open up the extension point (I think it's FacetIW.dedupAndEncode) to let the app add the DIM ordinal too? Hmm, we could open that up, but ... I think that's "too late"? You can't easily know which dim ords to add back at that point. I added a TODO. Seriously? What abstractions are we missing? Well, FacetArrays committing to an int[], while we have someone which wants to use a Map, because he has millions of facets, yet queries typically hit very few of them. With the simplified APIs this user could just make a custom facet method? Another abstraction is LUCENE-5316 , which we struggle to get it to perform (really frustrating!). I agree we need better abstraction here... the 3 int we require per unique facet label is costly. But, I don't think we need to force SSDVFacets to use this abstraction? Why have two faceting modules? A branch is the perfect choice here, since it allows us to move the entire module to the new API. And on the way we benefit by assessing that the new API can really allow implementing associations, complements, sampling. I would also prefer to have a single facet module after all this, but if the requirements (rich functionality vs. simple API) are too divergent, then two modules is at least an option. Progress not perfection... I may compromise on complements and sampling because: (1) complements is not per-segment and I think it might require a full rewrite anywhere (there's an issue for it) and (2) sampling because it's under the *.old package, meaning it was never fully migrated to the new APIs and I think it could use some serious simplification there too. I agree it's bad that complements is "top level"; everything else in the facet module is NRT friendly and I think we should stick with that. I'll work on associations... I personally like code reuse, and it kills me to see code that's duplicated. Often this calls out for bad design of APIs, not necessarily simplification. I think this is a precarious balance. If a little code dup can greatly simplify the APIs, then that's the better tradeoff. Wrong? I mean you still didn't write a version which pulls the ords from OrdinalsCache (and I assume you don't want to get rid of it!). You're right, the ords cache filling will be another place that "bakes in" the decoding. So, I agree: if we can find a clean way to abstract the encoding/source then let's pursue that.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542717 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542717 ]

          LUCENE-5339: update TODOs

          Show
          ASF subversion and git services added a comment - Commit 1542717 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542717 ] LUCENE-5339 : update TODOs
          Hide
          Shai Erera added a comment -

          I'm not sure how this can work, since in order to write the ords we
          need to see all FacetFields? Ie, at what point would we compile all
          the FacetFields into the BDV field?

          I was thinking when Doc.indexableFields() is called?

          Hmm, we could open that up, but ... I think that's "too late"? You
          can't easily know which dim ords to add back at that point. I added a
          TODO.

          Maybe that's the wrong extension point, but what I had in mind is something similar to what FacetFields does today – it adds the categories to the TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of course). I mentioned dedupAndEncode because I thought it does something like that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, then whatever method that does that ... and if there is none, let's wrap it in an overridable method?

          With the simplified APIs this user could just make a custom facet method?

          He already does that with the current APIs too (a special FacetsAccumulator. I don't know if it's easier/harder/the same to do w/ the new API. I guess it won't be harder.

          You're right, the ords cache filling will be another place that "bakes
          in" the decoding. So, I agree: if we can find a clean way to abstract
          the encoding/source then let's pursue that.

          As I said, let's divide that into two problems: API and optimization. For API, we can stick w/ CategoryListIterator and implement both a DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if they don't care where the ordinals come from.

          For optimization, we do a FastFacetCounts which inlines dgap+vint and reads from BDV, and we can also do a CachedOrdsFacetCounts which inlines the interaction with OrdinalsCache. Actually, if we provide these two, we can skip the third FacetCounts (uses CLI), as it will be for demo purposes only given current encoding. If anyone changes the encoding, he can write a FacetCounts. Also, we can always add it later ...

          The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so however he wants. But the CLI is the abstraction I'm thinking – it only has two methods: setNextReader and getOrdinals(int doc).

          I think this is a precarious balance. If a little code dup can
          greatly simplify the APIs, then that's the better tradeoff.

          In general I agree. It then becomes what's considered little vs a lot of code dup. I think that dgap+vint + rollup is not little (put together), as well as making the decision to rollup. But at this point I don't mind .. let's force code dup, and then simplify if users are angry.

          I agree we need better abstraction here... the 3 int we require per
          unique facet label is costly. But, I don't think we need to force
          SSDVFacets to use this abstraction?

          Well, at first what I had in mind is a Taxonomy interface (read-only) with two implementations: TaxoIndex and SortedSet. Especially since we're talking about supporting full hierarchies w/ SortedSet. I think it will be cool if a TaxoFacetCounts just takes a Taxonomy and we don't need to duplicate the code. But then I realized it's not just how the taxonomy is managed, but also where the ords are pulled from (BDV vs SSDV).

          We basically could have a SortedSetTaxonomy and SortedSetCLI to support any 'general' counting, but then as you pointed out somewhere, a SortedSetCLI may not be able to optimize by counting in seg-ord space and re-map afterwards. So at this point I'm not sure we should pursue the implementation of a SortedSet Taxonomy and CLI. Would be nice to know the APIs allow that (even if it means you lose some performance, e.g. always count in global ord-space), should anyone want to do that. Let's drop it for now.

          Show
          Shai Erera added a comment - I'm not sure how this can work, since in order to write the ords we need to see all FacetFields? Ie, at what point would we compile all the FacetFields into the BDV field? I was thinking when Doc.indexableFields() is called? Hmm, we could open that up, but ... I think that's "too late"? You can't easily know which dim ords to add back at that point. I added a TODO. Maybe that's the wrong extension point, but what I had in mind is something similar to what FacetFields does today – it adds the categories to the TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of course). I mentioned dedupAndEncode because I thought it does something like that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, then whatever method that does that ... and if there is none, let's wrap it in an overridable method? With the simplified APIs this user could just make a custom facet method? He already does that with the current APIs too (a special FacetsAccumulator. I don't know if it's easier/harder/the same to do w/ the new API. I guess it won't be harder. You're right, the ords cache filling will be another place that "bakes in" the decoding. So, I agree: if we can find a clean way to abstract the encoding/source then let's pursue that. As I said, let's divide that into two problems: API and optimization. For API, we can stick w/ CategoryListIterator and implement both a DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if they don't care where the ordinals come from. For optimization, we do a FastFacetCounts which inlines dgap+vint and reads from BDV, and we can also do a CachedOrdsFacetCounts which inlines the interaction with OrdinalsCache. Actually, if we provide these two, we can skip the third FacetCounts (uses CLI), as it will be for demo purposes only given current encoding. If anyone changes the encoding, he can write a FacetCounts. Also, we can always add it later ... The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so however he wants. But the CLI is the abstraction I'm thinking – it only has two methods: setNextReader and getOrdinals(int doc). I think this is a precarious balance. If a little code dup can greatly simplify the APIs, then that's the better tradeoff. In general I agree. It then becomes what's considered little vs a lot of code dup. I think that dgap+vint + rollup is not little (put together), as well as making the decision to rollup. But at this point I don't mind .. let's force code dup, and then simplify if users are angry. I agree we need better abstraction here... the 3 int we require per unique facet label is costly. But, I don't think we need to force SSDVFacets to use this abstraction? Well, at first what I had in mind is a Taxonomy interface (read-only) with two implementations: TaxoIndex and SortedSet. Especially since we're talking about supporting full hierarchies w/ SortedSet. I think it will be cool if a TaxoFacetCounts just takes a Taxonomy and we don't need to duplicate the code. But then I realized it's not just how the taxonomy is managed, but also where the ords are pulled from (BDV vs SSDV). We basically could have a SortedSetTaxonomy and SortedSetCLI to support any 'general' counting, but then as you pointed out somewhere, a SortedSetCLI may not be able to optimize by counting in seg-ord space and re-map afterwards. So at this point I'm not sure we should pursue the implementation of a SortedSet Taxonomy and CLI. Would be nice to know the APIs allow that (even if it means you lose some performance, e.g. always count in global ord-space), should anyone want to do that. Let's drop it for now.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542773 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542773 ]

          LUCENE-5339: add OrdinalsReader + Cache to abstract the source of the ords

          Show
          ASF subversion and git services added a comment - Commit 1542773 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542773 ] LUCENE-5339 : add OrdinalsReader + Cache to abstract the source of the ords
          Hide
          ASF subversion and git services added a comment -

          Commit 1542804 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542804 ]

          LUCENE-5339: simplify DrillDownQuery

          Show
          ASF subversion and git services added a comment - Commit 1542804 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542804 ] LUCENE-5339 : simplify DrillDownQuery
          Hide
          ASF subversion and git services added a comment -

          Commit 1542843 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1542843 ]

          LUCENE-5339: remove delim char

          Show
          ASF subversion and git services added a comment - Commit 1542843 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1542843 ] LUCENE-5339 : remove delim char
          Hide
          ASF subversion and git services added a comment -

          Commit 1543047 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543047 ]

          LUCENE-5339: assocations

          Show
          ASF subversion and git services added a comment - Commit 1543047 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543047 ] LUCENE-5339 : assocations
          Hide
          Shai Erera added a comment -

          Thanks Mike! Currently, int and float associations are written in different BDVs, one for int and one for float. If you look in CategoryAssociation, it has an ID() which is added to the facet field. That means fields like $facet$int, $facet$float, $facet$CA.getID(). This is also related to the nocommit you put about the association facet field – I think it should be indexFieldName + CategoryAssociation.ID.

          I see though that AssociationFacetField can handle both integers and floats. While the treatment is the same (writing a float is just like writing an int after FloatToIntBits), I think it's better if we separate them either into two fields, or keep the CategoryAssociation class to distinguish between them. Then FacetIW can iterate on the AssociationFacetFields and group them by CategoryAssociation type to index them under the respective BDV.

          We cannot allow mixing different association types in the same BDV, as their decoding isn't the same (even float vs int – you have to know whether to do Float.intBitsToFloat or not, after reading the int).

          Show
          Shai Erera added a comment - Thanks Mike! Currently, int and float associations are written in different BDVs, one for int and one for float. If you look in CategoryAssociation, it has an ID() which is added to the facet field. That means fields like $facet$int, $facet$float, $facet$CA.getID(). This is also related to the nocommit you put about the association facet field – I think it should be indexFieldName + CategoryAssociation.ID. I see though that AssociationFacetField can handle both integers and floats. While the treatment is the same (writing a float is just like writing an int after FloatToIntBits), I think it's better if we separate them either into two fields, or keep the CategoryAssociation class to distinguish between them. Then FacetIW can iterate on the AssociationFacetFields and group them by CategoryAssociation type to index them under the respective BDV. We cannot allow mixing different association types in the same BDV, as their decoding isn't the same (even float vs int – you have to know whether to do Float.intBitsToFloat or not, after reading the int).
          Hide
          ASF subversion and git services added a comment -

          Commit 1543161 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543161 ]

          LUCENE-5339: renames

          Show
          ASF subversion and git services added a comment - Commit 1543161 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543161 ] LUCENE-5339 : renames
          Hide
          ASF subversion and git services added a comment -

          Commit 1543202 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543202 ]

          LUCENE-5339: add base class for taxo facet impls; catch wrong index field name for a given dim

          Show
          ASF subversion and git services added a comment - Commit 1543202 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543202 ] LUCENE-5339 : add base class for taxo facet impls; catch wrong index field name for a given dim
          Hide
          ASF subversion and git services added a comment -

          Commit 1543213 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543213 ]

          LUCENE-5339: renames

          Show
          ASF subversion and git services added a comment - Commit 1543213 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543213 ] LUCENE-5339 : renames
          Hide
          Michael McCandless added a comment -

          I'm not sure how this can work, since in order to write the ords we need to see all FacetFields? Ie, at what point would we compile all the FacetFields into the BDV field?

          I was thinking when Doc.indexableFields() is called?

          Hmm, that's iffy. First off, IW also calls .storableFields(), and
          it's not defined which will be called first, and we need to add both
          storable and indexable fields.

          Second off, Document has a longer lifecycle, e.g. one can reuse it,
          reuse field instance in it, etc., and I don't think we should alter it
          in-place (remove FacetFields, add new fields).

          Maybe Document should have a "rewrite" method, that IW calls to the
          "actual" document to index? The default would just "return this".

          Maybe that's the wrong extension point, but what I had in mind is something similar to what FacetFields does today – it adds the categories to the TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of course). I mentioned dedupAndEncode because I thought it does something like that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, then whatever method that does that ... and if there is none, let's wrap it in an overridable method?

          Really this is just adding complexity for a minor gain?

          As I said, let's divide that into two problems: API and optimization. For API, we can stick w/ CategoryListIterator and implement both a DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if they don't care where the ordinals come from.

          For optimization, we do a FastFacetCounts which inlines dgap+vint and reads from BDV, and we can also do a CachedOrdsFacetCounts which inlines the interaction with OrdinalsCache. Actually, if we provide these two, we can skip the third FacetCounts (uses CLI), as it will be for demo purposes only given current encoding. If anyone changes the encoding, he can write a FacetCounts. Also, we can always add it later ...

          The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so however he wants. But the CLI is the abstraction I'm thinking – it only has two methods: setNextReader and getOrdinals(int doc).

          OK, I added an OrdinalsReader abstraction, and CachedOrdinalsReader
          (holds all decoded ords in shared int[]), and a
          DocValuesOrdinalsReader with a protected decode() method that a
          subclass could customize. FastTaxonomyFacetCounts specializes the DV
          decode.

          I think this is a precarious balance. If a little code dup can greatly simplify the APIs, then that's the better tradeoff.

          In general I agree. It then becomes what's considered little vs a lot of code dup. I think that dgap+vint + rollup is not little (put together), as well as making the decision to rollup. But at this point I don't mind .. let's force code dup, and then simplify if users are angry.

          I pulled out a base class (TaxonomyFacets) for all the taxonomy based
          facet methods, to share some code.

          Show
          Michael McCandless added a comment - I'm not sure how this can work, since in order to write the ords we need to see all FacetFields? Ie, at what point would we compile all the FacetFields into the BDV field? I was thinking when Doc.indexableFields() is called? Hmm, that's iffy. First off, IW also calls .storableFields(), and it's not defined which will be called first, and we need to add both storable and indexable fields. Second off, Document has a longer lifecycle, e.g. one can reuse it, reuse field instance in it, etc., and I don't think we should alter it in-place (remove FacetFields, add new fields). Maybe Document should have a "rewrite" method, that IW calls to the "actual" document to index? The default would just "return this". Maybe that's the wrong extension point, but what I had in mind is something similar to what FacetFields does today – it adds the categories to the TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of course). I mentioned dedupAndEncode because I thought it does something like that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, then whatever method that does that ... and if there is none, let's wrap it in an overridable method? Really this is just adding complexity for a minor gain? As I said, let's divide that into two problems: API and optimization. For API, we can stick w/ CategoryListIterator and implement both a DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if they don't care where the ordinals come from. For optimization, we do a FastFacetCounts which inlines dgap+vint and reads from BDV, and we can also do a CachedOrdsFacetCounts which inlines the interaction with OrdinalsCache. Actually, if we provide these two, we can skip the third FacetCounts (uses CLI), as it will be for demo purposes only given current encoding. If anyone changes the encoding, he can write a FacetCounts. Also, we can always add it later ... The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so however he wants. But the CLI is the abstraction I'm thinking – it only has two methods: setNextReader and getOrdinals(int doc). OK, I added an OrdinalsReader abstraction, and CachedOrdinalsReader (holds all decoded ords in shared int[]), and a DocValuesOrdinalsReader with a protected decode() method that a subclass could customize. FastTaxonomyFacetCounts specializes the DV decode. I think this is a precarious balance. If a little code dup can greatly simplify the APIs, then that's the better tradeoff. In general I agree. It then becomes what's considered little vs a lot of code dup. I think that dgap+vint + rollup is not little (put together), as well as making the decision to rollup. But at this point I don't mind .. let's force code dup, and then simplify if users are angry. I pulled out a base class (TaxonomyFacets) for all the taxonomy based facet methods, to share some code.
          Hide
          Shai Erera added a comment -

          Hmm, that's iffy. First off, IW also calls .storableFields(), and
          it's not defined which will be called first, and we need to add both
          storable and indexable fields.

          Why are storableFields() related? None of the facet fields are also storable. I think we should only override indexableFields(). The part about reusing fields is a bigger problem though, so maybe your Document.rewrite() idea (like Query) could work... I just think that extending IW for this minor task (adding facets) is too much. I like the way you add FacetField to a Document, I just prefer to avoid the IW extension (at this point). So if we can pursue Doc.rewrite(), that may be good. Otherwise, maybe we could bring back FacetFields, only you won't add all CPs, but rather add FacetField one by one and then you can call createDoc() or something.

          Really this is just adding complexity for a minor gain?

          We can do this later. I just think that indexing the dimension is in 99% of the cases redundant. I prefer that the 1 app out there that might be interested in that do hard work, rather than we enforce more ordinals in the BDV for all faceted search apps.

          OK, I added an OrdinalsReader abstraction

          Thanks!

          I pulled out a base class (TaxonomyFacets) for all the taxonomy based
          facet methods, to share some code.

          Awesome!

          Show
          Shai Erera added a comment - Hmm, that's iffy. First off, IW also calls .storableFields(), and it's not defined which will be called first, and we need to add both storable and indexable fields. Why are storableFields() related? None of the facet fields are also storable. I think we should only override indexableFields(). The part about reusing fields is a bigger problem though, so maybe your Document.rewrite() idea (like Query) could work... I just think that extending IW for this minor task (adding facets) is too much. I like the way you add FacetField to a Document, I just prefer to avoid the IW extension (at this point). So if we can pursue Doc.rewrite(), that may be good. Otherwise, maybe we could bring back FacetFields, only you won't add all CPs, but rather add FacetField one by one and then you can call createDoc() or something. Really this is just adding complexity for a minor gain? We can do this later. I just think that indexing the dimension is in 99% of the cases redundant. I prefer that the 1 app out there that might be interested in that do hard work, rather than we enforce more ordinals in the BDV for all faceted search apps. OK, I added an OrdinalsReader abstraction Thanks! I pulled out a base class (TaxonomyFacets) for all the taxonomy based facet methods, to share some code. Awesome!
          Hide
          ASF subversion and git services added a comment -

          Commit 1543506 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543506 ]

          LUCENE-5339: migrate some more tests; fix 'ignores IntsRef.offset bug' in TaxoFacetCounts; add FacetTestCase.getFacetCounts

          Show
          ASF subversion and git services added a comment - Commit 1543506 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543506 ] LUCENE-5339 : migrate some more tests; fix 'ignores IntsRef.offset bug' in TaxoFacetCounts; add FacetTestCase.getFacetCounts
          Hide
          ASF subversion and git services added a comment -

          Commit 1543530 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543530 ]

          LUCENE-5339: switch to DocumentBuilder.build instead of FacetIndexWriter

          Show
          ASF subversion and git services added a comment - Commit 1543530 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543530 ] LUCENE-5339 : switch to DocumentBuilder.build instead of FacetIndexWriter
          Hide
          ASF subversion and git services added a comment -

          Commit 1543535 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543535 ]

          LUCENE-5339: add best-effort detection of invalid mixing of different association field types in single indexed field

          Show
          ASF subversion and git services added a comment - Commit 1543535 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543535 ] LUCENE-5339 : add best-effort detection of invalid mixing of different association field types in single indexed field
          Hide
          ASF subversion and git services added a comment -

          Commit 1543572 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543572 ]

          LUCENE-5339: another test, cutover taxo writer/reader to pathToString/stringToPath

          Show
          ASF subversion and git services added a comment - Commit 1543572 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543572 ] LUCENE-5339 : another test, cutover taxo writer/reader to pathToString/stringToPath
          Hide
          ASF subversion and git services added a comment -

          Commit 1543803 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1543803 ]

          LUCENE-5339: more tests, add DimConfig.requireDimCount

          Show
          ASF subversion and git services added a comment - Commit 1543803 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1543803 ] LUCENE-5339 : more tests, add DimConfig.requireDimCount
          Hide
          ASF subversion and git services added a comment -

          Commit 1544299 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1544299 ]

          LUCENE-5339: cutover more tests; fixed a few bugs

          Show
          ASF subversion and git services added a comment - Commit 1544299 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1544299 ] LUCENE-5339 : cutover more tests; fixed a few bugs
          Hide
          ASF subversion and git services added a comment -

          Commit 1544586 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1544586 ]

          LUCENE-5339: cutover more tests

          Show
          ASF subversion and git services added a comment - Commit 1544586 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1544586 ] LUCENE-5339 : cutover more tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1544892 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1544892 ]

          LUCENE-5339: move build into FacetsConfig; cutover more tests

          Show
          ASF subversion and git services added a comment - Commit 1544892 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1544892 ] LUCENE-5339 : move build into FacetsConfig; cutover more tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1544971 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1544971 ]

          LUCENE-5339: cutover more tests

          Show
          ASF subversion and git services added a comment - Commit 1544971 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1544971 ] LUCENE-5339 : cutover more tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1545086 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545086 ]

          LUCENE-5339: more tests; add search utility methods; remove request path from SimpleFacetResult

          Show
          ASF subversion and git services added a comment - Commit 1545086 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545086 ] LUCENE-5339 : more tests; add search utility methods; remove request path from SimpleFacetResult
          Hide
          ASF subversion and git services added a comment -

          Commit 1545096 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545096 ]

          LUCENE-5339: remove RangeAccumulator/FacetRequest

          Show
          ASF subversion and git services added a comment - Commit 1545096 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545096 ] LUCENE-5339 : remove RangeAccumulator/FacetRequest
          Hide
          ASF subversion and git services added a comment -

          Commit 1545097 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545097 ]

          LUCENE-5339: remove old sorted set

          Show
          ASF subversion and git services added a comment - Commit 1545097 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545097 ] LUCENE-5339 : remove old sorted set
          Hide
          ASF subversion and git services added a comment -

          Commit 1545466 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545466 ]

          LUCENE-5339: finish cutover

          Show
          ASF subversion and git services added a comment - Commit 1545466 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545466 ] LUCENE-5339 : finish cutover
          Hide
          ASF subversion and git services added a comment -

          Commit 1545637 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545637 ]

          LUCENE-5339: cutover demo to new APIs

          Show
          ASF subversion and git services added a comment - Commit 1545637 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545637 ] LUCENE-5339 : cutover demo to new APIs
          Hide
          ASF subversion and git services added a comment -

          Commit 1545639 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545639 ]

          LUCENE-5339: add missing file

          Show
          ASF subversion and git services added a comment - Commit 1545639 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545639 ] LUCENE-5339 : add missing file
          Hide
          Robert Muir added a comment -

          I built javadocs from this branch and took a look, much less overwhelming!

          Can we do something about the .writercache packages? must there really be 3 of them with names like cl2o?

          org.apache.lucene.facet 	
          org.apache.lucene.facet.simple 	 
          org.apache.lucene.facet.taxonomy 	
          org.apache.lucene.facet.taxonomy.directory 	
          org.apache.lucene.facet.taxonomy.writercache 	
          org.apache.lucene.facet.taxonomy.writercache.cl2o 	
          org.apache.lucene.facet.taxonomy.writercache.lru 	
          
          Show
          Robert Muir added a comment - I built javadocs from this branch and took a look, much less overwhelming! Can we do something about the .writercache packages? must there really be 3 of them with names like cl2o? org.apache.lucene.facet org.apache.lucene.facet.simple org.apache.lucene.facet.taxonomy org.apache.lucene.facet.taxonomy.directory org.apache.lucene.facet.taxonomy.writercache org.apache.lucene.facet.taxonomy.writercache.cl2o org.apache.lucene.facet.taxonomy.writercache.lru
          Hide
          Shai Erera added a comment -

          Can we do something about the .writercache packages?

          +1 for moving both LRU and CL2O under the *.writercache package. Those packages are tiny, there's no need to really separate them. Also, while we're at it, I think we can rename the Cl2o cache to something more meaningful like FullTaxoInMemoryCache or better name. Cl2o, which stands for CategoryLabelToOrdinal is just meaningless...

          BTW, I noticed in your list of packages above that we still have .simple – are we going to keep it or it's just that not all of the facet module was cutover to the new API?

          Show
          Shai Erera added a comment - Can we do something about the .writercache packages? +1 for moving both LRU and CL2O under the *.writercache package. Those packages are tiny, there's no need to really separate them. Also, while we're at it, I think we can rename the Cl2o cache to something more meaningful like FullTaxoInMemoryCache or better name. Cl2o, which stands for CategoryLabelToOrdinal is just meaningless... BTW, I noticed in your list of packages above that we still have .simple – are we going to keep it or it's just that not all of the facet module was cutover to the new API?
          Hide
          ASF subversion and git services added a comment -

          Commit 1545798 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545798 ]

          LUCENE-5339: move/rename away from simple

          Show
          ASF subversion and git services added a comment - Commit 1545798 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545798 ] LUCENE-5339 : move/rename away from simple
          Hide
          Michael McCandless added a comment -

          Can we do something about the .writercache packages?

          +1 for moving both LRU and CL2O under the *.writercache package.

          Good idea, I'll move them.

          BTW, I noticed in your list of packages above that we still have .simple – are we going to keep it or it's just that not all of the facet module was cutover to the new API?

          I just committed the cutover to remove "simple" ...

          Show
          Michael McCandless added a comment - Can we do something about the .writercache packages? +1 for moving both LRU and CL2O under the *.writercache package. Good idea, I'll move them. BTW, I noticed in your list of packages above that we still have .simple – are we going to keep it or it's just that not all of the facet module was cutover to the new API? I just committed the cutover to remove "simple" ...
          Hide
          ASF subversion and git services added a comment -

          Commit 1545808 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1545808 ]

          LUCENE-5339: remove writercache sub-packages

          Show
          ASF subversion and git services added a comment - Commit 1545808 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1545808 ] LUCENE-5339 : remove writercache sub-packages
          Hide
          Gilad Barkai added a comment -

          Been away from the issue for some time and it looks like a major progress, Chapeau à lui

          LabelAndValue & FacetResult use instanceof checks in their equals method - is that a must?

          FacetResult has a member called childCount - I think it's the number of categories/path/labels that were encountered. The current jdocs "How many labels were populated under the requested path" reveals implementation (population). Perhaps exchange populated with encountered?

          FloatRange and DoubleRange uses Math.nextUp/Down for infinity as the ranges are always inclusive. Perhaps these constants for float and double could be static final.

          TaxonomyFacetSumFloatAssociations and TaxonomyFacetSumValueSource reuse a LOT of code, can they extend one another? perhaps extract a common super for both?

          In TaxonomyFacets the parents array is saves, I could not see where it's being used (and I think it's not used even in the older taxonomy-facet implementation).

          FacetConfig confuses me a bit, as it's very much aware of the Taxonomy, on another it handles all the kinds of the facets.
          Perhaps FacetConfig.build() could be split up, allowing each FacetField.Type a build() method of its own, rather than every types' building being done in the same method. It will also bring a common parent class to all FacetField types, which I also like. As such, the taxonomy part, with processFacetFields() could be moved to its respective Facet implementation.

          Show
          Gilad Barkai added a comment - Been away from the issue for some time and it looks like a major progress, Chapeau à lui LabelAndValue & FacetResult use instanceof checks in their equals method - is that a must? FacetResult has a member called childCount - I think it's the number of categories/path/labels that were encountered. The current jdocs "How many labels were populated under the requested path" reveals implementation (population). Perhaps exchange populated with encountered? FloatRange and DoubleRange uses Math.nextUp/Down for infinity as the ranges are always inclusive. Perhaps these constants for float and double could be static final. TaxonomyFacetSumFloatAssociations and TaxonomyFacetSumValueSource reuse a LOT of code, can they extend one another? perhaps extract a common super for both? In TaxonomyFacets the parents array is saves, I could not see where it's being used (and I think it's not used even in the older taxonomy-facet implementation). FacetConfig confuses me a bit, as it's very much aware of the Taxonomy, on another it handles all the kinds of the facets. Perhaps FacetConfig.build() could be split up, allowing each FacetField.Type a build() method of its own, rather than every types' building being done in the same method. It will also bring a common parent class to all FacetField types, which I also like. As such, the taxonomy part, with processFacetFields() could be moved to its respective Facet implementation.
          Hide
          Michael McCandless added a comment -

          Been away from the issue for some time and it looks like a major progress, Chapeau à lui

          Thanks Gilad, bit by bit...

          LabelAndValue & FacetResult use instanceof checks in their equals method - is that a must?

          Hmm, I'm don't know how to implement .equals without an instanceof check?

          FacetResult has a member called childCount - I think it's the number of categories/path/labels that were encountered. The current jdocs "How many labels were populated under the requested path" reveals implementation (population). Perhaps exchange populated with encountered?

          Fixed.

          FloatRange and DoubleRange uses Math.nextUp/Down for infinity as the ranges are always inclusive. Perhaps these constants for float and double could be static final.

          Well, .nextUp and .nextAfter. But, what constants? The number is
          computed differently for each range (from the provided min and mx)...

          TaxonomyFacetSumFloatAssociations and TaxonomyFacetSumValueSource reuse a LOT of code, can they extend one another? perhaps extract a common super for both?

          Well, they differ in the source of the value to aggregate (per doc vs
          per ord), but then the other methods are nearly the same except for
          the int/float difference... in fact, Fast/TaxoFacetCounts
          getTopChildren is also the same.

          I'll add a TODO...

          In TaxonomyFacets the parents array is saves, I could not see where it's being used (and I think it's not used even in the older taxonomy-facet implementation).

          Ooh, good catch: I removed it.

          FacetConfig confuses me a bit, as it's very much aware of the Taxonomy, on another it handles all the kinds of the facets.
          Perhaps FacetConfig.build() could be split up, allowing each FacetField.Type a build() method of its own, rather than every types' building being done in the same method. It will also bring a common parent class to all FacetField types, which I also like. As such, the taxonomy part, with processFacetFields() could be moved to its respective Facet implementation.

          I'm on the fence on whether FacetsConfig should hold the
          taxonomyWriter, vs you must pass it to the build method ...

          I like the idea of moving the build logic into each FacetField impl,
          but I want to keep it simple for the app (i.e. the app should not have
          to invoke N build methods).

          Show
          Michael McCandless added a comment - Been away from the issue for some time and it looks like a major progress, Chapeau à lui Thanks Gilad, bit by bit... LabelAndValue & FacetResult use instanceof checks in their equals method - is that a must? Hmm, I'm don't know how to implement .equals without an instanceof check? FacetResult has a member called childCount - I think it's the number of categories/path/labels that were encountered. The current jdocs "How many labels were populated under the requested path" reveals implementation (population). Perhaps exchange populated with encountered? Fixed. FloatRange and DoubleRange uses Math.nextUp/Down for infinity as the ranges are always inclusive. Perhaps these constants for float and double could be static final. Well, .nextUp and .nextAfter. But, what constants? The number is computed differently for each range (from the provided min and mx)... TaxonomyFacetSumFloatAssociations and TaxonomyFacetSumValueSource reuse a LOT of code, can they extend one another? perhaps extract a common super for both? Well, they differ in the source of the value to aggregate (per doc vs per ord), but then the other methods are nearly the same except for the int/float difference... in fact, Fast/TaxoFacetCounts getTopChildren is also the same. I'll add a TODO... In TaxonomyFacets the parents array is saves, I could not see where it's being used (and I think it's not used even in the older taxonomy-facet implementation). Ooh, good catch: I removed it. FacetConfig confuses me a bit, as it's very much aware of the Taxonomy, on another it handles all the kinds of the facets. Perhaps FacetConfig.build() could be split up, allowing each FacetField.Type a build() method of its own, rather than every types' building being done in the same method. It will also bring a common parent class to all FacetField types, which I also like. As such, the taxonomy part, with processFacetFields() could be moved to its respective Facet implementation. I'm on the fence on whether FacetsConfig should hold the taxonomyWriter, vs you must pass it to the build method ... I like the idea of moving the build logic into each FacetField impl, but I want to keep it simple for the app (i.e. the app should not have to invoke N build methods).
          Hide
          ASF subversion and git services added a comment -

          Commit 1546008 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1546008 ]

          LUCENE-5339: Gilad's feedback, improve javadocs

          Show
          ASF subversion and git services added a comment - Commit 1546008 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1546008 ] LUCENE-5339 : Gilad's feedback, improve javadocs
          Hide
          ASF subversion and git services added a comment -

          Commit 1546097 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1546097 ]

          LUCENE-5339: factor out base classes for int/float taxonomy aggregates

          Show
          ASF subversion and git services added a comment - Commit 1546097 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1546097 ] LUCENE-5339 : factor out base classes for int/float taxonomy aggregates
          Hide
          ASF subversion and git services added a comment -

          Commit 1546129 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1546129 ]

          LUCENE-5339: address some nocommits

          Show
          ASF subversion and git services added a comment - Commit 1546129 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1546129 ] LUCENE-5339 : address some nocommits
          Hide
          ASF subversion and git services added a comment -

          Commit 1546167 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1546167 ]

          LUCENE-5339: nocommits

          Show
          ASF subversion and git services added a comment - Commit 1546167 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1546167 ] LUCENE-5339 : nocommits
          Hide
          ASF subversion and git services added a comment -

          Commit 1546653 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1546653 ]

          LUCENE-5339: fix some nocommits; move taxoWriter out of FacetsConfig; move search + collect utility methods to FacetsCollector

          Show
          ASF subversion and git services added a comment - Commit 1546653 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1546653 ] LUCENE-5339 : fix some nocommits; move taxoWriter out of FacetsConfig; move search + collect utility methods to FacetsCollector
          Hide
          ASF subversion and git services added a comment -

          Commit 1547241 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1547241 ]

          LUCENE-5339: nocommits, javadocs, tests, range drill downs

          Show
          ASF subversion and git services added a comment - Commit 1547241 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1547241 ] LUCENE-5339 : nocommits, javadocs, tests, range drill downs
          Hide
          ASF subversion and git services added a comment -

          Commit 1547511 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1547511 ]

          LUCENE-5339: small opto for range facets, and factor out base class; put longHashCode back

          Show
          ASF subversion and git services added a comment - Commit 1547511 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1547511 ] LUCENE-5339 : small opto for range facets, and factor out base class; put longHashCode back
          Hide
          ASF subversion and git services added a comment -

          Commit 1552197 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1552197 ]

          LUCENE-5339: address remaining nocommits

          Show
          ASF subversion and git services added a comment - Commit 1552197 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1552197 ] LUCENE-5339 : address remaining nocommits
          Hide
          ASF subversion and git services added a comment -

          Commit 1552377 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1552377 ]

          LUCENE-5339: merge trunk

          Show
          ASF subversion and git services added a comment - Commit 1552377 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1552377 ] LUCENE-5339 : merge trunk
          Hide
          Michael McCandless added a comment -

          I think this branch is nearly ready. I'm attaching the current diffs vs trunk.

          Forbidden APIs is still angry (I enabled method level document linting for the facets module), and I need to get "ant precommit" happy, but I think it's close.

          This also includes the fixes for LUCENE-5371 (use segment trees for range faceting).

          Show
          Michael McCandless added a comment - I think this branch is nearly ready. I'm attaching the current diffs vs trunk. Forbidden APIs is still angry (I enabled method level document linting for the facets module), and I need to get "ant precommit" happy, but I think it's close. This also includes the fixes for LUCENE-5371 (use segment trees for range faceting).
          Hide
          Shai Erera added a comment -

          Few comments about package organization and names:

          • Can we move all the taxonomy stuff under o.a.l.facet.taxonomy?
            • Inside create *.associations and put all the associations stuff there (*FacetField, *Facets)
          • Can we move all the SortedSet stuff under o.a.l.facet.sortedset?
          • Do you think we need o.a.l.facet.range for all the range collecting classes? It seems we have now more than before.
          • I think we can remove FacetPackage.java - it was needed in the past because o.a.l.facet had no classes.

          If you agree with moving packages, then we need to do the same for tests too.

          I haven't fully reviewed the branch yet, I plan to though it may take me 1-2 weeks to finish a current project with a tight deadline. If you don't want to wait that's fine, we can open follow-up issues later.

          Show
          Shai Erera added a comment - Few comments about package organization and names: Can we move all the taxonomy stuff under o.a.l.facet.taxonomy? Inside create *.associations and put all the associations stuff there (*FacetField, *Facets) Can we move all the SortedSet stuff under o.a.l.facet.sortedset? Do you think we need o.a.l.facet.range for all the range collecting classes? It seems we have now more than before. I think we can remove FacetPackage.java - it was needed in the past because o.a.l.facet had no classes. If you agree with moving packages, then we need to do the same for tests too. I haven't fully reviewed the branch yet, I plan to though it may take me 1-2 weeks to finish a current project with a tight deadline. If you don't want to wait that's fine, we can open follow-up issues later.
          Hide
          Michael McCandless added a comment -

          Thanks Shai.

          I'd rather not add the sub-package hierarchy here; these sub-packages would only have a few classes and I think that's overkill? I think having so many sub-packages is also intimidating on first impression.

          I think we can remove FacetPackage.java - it was needed in the past because o.a.l.facet had no classes.

          Ahh, OK, I'll remove it.

          I haven't fully reviewed the branch yet, I plan to though it may take me 1-2 weeks to finish a current project with a tight deadline. If you don't want to wait that's fine, we can open follow-up issues later.

          OK, I'll likely commit this early next week ... we can iterate after that.

          Show
          Michael McCandless added a comment - Thanks Shai. I'd rather not add the sub-package hierarchy here; these sub-packages would only have a few classes and I think that's overkill? I think having so many sub-packages is also intimidating on first impression. I think we can remove FacetPackage.java - it was needed in the past because o.a.l.facet had no classes. Ahh, OK, I'll remove it. I haven't fully reviewed the branch yet, I plan to though it may take me 1-2 weeks to finish a current project with a tight deadline. If you don't want to wait that's fine, we can open follow-up issues later. OK, I'll likely commit this early next week ... we can iterate after that.
          Hide
          Shai Erera added a comment - - edited

          It's just that we have *.facet.taxonomy package, yet many taxonomy related classes are outside it. I prefer to have a more organized package hierarchy which groups classes together, than having them in an arbitrary *.facet package. For instance, the *.facet package alone contains 40 classes, yet the "suggest" package contains a total of 28 classes, that are divided into logical packages (*.analyzing, *.fst, *.tst, *.jaspell and *.suggest itself). What's the benefit of dumping all the classes in one package, when they don't share any common code? If we have a *.taxonomy, *.sortedset and *.range, you can at least know where to look for if you want to e.g. facet by taxonomy or sortedset. I don't know why you think packages are intimidating, they are meant to organize the code, and help users find related stuff. I did a quick count and compare:

          • Suggest module's packages contain 6 classes under *.analyzing and *.fst (each), 2 classes under *.jaspell and 3 classes under *.tst.
          • Facet module could contain 6 classes under *.range, 3 classes under *.sortedset and 9 classes under *.taxonomy (besides the ones that are already there).

          The two modules are similar IMO, just like you have several methods for "suggesting", you have several methods for "faceting"...

          Show
          Shai Erera added a comment - - edited It's just that we have *.facet.taxonomy package, yet many taxonomy related classes are outside it. I prefer to have a more organized package hierarchy which groups classes together, than having them in an arbitrary *.facet package. For instance, the *.facet package alone contains 40 classes, yet the "suggest" package contains a total of 28 classes, that are divided into logical packages (*.analyzing, *.fst, *.tst, *.jaspell and *.suggest itself). What's the benefit of dumping all the classes in one package, when they don't share any common code? If we have a *.taxonomy, *.sortedset and *.range, you can at least know where to look for if you want to e.g. facet by taxonomy or sortedset. I don't know why you think packages are intimidating, they are meant to organize the code, and help users find related stuff. I did a quick count and compare: Suggest module's packages contain 6 classes under *.analyzing and *.fst (each), 2 classes under *.jaspell and 3 classes under *.tst. Facet module could contain 6 classes under *.range, 3 classes under *.sortedset and 9 classes under *.taxonomy (besides the ones that are already there). The two modules are similar IMO, just like you have several methods for "suggesting", you have several methods for "faceting"...
          Hide
          Robert Muir added a comment -

          Maybe one crazy idea is to look at the JDK apis for inspiration.

          Sometimes the separate packages help, sometimes they hurt. Look at codecs/ for example:

          • it helps from an organization perspective, since we have quite a few of them, and its a super-expert thing where users by definition are probably looking at the source code anyway.
          • it hurts that we have a bunch of classes exposed (in both o.a.l.index and also o.a.l.codecs). This makes the javadocs overwhelming.

          For the faceting module, I don't think we should expect users are looking at the source code, at the same time this is an open source project so we should also care how its organized.

          I don't have an opinion either way: I'm just going to note that one big difference in the JDK apis (since i consider them very well done), is that they have a concept of "internal packages" that they hide from javadocs and so on, a way of getting the best of both worlds and keep things reasonably organized while also providing a minimal surface area for the public api.

          I'm not trying to promote that here, its just interesting.

          Show
          Robert Muir added a comment - Maybe one crazy idea is to look at the JDK apis for inspiration. Sometimes the separate packages help, sometimes they hurt. Look at codecs/ for example: it helps from an organization perspective, since we have quite a few of them, and its a super-expert thing where users by definition are probably looking at the source code anyway. it hurts that we have a bunch of classes exposed (in both o.a.l.index and also o.a.l.codecs). This makes the javadocs overwhelming. For the faceting module, I don't think we should expect users are looking at the source code, at the same time this is an open source project so we should also care how its organized. I don't have an opinion either way: I'm just going to note that one big difference in the JDK apis (since i consider them very well done), is that they have a concept of "internal packages" that they hide from javadocs and so on, a way of getting the best of both worlds and keep things reasonably organized while also providing a minimal surface area for the public api. I'm not trying to promote that here, its just interesting.
          Hide
          Michael McCandless added a comment -

          OK I agree it would be good to make sub-packages; the range impls already have 6 classes, which is non-trivial, and if we add further facet methods over time (e.g. there's an issue open for the "enum" method), then the single-package will be more and more awkward.

          Show
          Michael McCandless added a comment - OK I agree it would be good to make sub-packages; the range impls already have 6 classes, which is non-trivial, and if we add further facet methods over time (e.g. there's an issue open for the "enum" method), then the single-package will be more and more awkward.
          Hide
          ASF subversion and git services added a comment -

          Commit 1554372 from Shai Erera in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1554372 ]

          LUCENE-5339: handle warnings and javadoc errors

          Show
          ASF subversion and git services added a comment - Commit 1554372 from Shai Erera in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1554372 ] LUCENE-5339 : handle warnings and javadoc errors
          Hide
          ASF subversion and git services added a comment -

          Commit 1554379 from Shai Erera in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1554379 ]

          LUCENE-5339: organize packages

          Show
          ASF subversion and git services added a comment - Commit 1554379 from Shai Erera in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1554379 ] LUCENE-5339 : organize packages
          Hide
          Shai Erera added a comment -

          Committed the packages changes.

          Show
          Shai Erera added a comment - Committed the packages changes.
          Hide
          ASF subversion and git services added a comment -

          Commit 1554710 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1554710 ]

          LUCENE-5339: javadocs

          Show
          ASF subversion and git services added a comment - Commit 1554710 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1554710 ] LUCENE-5339 : javadocs
          Hide
          ASF subversion and git services added a comment -

          Commit 1555338 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1555338 ]

          LUCENE-5371, LUCENE-5339: speed up range faceting from O(N) per hit to O(log(N)) using segment trees; simplify facet APIs

          Show
          ASF subversion and git services added a comment - Commit 1555338 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1555338 ] LUCENE-5371 , LUCENE-5339 : speed up range faceting from O(N) per hit to O(log(N)) using segment trees; simplify facet APIs
          Hide
          ASF subversion and git services added a comment -

          Commit 1555342 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1555342 ]

          LUCENE-5371, LUCENE-5339: speed up range faceting from O(N) per hit to O(log(N)) using segment trees; simplify facet APIs

          Show
          ASF subversion and git services added a comment - Commit 1555342 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1555342 ] LUCENE-5371 , LUCENE-5339 : speed up range faceting from O(N) per hit to O(log(N)) using segment trees; simplify facet APIs
          Hide
          ASF subversion and git services added a comment -

          Commit 1555438 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1555438 ]

          LUCENE-5339: PrintTaxonomyStats is allowed to use System.out

          Show
          ASF subversion and git services added a comment - Commit 1555438 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1555438 ] LUCENE-5339 : PrintTaxonomyStats is allowed to use System.out
          Hide
          ASF subversion and git services added a comment -

          Commit 1555439 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1555439 ]

          LUCENE-5339: PrintTaxonomyStats is allowed to use System.out

          Show
          ASF subversion and git services added a comment - Commit 1555439 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1555439 ] LUCENE-5339 : PrintTaxonomyStats is allowed to use System.out
          Hide
          Shai Erera added a comment -

          Thanks Mike! Do we need to delete the branch?

          Show
          Shai Erera added a comment - Thanks Mike! Do we need to delete the branch?
          Hide
          Michael McCandless added a comment -

          Do we need to delete the branch?

          I'll delete it...

          Show
          Michael McCandless added a comment - Do we need to delete the branch? I'll delete it...
          Hide
          ASF subversion and git services added a comment -

          Commit 1555590 from Michael McCandless in branch 'dev/branches/lucene5339'
          [ https://svn.apache.org/r1555590 ]

          LUCENE-5339: remove now dead branch (it's merged/committed to trunk)

          Show
          ASF subversion and git services added a comment - Commit 1555590 from Michael McCandless in branch 'dev/branches/lucene5339' [ https://svn.apache.org/r1555590 ] LUCENE-5339 : remove now dead branch (it's merged/committed to trunk)
          Hide
          ASF subversion and git services added a comment -

          Commit 1555592 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1555592 ]

          LUCENE-5339: also move OrdinalReaders under taxonomy

          Show
          ASF subversion and git services added a comment - Commit 1555592 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1555592 ] LUCENE-5339 : also move OrdinalReaders under taxonomy
          Hide
          ASF subversion and git services added a comment -

          Commit 1555595 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1555595 ]

          LUCENE-5339: also move OrdinalReaders under taxonomy

          Show
          ASF subversion and git services added a comment - Commit 1555595 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1555595 ] LUCENE-5339 : also move OrdinalReaders under taxonomy
          Hide
          ASF subversion and git services added a comment -

          Commit 1555627 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1555627 ]

          LUCENE-5339: also catch invalid components in *FacetField

          Show
          ASF subversion and git services added a comment - Commit 1555627 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1555627 ] LUCENE-5339 : also catch invalid components in *FacetField
          Hide
          ASF subversion and git services added a comment -

          Commit 1555628 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1555628 ]

          LUCENE-5339: also catch invalid components in *FacetField

          Show
          ASF subversion and git services added a comment - Commit 1555628 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1555628 ] LUCENE-5339 : also catch invalid components in *FacetField

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development