Lucene - Core
  1. Lucene - Core
  2. LUCENE-4748

Add DrillSideways helper class to Lucene facets module

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 5.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This came out of a discussion on the java-user list with subject
      "Faceted search in OR": http://markmail.org/thread/jmnq6z2x7ayzci5k

      The basic idea is to count "near misses" during collection, ie
      documents that matched the main query and also all except one of the
      drill down filters.

      Drill sideways makes for a very nice faceted search UI because you
      don't "lose" the facet counts after drilling in. Eg maybe you do a
      search for "cameras", and you see facets for the manufacturer, so you
      drill into "Nikon".

      With drill sideways, even after drilling down, you'll still get the
      counts for all the other brands, where each count tells you how many
      hits you'd get if you changed to a different manufacturer.

      This becomes more fun if you add further drill-downs, eg maybe I next drill
      down into Resolution=10 megapixels", and then I can see how many 10
      megapixel cameras all other manufacturers, and what other resolutions
      Nikon cameras offer.

      1. LUCENE-4748.patch
        17 kB
        Michael McCandless
      2. LUCENE-4748.patch
        17 kB
        Michael McCandless
      3. LUCENE-4748.patch
        21 kB
        Michael McCandless
      4. LUCENE-4748.patch
        22 kB
        Michael McCandless
      5. LUCENE-4748.patch
        23 kB
        Michael McCandless
      6. LUCENE-4748.patch
        34 kB
        Michael McCandless
      7. DrillSideways-alternative.tar.gz
        10 kB
        Nicola Buso
      8. LUCENE-4748.patch
        66 kB
        Michael McCandless
      9. LUCENE-4748.patch
        78 kB
        Michael McCandless
      10. LUCENE-4748.patch
        81 kB
        Michael McCandless
      11. LUCENE-4748.patch
        82 kB
        Michael McCandless
      12. LUCENE-4748.patch
        83 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch with many nocommits.

        The basic idea seems to work ... I create a BooleanQuery and then visit/hang onto the child scorers to decide for each collected document whether it's a real hit (counts agains drilldown) or a near-miss (counts against whichever drill-down dimension was the near miss).

        I made a basic test case showing it seems to work ...

        Show
        Michael McCandless added a comment - Initial patch with many nocommits. The basic idea seems to work ... I create a BooleanQuery and then visit/hang onto the child scorers to decide for each collected document whether it's a real hit (counts agains drilldown) or a near-miss (counts against whichever drill-down dimension was the near miss). I made a basic test case showing it seems to work ...
        Hide
        Michael McCandless added a comment -

        New patch, just using IS.search instead of walking the leaves myself...

        Show
        Michael McCandless added a comment - New patch, just using IS.search instead of walking the leaves myself...
        Hide
        Gilad Barkai added a comment -

        Great idea!

        Since drill-down followed by drill-sideways is a sort of (re)filtering over the original result set, perhaps the query result (say ScoredDocIds) could be passed through rather than re-evaluating the Query?
        IIRC the scores should not change during drill-down (and sideways as well), so without re-evaluating this could perhaps save some juice?

        Show
        Gilad Barkai added a comment - Great idea! Since drill-down followed by drill-sideways is a sort of (re)filtering over the original result set, perhaps the query result (say ScoredDocIds) could be passed through rather than re-evaluating the Query? IIRC the scores should not change during drill-down (and sideways as well), so without re-evaluating this could perhaps save some juice?
        Hide
        Michael McCandless added a comment -

        Since drill-down followed by drill-sideways is a sort of (re)filtering over the original result set, perhaps the query result (say ScoredDocIds) could be passed through rather than re-evaluating the Query?
        IIRC the scores should not change during drill-down (and sideways as well), so without re-evaluating this could perhaps save some juice?

        Yes, this would work correctly: the drill-sideways counts for dim X will be the same as the drill-down counts for dim X on the query just before you drilled into dim X. So we could save counting the sideways counts for dim X (but, other dimensions previously drilled-down will still need new counting).

        We'd need to run a different query, basically moving dim X up to the top BooleanQuery as a MUST, and of course the app would have to save & return the previous queries drill-down counts for dim X (which is a hassle in a stateless server app)...

        Show
        Michael McCandless added a comment - Since drill-down followed by drill-sideways is a sort of (re)filtering over the original result set, perhaps the query result (say ScoredDocIds) could be passed through rather than re-evaluating the Query? IIRC the scores should not change during drill-down (and sideways as well), so without re-evaluating this could perhaps save some juice? Yes, this would work correctly: the drill-sideways counts for dim X will be the same as the drill-down counts for dim X on the query just before you drilled into dim X. So we could save counting the sideways counts for dim X (but, other dimensions previously drilled-down will still need new counting). We'd need to run a different query, basically moving dim X up to the top BooleanQuery as a MUST, and of course the app would have to save & return the previous queries drill-down counts for dim X (which is a hassle in a stateless server app)...
        Hide
        Michael McCandless added a comment -

        New patch, addressing some nocommits.

        Now you instantiate DrillSideways, calls .addDrillDown one or more times, then call one of the search methods.

        I think DrillDown ought to have a similar API, so that you can build up some dims with OR and others without ... and maybe we should merge DrillDown and DrillSideways ...

        Show
        Michael McCandless added a comment - New patch, addressing some nocommits. Now you instantiate DrillSideways, calls .addDrillDown one or more times, then call one of the search methods. I think DrillDown ought to have a similar API, so that you can build up some dims with OR and others without ... and maybe we should merge DrillDown and DrillSideways ...
        Hide
        Shai Erera added a comment -

        Now you instantiate DrillSideways, calls .addDrillDown one or more times, then call one of the search methods.

        That's actually very cool!

        I think DrillDown ought to have a similar API

        I agree! Today DrillDown API is kinda confusing (you even have a nocommit regarding that). E.g. how should one use DrillDown to construct and AND of ORs. Having DrillDown.add() will be a great simplification!

        Whether DrillDown (and sideways) should have .search(), or .getCollector() (something I've been thinking about before looking at your 2nd patch) ... I think that .search() looks good. If needed, we can always add a getCollector() later.

        I think it would be good if we can merge DrillDown and Sideways, though not sure how the API would look like at the moment. So maybe for now we keep them separated. I.e., Sideways requires heavy jdocs, explaining exactly what it does, and e.g. the multiple FacetArrays allocation, so if all that we get by merging is the joint .add() logic ... I'd rather keep them separated.

        I plan to do a more thorough review tomorrow, but briefly scanning the new patch, this looks awesome!

        Show
        Shai Erera added a comment - Now you instantiate DrillSideways, calls .addDrillDown one or more times, then call one of the search methods. That's actually very cool! I think DrillDown ought to have a similar API I agree! Today DrillDown API is kinda confusing (you even have a nocommit regarding that). E.g. how should one use DrillDown to construct and AND of ORs. Having DrillDown.add() will be a great simplification! Whether DrillDown (and sideways) should have .search(), or .getCollector() (something I've been thinking about before looking at your 2nd patch) ... I think that .search() looks good. If needed, we can always add a getCollector() later. I think it would be good if we can merge DrillDown and Sideways, though not sure how the API would look like at the moment. So maybe for now we keep them separated. I.e., Sideways requires heavy jdocs, explaining exactly what it does, and e.g. the multiple FacetArrays allocation, so if all that we get by merging is the joint .add() logic ... I'd rather keep them separated. I plan to do a more thorough review tomorrow, but briefly scanning the new patch, this looks awesome!
        Hide
        Michael McCandless added a comment -

        Whether DrillDown (and sideways) should have .search(), or .getCollector() (something I've been thinking about before looking at your 2nd patch) ... I think that .search() looks good. If needed, we can always add a getCollector() later.

        I think we should somehow do a .getCollector().

        For example, if you want to do grouping (using GroupingSearch) and faceting (using DrillSideways) ... you're kind of stuck because each of these classes does the "search" for you.

        It's also a hassle having to make the N search methods (I still have a nocommit to add searchAfter...).

        So a .getCollector would be nice so the app could use MultiCollector to run everything (maybe they need to do joins too!).

        But the challenge is ... we'd also need .getQuery, because DrillSideways runs a different Query from what the user provided (and a different collector). And then we'd need to expose the collector, and add methods to get the drill-down and drill-sideways results ...

        Show
        Michael McCandless added a comment - Whether DrillDown (and sideways) should have .search(), or .getCollector() (something I've been thinking about before looking at your 2nd patch) ... I think that .search() looks good. If needed, we can always add a getCollector() later. I think we should somehow do a .getCollector(). For example, if you want to do grouping (using GroupingSearch) and faceting (using DrillSideways) ... you're kind of stuck because each of these classes does the "search" for you. It's also a hassle having to make the N search methods (I still have a nocommit to add searchAfter...). So a .getCollector would be nice so the app could use MultiCollector to run everything (maybe they need to do joins too!). But the challenge is ... we'd also need .getQuery, because DrillSideways runs a different Query from what the user provided (and a different collector). And then we'd need to expose the collector, and add methods to get the drill-down and drill-sideways results ...
        Hide
        Nicola Buso added a comment -

        I'm looking into it and I have a question.

        why are your collecting facets in two different groups (drilldown and drillsideways)?
        In the final end the user doesn't matter of these two groups and want them merged (a facet can be in both groups). Is there a specific use case you have in mind to maintain them separated?

        Show
        Nicola Buso added a comment - I'm looking into it and I have a question. why are your collecting facets in two different groups (drilldown and drillsideways)? In the final end the user doesn't matter of these two groups and want them merged (a facet can be in both groups). Is there a specific use case you have in mind to maintain them separated?
        Hide
        Michael McCandless added a comment -

        Hmm that's a good point!

        OK I think if we changed collect to also count the hits against each drill-sideways dimension, then the drill-sideways results would be "self contained", ie also contain any drill-down values for that field.

        And then I think we can return a single list of facet results...

        Show
        Michael McCandless added a comment - Hmm that's a good point! OK I think if we changed collect to also count the hits against each drill-sideways dimension, then the drill-sideways results would be "self contained", ie also contain any drill-down values for that field. And then I think we can return a single list of facet results...
        Hide
        Michael McCandless added a comment -

        New patch, only returning a single List<FacetResult> matching the List<FacetRequest> provided in the FacetSearchParams.

        If the dimension was drill-down then the corresponding FacetResult will have the merged drill-down + -sideways results.

        Show
        Michael McCandless added a comment - New patch, only returning a single List<FacetResult> matching the List<FacetRequest> provided in the FacetSearchParams. If the dimension was drill-down then the corresponding FacetResult will have the merged drill-down + -sideways results.
        Hide
        Shai Erera added a comment -

        Few comments:

        • I think that DrillSideways can take a DrillDownQuery (once we finish with LUCENE-4750)?
          • It will eliminate .addDrillDown (and it's ok I think that DDQ too will enforce all passed CPs to belong to the same dimension)
          • Though if we do that, how can we set minShouldMatch on sub-query?
          • Maybe if DDQ itself won't wrap another Query, but just build a BQ over all CPs ... then the user will need to wrap, but we can add a utility method.
        • In .search(), just set minShouldMatch to 1 if (drillDownQueries.size() == 1)? It reads simpler...
          • Also, why do you need to add a fake Query? I understand the rewrite will eliminate BQ and return TQ, but what's the harm?
          • Isn't minShouldMatch=1 in that case similar to TQ?
        • In getDimIndex:
          • Extract dims.size() to a variable so it's not executed in every loop?
          • I think you can drop the if (cp.length > 0)? It doesn't make sense for someone to pass an empty CP. Also, you can assert on that in .addDrillDown()
            • BTW, I noticed that you test that in DrillSidewaysCollector ctor too.
          • I wonder if we made 'dims' LinkedHashSet it would perform better than these contains() (in .addDrillDown), get. Then you could just do dims.get(fr.cp.components[0]). I didn't try that in code, so not sure if you can get its index...

        Also, I think we could simplify things if DrillSideways worked like this:

        • Either exposed a .getQuery() method, or was itself a Query (like DDQ).
        • Either exposed a .getCollector() method (returning DrillSidewaysCollector) or if it was a Query, you'd just initialize a DrillSidewaysCollector (not a big deal, user-wise).
        • The collector's getFacetResults() would do the "merging" work that I see in .search()

        Then you:

        • Won't need DrillSidewaysResult, which today wrap a List<FacetResult> and TopDocs. Someone could MultiCollector.wrap(topDocsCollector, sidewaysCollector)? Just like w/ facets?
        • Won't need the multitude of search() methods. Again, someone could wrap TopDocsCollector, CachingCollector, TopFieldsCollector...

        In DrillSidewaysCollector ctor:

        • if (drillSidewaysRequest == null) – that means the user asked to drill-down on some CPs for dim X, but not requested to count it, right?
          • Do we must throw an exception? Perhaps we can just drop the relevant Query clause? Although, it's not very expected that a user would do that ... so perhaps keep the code for simplicity.
        • Instead of doing Collections.singletonList you can just pass the single FacetRequest to the vararg ctor. If you feel like it, we can optimize FacetSearchParams' vararg ctors to initialize a singletonList if facetRequests.length == 1.
        • exactCount = Math.max(2, dims.size()); – maybe add a comment why '2'?

        In DrillSidewaysCollector.setScorer:

        • Why does Scorer.getChildren() return a Collection and not List? We used to have that in IR.listCommits while in practice it was always a List. Can we fix Scorer?
          • I looked at all Scorer.getChildren() impls and they either return a List (ArrayList in most cases) or Collections.singleton (which is a Set). So it's indeed dangerous to assume it's a List, but I think we should just fix Scorer?
        • What do you mean by "// nocommit fragile: need tracker somehow..."? What's tracker?

        In DrillSidewaysCollector.collect:

        • Can you add some documentation to the 'if-else'?
        Show
        Shai Erera added a comment - Few comments: I think that DrillSideways can take a DrillDownQuery (once we finish with LUCENE-4750 )? It will eliminate .addDrillDown (and it's ok I think that DDQ too will enforce all passed CPs to belong to the same dimension) Though if we do that, how can we set minShouldMatch on sub-query? Maybe if DDQ itself won't wrap another Query, but just build a BQ over all CPs ... then the user will need to wrap, but we can add a utility method. In .search(), just set minShouldMatch to 1 if (drillDownQueries.size() == 1)? It reads simpler... Also, why do you need to add a fake Query? I understand the rewrite will eliminate BQ and return TQ, but what's the harm? Isn't minShouldMatch=1 in that case similar to TQ? In getDimIndex: Extract dims.size() to a variable so it's not executed in every loop? I think you can drop the if (cp.length > 0)? It doesn't make sense for someone to pass an empty CP. Also, you can assert on that in .addDrillDown() BTW, I noticed that you test that in DrillSidewaysCollector ctor too. I wonder if we made 'dims' LinkedHashSet it would perform better than these contains() (in .addDrillDown), get . Then you could just do dims.get(fr.cp.components [0] ). I didn't try that in code, so not sure if you can get its index... Also, I think we could simplify things if DrillSideways worked like this: Either exposed a .getQuery() method, or was itself a Query (like DDQ). Either exposed a .getCollector() method (returning DrillSidewaysCollector) or if it was a Query, you'd just initialize a DrillSidewaysCollector (not a big deal, user-wise). The collector's getFacetResults() would do the "merging" work that I see in .search() Then you: Won't need DrillSidewaysResult, which today wrap a List<FacetResult> and TopDocs. Someone could MultiCollector.wrap(topDocsCollector, sidewaysCollector)? Just like w/ facets? Won't need the multitude of search() methods. Again, someone could wrap TopDocsCollector, CachingCollector, TopFieldsCollector... In DrillSidewaysCollector ctor: if (drillSidewaysRequest == null) – that means the user asked to drill-down on some CPs for dim X, but not requested to count it, right? Do we must throw an exception? Perhaps we can just drop the relevant Query clause? Although, it's not very expected that a user would do that ... so perhaps keep the code for simplicity. Instead of doing Collections.singletonList you can just pass the single FacetRequest to the vararg ctor. If you feel like it, we can optimize FacetSearchParams' vararg ctors to initialize a singletonList if facetRequests.length == 1. exactCount = Math.max(2, dims.size()); – maybe add a comment why '2'? In DrillSidewaysCollector.setScorer: Why does Scorer.getChildren() return a Collection and not List? We used to have that in IR.listCommits while in practice it was always a List. Can we fix Scorer? I looked at all Scorer.getChildren() impls and they either return a List (ArrayList in most cases) or Collections.singleton (which is a Set). So it's indeed dangerous to assume it's a List, but I think we should just fix Scorer? What do you mean by "// nocommit fragile: need tracker somehow..."? What's tracker? In DrillSidewaysCollector.collect: Can you add some documentation to the 'if-else'?
        Hide
        Michael McCandless added a comment -

        Thank you for all the feedback Shai! Responses:

        I think that DrillSideways can take a DrillDownQuery (once we finish with LUCENE-4750)?

        +1

        DrillSideways will still need to do its own thing (build the BQ with
        minShouldMatch), but it should just take a DDQ.

        In .search(), just set minShouldMatch to 1 if (drillDownQueries.size() == 1)? It reads simpler...

        Fixed.

        Also, why do you need to add a fake Query? I understand the rewrite will eliminate BQ and return TQ, but what's the harm?

        Isn't minShouldMatch=1 in that case similar to TQ?

        If we don't add the fake query then BQ of a single term will just
        rewrite itself to that single term, so we won't find our scorer.

        Really we should make a specialized collector for this case (I added a
        TODO) because you should just use query foobar OR drill-down, and then
        check only the drill-down scorer to see if it matched.

        But I think we can do that later (it's an opto).

        In getDimIndex:

        Extract dims.size() to a variable so it's not executed in every loop?

        Fixed.

        I think you can drop the if (cp.length > 0)? It doesn't make sense for someone to pass an empty CP. Also, you can assert on that in .addDrillDown()

        BTW, I noticed that you test that in DrillSidewaysCollector ctor too.

        OK I'll move the check into .addDrillDown, but I think it should be a
        real check (not assert).

        I wonder if we made 'dims' LinkedHashSet it would perform better than these contains() (in .addDrillDown), get. Then you could just do dims.get(fr.cp.components[0]). I didn't try that in code, so not sure if you can get its index...

        Good idea, I think LinkedHashMap should work. I'll try that...

        Also, I think we could simplify things if DrillSideways worked like this:
        Either exposed a .getQuery() method, or was itself a Query (like DDQ).
        Either exposed a .getCollector() method (returning DrillSidewaysCollector) or if it was a Query, you'd just initialize a DrillSidewaysCollector (not a big deal, user-wise).
        The collector's getFacetResults() would do the "merging" work that I see in .search()
        Then you:
        Won't need DrillSidewaysResult, which today wrap a List<FacetResult> and TopDocs. Someone could MultiCollector.wrap(topDocsCollector, sidewaysCollector)? Just like w/ facets?
        Won't need the multitude of search() methods. Again, someone could wrap TopDocsCollector, CachingCollector, TopFieldsCollector...

        Alas, it's not so simple: you can't use
        MultiCollector.wrap(topDocsCollector, sidewaysCollector) because the
        BooleanQuery (DrillSidewaysQuery if we do that) hits too many results
        (hits + near-misses), and this is why it needs its own collector.

        All other collectors you may want for your search (grouping, joining,
        etc) need to be MultiCollector.wrap'd and then passed to
        DrillSidewaysCollector.

        So I'm not sure what value it is to provide the Query/Collector
        externally: all you can really do with them is search on the query,
        collecting with the collector.

        I'm also torn about "opening up" the Query/Collector because we may
        change the impl on how the near-miss counts are collected. EG Solr
        makes bit sets and does multiple passes to aggregate the near-misses.

        In DrillSidewaysCollector ctor:

        if (drillSidewaysRequest == null) – that means the user asked to drill-down on some CPs for dim X, but not requested to count it, right?

        Right.

        Do we must throw an exception? Perhaps we can just drop the relevant Query clause? Although, it's not very expected that a user would do that ... so perhaps keep the code for simplicity.

        Or, we could just return null as the FacetResult for that dimension?

        The thing is, I suspect it's unusual that the app would want this, and
        then in those cases that they do, they could drill-down on their main
        query (ie, pass DrillDownQuery as the Query to DrillSideways) and then
        we won't count the sideways counts for it.

        Instead of doing Collections.singletonList you can just pass the single FacetRequest to the vararg ctor.

        Ahh, good!

        If you feel like it, we can optimize FacetSearchParams' vararg ctors to initialize a singletonList if facetRequests.length == 1.

        I don't think we need to ... looks like we use Arrays.asList.

        exactCount = Math.max(2, dims.size()); – maybe add a comment why '2'?

        Done.

        In DrillSidewaysCollector.setScorer:
        Why does Scorer.getChildren() return a Collection and not List? We used to have that in IR.listCommits while in practice it was always a List. Can we fix Scorer?
        I looked at all Scorer.getChildren() impls and they either return a List (ArrayList in most cases) or Collections.singleton (which is a Set). So it's indeed dangerous to assume it's a List, but I think we should just fix Scorer?

        I don't think we can expect Scorer.getChildren to return a List,
        matching the order of the sub-clauses, in general. Ie, it will depend
        on each query: sometimes a Scorer can be dropped, if that clause's
        scorer was null, or the Scorers can be reordered (eg,
        ConjunctionScorer reorders by expected freq i think).

        What do you mean by "// nocommit fragile: need tracker somehow..."? What's tracker?

        Well, we cannot just cast to List and assume number of sub-scorers is
        the same as number of clauses. EG if the user's main query returns a
        null Scorer because it matches no hits, then BQ will only return a
        length 1 list to us.

        So to properly handle this I think we need a wrapper Query class (this
        original idea came from selckin I think), whose purpose is to track
        which Scorer instance was created for our "near-miss" BQ.

        In DrillSidewaysCollector.collect:
        Can you add some documentation to the 'if-else'?

        OK done.

        Show
        Michael McCandless added a comment - Thank you for all the feedback Shai! Responses: I think that DrillSideways can take a DrillDownQuery (once we finish with LUCENE-4750 )? +1 DrillSideways will still need to do its own thing (build the BQ with minShouldMatch), but it should just take a DDQ. In .search(), just set minShouldMatch to 1 if (drillDownQueries.size() == 1)? It reads simpler... Fixed. Also, why do you need to add a fake Query? I understand the rewrite will eliminate BQ and return TQ, but what's the harm? Isn't minShouldMatch=1 in that case similar to TQ? If we don't add the fake query then BQ of a single term will just rewrite itself to that single term, so we won't find our scorer. Really we should make a specialized collector for this case (I added a TODO) because you should just use query foobar OR drill-down, and then check only the drill-down scorer to see if it matched. But I think we can do that later (it's an opto). In getDimIndex: Extract dims.size() to a variable so it's not executed in every loop? Fixed. I think you can drop the if (cp.length > 0)? It doesn't make sense for someone to pass an empty CP. Also, you can assert on that in .addDrillDown() BTW, I noticed that you test that in DrillSidewaysCollector ctor too. OK I'll move the check into .addDrillDown, but I think it should be a real check (not assert). I wonder if we made 'dims' LinkedHashSet it would perform better than these contains() (in .addDrillDown), get. Then you could just do dims.get(fr.cp.components [0] ). I didn't try that in code, so not sure if you can get its index... Good idea, I think LinkedHashMap should work. I'll try that... Also, I think we could simplify things if DrillSideways worked like this: Either exposed a .getQuery() method, or was itself a Query (like DDQ). Either exposed a .getCollector() method (returning DrillSidewaysCollector) or if it was a Query, you'd just initialize a DrillSidewaysCollector (not a big deal, user-wise). The collector's getFacetResults() would do the "merging" work that I see in .search() Then you: Won't need DrillSidewaysResult, which today wrap a List<FacetResult> and TopDocs. Someone could MultiCollector.wrap(topDocsCollector, sidewaysCollector)? Just like w/ facets? Won't need the multitude of search() methods. Again, someone could wrap TopDocsCollector, CachingCollector, TopFieldsCollector... Alas, it's not so simple: you can't use MultiCollector.wrap(topDocsCollector, sidewaysCollector) because the BooleanQuery (DrillSidewaysQuery if we do that) hits too many results (hits + near-misses), and this is why it needs its own collector. All other collectors you may want for your search (grouping, joining, etc) need to be MultiCollector.wrap'd and then passed to DrillSidewaysCollector. So I'm not sure what value it is to provide the Query/Collector externally: all you can really do with them is search on the query, collecting with the collector. I'm also torn about "opening up" the Query/Collector because we may change the impl on how the near-miss counts are collected. EG Solr makes bit sets and does multiple passes to aggregate the near-misses. In DrillSidewaysCollector ctor: if (drillSidewaysRequest == null) – that means the user asked to drill-down on some CPs for dim X, but not requested to count it, right? Right. Do we must throw an exception? Perhaps we can just drop the relevant Query clause? Although, it's not very expected that a user would do that ... so perhaps keep the code for simplicity. Or, we could just return null as the FacetResult for that dimension? The thing is, I suspect it's unusual that the app would want this, and then in those cases that they do, they could drill-down on their main query (ie, pass DrillDownQuery as the Query to DrillSideways) and then we won't count the sideways counts for it. Instead of doing Collections.singletonList you can just pass the single FacetRequest to the vararg ctor. Ahh, good! If you feel like it, we can optimize FacetSearchParams' vararg ctors to initialize a singletonList if facetRequests.length == 1. I don't think we need to ... looks like we use Arrays.asList. exactCount = Math.max(2, dims.size()); – maybe add a comment why '2'? Done. In DrillSidewaysCollector.setScorer: Why does Scorer.getChildren() return a Collection and not List? We used to have that in IR.listCommits while in practice it was always a List. Can we fix Scorer? I looked at all Scorer.getChildren() impls and they either return a List (ArrayList in most cases) or Collections.singleton (which is a Set). So it's indeed dangerous to assume it's a List, but I think we should just fix Scorer? I don't think we can expect Scorer.getChildren to return a List, matching the order of the sub-clauses, in general. Ie, it will depend on each query: sometimes a Scorer can be dropped, if that clause's scorer was null, or the Scorers can be reordered (eg, ConjunctionScorer reorders by expected freq i think). What do you mean by "// nocommit fragile: need tracker somehow..."? What's tracker? Well, we cannot just cast to List and assume number of sub-scorers is the same as number of clauses. EG if the user's main query returns a null Scorer because it matches no hits, then BQ will only return a length 1 list to us. So to properly handle this I think we need a wrapper Query class (this original idea came from selckin I think), whose purpose is to track which Scorer instance was created for our "near-miss" BQ. In DrillSidewaysCollector.collect: Can you add some documentation to the 'if-else'? OK done.
        Hide
        Shai Erera added a comment -

        Thanks for all the answers Mike!

        The fixes look. And I suggest that all the nocommits that are optimizations should be converted to TODOs. We should get this feature in, optimize later.

        Show
        Shai Erera added a comment - Thanks for all the answers Mike! The fixes look. And I suggest that all the nocommits that are optimizations should be converted to TODOs. We should get this feature in, optimize later.
        Hide
        Michael McCandless added a comment -

        New patch, switching back to static methods that now just take a DrillDownQuery. I removed some nocommits but I think added more Still in progress...

        Show
        Michael McCandless added a comment - New patch, switching back to static methods that now just take a DrillDownQuery. I removed some nocommits but I think added more Still in progress...
        Hide
        Nicola Buso added a comment -

        This is another implementation of the brilliant idea from Mike McCandles in the ml. In the code FacetDrillDown is a porting of the lucene-4.2 DrillDown class in lucene-4.1. Next step would be to remove the dummy MatchAllDocsQuery.

        Show
        Nicola Buso added a comment - This is another implementation of the brilliant idea from Mike McCandles in the ml. In the code FacetDrillDown is a porting of the lucene-4.2 DrillDown class in lucene-4.1. Next step would be to remove the dummy MatchAllDocsQuery.
        Hide
        Michael McCandless added a comment -

        I made a new patch, with a custom scorer to find the exact & near-miss
        hits and tally accordingly. I also added a random test which seems to
        be passing ... I think the new scorer is working (but there are still
        tons of nocommits).

        It improves performance vs the last patch (base = last patch, comp =
        new patch), on full wikibig (6.6M docs), 7 dims. Each TermQuery does
        a drill down on Date/2012 and imageCount/1:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         LowTerm       33.53      (1.2%)       26.27      (4.2%)  -21.7% ( -26% -  -16%)
                         MedTerm       14.20      (0.9%)       16.29      (4.8%)   14.7% (   8% -   20%)
                        HighTerm        6.47      (1.2%)        9.43      (4.8%)   45.7% (  39% -   52%)
        

        I think LowTerm got slower because the new scorer has highish init
        cost: it works like BS1, allocating arrays[CHUNK] up front.

        For comparison ... this is the same set of queries, but doing
        only drill-down. base and comp are the same here (so the diffs are
        noise), so you have to abs compare to the table above to get the
        drill-sideways penalty (~2 - 2.4 X slower):

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         MedTerm       32.83      (1.0%)       32.94      (0.6%)    0.3% (  -1% -    2%)
                        HighTerm       22.26      (0.8%)       22.37      (0.5%)    0.5% (   0% -    1%)
                         LowTerm       58.47      (1.4%)       58.91      (0.9%)    0.8% (  -1% -    3%)
        
        Show
        Michael McCandless added a comment - I made a new patch, with a custom scorer to find the exact & near-miss hits and tally accordingly. I also added a random test which seems to be passing ... I think the new scorer is working (but there are still tons of nocommits). It improves performance vs the last patch (base = last patch, comp = new patch), on full wikibig (6.6M docs), 7 dims. Each TermQuery does a drill down on Date/2012 and imageCount/1: Task QPS base StdDev QPS comp StdDev Pct diff LowTerm 33.53 (1.2%) 26.27 (4.2%) -21.7% ( -26% - -16%) MedTerm 14.20 (0.9%) 16.29 (4.8%) 14.7% ( 8% - 20%) HighTerm 6.47 (1.2%) 9.43 (4.8%) 45.7% ( 39% - 52%) I think LowTerm got slower because the new scorer has highish init cost: it works like BS1, allocating arrays [CHUNK] up front. For comparison ... this is the same set of queries, but doing only drill-down. base and comp are the same here (so the diffs are noise), so you have to abs compare to the table above to get the drill-sideways penalty (~2 - 2.4 X slower): Task QPS base StdDev QPS comp StdDev Pct diff MedTerm 32.83 (1.0%) 32.94 (0.6%) 0.3% ( -1% - 2%) HighTerm 22.26 (0.8%) 22.37 (0.5%) 0.5% ( 0% - 1%) LowTerm 58.47 (1.4%) 58.91 (0.9%) 0.8% ( -1% - 3%)
        Hide
        Michael McCandless added a comment -

        New patch, fixing various bugs, beefing up the tests and resolving all
        nocommits. I think it's ready!

        I also fixed a consistency issue with the facets API: if you request
        faceting for a non-existent category, it now returns an empty
        FacetResult instead of skipping it.

        I tested on a wider variety of drill down / sideways queries. base =
        old patch and comp = this patch:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                  LowTermHardDD2       24.43      (2.0%)       24.43      (2.2%)    0.0% (  -4% -    4%)
                 HighTermEasyDD2       18.91      (1.6%)       20.59      (4.3%)    8.9% (   2% -   15%)
                  LowTermHardDD1       31.38      (2.0%)       36.21      (1.7%)   15.4% (  11% -   19%)
                 LowTermMixedDD2       44.09      (2.1%)       53.93      (0.9%)   22.3% (  18% -   25%)
                LowTermHardOrDD1       25.85      (2.3%)       33.80      (2.0%)   30.7% (  25% -   35%)
                  MedTermHardDD2        5.78      (1.4%)        7.71      (5.3%)   33.4% (  26% -   40%)
                  LowTermEasyDD2      129.51      (1.7%)      176.27      (3.9%)   36.1% (  30% -   42%)
                  MedTermEasyDD2       42.88      (1.8%)       60.03      (3.5%)   40.0% (  34% -   46%)
                 MedTermMixedDD2       12.52      (1.4%)       17.59      (4.2%)   40.5% (  34% -   46%)
                LowTermHardOrDD2       18.57      (2.8%)       26.45      (1.3%)   42.4% (  37% -   47%)
                  LowTermEasyDD1       71.73      (1.8%)      102.77      (1.8%)   43.3% (  38% -   47%)
                LowTermEasyOrDD2       61.01      (2.7%)       98.57      (6.7%)   61.6% (  50% -   73%)
                 HighTermHardDD2        1.22      (1.8%)        1.97      (6.8%)   61.7% (  52% -   71%)
                  MedTermHardDD1        8.77      (2.6%)       14.47      (5.1%)   65.1% (  55% -   74%)
                HighTermMixedDD2        2.69      (1.6%)        4.50      (6.8%)   67.4% (  58% -   76%)
                  MedTermEasyDD1       18.61      (2.6%)       32.34      (6.1%)   73.8% (  63% -   84%)
                LowTermEasyOrDD1       51.31      (2.2%)       91.48      (2.1%)   78.3% (  72% -   84%)
               HighTermEasyOrDD2        8.96      (3.1%)       16.17      (5.4%)   80.5% (  69% -   91%)
               HighTermEasyOrDD1        3.47      (4.1%)        6.40      (7.5%)   84.8% (  70% -  100%)
                MedTermHardOrDD2        4.31      (3.3%)        8.03      (6.4%)   86.6% (  74% -   99%)
                 HighTermEasyDD1        3.16      (3.0%)        5.89      (7.7%)   86.6% (  73% -  100%)
                MedTermEasyOrDD1       15.63      (3.4%)       30.05      (6.5%)   92.2% (  79% -  105%)
                 HighTermHardDD1        1.61      (3.1%)        3.13      (7.6%)   94.3% (  81% -  108%)
                MedTermHardOrDD1        6.75      (3.5%)       13.76      (6.0%)  103.9% (  91% -  117%)
               HighTermHardOrDD2        1.14      (4.2%)        2.41      (9.2%)  111.6% (  94% -  130%)
                MedTermEasyOrDD2       19.92      (3.0%)       45.44      (6.3%)  128.1% ( 115% -  141%)
               HighTermHardOrDD1        0.96      (3.5%)        2.54     (10.4%)  163.6% ( 144% -  183%)
        

        DD2 means drill down on 2 dims, DD1 means drill down on 1 dim. Hard
        means the 1 or 2 dims have high count, Easy means they have low count,
        and Mixed means one high and one low. OrDDX means I OR two values per
        dim.

        The new patch is especially faster for the OR case (ie, when you drill
        down on more than one value in a single dim), I think because it
        handles it directly instead of recursing into another BQ.

        Show
        Michael McCandless added a comment - New patch, fixing various bugs, beefing up the tests and resolving all nocommits. I think it's ready! I also fixed a consistency issue with the facets API: if you request faceting for a non-existent category, it now returns an empty FacetResult instead of skipping it. I tested on a wider variety of drill down / sideways queries. base = old patch and comp = this patch: Task QPS base StdDev QPS comp StdDev Pct diff LowTermHardDD2 24.43 (2.0%) 24.43 (2.2%) 0.0% ( -4% - 4%) HighTermEasyDD2 18.91 (1.6%) 20.59 (4.3%) 8.9% ( 2% - 15%) LowTermHardDD1 31.38 (2.0%) 36.21 (1.7%) 15.4% ( 11% - 19%) LowTermMixedDD2 44.09 (2.1%) 53.93 (0.9%) 22.3% ( 18% - 25%) LowTermHardOrDD1 25.85 (2.3%) 33.80 (2.0%) 30.7% ( 25% - 35%) MedTermHardDD2 5.78 (1.4%) 7.71 (5.3%) 33.4% ( 26% - 40%) LowTermEasyDD2 129.51 (1.7%) 176.27 (3.9%) 36.1% ( 30% - 42%) MedTermEasyDD2 42.88 (1.8%) 60.03 (3.5%) 40.0% ( 34% - 46%) MedTermMixedDD2 12.52 (1.4%) 17.59 (4.2%) 40.5% ( 34% - 46%) LowTermHardOrDD2 18.57 (2.8%) 26.45 (1.3%) 42.4% ( 37% - 47%) LowTermEasyDD1 71.73 (1.8%) 102.77 (1.8%) 43.3% ( 38% - 47%) LowTermEasyOrDD2 61.01 (2.7%) 98.57 (6.7%) 61.6% ( 50% - 73%) HighTermHardDD2 1.22 (1.8%) 1.97 (6.8%) 61.7% ( 52% - 71%) MedTermHardDD1 8.77 (2.6%) 14.47 (5.1%) 65.1% ( 55% - 74%) HighTermMixedDD2 2.69 (1.6%) 4.50 (6.8%) 67.4% ( 58% - 76%) MedTermEasyDD1 18.61 (2.6%) 32.34 (6.1%) 73.8% ( 63% - 84%) LowTermEasyOrDD1 51.31 (2.2%) 91.48 (2.1%) 78.3% ( 72% - 84%) HighTermEasyOrDD2 8.96 (3.1%) 16.17 (5.4%) 80.5% ( 69% - 91%) HighTermEasyOrDD1 3.47 (4.1%) 6.40 (7.5%) 84.8% ( 70% - 100%) MedTermHardOrDD2 4.31 (3.3%) 8.03 (6.4%) 86.6% ( 74% - 99%) HighTermEasyDD1 3.16 (3.0%) 5.89 (7.7%) 86.6% ( 73% - 100%) MedTermEasyOrDD1 15.63 (3.4%) 30.05 (6.5%) 92.2% ( 79% - 105%) HighTermHardDD1 1.61 (3.1%) 3.13 (7.6%) 94.3% ( 81% - 108%) MedTermHardOrDD1 6.75 (3.5%) 13.76 (6.0%) 103.9% ( 91% - 117%) HighTermHardOrDD2 1.14 (4.2%) 2.41 (9.2%) 111.6% ( 94% - 130%) MedTermEasyOrDD2 19.92 (3.0%) 45.44 (6.3%) 128.1% ( 115% - 141%) HighTermHardOrDD1 0.96 (3.5%) 2.54 (10.4%) 163.6% ( 144% - 183%) DD2 means drill down on 2 dims, DD1 means drill down on 1 dim. Hard means the 1 or 2 dims have high count, Easy means they have low count, and Mixed means one high and one low. OrDDX means I OR two values per dim. The new patch is especially faster for the OR case (ie, when you drill down on more than one value in a single dim), I think because it handles it directly instead of recursing into another BQ.
        Hide
        Nicola Buso added a comment -

        Hi,

        can you change DrillSideways class so that accept optionally some kind of factory for the facet collector? so that we can decide outside DrillSideways if we want a different implementation of the FacetCollector to be used!

        Nicola.

        Show
        Nicola Buso added a comment - Hi, can you change DrillSideways class so that accept optionally some kind of factory for the facet collector? so that we can decide outside DrillSideways if we want a different implementation of the FacetCollector to be used! Nicola.
        Hide
        Michael McCandless added a comment -

        can you change DrillSideways class so that accept optionally some kind of factory for the facet collector?

        Hmm you mean instead of calling FacetsCollector.create? Can you describe the use case behind this? (Hmm maybe FacetSearchParams already exposes a way to control which collector is created?).

        Show
        Michael McCandless added a comment - can you change DrillSideways class so that accept optionally some kind of factory for the facet collector? Hmm you mean instead of calling FacetsCollector.create? Can you describe the use case behind this? (Hmm maybe FacetSearchParams already exposes a way to control which collector is created?).
        Hide
        Nicola Buso added a comment -

        I have this case (I'm sure I'm not alone):
        http://markmail.org/search/?q=FacetRequest%20include%20residue#query:FacetRequest%20include%20residue%20list%3Aorg.apache.lucene.java-user+page:1+mid:2qgovhht5miqmhlk+state:results

        I implemented this creating a facetcollector that take count of the selection the user has done. I've not the 4.2 version where you introduced FacetsCollector.create(..) method; I will check there if I find something interesting.

        Show
        Nicola Buso added a comment - I have this case (I'm sure I'm not alone): http://markmail.org/search/?q=FacetRequest%20include%20residue#query:FacetRequest%20include%20residue%20list%3Aorg.apache.lucene.java-user+page:1+mid:2qgovhht5miqmhlk+state:results I implemented this creating a facetcollector that take count of the selection the user has done. I've not the 4.2 version where you introduced FacetsCollector.create(..) method; I will check there if I find something interesting.
        Hide
        Michael McCandless added a comment -

        Ahh OK I see. This is like LinkedIn's faceted UI, where you can enter a custom name under each dimension, if you don't see the value in the top N already displayed. EG http://www.linkedin.com/search/fpsearch?type=people&keywords=lisa&pplSearchOrigin=GLHD&pageKey=fps_results

        I think you cannot do this with just FacetSearchParams today ... I'll update the patch to make this possible. Maybe we need a factory to create FacetArrays ... or maybe we go back to non-static class and you subclass & override a method to create each FacetArray ...

        Show
        Michael McCandless added a comment - Ahh OK I see. This is like LinkedIn's faceted UI, where you can enter a custom name under each dimension, if you don't see the value in the top N already displayed. EG http://www.linkedin.com/search/fpsearch?type=people&keywords=lisa&pplSearchOrigin=GLHD&pageKey=fps_results I think you cannot do this with just FacetSearchParams today ... I'll update the patch to make this possible. Maybe we need a factory to create FacetArrays ... or maybe we go back to non-static class and you subclass & override a method to create each FacetArray ...
        Hide
        Shai Erera added a comment -

        I took a brief look at the path and there are two FacetCollectors used, for the drill-down counting and for each "sideway". Perhaps DrillSideways can indeed be an instance class, with a protected getFacetsAccumulator() method? The collector itself is not so important, it only collects matching documents. The accumulator does the heavy lifting and controls the FacetArrays. That way, Nicola can override and provide his own FacetArrays instance?

        Are there benefits in having this class offer static methods?

        Show
        Shai Erera added a comment - I took a brief look at the path and there are two FacetCollectors used, for the drill-down counting and for each "sideway". Perhaps DrillSideways can indeed be an instance class, with a protected getFacetsAccumulator() method? The collector itself is not so important, it only collects matching documents. The accumulator does the heavy lifting and controls the FacetArrays. That way, Nicola can override and provide his own FacetArrays instance? Are there benefits in having this class offer static methods?
        Hide
        Michael McCandless added a comment -

        Perhaps DrillSideways can indeed be an instance class, with a protected getFacetsAccumulator() method?

        +1

        I think I'd add separate methods for getting drill-down and drill-sideways Accumulator.

        But, what should the default impl be? Should I factor out that code from FacetsCollector.create that carefully picks either StandardFA or FA (maybe into a new FacetsAccumulator.create)?

        Show
        Michael McCandless added a comment - Perhaps DrillSideways can indeed be an instance class, with a protected getFacetsAccumulator() method? +1 I think I'd add separate methods for getting drill-down and drill-sideways Accumulator. But, what should the default impl be? Should I factor out that code from FacetsCollector.create that carefully picks either StandardFA or FA (maybe into a new FacetsAccumulator.create)?
        Hide
        Michael McCandless added a comment -

        New patch, w/ changes described above.

        Show
        Michael McCandless added a comment - New patch, w/ changes described above.
        Hide
        Shai Erera added a comment -

        Great work Mike!

        Few comments:

        • The CHANGES line should not remove 'static methods' right?
        • Can you add jdoc to FA.create(), something like "returns FA if all requests are CountFR, StandardFA otherwise"?
          • And while on that, fix FacetsCollector.create to say "calls create(FA) by creating FA using FA.create" ... something like that
        • This "// TODO: remove this limitation: it's silly?" – can we handle it now? It's odd that we add a 'silly' limitation . If we don't want the app to do it, then just throw the exception, and remove the TODO. Otherwise allow it?
          • Same goes for this "// TODO: remove this limitation (allow pure browse".
          • Unless, it's not easy to handle these TODOs now.
        • This "for(int i=0;i<fsp.facetRequests.size();i++) {" can be converted to Java 5-style iteration?
        • getDrillDown/SidewaysAccumulator – should the jdoc say "Override ... how the FA is created"? Or maybe "Override to use a custom FA by FacetsCollector"?

        Other than that, +1 to commit!

        Show
        Shai Erera added a comment - Great work Mike! Few comments: The CHANGES line should not remove 'static methods' right? Can you add jdoc to FA.create(), something like "returns FA if all requests are CountFR, StandardFA otherwise"? And while on that, fix FacetsCollector.create to say "calls create(FA) by creating FA using FA.create" ... something like that This "// TODO: remove this limitation: it's silly?" – can we handle it now? It's odd that we add a 'silly' limitation . If we don't want the app to do it, then just throw the exception, and remove the TODO. Otherwise allow it? Same goes for this "// TODO: remove this limitation (allow pure browse". Unless, it's not easy to handle these TODOs now. This "for(int i=0;i<fsp.facetRequests.size();i++) {" can be converted to Java 5-style iteration? getDrillDown/SidewaysAccumulator – should the jdoc say "Override ... how the FA is created"? Or maybe "Override to use a custom FA by FacetsCollector"? Other than that, +1 to commit!
        Hide
        Michael McCandless added a comment -

        Thanks Shai, I fixed those issues.

        For the first TODO, I removed the comment: I think an app should not make a DDQ, and no drill downs to it, and pass it to DS.

        For the 2nd TODO, I fixed it so the "pure browse" works, but I put another TODO to improve the implementation later ...

        Show
        Michael McCandless added a comment - Thanks Shai, I fixed those issues. For the first TODO, I removed the comment: I think an app should not make a DDQ, and no drill downs to it, and pass it to DS. For the 2nd TODO, I fixed it so the "pure browse" works, but I put another TODO to improve the implementation later ...
        Hide
        Shai Erera added a comment -

        Should DSQ perhaps be in its own class? Just to improve DS readability...

        Show
        Shai Erera added a comment - Should DSQ perhaps be in its own class? Just to improve DS readability...
        Hide
        Michael McCandless added a comment -

        Good idea! New patch pulling [package private] DSQ out.

        Show
        Michael McCandless added a comment - Good idea! New patch pulling [package private] DSQ out.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1449972

        LUCENE-4748: add DrillSideways utility class to facets module

        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1449972 LUCENE-4748 : add DrillSideways utility class to facets module
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1449973

        LUCENE-4748: add DrillSideways utility class to facets module

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1449973 LUCENE-4748 : add DrillSideways utility class to facets module
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development