Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3
    • Component/s: None
    • Labels:
      None

      Description

      Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information.
      I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version.
      The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query.

      The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki.
      I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts?

      1. SOLR-2524.patch
        78 kB
        Martijn van Groningen
      2. SOLR-2524.patch
        66 kB
        Martijn van Groningen
      3. SOLR-2524.patch
        65 kB
        Martijn van Groningen
      4. SOLR-2524.patch
        63 kB
        Martijn van Groningen
      5. SOLR-2524.patch
        62 kB
        Martijn van Groningen
      6. SOLR-2524.patch
        24 kB
        Martijn van Groningen

        Issue Links

          Activity

          Hide
          thomas menzel added a comment -

          > 4. ... I believe the last group you specify (field, func or query) is actually returned / executed.

          i have found that this is the case as well. however, the wiki says (said):

          If true, the result of the first field grouping command is used as the main result list in the response, using group.format=simple

          i have fixed the wiki to: last

          Show
          thomas menzel added a comment - > 4. ... I believe the last group you specify (field, func or query) is actually returned / executed. i have found that this is the case as well. however, the wiki says (said): If true, the result of the first field grouping command is used as the main result list in the response, using group.format=simple i have fixed the wiki to: last
          Hide
          Martijn van Groningen added a comment -

          I think the best place to ask questions like this is the solr-user list.
          Here are my answers:
          1. The fq filters the query result (q param) and if grouping is used all groups (fields, functions and queries). In your example this means that you have one group result with documents the match on things and type:video. Maybe the group.query can best be seen as a subquery of the main query (q & fq parameters).
          2. I'm not sure what you mean here. So based on the order in the query result you want to group?
          3. Well it should obay the sort param. If this is not the case then I think it is a bug.
          4. The group.main parameter renders the regular ungrouped result set. In this format only one group can be displayed. I believe the last group you specify (field, func or query) is actually returned / executed.

          Show
          Martijn van Groningen added a comment - I think the best place to ask questions like this is the solr-user list. Here are my answers: 1. The fq filters the query result (q param) and if grouping is used all groups (fields, functions and queries). In your example this means that you have one group result with documents the match on things and type:video. Maybe the group.query can best be seen as a subquery of the main query (q & fq parameters). 2. I'm not sure what you mean here. So based on the order in the query result you want to group? 3. Well it should obay the sort param. If this is not the case then I think it is a bug. 4. The group.main parameter renders the regular ungrouped result set. In this format only one group can be displayed. I believe the last group you specify (field, func or query) is actually returned / executed.
          Hide
          Cameron added a comment -

          Trying out this functionality. Not sure this is the place, but a couple more questions:

          1. How is group.query different from fq? For example, by adding: q=things&group=true&group.query=type:video, I'd expect all docs with type:video to be contained in a group, and the rest of the docs to fill out the rest of the results. Currently it behaves like fq, only returning a single group of docs

          2. Is there a way to have the result set "wrap" around a group? For example, if results 1-3 are non-grouped (the null group), then results 4-6 are grouped (type:video), and the rest are non-grouped again, assuming this is where they fall by score.

          3. group.query doesn't appear to obey the sort param?

          4. Specifying multiple group.query entries doesn't appear to work with group.main=true

          Again, sorry if this isn't the correct forum, but if either of the above qualify as new functionality, I can properly create new tickets for them individually

          Show
          Cameron added a comment - Trying out this functionality. Not sure this is the place, but a couple more questions: 1. How is group.query different from fq? For example, by adding: q=things&group=true&group.query=type:video, I'd expect all docs with type:video to be contained in a group, and the rest of the docs to fill out the rest of the results. Currently it behaves like fq, only returning a single group of docs 2. Is there a way to have the result set "wrap" around a group? For example, if results 1-3 are non-grouped (the null group), then results 4-6 are grouped (type:video), and the rest are non-grouped again, assuming this is where they fall by score. 3. group.query doesn't appear to obey the sort param? 4. Specifying multiple group.query entries doesn't appear to work with group.main=true Again, sorry if this isn't the correct forum, but if either of the above qualify as new functionality, I can properly create new tickets for them individually
          Hide
          Yuriy Akopov added a comment -

          Thanks Michael and Martij. I'll wait for that new release then.

          Show
          Yuriy Akopov added a comment - Thanks Michael and Martij. I'll wait for that new release then.
          Hide
          Martijn van Groningen added a comment -

          All 3 are correct. In order for post grouping facets LUCENE-3097 must be applied and it must be wired up into Solr. I expect this to happen in the coming days.

          Show
          Martijn van Groningen added a comment - All 3 are correct. In order for post grouping facets LUCENE-3097 must be applied and it must be wired up into Solr. I expect this to happen in the coming days.
          Hide
          Michael McCandless added a comment -

          I believe all 3 are correct.

          Show
          Michael McCandless added a comment - I believe all 3 are correct.
          Hide
          Yuriy Akopov added a comment -

          I suppose I'm late with these questions, but could you please acknowledge if the following is correct:

          1) The functionality from this patch was included into Solr 3.3, so no need to apply it to any version >= 3.3

          2) This patch (as well as the collapsing functionality in 3.3) doesn't allow calculation of facet numbers after collapsing. Faceting is still possible for collapsed results but the numbers returned for facets are always calculated before collapsing the results.

          3) In order to calculate facets after collapsing, LUCENE-3097 must be applied to Solr 3.3.

          Thanks.

          Show
          Yuriy Akopov added a comment - I suppose I'm late with these questions, but could you please acknowledge if the following is correct: 1) The functionality from this patch was included into Solr 3.3, so no need to apply it to any version >= 3.3 2) This patch (as well as the collapsing functionality in 3.3) doesn't allow calculation of facet numbers after collapsing. Faceting is still possible for collapsed results but the numbers returned for facets are always calculated before collapsing the results. 3) In order to calculate facets after collapsing, LUCENE-3097 must be applied to Solr 3.3. Thanks.
          Hide
          Robert Muir added a comment -

          Bulk close for 3.3

          Show
          Robert Muir added a comment - Bulk close for 3.3
          Hide
          Michael McCandless added a comment -

          Question: Does this support the option of getting facet counts after grouping? I am getting lost in all the issues....

          I don't think it does. For that we need LUCENE-3097, which I think is close.

          Show
          Michael McCandless added a comment - Question: Does this support the option of getting facet counts after grouping? I am getting lost in all the issues.... I don't think it does. For that we need LUCENE-3097 , which I think is close.
          Hide
          Bill Bell added a comment -

          Question: Does this support the option of getting facet counts after grouping? I am getting lost in all the issues....

          Thank you...

          Show
          Bill Bell added a comment - Question: Does this support the option of getting facet counts after grouping? I am getting lost in all the issues.... Thank you...
          Hide
          Martijn van Groningen added a comment -

          Committed in revision 1137067.

          Show
          Martijn van Groningen added a comment - Committed in revision 1137067.
          Hide
          Martijn van Groningen added a comment -

          Updated patch to what is created in SOLR-2564. The only difference is that when grouping by field (group.field) and the field isn't a string field (String, TextField or another that can produce a StringIndex from fieldcache) then a SolrException is thrown indicating that grouping on that field is not supported. This is to prevent double cache usage.

          I think it is time to commit this patch!

          Show
          Martijn van Groningen added a comment - Updated patch to what is created in SOLR-2564 . The only difference is that when grouping by field (group.field) and the field isn't a string field (String, TextField or another that can produce a StringIndex from fieldcache) then a SolrException is thrown indicating that grouping on that field is not supported. This is to prevent double cache usage. I think it is time to commit this patch!
          Hide
          Bill Bell added a comment -

          Can you add the number of matches to the log?

          SOLR-2337

          Show
          Bill Bell added a comment - Can you add the number of matches to the log? SOLR-2337
          Hide
          Bill Bell added a comment -

          Let's do it.

          Show
          Bill Bell added a comment - Let's do it.
          Hide
          Michael McCandless added a comment -

          I was using this patch last week and didn't encounter problems.

          Excellent – thank you for testing

          Why is it actually blocked by LUCENE-3099, vs. merely stands to benefit from improvements there when it happens?

          Well... Hoss is (rightly) nervous that we have divergent grouping
          impls in trunk vs 3.x.

          There are non-trivial differences... 3.x improves on trunk's impl:

          • Optionally uses CachingCollector, so certain queries will be
            substantially faster (adds new request param
            group.cache.maxSizeMB)
          • Optionally it's able to get total count of number of unique
            groups (using AllGroupsCollector).

          But, also, 3.x uses way more RAM than trunk's impl when grouping by a
          non-term field, because the grouping module currently always pulls a
          StringIndex (but LUCENE-3099 will fix this!).

          Hoss also wanted to see all of Solr trunk's grouping tests backported,
          and that is now done (thanks Martijn).

          We could simply commit anyway, for 3.3, on the strong expectation that
          we are going to cut trunk over to the grouping module before shipping
          4.0...? Any objections? This way 3.3 will have grouping, and we let
          the feature "bake" on Jenkins daily testing instead of aging as a patch...

          Show
          Michael McCandless added a comment - I was using this patch last week and didn't encounter problems. Excellent – thank you for testing Why is it actually blocked by LUCENE-3099 , vs. merely stands to benefit from improvements there when it happens? Well... Hoss is (rightly) nervous that we have divergent grouping impls in trunk vs 3.x. There are non-trivial differences... 3.x improves on trunk's impl: Optionally uses CachingCollector, so certain queries will be substantially faster (adds new request param group.cache.maxSizeMB) Optionally it's able to get total count of number of unique groups (using AllGroupsCollector). But, also, 3.x uses way more RAM than trunk's impl when grouping by a non-term field, because the grouping module currently always pulls a StringIndex (but LUCENE-3099 will fix this!). Hoss also wanted to see all of Solr trunk's grouping tests backported, and that is now done (thanks Martijn). We could simply commit anyway, for 3.3, on the strong expectation that we are going to cut trunk over to the grouping module before shipping 4.0...? Any objections? This way 3.3 will have grouping, and we let the feature "bake" on Jenkins daily testing instead of aging as a patch...
          Hide
          David Smiley added a comment -

          I was using this patch last week and didn't encounter problems. Why is it actually blocked by LUCENE-3099, vs. merely stands to benefit from improvements there when it happens?

          Show
          David Smiley added a comment - I was using this patch last week and didn't encounter problems. Why is it actually blocked by LUCENE-3099 , vs. merely stands to benefit from improvements there when it happens?
          Hide
          Martijn van Groningen added a comment -

          Hi David, I think that as well. I removed the affected version and changed fix version to 3.3, since this issue is blocked by LUCENE-3099. I don't think we can fix this before 3.2 is released, since the vote has already started.

          Show
          Martijn van Groningen added a comment - Hi David, I think that as well. I removed the affected version and changed fix version to 3.3, since this issue is blocked by LUCENE-3099 . I don't think we can fix this before 3.2 is released, since the vote has already started.
          Hide
          David Smiley added a comment -

          Shouldn't the "fix version" be 3.2 ? It appears someone inadvertently used 3.2 for the "affects version" which is really more of a bug field than for new features.

          Show
          David Smiley added a comment - Shouldn't the "fix version" be 3.2 ? It appears someone inadvertently used 3.2 for the "affects version" which is really more of a bug field than for new features.
          Hide
          Martijn van Groningen added a comment -

          Solr 3.x is able to return the total number of unique groups (using AllGroupsCollector), right? (Ahh, I see: using the new group.totalCount parameter).

          Yes That is the right parameter.

          Show
          Martijn van Groningen added a comment - Solr 3.x is able to return the total number of unique groups (using AllGroupsCollector), right? (Ahh, I see: using the new group.totalCount parameter). Yes That is the right parameter.
          Hide
          Michael McCandless added a comment -

          Martijn, with this patch, Solr 3.x is able to return the total number of unique groups (using AllGroupsCollector), right? (Ahh, I see: using the new group.totalCount parameter).

          Show
          Michael McCandless added a comment - Martijn, with this patch, Solr 3.x is able to return the total number of unique groups (using AllGroupsCollector), right? (Ahh, I see: using the new group.totalCount parameter).
          Hide
          Martijn van Groningen added a comment -

          Attached updated patch. The random grouping test also includes the use of cache.

          Show
          Martijn van Groningen added a comment - Attached updated patch. The random grouping test also includes the use of cache.
          Hide
          Michael McCandless added a comment -

          Right – the grouping module today always uses StringIndex/DocTermsIndex from the FC.

          But once LUCENE-3099 is in, Solr can subclass the collectors and use MutableValue which pulls the "right" (typed according to the schema) entry from FC, so we can avoid the FC insanity.

          We need LUCENE-3099 anyway to be able to group by FQ (and thus cutover Solr trunk to grouping module too)....

          Show
          Michael McCandless added a comment - Right – the grouping module today always uses StringIndex/DocTermsIndex from the FC. But once LUCENE-3099 is in, Solr can subclass the collectors and use MutableValue which pulls the "right" (typed according to the schema) entry from FC, so we can avoid the FC insanity. We need LUCENE-3099 anyway to be able to group by FQ (and thus cutover Solr trunk to grouping module too)....
          Hide
          Martijn van Groningen added a comment -

          In order to avoid double FC usage we need to make the GroupCollectors instantiate the proper FieldCache impl.

          Show
          Martijn van Groningen added a comment - In order to avoid double FC usage we need to make the GroupCollectors instantiate the proper FieldCache impl.
          Hide
          Martijn van Groningen added a comment -

          Attached a new patch with the random test succeeding. You need to have SOLR-2543 applied as well on your code base.

          The only problem that I still have is that I have to invoke deleteCore() method at the end of the random test. If don't do that I get an Insane FieldCache failure.

          Show
          Martijn van Groningen added a comment - Attached a new patch with the random test succeeding. You need to have SOLR-2543 applied as well on your code base. The only problem that I still have is that I have to invoke deleteCore() method at the end of the random test. If don't do that I get an Insane FieldCache failure.
          Hide
          Michael McCandless added a comment -

          I do agree this is a real risk, Hoss. Ie, if 3.x Solr uses the
          grouping module but trunk does not, it's not great.

          For example, the grouping module is already faster than trunk Solr's
          grouping impl (because of CachingCollector), and also adds a new
          request param (groups.cacheMaxMB). Once LUCENE-3097 is in, we can
          enable 3.x Solr to count facets post-grouping, which'll be another new
          grouping request param. The single pass grouping collector is another
          good improvement (though I don't know how Solr will take advantage of
          this... it'd need control over "doc blocks" while indexing, which'd be
          a major change). The two impls will continue to diverge...

          That said, the plan is definitely to get Solr 4.0 cutover to the
          grouping module; it's just a matter of time. I don't think we should
          ship 4.0 until we've done so.

          The only missing piece now is grouping by a FunctionQuery's DocValues.
          If we do LUCENE-3099 (let a subclass customize the grouping "key"), or
          if we factor out the shared FQ module (LUCENE-2883), then we can
          cutover Solr trunk.

          Show
          Michael McCandless added a comment - I do agree this is a real risk, Hoss. Ie, if 3.x Solr uses the grouping module but trunk does not, it's not great. For example, the grouping module is already faster than trunk Solr's grouping impl (because of CachingCollector), and also adds a new request param (groups.cacheMaxMB). Once LUCENE-3097 is in, we can enable 3.x Solr to count facets post-grouping, which'll be another new grouping request param. The single pass grouping collector is another good improvement (though I don't know how Solr will take advantage of this... it'd need control over "doc blocks" while indexing, which'd be a major change). The two impls will continue to diverge... That said, the plan is definitely to get Solr 4.0 cutover to the grouping module; it's just a matter of time. I don't think we should ship 4.0 until we've done so. The only missing piece now is grouping by a FunctionQuery's DocValues. If we do LUCENE-3099 (let a subclass customize the grouping "key"), or if we factor out the shared FQ module ( LUCENE-2883 ), then we can cutover Solr trunk.
          Hide
          Martijn van Groningen added a comment -

          I created SOLR-2543 for back porting the trunk's version of SolrTestCaseJ4.
          Unfortunately the random grouping tests fails now. If I have fixed this I will update the patch.

          Show
          Martijn van Groningen added a comment - I created SOLR-2543 for back porting the trunk's version of SolrTestCaseJ4. Unfortunately the random grouping tests fails now. If I have fixed this I will update the patch.
          Hide
          Martijn van Groningen added a comment -

          So once Lucene is patched to support POST facets in LUCENE-3097, are you planning on adding that into this (or a new ticket for Solr) ?

          Yes I'm!

          Maybe we should open a separate issue for this?

          I'll open a separate issue for back porting the SolrTestCaseJ4 random testing behaviour in 3x.

          Show
          Martijn van Groningen added a comment - So once Lucene is patched to support POST facets in LUCENE-3097 , are you planning on adding that into this (or a new ticket for Solr) ? Yes I'm! Maybe we should open a separate issue for this? I'll open a separate issue for back porting the SolrTestCaseJ4 random testing behaviour in 3x.
          Hide
          Robert Muir added a comment -

          There is a little random testing framework in SolrTestCaseJ4, but only in trunk.

          I tried really quick (just a few minutes) to backport it and started chasing crazy things, so I agree we should open a separate issue for this.

          Show
          Robert Muir added a comment - There is a little random testing framework in SolrTestCaseJ4, but only in trunk. I tried really quick (just a few minutes) to backport it and started chasing crazy things, so I agree we should open a separate issue for this.
          Hide
          Michael McCandless added a comment -

          The only part I couldn't port was the random testing. The API is different in 3x.

          Martijn, could you shed some more light on this one...? Like is there something in trunk's test infra that we should back port to 3.x...? Maybe we should open a separate issue for this?

          Show
          Michael McCandless added a comment - The only part I couldn't port was the random testing. The API is different in 3x. Martijn, could you shed some more light on this one...? Like is there something in trunk's test infra that we should back port to 3.x...? Maybe we should open a separate issue for this?
          Hide
          Bill Bell added a comment -

          So once Lucene is patched to support POST facets in LUCENE-3097, are you planning on adding that into this (or a new ticket for Solr) ?

          Thanks

          Show
          Bill Bell added a comment - So once Lucene is patched to support POST facets in LUCENE-3097 , are you planning on adding that into this (or a new ticket for Solr) ? Thanks
          Hide
          Martijn van Groningen added a comment -

          Hi Hoss,

          is backporting what solr already uses on trunk to the 3x branch out of the question?

          Many folks have grouping like requirements. Nowadays they either have to use the trunk or patch Solr. I think that having grouping in a released version would be great.

          we had identical Solr tests (at the request api level) on trunk and 3x to help sanity check that the two impls behave the same way

          I copied the grouping test from trunk to 3x. Only the random tests are not enabled. The general super test class (that other tests also use) is not compatible between trunk and 3x.

          the folks working on the grouping refactoring felt confident that by the time we get arround to releasing 4.0, the grouping refactoring would be at the point that the 3.2 impl and the 4.0 impl would be equivalent.

          I totally agree. I haven't opened a Solr issue yet, but I will do that soon. Basically this new issue will be concerned with incorporating the grouping module into Solr trunk without the loss of the current grouping functionality. I think the at the time 4.0 is released the feature set Solr supports regarding to grouping might be larger then what is in a previous 3.x release. We can keep the response format and request parameters backward compatible.

          Show
          Martijn van Groningen added a comment - Hi Hoss, is backporting what solr already uses on trunk to the 3x branch out of the question? Many folks have grouping like requirements. Nowadays they either have to use the trunk or patch Solr. I think that having grouping in a released version would be great. we had identical Solr tests (at the request api level) on trunk and 3x to help sanity check that the two impls behave the same way I copied the grouping test from trunk to 3x. Only the random tests are not enabled. The general super test class (that other tests also use) is not compatible between trunk and 3x. the folks working on the grouping refactoring felt confident that by the time we get arround to releasing 4.0, the grouping refactoring would be at the point that the 3.2 impl and the 4.0 impl would be equivalent. I totally agree. I haven't opened a Solr issue yet, but I will do that soon. Basically this new issue will be concerned with incorporating the grouping module into Solr trunk without the loss of the current grouping functionality. I think the at the time 4.0 is released the feature set Solr supports regarding to grouping might be larger then what is in a previous 3.x release. We can keep the response format and request parameters backward compatible.
          Hide
          Hoss Man added a comment -

          Solr trunk already has its own grouping implementation, from which we've factored out a shared grouping module (see LUCENE-1421). That module is available in trunk and 3.x, but since Solr already has grouping in trunk, and it has more features than the grouping module (specifically, that you can group by docvalues derived from a function query and by arbitrary query), Solr trunk will for now keep its own private impl.

          FWIW: I'm a little wary of the idea that we might wind up with an alternate approach to the "grouping" functionality released in 3.2 then what would then later be released in 4.0 ... i haven't looked at the approach in branch enough to understand how they differ, but i'm concerned about the hypothetical possibilities that they might have subtly differnet behavior in edge cases, or different perf characteristics, or that 3.2 might add a feature that is hard to support in 4.0, etc....

          I say this only to raise it as a red flag to watch out for – not because i want to stop the progress on this issue.

          The first question that sprang to mind when i saw this issue was: is backporting what solr already uses on trunk to the 3x branch out of the question?

          assuming it is, then i guess the main thing that would help ease my fears are if:

          1) we had identical Solr tests (at the request api level) on trunk and 3x to help sanity check that the two impls behave the same way
          2) the folks working on the grouping refactoring felt confident that by the time we get arround to releasing 4.0, the grouping refactoring would be at the point that the 3.2 impl and the 4.0 impl would be equivalent.

          Show
          Hoss Man added a comment - Solr trunk already has its own grouping implementation, from which we've factored out a shared grouping module (see LUCENE-1421 ). That module is available in trunk and 3.x, but since Solr already has grouping in trunk, and it has more features than the grouping module (specifically, that you can group by docvalues derived from a function query and by arbitrary query), Solr trunk will for now keep its own private impl. FWIW: I'm a little wary of the idea that we might wind up with an alternate approach to the "grouping" functionality released in 3.2 then what would then later be released in 4.0 ... i haven't looked at the approach in branch enough to understand how they differ, but i'm concerned about the hypothetical possibilities that they might have subtly differnet behavior in edge cases, or different perf characteristics, or that 3.2 might add a feature that is hard to support in 4.0, etc.... I say this only to raise it as a red flag to watch out for – not because i want to stop the progress on this issue. The first question that sprang to mind when i saw this issue was: is backporting what solr already uses on trunk to the 3x branch out of the question? assuming it is, then i guess the main thing that would help ease my fears are if: 1) we had identical Solr tests (at the request api level) on trunk and 3x to help sanity check that the two impls behave the same way 2) the folks working on the grouping refactoring felt confident that by the time we get arround to releasing 4.0, the grouping refactoring would be at the point that the 3.2 impl and the 4.0 impl would be equivalent.
          Hide
          Simon Willnauer added a comment -

          Looks great – I'll commit in a day or two. Thanks Martijn!

          +1 - nice work guys

          Show
          Simon Willnauer added a comment - Looks great – I'll commit in a day or two. Thanks Martijn! +1 - nice work guys
          Hide
          Michael McCandless added a comment -

          Looks great – I'll commit in a day or two. Thanks Martijn!

          Show
          Michael McCandless added a comment - Looks great – I'll commit in a day or two. Thanks Martijn!
          Hide
          Martijn van Groningen added a comment -

          Attached an updated patch. I added an entry in Solr's CHANGES.txt

          Show
          Martijn van Groningen added a comment - Attached an updated patch. I added an entry in Solr's CHANGES.txt
          Hide
          Michael McCandless added a comment -

          Patch looks awesome Martijn!

          And good job getting group-by-Query added back in! So it's only
          missing group-by-function-query-docvalues vs trunk.

          Can you add a Solr CHANGES entry? This is a great addition
          (finally!).

          Then I think it's ready to commit once we resolve Bill's issue
          above...

          Show
          Michael McCandless added a comment - Patch looks awesome Martijn! And good job getting group-by-Query added back in! So it's only missing group-by-function-query-docvalues vs trunk. Can you add a Solr CHANGES entry? This is a great addition (finally!). Then I think it's ready to commit once we resolve Bill's issue above...
          Hide
          Martijn van Groningen added a comment -

          Will this also be applied to 4.0 and the 3.2 branch?

          This patch will be applied on the 3x branch. This patch serves as basis for the work needed to make the trunk use the grouping module. But I think that will be addressed in a different issue.

          OK. I am trying to understand group.totalCount=grouped... I am not seeing the facets change in Solr.

          That is because the group.totalCount parameter only has effect on the total count and not the facets.

          So executing the same query with group.totalCount=GROUPED:

          <lst name="grouped">
             <lst name="inStock">
               <int name="matches">2</int>
               <arr name="groups">
                  <lst>
                     <bool name="groupValue">true</bool>
          

          So executing the same query with group.totalCount=UNGROUPED (default):

          <lst name="grouped">
             <lst name="inStock">
                <int name="matches">17</int>
                   <arr name="groups">
                      <lst>
                         <bool name="groupValue">true</bool>
          

          So having facets based on groups is the next step . I haven't implemented this yet. But I will properly use the group.docSet parameter for that. Because it will not only have a effect on the FacetComponent, but also the StatsComponent. I think we should first focus on getting the current patch committed. And then tackle this issue. Also to implement this we also need a other group collector.

          Show
          Martijn van Groningen added a comment - Will this also be applied to 4.0 and the 3.2 branch? This patch will be applied on the 3x branch. This patch serves as basis for the work needed to make the trunk use the grouping module. But I think that will be addressed in a different issue. OK. I am trying to understand group.totalCount=grouped... I am not seeing the facets change in Solr. That is because the group.totalCount parameter only has effect on the total count and not the facets. So executing the same query with group.totalCount=GROUPED: <lst name= "grouped" > <lst name= "inStock" > <int name= "matches" > 2 </int> <arr name= "groups" > <lst> <bool name= "groupValue" > true </bool> So executing the same query with group.totalCount=UNGROUPED (default): <lst name= "grouped" > <lst name= "inStock" > <int name= "matches" > 17 </int> <arr name= "groups" > <lst> <bool name= "groupValue" > true </bool> So having facets based on groups is the next step . I haven't implemented this yet. But I will properly use the group.docSet parameter for that. Because it will not only have a effect on the FacetComponent, but also the StatsComponent. I think we should first focus on getting the current patch committed. And then tackle this issue. Also to implement this we also need a other group collector.
          Hide
          Michael McCandless added a comment -

          Will this also be applied to 4.0 and the 3.2 branch?

          So the current plan is to commit this issue only for 3.2.

          Solr trunk already has its own grouping implementation, from which we've factored out a shared grouping module (see LUCENE-1421). That module is available in trunk and 3.x, but since Solr already has grouping in trunk, and it has more features than the grouping module (specifically, that you can group by docvalues derived from a function query and by arbitrary query), Solr trunk will for now keep its own private impl.

          Once we've factored out more stuff from Solr (function queries, LUCENE-2883, and I think also filter caches) then we'll fix the grouping module and also cutover Solr trunk to it. This is the current thinking anyway...

          Show
          Michael McCandless added a comment - Will this also be applied to 4.0 and the 3.2 branch? So the current plan is to commit this issue only for 3.2. Solr trunk already has its own grouping implementation, from which we've factored out a shared grouping module (see LUCENE-1421 ). That module is available in trunk and 3.x, but since Solr already has grouping in trunk, and it has more features than the grouping module (specifically, that you can group by docvalues derived from a function query and by arbitrary query), Solr trunk will for now keep its own private impl. Once we've factored out more stuff from Solr (function queries, LUCENE-2883 , and I think also filter caches) then we'll fix the grouping module and also cutover Solr trunk to it. This is the current thinking anyway...
          Hide
          Bill Bell added a comment -

          OK. I am trying to understand group.totalCount=grouped... I am not seeing the facets change in Solr.

          Using latest 3.2 branch, and patching it with your latest patch:

          http://localhost:8983/solr/select?q=*:*&group.field=inStock&group=true&facet=true&facet.field=manu&group.docSet=grouped&group.totalCount=GROUPED

          
          <lst name="facet_counts">
            <lst name="facet_queries"/>
              <lst name="facet_fields">
                <lst name="manu">
                 <int name="inc">8</int>
                 <int name="apache">2</int>
                 <int name="belkin">2</int>
                 <int name="canon">2</int>
          etc...
          

          But in the result set above, I only see the following once:

          <str name="manu">Belkin</str>
          

          I would assume the results for this field would be:

          
          <lst name="facet_counts">
            <lst name="facet_queries"/>
              <lst name="facet_fields">
                <lst name="manu">
                 <int name="belkin">1</int>
          etc...
          

          Since it it grouped, and only belkin shows... Is there a parameter to do that?

          Show
          Bill Bell added a comment - OK. I am trying to understand group.totalCount=grouped... I am not seeing the facets change in Solr. Using latest 3.2 branch, and patching it with your latest patch: http://localhost:8983/solr/select?q=*:*&group.field=inStock&group=true&facet=true&facet.field=manu&group.docSet=grouped&group.totalCount=GROUPED <lst name= "facet_counts" > <lst name= "facet_queries" /> <lst name= "facet_fields" > <lst name= "manu" > < int name= "inc" >8</ int > < int name= "apache" >2</ int > < int name= "belkin" >2</ int > < int name= "canon" >2</ int > etc... But in the result set above, I only see the following once: <str name= "manu" >Belkin</str> I would assume the results for this field would be: <lst name= "facet_counts" > <lst name= "facet_queries" /> <lst name= "facet_fields" > <lst name= "manu" > < int name= "belkin" >1</ int > etc... Since it it grouped, and only belkin shows... Is there a parameter to do that?
          Hide
          Bill Bell added a comment -

          Will this also be applied to 4.0 and the 3.2 branch?

          Show
          Bill Bell added a comment - Will this also be applied to 4.0 and the 3.2 branch?
          Hide
          Martijn van Groningen added a comment -

          Attached an updated patch.

          • Added cache warning
          • Added counting based on groups. This can be controlled by group.totalCount parameter. Use GROUPED to get counts based on groups and UNGROUPED to get counts based on plain documents. Default is UNGROUPED.
          • Ported the trunk grouping tests to 3x. The only part I couldn't port was the random testing. The API is different in 3x.
          • Added grouping by query (group.query). Porting this back to 3x was trivial.
          • Second pass caching is now enabled by default.
          • Changed .cache.maxSize into .cache.maxSizeMB
          Show
          Martijn van Groningen added a comment - Attached an updated patch. Added cache warning Added counting based on groups. This can be controlled by group.totalCount parameter. Use GROUPED to get counts based on groups and UNGROUPED to get counts based on plain documents. Default is UNGROUPED. Ported the trunk grouping tests to 3x. The only part I couldn't port was the random testing. The API is different in 3x. Added grouping by query (group.query). Porting this back to 3x was trivial. Second pass caching is now enabled by default. Changed .cache.maxSize into .cache.maxSizeMB
          Hide
          Michael McCandless added a comment -

          Was this a simple TermQuery

          No a MatchDocAllQuery (

          Ahh OK then that makes sense – MatchAllDocsQuery is a might fast query to execute So the work done to cache it is going to be slower.

          Show
          Michael McCandless added a comment - Was this a simple TermQuery No a MatchDocAllQuery ( Ahh OK then that makes sense – MatchAllDocsQuery is a might fast query to execute So the work done to cache it is going to be slower.
          Hide
          Martijn van Groningen added a comment -

          and maybe log a warning?

          I also think a log warning should be added too. If users don't want that they can reconfigure their logger not to log messages from the Grouping class.

          Was this a simple TermQuery

          No a MatchDocAllQuery (:). I usually use that to maximize the number documents to group on. Trying this on a simple term query that results in 8M documents; the query with cache is a little bit (~5%) faster. For boolean queries the cache does pay off. Grouping with cache is 20% faster than without. Off course these are just numbers from my machine and my test index.

          Is it somehow possible Solr is already caching the queries results itself...?

          Only the filterCache is used for any filter query in the fq param. The main query is not cached with the filterCache or QueryResultCache. This the same behaviour as in the trunk.

          Show
          Martijn van Groningen added a comment - and maybe log a warning? I also think a log warning should be added too. If users don't want that they can reconfigure their logger not to log messages from the Grouping class. Was this a simple TermQuery No a MatchDocAllQuery ( : ). I usually use that to maximize the number documents to group on. Trying this on a simple term query that results in 8M documents; the query with cache is a little bit (~5%) faster. For boolean queries the cache does pay off. Grouping with cache is 20% faster than without. Off course these are just numbers from my machine and my test index. Is it somehow possible Solr is already caching the queries results itself...? Only the filterCache is used for any filter query in the fq param. The main query is not cached with the filterCache or QueryResultCache. This the same behaviour as in the trunk.
          Hide
          Michael McCandless added a comment -

          I think that if the cachedCollector.isCached() returns false we should put something in the response indication that the cache wasn't used because it hit the cache.maxSizeMB limit. Otherwise the nobody will no whether the cache was utilized.

          +1, and maybe log a warning? Or is that going to be too much logging?

          When I was playing around with the cache options I noticed that searching without cache (~350 ms) was faster then with cache (~500 ms) on a 10M index with 1711 distinct group values. This is not what I'd expect.

          That is worrisome!! Was this a simple TermQuery? Is it somehow possible Solr is already caching the queries results itself...?

          I'm already thinking about a new collector that collects all most relevant documents of all groups. This collector should produce something like an OpenBitSet. We can use the OpenBitSet to create a DocSet. I think this should be implemented in a different issue.

          Cool!

          Show
          Michael McCandless added a comment - I think that if the cachedCollector.isCached() returns false we should put something in the response indication that the cache wasn't used because it hit the cache.maxSizeMB limit. Otherwise the nobody will no whether the cache was utilized. +1, and maybe log a warning? Or is that going to be too much logging? When I was playing around with the cache options I noticed that searching without cache (~350 ms) was faster then with cache (~500 ms) on a 10M index with 1711 distinct group values. This is not what I'd expect. That is worrisome!! Was this a simple TermQuery? Is it somehow possible Solr is already caching the queries results itself...? I'm already thinking about a new collector that collects all most relevant documents of all groups. This collector should produce something like an OpenBitSet. We can use the OpenBitSet to create a DocSet. I think this should be implemented in a different issue. Cool!
          Hide
          Martijn van Groningen added a comment -

          Maybe rename group.cache.maxSize -> .maxSizeMB? (So it's clear what the units are).

          Yes that is a more descriptive name.

          Should we default group.cache to true? (It's false now?).

          That makes sense.

          I think that if the cachedCollector.isCached() returns false we should put something in the response indication that the cache wasn't used because it hit the cache.maxSizeMB limit. Otherwise the nobody will no whether the cache was utilized.

          When I was playing around with the cache options I noticed that searching without cache (~350 ms) was faster then with cache (~500 ms) on a 10M index with 1711 distinct group values. This is not what I'd expect.

          When you get the top groups from collector2, should you pass in offset instead of 0? (Hmm – maybe groupOffset? It seems like you're using offset for both the first & second phase collectors? Maybe I'm confused...).

          I know that is confusing, but the DocSlice expects offset + len documents. So that was a quick of doing that. I will clean that up.

          This matches how Solr does grouping on trunk right?

          Yes it does. I'm already thinking about a new collector that collects all most relevant documents of all groups. This collector should produce something like an OpenBitSet. We can use the OpenBitSet to create a DocSet. I think this should be implemented in a different issue.

          Show
          Martijn van Groningen added a comment - Maybe rename group.cache.maxSize -> .maxSizeMB? (So it's clear what the units are). Yes that is a more descriptive name. Should we default group.cache to true? (It's false now?). That makes sense. I think that if the cachedCollector.isCached() returns false we should put something in the response indication that the cache wasn't used because it hit the cache.maxSizeMB limit. Otherwise the nobody will no whether the cache was utilized. When I was playing around with the cache options I noticed that searching without cache (~350 ms) was faster then with cache (~500 ms) on a 10M index with 1711 distinct group values. This is not what I'd expect. When you get the top groups from collector2, should you pass in offset instead of 0? (Hmm – maybe groupOffset? It seems like you're using offset for both the first & second phase collectors? Maybe I'm confused...). I know that is confusing, but the DocSlice expects offset + len documents. So that was a quick of doing that. I will clean that up. This matches how Solr does grouping on trunk right? Yes it does. I'm already thinking about a new collector that collects all most relevant documents of all groups. This collector should produce something like an OpenBitSet. We can use the OpenBitSet to create a DocSet. I think this should be implemented in a different issue.
          Hide
          Michael McCandless added a comment -

          Awesome, that was fast!

          Maybe rename group.cache.maxSize -> .maxSizeMB? (So it's clear what the units are).

          Should we default group.cache to true? (It's false now?).

          When you get the top groups from collector2, should you pass in offset instead of 0? (Hmm – maybe groupOffset? It seems like you're using offset for both the first & second phase collectors? Maybe I'm confused...).

          Computed DocSet (for facetComponent and StatsComponent) is based the ungrouped result.

          This matches how Solr does grouping on trunk right?

          Show
          Michael McCandless added a comment - Awesome, that was fast! Maybe rename group.cache.maxSize -> .maxSizeMB? (So it's clear what the units are). Should we default group.cache to true? (It's false now?). When you get the top groups from collector2, should you pass in offset instead of 0? (Hmm – maybe groupOffset? It seems like you're using offset for both the first & second phase collectors? Maybe I'm confused...). Computed DocSet (for facetComponent and StatsComponent) is based the ungrouped result. This matches how Solr does grouping on trunk right?
          Hide
          Martijn van Groningen added a comment -

          Attached the initial patch.

          • Patch is based on what is in the trunk.
            • Integrated the grouping contrib collectors
            • Same response formats.
            • All parameters except group.query and group.func are supported.
            • Computed DocSet (for facetComponent and StatsComponent) is based the ungrouped result.
          • Also integrated the caching collector. For this I added the group.cache=true|false and group.cache.maxSize=[number] parameters.

          Things still todo:

          • Integrate AllGroupsCollector for total count based on groups.
          • Create a Solr Test for grouping
          • Cleanup / Refactor / java doc
          Show
          Martijn van Groningen added a comment - Attached the initial patch. Patch is based on what is in the trunk. Integrated the grouping contrib collectors Same response formats. All parameters except group.query and group.func are supported. Computed DocSet (for facetComponent and StatsComponent) is based the ungrouped result. Also integrated the caching collector. For this I added the group.cache=true|false and group.cache.maxSize= [number] parameters. Things still todo: Integrate AllGroupsCollector for total count based on groups. Create a Solr Test for grouping Cleanup / Refactor / java doc
          Hide
          Michael McCandless added a comment -

          +1 this would be awesome Martijn!!

          In general we should try hard to backport features we build on trunk, to 3.x, when feasible.

          Show
          Michael McCandless added a comment - +1 this would be awesome Martijn!! In general we should try hard to backport features we build on trunk, to 3.x, when feasible.

            People

            • Assignee:
              Martijn van Groningen
              Reporter:
              Martijn van Groningen
            • Votes:
              5 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development