Solr
  1. Solr
  2. SOLR-2564

Integrating grouping module into Solr 4.0

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Since work on grouping module is going well. I think it is time to wire this up in Solr.
      Besides the current grouping features Solr provides, Solr will then also support second pass caching and total count based on groups.

      1. SOLR-2564.patch
        60 kB
        Martijn van Groningen
      2. SOLR-2564.patch
        62 kB
        Martijn van Groningen
      3. LUCENE-2564.patch
        61 kB
        Michael McCandless
      4. SOLR-2564.patch
        63 kB
        Martijn van Groningen
      5. SOLR-2564.patch
        68 kB
        Martijn van Groningen
      6. SOLR-2564.patch
        75 kB
        Martijn van Groningen
      7. SOLR-2564.patch
        79 kB
        Martijn van Groningen
      8. SOLR-2564.patch
        79 kB
        Martijn van Groningen
      9. SOLR-2564_performance_loss_fix.patch
        5 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Martijn van Groningen added a comment -

          I've fixed the issue and added a test that triggered the exception.
          Fixed in trunk in revision 1145173
          Fixed in 3x branch in revision 1145176

          Show
          Martijn van Groningen added a comment - I've fixed the issue and added a test that triggered the exception. Fixed in trunk in revision 1145173 Fixed in 3x branch in revision 1145176
          Hide
          Martijn van Groningen added a comment -

          Hi Matteo, I can also confirm the bug and only happens when group.main=true. I also think that this error occurs on 3x code base. I'll provide a fix for this issue soon.

          Show
          Martijn van Groningen added a comment - Hi Matteo, I can also confirm the bug and only happens when group.main=true. I also think that this error occurs on 3x code base. I'll provide a fix for this issue soon.
          Hide
          Simon Willnauer added a comment -

          Since Lucene is now also Java 6 we can just change the code in AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 in grouping.java is no longer needed, right?

          yes thats right

          Show
          Simon Willnauer added a comment - Since Lucene is now also Java 6 we can just change the code in AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 in grouping.java is no longer needed, right? yes thats right
          Hide
          Martijn van Groningen added a comment -

          Since Lucene is now also Java 6 we can just change the code in AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 in grouping.java is no longer needed, right?

          Show
          Martijn van Groningen added a comment - Since Lucene is now also Java 6 we can just change the code in AbstractFirstPassGroupingCollector and the TermFirstPassGroupingCollectorJava6 in grouping.java is no longer needed, right?
          Hide
          Michael McCandless added a comment -

          That fix looks great Yonik! Let the subclass use Java 6 only code...

          Show
          Michael McCandless added a comment - That fix looks great Yonik! Let the subclass use Java 6 only code...
          Hide
          Robert Muir added a comment -

          +1 thats a really good solution, fixes the issue and we can still table the java5/java6 stuff for later.

          Show
          Robert Muir added a comment - +1 thats a really good solution, fixes the issue and we can still table the java5/java6 stuff for later.
          Hide
          Yonik Seeley added a comment -

          Here's a simple patch that fixes the performance we lost when Solr cut over to the modules implementation (due to the fact that modules are on Java5 and Solr is on Java6).

          Show
          Yonik Seeley added a comment - Here's a simple patch that fixes the performance we lost when Solr cut over to the modules implementation (due to the fact that modules are on Java5 and Solr is on Java6).
          Hide
          Matteo Melli added a comment -

          Hi there,

          I'm testing this functionality into my project and found what I think it's a bug. The revision I'm working on is 1137889.

          I could reproduce the bug with a really simple index (the column is of type solr.String):

          Col1
          1
          2
          3

          The bug appear when I try to do a query with grouping mixing parameters start (with a value greather than 0) and group.main=true:

          http://localhost:8983/solr/test/select/?q=*:*&start=1&group=true&group.field=Col1&group.main=true

          The error trace is:

          Jun 21, 2011 1:32:10 PM org.apache.solr.common.SolrException log
          SEVERE: java.lang.ArrayIndexOutOfBoundsException: 3
          at org.apache.solr.search.DocSlice$1.nextDoc(DocSlice.java:119)
          at org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:247)
          at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:153)
          at org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:111)
          at org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:37)
          at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:340)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:242)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

          The problem does not appear without group.main=true so this may be a related bug to that option.

          PS: I was not sure if there where to open a bug since the version affected is still in development. Anyway sorry for any inconvenient.

          Show
          Matteo Melli added a comment - Hi there, I'm testing this functionality into my project and found what I think it's a bug. The revision I'm working on is 1137889. I could reproduce the bug with a really simple index (the column is of type solr.String): Col1 1 2 3 The bug appear when I try to do a query with grouping mixing parameters start (with a value greather than 0) and group.main=true: http://localhost:8983/solr/test/select/?q=*:*&start=1&group=true&group.field=Col1&group.main=true The error trace is: Jun 21, 2011 1:32:10 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.solr.search.DocSlice$1.nextDoc(DocSlice.java:119) at org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:247) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:153) at org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:111) at org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:37) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:340) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:242) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) The problem does not appear without group.main=true so this may be a related bug to that option. PS: I was not sure if there where to open a bug since the version affected is still in development. Anyway sorry for any inconvenient.
          Hide
          Martijn van Groningen added a comment -

          Committed in revision 1137037

          Show
          Martijn van Groningen added a comment - Committed in revision 1137037
          Hide
          Martijn van Groningen added a comment -

          I also did some performance tests with the following query on random data in the example schema:

          http://localhost:8983/solr/select?q=*:*&sort=_docid_ desc&group=true&group.cacheMB=0&group.field=single1000_i

          The field single1000_i had 1000 distinct values and the index has in total 100000 documents.

          I ran this query on the following Solr setups:

          • Last nights nightly build.
          • Solr build with this patch as it is.
          • Solr build with this patch and the necessary changes in AbstractFirstPassGroupingCollector so that pollLast was used in all cases.
            During my tests I noticed that differences between the first and the second setups was neglectable smal, but the the last Solr setup was on average 32% faster than the two other setups. So moving to the Java6's pollLast() method has definitely a positive impact on performance!

          I also think that this patch is ready to be committed and that the pollLast should be added when Lucene or the grouping module is java 6. (I prefer the first option) I'll commit it in the coming day or so.

          Show
          Martijn van Groningen added a comment - I also did some performance tests with the following query on random data in the example schema: http: //localhost:8983/solr/select?q=*:*&sort=_docid_ desc&group= true &group.cacheMB=0&group.field=single1000_i The field single1000_i had 1000 distinct values and the index has in total 100000 documents. I ran this query on the following Solr setups: Last nights nightly build. Solr build with this patch as it is. Solr build with this patch and the necessary changes in AbstractFirstPassGroupingCollector so that pollLast was used in all cases. During my tests I noticed that differences between the first and the second setups was neglectable smal, but the the last Solr setup was on average 32% faster than the two other setups. So moving to the Java6's pollLast() method has definitely a positive impact on performance! I also think that this patch is ready to be committed and that the pollLast should be added when Lucene or the grouping module is java 6. (I prefer the first option) I'll commit it in the coming day or so.
          Hide
          Martijn van Groningen added a comment - - edited

          Attached an update version of the patch.

          • The group.cache.percent defaults now to 0, which disables the grouping cache.
          Show
          Martijn van Groningen added a comment - - edited Attached an update version of the patch. The group.cache.percent defaults now to 0, which disables the grouping cache.
          Hide
          Martijn van Groningen added a comment -

          After some more testing I think it is better to disable the group cache by default.
          Here are my small test results on a index of 30M documents. Test were produced on a low end machine.

          Query type Query Nr of matches Search time with caching Search time without caching
          Term query country:es 24270711 2564 2261
          Boolean query country:es OR country:tr 25951654 4014 4170
          Fuzzy query region:riviera~0.001 977043 146 189
          Wildcard query region:r* 1608781 177 190
          Complexer boolean query region:r* OR region:riveria~0.0001 OR country:es OR country:tr 27582783 5328 7289

          So I think that we can conclude from here is that the group cache work well for boolean, wildcard and fuzzy queries.

          Show
          Martijn van Groningen added a comment - After some more testing I think it is better to disable the group cache by default. Here are my small test results on a index of 30M documents. Test were produced on a low end machine. Query type Query Nr of matches Search time with caching Search time without caching Term query country:es 24270711 2564 2261 Boolean query country:es OR country:tr 25951654 4014 4170 Fuzzy query region:riviera~0.001 977043 146 189 Wildcard query region:r* 1608781 177 190 Complexer boolean query region:r* OR region:riveria~0.0001 OR country:es OR country:tr 27582783 5328 7289 So I think that we can conclude from here is that the group cache work well for boolean, wildcard and fuzzy queries.
          Hide
          Michael McCandless added a comment -

          Patch looks great Martijn... only thing pending (I think?) is to perhaps disable caching collector when the query is MatchAllDocsQuery instance, since we know cache only hurts perf in this case.

          Else, +1 to commit! I don't think we should hold this pending the Java 1.6 discussion.

          Show
          Michael McCandless added a comment - Patch looks great Martijn... only thing pending (I think?) is to perhaps disable caching collector when the query is MatchAllDocsQuery instance, since we know cache only hurts perf in this case. Else, +1 to commit! I don't think we should hold this pending the Java 1.6 discussion.
          Hide
          Martijn van Groningen added a comment -

          it looks like we should hold SOLR-2524 until we commit this, then backport & commit to 3.x.

          I agree with that. Also most of the discussion takes place in this issue.

          I think we're getting close to commit this issue.

          Show
          Martijn van Groningen added a comment - it looks like we should hold SOLR-2524 until we commit this, then backport & commit to 3.x. I agree with that. Also most of the discussion takes place in this issue. I think we're getting close to commit this issue.
          Hide
          Martijn van Groningen added a comment -

          Added updated patch.

          • The caching behaviour is now controlled by group.cache.percent=[0-100]. This percentage is relative to maxdoc. A value of 0 will disable group caching and the default is 20.
          • Added hits to log as requested in SOLR-2337. This will add hits also when group.main is not true.
          Show
          Martijn van Groningen added a comment - Added updated patch. The caching behaviour is now controlled by group.cache.percent= [0-100] . This percentage is relative to maxdoc. A value of 0 will disable group caching and the default is 20. Added hits to log as requested in SOLR-2337 . This will add hits also when group.main is not true.
          Hide
          Michael McCandless added a comment -

          Hmm, I think this only needs to be a 4.0 blocker if we commit SOLR-2524 (3.x Solr grouping) first.

          But at this point, since we are close on this issue, it looks like we should hold SOLR-2524 until we commit this, then backport & commit to 3.x.

          Show
          Michael McCandless added a comment - Hmm, I think this only needs to be a 4.0 blocker if we commit SOLR-2524 (3.x Solr grouping) first. But at this point, since we are close on this issue, it looks like we should hold SOLR-2524 until we commit this, then backport & commit to 3.x.
          Hide
          Hoss Man added a comment -

          Marking this issue as a blocker for Solr 4.0 per McCandless comment in SOLR-2524...

          That said, the plan is definitely to get Solr 4.0 cutover to the
          grouping module; it's just a matter of time. I don't think we should
          ship 4.0 until we've done so.

          Show
          Hoss Man added a comment - Marking this issue as a blocker for Solr 4.0 per McCandless comment in SOLR-2524 ... That said, the plan is definitely to get Solr 4.0 cutover to the grouping module; it's just a matter of time. I don't think we should ship 4.0 until we've done so.
          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Martijn van Groningen added a comment -

          But I think caching should still default to on, just limited as a pctg
          of the number of docs in the index. Ie, by default we will cache the
          result set if it's less than 20% (say) of total docs in your index.

          Maybe instead of specifying a maximum size for the second pass cache, we could specify it with a percentage (0 till 100) relative from maxdoc. In this case when the index grows in number of documents the cache is still used for a lot of queries (depending on the specified percentage). So if we go with this maybe group.cacheMB should be renamed to group.cache.percentage. The default can then be something like 20. Any thoughts about this?

          Show
          Martijn van Groningen added a comment - But I think caching should still default to on, just limited as a pctg of the number of docs in the index. Ie, by default we will cache the result set if it's less than 20% (say) of total docs in your index. Maybe instead of specifying a maximum size for the second pass cache, we could specify it with a percentage (0 till 100) relative from maxdoc. In this case when the index grows in number of documents the cache is still used for a lot of queries (depending on the specified percentage). So if we go with this maybe group.cacheMB should be renamed to group.cache.percentage. The default can then be something like 20. Any thoughts about this?
          Hide
          Michael McCandless added a comment -

          Ahh, I see. Could we turn off caching if the query is instanceof AllDocsQuery?

          Show
          Michael McCandless added a comment - Ahh, I see. Could we turn off caching if the query is instanceof AllDocsQuery?
          Hide
          Yonik Seeley added a comment -

          > Actually, the worst case is twice as slow due to unneeded caching of a simple query.

          Sorry, what do you mean here?

          The worst case with this patch as a whole (due to the caching by default).
          This type of query is twice as slow:

          http://localhost:8983/solr/select?q=*:*&group=true&group.field=single1000_i
          

          Which led to me wondering about how complex queries must be before the caching is a win.

          Show
          Yonik Seeley added a comment - > Actually, the worst case is twice as slow due to unneeded caching of a simple query. Sorry, what do you mean here? The worst case with this patch as a whole (due to the caching by default). This type of query is twice as slow: http: //localhost:8983/solr/select?q=*:*&group= true &group.field=single1000_i Which led to me wondering about how complex queries must be before the caching is a win.
          Hide
          Michael McCandless added a comment -

          Actually, the worst case is twice as slow due to unneeded caching of a simple query.

          Sorry, what do you mean here?

          but I still question the default, which can lead to surprisingly huge memory use (think up to a field cache entry or more allocated per-request).

          I agree; -1 is a dangerous default.

          But I think caching should still default to on, just limited as a pctg
          of the number of docs in the index. Ie, by default we will cache the
          result set if it's less than 20% (say) of total docs in your index.
          Else we fallback to 2-pass.

          I think this matches how Solr handles caching filters now? Ie, filter
          cache evicts by total filter count and not net MB right, I think? So
          that if you have more docs in your index you'll spending more RAM on
          the caching...

          Costly queries that return a smallish result set can see big gains
          from the caching.

          Show
          Michael McCandless added a comment - Actually, the worst case is twice as slow due to unneeded caching of a simple query. Sorry, what do you mean here? but I still question the default, which can lead to surprisingly huge memory use (think up to a field cache entry or more allocated per-request). I agree; -1 is a dangerous default. But I think caching should still default to on, just limited as a pctg of the number of docs in the index. Ie, by default we will cache the result set if it's less than 20% (say) of total docs in your index. Else we fallback to 2-pass. I think this matches how Solr handles caching filters now? Ie, filter cache evicts by total filter count and not net MB right, I think? So that if you have more docs in your index you'll spending more RAM on the caching... Costly queries that return a smallish result set can see big gains from the caching.
          Hide
          Yonik Seeley added a comment -

          If the 16% slowdown is worst case

          Actually, the worst case is twice as slow due to unneeded caching of a simple query. Luckily this can be configured... but I still question the default, which can lead to surprisingly huge memory use (think up to a field cache entry or more allocated per-request). One advantage to the dual-pass approach by default in the first place was avoiding surprisingly large memory usage by default (which can degrade less gracefully by causing OOM exceptions as people try to crank up the number of request threads).

          Show
          Yonik Seeley added a comment - If the 16% slowdown is worst case Actually, the worst case is twice as slow due to unneeded caching of a simple query. Luckily this can be configured... but I still question the default, which can lead to surprisingly huge memory use (think up to a field cache entry or more allocated per-request). One advantage to the dual-pass approach by default in the first place was avoiding surprisingly large memory usage by default (which can degrade less gracefully by causing OOM exceptions as people try to crank up the number of request threads).
          Hide
          Erik Hatcher added a comment -

          Just for grins, here's a Ruby script that'll do it (provided you have the solr-ruby gem installed):

          require 'solr'
          solr = Solr::Connection.new
          1.upto(1000) {|i| solr.add(:id=>i, :single1000_i=>i)}
          solr.commit
          
          Show
          Erik Hatcher added a comment - Just for grins, here's a Ruby script that'll do it (provided you have the solr-ruby gem installed): require 'solr' solr = Solr::Connection. new 1.upto(1000) {|i| solr.add(:id=>i, :single1000_i=>i)} solr.commit
          Hide
          Ryan McKinley added a comment -

          Java 5 isn't even supported by oracle anymore, so why the hell do we support it?

          +1 (though this discussion should happen elsewhere)

          If the 16% slowdown is worst case and under 'normal' use it would be equivolent/faster, i say lets move forward with that and push for java6 in a different issue. Having a java6 version of lucene module (in solr?) seems like an mess.

          Show
          Ryan McKinley added a comment - Java 5 isn't even supported by oracle anymore, so why the hell do we support it? +1 (though this discussion should happen elsewhere) If the 16% slowdown is worst case and under 'normal' use it would be equivolent/faster, i say lets move forward with that and push for java6 in a different issue. Having a java6 version of lucene module (in solr?) seems like an mess.
          Hide
          Robert Muir added a comment -

          Another option is to allow the grouping module (separately from Lucene core) to use Java 6 code.... but even that could be "involved"

          Personally I am against parts of the code being java 6 and other parts being java 5. we already have this situation today (the solr part is java 6, everyhting else is java 5).

          Come on, its a major version, lets just cut over everything to java 6. Java 5 isn't even supported by oracle anymore, so why the hell do we support it?

          Show
          Robert Muir added a comment - Another option is to allow the grouping module (separately from Lucene core) to use Java 6 code.... but even that could be "involved" Personally I am against parts of the code being java 6 and other parts being java 5. we already have this situation today (the solr part is java 6, everyhting else is java 5). Come on, its a major version, lets just cut over everything to java 6. Java 5 isn't even supported by oracle anymore, so why the hell do we support it?
          Hide
          Yonik Seeley added a comment -

          Another option is to allow the grouping module (separately from Lucene core) to use Java 6 code

          +1

          Yonik, how do you create the index used for this test? Somehow you generate an int field w/ random 1000 unique values – do you have a client-side script you use to create random docs in Solr?

          I have some CSV files laying around that I reuse for ad-hoc testing of a lot of stuff. They were created with a simple python script.
          Then I simply index with

          URL=http://localhost:8983/solr
          curl "$URL/update/csv?stream.url=file:/tmp/test.csv&overwrite=false&commit=true"
          

          It was also my first reaction to think that this is a very synthetic case that people are unlikely to hit... until I thought about dates. Indexing everything in date order is a pretty common thing to do, and so is sorting by date - which hits the exact same case. Queries of : and simple filter queries on type, etc, also tend to be pretty common (i.e. full-text relevance/performance actually isn't an important feature for some users).

          How complex must queries be for caching to generate a net benefit under load? I haven't tried to test this myself.

          Show
          Yonik Seeley added a comment - Another option is to allow the grouping module (separately from Lucene core) to use Java 6 code +1 Yonik, how do you create the index used for this test? Somehow you generate an int field w/ random 1000 unique values – do you have a client-side script you use to create random docs in Solr? I have some CSV files laying around that I reuse for ad-hoc testing of a lot of stuff. They were created with a simple python script. Then I simply index with URL=http: //localhost:8983/solr curl "$URL/update/csv?stream.url=file:/tmp/test.csv&overwrite= false &commit= true " It was also my first reaction to think that this is a very synthetic case that people are unlikely to hit... until I thought about dates. Indexing everything in date order is a pretty common thing to do, and so is sorting by date - which hits the exact same case. Queries of : and simple filter queries on type, etc, also tend to be pretty common (i.e. full-text relevance/performance actually isn't an important feature for some users). How complex must queries be for caching to generate a net benefit under load? I haven't tried to test this myself.
          Hide
          Michael McCandless added a comment -

          I think we should decouple the "should Lucene require Java 6" (which I expect to be an..... involved, discussion) from making progress here?

          My feeling is we should still commit this. The 16% slowdown is on a very synthetic case (MatchAllDocsQuery, sorting by reversed docID, grouping by random int field)... unless we also see unacceptable slowdowns in more realistic cases? Also, net/net the user should see a speedup, typically, since caching is enabled by default.

          We should still open an issue to cutover this code back to the pollLast once we can use Java 6 code.

          Another option is to allow the grouping module (separately from Lucene core) to use Java 6 code.... but even that could be "involved"

          Yonik, how do you create the index used for this test? Somehow you generate an int field w/ random 1000 unique values – do you have a client-side script you use to create random docs in Solr?

          Show
          Michael McCandless added a comment - I think we should decouple the "should Lucene require Java 6" (which I expect to be an..... involved, discussion) from making progress here? My feeling is we should still commit this. The 16% slowdown is on a very synthetic case (MatchAllDocsQuery, sorting by reversed docID, grouping by random int field)... unless we also see unacceptable slowdowns in more realistic cases? Also, net/net the user should see a speedup, typically, since caching is enabled by default. We should still open an issue to cutover this code back to the pollLast once we can use Java 6 code. Another option is to allow the grouping module (separately from Lucene core) to use Java 6 code.... but even that could be "involved" Yonik, how do you create the index used for this test? Somehow you generate an int field w/ random 1000 unique values – do you have a client-side script you use to create random docs in Solr?
          Hide
          Michael McCandless added a comment -

          Ahh nice catch Martijn!

          Show
          Michael McCandless added a comment - Ahh nice catch Martijn!
          Hide
          Robert Muir added a comment -

          just send an email to the dev list... lots of people will +1, uwe will -1, but I dont see why this is any issue for 4.0

          Show
          Robert Muir added a comment - just send an email to the dev list... lots of people will +1, uwe will -1, but I dont see why this is any issue for 4.0
          Hide
          Yonik Seeley added a comment -

          Ah, good call Martijn - it must be that pollLast was replaced with two map operations.
          Too bad Lucene isn't on Java6 yet!

          Show
          Yonik Seeley added a comment - Ah, good call Martijn - it must be that pollLast was replaced with two map operations. Too bad Lucene isn't on Java6 yet!
          Hide
          Martijn van Groningen added a comment - - edited

          I've been checking out the performance, and it generally seems fine. But of course we normally short circuit based on comparators and often don't get beyond that... so to exercise & isolate the rest of the code, I tried a worst-case scenario where the short circuit wouldn't work (sort=docid desc) and solr trunk with this patch is ~16% slower than without it. Any ideas what the problem might be?

          What might be the problem is that the trunk is using (Grouping.java 589):

          SearchGroup smallest = orderedGroups.pollLast();
          

          Whilst the AbstractFirstPassGroupingCollector (line 217) is using:

          final CollectedSearchGroup<GROUP_VALUE_TYPE> bottomGroup = orderedGroups.last();
          orderedGroups.remove(bottomGroup);
          

          The above also happen around line 271.

          I haven't checked this out, but I think it is the most likely explanation between those two implementations. Retrieving the bottom group will be done in almost all cases when the short circuit doesn't work

          Show
          Martijn van Groningen added a comment - - edited I've been checking out the performance, and it generally seems fine. But of course we normally short circuit based on comparators and often don't get beyond that... so to exercise & isolate the rest of the code, I tried a worst-case scenario where the short circuit wouldn't work (sort=docid desc) and solr trunk with this patch is ~16% slower than without it. Any ideas what the problem might be? What might be the problem is that the trunk is using (Grouping.java 589): SearchGroup smallest = orderedGroups.pollLast(); Whilst the AbstractFirstPassGroupingCollector (line 217) is using: final CollectedSearchGroup<GROUP_VALUE_TYPE> bottomGroup = orderedGroups.last(); orderedGroups.remove(bottomGroup); The above also happen around line 271. I haven't checked this out, but I think it is the most likely explanation between those two implementations. Retrieving the bottom group will be done in almost all cases when the short circuit doesn't work
          Hide
          Martijn van Groningen added a comment -

          Attached a new patch that should select the term based collectors now more often:

          • In the case of grouping by function and the value source is StrFieldValueSource.
          • All field types that produce a StrFieldValueSource (in getValueSource) use now the term based collectors. So any custom field type can now also be supported. Both TextField and StrField produce a StrFieldValueSource. I'm not if this is the right approach, but it was the most easy way to implement.
          Show
          Martijn van Groningen added a comment - Attached a new patch that should select the term based collectors now more often: In the case of grouping by function and the value source is StrFieldValueSource. All field types that produce a StrFieldValueSource (in getValueSource) use now the term based collectors. So any custom field type can now also be supported. Both TextField and StrField produce a StrFieldValueSource. I'm not if this is the right approach, but it was the most easy way to implement.
          Hide
          Yonik Seeley added a comment -

          This was without caching to put them on an even footing (and given that the base query was all docs, caching would be slower anyway). The URL above was the actual one used to test.

          Show
          Yonik Seeley added a comment - This was without caching to put them on an even footing (and given that the base query was all docs, caching would be slower anyway). The URL above was the actual one used to test.
          Hide
          Michael McCandless added a comment -

          Hmmm. Was this with or without caching?

          Show
          Michael McCandless added a comment - Hmmm. Was this with or without caching?
          Hide
          Yonik Seeley added a comment -

          I've been checking out the performance, and it generally seems fine. But of course we normally short circuit based on comparators and often don't get beyond that... so to exercise & isolate the rest of the code, I tried a worst-case scenario where the short circuit wouldn't work (sort=docid desc) and solr trunk with this patch is ~16% slower than without it. Any ideas what the problem might be?

          http://localhost:8983/solr/select?q=*:*&sort=_docid_ desc&group=true&group.cacheMB=0&group.field=single1000_i
          

          Note: the single1000_i field is a single valued int field with 1000 unique values

          Show
          Yonik Seeley added a comment - I've been checking out the performance, and it generally seems fine. But of course we normally short circuit based on comparators and often don't get beyond that... so to exercise & isolate the rest of the code, I tried a worst-case scenario where the short circuit wouldn't work (sort= docid desc) and solr trunk with this patch is ~16% slower than without it. Any ideas what the problem might be? http: //localhost:8983/solr/select?q=*:*&sort=_docid_ desc&group= true &group.cacheMB=0&group.field=single1000_i Note: the single1000_i field is a single valued int field with 1000 unique values
          Hide
          Yonik Seeley added a comment -

          Browsing around this a bit more... the existing solr code selected the string based collectors for any ValueSource of StrFieldSource. This patch resorts to exact getClass() checks against string and text fields which won't match in as many cases (either derived fields, or user custom fields that don't derive from either of the these field types)

          Show
          Yonik Seeley added a comment - Browsing around this a bit more... the existing solr code selected the string based collectors for any ValueSource of StrFieldSource. This patch resorts to exact getClass() checks against string and text fields which won't match in as many cases (either derived fields, or user custom fields that don't derive from either of the these field types)
          Hide
          Martijn van Groningen added a comment -

          But: why not just use Double.MAX_VALUE?

          Yes, I should have used that and I'll change that. I thought that the size was initially used to create the underline array. But it isn't! The array inside the caching collector initially starts with a length 128 and grows when needed.

          How I've currently implemented LUCENE-3097 is that it will only get the most relevant document of each group. In terms of SOLR-236 that is the same as using collapse.threshold=1. I think what Yonik means is increasing the threshold so more documents and up in the docset, that eventually is used by the facet component. Increasing this threshold also means setting when to start to collapse. So when setting the collapse.threshold=3 this means that from the 4th document the collapsing starts. I think that the whole collaps.threshold feature doesn't scale very well.

          Anyway, I think when we go wire the 2nd method (LUCENE-3097) into Solr, we should first make it work for the most relevant group documents.

          Show
          Martijn van Groningen added a comment - But: why not just use Double.MAX_VALUE? Yes, I should have used that and I'll change that. I thought that the size was initially used to create the underline array. But it isn't! The array inside the caching collector initially starts with a length 128 and grows when needed. How I've currently implemented LUCENE-3097 is that it will only get the most relevant document of each group. In terms of SOLR-236 that is the same as using collapse.threshold=1. I think what Yonik means is increasing the threshold so more documents and up in the docset, that eventually is used by the facet component. Increasing this threshold also means setting when to start to collapse. So when setting the collapse.threshold=3 this means that from the 4th document the collapsing starts. I think that the whole collaps.threshold feature doesn't scale very well. Anyway, I think when we go wire the 2nd method ( LUCENE-3097 ) into Solr, we should first make it work for the most relevant group documents.
          Hide
          Michael McCandless added a comment -

          I think a good criteria for "correct" is if you were to click through on the facet (ie, take the current query and add a filter on facet field = facet value), would the hit count you see match the facet count you were just looking at?

          Ie, drill down should be "consistent".

          Both approaches will give the same facets counts if the field never varies within the group (ie, the field "belongs" to the "parent" docs); it's only "child" fields where you need faceting to be aware of the groups, so for apps that never display facets on child fields, only computing facets on the group heads will work.

          I suspect doc blocks will be the only practical way to implement faceting on child fields efficiently.

          Show
          Michael McCandless added a comment - I think a good criteria for "correct" is if you were to click through on the facet (ie, take the current query and add a filter on facet field = facet value), would the hit count you see match the facet count you were just looking at? Ie, drill down should be "consistent". Both approaches will give the same facets counts if the field never varies within the group (ie, the field "belongs" to the "parent" docs); it's only "child" fields where you need faceting to be aware of the groups, so for apps that never display facets on child fields, only computing facets on the group heads will work. I suspect doc blocks will be the only practical way to implement faceting on child fields efficiently.
          Hide
          Yonik Seeley added a comment -

          > The other use-case is more like field collapsing and does change what documents match (basically, only the first documents in each group, up to limit, "match").

          I'm not sure it's that simple, ie that we can so cleanly model collapsing as reducing the docs to consider and then running faceting on that reduced set.

          I think that's what was actually implemented in SOLR-236 IIRC, and what some people seem to be asking for.

          EG, the use case of getting correct facet counts for a field that has different values within the group, can't be handled by this approach?

          Well, correct is a matter of context (for example, some have called the facet counts for the current grouping implementation "incorrect" because it didn't happen to match their use case). Looking at the original description in LUCENE-3097, it seems you're talking about Martijn's 3rd method, while I was talking about the 2nd. But maybe some people that were originally advocating for #2, really wanted #3?

          Show
          Yonik Seeley added a comment - > The other use-case is more like field collapsing and does change what documents match (basically, only the first documents in each group, up to limit, "match"). I'm not sure it's that simple, ie that we can so cleanly model collapsing as reducing the docs to consider and then running faceting on that reduced set. I think that's what was actually implemented in SOLR-236 IIRC, and what some people seem to be asking for. EG, the use case of getting correct facet counts for a field that has different values within the group, can't be handled by this approach? Well, correct is a matter of context (for example, some have called the facet counts for the current grouping implementation "incorrect" because it didn't happen to match their use case). Looking at the original description in LUCENE-3097 , it seems you're talking about Martijn's 3rd method, while I was talking about the 2nd. But maybe some people that were originally advocating for #2, really wanted #3?
          Hide
          Michael McCandless added a comment -

          The other use-case is more like field collapsing and does change what documents match (basically, only the first documents in each group, up to limit, "match").

          I'm not sure it's that simple, ie that we can so cleanly model
          collapsing as reducing the docs to consider and then running faceting
          on that reduced set.

          EG, the use case of getting correct facet counts for a field that has
          different values within the group, can't be handled by this approach?
          This is the count=2 for size=S in my example at
          https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038605&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038605

          I think to do that properly, the faceting impl needs to see all docs
          in the group, not just the "lead doc" per group.

          I think another way to visualize/model this that we really need to be
          able to configure "which field counts" (ID_FIELD) for the schema.
          This field would then decide all counts – total "hit count", facet
          counts, etc., ie each of these counts is count(unique(ID_FIELD)) of
          the docs falling in that facet/result set. The default is Lucene's docid,
          but the app should be able to state any other ID_FIELD.

          Show
          Michael McCandless added a comment - The other use-case is more like field collapsing and does change what documents match (basically, only the first documents in each group, up to limit, "match"). I'm not sure it's that simple, ie that we can so cleanly model collapsing as reducing the docs to consider and then running faceting on that reduced set. EG, the use case of getting correct facet counts for a field that has different values within the group, can't be handled by this approach? This is the count=2 for size=S in my example at https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038605&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038605 I think to do that properly, the faceting impl needs to see all docs in the group, not just the "lead doc" per group. I think another way to visualize/model this that we really need to be able to configure "which field counts" (ID_FIELD) for the schema. This field would then decide all counts – total "hit count", facet counts, etc., ie each of these counts is count(unique(ID_FIELD)) of the docs falling in that facet/result set. The default is Lucene's docid, but the app should be able to state any other ID_FIELD.
          Hide
          Michael McCandless added a comment -

          Patch looks great Martijn!

          Only thing I noticed is cacheSizeMB is computed incorrectly from
          maxDoc (for the -1 case), because that's all int math I think? Ie
          it'll be truncated from eg 13.7 MB -> 13. But: why not just use
          Double.MAX_VALUE?

          Show
          Michael McCandless added a comment - Patch looks great Martijn! Only thing I noticed is cacheSizeMB is computed incorrectly from maxDoc (for the -1 case), because that's all int math I think? Ie it'll be truncated from eg 13.7 MB -> 13. But: why not just use Double.MAX_VALUE?
          Hide
          Martijn van Groningen added a comment -

          Attached an updated patch.

          • Renamed group.groupCount to group.ngroups
          • When group.ngroups=true ngroups is added with matches in the response.
          • Included group.ngroups option into random testing. (all grouping tests succeed)
          Show
          Martijn van Groningen added a comment - Attached an updated patch. Renamed group.groupCount to group.ngroups When group.ngroups=true ngroups is added with matches in the response. Included group.ngroups option into random testing. (all grouping tests succeed)
          Hide
          Martijn van Groningen added a comment -

          groupCount is now GROUPED or UNGROUPED, and is used to set "Accuracy" (more on that later
          Seems like this parameter should be a boolean that says if the total number of groups should be returned?
          If true, we can add a "ngroups" or "groupCount" element at the same level as "matches". We should probably just name the parameter the same thing as the variable that gets returned... i.e. group.ngroups=true would cause "ngroups" to be populated (or groupCount if we decide that's a better name).

          I like that. I'll update the patch soon, so it includes ngroups together with matches, if group.ngroups=true is specified.

          "method" should probably be reserved for the algorithm used for collapsing (as we do for faceting).

          Maybe we can use this for letting the user choose one of the Term|Block|Function|*GroupCollectors. The collectors define the way how groups are created.

          However for now if it's easier, we could treat group.collapse=true to apply to the base query and all filters, and handle the use case I mentioned above via a qparser in the future

          I agree let's commit this first and then start to enhance the collapse feature. The group.collapse should be included when LUCENE-3097 is finished which shouldn't take that long!

          Show
          Martijn van Groningen added a comment - groupCount is now GROUPED or UNGROUPED, and is used to set "Accuracy" (more on that later Seems like this parameter should be a boolean that says if the total number of groups should be returned? If true, we can add a "ngroups" or "groupCount" element at the same level as "matches". We should probably just name the parameter the same thing as the variable that gets returned... i.e. group.ngroups=true would cause "ngroups" to be populated (or groupCount if we decide that's a better name). I like that. I'll update the patch soon, so it includes ngroups together with matches, if group.ngroups=true is specified. "method" should probably be reserved for the algorithm used for collapsing (as we do for faceting). Maybe we can use this for letting the user choose one of the Term|Block|Function|*GroupCollectors. The collectors define the way how groups are created. However for now if it's easier, we could treat group.collapse=true to apply to the base query and all filters, and handle the use case I mentioned above via a qparser in the future I agree let's commit this first and then start to enhance the collapse feature. The group.collapse should be included when LUCENE-3097 is finished which shouldn't take that long!
          Hide
          Yonik Seeley added a comment -

          Thanks for theupdate Martijn, it's looking good.

          Just a note that the following optimization will no longer be valid once we have "post collapse faceting" or whatever we're calling it, or
          when we have an option to return the total number of groups.
          But hopefully our random testing will catch that in the future.

              protected Collector createFirstPassCollector() throws IOException {
                // Ok we don't want groups, but do want a total count
                if (actualGroupsToFind <= 0) {
                  fallBackCollector = new TotalHitCountCollector();
                  return fallBackCollector;
                }
          

          However I have changed this in the updated patch and basically this check isn't done with setting GET_SCORES flag.

          Thanks... GET_SCORES does have a different meaning: scores must be returned to the caller (which can still be false even if scores are used for sorting)

          I also think that group.groupCount is a better name.

          groupCount is now GROUPED or UNGROUPED, and is used to set "Accuracy" (more on that later
          Seems like this parameter should be a boolean that says if the total number of groups should be returned?
          If true, we can add a "ngroups" or "groupCount" element at the same level as "matches". We should probably just name the parameter the same thing as the variable that gets returned... i.e. group.ngroups=true would cause "ngroups" to be populated (or groupCount if we decide that's a better name).

          Maybe another name is more descriptive. Maybe style or method?

          "method" should probably be reserved for the algorithm used for collapsing (as we do for faceting).

          Background for others: This feature has been called many things like "post collapse faceting", etc. But it's really much more than that. Normal grouping simply groups documents and presents them in a different way, but does not change what documents match the base query + filters. The other use-case is more like field collapsing and does change what documents match (basically, only the first documents in each group, up to limit, "match").

          Maybe just use a word from the original name for this whole feature... "group.collapse=true"?

          There are some other interesting semantics to work out for group.collapse=true, such as if the collapsing happens before or after filters are applied. Perhaps either could make sense depending on the use case? Here's one use case I can think of: using field collapsing for only showing the latest version of a document. In this case, one would only want collapsing to apply to the base query (with filtering happening after that) because you don't want to get into the position of having a filter that filters out the most recent version of a document and thus shows an older version.

          However for now if it's easier, we could treat group.collapse=true to apply to the base query and all filters, and handle the use case I mentioned above via a qparser in the future.

          Show
          Yonik Seeley added a comment - Thanks for theupdate Martijn, it's looking good. Just a note that the following optimization will no longer be valid once we have "post collapse faceting" or whatever we're calling it, or when we have an option to return the total number of groups. But hopefully our random testing will catch that in the future. protected Collector createFirstPassCollector() throws IOException { // Ok we don't want groups, but do want a total count if (actualGroupsToFind <= 0) { fallBackCollector = new TotalHitCountCollector(); return fallBackCollector; } However I have changed this in the updated patch and basically this check isn't done with setting GET_SCORES flag. Thanks... GET_SCORES does have a different meaning: scores must be returned to the caller (which can still be false even if scores are used for sorting) I also think that group.groupCount is a better name. groupCount is now GROUPED or UNGROUPED, and is used to set "Accuracy" (more on that later Seems like this parameter should be a boolean that says if the total number of groups should be returned? If true, we can add a "ngroups" or "groupCount" element at the same level as "matches". We should probably just name the parameter the same thing as the variable that gets returned... i.e. group.ngroups=true would cause "ngroups" to be populated (or groupCount if we decide that's a better name). Maybe another name is more descriptive. Maybe style or method? "method" should probably be reserved for the algorithm used for collapsing (as we do for faceting). Background for others: This feature has been called many things like "post collapse faceting", etc. But it's really much more than that. Normal grouping simply groups documents and presents them in a different way, but does not change what documents match the base query + filters. The other use-case is more like field collapsing and does change what documents match (basically, only the first documents in each group, up to limit, "match"). Maybe just use a word from the original name for this whole feature... "group.collapse=true"? There are some other interesting semantics to work out for group.collapse=true, such as if the collapsing happens before or after filters are applied. Perhaps either could make sense depending on the use case? Here's one use case I can think of: using field collapsing for only showing the latest version of a document. In this case, one would only want collapsing to apply to the base query (with filtering happening after that) because you don't want to get into the position of having a filter that filters out the most recent version of a document and thus shows an older version. However for now if it's easier, we could treat group.collapse=true to apply to the base query and all filters, and handle the use case I mentioned above via a qparser in the future.
          Hide
          Yonik Seeley added a comment -

          So, maybe... we make an exception to enum casing (normally ALL CAPS) and use lowercase for enums that corresopnd to valid values for request params?

          +1 - seems like a good tradeoff, and the fact that it's scoped by the classname still makes it stand out.

          Show
          Yonik Seeley added a comment - So, maybe... we make an exception to enum casing (normally ALL CAPS) and use lowercase for enums that corresopnd to valid values for request params? +1 - seems like a good tradeoff, and the fact that it's scoped by the classname still makes it stand out.
          Hide
          Michael McCandless added a comment -

          We can choose to leave out the upper-casing. Solr users would then need make sure that parameter options are spelled correctly. Would that be allright?

          It seems like "typically" the request param values must be lower-case?
          (Which I think is good...).

          So, maybe... we make an exception to enum casing (normally ALL CAPS)
          and use lowercase for enums that corresopnd to valid values for
          request params? Then we can just do .valueOf() directly, and we get
          the strong checking to catch mis-spellings.

          Show
          Michael McCandless added a comment - We can choose to leave out the upper-casing. Solr users would then need make sure that parameter options are spelled correctly. Would that be allright? It seems like "typically" the request param values must be lower-case? (Which I think is good...). So, maybe... we make an exception to enum casing (normally ALL CAPS) and use lowercase for enums that corresopnd to valid values for request params? Then we can just do .valueOf() directly, and we get the strong checking to catch mis-spellings.
          Hide
          Martijn van Groningen added a comment -

          How about TotalCount.GROUPS and TotalCount.DOCS?

          +1 I think we can with that name for the total counts.

          Show
          Martijn van Groningen added a comment - How about TotalCount.GROUPS and TotalCount.DOCS? +1 I think we can with that name for the total counts.
          Hide
          Martijn van Groningen added a comment -

          Hi Yonik,

          It is good to know that you took a look at the patch!

          in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)?

          Yes I did this because I used to set Grouping.needScores with this flag. The needScores I also used whether to indicate if the scores need to be cached. However I have changed this in the updated patch and basically this check isn't done with setting GET_SCORES flag.

          I'm not a fan of this new style for matching request parameters to enums...

          We can choose to leave out the upper-casing. Solr users would then need make sure that parameter options are spelled correctly. Would that be allright?

          "Accuracy" seems a bit mis-named?

          Maybe another name is more descriptive. Maybe style or method?

          The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing?

          The jdoc is mixed up with group.docSet. I also think that group.groupCount is a better name. I changed this in the new patch

          What does "group.docSet" do?

          Currently nothing. I plan to use it when I finish LUCENE-3097. Basically it will decide whether the docset (for FacetComponent and StatsComponent) is based on plain documents or groups. Since you can have more than one Command (Field / Function / Query), it will then select the first CommandField or CommandFunction. I'm not sure how we should handle multiple command when having more than one command.

          I'm not sure we should default group.cache to true

          The query time can really be reduced with this option, but yes it requires more memory. If the cache collector threshold is met they array is immediately set to null during the search, so gc might be able to clean it up during the search. Also Solr users get a message in the response. Somehow I forget to move that from SOLR-2524, but it is in the updated patch now.

          we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too)

          Makes sense, grouping then at least consistent with the rest of Solr. I made it default to -1 for now.

          FYI: there's a nocommit in there misspelled as "No commit"

          I have removed that.

          It wasn't necessary before, and there are advantages to preserving information (like the fact that someone said "no limit" vs a specific number) until as late as possible. That was previously handled by getMax() in Grouping.java, and I still see it being called... so it should be OK?

          I've removed this if statement and made sure that getMax(...) is used wherever it is needed.

          Show
          Martijn van Groningen added a comment - Hi Yonik, It is good to know that you took a look at the patch! in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)? Yes I did this because I used to set Grouping.needScores with this flag. The needScores I also used whether to indicate if the scores need to be cached. However I have changed this in the updated patch and basically this check isn't done with setting GET_SCORES flag. I'm not a fan of this new style for matching request parameters to enums... We can choose to leave out the upper-casing. Solr users would then need make sure that parameter options are spelled correctly. Would that be allright? "Accuracy" seems a bit mis-named? Maybe another name is more descriptive. Maybe style or method? The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing? The jdoc is mixed up with group.docSet. I also think that group.groupCount is a better name. I changed this in the new patch What does "group.docSet" do? Currently nothing. I plan to use it when I finish LUCENE-3097 . Basically it will decide whether the docset (for FacetComponent and StatsComponent) is based on plain documents or groups. Since you can have more than one Command (Field / Function / Query), it will then select the first CommandField or CommandFunction. I'm not sure how we should handle multiple command when having more than one command. I'm not sure we should default group.cache to true The query time can really be reduced with this option, but yes it requires more memory. If the cache collector threshold is met they array is immediately set to null during the search, so gc might be able to clean it up during the search. Also Solr users get a message in the response. Somehow I forget to move that from SOLR-2524 , but it is in the updated patch now. we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too) Makes sense, grouping then at least consistent with the rest of Solr. I made it default to -1 for now. FYI: there's a nocommit in there misspelled as "No commit" I have removed that. It wasn't necessary before, and there are advantages to preserving information (like the fact that someone said "no limit" vs a specific number) until as late as possible. That was previously handled by getMax() in Grouping.java, and I still see it being called... so it should be OK? I've removed this if statement and made sure that getMax(...) is used wherever it is needed.
          Hide
          Michael McCandless added a comment -

          Do we have a group.facet = after ? I still like the old collapse component better because it had the collapse.facet=after.

          I think there is another ticket, but not sure if it is being worked on?

          No, not in this issue. This issue is just about cutting Solr over to the grouping module.

          LUCENE-3097 is the issue for post-grouping facet counts.

          Show
          Michael McCandless added a comment - Do we have a group.facet = after ? I still like the old collapse component better because it had the collapse.facet=after. I think there is another ticket, but not sure if it is being worked on? No, not in this issue. This issue is just about cutting Solr over to the grouping module. LUCENE-3097 is the issue for post-grouping facet counts.
          Hide
          Michael McCandless added a comment -

          I haven't been following this...

          Thank you for the review Yonik!

          I took a quick look at the patch, and at first blush it's hard to tell what changes are cleanup and what changes are cut-over.

          It's definitely a big refactoring.

          in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)?

          I assume this is because if you sort or groupSort by a Sort that
          contains SortField.SCORE, you need a scorer. Not sure why Solr trunk
          gets away with not doing this, though... Martijn do you know? Oh, and
          likely because the CachingCollector needs to know up front if scores
          are needed.

          I'm not a fan of this new style for matching request parameters to enums... solr does a lot of lookups on a typical request, and a switch to this style everywhere could definitely have an impact (the whole upper-casing the request param so we can match it to the enum name).

          The .upper() calls (twice per request, only if the request has
          group=true) are negligible here (true, other situations might be
          different, though we should fix them to lookup/check once).

          And, this approach gives us strong checking of the enum fields.
          Contrast that with what's on trunk now:

            String format = params.get(GroupParams.GROUP_FORMAT);
            Grouping.Format defaultFormat = "simple".equals(format) ? Grouping.Format.Simple : Grouping.Format.Grouped; 
          

          If you have a typo in "simple", you'll unexpectedly get
          Grouping.Format.Grouped.

          I think we should strongly check all of our request params?

          "Accuracy" seems a bit mis-named? It seems to imply an accuracy trade-off, but both methods are 100% accurate here, they just do different things to serve different usecases. At least the name doesn't seem to have made it's way into the external API though.

          The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing? (or are the comments just wrong?) If it's to return the number of groups, then perhaps the name should be "group.groupCount" as totalCount is unit-less.

          I agree. This setting just controls what the "total hit count" should
          count – unique groups vs docs.

          How about TotalCount.GROUPS and TotalCount.DOCS?

          What does "group.docSet" do? The comments don't quite make sense to me, but the param suggests it's sort of like group.totalCount?

          I think we should remove this from this patch (see my comment above –
          it applies to LUCENE-3097).

          in the interest of reducing the number of parameters, we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too), and other values as literal number of MB (which I'd discourage people from using personally).

          +1, that makes sense.

          I'm not sure we should default group.cache to true... there's a downside to the memory usage, and it's fragile: things may be working just fine, and the user may add a few more documents to the index and then the limit is hit and it just stops working (but still consumes much memory and extra log warnings per request).

          Enabling caching can make a huge improvement in QPS, especially for
          queries that are costly to execute but don't match too many docs.

          Maybe instead of a fixed MB we could make it a percentage of the
          maxDoc? This would make it less fragile.. So you could say you're
          willing to cache up to 50% of the total docs in the index.

          FYI: there's a nocommit in there misspelled as "No commit"

          Martijn can you fix this one...? And in general try to spell it as
          "nocommit" This is what our build catches. (And, fear not, you're
          not the only person to have trouble spelling nocommit!!).

          Show
          Michael McCandless added a comment - I haven't been following this... Thank you for the review Yonik! I took a quick look at the patch, and at first blush it's hard to tell what changes are cleanup and what changes are cut-over. It's definitely a big refactoring. in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)? I assume this is because if you sort or groupSort by a Sort that contains SortField.SCORE, you need a scorer. Not sure why Solr trunk gets away with not doing this, though... Martijn do you know? Oh, and likely because the CachingCollector needs to know up front if scores are needed. I'm not a fan of this new style for matching request parameters to enums... solr does a lot of lookups on a typical request, and a switch to this style everywhere could definitely have an impact (the whole upper-casing the request param so we can match it to the enum name). The .upper() calls (twice per request, only if the request has group=true) are negligible here (true, other situations might be different, though we should fix them to lookup/check once). And, this approach gives us strong checking of the enum fields. Contrast that with what's on trunk now: String format = params.get(GroupParams.GROUP_FORMAT); Grouping.Format defaultFormat = "simple".equals(format) ? Grouping.Format.Simple : Grouping.Format.Grouped; If you have a typo in "simple", you'll unexpectedly get Grouping.Format.Grouped. I think we should strongly check all of our request params? "Accuracy" seems a bit mis-named? It seems to imply an accuracy trade-off, but both methods are 100% accurate here, they just do different things to serve different usecases. At least the name doesn't seem to have made it's way into the external API though. The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing? (or are the comments just wrong?) If it's to return the number of groups, then perhaps the name should be "group.groupCount" as totalCount is unit-less. I agree. This setting just controls what the "total hit count" should count – unique groups vs docs. How about TotalCount.GROUPS and TotalCount.DOCS? What does "group.docSet" do? The comments don't quite make sense to me, but the param suggests it's sort of like group.totalCount? I think we should remove this from this patch (see my comment above – it applies to LUCENE-3097 ). in the interest of reducing the number of parameters, we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too), and other values as literal number of MB (which I'd discourage people from using personally). +1, that makes sense. I'm not sure we should default group.cache to true... there's a downside to the memory usage, and it's fragile: things may be working just fine, and the user may add a few more documents to the index and then the limit is hit and it just stops working (but still consumes much memory and extra log warnings per request). Enabling caching can make a huge improvement in QPS, especially for queries that are costly to execute but don't match too many docs. Maybe instead of a fixed MB we could make it a percentage of the maxDoc? This would make it less fragile.. So you could say you're willing to cache up to 50% of the total docs in the index. FYI: there's a nocommit in there misspelled as "No commit" Martijn can you fix this one...? And in general try to spell it as "nocommit" This is what our build catches. (And, fear not, you're not the only person to have trouble spelling nocommit!!).
          Hide
          Bill Bell added a comment -

          Do we have a group.facet = after ? I still like the old collapse component better because it had the collapse.facet=after.

          I think there is another ticket, but not sure if it is being worked on?

          Show
          Bill Bell added a comment - Do we have a group.facet = after ? I still like the old collapse component better because it had the collapse.facet=after. I think there is another ticket, but not sure if it is being worked on?
          Hide
          Yonik Seeley added a comment -

          In QueryComponent, is it possible to remove this?

          +        if (limitDefault < 0) {
          +          limitDefault = searcher.maxDoc();
          +        }
          

          It wasn't necessary before, and there are advantages to preserving information (like the fact that someone said "no limit" vs a specific number) until as late as possible. That was previously handled by getMax() in Grouping.java, and I still see it being called... so it should be OK?

          Show
          Yonik Seeley added a comment - In QueryComponent, is it possible to remove this? + if (limitDefault < 0) { + limitDefault = searcher.maxDoc(); + } It wasn't necessary before, and there are advantages to preserving information (like the fact that someone said "no limit" vs a specific number) until as late as possible. That was previously handled by getMax() in Grouping.java, and I still see it being called... so it should be OK?
          Hide
          Yonik Seeley added a comment -

          FYI: there's a nocommit in there misspelled as "No commit"

          Show
          Yonik Seeley added a comment - FYI: there's a nocommit in there misspelled as "No commit"
          Hide
          Yonik Seeley added a comment -

          I haven't been following this... I took a quick look at the patch, and at first blush it's hard to tell what changes are cleanup and what changes are cut-over.

          • in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)?
          • I'm not a fan of this new style for matching request parameters to enums... solr does a lot of lookups on a typical request, and a switch to this style everywhere could definitely have an impact (the whole upper-casing the request param so we can match it to the enum name).
          • "Accuracy" seems a bit mis-named? It seems to imply an accuracy trade-off, but both methods are 100% accurate here, they just do different things to serve different usecases. At least the name doesn't seem to have made it's way into the external API though.
          • The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing? (or are the comments just wrong?) If it's to return the number of groups, then perhaps the name should be "group.groupCount" as totalCount is unit-less.
          • What does "group.docSet" do? The comments don't quite make sense to me, but the param suggests it's sort of like group.totalCount?
          • I'm not sure we should default group.cache to true... there's a downside to the memory usage, and it's fragile: things may be working just fine, and the user may add a few more documents to the index and then the limit is hit and it just stops working (but still consumes much memory and extra log warnings per request).
          • in the interest of reducing the number of parameters, we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too), and other values as literal number of MB (which I'd discourage people from using personally).
          Show
          Yonik Seeley added a comment - I haven't been following this... I took a quick look at the patch, and at first blush it's hard to tell what changes are cleanup and what changes are cut-over. in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)? I'm not a fan of this new style for matching request parameters to enums... solr does a lot of lookups on a typical request, and a switch to this style everywhere could definitely have an impact (the whole upper-casing the request param so we can match it to the enum name). "Accuracy" seems a bit mis-named? It seems to imply an accuracy trade-off, but both methods are 100% accurate here, they just do different things to serve different usecases. At least the name doesn't seem to have made it's way into the external API though. The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing? (or are the comments just wrong?) If it's to return the number of groups, then perhaps the name should be "group.groupCount" as totalCount is unit-less. What does "group.docSet" do? The comments don't quite make sense to me, but the param suggests it's sort of like group.totalCount? I'm not sure we should default group.cache to true... there's a downside to the memory usage, and it's fragile: things may be working just fine, and the user may add a few more documents to the index and then the limit is hit and it just stops working (but still consumes much memory and extra log warnings per request). in the interest of reducing the number of parameters, we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too), and other values as literal number of MB (which I'd discourage people from using personally).
          Hide
          Martijn van Groningen added a comment -

          Fixed jdocs for GroupParams.GROUP_CACHE (we now default to true not false, right?)

          Yes we do.

          I think GroupParams.GROUP_FACETS (group.docSet request param) is
          unused? Should we remove it? (This is for when we do LUCENE-3097, I
          think?)

          I 'reserved' this parameter for LUCENE-3097. Should we keep it? Once LUCENE-3097 is committed, I will quickly integrate it into Solr.

          Show
          Martijn van Groningen added a comment - Fixed jdocs for GroupParams.GROUP_CACHE (we now default to true not false, right?) Yes we do. I think GroupParams.GROUP_FACETS (group.docSet request param) is unused? Should we remove it? (This is for when we do LUCENE-3097 , I think?) I 'reserved' this parameter for LUCENE-3097 . Should we keep it? Once LUCENE-3097 is committed, I will quickly integrate it into Solr.
          Hide
          Michael McCandless added a comment -

          Patch looks great!

          Attached new patch with minor fixes:

          • I had to fixup Solr's common-build.xml (it was referencing
            lucene/contrib/grouping from 3.x). After that all tests pass.
          • Fixed jdocs for GroupParams.GROUP_CACHE (we now default to true
            not false, right?)

          I think GroupParams.GROUP_FACETS (group.docSet request param) is
          unused? Should we remove it? (This is for when we do LUCENE-3097, I
          think?)

          So this cutover improves Solr trunk grouping's impl by adding option
          to get total group count (group.totalCount request param) and using
          the CachingCollector for fast replaying of hits for 2nd pass collector
          (adds group.cache/group.cache.maxSizeMB request params). The
          double-RAM issue is fixed, thanks to LUCENE-3099.

          So I think this is ready to be committed!!

          Show
          Michael McCandless added a comment - Patch looks great! Attached new patch with minor fixes: I had to fixup Solr's common-build.xml (it was referencing lucene/contrib/grouping from 3.x). After that all tests pass. Fixed jdocs for GroupParams.GROUP_CACHE (we now default to true not false, right?) I think GroupParams.GROUP_FACETS (group.docSet request param) is unused? Should we remove it? (This is for when we do LUCENE-3097 , I think?) So this cutover improves Solr trunk grouping's impl by adding option to get total group count (group.totalCount request param) and using the CachingCollector for fast replaying of hits for 2nd pass collector (adds group.cache/group.cache.maxSizeMB request params). The double-RAM issue is fixed, thanks to LUCENE-3099 . So I think this is ready to be committed!!
          Hide
          Martijn van Groningen added a comment -

          Added updated patch. All tests are passing now. I think that when LUCENE-3099 is committed, this patch can also be committed.

          Also included a small fix to the CachingCollector. During replaying the scorer is only set once. This must be done for every segment. This fix resolves that issue

          Show
          Martijn van Groningen added a comment - Added updated patch. All tests are passing now. I think that when LUCENE-3099 is committed, this patch can also be committed. Also included a small fix to the CachingCollector. During replaying the scorer is only set once. This must be done for every segment. This fix resolves that issue
          Hide
          Martijn van Groningen added a comment - - edited

          Initial patch. All tests, but the random test pass. The goal is that all existing grouping test must pass without any changes to the TestGroupingSearch class.

          This patch also includes collectors for function queries.

          LUCENE-3099 must be applied first before using the patch.

          Show
          Martijn van Groningen added a comment - - edited Initial patch. All tests, but the random test pass. The goal is that all existing grouping test must pass without any changes to the TestGroupingSearch class. This patch also includes collectors for function queries. LUCENE-3099 must be applied first before using the patch.

            People

            • Assignee:
              Martijn van Groningen
              Reporter:
              Martijn van Groningen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development