Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.3
    • Fix Version/s: 3.3
    • Component/s: search
    • Labels:
      None

      Description

      This patch include a new feature called "Field collapsing".

      "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
      http://www.fastsearch.com/glossary.aspx?m=48&amid=299

      The implementation add 3 new query parameters (SolrParams):
      "collapse.field" to choose the field used to group results
      "collapse.type" normal (default value) or adjacent
      "collapse.max" to select how many continuous results are allowed before collapsing

      TODO (in progress):

      • More documentation (on source code)
      • Test cases

      Two patches:

      • "field_collapsing.patch" for current development version
      • "field_collapsing_1.1.0.patch" for Solr-1.1.0

      P.S.: Feedback and misspelling correction are welcome

      1. SOLR-236-trunk.patch
        236 kB
        Martijn van Groningen
      2. SOLR-236-trunk.patch
        247 kB
        Martijn van Groningen
      3. SOLR-236-trunk.patch
        250 kB
        Martijn van Groningen
      4. SOLR-236-trunk.patch
        256 kB
        Martijn van Groningen
      5. SOLR-236-trunk.patch
        259 kB
        Martijn van Groningen
      6. SOLR-236-FieldCollapsing.patch
        16 kB
        Ryan McKinley
      7. SOLR-236-FieldCollapsing.patch
        18 kB
        Ryan McKinley
      8. SOLR-236-FieldCollapsing.patch
        18 kB
        Emmanuel Keller
      9. SOLR-236-distinctFacet.patch
        2 kB
        Bill Bell
      10. SOLR-236-branch_3x.patch
        258 kB
        Doug Steigerwald
      11. SOLR-236-1_4_1-paging-totals-working.patch
        264 kB
        Stephen Weiss
      12. SOLR-236-1_4_1-NPEfix.patch
        0.7 kB
        Cameron
      13. SOLR-236-1_4_1.patch
        264 kB
        Martijn van Groningen
      14. SOLR-236.patch
        253 kB
        Shalin Shekhar Mangar
      15. SOLR-236.patch
        245 kB
        Martijn van Groningen
      16. SOLR-236.patch
        257 kB
        Shalin Shekhar Mangar
      17. SOLR-236.patch
        251 kB
        Martijn van Groningen
      18. SOLR-236.patch
        252 kB
        Shalin Shekhar Mangar
      19. SOLR-236.patch
        244 kB
        Martijn van Groningen
      20. SOLR-236.patch
        245 kB
        Martijn van Groningen
      21. SOLR-236.patch
        27 kB
        Yonik Seeley
      22. solr-236.patch
        24 kB
        Bojan Smid
      23. SOLR-236_collapsing.patch
        26 kB
        Dmitry Lihachev
      24. SOLR-236_collapsing.patch
        25 kB
        Thomas Traeger
      25. quasidistributed.additional.patch
        1 kB
        Michael Gundlach
      26. NonAdjacentDocumentCollapserTest.java
        9 kB
        Peter Karich
      27. NonAdjacentDocumentCollapser.java
        21 kB
        Peter Karich
      28. field-collapsing-extended-592129.patch
        31 kB
        Karsten Sperling
      29. field-collapse-solr-236-2.patch
        52 kB
        Martijn van Groningen
      30. field-collapse-solr-236.patch
        49 kB
        Martijn van Groningen
      31. field-collapse-5.patch
        122 kB
        Martijn van Groningen
      32. field-collapse-5.patch
        133 kB
        Martijn van Groningen
      33. field-collapse-5.patch
        134 kB
        Martijn van Groningen
      34. field-collapse-5.patch
        134 kB
        Martijn van Groningen
      35. field-collapse-5.patch
        136 kB
        Martijn van Groningen
      36. field-collapse-5.patch
        146 kB
        Martijn van Groningen
      37. field-collapse-5.patch
        144 kB
        Martijn van Groningen
      38. field-collapse-5.patch
        216 kB
        Martijn van Groningen
      39. field-collapse-5.patch
        218 kB
        Martijn van Groningen
      40. field-collapse-5.patch
        218 kB
        Martijn van Groningen
      41. field-collapse-5.patch
        239 kB
        Martijn van Groningen
      42. field-collapse-5.patch
        244 kB
        Martijn van Groningen
      43. field-collapse-5.patch
        251 kB
        Martijn van Groningen
      44. field-collapse-5.patch
        253 kB
        Martijn van Groningen
      45. field-collapse-5.patch
        254 kB
        Martijn van Groningen
      46. field-collapse-4-with-solrj.patch
        66 kB
        Martijn van Groningen
      47. field-collapse-3.patch
        52 kB
        Martijn van Groningen
      48. field_collapsing_dsteigerwald.diff
        25 kB
        Doug Steigerwald
      49. field_collapsing_dsteigerwald.diff
        25 kB
        Charles Hornberger
      50. field_collapsing_dsteigerwald.diff
        25 kB
        Oleg Gnatovskiy
      51. field_collapsing_1.3.patch
        14 kB
        Emmanuel Keller
      52. field_collapsing_1.1.0.patch
        12 kB
        Emmanuel Keller
      53. DocSetScoreCollector.java
        5 kB
        Peter Karich
      54. collapsing-patch-to-1.3.0-ivan.patch
        24 kB
        Iván de Prado
      55. collapsing-patch-to-1.3.0-ivan_3.patch
        24 kB
        Iván de Prado
      56. collapsing-patch-to-1.3.0-ivan_2.patch
        24 kB
        Iván de Prado
      57. collapsing-patch-to-1.3.0-dieter.patch
        26 kB
        dieter grad

        Issue Links

          Activity

          Hide
          kishore padman added a comment -

          Hi,

          I have applied these 2 patches to solr1.4.1 for the field collapsing.

          Apply patch SOLR-236-1_4_1-paging-totals-working.patch
          Apply patch SOLR-236-1_4_1-NPEfix.patch

          The collapsing works fine, and facet counts shows correctly on the collpased records as I am using collpase.facet=after.
          But when a filter is done on a facet, all the corresponding facet counts is calculated on the basis of uncollapsed records.

          Has anyone faced this issue.Please let me know the resolution

          Thanks
          Kishore Padman

          Show
          kishore padman added a comment - Hi, I have applied these 2 patches to solr1.4.1 for the field collapsing. Apply patch SOLR-236 -1_4_1-paging-totals-working.patch Apply patch SOLR-236 -1_4_1-NPEfix.patch The collapsing works fine, and facet counts shows correctly on the collpased records as I am using collpase.facet=after. But when a filter is done on a facet, all the corresponding facet counts is calculated on the basis of uncollapsed records. Has anyone faced this issue.Please let me know the resolution Thanks Kishore Padman
          Hide
          Robert Muir added a comment -

          Bulk close for 3.3

          Show
          Robert Muir added a comment - Bulk close for 3.3
          Hide
          Michael McCandless added a comment -

          Resolving this looooon issue as a duplicate of SOLR-2524, which brings grouping (finally!) to Solr 3.x via the new (factored out from Solr's trunk grouping impl then backported to 3.x) grouping module.

          Show
          Michael McCandless added a comment - Resolving this looooon issue as a duplicate of SOLR-2524 , which brings grouping (finally!) to Solr 3.x via the new (factored out from Solr's trunk grouping impl then backported to 3.x) grouping module.
          Hide
          Jan Høydahl added a comment -

          I think you should consider the group by now included in 3_x branch (SOLR-2524 was recently committed)

          Show
          Jan Høydahl added a comment - I think you should consider the group by now included in 3_x branch ( SOLR-2524 was recently committed)
          Hide
          Yuriy Akopov added a comment -

          I am trying to migrate from Solr 1.4.1 to Solr 3.2 and so I need to patch the 3.2 branch.

          When I use "SOLR-236-branch_3x.patch" file on the dev/tags/release-3.2 branch WAR file is built successfully but then it fails on loading with "org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent'" message as if the collapsing functionality was not implemented.

          Should I try using 1.4.1 patch instead on 3.2 sources? That doesn't feel right but maybe they're compatible, I don't know.

          Show
          Yuriy Akopov added a comment - I am trying to migrate from Solr 1.4.1 to Solr 3.2 and so I need to patch the 3.2 branch. When I use " SOLR-236 -branch_3x.patch" file on the dev/tags/release-3.2 branch WAR file is built successfully but then it fails on loading with "org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent'" message as if the collapsing functionality was not implemented. Should I try using 1.4.1 patch instead on 3.2 sources? That doesn't feel right but maybe they're compatible, I don't know.
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Yuriy Akopov added a comment -

          Thanks, Stephen. So it isn't just me doing something else wrong.

          I'm thinking of displaying not the actual figures against the facet items but something like 100+, 200+, 300+ etc. Should be okay as the difference is not dramatic but seems to remain within the relatively narrow interval.

          Show
          Yuriy Akopov added a comment - Thanks, Stephen. So it isn't just me doing something else wrong. I'm thinking of displaying not the actual figures against the facet items but something like 100+, 200+, 300+ etc. Should be okay as the difference is not dramatic but seems to remain within the relatively narrow interval.
          Hide
          Stephen Weiss added a comment -

          Yes, I've had this too:

          https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12655750&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12655750

          I'm pretty sure I know the reason for it, but I don't know how to fix it... to the best of my knowledge no one on the ticket really said if the problem could be fixed or not yet either. At the moment we just use facet.before and explain to our users that the facets are for "unfiltered" results... almost no one complains once we explain it to them. However, a fix would be wonderful... people ask about it often enough that clearly it's not very intuitive.

          Show
          Stephen Weiss added a comment - Yes, I've had this too: https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12655750&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12655750 I'm pretty sure I know the reason for it, but I don't know how to fix it... to the best of my knowledge no one on the ticket really said if the problem could be fixed or not yet either. At the moment we just use facet.before and explain to our users that the facets are for "unfiltered" results... almost no one complains once we explain it to them. However, a fix would be wonderful ... people ask about it often enough that clearly it's not very intuitive.
          Hide
          Yuriy Akopov added a comment -

          Hi and sorry for breaking the silence.

          So far the patch is working okay in our system, thanks again.

          However I've noticed that the collapse.facet parameter set to 'after' doesn't produce very precise figures. When results are collapsed, it may give, say, 366 results for the facet item while actually there are 396 returned by Solr after collapsing.

          The figures are never very different from the actual ones but they vary in some narrow interval. I mean, for number of results up to 10000 they differ by <100 only. My collapsing-related part of the query is the following:

          $search_options['qt'] = 'collapse';
          $search_options['collapse.field'] = 'my_string_field'; // name of the field to collapse on, in my case it is a string field
          $search_options['collapse.type'] = 'normal'; // it is always 'normal' and never 'adjacent' in my case
          $search_options['collapse.facet'] = 'after';

          When collapsing is turned off, facet figures are calculated precisely, as expected. Have anybody else experienced that, and if so, is there a solution available? Thanks in advance.

          Show
          Yuriy Akopov added a comment - Hi and sorry for breaking the silence. So far the patch is working okay in our system, thanks again. However I've noticed that the collapse.facet parameter set to 'after' doesn't produce very precise figures. When results are collapsed, it may give, say, 366 results for the facet item while actually there are 396 returned by Solr after collapsing. The figures are never very different from the actual ones but they vary in some narrow interval. I mean, for number of results up to 10000 they differ by <100 only. My collapsing-related part of the query is the following: $search_options ['qt'] = 'collapse'; $search_options ['collapse.field'] = 'my_string_field'; // name of the field to collapse on, in my case it is a string field $search_options ['collapse.type'] = 'normal'; // it is always 'normal' and never 'adjacent' in my case $search_options ['collapse.facet'] = 'after'; When collapsing is turned off, facet figures are calculated precisely, as expected. Have anybody else experienced that, and if so, is there a solution available? Thanks in advance.
          Hide
          Yuriy Akopov added a comment -

          Another question:

          The patched version of .war starts and works as expected if I place the following simple instruction in solrconfig.xml:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
          </searchComponent>

          But if I add additional factories like it is advised by the sample config, it produces an error when searching with collapsing turned on:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
          <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" />
          <collapseCollectorFactory class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
          <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
          <collapseCollectorFactory name="groupAggregatedData" class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
          <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/>
          <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/>
          <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/>
          <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/>
          </collapseCollectorFactory>
          </searchComponent>

          So far it does what I expect from it without additional factories mentioned, but still it bothers me that it fails when they're listed. Maybe I placed the libraries in a wrong place?

          Show
          Yuriy Akopov added a comment - Another question: The patched version of .war starts and works as expected if I place the following simple instruction in solrconfig.xml: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent"> </searchComponent> But if I add additional factories like it is advised by the sample config, it produces an error when searching with collapsing turned on: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent"> <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory name="groupAggregatedData" class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory"> <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/> <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/> <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/> <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/> </collapseCollectorFactory> </searchComponent> So far it does what I expect from it without additional factories mentioned, but still it bothers me that it fails when they're listed. Maybe I placed the libraries in a wrong place?
          Hide
          George P. Stathis added a comment - - edited

          Bump on Yuriy's last question:

          • Are performance issues around the number of documents matched, the size of the index, or both?

          E.g. our index contains over 12 million documents already. Should we even consider using this feature?

          Adding a few more questions:

          • Are performance concerns around the 1.4 patch, the current Solr 4.0 branch or both?
          • Is sharding an option to alleviate some of these issues? Reading the comments in this ticket, it seems there are caveats getting this to work with shards?
          Show
          George P. Stathis added a comment - - edited Bump on Yuriy's last question: Are performance issues around the number of documents matched, the size of the index, or both? E.g. our index contains over 12 million documents already. Should we even consider using this feature? Adding a few more questions: Are performance concerns around the 1.4 patch, the current Solr 4.0 branch or both? Is sharding an option to alleviate some of these issues? Reading the comments in this ticket, it seems there are caveats getting this to work with shards?
          Hide
          Stephen Weiss added a comment -

          It would work fine as long as you weren't sending the collapse parameters, I don't think you'd need to replace the WAR.

          Show
          Stephen Weiss added a comment - It would work fine as long as you weren't sending the collapse parameters, I don't think you'd need to replace the WAR.
          Hide
          Yuriy Akopov added a comment -

          In other words, if I use additional filtering conditions in my request to make sure the returned set of documents to be grouped is never larger than, say, 1 million items, can I expect the described problem to happen, or I'll be safe? Or regardless of the particular query and its resulting set to be collapsed I'm in danger if my index contains few millions documents?

          (sorry for commenting twice on the same problem)

          Show
          Yuriy Akopov added a comment - In other words, if I use additional filtering conditions in my request to make sure the returned set of documents to be grouped is never larger than, say, 1 million items, can I expect the described problem to happen, or I'll be safe? Or regardless of the particular query and its resulting set to be collapsed I'm in danger if my index contains few millions documents? (sorry for commenting twice on the same problem)
          Hide
          Yuriy Akopov added a comment -

          Stephen, Grant, thanks for the notice. Currently the total number of documents we deal with is about 800K and I expect it to group up to 2M in a year, but every user is allowed to search not the whole amount but a subset of it (so, for every search, additional filtering conditions are applied). I hope we will be fine until Solr4 comes out.

          But if we encounter any critical problems, would it be enough to remove collapsing parameters from the request sent to Solr to prevent the external functions from failing, or it is needed to replace the Solr core with unpatched one? I mean, the failure on a large set of documents is possible even when collapse.* parameters are not supplied, or only if the collapsing was requested?

          Show
          Yuriy Akopov added a comment - Stephen, Grant, thanks for the notice. Currently the total number of documents we deal with is about 800K and I expect it to group up to 2M in a year, but every user is allowed to search not the whole amount but a subset of it (so, for every search, additional filtering conditions are applied). I hope we will be fine until Solr4 comes out. But if we encounter any critical problems, would it be enough to remove collapsing parameters from the request sent to Solr to prevent the external functions from failing, or it is needed to replace the Solr core with unpatched one? I mean, the failure on a large set of documents is possible even when collapse.* parameters are not supplied, or only if the collapsing was requested?
          Hide
          Grant Ingersoll added a comment -

          Keep in mind an alternative approach that scales, but loses some attributes of this patch (total groups for instance) is committed on trunk and will likely be backported to 3.2.

          Show
          Grant Ingersoll added a comment - Keep in mind an alternative approach that scales, but loses some attributes of this patch (total groups for instance) is committed on trunk and will likely be backported to 3.2.
          Hide
          Stephen Weiss added a comment -

          Just be careful Yuriy, there are reasons why this thing is not in Solr 1.4.1 already The code does not scale particularly well beyond a few million documents, especially if you use the version that preserves totals and paging. It was enough to keep my software from being scrapped, but if you plan on scaling much past that point any time soon, you may need to start thinking about alternative solutions. I know I certainly am... I have a sinking worry my application may outgrow the limits of this patch's stability before something truly production ready comes to fore, possibly even this year if growth continues. However, given that the very concept of grouping is critical to the site that I support with SOLR, and attempts to provide the same functionality without actually grouping have failed repeatedly over the past few months, it is very sadly starting to look like I will have to cut very useful features (to no end of complaints, I'm sure) in order to ensure it's overall stability unless some miracle happens. Mama always told me I should have learned Java!

          Long story short, if you don't have to have this patch yet, and your software hasn't been written to do anything like this yet, I would not start doing it now! You will regret it when you run out of options later on and your servers start crashing all over the place. See if you can keep it under wraps until a real release comes out with it.

          Show
          Stephen Weiss added a comment - Just be careful Yuriy, there are reasons why this thing is not in Solr 1.4.1 already The code does not scale particularly well beyond a few million documents, especially if you use the version that preserves totals and paging. It was enough to keep my software from being scrapped, but if you plan on scaling much past that point any time soon, you may need to start thinking about alternative solutions. I know I certainly am... I have a sinking worry my application may outgrow the limits of this patch's stability before something truly production ready comes to fore, possibly even this year if growth continues. However, given that the very concept of grouping is critical to the site that I support with SOLR, and attempts to provide the same functionality without actually grouping have failed repeatedly over the past few months, it is very sadly starting to look like I will have to cut very useful features (to no end of complaints, I'm sure) in order to ensure it's overall stability unless some miracle happens. Mama always told me I should have learned Java! Long story short, if you don't have to have this patch yet, and your software hasn't been written to do anything like this yet, I would not start doing it now! You will regret it when you run out of options later on and your servers start crashing all over the place. See if you can keep it under wraps until a real release comes out with it.
          Hide
          Yuriy Akopov added a comment -

          Stephen, apparently the version you've advised works fine! At least those two issues I complained about are gone. Many thanks for your help!

          Show
          Yuriy Akopov added a comment - Stephen, apparently the version you've advised works fine! At least those two issues I complained about are gone. Many thanks for your help!
          Hide
          Yuriy Akopov added a comment -

          By the way, a noob question: after build completes along with the "apache-solr-1.4.2-dev.war" the following jars are generated:

          apache-solr-cell-1.4.2-dev.jar
          apache-solr-clustering-1.4.2-dev.jar
          apache-solr-core-1.4.2-dev.jar
          apache-solr-dataimporthandler-1.4.2-dev.jar
          apache-solr-dataimporthandler-extras-1.4.2-dev.jar
          apache-solr-solrj-1.4.2-dev.jar
          solrj-lib/commons-codec-1.3.jar
          solrj-lib/commons-httpclient-3.1.jar
          solrj-lib/commons-io-1.4.jar
          solrj-lib/geronimo-stax-api_1.0_spec-1.0.1.jar
          solrj-lib/jcl-over-slf4j-1.5.5.jar
          solrj-lib/slf4j-api-1.5.5.jar
          solrj-lib/wstx-asl-3.2.7.jar

          Do I need also transfer these libraries as well, or it is only needed to replace war file to get the patched version working properly? In my previous tries I copied solrj-lib/*.jar files to lib folder of Solr instance home. Maybe that was the problem?

          Show
          Yuriy Akopov added a comment - By the way, a noob question: after build completes along with the "apache-solr-1.4.2-dev.war" the following jars are generated: apache-solr-cell-1.4.2-dev.jar apache-solr-clustering-1.4.2-dev.jar apache-solr-core-1.4.2-dev.jar apache-solr-dataimporthandler-1.4.2-dev.jar apache-solr-dataimporthandler-extras-1.4.2-dev.jar apache-solr-solrj-1.4.2-dev.jar solrj-lib/commons-codec-1.3.jar solrj-lib/commons-httpclient-3.1.jar solrj-lib/commons-io-1.4.jar solrj-lib/geronimo-stax-api_1.0_spec-1.0.1.jar solrj-lib/jcl-over-slf4j-1.5.5.jar solrj-lib/slf4j-api-1.5.5.jar solrj-lib/wstx-asl-3.2.7.jar Do I need also transfer these libraries as well, or it is only needed to replace war file to get the patched version working properly? In my previous tries I copied solrj-lib/*.jar files to lib folder of Solr instance home. Maybe that was the problem?
          Hide
          Yuriy Akopov added a comment -

          I didn't expect the reply to come so quickly! Thanks, Stephen, I'll try it and post the results then.

          Show
          Yuriy Akopov added a comment - I didn't expect the reply to come so quickly! Thanks, Stephen, I'll try it and post the results then.
          Hide
          Stephen Weiss added a comment -

          Yuriy... try my patch: SOLR-236-1_4_1-paging-totals-working.patch. I don't have either of the problems you describe (problem B was actually the purpose of my patch, I never saw Problem A and I have tons of "single", non-grouped documents so I'm sure I would be seeing it if it were happening). Some people had problems using the patch (I didn't use it myself, I made it after the fact) but if you look up in the comments people explain how to make it work. Note that I'm not using the SOLR-236-1_4_1-NPEfix.patch patch, I never had the NPE problem they describe so I never bothered with it, not sure what it does.

          Show
          Stephen Weiss added a comment - Yuriy... try my patch: SOLR-236 -1_4_1-paging-totals-working.patch. I don't have either of the problems you describe (problem B was actually the purpose of my patch, I never saw Problem A and I have tons of "single", non-grouped documents so I'm sure I would be seeing it if it were happening). Some people had problems using the patch (I didn't use it myself, I made it after the fact) but if you look up in the comments people explain how to make it work. Note that I'm not using the SOLR-236 -1_4_1-NPEfix.patch patch, I never had the NPE problem they describe so I never bothered with it, not sure what it does.
          Hide
          Yuriy Akopov added a comment -

          Hi,

          First of all, thanks you guys for working on that! However, I have encountered a problem with this patch which is hopefully caused by my mistakes, so please correct me if I have done something wrong.

          So, I have applied SOLR-236 patch to release-1.4.1 and gained support for collapse.*, which works. However, two issues discussed above in this thread are still there:

          a) When collapsing is requested, only grouped results are returned. So, if the document has got a unique value in the field collapsed (i.e. it has no other docs to group with) it is excluded from the results. Instead of expected "unique documents plus non-unique grouped to the most relevant one" just grouped ones are returned.

          b) The number of results matching the query ("numFound") returned is always equal to "rows" parameter provided or 10 if not supplied (i.e. it represents the number of results on the page is returned, not the total number of matched documents).

          There is a way around the latter "numFound" issue: faceting by the field collapsed as it was suggested before, but the number retrieved with that facet is also useless as it includes unique (non-grouped) documents as well, but they are not returned.

          So far, I'm stuck with that. Is there any chance of resolving that? What about the SOLR-1682 patch - if it fixes that, should be applied to the original release-1.4.1 or to the release-1.4.1 patched with SOLR-236 beforehand?

          Thanks in advance.

          P.S. As I understand, grouping is planned in Solr 4.0. Does anybody know by any chance if it is safe to use its nightly builds? I ran through its pending critical issues and they doesn't look fatal, but still I'm afraid of possible implications.

          Show
          Yuriy Akopov added a comment - Hi, First of all, thanks you guys for working on that! However, I have encountered a problem with this patch which is hopefully caused by my mistakes, so please correct me if I have done something wrong. So, I have applied SOLR-236 patch to release-1.4.1 and gained support for collapse.*, which works. However, two issues discussed above in this thread are still there: a) When collapsing is requested, only grouped results are returned. So, if the document has got a unique value in the field collapsed (i.e. it has no other docs to group with) it is excluded from the results. Instead of expected "unique documents plus non-unique grouped to the most relevant one" just grouped ones are returned. b) The number of results matching the query ("numFound") returned is always equal to "rows" parameter provided or 10 if not supplied (i.e. it represents the number of results on the page is returned, not the total number of matched documents). There is a way around the latter "numFound" issue: faceting by the field collapsed as it was suggested before, but the number retrieved with that facet is also useless as it includes unique (non-grouped) documents as well, but they are not returned. So far, I'm stuck with that. Is there any chance of resolving that? What about the SOLR-1682 patch - if it fixes that, should be applied to the original release-1.4.1 or to the release-1.4.1 patched with SOLR-236 beforehand? Thanks in advance. P.S. As I understand, grouping is planned in Solr 4.0. Does anybody know by any chance if it is safe to use its nightly builds? I ran through its pending critical issues and they doesn't look fatal, but still I'm afraid of possible implications.
          Hide
          Doug Steigerwald added a comment -

          I started to try and backport SOLR-1682 to the 3x branch, but that seemed to get out of hand pretty quickly from what I remember (was a few weeks ago). It was much easier making this work with the 3x branch than backporting SOLR-1682.

          We want/need new features in 3.1 when it is released and we won't be allowed to deploy trunk to our production environment.

          Show
          Doug Steigerwald added a comment - I started to try and backport SOLR-1682 to the 3x branch, but that seemed to get out of hand pretty quickly from what I remember (was a few weeks ago). It was much easier making this work with the 3x branch than backporting SOLR-1682 . We want/need new features in 3.1 when it is released and we won't be allowed to deploy trunk to our production environment.
          Hide
          Otis Gospodnetic added a comment -

          Why are people still working on this SOLR-236 patch?
          Doesn't SOLR-1682 supercede it?
          And isn't SOLR-1682 the one that's in trunk, while nothing from SOLR-236 was ever applied to trunk?
          Thanks.

          Show
          Otis Gospodnetic added a comment - Why are people still working on this SOLR-236 patch? Doesn't SOLR-1682 supercede it? And isn't SOLR-1682 the one that's in trunk, while nothing from SOLR-236 was ever applied to trunk? Thanks.
          Hide
          Doug Steigerwald added a comment -

          Attaching a patch for the 3x branch (SOLR-236-branch_3x.patch). This based off of SOLR-236-1_4_1-paging-totals-working.patch and SOLR-236-1_4_1-NPEfix.patch.

          Tests work and some basic spot checking I've done looks good.

          Show
          Doug Steigerwald added a comment - Attaching a patch for the 3x branch ( SOLR-236 -branch_3x.patch). This based off of SOLR-236 -1_4_1-paging-totals-working.patch and SOLR-236 -1_4_1-NPEfix.patch. Tests work and some basic spot checking I've done looks good.
          Hide
          Doug Steigerwald added a comment -

          Has anyone successfully applied field collapsing to the branch_3x branch?

          Show
          Doug Steigerwald added a comment - Has anyone successfully applied field collapsing to the branch_3x branch?
          Hide
          Cameron added a comment -

          Uploading SOLR-236-1_4_1-NPEfix.patch as a simple patch for the NullPointerException Shekhar and Ron have reported. The patch is intended to be applied AFTER the SOLR-236-1_4_1-paging-totals-working.patch has already been applied, for brevity.

          I didn't actually fix the filterCache key issue as Samuel suggested. Rather I'm preventing the NPE from occurring. I believe this is ok because the collapsed results will stay sorted by score as the collapser performs the collapsing.

          Show
          Cameron added a comment - Uploading SOLR-236 -1_4_1-NPEfix.patch as a simple patch for the NullPointerException Shekhar and Ron have reported. The patch is intended to be applied AFTER the SOLR-236 -1_4_1-paging-totals-working.patch has already been applied, for brevity. I didn't actually fix the filterCache key issue as Samuel suggested. Rather I'm preventing the NPE from occurring. I believe this is ok because the collapsed results will stay sorted by score as the collapser performs the collapsing.
          Hide
          Hsiu Wang added a comment -

          I applied the SOLR-236-1_4_1-paging-totals-working.patch to 3x branch. when I ran unit test FieldCollapsingIntegrationTest, I got "Insane FieldCache usage(s) found expected:<0> but was:<1>" on all 3 sort related tests(testNonAdjacentFieldCollapse_sortOnNameAndCollectAggregates, testNonAdjacentFieldCollapse_sortOnNameAndCollectCollapsedDocs, and testForArrayOutOfBoundsBugWhenSorting).

          Show
          Hsiu Wang added a comment - I applied the SOLR-236 -1_4_1-paging-totals-working.patch to 3x branch. when I ran unit test FieldCollapsingIntegrationTest, I got "Insane FieldCache usage(s) found expected:<0> but was:<1>" on all 3 sort related tests(testNonAdjacentFieldCollapse_sortOnNameAndCollectAggregates, testNonAdjacentFieldCollapse_sortOnNameAndCollectCollapsedDocs, and testForArrayOutOfBoundsBugWhenSorting).
          Hide
          Steven Fuchs added a comment -

          Great feature! But it seems to be missing a capability I need. I'll explain it:

          I'd like to use group results in my query, namely exclude all documents in a group when any document in that group has a certain value. It could be as simple as a field value although the ability to do more complex queries would be nice also. Please consider adding functionality like this to your sub tasks list. Or better yet if this capability exists and I missed it please someone point it out.

          TIA
          steve

          Show
          Steven Fuchs added a comment - Great feature! But it seems to be missing a capability I need. I'll explain it: I'd like to use group results in my query, namely exclude all documents in a group when any document in that group has a certain value. It could be as simple as a field value although the ability to do more complex queries would be nice also. Please consider adding functionality like this to your sub tasks list. Or better yet if this capability exists and I missed it please someone point it out. TIA steve
          Hide
          Samuel García Martínez added a comment -

          The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query cache and the signature key that is using to store cached results.

          To sum up, if you perform a filter query and then, you perform that query using collapse field, that query result is already cached, but not cached as expected by this component. Resulting that the DocSet implementation is not the expected one, and, as cached result, the DocumentCollector is not executed at any time.

          As soon as i can ill post a patch using combined key to cache results, formed by the collector class and the query itself.

          Colbenson - Findability Experts
          http://www.colbenson.es/

          Show
          Samuel García Martínez added a comment - The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query cache and the signature key that is using to store cached results. To sum up, if you perform a filter query and then, you perform that query using collapse field, that query result is already cached, but not cached as expected by this component. Resulting that the DocSet implementation is not the expected one, and, as cached result, the DocumentCollector is not executed at any time. As soon as i can ill post a patch using combined key to cache results, formed by the collector class and the query itself. Colbenson - Findability Experts http://www.colbenson.es/
          Hide
          Ron Veenstra added a comment -

          I have also been getting a null pointer exception:
          message null java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$PredefinedScorer.docID(NonAdjacentDocumentCollapser.java:397)

          The error is repeatable for a given search term when sorted by "score desc," followed by any other field. It seems to crop up whenever there is only one result that should be returned in the collapsed field group, but does not happen for every possible query where this is the case (leading me to believe something else is at work). Changing the sort order to anything else (moving score to second, or omitting a second field) eliminates the error. This was the simple solution for my problem, but wanted to post this in case any of the information proved useful.

          Using Solr 1.4.1 with SOLR-236-1_4_1-paging-totals-working.patch

          Show
          Ron Veenstra added a comment - I have also been getting a null pointer exception: message null java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$PredefinedScorer.docID(NonAdjacentDocumentCollapser.java:397) The error is repeatable for a given search term when sorted by "score desc," followed by any other field. It seems to crop up whenever there is only one result that should be returned in the collapsed field group, but does not happen for every possible query where this is the case (leading me to believe something else is at work). Changing the sort order to anything else (moving score to second, or omitting a second field) eliminates the error. This was the simple solution for my problem, but wanted to post this in case any of the information proved useful. Using Solr 1.4.1 with SOLR-236 -1_4_1-paging-totals-working.patch
          Hide
          Jerry Mindek added a comment -

          I will be out from Dec 25 and returning to the office Monday Jan 4th. Thanks!

          Show
          Jerry Mindek added a comment - I will be out from Dec 25 and returning to the office Monday Jan 4th. Thanks!
          Hide
          Shekhar Nirkhe added a comment -

          I am getting null pointer exception in
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:443)

          I am using Solr 1.4.1 with following patches.

          SOLR-236-1_4_1.patch
          SOLR-236-1_4_1.fix.patch

          Am I missing something ?

          Show
          Shekhar Nirkhe added a comment - I am getting null pointer exception in at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:443) I am using Solr 1.4.1 with following patches. SOLR-236 -1_4_1.patch SOLR-236 -1_4_1.fix.patch Am I missing something ?
          Hide
          Joseph McElroy added a comment -

          Hi there,

          Great work on this feature, it has been something i have been waiting for a while to be implemented in SOLR, thank you all for this.

          Two questions however:

          • An option to sort the groups on the number of documents each group has? so the group with the largest number of documents would be the highest ranked.
          • Ability to return the number of groups within the result set? This would allow for pagination.

          Thanks
          Joe

          Show
          Joseph McElroy added a comment - Hi there, Great work on this feature, it has been something i have been waiting for a while to be implemented in SOLR, thank you all for this. Two questions however: An option to sort the groups on the number of documents each group has? so the group with the largest number of documents would be the highest ranked. Ability to return the number of groups within the result set? This would allow for pagination. Thanks Joe
          Hide
          Luke Bochsler added a comment - - edited

          "Is anyone working on the ability to calculate facets AFTER the group?"

          This would be great to have that possibility! Sorry I'm not a Java Programmer so I cannot contribute a solution, instead I contribute to other open source systems. However, would that be a big deal for you guys to implement it? I'm using Solr in a web project as search solution and desperately need this feature along with the great grouping functionality. The grouping in general has made my life so much easier so far, so it seems we are just one step away from having it all covered by Solr!

          Thank you so much!

          Luke

          Show
          Luke Bochsler added a comment - - edited "Is anyone working on the ability to calculate facets AFTER the group?" This would be great to have that possibility! Sorry I'm not a Java Programmer so I cannot contribute a solution, instead I contribute to other open source systems. However, would that be a big deal for you guys to implement it? I'm using Solr in a web project as search solution and desperately need this feature along with the great grouping functionality. The grouping in general has made my life so much easier so far, so it seems we are just one step away from having it all covered by Solr! Thank you so much! Luke
          Hide
          Ingmar Seeliger added a comment -

          Field collapsing is a very nice feature - thank you for that!

          I've just tested it with (pseudo-)distributed search, that means the data on each solr-server has one specific value for the collapse field, and realized one problem:
          I want to include the documents in the result list, using collapse.includeCollapsedDocs.fl=...
          The result list has empty docs:
          <result name="collapsedDocs" numFound="4" start="0">
          <doc/>
          <doc/>
          <doc/>
          <doc/>
          </result>
          When I remove the distributed search, everything works fine on one server. Perhaps someone can look for that? Thanks!

          Show
          Ingmar Seeliger added a comment - Field collapsing is a very nice feature - thank you for that! I've just tested it with (pseudo-)distributed search, that means the data on each solr-server has one specific value for the collapse field, and realized one problem: I want to include the documents in the result list, using collapse.includeCollapsedDocs.fl=... The result list has empty docs: <result name="collapsedDocs" numFound="4" start="0"> <doc/> <doc/> <doc/> <doc/> </result> When I remove the distributed search, everything works fine on one server. Perhaps someone can look for that? Thanks!
          Hide
          Yonik Seeley added a comment -

          Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct.

          There's no correctness issue or bug here. Many use cases require the current behavior (the number of docs per group shown having no effect on faceting), and other use cases require what you seek. Both are valid, but we only have one implemented so far.

          Show
          Yonik Seeley added a comment - Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct. There's no correctness issue or bug here. Many use cases require the current behavior (the number of docs per group shown having no effect on faceting), and other use cases require what you seek. Both are valid, but we only have one implemented so far.
          Hide
          Bill Bell added a comment -

          Yonik and team,

          Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct.

          Thank you.
          Bill

          Show
          Bill Bell added a comment - Yonik and team, Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct. Thank you. Bill
          Hide
          Yonik Seeley added a comment -

          I've just committed a fix to the sort != group.sort problem.
          As I previously said, the algorithm for handling this was broken (the TopGroupSortCollector class), so I've redefined what sort means.
          Sort does not order groups by the first document in each group, but orders groups by the highest ranking document by "sort" in that group.
          I've updated the randomized grouping tests to reflect this change, and enabled tests where sort != group.sort

          Show
          Yonik Seeley added a comment - I've just committed a fix to the sort != group.sort problem. As I previously said, the algorithm for handling this was broken (the TopGroupSortCollector class), so I've redefined what sort means. Sort does not order groups by the first document in each group, but orders groups by the highest ranking document by "sort" in that group. I've updated the randomized grouping tests to reflect this change, and enabled tests where sort != group.sort
          Hide
          Stephen Weiss added a comment -

          Cheers peterwang, you're probably right. I didn't actually use this patch, I made the modifications by hand after applying Martijn's patch. I generally don't make my own patch files, I just let SVN do it for me, so I'm not really aware of the syntax... The point is to just delete those extra lines.

          Show
          Stephen Weiss added a comment - Cheers peterwang, you're probably right. I didn't actually use this patch, I made the modifications by hand after applying Martijn's patch. I generally don't make my own patch files, I just let SVN do it for me, so I'm not really aware of the syntax... The point is to just delete those extra lines.
          Hide
          Bill Bell added a comment -

          OK I have a patch to add namedistinct. Note that is optional, and to be careful of the number of facets when using it.

          On sample data:

          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=manu&facet.mincount=1&facet.limit=-1&f.manu.facet.namedistinct=0&facet.field=price&f.price.facet.namedistinct=1

          It works on facet.fields.

          SOLR-236-distinctFacet.patch

          Show
          Bill Bell added a comment - OK I have a patch to add namedistinct. Note that is optional, and to be careful of the number of facets when using it. On sample data: http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=manu&facet.mincount=1&facet.limit=-1&f.manu.facet.namedistinct=0&facet.field=price&f.price.facet.namedistinct=1 It works on facet.fields. SOLR-236 -distinctFacet.patch
          Hide
          Bill Bell added a comment -

          TO do distinct facet counts.

          Show
          Bill Bell added a comment - TO do distinct facet counts.
          Hide
          peterwang added a comment - - edited

          SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors:

          patch: **** malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java

          seems caused by hand edit SOLR-236-1_4_1.patch to produce SOLR-236-1_4_1-paging-totals-working.patch (delete 6 lines without fix diff hunk number)
          possible fix:

          diff -u SOLR-236-1_4_1-paging-totals-working.patch.orig SOLR-236-1_4_1-paging-totals-working.patch
          --- SOLR-236-1_4_1-paging-totals-working.patch.orig     2010-11-17 19:26:05.000000000 +0800
          +++ SOLR-236-1_4_1-paging-totals-working.patch  2010-11-17 19:17:20.000000000 +0800
          @@ -2834,7 +2834,7 @@
           ===================================================================
           --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java    (revision )
           +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java    (revision )
          -@@ -0,0 +1,517 @@
          +@@ -0,0 +1,511 @@
           +/**
           + * Licensed to the Apache Software Foundation (ASF) under one or more
           + * contributor license agreements.  See the NOTICE file distributed with
          
          Show
          peterwang added a comment - - edited SOLR-236 -1_4_1-paging-totals-working.patch patch failed with following errors: patch: **** malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit SOLR-236 -1_4_1.patch to produce SOLR-236 -1_4_1-paging-totals-working.patch (delete 6 lines without fix diff hunk number) possible fix: diff -u SOLR-236-1_4_1-paging-totals-working.patch.orig SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1-paging-totals-working.patch.orig 2010-11-17 19:26:05.000000000 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.000000000 +0800 @@ -2834,7 +2834,7 @@ =================================================================== --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with
          Hide
          Bill Bell added a comment -

          Here is an idea. If we go with the terminology -

          <int name="name">value</int>

          Then we can just return the name distinct by a few mods to SimpleFacet.java. All other parameters still apply. Default will be off.

          facet.<field>.namedistinct=1

          <lst name="hgid">
          <int name="count">3</int>
          </lst>

          I can have a patch for this today. Would this be something that we could go with?

          Bill

          Show
          Bill Bell added a comment - Here is an idea. If we go with the terminology - <int name="name">value</int> Then we can just return the name distinct by a few mods to SimpleFacet.java. All other parameters still apply. Default will be off. facet.<field>.namedistinct=1 <lst name="hgid"> <int name=" count ">3</int> </lst> I can have a patch for this today. Would this be something that we could go with? Bill
          Hide
          Bill Bell added a comment -

          Yonik,

          > OK Bill, this should be fixed in the latest trunk... can you try it out?

          Yes paging seems to work right now.

          Question: Is there a way to return or do it with some sort of ord() value?

          <lst name="hgid">
          <int name="HGPY0056D09F7B57442E8">4</int>
          <int name="HGPY00A33AD7808996941">3</int>
          <int name="HGPY00D6274FD07B4EE7A">3</int>
          </lst>
          <facetname name="hgid">3</facetname>

          Show
          Bill Bell added a comment - Yonik, > OK Bill, this should be fixed in the latest trunk... can you try it out? Yes paging seems to work right now. Question: Is there a way to return or do it with some sort of ord() value? <lst name="hgid"> <int name="HGPY0056D09F7B57442E8">4</int> <int name="HGPY00A33AD7808996941">3</int> <int name="HGPY00D6274FD07B4EE7A">3</int> </lst> <facetname name="hgid">3</facetname>
          Hide
          Bill Bell added a comment -

          I found a solution, but is is not ideal. I need to be able to get a count of facets (left side and not left side):

          http://localhosT:8983/solr/select?facet=true&facet.field=hgid&facet.limit=100000&facet.mincount=1

          (HGID is the group by)

          I need to I get the following but need left side number. So instead of 10, I need "3". Is there anyway to do that? Just return 3.

          • <lst name="hgid">
            <int name="HGPY0056D09F7B57442E8">4</int>
            <int name="HGPY00A33AD7808996941">3</int>
            <int name="HGPY00D6274FD07B4EE7A">3</int>
            </lst>
          Show
          Bill Bell added a comment - I found a solution, but is is not ideal. I need to be able to get a count of facets (left side and not left side): http://localhosT:8983/solr/select?facet=true&facet.field=hgid&facet.limit=100000&facet.mincount=1 (HGID is the group by) I need to I get the following but need left side number. So instead of 10, I need "3". Is there anyway to do that? Just return 3. <lst name="hgid"> <int name="HGPY0056D09F7B57442E8">4</int> <int name="HGPY00A33AD7808996941">3</int> <int name="HGPY00D6274FD07B4EE7A">3</int> </lst>
          Hide
          Stephen Weiss added a comment -

          This would be the patch that I'm describing... I used it with the Solr 1.4.1 release tarball. It's just Martijn's latest patch minus a few lines (by his suggestion) that mess up the totals and paging. Again, you want to make sure your server is well configured - we are not really Java people and it took a while to get the settings to a place where we didn't have OOM errors every day. We're using these startup options with Jetty:

          -Xms10240m -Xmx10240m -XX:NewRatio=5 -XX:+UseParNewGC

          That RAM total is half the RAM available on the machine - we leave the rest of the RAM open for disk caches. It will take up it's half of the RAM very quickly but then it hovers there and has only ever gone over the limit once since September, which seemed to be related to an unoptimized index (after replacing an unusually large # of docs).

          Show
          Stephen Weiss added a comment - This would be the patch that I'm describing... I used it with the Solr 1.4.1 release tarball. It's just Martijn's latest patch minus a few lines (by his suggestion) that mess up the totals and paging. Again, you want to make sure your server is well configured - we are not really Java people and it took a while to get the settings to a place where we didn't have OOM errors every day. We're using these startup options with Jetty: -Xms10240m -Xmx10240m -XX:NewRatio=5 -XX:+UseParNewGC That RAM total is half the RAM available on the machine - we leave the rest of the RAM open for disk caches. It will take up it's half of the RAM very quickly but then it hovers there and has only ever gone over the limit once since September, which seemed to be related to an unoptimized index (after replacing an unusually large # of docs).
          Hide
          Bill Bell added a comment -

          Is the older CollapseComponent still available in the trunk?

          Or do we need to use the newer group parameters?

          How do I get the older one to work?

          Show
          Bill Bell added a comment - Is the older CollapseComponent still available in the trunk? Or do we need to use the newer group parameters? How do I get the older one to work?
          Hide
          James Dyer added a comment -

          Stephen,

          I would be very interested in seeing your patch if you can upload it. Luckily, the index we're migrating to SOLR for this project is small and I think I won't have to scale very much in either case. Your patch might be better than the current SOLR-1682/236 patches for our needs however.

          Show
          James Dyer added a comment - Stephen, I would be very interested in seeing your patch if you can upload it. Luckily, the index we're migrating to SOLR for this project is small and I think I won't have to scale very much in either case. Your patch might be better than the current SOLR-1682 /236 patches for our needs however.
          Hide
          Stephen Weiss added a comment -

          If you need help James, I have a version of 1.4.1 patched that does do the collapsing and provide this data - it was based on some of the comments above along with a patch that came out a while ago (back when it really was only 5 or 6 lines of difference). The faceting route really doesn't work out well once you hit a certain number of collapse groups. Anyway, I've been using this version in production for quite a while now, and while it is a bit of a memory hog, if you manage the memory properly, keep your indexes optimized and provide enough RAM to cover your indexes then it's pretty stable and gets the job done.

          Show
          Stephen Weiss added a comment - If you need help James, I have a version of 1.4.1 patched that does do the collapsing and provide this data - it was based on some of the comments above along with a patch that came out a while ago (back when it really was only 5 or 6 lines of difference). The faceting route really doesn't work out well once you hit a certain number of collapse groups. Anyway, I've been using this version in production for quite a while now, and while it is a bit of a memory hog, if you manage the memory properly, keep your indexes optimized and provide enough RAM to cover your indexes then it's pretty stable and gets the job done.
          Hide
          Yonik Seeley added a comment -

          I remember it only was a difference of 5 or 6 lines of code either way.

          Not with what is committed in trunk. To be scalable wrt to the number of groups, we only keep the top 10 groups in memory at any one time (and hence we never know the total number of groups). The ability to retrieve the number of groups will require a different algorithm with different tradeoffs. I'm sure we'll get to it in time, but it is not just a tweak to the existing algorithm.

          Show
          Yonik Seeley added a comment - I remember it only was a difference of 5 or 6 lines of code either way. Not with what is committed in trunk. To be scalable wrt to the number of groups, we only keep the top 10 groups in memory at any one time (and hence we never know the total number of groups). The ability to retrieve the number of groups will require a different algorithm with different tradeoffs. I'm sure we'll get to it in time, but it is not just a tweak to the existing algorithm.
          Hide
          James Dyer added a comment -

          We also have a hard requirement for field collapsing with total # of groups for a project scheduled for Production Q1 2011. So far, best I can tell I would have to facet on the group-by field with facet.limit=-1 to get this. Surely we would have less overhead if the group-by functionality could compute this by itself and just return the number. Turning it on/off makes sense as some won't want the performance/memory hit.

          Show
          James Dyer added a comment - We also have a hard requirement for field collapsing with total # of groups for a project scheduled for Production Q1 2011. So far, best I can tell I would have to facet on the group-by field with facet.limit=-1 to get this. Surely we would have less overhead if the group-by functionality could compute this by itself and just return the number. Turning it on/off makes sense as some won't want the performance/memory hit.
          Hide
          Stephen Weiss added a comment -

          Just chiming in on that last comment... we also rely on functional paging and total counts when collapsing as well. I once raised the idea of not providing this information in our search results to my boss and he looked at me like I had 3 heads, it's just not an option. In most of the patches on this ticket we could get this data, but for some it seemed like eliminating totals and paging wasn't a big deal and provided a significant performance boost. I can understand the reasons for not including this for every collapsed query (if you don't need the totals or paging then the performance boost is nice), but if there was a way we could have an option to turn this on or off (even with the performance hit, having it is better than not being able to collapse at all), maybe that could help keep everyone happy. I remember it only was a difference of 5 or 6 lines of code either way.

          Show
          Stephen Weiss added a comment - Just chiming in on that last comment... we also rely on functional paging and total counts when collapsing as well. I once raised the idea of not providing this information in our search results to my boss and he looked at me like I had 3 heads, it's just not an option. In most of the patches on this ticket we could get this data, but for some it seemed like eliminating totals and paging wasn't a big deal and provided a significant performance boost. I can understand the reasons for not including this for every collapsed query (if you don't need the totals or paging then the performance boost is nice), but if there was a way we could have an option to turn this on or off (even with the performance hit, having it is better than not being able to collapse at all), maybe that could help keep everyone happy. I remember it only was a difference of 5 or 6 lines of code either way.
          Hide
          Bill Bell added a comment -

          Yonik,

          I am testing. Will get back to you on the starts/rows.

          Also is there a way to get the total number of results based on ther grouping? I get the following:

          
           <lst name="grouped">
          - <lst name="hgid">
            <int name="matches">6</int> 
          - <arr name="groups">
          - <lst>
          
          

          But no total number. Also the matches=6, includes those fields not returned (the group has 2 entries, but I only return 1). It should show matches=6, results=4 (since 2 are hidden), totalNumber=6747.

          Otherwise we cannot page.

          If we do a http://localhost:8983/select?q=test&facet=true&face.field=hgid there are too many results (thousands). ANy other way to group by and get a total?

          Show
          Bill Bell added a comment - Yonik, I am testing. Will get back to you on the starts/rows. Also is there a way to get the total number of results based on ther grouping? I get the following: <lst name= "grouped" > - <lst name= "hgid" > < int name= "matches" >6</ int > - <arr name= "groups" > - <lst> But no total number. Also the matches=6, includes those fields not returned (the group has 2 entries, but I only return 1). It should show matches=6, results=4 (since 2 are hidden), totalNumber=6747. Otherwise we cannot page. If we do a http://localhost:8983/select?q=test&facet=true&face.field=hgid there are too many results (thousands). ANy other way to group by and get a total?
          Hide
          Yonik Seeley added a comment -

          OK Bill, this should be fixed in the latest trunk... can you try it out?

          Show
          Yonik Seeley added a comment - OK Bill, this should be fixed in the latest trunk... can you try it out?
          Hide
          Yonik Seeley added a comment -

          We get 15 results. 10+5 ? It should be 10 rows.

          Yes, I've reproduced this with the random testing too. Not sure what to make of it yet.
          It looks like the orderedGroups TreeSet acquires too many entries for some reason.

          Show
          Yonik Seeley added a comment - We get 15 results. 10+5 ? It should be 10 rows. Yes, I've reproduced this with the random testing too. Not sure what to make of it yet. It looks like the orderedGroups TreeSet acquires too many entries for some reason.
          Hide
          Bill Bell added a comment -

          We are having an issue with this patch.

          http://localhost:8983/solr/provs/select?fl=hgid,score&q.alt=*:*&start=5&rows=10&qt=standard&group=true&group.field=hgid
          

          We get 15 results. 10+5 ? It should be 10 rows. This does not appear to be working right with start and rows.

          Show
          Bill Bell added a comment - We are having an issue with this patch. http: //localhost:8983/solr/provs/select?fl=hgid,score&q.alt=*:*&start=5&rows=10&qt=standard&group= true &group.field=hgid We get 15 results. 10+5 ? It should be 10 rows. This does not appear to be working right with start and rows.
          Hide
          Yonik Seeley added a comment -

          NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used.

          Actually, I think it's worse. The algorithm added in SOLR-1682 (TopGroupSortCollector) that handled when sort != group.sort seems broken.
          The problem: a high ranking group may be demoted to a lower ranking group because it's top document changed (and the sorts used to find the top doc in a group and the top group are different). But we may have already discarded higher ranking groups based on the original high ranking, so now we have permanently lost information.

          Show
          Yonik Seeley added a comment - NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used. Actually, I think it's worse. The algorithm added in SOLR-1682 (TopGroupSortCollector) that handled when sort != group.sort seems broken. The problem: a high ranking group may be demoted to a lower ranking group because it's top document changed (and the sorts used to find the top doc in a group and the top group are different). But we may have already discarded higher ranking groups based on the original high ranking, so now we have permanently lost information.
          Hide
          Yonik Seeley added a comment -

          Random testing found another bug - while finding the top groups, we forgot to setBottom on the priority queue when it changed.

          Show
          Yonik Seeley added a comment - Random testing found another bug - while finding the top groups, we forgot to setBottom on the priority queue when it changed.
          Hide
          Yonik Seeley added a comment -

          NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used).
          The wrong comparators were used in one place, leading to errors finding the top groups. I just committed a fix for this.

          The NPE when rows==0 has also been fixed.

          Show
          Yonik Seeley added a comment - NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used). The wrong comparators were used in one place, leading to errors finding the top groups. I just committed a fix for this. The NPE when rows==0 has also been fixed.
          Hide
          Yonik Seeley added a comment -

          Two more corner cases not yet fixed:
          1) if rows==0, we get an NPE
          2) if group.limit and group.offset are both 0, then the counts for the resulting doclists are all zero.

          Show
          Yonik Seeley added a comment - Two more corner cases not yet fixed: 1) if rows==0, we get an NPE 2) if group.limit and group.offset are both 0, then the counts for the resulting doclists are all zero.
          Hide
          Yonik Seeley added a comment -

          Just committed a patch that the random testing I'm developing uncovered - we lost the default of group.sort to sort during the last refactoring.

          Show
          Yonik Seeley added a comment - Just committed a patch that the random testing I'm developing uncovered - we lost the default of group.sort to sort during the last refactoring.
          Hide
          Martijn van Groningen added a comment -

          After applying the patch SOLR-236-1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored?

          I think so, since the patch your referrer has nothing to do with spelling.

          Show
          Martijn van Groningen added a comment - After applying the patch SOLR-236 -1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored? I think so, since the patch your referrer has nothing to do with spelling.
          Hide
          Yonik Seeley added a comment -

          Here's a refactoring patch that pulls all the grouping stuff out of SolrIndexSearcher (I'm sure many of you will be glad about that and uses subclasses rather than instanceof checks for different behavior of grouping commands.

          This isn't the end of refactoring, but it's a good start I think, and should make additional changes easier.

          Show
          Yonik Seeley added a comment - Here's a refactoring patch that pulls all the grouping stuff out of SolrIndexSearcher (I'm sure many of you will be glad about that and uses subclasses rather than instanceof checks for different behavior of grouping commands. This isn't the end of refactoring, but it's a good start I think, and should make additional changes easier.
          Hide
          Thorsten Maus added a comment -

          After applying the patch SOLR-236-1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored?

          Show
          Thorsten Maus added a comment - After applying the patch SOLR-236 -1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored?
          Hide
          Jamie added a comment -

          When using collapse.includeCollapsedDocs.fl to and sorting by a field (not score) the returned collapsed results aren't sorted correctly.

          Show
          Jamie added a comment - When using collapse.includeCollapsedDocs.fl to and sorting by a field (not score) the returned collapsed results aren't sorted correctly.
          Hide
          Yonik Seeley added a comment -

          It works great but gives problem when I include other components like Facet and Highlighter.

          See the list of sub-tasks on this issue starting with "SearchGrouping:".
          I fixed faceting yesterday - and I hope to fix highlighting and debugging today.

          Show
          Yonik Seeley added a comment - It works great but gives problem when I include other components like Facet and Highlighter. See the list of sub-tasks on this issue starting with "SearchGrouping:". I fixed faceting yesterday - and I hope to fix highlighting and debugging today.
          Hide
          Varun Gupta added a comment -

          I am using the patch SOLR-1682 committed on trunk for field collapsing. It works great but gives problem when I include other components like Facet and Highlighter. Is there any workaround to use Highlight and Facet components along with grouping?

          Show
          Varun Gupta added a comment - I am using the patch SOLR-1682 committed on trunk for field collapsing. It works great but gives problem when I include other components like Facet and Highlighter. Is there any workaround to use Highlight and Facet components along with grouping?
          Hide
          Stephen Weiss added a comment -

          FWIW, I fixed my earlier OOM issues with some garbage collection tuning.

          Now I'm noticing NPEs very similar to those people were reporting back before the patch from Jun 28th:

          SEVERE: java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
          ... it's the same backtrace ...

          I'm guessing it's because I added those 5 lines back into the patch to get the paging working again.

          It's rather infrequent, it's probably something I can deal with until the new patch is complete. It doesn't happen every time at all like it seemed to happen to many people - just once in a while, and on queries that honestly run all the time, so it seems random and not related to a particular query (except perhaps in the size of the filter queries - these fqs relatively large #'s of documents). But if any of this code makes it to the new patch I thought it would be worth mentioning.

          Show
          Stephen Weiss added a comment - FWIW, I fixed my earlier OOM issues with some garbage collection tuning. Now I'm noticing NPEs very similar to those people were reporting back before the patch from Jun 28th: SEVERE: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262) ... it's the same backtrace ... I'm guessing it's because I added those 5 lines back into the patch to get the paging working again. It's rather infrequent, it's probably something I can deal with until the new patch is complete. It doesn't happen every time at all like it seemed to happen to many people - just once in a while, and on queries that honestly run all the time, so it seems random and not related to a particular query (except perhaps in the size of the filter queries - these fqs relatively large #'s of documents). But if any of this code makes it to the new patch I thought it would be worth mentioning.
          Hide
          Amit Nithian added a comment -

          Two questions and one comment:
          Comment:
          1) This is a neat patch! Thanks for this contribution.

          Questions:
          1) Which patch should we start using.. this one or the one Yonik referenced?
          2) Will the cache config in the component be retrieved via the CacheConfig instead of as a child element in the component?

          Excited to see the final product. I am using it for a simple app right now and it's working fairly well.

          Show
          Amit Nithian added a comment - Two questions and one comment: Comment: 1) This is a neat patch! Thanks for this contribution. Questions: 1) Which patch should we start using.. this one or the one Yonik referenced? 2) Will the cache config in the component be retrieved via the CacheConfig instead of as a child element in the component? Excited to see the final product. I am using it for a simple app right now and it's working fairly well.
          Hide
          Peter Kieltyka added a comment -

          Hey guys,

          How difficult would it be to add the ability to specify if for any collapsed values, to not return any of the documents.. to just purge all duplicates from the results.

          This could be done by adding a new field: collapse.purge which can be true or false, and defaults to false

          I could really use that. I have a scenario where I have the following data set of documents:

          ALL: <1,2,3,4,5>
          A: <1,2>
          B: <3,4>
          C: <4,5>

          and I want to search the text within the subset of documents: (ALL - A) = <3,4,5>

          Collapse would do this ..

          q => text:something AND -(group_id:[* TO *] AND -group_id:A)
          collapse.field => uid
          collapse.purge => true

          Cheers!

          Show
          Peter Kieltyka added a comment - Hey guys, How difficult would it be to add the ability to specify if for any collapsed values, to not return any of the documents.. to just purge all duplicates from the results. This could be done by adding a new field: collapse.purge which can be true or false, and defaults to false I could really use that. I have a scenario where I have the following data set of documents: ALL: <1,2,3,4,5> A: <1,2> B: <3,4> C: <4,5> and I want to search the text within the subset of documents: (ALL - A) = <3,4,5> Collapse would do this .. q => text:something AND -(group_id: [* TO *] AND -group_id:A) collapse.field => uid collapse.purge => true Cheers!
          Hide
          Yonik Seeley added a comment -

          Since everyone seems to be watching this issue, I'll comment here.
          I've just committed the first parts to field collapsing to trunk! See SOLR-1682
          Thanks to everyone who has worked on these related issues for so long!
          I chose to back off and bite of a manageable piece, but I referenced all the
          great work that has been done in the various related issues, and tried
          to give credit to everyone who's submitted patches (let me know if I missed anyone.)

          This is really just a start to build from of course - there's much left to do!

          Show
          Yonik Seeley added a comment - Since everyone seems to be watching this issue, I'll comment here. I've just committed the first parts to field collapsing to trunk! See SOLR-1682 Thanks to everyone who has worked on these related issues for so long! I chose to back off and bite of a manageable piece, but I referenced all the great work that has been done in the various related issues, and tried to give credit to everyone who's submitted patches (let me know if I missed anyone.) This is really just a start to build from of course - there's much left to do!
          Hide
          Evgeniy Serykh added a comment -

          I've patched release of solr 1.4.1. When I try to execute query with collapsing 'numFound' value always equals 10 while 'rows' param not specified.

          Show
          Evgeniy Serykh added a comment - I've patched release of solr 1.4.1. When I try to execute query with collapsing 'numFound' value always equals 10 while 'rows' param not specified.
          Hide
          wyhw whon added a comment -

          when i use fq=xxxx:1302 , i got a error as follow, but it can work with other fq.

          HTTP Status 500 - -1073634 java.lang.ArrayIndexOutOfBoundsException: -1073634 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:659) at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1529) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:973) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:347) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1503) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:183) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:134) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:242) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619)

          but ,
          it can work if not use fq
          if i disable useFilterForSortedQuery in solrconfig.xml, it also work .

          Show
          wyhw whon added a comment - when i use fq=xxxx:1302 , i got a error as follow, but it can work with other fq. HTTP Status 500 - -1073634 java.lang.ArrayIndexOutOfBoundsException: -1073634 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:659) at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1529) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:973) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:347) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1503) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:183) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:134) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:242) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) but , it can work if not use fq if i disable useFilterForSortedQuery in solrconfig.xml, it also work .
          Hide
          David Tuška added a comment -

          Hello, I find some bug in "Field collapsing",
          I will tested it for solr-1.4.1-patch and try test for trunk-patch(rev.955615) too.

          1) No collapse_counts/results will be returned if collapseCount==1,
          although no-collapse will be returned.

          http://localhost:8080/solr_tour/select/?q=nl_counter%3A1%0D%0A&start=0&rows=10&indent=on&sort=c_price_from_orig+asc&collapse.field=nl_tour_id&collapse.threshold=1&collapse.type=adjacent&collapse.debug=true

           
          <lst name="collapse_counts">
            <str name="field">nl_tour_id</str>
            <lst name="results"/>
            <lst name="debug">
              <str name="Docset type">HashDocSet(26)</str>
              <long name="Total collapsing time(ms)">0</long>
              <long name="Create uncollapsed docset(ms)">0</long>
              <long name="Get fieldvalues from fieldcache (ms)">0</long>
              <long name="AdjacentDocumentCollapser collapsing time(ms)">0</long>
              <long name="Creating collapseinfo time(ms)">0</long>
              <long name="Convert to bitset time(ms)">0</long>
              <long name="Create collapsed docset time(ms)">0</long>
            </lst>
          </lst>
          <result name="response" numFound="26" start="0">
          10x <doc></doc> 
          ...
          

          If I look into code, I find some problematic part of code:

          In NonAdjacentDocumentCollapser.java in function doCollapsing is bad condition and priorityQueue:

          NonAdjacentDocumentCollapser.java
          protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) {
          
            for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) {
              int currentId = i.nextDoc();
              String currentValue = values.lookup[values.order[currentId]];
          
              NonAdjacentCollapseGroup collapseDoc = collapsedDocs.get(currentValue);
          
              if (collapseDoc == null) {
                ..
              }
          
              Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId);
          
              // IMHO HERE must be >= NO > !!!!
              if (++collapseDoc.totalCount > collapseThreshold) {
                collapseDoc.collapsedDocuments++;
          
                //HERE IS PROBLEM TOO, if doc is only one, then is not returned by collapseDoc.priorityQueue.insertWithOverflow for collapse.threshold=1
                if (dropOutId != null)
                {
                  for (CollapseCollector collector : collectors) {
                    collector.documentCollapsed(dropOutId, collapseDoc, collapseContext);
                  }
                }
              }
          }
          

          In AdjacentDocumentCollapser.java in doCollapsing is problem in Initializing condition,
          if doc is only one, then only inicializing condition is process, else-if, else part not will be processed and collector.documentCollapsed or collector.documentHead not will be call.

          NonAdjacentDocumentCollapser.java
          protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) {
            ...
            String collapseValue = null;
            ...
            for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) {
              int currentId = i.nextDoc();
              String currentValue = values.lookup[values.order[currentId]];
          
              // Initializing
              if (collapseValue == null) {
                repeatCount = 0;
                collapseCount = 0;
                collapseId = currentId;
                collapseValue = currentValue;
          
                // Collapse the document if the field value is the same and
                // we have a run of at least collapseThreshold uncollapsedDocset.
              }
              //IMHO HERE MUST BE if NO else-if !!!!    
              else if (collapseValue.equals(currentValue))
              {
                if (++repeatCount >= collapseThreshold) {
                  collapseCount++;
                  for (CollapseCollector collector : collectors) {
                    CollapseGroup valueToCollapse = new AdjacentCollapseGroup(collapseId, currentValue);
                    collector.documentCollapsed(currentId, valueToCollapse, collapseContext);
                  }
                } else {
                  addDoc(currentId);
                }
              }
              else
              {
                ...
              }
              ...
            }
            ...
          }
          

          2) I have problem with sorting, I need sort CollapseGroup by c_price_from_orig field,
          but if I have in request "sort=c_price_from_orig+asc",
          returned CollapseGroup will be sorted by c_price_from_orig (minimum of collapsed doc in group),
          but some CollapseGroup will be skiped and doc with c_price_from_orig will not be returned firts !!!

          I try debug this problem and report this better.

          thanks for your reply,
          sorry for my english and

          best regards
          David

          Show
          David Tuška added a comment - Hello, I find some bug in "Field collapsing", I will tested it for solr-1.4.1-patch and try test for trunk-patch(rev.955615) too. 1) No collapse_counts/results will be returned if collapseCount==1, although no-collapse will be returned. http://localhost:8080/solr_tour/select/?q=nl_counter%3A1%0D%0A&start=0&rows=10&indent=on&sort=c_price_from_orig+asc&collapse.field=nl_tour_id&collapse.threshold=1&collapse.type=adjacent&collapse.debug=true <lst name= "collapse_counts" > <str name= "field" > nl_tour_id </str> <lst name= "results" /> <lst name= "debug" > <str name= "Docset type" > HashDocSet(26) </str> <long name= "Total collapsing time(ms)" > 0 </long> <long name= "Create uncollapsed docset(ms)" > 0 </long> <long name= "Get fieldvalues from fieldcache (ms)" > 0 </long> <long name= "AdjacentDocumentCollapser collapsing time(ms)" > 0 </long> <long name= "Creating collapseinfo time(ms)" > 0 </long> <long name= "Convert to bitset time(ms)" > 0 </long> <long name= "Create collapsed docset time(ms)" > 0 </long> </lst> </lst> <result name= "response" numFound= "26" start= "0" > 10x <doc> </doc> ... If I look into code, I find some problematic part of code: In NonAdjacentDocumentCollapser.java in function doCollapsing is bad condition and priorityQueue: NonAdjacentDocumentCollapser.java protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) { for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) { int currentId = i.nextDoc(); String currentValue = values.lookup[values.order[currentId]]; NonAdjacentCollapseGroup collapseDoc = collapsedDocs.get(currentValue); if (collapseDoc == null ) { .. } Integer dropOutId = ( Integer ) collapseDoc.priorityQueue.insertWithOverflow(currentId); // IMHO HERE must be >= NO > !!!! if (++collapseDoc.totalCount > collapseThreshold) { collapseDoc.collapsedDocuments++; //HERE IS PROBLEM TOO, if doc is only one, then is not returned by collapseDoc.priorityQueue.insertWithOverflow for collapse.threshold=1 if (dropOutId != null ) { for (CollapseCollector collector : collectors) { collector.documentCollapsed(dropOutId, collapseDoc, collapseContext); } } } } In AdjacentDocumentCollapser.java in doCollapsing is problem in Initializing condition, if doc is only one, then only inicializing condition is process, else-if, else part not will be processed and collector.documentCollapsed or collector.documentHead not will be call. NonAdjacentDocumentCollapser.java protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) { ... String collapseValue = null ; ... for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) { int currentId = i.nextDoc(); String currentValue = values.lookup[values.order[currentId]]; // Initializing if (collapseValue == null ) { repeatCount = 0; collapseCount = 0; collapseId = currentId; collapseValue = currentValue; // Collapse the document if the field value is the same and // we have a run of at least collapseThreshold uncollapsedDocset. } //IMHO HERE MUST BE if NO else - if !!!! else if (collapseValue.equals(currentValue)) { if (++repeatCount >= collapseThreshold) { collapseCount++; for (CollapseCollector collector : collectors) { CollapseGroup valueToCollapse = new AdjacentCollapseGroup(collapseId, currentValue); collector.documentCollapsed(currentId, valueToCollapse, collapseContext); } } else { addDoc(currentId); } } else { ... } ... } ... } 2) I have problem with sorting, I need sort CollapseGroup by c_price_from_orig field, but if I have in request "sort=c_price_from_orig+asc", returned CollapseGroup will be sorted by c_price_from_orig (minimum of collapsed doc in group), but some CollapseGroup will be skiped and doc with c_price_from_orig will not be returned firts !!! I try debug this problem and report this better. thanks for your reply, sorry for my english and best regards David
          Hide
          Pavel Minchenkov added a comment -

          Please, update patch for trunk.

          Show
          Pavel Minchenkov added a comment - Please, update patch for trunk.
          Hide
          cruz fernandez added a comment - - edited

          I'm having an issue with the facet exclude filter parameters (http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters). I have added this exclude tags and the facet result I'm getting is without collapsing (it's counting the uncollapsed items).

          For example, in my first page it shows something like this (the facet result gives something like this):

          • book (11)
          • website (20)
          • journal (5)

          after clicking on book it shows 11 results correctly, but the faceting with the exclude applied shows:

          • book (230)
          • website (25)
          • journal (5)

          I am using the parameter collapse.facet=after

          The collapsed count of books is 11, and the uncollapsed count is 230, I verified it.

          Show
          cruz fernandez added a comment - - edited I'm having an issue with the facet exclude filter parameters ( http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters ). I have added this exclude tags and the facet result I'm getting is without collapsing (it's counting the uncollapsed items). For example, in my first page it shows something like this (the facet result gives something like this): book (11) website (20) journal (5) after clicking on book it shows 11 results correctly, but the faceting with the exclude applied shows: book (230) website (25) journal (5) I am using the parameter collapse.facet=after The collapsed count of books is 11, and the uncollapsed count is 230, I verified it.
          Hide
          Pavel Minchenkov added a comment -

          Latest patch for current trunk has many conflicts in SolrIndexSearcher.java.

          Show
          Pavel Minchenkov added a comment - Latest patch for current trunk has many conflicts in SolrIndexSearcher.java.
          Hide
          Stephen Weiss added a comment -

          Actually I'm testing more (I want to make sure it's not just my own error), and it seems like paging in general is just broken with this patch - Any page between 4 and 80 seems to have the exact same results on it as well. Then the results change a little, every 20 pages or so.

          Show
          Stephen Weiss added a comment - Actually I'm testing more (I want to make sure it's not just my own error), and it seems like paging in general is just broken with this patch - Any page between 4 and 80 seems to have the exact same results on it as well. Then the results change a little, every 20 pages or so.
          Hide
          Stephen Weiss added a comment -

          Oh Martijn, I hope you're reading. After a few months of calm we had some OOM's again on our production servers. So I tried your latest patch with the solr 1.4.1 release, since bundled in there are various fixes for memory leaks. The performance difference is great - far less CPU and RAM usage all around. But there's a catch! Something was introduced to change the "numFound" that is reported. After we noticed this, I found your comment and removed these lines from NonAdjacentDocumentCollapser.java:

          + if (collapsedGroupPriority.size() > maxNumberOfGroups)

          { + NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); + collapsedDocs.remove(inferiorGroup.fieldValue); + collapsedGroupPriority.remove(inferiorGroup); + }

          We did NOT remove line 99 as suggested because this caused compiler problems:

          [javac] /home/sweiss/apache-solr-1.4.1/src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java:99: cannot find symbol
          [javac] symbol : variable collapseDoc
          [javac] location: class org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser
          [javac] if (collapseDoc == null) {

          After doing this, I noticed a huge performance drop - far worse than what we had even with 1.4 and your patch from December. Searches were taking >10s to complete (before we were just over 1s for the worst searches). So, I went back and tried to find a way to get the "numFound" through other means - and I figured I could just facet on the same field we're collapsing on, and then count the number of facets. Looks good - the count of the facets is the right count, and it would appear to be working.

          But, there's a snag. It seems that the results being returned by your patch, unaltered, are incorrect. For an example - my search for "orange" returns 7200 collapsed results, either using the real numFound from the altered patch, or using the facet method wtih the new patch. This equates to 160 pages of results. However, with the unaltered patch, if we actually try to retrieve page 158, or really any result over 130 or so, we get the exact same results. With the altered patch (removing those few lines), page 158 actually is page 158. Basically, it seems like your patch throws away good results - and I get the feeling that it throws away those good results somewhere in those 5 lines.

          Now, I'm stuck. I really don't know what to do... I don't want the OOMs to continue, but it looks like they will regardless because both the old version (1.4 + December patch) and the new, altered patched version are using too many resources. But if I used the latest patch without changing it, I'm not getting the right results all the way through.

          Is there anything we can do? I appreciate your help...

          Show
          Stephen Weiss added a comment - Oh Martijn, I hope you're reading. After a few months of calm we had some OOM's again on our production servers. So I tried your latest patch with the solr 1.4.1 release, since bundled in there are various fixes for memory leaks. The performance difference is great - far less CPU and RAM usage all around. But there's a catch! Something was introduced to change the "numFound" that is reported. After we noticed this, I found your comment and removed these lines from NonAdjacentDocumentCollapser.java: + if (collapsedGroupPriority.size() > maxNumberOfGroups) { + NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); + collapsedDocs.remove(inferiorGroup.fieldValue); + collapsedGroupPriority.remove(inferiorGroup); + } We did NOT remove line 99 as suggested because this caused compiler problems: [javac] /home/sweiss/apache-solr-1.4.1/src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java:99: cannot find symbol [javac] symbol : variable collapseDoc [javac] location: class org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser [javac] if (collapseDoc == null) { After doing this, I noticed a huge performance drop - far worse than what we had even with 1.4 and your patch from December. Searches were taking >10s to complete (before we were just over 1s for the worst searches). So, I went back and tried to find a way to get the "numFound" through other means - and I figured I could just facet on the same field we're collapsing on, and then count the number of facets. Looks good - the count of the facets is the right count, and it would appear to be working. But, there's a snag. It seems that the results being returned by your patch, unaltered, are incorrect. For an example - my search for "orange" returns 7200 collapsed results, either using the real numFound from the altered patch, or using the facet method wtih the new patch. This equates to 160 pages of results. However, with the unaltered patch, if we actually try to retrieve page 158, or really any result over 130 or so, we get the exact same results. With the altered patch (removing those few lines), page 158 actually is page 158. Basically, it seems like your patch throws away good results - and I get the feeling that it throws away those good results somewhere in those 5 lines. Now, I'm stuck. I really don't know what to do... I don't want the OOMs to continue, but it looks like they will regardless because both the old version (1.4 + December patch) and the new, altered patched version are using too many resources. But if I used the latest patch without changing it, I'm not getting the right results all the way through. Is there anything we can do? I appreciate your help...
          Hide
          Martijn van Groningen added a comment -

          Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note).

          A lot of stuff is changing (or already has changed) in Lucene / Solr internally, so that might have been the cause of these exceptions.

          So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core.

          I think that this part of the response is not copied from the shard's responses into the response that is returned to the client. So that 'll have to be added in order to get these collapsed documents

          One important notice about this patch is that it is not going to be committed. Child issues of SOLR-236 like SOLR-1682 on the other hand will get committed to the trunk, but it might take some time till all the functionality that patches of SOLR-236 provide are implemented in a efficient manner. Just to make some things clear, because this is a long, very long and complicated issue.

          Show
          Martijn van Groningen added a comment - Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note). A lot of stuff is changing (or already has changed) in Lucene / Solr internally, so that might have been the cause of these exceptions. So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core. I think that this part of the response is not copied from the shard's responses into the response that is returned to the client. So that 'll have to be added in order to get these collapsed documents One important notice about this patch is that it is not going to be committed. Child issues of SOLR-236 like SOLR-1682 on the other hand will get committed to the trunk, but it might take some time till all the functionality that patches of SOLR-236 provide are implemented in a efficient manner. Just to make some things clear, because this is a long, very long and complicated issue.
          Hide
          Jasper van Veghel added a comment -

          Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note).

          The only thing I'm still running into at this point is that I'm trying to get this to run using multiple cores / shards. Documents with the same collapse-field values don't span across shards so I figured it should work, and it does. But when including:

          collapse.includeCollapsedDocs.fl=content

          The actual documents returned in the collapse-counts/results are listed as:

          <result name="collapsedDocs" numFound="1" start="0">
          <doc/>
          </result>

          So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core.

          Show
          Jasper van Veghel added a comment - Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note). The only thing I'm still running into at this point is that I'm trying to get this to run using multiple cores / shards. Documents with the same collapse-field values don't span across shards so I figured it should work, and it does. But when including: collapse.includeCollapsedDocs.fl=content The actual documents returned in the collapse-counts/results are listed as: <result name="collapsedDocs" numFound="1" start="0"> <doc/> </result> So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core.
          Hide
          Doug Steigerwald added a comment -

          Excellent! Everything looks good with our issue. Thanks for the quick turn around.

          Show
          Doug Steigerwald added a comment - Excellent! Everything looks good with our issue. Thanks for the quick turn around.
          Hide
          Martijn van Groningen added a comment -

          @Doug Steigerwald and Jasper van Veghel
          Can you check if your errors still occur in the latest patch for 1.4.1 release?

          Show
          Martijn van Groningen added a comment - @Doug Steigerwald and Jasper van Veghel Can you check if your errors still occur in the latest patch for 1.4.1 release?
          Hide
          Martijn van Groningen added a comment -

          Attached a new patch. This patch in a backport of the last patch for Solr 1.4.1. There are currently many changes in the trunk which make maintaining this patch difficult. To apply this patch checkout: http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/ and apply the patch in checkout directory.

          Show
          Martijn van Groningen added a comment - Attached a new patch. This patch in a backport of the last patch for Solr 1.4.1. There are currently many changes in the trunk which make maintaining this patch difficult. To apply this patch checkout: http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/ and apply the patch in checkout directory.
          Hide
          Jasper van Veghel added a comment -

          I'm getting the same Exception as Eric Caron, only without using an fq. It seems to have something to do with caching and potentially stemming.

          These queries are being run against a set of Dutch political news articles. The following works:

          /select?q=rosenthal&collapse.field=url_exact

          And this doesn't:

          /select?q=roos&collapse.field=url_exact

          Where due to stemming 'roos' is also highlighted in results for the former query; hence the (expected) results for the latter query are a subset of the former. The exception is:

          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:451)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:263)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:197)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:114)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:183)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1322)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)

          Show
          Jasper van Veghel added a comment - I'm getting the same Exception as Eric Caron, only without using an fq. It seems to have something to do with caching and potentially stemming. These queries are being run against a set of Dutch political news articles. The following works: /select?q=rosenthal&collapse.field=url_exact And this doesn't: /select?q=roos&collapse.field=url_exact Where due to stemming 'roos' is also highlighted in results for the former query; hence the (expected) results for the latter query are a subset of the former. The exception is: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:451) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:263) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:197) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:114) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:183) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1322) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
          Hide
          Doug Steigerwald added a comment -

          I keep running into an ArrayIndexOutOfBoundsException when sorting with field collapsing. I'm running Solr 1.4.1 with the field-collapse-5.patch along with the 3 files from Peter for OOM issues.

          We've got a basic query that returns all event type records in the index (object_class:events), and one fq to make sure we're grabbing data for the correct site (site_id:86). I'm sorting on a category_id (TrieIntField). Collapsing on a string (collapse.type=normal). Here's a basic query that doesn't work for us.

          q=object_class:events&fq=site_id:86&sort=category_id+desc&collapse.field=rollup&collapse.type=normal

          Jun 24, 2010 3:20:12 PM org.apache.solr.common.SolrException log
          SEVERE: java.lang.ArrayIndexOutOfBoundsException: -4294
          at org.apache.lucene.search.FieldComparator$IntComparator.copy(FieldComparator.java:328)
          at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133)
          at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1487)
          at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:931)
          at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1289)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:176)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

          This is happening to one of our sites in production (the only site left using our events calendar) and I can't seem to make it happen in development with some fake data. We wiped all data from our production indexes and reindexed recently (upgraded to Solr 1.4.0 a few weeks ago). Does anyone have any ideas what might be causing this? I'm going to try and pull the database to our development servers and see if I can reindex and reproduce the issue, but that will take some time. The copied index from production to development does show this issue.

          Any hints? This is happening when sorting on any TrieIntField or string field. Normal collapsing or adjacent.

          Show
          Doug Steigerwald added a comment - I keep running into an ArrayIndexOutOfBoundsException when sorting with field collapsing. I'm running Solr 1.4.1 with the field-collapse-5.patch along with the 3 files from Peter for OOM issues. We've got a basic query that returns all event type records in the index (object_class:events), and one fq to make sure we're grabbing data for the correct site (site_id:86). I'm sorting on a category_id (TrieIntField). Collapsing on a string (collapse.type=normal). Here's a basic query that doesn't work for us. q=object_class:events&fq=site_id:86&sort=category_id+desc&collapse.field=rollup&collapse.type=normal Jun 24, 2010 3:20:12 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ArrayIndexOutOfBoundsException: -4294 at org.apache.lucene.search.FieldComparator$IntComparator.copy(FieldComparator.java:328) at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1487) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:931) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1289) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:176) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) This is happening to one of our sites in production (the only site left using our events calendar) and I can't seem to make it happen in development with some fake data. We wiped all data from our production indexes and reindexed recently (upgraded to Solr 1.4.0 a few weeks ago). Does anyone have any ideas what might be causing this? I'm going to try and pull the database to our development servers and see if I can reindex and reproduce the issue, but that will take some time. The copied index from production to development does show this issue. Any hints? This is happening when sorting on any TrieIntField or string field. Normal collapsing or adjacent.
          Hide
          Martijn van Groningen added a comment -

          I've attached a new patch that is compatible with the current trunk (rev 955615). The reason the previous patch did not work, was that the StringIndex class was removed. DocTermsIndex is used instead. See LUCENE-2380 for more details on this.

          Show
          Martijn van Groningen added a comment - I've attached a new patch that is compatible with the current trunk (rev 955615). The reason the previous patch did not work, was that the StringIndex class was removed. DocTermsIndex is used instead. See LUCENE-2380 for more details on this.
          Hide
          Lance Norskog added a comment -

          It's the three-year anniversary for SOLR-236! And it's still active, unfinished and uncommitted. Is this a record?

          Show
          Lance Norskog added a comment - It's the three-year anniversary for SOLR-236 ! And it's still active, unfinished and uncommitted. Is this a record?
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Christophe Biocca added a comment -

          I'd just like to throw in a suggestion about the AbstractDocumentCollapser & CollapseCollectorFactory APIs: It seems to me that changing the factory.createCollapseCollector(SolrRequest req) to factory.createCollapseCollector(ResponseBuilder rb) would allow for more specialized collapse collectors, that would be able to use, amongst other things, the SortSpec in the implementation of the collector. Our use case is that we want to show possibly more than one document for a given value of a collapse field, depending on relative scores. Passing in the ResponseBuilder would allow us to do that much more easily. Since the caching uses the ResponseBuilder object as its key, it won't introduce any new issues.

          Show
          Christophe Biocca added a comment - I'd just like to throw in a suggestion about the AbstractDocumentCollapser & CollapseCollectorFactory APIs: It seems to me that changing the factory.createCollapseCollector(SolrRequest req) to factory.createCollapseCollector(ResponseBuilder rb) would allow for more specialized collapse collectors, that would be able to use, amongst other things, the SortSpec in the implementation of the collector. Our use case is that we want to show possibly more than one document for a given value of a collapse field, depending on relative scores. Passing in the ResponseBuilder would allow us to do that much more easily. Since the caching uses the ResponseBuilder object as its key, it won't introduce any new issues.
          Hide
          Kallin Nagelberg added a comment -

          I tried asking this question on the user list, but perhaps this is a more appropriate forum.

          As I understand field collapsing has been disabled on multi-valued fields. Is this really necessary?

          Let's say I have a multi-valued field, 'my-mv-field'. I have a query like (my-mv-field:1 OR my-mv-field:5) that returns docs with the following values for 'my-mv-field':

          Doc1: 1, 2, 3,
          Doc2: 1, 3
          Doc3: 2, 4, 5, 6
          Doc4: 1

          If I collapse on that field with that query I imagine it should mean 'collect the docs, starting from the top, so that I find 1 and 5'. In this case if it returned Doc1 and Doc3 I would be happy.

          There must be some ambiguity or implementation detail I am unaware that is preventing this. It may be a critical piece of functionality for an application I'm working on, so I'm curious if there is point in pursuing development of this functionality or if I am missing something.

          Thanks,
          Kallin Nagelberg

          Show
          Kallin Nagelberg added a comment - I tried asking this question on the user list, but perhaps this is a more appropriate forum. As I understand field collapsing has been disabled on multi-valued fields. Is this really necessary? Let's say I have a multi-valued field, 'my-mv-field'. I have a query like (my-mv-field:1 OR my-mv-field:5) that returns docs with the following values for 'my-mv-field': Doc1: 1, 2, 3, Doc2: 1, 3 Doc3: 2, 4, 5, 6 Doc4: 1 If I collapse on that field with that query I imagine it should mean 'collect the docs, starting from the top, so that I find 1 and 5'. In this case if it returned Doc1 and Doc3 I would be happy. There must be some ambiguity or implementation detail I am unaware that is preventing this. It may be a critical piece of functionality for an application I'm working on, so I'm curious if there is point in pursuing development of this functionality or if I am missing something. Thanks, Kallin Nagelberg
          Hide
          Martijn van Groningen added a comment -

          Varun I noticed the same NPE. I've updated the patch and fixed the issue. In the patch I've also added a test that simulated the problem that you have described.

          Show
          Martijn van Groningen added a comment - Varun I noticed the same NPE. I've updated the patch and fixed the issue. In the patch I've also added a test that simulated the problem that you have described.
          Hide
          Lance Norskog added a comment -

          Eric Caron added a comment - 29/Apr/10 02:27 PM

          Using the latest from trunk as of 2010-04-29, and the SOLR-236-trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq.

          Varun Gupta added a comment - 15/May/10 07:36 AM

          I applied the latest patch on the trunk and got the below exception after I made some commits to the index:

          Eric, Varun: Please create unit tests that show these bugs.

          Show
          Lance Norskog added a comment - Eric Caron added a comment - 29/Apr/10 02:27 PM Using the latest from trunk as of 2010-04-29, and the SOLR-236 -trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq. Varun Gupta added a comment - 15/May/10 07:36 AM I applied the latest patch on the trunk and got the below exception after I made some commits to the index: Eric, Varun: Please create unit tests that show these bugs.
          Hide
          Varun Gupta added a comment -

          I applied the latest patch on the trunk and got the below exception after I made some commits to the index:

          SEVERE: java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FlatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196)
          at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:221)
          at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:130)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueu.java:146)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doollapsing(NonAdjacentDocumentCollapser.java:113)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.execueCollapse(AbstractDocumentCollapser.java:259)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collase(AbstractDocumentCollapser.java:179)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapeComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(Collapseomponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SerchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHanderBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilte.java:341)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFiltr.java:244)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(AppicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationilterChain.java:206)

          I also got an error while doing optimizing index.

          Show
          Varun Gupta added a comment - I applied the latest patch on the trunk and got the below exception after I made some commits to the index: SEVERE: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FlatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentComparator.compare(NonAdjacentDocumentCollapser.java:262) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:221) at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:130) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueu.java:146) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doollapsing(NonAdjacentDocumentCollapser.java:113) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.execueCollapse(AbstractDocumentCollapser.java:259) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collase(AbstractDocumentCollapser.java:179) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapeComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(Collapseomponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SerchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHanderBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilte.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFiltr.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(AppicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationilterChain.java:206) I also got an error while doing optimizing index.
          Hide
          Eric Caron added a comment -

          Regarding the numFound count, one of the loudest complaints from the Sphinx community is the inability to see the total number pre-collapse. Is it possible to dictate which value (possibly both) is calculated at run-time? When FieldCollapse gains the attention it deserves, I'd expect an onslaught of requests along these lines. (I personally want both, the pre-value to display the number of matches, and the post-value to calculate pagination).

          Show
          Eric Caron added a comment - Regarding the numFound count, one of the loudest complaints from the Sphinx community is the inability to see the total number pre-collapse. Is it possible to dictate which value (possibly both) is calculated at run-time? When FieldCollapse gains the attention it deserves, I'd expect an onslaught of requests along these lines. (I personally want both, the pre-value to display the number of matches, and the post-value to calculate pagination).
          Hide
          Joseph Freeman added a comment -

          collapse.includeCollapsedDocs.count ?

          When I use collapse.includeCollapsedDocs.fl, I get ALL the collapsed documents.

          It seems like we should have a collapse.includeCollapsedDocs.count parameter to limit this result set?

          Show
          Joseph Freeman added a comment - collapse.includeCollapsedDocs.count ? When I use collapse.includeCollapsedDocs.fl, I get ALL the collapsed documents. It seems like we should have a collapse.includeCollapsedDocs.count parameter to limit this result set?
          Hide
          Martijn van Groningen added a comment -

          Another note. The numFound count (all document found) in this patch does mean all documents found. This number currently represents all documents returned in the response. This is due to a performance improvement made and was discussed on this page a while ago. However you can disable this performance improvement by commenting / deleting the lines 99 and 106 to 110 in NonAdjacentDocumentCollapser.java file (latest patch). My experiences with this improvement is that it saves memory, but the the search time improvements were minimal. So whether you do this I guess depends in your situation.

          Show
          Martijn van Groningen added a comment - Another note. The numFound count (all document found) in this patch does mean all documents found. This number currently represents all documents returned in the response. This is due to a performance improvement made and was discussed on this page a while ago. However you can disable this performance improvement by commenting / deleting the lines 99 and 106 to 110 in NonAdjacentDocumentCollapser.java file (latest patch). My experiences with this improvement is that it saves memory, but the the search time improvements were minimal. So whether you do this I guess depends in your situation.
          Hide
          Martijn van Groningen added a comment -

          I've updated the patch for the trunk The following changes are included:

          • The patch has been updated to the latest trunk. So no patch conflicts should occur.
          • Eric Caron reported NPEs when using field collapsing in combination with a filter query. After some digging I found the cause of the NPE. When using a fq the scores are being cached in the filter cache, but due to a bug in DelegateDocSet the scores where not returned in some cases (null was returned). This resulted in a NPE in a later stage of the query execution. I've also updated the integration test to cover this situation. This also explains why the first time everything was fine. When doing a normal refresh (F5 /  - R) the result comes from the HTTP cache so everything is still fine. However when doing a hard refresh a second query is executed and result are then retrieved from the Solr configured caches in most cases and resulting in this NPE.
          Show
          Martijn van Groningen added a comment - I've updated the patch for the trunk The following changes are included: The patch has been updated to the latest trunk. So no patch conflicts should occur. Eric Caron reported NPEs when using field collapsing in combination with a filter query. After some digging I found the cause of the NPE. When using a fq the scores are being cached in the filter cache, but due to a bug in DelegateDocSet the scores where not returned in some cases (null was returned). This resulted in a NPE in a later stage of the query execution. I've also updated the integration test to cover this situation. This also explains why the first time everything was fine. When doing a normal refresh (F5 /  - R) the result comes from the HTTP cache so everything is still fine. However when doing a hard refresh a second query is executed and result are then retrieved from the Solr configured caches in most cases and resulting in this NPE.
          Hide
          Sergey Shinderuk added a comment -

          Finally I applied SOLR-236.patch to rev 899572 (dtd. 2010-01-15) of the trunk and I get correct numFound values with collapsing enabled.

          Show
          Sergey Shinderuk added a comment - Finally I applied SOLR-236 .patch to rev 899572 (dtd. 2010-01-15) of the trunk and I get correct numFound values with collapsing enabled.
          Hide
          Sergey Shinderuk added a comment -

          @Claus
          I faced the same issue. Did you find any solution or maybe workaround?

          When collapsing is enabled, numFound is equal to the number of rows requested and NOT the total number of distinct documents found.

          I applied the latest SOLR-236-trunk.patch to the trunk checked out on the date of patch, because patching the latest revision fails.
          Am I doing something wrong?

          I want to collapse near-duplicate documents in search results based on document signature. But with this issue I can't paginate through results, because I don't know how many.

          Besides, an article at http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ shows examples with correct numFound returned. How can I get it working???

          Show
          Sergey Shinderuk added a comment - @Claus I faced the same issue. Did you find any solution or maybe workaround? When collapsing is enabled, numFound is equal to the number of rows requested and NOT the total number of distinct documents found. I applied the latest SOLR-236 -trunk.patch to the trunk checked out on the date of patch, because patching the latest revision fails. Am I doing something wrong? I want to collapse near-duplicate documents in search results based on document signature. But with this issue I can't paginate through results, because I don't know how many. Besides, an article at http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ shows examples with correct numFound returned. How can I get it working???
          Hide
          Eric Caron added a comment -

          Using the latest from trunk as of 2010-04-29, and the SOLR-236-trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq.

          Works:
          /solr/select/?q=sales&fq=country%3A1
          Works:
          /solr/select/?q=sales&collapse.field=company
          Doesn't work:
          /solr/select/?q=sales&collapse.field=company&fq=country%3A1

          The top of the trace is:
          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:113)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:179)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)

          Show
          Eric Caron added a comment - Using the latest from trunk as of 2010-04-29, and the SOLR-236 -trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq. Works: /solr/select/?q=sales&fq=country%3A1 Works: /solr/select/?q=sales&collapse.field=company Doesn't work: /solr/select/?q=sales&collapse.field=company&fq=country%3A1 The top of the trace is: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:113) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:179) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          Hide
          Karel Braeckman added a comment -

          Hi all,

          I wondered if it is possible to sort the collapsed results based on an aggregate function (e.g., sort by sum(price))?

          What is to be done to make this possible? (could it be done via a plugin? )

          Kind regards,
          Karel

          Show
          Karel Braeckman added a comment - Hi all, I wondered if it is possible to sort the collapsed results based on an aggregate function (e.g., sort by sum(price))? What is to be done to make this possible? (could it be done via a plugin? ) Kind regards, Karel
          Hide
          Lukas Kahwe Smith added a comment -

          Its my understanding that this patch atm only produces a score for each collapsed group of the max() of the scores inside the group. Is there any work being done to increase the score to include all documents inside the group? Like taking into account the collapse_count or the individual scores (summing or via some custom algorithm).

          Show
          Lukas Kahwe Smith added a comment - Its my understanding that this patch atm only produces a score for each collapsed group of the max() of the scores inside the group. Is there any work being done to increase the score to include all documents inside the group? Like taking into account the collapse_count or the individual scores (summing or via some custom algorithm).
          Hide
          Billy Morgan added a comment -

          @Claus

          I am having the same issue

          Show
          Billy Morgan added a comment - @Claus I am having the same issue
          Hide
          Claus Schröter added a comment -

          Hi all,

          I applied Martijns last Patch to the trunk and encountered a problem with document counts:

          whenever I set the rows= value to the query, the "numFound" result parameter is limited to exactly the value of rows.
          The facet counts are also limited to this value.

          If I omit the rows parameter everything is fine. I tried to track back the problem. It seems that the SolrSearcher query is limited to "rows" value
          before collapsing is done.

          Does anybody encounter a similar problem?

          Cheers!
          clausi

          Show
          Claus Schröter added a comment - Hi all, I applied Martijns last Patch to the trunk and encountered a problem with document counts: whenever I set the rows= value to the query, the "numFound" result parameter is limited to exactly the value of rows. The facet counts are also limited to this value. If I omit the rows parameter everything is fine. I tried to track back the problem. It seems that the SolrSearcher query is limited to "rows" value before collapsing is done. Does anybody encounter a similar problem? Cheers! clausi
          Hide
          Pierre-Luc added a comment -

          Hi all,

          We have integrated the most recent patch into our 1.4 install and the Out of memory fix suggested by Peter. I am facing memory issues only when collapsing. I would like to know why the class CacheValue is static in AbstractDocumentCollapser. If I remove the static attribute of that class, the memory footprint is greatly reduced and everything works fine.

          My document count is around 5 million.

          Any help would be greatly appreciated.
          Thank you.

          Show
          Pierre-Luc added a comment - Hi all, We have integrated the most recent patch into our 1.4 install and the Out of memory fix suggested by Peter. I am facing memory issues only when collapsing. I would like to know why the class CacheValue is static in AbstractDocumentCollapser. If I remove the static attribute of that class, the memory footprint is greatly reduced and everything works fine. My document count is around 5 million. Any help would be greatly appreciated. Thank you.
          Hide
          Martijn van Groningen added a comment -

          @Thomas
          Somehow the solrj code was left out the when I created the patch yesterday. I guess, I accidentally deleted it when I was moving the code the new trunk. Anyhow I have updated the patch that includes the solrj code and applying it should go flawless.

          Show
          Martijn van Groningen added a comment - @Thomas Somehow the solrj code was left out the when I created the patch yesterday. I guess, I accidentally deleted it when I was moving the code the new trunk. Anyhow I have updated the patch that includes the solrj code and applying it should go flawless.
          Hide
          Robert Zotter added a comment -

          @Thomas Essentially my use case involves a product listing of sorts whereas there are many closely related items being sold by any number of sellers. I would like to distribute the search results across as many sellers as possible giving each seller a fair chance to sell their products, so I was going to use field collapsing to limit the number of items being displayed per seller.

          Ideally it would be nice if there were some way to evenly distribute closely related documents (scores within some defined percentage of each other)

          For example instead of:

          Item 1 sold by Seller A
          Item 2 sold by Seller A
          Item 3 sold by Seller A
          Item 4 sold by Seller B
          Item 5 sold by Seller B
          Item 6 sold by Seller B

          Assuming all of these ideas are within a certain percentage of each other it would be nice to have:

          Item 1 sold by Seller A
          Item 4 sold by Seller B
          Item 2 sold by Seller A
          Item 5 sold by Seller B
          ....

          Although I do not achieve this exact behavior with this particular patch It will at least get me closer to my goal.

          FYI my document count is around 6 million and I am already utilizing the document deduper.

          Show
          Robert Zotter added a comment - @Thomas Essentially my use case involves a product listing of sorts whereas there are many closely related items being sold by any number of sellers. I would like to distribute the search results across as many sellers as possible giving each seller a fair chance to sell their products, so I was going to use field collapsing to limit the number of items being displayed per seller. Ideally it would be nice if there were some way to evenly distribute closely related documents (scores within some defined percentage of each other) For example instead of: Item 1 sold by Seller A Item 2 sold by Seller A Item 3 sold by Seller A Item 4 sold by Seller B Item 5 sold by Seller B Item 6 sold by Seller B Assuming all of these ideas are within a certain percentage of each other it would be nice to have: Item 1 sold by Seller A Item 4 sold by Seller B Item 2 sold by Seller A Item 5 sold by Seller B .... Although I do not achieve this exact behavior with this particular patch It will at least get me closer to my goal. FYI my document count is around 6 million and I am already utilizing the document deduper.
          Hide
          Thomas Heigl added a comment -

          @Robert:

          What is your use case for field collapsing? I think under "normal" conditions (collapsing on a field with reasonably many unique values) you can go with the slightly older patch and the OOM fixes. I compared the performance of the newest patch for the trunk with the 1.4 release patched as described above and didn't notice much difference under these conditions. I will must likely go with the trunk, however, as I have millions of documents with millions of unique values on the collapse field and need every bit of performance I can get.

          Show
          Thomas Heigl added a comment - @Robert: What is your use case for field collapsing? I think under "normal" conditions (collapsing on a field with reasonably many unique values) you can go with the slightly older patch and the OOM fixes. I compared the performance of the newest patch for the trunk with the 1.4 release patched as described above and didn't notice much difference under these conditions. I will must likely go with the trunk, however, as I have millions of documents with millions of unique values on the collapse field and need every bit of performance I can get.
          Hide
          Robert Zotter added a comment -

          @Thomas. Thanks for the input. Do you think its best to go with a clean version of 1.4 or the latest from trunk? Basically I'm asking if you think trunk is semi-stable enough for a production environment. Thanks

          Show
          Robert Zotter added a comment - @Thomas. Thanks for the input. Do you think its best to go with a clean version of 1.4 or the latest from trunk? Basically I'm asking if you think trunk is semi-stable enough for a production environment. Thanks
          Hide
          Thomas Heigl added a comment -

          @Robert:

          I just tried the field collapsing patch with a clean version of the 1.4 release. The only recent patch that seems to be applicable without manually resolving conflicts is 2009-12-08. In addition to the patch you should also add the three individual files uploaded by Peter Karich to deal with the worst memory issues.

          Show
          Thomas Heigl added a comment - @Robert: I just tried the field collapsing patch with a clean version of the 1.4 release. The only recent patch that seems to be applicable without manually resolving conflicts is 2009-12-08 . In addition to the patch you should also add the three individual files uploaded by Peter Karich to deal with the worst memory issues.
          Hide
          Thomas Heigl added a comment -

          @Martijn:

          There is a small problem with the latest patch file. Both TortoiseSVN and patch complain that the file is malformed because there is an "empty" patch for FieldCollapseResponse.java around line 2199. Simply removing lines 2195-2199 does the trick.

          Apart from that, the patch works perfectly for me.

          Show
          Thomas Heigl added a comment - @Martijn: There is a small problem with the latest patch file. Both TortoiseSVN and patch complain that the file is malformed because there is an "empty" patch for FieldCollapseResponse.java around line 2199. Simply removing lines 2195-2199 does the trick. Apart from that, the patch works perfectly for me.
          Hide
          Martijn van Groningen added a comment -

          I've attached a new patch, which included the following changes:

          • Patch uses the new Solr trunk. Everything in the patch is relative to the trunk directory.
          • The changes Peter Karich made to DocSetScoreCollector and NonAdjacentDocumentCollapserTest that make it much more memory efficient.
          • The change Yonik suggested to make field collapsing more efficient.

            efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups.

          This also means that the total count of a search with field collapsing does not represent all the found documents. The total count now represents: start + count

          Show
          Martijn van Groningen added a comment - I've attached a new patch, which included the following changes: Patch uses the new Solr trunk. Everything in the patch is relative to the trunk directory. The changes Peter Karich made to DocSetScoreCollector and NonAdjacentDocumentCollapserTest that make it much more memory efficient. The change Yonik suggested to make field collapsing more efficient. efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups. This also means that the total count of a search with field collapsing does not represent all the found documents. The total count now represents: start + count
          Hide
          Robert Zotter added a comment -

          What are the required steps to get this patch working with a clean 1.4? Is it even compatible? I've read in the above comments that the 12/12 field-collapse-5.patch will patch correctly but it has horrible memory bugs. Has there been any updates on this? Recommendations anyone?

          Show
          Robert Zotter added a comment - What are the required steps to get this patch working with a clean 1.4? Is it even compatible? I've read in the above comments that the 12/12 field-collapse-5.patch will patch correctly but it has horrible memory bugs. Has there been any updates on this? Recommendations anyone?
          Hide
          Peter Karich added a comment - - edited

          It seems to me that the provided changes are necessary to make the OutOfMemory exception gone (see appended 3 files). Please apply the files with caution, because I made the changes from an old patch (from Nov 2009)

          Show
          Peter Karich added a comment - - edited It seems to me that the provided changes are necessary to make the OutOfMemory exception gone (see appended 3 files). Please apply the files with caution, because I made the changes from an old patch (from Nov 2009)
          Hide
          Peter Karich added a comment -

          > Shouldn't the float array in DocSetScoreCollector be changed to a Map?

          hmmh, maybe I expressed myself a bit weird: I already changed this all to a Map (a SortedMap) ...
          I started this change in DocSetScoreCollector and changed all the other occurances of the float array (otherwise I would have to copy the entire map)

          > > I think the compare method should NOT be called if no docs are in the scores array ... ?

          > I would expect that every docId has a score.

          Yes, me too. So I expect there is somewhere a bug. But as I sayd this breaks only one test (collapse with faceting before). It could be even a but in the testcase though.

          Show
          Peter Karich added a comment - > Shouldn't the float array in DocSetScoreCollector be changed to a Map? hmmh, maybe I expressed myself a bit weird: I already changed this all to a Map (a SortedMap) ... I started this change in DocSetScoreCollector and changed all the other occurances of the float array (otherwise I would have to copy the entire map) > > I think the compare method should NOT be called if no docs are in the scores array ... ? > I would expect that every docId has a score. Yes, me too. So I expect there is somewhere a bug. But as I sayd this breaks only one test (collapse with faceting before). It could be even a but in the testcase though.
          Hide
          Martijn van Groningen added a comment -

          Shouldn't the float array in DocSetScoreCollector be changed to a Map? Because that is actually being cached and requires the most memory. The float array in the NonAdjacentDocumentCollapser.PredefinedScorer isn't being cached. Though changing this to a Map can be an improvement.

          I think the compare method should NOT be called if no docs are in the scores array ... ?

          I would expect that every docId has a score.

          Show
          Martijn van Groningen added a comment - Shouldn't the float array in DocSetScoreCollector be changed to a Map? Because that is actually being cached and requires the most memory. The float array in the NonAdjacentDocumentCollapser.PredefinedScorer isn't being cached. Though changing this to a Map can be an improvement. I think the compare method should NOT be called if no docs are in the scores array ... ? I would expect that every docId has a score.
          Hide
          Peter Karich added a comment - - edited

          regarding the OutOfMemory problem: we are now testing the suggested change in production.

          I replaced the float array with a TreeMap<Integer, Float>. The change was nearly trivial (I cannot provide a patch easily, because we are using an older patch, althoug I could post the 3 changed files.)

          The point why I used a TreeMap instead a HashMap was that in the method advance in the class NonAdjacentDocumentCollapser.PredefinedScorer I needed the tailMap method:

          public int advance(int target) throws IOException {
                      // now we need a treemap method:
                      iter = scores.tailMap(target).entrySet().iterator();
                      if (iter.hasNext())
                          return target;
                      else
                          return NO_MORE_DOCS;
          }
          

          Then - I think - I discovered a bug/inconsistent behaviour: If I run the test FieldCollapsingIntegrationTest.testNonAdjacentCollapse_withFacetingBefore then the scores arrays will be created ala new float[maxDocs] in the old version. But the array will never be filled with some values so Float value1 = values.get(doc1); will return null in the method NonAdjacentDocumentCollapser.FloatValueFieldComparator.compare (the size of TreeMap is 0!); I work around this via

           
          if (value1 == null)
                          value1 = 0f;
          if (value2 == null)
                          value2 = 0f;
          

          I think the compare method should NOT be called if no docs are in the scores array ... ?

          Show
          Peter Karich added a comment - - edited regarding the OutOfMemory problem: we are now testing the suggested change in production. I replaced the float array with a TreeMap<Integer, Float>. The change was nearly trivial (I cannot provide a patch easily, because we are using an older patch, althoug I could post the 3 changed files.) The point why I used a TreeMap instead a HashMap was that in the method advance in the class NonAdjacentDocumentCollapser.PredefinedScorer I needed the tailMap method: public int advance(int target) throws IOException { // now we need a treemap method: iter = scores.tailMap(target).entrySet().iterator(); if (iter.hasNext()) return target; else return NO_MORE_DOCS; } Then - I think - I discovered a bug/inconsistent behaviour: If I run the test FieldCollapsingIntegrationTest.testNonAdjacentCollapse_withFacetingBefore then the scores arrays will be created ala new float [maxDocs] in the old version. But the array will never be filled with some values so Float value1 = values.get(doc1); will return null in the method NonAdjacentDocumentCollapser.FloatValueFieldComparator.compare (the size of TreeMap is 0!); I work around this via if (value1 == null) value1 = 0f; if (value2 == null) value2 = 0f; I think the compare method should NOT be called if no docs are in the scores array ... ?
          Hide
          Martijn van Groningen added a comment -

          The numFound attribute holds the total number of documents found for the specified query, so also the documents beyond the first result page. The reason that for the first query, the numFound is lower the the second query is that the collapse.threshold is higher. Only documents with the same collapse field value, that appear more then twice will be omitted from the result. This results in less document being collapsed.

          Show
          Martijn van Groningen added a comment - The numFound attribute holds the total number of documents found for the specified query, so also the documents beyond the first result page. The reason that for the first query, the numFound is lower the the second query is that the collapse.threshold is higher. Only documents with the same collapse field value, that appear more then twice will be omitted from the result. This results in less document being collapsed.
          Hide
          Yao Ge added a comment -

          I just applied the latest patch to trunk and I don't quite understand how the "numFound" in the response list is computed. With rows=10&collapse.threshold=1, I got numFound=11, with rows=10&collapse.threshold=2, I got numFound=22.
          I both cases the actual doc in the list is 10. Why is the numFound reported this way?

          Show
          Yao Ge added a comment - I just applied the latest patch to trunk and I don't quite understand how the "numFound" in the response list is computed. With rows=10&collapse.threshold=1, I got numFound=11, with rows=10&collapse.threshold=2, I got numFound=22. I both cases the actual doc in the list is 10. Why is the numFound reported this way?
          Hide
          Martijn van Groningen added a comment -

          That makes sense. I initially made it an array to maintain the document order for the scores, but this order is already in the openbitset. I think a Map is a good idea.

          Show
          Martijn van Groningen added a comment - That makes sense. I initially made it an array to maintain the document order for the scores, but this order is already in the openbitset. I think a Map is a good idea.
          Hide
          Leon Messerschmidt added a comment -

          The OutOfMemory problem affects both field-collapse-5.patch on Solr 1.4 and SOLR-236.patch on the trunk.

          The root cause of the problem is DocSetScoreCollector that creates an array of float that is the size of the maxID document that matches the query. If you have a large index (we have several million documents) and a document with a very large id is matched you may end up with a huge array (in our case several hundred MB). Only a really small subset of the array is being used at any given time (especially if you're matching just a few documents with big doc ids).

          The implementation can rather use a sparse array or a map to keep track of scores.

          Show
          Leon Messerschmidt added a comment - The OutOfMemory problem affects both field-collapse-5.patch on Solr 1.4 and SOLR-236 .patch on the trunk. The root cause of the problem is DocSetScoreCollector that creates an array of float that is the size of the maxID document that matches the query. If you have a large index (we have several million documents) and a document with a very large id is matched you may end up with a huge array (in our case several hundred MB). Only a really small subset of the array is being used at any given time (especially if you're matching just a few documents with big doc ids). The implementation can rather use a sparse array or a map to keep track of scores.
          Hide
          Peter Karich added a comment - - edited

          Trying the latest patch from 1th Feb 2010. It compiles against solr-2010-02-13 from nightly build dir, but does not work. If I query

          http://server/solr-app/select?q=*:*&collapse.field=myfield

          it fails with:

           
          
          HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.schema.FieldType.toExternal(FieldType.java:329) at 
          org.apache.solr.schema.FieldType.storedToReadable(FieldType.java:348) at 
          org.apache.solr.search.fieldcollapse.collector.AbstractCollapseCollector.getCollapseGroupResult(AbstractCollapseCollector.java:58) at 
          org.apache.solr.search.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory$DocumentCountCollapseCollector.getResult(DocumentGroupCountCollapseCollectorFactory.ja
          va:84) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:193) at 
          org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:192) at 
          org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at 
          org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at
          ...
           

          I only need the OutOfMemory problem solved ...

          Show
          Peter Karich added a comment - - edited Trying the latest patch from 1th Feb 2010. It compiles against solr-2010-02-13 from nightly build dir, but does not work. If I query http://server/solr-app/select?q=*:*&collapse.field=myfield it fails with: HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.schema.FieldType.toExternal(FieldType.java:329) at org.apache.solr.schema.FieldType.storedToReadable(FieldType.java:348) at org.apache.solr.search.fieldcollapse.collector.AbstractCollapseCollector.getCollapseGroupResult(AbstractCollapseCollector.java:58) at org.apache.solr.search.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory$DocumentCountCollapseCollector.getResult(DocumentGroupCountCollapseCollectorFactory.ja va:84) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:193) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:192) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at ... I only need the OutOfMemory problem solved ...
          Hide
          Peter Karich added a comment -

          We are facing OutOfMemory problems too. We are using https://issues.apache.org/jira/secure/attachment/12425775/field-collapse-5.patch

          > Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly,
          > I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller.

          How can I turn off the collapse cache or make the filterCache smaller?
          Are there other workarounds? E.g. via using a special version of the patch ?

          I read that it could help to specify collapse.maxdocs but this didn't help in our case ... could collapse.type=adjacent help here? (https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12495376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12495376)

          What do you think?

          BTW: We really like this patch and would like to use it !!

          Show
          Peter Karich added a comment - We are facing OutOfMemory problems too. We are using https://issues.apache.org/jira/secure/attachment/12425775/field-collapse-5.patch > Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly, > I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller. How can I turn off the collapse cache or make the filterCache smaller? Are there other workarounds? E.g. via using a special version of the patch ? I read that it could help to specify collapse.maxdocs but this didn't help in our case ... could collapse.type=adjacent help here? ( https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12495376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12495376 ) What do you think? BTW: We really like this patch and would like to use it !!
          Hide
          Gerald DeConto added a comment -

          I have been able to apply and use the solr-236 patch successfully. Very, very cool and powerful.

          Are there any plans/hacks to include the non-collapsed document in the collapseCount and aggregate function values (ie so that it includes ALL documents, not just the collapsed ones)? Possibly via some parameter like collapse.includeAllDocs?

          I think this would be a great addition to the collapse code (and solr functionality), via what I would think is a small change, since solr doesnt have any other aggregation mechanism (as yet).

          Am trying to see how to change the code myself but Java is not my primary language.

          Show
          Gerald DeConto added a comment - I have been able to apply and use the solr-236 patch successfully. Very, very cool and powerful. Are there any plans/hacks to include the non-collapsed document in the collapseCount and aggregate function values (ie so that it includes ALL documents, not just the collapsed ones)? Possibly via some parameter like collapse.includeAllDocs? I think this would be a great addition to the collapse code (and solr functionality), via what I would think is a small change, since solr doesnt have any other aggregation mechanism (as yet). Am trying to see how to change the code myself but Java is not my primary language.
          Hide
          Kevin Cunningham added a comment -

          No, just field collapsing. We went back to the field-collapse-5.patch for the time being. So far its been good and we updated just to get closer to the latest not because we were seeing issues. Thanks.

          Show
          Kevin Cunningham added a comment - No, just field collapsing. We went back to the field-collapse-5.patch for the time being. So far its been good and we updated just to get closer to the latest not because we were seeing issues. Thanks.
          Hide
          Martijn van Groningen added a comment -

          Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint?

          Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly, I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller.

          What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch?

          Not much I believe, some are using it in production without too many problems.

          Show
          Martijn van Groningen added a comment - Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint? Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly, I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller. What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch? Not much I believe, some are using it in production without too many problems.
          Hide
          Kevin Cunningham added a comment - - edited

          Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint?

          We had been running the 11-29 field-collapse-5.patch patch and saw nothing near this amount of memory consumption.

          What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch?

          Show
          Kevin Cunningham added a comment - - edited Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint? We had been running the 11-29 field-collapse-5.patch patch and saw nothing near this amount of memory consumption. What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch?
          Hide
          Martijn van Groningen added a comment -

          If you look into the AbstractDocumentCollapser#createDocumentCollapseResult() you will see that the collapseResult will never be null. Therefore I think the null check is not necessary.
          It think the following code is sufficient:

          DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(),
                collapseResult.getCollapsedDocset(),
                rb.getSortSpec().getSort(),
                rb.getSortSpec().getOffset(),
                rb.getSortSpec().getCount(),
                rb.getFieldFlags());
          

          Also specifying the filters is unnecessary, because it was already taken into account when creating the uncollapsed docset.

          Show
          Martijn van Groningen added a comment - If you look into the AbstractDocumentCollapser#createDocumentCollapseResult() you will see that the collapseResult will never be null. Therefore I think the null check is not necessary. It think the following code is sufficient: DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(), collapseResult.getCollapsedDocset(), rb.getSortSpec().getSort(), rb.getSortSpec().getOffset(), rb.getSortSpec().getCount(), rb.getFieldFlags()); Also specifying the filters is unnecessary, because it was already taken into account when creating the uncollapsed docset.
          Hide
          Koji Sekiguchi added a comment -

          The following snippet in CollapseComponent.doProcess():

          DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(),
                collapseResult == null ? rb.getFilters() : null,
                collapseResult.getCollapsedDocset(),
                rb.getSortSpec().getSort(),
                rb.getSortSpec().getOffset(),
                rb.getSortSpec().getCount(),
                rb.getFieldFlags());
          

          2nd line implies that collapseResult may be null. If it is null, we got NPE at 3rd line?

          Show
          Koji Sekiguchi added a comment - The following snippet in CollapseComponent.doProcess(): DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(), collapseResult == null ? rb.getFilters() : null , collapseResult.getCollapsedDocset(), rb.getSortSpec().getSort(), rb.getSortSpec().getOffset(), rb.getSortSpec().getCount(), rb.getFieldFlags()); 2nd line implies that collapseResult may be null. If it is null, we got NPE at 3rd line?
          Hide
          Martijn van Groningen added a comment -

          I agree! I've updated the patch that adds a check if a field is indexed. If not an exception is thrown.

          Show
          Martijn van Groningen added a comment - I agree! I've updated the patch that adds a check if a field is indexed. If not an exception is thrown.
          Hide
          Koji Sekiguchi added a comment -

          A random comment, don't we need to check collapse.field is indexed in checkCollapseField()?

          protected void checkCollapseField(IndexSchema schema) {
            SchemaField schemaField = schema.getFieldOrNull(collapseField);
            if (schemaField == null) {
              throw new RuntimeException("Could not collapse, because collapse field does not exist in the schema.");
            }
          
            if (schemaField.multiValued()) {
              throw new RuntimeException("Could not collapse, because collapse field is multivalued");
            }
          
            if (schemaField.getType().isTokenized()) {
              throw new RuntimeException("Could not collapse, because collapse field is tokenized");
            }
          }
          

          I accidentally specified an unindexed field for collapse.field, I got unexpected result without any errors.

          Show
          Koji Sekiguchi added a comment - A random comment, don't we need to check collapse.field is indexed in checkCollapseField()? protected void checkCollapseField(IndexSchema schema) { SchemaField schemaField = schema.getFieldOrNull(collapseField); if (schemaField == null ) { throw new RuntimeException( "Could not collapse, because collapse field does not exist in the schema." ); } if (schemaField.multiValued()) { throw new RuntimeException( "Could not collapse, because collapse field is multivalued" ); } if (schemaField.getType().isTokenized()) { throw new RuntimeException( "Could not collapse, because collapse field is tokenized" ); } } I accidentally specified an unindexed field for collapse.field, I got unexpected result without any errors.
          Hide
          Martijn van Groningen added a comment -

          Attached updated patch that works with the latest trunk. This patch is not compatible with 1.4 branch.

          Show
          Martijn van Groningen added a comment - Attached updated patch that works with the latest trunk. This patch is not compatible with 1.4 branch.
          Hide
          Martijn van Groningen added a comment -

          Hi Yaniv, I tried the same on 1.4 branch (from svn) and the svn trunk. Applying the patch on both sources went fine, but when building (ant dist) on trunk I also got compile errors. This had to do with that SolrQueryResponse changed package from request package to response package. I will update the patch shortly. Building on the 1.4 branch went without any problems (ant dist). What errors did occur when running ant dist on 1.4 branch?

          Show
          Martijn van Groningen added a comment - Hi Yaniv, I tried the same on 1.4 branch (from svn) and the svn trunk. Applying the patch on both sources went fine, but when building (ant dist) on trunk I also got compile errors. This had to do with that SolrQueryResponse changed package from request package to response package. I will update the patch shortly. Building on the 1.4 branch went without any problems (ant dist). What errors did occur when running ant dist on 1.4 branch?
          Hide
          Yaniv S. added a comment -

          Hi All, this is a very exciting feature and I'm trying to apply it on our system.
          I've tried patching on 1.4 and on the trunk version but both give me build errors.
          Any suggestions on how I can build 1.4 or latest with this patch?

          Many Thanks,
          Yaniv

          Show
          Yaniv S. added a comment - Hi All, this is a very exciting feature and I'm trying to apply it on our system. I've tried patching on 1.4 and on the trunk version but both give me build errors. Any suggestions on how I can build 1.4 or latest with this patch? Many Thanks, Yaniv
          Hide
          Martijn van Groningen added a comment -

          If the field is tokenized and has more than one token your field collapse result will become incorrect. What happens if I remember correctly is that it will only collapse on the field's last token. This off course leads to weird collapse groups. For the users that only have one token per collapse field are because of this check out of luck. Somehow I think we should make the user know that is not possible to collapse on a tokenized field (at least with multiple tokens). Maybe adding a warning in the response. Still I think the exception is more clear, but also prohibits it off course.

          Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception.

          Checking if a tokenized field contains only one token is really inefficient, because you have the check all every collapse field of all documents. Now do check is done based on the field's definition in the schema.

          Show
          Martijn van Groningen added a comment - If the field is tokenized and has more than one token your field collapse result will become incorrect. What happens if I remember correctly is that it will only collapse on the field's last token. This off course leads to weird collapse groups. For the users that only have one token per collapse field are because of this check out of luck. Somehow I think we should make the user know that is not possible to collapse on a tokenized field (at least with multiple tokens). Maybe adding a warning in the response. Still I think the exception is more clear, but also prohibits it off course. Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception. Checking if a tokenized field contains only one token is really inefficient, because you have the check all every collapse field of all documents. Now do check is done based on the field's definition in the schema.
          Hide
          Michael Gundlach added a comment -

          I've found the need to collapse on an analyzed field which contains one token (an email field, which is analyzed in order to lowercase it.) I had to apply a patch on top of field-collapse-5.patch in order to comment out the isTokenized() check in AbstractCollapseComponent.java , at which point the code worked perfectly.

          Is there a strong argument for keeping the isTokenized() check in? Anyone who needs to collapse an analyzed, single-token field is out of luck with this check in place. I understand that the current version protects users from incorrect results if they collapse a multi-token tokenized field, but maybe collapsing on analyzed fields is worth that risk. (Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception.)

          Show
          Michael Gundlach added a comment - I've found the need to collapse on an analyzed field which contains one token (an email field, which is analyzed in order to lowercase it.) I had to apply a patch on top of field-collapse-5.patch in order to comment out the isTokenized() check in AbstractCollapseComponent.java , at which point the code worked perfectly. Is there a strong argument for keeping the isTokenized() check in? Anyone who needs to collapse an analyzed, single-token field is out of luck with this check in place. I understand that the current version protects users from incorrect results if they collapse a multi-token tokenized field, but maybe collapsing on analyzed fields is worth that risk. (Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception.)
          Hide
          Martijn van Groningen added a comment -

          I believe the field-collapse-5.patch should work for 1.4. Some bugs were fixed in later patches so I recommend using the latest patch on the latest successful nightly build if that is an option for you.
          Applying the latest patch on the 1.4 sources will properly result in some minor merge errors, but I think these should be easy the fix.

          Show
          Martijn van Groningen added a comment - I believe the field-collapse-5.patch should work for 1.4. Some bugs were fixed in later patches so I recommend using the latest patch on the latest successful nightly build if that is an option for you. Applying the latest patch on the 1.4 sources will properly result in some minor merge errors, but I think these should be easy the fix.
          Hide
          Kevin Cunningham added a comment -

          Which patch is recommended for those running a stock 1.4 release?

          Show
          Kevin Cunningham added a comment - Which patch is recommended for those running a stock 1.4 release?
          Hide
          Martijn van Groningen added a comment - - edited

          The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset.

          The collapse.maxdocs aborts collapsing after the threshold is met, but it is doing that based on the uncollapsed docset which is not sorted in any way. The result of that is that documents that would normally appear in the first page don't appear at all in the search result. Eventually the collapse component uses the collapsed docset as the result set and not the uncollapsed docset.

          Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/>. Without setting the option collapse.field, it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled).

          I agree it gets huge. This applies for both the filterCache and field collapse cache. This is something that has to be addressed and certainly will in the new field-collapse implementation. In the patch you're using too much is being cached (some data can even be neglected in the cache). Also in some cases strings are being cached that actually could be replaced with hashcodes.

          Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...).

          That actually makes sense for using the collapse.enable parameter again in the patch.

          Martijn

          Show
          Martijn van Groningen added a comment - - edited The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset. The collapse.maxdocs aborts collapsing after the threshold is met, but it is doing that based on the uncollapsed docset which is not sorted in any way. The result of that is that documents that would normally appear in the first page don't appear at all in the search result. Eventually the collapse component uses the collapsed docset as the result set and not the uncollapsed docset. Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/>. Without setting the option collapse.field, it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled). I agree it gets huge. This applies for both the filterCache and field collapse cache. This is something that has to be addressed and certainly will in the new field-collapse implementation. In the patch you're using too much is being cached (some data can even be neglected in the cache). Also in some cases strings are being cached that actually could be replaced with hashcodes. Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...). That actually makes sense for using the collapse.enable parameter again in the patch. Martijn
          Hide
          Patrick Jungermann added a comment -

          Hi all,

          we using the Solr's trunk with the latest patch of 2009-12-24 09:54 AM. Within the index, there are ~3.5 million documents with string-based identifiers of a length up to 50 chars.

          The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset.

          Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/>.
          Without setting the option collapse.field, it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled).

          Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...).

          Patrick

          Show
          Patrick Jungermann added a comment - Hi all, we using the Solr's trunk with the latest patch of 2009-12-24 09:54 AM . Within the index, there are ~3.5 million documents with string-based identifiers of a length up to 50 chars. The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset. Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/> . Without setting the option collapse.field , it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled). Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...). Patrick
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs:

          Clusters documents into a flat structure based on the values of some field of the documents. By default the {@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the {@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way.

          I don't know how it performs, but it seems like it would at least be worth investigating.

          Carrot2's ByFieldClusteringAlgorithm is very simple. It literally throws everything into a hash map based on the field value (source code). This algorithm is used in our live demo to cluster by news source.

          Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm

          This one creates a hierarchy based on the URL segments and might be useful to create "by-domain" collapsing if needed.

          In general, my rough guess is that it's the criteria for content-based collapsing would be closer to duplicate detection rather than the type of grouping Carrot2 produces.

          Show
          Stanislaw Osinski added a comment - Hi Grant, I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs: Clusters documents into a flat structure based on the values of some field of the documents. By default the {@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the {@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. I don't know how it performs, but it seems like it would at least be worth investigating. Carrot2's ByFieldClusteringAlgorithm is very simple. It literally throws everything into a hash map based on the field value ( source code ). This algorithm is used in our live demo to cluster by news source . Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm This one creates a hierarchy based on the URL segments and might be useful to create "by-domain" collapsing if needed. In general, my rough guess is that it's the criteria for content-based collapsing would be closer to duplicate detection rather than the type of grouping Carrot2 produces.
          Hide
          Grant Ingersoll added a comment -

          I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.

          I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs:

          Clusters documents into a flat structure based on the values of some field of the
          documents. By default the

          Unknown macro: {@link Document#SOURCES}

          field is used

          and

          • Name of the field to cluster by. Each non-null scalar field value with distinct
          • hash code will give raise to a single cluster, named using the
          • Unknown macro: {@link Object#toString()}

            value of the field. If the field value is a collection,

          • the document will be assigned to all clusters corresponding to the values in the
          • collection. Note that arrays will not be 'unfolded' in this way.

          I don't know how it performs, but it seems like it would at least be worth investigating.

          Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm

          Just food for thought.

          Show
          Grant Ingersoll added a comment - I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for. I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs: Clusters documents into a flat structure based on the values of some field of the documents. By default the Unknown macro: {@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the Unknown macro: {@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. I don't know how it performs, but it seems like it would at least be worth investigating. Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm Just food for thought.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Patch updated for SOLR-1685 and SOLR-1686
          2. The last patch had reverted changes to CollapseComponent configuration in solrconfig.xml and solrconfig-fieldcollapse.xml. Synced it back
          Show
          Shalin Shekhar Mangar added a comment - Patch updated for SOLR-1685 and SOLR-1686 The last patch had reverted changes to CollapseComponent configuration in solrconfig.xml and solrconfig-fieldcollapse.xml. Synced it back
          Hide
          Uri Boness added a comment -

          If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group?

          As far as I know (Martijn, correct me if I'm wrong), Martijn's patch returns the number of groups and documents, where each group is actually represented as a document. So in that sense, the total count applies to the result set as is (groups count as documents) and therefore pagination just works.

          The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page).

          First of all, I must admit that I never noticed that in Google, so I guess you're right . But when you think about it, with Google, how many time do you get a low hit count that only fits in 2-3 pages? Well, I hardly ever get it, and when I do I don't even bother to check the result I just try to improve my search. With Solr, a lot of times its different, specially when all these discovery features and faceting are so often used to narrow the search extensively... I'm not saying not having a perfect pagination mechanism is a problem... not at all, I'm just saying that it might be an issue for specific use cases or specific domains.... but that's just an assumption (or a gut feeling)

          Show
          Uri Boness added a comment - If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group? As far as I know (Martijn, correct me if I'm wrong), Martijn's patch returns the number of groups and documents, where each group is actually represented as a document. So in that sense, the total count applies to the result set as is (groups count as documents) and therefore pagination just works. The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page). First of all, I must admit that I never noticed that in Google, so I guess you're right . But when you think about it, with Google, how many time do you get a low hit count that only fits in 2-3 pages? Well, I hardly ever get it, and when I do I don't even bother to check the result I just try to improve my search. With Solr, a lot of times its different, specially when all these discovery features and faceting are so often used to narrow the search extensively... I'm not saying not having a perfect pagination mechanism is a problem... not at all, I'm just saying that it might be an issue for specific use cases or specific domains.... but that's just an assumption (or a gut feeling)
          Hide
          Martijn van Groningen added a comment -

          Yes, I used his patch. Made a small bugfix and made sure that is in sync with the latest trunk.

          Show
          Martijn van Groningen added a comment - Yes, I used his patch. Made a small bugfix and made sure that is in sync with the latest trunk.
          Hide
          Noble Paul added a comment -

          is't the patch built on the one given by shalin? the configuration looks different...

          Show
          Noble Paul added a comment - is't the patch built on the one given by shalin? the configuration looks different...
          Hide
          Martijn van Groningen added a comment -

          Updated the patch, so it patch without conflicts with the current trunk. Also included a bugfix regarding to field collapsing and filter cache that was noticed by Varun Gupta on the user mailing list.

          Show
          Martijn van Groningen added a comment - Updated the patch, so it patch without conflicts with the current trunk. Also included a bugfix regarding to field collapsing and filter cache that was noticed by Varun Gupta on the user mailing list.
          Hide
          Shalin Shekhar Mangar added a comment -

          @ttdi - Please post your questions to solr-user mailing list. This issue is strictly for Solr related development (not usage).

          Show
          Shalin Shekhar Mangar added a comment - @ttdi - Please post your questions to solr-user mailing list. This issue is strictly for Solr related development (not usage).
          Hide
          ttdi added a comment -

          hi,Martijn van Groningen experts,
          when i use http://localhost:8080/search/?page=1
          this can collapse the page=1 result,but when i use http://localhost:8080/search/?page=2
          it can only collapse the page=2 result, not collapse all record?
          i want collapse the all record use pagination ,how can i do it?
          Thanks!

          Show
          ttdi added a comment - hi,Martijn van Groningen experts, when i use http://localhost:8080/search/?page=1 this can collapse the page=1 result,but when i use http://localhost:8080/search/?page=2 it can only collapse the page=2 result, not collapse all record? i want collapse the all record use pagination ,how can i do it? Thanks!
          Hide
          Stephen Weiss added a comment -

          Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values? I'm going over the code and re-consider the way stuff is cached right now.

          No, we're very simple in our usage of the collapse features themselves, we don't even use the output that the collapse patch adds. However we do facet on a number of fields in this query as well, and sort by a date field. We also use local filter queries which we exclude for the facets individually (my favorite new feature). This packs a lot more action into one query then we had been doing previously (without that, we were running 8+ queries to get the same information), I was worried at first that this was the cause of the ram consumption. The field we are collapsing on is type "pint", it can be positive or negative depending on what system the document is coming in from. Each document has several stored fields, but a whole document's stored fields are under 1K together, always (it's only image metadata - there's no body text to any of these documents, this is for an image search engine).

          Show
          Stephen Weiss added a comment - Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values? I'm going over the code and re-consider the way stuff is cached right now. No, we're very simple in our usage of the collapse features themselves, we don't even use the output that the collapse patch adds. However we do facet on a number of fields in this query as well, and sort by a date field. We also use local filter queries which we exclude for the facets individually (my favorite new feature). This packs a lot more action into one query then we had been doing previously (without that, we were running 8+ queries to get the same information), I was worried at first that this was the cause of the ram consumption. The field we are collapsing on is type "pint", it can be positive or negative depending on what system the document is coming in from. Each document has several stored fields, but a whole document's stored fields are under 1K together, always (it's only image metadata - there's no body text to any of these documents, this is for an image search engine).
          Hide
          Martijn van Groningen added a comment -

          It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes.

          Hmmm.... that doesn't seem right. This is an issue.

          Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values?
          I'm going over the code and re-consider the way stuff is cached right now.

          Show
          Martijn van Groningen added a comment - It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes. Hmmm.... that doesn't seem right. This is an issue. Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values? I'm going over the code and re-consider the way stuff is cached right now.
          Hide
          Stephen Weiss added a comment -

          Quick note on the collapse cache - we just went into production with 1.4 and right away we had to turn off the collapse cache. This was with 1.4 dist and the patch from 12/12. With the cache enabled, RAM consumption was through the roof on the production servers - I guess with the variety of queries coming in, it filled up very fast. It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes. We just used the sample config (maxSize=512), it looks like there were about 60 entries in the cache before we restarted. We would see the memory usage jump by as much as 2% after just one query.

          Without the cache the performance is still quite good (far better than what we had before) so we're not plussed, but it may indicate there needs to be more optimization there... Generally our consumption rarely goes over 50% on this machine unless we have a lot of commits coming in. The cache did provide some performance benefits on some of the queries that return large numbers of results (1M+) so it would be nice to have. Of course, it's possible with our index that these levels of RAM consumption would be unavoidable. I'm not sure if there's any further specifics I could provide that would be helpful, let me know.

          Show
          Stephen Weiss added a comment - Quick note on the collapse cache - we just went into production with 1.4 and right away we had to turn off the collapse cache. This was with 1.4 dist and the patch from 12/12. With the cache enabled, RAM consumption was through the roof on the production servers - I guess with the variety of queries coming in, it filled up very fast. It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes. We just used the sample config (maxSize=512), it looks like there were about 60 entries in the cache before we restarted. We would see the memory usage jump by as much as 2% after just one query. Without the cache the performance is still quite good (far better than what we had before) so we're not plussed, but it may indicate there needs to be more optimization there... Generally our consumption rarely goes over 50% on this machine unless we have a lot of commits coming in. The cache did provide some performance benefits on some of the queries that return large numbers of results (1M+) so it would be nice to have. Of course, it's possible with our index that these levels of RAM consumption would be unavoidable. I'm not sure if there's any further specifics I could provide that would be helpful, let me know.
          Hide
          Yonik Seeley added a comment -

          As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page.

          A ton of memory, and probably a good amount of time too. It may be the only variant that certain people would be able to use (but note that it is just a variant - I'm not proposing doing away with the other options).

          I think there might be a problem with pagination as well

          Yes, pagination is a sticky issue... but I don't think this algorithm messes it up further.

          If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group? I guess they over-request a little. What if they want a fixed number of groups? I guess they over-request by a lot (nGroups*collapse.threshold). Then they need to keep track of how many documents they actually used.

          The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page).

          Show
          Yonik Seeley added a comment - As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page. A ton of memory, and probably a good amount of time too. It may be the only variant that certain people would be able to use (but note that it is just a variant - I'm not proposing doing away with the other options). I think there might be a problem with pagination as well Yes, pagination is a sticky issue... but I don't think this algorithm messes it up further. If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group? I guess they over-request a little. What if they want a fixed number of groups? I guess they over-request by a lot (nGroups*collapse.threshold). Then they need to keep track of how many documents they actually used. The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page).
          Hide
          Shalin Shekhar Mangar added a comment -

          This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents).

          We show the highest scoring document of a group, so does the fact that the metadata belongs to the group and not the document matter at all?

          But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe).

          We are not extending anything. We're just adding a couple of fields which may not exist in the index and this is a capability we plan to introduce anyway (however this issue does not need to depend on SOLR-1566). The response format remains exactly the same. There is no break in compatibility.

          Show
          Shalin Shekhar Mangar added a comment - This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents). We show the highest scoring document of a group, so does the fact that the metadata belongs to the group and not the document matter at all? But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe). We are not extending anything. We're just adding a couple of fields which may not exist in the index and this is a capability we plan to introduce anyway (however this issue does not need to depend on SOLR-1566 ). The response format remains exactly the same. There is no break in compatibility.
          Hide
          Uri Boness added a comment -

          @Yonik

          As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page. Beyond loosing the faceting support over the collapsed DocSet, I think there might be a problem with pagination as well. For every page you'll end up with a different total count and therefore different number of pages. This can be very confusing from the user perspective - imagine going to the first page and calculating (and displaying) that you have 3 pages of results, then when the user asks for the second page, s/he gets a response with 2 pages and different total count.

          Show
          Uri Boness added a comment - @Yonik As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page. Beyond loosing the faceting support over the collapsed DocSet, I think there might be a problem with pagination as well. For every page you'll end up with a different total count and therefore different number of pages. This can be very confusing from the user perspective - imagine going to the first page and calculating (and displaying) that you have 3 pages of results, then when the user asks for the second page, s/he gets a response with 2 pages and different total count.
          Hide
          Uri Boness added a comment -

          Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566

          This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents).

          Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea.

          I definitely agree that BWC should be kept, specially here when we're dealing with a query component. But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe).

          Show
          Uri Boness added a comment - Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566 This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents). Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea. I definitely agree that BWC should be kept, specially here when we're dealing with a query component. But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe).
          Hide
          Noble Paul added a comment -

          I think mixing the collapse information with document fields is wrong

          Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566

          Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea.

          Moreover , if we keep it in the document, it keeps parsing and processing simpler

          Show
          Noble Paul added a comment - I think mixing the collapse information with document fields is wrong Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566 Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea. Moreover , if we keep it in the document, it keeps parsing and processing simpler
          Hide
          Yonik Seeley added a comment -

          You think that collapse.collectDiscardedDocuments.fl is better?

          Is this something that's really needed? If so, some other name ideas could be
          collapse.discarded.fl
          collapse.discarded.limit (doesn't seem to be a good idea to have an unbounded number).

          Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works.

          Right - one would not be able to tell the total number of collapsed docs, or the total number of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported. I do think that just like faceting, there will be multiple ways of doing collapsing.

          Anyway, this is a great example of trying to make sure the interface doesn't preclude optimizations. Perhaps the total count of the search (numFound) should be pre-collapsing if collapse.facet=before, or perhaps it should always be pre-collapsing, and we should have another optional count for post-collapsing?

          Show
          Yonik Seeley added a comment - You think that collapse.collectDiscardedDocuments.fl is better? Is this something that's really needed? If so, some other name ideas could be collapse.discarded.fl collapse.discarded.limit (doesn't seem to be a good idea to have an unbounded number). Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works. Right - one would not be able to tell the total number of collapsed docs, or the total number of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported. I do think that just like faceting, there will be multiple ways of doing collapsing. Anyway, this is a great example of trying to make sure the interface doesn't preclude optimizations. Perhaps the total count of the search (numFound) should be pre-collapsing if collapse.facet=before, or perhaps it should always be pre-collapsing, and we should have another optional count for post-collapsing?
          Hide
          Uri Boness added a comment -

          @Shalin

          I think mixing the collapse information with document fields is wrong. The collapse fields don't really belong to the document, but to the group the document represents, while the other field do belong to it. The response format should somehow indicate this difference.

          Show
          Uri Boness added a comment - @Shalin I think mixing the collapse information with document fields is wrong. The collapse fields don't really belong to the document, but to the group the document represents, while the other field do belong to it. The response format should somehow indicate this difference.
          Hide
          Martijn van Groningen added a comment -

          We need to open a separate issue for the core related changes.

          As you properly have noticed I have split the patch into smaller patches and created sub issues for each patch.

          How about we change the current field collapsing response format to the following?

          Looks okay at first sight.

          For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response.

          Maybe we need a more elegant solution for this. All these extra fields are calculated values. If we were to put the calculated values into a certain context and the response writers can then look values up in the context and write them to the response. Other functionalities might also benefit from this solution like distances from a central point when doing a geo search. It is just an idea. I recall there an issue in Jira that propose something like this, but I couldn't find it.

          "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated?

          I think that is good idea, other parameters (like the fq) are also multi-valued.

          BTW I think we should continue further technical discussions in the sub issues. We got space there for a lot of comments

          Show
          Martijn van Groningen added a comment - We need to open a separate issue for the core related changes. As you properly have noticed I have split the patch into smaller patches and created sub issues for each patch. How about we change the current field collapsing response format to the following? Looks okay at first sight. For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response. Maybe we need a more elegant solution for this. All these extra fields are calculated values. If we were to put the calculated values into a certain context and the response writers can then look values up in the context and write them to the response. Other functionalities might also benefit from this solution like distances from a central point when doing a geo search. It is just an idea. I recall there an issue in Jira that propose something like this, but I couldn't find it. "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated? I think that is good idea, other parameters (like the fq) are also multi-valued. BTW I think we should continue further technical discussions in the sub issues. We got space there for a lot of comments
          Hide
          Shalin Shekhar Mangar added a comment -

          How about we change the current field collapsing response format to the following?

          We add new well-known fields to the document itself, say

          1. "collapse.value" - contains the group field's value for this document
          2. "collapse.count" - the number of results collapsed under this document
          3. "collapse.aggregate.function(field-name)" - the aggregate value for the given function applied to the given field for this document's group

          Example:

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
            <lst name="responseHeader">
              <int name="status">0</int>
              <int name="QTime">2</int>
              <lst name="params">
                <str name="collapse.field">manu_exact</str>
                <str name="collapse.aggregate">max(field1)</str>
                <str name="collapse.aggregate">avg(field1)</str>
                <str name="q">title:test</str>
                <str name="field.collapse">title</str>
                <str name="qt">collapse</str>
              </lst>
            </lst>
            <result name="response" numFound="30" start="0">
              <doc>
                <str name="id">F8V7067-APL-KIT</str>
                <str name="collapse.value">Belkin</str>
                <int name="collapse.count">1</int>
                <int name="collapse.aggregate.max(field1)">100</int>
                <float name="collapse.aggregate.avg(field1)">50.0</float>
              </doc>
              <doc>
                <str name="id">TWINX2048-3200PRO</str>
                <str name="collapse.value">Corsair Microsystems Inc.</str>
                <int name="collapse.count">3</int>
                <int name="collapse.aggregate.max(field1)">100</int>
                <float name="collapse.aggregate.avg(field1)">50.0</float>
              </doc>
            </result>
          </response>
          

          No need to have another section and correlate based on uniqueKeys. For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response.

          For request parameters:

          1. "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated?
          Show
          Shalin Shekhar Mangar added a comment - How about we change the current field collapsing response format to the following? We add new well-known fields to the document itself, say "collapse.value" - contains the group field's value for this document "collapse.count" - the number of results collapsed under this document "collapse.aggregate.function(field-name)" - the aggregate value for the given function applied to the given field for this document's group Example: <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > <int name= "status" > 0 </int> <int name= "QTime" > 2 </int> <lst name= "params" > <str name= "collapse.field" > manu_exact </str> <str name= "collapse.aggregate" > max(field1) </str> <str name= "collapse.aggregate" > avg(field1) </str> <str name= "q" > title:test </str> <str name= "field.collapse" > title </str> <str name= "qt" > collapse </str> </lst> </lst> <result name= "response" numFound= "30" start= "0" > <doc> <str name= "id" > F8V7067-APL-KIT </str> <str name= "collapse.value" > Belkin </str> <int name= "collapse.count" > 1 </int> <int name= "collapse.aggregate.max(field1)" > 100 </int> <float name= "collapse.aggregate.avg(field1)" > 50.0 </float> </doc> <doc> <str name= "id" > TWINX2048-3200PRO </str> <str name= "collapse.value" > Corsair Microsystems Inc. </str> <int name= "collapse.count" > 3 </int> <int name= "collapse.aggregate.max(field1)" > 100 </int> <float name= "collapse.aggregate.avg(field1)" > 50.0 </float> </doc> </result> </response> No need to have another section and correlate based on uniqueKeys. For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response. For request parameters: "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated?
          Hide
          Noble Paul added a comment -

          We need to open a separate issue for the core related changes.

          Show
          Noble Paul added a comment - We need to open a separate issue for the core related changes.
          Hide
          Martijn van Groningen added a comment -

          I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway.

          The changes in the core that should be in a separate patch are:

          1. SolrIndexSearcher
          2. DocSetHitCollector
          3. DocSetAwareCollector

          The above files where changes because of the following reasons:

          1. The getDocSet(...) methods in the SolrIndexSearcher did not allow me to specify a Lucene Collector, which I needed to get the uncollapsed docset and levering the Solr caches whilst doing that. I changed them so I was able to do that.
          2. The patch also contains an extra getDocListAndSet(...) method that allows specifying a filter docset, which in the case of field collapsing is the collapsed docset.

          The QueryComponent has changed as well. The only reason these changes where made, was to support the psuedo distributed field-collapsing. Maybe for the distributed field collapsing a separate patch should created with this change as a start. Last but not least the SolrJ code. I think for these changes a separate patch should be created as well. Maybe for each patch a sub issue should be created in Jira.

          The rest of the files in the patch do not impact any core files and I think should remain in one patch.

          Show
          Martijn van Groningen added a comment - I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway. The changes in the core that should be in a separate patch are: SolrIndexSearcher DocSetHitCollector DocSetAwareCollector The above files where changes because of the following reasons: The getDocSet(...) methods in the SolrIndexSearcher did not allow me to specify a Lucene Collector, which I needed to get the uncollapsed docset and levering the Solr caches whilst doing that. I changed them so I was able to do that. The patch also contains an extra getDocListAndSet(...) method that allows specifying a filter docset, which in the case of field collapsing is the collapsed docset. The QueryComponent has changed as well. The only reason these changes where made, was to support the psuedo distributed field-collapsing. Maybe for the distributed field collapsing a separate patch should created with this change as a start. Last but not least the SolrJ code. I think for these changes a separate patch should be created as well. Maybe for each patch a sub issue should be created in Jira. The rest of the files in the patch do not impact any core files and I think should remain in one patch.
          Hide
          Martijn van Groningen added a comment -

          ttdi,
          The latest patch is not in sync with the latest trunk. You can try to patch to the trunk or use a previous patch for the 1.4 code.

          Yonik,
          The parameters description is a bit poor. The response format of the older patches contains two separate lists of collapse group counts. A list with counts per most relevant document id that is enabled or disabled with collapse.info.doc param. The second list with counts per fieldvalue of the most relevant document that is controlled with collapse.info.count param. Now that the response format has changed we should rename it to something more descriptive. Maybe something like collapse.showCount that adds the collapse count to the collapse group in the response (default to true) and collapse.showFieldValue that adds the fieldvalue of the most relevant document to the group (defaults to false)?

          The collapse.maxdocs specifies when to abort field-collapsing after n document have been processed. I have never used is. I can imagine that one would use it to shorten the search time.

          The collapse.includeCollapsedDocs.fl enables a collapse collector that collects the documents that have been discarded and output the specified fields of the discarded documents to the fieldcollapse response per collapse group (* for all fields). The parameter name does not reflect that behaviour entirely. You think that collapse.collectDiscardedDocuments.fl is better? However personally I would not use this, because of the negative impact it has on performance. Usually one wants to know something like the average / highest / lowest price of a collapse group. The AggregateCollapseCollector would fit the needs better.

          Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).

          I'm not sure about that. It would make things more complicated. Sorting the discarded documents in combination with the collapse.includeCollapsedDocs.fl functionality would maybe make more sense.

          The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?

          I think the latter would make more sense, because field-collapsing does change the search result. It would just make it more obvious.

          Is there a way to specify the number of groups that I want back instead of the number of documents?

          No there is not, but if the list of documents is replaced with a list of groups then the rows parameter should be used to indicate the number of groups to be displayed instead the number of documents to be displayed.

          Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works.

          Show
          Martijn van Groningen added a comment - ttdi, The latest patch is not in sync with the latest trunk. You can try to patch to the trunk or use a previous patch for the 1.4 code. Yonik, The parameters description is a bit poor. The response format of the older patches contains two separate lists of collapse group counts. A list with counts per most relevant document id that is enabled or disabled with collapse.info.doc param. The second list with counts per fieldvalue of the most relevant document that is controlled with collapse.info.count param. Now that the response format has changed we should rename it to something more descriptive. Maybe something like collapse.showCount that adds the collapse count to the collapse group in the response (default to true) and collapse.showFieldValue that adds the fieldvalue of the most relevant document to the group (defaults to false)? The collapse.maxdocs specifies when to abort field-collapsing after n document have been processed. I have never used is. I can imagine that one would use it to shorten the search time. The collapse.includeCollapsedDocs.fl enables a collapse collector that collects the documents that have been discarded and output the specified fields of the discarded documents to the fieldcollapse response per collapse group (* for all fields). The parameter name does not reflect that behaviour entirely. You think that collapse.collectDiscardedDocuments.fl is better? However personally I would not use this, because of the negative impact it has on performance. Usually one wants to know something like the average / highest / lowest price of a collapse group. The AggregateCollapseCollector would fit the needs better. Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately). I'm not sure about that. It would make things more complicated. Sorting the discarded documents in combination with the collapse.includeCollapsedDocs.fl functionality would maybe make more sense. The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like? I think the latter would make more sense, because field-collapsing does change the search result. It would just make it more obvious. Is there a way to specify the number of groups that I want back instead of the number of documents? No there is not, but if the list of documents is replaced with a list of groups then the rows parameter should be used to indicate the number of groups to be displayed instead the number of documents to be displayed. Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works.
          Hide
          Yonik Seeley added a comment -

          First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature!

          I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in.

          It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results?

          The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?

          From the wiki:
          collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level.

          collapse.info.doc - what does that do? I understand counts per group, but what's count per doc?

          collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold.

          collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc.

          Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).

          Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back?

          Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...)

          Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts.

          Performance: yes, it looks like the current code uses a lot of memory.
          Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function):

          =================== two pass collapsing algorithm for collapse.aggregate=max ====================
          First pass: pretend that collapseCount=1
            - Use a TreeSet as  a priority queue since one can remove and insert entries.
            - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
            - compare new doc with smallest element in treeset.  If smaller discard and go to the next doc.
            - If new doc is bigger, look up it's group.  Use the Map to find if the group has been added to the TreeSet and add it if not.
            - If the new bigger doc is already in the TreeSet, compare with the document in that group.  If bigger, update the node,
              remove and re-add to the TreeSet to re-sort.
          
          efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
          We will now have the top 10 documents collapsed by the right field with a collapseCount of 1.  Put another way, we have the top 10 groups.
          
          Second pass (if collapseCount>1):
           - create a priority queue for each group (10) of size collapseCount
           - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
           - for each document, find it's appropriate priority queue and insert
           - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
          
          So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
          Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
          We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
          
          Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          
          Show
          Yonik Seeley added a comment - First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature! I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in. It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results? The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like? From the wiki: collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level. collapse.info.doc - what does that do? I understand counts per group, but what's count per doc? collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold. collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc. Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately). Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back? Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...) Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts. Performance: yes, it looks like the current code uses a lot of memory. Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function): =================== two pass collapsing algorithm for collapse.aggregate=max ==================== First pass: pretend that collapseCount=1 - Use a TreeSet as a priority queue since one can remove and insert entries. - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet - compare new doc with smallest element in treeset. If smaller discard and go to the next doc. - If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not. - If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node, remove and re-add to the TreeSet to re-sort. efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups. Second pass ( if collapseCount>1): - create a priority queue for each group (10) of size collapseCount - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1) - for each document, find it's appropriate priority queue and insert - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched. So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups. Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed. We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing. Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          Hide
          Yonik Seeley added a comment -

          First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature!

          I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in.

          It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results?

          The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?

          From the wiki:
          collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level.

          collapse.info.doc - what does that do? I understand counts per group, but what's count per doc?

          collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold.

          collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc.

          Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).

          Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back?

          Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...)

          Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts.

          Performance: yes, it looks like the current code uses a lot of memory.
          Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function):

          =================== two pass collapsing algorithm for collapse.aggregate=max ====================
          First pass: pretend that collapseCount=1
            - Use a TreeSet as a priority queue since one can remove and insert entries.
            - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
            - compare new doc with smallest element in treeset. If smaller discard and go to the next doc.
            - If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not.
            - If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node,
              remove and re-add to the TreeSet to re-sort.
          
          efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
          We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups.
          
          Second pass (if collapseCount>1):
           - create a priority queue for each group (10) of size collapseCount
           - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
           - for each document, find it's appropriate priority queue and insert
           - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
          
          So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
          Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
          We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
          
          Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          
          Show
          Yonik Seeley added a comment - First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature! I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in. It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results? The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like? From the wiki: collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level. collapse.info.doc - what does that do? I understand counts per group, but what's count per doc? collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold. collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc. Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately). Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back? Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...) Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts. Performance: yes, it looks like the current code uses a lot of memory. Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function): =================== two pass collapsing algorithm for collapse.aggregate=max ==================== First pass: pretend that collapseCount=1 - Use a TreeSet as a priority queue since one can remove and insert entries. - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet - compare new doc with smallest element in treeset. If smaller discard and go to the next doc. - If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not. - If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node, remove and re-add to the TreeSet to re-sort. efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups. Second pass ( if collapseCount>1): - create a priority queue for each group (10) of size collapseCount - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1) - for each document, find it's appropriate priority queue and insert - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched. So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups. Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed. We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing. Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          Hide
          Mark Miller added a comment -

          This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue.

          Its not really any different than putting it in trunk. Non committers can still post patches to the branch in JIRA, the same as if the issue was in trunk. Smaller, more focused patches. If there are no benefits to a branch in this regard, what is the argument to putting this in trunk for further dev? Might as well just stay in patch form until its ready then.

          If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced.

          You have to apply the patch. With a branch you have to type a merge command. Its the same effort - a single command.

          Show
          Mark Miller added a comment - This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue. Its not really any different than putting it in trunk. Non committers can still post patches to the branch in JIRA, the same as if the issue was in trunk. Smaller, more focused patches. If there are no benefits to a branch in this regard, what is the argument to putting this in trunk for further dev? Might as well just stay in patch form until its ready then. If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced. You have to apply the patch. With a branch you have to type a merge command. Its the same effort - a single command.
          Hide
          Noble Paul added a comment -

          The main difference I see is that its easier for non committers to share updated patches

          This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue.

          If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced.

          Show
          Noble Paul added a comment - The main difference I see is that its easier for non committers to share updated patches This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue. If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced.
          Hide
          Mark Miller added a comment -

          bq . On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself.

          Is that true? Syncing a branch is the same as syncing a patch - non conflicts are merged automatically and conflicts must be handled - same with a patch or a branch. And a patch gets out of date just as easily as a branch. The main difference I see is that its easier for non committers to share updated patches, whereas merging the branch will require the help of a committer if you want to share the merge with others. Anyone can checkout the branch and merge with trunk though - its literally the same effort as updating an out of date patch.

          Show
          Mark Miller added a comment - bq . On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself. Is that true? Syncing a branch is the same as syncing a patch - non conflicts are merged automatically and conflicts must be handled - same with a patch or a branch. And a patch gets out of date just as easily as a branch. The main difference I see is that its easier for non committers to share updated patches, whereas merging the branch will require the help of a committer if you want to share the merge with others. Anyone can checkout the branch and merge with trunk though - its literally the same effort as updating an out of date patch.
          Hide
          Noble Paul added a comment -

          olr already has a few places where the response format is still marked as experimental and as subject to changes in the future ....

          Marking the output format as experimental is just trying to be safe. We strive hard to ensure that we don't change it or even if we do it it is not disruptive. So let us not take this as an excuse to be lax of the review of the public API.

          on keeping a separate branch....

          I would say a branch is less useful than an patch. if the patch applies to the trunk , I can be sure that I have the latest and greatest stuff. On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself.

          @Uri
          I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway.

          Show
          Noble Paul added a comment - olr already has a few places where the response format is still marked as experimental and as subject to changes in the future .... Marking the output format as experimental is just trying to be safe. We strive hard to ensure that we don't change it or even if we do it it is not disruptive. So let us not take this as an excuse to be lax of the review of the public API. on keeping a separate branch.... I would say a branch is less useful than an patch. if the patch applies to the trunk , I can be sure that I have the latest and greatest stuff. On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself. @Uri I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway.
          Hide
          Uri Boness added a comment -

          Essentially it boils down to two options:

          1. Keep it out of the trunk, in which case users that will need this functionality will only get it by working with a patched Solr version of their own, or use a branch (in both cases, most likely they will miss the continuous work done on the trunk unless they keep on merging the changes)
          2. Keep in the trunk with some caveats, in which case they users have a chance to use this functionality out of the box

          In both cases, the user have a choice to make:

          • be satisfied by the performance of this feature
          • look for an alternative solution (other products)
          • give up this functionality all together (if their business requirements allow that)

          So the main difference here I would say is in how easy you'd like to provide this functionality to the users. On the Solr development part, indeed once this is committed to the trunk there's much more responsibility on the committers to make it work (enhance performance and fix bugs)... but this is a good thing as there is a high demand for this feature and as a community driven project this demand should to be satisfied. And I do think that the number of users using this patch already is a good indicator that it is good enough for quite a lot of use cases.

          I do agree though that before committing anything, the public API should be re-evaluated to minimize chances for BWC issues later on. BTW, regarding the response, Solr already has a few places where the response format is still marked as experimental and as subject to changes in the future (but it doesn't stop people from using this functionality as they take the responsibility to adapt to any such future changes when the come).

          Now... writing this, it suddenly occurred to me that there might be another solution to this all discussion which is in a way a combination of many of the suggestions in this thread. What if, this patch would be split to two: the changes to the core and the component itself. Now, if the changes to the core are not that drastic and make sense (or at least everyone can live with them) then perhaps they can be committed to the trunk. As for the rest of the patch (which consists of the search components and its other supporting classes), this can be put in SVN as separate branch for contrib. The good thing about this solution is that the work done on this functionality will be in SVN so you benefit from it as David mentioned above. The other benefit is that with this layout you can actually build the branched code base separately and distribute this functionality as a separate jar which can be deployed in Solr 1.5x distribution. Again, a bit of work left to the users (too much to my taste) but at least they're not forced to use a patched version of Solr. Would that be a possible solution?

          Show
          Uri Boness added a comment - Essentially it boils down to two options: Keep it out of the trunk, in which case users that will need this functionality will only get it by working with a patched Solr version of their own, or use a branch (in both cases, most likely they will miss the continuous work done on the trunk unless they keep on merging the changes) Keep in the trunk with some caveats, in which case they users have a chance to use this functionality out of the box In both cases, the user have a choice to make: be satisfied by the performance of this feature look for an alternative solution (other products) give up this functionality all together (if their business requirements allow that) So the main difference here I would say is in how easy you'd like to provide this functionality to the users. On the Solr development part, indeed once this is committed to the trunk there's much more responsibility on the committers to make it work (enhance performance and fix bugs)... but this is a good thing as there is a high demand for this feature and as a community driven project this demand should to be satisfied. And I do think that the number of users using this patch already is a good indicator that it is good enough for quite a lot of use cases. I do agree though that before committing anything, the public API should be re-evaluated to minimize chances for BWC issues later on. BTW, regarding the response, Solr already has a few places where the response format is still marked as experimental and as subject to changes in the future (but it doesn't stop people from using this functionality as they take the responsibility to adapt to any such future changes when the come). Now... writing this, it suddenly occurred to me that there might be another solution to this all discussion which is in a way a combination of many of the suggestions in this thread. What if, this patch would be split to two: the changes to the core and the component itself. Now, if the changes to the core are not that drastic and make sense (or at least everyone can live with them) then perhaps they can be committed to the trunk. As for the rest of the patch (which consists of the search components and its other supporting classes), this can be put in SVN as separate branch for contrib. The good thing about this solution is that the work done on this functionality will be in SVN so you benefit from it as David mentioned above. The other benefit is that with this layout you can actually build the branched code base separately and distribute this functionality as a separate jar which can be deployed in Solr 1.5x distribution. Again, a bit of work left to the users (too much to my taste) but at least they're not forced to use a patched version of Solr. Would that be a possible solution?
          Hide
          Patrick Eger added a comment -

          Hi, possibly not important but would like to give my perspective as a user. Specifically, the code is very much production ready in our opinion, albeit under a limited set of circumstances that we are comfortable with (< 5 million docs, no distributed search). Within those confines it works great and satisfies our needs, and we are more than willing to pay the performance hit since it's absolutely essential to the correct functionality. I suppose i'd disagree with the assertion that the performance is "unacceptable", as i think that is a value judgement each user will have to make.

          Modulo the discussion about the request format, output format and config (stuff that is hard to change later). I would much rather have the code be in and documented with those caveats clearly spelled out and probably tracked in separate JIRA issues. IE DO NOT USE IF SHARDING, >5 million docs, etc, etc. Again, just my 2c as a satisfied user.

          Show
          Patrick Eger added a comment - Hi, possibly not important but would like to give my perspective as a user. Specifically, the code is very much production ready in our opinion, albeit under a limited set of circumstances that we are comfortable with (< 5 million docs, no distributed search). Within those confines it works great and satisfies our needs, and we are more than willing to pay the performance hit since it's absolutely essential to the correct functionality. I suppose i'd disagree with the assertion that the performance is "unacceptable", as i think that is a value judgement each user will have to make. Modulo the discussion about the request format, output format and config (stuff that is hard to change later). I would much rather have the code be in and documented with those caveats clearly spelled out and probably tracked in separate JIRA issues. IE DO NOT USE IF SHARDING, >5 million docs, etc, etc. Again, just my 2c as a satisfied user.
          Hide
          Grant Ingersoll added a comment -

          I'm not sold on the output yet, either. Have we considered it being inline? We're getting more and more parallel arrays we need to consider. I think with the other Solr issues that are looking at pseudo-fields and the ability for components to add results, that we could rework these things.

          Also, why don't the aggregate functions just work w/ all the existing functions?

          Show
          Grant Ingersoll added a comment - I'm not sold on the output yet, either. Have we considered it being inline? We're getting more and more parallel arrays we need to consider. I think with the other Solr issues that are looking at pseudo-fields and the ability for components to add results, that we could rework these things. Also, why don't the aggregate functions just work w/ all the existing functions?
          Hide
          Noble Paul added a comment -

          The main problem with the patch is that the performance/resource consumption is unacceptable.

          • Is it true that the perf cost is avoidable?
          • or are their implementation details which can be optimized?

          We are working to make to ready for trunk. So anything that helps us move towards the objective is welcome

          Show
          Noble Paul added a comment - The main problem with the patch is that the performance/resource consumption is unacceptable. Is it true that the perf cost is avoidable? or are their implementation details which can be optimized? We are working to make to ready for trunk. So anything that helps us move towards the objective is welcome
          Hide
          Mark Miller added a comment - - edited

          I very much disagree with a policy blocking non-production-ready code from being in source control

          Just to be clear, there is no such policy that I've seen - each decision just comes down to consensus. And as far as I know, our branch policy is pretty much "anything goes" - trunk is very different than svn. Anyone (anyone with access to svn that is) can play around with a branch for anything if they want.

          I agree with your thoughts on a branch - if the argument is, we want it to be easier for devs to check out and work on this, or for users to checkout and build this without applying patches, why not just make a branch? Merging is annoying but not difficult - I've been doing plenty of branch merging lately, and while its not glorious work, modern tools make it more of a grind than a challenge.

          Show
          Mark Miller added a comment - - edited I very much disagree with a policy blocking non-production-ready code from being in source control Just to be clear, there is no such policy that I've seen - each decision just comes down to consensus. And as far as I know, our branch policy is pretty much "anything goes" - trunk is very different than svn. Anyone (anyone with access to svn that is) can play around with a branch for anything if they want. I agree with your thoughts on a branch - if the argument is, we want it to be easier for devs to check out and work on this, or for users to checkout and build this without applying patches, why not just make a branch? Merging is annoying but not difficult - I've been doing plenty of branch merging lately, and while its not glorious work, modern tools make it more of a grind than a challenge.
          Hide
          David Smiley added a comment -

          I've been watching this thread forever without saying anything but want to offer my two cents and I'll but out.

          I very much disagree with a policy blocking non-production-ready code from being in source control. All code starts off this way and it would be quite a shame not to leverage the advantages of source control simply because it isn't ready yet. If people are uncomfortable with it being in trunk then simply use a branch. Of course, how simple "simple" is depends on one's comfort with source control and the particular source control technology used and tools to help you (e.g. IDEs). By the way, git makes "feature branches" (which is what this would be) easy to manage and integrates bidirectionally with subversion. If you're not comfortable with branching because you're not familiar with it then you need to learn. By "you" I don't mean anyone in particular, I mean all professional software developers. Source control and branching are tools of our trade.

          Show
          David Smiley added a comment - I've been watching this thread forever without saying anything but want to offer my two cents and I'll but out. I very much disagree with a policy blocking non-production-ready code from being in source control. All code starts off this way and it would be quite a shame not to leverage the advantages of source control simply because it isn't ready yet. If people are uncomfortable with it being in trunk then simply use a branch. Of course, how simple "simple" is depends on one's comfort with source control and the particular source control technology used and tools to help you (e.g. IDEs). By the way, git makes "feature branches" (which is what this would be) easy to manage and integrates bidirectionally with subversion. If you're not comfortable with branching because you're not familiar with it then you need to learn. By "you" I don't mean anyone in particular, I mean all professional software developers. Source control and branching are tools of our trade.
          Hide
          Mark Miller added a comment - - edited

          (Faceting fot a 50 times perf boost in 1.4)

          No it didn't. Certain cases have gotten a boost (I think you might be referring to multi-valued field faceting cases?). And general faceting was always relatively fast and scalable.

          I'm against committing features to trunk with a warning that the feature is not ready for trunk.

          Show
          Mark Miller added a comment - - edited (Faceting fot a 50 times perf boost in 1.4) No it didn't. Certain cases have gotten a boost (I think you might be referring to multi-valued field faceting cases?). And general faceting was always relatively fast and scalable. I'm against committing features to trunk with a warning that the feature is not ready for trunk.
          Hide
          Noble Paul added a comment -

          This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large.

          The performance price is paid only if you use this component. Having the functionality itself in Solr is quite important. Performance can obviously be improved. (Faceting fot a 50 times perf boost in 1.4) . As long as the performance of the component is within the acceptable range we should leave that call to the user. The cost actually depends on the data set too.

          As long as the component has a correct public API (req params/response format/configuration) I believe it can be committed with a clear warning.

          Show
          Noble Paul added a comment - This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large. The performance price is paid only if you use this component. Having the functionality itself in Solr is quite important. Performance can obviously be improved. (Faceting fot a 50 times perf boost in 1.4) . As long as the performance of the component is within the acceptable range we should leave that call to the user. The cost actually depends on the data set too. As long as the component has a correct public API (req params/response format/configuration) I believe it can be committed with a clear warning.
          Hide
          Mark Miller added a comment -

          I'm with Grant on this one. Trunk is not a sandbox, and getting more developer attention is not a good reason to put something in trunk. Issues should go in when they are ready.

          Tons of interest and votes doesn't mean rush to trunk - if that type of thing moves you, it means start putting some work into it to make it ready for trunk.

          This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large. The performance hit is not any better. The linked to blog marks performance with collapsing as 5-10 times slower than without.

          Personally, I don't think this issue is ready for trunk.

          Show
          Mark Miller added a comment - I'm with Grant on this one. Trunk is not a sandbox, and getting more developer attention is not a good reason to put something in trunk. Issues should go in when they are ready. Tons of interest and votes doesn't mean rush to trunk - if that type of thing moves you, it means start putting some work into it to make it ready for trunk. This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large. The performance hit is not any better. The linked to blog marks performance with collapsing as 5-10 times slower than without. Personally, I don't think this issue is ready for trunk.
          Hide
          Uri Boness added a comment -

          I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.

          The main difference between the two components is that while the clustering works more as a function where the input is the doclist/docset and the output is a separate data structure representing the groups, the collapse component operates directly on the docset & doclist modifies them and incorporates the groups within the final search result.

          In all occurrences where we found the need for the collapse component, we needed to incorporate the grouping within the search result, and adjust the sorting and the pagination accordingly. As far as I know you cannot do that with the clustering component. This tight integration with the result is also the reason why the collapse component right now is actually a replacement to the query component.

          Show
          Uri Boness added a comment - I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for. The main difference between the two components is that while the clustering works more as a function where the input is the doclist/docset and the output is a separate data structure representing the groups, the collapse component operates directly on the docset & doclist modifies them and incorporates the groups within the final search result. In all occurrences where we found the need for the collapse component, we needed to incorporate the grouping within the search result, and adjust the sorting and the pagination accordingly. As far as I know you cannot do that with the clustering component. This tight integration with the result is also the reason why the collapse component right now is actually a replacement to the query component.
          Hide
          Grant Ingersoll added a comment -

          I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.

          Show
          Grant Ingersoll added a comment - I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.
          Hide
          Martijn van Groningen added a comment -

          For Shalin:

          I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins.

          I agree about the PluginInfo and I think it is the right place for field collapse config.

          Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent?

          Besides different configured CollapseCollectorFactories none.

          I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase

          That would be great!

          Show
          Martijn van Groningen added a comment - For Shalin: I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins. I agree about the PluginInfo and I think it is the right place for field collapse config. Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent? Besides different configured CollapseCollectorFactories none. I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase That would be great!
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. Modified configuration as Noble suggested. The AggregateCollapseCollectorFactory is now PluginInfoInitialized instead of NamedListInitialzed and functions are plugins. The "name" attribute is removed from "collapseCollectorFactory" since it is no longer necessary:
            <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
                <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
            
                <collapseCollectorFactory class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
                  <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/>
                  <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/>
                  <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/>
                  <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/>
                </collapseCollectorFactory>
            
              	<fieldCollapseCache
                  class="solr.FastLRUCache"
                  size="512"
                  initialSize="512"
                  autowarmCount="128"/>
                
              </searchComponent>
            
          2. Changed DistributedFieldCollapsingIntegrationTest to use BaseDistributedSearchTestCase. This fails right now. I believe there is a bug with the distributed implementation. The distributed version returns one extra group when compared to the non-distributed version. I've put an @Ignore annotation on that test.

          We can consider creating the functions through a factory so that they can accept initialization parameters. The schema-fieldcollapse.xml and solrconfig-fieldcollapse.xml are no longer necessary and can be removed.

          Next steps:

          1. Let us open issues for all the modifications needed in Solr to support this feature. That will help us break down this patch into more manageable (and easily reviewable) pieces. I guess we need one for providing custom Collectors for SolrIndexSearcher methods. Any others?
          2. The response format is not very clear in the wiki. We should add more examples and explain the format.
          Show
          Shalin Shekhar Mangar added a comment - Changes: Modified configuration as Noble suggested. The AggregateCollapseCollectorFactory is now PluginInfoInitialized instead of NamedListInitialzed and functions are plugins. The "name" attribute is removed from "collapseCollectorFactory" since it is no longer necessary: <searchComponent name= "collapse" class= "org.apache.solr.handler.component.CollapseComponent" > <collapseCollectorFactory class= "solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory class= "solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory class= "solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory class= "org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory" > <function name= "sum" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction" /> <function name= "avg" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction" /> <function name= "min" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction" /> <function name= "max" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction" /> </collapseCollectorFactory> <fieldCollapseCache class= "solr.FastLRUCache" size= "512" initialSize= "512" autowarmCount= "128" /> </searchComponent> Changed DistributedFieldCollapsingIntegrationTest to use BaseDistributedSearchTestCase. This fails right now. I believe there is a bug with the distributed implementation. The distributed version returns one extra group when compared to the non-distributed version. I've put an @Ignore annotation on that test. We can consider creating the functions through a factory so that they can accept initialization parameters. The schema-fieldcollapse.xml and solrconfig-fieldcollapse.xml are no longer necessary and can be removed. Next steps: Let us open issues for all the modifications needed in Solr to support this feature. That will help us break down this patch into more manageable (and easily reviewable) pieces. I guess we need one for providing custom Collectors for SolrIndexSearcher methods. Any others? The response format is not very clear in the wiki. We should add more examples and explain the format.
          Hide
          Shalin Shekhar Mangar added a comment -

          For Martijn:

          The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared?

          I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins.

          Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent?

          The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE.

          Oops, sorry about that. I only ran the tests inside org.apache.solr.search.fieldcollapse. I didn't notice there are other tests too. Thanks!

          The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch.

          I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase

          Show
          Shalin Shekhar Mangar added a comment - For Martijn: The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared? I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins. Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent? The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE. Oops, sorry about that. I only ran the tests inside org.apache.solr.search.fieldcollapse. I didn't notice there are other tests too. Thanks! The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch. I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase
          Hide
          Noble Paul added a comment -

          I think that is all the more reason why it needs to be done right and not just be a "good start".

          The fact that it has been around for so long means that the "good start" is gonna take longer to happen. According to me , we should fix the obvious stuff and commit this with a clear warning in the javadocs and wiki that this has perf isssues and the code/API/configuration may change incompatibly in the future.

          Committed stuff I'll try out easier than patches actually.

          +1 There is a better chances of developers taking a look at it if it is already in the trunk.

          Show
          Noble Paul added a comment - I think that is all the more reason why it needs to be done right and not just be a "good start". The fact that it has been around for so long means that the "good start" is gonna take longer to happen. According to me , we should fix the obvious stuff and commit this with a clear warning in the javadocs and wiki that this has perf isssues and the code/API/configuration may change incompatibly in the future. Committed stuff I'll try out easier than patches actually. +1 There is a better chances of developers taking a look at it if it is already in the trunk.
          Hide
          Martijn van Groningen added a comment -

          Shalin, I have updated your patch.

          1. The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE.
          2. Removed a system.out that I accidentally added in my previous patch.

          The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch. That allowed my the specify different schema file for this particular test. I think it is important for the test coverage to have this test. Should I add the fields of the schema-fieldcollapse.xml to the schema.xml that the other tests use? The test should then succeed.

          Show
          Martijn van Groningen added a comment - Shalin, I have updated your patch. The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE. Removed a system.out that I accidentally added in my previous patch. The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch. That allowed my the specify different schema file for this particular test. I think it is important for the test coverage to have this test. Should I add the fields of the schema-fieldcollapse.xml to the schema.xml that the other tests use? The test should then succeed.
          Hide
          Grant Ingersoll added a comment -

          Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later.

          Which is why it should not go in unless it is ready. Adding a large patch that isn't right just b/c it's been around for a while and is "hard to maintain" is no reason to just go commit something. The problem w/ committing something that isn't ready is then we have to do even more work to maintain it, thus taking away from the opportunity to make it better.

          As for the voting and the popularity, I think that is all the more reason why it needs to be done right and not just be a "good start". With this many eyes on it, it shouldn't be easy to get people testing it and giving feedback.

          If the issue is that the patch is to big, then perhaps it needs to be broken up into smaller pieces that lay the framework for field collapsing to work.

          Show
          Grant Ingersoll added a comment - Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later. Which is why it should not go in unless it is ready. Adding a large patch that isn't right just b/c it's been around for a while and is "hard to maintain" is no reason to just go commit something. The problem w/ committing something that isn't ready is then we have to do even more work to maintain it, thus taking away from the opportunity to make it better. As for the voting and the popularity, I think that is all the more reason why it needs to be done right and not just be a "good start". With this many eyes on it, it shouldn't be easy to get people testing it and giving feedback. If the issue is that the patch is to big, then perhaps it needs to be broken up into smaller pieces that lay the framework for field collapsing to work.
          Hide
          Erik Hatcher added a comment - - edited

          I'll just add my 0,02€ - the main thing to vet now that it works (first make it work), is the interface to the client. are the request params ideal? is the response data structure locked down? if so, get this committed ASAP and iterate on the internals of distributed and performance issues (then make it right).

          Admittedly I've not tried this feature out myself though. Committed stuff I'll try out easier than patches actually.

          Show
          Erik Hatcher added a comment - - edited I'll just add my 0,02€ - the main thing to vet now that it works (first make it work), is the interface to the client. are the request params ideal? is the response data structure locked down? if so, get this committed ASAP and iterate on the internals of distributed and performance issues (then make it right). Admittedly I've not tried this feature out myself though. Committed stuff I'll try out easier than patches actually.
          Hide
          Martijn van Groningen added a comment -

          I have updated the response examples on the wiki.

          Some time ago I tried to come up with an accurate distributed solution, but I ran a problem as I have described in a previous comment:

          ....
          Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse
          threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse.

          There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes.
          ....

          I'm really curious how others have addressed this issue. I have not stumbled on any literature on this particular issue, maybe someone else has.

          Show
          Martijn van Groningen added a comment - I have updated the response examples on the wiki. Some time ago I tried to come up with an accurate distributed solution, but I ran a problem as I have described in a previous comment: .... Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse. There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes. .... I'm really curious how others have addressed this issue. I have not stumbled on any literature on this particular issue, maybe someone else has.
          Hide
          Uri Boness added a comment -

          Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later.

          I think managing a separate branch will be just as hard as managing a patch. I do however agree that it's about time this patch will be committed to the trunk. Even though the current solution is not scalable in terms of distributed search (and I agree that the current solution for that is not really a viable solution), many are already using it and it is the most wanted feature in JIRA after all. One think you can do, is apply the changed to the core (which are not really many) and commit the rest of the patch as a contrib (along with all the disclaimers Shalin mentioned above).

          Show
          Uri Boness added a comment - Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later. I think managing a separate branch will be just as hard as managing a patch. I do however agree that it's about time this patch will be committed to the trunk. Even though the current solution is not scalable in terms of distributed search (and I agree that the current solution for that is not really a viable solution), many are already using it and it is the most wanted feature in JIRA after all. One think you can do, is apply the changed to the core (which are not really many) and commit the rest of the patch as a contrib (along with all the disclaimers Shalin mentioned above).
          Hide
          Shalin Shekhar Mangar added a comment -

          I'd define large scale for this in a couple of ways:
          1. Lots of docs in the result set (10K+)
          2. Lots of overall docs (100M+)
          3. Lots of queries (> 10 QPS)

          Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later.

          What do you think?

          Show
          Shalin Shekhar Mangar added a comment - I'd define large scale for this in a couple of ways: 1. Lots of docs in the result set (10K+) 2. Lots of overall docs (100M+) 3. Lots of queries (> 10 QPS) Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later. What do you think?
          Hide
          Oleg Gnatovskiy added a comment -

          Grant - I agree regarding the current distributed implementation. The implementation is pretty much pseudo-distributed and would cause many companies (ours included) to have to completely restructure their indexes. What we tried long ago, was to have the process method on each shard to return the id that is being collapsed on, along with documentId and score. Then, in mergeIds we would do another level of collapse - basically keeping only 1 of the documents with a unique collapseId, and removing the others from all other shards.

          Obviously this caused several problems, not the least of which being that facet counts would always be slightly off, since we might have removed a document that was counted by the facetComponent.

          Show
          Oleg Gnatovskiy added a comment - Grant - I agree regarding the current distributed implementation. The implementation is pretty much pseudo-distributed and would cause many companies (ours included) to have to completely restructure their indexes. What we tried long ago, was to have the process method on each shard to return the id that is being collapsed on, along with documentId and score. Then, in mergeIds we would do another level of collapse - basically keeping only 1 of the documents with a unique collapseId, and removing the others from all other shards. Obviously this caused several problems, not the least of which being that facet counts would always be slightly off, since we might have removed a document that was counted by the facetComponent.
          Hide
          Grant Ingersoll added a comment -

          I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs.

          That doesn't seem good and I don't think it will work w/ all the distributed work going on. I will likely have some time next week to help out. Has anyone looked at how Google or others do this? Clearly they collapse at very large scale w/ no noticeable detrimental effect. Anyone looked at the literature on this?

          The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch).

          OK, good to know. Can you update the page to reflect the latest patch?

          Show
          Grant Ingersoll added a comment - I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs. That doesn't seem good and I don't think it will work w/ all the distributed work going on. I will likely have some time next week to help out. Has anyone looked at how Google or others do this? Clearly they collapse at very large scale w/ no noticeable detrimental effect. Anyone looked at the literature on this? The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch). OK, good to know. Can you update the page to reflect the latest patch?
          Hide
          Martijn van Groningen added a comment -

          Shalin.
          1. This configuration also looks fine by me. The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared?
          2. I forgot to create that, so a good thing you added it.
          3. I think leaving out those changes will make the distributed integration tests fail (Haven't checked it).

          Noble.
          1. The reason I gave a name to collaspeCollectorFactory was for using an instance twice for different collapse components.
          2. Moving the classname to the class attribute looks better, then in the function element. So I think we should change that.

          Grant.
          1. I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs.
          2. Field collapsing can impact the search time in a very negative way. I wrote a small paragraph about it on my blog.
          3. The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch).

          Show
          Martijn van Groningen added a comment - Shalin. 1. This configuration also looks fine by me. The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared? 2. I forgot to create that, so a good thing you added it. 3. I think leaving out those changes will make the distributed integration tests fail (Haven't checked it). Noble. 1. The reason I gave a name to collaspeCollectorFactory was for using an instance twice for different collapse components. 2. Moving the classname to the class attribute looks better, then in the function element. So I think we should change that. Grant. 1. I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs. 2. Field collapsing can impact the search time in a very negative way. I wrote a small paragraph about it on my blog . 3. The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch).
          Hide
          Grant Ingersoll added a comment -

          Is there a typo on the http://wiki.apache.org/solr/FieldCollapsing page in regards to the outputs? There are two different output results, but the URL for the examples are the same. See http://wiki.apache.org/solr/FieldCollapsing#Examples. I think the second one is intended to show a collapse count for fields?

          Also, I'm not sold on having separate collapse elements from the actual response, but I know other things do it too, so it isn't a huge deal), but the list of "parallel arrays" that one needs to traverse in order to render results is growing (highlighter, MLT, now Field Collapsing.

          Show
          Grant Ingersoll added a comment - Is there a typo on the http://wiki.apache.org/solr/FieldCollapsing page in regards to the outputs? There are two different output results, but the URL for the examples are the same. See http://wiki.apache.org/solr/FieldCollapsing#Examples . I think the second one is intended to show a collapse count for fields? Also, I'm not sold on having separate collapse elements from the actual response, but I know other things do it too, so it isn't a huge deal), but the list of "parallel arrays" that one needs to traverse in order to render results is growing (highlighter, MLT, now Field Collapsing.
          Hide
          Grant Ingersoll added a comment -

          I'd define large scale for this in a couple of ways:
          1. Lots of docs in the result set (10K+)
          2. Lots of overall docs (100M+)
          3. Lots of queries (> 10 QPS)

          Show
          Grant Ingersoll added a comment - I'd define large scale for this in a couple of ways: 1. Lots of docs in the result set (10K+) 2. Lots of overall docs (100M+) 3. Lots of queries (> 10 QPS)
          Hide
          Stephen Weiss added a comment -

          How do we define "large scale"? I have an index of about 5 million docs. Does that qualify? I'm working on it right now, I can run whatever benchmarks you like.

          Show
          Stephen Weiss added a comment - How do we define "large scale"? I have an index of about 5 million docs. Does that qualify? I'm working on it right now, I can run whatever benchmarks you like.
          Hide
          Grant Ingersoll added a comment -

          Does anybody have a reason for why this should not be committed to trunk as it stands right now?

          It's been a while, but the last time I looked at it (3-4 mos. ago) I had the impression that it wouldn't scale. Has anyone benchmarked this at large scale?

          Show
          Grant Ingersoll added a comment - Does anybody have a reason for why this should not be committed to trunk as it stands right now? It's been a while, but the last time I looked at it (3-4 mos. ago) I had the impression that it wouldn't scale. Has anyone benchmarked this at large scale?
          Hide
          Noble Paul added a comment - - edited

          shalin, the names may not be necessary on the collapseCollectorFactory becaus they are never referred by the name

          how about making the functions also plugis as

          <collapseCollectorFactory class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">        
                <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/>
                <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/>
                <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/>
                <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/>
          </collapseCollectorFactory>
                  
          
          Show
          Noble Paul added a comment - - edited shalin, the names may not be necessary on the collapseCollectorFactory becaus they are never referred by the name how about making the functions also plugis as <collapseCollectorFactory class= "org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory" > <function name= "sum" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction" /> <function name= "avg" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction" /> <function name= "min" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction" /> <function name= "max" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction" /> </collapseCollectorFactory>
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch in sync with trunk.

          1. CollapseComponent is PluginInfoInitialized. Removed changes to SolrConfig. Note, the collapseCollectorFactories array and the separate fieldCollapsing element has been removed from configuration. this patch has the following configuration:
            <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
                <collapseCollectorFactory name="groupDocumentsCounts" class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory name="groupFieldValue" class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory name="groupDocumentsFields" class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
            
                <collapseCollectorFactory name="groupAggregatedData" class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
                    <lst name="aggregateFunctions">
                        <str name="sum">org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction</str>
                        <str name="avg">org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction</str>
                        <str name="min">org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction</str>
                        <str name="max">org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction</str>
                    </lst>
                </collapseCollectorFactory>
            
               <fieldCollapseCache
                  class="solr.FastLRUCache"
                  size="512"
                  initialSize="512"
                  autowarmCount="128"/>
              </searchComponent>
            
          1. I couldn't find where the fieldCollapseCache was being regenerated. It seems it is not being thrown away after commits? I have changed it to be re-created on newSearcher event.
          2. Removed changes to JettySolrRunner,CoreContainer and SolrDispatchFilter for the distributed test case. We will refactor it to use BaseDistributedSearchTestCase (not implemented yet)
          Show
          Shalin Shekhar Mangar added a comment - Patch in sync with trunk. CollapseComponent is PluginInfoInitialized. Removed changes to SolrConfig. Note, the collapseCollectorFactories array and the separate fieldCollapsing element has been removed from configuration. this patch has the following configuration: <searchComponent name= "collapse" class= "org.apache.solr.handler.component.CollapseComponent" > <collapseCollectorFactory name= "groupDocumentsCounts" class= "solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory name= "groupFieldValue" class= "solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory name= "groupDocumentsFields" class= "solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory name= "groupAggregatedData" class= "org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory" > <lst name= "aggregateFunctions" > <str name= "sum" > org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction </str> <str name= "avg" > org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction </str> <str name= "min" > org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction </str> <str name= "max" > org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction </str> </lst> </collapseCollectorFactory> <fieldCollapseCache class= "solr.FastLRUCache" size= "512" initialSize= "512" autowarmCount= "128" /> </searchComponent> I couldn't find where the fieldCollapseCache was being regenerated. It seems it is not being thrown away after commits? I have changed it to be re-created on newSearcher event. Removed changes to JettySolrRunner,CoreContainer and SolrDispatchFilter for the distributed test case. We will refactor it to use BaseDistributedSearchTestCase (not implemented yet)
          Hide
          Martijn van Groningen added a comment -

          Well that is nice to hear Stephen . I think I will add a 1.4 comparable patch to the issue, so people do not have issues while patching.
          I think it is a good idea Shalin to add the patch to the trunk as it is. The patch is quite stable now. For any future work related to field-collapsing we should open new issues (this is the longest issue I've ever seen). Does anyone else has a reason why field-collapsing shouldn't be committed to the trunk?

          Show
          Martijn van Groningen added a comment - Well that is nice to hear Stephen . I think I will add a 1.4 comparable patch to the issue, so people do not have issues while patching. I think it is a good idea Shalin to add the patch to the trunk as it is. The patch is quite stable now. For any future work related to field-collapsing we should open new issues (this is the longest issue I've ever seen). Does anyone else has a reason why field-collapsing shouldn't be committed to the trunk?
          Hide
          Shalin Shekhar Mangar added a comment -

          Does anybody have a reason for why this should not be committed to trunk as it stands right now?

          Show
          Shalin Shekhar Mangar added a comment - Does anybody have a reason for why this should not be committed to trunk as it stands right now?
          Hide
          Stephen Weiss added a comment -

          Martijn, I'm about to upgrade our production servers to Solr 1.4 with this latest patch you just posted and the difference is incredible. The time from startup to first collapsed query results has gone from 90 down to about 20 seconds, subsequent searches seem to execute about twice as fast on average. SOLR-236 has come a very long way in the year since we last patched. Thanks for all the hard work, it's truly great.

          FYI, it doesn't patch clean against the 1.4 distribution tarball but I don't even understand what the conflict is, reading the patch the original code in that area that failed looked identical to what the patch was expecting:

          (in QueryComponent.java)

          sreq.params.remove(ResponseBuilder.FIELD_SORT_VALUES); // this was there
          +
          + // disable collapser
          + sreq.params.remove("collapse.field");
          +
          // make sure that the id is returned for correlation. // and so was this?

          Maybe it's a whitespace issue? Anyway it works fine if you just paste it in place.

          Show
          Stephen Weiss added a comment - Martijn, I'm about to upgrade our production servers to Solr 1.4 with this latest patch you just posted and the difference is incredible . The time from startup to first collapsed query results has gone from 90 down to about 20 seconds, subsequent searches seem to execute about twice as fast on average. SOLR-236 has come a very long way in the year since we last patched. Thanks for all the hard work, it's truly great. FYI, it doesn't patch clean against the 1.4 distribution tarball but I don't even understand what the conflict is, reading the patch the original code in that area that failed looked identical to what the patch was expecting: (in QueryComponent.java) sreq.params.remove(ResponseBuilder.FIELD_SORT_VALUES); // this was there + + // disable collapser + sreq.params.remove("collapse.field"); + // make sure that the id is returned for correlation. // and so was this? Maybe it's a whitespace issue? Anyway it works fine if you just paste it in place.
          Hide
          Martijn van Groningen added a comment -

          @Marc. This was a silly bug, that occurs when you do not define a field collapse cache in the solrconfig.xml. I have attached a patch that fixes this bug, so you can use field collapse without configuring a field collapse cache. Caching with field collapsing is an optional feature.

          @Chad. Due to changes in the trunk applying the previous patch will result into merge conflicts. The new patch can be applied without merge conflicts. This means that applying this patch on 1.4 source will properly result in merge conflicts.

          Show
          Martijn van Groningen added a comment - @Marc. This was a silly bug, that occurs when you do not define a field collapse cache in the solrconfig.xml. I have attached a patch that fixes this bug, so you can use field collapse without configuring a field collapse cache. Caching with field collapsing is an optional feature. @Chad. Due to changes in the trunk applying the previous patch will result into merge conflicts. The new patch can be applied without merge conflicts. This means that applying this patch on 1.4 source will properly result in merge conflicts.
          Hide
          Chad Kouse added a comment - - edited

          Just wanted to comment that I am experiencing the same behavior as Marc Menghin above (NPE) – the patch did NOT install cleanly (1 hunk failed) – but I couldn't really tell why since it looked like it should have worked – I just manually copied the hunk into the correct class.... Sorry I didn't note what failed....

          Show
          Chad Kouse added a comment - - edited Just wanted to comment that I am experiencing the same behavior as Marc Menghin above (NPE) – the patch did NOT install cleanly (1 hunk failed) – but I couldn't really tell why since it looked like it should have worked – I just manually copied the hunk into the correct class.... Sorry I didn't note what failed....
          Hide
          Marc Menghin added a comment -

          Hi,

          new to Solr, so sorry for my likely still incomplete setup. I got everything from Solr SVN and applied the Patch (field-collapse-5.patch 2009-12-08 09:43 PM). As I search I get a NPE because I seem to not have a cache for the collapsing. It wants to add a entry to the cache but can't. There is none at that time, which it checks before in AbstractDocumentCollapser.collapse but still wants to use it later in AbstractDocumentCollapser.createDocumentCollapseResult. I suppose thats a bug? Or is something wrong on my side?

          Exception I get is:

          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.createDocumentCollapseResult(AbstractDocumentCollapser.java:278)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:249)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:172)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

          I fixed it locally by only adding something to the cache if there is one (fieldCollapseCache != null). But I'm not very into the code so not sure if thats a good/right way to fix it.

          Thanks,
          Marc

          Show
          Marc Menghin added a comment - Hi, new to Solr, so sorry for my likely still incomplete setup. I got everything from Solr SVN and applied the Patch (field-collapse-5.patch 2009-12-08 09:43 PM). As I search I get a NPE because I seem to not have a cache for the collapsing. It wants to add a entry to the cache but can't. There is none at that time, which it checks before in AbstractDocumentCollapser.collapse but still wants to use it later in AbstractDocumentCollapser.createDocumentCollapseResult. I suppose thats a bug? Or is something wrong on my side? Exception I get is: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.createDocumentCollapseResult(AbstractDocumentCollapser.java:278) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:249) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:172) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) I fixed it locally by only adding something to the cache if there is one (fieldCollapseCache != null). But I'm not very into the code so not sure if thats a good/right way to fix it. Thanks, Marc
          Hide
          Martijn van Groningen added a comment -

          I have updated the patch and fixed the following issues:

          • The issue that Marc described on the solr-dev list. The collapsed groups identifiers disappeared when the id field was anything other then a plain field (int, long etc...).
          • The caching was not properly working when the collapse.field was changed between requests. Queries that should not have been cached were.
          Show
          Martijn van Groningen added a comment - I have updated the patch and fixed the following issues: The issue that Marc described on the solr-dev list. The collapsed groups identifiers disappeared when the id field was anything other then a plain field (int, long etc...). The caching was not properly working when the collapse.field was changed between requests. Queries that should not have been cached were.
          Hide
          Martijn van Groningen added a comment - - edited

          I have attached a new patch that has the following changes:

          1. Added caching for the field collapse functionality. Check the solr wiki for how to configure field-collapsing with caching.
          2. Removed the collapse.max parameter (collapse.threshold must be used instead). It was deprecated for a long time.
          Show
          Martijn van Groningen added a comment - - edited I have attached a new patch that has the following changes: Added caching for the field collapse functionality. Check the solr wiki for how to configure field-collapsing with caching. Removed the collapse.max parameter (collapse.threshold must be used instead). It was deprecated for a long time.
          Hide
          German Attanasio Ruiz added a comment -

          Tomorrow I'm going to try the patch , the next time I hope to help and not only communicate the problem

          Show
          German Attanasio Ruiz added a comment - Tomorrow I'm going to try the patch , the next time I hope to help and not only communicate the problem
          Hide
          Martijn van Groningen added a comment - - edited

          The reason why the search results after the first search were incorrect was, because the scores were not preserved in the cache. The result of that was that the collapsing algorithm could not properly group the documents into the collapse groups (the most relevant document per document group could not be determined properly), because there was no score information when retrieving the documents from cache (as DocSet in SolrIndexSearcher) .

          I made sure that in the attached patch the score is also saved in the cache, so the collapsing algorithm can do its work properly when the documents are retrieved from the cache. Because the scores are now stored with the cached documents the actual size of the filterCache in memory will increase.

          Show
          Martijn van Groningen added a comment - - edited The reason why the search results after the first search were incorrect was, because the scores were not preserved in the cache. The result of that was that the collapsing algorithm could not properly group the documents into the collapse groups (the most relevant document per document group could not be determined properly), because there was no score information when retrieving the documents from cache (as DocSet in SolrIndexSearcher) . I made sure that in the attached patch the score is also saved in the cache, so the collapsing algorithm can do its work properly when the documents are retrieved from the cache. Because the scores are now stored with the cached documents the actual size of the filterCache in memory will increase.
          Hide
          Martijn van Groningen added a comment -

          I can confirm this bug. I will attach a new patch that fixes this issue shortly. Thanks for noticing.

          Show
          Martijn van Groningen added a comment - I can confirm this bug. I will attach a new patch that fixes this issue shortly. Thanks for noticing.
          Hide
          German Attanasio Ruiz added a comment -

          Sorting of results doesn't work properly. Next, I detail the steps I followed and the problem I faced

          I am using solr as a search engine for web pages, from which I use a field named "site" for collapsing and sort over scord

          Steps
          After downloading the last version of solr "solr-2009-11-15" and applying the patch "field-collapse-5.patch 2009-11-15 08:55 PM Martijn van Groningen 239 kB"

          STEP 1 - I make a search using fieldcollapsing and the result is correct, the number with greatest scord is 0.477
          STEP 2 - I make the same search and the fieldcollapsing throws other result with scord 0.17, the (correct) result of step 1 does not appear again

          Possible problem
          Step 1 stores the document in the cache for future searches
          at Step 2 the search is don over the cache and does not find the previously stored document

          Possible solution
          I believe that the problem is in the storing of the document in the cache since if we make step 2 again we have the same result and the document with scord of 0.17 is not removed from the results, the only result removed is the document with scord 0.477

          Conclusion
          Documents are not sorted properly when using "fieldcollapsing + solrcache", that is when documents stored in solr cache are required

          Show
          German Attanasio Ruiz added a comment - Sorting of results doesn't work properly. Next, I detail the steps I followed and the problem I faced I am using solr as a search engine for web pages, from which I use a field named "site" for collapsing and sort over scord Steps After downloading the last version of solr "solr-2009-11-15" and applying the patch "field-collapse-5.patch 2009-11-15 08:55 PM Martijn van Groningen 239 kB" STEP 1 - I make a search using fieldcollapsing and the result is correct, the number with greatest scord is 0.477 STEP 2 - I make the same search and the fieldcollapsing throws other result with scord 0.17, the (correct) result of step 1 does not appear again Possible problem Step 1 stores the document in the cache for future searches at Step 2 the search is don over the cache and does not find the previously stored document Possible solution I believe that the problem is in the storing of the document in the cache since if we make step 2 again we have the same result and the document with scord of 0.17 is not removed from the results, the only result removed is the document with scord 0.477 Conclusion Documents are not sorted properly when using "fieldcollapsing + solrcache", that is when documents stored in solr cache are required
          Hide
          Thomas Woodard added a comment -

          And this morning, without changing anything, it is working fine. I don't know what happened on Friday, but the changes I made then must have fixed it without showing up for some reason. In any case, thank you for the assistance.

          Show
          Thomas Woodard added a comment - And this morning, without changing anything, it is working fine. I don't know what happened on Friday, but the changes I made then must have fixed it without showing up for some reason. In any case, thank you for the assistance.
          Hide
          Martijn van Groningen added a comment -

          I have attached a new patch, that incorporates Micheal's quasi distributed patch so you don't have to patch twice. In addition to that the new patch also merges the collapse_count data from each individual shard response. When using this patch you will still need to make sure that all documents of one collapse group stay on one shard, otherwise your collapse result will be incorrect. The documents of a different collapse group can stay on a different shard.

          Show
          Martijn van Groningen added a comment - I have attached a new patch, that incorporates Micheal's quasi distributed patch so you don't have to patch twice. In addition to that the new patch also merges the collapse_count data from each individual shard response. When using this patch you will still need to make sure that all documents of one collapse group stay on one shard, otherwise your collapse result will be incorrect. The documents of a different collapse group can stay on a different shard.
          Hide
          Martijn van Groningen added a comment -

          What kind of exception is occurring if you use dismax (with and without field collapsing)? If I do a collapse search with dismax in the example setup (http://localhost:8983/solr/select/?q=power&collapse.field=inStock&qt=dismax) field collapsing appears to be working.

          Show
          Martijn van Groningen added a comment - What kind of exception is occurring if you use dismax (with and without field collapsing)? If I do a collapse search with dismax in the example setup ( http://localhost:8983/solr/select/?q=power&collapse.field=inStock&qt=dismax ) field collapsing appears to be working.
          Hide
          Thomas Woodard added a comment - - edited

          I tried the build again, and you are right, it does work fine with the default search handler. I had been trying to get it working with our search handler, which is dismax. That still doesn't work. Here is the handler configuration, which works fine until collapsing is added.

          <requestHandler name="glsearch" class="solr.SearchHandler">
          	<lst name="defaults">
          		<str name="defType">dismax</str>
          		<str name="qf">name^3 description^2 long_description^2 search_stars^1 search_directors^1 product_id^0.1</str>
          		<str name="tie">0.1</str>
          		<str name="facet">true</str>
          		<str name="facet.field">stars</str>
          		<str name="facet.field">directors</str>
          		<str name="facet.field">keywords</str>
          		<str name="facet.field">studio</str>
          		<str name="facet.mincount">1</str>
          	</lst>
          </requestHandler>
          

          Edit: The search fails even if you don't pass a collapse field.

          Show
          Thomas Woodard added a comment - - edited I tried the build again, and you are right, it does work fine with the default search handler. I had been trying to get it working with our search handler, which is dismax. That still doesn't work. Here is the handler configuration, which works fine until collapsing is added. <requestHandler name= "glsearch" class= "solr.SearchHandler" > <lst name= "defaults" > <str name= "defType" > dismax </str> <str name= "qf" > name^3 description^2 long_description^2 search_stars^1 search_directors^1 product_id^0.1 </str> <str name= "tie" > 0.1 </str> <str name= "facet" > true </str> <str name= "facet.field" > stars </str> <str name= "facet.field" > directors </str> <str name= "facet.field" > keywords </str> <str name= "facet.field" > studio </str> <str name= "facet.mincount" > 1 </str> </lst> </requestHandler> Edit: The search fails even if you don't pass a collapse field.
          Hide
          Martijn van Groningen added a comment -

          Thomas, the method that cannot be found ( SolrIndexSearcher.getDocSet(...) ) is a method that is part of the patch. So if the patch was successful applied then this should not happen.
          When I released the latest patch I only tested against the solr trunk, but I have tried the following to verify that the patch works with 1.4.0 release:

          • Dowloaded 1.4.0 release from Solr site
          • Applied the patch
          • Executed: ant clean dist example
          • In the example config (example/solr/conf/solrconfig.xml) I added the following line under the standard request handler:
            <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent" />
          • Started the Jetty with Solr with the following command: java -jar start.jar
          • Added example data to Solr with the following command in the exampledocs dir: ./post.sh *.xml
          • I Browsed to the following url: http://localhost:8983/solr/select/?q=*:*&collapse.field=inStock and saw that the result was collapsed on the inStock field.

          It seems that everything is running fine. Can you tell something about how you deployed Solr on your machine?

          Show
          Martijn van Groningen added a comment - Thomas, the method that cannot be found ( SolrIndexSearcher.getDocSet(...) ) is a method that is part of the patch. So if the patch was successful applied then this should not happen. When I released the latest patch I only tested against the solr trunk, but I have tried the following to verify that the patch works with 1.4.0 release: Dowloaded 1.4.0 release from Solr site Applied the patch Executed: ant clean dist example In the example config (example/solr/conf/solrconfig.xml) I added the following line under the standard request handler: <searchComponent name= "query" class= "org.apache.solr.handler.component.CollapseComponent" /> Started the Jetty with Solr with the following command: java -jar start.jar Added example data to Solr with the following command in the exampledocs dir: ./post.sh *.xml I Browsed to the following url: http://localhost:8983/solr/select/?q=*:*&collapse.field=inStock and saw that the result was collapsed on the inStock field. It seems that everything is running fine. Can you tell something about how you deployed Solr on your machine?
          Hide
          Thomas Woodard added a comment -

          I'm trying to get field collapsing to work against the 1.4.0 release. I applied the latest patch, moved the file, did a clean build, and set up a config based on the example. If I run a search without collapsing everything is fine, but if it actually tries to collapse, I get the following error:

          java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.getDocSet(Lorg/apache/lucene/search/Query;Lorg/apache/solr/search/DocSet;Lorg/apache/solr/search/DocSetAwareCollector;)Lorg/apache/solr/search/DocSet;
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:60)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:168)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:160)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)

          The tricky part is that the method is there in the source and I wrote a little test JSP that can find it just fine. That implies a class loader issue of some sort, but I'm not seeing it. Any help would be greatly appreciated.

          Show
          Thomas Woodard added a comment - I'm trying to get field collapsing to work against the 1.4.0 release. I applied the latest patch, moved the file, did a clean build, and set up a config based on the example. If I run a search without collapsing everything is fine, but if it actually tries to collapse, I get the following error: java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.getDocSet(Lorg/apache/lucene/search/Query;Lorg/apache/solr/search/DocSet;Lorg/apache/solr/search/DocSetAwareCollector;)Lorg/apache/solr/search/DocSet; at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:60) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:168) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:160) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) The tricky part is that the method is there in the source and I wrote a little test JSP that can find it just fine. That implies a class loader issue of some sort, but I'm not seeing it. Any help would be greatly appreciated.
          Hide
          Martijn van Groningen added a comment -

          I have updated the field collapse patch and improved the response format. Check my blog for more details.

          Show
          Martijn van Groningen added a comment - I have updated the field collapse patch and improved the response format. Check my blog for more details.
          Hide
          Michael Gundlach added a comment - - edited

          This patch (quasidistributed.additional.patch) does not apply field collapsing.

          Apply this patch in addition to the latest field collapsing patch, to avoid an NPE when:

          • you are collapsing on a field F,
          • you are sharding into multiple cores, using the hash of field F as your sharding key, AND
          • you perform a distributed search on a tokenized field.

          Note that if you attempt to use this patch to collapse on a field F1 and shard according to a field F2, you will get buggy search behavior.

          Show
          Michael Gundlach added a comment - - edited This patch (quasidistributed.additional.patch) does not apply field collapsing. Apply this patch in addition to the latest field collapsing patch, to avoid an NPE when: you are collapsing on a field F, you are sharding into multiple cores, using the hash of field F as your sharding key, AND you perform a distributed search on a tokenized field. Note that if you attempt to use this patch to collapse on a field F1 and shard according to a field F2, you will get buggy search behavior.
          Hide
          Michael Gundlach added a comment -

          Martijn,

          I probably wasn't clear – we are sharding and collapsing on a non-tokenized "email" field. We can perform distributed collapsing fine when searching on some other nontokenized field; the NPE occurs when we perform a search on a tokenized field.

          Anyway, I'll attach the small patch now, which just adds the null check to Solr trunk.

          Show
          Michael Gundlach added a comment - Martijn, I probably wasn't clear – we are sharding and collapsing on a non-tokenized "email" field. We can perform distributed collapsing fine when searching on some other nontokenized field; the NPE occurs when we perform a search on a tokenized field. Anyway, I'll attach the small patch now, which just adds the null check to Solr trunk.
          Hide
          Martijn van Groningen added a comment -

          Hi Shalin, it was not my intention (Usually in my case I use a long as id). I'm currently refactoring the response format as described in a previous comment, so I have to change the SolrJ classes anyway. I will submit a patch shortly.

          Show
          Martijn van Groningen added a comment - Hi Shalin, it was not my intention (Usually in my case I use a long as id). I'm currently refactoring the response format as described in a previous comment, so I have to change the SolrJ classes anyway. I will submit a patch shortly.
          Hide
          Shalin Shekhar Mangar added a comment -

          I'm using Martijn's patch from 2009-10-27. The FieldCollapseResponse#parseDocumentIdCollapseCounts assumes the unique key is a long. Is that a bug or an undocumented limitation?

          Nice work guys! We should definitely get this into Solr 1.5

          Show
          Shalin Shekhar Mangar added a comment - I'm using Martijn's patch from 2009-10-27. The FieldCollapseResponse#parseDocumentIdCollapseCounts assumes the unique key is a long. Is that a bug or an undocumented limitation? Nice work guys! We should definitely get this into Solr 1.5
          Hide
          Martijn van Groningen added a comment -

          With the current patch if you try to collapse on a field that is tokenized or multivalued an exception is thrown indicating that you cannot do that and the search is cancelled. What I guess is that when the search results are retrieved from the shards on the master a NPE is thrown because the shard result is not there. This is a limitation in itself, but it boils down to the fact how the FieldCache handles such field types (or at least how I think the FieldCache handles it).

          I think it is good idea to share your patch and from there we might be able to get the change in a proper manner. So others will also benefit from quasi-distributed field collapsing.

          Anyhow to properly implement distributed field collapsing the distributed methods have to be overriden in the collapse component, so that is where I would start. We might then also include the collapse_count in the response.

          Show
          Martijn van Groningen added a comment - With the current patch if you try to collapse on a field that is tokenized or multivalued an exception is thrown indicating that you cannot do that and the search is cancelled. What I guess is that when the search results are retrieved from the shards on the master a NPE is thrown because the shard result is not there. This is a limitation in itself, but it boils down to the fact how the FieldCache handles such field types (or at least how I think the FieldCache handles it). I think it is good idea to share your patch and from there we might be able to get the change in a proper manner. So others will also benefit from quasi-distributed field collapsing. Anyhow to properly implement distributed field collapsing the distributed methods have to be overriden in the collapse component, so that is where I would start. We might then also include the collapse_count in the response.
          Hide
          Michael Gundlach added a comment - - edited

          I've found an NPE that occurs when performing quasi-distributed field collapsing.

          My company only has one use case for field collapsing: collapsing on email address. Our index is spread across multiple cores. We found that if we shard by email address, so that all documents with a given email address are guaranteed to appear on the same core, then we can do distributed field collapsing.

          We add &collapse.field=email and &shards=core1,core2,... to a regular query. Each core collapses on email and sends the results back to the requestor. Since no emails appear on more than one core, we've accomplished distributed search. We do lose the <collapse_count> section, but that's not needed for our purpose – we just need an accurate total document count, and to have no more than one document for a given email address in the results.

          Unfortunately, this throws an NPE when searching on a tokenized field. Searching string fields is fine. I don't understand exactly why the NPE appears, but I did bandaid over it by checking explicitly for nulls at the appropriate line in the code. No more NPE.

          There's a downside, which is that if we attempt to collapse on a field other than email – one which has documents appearing in multiple cores – the results are buggy: the first search returns few documents, and the number of documents actually displayed don't always match the "numFound" value. Then upon refresh we get what we think is the correct numFound, and the correct list of documents. This doesn't bother me too much, as you're guaranteed to get incorrect answers from the collapse code anyway when collapsing on a field that you didn't use as your key for sharding.

          In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute the fix, or at least point out the error:

          1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a patch file. The resultant patch file looks very different from the latest SOLR-236 patchfile, so I assume I did something wrong.

          2. I pulled trunk, made my 2 line change, and created another patch file. This file is tiny but of course is missing all of the field collapsing changes.

          Would you like me to post either of these patchfiles to this issue? Or is it sufficient to just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was null.) Perhaps my use case is extraordinary enough that you're happy leaving the NPE in place and telling other users to not do what I'm doing?

          Thanks!
          Michael

          Show
          Michael Gundlach added a comment - - edited I've found an NPE that occurs when performing quasi-distributed field collapsing. My company only has one use case for field collapsing: collapsing on email address. Our index is spread across multiple cores. We found that if we shard by email address, so that all documents with a given email address are guaranteed to appear on the same core, then we can do distributed field collapsing. We add &collapse.field=email and &shards=core1,core2,... to a regular query. Each core collapses on email and sends the results back to the requestor. Since no emails appear on more than one core, we've accomplished distributed search. We do lose the <collapse_count> section, but that's not needed for our purpose – we just need an accurate total document count, and to have no more than one document for a given email address in the results. Unfortunately, this throws an NPE when searching on a tokenized field. Searching string fields is fine. I don't understand exactly why the NPE appears, but I did bandaid over it by checking explicitly for nulls at the appropriate line in the code. No more NPE. There's a downside, which is that if we attempt to collapse on a field other than email – one which has documents appearing in multiple cores – the results are buggy: the first search returns few documents, and the number of documents actually displayed don't always match the "numFound" value. Then upon refresh we get what we think is the correct numFound, and the correct list of documents. This doesn't bother me too much, as you're guaranteed to get incorrect answers from the collapse code anyway when collapsing on a field that you didn't use as your key for sharding. In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute the fix, or at least point out the error: 1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a patch file. The resultant patch file looks very different from the latest SOLR-236 patchfile, so I assume I did something wrong. 2. I pulled trunk, made my 2 line change, and created another patch file. This file is tiny but of course is missing all of the field collapsing changes. Would you like me to post either of these patchfiles to this issue? Or is it sufficient to just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was null.) Perhaps my use case is extraordinary enough that you're happy leaving the NPE in place and telling other users to not do what I'm doing? Thanks! Michael
          Hide
          Martijn van Groningen added a comment -

          I agree about the caching. When searching with fieldcollapsing for the same query more than ones, then some caching should kick in. I think that the execution of the doCollapse(...) method should be cached. In this method the field collapse logic is executed, which takes the most time of a field collapse search.

          Show
          Martijn van Groningen added a comment - I agree about the caching. When searching with fieldcollapsing for the same query more than ones, then some caching should kick in. I think that the execution of the doCollapse(...) method should be cached. In this method the field collapse logic is executed, which takes the most time of a field collapse search.
          Hide
          Lance Norskog added a comment -

          Getting the refactoring right is important.

          Scaling needs to be on the roadmap as well. The data created in collapsing has to be cached in some way. If I do a collapse on my 500m test index, the first one takes 110ms and the second one takes 80-90ms. Searches that walk from one result page to the next have to be fast the second time. Field collapsing probably needs some explicit caching. This is a show-stopper for getting this committed.

          When I sort or facet the work done up front is reused in some way. In sorting there is a huge amount of work pushed to the first query and explicitly cached. Faceting seems to leave its work in the existing caches and runs much faster the second time.

          Show
          Lance Norskog added a comment - Getting the refactoring right is important. Scaling needs to be on the roadmap as well. The data created in collapsing has to be cached in some way. If I do a collapse on my 500m test index, the first one takes 110ms and the second one takes 80-90ms. Searches that walk from one result page to the next have to be fast the second time. Field collapsing probably needs some explicit caching. This is a show-stopper for getting this committed. When I sort or facet the work done up front is reused in some way. In sorting there is a huge amount of work pushed to the first query and explicitly cached. Faceting seems to leave its work in the existing caches and runs much faster the second time.
          Hide
          Martijn van Groningen added a comment -

          It certainly has be going on for a long time
          Talking about the last miles there are a few things in my mind about field collapsing:

          • Change the response format. Currently if I look at the response even I get confused sometimes about the information returned. The response should more structured. Something like this:
            <lst name="collapse_counts">
                <str name="field">venue</str>
                <lst name="results">
                    <lst name="233238"> <!-- id of most relevant document of the group -->
                        <str name="fieldValue">melkweg</str>
                        <int name="collapseCount">2</int>
                        <!-- and other CollapseCollector specific collapse information -->
                    </lst>
                    ...
                </lst>
            </lst>
            

            Currently when doing adjacent field collapsing the collapse_counts gives results that are unusable to use. The collapse_counts use the field value as key which is not unique for adjacent collapsing as shown in the example:

            <lst name="collapse_counts">
             <int name="hard">1</int>
             <int name="hard">1</int>
             <int name="electronics">1</int>
             <int name="memory">2</int>
             <int name="monitor">1</int>
            </lst>
            
          • Add the notion of a CollapseMatcher, that decides whether document field values are equal or not and thus whether they are allowed to be collapsed. This opens the road for more exotic features like fuzzy field collapsing and collapsing on more than one field. Also this allows users of the patch to easily implement their own matching rules.
          • Distributed field collapsing. Although I have some ideas on how to get started, from my perspective it not going to be performed. Because somehow the field collapse state has to be shared between shards in order to do proper field collapsing. This state can potentially be a lot of data depending on the specific search and corpus.
          • And maybe add a collapse collector that collects statistics about most common field value per collapsed group.

          I think that this is somewhat the roadmap from my side for field collapsing at moment, but feel free to elaborate on this.
          Btw I have recently written a blog about field collapsing in general, that might be handy for someone who is implementing field collapsing.

          Show
          Martijn van Groningen added a comment - It certainly has be going on for a long time Talking about the last miles there are a few things in my mind about field collapsing: Change the response format. Currently if I look at the response even I get confused sometimes about the information returned. The response should more structured. Something like this: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "233238" > <!-- id of most relevant document of the group --> <str name= "fieldValue" > melkweg </str> <int name= "collapseCount" > 2 </int> <!-- and other CollapseCollector specific collapse information --> </lst> ... </lst> </lst> Currently when doing adjacent field collapsing the collapse_counts gives results that are unusable to use. The collapse_counts use the field value as key which is not unique for adjacent collapsing as shown in the example: <lst name= "collapse_counts" > <int name= "hard" > 1 </int> <int name= "hard" > 1 </int> <int name= "electronics" > 1 </int> <int name= "memory" > 2 </int> <int name= "monitor" > 1 </int> </lst> Add the notion of a CollapseMatcher, that decides whether document field values are equal or not and thus whether they are allowed to be collapsed. This opens the road for more exotic features like fuzzy field collapsing and collapsing on more than one field. Also this allows users of the patch to easily implement their own matching rules. Distributed field collapsing. Although I have some ideas on how to get started, from my perspective it not going to be performed. Because somehow the field collapse state has to be shared between shards in order to do proper field collapsing. This state can potentially be a lot of data depending on the specific search and corpus. And maybe add a collapse collector that collects statistics about most common field value per collapsed group. I think that this is somewhat the roadmap from my side for field collapsing at moment, but feel free to elaborate on this. Btw I have recently written a blog about field collapsing in general, that might be handy for someone who is implementing field collapsing.
          Hide
          Martijn van Groningen added a comment -

          I have updated the patch that fixes the bug that was reported yesterday on the solr-user mailing list:

          found another exception, i cant find specific steps to reproduce
          besides starting with an unfiltered result and then given an int field
          with values (1,2,3) filtering by 3 triggers it sometimes, this is in
          an index with very frequent updates and deletes

          --joe

          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory
          $FieldValueCountCollapseCollector.getResult(FieldValueCountCollapseCollectorFactory.java:84)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:191)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:179)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:326)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
          at org.mo