Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.3
    • Fix Version/s: 3.3
    • Component/s: search
    • Labels:
      None

      Description

      This patch include a new feature called "Field collapsing".

      "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
      http://www.fastsearch.com/glossary.aspx?m=48&amid=299

      The implementation add 3 new query parameters (SolrParams):
      "collapse.field" to choose the field used to group results
      "collapse.type" normal (default value) or adjacent
      "collapse.max" to select how many continuous results are allowed before collapsing

      TODO (in progress):

      • More documentation (on source code)
      • Test cases

      Two patches:

      • "field_collapsing.patch" for current development version
      • "field_collapsing_1.1.0.patch" for Solr-1.1.0

      P.S.: Feedback and misspelling correction are welcome

      1. field_collapsing_1.1.0.patch
        12 kB
        Emmanuel Keller
      2. SOLR-236-FieldCollapsing.patch
        16 kB
        Ryan McKinley
      3. SOLR-236-FieldCollapsing.patch
        18 kB
        Ryan McKinley
      4. SOLR-236-FieldCollapsing.patch
        18 kB
        Emmanuel Keller
      5. field_collapsing_1.3.patch
        14 kB
        Emmanuel Keller
      6. field-collapsing-extended-592129.patch
        31 kB
        Karsten Sperling
      7. field_collapsing_dsteigerwald.diff
        25 kB
        Doug Steigerwald
      8. field_collapsing_dsteigerwald.diff
        25 kB
        Charles Hornberger
      9. field_collapsing_dsteigerwald.diff
        25 kB
        Oleg Gnatovskiy
      10. solr-236.patch
        24 kB
        Bojan Smid
      11. collapsing-patch-to-1.3.0-ivan.patch
        24 kB
        Iván de Prado
      12. collapsing-patch-to-1.3.0-ivan_2.patch
        24 kB
        Iván de Prado
      13. collapsing-patch-to-1.3.0-ivan_3.patch
        24 kB
        Iván de Prado
      14. collapsing-patch-to-1.3.0-dieter.patch
        26 kB
        dieter grad
      15. SOLR-236_collapsing.patch
        26 kB
        Dmitry Lihachev
      16. SOLR-236_collapsing.patch
        25 kB
        Thomas Traeger
      17. field-collapse-solr-236.patch
        49 kB
        Martijn van Groningen
      18. field-collapse-solr-236-2.patch
        52 kB
        Martijn van Groningen
      19. field-collapse-3.patch
        52 kB
        Martijn van Groningen
      20. field-collapse-4-with-solrj.patch
        66 kB
        Martijn van Groningen
      21. field-collapse-5.patch
        122 kB
        Martijn van Groningen
      22. field-collapse-5.patch
        133 kB
        Martijn van Groningen
      23. field-collapse-5.patch
        134 kB
        Martijn van Groningen
      24. field-collapse-5.patch
        134 kB
        Martijn van Groningen
      25. field-collapse-5.patch
        136 kB
        Martijn van Groningen
      26. field-collapse-5.patch
        146 kB
        Martijn van Groningen
      27. field-collapse-5.patch
        144 kB
        Martijn van Groningen
      28. field-collapse-5.patch
        216 kB
        Martijn van Groningen
      29. field-collapse-5.patch
        218 kB
        Martijn van Groningen
      30. quasidistributed.additional.patch
        1 kB
        Michael Gundlach
      31. field-collapse-5.patch
        218 kB
        Martijn van Groningen
      32. field-collapse-5.patch
        239 kB
        Martijn van Groningen
      33. field-collapse-5.patch
        244 kB
        Martijn van Groningen
      34. field-collapse-5.patch
        251 kB
        Martijn van Groningen
      35. field-collapse-5.patch
        253 kB
        Martijn van Groningen
      36. field-collapse-5.patch
        254 kB
        Martijn van Groningen
      37. SOLR-236.patch
        253 kB
        Shalin Shekhar Mangar
      38. SOLR-236.patch
        245 kB
        Martijn van Groningen
      39. SOLR-236.patch
        257 kB
        Shalin Shekhar Mangar
      40. SOLR-236.patch
        251 kB
        Martijn van Groningen
      41. SOLR-236.patch
        252 kB
        Shalin Shekhar Mangar
      42. SOLR-236.patch
        244 kB
        Martijn van Groningen
      43. SOLR-236.patch
        245 kB
        Martijn van Groningen
      44. DocSetScoreCollector.java
        5 kB
        Peter Karich
      45. NonAdjacentDocumentCollapser.java
        21 kB
        Peter Karich
      46. NonAdjacentDocumentCollapserTest.java
        9 kB
        Peter Karich
      47. SOLR-236-trunk.patch
        236 kB
        Martijn van Groningen
      48. SOLR-236-trunk.patch
        247 kB
        Martijn van Groningen
      49. SOLR-236-trunk.patch
        250 kB
        Martijn van Groningen
      50. SOLR-236-trunk.patch
        256 kB
        Martijn van Groningen
      51. SOLR-236-trunk.patch
        259 kB
        Martijn van Groningen
      52. SOLR-236-1_4_1.patch
        264 kB
        Martijn van Groningen
      53. SOLR-236.patch
        27 kB
        Yonik Seeley
      54. SOLR-236-1_4_1-paging-totals-working.patch
        264 kB
        Stephen Weiss
      55. SOLR-236-distinctFacet.patch
        2 kB
        Bill Bell
      56. SOLR-236-1_4_1-NPEfix.patch
        0.7 kB
        Cameron
      57. SOLR-236-branch_3x.patch
        258 kB
        Doug Steigerwald

        Issue Links

        1.
        Provide an API to specify custom Collectors Sub-task Resolved Unassigned
         
        2.
        Fieldcollapse SolrJ code Sub-task Closed Unassigned
         
        3.
        Implement CollapseComponent Sub-task Closed Shalin Shekhar Mangar
         
        4.
        Distributed field collapsing Sub-task Closed Unassigned
         
        5.
        Refactor QueryComponent for easy extensibility Sub-task Resolved Shalin Shekhar Mangar
         
        6.
        Support fixing the number of shards in BaseDistributedTestCase Sub-task Resolved Shalin Shekhar Mangar
         
        7.
        Search Grouping: single doclist format Sub-task Resolved Unassigned
         
        8.
        Search Grouping: support highlighting Sub-task Closed Unassigned
         
        9.
        Search Grouping: support explain (debugQuery) Sub-task Resolved Unassigned
         
        10.
        Search Grouping: support distributed search Sub-task Closed Unassigned
         
        11. Search Grouping: CSV response writer Sub-task Open Unassigned
         
        12.
        Search Grouping: collapse by string specialization Sub-task Closed Unassigned
         
        13. Search Grouping: intermediate caches Sub-task Open Unassigned
         
        14. Search Grouping: single pass implementation Sub-task Open Unassigned
         
        15. Search Grouping: unlikely collision implementation Sub-task Open Unassigned
         
        16. Search Grouping: expand group sort options Sub-task Open Unassigned
         
        17.
        Search Grouping: SolrJ support Sub-task Resolved Unassigned
         
        18.
        Search Grouping: Facet support Sub-task Closed Unassigned
         
        19.
        Search Grouping: Group by query (like facet.query) Sub-task Resolved Unassigned
         
        20.
        Add grouping support to Velocity UI Sub-task Resolved Erik Hatcher
         
        21.
        Externalizing groupValue values Sub-task Closed Unassigned
         
        22.
        Grouping treats null values as equivalent to 0 or an empty string Sub-task Resolved Unassigned
         
        23.
        Grouping performance improvements Sub-task Closed Unassigned
         
        24.
        Search Grouping: random testing Sub-task Resolved Unassigned
         

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          1500d 19h 52m 1 Michael McCandless 20/Jun/11 19:06
          Resolved Resolved Closed Closed
          11d 8h 36m 1 Robert Muir 02/Jul/11 03:43
          Gavin made changes -
          Link This issue depends upon SOLR-2246 [ SOLR-2246 ]
          Gavin made changes -
          Link This issue depends on SOLR-2246 [ SOLR-2246 ]
          Gavin made changes -
          Link This issue depends upon SOLR-281 [ SOLR-281 ]
          Gavin made changes -
          Link This issue depends on SOLR-281 [ SOLR-281 ]
          Hide
          kishore padman added a comment -

          Hi,

          I have applied these 2 patches to solr1.4.1 for the field collapsing.

          Apply patch SOLR-236-1_4_1-paging-totals-working.patch
          Apply patch SOLR-236-1_4_1-NPEfix.patch

          The collapsing works fine, and facet counts shows correctly on the collpased records as I am using collpase.facet=after.
          But when a filter is done on a facet, all the corresponding facet counts is calculated on the basis of uncollapsed records.

          Has anyone faced this issue.Please let me know the resolution

          Thanks
          Kishore Padman

          Show
          kishore padman added a comment - Hi, I have applied these 2 patches to solr1.4.1 for the field collapsing. Apply patch SOLR-236 -1_4_1-paging-totals-working.patch Apply patch SOLR-236 -1_4_1-NPEfix.patch The collapsing works fine, and facet counts shows correctly on the collpased records as I am using collpase.facet=after. But when a filter is done on a facet, all the corresponding facet counts is calculated on the basis of uncollapsed records. Has anyone faced this issue.Please let me know the resolution Thanks Kishore Padman
          Robert Muir made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Robert Muir added a comment -

          Bulk close for 3.3

          Show
          Robert Muir added a comment - Bulk close for 3.3
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Hide
          Michael McCandless added a comment -

          Resolving this looooon issue as a duplicate of SOLR-2524, which brings grouping (finally!) to Solr 3.x via the new (factored out from Solr's trunk grouping impl then backported to 3.x) grouping module.

          Show
          Michael McCandless added a comment - Resolving this looooon issue as a duplicate of SOLR-2524 , which brings grouping (finally!) to Solr 3.x via the new (factored out from Solr's trunk grouping impl then backported to 3.x) grouping module.
          Hide
          Jan Høydahl added a comment -

          I think you should consider the group by now included in 3_x branch (SOLR-2524 was recently committed)

          Show
          Jan Høydahl added a comment - I think you should consider the group by now included in 3_x branch ( SOLR-2524 was recently committed)
          Hide
          Yuriy Akopov added a comment -

          I am trying to migrate from Solr 1.4.1 to Solr 3.2 and so I need to patch the 3.2 branch.

          When I use "SOLR-236-branch_3x.patch" file on the dev/tags/release-3.2 branch WAR file is built successfully but then it fails on loading with "org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent'" message as if the collapsing functionality was not implemented.

          Should I try using 1.4.1 patch instead on 3.2 sources? That doesn't feel right but maybe they're compatible, I don't know.

          Show
          Yuriy Akopov added a comment - I am trying to migrate from Solr 1.4.1 to Solr 3.2 and so I need to patch the 3.2 branch. When I use " SOLR-236 -branch_3x.patch" file on the dev/tags/release-3.2 branch WAR file is built successfully but then it fails on loading with "org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent'" message as if the collapsing functionality was not implemented. Should I try using 1.4.1 patch instead on 3.2 sources? That doesn't feel right but maybe they're compatible, I don't know.
          Robert Muir made changes -
          Fix Version/s 3.3 [ 12316471 ]
          Fix Version/s 3.2 [ 12316172 ]
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hoss Man made changes -
          Fix Version/s 3.2 [ 12316172 ]
          Fix Version/s Next [ 12315093 ]
          Hide
          Yuriy Akopov added a comment -

          Thanks, Stephen. So it isn't just me doing something else wrong.

          I'm thinking of displaying not the actual figures against the facet items but something like 100+, 200+, 300+ etc. Should be okay as the difference is not dramatic but seems to remain within the relatively narrow interval.

          Show
          Yuriy Akopov added a comment - Thanks, Stephen. So it isn't just me doing something else wrong. I'm thinking of displaying not the actual figures against the facet items but something like 100+, 200+, 300+ etc. Should be okay as the difference is not dramatic but seems to remain within the relatively narrow interval.
          Hide
          Stephen Weiss added a comment -

          Yes, I've had this too:

          https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12655750&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12655750

          I'm pretty sure I know the reason for it, but I don't know how to fix it... to the best of my knowledge no one on the ticket really said if the problem could be fixed or not yet either. At the moment we just use facet.before and explain to our users that the facets are for "unfiltered" results... almost no one complains once we explain it to them. However, a fix would be wonderful... people ask about it often enough that clearly it's not very intuitive.

          Show
          Stephen Weiss added a comment - Yes, I've had this too: https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12655750&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12655750 I'm pretty sure I know the reason for it, but I don't know how to fix it... to the best of my knowledge no one on the ticket really said if the problem could be fixed or not yet either. At the moment we just use facet.before and explain to our users that the facets are for "unfiltered" results... almost no one complains once we explain it to them. However, a fix would be wonderful ... people ask about it often enough that clearly it's not very intuitive.
          Hide
          Yuriy Akopov added a comment -

          Hi and sorry for breaking the silence.

          So far the patch is working okay in our system, thanks again.

          However I've noticed that the collapse.facet parameter set to 'after' doesn't produce very precise figures. When results are collapsed, it may give, say, 366 results for the facet item while actually there are 396 returned by Solr after collapsing.

          The figures are never very different from the actual ones but they vary in some narrow interval. I mean, for number of results up to 10000 they differ by <100 only. My collapsing-related part of the query is the following:

          $search_options['qt'] = 'collapse';
          $search_options['collapse.field'] = 'my_string_field'; // name of the field to collapse on, in my case it is a string field
          $search_options['collapse.type'] = 'normal'; // it is always 'normal' and never 'adjacent' in my case
          $search_options['collapse.facet'] = 'after';

          When collapsing is turned off, facet figures are calculated precisely, as expected. Have anybody else experienced that, and if so, is there a solution available? Thanks in advance.

          Show
          Yuriy Akopov added a comment - Hi and sorry for breaking the silence. So far the patch is working okay in our system, thanks again. However I've noticed that the collapse.facet parameter set to 'after' doesn't produce very precise figures. When results are collapsed, it may give, say, 366 results for the facet item while actually there are 396 returned by Solr after collapsing. The figures are never very different from the actual ones but they vary in some narrow interval. I mean, for number of results up to 10000 they differ by <100 only. My collapsing-related part of the query is the following: $search_options ['qt'] = 'collapse'; $search_options ['collapse.field'] = 'my_string_field'; // name of the field to collapse on, in my case it is a string field $search_options ['collapse.type'] = 'normal'; // it is always 'normal' and never 'adjacent' in my case $search_options ['collapse.facet'] = 'after'; When collapsing is turned off, facet figures are calculated precisely, as expected. Have anybody else experienced that, and if so, is there a solution available? Thanks in advance.
          motto made changes -
          Comment [ Am I right that trunk is 4.0? What is the newest patch that works on that code? All patches I tried so far failed for me.
          Also, would someone we able to share a solr.WAR file that is already patched and fairly up-to-date?

          Thanks ]
          Hide
          Yuriy Akopov added a comment -

          Another question:

          The patched version of .war starts and works as expected if I place the following simple instruction in solrconfig.xml:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
          </searchComponent>

          But if I add additional factories like it is advised by the sample config, it produces an error when searching with collapsing turned on:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
          <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" />
          <collapseCollectorFactory class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
          <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
          <collapseCollectorFactory name="groupAggregatedData" class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
          <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/>
          <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/>
          <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/>
          <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/>
          </collapseCollectorFactory>
          </searchComponent>

          So far it does what I expect from it without additional factories mentioned, but still it bothers me that it fails when they're listed. Maybe I placed the libraries in a wrong place?

          Show
          Yuriy Akopov added a comment - Another question: The patched version of .war starts and works as expected if I place the following simple instruction in solrconfig.xml: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent"> </searchComponent> But if I add additional factories like it is advised by the sample config, it produces an error when searching with collapsing turned on: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent"> <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory name="groupAggregatedData" class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory"> <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/> <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/> <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/> <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/> </collapseCollectorFactory> </searchComponent> So far it does what I expect from it without additional factories mentioned, but still it bothers me that it fails when they're listed. Maybe I placed the libraries in a wrong place?
          Hide
          George P. Stathis added a comment - - edited

          Bump on Yuriy's last question:

          • Are performance issues around the number of documents matched, the size of the index, or both?

          E.g. our index contains over 12 million documents already. Should we even consider using this feature?

          Adding a few more questions:

          • Are performance concerns around the 1.4 patch, the current Solr 4.0 branch or both?
          • Is sharding an option to alleviate some of these issues? Reading the comments in this ticket, it seems there are caveats getting this to work with shards?
          Show
          George P. Stathis added a comment - - edited Bump on Yuriy's last question: Are performance issues around the number of documents matched, the size of the index, or both? E.g. our index contains over 12 million documents already. Should we even consider using this feature? Adding a few more questions: Are performance concerns around the 1.4 patch, the current Solr 4.0 branch or both? Is sharding an option to alleviate some of these issues? Reading the comments in this ticket, it seems there are caveats getting this to work with shards?
          Hide
          Stephen Weiss added a comment -

          It would work fine as long as you weren't sending the collapse parameters, I don't think you'd need to replace the WAR.

          Show
          Stephen Weiss added a comment - It would work fine as long as you weren't sending the collapse parameters, I don't think you'd need to replace the WAR.
          Hide
          Yuriy Akopov added a comment -

          In other words, if I use additional filtering conditions in my request to make sure the returned set of documents to be grouped is never larger than, say, 1 million items, can I expect the described problem to happen, or I'll be safe? Or regardless of the particular query and its resulting set to be collapsed I'm in danger if my index contains few millions documents?

          (sorry for commenting twice on the same problem)

          Show
          Yuriy Akopov added a comment - In other words, if I use additional filtering conditions in my request to make sure the returned set of documents to be grouped is never larger than, say, 1 million items, can I expect the described problem to happen, or I'll be safe? Or regardless of the particular query and its resulting set to be collapsed I'm in danger if my index contains few millions documents? (sorry for commenting twice on the same problem)
          Hide
          Yuriy Akopov added a comment -

          Stephen, Grant, thanks for the notice. Currently the total number of documents we deal with is about 800K and I expect it to group up to 2M in a year, but every user is allowed to search not the whole amount but a subset of it (so, for every search, additional filtering conditions are applied). I hope we will be fine until Solr4 comes out.

          But if we encounter any critical problems, would it be enough to remove collapsing parameters from the request sent to Solr to prevent the external functions from failing, or it is needed to replace the Solr core with unpatched one? I mean, the failure on a large set of documents is possible even when collapse.* parameters are not supplied, or only if the collapsing was requested?

          Show
          Yuriy Akopov added a comment - Stephen, Grant, thanks for the notice. Currently the total number of documents we deal with is about 800K and I expect it to group up to 2M in a year, but every user is allowed to search not the whole amount but a subset of it (so, for every search, additional filtering conditions are applied). I hope we will be fine until Solr4 comes out. But if we encounter any critical problems, would it be enough to remove collapsing parameters from the request sent to Solr to prevent the external functions from failing, or it is needed to replace the Solr core with unpatched one? I mean, the failure on a large set of documents is possible even when collapse.* parameters are not supplied, or only if the collapsing was requested?
          Hide
          Grant Ingersoll added a comment -

          Keep in mind an alternative approach that scales, but loses some attributes of this patch (total groups for instance) is committed on trunk and will likely be backported to 3.2.

          Show
          Grant Ingersoll added a comment - Keep in mind an alternative approach that scales, but loses some attributes of this patch (total groups for instance) is committed on trunk and will likely be backported to 3.2.
          Hide
          Stephen Weiss added a comment -

          Just be careful Yuriy, there are reasons why this thing is not in Solr 1.4.1 already The code does not scale particularly well beyond a few million documents, especially if you use the version that preserves totals and paging. It was enough to keep my software from being scrapped, but if you plan on scaling much past that point any time soon, you may need to start thinking about alternative solutions. I know I certainly am... I have a sinking worry my application may outgrow the limits of this patch's stability before something truly production ready comes to fore, possibly even this year if growth continues. However, given that the very concept of grouping is critical to the site that I support with SOLR, and attempts to provide the same functionality without actually grouping have failed repeatedly over the past few months, it is very sadly starting to look like I will have to cut very useful features (to no end of complaints, I'm sure) in order to ensure it's overall stability unless some miracle happens. Mama always told me I should have learned Java!

          Long story short, if you don't have to have this patch yet, and your software hasn't been written to do anything like this yet, I would not start doing it now! You will regret it when you run out of options later on and your servers start crashing all over the place. See if you can keep it under wraps until a real release comes out with it.

          Show
          Stephen Weiss added a comment - Just be careful Yuriy, there are reasons why this thing is not in Solr 1.4.1 already The code does not scale particularly well beyond a few million documents, especially if you use the version that preserves totals and paging. It was enough to keep my software from being scrapped, but if you plan on scaling much past that point any time soon, you may need to start thinking about alternative solutions. I know I certainly am... I have a sinking worry my application may outgrow the limits of this patch's stability before something truly production ready comes to fore, possibly even this year if growth continues. However, given that the very concept of grouping is critical to the site that I support with SOLR, and attempts to provide the same functionality without actually grouping have failed repeatedly over the past few months, it is very sadly starting to look like I will have to cut very useful features (to no end of complaints, I'm sure) in order to ensure it's overall stability unless some miracle happens. Mama always told me I should have learned Java! Long story short, if you don't have to have this patch yet, and your software hasn't been written to do anything like this yet, I would not start doing it now! You will regret it when you run out of options later on and your servers start crashing all over the place. See if you can keep it under wraps until a real release comes out with it.
          Hide
          Yuriy Akopov added a comment -

          Stephen, apparently the version you've advised works fine! At least those two issues I complained about are gone. Many thanks for your help!

          Show
          Yuriy Akopov added a comment - Stephen, apparently the version you've advised works fine! At least those two issues I complained about are gone. Many thanks for your help!
          Hide
          Yuriy Akopov added a comment -

          By the way, a noob question: after build completes along with the "apache-solr-1.4.2-dev.war" the following jars are generated:

          apache-solr-cell-1.4.2-dev.jar
          apache-solr-clustering-1.4.2-dev.jar
          apache-solr-core-1.4.2-dev.jar
          apache-solr-dataimporthandler-1.4.2-dev.jar
          apache-solr-dataimporthandler-extras-1.4.2-dev.jar
          apache-solr-solrj-1.4.2-dev.jar
          solrj-lib/commons-codec-1.3.jar
          solrj-lib/commons-httpclient-3.1.jar
          solrj-lib/commons-io-1.4.jar
          solrj-lib/geronimo-stax-api_1.0_spec-1.0.1.jar
          solrj-lib/jcl-over-slf4j-1.5.5.jar
          solrj-lib/slf4j-api-1.5.5.jar
          solrj-lib/wstx-asl-3.2.7.jar

          Do I need also transfer these libraries as well, or it is only needed to replace war file to get the patched version working properly? In my previous tries I copied solrj-lib/*.jar files to lib folder of Solr instance home. Maybe that was the problem?

          Show
          Yuriy Akopov added a comment - By the way, a noob question: after build completes along with the "apache-solr-1.4.2-dev.war" the following jars are generated: apache-solr-cell-1.4.2-dev.jar apache-solr-clustering-1.4.2-dev.jar apache-solr-core-1.4.2-dev.jar apache-solr-dataimporthandler-1.4.2-dev.jar apache-solr-dataimporthandler-extras-1.4.2-dev.jar apache-solr-solrj-1.4.2-dev.jar solrj-lib/commons-codec-1.3.jar solrj-lib/commons-httpclient-3.1.jar solrj-lib/commons-io-1.4.jar solrj-lib/geronimo-stax-api_1.0_spec-1.0.1.jar solrj-lib/jcl-over-slf4j-1.5.5.jar solrj-lib/slf4j-api-1.5.5.jar solrj-lib/wstx-asl-3.2.7.jar Do I need also transfer these libraries as well, or it is only needed to replace war file to get the patched version working properly? In my previous tries I copied solrj-lib/*.jar files to lib folder of Solr instance home. Maybe that was the problem?
          Hide
          Yuriy Akopov added a comment -

          I didn't expect the reply to come so quickly! Thanks, Stephen, I'll try it and post the results then.

          Show
          Yuriy Akopov added a comment - I didn't expect the reply to come so quickly! Thanks, Stephen, I'll try it and post the results then.
          Hide
          Stephen Weiss added a comment -

          Yuriy... try my patch: SOLR-236-1_4_1-paging-totals-working.patch. I don't have either of the problems you describe (problem B was actually the purpose of my patch, I never saw Problem A and I have tons of "single", non-grouped documents so I'm sure I would be seeing it if it were happening). Some people had problems using the patch (I didn't use it myself, I made it after the fact) but if you look up in the comments people explain how to make it work. Note that I'm not using the SOLR-236-1_4_1-NPEfix.patch patch, I never had the NPE problem they describe so I never bothered with it, not sure what it does.

          Show
          Stephen Weiss added a comment - Yuriy... try my patch: SOLR-236 -1_4_1-paging-totals-working.patch. I don't have either of the problems you describe (problem B was actually the purpose of my patch, I never saw Problem A and I have tons of "single", non-grouped documents so I'm sure I would be seeing it if it were happening). Some people had problems using the patch (I didn't use it myself, I made it after the fact) but if you look up in the comments people explain how to make it work. Note that I'm not using the SOLR-236 -1_4_1-NPEfix.patch patch, I never had the NPE problem they describe so I never bothered with it, not sure what it does.
          Hide
          Yuriy Akopov added a comment -

          Hi,

          First of all, thanks you guys for working on that! However, I have encountered a problem with this patch which is hopefully caused by my mistakes, so please correct me if I have done something wrong.

          So, I have applied SOLR-236 patch to release-1.4.1 and gained support for collapse.*, which works. However, two issues discussed above in this thread are still there:

          a) When collapsing is requested, only grouped results are returned. So, if the document has got a unique value in the field collapsed (i.e. it has no other docs to group with) it is excluded from the results. Instead of expected "unique documents plus non-unique grouped to the most relevant one" just grouped ones are returned.

          b) The number of results matching the query ("numFound") returned is always equal to "rows" parameter provided or 10 if not supplied (i.e. it represents the number of results on the page is returned, not the total number of matched documents).

          There is a way around the latter "numFound" issue: faceting by the field collapsed as it was suggested before, but the number retrieved with that facet is also useless as it includes unique (non-grouped) documents as well, but they are not returned.

          So far, I'm stuck with that. Is there any chance of resolving that? What about the SOLR-1682 patch - if it fixes that, should be applied to the original release-1.4.1 or to the release-1.4.1 patched with SOLR-236 beforehand?

          Thanks in advance.

          P.S. As I understand, grouping is planned in Solr 4.0. Does anybody know by any chance if it is safe to use its nightly builds? I ran through its pending critical issues and they doesn't look fatal, but still I'm afraid of possible implications.

          Show
          Yuriy Akopov added a comment - Hi, First of all, thanks you guys for working on that! However, I have encountered a problem with this patch which is hopefully caused by my mistakes, so please correct me if I have done something wrong. So, I have applied SOLR-236 patch to release-1.4.1 and gained support for collapse.*, which works. However, two issues discussed above in this thread are still there: a) When collapsing is requested, only grouped results are returned. So, if the document has got a unique value in the field collapsed (i.e. it has no other docs to group with) it is excluded from the results. Instead of expected "unique documents plus non-unique grouped to the most relevant one" just grouped ones are returned. b) The number of results matching the query ("numFound") returned is always equal to "rows" parameter provided or 10 if not supplied (i.e. it represents the number of results on the page is returned, not the total number of matched documents). There is a way around the latter "numFound" issue: faceting by the field collapsed as it was suggested before, but the number retrieved with that facet is also useless as it includes unique (non-grouped) documents as well, but they are not returned. So far, I'm stuck with that. Is there any chance of resolving that? What about the SOLR-1682 patch - if it fixes that, should be applied to the original release-1.4.1 or to the release-1.4.1 patched with SOLR-236 beforehand? Thanks in advance. P.S. As I understand, grouping is planned in Solr 4.0. Does anybody know by any chance if it is safe to use its nightly builds? I ran through its pending critical issues and they doesn't look fatal, but still I'm afraid of possible implications.
          Hide
          Doug Steigerwald added a comment -

          I started to try and backport SOLR-1682 to the 3x branch, but that seemed to get out of hand pretty quickly from what I remember (was a few weeks ago). It was much easier making this work with the 3x branch than backporting SOLR-1682.

          We want/need new features in 3.1 when it is released and we won't be allowed to deploy trunk to our production environment.

          Show
          Doug Steigerwald added a comment - I started to try and backport SOLR-1682 to the 3x branch, but that seemed to get out of hand pretty quickly from what I remember (was a few weeks ago). It was much easier making this work with the 3x branch than backporting SOLR-1682 . We want/need new features in 3.1 when it is released and we won't be allowed to deploy trunk to our production environment.
          Hide
          Otis Gospodnetic added a comment -

          Why are people still working on this SOLR-236 patch?
          Doesn't SOLR-1682 supercede it?
          And isn't SOLR-1682 the one that's in trunk, while nothing from SOLR-236 was ever applied to trunk?
          Thanks.

          Show
          Otis Gospodnetic added a comment - Why are people still working on this SOLR-236 patch? Doesn't SOLR-1682 supercede it? And isn't SOLR-1682 the one that's in trunk, while nothing from SOLR-236 was ever applied to trunk? Thanks.
          Doug Steigerwald made changes -
          Attachment SOLR-236-branch_3x.patch [ 12471418 ]
          Hide
          Doug Steigerwald added a comment -

          Attaching a patch for the 3x branch (SOLR-236-branch_3x.patch). This based off of SOLR-236-1_4_1-paging-totals-working.patch and SOLR-236-1_4_1-NPEfix.patch.

          Tests work and some basic spot checking I've done looks good.

          Show
          Doug Steigerwald added a comment - Attaching a patch for the 3x branch ( SOLR-236 -branch_3x.patch). This based off of SOLR-236 -1_4_1-paging-totals-working.patch and SOLR-236 -1_4_1-NPEfix.patch. Tests work and some basic spot checking I've done looks good.
          Hide
          Doug Steigerwald added a comment -

          Has anyone successfully applied field collapsing to the branch_3x branch?

          Show
          Doug Steigerwald added a comment - Has anyone successfully applied field collapsing to the branch_3x branch?
          Cameron made changes -
          Attachment SOLR-236-1_4_1-NPEfix.patch [ 12470202 ]
          Hide
          Cameron added a comment -

          Uploading SOLR-236-1_4_1-NPEfix.patch as a simple patch for the NullPointerException Shekhar and Ron have reported. The patch is intended to be applied AFTER the SOLR-236-1_4_1-paging-totals-working.patch has already been applied, for brevity.

          I didn't actually fix the filterCache key issue as Samuel suggested. Rather I'm preventing the NPE from occurring. I believe this is ok because the collapsed results will stay sorted by score as the collapser performs the collapsing.

          Show
          Cameron added a comment - Uploading SOLR-236 -1_4_1-NPEfix.patch as a simple patch for the NullPointerException Shekhar and Ron have reported. The patch is intended to be applied AFTER the SOLR-236 -1_4_1-paging-totals-working.patch has already been applied, for brevity. I didn't actually fix the filterCache key issue as Samuel suggested. Rather I'm preventing the NPE from occurring. I believe this is ok because the collapsed results will stay sorted by score as the collapser performs the collapsing.
          Hide
          Hsiu Wang added a comment -

          I applied the SOLR-236-1_4_1-paging-totals-working.patch to 3x branch. when I ran unit test FieldCollapsingIntegrationTest, I got "Insane FieldCache usage(s) found expected:<0> but was:<1>" on all 3 sort related tests(testNonAdjacentFieldCollapse_sortOnNameAndCollectAggregates, testNonAdjacentFieldCollapse_sortOnNameAndCollectCollapsedDocs, and testForArrayOutOfBoundsBugWhenSorting).

          Show
          Hsiu Wang added a comment - I applied the SOLR-236 -1_4_1-paging-totals-working.patch to 3x branch. when I ran unit test FieldCollapsingIntegrationTest, I got "Insane FieldCache usage(s) found expected:<0> but was:<1>" on all 3 sort related tests(testNonAdjacentFieldCollapse_sortOnNameAndCollectAggregates, testNonAdjacentFieldCollapse_sortOnNameAndCollectCollapsedDocs, and testForArrayOutOfBoundsBugWhenSorting).
          Hide
          Steven Fuchs added a comment -

          Great feature! But it seems to be missing a capability I need. I'll explain it:

          I'd like to use group results in my query, namely exclude all documents in a group when any document in that group has a certain value. It could be as simple as a field value although the ability to do more complex queries would be nice also. Please consider adding functionality like this to your sub tasks list. Or better yet if this capability exists and I missed it please someone point it out.

          TIA
          steve

          Show
          Steven Fuchs added a comment - Great feature! But it seems to be missing a capability I need. I'll explain it: I'd like to use group results in my query, namely exclude all documents in a group when any document in that group has a certain value. It could be as simple as a field value although the ability to do more complex queries would be nice also. Please consider adding functionality like this to your sub tasks list. Or better yet if this capability exists and I missed it please someone point it out. TIA steve
          Hide
          Samuel García Martínez added a comment -

          The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query cache and the signature key that is using to store cached results.

          To sum up, if you perform a filter query and then, you perform that query using collapse field, that query result is already cached, but not cached as expected by this component. Resulting that the DocSet implementation is not the expected one, and, as cached result, the DocumentCollector is not executed at any time.

          As soon as i can ill post a patch using combined key to cache results, formed by the collector class and the query itself.

          Colbenson - Findability Experts
          http://www.colbenson.es/

          Show
          Samuel García Martínez added a comment - The NPE noticed by Shekhar Nirkhe is caused by some errors on filter query cache and the signature key that is using to store cached results. To sum up, if you perform a filter query and then, you perform that query using collapse field, that query result is already cached, but not cached as expected by this component. Resulting that the DocSet implementation is not the expected one, and, as cached result, the DocumentCollector is not executed at any time. As soon as i can ill post a patch using combined key to cache results, formed by the collector class and the query itself. Colbenson - Findability Experts http://www.colbenson.es/
          Hide
          Ron Veenstra added a comment -

          I have also been getting a null pointer exception:
          message null java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$PredefinedScorer.docID(NonAdjacentDocumentCollapser.java:397)

          The error is repeatable for a given search term when sorted by "score desc," followed by any other field. It seems to crop up whenever there is only one result that should be returned in the collapsed field group, but does not happen for every possible query where this is the case (leading me to believe something else is at work). Changing the sort order to anything else (moving score to second, or omitting a second field) eliminates the error. This was the simple solution for my problem, but wanted to post this in case any of the information proved useful.

          Using Solr 1.4.1 with SOLR-236-1_4_1-paging-totals-working.patch

          Show
          Ron Veenstra added a comment - I have also been getting a null pointer exception: message null java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$PredefinedScorer.docID(NonAdjacentDocumentCollapser.java:397) The error is repeatable for a given search term when sorted by "score desc," followed by any other field. It seems to crop up whenever there is only one result that should be returned in the collapsed field group, but does not happen for every possible query where this is the case (leading me to believe something else is at work). Changing the sort order to anything else (moving score to second, or omitting a second field) eliminates the error. This was the simple solution for my problem, but wanted to post this in case any of the information proved useful. Using Solr 1.4.1 with SOLR-236 -1_4_1-paging-totals-working.patch
          Hide
          Jerry Mindek added a comment -

          I will be out from Dec 25 and returning to the office Monday Jan 4th. Thanks!

          Show
          Jerry Mindek added a comment - I will be out from Dec 25 and returning to the office Monday Jan 4th. Thanks!
          Hide
          Shekhar Nirkhe added a comment -

          I am getting null pointer exception in
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:443)

          I am using Solr 1.4.1 with following patches.

          SOLR-236-1_4_1.patch
          SOLR-236-1_4_1.fix.patch

          Am I missing something ?

          Show
          Shekhar Nirkhe added a comment - I am getting null pointer exception in at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:443) I am using Solr 1.4.1 with following patches. SOLR-236 -1_4_1.patch SOLR-236 -1_4_1.fix.patch Am I missing something ?
          Hide
          Joseph McElroy added a comment -

          Hi there,

          Great work on this feature, it has been something i have been waiting for a while to be implemented in SOLR, thank you all for this.

          Two questions however:

          • An option to sort the groups on the number of documents each group has? so the group with the largest number of documents would be the highest ranked.
          • Ability to return the number of groups within the result set? This would allow for pagination.

          Thanks
          Joe

          Show
          Joseph McElroy added a comment - Hi there, Great work on this feature, it has been something i have been waiting for a while to be implemented in SOLR, thank you all for this. Two questions however: An option to sort the groups on the number of documents each group has? so the group with the largest number of documents would be the highest ranked. Ability to return the number of groups within the result set? This would allow for pagination. Thanks Joe
          Hide
          Luke Bochsler added a comment - - edited

          "Is anyone working on the ability to calculate facets AFTER the group?"

          This would be great to have that possibility! Sorry I'm not a Java Programmer so I cannot contribute a solution, instead I contribute to other open source systems. However, would that be a big deal for you guys to implement it? I'm using Solr in a web project as search solution and desperately need this feature along with the great grouping functionality. The grouping in general has made my life so much easier so far, so it seems we are just one step away from having it all covered by Solr!

          Thank you so much!

          Luke

          Show
          Luke Bochsler added a comment - - edited "Is anyone working on the ability to calculate facets AFTER the group?" This would be great to have that possibility! Sorry I'm not a Java Programmer so I cannot contribute a solution, instead I contribute to other open source systems. However, would that be a big deal for you guys to implement it? I'm using Solr in a web project as search solution and desperately need this feature along with the great grouping functionality. The grouping in general has made my life so much easier so far, so it seems we are just one step away from having it all covered by Solr! Thank you so much! Luke
          Hide
          Ingmar Seeliger added a comment -

          Field collapsing is a very nice feature - thank you for that!

          I've just tested it with (pseudo-)distributed search, that means the data on each solr-server has one specific value for the collapse field, and realized one problem:
          I want to include the documents in the result list, using collapse.includeCollapsedDocs.fl=...
          The result list has empty docs:
          <result name="collapsedDocs" numFound="4" start="0">
          <doc/>
          <doc/>
          <doc/>
          <doc/>
          </result>
          When I remove the distributed search, everything works fine on one server. Perhaps someone can look for that? Thanks!

          Show
          Ingmar Seeliger added a comment - Field collapsing is a very nice feature - thank you for that! I've just tested it with (pseudo-)distributed search, that means the data on each solr-server has one specific value for the collapse field, and realized one problem: I want to include the documents in the result list, using collapse.includeCollapsedDocs.fl=... The result list has empty docs: <result name="collapsedDocs" numFound="4" start="0"> <doc/> <doc/> <doc/> <doc/> </result> When I remove the distributed search, everything works fine on one server. Perhaps someone can look for that? Thanks!
          Hide
          Yonik Seeley added a comment -

          Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct.

          There's no correctness issue or bug here. Many use cases require the current behavior (the number of docs per group shown having no effect on faceting), and other use cases require what you seek. Both are valid, but we only have one implemented so far.

          Show
          Yonik Seeley added a comment - Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct. There's no correctness issue or bug here. Many use cases require the current behavior (the number of docs per group shown having no effect on faceting), and other use cases require what you seek. Both are valid, but we only have one implemented so far.
          Hide
          Bill Bell added a comment -

          Yonik and team,

          Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct.

          Thank you.
          Bill

          Show
          Bill Bell added a comment - Yonik and team, Is anyone working on the ability to calculate facets AFTER the group? Without a patch for that, the facet numbering is not correct. Thank you. Bill
          Hide
          Yonik Seeley added a comment -

          I've just committed a fix to the sort != group.sort problem.
          As I previously said, the algorithm for handling this was broken (the TopGroupSortCollector class), so I've redefined what sort means.
          Sort does not order groups by the first document in each group, but orders groups by the highest ranking document by "sort" in that group.
          I've updated the randomized grouping tests to reflect this change, and enabled tests where sort != group.sort

          Show
          Yonik Seeley added a comment - I've just committed a fix to the sort != group.sort problem. As I previously said, the algorithm for handling this was broken (the TopGroupSortCollector class), so I've redefined what sort means. Sort does not order groups by the first document in each group, but orders groups by the highest ranking document by "sort" in that group. I've updated the randomized grouping tests to reflect this change, and enabled tests where sort != group.sort
          Bill Bell made changes -
          Link This issue depends on SOLR-2246 [ SOLR-2246 ]
          Bill Bell made changes -
          Link This issue is related to SOLR-2242 [ SOLR-2242 ]
          Hide
          Stephen Weiss added a comment -

          Cheers peterwang, you're probably right. I didn't actually use this patch, I made the modifications by hand after applying Martijn's patch. I generally don't make my own patch files, I just let SVN do it for me, so I'm not really aware of the syntax... The point is to just delete those extra lines.

          Show
          Stephen Weiss added a comment - Cheers peterwang, you're probably right. I didn't actually use this patch, I made the modifications by hand after applying Martijn's patch. I generally don't make my own patch files, I just let SVN do it for me, so I'm not really aware of the syntax... The point is to just delete those extra lines.
          Hide
          Bill Bell added a comment -

          OK I have a patch to add namedistinct. Note that is optional, and to be careful of the number of facets when using it.

          On sample data:

          http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=manu&facet.mincount=1&facet.limit=-1&f.manu.facet.namedistinct=0&facet.field=price&f.price.facet.namedistinct=1

          It works on facet.fields.

          SOLR-236-distinctFacet.patch

          Show
          Bill Bell added a comment - OK I have a patch to add namedistinct. Note that is optional, and to be careful of the number of facets when using it. On sample data: http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=manu&facet.mincount=1&facet.limit=-1&f.manu.facet.namedistinct=0&facet.field=price&f.price.facet.namedistinct=1 It works on facet.fields. SOLR-236 -distinctFacet.patch
          Bill Bell made changes -
          Attachment SOLR-236-distinctFacet.patch [ 12459815 ]
          Hide
          Bill Bell added a comment -

          TO do distinct facet counts.

          Show
          Bill Bell added a comment - TO do distinct facet counts.
          Hide
          peterwang added a comment - - edited

          SOLR-236-1_4_1-paging-totals-working.patch patch failed with following errors:

          patch: **** malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java

          seems caused by hand edit SOLR-236-1_4_1.patch to produce SOLR-236-1_4_1-paging-totals-working.patch (delete 6 lines without fix diff hunk number)
          possible fix:

          diff -u SOLR-236-1_4_1-paging-totals-working.patch.orig SOLR-236-1_4_1-paging-totals-working.patch
          --- SOLR-236-1_4_1-paging-totals-working.patch.orig     2010-11-17 19:26:05.000000000 +0800
          +++ SOLR-236-1_4_1-paging-totals-working.patch  2010-11-17 19:17:20.000000000 +0800
          @@ -2834,7 +2834,7 @@
           ===================================================================
           --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java    (revision )
           +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java    (revision )
          -@@ -0,0 +1,517 @@
          +@@ -0,0 +1,511 @@
           +/**
           + * Licensed to the Apache Software Foundation (ASF) under one or more
           + * contributor license agreements.  See the NOTICE file distributed with
          
          Show
          peterwang added a comment - - edited SOLR-236 -1_4_1-paging-totals-working.patch patch failed with following errors: patch: **** malformed patch at line 3348: Index: src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java seems caused by hand edit SOLR-236 -1_4_1.patch to produce SOLR-236 -1_4_1-paging-totals-working.patch (delete 6 lines without fix diff hunk number) possible fix: diff -u SOLR-236-1_4_1-paging-totals-working.patch.orig SOLR-236-1_4_1-paging-totals-working.patch --- SOLR-236-1_4_1-paging-totals-working.patch.orig 2010-11-17 19:26:05.000000000 +0800 +++ SOLR-236-1_4_1-paging-totals-working.patch 2010-11-17 19:17:20.000000000 +0800 @@ -2834,7 +2834,7 @@ =================================================================== --- src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) +++ src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java (revision ) -@@ -0,0 +1,517 @@ +@@ -0,0 +1,511 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with
          Hide
          Bill Bell added a comment -

          Here is an idea. If we go with the terminology -

          <int name="name">value</int>

          Then we can just return the name distinct by a few mods to SimpleFacet.java. All other parameters still apply. Default will be off.

          facet.<field>.namedistinct=1

          <lst name="hgid">
          <int name="count">3</int>
          </lst>

          I can have a patch for this today. Would this be something that we could go with?

          Bill

          Show
          Bill Bell added a comment - Here is an idea. If we go with the terminology - <int name="name">value</int> Then we can just return the name distinct by a few mods to SimpleFacet.java. All other parameters still apply. Default will be off. facet.<field>.namedistinct=1 <lst name="hgid"> <int name=" count ">3</int> </lst> I can have a patch for this today. Would this be something that we could go with? Bill
          Hide
          Bill Bell added a comment -

          Yonik,

          > OK Bill, this should be fixed in the latest trunk... can you try it out?

          Yes paging seems to work right now.

          Question: Is there a way to return or do it with some sort of ord() value?

          <lst name="hgid">
          <int name="HGPY0056D09F7B57442E8">4</int>
          <int name="HGPY00A33AD7808996941">3</int>
          <int name="HGPY00D6274FD07B4EE7A">3</int>
          </lst>
          <facetname name="hgid">3</facetname>

          Show
          Bill Bell added a comment - Yonik, > OK Bill, this should be fixed in the latest trunk... can you try it out? Yes paging seems to work right now. Question: Is there a way to return or do it with some sort of ord() value? <lst name="hgid"> <int name="HGPY0056D09F7B57442E8">4</int> <int name="HGPY00A33AD7808996941">3</int> <int name="HGPY00D6274FD07B4EE7A">3</int> </lst> <facetname name="hgid">3</facetname>
          Hide
          Bill Bell added a comment -

          I found a solution, but is is not ideal. I need to be able to get a count of facets (left side and not left side):

          http://localhosT:8983/solr/select?facet=true&facet.field=hgid&facet.limit=100000&facet.mincount=1

          (HGID is the group by)

          I need to I get the following but need left side number. So instead of 10, I need "3". Is there anyway to do that? Just return 3.

          • <lst name="hgid">
            <int name="HGPY0056D09F7B57442E8">4</int>
            <int name="HGPY00A33AD7808996941">3</int>
            <int name="HGPY00D6274FD07B4EE7A">3</int>
            </lst>
          Show
          Bill Bell added a comment - I found a solution, but is is not ideal. I need to be able to get a count of facets (left side and not left side): http://localhosT:8983/solr/select?facet=true&facet.field=hgid&facet.limit=100000&facet.mincount=1 (HGID is the group by) I need to I get the following but need left side number. So instead of 10, I need "3". Is there anyway to do that? Just return 3. <lst name="hgid"> <int name="HGPY0056D09F7B57442E8">4</int> <int name="HGPY00A33AD7808996941">3</int> <int name="HGPY00D6274FD07B4EE7A">3</int> </lst>
          Stephen Weiss made changes -
          Hide
          Stephen Weiss added a comment -

          This would be the patch that I'm describing... I used it with the Solr 1.4.1 release tarball. It's just Martijn's latest patch minus a few lines (by his suggestion) that mess up the totals and paging. Again, you want to make sure your server is well configured - we are not really Java people and it took a while to get the settings to a place where we didn't have OOM errors every day. We're using these startup options with Jetty:

          -Xms10240m -Xmx10240m -XX:NewRatio=5 -XX:+UseParNewGC

          That RAM total is half the RAM available on the machine - we leave the rest of the RAM open for disk caches. It will take up it's half of the RAM very quickly but then it hovers there and has only ever gone over the limit once since September, which seemed to be related to an unoptimized index (after replacing an unusually large # of docs).

          Show
          Stephen Weiss added a comment - This would be the patch that I'm describing... I used it with the Solr 1.4.1 release tarball. It's just Martijn's latest patch minus a few lines (by his suggestion) that mess up the totals and paging. Again, you want to make sure your server is well configured - we are not really Java people and it took a while to get the settings to a place where we didn't have OOM errors every day. We're using these startup options with Jetty: -Xms10240m -Xmx10240m -XX:NewRatio=5 -XX:+UseParNewGC That RAM total is half the RAM available on the machine - we leave the rest of the RAM open for disk caches. It will take up it's half of the RAM very quickly but then it hovers there and has only ever gone over the limit once since September, which seemed to be related to an unoptimized index (after replacing an unusually large # of docs).
          Hide
          Bill Bell added a comment -

          Is the older CollapseComponent still available in the trunk?

          Or do we need to use the newer group parameters?

          How do I get the older one to work?

          Show
          Bill Bell added a comment - Is the older CollapseComponent still available in the trunk? Or do we need to use the newer group parameters? How do I get the older one to work?
          Hide
          James Dyer added a comment -

          Stephen,

          I would be very interested in seeing your patch if you can upload it. Luckily, the index we're migrating to SOLR for this project is small and I think I won't have to scale very much in either case. Your patch might be better than the current SOLR-1682/236 patches for our needs however.

          Show
          James Dyer added a comment - Stephen, I would be very interested in seeing your patch if you can upload it. Luckily, the index we're migrating to SOLR for this project is small and I think I won't have to scale very much in either case. Your patch might be better than the current SOLR-1682 /236 patches for our needs however.
          Hide
          Stephen Weiss added a comment -

          If you need help James, I have a version of 1.4.1 patched that does do the collapsing and provide this data - it was based on some of the comments above along with a patch that came out a while ago (back when it really was only 5 or 6 lines of difference). The faceting route really doesn't work out well once you hit a certain number of collapse groups. Anyway, I've been using this version in production for quite a while now, and while it is a bit of a memory hog, if you manage the memory properly, keep your indexes optimized and provide enough RAM to cover your indexes then it's pretty stable and gets the job done.

          Show
          Stephen Weiss added a comment - If you need help James, I have a version of 1.4.1 patched that does do the collapsing and provide this data - it was based on some of the comments above along with a patch that came out a while ago (back when it really was only 5 or 6 lines of difference). The faceting route really doesn't work out well once you hit a certain number of collapse groups. Anyway, I've been using this version in production for quite a while now, and while it is a bit of a memory hog, if you manage the memory properly, keep your indexes optimized and provide enough RAM to cover your indexes then it's pretty stable and gets the job done.
          Hide
          Yonik Seeley added a comment -

          I remember it only was a difference of 5 or 6 lines of code either way.

          Not with what is committed in trunk. To be scalable wrt to the number of groups, we only keep the top 10 groups in memory at any one time (and hence we never know the total number of groups). The ability to retrieve the number of groups will require a different algorithm with different tradeoffs. I'm sure we'll get to it in time, but it is not just a tweak to the existing algorithm.

          Show
          Yonik Seeley added a comment - I remember it only was a difference of 5 or 6 lines of code either way. Not with what is committed in trunk. To be scalable wrt to the number of groups, we only keep the top 10 groups in memory at any one time (and hence we never know the total number of groups). The ability to retrieve the number of groups will require a different algorithm with different tradeoffs. I'm sure we'll get to it in time, but it is not just a tweak to the existing algorithm.
          Hide
          James Dyer added a comment -

          We also have a hard requirement for field collapsing with total # of groups for a project scheduled for Production Q1 2011. So far, best I can tell I would have to facet on the group-by field with facet.limit=-1 to get this. Surely we would have less overhead if the group-by functionality could compute this by itself and just return the number. Turning it on/off makes sense as some won't want the performance/memory hit.

          Show
          James Dyer added a comment - We also have a hard requirement for field collapsing with total # of groups for a project scheduled for Production Q1 2011. So far, best I can tell I would have to facet on the group-by field with facet.limit=-1 to get this. Surely we would have less overhead if the group-by functionality could compute this by itself and just return the number. Turning it on/off makes sense as some won't want the performance/memory hit.
          Hide
          Stephen Weiss added a comment -

          Just chiming in on that last comment... we also rely on functional paging and total counts when collapsing as well. I once raised the idea of not providing this information in our search results to my boss and he looked at me like I had 3 heads, it's just not an option. In most of the patches on this ticket we could get this data, but for some it seemed like eliminating totals and paging wasn't a big deal and provided a significant performance boost. I can understand the reasons for not including this for every collapsed query (if you don't need the totals or paging then the performance boost is nice), but if there was a way we could have an option to turn this on or off (even with the performance hit, having it is better than not being able to collapse at all), maybe that could help keep everyone happy. I remember it only was a difference of 5 or 6 lines of code either way.

          Show
          Stephen Weiss added a comment - Just chiming in on that last comment... we also rely on functional paging and total counts when collapsing as well. I once raised the idea of not providing this information in our search results to my boss and he looked at me like I had 3 heads, it's just not an option. In most of the patches on this ticket we could get this data, but for some it seemed like eliminating totals and paging wasn't a big deal and provided a significant performance boost. I can understand the reasons for not including this for every collapsed query (if you don't need the totals or paging then the performance boost is nice), but if there was a way we could have an option to turn this on or off (even with the performance hit, having it is better than not being able to collapse at all), maybe that could help keep everyone happy. I remember it only was a difference of 5 or 6 lines of code either way.
          Hide
          Bill Bell added a comment -

          Yonik,

          I am testing. Will get back to you on the starts/rows.

          Also is there a way to get the total number of results based on ther grouping? I get the following:

          
           <lst name="grouped">
          - <lst name="hgid">
            <int name="matches">6</int> 
          - <arr name="groups">
          - <lst>
          
          

          But no total number. Also the matches=6, includes those fields not returned (the group has 2 entries, but I only return 1). It should show matches=6, results=4 (since 2 are hidden), totalNumber=6747.

          Otherwise we cannot page.

          If we do a http://localhost:8983/select?q=test&facet=true&face.field=hgid there are too many results (thousands). ANy other way to group by and get a total?

          Show
          Bill Bell added a comment - Yonik, I am testing. Will get back to you on the starts/rows. Also is there a way to get the total number of results based on ther grouping? I get the following: <lst name= "grouped" > - <lst name= "hgid" > < int name= "matches" >6</ int > - <arr name= "groups" > - <lst> But no total number. Also the matches=6, includes those fields not returned (the group has 2 entries, but I only return 1). It should show matches=6, results=4 (since 2 are hidden), totalNumber=6747. Otherwise we cannot page. If we do a http://localhost:8983/select?q=test&facet=true&face.field=hgid there are too many results (thousands). ANy other way to group by and get a total?
          Hide
          Yonik Seeley added a comment -

          OK Bill, this should be fixed in the latest trunk... can you try it out?

          Show
          Yonik Seeley added a comment - OK Bill, this should be fixed in the latest trunk... can you try it out?
          Hide
          Yonik Seeley added a comment -

          We get 15 results. 10+5 ? It should be 10 rows.

          Yes, I've reproduced this with the random testing too. Not sure what to make of it yet.
          It looks like the orderedGroups TreeSet acquires too many entries for some reason.

          Show
          Yonik Seeley added a comment - We get 15 results. 10+5 ? It should be 10 rows. Yes, I've reproduced this with the random testing too. Not sure what to make of it yet. It looks like the orderedGroups TreeSet acquires too many entries for some reason.
          Hide
          Bill Bell added a comment -

          We are having an issue with this patch.

          http://localhost:8983/solr/provs/select?fl=hgid,score&q.alt=*:*&start=5&rows=10&qt=standard&group=true&group.field=hgid
          

          We get 15 results. 10+5 ? It should be 10 rows. This does not appear to be working right with start and rows.

          Show
          Bill Bell added a comment - We are having an issue with this patch. http: //localhost:8983/solr/provs/select?fl=hgid,score&q.alt=*:*&start=5&rows=10&qt=standard&group= true &group.field=hgid We get 15 results. 10+5 ? It should be 10 rows. This does not appear to be working right with start and rows.
          Hide
          Yonik Seeley added a comment -

          NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used.

          Actually, I think it's worse. The algorithm added in SOLR-1682 (TopGroupSortCollector) that handled when sort != group.sort seems broken.
          The problem: a high ranking group may be demoted to a lower ranking group because it's top document changed (and the sorts used to find the top doc in a group and the top group are different). But we may have already discarded higher ranking groups based on the original high ranking, so now we have permanently lost information.

          Show
          Yonik Seeley added a comment - NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used. Actually, I think it's worse. The algorithm added in SOLR-1682 (TopGroupSortCollector) that handled when sort != group.sort seems broken. The problem: a high ranking group may be demoted to a lower ranking group because it's top document changed (and the sorts used to find the top doc in a group and the top group are different). But we may have already discarded higher ranking groups based on the original high ranking, so now we have permanently lost information.
          Hide
          Yonik Seeley added a comment -

          Random testing found another bug - while finding the top groups, we forgot to setBottom on the priority queue when it changed.

          Show
          Yonik Seeley added a comment - Random testing found another bug - while finding the top groups, we forgot to setBottom on the priority queue when it changed.
          Hide
          Yonik Seeley added a comment -

          NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used).
          The wrong comparators were used in one place, leading to errors finding the top groups. I just committed a fix for this.

          The NPE when rows==0 has also been fixed.

          Show
          Yonik Seeley added a comment - NOTE: there was a serious bug when sort != group.sort (i.e. when TopGroupSortCollector was used). The wrong comparators were used in one place, leading to errors finding the top groups. I just committed a fix for this. The NPE when rows==0 has also been fixed.
          Hide
          Yonik Seeley added a comment -

          Two more corner cases not yet fixed:
          1) if rows==0, we get an NPE
          2) if group.limit and group.offset are both 0, then the counts for the resulting doclists are all zero.

          Show
          Yonik Seeley added a comment - Two more corner cases not yet fixed: 1) if rows==0, we get an NPE 2) if group.limit and group.offset are both 0, then the counts for the resulting doclists are all zero.
          Hide
          Yonik Seeley added a comment -

          Just committed a patch that the random testing I'm developing uncovered - we lost the default of group.sort to sort during the last refactoring.

          Show
          Yonik Seeley added a comment - Just committed a patch that the random testing I'm developing uncovered - we lost the default of group.sort to sort during the last refactoring.
          Hide
          Martijn van Groningen added a comment -

          After applying the patch SOLR-236-1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored?

          I think so, since the patch your referrer has nothing to do with spelling.

          Show
          Martijn van Groningen added a comment - After applying the patch SOLR-236 -1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored? I think so, since the patch your referrer has nothing to do with spelling.
          Yonik Seeley made changes -
          Attachment SOLR-236.patch [ 12458675 ]
          Hide
          Yonik Seeley added a comment -

          Here's a refactoring patch that pulls all the grouping stuff out of SolrIndexSearcher (I'm sure many of you will be glad about that and uses subclasses rather than instanceof checks for different behavior of grouping commands.

          This isn't the end of refactoring, but it's a good start I think, and should make additional changes easier.

          Show
          Yonik Seeley added a comment - Here's a refactoring patch that pulls all the grouping stuff out of SolrIndexSearcher (I'm sure many of you will be glad about that and uses subclasses rather than instanceof checks for different behavior of grouping commands. This isn't the end of refactoring, but it's a good start I think, and should make additional changes easier.
          Hide
          Thorsten Maus added a comment -

          After applying the patch SOLR-236-1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored?

          Show
          Thorsten Maus added a comment - After applying the patch SOLR-236 -1_4_1.patch the ant test task fails on org.apache.solr.spelling.SpellingQueryConverterTest. Can it be ignored?
          Hide
          Jamie added a comment -

          When using collapse.includeCollapsedDocs.fl to and sorting by a field (not score) the returned collapsed results aren't sorted correctly.

          Show
          Jamie added a comment - When using collapse.includeCollapsedDocs.fl to and sorting by a field (not score) the returned collapsed results aren't sorted correctly.
          Hide
          Yonik Seeley added a comment -

          It works great but gives problem when I include other components like Facet and Highlighter.

          See the list of sub-tasks on this issue starting with "SearchGrouping:".
          I fixed faceting yesterday - and I hope to fix highlighting and debugging today.

          Show
          Yonik Seeley added a comment - It works great but gives problem when I include other components like Facet and Highlighter. See the list of sub-tasks on this issue starting with "SearchGrouping:". I fixed faceting yesterday - and I hope to fix highlighting and debugging today.
          Hide
          Varun Gupta added a comment -

          I am using the patch SOLR-1682 committed on trunk for field collapsing. It works great but gives problem when I include other components like Facet and Highlighter. Is there any workaround to use Highlight and Facet components along with grouping?

          Show
          Varun Gupta added a comment - I am using the patch SOLR-1682 committed on trunk for field collapsing. It works great but gives problem when I include other components like Facet and Highlighter. Is there any workaround to use Highlight and Facet components along with grouping?
          Hide
          Stephen Weiss added a comment -

          FWIW, I fixed my earlier OOM issues with some garbage collection tuning.

          Now I'm noticing NPEs very similar to those people were reporting back before the patch from Jun 28th:

          SEVERE: java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
          ... it's the same backtrace ...

          I'm guessing it's because I added those 5 lines back into the patch to get the paging working again.

          It's rather infrequent, it's probably something I can deal with until the new patch is complete. It doesn't happen every time at all like it seemed to happen to many people - just once in a while, and on queries that honestly run all the time, so it seems random and not related to a particular query (except perhaps in the size of the filter queries - these fqs relatively large #'s of documents). But if any of this code makes it to the new patch I thought it would be worth mentioning.

          Show
          Stephen Weiss added a comment - FWIW, I fixed my earlier OOM issues with some garbage collection tuning. Now I'm noticing NPEs very similar to those people were reporting back before the patch from Jun 28th: SEVERE: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262) ... it's the same backtrace ... I'm guessing it's because I added those 5 lines back into the patch to get the paging working again. It's rather infrequent, it's probably something I can deal with until the new patch is complete. It doesn't happen every time at all like it seemed to happen to many people - just once in a while, and on queries that honestly run all the time, so it seems random and not related to a particular query (except perhaps in the size of the filter queries - these fqs relatively large #'s of documents). But if any of this code makes it to the new patch I thought it would be worth mentioning.
          Hide
          Amit Nithian added a comment -

          Two questions and one comment:
          Comment:
          1) This is a neat patch! Thanks for this contribution.

          Questions:
          1) Which patch should we start using.. this one or the one Yonik referenced?
          2) Will the cache config in the component be retrieved via the CacheConfig instead of as a child element in the component?

          Excited to see the final product. I am using it for a simple app right now and it's working fairly well.

          Show
          Amit Nithian added a comment - Two questions and one comment: Comment: 1) This is a neat patch! Thanks for this contribution. Questions: 1) Which patch should we start using.. this one or the one Yonik referenced? 2) Will the cache config in the component be retrieved via the CacheConfig instead of as a child element in the component? Excited to see the final product. I am using it for a simple app right now and it's working fairly well.
          Hide
          Peter Kieltyka added a comment -

          Hey guys,

          How difficult would it be to add the ability to specify if for any collapsed values, to not return any of the documents.. to just purge all duplicates from the results.

          This could be done by adding a new field: collapse.purge which can be true or false, and defaults to false

          I could really use that. I have a scenario where I have the following data set of documents:

          ALL: <1,2,3,4,5>
          A: <1,2>
          B: <3,4>
          C: <4,5>

          and I want to search the text within the subset of documents: (ALL - A) = <3,4,5>

          Collapse would do this ..

          q => text:something AND -(group_id:[* TO *] AND -group_id:A)
          collapse.field => uid
          collapse.purge => true

          Cheers!

          Show
          Peter Kieltyka added a comment - Hey guys, How difficult would it be to add the ability to specify if for any collapsed values, to not return any of the documents.. to just purge all duplicates from the results. This could be done by adding a new field: collapse.purge which can be true or false, and defaults to false I could really use that. I have a scenario where I have the following data set of documents: ALL: <1,2,3,4,5> A: <1,2> B: <3,4> C: <4,5> and I want to search the text within the subset of documents: (ALL - A) = <3,4,5> Collapse would do this .. q => text:something AND -(group_id: [* TO *] AND -group_id:A) collapse.field => uid collapse.purge => true Cheers!
          Hide
          Yonik Seeley added a comment -

          Since everyone seems to be watching this issue, I'll comment here.
          I've just committed the first parts to field collapsing to trunk! See SOLR-1682
          Thanks to everyone who has worked on these related issues for so long!
          I chose to back off and bite of a manageable piece, but I referenced all the
          great work that has been done in the various related issues, and tried
          to give credit to everyone who's submitted patches (let me know if I missed anyone.)

          This is really just a start to build from of course - there's much left to do!

          Show
          Yonik Seeley added a comment - Since everyone seems to be watching this issue, I'll comment here. I've just committed the first parts to field collapsing to trunk! See SOLR-1682 Thanks to everyone who has worked on these related issues for so long! I chose to back off and bite of a manageable piece, but I referenced all the great work that has been done in the various related issues, and tried to give credit to everyone who's submitted patches (let me know if I missed anyone.) This is really just a start to build from of course - there's much left to do!
          Hide
          Evgeniy Serykh added a comment -

          I've patched release of solr 1.4.1. When I try to execute query with collapsing 'numFound' value always equals 10 while 'rows' param not specified.

          Show
          Evgeniy Serykh added a comment - I've patched release of solr 1.4.1. When I try to execute query with collapsing 'numFound' value always equals 10 while 'rows' param not specified.
          Hide
          wyhw whon added a comment -

          when i use fq=xxxx:1302 , i got a error as follow, but it can work with other fq.

          HTTP Status 500 - -1073634 java.lang.ArrayIndexOutOfBoundsException: -1073634 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:659) at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1529) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:973) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:347) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1503) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:183) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:134) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:242) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619)

          but ,
          it can work if not use fq
          if i disable useFilterForSortedQuery in solrconfig.xml, it also work .

          Show
          wyhw whon added a comment - when i use fq=xxxx:1302 , i got a error as follow, but it can work with other fq. HTTP Status 500 - -1073634 java.lang.ArrayIndexOutOfBoundsException: -1073634 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:659) at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1529) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:973) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:347) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1503) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:183) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:134) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:242) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) but , it can work if not use fq if i disable useFilterForSortedQuery in solrconfig.xml, it also work .
          Hide
          David Tuška added a comment -

          Hello, I find some bug in "Field collapsing",
          I will tested it for solr-1.4.1-patch and try test for trunk-patch(rev.955615) too.

          1) No collapse_counts/results will be returned if collapseCount==1,
          although no-collapse will be returned.

          http://localhost:8080/solr_tour/select/?q=nl_counter%3A1%0D%0A&start=0&rows=10&indent=on&sort=c_price_from_orig+asc&collapse.field=nl_tour_id&collapse.threshold=1&collapse.type=adjacent&collapse.debug=true

           
          <lst name="collapse_counts">
            <str name="field">nl_tour_id</str>
            <lst name="results"/>
            <lst name="debug">
              <str name="Docset type">HashDocSet(26)</str>
              <long name="Total collapsing time(ms)">0</long>
              <long name="Create uncollapsed docset(ms)">0</long>
              <long name="Get fieldvalues from fieldcache (ms)">0</long>
              <long name="AdjacentDocumentCollapser collapsing time(ms)">0</long>
              <long name="Creating collapseinfo time(ms)">0</long>
              <long name="Convert to bitset time(ms)">0</long>
              <long name="Create collapsed docset time(ms)">0</long>
            </lst>
          </lst>
          <result name="response" numFound="26" start="0">
          10x <doc></doc> 
          ...
          

          If I look into code, I find some problematic part of code:

          In NonAdjacentDocumentCollapser.java in function doCollapsing is bad condition and priorityQueue:

          NonAdjacentDocumentCollapser.java
          protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) {
          
            for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) {
              int currentId = i.nextDoc();
              String currentValue = values.lookup[values.order[currentId]];
          
              NonAdjacentCollapseGroup collapseDoc = collapsedDocs.get(currentValue);
          
              if (collapseDoc == null) {
                ..
              }
          
              Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId);
          
              // IMHO HERE must be >= NO > !!!!
              if (++collapseDoc.totalCount > collapseThreshold) {
                collapseDoc.collapsedDocuments++;
          
                //HERE IS PROBLEM TOO, if doc is only one, then is not returned by collapseDoc.priorityQueue.insertWithOverflow for collapse.threshold=1
                if (dropOutId != null)
                {
                  for (CollapseCollector collector : collectors) {
                    collector.documentCollapsed(dropOutId, collapseDoc, collapseContext);
                  }
                }
              }
          }
          

          In AdjacentDocumentCollapser.java in doCollapsing is problem in Initializing condition,
          if doc is only one, then only inicializing condition is process, else-if, else part not will be processed and collector.documentCollapsed or collector.documentHead not will be call.

          NonAdjacentDocumentCollapser.java
          protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) {
            ...
            String collapseValue = null;
            ...
            for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) {
              int currentId = i.nextDoc();
              String currentValue = values.lookup[values.order[currentId]];
          
              // Initializing
              if (collapseValue == null) {
                repeatCount = 0;
                collapseCount = 0;
                collapseId = currentId;
                collapseValue = currentValue;
          
                // Collapse the document if the field value is the same and
                // we have a run of at least collapseThreshold uncollapsedDocset.
              }
              //IMHO HERE MUST BE if NO else-if !!!!    
              else if (collapseValue.equals(currentValue))
              {
                if (++repeatCount >= collapseThreshold) {
                  collapseCount++;
                  for (CollapseCollector collector : collectors) {
                    CollapseGroup valueToCollapse = new AdjacentCollapseGroup(collapseId, currentValue);
                    collector.documentCollapsed(currentId, valueToCollapse, collapseContext);
                  }
                } else {
                  addDoc(currentId);
                }
              }
              else
              {
                ...
              }
              ...
            }
            ...
          }
          

          2) I have problem with sorting, I need sort CollapseGroup by c_price_from_orig field,
          but if I have in request "sort=c_price_from_orig+asc",
          returned CollapseGroup will be sorted by c_price_from_orig (minimum of collapsed doc in group),
          but some CollapseGroup will be skiped and doc with c_price_from_orig will not be returned firts !!!

          I try debug this problem and report this better.

          thanks for your reply,
          sorry for my english and

          best regards
          David

          Show
          David Tuška added a comment - Hello, I find some bug in "Field collapsing", I will tested it for solr-1.4.1-patch and try test for trunk-patch(rev.955615) too. 1) No collapse_counts/results will be returned if collapseCount==1, although no-collapse will be returned. http://localhost:8080/solr_tour/select/?q=nl_counter%3A1%0D%0A&start=0&rows=10&indent=on&sort=c_price_from_orig+asc&collapse.field=nl_tour_id&collapse.threshold=1&collapse.type=adjacent&collapse.debug=true <lst name= "collapse_counts" > <str name= "field" > nl_tour_id </str> <lst name= "results" /> <lst name= "debug" > <str name= "Docset type" > HashDocSet(26) </str> <long name= "Total collapsing time(ms)" > 0 </long> <long name= "Create uncollapsed docset(ms)" > 0 </long> <long name= "Get fieldvalues from fieldcache (ms)" > 0 </long> <long name= "AdjacentDocumentCollapser collapsing time(ms)" > 0 </long> <long name= "Creating collapseinfo time(ms)" > 0 </long> <long name= "Convert to bitset time(ms)" > 0 </long> <long name= "Create collapsed docset time(ms)" > 0 </long> </lst> </lst> <result name= "response" numFound= "26" start= "0" > 10x <doc> </doc> ... If I look into code, I find some problematic part of code: In NonAdjacentDocumentCollapser.java in function doCollapsing is bad condition and priorityQueue: NonAdjacentDocumentCollapser.java protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) { for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) { int currentId = i.nextDoc(); String currentValue = values.lookup[values.order[currentId]]; NonAdjacentCollapseGroup collapseDoc = collapsedDocs.get(currentValue); if (collapseDoc == null ) { .. } Integer dropOutId = ( Integer ) collapseDoc.priorityQueue.insertWithOverflow(currentId); // IMHO HERE must be >= NO > !!!! if (++collapseDoc.totalCount > collapseThreshold) { collapseDoc.collapsedDocuments++; //HERE IS PROBLEM TOO, if doc is only one, then is not returned by collapseDoc.priorityQueue.insertWithOverflow for collapse.threshold=1 if (dropOutId != null ) { for (CollapseCollector collector : collectors) { collector.documentCollapsed(dropOutId, collapseDoc, collapseContext); } } } } In AdjacentDocumentCollapser.java in doCollapsing is problem in Initializing condition, if doc is only one, then only inicializing condition is process, else-if, else part not will be processed and collector.documentCollapsed or collector.documentHead not will be call. NonAdjacentDocumentCollapser.java protected void doCollapsing(DocSet uncollapsedDocset, FieldCache.StringIndex values) { ... String collapseValue = null ; ... for (DocIterator i = uncollapsedDocset.iterator(); i.hasNext();) { int currentId = i.nextDoc(); String currentValue = values.lookup[values.order[currentId]]; // Initializing if (collapseValue == null ) { repeatCount = 0; collapseCount = 0; collapseId = currentId; collapseValue = currentValue; // Collapse the document if the field value is the same and // we have a run of at least collapseThreshold uncollapsedDocset. } //IMHO HERE MUST BE if NO else - if !!!! else if (collapseValue.equals(currentValue)) { if (++repeatCount >= collapseThreshold) { collapseCount++; for (CollapseCollector collector : collectors) { CollapseGroup valueToCollapse = new AdjacentCollapseGroup(collapseId, currentValue); collector.documentCollapsed(currentId, valueToCollapse, collapseContext); } } else { addDoc(currentId); } } else { ... } ... } ... } 2) I have problem with sorting, I need sort CollapseGroup by c_price_from_orig field, but if I have in request "sort=c_price_from_orig+asc", returned CollapseGroup will be sorted by c_price_from_orig (minimum of collapsed doc in group), but some CollapseGroup will be skiped and doc with c_price_from_orig will not be returned firts !!! I try debug this problem and report this better. thanks for your reply, sorry for my english and best regards David
          Hide
          Pavel Minchenkov added a comment -

          Please, update patch for trunk.

          Show
          Pavel Minchenkov added a comment - Please, update patch for trunk.
          Hide
          cruz fernandez added a comment - - edited

          I'm having an issue with the facet exclude filter parameters (http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters). I have added this exclude tags and the facet result I'm getting is without collapsing (it's counting the uncollapsed items).

          For example, in my first page it shows something like this (the facet result gives something like this):

          • book (11)
          • website (20)
          • journal (5)

          after clicking on book it shows 11 results correctly, but the faceting with the exclude applied shows:

          • book (230)
          • website (25)
          • journal (5)

          I am using the parameter collapse.facet=after

          The collapsed count of books is 11, and the uncollapsed count is 230, I verified it.

          Show
          cruz fernandez added a comment - - edited I'm having an issue with the facet exclude filter parameters ( http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters ). I have added this exclude tags and the facet result I'm getting is without collapsing (it's counting the uncollapsed items). For example, in my first page it shows something like this (the facet result gives something like this): book (11) website (20) journal (5) after clicking on book it shows 11 results correctly, but the faceting with the exclude applied shows: book (230) website (25) journal (5) I am using the parameter collapse.facet=after The collapsed count of books is 11, and the uncollapsed count is 230, I verified it.
          Hide
          Pavel Minchenkov added a comment -

          Latest patch for current trunk has many conflicts in SolrIndexSearcher.java.

          Show
          Pavel Minchenkov added a comment - Latest patch for current trunk has many conflicts in SolrIndexSearcher.java.
          Hide
          Stephen Weiss added a comment -

          Actually I'm testing more (I want to make sure it's not just my own error), and it seems like paging in general is just broken with this patch - Any page between 4 and 80 seems to have the exact same results on it as well. Then the results change a little, every 20 pages or so.

          Show
          Stephen Weiss added a comment - Actually I'm testing more (I want to make sure it's not just my own error), and it seems like paging in general is just broken with this patch - Any page between 4 and 80 seems to have the exact same results on it as well. Then the results change a little, every 20 pages or so.
          Hide
          Stephen Weiss added a comment -

          Oh Martijn, I hope you're reading. After a few months of calm we had some OOM's again on our production servers. So I tried your latest patch with the solr 1.4.1 release, since bundled in there are various fixes for memory leaks. The performance difference is great - far less CPU and RAM usage all around. But there's a catch! Something was introduced to change the "numFound" that is reported. After we noticed this, I found your comment and removed these lines from NonAdjacentDocumentCollapser.java:

          + if (collapsedGroupPriority.size() > maxNumberOfGroups)

          { + NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); + collapsedDocs.remove(inferiorGroup.fieldValue); + collapsedGroupPriority.remove(inferiorGroup); + }

          We did NOT remove line 99 as suggested because this caused compiler problems:

          [javac] /home/sweiss/apache-solr-1.4.1/src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java:99: cannot find symbol
          [javac] symbol : variable collapseDoc
          [javac] location: class org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser
          [javac] if (collapseDoc == null) {

          After doing this, I noticed a huge performance drop - far worse than what we had even with 1.4 and your patch from December. Searches were taking >10s to complete (before we were just over 1s for the worst searches). So, I went back and tried to find a way to get the "numFound" through other means - and I figured I could just facet on the same field we're collapsing on, and then count the number of facets. Looks good - the count of the facets is the right count, and it would appear to be working.

          But, there's a snag. It seems that the results being returned by your patch, unaltered, are incorrect. For an example - my search for "orange" returns 7200 collapsed results, either using the real numFound from the altered patch, or using the facet method wtih the new patch. This equates to 160 pages of results. However, with the unaltered patch, if we actually try to retrieve page 158, or really any result over 130 or so, we get the exact same results. With the altered patch (removing those few lines), page 158 actually is page 158. Basically, it seems like your patch throws away good results - and I get the feeling that it throws away those good results somewhere in those 5 lines.

          Now, I'm stuck. I really don't know what to do... I don't want the OOMs to continue, but it looks like they will regardless because both the old version (1.4 + December patch) and the new, altered patched version are using too many resources. But if I used the latest patch without changing it, I'm not getting the right results all the way through.

          Is there anything we can do? I appreciate your help...

          Show
          Stephen Weiss added a comment - Oh Martijn, I hope you're reading. After a few months of calm we had some OOM's again on our production servers. So I tried your latest patch with the solr 1.4.1 release, since bundled in there are various fixes for memory leaks. The performance difference is great - far less CPU and RAM usage all around. But there's a catch! Something was introduced to change the "numFound" that is reported. After we noticed this, I found your comment and removed these lines from NonAdjacentDocumentCollapser.java: + if (collapsedGroupPriority.size() > maxNumberOfGroups) { + NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first(); + collapsedDocs.remove(inferiorGroup.fieldValue); + collapsedGroupPriority.remove(inferiorGroup); + } We did NOT remove line 99 as suggested because this caused compiler problems: [javac] /home/sweiss/apache-solr-1.4.1/src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java:99: cannot find symbol [javac] symbol : variable collapseDoc [javac] location: class org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser [javac] if (collapseDoc == null) { After doing this, I noticed a huge performance drop - far worse than what we had even with 1.4 and your patch from December. Searches were taking >10s to complete (before we were just over 1s for the worst searches). So, I went back and tried to find a way to get the "numFound" through other means - and I figured I could just facet on the same field we're collapsing on, and then count the number of facets. Looks good - the count of the facets is the right count, and it would appear to be working. But, there's a snag. It seems that the results being returned by your patch, unaltered, are incorrect. For an example - my search for "orange" returns 7200 collapsed results, either using the real numFound from the altered patch, or using the facet method wtih the new patch. This equates to 160 pages of results. However, with the unaltered patch, if we actually try to retrieve page 158, or really any result over 130 or so, we get the exact same results. With the altered patch (removing those few lines), page 158 actually is page 158. Basically, it seems like your patch throws away good results - and I get the feeling that it throws away those good results somewhere in those 5 lines. Now, I'm stuck. I really don't know what to do... I don't want the OOMs to continue, but it looks like they will regardless because both the old version (1.4 + December patch) and the new, altered patched version are using too many resources. But if I used the latest patch without changing it, I'm not getting the right results all the way through. Is there anything we can do? I appreciate your help...
          Hide
          Martijn van Groningen added a comment -

          Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note).

          A lot of stuff is changing (or already has changed) in Lucene / Solr internally, so that might have been the cause of these exceptions.

          So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core.

          I think that this part of the response is not copied from the shard's responses into the response that is returned to the client. So that 'll have to be added in order to get these collapsed documents

          One important notice about this patch is that it is not going to be committed. Child issues of SOLR-236 like SOLR-1682 on the other hand will get committed to the trunk, but it might take some time till all the functionality that patches of SOLR-236 provide are implemented in a efficient manner. Just to make some things clear, because this is a long, very long and complicated issue.

          Show
          Martijn van Groningen added a comment - Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note). A lot of stuff is changing (or already has changed) in Lucene / Solr internally, so that might have been the cause of these exceptions. So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core. I think that this part of the response is not copied from the shard's responses into the response that is returned to the client. So that 'll have to be added in order to get these collapsed documents One important notice about this patch is that it is not going to be committed. Child issues of SOLR-236 like SOLR-1682 on the other hand will get committed to the trunk, but it might take some time till all the functionality that patches of SOLR-236 provide are implemented in a efficient manner. Just to make some things clear, because this is a long, very long and complicated issue.
          Hide
          Jasper van Veghel added a comment -

          Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note).

          The only thing I'm still running into at this point is that I'm trying to get this to run using multiple cores / shards. Documents with the same collapse-field values don't span across shards so I figured it should work, and it does. But when including:

          collapse.includeCollapsedDocs.fl=content

          The actual documents returned in the collapse-counts/results are listed as:

          <result name="collapsedDocs" numFound="1" start="0">
          <doc/>
          </result>

          So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core.

          Show
          Jasper van Veghel added a comment - Seconded. The NPE's were occurring rather randomly, but I haven't seen them since I've switched to 1.4.1 + your latest patch. Good stuff! It's also nice to have a patch against an actual release version (FYI, I was using r955615 before as per your patch note). The only thing I'm still running into at this point is that I'm trying to get this to run using multiple cores / shards. Documents with the same collapse-field values don't span across shards so I figured it should work, and it does. But when including: collapse.includeCollapsedDocs.fl=content The actual documents returned in the collapse-counts/results are listed as: <result name="collapsedDocs" numFound="1" start="0"> <doc/> </result> So the actual field requested (content) doesn't get added. It does work when I remove the shards= parameter, only querying one core.
          Hide
          Doug Steigerwald added a comment -

          Excellent! Everything looks good with our issue. Thanks for the quick turn around.

          Show
          Doug Steigerwald added a comment - Excellent! Everything looks good with our issue. Thanks for the quick turn around.
          Hide
          Martijn van Groningen added a comment -

          @Doug Steigerwald and Jasper van Veghel
          Can you check if your errors still occur in the latest patch for 1.4.1 release?

          Show
          Martijn van Groningen added a comment - @Doug Steigerwald and Jasper van Veghel Can you check if your errors still occur in the latest patch for 1.4.1 release?
          Martijn van Groningen made changes -
          Attachment SOLR-236-1_4_1.patch [ 12448216 ]
          Hide
          Martijn van Groningen added a comment -

          Attached a new patch. This patch in a backport of the last patch for Solr 1.4.1. There are currently many changes in the trunk which make maintaining this patch difficult. To apply this patch checkout: http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/ and apply the patch in checkout directory.

          Show
          Martijn van Groningen added a comment - Attached a new patch. This patch in a backport of the last patch for Solr 1.4.1. There are currently many changes in the trunk which make maintaining this patch difficult. To apply this patch checkout: http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.1/ and apply the patch in checkout directory.
          Hide
          Jasper van Veghel added a comment -

          I'm getting the same Exception as Eric Caron, only without using an fq. It seems to have something to do with caching and potentially stemming.

          These queries are being run against a set of Dutch political news articles. The following works:

          /select?q=rosenthal&collapse.field=url_exact

          And this doesn't:

          /select?q=roos&collapse.field=url_exact

          Where due to stemming 'roos' is also highlighted in results for the former query; hence the (expected) results for the latter query are a subset of the former. The exception is:

          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:451)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:263)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:197)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:114)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:183)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1322)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)

          Show
          Jasper van Veghel added a comment - I'm getting the same Exception as Eric Caron, only without using an fq. It seems to have something to do with caching and potentially stemming. These queries are being run against a set of Dutch political news articles. The following works: /select?q=rosenthal&collapse.field=url_exact And this doesn't: /select?q=roos&collapse.field=url_exact Where due to stemming 'roos' is also highlighted in results for the former query; hence the (expected) results for the latter query are a subset of the former. The exception is: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:451) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:263) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:197) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:114) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:183) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1322) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
          Hide
          Doug Steigerwald added a comment -

          I keep running into an ArrayIndexOutOfBoundsException when sorting with field collapsing. I'm running Solr 1.4.1 with the field-collapse-5.patch along with the 3 files from Peter for OOM issues.

          We've got a basic query that returns all event type records in the index (object_class:events), and one fq to make sure we're grabbing data for the correct site (site_id:86). I'm sorting on a category_id (TrieIntField). Collapsing on a string (collapse.type=normal). Here's a basic query that doesn't work for us.

          q=object_class:events&fq=site_id:86&sort=category_id+desc&collapse.field=rollup&collapse.type=normal

          Jun 24, 2010 3:20:12 PM org.apache.solr.common.SolrException log
          SEVERE: java.lang.ArrayIndexOutOfBoundsException: -4294
          at org.apache.lucene.search.FieldComparator$IntComparator.copy(FieldComparator.java:328)
          at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133)
          at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1487)
          at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:931)
          at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1289)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:176)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)

          This is happening to one of our sites in production (the only site left using our events calendar) and I can't seem to make it happen in development with some fake data. We wiped all data from our production indexes and reindexed recently (upgraded to Solr 1.4.0 a few weeks ago). Does anyone have any ideas what might be causing this? I'm going to try and pull the database to our development servers and see if I can reindex and reproduce the issue, but that will take some time. The copied index from production to development does show this issue.

          Any hints? This is happening when sorting on any TrieIntField or string field. Normal collapsing or adjacent.

          Show
          Doug Steigerwald added a comment - I keep running into an ArrayIndexOutOfBoundsException when sorting with field collapsing. I'm running Solr 1.4.1 with the field-collapse-5.patch along with the 3 files from Peter for OOM issues. We've got a basic query that returns all event type records in the index (object_class:events), and one fq to make sure we're grabbing data for the correct site (site_id:86). I'm sorting on a category_id (TrieIntField). Collapsing on a string (collapse.type=normal). Here's a basic query that doesn't work for us. q=object_class:events&fq=site_id:86&sort=category_id+desc&collapse.field=rollup&collapse.type=normal Jun 24, 2010 3:20:12 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ArrayIndexOutOfBoundsException: -4294 at org.apache.lucene.search.FieldComparator$IntComparator.copy(FieldComparator.java:328) at org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(TopFieldCollector.java:133) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1487) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:931) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1289) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:176) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) This is happening to one of our sites in production (the only site left using our events calendar) and I can't seem to make it happen in development with some fake data. We wiped all data from our production indexes and reindexed recently (upgraded to Solr 1.4.0 a few weeks ago). Does anyone have any ideas what might be causing this? I'm going to try and pull the database to our development servers and see if I can reindex and reproduce the issue, but that will take some time. The copied index from production to development does show this issue. Any hints? This is happening when sorting on any TrieIntField or string field. Normal collapsing or adjacent.
          Martijn van Groningen made changes -
          Attachment SOLR-236-trunk.patch [ 12447374 ]
          Hide
          Martijn van Groningen added a comment -

          I've attached a new patch that is compatible with the current trunk (rev 955615). The reason the previous patch did not work, was that the StringIndex class was removed. DocTermsIndex is used instead. See LUCENE-2380 for more details on this.

          Show
          Martijn van Groningen added a comment - I've attached a new patch that is compatible with the current trunk (rev 955615). The reason the previous patch did not work, was that the StringIndex class was removed. DocTermsIndex is used instead. See LUCENE-2380 for more details on this.
          Hide
          Lance Norskog added a comment -

          It's the three-year anniversary for SOLR-236! And it's still active, unfinished and uncommitted. Is this a record?

          Show
          Lance Norskog added a comment - It's the three-year anniversary for SOLR-236 ! And it's still active, unfinished and uncommitted. Is this a record?
          Hoss Man made changes -
          Fix Version/s Next [ 12315093 ]
          Fix Version/s 1.5 [ 12313566 ]
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Christophe Biocca added a comment -

          I'd just like to throw in a suggestion about the AbstractDocumentCollapser & CollapseCollectorFactory APIs: It seems to me that changing the factory.createCollapseCollector(SolrRequest req) to factory.createCollapseCollector(ResponseBuilder rb) would allow for more specialized collapse collectors, that would be able to use, amongst other things, the SortSpec in the implementation of the collector. Our use case is that we want to show possibly more than one document for a given value of a collapse field, depending on relative scores. Passing in the ResponseBuilder would allow us to do that much more easily. Since the caching uses the ResponseBuilder object as its key, it won't introduce any new issues.

          Show
          Christophe Biocca added a comment - I'd just like to throw in a suggestion about the AbstractDocumentCollapser & CollapseCollectorFactory APIs: It seems to me that changing the factory.createCollapseCollector(SolrRequest req) to factory.createCollapseCollector(ResponseBuilder rb) would allow for more specialized collapse collectors, that would be able to use, amongst other things, the SortSpec in the implementation of the collector. Our use case is that we want to show possibly more than one document for a given value of a collapse field, depending on relative scores. Passing in the ResponseBuilder would allow us to do that much more easily. Since the caching uses the ResponseBuilder object as its key, it won't introduce any new issues.
          Hide
          Kallin Nagelberg added a comment -

          I tried asking this question on the user list, but perhaps this is a more appropriate forum.

          As I understand field collapsing has been disabled on multi-valued fields. Is this really necessary?

          Let's say I have a multi-valued field, 'my-mv-field'. I have a query like (my-mv-field:1 OR my-mv-field:5) that returns docs with the following values for 'my-mv-field':

          Doc1: 1, 2, 3,
          Doc2: 1, 3
          Doc3: 2, 4, 5, 6
          Doc4: 1

          If I collapse on that field with that query I imagine it should mean 'collect the docs, starting from the top, so that I find 1 and 5'. In this case if it returned Doc1 and Doc3 I would be happy.

          There must be some ambiguity or implementation detail I am unaware that is preventing this. It may be a critical piece of functionality for an application I'm working on, so I'm curious if there is point in pursuing development of this functionality or if I am missing something.

          Thanks,
          Kallin Nagelberg

          Show
          Kallin Nagelberg added a comment - I tried asking this question on the user list, but perhaps this is a more appropriate forum. As I understand field collapsing has been disabled on multi-valued fields. Is this really necessary? Let's say I have a multi-valued field, 'my-mv-field'. I have a query like (my-mv-field:1 OR my-mv-field:5) that returns docs with the following values for 'my-mv-field': Doc1: 1, 2, 3, Doc2: 1, 3 Doc3: 2, 4, 5, 6 Doc4: 1 If I collapse on that field with that query I imagine it should mean 'collect the docs, starting from the top, so that I find 1 and 5'. In this case if it returned Doc1 and Doc3 I would be happy. There must be some ambiguity or implementation detail I am unaware that is preventing this. It may be a critical piece of functionality for an application I'm working on, so I'm curious if there is point in pursuing development of this functionality or if I am missing something. Thanks, Kallin Nagelberg
          Martijn van Groningen made changes -
          Attachment SOLR-236-trunk.patch [ 12444611 ]
          Hide
          Martijn van Groningen added a comment -

          Varun I noticed the same NPE. I've updated the patch and fixed the issue. In the patch I've also added a test that simulated the problem that you have described.

          Show
          Martijn van Groningen added a comment - Varun I noticed the same NPE. I've updated the patch and fixed the issue. In the patch I've also added a test that simulated the problem that you have described.
          Hide
          Lance Norskog added a comment -

          Eric Caron added a comment - 29/Apr/10 02:27 PM

          Using the latest from trunk as of 2010-04-29, and the SOLR-236-trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq.

          Varun Gupta added a comment - 15/May/10 07:36 AM

          I applied the latest patch on the trunk and got the below exception after I made some commits to the index:

          Eric, Varun: Please create unit tests that show these bugs.

          Show
          Lance Norskog added a comment - Eric Caron added a comment - 29/Apr/10 02:27 PM Using the latest from trunk as of 2010-04-29, and the SOLR-236 -trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq. Varun Gupta added a comment - 15/May/10 07:36 AM I applied the latest patch on the trunk and got the below exception after I made some commits to the index: Eric, Varun: Please create unit tests that show these bugs.
          Hide
          Varun Gupta added a comment -

          I applied the latest patch on the trunk and got the below exception after I made some commits to the index:

          SEVERE: java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FlatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196)
          at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:221)
          at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:130)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueu.java:146)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doollapsing(NonAdjacentDocumentCollapser.java:113)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.execueCollapse(AbstractDocumentCollapser.java:259)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collase(AbstractDocumentCollapser.java:179)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapeComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(Collapseomponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SerchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHanderBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilte.java:341)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFiltr.java:244)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(AppicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationilterChain.java:206)

          I also got an error while doing optimizing index.

          Show
          Varun Gupta added a comment - I applied the latest patch on the trunk and got the below exception after I made some commits to the index: SEVERE: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FlatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentComparator.compare(NonAdjacentDocumentCollapser.java:262) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DoumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:221) at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:130) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueu.java:146) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doollapsing(NonAdjacentDocumentCollapser.java:113) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.execueCollapse(AbstractDocumentCollapser.java:259) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collase(AbstractDocumentCollapser.java:179) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapeComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(Collapseomponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SerchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHanderBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilte.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFiltr.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(AppicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationilterChain.java:206) I also got an error while doing optimizing index.
          Hide
          Eric Caron added a comment -

          Regarding the numFound count, one of the loudest complaints from the Sphinx community is the inability to see the total number pre-collapse. Is it possible to dictate which value (possibly both) is calculated at run-time? When FieldCollapse gains the attention it deserves, I'd expect an onslaught of requests along these lines. (I personally want both, the pre-value to display the number of matches, and the post-value to calculate pagination).

          Show
          Eric Caron added a comment - Regarding the numFound count, one of the loudest complaints from the Sphinx community is the inability to see the total number pre-collapse. Is it possible to dictate which value (possibly both) is calculated at run-time? When FieldCollapse gains the attention it deserves, I'd expect an onslaught of requests along these lines. (I personally want both, the pre-value to display the number of matches, and the post-value to calculate pagination).
          Hide
          Joseph Freeman added a comment -

          collapse.includeCollapsedDocs.count ?

          When I use collapse.includeCollapsedDocs.fl, I get ALL the collapsed documents.

          It seems like we should have a collapse.includeCollapsedDocs.count parameter to limit this result set?

          Show
          Joseph Freeman added a comment - collapse.includeCollapsedDocs.count ? When I use collapse.includeCollapsedDocs.fl, I get ALL the collapsed documents. It seems like we should have a collapse.includeCollapsedDocs.count parameter to limit this result set?
          Hide
          Martijn van Groningen added a comment -

          Another note. The numFound count (all document found) in this patch does mean all documents found. This number currently represents all documents returned in the response. This is due to a performance improvement made and was discussed on this page a while ago. However you can disable this performance improvement by commenting / deleting the lines 99 and 106 to 110 in NonAdjacentDocumentCollapser.java file (latest patch). My experiences with this improvement is that it saves memory, but the the search time improvements were minimal. So whether you do this I guess depends in your situation.

          Show
          Martijn van Groningen added a comment - Another note. The numFound count (all document found) in this patch does mean all documents found. This number currently represents all documents returned in the response. This is due to a performance improvement made and was discussed on this page a while ago. However you can disable this performance improvement by commenting / deleting the lines 99 and 106 to 110 in NonAdjacentDocumentCollapser.java file (latest patch). My experiences with this improvement is that it saves memory, but the the search time improvements were minimal. So whether you do this I guess depends in your situation.
          Martijn van Groningen made changes -
          Attachment SOLR-236-trunk.patch [ 12444531 ]
          Hide
          Martijn van Groningen added a comment -

          I've updated the patch for the trunk The following changes are included:

          • The patch has been updated to the latest trunk. So no patch conflicts should occur.
          • Eric Caron reported NPEs when using field collapsing in combination with a filter query. After some digging I found the cause of the NPE. When using a fq the scores are being cached in the filter cache, but due to a bug in DelegateDocSet the scores where not returned in some cases (null was returned). This resulted in a NPE in a later stage of the query execution. I've also updated the integration test to cover this situation. This also explains why the first time everything was fine. When doing a normal refresh (F5 /  - R) the result comes from the HTTP cache so everything is still fine. However when doing a hard refresh a second query is executed and result are then retrieved from the Solr configured caches in most cases and resulting in this NPE.
          Show
          Martijn van Groningen added a comment - I've updated the patch for the trunk The following changes are included: The patch has been updated to the latest trunk. So no patch conflicts should occur. Eric Caron reported NPEs when using field collapsing in combination with a filter query. After some digging I found the cause of the NPE. When using a fq the scores are being cached in the filter cache, but due to a bug in DelegateDocSet the scores where not returned in some cases (null was returned). This resulted in a NPE in a later stage of the query execution. I've also updated the integration test to cover this situation. This also explains why the first time everything was fine. When doing a normal refresh (F5 /  - R) the result comes from the HTTP cache so everything is still fine. However when doing a hard refresh a second query is executed and result are then retrieved from the Solr configured caches in most cases and resulting in this NPE.
          Hide
          Sergey Shinderuk added a comment -

          Finally I applied SOLR-236.patch to rev 899572 (dtd. 2010-01-15) of the trunk and I get correct numFound values with collapsing enabled.

          Show
          Sergey Shinderuk added a comment - Finally I applied SOLR-236 .patch to rev 899572 (dtd. 2010-01-15) of the trunk and I get correct numFound values with collapsing enabled.
          Hide
          Sergey Shinderuk added a comment -

          @Claus
          I faced the same issue. Did you find any solution or maybe workaround?

          When collapsing is enabled, numFound is equal to the number of rows requested and NOT the total number of distinct documents found.

          I applied the latest SOLR-236-trunk.patch to the trunk checked out on the date of patch, because patching the latest revision fails.
          Am I doing something wrong?

          I want to collapse near-duplicate documents in search results based on document signature. But with this issue I can't paginate through results, because I don't know how many.

          Besides, an article at http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ shows examples with correct numFound returned. How can I get it working???

          Show
          Sergey Shinderuk added a comment - @Claus I faced the same issue. Did you find any solution or maybe workaround? When collapsing is enabled, numFound is equal to the number of rows requested and NOT the total number of distinct documents found. I applied the latest SOLR-236 -trunk.patch to the trunk checked out on the date of patch, because patching the latest revision fails. Am I doing something wrong? I want to collapse near-duplicate documents in search results based on document signature. But with this issue I can't paginate through results, because I don't know how many. Besides, an article at http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ shows examples with correct numFound returned. How can I get it working???
          Hide
          Eric Caron added a comment -

          Using the latest from trunk as of 2010-04-29, and the SOLR-236-trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq.

          Works:
          /solr/select/?q=sales&fq=country%3A1
          Works:
          /solr/select/?q=sales&collapse.field=company
          Doesn't work:
          /solr/select/?q=sales&collapse.field=company&fq=country%3A1

          The top of the trace is:
          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148)
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:113)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:179)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)

          Show
          Eric Caron added a comment - Using the latest from trunk as of 2010-04-29, and the SOLR-236 -trunk.patch from 2010-03-29 05:08, I get a nullpointerexception whenever I use collapse.field and a fq. Works: /solr/select/?q=sales&fq=country%3A1 Works: /solr/select/?q=sales&collapse.field=company Doesn't work: /solr/select/?q=sales&collapse.field=company&fq=country%3A1 The top of the trace is: java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$FloatValueFieldComparator.compare(NonAdjacentDocumentCollapser.java:450) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:262) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:196) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:148) at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:113) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:259) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:179) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:173) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          Hide
          Karel Braeckman added a comment -

          Hi all,

          I wondered if it is possible to sort the collapsed results based on an aggregate function (e.g., sort by sum(price))?

          What is to be done to make this possible? (could it be done via a plugin? )

          Kind regards,
          Karel

          Show
          Karel Braeckman added a comment - Hi all, I wondered if it is possible to sort the collapsed results based on an aggregate function (e.g., sort by sum(price))? What is to be done to make this possible? (could it be done via a plugin? ) Kind regards, Karel
          Hide
          Lukas Kahwe Smith added a comment -

          Its my understanding that this patch atm only produces a score for each collapsed group of the max() of the scores inside the group. Is there any work being done to increase the score to include all documents inside the group? Like taking into account the collapse_count or the individual scores (summing or via some custom algorithm).

          Show
          Lukas Kahwe Smith added a comment - Its my understanding that this patch atm only produces a score for each collapsed group of the max() of the scores inside the group. Is there any work being done to increase the score to include all documents inside the group? Like taking into account the collapse_count or the individual scores (summing or via some custom algorithm).
          Hide
          Billy Morgan added a comment -

          @Claus

          I am having the same issue

          Show
          Billy Morgan added a comment - @Claus I am having the same issue
          Hide
          Claus Schröter added a comment -

          Hi all,

          I applied Martijns last Patch to the trunk and encountered a problem with document counts:

          whenever I set the rows= value to the query, the "numFound" result parameter is limited to exactly the value of rows.
          The facet counts are also limited to this value.

          If I omit the rows parameter everything is fine. I tried to track back the problem. It seems that the SolrSearcher query is limited to "rows" value
          before collapsing is done.

          Does anybody encounter a similar problem?

          Cheers!
          clausi

          Show
          Claus Schröter added a comment - Hi all, I applied Martijns last Patch to the trunk and encountered a problem with document counts: whenever I set the rows= value to the query, the "numFound" result parameter is limited to exactly the value of rows. The facet counts are also limited to this value. If I omit the rows parameter everything is fine. I tried to track back the problem. It seems that the SolrSearcher query is limited to "rows" value before collapsing is done. Does anybody encounter a similar problem? Cheers! clausi
          Hide
          Pierre-Luc added a comment -

          Hi all,

          We have integrated the most recent patch into our 1.4 install and the Out of memory fix suggested by Peter. I am facing memory issues only when collapsing. I would like to know why the class CacheValue is static in AbstractDocumentCollapser. If I remove the static attribute of that class, the memory footprint is greatly reduced and everything works fine.

          My document count is around 5 million.

          Any help would be greatly appreciated.
          Thank you.

          Show
          Pierre-Luc added a comment - Hi all, We have integrated the most recent patch into our 1.4 install and the Out of memory fix suggested by Peter. I am facing memory issues only when collapsing. I would like to know why the class CacheValue is static in AbstractDocumentCollapser. If I remove the static attribute of that class, the memory footprint is greatly reduced and everything works fine. My document count is around 5 million. Any help would be greatly appreciated. Thank you.
          Martijn van Groningen made changes -
          Attachment SOLR-236-trunk.patch [ 12440108 ]
          Hide
          Martijn van Groningen added a comment -

          @Thomas
          Somehow the solrj code was left out the when I created the patch yesterday. I guess, I accidentally deleted it when I was moving the code the new trunk. Anyhow I have updated the patch that includes the solrj code and applying it should go flawless.

          Show
          Martijn van Groningen added a comment - @Thomas Somehow the solrj code was left out the when I created the patch yesterday. I guess, I accidentally deleted it when I was moving the code the new trunk. Anyhow I have updated the patch that includes the solrj code and applying it should go flawless.
          Hide
          Robert Zotter added a comment -

          @Thomas Essentially my use case involves a product listing of sorts whereas there are many closely related items being sold by any number of sellers. I would like to distribute the search results across as many sellers as possible giving each seller a fair chance to sell their products, so I was going to use field collapsing to limit the number of items being displayed per seller.

          Ideally it would be nice if there were some way to evenly distribute closely related documents (scores within some defined percentage of each other)

          For example instead of:

          Item 1 sold by Seller A
          Item 2 sold by Seller A
          Item 3 sold by Seller A
          Item 4 sold by Seller B
          Item 5 sold by Seller B
          Item 6 sold by Seller B

          Assuming all of these ideas are within a certain percentage of each other it would be nice to have:

          Item 1 sold by Seller A
          Item 4 sold by Seller B
          Item 2 sold by Seller A
          Item 5 sold by Seller B
          ....

          Although I do not achieve this exact behavior with this particular patch It will at least get me closer to my goal.

          FYI my document count is around 6 million and I am already utilizing the document deduper.

          Show
          Robert Zotter added a comment - @Thomas Essentially my use case involves a product listing of sorts whereas there are many closely related items being sold by any number of sellers. I would like to distribute the search results across as many sellers as possible giving each seller a fair chance to sell their products, so I was going to use field collapsing to limit the number of items being displayed per seller. Ideally it would be nice if there were some way to evenly distribute closely related documents (scores within some defined percentage of each other) For example instead of: Item 1 sold by Seller A Item 2 sold by Seller A Item 3 sold by Seller A Item 4 sold by Seller B Item 5 sold by Seller B Item 6 sold by Seller B Assuming all of these ideas are within a certain percentage of each other it would be nice to have: Item 1 sold by Seller A Item 4 sold by Seller B Item 2 sold by Seller A Item 5 sold by Seller B .... Although I do not achieve this exact behavior with this particular patch It will at least get me closer to my goal. FYI my document count is around 6 million and I am already utilizing the document deduper.
          Hide
          Thomas Heigl added a comment -

          @Robert:

          What is your use case for field collapsing? I think under "normal" conditions (collapsing on a field with reasonably many unique values) you can go with the slightly older patch and the OOM fixes. I compared the performance of the newest patch for the trunk with the 1.4 release patched as described above and didn't notice much difference under these conditions. I will must likely go with the trunk, however, as I have millions of documents with millions of unique values on the collapse field and need every bit of performance I can get.

          Show
          Thomas Heigl added a comment - @Robert: What is your use case for field collapsing? I think under "normal" conditions (collapsing on a field with reasonably many unique values) you can go with the slightly older patch and the OOM fixes. I compared the performance of the newest patch for the trunk with the 1.4 release patched as described above and didn't notice much difference under these conditions. I will must likely go with the trunk, however, as I have millions of documents with millions of unique values on the collapse field and need every bit of performance I can get.
          Hide
          Robert Zotter added a comment -

          @Thomas. Thanks for the input. Do you think its best to go with a clean version of 1.4 or the latest from trunk? Basically I'm asking if you think trunk is semi-stable enough for a production environment. Thanks

          Show
          Robert Zotter added a comment - @Thomas. Thanks for the input. Do you think its best to go with a clean version of 1.4 or the latest from trunk? Basically I'm asking if you think trunk is semi-stable enough for a production environment. Thanks
          Hide
          Thomas Heigl added a comment -

          @Robert:

          I just tried the field collapsing patch with a clean version of the 1.4 release. The only recent patch that seems to be applicable without manually resolving conflicts is 2009-12-08. In addition to the patch you should also add the three individual files uploaded by Peter Karich to deal with the worst memory issues.

          Show
          Thomas Heigl added a comment - @Robert: I just tried the field collapsing patch with a clean version of the 1.4 release. The only recent patch that seems to be applicable without manually resolving conflicts is 2009-12-08 . In addition to the patch you should also add the three individual files uploaded by Peter Karich to deal with the worst memory issues.
          Hide
          Thomas Heigl added a comment -

          @Martijn:

          There is a small problem with the latest patch file. Both TortoiseSVN and patch complain that the file is malformed because there is an "empty" patch for FieldCollapseResponse.java around line 2199. Simply removing lines 2195-2199 does the trick.

          Apart from that, the patch works perfectly for me.

          Show
          Thomas Heigl added a comment - @Martijn: There is a small problem with the latest patch file. Both TortoiseSVN and patch complain that the file is malformed because there is an "empty" patch for FieldCollapseResponse.java around line 2199. Simply removing lines 2195-2199 does the trick. Apart from that, the patch works perfectly for me.
          Martijn van Groningen made changes -
          Attachment SOLR-236-trunk.patch [ 12440022 ]
          Hide
          Martijn van Groningen added a comment -

          I've attached a new patch, which included the following changes:

          • Patch uses the new Solr trunk. Everything in the patch is relative to the trunk directory.
          • The changes Peter Karich made to DocSetScoreCollector and NonAdjacentDocumentCollapserTest that make it much more memory efficient.
          • The change Yonik suggested to make field collapsing more efficient.

            efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups.

          This also means that the total count of a search with field collapsing does not represent all the found documents. The total count now represents: start + count

          Show
          Martijn van Groningen added a comment - I've attached a new patch, which included the following changes: Patch uses the new Solr trunk. Everything in the patch is relative to the trunk directory. The changes Peter Karich made to DocSetScoreCollector and NonAdjacentDocumentCollapserTest that make it much more memory efficient. The change Yonik suggested to make field collapsing more efficient. efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups. This also means that the total count of a search with field collapsing does not represent all the found documents. The total count now represents: start + count
          Hide
          Robert Zotter added a comment -

          What are the required steps to get this patch working with a clean 1.4? Is it even compatible? I've read in the above comments that the 12/12 field-collapse-5.patch will patch correctly but it has horrible memory bugs. Has there been any updates on this? Recommendations anyone?

          Show
          Robert Zotter added a comment - What are the required steps to get this patch working with a clean 1.4? Is it even compatible? I've read in the above comments that the 12/12 field-collapse-5.patch will patch correctly but it has horrible memory bugs. Has there been any updates on this? Recommendations anyone?
          Peter Karich made changes -
          Attachment DocSetScoreCollector.java [ 12437980 ]
          Attachment NonAdjacentDocumentCollapser.java [ 12437981 ]
          Attachment NonAdjacentDocumentCollapserTest.java [ 12437982 ]
          Hide
          Peter Karich added a comment - - edited

          It seems to me that the provided changes are necessary to make the OutOfMemory exception gone (see appended 3 files). Please apply the files with caution, because I made the changes from an old patch (from Nov 2009)

          Show
          Peter Karich added a comment - - edited It seems to me that the provided changes are necessary to make the OutOfMemory exception gone (see appended 3 files). Please apply the files with caution, because I made the changes from an old patch (from Nov 2009)
          Hide
          Peter Karich added a comment -

          > Shouldn't the float array in DocSetScoreCollector be changed to a Map?

          hmmh, maybe I expressed myself a bit weird: I already changed this all to a Map (a SortedMap) ...
          I started this change in DocSetScoreCollector and changed all the other occurances of the float array (otherwise I would have to copy the entire map)

          > > I think the compare method should NOT be called if no docs are in the scores array ... ?

          > I would expect that every docId has a score.

          Yes, me too. So I expect there is somewhere a bug. But as I sayd this breaks only one test (collapse with faceting before). It could be even a but in the testcase though.

          Show
          Peter Karich added a comment - > Shouldn't the float array in DocSetScoreCollector be changed to a Map? hmmh, maybe I expressed myself a bit weird: I already changed this all to a Map (a SortedMap) ... I started this change in DocSetScoreCollector and changed all the other occurances of the float array (otherwise I would have to copy the entire map) > > I think the compare method should NOT be called if no docs are in the scores array ... ? > I would expect that every docId has a score. Yes, me too. So I expect there is somewhere a bug. But as I sayd this breaks only one test (collapse with faceting before). It could be even a but in the testcase though.
          Hide
          Martijn van Groningen added a comment -

          Shouldn't the float array in DocSetScoreCollector be changed to a Map? Because that is actually being cached and requires the most memory. The float array in the NonAdjacentDocumentCollapser.PredefinedScorer isn't being cached. Though changing this to a Map can be an improvement.

          I think the compare method should NOT be called if no docs are in the scores array ... ?

          I would expect that every docId has a score.

          Show
          Martijn van Groningen added a comment - Shouldn't the float array in DocSetScoreCollector be changed to a Map? Because that is actually being cached and requires the most memory. The float array in the NonAdjacentDocumentCollapser.PredefinedScorer isn't being cached. Though changing this to a Map can be an improvement. I think the compare method should NOT be called if no docs are in the scores array ... ? I would expect that every docId has a score.
          Hide
          Peter Karich added a comment - - edited

          regarding the OutOfMemory problem: we are now testing the suggested change in production.

          I replaced the float array with a TreeMap<Integer, Float>. The change was nearly trivial (I cannot provide a patch easily, because we are using an older patch, althoug I could post the 3 changed files.)

          The point why I used a TreeMap instead a HashMap was that in the method advance in the class NonAdjacentDocumentCollapser.PredefinedScorer I needed the tailMap method:

          public int advance(int target) throws IOException {
                      // now we need a treemap method:
                      iter = scores.tailMap(target).entrySet().iterator();
                      if (iter.hasNext())
                          return target;
                      else
                          return NO_MORE_DOCS;
          }
          

          Then - I think - I discovered a bug/inconsistent behaviour: If I run the test FieldCollapsingIntegrationTest.testNonAdjacentCollapse_withFacetingBefore then the scores arrays will be created ala new float[maxDocs] in the old version. But the array will never be filled with some values so Float value1 = values.get(doc1); will return null in the method NonAdjacentDocumentCollapser.FloatValueFieldComparator.compare (the size of TreeMap is 0!); I work around this via

           
          if (value1 == null)
                          value1 = 0f;
          if (value2 == null)
                          value2 = 0f;
          

          I think the compare method should NOT be called if no docs are in the scores array ... ?

          Show
          Peter Karich added a comment - - edited regarding the OutOfMemory problem: we are now testing the suggested change in production. I replaced the float array with a TreeMap<Integer, Float>. The change was nearly trivial (I cannot provide a patch easily, because we are using an older patch, althoug I could post the 3 changed files.) The point why I used a TreeMap instead a HashMap was that in the method advance in the class NonAdjacentDocumentCollapser.PredefinedScorer I needed the tailMap method: public int advance(int target) throws IOException { // now we need a treemap method: iter = scores.tailMap(target).entrySet().iterator(); if (iter.hasNext()) return target; else return NO_MORE_DOCS; } Then - I think - I discovered a bug/inconsistent behaviour: If I run the test FieldCollapsingIntegrationTest.testNonAdjacentCollapse_withFacetingBefore then the scores arrays will be created ala new float [maxDocs] in the old version. But the array will never be filled with some values so Float value1 = values.get(doc1); will return null in the method NonAdjacentDocumentCollapser.FloatValueFieldComparator.compare (the size of TreeMap is 0!); I work around this via if (value1 == null) value1 = 0f; if (value2 == null) value2 = 0f; I think the compare method should NOT be called if no docs are in the scores array ... ?
          Hide
          Martijn van Groningen added a comment -

          The numFound attribute holds the total number of documents found for the specified query, so also the documents beyond the first result page. The reason that for the first query, the numFound is lower the the second query is that the collapse.threshold is higher. Only documents with the same collapse field value, that appear more then twice will be omitted from the result. This results in less document being collapsed.

          Show
          Martijn van Groningen added a comment - The numFound attribute holds the total number of documents found for the specified query, so also the documents beyond the first result page. The reason that for the first query, the numFound is lower the the second query is that the collapse.threshold is higher. Only documents with the same collapse field value, that appear more then twice will be omitted from the result. This results in less document being collapsed.
          Hide
          Yao Ge added a comment -

          I just applied the latest patch to trunk and I don't quite understand how the "numFound" in the response list is computed. With rows=10&collapse.threshold=1, I got numFound=11, with rows=10&collapse.threshold=2, I got numFound=22.
          I both cases the actual doc in the list is 10. Why is the numFound reported this way?

          Show
          Yao Ge added a comment - I just applied the latest patch to trunk and I don't quite understand how the "numFound" in the response list is computed. With rows=10&collapse.threshold=1, I got numFound=11, with rows=10&collapse.threshold=2, I got numFound=22. I both cases the actual doc in the list is 10. Why is the numFound reported this way?
          Hide
          Martijn van Groningen added a comment -

          That makes sense. I initially made it an array to maintain the document order for the scores, but this order is already in the openbitset. I think a Map is a good idea.

          Show
          Martijn van Groningen added a comment - That makes sense. I initially made it an array to maintain the document order for the scores, but this order is already in the openbitset. I think a Map is a good idea.
          Hide
          Leon Messerschmidt added a comment -

          The OutOfMemory problem affects both field-collapse-5.patch on Solr 1.4 and SOLR-236.patch on the trunk.

          The root cause of the problem is DocSetScoreCollector that creates an array of float that is the size of the maxID document that matches the query. If you have a large index (we have several million documents) and a document with a very large id is matched you may end up with a huge array (in our case several hundred MB). Only a really small subset of the array is being used at any given time (especially if you're matching just a few documents with big doc ids).

          The implementation can rather use a sparse array or a map to keep track of scores.

          Show
          Leon Messerschmidt added a comment - The OutOfMemory problem affects both field-collapse-5.patch on Solr 1.4 and SOLR-236 .patch on the trunk. The root cause of the problem is DocSetScoreCollector that creates an array of float that is the size of the maxID document that matches the query. If you have a large index (we have several million documents) and a document with a very large id is matched you may end up with a huge array (in our case several hundred MB). Only a really small subset of the array is being used at any given time (especially if you're matching just a few documents with big doc ids). The implementation can rather use a sparse array or a map to keep track of scores.
          steevensz made changes -
          Comment [ I applied this patch to the nightlybuild of feb 22 and this compiles without any problem.

          I can start Solr and it runs fine. But when i add the Field Collapse in the solrconfig.xml i cannot start Solr anymore.

          After adding this line to my solrconfig.xml:

          <searchComponent name="query"
                    class="org.apache.solr.handler.component.CollapseComponent" />


          I get this error when i run Solr:

          2010-02-22 22:24:30.722::WARN: Failed startup of context org.mortbay.jetty.webapp.WebAppContext@7f5580{/solr,jar:file:/opt/apache-solr-1.5-dev/example/webapps/solr.war!/}
          java.lang.NullPointerException
                  at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:593)
                  at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
                  at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
                  at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
                  at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
                  at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
                  at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
                  at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
                  at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
                  at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
                  at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
                  at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
                  at org.mortbay.jetty.Server.doStart(Server.java:210)
                  at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
                  at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
                  at java.lang.reflect.Method.invoke(Unknown Source)
                  at org.mortbay.start.Main.invokeMain(Main.java:183)
                  at org.mortbay.start.Main.start(Main.java:497)
                  at org.mortbay.start.Main.main(Main.java:115)

          (I am using Centos with Java 1.6.0_13)

          Any help is greatly appreciated!!
          ]
          Otis Gospodnetic made changes -
          Link This issue is related to SOLR-1773 [ SOLR-1773 ]
          Otis Gospodnetic made changes -
          Link This issue is related to SOLR-1682 [ SOLR-1682 ]
          Hide
          Peter Karich added a comment - - edited

          Trying the latest patch from 1th Feb 2010. It compiles against solr-2010-02-13 from nightly build dir, but does not work. If I query

          http://server/solr-app/select?q=*:*&collapse.field=myfield

          it fails with:

           
          
          HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.schema.FieldType.toExternal(FieldType.java:329) at 
          org.apache.solr.schema.FieldType.storedToReadable(FieldType.java:348) at 
          org.apache.solr.search.fieldcollapse.collector.AbstractCollapseCollector.getCollapseGroupResult(AbstractCollapseCollector.java:58) at 
          org.apache.solr.search.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory$DocumentCountCollapseCollector.getResult(DocumentGroupCountCollapseCollectorFactory.ja
          va:84) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:193) at 
          org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:192) at 
          org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at 
          org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at
          ...
           

          I only need the OutOfMemory problem solved ...

          Show
          Peter Karich added a comment - - edited Trying the latest patch from 1th Feb 2010. It compiles against solr-2010-02-13 from nightly build dir, but does not work. If I query http://server/solr-app/select?q=*:*&collapse.field=myfield it fails with: HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.schema.FieldType.toExternal(FieldType.java:329) at org.apache.solr.schema.FieldType.storedToReadable(FieldType.java:348) at org.apache.solr.search.fieldcollapse.collector.AbstractCollapseCollector.getCollapseGroupResult(AbstractCollapseCollector.java:58) at org.apache.solr.search.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory$DocumentCountCollapseCollector.getResult(DocumentGroupCountCollapseCollectorFactory.ja va:84) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:193) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:192) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at ... I only need the OutOfMemory problem solved ...
          Hide
          Peter Karich added a comment -

          We are facing OutOfMemory problems too. We are using https://issues.apache.org/jira/secure/attachment/12425775/field-collapse-5.patch

          > Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly,
          > I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller.

          How can I turn off the collapse cache or make the filterCache smaller?
          Are there other workarounds? E.g. via using a special version of the patch ?

          I read that it could help to specify collapse.maxdocs but this didn't help in our case ... could collapse.type=adjacent help here? (https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12495376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12495376)

          What do you think?

          BTW: We really like this patch and would like to use it !!

          Show
          Peter Karich added a comment - We are facing OutOfMemory problems too. We are using https://issues.apache.org/jira/secure/attachment/12425775/field-collapse-5.patch > Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly, > I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller. How can I turn off the collapse cache or make the filterCache smaller? Are there other workarounds? E.g. via using a special version of the patch ? I read that it could help to specify collapse.maxdocs but this didn't help in our case ... could collapse.type=adjacent help here? ( https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12495376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12495376 ) What do you think? BTW: We really like this patch and would like to use it !!
          Hide
          Gerald DeConto added a comment -

          I have been able to apply and use the solr-236 patch successfully. Very, very cool and powerful.

          Are there any plans/hacks to include the non-collapsed document in the collapseCount and aggregate function values (ie so that it includes ALL documents, not just the collapsed ones)? Possibly via some parameter like collapse.includeAllDocs?

          I think this would be a great addition to the collapse code (and solr functionality), via what I would think is a small change, since solr doesnt have any other aggregation mechanism (as yet).

          Am trying to see how to change the code myself but Java is not my primary language.

          Show
          Gerald DeConto added a comment - I have been able to apply and use the solr-236 patch successfully. Very, very cool and powerful. Are there any plans/hacks to include the non-collapsed document in the collapseCount and aggregate function values (ie so that it includes ALL documents, not just the collapsed ones)? Possibly via some parameter like collapse.includeAllDocs? I think this would be a great addition to the collapse code (and solr functionality), via what I would think is a small change, since solr doesnt have any other aggregation mechanism (as yet). Am trying to see how to change the code myself but Java is not my primary language.
          Hide
          Kevin Cunningham added a comment -

          No, just field collapsing. We went back to the field-collapse-5.patch for the time being. So far its been good and we updated just to get closer to the latest not because we were seeing issues. Thanks.

          Show
          Kevin Cunningham added a comment - No, just field collapsing. We went back to the field-collapse-5.patch for the time being. So far its been good and we updated just to get closer to the latest not because we were seeing issues. Thanks.
          Hide
          Martijn van Groningen added a comment -

          Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint?

          Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly, I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller.

          What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch?

          Not much I believe, some are using it in production without too many problems.

          Show
          Martijn van Groningen added a comment - Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint? Are you using any other features besides plain collapsing? The field collapse cache gets large very quickly, I suggest you turn it off (if you are using it). Also you can try to make your filterCache smaller. What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch? Not much I believe, some are using it in production without too many problems.
          Hide
          Kevin Cunningham added a comment - - edited

          Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint?

          We had been running the 11-29 field-collapse-5.patch patch and saw nothing near this amount of memory consumption.

          What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch?

          Show
          Kevin Cunningham added a comment - - edited Regarding Patrick's comment about a memory leak, we are seeing something similar - very large memory usage and eventually using all the available memory. Were there any confirmed issues that may have been addressed with the later patches? We're using the 12-24 patch. Any toggles we can switch to still get the feature, yet minimize the memory footprint? We had been running the 11-29 field-collapse-5.patch patch and saw nothing near this amount of memory consumption. What fixes would we be missing if ran Solr 1.4 with the last "field-collapse-5.patch" patch?
          Hide
          Martijn van Groningen added a comment -

          If you look into the AbstractDocumentCollapser#createDocumentCollapseResult() you will see that the collapseResult will never be null. Therefore I think the null check is not necessary.
          It think the following code is sufficient:

          DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(),
                collapseResult.getCollapsedDocset(),
                rb.getSortSpec().getSort(),
                rb.getSortSpec().getOffset(),
                rb.getSortSpec().getCount(),
                rb.getFieldFlags());
          

          Also specifying the filters is unnecessary, because it was already taken into account when creating the uncollapsed docset.

          Show
          Martijn van Groningen added a comment - If you look into the AbstractDocumentCollapser#createDocumentCollapseResult() you will see that the collapseResult will never be null. Therefore I think the null check is not necessary. It think the following code is sufficient: DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(), collapseResult.getCollapsedDocset(), rb.getSortSpec().getSort(), rb.getSortSpec().getOffset(), rb.getSortSpec().getCount(), rb.getFieldFlags()); Also specifying the filters is unnecessary, because it was already taken into account when creating the uncollapsed docset.
          Hide
          Koji Sekiguchi added a comment -

          The following snippet in CollapseComponent.doProcess():

          DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(),
                collapseResult == null ? rb.getFilters() : null,
                collapseResult.getCollapsedDocset(),
                rb.getSortSpec().getSort(),
                rb.getSortSpec().getOffset(),
                rb.getSortSpec().getCount(),
                rb.getFieldFlags());
          

          2nd line implies that collapseResult may be null. If it is null, we got NPE at 3rd line?

          Show
          Koji Sekiguchi added a comment - The following snippet in CollapseComponent.doProcess(): DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(), collapseResult == null ? rb.getFilters() : null , collapseResult.getCollapsedDocset(), rb.getSortSpec().getSort(), rb.getSortSpec().getOffset(), rb.getSortSpec().getCount(), rb.getFieldFlags()); 2nd line implies that collapseResult may be null. If it is null, we got NPE at 3rd line?
          Martijn van Groningen made changes -
          Attachment SOLR-236.patch [ 12434435 ]
          Hide
          Martijn van Groningen added a comment -

          I agree! I've updated the patch that adds a check if a field is indexed. If not an exception is thrown.

          Show
          Martijn van Groningen added a comment - I agree! I've updated the patch that adds a check if a field is indexed. If not an exception is thrown.
          Hide
          Koji Sekiguchi added a comment -

          A random comment, don't we need to check collapse.field is indexed in checkCollapseField()?

          protected void checkCollapseField(IndexSchema schema) {
            SchemaField schemaField = schema.getFieldOrNull(collapseField);
            if (schemaField == null) {
              throw new RuntimeException("Could not collapse, because collapse field does not exist in the schema.");
            }
          
            if (schemaField.multiValued()) {
              throw new RuntimeException("Could not collapse, because collapse field is multivalued");
            }
          
            if (schemaField.getType().isTokenized()) {
              throw new RuntimeException("Could not collapse, because collapse field is tokenized");
            }
          }
          

          I accidentally specified an unindexed field for collapse.field, I got unexpected result without any errors.

          Show
          Koji Sekiguchi added a comment - A random comment, don't we need to check collapse.field is indexed in checkCollapseField()? protected void checkCollapseField(IndexSchema schema) { SchemaField schemaField = schema.getFieldOrNull(collapseField); if (schemaField == null ) { throw new RuntimeException( "Could not collapse, because collapse field does not exist in the schema." ); } if (schemaField.multiValued()) { throw new RuntimeException( "Could not collapse, because collapse field is multivalued" ); } if (schemaField.getType().isTokenized()) { throw new RuntimeException( "Could not collapse, because collapse field is tokenized" ); } } I accidentally specified an unindexed field for collapse.field, I got unexpected result without any errors.
          Martijn van Groningen made changes -
          Attachment SOLR-236.patch [ 12430926 ]
          Hide
          Martijn van Groningen added a comment -

          Attached updated patch that works with the latest trunk. This patch is not compatible with 1.4 branch.

          Show
          Martijn van Groningen added a comment - Attached updated patch that works with the latest trunk. This patch is not compatible with 1.4 branch.
          Hide
          Martijn van Groningen added a comment -

          Hi Yaniv, I tried the same on 1.4 branch (from svn) and the svn trunk. Applying the patch on both sources went fine, but when building (ant dist) on trunk I also got compile errors. This had to do with that SolrQueryResponse changed package from request package to response package. I will update the patch shortly. Building on the 1.4 branch went without any problems (ant dist). What errors did occur when running ant dist on 1.4 branch?

          Show
          Martijn van Groningen added a comment - Hi Yaniv, I tried the same on 1.4 branch (from svn) and the svn trunk. Applying the patch on both sources went fine, but when building (ant dist) on trunk I also got compile errors. This had to do with that SolrQueryResponse changed package from request package to response package. I will update the patch shortly. Building on the 1.4 branch went without any problems (ant dist). What errors did occur when running ant dist on 1.4 branch?
          Hide
          Yaniv S. added a comment -

          Hi All, this is a very exciting feature and I'm trying to apply it on our system.
          I've tried patching on 1.4 and on the trunk version but both give me build errors.
          Any suggestions on how I can build 1.4 or latest with this patch?

          Many Thanks,
          Yaniv

          Show
          Yaniv S. added a comment - Hi All, this is a very exciting feature and I'm trying to apply it on our system. I've tried patching on 1.4 and on the trunk version but both give me build errors. Any suggestions on how I can build 1.4 or latest with this patch? Many Thanks, Yaniv
          Hide
          Martijn van Groningen added a comment -

          If the field is tokenized and has more than one token your field collapse result will become incorrect. What happens if I remember correctly is that it will only collapse on the field's last token. This off course leads to weird collapse groups. For the users that only have one token per collapse field are because of this check out of luck. Somehow I think we should make the user know that is not possible to collapse on a tokenized field (at least with multiple tokens). Maybe adding a warning in the response. Still I think the exception is more clear, but also prohibits it off course.

          Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception.

          Checking if a tokenized field contains only one token is really inefficient, because you have the check all every collapse field of all documents. Now do check is done based on the field's definition in the schema.

          Show
          Martijn van Groningen added a comment - If the field is tokenized and has more than one token your field collapse result will become incorrect. What happens if I remember correctly is that it will only collapse on the field's last token. This off course leads to weird collapse groups. For the users that only have one token per collapse field are because of this check out of luck. Somehow I think we should make the user know that is not possible to collapse on a tokenized field (at least with multiple tokens). Maybe adding a warning in the response. Still I think the exception is more clear, but also prohibits it off course. Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception. Checking if a tokenized field contains only one token is really inefficient, because you have the check all every collapse field of all documents. Now do check is done based on the field's definition in the schema.
          Hide
          Michael Gundlach added a comment -

          I've found the need to collapse on an analyzed field which contains one token (an email field, which is analyzed in order to lowercase it.) I had to apply a patch on top of field-collapse-5.patch in order to comment out the isTokenized() check in AbstractCollapseComponent.java , at which point the code worked perfectly.

          Is there a strong argument for keeping the isTokenized() check in? Anyone who needs to collapse an analyzed, single-token field is out of luck with this check in place. I understand that the current version protects users from incorrect results if they collapse a multi-token tokenized field, but maybe collapsing on analyzed fields is worth that risk. (Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception.)

          Show
          Michael Gundlach added a comment - I've found the need to collapse on an analyzed field which contains one token (an email field, which is analyzed in order to lowercase it.) I had to apply a patch on top of field-collapse-5.patch in order to comment out the isTokenized() check in AbstractCollapseComponent.java , at which point the code worked perfectly. Is there a strong argument for keeping the isTokenized() check in? Anyone who needs to collapse an analyzed, single-token field is out of luck with this check in place. I understand that the current version protects users from incorrect results if they collapse a multi-token tokenized field, but maybe collapsing on analyzed fields is worth that risk. (Or someone could come after me and write a patch that checks for multi-tokened fields somehow and throws an exception.)
          Hide
          Martijn van Groningen added a comment -

          I believe the field-collapse-5.patch should work for 1.4. Some bugs were fixed in later patches so I recommend using the latest patch on the latest successful nightly build if that is an option for you.
          Applying the latest patch on the 1.4 sources will properly result in some minor merge errors, but I think these should be easy the fix.

          Show
          Martijn van Groningen added a comment - I believe the field-collapse-5.patch should work for 1.4. Some bugs were fixed in later patches so I recommend using the latest patch on the latest successful nightly build if that is an option for you. Applying the latest patch on the 1.4 sources will properly result in some minor merge errors, but I think these should be easy the fix.
          Hide
          Kevin Cunningham added a comment -

          Which patch is recommended for those running a stock 1.4 release?

          Show
          Kevin Cunningham added a comment - Which patch is recommended for those running a stock 1.4 release?
          Hide
          Martijn van Groningen added a comment - - edited

          The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset.

          The collapse.maxdocs aborts collapsing after the threshold is met, but it is doing that based on the uncollapsed docset which is not sorted in any way. The result of that is that documents that would normally appear in the first page don't appear at all in the search result. Eventually the collapse component uses the collapsed docset as the result set and not the uncollapsed docset.

          Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/>. Without setting the option collapse.field, it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled).

          I agree it gets huge. This applies for both the filterCache and field collapse cache. This is something that has to be addressed and certainly will in the new field-collapse implementation. In the patch you're using too much is being cached (some data can even be neglected in the cache). Also in some cases strings are being cached that actually could be replaced with hashcodes.

          Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...).

          That actually makes sense for using the collapse.enable parameter again in the patch.

          Martijn

          Show
          Martijn van Groningen added a comment - - edited The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset. The collapse.maxdocs aborts collapsing after the threshold is met, but it is doing that based on the uncollapsed docset which is not sorted in any way. The result of that is that documents that would normally appear in the first page don't appear at all in the search result. Eventually the collapse component uses the collapsed docset as the result set and not the uncollapsed docset. Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/>. Without setting the option collapse.field, it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled). I agree it gets huge. This applies for both the filterCache and field collapse cache. This is something that has to be addressed and certainly will in the new field-collapse implementation. In the patch you're using too much is being cached (some data can even be neglected in the cache). Also in some cases strings are being cached that actually could be replaced with hashcodes. Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...). That actually makes sense for using the collapse.enable parameter again in the patch. Martijn
          Hide
          Patrick Jungermann added a comment -

          Hi all,

          we using the Solr's trunk with the latest patch of 2009-12-24 09:54 AM. Within the index, there are ~3.5 million documents with string-based identifiers of a length up to 50 chars.

          The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset.

          Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/>.
          Without setting the option collapse.field, it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled).

          Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...).

          Patrick

          Show
          Patrick Jungermann added a comment - Hi all, we using the Solr's trunk with the latest patch of 2009-12-24 09:54 AM . Within the index, there are ~3.5 million documents with string-based identifiers of a length up to 50 chars. The result document of our prefix query, which was at position 1 without collapsing, was with collapsing not even within the top 10 results. We using the option collapse.maxdocs=150 and after changing this option to the value 15000, the results seem to be as expected. Because of that, we concluded, that there has to be a problem with the sorting of the uncollapsed docset. Also, we noticed a huge memory leak problem, when using collapsing. We configured the component with <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent"/> . Without setting the option collapse.field , it works normally, there are far no memory problems. If requests with enabled collapsing are received by the Solr server, the whole memory (oldgen could not be freed; eden space is heavily in use; ...) gets full after some few requests. By using a profiler, we noticed that the filterCache was extraordinary large. We supposed that there could be a caching problem (collapeCache was not enabled). Additionally it might be very useful, if the parameter collapse=true|false would work again and could be used to enabled/disable the collapsing functionality. Currently, the existence of a field choosen for collapsing enables this feature and there is no possibility to configure the fields for collapsing within the request handlers. With that, we could configure it and only enable/disable it within the requests like it will be conveniently used by other components (highlighting, faceting, ...). Patrick
          Hide
          Stanislaw Osinski added a comment -

          Hi Grant,

          I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs:

          Clusters documents into a flat structure based on the values of some field of the documents. By default the {@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the {@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way.

          I don't know how it performs, but it seems like it would at least be worth investigating.

          Carrot2's ByFieldClusteringAlgorithm is very simple. It literally throws everything into a hash map based on the field value (source code). This algorithm is used in our live demo to cluster by news source.

          Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm

          This one creates a hierarchy based on the URL segments and might be useful to create "by-domain" collapsing if needed.

          In general, my rough guess is that it's the criteria for content-based collapsing would be closer to duplicate detection rather than the type of grouping Carrot2 produces.

          Show
          Stanislaw Osinski added a comment - Hi Grant, I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs: Clusters documents into a flat structure based on the values of some field of the documents. By default the {@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the {@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. I don't know how it performs, but it seems like it would at least be worth investigating. Carrot2's ByFieldClusteringAlgorithm is very simple. It literally throws everything into a hash map based on the field value ( source code ). This algorithm is used in our live demo to cluster by news source . Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm This one creates a hierarchy based on the URL segments and might be useful to create "by-domain" collapsing if needed. In general, my rough guess is that it's the criteria for content-based collapsing would be closer to duplicate detection rather than the type of grouping Carrot2 produces.
          Hide
          Grant Ingersoll added a comment -

          I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.

          I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs:

          Clusters documents into a flat structure based on the values of some field of the
          documents. By default the

          Unknown macro: {@link Document#SOURCES}

          field is used

          and

          • Name of the field to cluster by. Each non-null scalar field value with distinct
          • hash code will give raise to a single cluster, named using the
          • Unknown macro: {@link Object#toString()}

            value of the field. If the field value is a collection,

          • the document will be assigned to all clusters corresponding to the values in the
          • collection. Note that arrays will not be 'unfolded' in this way.

          I don't know how it performs, but it seems like it would at least be worth investigating.

          Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm

          Just food for thought.

          Show
          Grant Ingersoll added a comment - I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for. I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs: Clusters documents into a flat structure based on the values of some field of the documents. By default the Unknown macro: {@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the Unknown macro: {@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. I don't know how it performs, but it seems like it would at least be worth investigating. Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm Just food for thought.
          Shalin Shekhar Mangar made changes -
          Attachment SOLR-236.patch [ 12428902 ]
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Patch updated for SOLR-1685 and SOLR-1686
          2. The last patch had reverted changes to CollapseComponent configuration in solrconfig.xml and solrconfig-fieldcollapse.xml. Synced it back
          Show
          Shalin Shekhar Mangar added a comment - Patch updated for SOLR-1685 and SOLR-1686 The last patch had reverted changes to CollapseComponent configuration in solrconfig.xml and solrconfig-fieldcollapse.xml. Synced it back
          Hide
          Uri Boness added a comment -

          If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group?

          As far as I know (Martijn, correct me if I'm wrong), Martijn's patch returns the number of groups and documents, where each group is actually represented as a document. So in that sense, the total count applies to the result set as is (groups count as documents) and therefore pagination just works.

          The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page).

          First of all, I must admit that I never noticed that in Google, so I guess you're right . But when you think about it, with Google, how many time do you get a low hit count that only fits in 2-3 pages? Well, I hardly ever get it, and when I do I don't even bother to check the result I just try to improve my search. With Solr, a lot of times its different, specially when all these discovery features and faceting are so often used to narrow the search extensively... I'm not saying not having a perfect pagination mechanism is a problem... not at all, I'm just saying that it might be an issue for specific use cases or specific domains.... but that's just an assumption (or a gut feeling)

          Show
          Uri Boness added a comment - If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group? As far as I know (Martijn, correct me if I'm wrong), Martijn's patch returns the number of groups and documents, where each group is actually represented as a document. So in that sense, the total count applies to the result set as is (groups count as documents) and therefore pagination just works. The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page). First of all, I must admit that I never noticed that in Google, so I guess you're right . But when you think about it, with Google, how many time do you get a low hit count that only fits in 2-3 pages? Well, I hardly ever get it, and when I do I don't even bother to check the result I just try to improve my search. With Solr, a lot of times its different, specially when all these discovery features and faceting are so often used to narrow the search extensively... I'm not saying not having a perfect pagination mechanism is a problem... not at all, I'm just saying that it might be an issue for specific use cases or specific domains.... but that's just an assumption (or a gut feeling)
          Hide
          Martijn van Groningen added a comment -

          Yes, I used his patch. Made a small bugfix and made sure that is in sync with the latest trunk.

          Show
          Martijn van Groningen added a comment - Yes, I used his patch. Made a small bugfix and made sure that is in sync with the latest trunk.
          Hide
          Noble Paul added a comment -

          is't the patch built on the one given by shalin? the configuration looks different...

          Show
          Noble Paul added a comment - is't the patch built on the one given by shalin? the configuration looks different...
          Martijn van Groningen made changes -
          Attachment SOLR-236.patch [ 12428818 ]
          Hide
          Martijn van Groningen added a comment -

          Updated the patch, so it patch without conflicts with the current trunk. Also included a bugfix regarding to field collapsing and filter cache that was noticed by Varun Gupta on the user mailing list.

          Show
          Martijn van Groningen added a comment - Updated the patch, so it patch without conflicts with the current trunk. Also included a bugfix regarding to field collapsing and filter cache that was noticed by Varun Gupta on the user mailing list.
          Hide
          Shalin Shekhar Mangar added a comment -

          @ttdi - Please post your questions to solr-user mailing list. This issue is strictly for Solr related development (not usage).

          Show
          Shalin Shekhar Mangar added a comment - @ttdi - Please post your questions to solr-user mailing list. This issue is strictly for Solr related development (not usage).
          Hide
          ttdi added a comment -

          hi,Martijn van Groningen experts,
          when i use http://localhost:8080/search/?page=1
          this can collapse the page=1 result,but when i use http://localhost:8080/search/?page=2
          it can only collapse the page=2 result, not collapse all record?
          i want collapse the all record use pagination ,how can i do it?
          Thanks!

          Show
          ttdi added a comment - hi,Martijn van Groningen experts, when i use http://localhost:8080/search/?page=1 this can collapse the page=1 result,but when i use http://localhost:8080/search/?page=2 it can only collapse the page=2 result, not collapse all record? i want collapse the all record use pagination ,how can i do it? Thanks!
          Hide
          Stephen Weiss added a comment -

          Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values? I'm going over the code and re-consider the way stuff is cached right now.

          No, we're very simple in our usage of the collapse features themselves, we don't even use the output that the collapse patch adds. However we do facet on a number of fields in this query as well, and sort by a date field. We also use local filter queries which we exclude for the facets individually (my favorite new feature). This packs a lot more action into one query then we had been doing previously (without that, we were running 8+ queries to get the same information), I was worried at first that this was the cause of the ram consumption. The field we are collapsing on is type "pint", it can be positive or negative depending on what system the document is coming in from. Each document has several stored fields, but a whole document's stored fields are under 1K together, always (it's only image metadata - there's no body text to any of these documents, this is for an image search engine).

          Show
          Stephen Weiss added a comment - Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values? I'm going over the code and re-consider the way stuff is cached right now. No, we're very simple in our usage of the collapse features themselves, we don't even use the output that the collapse patch adds. However we do facet on a number of fields in this query as well, and sort by a date field. We also use local filter queries which we exclude for the facets individually (my favorite new feature). This packs a lot more action into one query then we had been doing previously (without that, we were running 8+ queries to get the same information), I was worried at first that this was the cause of the ram consumption. The field we are collapsing on is type "pint", it can be positive or negative depending on what system the document is coming in from. Each document has several stored fields, but a whole document's stored fields are under 1K together, always (it's only image metadata - there's no body text to any of these documents, this is for an image search engine).
          Hide
          Martijn van Groningen added a comment -

          It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes.

          Hmmm.... that doesn't seem right. This is an issue.

          Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values?
          I'm going over the code and re-consider the way stuff is cached right now.

          Show
          Martijn van Groningen added a comment - It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes. Hmmm.... that doesn't seem right. This is an issue. Are you using any extra field collapse features? Such as aggregate functions. Also the collapse groups you collapse on do these have large field values? I'm going over the code and re-consider the way stuff is cached right now.
          Hide
          Stephen Weiss added a comment -

          Quick note on the collapse cache - we just went into production with 1.4 and right away we had to turn off the collapse cache. This was with 1.4 dist and the patch from 12/12. With the cache enabled, RAM consumption was through the roof on the production servers - I guess with the variety of queries coming in, it filled up very fast. It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes. We just used the sample config (maxSize=512), it looks like there were about 60 entries in the cache before we restarted. We would see the memory usage jump by as much as 2% after just one query.

          Without the cache the performance is still quite good (far better than what we had before) so we're not plussed, but it may indicate there needs to be more optimization there... Generally our consumption rarely goes over 50% on this machine unless we have a lot of commits coming in. The cache did provide some performance benefits on some of the queries that return large numbers of results (1M+) so it would be nice to have. Of course, it's possible with our index that these levels of RAM consumption would be unavoidable. I'm not sure if there's any further specifics I could provide that would be helpful, let me know.

          Show
          Stephen Weiss added a comment - Quick note on the collapse cache - we just went into production with 1.4 and right away we had to turn off the collapse cache. This was with 1.4 dist and the patch from 12/12. With the cache enabled, RAM consumption was through the roof on the production servers - I guess with the variety of queries coming in, it filled up very fast. It almost maxed out a machine with 18GB devoted to jetty in about 20 minutes. We just used the sample config (maxSize=512), it looks like there were about 60 entries in the cache before we restarted. We would see the memory usage jump by as much as 2% after just one query. Without the cache the performance is still quite good (far better than what we had before) so we're not plussed, but it may indicate there needs to be more optimization there... Generally our consumption rarely goes over 50% on this machine unless we have a lot of commits coming in. The cache did provide some performance benefits on some of the queries that return large numbers of results (1M+) so it would be nice to have. Of course, it's possible with our index that these levels of RAM consumption would be unavoidable. I'm not sure if there's any further specifics I could provide that would be helpful, let me know.
          Hide
          Yonik Seeley added a comment -

          As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page.

          A ton of memory, and probably a good amount of time too. It may be the only variant that certain people would be able to use (but note that it is just a variant - I'm not proposing doing away with the other options).

          I think there might be a problem with pagination as well

          Yes, pagination is a sticky issue... but I don't think this algorithm messes it up further.

          If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group? I guess they over-request a little. What if they want a fixed number of groups? I guess they over-request by a lot (nGroups*collapse.threshold). Then they need to keep track of how many documents they actually used.

          The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page).

          Show
          Yonik Seeley added a comment - As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page. A ton of memory, and probably a good amount of time too. It may be the only variant that certain people would be able to use (but note that it is just a variant - I'm not proposing doing away with the other options). I think there might be a problem with pagination as well Yes, pagination is a sticky issue... but I don't think this algorithm messes it up further. If we are returning a number of documents (as opposed to a number of groups) to the user, how do they avoid splitting on a page in the middle of the group? I guess they over-request a little. What if they want a fixed number of groups? I guess they over-request by a lot (nGroups*collapse.threshold). Then they need to keep track of how many documents they actually used. The only thing this algorithm can't do (related to pagination) is give the total number of documents after collapsing (and hence can't calculate the exact number of pages). This can be fine in many circumstances as long as the gui handles it (people don't seem to mind google doing it... I just tried it. Google didn't show the result count right unless displaying the last page).
          Hide
          Shalin Shekhar Mangar added a comment -

          This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents).

          We show the highest scoring document of a group, so does the fact that the metadata belongs to the group and not the document matter at all?

          But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe).

          We are not extending anything. We're just adding a couple of fields which may not exist in the index and this is a capability we plan to introduce anyway (however this issue does not need to depend on SOLR-1566). The response format remains exactly the same. There is no break in compatibility.

          Show
          Shalin Shekhar Mangar added a comment - This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents). We show the highest scoring document of a group, so does the fact that the metadata belongs to the group and not the document matter at all? But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe). We are not extending anything. We're just adding a couple of fields which may not exist in the index and this is a capability we plan to introduce anyway (however this issue does not need to depend on SOLR-1566 ). The response format remains exactly the same. There is no break in compatibility.
          Hide
          Uri Boness added a comment -

          @Yonik

          As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page. Beyond loosing the faceting support over the collapsed DocSet, I think there might be a problem with pagination as well. For every page you'll end up with a different total count and therefore different number of pages. This can be very confusing from the user perspective - imagine going to the first page and calculating (and displaying) that you have 3 pages of results, then when the user asks for the second page, s/he gets a response with 2 pages and different total count.

          Show
          Uri Boness added a comment - @Yonik As far as I understand from your collapse algorithm proposal, in order to save memory you'd like to restrict the group creation to only those that belong in the requested results page. Beyond loosing the faceting support over the collapsed DocSet, I think there might be a problem with pagination as well. For every page you'll end up with a different total count and therefore different number of pages. This can be very confusing from the user perspective - imagine going to the first page and calculating (and displaying) that you have 3 pages of results, then when the user asks for the second page, s/he gets a response with 2 pages and different total count.
          Hide
          Uri Boness added a comment -

          Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566

          This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents).

          Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea.

          I definitely agree that BWC should be kept, specially here when we're dealing with a query component. But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe).

          Show
          Uri Boness added a comment - Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566 This is exactly the point, it's not really meta-data over the document, but on the group the document belongs to. And you also need a more obvious way to mark this document as a group representation (to distinguish it from other normal documents). Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea. I definitely agree that BWC should be kept, specially here when we're dealing with a query component. But extending the current <doc> element, doesn't mean we break BWC. Adding a <collapse-info> (or <collapse-meta-data>) sub element to it, will certainly not break anything, specially when we still don't have a formal xsd for the responses (I know we're working on it, but it's still not out there so it's safe).
          Hide
          Noble Paul added a comment -

          I think mixing the collapse information with document fields is wrong

          Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566

          Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea.

          Moreover , if we keep it in the document, it keeps parsing and processing simpler

          Show
          Noble Paul added a comment - I think mixing the collapse information with document fields is wrong Why is it wrong. it is about adding meta-info to the docs. This is what we plan to do with SOLR-1566 Even when we collapse what we are expecting is simple search results. So a drastic deviation from the standard format is not a good idea. Moreover , if we keep it in the document, it keeps parsing and processing simpler
          Noble Paul made changes -
          Comment [ hi,experts,
           thanks for the great work!
           now i download solr1.4 from http://apache.freelamp.com/lucene/solr/1.4.0/apache-solr-1.4.0.zip
          and i path this patch: SOLR-236.patch 2009-12-18 10:16 AM Shalin Shekhar Mangar
          like this:
          G:\doc\apache-solr-1.4.0>patch.exe -p0 < SOLR-236.patch

          it will show some error,and this patch( SOLR-236.patch 2009-12-18 10:16 AM )don't support solr1.4 ?


          and the result is:
          patching file src/test/test-files/solr/conf/solrconfig-fieldcollapse.xml
          patching file src/test/test-files/solr/conf/schema-fieldcollapse.xml
          patching file src/test/test-files/solr/conf/solrconfig.xml
          patching file src/test/test-files/fieldcollapse/testResponse.xml
          can't find file to patch at input line 787
          Perhaps you used the wrong -p or --strip option?
          The text leading up to this was:
          --------------------------
          |
          |Property changes on: src/test/test-files/fieldcollapse/testResponse.xml
          |___________________________________________________________________
          |Added: svn:keywords
          | + Date Author Id Revision HeadURL
          |Added: svn:eol-style
          | + native
          |
          |Index: src/test/org/apache/solr/BaseDistributedSearchTestCase.java
          |===================================================================
          |--- src/test/org/apache/solr/BaseDistributedSearchTestCase.java(revision 891214)
          |+++ src/test/org/apache/solr/BaseDistributedSearchTestCase.java(working copy)
          --------------------------
          File to patch: SOLR-236.patch
          S: No such file or directory
          Skip this patch? [y] y
          Skipping patch.
          2 out of 2 hunks ignored
          patching file src/test/org/apache/solr/search/fieldcollapse/FieldCollapsingIntegrationTest.java
          patching file src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java
          patching file src/test/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapserTest.java
          patching file src/test/org/apache/solr/search/fieldcollapse/AdjacentCollapserTest.java
          patching file src/test/org/apache/solr/handler/component/CollapseComponentTest.java
          patching file src/test/org/apache/solr/client/solrj/response/FieldCollapseResponseTest.java
          patching file src/java/org/apache/solr/search/DocSetAwareCollector.java
          patching file src/java/org/apache/solr/search/fieldcollapse/CollapseGroup.java
          patching file src/java/org/apache/solr/search/fieldcollapse/DocumentCollapseResult.java
          patching file src/java/org/apache/solr/search/fieldcollapse/DocumentCollapser.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollectorFactory.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/DocumentGroupCountCollapseCollectorFactory.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AverageFunction.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MinFunction.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/SumFunction.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MaxFunction.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AggregateFunction.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/CollapseContext.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/DocumentFieldsCollapseCollectorFactory.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/AggregateCollapseCollectorFactory.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollector.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/FieldValueCountCollapseCollectorFactory.java
          patching file src/java/org/apache/solr/search/fieldcollapse/collector/AbstractCollapseCollector.java
          patching file src/java/org/apache/solr/search/fieldcollapse/AbstractDocumentCollapser.java
          patching file src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java
          patching file src/java/org/apache/solr/search/fieldcollapse/AdjacentDocumentCollapser.java
          patching file src/java/org/apache/solr/search/fieldcollapse/util/Counter.java
          patching file src/java/org/apache/solr/search/SolrIndexSearcher.java
          patching file src/java/org/apache/solr/search/DocSetHitCollector.java
          patching file src/java/org/apache/solr/handler/component/CollapseComponent.java
          patching file src/java/org/apache/solr/handler/component/QueryComponent.java
          Hunk #5 succeeded at 521 with fuzz 2.
          Hunk #6 succeeded at 562 (offset -5 lines).
          patching file src/java/org/apache/solr/util/DocSetScoreCollector.java
          patching file src/common/org/apache/solr/common/params/CollapseParams.java
          patching file src/solrj/org/apache/solr/client/solrj/SolrQuery.java
          Hunk #1 FAILED at 17.
          Hunk #2 FAILED at 50.
          Hunk #3 FAILED at 76.
          Hunk #4 FAILED at 148.
          Hunk #5 FAILED at 197.
          Hunk #6 succeeded at 510 (offset -155 lines).
          Hunk #7 succeeded at 566 (offset -155 lines).
          5 out of 7 hunks FAILED -- saving rejects to file src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej
          patching file src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
          Hunk #1 FAILED at 47.
          Hunk #2 FAILED at 63.
          Hunk #3 succeeded at 122 with fuzz 2 (offset -8 lines).
          Hunk #4 succeeded at 320 with fuzz 2 (offset 17 lines).
          2 out of 4 hunks FAILED -- saving rejects to file src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java.rej
          patching file src/solrj/org/apache/solr/client/solrj/response/FieldCollapseResponse.java

          and in src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej

          ***************
          *** 17,28 ****
            
            package org.apache.solr.client.solrj;
            
          - import org.apache.solr.common.params.CommonParams;
          - import org.apache.solr.common.params.FacetParams;
          - import org.apache.solr.common.params.HighlightParams;
          - import org.apache.solr.common.params.ModifiableSolrParams;
          - import org.apache.solr.common.params.StatsParams;
          - import org.apache.solr.common.params.TermsParams;
            
            import java.util.regex.Pattern;
            
          --- 17,23 ----
            
            package org.apache.solr.client.solrj;
            
          + import org.apache.solr.common.params.*;
            
            import java.util.regex.Pattern;
            
          ***************
          *** 55,62 ****
                this.set(CommonParams.Q, q);
              }
            
          - /** enable/disable terms.
          - *
               * @param b flag to indicate terms should be enabled. <br /> if b==false, removes all other terms parameters
               * @return Current reference (<i>this</i>)
               */
          --- 50,57 ----
                this.set(CommonParams.Q, q);
              }
            
          + /** enable/disable terms.
          + *
               * @param b flag to indicate terms should be enabled. <br /> if b==false, removes all other terms parameters
               * @return Current reference (<i>this</i>)
               */
          ***************
          *** 81,150 ****
                }
                return this;
              }
          -
              public boolean getTerms() {
                return this.getBool(TermsParams.TERMS, false);
              }
          -
              public SolrQuery addTermsField(String field) {
                this.add(TermsParams.TERMS_FIELD, field);
                return this;
              }
          -
              public String[] getTermsFields() {
                return this.getParams(TermsParams.TERMS_FIELD);
              }
          -
              public SolrQuery setTermsLower(String lower) {
                this.set(TermsParams.TERMS_LOWER, lower);
                return this;
              }
          -
              public String getTermsLower() {
                return this.get(TermsParams.TERMS_LOWER, "");
              }
          -
              public SolrQuery setTermsUpper(String upper) {
                this.set(TermsParams.TERMS_UPPER, upper);
                return this;
              }
          -
              public String getTermsUpper() {
                return this.get(TermsParams.TERMS_UPPER, "");
              }
          -
              public SolrQuery setTermsUpperInclusive(boolean b) {
                this.set(TermsParams.TERMS_UPPER_INCLUSIVE, b);
                return this;
              }
          -
              public boolean getTermsUpperInclusive() {
                return this.getBool(TermsParams.TERMS_UPPER_INCLUSIVE, false);
              }
          -
              public SolrQuery setTermsLowerInclusive(boolean b) {
                this.set(TermsParams.TERMS_LOWER_INCLUSIVE, b);
                return this;
              }
          -
              public boolean getTermsLowerInclusive() {
                return this.getBool(TermsParams.TERMS_LOWER_INCLUSIVE, true);
              }
          -
              public SolrQuery setTermsLimit(int limit) {
                this.set(TermsParams.TERMS_LIMIT, limit);
                return this;
              }
          -
              public int getTermsLimit() {
                return this.getInt(TermsParams.TERMS_LIMIT, 10);
              }
          -
              public SolrQuery setTermsMinCount(int cnt) {
                this.set(TermsParams.TERMS_MINCOUNT, cnt);
                return this;
              }
          -
              public int getTermsMinCount() {
                return this.getInt(TermsParams.TERMS_MINCOUNT, 1);
              }
          --- 76,145 ----
                }
                return this;
              }
          +
              public boolean getTerms() {
                return this.getBool(TermsParams.TERMS, false);
              }
          +
              public SolrQuery addTermsField(String field) {
                this.add(TermsParams.TERMS_FIELD, field);
                return this;
              }
          +
              public String[] getTermsFields() {
                return this.getParams(TermsParams.TERMS_FIELD);
              }
          +
              public SolrQuery setTermsLower(String lower) {
                this.set(TermsParams.TERMS_LOWER, lower);
                return this;
              }
          +
              public String getTermsLower() {
                return this.get(TermsParams.TERMS_LOWER, "");
              }
          +
              public SolrQuery setTermsUpper(String upper) {
                this.set(TermsParams.TERMS_UPPER, upper);
                return this;
              }
          +
              public String getTermsUpper() {
                return this.get(TermsParams.TERMS_UPPER, "");
              }
          +
              public SolrQuery setTermsUpperInclusive(boolean b) {
                this.set(TermsParams.TERMS_UPPER_INCLUSIVE, b);
                return this;
              }
          +
              public boolean getTermsUpperInclusive() {
                return this.getBool(TermsParams.TERMS_UPPER_INCLUSIVE, false);
              }
          +
              public SolrQuery setTermsLowerInclusive(boolean b) {
                this.set(TermsParams.TERMS_LOWER_INCLUSIVE, b);
                return this;
              }
          +
              public boolean getTermsLowerInclusive() {
                return this.getBool(TermsParams.TERMS_LOWER_INCLUSIVE, true);
              }
          +
              public SolrQuery setTermsLimit(int limit) {
                this.set(TermsParams.TERMS_LIMIT, limit);
                return this;
              }
          +
              public int getTermsLimit() {
                return this.getInt(TermsParams.TERMS_LIMIT, 10);
              }
          +
              public SolrQuery setTermsMinCount(int cnt) {
                this.set(TermsParams.TERMS_MINCOUNT, cnt);
                return this;
              }
          +
              public int getTermsMinCount() {
                return this.getInt(TermsParams.TERMS_MINCOUNT, 1);
              }
          ***************
          *** 153,186 ****
                this.set(TermsParams.TERMS_MAXCOUNT, cnt);
                return this;
              }
          -
              public int getTermsMaxCount() {
                return this.getInt(TermsParams.TERMS_MAXCOUNT, -1);
              }
          -
              public SolrQuery setTermsPrefix(String prefix) {
                this.set(TermsParams.TERMS_PREFIX_STR, prefix);
                return this;
              }
          -
              public String getTermsPrefix() {
                return this.get(TermsParams.TERMS_PREFIX_STR, "");
              }
          -
              public SolrQuery setTermsRaw(boolean b) {
                this.set(TermsParams.TERMS_RAW, b);
                return this;
              }
          -
              public boolean getTermsRaw() {
                return this.getBool(TermsParams.TERMS_RAW, false);
              }
          -
              public SolrQuery setTermsSortString(String type) {
                this.set(TermsParams.TERMS_SORT, type);
                return this;
              }
          -
              public String getTermsSortString() {
                return this.get(TermsParams.TERMS_SORT, TermsParams.TERMS_SORT_COUNT);
              }
          --- 148,181 ----
                this.set(TermsParams.TERMS_MAXCOUNT, cnt);
                return this;
              }
          +
              public int getTermsMaxCount() {
                return this.getInt(TermsParams.TERMS_MAXCOUNT, -1);
              }
          +
              public SolrQuery setTermsPrefix(String prefix) {
                this.set(TermsParams.TERMS_PREFIX_STR, prefix);
                return this;
              }
          +
              public String getTermsPrefix() {
                return this.get(TermsParams.TERMS_PREFIX_STR, "");
              }
          +
              public SolrQuery setTermsRaw(boolean b) {
                this.set(TermsParams.TERMS_RAW, b);
                return this;
              }
          +
              public boolean getTermsRaw() {
                return this.getBool(TermsParams.TERMS_RAW, false);
              }
          +
              public SolrQuery setTermsSortString(String type) {
                this.set(TermsParams.TERMS_SORT, type);
                return this;
              }
          +
              public String getTermsSortString() {
                return this.get(TermsParams.TERMS_SORT, TermsParams.TERMS_SORT_COUNT);
              }
          ***************
          *** 202,208 ****
              public String[] getTermsRegexFlags() {
                return this.getParams(TermsParams.TERMS_REGEXP_FLAG);
              }
          -
              /** Add field(s) for facet computation.
               *
               * @param fields Array of field names from the IndexSchema
          --- 197,203 ----
              public String[] getTermsRegexFlags() {
                return this.getParams(TermsParams.TERMS_REGEXP_FLAG);
              }
          +
              /** Add field(s) for facet computation.
               *
               * @param fields Array of field names from the IndexSchema



          in src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java.rej:

          ***************
          *** 47,52 ****
              private NamedList<Object> _spellInfo = null;
              private NamedList<Object> _statsInfo = null;
              private NamedList<Object> _termsInfo = null;
            
              // Facet stuff
              private Map<String,Integer> _facetQuery = null;
          --- 47,53 ----
              private NamedList<Object> _spellInfo = null;
              private NamedList<Object> _statsInfo = null;
              private NamedList<Object> _termsInfo = null;
          + private NamedList<Object> _collapseInfo = null;
            
              // Facet stuff
              private Map<String,Integer> _facetQuery = null;
          ***************
          *** 62,68 ****
            
              // Terms Response
              private TermsResponse _termsResponse = null;
          -
              // Field stats Response
              private Map<String,FieldStatsInfo> _fieldStatsInfo = null;
              
          --- 63,72 ----
            
              // Terms Response
              private TermsResponse _termsResponse = null;
          +
          + // Field collapse response
          + private FieldCollapseResponse _fieldCollapseResponse = null;
          +
              // Field stats Response
              private Map<String,FieldStatsInfo> _fieldStatsInfo = null;
              

          ]
          Hide
          Yonik Seeley added a comment -

          You think that collapse.collectDiscardedDocuments.fl is better?

          Is this something that's really needed? If so, some other name ideas could be
          collapse.discarded.fl
          collapse.discarded.limit (doesn't seem to be a good idea to have an unbounded number).

          Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works.

          Right - one would not be able to tell the total number of collapsed docs, or the total number of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported. I do think that just like faceting, there will be multiple ways of doing collapsing.

          Anyway, this is a great example of trying to make sure the interface doesn't preclude optimizations. Perhaps the total count of the search (numFound) should be pre-collapsing if collapse.facet=before, or perhaps it should always be pre-collapsing, and we should have another optional count for post-collapsing?

          Show
          Yonik Seeley added a comment - You think that collapse.collectDiscardedDocuments.fl is better? Is this something that's really needed? If so, some other name ideas could be collapse.discarded.fl collapse.discarded.limit (doesn't seem to be a good idea to have an unbounded number). Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works. Right - one would not be able to tell the total number of collapsed docs, or the total number of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported. I do think that just like faceting, there will be multiple ways of doing collapsing. Anyway, this is a great example of trying to make sure the interface doesn't preclude optimizations. Perhaps the total count of the search (numFound) should be pre-collapsing if collapse.facet=before, or perhaps it should always be pre-collapsing, and we should have another optional count for post-collapsing?
          Hide
          Uri Boness added a comment -

          @Shalin

          I think mixing the collapse information with document fields is wrong. The collapse fields don't really belong to the document, but to the group the document represents, while the other field do belong to it. The response format should somehow indicate this difference.

          Show
          Uri Boness added a comment - @Shalin I think mixing the collapse information with document fields is wrong. The collapse fields don't really belong to the document, but to the group the document represents, while the other field do belong to it. The response format should somehow indicate this difference.
          Hide
          Martijn van Groningen added a comment -

          We need to open a separate issue for the core related changes.

          As you properly have noticed I have split the patch into smaller patches and created sub issues for each patch.

          How about we change the current field collapsing response format to the following?

          Looks okay at first sight.

          For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response.

          Maybe we need a more elegant solution for this. All these extra fields are calculated values. If we were to put the calculated values into a certain context and the response writers can then look values up in the context and write them to the response. Other functionalities might also benefit from this solution like distances from a central point when doing a geo search. It is just an idea. I recall there an issue in Jira that propose something like this, but I couldn't find it.

          "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated?

          I think that is good idea, other parameters (like the fq) are also multi-valued.

          BTW I think we should continue further technical discussions in the sub issues. We got space there for a lot of comments

          Show
          Martijn van Groningen added a comment - We need to open a separate issue for the core related changes. As you properly have noticed I have split the patch into smaller patches and created sub issues for each patch. How about we change the current field collapsing response format to the following? Looks okay at first sight. For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response. Maybe we need a more elegant solution for this. All these extra fields are calculated values. If we were to put the calculated values into a certain context and the response writers can then look values up in the context and write them to the response. Other functionalities might also benefit from this solution like distances from a central point when doing a geo search. It is just an idea. I recall there an issue in Jira that propose something like this, but I couldn't find it. "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated? I think that is good idea, other parameters (like the fq) are also multi-valued. BTW I think we should continue further technical discussions in the sub issues. We got space there for a lot of comments
          Hide
          Shalin Shekhar Mangar added a comment -

          How about we change the current field collapsing response format to the following?

          We add new well-known fields to the document itself, say

          1. "collapse.value" - contains the group field's value for this document
          2. "collapse.count" - the number of results collapsed under this document
          3. "collapse.aggregate.function(field-name)" - the aggregate value for the given function applied to the given field for this document's group

          Example:

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
            <lst name="responseHeader">
              <int name="status">0</int>
              <int name="QTime">2</int>
              <lst name="params">
                <str name="collapse.field">manu_exact</str>
                <str name="collapse.aggregate">max(field1)</str>
                <str name="collapse.aggregate">avg(field1)</str>
                <str name="q">title:test</str>
                <str name="field.collapse">title</str>
                <str name="qt">collapse</str>
              </lst>
            </lst>
            <result name="response" numFound="30" start="0">
              <doc>
                <str name="id">F8V7067-APL-KIT</str>
                <str name="collapse.value">Belkin</str>
                <int name="collapse.count">1</int>
                <int name="collapse.aggregate.max(field1)">100</int>
                <float name="collapse.aggregate.avg(field1)">50.0</float>
              </doc>
              <doc>
                <str name="id">TWINX2048-3200PRO</str>
                <str name="collapse.value">Corsair Microsystems Inc.</str>
                <int name="collapse.count">3</int>
                <int name="collapse.aggregate.max(field1)">100</int>
                <float name="collapse.aggregate.avg(field1)">50.0</float>
              </doc>
            </result>
          </response>
          

          No need to have another section and correlate based on uniqueKeys. For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response.

          For request parameters:

          1. "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated?
          Show
          Shalin Shekhar Mangar added a comment - How about we change the current field collapsing response format to the following? We add new well-known fields to the document itself, say "collapse.value" - contains the group field's value for this document "collapse.count" - the number of results collapsed under this document "collapse.aggregate.function(field-name)" - the aggregate value for the given function applied to the given field for this document's group Example: <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > <int name= "status" > 0 </int> <int name= "QTime" > 2 </int> <lst name= "params" > <str name= "collapse.field" > manu_exact </str> <str name= "collapse.aggregate" > max(field1) </str> <str name= "collapse.aggregate" > avg(field1) </str> <str name= "q" > title:test </str> <str name= "field.collapse" > title </str> <str name= "qt" > collapse </str> </lst> </lst> <result name= "response" numFound= "30" start= "0" > <doc> <str name= "id" > F8V7067-APL-KIT </str> <str name= "collapse.value" > Belkin </str> <int name= "collapse.count" > 1 </int> <int name= "collapse.aggregate.max(field1)" > 100 </int> <float name= "collapse.aggregate.avg(field1)" > 50.0 </float> </doc> <doc> <str name= "id" > TWINX2048-3200PRO </str> <str name= "collapse.value" > Corsair Microsystems Inc. </str> <int name= "collapse.count" > 3 </int> <int name= "collapse.aggregate.max(field1)" > 100 </int> <float name= "collapse.aggregate.avg(field1)" > 50.0 </float> </doc> </result> </response> No need to have another section and correlate based on uniqueKeys. For this to work, CollapseComponent must generate a custom SolrDocumentList and set it as "results" in the response. For request parameters: "collapse.aggregate" - Can we make this a multi-valued parameter instead of comma separated?
          Hide
          Noble Paul added a comment -

          We need to open a separate issue for the core related changes.

          Show
          Noble Paul added a comment - We need to open a separate issue for the core related changes.
          Hide
          Martijn van Groningen added a comment -

          I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway.

          The changes in the core that should be in a separate patch are:

          1. SolrIndexSearcher
          2. DocSetHitCollector
          3. DocSetAwareCollector

          The above files where changes because of the following reasons:

          1. The getDocSet(...) methods in the SolrIndexSearcher did not allow me to specify a Lucene Collector, which I needed to get the uncollapsed docset and levering the Solr caches whilst doing that. I changed them so I was able to do that.
          2. The patch also contains an extra getDocListAndSet(...) method that allows specifying a filter docset, which in the case of field collapsing is the collapsed docset.

          The QueryComponent has changed as well. The only reason these changes where made, was to support the psuedo distributed field-collapsing. Maybe for the distributed field collapsing a separate patch should created with this change as a start. Last but not least the SolrJ code. I think for these changes a separate patch should be created as well. Maybe for each patch a sub issue should be created in Jira.

          The rest of the files in the patch do not impact any core files and I think should remain in one patch.

          Show
          Martijn van Groningen added a comment - I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway. The changes in the core that should be in a separate patch are: SolrIndexSearcher DocSetHitCollector DocSetAwareCollector The above files where changes because of the following reasons: The getDocSet(...) methods in the SolrIndexSearcher did not allow me to specify a Lucene Collector, which I needed to get the uncollapsed docset and levering the Solr caches whilst doing that. I changed them so I was able to do that. The patch also contains an extra getDocListAndSet(...) method that allows specifying a filter docset, which in the case of field collapsing is the collapsed docset. The QueryComponent has changed as well. The only reason these changes where made, was to support the psuedo distributed field-collapsing. Maybe for the distributed field collapsing a separate patch should created with this change as a start. Last but not least the SolrJ code. I think for these changes a separate patch should be created as well. Maybe for each patch a sub issue should be created in Jira. The rest of the files in the patch do not impact any core files and I think should remain in one patch.
          Hide
          Martijn van Groningen added a comment -

          ttdi,
          The latest patch is not in sync with the latest trunk. You can try to patch to the trunk or use a previous patch for the 1.4 code.

          Yonik,
          The parameters description is a bit poor. The response format of the older patches contains two separate lists of collapse group counts. A list with counts per most relevant document id that is enabled or disabled with collapse.info.doc param. The second list with counts per fieldvalue of the most relevant document that is controlled with collapse.info.count param. Now that the response format has changed we should rename it to something more descriptive. Maybe something like collapse.showCount that adds the collapse count to the collapse group in the response (default to true) and collapse.showFieldValue that adds the fieldvalue of the most relevant document to the group (defaults to false)?

          The collapse.maxdocs specifies when to abort field-collapsing after n document have been processed. I have never used is. I can imagine that one would use it to shorten the search time.

          The collapse.includeCollapsedDocs.fl enables a collapse collector that collects the documents that have been discarded and output the specified fields of the discarded documents to the fieldcollapse response per collapse group (* for all fields). The parameter name does not reflect that behaviour entirely. You think that collapse.collectDiscardedDocuments.fl is better? However personally I would not use this, because of the negative impact it has on performance. Usually one wants to know something like the average / highest / lowest price of a collapse group. The AggregateCollapseCollector would fit the needs better.

          Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).

          I'm not sure about that. It would make things more complicated. Sorting the discarded documents in combination with the collapse.includeCollapsedDocs.fl functionality would maybe make more sense.

          The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?

          I think the latter would make more sense, because field-collapsing does change the search result. It would just make it more obvious.

          Is there a way to specify the number of groups that I want back instead of the number of documents?

          No there is not, but if the list of documents is replaced with a list of groups then the rows parameter should be used to indicate the number of groups to be displayed instead the number of documents to be displayed.

          Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works.

          Show
          Martijn van Groningen added a comment - ttdi, The latest patch is not in sync with the latest trunk. You can try to patch to the trunk or use a previous patch for the 1.4 code. Yonik, The parameters description is a bit poor. The response format of the older patches contains two separate lists of collapse group counts. A list with counts per most relevant document id that is enabled or disabled with collapse.info.doc param. The second list with counts per fieldvalue of the most relevant document that is controlled with collapse.info.count param. Now that the response format has changed we should rename it to something more descriptive. Maybe something like collapse.showCount that adds the collapse count to the collapse group in the response (default to true) and collapse.showFieldValue that adds the fieldvalue of the most relevant document to the group (defaults to false)? The collapse.maxdocs specifies when to abort field-collapsing after n document have been processed. I have never used is. I can imagine that one would use it to shorten the search time. The collapse.includeCollapsedDocs.fl enables a collapse collector that collects the documents that have been discarded and output the specified fields of the discarded documents to the fieldcollapse response per collapse group (* for all fields). The parameter name does not reflect that behaviour entirely. You think that collapse.collectDiscardedDocuments.fl is better? However personally I would not use this, because of the negative impact it has on performance. Usually one wants to know something like the average / highest / lowest price of a collapse group. The AggregateCollapseCollector would fit the needs better. Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately). I'm not sure about that. It would make things more complicated. Sorting the discarded documents in combination with the collapse.includeCollapsedDocs.fl functionality would maybe make more sense. The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like? I think the latter would make more sense, because field-collapsing does change the search result. It would just make it more obvious. Is there a way to specify the number of groups that I want back instead of the number of documents? No there is not, but if the list of documents is replaced with a list of groups then the rows parameter should be used to indicate the number of groups to be displayed instead the number of documents to be displayed. Just one thought I had about the algorithm you propose. If you only create collapse groups for the top ten documents then what about the total count of the search? Unique documents outside the top ten documents are not being grouped (if I understand you correctly) and that would impact the total count with how it currency works.
          Hide
          Yonik Seeley added a comment -

          First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature!

          I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in.

          It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results?

          The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?

          From the wiki:
          collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level.

          collapse.info.doc - what does that do? I understand counts per group, but what's count per doc?

          collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold.

          collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc.

          Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).

          Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back?

          Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...)

          Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts.

          Performance: yes, it looks like the current code uses a lot of memory.
          Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function):

          =================== two pass collapsing algorithm for collapse.aggregate=max ====================
          First pass: pretend that collapseCount=1
            - Use a TreeSet as  a priority queue since one can remove and insert entries.
            - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
            - compare new doc with smallest element in treeset.  If smaller discard and go to the next doc.
            - If new doc is bigger, look up it's group.  Use the Map to find if the group has been added to the TreeSet and add it if not.
            - If the new bigger doc is already in the TreeSet, compare with the document in that group.  If bigger, update the node,
              remove and re-add to the TreeSet to re-sort.
          
          efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
          We will now have the top 10 documents collapsed by the right field with a collapseCount of 1.  Put another way, we have the top 10 groups.
          
          Second pass (if collapseCount>1):
           - create a priority queue for each group (10) of size collapseCount
           - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
           - for each document, find it's appropriate priority queue and insert
           - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
          
          So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
          Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
          We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
          
          Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          
          Show
          Yonik Seeley added a comment - First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature! I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in. It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results? The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like? From the wiki: collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level. collapse.info.doc - what does that do? I understand counts per group, but what's count per doc? collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold. collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc. Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately). Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back? Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...) Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts. Performance: yes, it looks like the current code uses a lot of memory. Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function): =================== two pass collapsing algorithm for collapse.aggregate=max ==================== First pass: pretend that collapseCount=1 - Use a TreeSet as a priority queue since one can remove and insert entries. - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet - compare new doc with smallest element in treeset. If smaller discard and go to the next doc. - If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not. - If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node, remove and re-add to the TreeSet to re-sort. efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups. Second pass ( if collapseCount>1): - create a priority queue for each group (10) of size collapseCount - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1) - for each document, find it's appropriate priority queue and insert - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched. So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups. Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed. We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing. Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          Hide
          Yonik Seeley added a comment -

          First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature!

          I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in.

          It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results?

          The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like?

          From the wiki:
          collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level.

          collapse.info.doc - what does that do? I understand counts per group, but what's count per doc?

          collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold.

          collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc.

          Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately).

          Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back?

          Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...)

          Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts.

          Performance: yes, it looks like the current code uses a lot of memory.
          Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function):

          =================== two pass collapsing algorithm for collapse.aggregate=max ====================
          First pass: pretend that collapseCount=1
            - Use a TreeSet as a priority queue since one can remove and insert entries.
            - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
            - compare new doc with smallest element in treeset. If smaller discard and go to the next doc.
            - If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not.
            - If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node,
              remove and re-add to the TreeSet to re-sort.
          
          efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
          We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups.
          
          Second pass (if collapseCount>1):
           - create a priority queue for each group (10) of size collapseCount
           - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
           - for each document, find it's appropriate priority queue and insert
           - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
          
          So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
          Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
          We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
          
          Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          
          Show
          Yonik Seeley added a comment - First, thanks to everyone who has spent so much time working on this - lack of committer attention doesn't equate to lack of interest... this is a very much needed feature! I'd agree with Erik that the most important thing is the interface to the client, and making it well thought out and semantically "tight". Martijn's recent improvements to the response structure is an example of improvements in this area. It's also important to think about the interface in terms of how easy it will be to add further features, optimizations, and support distributed search. If the code isn't sufficiently standalone, we also need to see how easily it fits into the rest of Solr (what APIs it adds or modifies, etc). Actually implementing performance improvements and more distributed search can come later - as long as we've thought about it now so we haven't boxed ourselves in. It seems like field collapsing should just be additional functionality of the query component rather than a separate component since it changes the results? The most basic question about the interface would be how to present groups. Do we stick with a linear document list and supplement that with extra info in a different part of the response (as the current approach takes)? Or stick that extra info in with some of the documents somehow? Or if collapse=true, replace the list of documents with a list of groups, each which can contain many documents? Which will be easiest for clients to deal with? If you were starting from scratch and didn't have to deal with any of Solr's current shortcomings, what would it look like? From the wiki: collapse.maxdocs - what does this actually mean? I assume it collects arbitrary documents up to the max (normally by index order)? Does this really make sense? Does it affect faceting, etc? If it does make sense, it seems like it would also make sense for normal non-collapsed query results too, in which case it should be implemented at that level. collapse.info.doc - what does that do? I understand counts per group, but what's count per doc? collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example on the wiki or blogs. It says "Parameter indicating to return the collapsed documents in the response"... but I thought documents were included up until collapse.threshold. collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param (someone recently suggested using a comma separated list... debug=timings,query, etc. Should I be able to specify a completely different sort within a group? collapse.sort=... seems nice... what are the implications? One bit of strangeness: it would seem to allow a highly ranked document responsible for the group being at the top of the list being dropped from the group due to a different sort criteria within the group. It's not necessarily an implementation problem though (sort values for the group should be maintained separately). Is there a way to specify the number of groups that I want back instead of the number of documents? Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I get too many documents back? Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets asks for the docset of be base query...) Distributed Search: should be able to use the same type of algorithm that faceting does to ensure accurate counts. Performance: yes, it looks like the current code uses a lot of memory. Here's an algorithm that I thought of on my last plane ride that can do much better (assuming max() is the aggregation function): =================== two pass collapsing algorithm for collapse.aggregate=max ==================== First pass: pretend that collapseCount=1 - Use a TreeSet as a priority queue since one can remove and insert entries. - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet - compare new doc with smallest element in treeset. If smaller discard and go to the next doc. - If new doc is bigger, look up it's group. Use the Map to find if the group has been added to the TreeSet and add it if not. - If the new bigger doc is already in the TreeSet, compare with the document in that group. If bigger, update the node, remove and re-add to the TreeSet to re-sort. efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance) We will now have the top 10 documents collapsed by the right field with a collapseCount of 1. Put another way, we have the top 10 groups. Second pass ( if collapseCount>1): - create a priority queue for each group (10) of size collapseCount - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1) - for each document, find it's appropriate priority queue and insert - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched. So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups. Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed. We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing. Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
          Hide
          Mark Miller added a comment -

          This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue.

          Its not really any different than putting it in trunk. Non committers can still post patches to the branch in JIRA, the same as if the issue was in trunk. Smaller, more focused patches. If there are no benefits to a branch in this regard, what is the argument to putting this in trunk for further dev? Might as well just stay in patch form until its ready then.

          If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced.

          You have to apply the patch. With a branch you have to type a merge command. Its the same effort - a single command.

          Show
          Mark Miller added a comment - This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue. Its not really any different than putting it in trunk. Non committers can still post patches to the branch in JIRA, the same as if the issue was in trunk. Smaller, more focused patches. If there are no benefits to a branch in this regard, what is the argument to putting this in trunk for further dev? Might as well just stay in patch form until its ready then. If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced. You have to apply the patch. With a branch you have to type a merge command. Its the same effort - a single command.
          Hide
          Noble Paul added a comment -

          The main difference I see is that its easier for non committers to share updated patches

          This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue.

          If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced.

          Show
          Noble Paul added a comment - The main difference I see is that its easier for non committers to share updated patches This is a huge difference. Considering the no:of involvement of the non-committers involved in this issue. If your patch does not modify any existing files you never have to sync it w/ trunk. It is always synced.
          Hide
          Mark Miller added a comment -

          bq . On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself.

          Is that true? Syncing a branch is the same as syncing a patch - non conflicts are merged automatically and conflicts must be handled - same with a patch or a branch. And a patch gets out of date just as easily as a branch. The main difference I see is that its easier for non committers to share updated patches, whereas merging the branch will require the help of a committer if you want to share the merge with others. Anyone can checkout the branch and merge with trunk though - its literally the same effort as updating an out of date patch.

          Show
          Mark Miller added a comment - bq . On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself. Is that true? Syncing a branch is the same as syncing a patch - non conflicts are merged automatically and conflicts must be handled - same with a patch or a branch. And a patch gets out of date just as easily as a branch. The main difference I see is that its easier for non committers to share updated patches, whereas merging the branch will require the help of a committer if you want to share the merge with others. Anyone can checkout the branch and merge with trunk though - its literally the same effort as updating an out of date patch.
          Hide
          Noble Paul added a comment -

          olr already has a few places where the response format is still marked as experimental and as subject to changes in the future ....

          Marking the output format as experimental is just trying to be safe. We strive hard to ensure that we don't change it or even if we do it it is not disruptive. So let us not take this as an excuse to be lax of the review of the public API.

          on keeping a separate branch....

          I would say a branch is less useful than an patch. if the patch applies to the trunk , I can be sure that I have the latest and greatest stuff. On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself.

          @Uri
          I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway.

          Show
          Noble Paul added a comment - olr already has a few places where the response format is still marked as experimental and as subject to changes in the future .... Marking the output format as experimental is just trying to be safe. We strive hard to ensure that we don't change it or even if we do it it is not disruptive. So let us not take this as an excuse to be lax of the review of the public API. on keeping a separate branch.... I would say a branch is less useful than an patch. if the patch applies to the trunk , I can be sure that I have the latest and greatest stuff. On the other hand if the code lives in a branch it is more work to keep it synced w/ the trunk than the patch itself. @Uri I support your suggestion on splitting this issue into two. i.e make the core changes in a separate patch . That is the plan anyway.
          Hide
          Uri Boness added a comment -

          Essentially it boils down to two options:

          1. Keep it out of the trunk, in which case users that will need this functionality will only get it by working with a patched Solr version of their own, or use a branch (in both cases, most likely they will miss the continuous work done on the trunk unless they keep on merging the changes)
          2. Keep in the trunk with some caveats, in which case they users have a chance to use this functionality out of the box

          In both cases, the user have a choice to make:

          • be satisfied by the performance of this feature
          • look for an alternative solution (other products)
          • give up this functionality all together (if their business requirements allow that)

          So the main difference here I would say is in how easy you'd like to provide this functionality to the users. On the Solr development part, indeed once this is committed to the trunk there's much more responsibility on the committers to make it work (enhance performance and fix bugs)... but this is a good thing as there is a high demand for this feature and as a community driven project this demand should to be satisfied. And I do think that the number of users using this patch already is a good indicator that it is good enough for quite a lot of use cases.

          I do agree though that before committing anything, the public API should be re-evaluated to minimize chances for BWC issues later on. BTW, regarding the response, Solr already has a few places where the response format is still marked as experimental and as subject to changes in the future (but it doesn't stop people from using this functionality as they take the responsibility to adapt to any such future changes when the come).

          Now... writing this, it suddenly occurred to me that there might be another solution to this all discussion which is in a way a combination of many of the suggestions in this thread. What if, this patch would be split to two: the changes to the core and the component itself. Now, if the changes to the core are not that drastic and make sense (or at least everyone can live with them) then perhaps they can be committed to the trunk. As for the rest of the patch (which consists of the search components and its other supporting classes), this can be put in SVN as separate branch for contrib. The good thing about this solution is that the work done on this functionality will be in SVN so you benefit from it as David mentioned above. The other benefit is that with this layout you can actually build the branched code base separately and distribute this functionality as a separate jar which can be deployed in Solr 1.5x distribution. Again, a bit of work left to the users (too much to my taste) but at least they're not forced to use a patched version of Solr. Would that be a possible solution?

          Show
          Uri Boness added a comment - Essentially it boils down to two options: Keep it out of the trunk, in which case users that will need this functionality will only get it by working with a patched Solr version of their own, or use a branch (in both cases, most likely they will miss the continuous work done on the trunk unless they keep on merging the changes) Keep in the trunk with some caveats, in which case they users have a chance to use this functionality out of the box In both cases, the user have a choice to make: be satisfied by the performance of this feature look for an alternative solution (other products) give up this functionality all together (if their business requirements allow that) So the main difference here I would say is in how easy you'd like to provide this functionality to the users. On the Solr development part, indeed once this is committed to the trunk there's much more responsibility on the committers to make it work (enhance performance and fix bugs)... but this is a good thing as there is a high demand for this feature and as a community driven project this demand should to be satisfied. And I do think that the number of users using this patch already is a good indicator that it is good enough for quite a lot of use cases. I do agree though that before committing anything, the public API should be re-evaluated to minimize chances for BWC issues later on. BTW, regarding the response, Solr already has a few places where the response format is still marked as experimental and as subject to changes in the future (but it doesn't stop people from using this functionality as they take the responsibility to adapt to any such future changes when the come). Now... writing this, it suddenly occurred to me that there might be another solution to this all discussion which is in a way a combination of many of the suggestions in this thread. What if, this patch would be split to two: the changes to the core and the component itself. Now, if the changes to the core are not that drastic and make sense (or at least everyone can live with them) then perhaps they can be committed to the trunk. As for the rest of the patch (which consists of the search components and its other supporting classes), this can be put in SVN as separate branch for contrib. The good thing about this solution is that the work done on this functionality will be in SVN so you benefit from it as David mentioned above. The other benefit is that with this layout you can actually build the branched code base separately and distribute this functionality as a separate jar which can be deployed in Solr 1.5x distribution. Again, a bit of work left to the users (too much to my taste) but at least they're not forced to use a patched version of Solr. Would that be a possible solution?
          Hide
          Patrick Eger added a comment -

          Hi, possibly not important but would like to give my perspective as a user. Specifically, the code is very much production ready in our opinion, albeit under a limited set of circumstances that we are comfortable with (< 5 million docs, no distributed search). Within those confines it works great and satisfies our needs, and we are more than willing to pay the performance hit since it's absolutely essential to the correct functionality. I suppose i'd disagree with the assertion that the performance is "unacceptable", as i think that is a value judgement each user will have to make.

          Modulo the discussion about the request format, output format and config (stuff that is hard to change later). I would much rather have the code be in and documented with those caveats clearly spelled out and probably tracked in separate JIRA issues. IE DO NOT USE IF SHARDING, >5 million docs, etc, etc. Again, just my 2c as a satisfied user.

          Show
          Patrick Eger added a comment - Hi, possibly not important but would like to give my perspective as a user. Specifically, the code is very much production ready in our opinion, albeit under a limited set of circumstances that we are comfortable with (< 5 million docs, no distributed search). Within those confines it works great and satisfies our needs, and we are more than willing to pay the performance hit since it's absolutely essential to the correct functionality. I suppose i'd disagree with the assertion that the performance is "unacceptable", as i think that is a value judgement each user will have to make. Modulo the discussion about the request format, output format and config (stuff that is hard to change later). I would much rather have the code be in and documented with those caveats clearly spelled out and probably tracked in separate JIRA issues. IE DO NOT USE IF SHARDING, >5 million docs, etc, etc. Again, just my 2c as a satisfied user.
          Hide
          Grant Ingersoll added a comment -

          I'm not sold on the output yet, either. Have we considered it being inline? We're getting more and more parallel arrays we need to consider. I think with the other Solr issues that are looking at pseudo-fields and the ability for components to add results, that we could rework these things.

          Also, why don't the aggregate functions just work w/ all the existing functions?

          Show
          Grant Ingersoll added a comment - I'm not sold on the output yet, either. Have we considered it being inline? We're getting more and more parallel arrays we need to consider. I think with the other Solr issues that are looking at pseudo-fields and the ability for components to add results, that we could rework these things. Also, why don't the aggregate functions just work w/ all the existing functions?
          Hide
          Noble Paul added a comment -

          The main problem with the patch is that the performance/resource consumption is unacceptable.

          • Is it true that the perf cost is avoidable?
          • or are their implementation details which can be optimized?

          We are working to make to ready for trunk. So anything that helps us move towards the objective is welcome

          Show
          Noble Paul added a comment - The main problem with the patch is that the performance/resource consumption is unacceptable. Is it true that the perf cost is avoidable? or are their implementation details which can be optimized? We are working to make to ready for trunk. So anything that helps us move towards the objective is welcome
          Hide
          Mark Miller added a comment - - edited

          I very much disagree with a policy blocking non-production-ready code from being in source control

          Just to be clear, there is no such policy that I've seen - each decision just comes down to consensus. And as far as I know, our branch policy is pretty much "anything goes" - trunk is very different than svn. Anyone (anyone with access to svn that is) can play around with a branch for anything if they want.

          I agree with your thoughts on a branch - if the argument is, we want it to be easier for devs to check out and work on this, or for users to checkout and build this without applying patches, why not just make a branch? Merging is annoying but not difficult - I've been doing plenty of branch merging lately, and while its not glorious work, modern tools make it more of a grind than a challenge.

          Show
          Mark Miller added a comment - - edited I very much disagree with a policy blocking non-production-ready code from being in source control Just to be clear, there is no such policy that I've seen - each decision just comes down to consensus. And as far as I know, our branch policy is pretty much "anything goes" - trunk is very different than svn. Anyone (anyone with access to svn that is) can play around with a branch for anything if they want. I agree with your thoughts on a branch - if the argument is, we want it to be easier for devs to check out and work on this, or for users to checkout and build this without applying patches, why not just make a branch? Merging is annoying but not difficult - I've been doing plenty of branch merging lately, and while its not glorious work, modern tools make it more of a grind than a challenge.
          Hide
          David Smiley added a comment -

          I've been watching this thread forever without saying anything but want to offer my two cents and I'll but out.

          I very much disagree with a policy blocking non-production-ready code from being in source control. All code starts off this way and it would be quite a shame not to leverage the advantages of source control simply because it isn't ready yet. If people are uncomfortable with it being in trunk then simply use a branch. Of course, how simple "simple" is depends on one's comfort with source control and the particular source control technology used and tools to help you (e.g. IDEs). By the way, git makes "feature branches" (which is what this would be) easy to manage and integrates bidirectionally with subversion. If you're not comfortable with branching because you're not familiar with it then you need to learn. By "you" I don't mean anyone in particular, I mean all professional software developers. Source control and branching are tools of our trade.

          Show
          David Smiley added a comment - I've been watching this thread forever without saying anything but want to offer my two cents and I'll but out. I very much disagree with a policy blocking non-production-ready code from being in source control. All code starts off this way and it would be quite a shame not to leverage the advantages of source control simply because it isn't ready yet. If people are uncomfortable with it being in trunk then simply use a branch. Of course, how simple "simple" is depends on one's comfort with source control and the particular source control technology used and tools to help you (e.g. IDEs). By the way, git makes "feature branches" (which is what this would be) easy to manage and integrates bidirectionally with subversion. If you're not comfortable with branching because you're not familiar with it then you need to learn. By "you" I don't mean anyone in particular, I mean all professional software developers. Source control and branching are tools of our trade.
          Hide
          Mark Miller added a comment - - edited

          (Faceting fot a 50 times perf boost in 1.4)

          No it didn't. Certain cases have gotten a boost (I think you might be referring to multi-valued field faceting cases?). And general faceting was always relatively fast and scalable.

          I'm against committing features to trunk with a warning that the feature is not ready for trunk.

          Show
          Mark Miller added a comment - - edited (Faceting fot a 50 times perf boost in 1.4) No it didn't. Certain cases have gotten a boost (I think you might be referring to multi-valued field faceting cases?). And general faceting was always relatively fast and scalable. I'm against committing features to trunk with a warning that the feature is not ready for trunk.
          Hide
          Noble Paul added a comment -

          This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large.

          The performance price is paid only if you use this component. Having the functionality itself in Solr is quite important. Performance can obviously be improved. (Faceting fot a 50 times perf boost in 1.4) . As long as the performance of the component is within the acceptable range we should leave that call to the user. The cost actually depends on the data set too.

          As long as the component has a correct public API (req params/response format/configuration) I believe it can be committed with a clear warning.

          Show
          Noble Paul added a comment - This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large. The performance price is paid only if you use this component. Having the functionality itself in Solr is quite important. Performance can obviously be improved. (Faceting fot a 50 times perf boost in 1.4) . As long as the performance of the component is within the acceptable range we should leave that call to the user. The cost actually depends on the data set too. As long as the component has a correct public API (req params/response format/configuration) I believe it can be committed with a clear warning.
          Hide
          Mark Miller added a comment -

          I'm with Grant on this one. Trunk is not a sandbox, and getting more developer attention is not a good reason to put something in trunk. Issues should go in when they are ready.

          Tons of interest and votes doesn't mean rush to trunk - if that type of thing moves you, it means start putting some work into it to make it ready for trunk.

          This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large. The performance hit is not any better. The linked to blog marks performance with collapsing as 5-10 times slower than without.

          Personally, I don't think this issue is ready for trunk.

          Show
          Mark Miller added a comment - I'm with Grant on this one. Trunk is not a sandbox, and getting more developer attention is not a good reason to put something in trunk. Issues should go in when they are ready. Tons of interest and votes doesn't mean rush to trunk - if that type of thing moves you, it means start putting some work into it to make it ready for trunk. This patch has quite a resource/performance hit. I've seen and read about the resource hit. Its rather large. The performance hit is not any better. The linked to blog marks performance with collapsing as 5-10 times slower than without. Personally, I don't think this issue is ready for trunk.
          Hide
          Uri Boness added a comment -

          I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.

          The main difference between the two components is that while the clustering works more as a function where the input is the doclist/docset and the output is a separate data structure representing the groups, the collapse component operates directly on the docset & doclist modifies them and incorporates the groups within the final search result.

          In all occurrences where we found the need for the collapse component, we needed to incorporate the grouping within the search result, and adjust the sorting and the pagination accordingly. As far as I know you cannot do that with the clustering component. This tight integration with the result is also the reason why the collapse component right now is actually a replacement to the query component.

          Show
          Uri Boness added a comment - I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for. The main difference between the two components is that while the clustering works more as a function where the input is the doclist/docset and the output is a separate data structure representing the groups, the collapse component operates directly on the docset & doclist modifies them and incorporates the groups within the final search result. In all occurrences where we found the need for the collapse component, we needed to incorporate the grouping within the search result, and adjust the sorting and the pagination accordingly. As far as I know you cannot do that with the clustering component. This tight integration with the result is also the reason why the collapse component right now is actually a replacement to the query component.
          Hide
          Grant Ingersoll added a comment -

          I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.

          Show
          Grant Ingersoll added a comment - I'm curious as to whether anyone has just thought of using the Clustering component for this? If your "collapse" field was a single token, I wonder if you would get the results you're looking for.
          Hide
          Martijn van Groningen added a comment -

          For Shalin:

          I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins.

          I agree about the PluginInfo and I think it is the right place for field collapse config.

          Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent?

          Besides different configured CollapseCollectorFactories none.

          I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase

          That would be great!

          Show
          Martijn van Groningen added a comment - For Shalin: I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins. I agree about the PluginInfo and I think it is the right place for field collapse config. Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent? Besides different configured CollapseCollectorFactories none. I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase That would be great!
          Shalin Shekhar Mangar made changes -
          Attachment SOLR-236.patch [ 12428420 ]
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. Modified configuration as Noble suggested. The AggregateCollapseCollectorFactory is now PluginInfoInitialized instead of NamedListInitialzed and functions are plugins. The "name" attribute is removed from "collapseCollectorFactory" since it is no longer necessary:
            <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
                <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
            
                <collapseCollectorFactory class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
                  <function name="sum" class="org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction"/>
                  <function name="avg" class="org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction"/>
                  <function name="min" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction"/>
                  <function name="max" class="org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction"/>
                </collapseCollectorFactory>
            
              	<fieldCollapseCache
                  class="solr.FastLRUCache"
                  size="512"
                  initialSize="512"
                  autowarmCount="128"/>
                
              </searchComponent>
            
          2. Changed DistributedFieldCollapsingIntegrationTest to use BaseDistributedSearchTestCase. This fails right now. I believe there is a bug with the distributed implementation. The distributed version returns one extra group when compared to the non-distributed version. I've put an @Ignore annotation on that test.

          We can consider creating the functions through a factory so that they can accept initialization parameters. The schema-fieldcollapse.xml and solrconfig-fieldcollapse.xml are no longer necessary and can be removed.

          Next steps:

          1. Let us open issues for all the modifications needed in Solr to support this feature. That will help us break down this patch into more manageable (and easily reviewable) pieces. I guess we need one for providing custom Collectors for SolrIndexSearcher methods. Any others?
          2. The response format is not very clear in the wiki. We should add more examples and explain the format.
          Show
          Shalin Shekhar Mangar added a comment - Changes: Modified configuration as Noble suggested. The AggregateCollapseCollectorFactory is now PluginInfoInitialized instead of NamedListInitialzed and functions are plugins. The "name" attribute is removed from "collapseCollectorFactory" since it is no longer necessary: <searchComponent name= "collapse" class= "org.apache.solr.handler.component.CollapseComponent" > <collapseCollectorFactory class= "solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory class= "solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory class= "solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory class= "org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory" > <function name= "sum" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction" /> <function name= "avg" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction" /> <function name= "min" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction" /> <function name= "max" class= "org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction" /> </collapseCollectorFactory> <fieldCollapseCache class= "solr.FastLRUCache" size= "512" initialSize= "512" autowarmCount= "128" /> </searchComponent> Changed DistributedFieldCollapsingIntegrationTest to use BaseDistributedSearchTestCase. This fails right now. I believe there is a bug with the distributed implementation. The distributed version returns one extra group when compared to the non-distributed version. I've put an @Ignore annotation on that test. We can consider creating the functions through a factory so that they can accept initialization parameters. The schema-fieldcollapse.xml and solrconfig-fieldcollapse.xml are no longer necessary and can be removed. Next steps: Let us open issues for all the modifications needed in Solr to support this feature. That will help us break down this patch into more manageable (and easily reviewable) pieces. I guess we need one for providing custom Collectors for SolrIndexSearcher methods. Any others? The response format is not very clear in the wiki. We should add more examples and explain the format.
          Hide
          Shalin Shekhar Mangar added a comment -

          For Martijn:

          The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared?

          I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins.

          Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent?

          The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE.

          Oops, sorry about that. I only ran the tests inside org.apache.solr.search.fieldcollapse. I didn't notice there are other tests too. Thanks!

          The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch.

          I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase

          Show
          Shalin Shekhar Mangar added a comment - For Martijn: The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared? I just don't think that we should introduce new tags and new kinds of components in solrconfig.xml, particularly those that are useful to only a single component. That introduces changes in SolrConfig.java so that it knows how to load such things. That is why I moved that configuration inside CollapseComponent. Ideally, all components will use PluginInfo and load whatever they need from their own PluginInfo object and SolrConfig would not need to be changed unless we introduce new kinds of Solr plugins. Just curious, what would be a use-case for sharing factories (other than reducing duplication of configuration) and having multiple CollapseComponent? The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE. Oops, sorry about that. I only ran the tests inside org.apache.solr.search.fieldcollapse. I didn't notice there are other tests too. Thanks! The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch. I don't think we need to add that functionality to CoreContainer and SolrDispatchFilter. It is still possible to specify a different solrconfig and schema for a test. Let me see if I can make this work with BaseDistributedSearchTestCase
          Hide
          Noble Paul added a comment -

          I think that is all the more reason why it needs to be done right and not just be a "good start".

          The fact that it has been around for so long means that the "good start" is gonna take longer to happen. According to me , we should fix the obvious stuff and commit this with a clear warning in the javadocs and wiki that this has perf isssues and the code/API/configuration may change incompatibly in the future.

          Committed stuff I'll try out easier than patches actually.

          +1 There is a better chances of developers taking a look at it if it is already in the trunk.

          Show
          Noble Paul added a comment - I think that is all the more reason why it needs to be done right and not just be a "good start". The fact that it has been around for so long means that the "good start" is gonna take longer to happen. According to me , we should fix the obvious stuff and commit this with a clear warning in the javadocs and wiki that this has perf isssues and the code/API/configuration may change incompatibly in the future. Committed stuff I'll try out easier than patches actually. +1 There is a better chances of developers taking a look at it if it is already in the trunk.
          Martijn van Groningen made changes -
          Attachment SOLR-236.patch [ 12428365 ]
          Hide
          Martijn van Groningen added a comment -

          Shalin, I have updated your patch.

          1. The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE.
          2. Removed a system.out that I accidentally added in my previous patch.

          The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch. That allowed my the specify different schema file for this particular test. I think it is important for the test coverage to have this test. Should I add the fields of the schema-fieldcollapse.xml to the schema.xml that the other tests use? The test should then succeed.

          Show
          Martijn van Groningen added a comment - Shalin, I have updated your patch. The CollapseComponentTest was failing. The field collapseCollectorFactories in CollapseComponent was null when not specifying any collapse collector factories in the solrconfig.xml which resulted in a NPE. Removed a system.out that I accidentally added in my previous patch. The DistributedFieldCollapsingIntegrationTest is still failing, because you left out changes in JettySolrRunner, CoreContainer and SolrDispatchFilter from my original patch. That allowed my the specify different schema file for this particular test. I think it is important for the test coverage to have this test. Should I add the fields of the schema-fieldcollapse.xml to the schema.xml that the other tests use? The test should then succeed.
          Hide
          Grant Ingersoll added a comment -

          Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later.

          Which is why it should not go in unless it is ready. Adding a large patch that isn't right just b/c it's been around for a while and is "hard to maintain" is no reason to just go commit something. The problem w/ committing something that isn't ready is then we have to do even more work to maintain it, thus taking away from the opportunity to make it better.

          As for the voting and the popularity, I think that is all the more reason why it needs to be done right and not just be a "good start". With this many eyes on it, it shouldn't be easy to get people testing it and giving feedback.

          If the issue is that the patch is to big, then perhaps it needs to be broken up into smaller pieces that lay the framework for field collapsing to work.

          Show
          Grant Ingersoll added a comment - Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later. Which is why it should not go in unless it is ready. Adding a large patch that isn't right just b/c it's been around for a while and is "hard to maintain" is no reason to just go commit something. The problem w/ committing something that isn't ready is then we have to do even more work to maintain it, thus taking away from the opportunity to make it better. As for the voting and the popularity, I think that is all the more reason why it needs to be done right and not just be a "good start". With this many eyes on it, it shouldn't be easy to get people testing it and giving feedback. If the issue is that the patch is to big, then perhaps it needs to be broken up into smaller pieces that lay the framework for field collapsing to work.
          Hide
          Erik Hatcher added a comment - - edited

          I'll just add my 0,02€ - the main thing to vet now that it works (first make it work), is the interface to the client. are the request params ideal? is the response data structure locked down? if so, get this committed ASAP and iterate on the internals of distributed and performance issues (then make it right).

          Admittedly I've not tried this feature out myself though. Committed stuff I'll try out easier than patches actually.

          Show
          Erik Hatcher added a comment - - edited I'll just add my 0,02€ - the main thing to vet now that it works (first make it work), is the interface to the client. are the request params ideal? is the response data structure locked down? if so, get this committed ASAP and iterate on the internals of distributed and performance issues (then make it right). Admittedly I've not tried this feature out myself though. Committed stuff I'll try out easier than patches actually.
          Hide
          Martijn van Groningen added a comment -

          I have updated the response examples on the wiki.

          Some time ago I tried to come up with an accurate distributed solution, but I ran a problem as I have described in a previous comment:

          ....
          Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse
          threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse.

          There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes.
          ....

          I'm really curious how others have addressed this issue. I have not stumbled on any literature on this particular issue, maybe someone else has.

          Show
          Martijn van Groningen added a comment - I have updated the response examples on the wiki. Some time ago I tried to come up with an accurate distributed solution, but I ran a problem as I have described in a previous comment: .... Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse. There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes. .... I'm really curious how others have addressed this issue. I have not stumbled on any literature on this particular issue, maybe someone else has.
          Hide
          Uri Boness added a comment -

          Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later.

          I think managing a separate branch will be just as hard as managing a patch. I do however agree that it's about time this patch will be committed to the trunk. Even though the current solution is not scalable in terms of distributed search (and I agree that the current solution for that is not really a viable solution), many are already using it and it is the most wanted feature in JIRA after all. One think you can do, is apply the changed to the core (which are not really many) and commit the rest of the patch as a contrib (along with all the disclaimers Shalin mentioned above).

          Show
          Uri Boness added a comment - Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later. I think managing a separate branch will be just as hard as managing a patch. I do however agree that it's about time this patch will be committed to the trunk. Even though the current solution is not scalable in terms of distributed search (and I agree that the current solution for that is not really a viable solution), many are already using it and it is the most wanted feature in JIRA after all. One think you can do, is apply the changed to the core (which are not really many) and commit the rest of the patch as a contrib (along with all the disclaimers Shalin mentioned above).
          Hide
          Shalin Shekhar Mangar added a comment -

          I'd define large scale for this in a couple of ways:
          1. Lots of docs in the result set (10K+)
          2. Lots of overall docs (100M+)
          3. Lots of queries (> 10 QPS)

          Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later.

          What do you think?

          Show
          Shalin Shekhar Mangar added a comment - I'd define large scale for this in a couple of ways: 1. Lots of docs in the result set (10K+) 2. Lots of overall docs (100M+) 3. Lots of queries (> 10 QPS) Grant, this patch may not be perfect but I think we all agree that it is a great start. This is stable, used by many and has been well supported by the community. This is also a large patch and as I have known from my DataImportHandler experience, maintaining a large patch is quite a pain (and DataImportHandler didn't even touch the core). How about we commit this (after some review, of course), mark this as experimental (no guarantees of any sort) and then start improving it one issue at a time? Alternately, if you are not comfortable adding it to trunk, we can commit this on a branch and merge into trunk later. What do you think?
          Hide
          Oleg Gnatovskiy added a comment -

          Grant - I agree regarding the current distributed implementation. The implementation is pretty much pseudo-distributed and would cause many companies (ours included) to have to completely restructure their indexes. What we tried long ago, was to have the process method on each shard to return the id that is being collapsed on, along with documentId and score. Then, in mergeIds we would do another level of collapse - basically keeping only 1 of the documents with a unique collapseId, and removing the others from all other shards.

          Obviously this caused several problems, not the least of which being that facet counts would always be slightly off, since we might have removed a document that was counted by the facetComponent.

          Show
          Oleg Gnatovskiy added a comment - Grant - I agree regarding the current distributed implementation. The implementation is pretty much pseudo-distributed and would cause many companies (ours included) to have to completely restructure their indexes. What we tried long ago, was to have the process method on each shard to return the id that is being collapsed on, along with documentId and score. Then, in mergeIds we would do another level of collapse - basically keeping only 1 of the documents with a unique collapseId, and removing the others from all other shards. Obviously this caused several problems, not the least of which being that facet counts would always be slightly off, since we might have removed a document that was counted by the facetComponent.
          Hide
          Grant Ingersoll added a comment -

          I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs.

          That doesn't seem good and I don't think it will work w/ all the distributed work going on. I will likely have some time next week to help out. Has anyone looked at how Google or others do this? Clearly they collapse at very large scale w/ no noticeable detrimental effect. Anyone looked at the literature on this?

          The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch).

          OK, good to know. Can you update the page to reflect the latest patch?

          Show
          Grant Ingersoll added a comment - I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs. That doesn't seem good and I don't think it will work w/ all the distributed work going on. I will likely have some time next week to help out. Has anyone looked at how Google or others do this? Clearly they collapse at very large scale w/ no noticeable detrimental effect. Anyone looked at the literature on this? The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch). OK, good to know. Can you update the page to reflect the latest patch?
          Hide
          Martijn van Groningen added a comment -

          Shalin.
          1. This configuration also looks fine by me. The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared?
          2. I forgot to create that, so a good thing you added it.
          3. I think leaving out those changes will make the distributed integration tests fail (Haven't checked it).

          Noble.
          1. The reason I gave a name to collaspeCollectorFactory was for using an instance twice for different collapse components.
          2. Moving the classname to the class attribute looks better, then in the function element. So I think we should change that.

          Grant.
          1. I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs.
          2. Field collapsing can impact the search time in a very negative way. I wrote a small paragraph about it on my blog.
          3. The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch).

          Show
          Martijn van Groningen added a comment - Shalin. 1. This configuration also looks fine by me. The reason I added <fieldCollapsing> ... </fieldCollapsing> was to be able support sharing of collapseCollectorFactory instances between different collapse components in the near future. You think that is a valid reason for that? Or do you think that collapseCollectorFactories shouldn't be shared? 2. I forgot to create that, so a good thing you added it. 3. I think leaving out those changes will make the distributed integration tests fail (Haven't checked it). Noble. 1. The reason I gave a name to collaspeCollectorFactory was for using an instance twice for different collapse components. 2. Moving the classname to the class attribute looks better, then in the function element. So I think we should change that. Grant. 1. I think you also referring to sharding. Sharding is supported, but not in a very elegant way. You will need to partition your documents to your shards in such a way that all documents belonging to a collapse group appear on one shard. To be honest I have never tested the patch on a corpus of 100M docs. 2. Field collapsing can impact the search time in a very negative way. I wrote a small paragraph about it on my blog . 3. The first two response examples are for 'old' patches. The last response example is for the more recent patches (and current patch).
          Hide
          Grant Ingersoll added a comment -

          Is there a typo on the http://wiki.apache.org/solr/FieldCollapsing page in regards to the outputs? There are two different output results, but the URL for the examples are the same. See http://wiki.apache.org/solr/FieldCollapsing#Examples. I think the second one is intended to show a collapse count for fields?

          Also, I'm not sold on having separate collapse elements from the actual response, but I know other things do it too, so it isn't a huge deal), but the list of "parallel arrays" that one needs to traverse in order to render results is growing (highlighter, MLT, now Field Collapsing.

          Show
          Grant Ingersoll added a comment - Is there a typo on the http://wiki.apache.org/solr/FieldCollapsing page in regards to the outputs? There are two different output results, but the URL for the examples are the same. See http://wiki.apache.org/solr/FieldCollapsing#Examples . I think the second one is intended to show a collapse count for fields? Also, I'm not sold on having separate collapse elements from the actual response, but I know other things do it too, so it isn't a huge deal), but the list of "parallel arrays" that one needs to traverse in order to render results is growing (highlighter, MLT, now Field Collapsing.
          Hide
          Grant Ingersoll added a comment -

          I'd define large scale for this in a couple of ways:
          1. Lots of docs in the result set (10K+)
          2. Lots of overall docs (100M+)
          3. Lots of queries (> 10 QPS)

          Show
          Grant Ingersoll added a comment - I'd define large scale for this in a couple of ways: 1. Lots of docs in the result set (10K+) 2. Lots of overall docs (100M+) 3. Lots of queries (> 10 QPS)
          Hide
          Stephen Weiss added a comment -

          How do we define "large scale"? I have an index of about 5 million docs. Does that qualify? I'm working on it right now, I can run whatever benchmarks you like.

          Show
          Stephen Weiss added a comment - How do we define "large scale"? I have an index of about 5 million docs. Does that qualify? I'm working on it right now, I can run whatever benchmarks you like.
          Hide
          Grant Ingersoll added a comment -

          Does anybody have a reason for why this should not be committed to trunk as it stands right now?

          It's been a while, but the last time I looked at it (3-4 mos. ago) I had the impression that it wouldn't scale. Has anyone benchmarked this at large scale?

          Show
          Grant Ingersoll added a comment - Does anybody have a reason for why this should not be committed to trunk as it stands right now? It's been a while, but the last time I looked at it (3-4 mos. ago) I had the impression that it wouldn't scale. Has anyone benchmarked this at large scale?