Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.3
    • Fix Version/s: 3.3
    • Component/s: search
    • Labels:
      None

      Description

      This patch include a new feature called "Field collapsing".

      "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
      http://www.fastsearch.com/glossary.aspx?m=48&amid=299

      The implementation add 3 new query parameters (SolrParams):
      "collapse.field" to choose the field used to group results
      "collapse.type" normal (default value) or adjacent
      "collapse.max" to select how many continuous results are allowed before collapsing

      TODO (in progress):

      • More documentation (on source code)
      • Test cases

      Two patches:

      • "field_collapsing.patch" for current development version
      • "field_collapsing_1.1.0.patch" for Solr-1.1.0

      P.S.: Feedback and misspelling correction are welcome

      1. collapsing-patch-to-1.3.0-dieter.patch
        26 kB
        dieter grad
      2. collapsing-patch-to-1.3.0-ivan_2.patch
        24 kB
        Iván de Prado
      3. collapsing-patch-to-1.3.0-ivan_3.patch
        24 kB
        Iván de Prado
      4. collapsing-patch-to-1.3.0-ivan.patch
        24 kB
        Iván de Prado
      5. DocSetScoreCollector.java
        5 kB
        Peter Karich
      6. field_collapsing_1.1.0.patch
        12 kB
        Emmanuel Keller
      7. field_collapsing_1.3.patch
        14 kB
        Emmanuel Keller
      8. field_collapsing_dsteigerwald.diff
        25 kB
        Oleg Gnatovskiy
      9. field_collapsing_dsteigerwald.diff
        25 kB
        Charles Hornberger
      10. field_collapsing_dsteigerwald.diff
        25 kB
        Doug Steigerwald
      11. field-collapse-3.patch
        52 kB
        Martijn van Groningen
      12. field-collapse-4-with-solrj.patch
        66 kB
        Martijn van Groningen
      13. field-collapse-5.patch
        254 kB
        Martijn van Groningen
      14. field-collapse-5.patch
        253 kB
        Martijn van Groningen
      15. field-collapse-5.patch
        251 kB
        Martijn van Groningen
      16. field-collapse-5.patch
        244 kB
        Martijn van Groningen
      17. field-collapse-5.patch
        239 kB
        Martijn van Groningen
      18. field-collapse-5.patch
        218 kB
        Martijn van Groningen
      19. field-collapse-5.patch
        218 kB
        Martijn van Groningen
      20. field-collapse-5.patch
        216 kB
        Martijn van Groningen
      21. field-collapse-5.patch
        144 kB
        Martijn van Groningen
      22. field-collapse-5.patch
        146 kB
        Martijn van Groningen
      23. field-collapse-5.patch
        136 kB
        Martijn van Groningen
      24. field-collapse-5.patch
        134 kB
        Martijn van Groningen
      25. field-collapse-5.patch
        134 kB
        Martijn van Groningen
      26. field-collapse-5.patch
        133 kB
        Martijn van Groningen
      27. field-collapse-5.patch
        122 kB
        Martijn van Groningen
      28. field-collapse-solr-236.patch
        49 kB
        Martijn van Groningen
      29. field-collapse-solr-236-2.patch
        52 kB
        Martijn van Groningen
      30. field-collapsing-extended-592129.patch
        31 kB
        Karsten Sperling
      31. NonAdjacentDocumentCollapser.java
        21 kB
        Peter Karich
      32. NonAdjacentDocumentCollapserTest.java
        9 kB
        Peter Karich
      33. quasidistributed.additional.patch
        1 kB
        Michael Gundlach
      34. SOLR-236_collapsing.patch
        25 kB
        Thomas Traeger
      35. SOLR-236_collapsing.patch
        26 kB
        Dmitry Lihachev
      36. solr-236.patch
        24 kB
        Bojan Smid
      37. SOLR-236.patch
        27 kB
        Yonik Seeley
      38. SOLR-236.patch
        245 kB
        Martijn van Groningen
      39. SOLR-236.patch
        244 kB
        Martijn van Groningen
      40. SOLR-236.patch
        252 kB
        Shalin Shekhar Mangar
      41. SOLR-236.patch
        251 kB
        Martijn van Groningen
      42. SOLR-236.patch
        257 kB
        Shalin Shekhar Mangar
      43. SOLR-236.patch
        245 kB
        Martijn van Groningen
      44. SOLR-236.patch
        253 kB
        Shalin Shekhar Mangar
      45. SOLR-236-1_4_1.patch
        264 kB
        Martijn van Groningen
      46. SOLR-236-1_4_1-NPEfix.patch
        0.7 kB
        Cameron
      47. SOLR-236-1_4_1-paging-totals-working.patch
        264 kB
        Stephen Weiss
      48. SOLR-236-branch_3x.patch
        258 kB
        Doug Steigerwald
      49. SOLR-236-distinctFacet.patch
        2 kB
        Bill Bell
      50. SOLR-236-FieldCollapsing.patch
        18 kB
        Emmanuel Keller
      51. SOLR-236-FieldCollapsing.patch
        18 kB
        Ryan McKinley
      52. SOLR-236-FieldCollapsing.patch
        16 kB
        Ryan McKinley
      53. SOLR-236-trunk.patch
        259 kB
        Martijn van Groningen
      54. SOLR-236-trunk.patch
        256 kB
        Martijn van Groningen
      55. SOLR-236-trunk.patch
        250 kB
        Martijn van Groningen
      56. SOLR-236-trunk.patch
        247 kB
        Martijn van Groningen
      57. SOLR-236-trunk.patch
        236 kB
        Martijn van Groningen

        Issue Links

        1.
        Provide an API to specify custom Collectors Sub-task Resolved Unassigned
         
        2.
        Fieldcollapse SolrJ code Sub-task Closed Unassigned
         
        3.
        Implement CollapseComponent Sub-task Closed Shalin Shekhar Mangar
         
        4.
        Distributed field collapsing Sub-task Closed Unassigned
         
        5.
        Refactor QueryComponent for easy extensibility Sub-task Resolved Shalin Shekhar Mangar
         
        6.
        Support fixing the number of shards in BaseDistributedTestCase Sub-task Resolved Shalin Shekhar Mangar
         
        7.
        Search Grouping: single doclist format Sub-task Resolved Unassigned
         
        8.
        Search Grouping: support highlighting Sub-task Closed Unassigned
         
        9.
        Search Grouping: support explain (debugQuery) Sub-task Resolved Unassigned
         
        10.
        Search Grouping: support distributed search Sub-task Closed Unassigned
         
        11. Search Grouping: CSV response writer Sub-task Open Unassigned
         
        12.
        Search Grouping: collapse by string specialization Sub-task Closed Unassigned
         
        13. Search Grouping: intermediate caches Sub-task Open Unassigned
         
        14. Search Grouping: single pass implementation Sub-task Open Unassigned
         
        15. Search Grouping: unlikely collision implementation Sub-task Open Unassigned
         
        16. Search Grouping: expand group sort options Sub-task Open Unassigned
         
        17.
        Search Grouping: SolrJ support Sub-task Resolved Unassigned
         
        18.
        Search Grouping: Facet support Sub-task Closed Unassigned
         
        19.
        Search Grouping: Group by query (like facet.query) Sub-task Resolved Unassigned
         
        20. Add grouping support to Velocity UI Sub-task Open Erik Hatcher
         
        21.
        Externalizing groupValue values Sub-task Closed Unassigned
         
        22.
        Grouping treats null values as equivalent to 0 or an empty string Sub-task Resolved Unassigned
         
        23.
        Grouping performance improvements Sub-task Closed Unassigned
         
        24.
        Search Grouping: random testing Sub-task Resolved Unassigned
         

          Activity

          Hide
          Emmanuel Keller added a comment -

          Field Collapsing

          Show
          Emmanuel Keller added a comment - Field Collapsing
          Hide
          Emmanuel Keller added a comment -

          Remplacing HashDocSet by BitDocSet for hasMoreResult for better performances

          Show
          Emmanuel Keller added a comment - Remplacing HashDocSet by BitDocSet for hasMoreResult for better performances
          Hide
          Ryan McKinley added a comment -

          This looks good. Someone with better lucene chops should look at the IndexSearcher getDocListAndSet part...

          A few comments/questions about the interface:

          If you apply all the example docs and hit:
          http://localhost:8983/solr/select/?q=*:*&collapse=true

          you get 500. We should use: params.required().get( "collapse.field" ) to have a nicer error:

          With:
          http://localhost:8983/solr/select/?q=*:*&collapse=true&collapse.field=manu&collapse.max=1

          the collapse info at the bottom says:

          <lst name="collapse_counts">
          <int name="has_more_results">3</int>
          <int name="has_more_results">5</int>
          <int name="has_more_results">9</int>
          </lst>

          what does that mean? How would you use it? How does it relate to the <result docs?

          Show
          Ryan McKinley added a comment - This looks good. Someone with better lucene chops should look at the IndexSearcher getDocListAndSet part... A few comments/questions about the interface: If you apply all the example docs and hit: http://localhost:8983/solr/select/?q=*:*&collapse=true you get 500. We should use: params.required().get( "collapse.field" ) to have a nicer error: With: http://localhost:8983/solr/select/?q=*:*&collapse=true&collapse.field=manu&collapse.max=1 the collapse info at the bottom says: <lst name="collapse_counts"> <int name="has_more_results">3</int> <int name="has_more_results">5</int> <int name="has_more_results">9</int> </lst> what does that mean? How would you use it? How does it relate to the <result docs?
          Hide
          Emmanuel Keller added a comment -

          My turn to miss something
          You are right, we have to use params.required().get("collapse.field").

          About collapse info:
          <int name="has_more_results">3</int> means that the third doc of the result has been collapsed and that some consecutive results having same field has been removed.

          Show
          Emmanuel Keller added a comment - My turn to miss something You are right, we have to use params.required().get("collapse.field"). About collapse info: <int name="has_more_results">3</int> means that the third doc of the result has been collapsed and that some consecutive results having same field has been removed.
          Hide
          Yonik Seeley added a comment -

          Thanks for looking into this Emmanuel.
          It appears as if this only collapses adjacent documents, correct?

          We should really try to get everyone on the same page... hash out the exact semantics of "collapsing", and the most useful interface. An efficient implementation can follow.

          A good starting point might be here:

          Show
          Yonik Seeley added a comment - Thanks for looking into this Emmanuel. It appears as if this only collapses adjacent documents, correct? We should really try to get everyone on the same page... hash out the exact semantics of "collapsing", and the most useful interface. An efficient implementation can follow. A good starting point might be here:
          Hide
          Yonik Seeley added a comment -
          Show
          Yonik Seeley added a comment - A good starting point might be here: http://www.nabble.com/result-grouping--tf2910425.html#a8131895
          Hide
          Emmanuel Keller added a comment -

          Yonik,

          You are right, only adjacent documents are collapsed.
          I work on a large index ( 2.000.000 documents) growing every day. The first goal was to group results, preserving score ranking and achieving good performances. This "light" implementation meets our needs.
          I am currently working on a second implementation taking care of the semantics.

          P.S.: Congratulations for this great application.

          Show
          Emmanuel Keller added a comment - Yonik, You are right, only adjacent documents are collapsed. I work on a large index ( 2.000.000 documents) growing every day. The first goal was to group results, preserving score ranking and achieving good performances. This "light" implementation meets our needs. I am currently working on a second implementation taking care of the semantics. P.S.: Congratulations for this great application.
          Hide
          Emmanuel Keller added a comment -

          This release is more conform with the semantics of "field collapsing".

          Parameters are:

          collapse=true // enable collapsing
          collapse.field=[field] // indexed field used for collapsing
          collapse.max=[integer] // Start collapsing after n document
          collapse.type=[normal|adjacent] // Default value is "normal"

          • "adjacent" collapse only consecutive documents.
          • "normal" collapse all documents having equal collapsing field.
          Show
          Emmanuel Keller added a comment - This release is more conform with the semantics of "field collapsing". Parameters are: collapse=true // enable collapsing collapse.field= [field] // indexed field used for collapsing collapse.max= [integer] // Start collapsing after n document collapse.type= [normal|adjacent] // Default value is "normal" "adjacent" collapse only consecutive documents. "normal" collapse all documents having equal collapsing field.
          Hide
          Emmanuel Keller added a comment -

          Corrects a bug on the previous version when using a value greater than 1 as collapse.max parameter.

          Show
          Emmanuel Keller added a comment - Corrects a bug on the previous version when using a value greater than 1 as collapse.max parameter.
          Hide
          Otis Gospodnetic added a comment -

          Question:
          Do you need collapse=true when you can detect whether collapse.field has been specified or not?

          Show
          Otis Gospodnetic added a comment - Question: Do you need collapse=true when you can detect whether collapse.field has been specified or not?
          Hide
          Emmanuel Keller added a comment -

          You're right. As collapse.field is a required field, we don't need more information. My first idea was to copy the behavior of facet.

          Show
          Emmanuel Keller added a comment - You're right. As collapse.field is a required field, we don't need more information. My first idea was to copy the behavior of facet.
          Hide
          Emmanuel Keller added a comment -

          The last version of the patch.

          • Results are now cached using "CollapseCache" (a new instance of SolrCache added on solrconfig.xml)
          • The parameter "collapse" has been removed.

          This version has been fully tested.

          Feedbacks are welcome.

          Show
          Emmanuel Keller added a comment - The last version of the patch. Results are now cached using "CollapseCache" (a new instance of SolrCache added on solrconfig.xml) The parameter "collapse" has been removed. This version has been fully tested. Feedbacks are welcome.
          Hide
          Emmanuel Keller added a comment -

          I still maintain a version for the release 1.1.0 (The version we used on our production environment).

          Show
          Emmanuel Keller added a comment - I still maintain a version for the release 1.1.0 (The version we used on our production environment).
          Hide
          Ryan McKinley added a comment -

          I updated the patch so that is applies cleanly with trunk, while I was at it, I:

          • fixed a few spelling errors
          • made the "collapse.type" parameter parsing to throw an error if the passed field is unknown (rather then quietly using 'normal')
          • changed the patch name to include the number. – as we update the patch, use this same name again so it is easy to tell what is the most current.

          I also made a wiki page so there are direct links to interesting queries:
          http://wiki.apache.org/solr/FieldCollapsing

          • - - - - - -

          Again, I will leave any discussion about the lucene implementation to other more qualified and will just focus on the response interface.

          Currently if you send the query:
          http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal

          you get a response that looks like:
          <lst name="collapse_counts">
          <int name="hard">1</int>
          <int name="electronics">2</int>
          <int name="memory">2</int>
          <int name="monitor">1</int>
          <int name="software">1</int>
          </lst>

          It looks like that says: for the field 'cat', there is one more result with cat=hard, 2 more results with cat=electronics, ...

          How is a client supposed to know how to deal with that? "hard" is tokenized version of "hard drive" – unless it were a 'string' field, the client would need to know how to do that – or the response needs to change.

          From a client, it would be more useful to have output that looked something like:
          <lst name="collapse_counts">
          <str name="field">cat</str>
          <lst name="doc">
          <int name="SP2514N">1</int>
          <int name="6H500F0">1</int>
          <int name="VS1GB400C3">2</int>
          <int name="VS1GB400C3">1</int>
          </lst>
          <lst name="count">
          <int name="hard">1</int>
          <int name="electronics">1</int>
          <int name="memory">2</int>
          <int name="monitor">1</int>
          </lst>
          </lst>

          "field" says what field was collapsed on,
          "doc" is a map of doc id -> how many more collapsed on that field
          "count" is a map of 'token'-> how many more collapsed on that field

          This way, the client would know what collapse counts apply to which documents without knowing about the schema.

          thoughts?

          Show
          Ryan McKinley added a comment - I updated the patch so that is applies cleanly with trunk, while I was at it, I: fixed a few spelling errors made the "collapse.type" parameter parsing to throw an error if the passed field is unknown (rather then quietly using 'normal') changed the patch name to include the number. – as we update the patch, use this same name again so it is easy to tell what is the most current. I also made a wiki page so there are direct links to interesting queries: http://wiki.apache.org/solr/FieldCollapsing - - - - - - Again, I will leave any discussion about the lucene implementation to other more qualified and will just focus on the response interface. Currently if you send the query: http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal you get a response that looks like: <lst name="collapse_counts"> <int name="hard">1</int> <int name="electronics">2</int> <int name="memory">2</int> <int name="monitor">1</int> <int name="software">1</int> </lst> It looks like that says: for the field 'cat', there is one more result with cat=hard, 2 more results with cat=electronics, ... How is a client supposed to know how to deal with that? "hard" is tokenized version of "hard drive" – unless it were a 'string' field, the client would need to know how to do that – or the response needs to change. From a client, it would be more useful to have output that looked something like: <lst name="collapse_counts"> <str name="field">cat</str> <lst name="doc"> <int name="SP2514N">1</int> <int name="6H500F0">1</int> <int name="VS1GB400C3">2</int> <int name="VS1GB400C3">1</int> </lst> <lst name="count"> <int name="hard">1</int> <int name="electronics">1</int> <int name="memory">2</int> <int name="monitor">1</int> </lst> </lst> "field" says what field was collapsed on, "doc" is a map of doc id -> how many more collapsed on that field "count" is a map of 'token'-> how many more collapsed on that field This way, the client would know what collapse counts apply to which documents without knowing about the schema. thoughts?
          Hide
          Emmanuel Keller added a comment -

          Right, It's more useful.

          This new version includes the result as you expect it.

          You should add the following constraint on the wiki: The collapsing field must be un-tokenized.

          Show
          Emmanuel Keller added a comment - Right, It's more useful. This new version includes the result as you expect it. You should add the following constraint on the wiki: The collapsing field must be un-tokenized.
          Hide
          Ryan McKinley added a comment -

          I just took a look at this using the example data:
          http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal&rows=10

          <lst name="collapse_counts">
          <str name="field">cat</str>
          <lst name="doc">
          <int>1</int>
          <int name="1">2</int>
          <int name="2">2</int>
          <int name="4">1</int>
          <int name="7">1</int>
          </lst>
          <lst name="count">
          <int>1</int>
          <int name="card">2</int>
          <int name="drive">2</int>
          <int name="hard">1</int>
          <int name="music">1</int>
          </lst>
          </lst>

          • - -

          what is the "<int>1</int>" at the front of each response?

          Perhaps the 'doc' results should be renamed 'offset' or 'index', and then have another one named 'doc' that uses the uniqueKey as the index... this would be useful to build a Map.

          • - -

          Also, check:
          http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=adjacent&rows=50

          ArrayIndexOutOfBoundsException:

          • - -

          > You should add the following constraint on the wiki: The collapsing field must be un-tokenized.

          Anyone can edit the wiki (you just have to make an account) – it would be great if you could help keep the page accurate / useful. JIRA discussion comment trails don't work so well at that...

          Re: tokenized... what about it does not work? Are the limitations an different if it is mult-valued? Is it just that if any token matches within the field it will collapse and that may or may not be what you expect?

          • - -

          Did you get a chance to look at the questions from the previous discussion? I just noticed Yonik posted something new there:
          http://www.nabble.com/result-grouping--tf2910425.html#a10959848

          Show
          Ryan McKinley added a comment - I just took a look at this using the example data: http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=normal&rows=10 <lst name="collapse_counts"> <str name="field">cat</str> <lst name="doc"> <int>1</int> <int name="1">2</int> <int name="2">2</int> <int name="4">1</int> <int name="7">1</int> </lst> <lst name="count"> <int>1</int> <int name="card">2</int> <int name="drive">2</int> <int name="hard">1</int> <int name="music">1</int> </lst> </lst> - - what is the "<int>1</int>" at the front of each response? Perhaps the 'doc' results should be renamed 'offset' or 'index', and then have another one named 'doc' that uses the uniqueKey as the index... this would be useful to build a Map. - - Also, check: http://localhost:8983/solr/select/?q=*:*&collapse.field=cat&collapse.max=1&collapse.type=adjacent&rows=50 ArrayIndexOutOfBoundsException: - - > You should add the following constraint on the wiki: The collapsing field must be un-tokenized. Anyone can edit the wiki (you just have to make an account) – it would be great if you could help keep the page accurate / useful. JIRA discussion comment trails don't work so well at that... Re: tokenized... what about it does not work? Are the limitations an different if it is mult-valued? Is it just that if any token matches within the field it will collapse and that may or may not be what you expect? - - Did you get a chance to look at the questions from the previous discussion? I just noticed Yonik posted something new there: http://www.nabble.com/result-grouping--tf2910425.html#a10959848
          Hide
          Emmanuel Keller added a comment -

          Sorry, my last post was buggy. Here is the correct one. There is no more exception now.
          About tokens, if any token matches within the field it will collapse.
          When I start implementing collapsing, my need was to to group documents having exact identical field.

          I believe that faceting has identical behavior. Lookt at "Graphic card" as example:
          http://localhost:8983/solr/select/?q=cat:graphic%20card&version=2.2&start=0&rows=10&indent=on&facet=true&facet.field=cat

          I will try to maintain the wiki page.

          Show
          Emmanuel Keller added a comment - Sorry, my last post was buggy. Here is the correct one. There is no more exception now. About tokens, if any token matches within the field it will collapse. When I start implementing collapsing, my need was to to group documents having exact identical field. I believe that faceting has identical behavior. Lookt at "Graphic card" as example: http://localhost:8983/solr/select/?q=cat:graphic%20card&version=2.2&start=0&rows=10&indent=on&facet=true&facet.field=cat I will try to maintain the wiki page.
          Hide
          Yonik Seeley added a comment -

          I guess adjacent collapsing can make sense when one is sorting by the field that is being collapsed.

          For the normal collapsing though, this patch appears to implement it by changing the sort order to the collapsing field (normally not desired). For example, if sorting by relevance and collapsing on a field, one would normally want the groups sorted by relevance (with the group relevance defined as the max score of it's members).

          As far as how to do paging, it makes sense to rigidly define it in terms of number of documents, regardless of how many documents are in each group. Going back to google, it always displays the first 10 documents, but a variable number of groups. That does mean that a group could be split across pages. It would actually be much simpler (IMO) to always return a fixed number of groups rather than a fixed number of documents, but I don't think this would be less useful to people. Thoughts?

          Show
          Yonik Seeley added a comment - I guess adjacent collapsing can make sense when one is sorting by the field that is being collapsed. For the normal collapsing though, this patch appears to implement it by changing the sort order to the collapsing field (normally not desired). For example, if sorting by relevance and collapsing on a field, one would normally want the groups sorted by relevance (with the group relevance defined as the max score of it's members). As far as how to do paging, it makes sense to rigidly define it in terms of number of documents, regardless of how many documents are in each group. Going back to google, it always displays the first 10 documents, but a variable number of groups. That does mean that a group could be split across pages. It would actually be much simpler (IMO) to always return a fixed number of groups rather than a fixed number of documents, but I don't think this would be less useful to people. Thoughts?
          Hide
          Yonik Seeley added a comment -

          Will Johnson brings up other use-cases:
          [...]
          > it's also heavily used in
          > ecommerce settings. Check out BestBuy.com/circuitcity/etc and do a
          > search for some really generic word like 'cable' and notice all the
          > groups of items; BB shows 3 per group, CC shows 1 per group. In each
          > case it's not clear that the number of docs is really limited at all, ie
          > it's more important to get back all the categories with n docs per
          > category and the counts per category than it is to get back a fixed
          > number of results or even categories for that matter. Also notice that
          > neither of these sites allow you to page through the categorized
          > results.

          Some of this seems very closely related to faceted search, and much of it could be implemented that way now on the client side, but it would take multiple queries to do so.

          One could also think about supporting multi-valued fields in the same manner that faceting does.

          Show
          Yonik Seeley added a comment - Will Johnson brings up other use-cases: [...] > it's also heavily used in > ecommerce settings. Check out BestBuy.com/circuitcity/etc and do a > search for some really generic word like 'cable' and notice all the > groups of items; BB shows 3 per group, CC shows 1 per group. In each > case it's not clear that the number of docs is really limited at all, ie > it's more important to get back all the categories with n docs per > category and the counts per category than it is to get back a fixed > number of results or even categories for that matter. Also notice that > neither of these sites allow you to page through the categorized > results. Some of this seems very closely related to faceted search, and much of it could be implemented that way now on the client side, but it would take multiple queries to do so. One could also think about supporting multi-valued fields in the same manner that faceting does.
          Hide
          Emmanuel Keller added a comment -

          Adjacent collapsing is useful because it preserves the pertinence of the sort.
          The sorting is not modified. I copy the current sort to do a new search.

          I am currently working on taking care of type field (int).

          Show
          Emmanuel Keller added a comment - Adjacent collapsing is useful because it preserves the pertinence of the sort. The sorting is not modified. I copy the current sort to do a new search. I am currently working on taking care of type field (int).
          Hide
          Yonik Seeley added a comment -

          > The sorting is not modified. I copy the current sort to do a new search.

          Perhaps if you outlined the algorithm you use, it would clear up some things.

          It looks like you make a copy of the Sort and insert a primary sort on the field to be collapsed, and then process the same way as you would for the "ADJACENT" option. If the original sort was by relevance, this doesn't give you the groups sorted by relevance, right?

          Show
          Yonik Seeley added a comment - > The sorting is not modified. I copy the current sort to do a new search. Perhaps if you outlined the algorithm you use, it would clear up some things. It looks like you make a copy of the Sort and insert a primary sort on the field to be collapsed, and then process the same way as you would for the "ADJACENT" option. If the original sort was by relevance, this doesn't give you the groups sorted by relevance, right?
          Hide
          Yonik Seeley added a comment -

          Oh I see... the modified sort is just to build the filter.

          The building-the-filter part is a problem though... asking for all matching docs in sorted order isn't that scalable.
          If we get the interface right though, more efficient implementations can follow.
          For that reason, it might be good for implementatin details like "collapseCache" to be private.

          Show
          Yonik Seeley added a comment - Oh I see... the modified sort is just to build the filter. The building-the-filter part is a problem though... asking for all matching docs in sorted order isn't that scalable. If we get the interface right though, more efficient implementations can follow. For that reason, it might be good for implementatin details like "collapseCache" to be private.
          Hide
          Emmanuel Keller added a comment -

          Correct, except that collapse result is only used as filter to the final result to hide collapsed documents.

          P.S.: Sorry, if my answers are a little short, I am not perfectly fluent in english.

          Show
          Emmanuel Keller added a comment - Correct, except that collapse result is only used as filter to the final result to hide collapsed documents. P.S.: Sorry, if my answers are a little short, I am not perfectly fluent in english.
          Hide
          Ryan McKinley added a comment -

          Any thoughts on what the faceting semantics for field collapsing should be?

          That is, should faceting apply to the collapsed results or the pre-collapsed results?

          I think the pre-collapsed results.

          Show
          Ryan McKinley added a comment - Any thoughts on what the faceting semantics for field collapsing should be? That is, should faceting apply to the collapsed results or the pre-collapsed results? I think the pre-collapsed results.
          Hide
          Yonik Seeley added a comment -

          Yes, it seems like faceting should be for pre-collapsed.

          Show
          Yonik Seeley added a comment - Yes, it seems like faceting should be for pre-collapsed.
          Hide
          Emmanuel Keller added a comment -

          Do we have to make a choice ? Both behaviors are interesting.
          What about a new parameter like collapse.facet=[pre|post] ?

          Show
          Emmanuel Keller added a comment - Do we have to make a choice ? Both behaviors are interesting. What about a new parameter like collapse.facet= [pre|post] ?
          Hide
          Yonik Seeley added a comment -

          We facet on the complete set of documents matching a query, even when the user only requests the top 10 matches. It seems we should do the same here. The set of documents is the same, the only difference is what "top" documents are returned.

          Show
          Yonik Seeley added a comment - We facet on the complete set of documents matching a query, even when the user only requests the top 10 matches. It seems we should do the same here. The set of documents is the same, the only difference is what "top" documents are returned.
          Hide
          Emmanuel Keller added a comment -

          New release:

          • Fieldcollapsing added on DisMaxRequestHandler
          • Types are correctly handled on collapsed field
          Show
          Emmanuel Keller added a comment - New release: Fieldcollapsing added on DisMaxRequestHandler Types are correctly handled on collapsed field
          Hide
          Ryan McKinley added a comment -

          No real changes. Updated to apply with trunk.
          Moved the valid values for CollapseType to a 'common' package

          • - - -

          as a side note, when you make a patch, its easiest to deal with if the path is relative to the solr root directory.

          src/java/org/apache/solr/search/SolrIndexSearcher.java
          is better then:
          /Users/ekeller/Documents/workspace/solr/src/java/org/apache/solr/search/SolrIndexSearcher.java

          Show
          Ryan McKinley added a comment - No real changes. Updated to apply with trunk. Moved the valid values for CollapseType to a 'common' package - - - as a side note, when you make a patch, its easiest to deal with if the path is relative to the solr root directory. src/java/org/apache/solr/search/SolrIndexSearcher.java is better then: /Users/ekeller/Documents/workspace/solr/src/java/org/apache/solr/search/SolrIndexSearcher.java
          Hide
          Emmanuel Keller added a comment -

          This new patch resolves a performance issues.
          I have added time informations for monitoring performances:

          <str name="time">57/5</str>

          The first value is the elapsed time (in milliseconds) needed to compute collapsed informations (CollapseFilter.ajacentCollapse method).
          The second value is the elapsed time needed to compute results informations (CollapseFilter.getMoreResults method).

          We are using Solr (with collapsing patch) on a large index in production environnment (120GB with more than 3 000 000 documents).

          P.S.: This time, the patch is relative to the solr root directory.

          Show
          Emmanuel Keller added a comment - This new patch resolves a performance issues. I have added time informations for monitoring performances: <str name="time">57/5</str> The first value is the elapsed time (in milliseconds) needed to compute collapsed informations (CollapseFilter.ajacentCollapse method). The second value is the elapsed time needed to compute results informations (CollapseFilter.getMoreResults method). We are using Solr (with collapsing patch) on a large index in production environnment (120GB with more than 3 000 000 documents). P.S.: This time, the patch is relative to the solr root directory.
          Hide
          nunol added a comment -

          It would be nice for this patch to also report on what documents were actually collapsed - for example, if the result list contained:

          doc1
          doc2
          doc3

          and doc2 and doc3 were collapsed, this would be reflected in the XML result as, so that one could determine that (forgive my crap visual representation):

          doc1
          -> doc2
          -> doc3

          Regards.

          Show
          nunol added a comment - It would be nice for this patch to also report on what documents were actually collapsed - for example, if the result list contained: doc1 doc2 doc3 and doc2 and doc3 were collapsed, this would be reflected in the XML result as, so that one could determine that (forgive my crap visual representation): doc1 -> doc2 -> doc3 Regards.
          Hide
          Brian Mertens added a comment -

          Imagine a case where a Solr database contains news stories from many newspapers and some wire services.

          A single wire story will typically be picked up and reprinted in many different papers, ranging from national papers like the NYTimes, to small town papers. My database will have all of them, and possibly also the original from the wire service. Each paper will choose their own headline, and will edit the story differently for length to fill a hole on the printed page, so they cannot be trivially detected as duplicates, but to my users, they basically are.

          I need to detect and group together these "duplicates" when displaying search results.

          So let's say every story has had an integer hash value calculated of the first X words of the lead paragraph, and that value is indexed and stored (e.g. "similarity_hash"), as a way to detect duplicate stories.

          I would want to Field Collapse my results on that hash value, so that all occurrences of the same story are lumped together.

          Also, my users would much prefer the most "authoritative" version of the story to be displayed as the primary result, with a count and link to the collapsed results. Authoritativeness could be coded as simple as 1) Wire Service, 2) National Paper, 3) Regional Paper, 4) Small Town Paper, which could be index and stored as an integer "authority". (For finer-grained authority we could store the newspapers circulation numbers.)

          Then I could display to users:
          "Dog Bites Man"
          New York Times, link to see 77 other duplicates

          So, finally getting to the point, would it be possible to make this feature work such that it field collapses results on one field ("similarity_hash"), selects the one to return based on another field ("authority" or "circulation')? (While allowing the results to be sorted by a third field, e.g. date or relevance.)

          Perhaps by a new parameter?
          collapse.authority=[field] // indexed field used for selecting which result from collapsed group to return, default being... ?

          If this sounds familiar, it is somewhat similar to what Google News is doing:
          http://www.pcworld.com/article/id,136680/article.html

          Final question: Do you think Field Collapse could work nicely with SOLR-303 Federated Search, or is that a bridge too far?

          Show
          Brian Mertens added a comment - Imagine a case where a Solr database contains news stories from many newspapers and some wire services. A single wire story will typically be picked up and reprinted in many different papers, ranging from national papers like the NYTimes, to small town papers. My database will have all of them, and possibly also the original from the wire service. Each paper will choose their own headline, and will edit the story differently for length to fill a hole on the printed page, so they cannot be trivially detected as duplicates, but to my users, they basically are. I need to detect and group together these "duplicates" when displaying search results. So let's say every story has had an integer hash value calculated of the first X words of the lead paragraph, and that value is indexed and stored (e.g. "similarity_hash"), as a way to detect duplicate stories. I would want to Field Collapse my results on that hash value, so that all occurrences of the same story are lumped together. Also, my users would much prefer the most "authoritative" version of the story to be displayed as the primary result, with a count and link to the collapsed results. Authoritativeness could be coded as simple as 1) Wire Service, 2) National Paper, 3) Regional Paper, 4) Small Town Paper, which could be index and stored as an integer "authority". (For finer-grained authority we could store the newspapers circulation numbers.) Then I could display to users: "Dog Bites Man" New York Times, link to see 77 other duplicates So, finally getting to the point, would it be possible to make this feature work such that it field collapses results on one field ("similarity_hash"), selects the one to return based on another field ("authority" or "circulation')? (While allowing the results to be sorted by a third field, e.g. date or relevance.) Perhaps by a new parameter? collapse.authority= [field] // indexed field used for selecting which result from collapsed group to return, default being... ? If this sounds familiar, it is somewhat similar to what Google News is doing: http://www.pcworld.com/article/id,136680/article.html Final question: Do you think Field Collapse could work nicely with SOLR-303 Federated Search, or is that a bridge too far?
          Hide
          Dima Brodsky added a comment -

          Hi,

          I am new to the list and to Solr, so I appologize in advance if I say something silly.

          I have been playing with the field collapse patch and I have a couple of questions and I have noticed a couple of issues. What is the intended use / audience for the field collapsing patch. One of the issues I see is that the sort order is changed during normal field collapsing and this causes problems if I want the results ordered based on relevancy. Another issue, is that the backfilling of the results, if there is not enough, is done from the deduped results rather than getting more results from the index. Is this by design?

          Thanks!!
          ttyl
          Dima

          Show
          Dima Brodsky added a comment - Hi, I am new to the list and to Solr, so I appologize in advance if I say something silly. I have been playing with the field collapse patch and I have a couple of questions and I have noticed a couple of issues. What is the intended use / audience for the field collapsing patch. One of the issues I see is that the sort order is changed during normal field collapsing and this causes problems if I want the results ordered based on relevancy. Another issue, is that the backfilling of the results, if there is not enough, is done from the deduped results rather than getting more results from the index. Is this by design? Thanks!! ttyl Dima
          Hide
          Tracy Flynn added a comment -

          Hi,

          I am new to Solr, and this thread in particular, so please excuse any questions that seem obvious.

          I am investigating converting an existing FAST installation to Solr. I've been able to see how to convert all my queries to Solr/Lucene with little or no trouble, with the exception of field collapsing. I've actually implemented a demo of our main search with a Ruby/Rails front end in a few hours. Nice work everyone!

          I have found this thread, looked at the patch for field collapsing and have a couple of questions.

          I've looked at the Subversion tree and

          • Don't find a 1.3 branch
          • Don't find the patch code in the trunk

          Is there a 'private' sandbox Solr developers work in that's not visible to the pubic (i.e. me)?

          If not, what revision of the trunk does the patch apply to?

          Any help would be appreciated. If I can get a demo that includes field collapsing, my management may be persuaded to let me move our main search to Solr.

          Regards,

          Tracy

          Show
          Tracy Flynn added a comment - Hi, I am new to Solr, and this thread in particular, so please excuse any questions that seem obvious. I am investigating converting an existing FAST installation to Solr. I've been able to see how to convert all my queries to Solr/Lucene with little or no trouble, with the exception of field collapsing. I've actually implemented a demo of our main search with a Ruby/Rails front end in a few hours. Nice work everyone! I have found this thread, looked at the patch for field collapsing and have a couple of questions. I've looked at the Subversion tree and Don't find a 1.3 branch Don't find the patch code in the trunk Is there a 'private' sandbox Solr developers work in that's not visible to the pubic (i.e. me)? If not, what revision of the trunk does the patch apply to? Any help would be appreciated. If I can get a demo that includes field collapsing, my management may be persuaded to let me move our main search to Solr. Regards, Tracy
          Hide
          Ryan McKinley added a comment -

          Hi Tracy-

          There has not been much movement on this while we get SOLR-281 sorted (I hope this happens soon) – once that is in, there will hopefully be an updated patch on the 1.3 branch that will be posted here.

          "1.3" is not a branch yet – it is the trunk revision that most patches work with. Only when it becomes an official release, will it actually get called 1.3 in the repository.

          If you need to show field collapsing soon, I think your best bet (i have not tried it) is to apply the ' field_collapsing_1.1.0.patch' to the 1.1.0 branch ( http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.1.0/ ) But if you can wait a few weeks, it will hopefully be available in trunk (or easily patchable from trunk)

          ryan

          Show
          Ryan McKinley added a comment - Hi Tracy- There has not been much movement on this while we get SOLR-281 sorted (I hope this happens soon) – once that is in, there will hopefully be an updated patch on the 1.3 branch that will be posted here. "1.3" is not a branch yet – it is the trunk revision that most patches work with. Only when it becomes an official release, will it actually get called 1.3 in the repository. If you need to show field collapsing soon, I think your best bet (i have not tried it) is to apply the ' field_collapsing_1.1.0.patch' to the 1.1.0 branch ( http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.1.0/ ) But if you can wait a few weeks, it will hopefully be available in trunk (or easily patchable from trunk) ryan
          Hide
          Tracy Flynn added a comment -

          Ryan,

          Thanks for the quick reply and clarification. I'll follow your suggestion as to where to apply and try the patch.

          I'll be eagerly waiting for the updated trunk.

          Regards,

          Tracy

          Show
          Tracy Flynn added a comment - Ryan, Thanks for the quick reply and clarification. I'll follow your suggestion as to where to apply and try the patch. I'll be eagerly waiting for the updated trunk. Regards, Tracy
          Hide
          Emmanuel Keller added a comment - - edited

          Here is the patch for solr 1.3 rev 589395.

          I made some performance improvement. No more cache. I use bitdocset or hashdocset depending on solrconfig.hashdocsetmaxsize variable.

          Regards,
          Emmanuel Keller.

          Show
          Emmanuel Keller added a comment - - edited Here is the patch for solr 1.3 rev 589395. I made some performance improvement. No more cache. I use bitdocset or hashdocset depending on solrconfig.hashdocsetmaxsize variable. Regards, Emmanuel Keller.
          Hide
          Yonik Seeley added a comment -

          It looks like the latest patch only includes changed files and not new ones (like CollapseFilter?)

          Show
          Yonik Seeley added a comment - It looks like the latest patch only includes changed files and not new ones (like CollapseFilter?)
          Hide
          Emmanuel Keller added a comment -

          Thank you Yonik !
          Here is the complete version.

          P.S.: It's time to go to bed in Europe ...

          Emmanuel.

          Show
          Emmanuel Keller added a comment - Thank you Yonik ! Here is the complete version. P.S.: It's time to go to bed in Europe ... Emmanuel.
          Hide
          Karsten Sperling added a comment -

          I've just looked at the implementation of this patch again – it ends up calling SolrIndexSearcher.getDocListC() with a DocSet derived from the CollapseFilter as the 'filter' parameter. The comment on that method says that only filter or filterList should be provided, but not both. However with the field collapsing patch both WILL be provided if filter queries are passed to the dismax request handler by the client. Can anybody shed any light on what the implications of this are?

          Show
          Karsten Sperling added a comment - I've just looked at the implementation of this patch again – it ends up calling SolrIndexSearcher.getDocListC() with a DocSet derived from the CollapseFilter as the 'filter' parameter. The comment on that method says that only filter or filterList should be provided, but not both. However with the field collapsing patch both WILL be provided if filter queries are passed to the dismax request handler by the client. Can anybody shed any light on what the implications of this are?
          Hide
          Karsten Sperling added a comment -

          I've done some work on the field collapsing patch and made some additions and changes and posting this patch (against revision 592129) here for discussion.

          • Added a collapse.facet = before|after parameter to control if faceting happens before or after collapsing.
          • Changed collapse.max to collapse.threshold – this value controls after which number of collapsible hits collapsing actually kicks in (collapse.max is still supported as an alias).
          • Added a collapse.maxdocs parameter that limits the number of documents that CollapseFilter will process to create the filter DocSet. The intention of this is to be able to limit the time collapsing will take for very large result sets (obviously at the expense of accurate collapsing in those cases).
          • Inverted the logic of the filter DocSet created by CollapseFilter to contain the documents that are to be collapsed instead of the ones that are to be kept. Without this collapse.maxdocs doesn't work.
          • Added collapse.info.doc and collapse.info.count parameters to provide more control over what gets returned in the collapse_counts extra results.
          • Made a minimal change to SolrIndexSearcher.getDocListC() to support passing both the filter and filterList parameters. In most cases this was already handled anyway.
          • Did some general refactoring and added comments and a test case.

          If somebody with deeper Solr/Lucene knowledge could review these changes it would be much appreciated.

          Karsten

          Show
          Karsten Sperling added a comment - I've done some work on the field collapsing patch and made some additions and changes and posting this patch (against revision 592129) here for discussion. Added a collapse.facet = before|after parameter to control if faceting happens before or after collapsing. Changed collapse.max to collapse.threshold – this value controls after which number of collapsible hits collapsing actually kicks in (collapse.max is still supported as an alias). Added a collapse.maxdocs parameter that limits the number of documents that CollapseFilter will process to create the filter DocSet. The intention of this is to be able to limit the time collapsing will take for very large result sets (obviously at the expense of accurate collapsing in those cases). Inverted the logic of the filter DocSet created by CollapseFilter to contain the documents that are to be collapsed instead of the ones that are to be kept. Without this collapse.maxdocs doesn't work. Added collapse.info.doc and collapse.info.count parameters to provide more control over what gets returned in the collapse_counts extra results. Made a minimal change to SolrIndexSearcher.getDocListC() to support passing both the filter and filterList parameters. In most cases this was already handled anyway. Did some general refactoring and added comments and a test case. If somebody with deeper Solr/Lucene knowledge could review these changes it would be much appreciated. Karsten
          Hide
          Doug Steigerwald added a comment - - edited

          I've created a CollapseComponent for field collapsing. Everything seems to work fine with it. Only issue I'm having is I cannot use the query component because when it isn't commented out, the non-field collapsed results are displayed and I can't figure out how to remove them. Someone might be able to figure that part out.

          [http://localhost:8983/solr/search?q=id:[0%20TO%20*]&collapse=true&collapse.field=inStock&collapse.type=normal&collapse.threshold=0]

          Here's the config I'm using:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />
          <requestHandler name="/search" class="solr.SearchHandler">
          <lst name="defaults">
          <str name="echoParams">explicit</str>
          </lst>
          <arr name="components">
          <!-- <str>query</str> -->
          <str>facet</str>
          <!-- <str>mlt</str> -->
          <!-- <str>highlight</str> -->
          <!-- <str>debug</str> -->
          <str>collapse</str>
          </arr>
          </requestHandler>

          Show
          Doug Steigerwald added a comment - - edited I've created a CollapseComponent for field collapsing. Everything seems to work fine with it. Only issue I'm having is I cannot use the query component because when it isn't commented out, the non-field collapsed results are displayed and I can't figure out how to remove them. Someone might be able to figure that part out. [http://localhost:8983/solr/search?q=id: [0%20TO%20*] &collapse=true&collapse.field=inStock&collapse.type=normal&collapse.threshold=0] Here's the config I'm using: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> <requestHandler name="/search" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> </lst> <arr name="components"> <!-- <str>query</str> --> <str>facet</str> <!-- <str>mlt</str> --> <!-- <str>highlight</str> --> <!-- <str>debug</str> --> <str>collapse</str> </arr> </requestHandler>
          Hide
          Charles Hornberger added a comment - - edited

          UPDATE: Doug Steigerwald's patch (field_collapsing_dsteigerwald.diff) applies cleanly to trunk

          I'm having trouble applying field_collapsing_1.3.patch to the head of trunk.

          charlie@macbuntu:~/solr/src/java$ patch -p0 < /home/charlie/downloads/field_collapsing_1.3.patch 
          patching file org/apache/solr/search/CollapseFilter.java
          patching file org/apache/solr/search/SolrIndexSearcher.java
          Hunk #1 succeeded at 694 (offset -8 lines).
          Hunk #2 succeeded at 1252 (offset -1 lines).
          patching file org/apache/solr/common/params/CollapseParams.java
          patching file org/apache/solr/handler/StandardRequestHandler.java
          Hunk #1 FAILED at 33.
          Hunk #2 FAILED at 90.
          Hunk #3 FAILED at 117.
          3 out of 3 hunks FAILED -- saving rejects to file org/apache/solr/handler/StandardRequestHandler.java.rej
          patching file org/apache/solr/handler/DisMaxRequestHandler.java
          Hunk #1 FAILED at 31.
          Hunk #2 FAILED at 40.
          Hunk #3 FAILED at 311.
          Hunk #4 FAILED at 339.
          4 out of 4 hunks FAILED -- saving rejects to file org/apache/solr/handler/DisMaxRequestHandler.java.rej
          

          I'm guessing that maybe the field collapsing patch needs to be updated for the SearchHandler refactoring that was does as part of SOLR-281? If so, I'll take a whack at migrating the changes to the SearchHandler.java, and see if I can produce a better patch.

          Show
          Charles Hornberger added a comment - - edited UPDATE: Doug Steigerwald's patch (field_collapsing_dsteigerwald.diff) applies cleanly to trunk I'm having trouble applying field_collapsing_1.3.patch to the head of trunk. charlie@macbuntu:~/solr/src/java$ patch -p0 < /home/charlie/downloads/field_collapsing_1.3.patch patching file org/apache/solr/search/CollapseFilter.java patching file org/apache/solr/search/SolrIndexSearcher.java Hunk #1 succeeded at 694 (offset -8 lines). Hunk #2 succeeded at 1252 (offset -1 lines). patching file org/apache/solr/common/params/CollapseParams.java patching file org/apache/solr/handler/StandardRequestHandler.java Hunk #1 FAILED at 33. Hunk #2 FAILED at 90. Hunk #3 FAILED at 117. 3 out of 3 hunks FAILED -- saving rejects to file org/apache/solr/handler/StandardRequestHandler.java.rej patching file org/apache/solr/handler/DisMaxRequestHandler.java Hunk #1 FAILED at 31. Hunk #2 FAILED at 40. Hunk #3 FAILED at 311. Hunk #4 FAILED at 339. 4 out of 4 hunks FAILED -- saving rejects to file org/apache/solr/handler/DisMaxRequestHandler.java.rej I'm guessing that maybe the field collapsing patch needs to be updated for the SearchHandler refactoring that was does as part of SOLR-281 ? If so, I'll take a whack at migrating the changes to the SearchHandler.java, and see if I can produce a better patch.
          Hide
          Ryan McKinley added a comment -

          Charles - try applying Doug Steigerwald's latest patch: field_collapsing_dsteigerwald.diff

          I have not tested it, but it does apply without errors

          Show
          Ryan McKinley added a comment - Charles - try applying Doug Steigerwald's latest patch: field_collapsing_dsteigerwald.diff I have not tested it, but it does apply without errors
          Hide
          Charles Hornberger added a comment -

          Doug – I just started looking into field collapsing the other day, but from glancing at the code in QueryComponent.java and CollapseComponent.java, it seems like perhaps you're not supposed to be using both components – after all, their prepare() methods are identical, and their process() methods both execute the user's search and shove the resulting DocList into the "response" entry of the response object's internal storage Map. (The QueryComponent additionally stores the DocListAndSet in the ResponseBuilder object via builder.setResults() – I'm not sure why this is – and prefetches documents if the result set is small enough.) My guess is that if you want to enable collapsing, you should use the CollapseComponent; if you want to disable it, use the QueryComponent. Maybe someone who understand the design of the search handling components better than me can confirm this or correct my misunderstanding(s) ...

          Show
          Charles Hornberger added a comment - Doug – I just started looking into field collapsing the other day, but from glancing at the code in QueryComponent.java and CollapseComponent.java, it seems like perhaps you're not supposed to be using both components – after all, their prepare() methods are identical, and their process() methods both execute the user's search and shove the resulting DocList into the "response" entry of the response object's internal storage Map. (The QueryComponent additionally stores the DocListAndSet in the ResponseBuilder object via builder.setResults() – I'm not sure why this is – and prefetches documents if the result set is small enough.) My guess is that if you want to enable collapsing, you should use the CollapseComponent; if you want to disable it, use the QueryComponent. Maybe someone who understand the design of the search handling components better than me can confirm this or correct my misunderstanding(s) ...
          Hide
          Charles Hornberger added a comment -

          Attaching a new copy of Doug Steigerwald's patch that omits the System.out.println() call in CollapseComponent.java.

          Show
          Charles Hornberger added a comment - Attaching a new copy of Doug Steigerwald's patch that omits the System.out.println() call in CollapseComponent.java.
          Hide
          Doug Steigerwald added a comment -

          I copied what was in QueryComponent.prepare() method because I was having to disable the query component because of the extra results I was getting. Initially I had CollapseComponent.prepare() empty, but I had results from the query component and then adding the collapse component results being returned (2 'response' in the results.

          Easy solution for me was to copy the prepare from QueryComponent and disable the query component in the request handler. There may be another way, but I was unable to figure it out.

          Show
          Doug Steigerwald added a comment - I copied what was in QueryComponent.prepare() method because I was having to disable the query component because of the extra results I was getting. Initially I had CollapseComponent.prepare() empty, but I had results from the query component and then adding the collapse component results being returned (2 'response' in the results. Easy solution for me was to copy the prepare from QueryComponent and disable the query component in the request handler. There may be another way, but I was unable to figure it out.
          Hide
          Oleg Gnatovskiy added a comment -

          Hello, I am new to Solr, so forgive me if what I say doesn't make sense... None of the patches for 1.3 work any more, since the file org.apache.solr.handler.SearchHandler has been removed from the nightly builds. Will someone write a new patch that works with teh current nightly builds? If not, could we get a copy of an old nightly build somewhere? Thanks a lot.

          Show
          Oleg Gnatovskiy added a comment - Hello, I am new to Solr, so forgive me if what I say doesn't make sense... None of the patches for 1.3 work any more, since the file org.apache.solr.handler.SearchHandler has been removed from the nightly builds. Will someone write a new patch that works with teh current nightly builds? If not, could we get a copy of an old nightly build somewhere? Thanks a lot.
          Hide
          Charles Hornberger added a comment -

          It seems like SearchHandler was simply moved down into the org.apache.solr.handler.components package as part of r610426 - http://svn.apache.org/viewvc?view=rev&revision=610426

          You should be able to modify the import statements field_collapsing_dsteigerwald.diff to make it work, no?

          Show
          Charles Hornberger added a comment - It seems like SearchHandler was simply moved down into the org.apache.solr.handler.components package as part of r610426 - http://svn.apache.org/viewvc?view=rev&revision=610426 You should be able to modify the import statements field_collapsing_dsteigerwald.diff to make it work, no?
          Hide
          Oleg Gnatovskiy added a comment -

          Oh, I didn''t notice. I will give a try tomorrow morning. Thank you.

          Show
          Oleg Gnatovskiy added a comment - Oh, I didn''t notice. I will give a try tomorrow morning. Thank you.
          Hide
          Oleg Gnatovskiy added a comment -

          That works, thanks

          Show
          Oleg Gnatovskiy added a comment - That works, thanks
          Hide
          Charles Hornberger added a comment -

          NegatedDocSet is throwing "Unsupported Operation" exceptions:

          org.apache.solr.common.SolrException:Unsupported Operation
          at org.apache.solr.search.NegatedDocSet.iterator(NegatedDocSet.java:77)
          at org.apache.solr.search.DocSetBase.getBits(DocSet.java:183)
          at org.apache.solr.search.NegatedDocSet.getBits(NegatedDocSet.java:27)
          at org.apache.solr.search.DocSetBase.intersection(DocSet.java:199)
          at org.apache.solr.search.BitDocSet.intersection(BitDocSet.java:30)
          at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1109)
          at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:811)
          at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1258)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:103)
          at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:155)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:902)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:275)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
          at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
          at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
          at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
          at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
          at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
          at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
          at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
          at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
          at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
          at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
          at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
          at java.lang.Thread.run(Thread.java:595)

          Not quite sure what search is triggering this path thru the code, but it is not happening on every request; just some ... am firing up the debugger now to see what I can learn, but thought I'd post this anyway to see if anyone has any tips.

          Show
          Charles Hornberger added a comment - NegatedDocSet is throwing "Unsupported Operation" exceptions: org.apache.solr.common.SolrException:Unsupported Operation at org.apache.solr.search.NegatedDocSet.iterator(NegatedDocSet.java:77) at org.apache.solr.search.DocSetBase.getBits(DocSet.java:183) at org.apache.solr.search.NegatedDocSet.getBits(NegatedDocSet.java:27) at org.apache.solr.search.DocSetBase.intersection(DocSet.java:199) at org.apache.solr.search.BitDocSet.intersection(BitDocSet.java:30) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1109) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:811) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1258) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:103) at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:155) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:902) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:275) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:595) Not quite sure what search is triggering this path thru the code, but it is not happening on every request; just some ... am firing up the debugger now to see what I can learn, but thought I'd post this anyway to see if anyone has any tips.
          Hide
          Charles Hornberger added a comment - - edited

          Ah ... got the beginnings of a diagnosis. The problem appears when the DocSet qDocSet returned by DocSetHitCollector.getDocSet() – called at org.apache.solr.search.SolrIndexSearcher:1101 in trunk, or 1108 with the field_collapsing patch applied, inside getDocListAndSetNC()) – is a BitDocSet, and not when it's a HashDocSet. As the stack trace above shows, calling intersection() on a BitDocSet object invokes the superclass' DocSetBase.intersection() method, which invokes a call chain that blows up when it hits the iterator() method of the NegatedDocSet passed in as the filter parameter to getDocListAndSetNC(); NegatedDocSet.iterator() blows up by design:

          public DocIterator iterator() {
              throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unsupported Operation");
          }
          

          I see that DocSetBase.intersection(DocSet other) has special-casing logic for dealing with other parameters that are instances of HashDocSet; does it also need special casing logic for dealing with other parameters that are NegatedDocSets? Or should NegatedDocSet really implement iterator()? Or something else entirely?

          Show
          Charles Hornberger added a comment - - edited Ah ... got the beginnings of a diagnosis. The problem appears when the DocSet qDocSet returned by DocSetHitCollector.getDocSet() – called at org.apache.solr.search.SolrIndexSearcher:1101 in trunk, or 1108 with the field_collapsing patch applied, inside getDocListAndSetNC()) – is a BitDocSet, and not when it's a HashDocSet. As the stack trace above shows, calling intersection() on a BitDocSet object invokes the superclass' DocSetBase.intersection() method, which invokes a call chain that blows up when it hits the iterator() method of the NegatedDocSet passed in as the filter parameter to getDocListAndSetNC(); NegatedDocSet.iterator() blows up by design: public DocIterator iterator() { throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unsupported Operation" ); } I see that DocSetBase.intersection(DocSet other) has special-casing logic for dealing with other parameters that are instances of HashDocSet; does it also need special casing logic for dealing with other parameters that are NegatedDocSets? Or should NegatedDocSet really implement iterator()? Or something else entirely?
          Hide
          Charles Hornberger added a comment -

          Here's the simplest change I could think of to make DocSetBase subclasses that don't override intersection() (which just means BitDocSet at the moment) stop choking when their intersection() gets called with a NegatedDocSet as the other parameter; it's probably horribly stupid. Also, there should be a test.

          Index: src/java/org/apache/solr/search/DocSet.java
          ===================================================================
          --- src/java/org/apache/solr/search/DocSet.java (revision 617738)
          +++ src/java/org/apache/solr/search/DocSet.java (working copy)
          @@ -193,7 +193,18 @@
               if (other instanceof HashDocSet) {
                 return other.intersection(this);
               }
          -
          +    // you can't call getBits() on a NegatedDocSet, because
          +    // getBits() // calls iterator(), and iterator() isn't 
          +    // supported by NegatedDocSet
          +    if (other instanceof NegatedDocSet) {
          +        BitDocSet newdocs = new BitDocSet();
          +        for (DocIterator iter = iterator(); iter.hasNext();) {
          +          int next = iter.nextDoc();
          +          if (other.exists(next))
          +           newdocs.add(next);
          +        }
          +        return newdocs;
          +    }
               // Default... handle with bitsets.
               OpenBitSet newbits = (OpenBitSet)(this.getBits().clone());
               newbits.and(other.getBits());
          

          Comments?

          Show
          Charles Hornberger added a comment - Here's the simplest change I could think of to make DocSetBase subclasses that don't override intersection() (which just means BitDocSet at the moment) stop choking when their intersection() gets called with a NegatedDocSet as the other parameter; it's probably horribly stupid. Also, there should be a test. Index: src/java/org/apache/solr/search/DocSet.java =================================================================== --- src/java/org/apache/solr/search/DocSet.java (revision 617738) +++ src/java/org/apache/solr/search/DocSet.java (working copy) @@ -193,7 +193,18 @@ if (other instanceof HashDocSet) { return other.intersection( this ); } - + // you can't call getBits() on a NegatedDocSet, because + // getBits() // calls iterator(), and iterator() isn't + // supported by NegatedDocSet + if (other instanceof NegatedDocSet) { + BitDocSet newdocs = new BitDocSet(); + for (DocIterator iter = iterator(); iter.hasNext();) { + int next = iter.nextDoc(); + if (other.exists(next)) + newdocs.add(next); + } + return newdocs; + } // Default... handle with bitsets. OpenBitSet newbits = (OpenBitSet)( this .getBits().clone()); newbits.and(other.getBits()); Comments?
          Hide
          Yonik Seeley added a comment -

          I haven't been following this, so I don't know why there is a need for a NegatedDocSet (or if introducing it is the best solution), but it looks like you have two cases to handle: one negative set or two negative sets.
          If you have a and -b, then return a.andNot(b)
          if both a and b are negative (-a.intersection(-b)) then return NegatedDocSet(a.union(b)) // per De Morgan, -a&-b == -(a|b)

          That's only for intersection() of course.

          Show
          Yonik Seeley added a comment - I haven't been following this, so I don't know why there is a need for a NegatedDocSet (or if introducing it is the best solution), but it looks like you have two cases to handle: one negative set or two negative sets. If you have a and -b, then return a.andNot(b) if both a and b are negative (-a.intersection(-b)) then return NegatedDocSet(a.union(b)) // per De Morgan, -a&-b == -(a|b) That's only for intersection() of course.
          Hide
          Karsten Sperling added a comment -

          NegatedDocSet got introduced because the filter logic expects to use the intersection operation to apply a number of filters to a result. Introducing a negated docset was much easier than supporting both intersection as well as and-not type filters.

          NegatedDocSet does not support iteration because the negation of a finite set is (at least theoretically) infinite. Even though it would in practice be possible to limit the negated set via the known maximum document id, this would probably not be very efficient. However, it is simply not necessary to ever iterate over the elements of a NegatedDocSet, because we know that the end-result of all DocSet operations is going to be a finite set of results, not an infinite one. A NegatedDocSet will only ever be used to "subtract" from a finite DocSet. As Yonik has pointed out, operations on a NegatedDocSet can be rewritten as (different) operations on the set being negated. The operation methods inside NegatedDocSet do this.

          The reason the bug occurs is because of the naive way the binary set operation calls are dispatched: DocSet clients simply call e.g. set1.intersection(set2), arbitrarily leaving the choice of implementation to the logic defined by the class of set1. Currently, BitDocSet does not know about NegatedDocSet, and hence doesn't perform the necessary rewriting or delegation to NegatedDocSet.

          However, instead of requiring each and every DocSet subclass to know about all other ones (and in the absence of language support for multiple dispatch), I think it would be better to centralize this knowledge in a single class DocSetOp with static methods that selects the appropriate implementation for an operation based on the type of both parameters. Either the client code could be changed to call DocSetOp.intersection(a, b) instead of a.intersection(b), but this would involve changing the DocSet interface. A backwards compatible solution would be to simply have final DocSetBase.intersection() delegating to DocSetOp.intersection.

          Show
          Karsten Sperling added a comment - NegatedDocSet got introduced because the filter logic expects to use the intersection operation to apply a number of filters to a result. Introducing a negated docset was much easier than supporting both intersection as well as and-not type filters. NegatedDocSet does not support iteration because the negation of a finite set is (at least theoretically) infinite. Even though it would in practice be possible to limit the negated set via the known maximum document id, this would probably not be very efficient. However, it is simply not necessary to ever iterate over the elements of a NegatedDocSet, because we know that the end-result of all DocSet operations is going to be a finite set of results, not an infinite one. A NegatedDocSet will only ever be used to "subtract" from a finite DocSet. As Yonik has pointed out, operations on a NegatedDocSet can be rewritten as (different) operations on the set being negated. The operation methods inside NegatedDocSet do this. The reason the bug occurs is because of the naive way the binary set operation calls are dispatched: DocSet clients simply call e.g. set1.intersection(set2), arbitrarily leaving the choice of implementation to the logic defined by the class of set1. Currently, BitDocSet does not know about NegatedDocSet, and hence doesn't perform the necessary rewriting or delegation to NegatedDocSet. However, instead of requiring each and every DocSet subclass to know about all other ones (and in the absence of language support for multiple dispatch), I think it would be better to centralize this knowledge in a single class DocSetOp with static methods that selects the appropriate implementation for an operation based on the type of both parameters. Either the client code could be changed to call DocSetOp.intersection(a, b) instead of a.intersection(b), but this would involve changing the DocSet interface. A backwards compatible solution would be to simply have final DocSetBase.intersection() delegating to DocSetOp.intersection.
          Hide
          Charles Hornberger added a comment -

          As Yonik has pointed out, operations on a NegatedDocSet can be rewritten as (different) operations on the set being negated. The operation methods inside NegatedDocSet do this.

          Right. I realized, sheepishly, after I posted the first suggested patch that it'd be much simpler to just mimic the first if-clause in DocSet.intersection():

            if (other instanceof NegatedDocSet) {
              other.intersection(this);
            }
          
          Show
          Charles Hornberger added a comment - As Yonik has pointed out, operations on a NegatedDocSet can be rewritten as (different) operations on the set being negated. The operation methods inside NegatedDocSet do this. Right. I realized, sheepishly, after I posted the first suggested patch that it'd be much simpler to just mimic the first if-clause in DocSet.intersection(): if (other instanceof NegatedDocSet) { other.intersection( this ); }
          Hide
          Charles Hornberger added a comment -

          However, instead of requiring each and every DocSet subclass to know about all other ones (and in the absence of language support for multiple dispatch), I think it would be better to centralize this knowledge in a single class DocSetOp with static methods that selects the appropriate implementation for an operation based on the type of both parameters.

          +1 for this ... whether or not NegatedDocSet is part of the final implementation of this feature. FWIW, I just noticed that there's another bug lurking in BitDocSet.andNot(), which will fail if a NegatedDocSet is passed in. It seems to me that it might be easier – at least for me – to read/write/extend a test suite that exercised all the paths thru DocSetOp, than to write a set of tests that exercised all the paths thru DocSetBase and its subclasses.

          Also, I think that maybe there's a clear distinction to be made between intrinsic operations on a set (add(), exists(), et al.) and ones that involve another set (intersection(), union(), andNot()). Not sure it's a useful one, but it make sense to me. I don't know, though, whether it make sense to go further than that and say – as the current implementation of NegatedDocSet implies – that there are some set operations (iterator() and size()) that are in fact optional.

          Off the top of my head: Would it be simpler to just modify add a filterType flag to the getDocList*() family of methods in SolrSearchInterface to cause it to call a.andNot(b) rather than a.intersection(b) when applying b as a filter? (I'm really completely ignorant – or nearly completely – of how the seach code works, so feel free not to dignify this with a response if it's a useless idea ... )

          Show
          Charles Hornberger added a comment - However, instead of requiring each and every DocSet subclass to know about all other ones (and in the absence of language support for multiple dispatch), I think it would be better to centralize this knowledge in a single class DocSetOp with static methods that selects the appropriate implementation for an operation based on the type of both parameters. +1 for this ... whether or not NegatedDocSet is part of the final implementation of this feature. FWIW, I just noticed that there's another bug lurking in BitDocSet.andNot(), which will fail if a NegatedDocSet is passed in. It seems to me that it might be easier – at least for me – to read/write/extend a test suite that exercised all the paths thru DocSetOp, than to write a set of tests that exercised all the paths thru DocSetBase and its subclasses. Also, I think that maybe there's a clear distinction to be made between intrinsic operations on a set (add(), exists(), et al.) and ones that involve another set (intersection(), union(), andNot()). Not sure it's a useful one, but it make sense to me. I don't know, though, whether it make sense to go further than that and say – as the current implementation of NegatedDocSet implies – that there are some set operations (iterator() and size()) that are in fact optional. Off the top of my head: Would it be simpler to just modify add a filterType flag to the getDocList*() family of methods in SolrSearchInterface to cause it to call a.andNot(b) rather than a.intersection(b) when applying b as a filter? (I'm really completely ignorant – or nearly completely – of how the seach code works, so feel free not to dignify this with a response if it's a useless idea ... )
          Hide
          Oleg Gnatovskiy added a comment - - edited

          Hello everyone. I am planning to implement chain collapsing on a high traffic production environment, so I'd like to use a stable version of Solr. It doesn't seem like you have a chain collapse patch for Solr 1.2, so I tried the Solr 1.1 patch. It seems to work fine at collapsing, but how do I get a countt for the documents other then the one being displayed?

          As a result I see:

          <lst name="collapse_counts">
          <int name="Restaurant">2414</int>
          <int name="Bar/Club">9</int>
          <int name="Directory & Services">37</int>
          </lst>

          Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more Directory & Services? If so, then that's great.

          However when I collapse on some fields I get an empty collapse_counts list. It could be that those fields have a large number of different values that it collapses on. Is there a limit to the number of values that collaose_counts displays?

          Thanks in advance for any help you can provide!

          Show
          Oleg Gnatovskiy added a comment - - edited Hello everyone. I am planning to implement chain collapsing on a high traffic production environment, so I'd like to use a stable version of Solr. It doesn't seem like you have a chain collapse patch for Solr 1.2, so I tried the Solr 1.1 patch. It seems to work fine at collapsing, but how do I get a countt for the documents other then the one being displayed? As a result I see: <lst name="collapse_counts"> <int name="Restaurant">2414</int> <int name="Bar/Club">9</int> <int name="Directory & Services">37</int> </lst> Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more Directory & Services? If so, then that's great. However when I collapse on some fields I get an empty collapse_counts list. It could be that those fields have a large number of different values that it collapses on. Is there a limit to the number of values that collaose_counts displays? Thanks in advance for any help you can provide!
          Hide
          Oleg Gnatovskiy added a comment -

          Also, is field collapse going to be a part of the upcoming Solr 1.3 release, or will we need to run a patch on it?

          Show
          Oleg Gnatovskiy added a comment - Also, is field collapse going to be a part of the upcoming Solr 1.3 release, or will we need to run a patch on it?
          Hide
          Oleg Gnatovskiy added a comment -

          OK, I think I have the first issue figured out. If the current resultset (lets say the first 10 rows) doesn't have the field that we are collapsing on, the counts don't show up. Is that correct?

          Show
          Oleg Gnatovskiy added a comment - OK, I think I have the first issue figured out. If the current resultset (lets say the first 10 rows) doesn't have the field that we are collapsing on, the counts don't show up. Is that correct?
          Hide
          Oleg Gnatovskiy added a comment -

          Latest patch file fixes an issue where facet searching would throw a NullPointerException when using the fieldCollapse requestHandler. Also, updated the import path for SearchHandler. Thank you Dave for these tips!

          Show
          Oleg Gnatovskiy added a comment - Latest patch file fixes an issue where facet searching would throw a NullPointerException when using the fieldCollapse requestHandler. Also, updated the import path for SearchHandler. Thank you Dave for these tips!
          Hide
          Oleg Gnatovskiy added a comment -

          That thanks should be to Charles not Dave Sorry about that!

          Show
          Oleg Gnatovskiy added a comment - That thanks should be to Charles not Dave Sorry about that!
          Hide
          Nikolai Kordulla added a comment -

          A good thing were to apply this CollapseComponent for the mlt results.

          Show
          Nikolai Kordulla added a comment - A good thing were to apply this CollapseComponent for the mlt results.
          Hide
          Oleg Gnatovskiy added a comment -

          Are there any plans to add collapse controls to SolrJ?

          Show
          Oleg Gnatovskiy added a comment - Are there any plans to add collapse controls to SolrJ?
          Hide
          Oleg Gnatovskiy added a comment -

          None of the patches work on the current nightly build anymore. Could anyone help? Thanks

          Show
          Oleg Gnatovskiy added a comment - None of the patches work on the current nightly build anymore. Could anyone help? Thanks
          Hide
          Bojan Smid added a comment -

          I will try to bring this patch up to date. Currently I see two main problems:

          1) The patch applies to trunk, but it doesn't compile. The problem occurs mainly because of changes in Search Components (for instance, some method signatures which CollapseComponent implements were changed). I have this fixed locally (more or less), but I have to test it before posting new version of patch.

          2) It seems that CollapseComponent can't be used in chain with QueryComponent, but instead of it. CollapseComponent basically copies QueryComponent querying logic and adds some of it's own. I guess this isn't the right way to go. CollapseComponent should contain only collapsing logic and should be chainable with other components. Can anyone confirm if I'm right here? Of course, there might be some fundamental reason why CollapseComponent had to be implemented this way.

          Does anyone else see any other issues with this component?

          Show
          Bojan Smid added a comment - I will try to bring this patch up to date. Currently I see two main problems: 1) The patch applies to trunk, but it doesn't compile. The problem occurs mainly because of changes in Search Components (for instance, some method signatures which CollapseComponent implements were changed). I have this fixed locally (more or less), but I have to test it before posting new version of patch. 2) It seems that CollapseComponent can't be used in chain with QueryComponent, but instead of it. CollapseComponent basically copies QueryComponent querying logic and adds some of it's own. I guess this isn't the right way to go. CollapseComponent should contain only collapsing logic and should be chainable with other components. Can anyone confirm if I'm right here? Of course, there might be some fundamental reason why CollapseComponent had to be implemented this way. Does anyone else see any other issues with this component?
          Hide
          Oleg Gnatovskiy added a comment -

          Hey Bojan. I actually hacked collapsecomponent quite a bit, in order to get it to work with Distributed Search, but I am not going to upload it, since its horribly buggy. Do you think that's a feature that can be added?

          Show
          Oleg Gnatovskiy added a comment - Hey Bojan. I actually hacked collapsecomponent quite a bit, in order to get it to work with Distributed Search, but I am not going to upload it, since its horribly buggy. Do you think that's a feature that can be added?
          Hide
          Bojan Smid added a comment -

          Hi Oleg. I'll look into this also. In case you have any working code, you can mail it to me, and I'll see what can be reused.

          Show
          Bojan Smid added a comment - Hi Oleg. I'll look into this also. In case you have any working code, you can mail it to me, and I'll see what can be reused.
          Hide
          Otis Gospodnetic added a comment -

          It's amazing this issue/patch has so many votes and watchers, yet it's stuck...
          Ryan, Yonik, Emmanuel, Doug, Charles, Karsten

          I think Bojan is onto something here. Isn't the ability to chain QueryComponent (QC) and CollapseComponent (CC) essential?

          I'm looking at field_collapsing_dsteigerwald.diff and see that the CC.prepare method there is identical to the QC.prepare method, while process methods are different. Could we solve this particular copy/paste situation by making CC extend QC and simply override the process method?

          As for chaining, could CC take the same approach as the MLT Component, which simply does it's thing to find "more like this" docs and stuffs them into the "moreLikeThis" element in the response?

          I could be misunderstanding something, so please correct me if I'm wrong. I'd love to get this one in 1.3 – it's been waiting in JIRA for too long.

          Show
          Otis Gospodnetic added a comment - It's amazing this issue/patch has so many votes and watchers, yet it's stuck... Ryan, Yonik, Emmanuel, Doug, Charles, Karsten I think Bojan is onto something here. Isn't the ability to chain QueryComponent (QC) and CollapseComponent (CC) essential ? I'm looking at field_collapsing_dsteigerwald.diff and see that the CC.prepare method there is identical to the QC.prepare method , while process methods are different. Could we solve this particular copy/paste situation by making CC extend QC and simply override the process method ? As for chaining, could CC take the same approach as the MLT Component, which simply does it's thing to find "more like this" docs and stuffs them into the "moreLikeThis" element in the response? I could be misunderstanding something, so please correct me if I'm wrong. I'd love to get this one in 1.3 – it's been waiting in JIRA for too long.
          Hide
          Bojan Smid added a comment -

          I updated the patch so that it can be compiled on Solr trunk. Also, since CollapseComponent essentially copied QueryComponent's prepare method (and it seems that it is supposed to be used instead of it), I made it extend QueryComponent (with collapsing-specific process() method, and prepare() method inherited from super class).

          Show
          Bojan Smid added a comment - I updated the patch so that it can be compiled on Solr trunk. Also, since CollapseComponent essentially copied QueryComponent's prepare method (and it seems that it is supposed to be used instead of it), I made it extend QueryComponent (with collapsing-specific process() method, and prepare() method inherited from super class).
          Hide
          Oleg Gnatovskiy added a comment -

          I'd like to request some distributed search functionality for this feature as well.

          Show
          Oleg Gnatovskiy added a comment - I'd like to request some distributed search functionality for this feature as well.
          Hide
          Otis Gospodnetic added a comment -

          There is so little interest in this patch/functionality now, that I doubt it will get distributed search support in time for 1.3 I would like to commit Bojan's patch for 1.3, though.

          Show
          Otis Gospodnetic added a comment - There is so little interest in this patch/functionality now, that I doubt it will get distributed search support in time for 1.3 I would like to commit Bojan's patch for 1.3, though.
          Hide
          Yonik Seeley added a comment -

          Since this is adding new interface/API, it would be very nice if one could easily review it. It's very important that the interface and the exact semantics are nailed down IMO (there seem to be a lot of options).
          Is http://wiki.apache.org/solr/FieldCollapsing up-to-date?

          There don't seem to be any tests either.

          Show
          Yonik Seeley added a comment - Since this is adding new interface/API, it would be very nice if one could easily review it. It's very important that the interface and the exact semantics are nailed down IMO (there seem to be a lot of options). Is http://wiki.apache.org/solr/FieldCollapsing up-to-date? There don't seem to be any tests either.
          Hide
          JList added a comment -

          Although field collpasing worked fine in my brief testing, when I put it to work with more documents, I got exceptions. It seems to have something to do with the queries (or documents, since different queries return different documents). With some queries, this exception does not happen.

          If I remove the collapse.* parameters, the error does not happen. Any idea why this is happening? Thanks.

          HTTP ERROR: 500
          Unsupported Operation

          org.apache.solr.common.SolrException: Unsupported Operation
          at org.apache.solr.search.NegatedDocSet.iterator(NegatedDocSet.java:77)
          at org.apache.solr.search.DocSetBase.getBits(DocSet.java:183)
          at org.apache.solr.search.NegatedDocSet.getBits(NegatedDocSet.java:27)
          at org.apache.solr.search.DocSetBase.intersection(DocSet.java:199)
          at org.apache.solr.search.BitDocSet.intersection(BitDocSet.java:30)
          at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1109)
          at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:811)
          at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1282)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:57)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
          at org.mortbay.jetty.Server.handle(Server.java:285)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
          at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
          at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

          Show
          JList added a comment - Although field collpasing worked fine in my brief testing, when I put it to work with more documents, I got exceptions. It seems to have something to do with the queries (or documents, since different queries return different documents). With some queries, this exception does not happen. If I remove the collapse.* parameters, the error does not happen. Any idea why this is happening? Thanks. HTTP ERROR: 500 Unsupported Operation org.apache.solr.common.SolrException: Unsupported Operation at org.apache.solr.search.NegatedDocSet.iterator(NegatedDocSet.java:77) at org.apache.solr.search.DocSetBase.getBits(DocSet.java:183) at org.apache.solr.search.NegatedDocSet.getBits(NegatedDocSet.java:27) at org.apache.solr.search.DocSetBase.intersection(DocSet.java:199) at org.apache.solr.search.BitDocSet.intersection(BitDocSet.java:30) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1109) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:811) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1282) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:57) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
          Hide
          Bojan Smid added a comment -

          You can check discussion about this same problem in the posts above (starting with 1st Feb 2008). It seems like a rather complex issue which could require some serious refactoring of collapsing code.

          Show
          Bojan Smid added a comment - You can check discussion about this same problem in the posts above (starting with 1st Feb 2008). It seems like a rather complex issue which could require some serious refactoring of collapsing code.
          Hide
          JList added a comment -

          Sorry about the dup. I obviously didn't check the comments before I posted the bug. Anyway, it's still there, it's still happening

          Show
          JList added a comment - Sorry about the dup. I obviously didn't check the comments before I posted the bug. Anyway, it's still there, it's still happening
          Hide
          JList added a comment -

          Not sure if it's related to the query string or the documents that the query hits. If the latter, it would be trickier to reproduce.
          Anyway I tried a few English words and the error didn't happen. So far I was only able to reproduce it with CJK (Simplified Chinese to be exact) queries.

          This is an example query that triggers this problem (in UTF-8):
          '\xe5\x9c\xb0\xe9\x9c\x87'

          The query string:
          http://localhost:8983/solr/select/?q=%E5%9C%B0%E9%9C%87&version=2.2&start=0&rows=10&indent=on&collapse.field=domain

          Show
          JList added a comment - Not sure if it's related to the query string or the documents that the query hits. If the latter, it would be trickier to reproduce. Anyway I tried a few English words and the error didn't happen. So far I was only able to reproduce it with CJK (Simplified Chinese to be exact) queries. This is an example query that triggers this problem (in UTF-8): '\xe5\x9c\xb0\xe9\x9c\x87' The query string: http://localhost:8983/solr/select/?q=%E5%9C%B0%E9%9C%87&version=2.2&start=0&rows=10&indent=on&collapse.field=domain
          Hide
          Matthias Epheser added a comment -

          I just tried to apply the last patch and ran into 2 issues:

          First:

          The new getDocListAndSet(Query query, List<Query>..) method in SolrIndexSearcher calls the getDocListC(..) method using the old signature. I changed the call to the new signature and it worked very well:

          DocListAndSet ret = new DocListAndSet();
          QueryResult queryResult = new QueryResult();
          queryResult.setDocListAndSet(ret);
          queryResult.setPartialResults(false);
          QueryCommand queryCommand = new QueryCommand();
          queryCommand.setQuery(query);
          queryCommand.setFilterList(filterList);
          queryCommand.setFilter(docSet);
          queryCommand.setSort(lsort);
          queryCommand.setOffset(offset);
          queryCommand.setLen(len);
          queryCommand.setFlags(flags |= GET_DOCSET);
          getDocListC(queryResult, queryCommand);

          Second:

          After adding more docs (~3000), I got an Exception in SolrIndexSearcher at line ~1300:
          qr.setDocSet(filter == null ? qDocSet : qDocSet.intersection(filter));

          As the NegotiatedDocSet doesn't implement the iterator() function, this call lead to an Unsupported Operation exception. I just naively tried to implement this funtion using "return source.iterator()". Works fine for me.

          As the first issue is very clear, I wanted to check my approach for the second one before I post a patch. Maybe there are some side effects that I missed.

          Show
          Matthias Epheser added a comment - I just tried to apply the last patch and ran into 2 issues: First: The new getDocListAndSet(Query query, List<Query>..) method in SolrIndexSearcher calls the getDocListC(..) method using the old signature. I changed the call to the new signature and it worked very well: DocListAndSet ret = new DocListAndSet(); QueryResult queryResult = new QueryResult(); queryResult.setDocListAndSet(ret); queryResult.setPartialResults(false); QueryCommand queryCommand = new QueryCommand(); queryCommand.setQuery(query); queryCommand.setFilterList(filterList); queryCommand.setFilter(docSet); queryCommand.setSort(lsort); queryCommand.setOffset(offset); queryCommand.setLen(len); queryCommand.setFlags(flags |= GET_DOCSET); getDocListC(queryResult, queryCommand); Second: After adding more docs (~3000), I got an Exception in SolrIndexSearcher at line ~1300: qr.setDocSet(filter == null ? qDocSet : qDocSet.intersection(filter)); As the NegotiatedDocSet doesn't implement the iterator() function, this call lead to an Unsupported Operation exception. I just naively tried to implement this funtion using "return source.iterator()". Works fine for me. As the first issue is very clear, I wanted to check my approach for the second one before I post a patch. Maybe there are some side effects that I missed.
          Hide
          Doug Steigerwald added a comment -

          I'm in the process of updating our Solr build and I'm running into issues with this patch now. I added the code in the first issue Matthias mentioned. Unfortunately whenever I try to do any field collapsing, I get a NPE:

          java.lang.NullPointerException
          at org.apache.solr.search.CollapseFilter.getCollapseInfo(CollapseFilter.java:263)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:65)
          ...

          My request handler for testing is simple. It only has the collapse component in it. Posting the example docs and trying to execute the following query gives me the NPE.

          http://localhost:8983/solr/search?q=*:*&collapse.field=cat&collapse.type=normal

          Updated my trunk this morning (r687489).

          Show
          Doug Steigerwald added a comment - I'm in the process of updating our Solr build and I'm running into issues with this patch now. I added the code in the first issue Matthias mentioned. Unfortunately whenever I try to do any field collapsing, I get a NPE: java.lang.NullPointerException at org.apache.solr.search.CollapseFilter.getCollapseInfo(CollapseFilter.java:263) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:65) ... My request handler for testing is simple. It only has the collapse component in it. Posting the example docs and trying to execute the following query gives me the NPE. http://localhost:8983/solr/search?q=*:*&collapse.field=cat&collapse.type=normal Updated my trunk this morning (r687489).
          Hide
          Oleg Gnatovskiy added a comment - - edited

          I was able to hack the latest patch in, and to get it to work, but it required some pretty heavy naive changes...

          If you are getting an NPE try this: in the SolrIndexSearcher class, in the getDocListC method change out = new DocListAndSet(); to

          DocListAndSet out = null;
          if(qr.getDocListAndSet() == null)
          out = new DocListAndSet();
          else
          out = qr.getDocListAndSet();

          Show
          Oleg Gnatovskiy added a comment - - edited I was able to hack the latest patch in, and to get it to work, but it required some pretty heavy naive changes... If you are getting an NPE try this: in the SolrIndexSearcher class, in the getDocListC method change out = new DocListAndSet(); to DocListAndSet out = null; if(qr.getDocListAndSet() == null) out = new DocListAndSet(); else out = qr.getDocListAndSet();
          Hide
          Mark Miller added a comment - - edited

          Sorting twice (when not sorting on the collapse field) only makes sense if we are doing external sorts (harddrive), correct ? It seems to me that this should be closer to the facet stuff (in using the field cache) and then use a hash table of accumulators: linear time (is that generally?) right? (edit: looks like thats too memory intensive)

          As Otis mentions above, this issue appears very popular. We should finish it up.

          Show
          Mark Miller added a comment - - edited Sorting twice (when not sorting on the collapse field) only makes sense if we are doing external sorts (harddrive), correct ? It seems to me that this should be closer to the facet stuff (in using the field cache) and then use a hash table of accumulators: linear time (is that generally?) right? (edit: looks like thats too memory intensive) As Otis mentions above, this issue appears very popular. We should finish it up.
          Hide
          Oleg Gnatovskiy added a comment -

          What's a hard drive sort?

          Show
          Oleg Gnatovskiy added a comment - What's a hard drive sort?
          Hide
          Mark Miller added a comment - - edited

          What's a hard drive sort?

          Sorry - was not very clear.

          Just like sorting, finding dupes can be done in memory or using external storage (harddrive). I am only just looking into this stuff myself, but it seems in the best case you would want to do it in memory with a hash system which can be linear scalability. If you have too many items to look for dupes in, you have to use external storage - one good method is two sorts (we get one from the search), but there are other options too I think. In this case, the sorts are able to be done in memory though, but I think the hashtable method of identifying dupes is much less memory efficient (too many unique terms).

          Show
          Mark Miller added a comment - - edited What's a hard drive sort? Sorry - was not very clear. Just like sorting, finding dupes can be done in memory or using external storage (harddrive). I am only just looking into this stuff myself, but it seems in the best case you would want to do it in memory with a hash system which can be linear scalability. If you have too many items to look for dupes in, you have to use external storage - one good method is two sorts (we get one from the search), but there are other options too I think. In this case, the sorts are able to be done in memory though, but I think the hashtable method of identifying dupes is much less memory efficient (too many unique terms).
          Hide
          Vaijanath N. Rao added a comment -

          Hi All,

          I am trying to apply this patch to solr-1.4 code and getting following errors.
          At line number 58 of the CollapseComponent.java and the error is:
          The method getDocListAndSet (Query, List<Query>, Sort, int , int , int) in the type SolrIndexSearcher is not applicable for the arguments (Query, List<Query>, DocSet, Sort, int , int , int)

          Can anyone tell me the correction I need to do to get this code working.

          --Thanks and Regards
          Vaijanath

          Show
          Vaijanath N. Rao added a comment - Hi All, I am trying to apply this patch to solr-1.4 code and getting following errors. At line number 58 of the CollapseComponent.java and the error is: The method getDocListAndSet (Query, List<Query>, Sort, int , int , int) in the type SolrIndexSearcher is not applicable for the arguments (Query, List<Query>, DocSet, Sort, int , int , int) Can anyone tell me the correction I need to do to get this code working. --Thanks and Regards Vaijanath
          Hide
          Vaijanath N. Rao added a comment -

          Hi All,

          I got this patch working but for 1.3 code and not 1.4. I will try to get this working and will tell you the results. I pulled in some code from older version namely for
          getDocListAndSet
          getDocListNC
          getDocListC.

          I also added an constructor DocSetHitCollector (int maxDoc) with following code
          public DocSetHitCollector(int maxDoc)

          { this(HashDocSet.DEFAULT_INVERSE_LOAD_FACTOR,maxDoc,maxDoc); // TODO Auto-generated constructor stub }

          I wanted to know if any of the additions harm any other component of solr.

          Do I need to make any changes to solrconfig other than the following

          Adding <arr name="first-components"> <str>collapse</str> </arr> this to standard and dismax query handler
          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />

          I will check this with highlighting and let you all know of any observation that I make.

          --Thanks and Regards
          Vaijanath

          Show
          Vaijanath N. Rao added a comment - Hi All, I got this patch working but for 1.3 code and not 1.4. I will try to get this working and will tell you the results. I pulled in some code from older version namely for getDocListAndSet getDocListNC getDocListC. I also added an constructor DocSetHitCollector (int maxDoc) with following code public DocSetHitCollector(int maxDoc) { this(HashDocSet.DEFAULT_INVERSE_LOAD_FACTOR,maxDoc,maxDoc); // TODO Auto-generated constructor stub } I wanted to know if any of the additions harm any other component of solr. Do I need to make any changes to solrconfig other than the following Adding <arr name="first-components"> <str>collapse</str> </arr> this to standard and dismax query handler <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> I will check this with highlighting and let you all know of any observation that I make. --Thanks and Regards Vaijanath
          Hide
          Iván de Prado added a comment -

          A patch for field collapsing over Solr 1.3.0. It changes the behavior to be more memory friendly when the parameter collapse.maxdocs is used.

          Show
          Iván de Prado added a comment - A patch for field collapsing over Solr 1.3.0. It changes the behavior to be more memory friendly when the parameter collapse.maxdocs is used.
          Hide
          Iván de Prado added a comment -

          I attached a patch named collapsing-patch-to-1.3.0-ivan.patch. The patch applies to Solr 1.3.0.

          Karsten commented in the comment "Karsten Sperling - 06/Nov/07 02:06 PM":

          Inverted the logic of the filter DocSet created by CollapseFilter to contain the documents that are to be collapsed instead of the ones that are to be kept. Without this collapse.maxdocs doesn't work.

          I found that this way of doing consumes a lot of memory, even if your query is bounded to a few number of documents. And I found that there is not advantage on using collapse.maxdocs if you don't speed up queries and reduces the amount of needed memory.

          So, I decided to revert the Karsten change in order to make field collapsing faster and less resources consuming when querying for smaller datasets.

          WARNING: This patch changes the semantic of collapse.maxdocs. Before this patch, the collapse.maxdocs was used just for reduce the number of docs cheked for grouping, but presenting the rest of documents that were not grouped in the result.

          With current patch, only documents that were examinated for grouping can appear in the result. This semantic have two benefits:

          • The amount of resources can be controled per each query
          • Not ungrouped content is presented.
          Show
          Iván de Prado added a comment - I attached a patch named collapsing-patch-to-1.3.0-ivan.patch. The patch applies to Solr 1.3.0. Karsten commented in the comment "Karsten Sperling - 06/Nov/07 02:06 PM": Inverted the logic of the filter DocSet created by CollapseFilter to contain the documents that are to be collapsed instead of the ones that are to be kept. Without this collapse.maxdocs doesn't work. I found that this way of doing consumes a lot of memory, even if your query is bounded to a few number of documents. And I found that there is not advantage on using collapse.maxdocs if you don't speed up queries and reduces the amount of needed memory. So, I decided to revert the Karsten change in order to make field collapsing faster and less resources consuming when querying for smaller datasets. WARNING: This patch changes the semantic of collapse.maxdocs. Before this patch, the collapse.maxdocs was used just for reduce the number of docs cheked for grouping, but presenting the rest of documents that were not grouped in the result. With current patch, only documents that were examinated for grouping can appear in the result. This semantic have two benefits: The amount of resources can be controled per each query Not ungrouped content is presented.
          Hide
          Doug Steigerwald added a comment -

          I'm having an issue with Ivan's latest patch. I'm testing on a data set of 8113 documents. All the documents have a string field called site. There are only two sites, Site1 and Site2.

          Site1 has 3466 documents.
          Site2 has 4647 documents.

          With the following simple query, I only get 1 result:
          http://localhost:8983/solr/core1/search?q=*:*&collapase=true&collapse.field=site

          ....
          <lst name="collapse_counts">
          <str name="field">site</str>
          <lst name="doc">
          <int name="site2-doc-2981790">4646</int>
          </lst>
          <lst name="count">
          <int name="Site2">4646</int>
          </lst>
          <str name="debug">HashDocSet(2) Time(ms): 0/0/0/0</str>
          </lst>
          <result name="response" numFound="1" start="0">
          ....

          The only result displayed is for Site2.

          I have an older patch working with Solr 1.3.0, but I can't get it to mesh with localsolr properly. My localsolr gives 1656 results, and collapsed on the site it should give 2 results but gives 8 results, some of which are duplicate documents. Without localsolr, my field collapsing patch seems to work fine.

          Show
          Doug Steigerwald added a comment - I'm having an issue with Ivan's latest patch. I'm testing on a data set of 8113 documents. All the documents have a string field called site. There are only two sites, Site1 and Site2. Site1 has 3466 documents. Site2 has 4647 documents. With the following simple query, I only get 1 result: http://localhost:8983/solr/core1/search?q=*:*&collapase=true&collapse.field=site .... <lst name="collapse_counts"> <str name="field">site</str> <lst name="doc"> <int name="site2-doc-2981790">4646</int> </lst> <lst name="count"> <int name="Site2">4646</int> </lst> <str name="debug">HashDocSet(2) Time(ms): 0/0/0/0</str> </lst> <result name="response" numFound="1" start="0"> .... The only result displayed is for Site2. I have an older patch working with Solr 1.3.0, but I can't get it to mesh with localsolr properly. My localsolr gives 1656 results, and collapsed on the site it should give 2 results but gives 8 results, some of which are duplicate documents. Without localsolr, my field collapsing patch seems to work fine.
          Hide
          Ryan McKinley added a comment -

          What is the "localsolr" field you are talking about?

          Is it the solr stuff from http://sourceforge.net/projects/locallucene ?

          Show
          Ryan McKinley added a comment - What is the "localsolr" field you are talking about? Is it the solr stuff from http://sourceforge.net/projects/locallucene ?
          Hide
          Doug Steigerwald added a comment -

          Yes, that localsolr. I've just been trying to get the two components working together but haven't had much luck.

          Separately they work fine, but together not so much. I can't get the field collapsing to work correctly with an existing reset set from the localsolr component in the response builder.

          Show
          Doug Steigerwald added a comment - Yes, that localsolr. I've just been trying to get the two components working together but haven't had much luck. Separately they work fine, but together not so much. I can't get the field collapsing to work correctly with an existing reset set from the localsolr component in the response builder.
          Hide
          Iván de Prado added a comment - - edited

          I have attached new patch with the problems solved in my first submitted patch. Doug Steigerwald, could you check if this patch works well for you? Thanks.

          Show
          Iván de Prado added a comment - - edited I have attached new patch with the problems solved in my first submitted patch. Doug Steigerwald, could you check if this patch works well for you? Thanks.
          Hide
          Doug Steigerwald added a comment -

          Looks fine from my little bit of testing.

          Show
          Doug Steigerwald added a comment - Looks fine from my little bit of testing.
          Hide
          Stephen Weiss added a comment -

          I'm using Ivan's patch and running into some trouble with faceting...

          Basically, I can tell that faceting is happening after the collapse - because the facet counts are definitely lower than they would be otherwise. For example, with one search, I'd have 196 results with no collapsing, I get 120 results with collapsing - but the facet count is 119??? In other searches the difference is more drastic - In another search, I get 61 results without collapsing, 61 with collapsing, but the facet count is 39.

          Looking at it for a while now, I think I can guess what the problem might be...

          The incorrect counts seem to only happen when the term in question does not occur evenly across all duplicates of a document. That is, multiple document records may exist for the same image (it's an image search engine), but each document will have different terms in different fields depending on the audience it's targeting. So, when you collapse, the counts are lower than they should be because when you actually execute a search with that facet's term included in the query, all the documents after collapsing will be ones that have that term.

          Here's an illustration:

          Collapse field is "link_id", facet field is "keyword":

          Doc 1:
          id: 123456,
          link_id: 2,
          keyword: Black, Printed, Dress

          Doc 2:
          id: 123457,
          link_id: 2,
          keyword: Black, Shoes, Patent

          Doc 3:
          id: 123458,
          link_id: 2,
          keyword: Red, Hat, Felt

          Doc 4:
          id: 123459,
          link_id:1,
          keyword: Felt, Hat, Black

          So, when you collapse, only two of these documents are in the result set (123456, 123459), and only the keywords Black, Printed, Dress, Felt, and Hat are counted. The facet count for Black is 2, the facet count for Felt is 1. If you choose Black and add it to your query, you get 2 results (great). However, if you add Felt to your query, you get 2 results (because a different document for link_id 2 is chosen in that query than is in the more general query from which the facets are produced).

          I think what needs to happen here is that all the terms for all the documents that are collapsed together need to be included (just once) with the document that gets counted for faceting. In this example, when the document for link_id 2 is counted, it would need to appear to the facet counter to have keywords Black, Printed, Dress, Shoes, Patent, Red, Hat, and Felt, as opposed to just Black, Printed, and Dress.

          Show
          Stephen Weiss added a comment - I'm using Ivan's patch and running into some trouble with faceting... Basically, I can tell that faceting is happening after the collapse - because the facet counts are definitely lower than they would be otherwise. For example, with one search, I'd have 196 results with no collapsing, I get 120 results with collapsing - but the facet count is 119??? In other searches the difference is more drastic - In another search, I get 61 results without collapsing, 61 with collapsing, but the facet count is 39. Looking at it for a while now, I think I can guess what the problem might be... The incorrect counts seem to only happen when the term in question does not occur evenly across all duplicates of a document. That is, multiple document records may exist for the same image (it's an image search engine), but each document will have different terms in different fields depending on the audience it's targeting. So, when you collapse, the counts are lower than they should be because when you actually execute a search with that facet's term included in the query, all the documents after collapsing will be ones that have that term. Here's an illustration: Collapse field is "link_id", facet field is "keyword": Doc 1: id: 123456, link_id: 2, keyword: Black, Printed, Dress Doc 2: id: 123457, link_id: 2, keyword: Black, Shoes, Patent Doc 3: id: 123458, link_id: 2, keyword: Red, Hat, Felt Doc 4: id: 123459, link_id:1, keyword: Felt, Hat, Black So, when you collapse, only two of these documents are in the result set (123456, 123459), and only the keywords Black, Printed, Dress, Felt, and Hat are counted. The facet count for Black is 2, the facet count for Felt is 1. If you choose Black and add it to your query, you get 2 results (great). However, if you add Felt to your query, you get 2 results (because a different document for link_id 2 is chosen in that query than is in the more general query from which the facets are produced). I think what needs to happen here is that all the terms for all the documents that are collapsed together need to be included (just once) with the document that gets counted for faceting. In this example, when the document for link_id 2 is counted, it would need to appear to the facet counter to have keywords Black, Printed, Dress, Shoes, Patent, Red, Hat, and Felt, as opposed to just Black, Printed, and Dress.
          Hide
          Iván de Prado added a comment -

          You can try with collapse.facet=before, but then you'll notice that the list of documents returned is all, not only the collapsed ones.

          Show
          Iván de Prado added a comment - You can try with collapse.facet=before, but then you'll notice that the list of documents returned is all, not only the collapsed ones.
          Hide
          Stephen Weiss added a comment -

          Yes, this is basically what I'm doing for now... At least it's reasonable enough to explain to a client that the counts are for unfilitered results. However, ideally, it should be able to facet properly on filitered results as well...

          Also, with simply collapse.facet=before, the results returned are the unfilitered results. You have to specify collapse.facet=after to get filtered results at all, and run the query component right before the facet component then to get the unfilitered facet counts... which doesn't seem to be ideal. This is with release version of SOLR 1.3 and Iván's most recent patch. All in all it took a lot of experimenting but at least now I have a method that works that we can go live with and then we'll just update the software as the situation improves.

          Thanks for all your efforts on the patch! I complain but really, the fact it works at all is a miracle for us.

          Show
          Stephen Weiss added a comment - Yes, this is basically what I'm doing for now... At least it's reasonable enough to explain to a client that the counts are for unfilitered results. However, ideally, it should be able to facet properly on filitered results as well... Also, with simply collapse.facet=before, the results returned are the unfilitered results. You have to specify collapse.facet=after to get filtered results at all, and run the query component right before the facet component then to get the unfilitered facet counts... which doesn't seem to be ideal. This is with release version of SOLR 1.3 and Iván's most recent patch. All in all it took a lot of experimenting but at least now I have a method that works that we can go live with and then we'll just update the software as the situation improves. Thanks for all your efforts on the patch! I complain but really, the fact it works at all is a miracle for us.
          Hide
          Stephen Weiss added a comment -

          I get an error on certain searches with Ivan's latest patch.

          Dec 15, 2008 2:32:00 PM org.apache.solr.core.SolrCore execute
          INFO: [ss_image_core] webapp=/solr path=/select params=

          {collapse=true&facet.limit=5&wt=json&rows=50&json.nl=map&start=0&sort=add_date+desc,+object_id+asc&facet=true&collapse.facet=after&f.season.facet.limit=-1&facet.mincount=1&fl=object_id&q=object_type:image+AND+classif_name:(19097)+AND+market:(49154)+AND+perms:(1835+OR+4785+OR+1725+OR+1690+OR+2816+OR+3149+OR+3082+OR+2815+OR+2814+OR+3083+OR+4783)&version=1.2&f.classif_name.facet.limit=-1&collapse.field=link_id&collapse.threshold=1&facet.field=classif_name&facet.field=market&facet.field=season&facet.field=city&facet.field=designer&facet.field=category&facet.field=keywords&facet.field=lifestyle}

          hits=263059 status=500 QTime=4508
          Dec 15, 2008 2:32:00 PM org.apache.solr.common.SolrException log
          SEVERE: java.lang.ArrayIndexOutOfBoundsException: 41386
          at org.apache.solr.util.OpenBitSet.fastSet(OpenBitSet.java:235)
          at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:214)
          at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:171)
          at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:139)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:52)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:324)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
          at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
          at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

          Unfortunate really, it happens every time this specific search is run, but many, many other searches of similar result set size and considerably more complexity or equivalent complexity will execute fine... I can't honestly tell you what's special about this one search that would make it fail.

          For now the patch is offline until we can figure something out for it... I can provide access to the machine (I managed to reproduce it in a test environment) if it would help determine what the problem is / make the software better for everyone.

          Show
          Stephen Weiss added a comment - I get an error on certain searches with Ivan's latest patch. Dec 15, 2008 2:32:00 PM org.apache.solr.core.SolrCore execute INFO: [ss_image_core] webapp=/solr path=/select params= {collapse=true&facet.limit=5&wt=json&rows=50&json.nl=map&start=0&sort=add_date+desc,+object_id+asc&facet=true&collapse.facet=after&f.season.facet.limit=-1&facet.mincount=1&fl=object_id&q=object_type:image+AND+classif_name:(19097)+AND+market:(49154)+AND+perms:(1835+OR+4785+OR+1725+OR+1690+OR+2816+OR+3149+OR+3082+OR+2815+OR+2814+OR+3083+OR+4783)&version=1.2&f.classif_name.facet.limit=-1&collapse.field=link_id&collapse.threshold=1&facet.field=classif_name&facet.field=market&facet.field=season&facet.field=city&facet.field=designer&facet.field=category&facet.field=keywords&facet.field=lifestyle} hits=263059 status=500 QTime=4508 Dec 15, 2008 2:32:00 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ArrayIndexOutOfBoundsException: 41386 at org.apache.solr.util.OpenBitSet.fastSet(OpenBitSet.java:235) at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:214) at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:171) at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:139) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:52) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) Unfortunate really, it happens every time this specific search is run, but many, many other searches of similar result set size and considerably more complexity or equivalent complexity will execute fine... I can't honestly tell you what's special about this one search that would make it fail. For now the patch is offline until we can figure something out for it... I can provide access to the machine (I managed to reproduce it in a test environment) if it would help determine what the problem is / make the software better for everyone.
          Hide
          Karsten Sperling added a comment -

          I'm pretty sure the problem Stephen ran into is an off-by-one error in the bitset allocation inside the collapsing code; I ran into the same problem when I customized it for internal use about half a year ago – and unfortunately forgot all about the problem until reading Stephen's comment just now. Basically the bitset gets allocated 1 bit too small, so there's about a 1/32 chance that if the bit for the document with the highest ID gets set it will cause the AIOOB exception.

          Show
          Karsten Sperling added a comment - I'm pretty sure the problem Stephen ran into is an off-by-one error in the bitset allocation inside the collapsing code; I ran into the same problem when I customized it for internal use about half a year ago – and unfortunately forgot all about the problem until reading Stephen's comment just now. Basically the bitset gets allocated 1 bit too small, so there's about a 1/32 chance that if the bit for the document with the highest ID gets set it will cause the AIOOB exception.
          Hide
          Iván de Prado added a comment -

          Karsten Sperling was right. Seems that there was a wrong bounds initialization for the OpenBitSet. I have solved it and attached a new patch.

          Stephen Weiss, can you test if now the error has disappeared?

          Thanks.

          Show
          Iván de Prado added a comment - Karsten Sperling was right. Seems that there was a wrong bounds initialization for the OpenBitSet. I have solved it and attached a new patch. Stephen Weiss, can you test if now the error has disappeared? Thanks.
          Hide
          Stephen Weiss added a comment -

          Yes! It does work. Thank you both so much! It's been running for 5 days now without a hiccup. This is going into production use now (we'll be monitoring), they simply can't wait for the functionality. From here it looks like if you get faceting tidied up and some docs written, they should be including this soon!

          Show
          Stephen Weiss added a comment - Yes! It does work. Thank you both so much! It's been running for 5 days now without a hiccup. This is going into production use now (we'll be monitoring), they simply can't wait for the functionality. From here it looks like if you get faceting tidied up and some docs written, they should be including this soon!
          Hide
          Ryan McKinley added a comment -

          I see there is a patch agains 1.3, is there any current patch against trunk? (we would need something against trunk in order to consider this for 1.4)

          Show
          Ryan McKinley added a comment - I see there is a patch agains 1.3, is there any current patch against trunk? (we would need something against trunk in order to consider this for 1.4)
          Hide
          Thomas Traeger added a comment -

          I tested 1.3 and ivans latest patch.

          When I add a Filter Query (fq param) to my query I get an exception "Either filter or filterList may be set in the QueryCommand, but not both.". I'm not that familiar with java but at least disabled the exception in SolrIndexSearch.java. I can use Filter Queries now and no problems occured so far. But surely this has to be handled in another way.

          Btw, I think this had already been fixed by Karsten back in 2007 in some way (patch field-collapsing-extended-592129.patch). He commented it with:

          "Made a minimal change to SolrIndexSearcher.getDocListC() to support passing both the filter and filterList parameters. In most cases this was already handled anyway."

          Show
          Thomas Traeger added a comment - I tested 1.3 and ivans latest patch. When I add a Filter Query (fq param) to my query I get an exception "Either filter or filterList may be set in the QueryCommand, but not both.". I'm not that familiar with java but at least disabled the exception in SolrIndexSearch.java. I can use Filter Queries now and no problems occured so far. But surely this has to be handled in another way. Btw, I think this had already been fixed by Karsten back in 2007 in some way (patch field-collapsing-extended-592129.patch). He commented it with: "Made a minimal change to SolrIndexSearcher.getDocListC() to support passing both the filter and filterList parameters. In most cases this was already handled anyway."
          Hide
          dieter grad added a comment -

          I had to make a patch to fix two issues that we needed for our system. I am not used to this code, so maybe someone can pick this patch and make it something useful for everybody.

          The fixes are:

          1) When collapsing.facet=before, only the collapsed documents are returned (and not the whole collection).

          2) When collapsing is normal, the selected sort order is preserved by returning the first document of the collapsed group.

          For example, if the values of the collapsing field are:

          1) Y
          2) X
          3) X
          4) Y
          5)X
          6)Z

          the documents returned are 1, 2 and 6, in that order.

          So, for example, if you sort by price ascending, you will get the result sorted by price, where each item is the cheapest item of its collapsed group.

          Show
          dieter grad added a comment - I had to make a patch to fix two issues that we needed for our system. I am not used to this code, so maybe someone can pick this patch and make it something useful for everybody. The fixes are: 1) When collapsing.facet=before, only the collapsed documents are returned (and not the whole collection). 2) When collapsing is normal, the selected sort order is preserved by returning the first document of the collapsed group. For example, if the values of the collapsing field are: 1) Y 2) X 3) X 4) Y 5)X 6)Z the documents returned are 1, 2 and 6, in that order. So, for example, if you sort by price ascending, you will get the result sorted by price, where each item is the cheapest item of its collapsed group.
          Hide
          Shalin Shekhar Mangar added a comment -

          Marked for 1.5

          Show
          Shalin Shekhar Mangar added a comment - Marked for 1.5
          Hide
          Oleg Gnatovskiy added a comment -

          Are the any concrete plans on where this feature is going? Is it ever going to get support for distributed search?

          Show
          Oleg Gnatovskiy added a comment - Are the any concrete plans on where this feature is going? Is it ever going to get support for distributed search?
          Hide
          Stephen Weiss added a comment - - edited

          Help!!

          We've been using this patch in production for months now, and suddenly in the last 3 days it is crashing constantly.

          Edit - It's Ivan's latest patch, #3, with Solr 1.3 dist

          Mar 6, 2009 5:23:50 AM org.apache.solr.common.SolrException log
          SEVERE: java.lang.OutOfMemoryError: Java heap space
          at org.apache.solr.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:701)
          at org.apache.solr.util.OpenBitSet.ensureCapacity(OpenBitSet.java:711)
          at org.apache.solr.util.OpenBitSet.expandingWordNum(OpenBitSet.java:280)
          at org.apache.solr.util.OpenBitSet.set(OpenBitSet.java:221)
          at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:217)
          at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:171)
          at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:139)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:52)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:324)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
          at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
          at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

          It seems to happen randomly - there's no special request happening, nothing new added to the index, nothing. We've made no configuration changes. The only thing that's happened is more documents have been added since then. The schema is the same, we have perhaps 200000 more documents in the index now than we did when we first went live with it.

          It was a 32-bit machine allocated 2GB of RAM for Java before. We just upgraded it to 64-bit and increased the heap space to 3GB, and still it went down last night. I'm at my wits end, I don't know what to do but this functionality has been live so long now it's going to be extremely painful to take it away. Someone, please tell me if there's anything I can do to save this thing.

          Show
          Stephen Weiss added a comment - - edited Help!! We've been using this patch in production for months now, and suddenly in the last 3 days it is crashing constantly. Edit - It's Ivan's latest patch, #3, with Solr 1.3 dist Mar 6, 2009 5:23:50 AM org.apache.solr.common.SolrException log SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:701) at org.apache.solr.util.OpenBitSet.ensureCapacity(OpenBitSet.java:711) at org.apache.solr.util.OpenBitSet.expandingWordNum(OpenBitSet.java:280) at org.apache.solr.util.OpenBitSet.set(OpenBitSet.java:221) at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:217) at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:171) at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:139) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:52) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:169) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) It seems to happen randomly - there's no special request happening, nothing new added to the index, nothing. We've made no configuration changes. The only thing that's happened is more documents have been added since then. The schema is the same, we have perhaps 200000 more documents in the index now than we did when we first went live with it. It was a 32-bit machine allocated 2GB of RAM for Java before. We just upgraded it to 64-bit and increased the heap space to 3GB, and still it went down last night. I'm at my wits end, I don't know what to do but this functionality has been live so long now it's going to be extremely painful to take it away. Someone, please tell me if there's anything I can do to save this thing.
          Hide
          Iván de Prado added a comment -

          That is one of the problems that this patch has: The consumption of resources (memory and CPU) increases with the amount of results in the query and with the amount of requests.

          Is not trivial to change that. I imaging that deep changes in Solr or Lucene would be needed to have an efficient collapsing.

          The advices I can give you are:

          • Increase the amount of memory or your Solr
          • Use the parameter "collapse.maxdocs" . This parameter limits the number of document that are seen when collapsing. By using it, you'll limit the amount or memory and resources used per each query. But if the query you did has more than maxdocs documents, then the collapsing won't be perfect. Some documents won't be collapsed.

          I hope it helps something.

          Show
          Iván de Prado added a comment - That is one of the problems that this patch has: The consumption of resources (memory and CPU) increases with the amount of results in the query and with the amount of requests. Is not trivial to change that. I imaging that deep changes in Solr or Lucene would be needed to have an efficient collapsing. The advices I can give you are: Increase the amount of memory or your Solr Use the parameter "collapse.maxdocs" . This parameter limits the number of document that are seen when collapsing. By using it, you'll limit the amount or memory and resources used per each query. But if the query you did has more than maxdocs documents, then the collapsing won't be perfect. Some documents won't be collapsed. I hope it helps something.
          Hide
          Stephen Weiss added a comment -

          Thank you so much for your prompt response Ivan, I really appreciate your help.

          I have already maxed out the RAM on the machine - it seems very strange to me that adding a whole other GB of RAM did not fix the issue already. So I will have to try the next option, collapse.maxdocs.

          How does this work though? Does this mean, let's say I set collapse.maxdocs to 10000, that means the first 10000 documents will be collapsed, and after that they won't be? Or is it more random?

          Show
          Stephen Weiss added a comment - Thank you so much for your prompt response Ivan, I really appreciate your help. I have already maxed out the RAM on the machine - it seems very strange to me that adding a whole other GB of RAM did not fix the issue already. So I will have to try the next option, collapse.maxdocs. How does this work though? Does this mean, let's say I set collapse.maxdocs to 10000, that means the first 10000 documents will be collapsed, and after that they won't be? Or is it more random?
          Hide
          Iván de Prado added a comment -

          Is not random. I don't remember pretty well, but I think that documents are sorted by the collapsing field. After that, they are being grouped sequentially until reaching maxdocs. The groups that results from there are the documents that are presented. So the number of groups resulted are always smaller than the number of maxdocs.

          Summary: only maxdocs are scanned to generate the resulting groups.

          Show
          Iván de Prado added a comment - Is not random. I don't remember pretty well, but I think that documents are sorted by the collapsing field. After that, they are being grouped sequentially until reaching maxdocs. The groups that results from there are the documents that are presented. So the number of groups resulted are always smaller than the number of maxdocs. Summary: only maxdocs are scanned to generate the resulting groups.
          Hide
          Stephen Weiss added a comment -

          Unfortunately I don't think that will work for us. The collapse.maxdocs seems to collapse the oldest documents in the index - but we sort from newest to oldest, so effectively the newest documents in the index are just left out. Not only do they not collapse but they don't appear at all. If this is the only solution then we will have to stop using the patch... and unfortunately this means in general we will probably have to stop using Solr. The company has already made clear that this functionality is required, and especially since it has been working now for several months they will be very unlikely to accept that they can't have it anymore.

          Anyway I don't want to give up yet...

          I'm really not convinced this is really a problem of running out of the necessary memory to complete the operation - it only started doing this very recently. How does it run for 3 months with 2GB of RAM without any trouble, and now it fails even with 3GB of RAM? It's not like we just added those 200000 documents yesterday - they have accumulated over the past few months, in the past 3 days we've only perhaps added 20,000 documents. 20,000 more documents (with barely any new search terms at all) means it needs more than 1GB of memory more than what it was already using? If we grow by 25% every year that means by December we will need 50GB of RAM in the machine.

          Show
          Stephen Weiss added a comment - Unfortunately I don't think that will work for us. The collapse.maxdocs seems to collapse the oldest documents in the index - but we sort from newest to oldest, so effectively the newest documents in the index are just left out. Not only do they not collapse but they don't appear at all. If this is the only solution then we will have to stop using the patch... and unfortunately this means in general we will probably have to stop using Solr. The company has already made clear that this functionality is required, and especially since it has been working now for several months they will be very unlikely to accept that they can't have it anymore. Anyway I don't want to give up yet... I'm really not convinced this is really a problem of running out of the necessary memory to complete the operation - it only started doing this very recently. How does it run for 3 months with 2GB of RAM without any trouble, and now it fails even with 3GB of RAM? It's not like we just added those 200000 documents yesterday - they have accumulated over the past few months, in the past 3 days we've only perhaps added 20,000 documents. 20,000 more documents (with barely any new search terms at all) means it needs more than 1GB of memory more than what it was already using? If we grow by 25% every year that means by December we will need 50GB of RAM in the machine.
          Hide
          Mark Miller added a comment -

          How much RAM does the machine have total? 4 GB?

          Do you ever commit rapidly?

          You might try decreasing your cache sizes if you are using them.

          Show
          Mark Miller added a comment - How much RAM does the machine have total? 4 GB? Do you ever commit rapidly? You might try decreasing your cache sizes if you are using them.
          Hide
          Stephen Weiss added a comment -

          The machine has 4GB total. In response to this issue, and especially now that we have upgraded it to be 64 bit (again, for this issue), we have already ordered another 16 GB for the machine to try and stave off the problem. We should have it in next week.

          I restrict commits severely - a commit is only allowed once an hour, in practice they happen even less frequently - perhaps 5 or 6 times a day, and very spread out. We are freakishly paranoid But honestly that's all we need - new documents come in in chunks and generally they want them to go in all at once, and not piecemeal, so that the site updates cleanly (the commits are synchronized with other content updates - new images on the home page, etc).

          Some more information... just trying to toss out anything that matters. We have a very small set of possible terms - only 60,000 or so which tokenize to perhaps 200,000 total distinct words. We do not use synonyms at index time (only at query time). We use faceting, collapsing, and sorting - that's about it, no more like this or spellchecker (although we'd like to, we haven't gotten there yet). Faceting we do use heavily though - there are 16 different fields on which we return facet counts. All these fields together represent no more than 15,000 unique terms. There are approx. 4M documents in the index total, and none of them are larger than 1K.

          Memory usage on the machine seems to steadily increase - after restart and warming, 40% of the RAM on the machine is in use. Then, as searches come in, it steadily increases. Right now it is using 61%, in an hour it will probably be closer to 75% - the danger zone. This is also unusual because before, it used to stay pretty steady around 52-53%.

          This is a multi-core system - we have 2 cores, the one I'm describing now is only one of them. The other core is very, very small - total 8000 documents, which are also no more than 1 K each. We do use faceting there but no collapsing (it is not necessary for that part). It is essentially irrelevant, with or without that core the machine consumes about the same amount of resources.

          In response to this problem I have already dramatically reduced the following options:

          < <mergeFactor>2</mergeFactor>
          < <maxBufferedDocs>100</maxBufferedDocs>

          > <mergeFactor>10</mergeFactor>
          > <maxBufferedDocs>1000</maxBufferedDocs>
          42c42
          < <maxFieldLength>2500</maxFieldLength>

          > <maxFieldLength>10000</maxFieldLength>
          50,51c50,51
          < <mergeFactor>2</mergeFactor>
          < <maxBufferedDocs>100</maxBufferedDocs>

          > <mergeFactor>10</mergeFactor>
          > <maxBufferedDocs>1000</maxBufferedDocs>
          53c53
          < <maxFieldLength>2500</maxFieldLength>

          > <maxFieldLength>10000</maxFieldLength>

          ( diff of solrconfig.xml - < indicates current values, > indicates values when the problem started happening).

          This actually seemed to make the search much faster (strangely enough), but it doesn't seem to have helped memory consumption very much.

          These are our cache parameters:

          <filterCache
          class="solr.LRUCache"
          size="65536"
          initialSize="4096"
          autowarmCount="2048"/>

          <queryResultCache
          class="solr.LRUCache"
          size="512"
          initialSize="512"
          autowarmCount="256"/>

          <documentCache
          class="solr.LRUCache"
          size="16384"
          initialSize="16384"
          autowarmCount="0"/>

          <cache name="collapseCache"
          class="solr.LRUCache"
          size="512"
          initialSize="512"
          autowarmCount="0"/>

          I'm actually not sure if the collapseCache even does anything since it does not appear in the admin listing. I'm going to try reducing the filterCache to 32K entries and see if that makes a difference. I think that may be the right track since otherwise it seems like a big memory leak is happening.

          Is there any way to specify the size of the cache in terms of the actual size it should take up in memory, as opposed to the number of entries? 64K sounded quite small to me but now I'm thinking that 64K could mean GB's of memory depending on what the entries are, I honestly don't understand what the correlation would be between an entry and the size that entry takes in RAM.

          Show
          Stephen Weiss added a comment - The machine has 4GB total. In response to this issue, and especially now that we have upgraded it to be 64 bit (again, for this issue), we have already ordered another 16 GB for the machine to try and stave off the problem. We should have it in next week. I restrict commits severely - a commit is only allowed once an hour, in practice they happen even less frequently - perhaps 5 or 6 times a day, and very spread out. We are freakishly paranoid But honestly that's all we need - new documents come in in chunks and generally they want them to go in all at once, and not piecemeal, so that the site updates cleanly (the commits are synchronized with other content updates - new images on the home page, etc). Some more information... just trying to toss out anything that matters. We have a very small set of possible terms - only 60,000 or so which tokenize to perhaps 200,000 total distinct words. We do not use synonyms at index time (only at query time). We use faceting, collapsing, and sorting - that's about it, no more like this or spellchecker (although we'd like to, we haven't gotten there yet). Faceting we do use heavily though - there are 16 different fields on which we return facet counts. All these fields together represent no more than 15,000 unique terms. There are approx. 4M documents in the index total, and none of them are larger than 1K. Memory usage on the machine seems to steadily increase - after restart and warming, 40% of the RAM on the machine is in use. Then, as searches come in, it steadily increases. Right now it is using 61%, in an hour it will probably be closer to 75% - the danger zone. This is also unusual because before, it used to stay pretty steady around 52-53%. This is a multi-core system - we have 2 cores, the one I'm describing now is only one of them. The other core is very, very small - total 8000 documents, which are also no more than 1 K each. We do use faceting there but no collapsing (it is not necessary for that part). It is essentially irrelevant, with or without that core the machine consumes about the same amount of resources. In response to this problem I have already dramatically reduced the following options: < <mergeFactor>2</mergeFactor> < <maxBufferedDocs>100</maxBufferedDocs> — > <mergeFactor>10</mergeFactor> > <maxBufferedDocs>1000</maxBufferedDocs> 42c42 < <maxFieldLength>2500</maxFieldLength> — > <maxFieldLength>10000</maxFieldLength> 50,51c50,51 < <mergeFactor>2</mergeFactor> < <maxBufferedDocs>100</maxBufferedDocs> — > <mergeFactor>10</mergeFactor> > <maxBufferedDocs>1000</maxBufferedDocs> 53c53 < <maxFieldLength>2500</maxFieldLength> — > <maxFieldLength>10000</maxFieldLength> ( diff of solrconfig.xml - < indicates current values, > indicates values when the problem started happening). This actually seemed to make the search much faster (strangely enough), but it doesn't seem to have helped memory consumption very much. These are our cache parameters: <filterCache class="solr.LRUCache" size="65536" initialSize="4096" autowarmCount="2048"/> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="256"/> <documentCache class="solr.LRUCache" size="16384" initialSize="16384" autowarmCount="0"/> <cache name="collapseCache" class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0"/> I'm actually not sure if the collapseCache even does anything since it does not appear in the admin listing. I'm going to try reducing the filterCache to 32K entries and see if that makes a difference. I think that may be the right track since otherwise it seems like a big memory leak is happening. Is there any way to specify the size of the cache in terms of the actual size it should take up in memory, as opposed to the number of entries? 64K sounded quite small to me but now I'm thinking that 64K could mean GB's of memory depending on what the entries are, I honestly don't understand what the correlation would be between an entry and the size that entry takes in RAM.
          Hide
          Mark Miller added a comment -

          we have already ordered another 16 GB for the machine to try and stave off the problem. We should have it in next week.

          Great. You've got a lot going on here, and 4 GB is on the extremely low end of what I'd suggest.

          I restrict commits severely -

          Good news again.

          In response to this problem I have already dramatically reduced the following options:

          Dropping the merge factor is not likely to help much. It will increase the time it takes to add docs (merges occur much more often) for the benefit of maintaining an almost optimized index at all times (hence the faster search speed). Not a big RAM factor though.

          Also, dropping the max buffered docs is also probably not a huge saver, and will only affect RAM usage during indexing. Going from 1000 to 100 will likely hurt indexing performance and not save that much RAM in the larger scheme of things.

          And dropping the maxFieldLength will hide parts of the document that are over that length - perhaps youll end up with a handful fewer index terms, but again, not likely a big savings here and may do more harm than good.

          My suggestion of lowering your cache sizes was just a thought to eek out some more RAM for you. Its not really suggested though if you can get more RAM. For best performance, those caches should be set correctly. If you are using the fieldcache method for faceting, you want the size of the filter cache to be the same as the number of unique terms you are faceting on. The other caches are not so large that I would suggest trimming them.

          The reality is, you've got 4 million docs, sorting (uses field caches), faceting (likely uses field caches), and this resource intensive field collapse patch. More RAM is probably your best bet. Every document you add potentially adds to the RAM usage of each of these things. That doesn't mean you don't have a different problem (it does seem weird it ballooned all of a sudden), but your running some RAM hungry stuff here, and it wouldn't blow my mind that 3 gig is not enough to handle it. It could be that only recently the right searches started coming in at the right times to fire up all your needs at once. Much of this may be lazy loaded or loaded on the fly depending on if and how you have configured your warming searches.

          Show
          Mark Miller added a comment - we have already ordered another 16 GB for the machine to try and stave off the problem. We should have it in next week. Great. You've got a lot going on here, and 4 GB is on the extremely low end of what I'd suggest. I restrict commits severely - Good news again. In response to this problem I have already dramatically reduced the following options: Dropping the merge factor is not likely to help much. It will increase the time it takes to add docs (merges occur much more often) for the benefit of maintaining an almost optimized index at all times (hence the faster search speed). Not a big RAM factor though. Also, dropping the max buffered docs is also probably not a huge saver, and will only affect RAM usage during indexing. Going from 1000 to 100 will likely hurt indexing performance and not save that much RAM in the larger scheme of things. And dropping the maxFieldLength will hide parts of the document that are over that length - perhaps youll end up with a handful fewer index terms, but again, not likely a big savings here and may do more harm than good. My suggestion of lowering your cache sizes was just a thought to eek out some more RAM for you. Its not really suggested though if you can get more RAM. For best performance, those caches should be set correctly. If you are using the fieldcache method for faceting, you want the size of the filter cache to be the same as the number of unique terms you are faceting on. The other caches are not so large that I would suggest trimming them. The reality is, you've got 4 million docs, sorting (uses field caches), faceting (likely uses field caches), and this resource intensive field collapse patch. More RAM is probably your best bet. Every document you add potentially adds to the RAM usage of each of these things. That doesn't mean you don't have a different problem (it does seem weird it ballooned all of a sudden), but your running some RAM hungry stuff here, and it wouldn't blow my mind that 3 gig is not enough to handle it. It could be that only recently the right searches started coming in at the right times to fire up all your needs at once. Much of this may be lazy loaded or loaded on the fly depending on if and how you have configured your warming searches.
          Hide
          Stephen Weiss added a comment -

          Thanks. In the wiki next to each one of these parameters it explicitly says that reducing this parameter will decrease memory usage, this is why we reduced these parameters (it did not mention the filterCache at all).

          I really do hope the RAM will help. It certainly can't help.

          My filterCache stats are great- you know it's set to 64K but right now, with almost all the RAM used up (we're at 71.9% now), but it's only using 36290 entries at the moment and it's holding pretty steady there (even as RAM usage increased by 10%). None of the other caches have gone up much either. We have no cache evictions, at all, but a 99% hit ratio.

          I'm going to try lowering the filterCache to be just above the number it's at now, since that amount seems to be all it needs. It's possible at crash time all the sudden is uses a lot more of it for some reason - I have a feeling it might be related to a new permissions group that was added 3 days ago. That might trigger a lot more filters. It is barely used at all yet except by one client - I'm going to go check and see if there's any correspondence between when that client logs in and when the problem occurs - I bet there is.

          Thanks for all your help guys.

          Show
          Stephen Weiss added a comment - Thanks. In the wiki next to each one of these parameters it explicitly says that reducing this parameter will decrease memory usage, this is why we reduced these parameters (it did not mention the filterCache at all). I really do hope the RAM will help. It certainly can't help. My filterCache stats are great- you know it's set to 64K but right now, with almost all the RAM used up (we're at 71.9% now), but it's only using 36290 entries at the moment and it's holding pretty steady there (even as RAM usage increased by 10%). None of the other caches have gone up much either. We have no cache evictions, at all, but a 99% hit ratio. I'm going to try lowering the filterCache to be just above the number it's at now, since that amount seems to be all it needs. It's possible at crash time all the sudden is uses a lot more of it for some reason - I have a feeling it might be related to a new permissions group that was added 3 days ago. That might trigger a lot more filters. It is barely used at all yet except by one client - I'm going to go check and see if there's any correspondence between when that client logs in and when the problem occurs - I bet there is. Thanks for all your help guys.
          Hide
          Mark Miller added a comment -

          Thanks. In the wiki next to each one of these parameters it explicitly says that reducing this parameter will decrease memory usage, this is why we reduced these parameters (it did not mention the filterCache at all).

          They will save RAM to a certain extent for certain situations. But not very helpful at the sizes you are working with (and not settings I would use to save RAM anyway, unless the amount I need to save was pretty small). Also, the savings are largely index side - not likely a huge part of your RAM concerns, which are search side.

          My filterCache stats are great- you know it's set to 64K but right now, with almost all the RAM used up (we're at 71.9% now), but it's only using 36290 entries at the moment and it's holding pretty steady there(even as RAM usage increased by 10%). None of the other caches have gone up much either. We have no cache evictions, at all, but a 99% hit ratio.

          The sizes may be higher than you need then. They should be adjusted to the best settings based on the wiki info. I was originally suggesting you might sacrifice speed with the caches for RAM - but, its always best to use the best settings and have the necessary RAM.

          Show
          Mark Miller added a comment - Thanks. In the wiki next to each one of these parameters it explicitly says that reducing this parameter will decrease memory usage, this is why we reduced these parameters (it did not mention the filterCache at all). They will save RAM to a certain extent for certain situations. But not very helpful at the sizes you are working with (and not settings I would use to save RAM anyway, unless the amount I need to save was pretty small). Also, the savings are largely index side - not likely a huge part of your RAM concerns, which are search side. My filterCache stats are great- you know it's set to 64K but right now, with almost all the RAM used up (we're at 71.9% now), but it's only using 36290 entries at the moment and it's holding pretty steady there(even as RAM usage increased by 10%). None of the other caches have gone up much either. We have no cache evictions, at all, but a 99% hit ratio. The sizes may be higher than you need then. They should be adjusted to the best settings based on the wiki info. I was originally suggesting you might sacrifice speed with the caches for RAM - but, its always best to use the best settings and have the necessary RAM.
          Hide
          Dmitry Lihachev added a comment -

          When I add a Filter Query (fq param) to my query I get an exception "Either filter or filterList may be set in the QueryCommand, but not both."

          Show
          Dmitry Lihachev added a comment - When I add a Filter Query (fq param) to my query I get an exception "Either filter or filterList may be set in the QueryCommand, but not both."
          Hide
          Dmitry Lihachev added a comment -

          This patch (based on dieter patch) allows using fq parameter

          Show
          Dmitry Lihachev added a comment - This patch (based on dieter patch) allows using fq parameter
          Hide
          dredford added a comment - - edited

          There is an issue with collapsed result ordering when querying with only the unique Id and score fields in the request.

          [Update: this is only an issue when both standard results and collapse results are present - which I was using for testing]

          eg:
          q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score&collapse.field=PrimaryId&collapse.max=1

          gives wrong ordering (note: Id is our unique Id)

          but adding a another field - even a bogus one - works.
          q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score,bogus&collapse.field=PrimaryId&collapse.max=1

          Also using an fq makes it work
          eg:
          fq=Type:articles&q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score&collapse.field=PrimaryId&collapse.max=1

          I'm using the latest Dmitry patch (25/mar/09) against 1.3.0.

          Apart from that great so far...thanks to all

          Show
          dredford added a comment - - edited There is an issue with collapsed result ordering when querying with only the unique Id and score fields in the request. [Update: this is only an issue when both standard results and collapse results are present - which I was using for testing] eg: q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score&collapse.field=PrimaryId&collapse.max=1 gives wrong ordering (note: Id is our unique Id) but adding a another field - even a bogus one - works. q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score,bogus&collapse.field=PrimaryId&collapse.max=1 Also using an fq makes it work eg: fq=Type:articles&q=ford&version=2.2&start=0&rows=10&indent=on&fl=Id,score&collapse.field=PrimaryId&collapse.max=1 I'm using the latest Dmitry patch (25/mar/09) against 1.3.0. Apart from that great so far...thanks to all
          Hide
          Jeff added a comment - - edited

          We have tried to integrate the most recent patch into our 1.4 install. The patching was smooth and overall it works good. However, it appears the issue with fq has returned. Whenever I try to filter the query it gives "Either filter or filterList may be set in the QueryCommand, but not both." Not sure what happened. What part of the patch makes it possible for fq to work as it may not be there now.

          Additionally, the collapse.facet=before seems to not work. Any help in this area would be greatly appreciated.

          Show
          Jeff added a comment - - edited We have tried to integrate the most recent patch into our 1.4 install. The patching was smooth and overall it works good. However, it appears the issue with fq has returned. Whenever I try to filter the query it gives "Either filter or filterList may be set in the QueryCommand, but not both." Not sure what happened. What part of the patch makes it possible for fq to work as it may not be there now. Additionally, the collapse.facet=before seems to not work. Any help in this area would be greatly appreciated.
          Hide
          Domingo Gómez García added a comment - - edited

          I made checkout on svn release-1.3.0 and applied SOLR-236_collapsing.patch.
          Is there any way of integrate with solrj?

          Show
          Domingo Gómez García added a comment - - edited I made checkout on svn release-1.3.0 and applied SOLR-236 _collapsing.patch. Is there any way of integrate with solrj?
          Hide
          Oleg Gnatovskiy added a comment -

          How did you fix the memory issue?

          Show
          Oleg Gnatovskiy added a comment - How did you fix the memory issue?
          Hide
          Domingo Gómez García added a comment -

          -XX:PermSize=1524m -XX:MaxPermSize=1524m -Xmx128m
          It's not a real fix, but works for now...

          Show
          Domingo Gómez García added a comment - -XX:PermSize=1524m -XX:MaxPermSize=1524m -Xmx128m It's not a real fix, but works for now...
          Hide
          Thomas Traeger added a comment -

          This patch is based on the latest patch by Dmitry, it addresses the following issues:

          • the CollapseComponent now simply falls back to the process method of QueryComponent when no collapse.field is defined. This fixes issues with the fq param when collapsing was disabled and makes CollapseComponent a fully compatible replacement for QueryComponent.
          • collapse.facet=before is now fixed, the previous patch ignored any filter queries (fq) and therefore returned wrong facet counts
          • ResponseBuilder "builder" renamed to "rb" to match QueryComponent

          This patch applies to trunk (rev. 772433) but works with Solr 1.3 too. For 1.3 you have to move CollapseParams.java from common/org/apache/solr/common/params to java/org/apache/solr/common/params/ as the location of this file has been changed in trunk.

          This is my first contribution so any feedback is much appreciated. This is a great feature so lets get it into Solr as soon as possible.

          Show
          Thomas Traeger added a comment - This patch is based on the latest patch by Dmitry, it addresses the following issues: the CollapseComponent now simply falls back to the process method of QueryComponent when no collapse.field is defined. This fixes issues with the fq param when collapsing was disabled and makes CollapseComponent a fully compatible replacement for QueryComponent. collapse.facet=before is now fixed, the previous patch ignored any filter queries (fq) and therefore returned wrong facet counts ResponseBuilder "builder" renamed to "rb" to match QueryComponent This patch applies to trunk (rev. 772433) but works with Solr 1.3 too. For 1.3 you have to move CollapseParams.java from common/org/apache/solr/common/params to java/org/apache/solr/common/params/ as the location of this file has been changed in trunk. This is my first contribution so any feedback is much appreciated. This is a great feature so lets get it into Solr as soon as possible.
          Hide
          Martijn van Groningen added a comment - - edited

          Hi,

          I have modified the latest patch of Thomas and made two performance improvements:
          1) Improved normal field collapsing. I tested it with an index 1.1 million documents. When collapsing on all documents and with no sorting specified (so sorting on score) the query time is around 130ms compared with the previous patch which is around 1.5 s. When I then add sorting on string field the query time is around 220 ms compared with the previous patch which is around 5.2 s.

          The reason why it is faster is because the latest patch queries for a doclist instead of a docset. In the normal collapse method it keeps track of the most relevant documents, so the end result is the same, also creating a docList of 1.1 million documents (and ordering it) is very expensive.

          Note: I did not improved adjacent collapsing, because the adjacent method needs (as far as I understand it) a completely sorted list of documents (docList).

          2) Slightly improved facetation in combination with field collapsing, by reusing the uncollapsed docset that is created during the collapsing process (the previous patch made invoked a second search).

          I also have added documentation, added a few unit tests for the collapsing process itself and made the debug information more readable.
          This patch works from revision 779335 (last Wednesday) and up. This patch depends on some changes in Solr and a change inside Lucene.

          I'm very interested in other people's experiences with this patch and feedback on the patch itself.

          Cheers,

          Martijn

          Show
          Martijn van Groningen added a comment - - edited Hi, I have modified the latest patch of Thomas and made two performance improvements: 1) Improved normal field collapsing. I tested it with an index 1.1 million documents. When collapsing on all documents and with no sorting specified (so sorting on score) the query time is around 130ms compared with the previous patch which is around 1.5 s. When I then add sorting on string field the query time is around 220 ms compared with the previous patch which is around 5.2 s. The reason why it is faster is because the latest patch queries for a doclist instead of a docset. In the normal collapse method it keeps track of the most relevant documents, so the end result is the same, also creating a docList of 1.1 million documents (and ordering it) is very expensive. Note: I did not improved adjacent collapsing, because the adjacent method needs (as far as I understand it) a completely sorted list of documents (docList). 2) Slightly improved facetation in combination with field collapsing, by reusing the uncollapsed docset that is created during the collapsing process (the previous patch made invoked a second search). I also have added documentation, added a few unit tests for the collapsing process itself and made the debug information more readable. This patch works from revision 779335 (last Wednesday) and up. This patch depends on some changes in Solr and a change inside Lucene. I'm very interested in other people's experiences with this patch and feedback on the patch itself. Cheers, Martijn
          Hide
          Thomas Traeger added a comment -

          I made some tests with your patch and trunk (rev. 779497). It looks good so far but I have some problems with occasional null pointer exceptions when using the sort parameter:

          http://localhost:8983/solr/select?q=*:*&collapse.field=manu&sort=score%20desc,alphaNameSort%20asc

          java.lang.NullPointerException
          at org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:421)
          at org.apache.solr.search.CollapseFilter$DocumentComparator.compare(CollapseFilter.java:649)
          at org.apache.solr.search.CollapseFilter$DocumentPriorityQueue.lessThan(CollapseFilter.java:596)
          at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:153)
          at org.apache.solr.search.CollapseFilter.normalCollapse(CollapseFilter.java:321)
          at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:211)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1328)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
          at org.mortbay.jetty.Server.handle(Server.java:285)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
          at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
          at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

          These queries work as expected:
          http://localhost:8983/solr/select?q=*:*&collapse.field=manu&sort=score%20desc
          http://localhost:8983/solr/select?q=*:*&sort=score%20desc,alphaNameSort%20asc

          Show
          Thomas Traeger added a comment - I made some tests with your patch and trunk (rev. 779497). It looks good so far but I have some problems with occasional null pointer exceptions when using the sort parameter: http://localhost:8983/solr/select?q=*:*&collapse.field=manu&sort=score%20desc,alphaNameSort%20asc java.lang.NullPointerException at org.apache.lucene.search.FieldComparator$RelevanceComparator.copy(FieldComparator.java:421) at org.apache.solr.search.CollapseFilter$DocumentComparator.compare(CollapseFilter.java:649) at org.apache.solr.search.CollapseFilter$DocumentPriorityQueue.lessThan(CollapseFilter.java:596) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:153) at org.apache.solr.search.CollapseFilter.normalCollapse(CollapseFilter.java:321) at org.apache.solr.search.CollapseFilter.<init>(CollapseFilter.java:211) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1328) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) These queries work as expected: http://localhost:8983/solr/select?q=*:*&collapse.field=manu&sort=score%20desc http://localhost:8983/solr/select?q=*:*&sort=score%20desc,alphaNameSort%20asc
          Hide
          Martijn van Groningen added a comment -

          Thanks for the feedback, I fixed the problem you described and I have added a new patch containing the fix.
          The problem occurred when sorting was done on one ore more normal fields and on scoring.

          Show
          Martijn van Groningen added a comment - Thanks for the feedback, I fixed the problem you described and I have added a new patch containing the fix. The problem occurred when sorting was done on one ore more normal fields and on scoring.
          Hide
          Thomas Traeger added a comment -

          The problem is solved, thanks. I will use your patch for my current project that is planned for golive in 5 weeks. If I find any more issues I will report them here.

          Show
          Thomas Traeger added a comment - The problem is solved, thanks. I will use your patch for my current project that is planned for golive in 5 weeks. If I find any more issues I will report them here.
          Hide
          Oleg Gnatovskiy added a comment -

          Hey guys, are there any plans to make field collapsing work on multi shard systems?

          Show
          Oleg Gnatovskiy added a comment - Hey guys, are there any plans to make field collapsing work on multi shard systems?
          Hide
          Martijn van Groningen added a comment -

          I'm looking forward in your experiences with this patch, particular in production.

          I think in order to make collapsing work on multi shard systems the process method of the CollapseComponent needs to be modified.
          CollapseComponent already subclasses QueryComponent (which already supports querying on multi shard systems), so it should not be that difficult.

          Show
          Martijn van Groningen added a comment - I'm looking forward in your experiences with this patch, particular in production. I think in order to make collapsing work on multi shard systems the process method of the CollapseComponent needs to be modified. CollapseComponent already subclasses QueryComponent (which already supports querying on multi shard systems), so it should not be that difficult.
          Hide
          Ron Veenstra added a comment - - edited

          I require assistance. I've installed a fresh Solr (1.3.0), and all appears/operates well. I then patch using SOLR-236_collapsing.patch [by Thomas Traeger] (the last patch i saw claimed to work with 1.3.0), without error. I then add to solrconfig.xml the following (per: http://wiki.apache.org/solr/FieldCollapsing) :

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />

          Upon restart, I get a long configuration error, which seems to hinge on:

          HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: <abortOnConfigurationError>false</abortOnConfigurationError> in solrconfig.xml ------------------------------------------------------------- org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:273)

          [the full error can be included if desired.]

          I've verified that the CollapseComponent file exists in the proper place.
          I've moved CollapseParams as required, (move CollapseParams.java from common/org/apache/solr/common/params to java/org/apache/solr/common/params/ )
          I've tried multiple iterations of the patch (on fresh installs), all with the same issue.

          Are there additional steps, patches, or configurations that are required?
          Is this a known issue?
          Any help is very much appreciated.

          Show
          Ron Veenstra added a comment - - edited I require assistance. I've installed a fresh Solr (1.3.0), and all appears/operates well. I then patch using SOLR-236 _collapsing.patch [by Thomas Traeger] (the last patch i saw claimed to work with 1.3.0), without error. I then add to solrconfig.xml the following (per: http://wiki.apache.org/solr/FieldCollapsing ) : <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> Upon restart, I get a long configuration error, which seems to hinge on: HTTP Status 500 - Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: <abortOnConfigurationError>false</abortOnConfigurationError> in solrconfig.xml ------------------------------------------------------------- org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:273) [the full error can be included if desired.] I've verified that the CollapseComponent file exists in the proper place. I've moved CollapseParams as required, (move CollapseParams.java from common/org/apache/solr/common/params to java/org/apache/solr/common/params/ ) I've tried multiple iterations of the patch (on fresh installs), all with the same issue. Are there additional steps, patches, or configurations that are required? Is this a known issue? Any help is very much appreciated.
          Hide
          Thomas Traeger added a comment -

          ron, your approach should work, I just verified it on my Ubuntu 9.04 box. Here are my steps to a working example installation of solr 1.3.0 with collapsing enabled:

          java -version
          > java version "1.6.0_13"
          > Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
          > Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
          
          wget http://www.apache.org/dist/lucene/solr/1.3.0/apache-solr-1.3.0.tgz
          tar xvzf apache-solr-1.3.0.tgz 
          wget http://issues.apache.org/jira/secure/attachment/12407410/SOLR-236_collapsing.patch
          cd apache-solr-1.3.0/
          patch -p0 <../SOLR-236_collapsing.patch 
          mv src/common/org/apache/solr/common/params/CollapseParams.java src/java/org/apache/solr/common/params/
          ant example
          cd example/
          vi solr/conf/solrconfig.xml 
          

          add the collapse component class definition:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />
          

          set the components in the standard requestHandler:

              <arr name="components">
                <str>collapse</str>
              </arr>
          

          start jetty

          java -jar start.jar
          

          add example docs

          cd example/exampledocs
          sh post.sh *.xml
          

          and open http://localhost:8983/solr/select/?q=*:*&collapse.field=cat in your browser.

          Show
          Thomas Traeger added a comment - ron, your approach should work, I just verified it on my Ubuntu 9.04 box. Here are my steps to a working example installation of solr 1.3.0 with collapsing enabled: java -version > java version "1.6.0_13" > Java(TM) SE Runtime Environment (build 1.6.0_13-b03) > Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) wget http://www.apache.org/dist/lucene/solr/1.3.0/apache-solr-1.3.0.tgz tar xvzf apache-solr-1.3.0.tgz wget http://issues.apache.org/jira/secure/attachment/12407410/SOLR-236_collapsing.patch cd apache-solr-1.3.0/ patch -p0 <../SOLR-236_collapsing.patch mv src/common/org/apache/solr/common/params/CollapseParams.java src/java/org/apache/solr/common/params/ ant example cd example/ vi solr/conf/solrconfig.xml add the collapse component class definition: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> set the components in the standard requestHandler: <arr name="components"> <str>collapse</str> </arr> start jetty java -jar start.jar add example docs cd example/exampledocs sh post.sh *.xml and open http://localhost:8983/solr/select/?q=*:*&collapse.field=cat in your browser.
          Hide
          Stephen Weiss added a comment -

          The problem sounds very familiar to me, I remember going through something similar when I was first trying to get the patch to work. My configuration ended up being:

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />

          <requestHandler name="standard" class="solr.StandardRequestHandler">
          <!-- default values for query parameters -->
          <lst name="defaults">
          <str name="echoParams">explicit</str>
          <!--
          <int name="rows">10</int>
          <str name="fl">*</str>
          <str name="version">2.1</str>
          -->
          </lst>
          <arr name="components">
          <str>query</str>
          <str>facet</str>
          <str>collapse</str>
          <str>mlt</str>
          <str>highlight</str>
          <str>debug</str>
          </arr>
          </requestHandler>

          All I remember is if I didn't have that <arr name="components"> section arranged exactly like that (even if I rearranged other items without rearranging the "collapse" part), either 1) faceting would completely stop working correctly at all, giving me totally bogus numbers or 2) I would get something a lot like the error described above and nothing would work at all.

          However, I'm using an older version of the patch (collapsing-patch-to-1.3.0-ivan_3.patch) so it's totally possible that this has nothing to do with that.

          On that note... have people found in general that the newer versions of the patch give any particular benefits in particular? I saw someone say that the latest patches were faster but I wasn't sure if they were faster in all cases or only when not sorting (we always sort so if it's only for unsorted sets it doesn't do us much good).

          Show
          Stephen Weiss added a comment - The problem sounds very familiar to me, I remember going through something similar when I was first trying to get the patch to work. My configuration ended up being: <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> <requestHandler name="standard" class="solr.StandardRequestHandler"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <!-- <int name="rows">10</int> <str name="fl">*</str> <str name="version">2.1</str> --> </lst> <arr name="components"> <str>query</str> <str>facet</str> <str>collapse</str> <str>mlt</str> <str>highlight</str> <str>debug</str> </arr> </requestHandler> All I remember is if I didn't have that <arr name="components"> section arranged exactly like that (even if I rearranged other items without rearranging the "collapse" part), either 1) faceting would completely stop working correctly at all, giving me totally bogus numbers or 2) I would get something a lot like the error described above and nothing would work at all. However, I'm using an older version of the patch (collapsing-patch-to-1.3.0-ivan_3.patch) so it's totally possible that this has nothing to do with that. On that note... have people found in general that the newer versions of the patch give any particular benefits in particular? I saw someone say that the latest patches were faster but I wasn't sure if they were faster in all cases or only when not sorting (we always sort so if it's only for unsorted sets it doesn't do us much good).
          Hide
          Ron Veenstra added a comment -

          Thanks for the replies.

          Thomas, I followed your steps, verifying same java version and build, etc. (all matched. I'm working with a CentOS 5 machine..Any potential for the problem being related to that?)
          Patching and installing all appeared successful, but the resulting jetty powered page still resulted in:

          org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent'
          [followed by the long line of tracebacks..]

          My solrconfig.xml included the following (included in case there is an obvious flaw):

          <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />

          <requestHandler name="standard" class="solr.SearchHandler" default="true">
          <!-- default values for query parameters -->
          <lst name="defaults">
          <str name="echoParams">explicit</str>
          <!--
          <int name="rows">10</int>
          <str name="fl">*</str>
          <str name="version">2.1</str>
          -->
          </lst>

          <arr name="components">
          <str>collapse</str>
          </arr>
          </requestHandler>

          Stephen: I attempted your configuration as well, with the most recent patch and the patch you referenced, but the results were the same.

          I am going to attempt a fresh try on an Ubuntu Machine, but any other ideas would be most appreciated.

          Show
          Ron Veenstra added a comment - Thanks for the replies. Thomas, I followed your steps, verifying same java version and build, etc. (all matched. I'm working with a CentOS 5 machine..Any potential for the problem being related to that?) Patching and installing all appeared successful, but the resulting jetty powered page still resulted in: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.component.CollapseComponent' [followed by the long line of tracebacks..] My solrconfig.xml included the following (included in case there is an obvious flaw): <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" /> <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <!-- <int name="rows">10</int> <str name="fl">*</str> <str name="version">2.1</str> --> </lst> <arr name="components"> <str>collapse</str> </arr> </requestHandler> Stephen: I attempted your configuration as well, with the most recent patch and the patch you referenced, but the results were the same. I am going to attempt a fresh try on an Ubuntu Machine, but any other ideas would be most appreciated.
          Hide
          Thomas Traeger added a comment -

          Strange, maybe something went wrong during building and CollapseComponent is not included into the war. You might look into solr.war and check for CollapseComponent.class:

          cd apache-solr-1.3.0/example/webapps
          unzip solr.war
          cd WEB-INF/lib
          unzip apache-solr-core-1.3.0.jar
          cd org/apache/solr/handler/component/
          

          Is the file CollapseComponent.class there?

          Show
          Thomas Traeger added a comment - Strange, maybe something went wrong during building and CollapseComponent is not included into the war. You might look into solr.war and check for CollapseComponent.class: cd apache-solr-1.3.0/example/webapps unzip solr.war cd WEB-INF/lib unzip apache-solr-core-1.3.0.jar cd org/apache/solr/handler/component/ Is the file CollapseComponent.class there?
          Hide
          Martijn van Groningen added a comment -

          Hi Stephan, when I was doing performance tests on the latest patch for doing normal collapsing (not adjacent collapsing), I found that there was a significant performance improvement during field collapsing compared to the old patch. This applies for both specifying sorting and not specifying sorting in the request. If you have other questions / comments about the latest patch just ask.

          Show
          Martijn van Groningen added a comment - Hi Stephan, when I was doing performance tests on the latest patch for doing normal collapsing (not adjacent collapsing), I found that there was a significant performance improvement during field collapsing compared to the old patch. This applies for both specifying sorting and not specifying sorting in the request. If you have other questions / comments about the latest patch just ask.
          Hide
          Ron Veenstra added a comment -

          Thomas,

          Again thanks. I've verified that the CollapseComponent is indeed NOT present in the war. That'd suggest something going amiss during the patching process, correct? And as it appears to be happening each time, either there's an issue with the patch (which others have verified as working) or something conflicts with my current setup (solr / tomcat / CentOS). Can I manually create apache-solr-core and force the file in?

          Show
          Ron Veenstra added a comment - Thomas, Again thanks. I've verified that the CollapseComponent is indeed NOT present in the war. That'd suggest something going amiss during the patching process, correct? And as it appears to be happening each time, either there's an issue with the patch (which others have verified as working) or something conflicts with my current setup (solr / tomcat / CentOS). Can I manually create apache-solr-core and force the file in?
          Hide
          Ron Veenstra added a comment -

          Quick update : starting fresh, i was able to get the issue resolved once ant properly rebuilt the solr-core file. Uncertain why previous attempts failed so completely. Many thanks for your help.

          Show
          Ron Veenstra added a comment - Quick update : starting fresh, i was able to get the issue resolved once ant properly rebuilt the solr-core file. Uncertain why previous attempts failed so completely. Many thanks for your help.
          Hide
          Earwin Burrfoot added a comment -

          I have implemented collapsing on a high-volume project of mine in much less flexible, but more practical manner.

          Part I. You have to guarantee that all documents having the same value of collapse-field are dropped into Lucene index as a sequential batch. That guarantees they get sequential docIds, and with some more work - that they all end up in the same segment.
          Part II. When doing collection you always get docIds in sequential order, and thus, thanks to Part I you get the docs-to-be-collapsed already grouped by collapse-field, even before you drop the docs into PriorityQueue to sort them.

          Cons:
          You can only collapse on a single predetermined at index creation time field.
          If one document changes, you have to reindex all docs that have the same collapse-field value, so it's best if you have either low update/add rates, or few documents sharing the same collapse-field value.

          Pros:
          The CPU and memory costs for collapsing compared to usual search are very close to zero and do not depend on index size/total docs found.
          The same idea works with new Lucene per-segment collection and in distributed mode (sharded index).
          Within collapsed group you can sort hits however you want, and select one that will represent the group for usual sort/paging.
          The implementation is not brain-dead simple, but nears it.

          Show
          Earwin Burrfoot added a comment - I have implemented collapsing on a high-volume project of mine in much less flexible, but more practical manner. Part I. You have to guarantee that all documents having the same value of collapse-field are dropped into Lucene index as a sequential batch. That guarantees they get sequential docIds, and with some more work - that they all end up in the same segment. Part II. When doing collection you always get docIds in sequential order, and thus, thanks to Part I you get the docs-to-be-collapsed already grouped by collapse-field, even before you drop the docs into PriorityQueue to sort them. Cons: You can only collapse on a single predetermined at index creation time field. If one document changes, you have to reindex all docs that have the same collapse-field value, so it's best if you have either low update/add rates, or few documents sharing the same collapse-field value. Pros: The CPU and memory costs for collapsing compared to usual search are very close to zero and do not depend on index size/total docs found. The same idea works with new Lucene per-segment collection and in distributed mode (sharded index). Within collapsed group you can sort hits however you want, and select one that will represent the group for usual sort/paging. The implementation is not brain-dead simple, but nears it.
          Hide
          Kevin Cunningham added a comment -

          Martijn,
          You mentioned your latest patch update "depends on some changes in Solr and a change inside Lucene". Does this mean it is not compatible with 1.3?

          Show
          Kevin Cunningham added a comment - Martijn, You mentioned your latest patch update "depends on some changes in Solr and a change inside Lucene". Does this mean it is not compatible with 1.3?
          Hide
          emmanuel vecchia added a comment -

          I applied the latest patch field-collapse-solr-236-2.patch to http://www.apache.org/dist/lucene/solr/1.3.0/apache-solr-1.3.0.tgz and tried to compile it seems to require org.apache.lucene.search.FieldComparator and org.apache.lucene.search.Collector and maybe other classes from lucene. I checked out a few version of lucene but looking at LUCENE-1483 it seems that only the current trunk have the classes needed. So it doesn't seem to be possible to use the patch with 1.3

          Show
          emmanuel vecchia added a comment - I applied the latest patch field-collapse-solr-236-2.patch to http://www.apache.org/dist/lucene/solr/1.3.0/apache-solr-1.3.0.tgz and tried to compile it seems to require org.apache.lucene.search.FieldComparator and org.apache.lucene.search.Collector and maybe other classes from lucene. I checked out a few version of lucene but looking at LUCENE-1483 it seems that only the current trunk have the classes needed. So it doesn't seem to be possible to use the patch with 1.3
          Hide
          Martijn van Groningen added a comment -

          Keven, that is correct my patch is not compatible with 1.3. It works from revision 779497 (which is 1.4-dev).

          Show
          Martijn van Groningen added a comment - Keven, that is correct my patch is not compatible with 1.3. It works from revision 779497 (which is 1.4-dev).
          Hide
          Shekhar added a comment -

          Hi,

          Has anyone successfully used localsolr and collapse patch together in Solr 1.4-dev. I am getting two result-sets one from localsolr and other from collapse. I need a merged result-set..
          I am using localsolr 1.5 and field-collapse-solr-236-2.patch.
          Any pointers ???

          Show
          Shekhar added a comment - Hi, Has anyone successfully used localsolr and collapse patch together in Solr 1.4-dev. I am getting two result-sets one from localsolr and other from collapse. I need a merged result-set.. I am using localsolr 1.5 and field-collapse-solr-236-2.patch. Any pointers ???
          Hide
          Martijn van Groningen added a comment -

          Shekar, can you show how you configured local solr and field collapsing in the solrconfig.xml file?

          Show
          Martijn van Groningen added a comment - Shekar, can you show how you configured local solr and field collapsing in the solrconfig.xml file?
          Hide
          Shekhar added a comment - - edited

          Here is the solfconfig file.

          <requestHandler name="geo" class="solr.SearchHandler">
          <lst name="defaults">
          <str name="echoParams">explicit</str>
          </lst>

          <arr name="components">
          <str>localsolr</str>
          <str>collapse</str>
          </arr>

          </requestHandler>

          You can get more details from http://www.gissearch.com/localsolr

          ===================================================

          Following are the results I am getting :

          <response>

          <lst name="responseHeader">
          <int name="status">0</int>
          <int name="QTime">146</int>

          <lst name="params">
          <str name="lat">41.883784</str>
          <str name="radius">50</str>
          <str name="collapse.field">resource_id</str>
          <str name="rows">2</str>
          <str name="indent">on</str>
          <str name="fl">resource_id,geo_distance</str>
          <str name="q">TV</str>
          <str name="qt">geo</str>
          <str name="long">-87.637668</str>
          </lst>
          </lst>

          <result name="response" numFound="4294" start="0">

          <doc>
          <int name="resource_id">10018</int>
          <double name="geo_distance">26.16691883965225</double>
          </doc>

          <doc>
          <int name="resource_id">10102</int>
          <double name="geo_distance">39.90588996589528</double>
          </doc>
          </result>

          <lst name="collapse_counts">
          <str name="field">resource_id</str>

          <lst name="doc">
          <int name="10022">116</int>
          <int name="11701">4</int>
          </lst>

          <lst name="count">
          <int name="10015">116</int>
          <int name="10018">4</int>
          </lst>

          <lst name="debug">
          <str name="Docset type">BitDocSet(5201)</str>
          <long name="Total collapsing time(ms)">46</long>
          <long name="Create uncollapsed docset(ms)">22</long>
          <long name="Collapsing normal time(ms)">24</long>
          <long name="Creating collapseinfo time(ms)">0</long>
          <long name="Convert to bitset time(ms)">0</long>
          <long name="Create collapsed docset time(ms)">0</long>
          </lst>
          </lst>

          <result name="response" numFound="5201" start="0">

          <doc>
          <int name="resource_id">10015</int>
          </doc>

          <doc>
          <int name="resource_id">10018</int>
          </doc>
          </result>
          </response>

          Show
          Shekhar added a comment - - edited Here is the solfconfig file. <requestHandler name="geo" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> </lst> <arr name="components"> <str>localsolr</str> <str>collapse</str> </arr> </requestHandler> You can get more details from http://www.gissearch.com/localsolr =================================================== Following are the results I am getting : <response> − <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">146</int> − <lst name="params"> <str name="lat">41.883784</str> <str name="radius">50</str> <str name="collapse.field">resource_id</str> <str name="rows">2</str> <str name="indent">on</str> <str name="fl">resource_id,geo_distance</str> <str name="q">TV</str> <str name="qt">geo</str> <str name="long">-87.637668</str> </lst> </lst> − <result name="response" numFound="4294" start="0"> − <doc> <int name="resource_id">10018</int> <double name="geo_distance">26.16691883965225</double> </doc> − <doc> <int name="resource_id">10102</int> <double name="geo_distance">39.90588996589528</double> </doc> </result> − <lst name="collapse_counts"> <str name="field">resource_id</str> − <lst name="doc"> <int name="10022">116</int> <int name="11701">4</int> </lst> − <lst name="count"> <int name="10015">116</int> <int name="10018">4</int> </lst> − <lst name="debug"> <str name="Docset type">BitDocSet(5201)</str> <long name="Total collapsing time(ms)">46</long> <long name="Create uncollapsed docset(ms)">22</long> <long name="Collapsing normal time(ms)">24</long> <long name="Creating collapseinfo time(ms)">0</long> <long name="Convert to bitset time(ms)">0</long> <long name="Create collapsed docset time(ms)">0</long> </lst> </lst> − <result name="response" numFound="5201" start="0"> − <doc> <int name="resource_id">10015</int> </doc> − <doc> <int name="resource_id">10018</int> </doc> </result> </response>
          Hide
          Martijn van Groningen added a comment -

          The LocalSolrQueryComponent and the CollapseComponent are both doing a search, that is why there are two result sets.
          I think if you want field collapsing and local search you cannot use the version of localsolr that you are currently using, but you can use the latest
          local solr patch (SOLR-773). The latest patch does local search in a different manner, the DistanceCalculatingComponent (LocalSolrQueryComponent is removed) self does not do a search, but it adds a filter query (based on the lat, long and radius) to the normal search, that is then executed in the collapse component, so it should work in the way you expect it.

          Example configuration for the latest patch:
          <searchComponent name="geodistance" class="org.apache.solr.spatial.tier.DistanceCalculatingComponent" />
          <queryParser name="spatial_tier" class="org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin" />
          <requestHandler name="geo" class="org.apache.solr.handler.component.SearchHandler">
          <lst name="defaults">
          <str name="echoParams">explicit</str>
          <str name="defType">spatial_tier</str>
          </lst>
          <lst name="invariants">
          <str name="latField">lat</str>
          <str name="lngField">lng</str>
          <str name="distanceField">geo_distance</str>
          <str name="tierPrefix">tier</str>
          </lst>
          <arr name="components">
          <str>collapse</str>
          </arr>
          <arr name="last-components">
          <str>geodistance</str>
          </arr>
          </requestHandler>

          Show
          Martijn van Groningen added a comment - The LocalSolrQueryComponent and the CollapseComponent are both doing a search, that is why there are two result sets. I think if you want field collapsing and local search you cannot use the version of localsolr that you are currently using, but you can use the latest local solr patch ( SOLR-773 ). The latest patch does local search in a different manner, the DistanceCalculatingComponent (LocalSolrQueryComponent is removed) self does not do a search, but it adds a filter query (based on the lat, long and radius) to the normal search, that is then executed in the collapse component, so it should work in the way you expect it. Example configuration for the latest patch: <searchComponent name="geodistance" class="org.apache.solr.spatial.tier.DistanceCalculatingComponent" /> <queryParser name="spatial_tier" class="org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin" /> <requestHandler name="geo" class="org.apache.solr.handler.component.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="defType">spatial_tier</str> </lst> <lst name="invariants"> <str name="latField">lat</str> <str name="lngField">lng</str> <str name="distanceField">geo_distance</str> <str name="tierPrefix"> tier </str> </lst> <arr name="components"> <str>collapse</str> </arr> <arr name="last-components"> <str>geodistance</str> </arr> </requestHandler>
          Hide
          Shekhar added a comment -

          Thanks a lot Martijn for you help..
          Could you please point me to the example you are referring to. I could not find any example which is using DistanceCalculatingComponent.

          Show
          Shekhar added a comment - Thanks a lot Martijn for you help.. Could you please point me to the example you are referring to. I could not find any example which is using DistanceCalculatingComponent.
          Hide
          Martijn van Groningen added a comment - - edited

          I have not found an online example yet, but I copied this config from the javadoc of the DistanceCalculatingComponent class and modified it. The patch also modifies the solr examples, so i f you look there you can see how the patch is used (example/solr/conf/schema.xml and example/solr/conf/solrconfig.xml). You need to add an extra update processor and an extra field and dynamic field in order to make it work.

          Show
          Martijn van Groningen added a comment - - edited I have not found an online example yet, but I copied this config from the javadoc of the DistanceCalculatingComponent class and modified it. The patch also modifies the solr examples, so i f you look there you can see how the patch is used (example/solr/conf/schema.xml and example/solr/conf/solrconfig.xml). You need to add an extra update processor and an extra field and dynamic field in order to make it work.
          Hide
          Oleg Gnatovskiy added a comment -

          Hello all. We implemented the old Field Collapse patch ( 2008-02-14 03:38 PM) a few months ago into our production environment with some custom code to make it work over distributed search. Shortly after deployment we started noticing extremely slow queries (3-12 seconds) on a completely random basis. After disabling field collapse these random queries disappeared. Does anyone here know of the issue that might have caused this, and any idea if it has been fixed in the patches since the one we used?

          Show
          Oleg Gnatovskiy added a comment - Hello all. We implemented the old Field Collapse patch ( 2008-02-14 03:38 PM) a few months ago into our production environment with some custom code to make it work over distributed search. Shortly after deployment we started noticing extremely slow queries (3-12 seconds) on a completely random basis. After disabling field collapse these random queries disappeared. Does anyone here know of the issue that might have caused this, and any idea if it has been fixed in the patches since the one we used?
          Hide
          Martijn van Groningen added a comment -

          Hi Oleg, I have checked your latest patch, but I could not find the code that deals with the distributed search. How did you make collapsing work for distributed search? Which parameters did you use while doing a search? What I can tell is that the latest patches do not support field collapsing for distributed search.

          Show
          Martijn van Groningen added a comment - Hi Oleg, I have checked your latest patch, but I could not find the code that deals with the distributed search. How did you make collapsing work for distributed search? Which parameters did you use while doing a search? What I can tell is that the latest patches do not support field collapsing for distributed search.
          Hide
          David Smiley added a comment -

          Auto-reply: I'm on Vacation this week.

          Show
          David Smiley added a comment - Auto-reply: I'm on Vacation this week.
          Hide
          Jay Hill added a comment -

          I've tried applying the most recent patch against a completely fresh check out of the trunk, but I'm getting compile errors related to a class updated in the patch:

          compile:
          [mkdir] Created dir: /Users/jayhill/solrwork/trunk/build/solr
          [javac] Compiling 367 source files to /Users/jayhill/solrwork/trunk/build/solr
          [javac] /Users/jayhill/solrwork/trunk/src/java/org/apache/solr/util/DocSetScoreCollector.java:31: org.apache.solr.util.DocSetScoreCollector is not abstract and does not override abstract method acceptsDocsOutOfOrder() in org.apache.lucene.search.Collector
          [javac] public class DocSetScoreCollector extends Collector {
          [javac] ^
          [javac] Note: Some input files use or override a deprecated API.
          [javac] Note: Recompile with -Xlint:deprecation for details.
          [javac] Note: Some input files use unchecked or unsafe operations.
          [javac] Note: Recompile with -Xlint:unchecked for details.
          [javac] 1 error

          I noticed that FieldCollapsing is targeted for release 1.5, but I've noticed that some folks have been using it in production, and I was curious to work with it in 1.4 is possible.

          Show
          Jay Hill added a comment - I've tried applying the most recent patch against a completely fresh check out of the trunk, but I'm getting compile errors related to a class updated in the patch: compile: [mkdir] Created dir: /Users/jayhill/solrwork/trunk/build/solr [javac] Compiling 367 source files to /Users/jayhill/solrwork/trunk/build/solr [javac] /Users/jayhill/solrwork/trunk/src/java/org/apache/solr/util/DocSetScoreCollector.java:31: org.apache.solr.util.DocSetScoreCollector is not abstract and does not override abstract method acceptsDocsOutOfOrder() in org.apache.lucene.search.Collector [javac] public class DocSetScoreCollector extends Collector { [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error I noticed that FieldCollapsing is targeted for release 1.5, but I've noticed that some folks have been using it in production, and I was curious to work with it in 1.4 is possible.
          Hide
          Martijn van Groningen added a comment -

          Hey Jay, I have fixed this issue in the new patch. So if you apply the new patch everything should be fine.
          The compile error was a result of of the upgrade of the Lucene libraries is Solr. Because of LUCENE-1630 a new method was added to the Collector class.
          In this patch I also removed the invocations to ExtendedFieldCache methods and changed them to FieldCache methods. ExtendedFieldCache is now deprecated in the updated Lucene libraries. If you have any problems with this patch let me know.

          Important:
          Only use this patch from revision 794328 (07/15/2009) and up. Use the previous patch if you are using an older 1.4-dev revision.

          Show
          Martijn van Groningen added a comment - Hey Jay, I have fixed this issue in the new patch. So if you apply the new patch everything should be fine. The compile error was a result of of the upgrade of the Lucene libraries is Solr. Because of LUCENE-1630 a new method was added to the Collector class. In this patch I also removed the invocations to ExtendedFieldCache methods and changed them to FieldCache methods. ExtendedFieldCache is now deprecated in the updated Lucene libraries. If you have any problems with this patch let me know. Important: Only use this patch from revision 794328 (07/15/2009) and up. Use the previous patch if you are using an older 1.4-dev revision.
          Hide
          Martijn van Groningen added a comment -

          Because the lucene jars have been updated, the previous patch does not work with the current trunk.
          Use this patch for rev 801872 and up. For revisions before that use the older patches.

          I have also included SolrJ support for fieldcollapsing in this patch. Might be handy those integrating with Solr via SolrJ.
          By invoking enableFieldCollapsing(...) with a fieldname as parameter on SolrQuery class you enable fieldcollapsing for the current request.
          If the search was successful one can execute getFieldCollapseResponse() on SolrResponse for retrieving a FieldCollapseResponse object from which one can retrieve the field collapse information.

          Show
          Martijn van Groningen added a comment - Because the lucene jars have been updated, the previous patch does not work with the current trunk. Use this patch for rev 801872 and up. For revisions before that use the older patches. I have also included SolrJ support for fieldcollapsing in this patch. Might be handy those integrating with Solr via SolrJ. By invoking enableFieldCollapsing(...) with a fieldname as parameter on SolrQuery class you enable fieldcollapsing for the current request. If the search was successful one can execute getFieldCollapseResponse() on SolrResponse for retrieving a FieldCollapseResponse object from which one can retrieve the field collapse information.
          Hide
          Martijn van Groningen added a comment -

          I have updated the field collapse patch and made the following changes:

          1. Refactored the collapse code into a strategy pattern. The two distinct manners of collapsing are now in two different classes, which in my understanding makes the code cleaner and easier to understand. I have removed the CollapseFilter and created a DocumentCollapser which is an interface. The DocumentCollapser has two concrete implementation the AdjacentDocumentCollapser and the NonAdjacentDocumentCollapser. Both implementation share the same abstract base class AbstractDocumentCollapser that has fields and methods that are common in both concrete implementation.
          2. Removed deprecated Lucene methods in the PredefinedScorer.
          3. Fixed a normal field collapse bug. Filter queries were handled as normal queries (were added together via a boolean query), and thus were also used for scoring.
          4. Added more unit and integration tests, including two tests that tests facets in combination with field collapsing. These tests test the collapse before collapsing and after collapsing.

          This patch only works with the Solr 1.4-dev from revision 804700 and later.

          Show
          Martijn van Groningen added a comment - I have updated the field collapse patch and made the following changes: Refactored the collapse code into a strategy pattern. The two distinct manners of collapsing are now in two different classes, which in my understanding makes the code cleaner and easier to understand. I have removed the CollapseFilter and created a DocumentCollapser which is an interface. The DocumentCollapser has two concrete implementation the AdjacentDocumentCollapser and the NonAdjacentDocumentCollapser . Both implementation share the same abstract base class AbstractDocumentCollapser that has fields and methods that are common in both concrete implementation. Removed deprecated Lucene methods in the PredefinedScorer . Fixed a normal field collapse bug. Filter queries were handled as normal queries (were added together via a boolean query), and thus were also used for scoring. Added more unit and integration tests, including two tests that tests facets in combination with field collapsing. These tests test the collapse before collapsing and after collapsing. This patch only works with the Solr 1.4-dev from revision 804700 and later.
          Hide
          Martijn van Groningen added a comment -

          I was trying to come up with a solution to implement distributed field collapsing, but I ran into a problem that I could not solve in an efficient manner.

          Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse
          threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse.

          There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes.

          Besides that there is also another problem with distributed field collapsing. Field collapsing only keeps the most relevant document in the result set and collapses the less relevant ones. If scoring is used to sort then field collapsing will fail to do this properly, because of the fact there is no global scoring (idf).

          Does anyone have an idea on how to solve this? The first problem seems related to same kind of problem implementing global score has.

          Show
          Martijn van Groningen added a comment - I was trying to come up with a solution to implement distributed field collapsing, but I ran into a problem that I could not solve in an efficient manner. Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse. There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes. Besides that there is also another problem with distributed field collapsing. Field collapsing only keeps the most relevant document in the result set and collapses the less relevant ones. If scoring is used to sort then field collapsing will fail to do this properly, because of the fact there is no global scoring (idf). Does anyone have an idea on how to solve this? The first problem seems related to same kind of problem implementing global score has.
          Hide
          Thomas Traeger added a comment -

          Hi Martin, I tested your latest patch, found no problem so far. The code is indeed better to understand now, good work.

          For my current project I need to know which documents have been removed during collapsing. The current idea is to change the collapsing info and add an array with all document IDs that are removed from the result. Any suggestion on how/where to implement this?

          Show
          Thomas Traeger added a comment - Hi Martin, I tested your latest patch, found no problem so far. The code is indeed better to understand now, good work. For my current project I need to know which documents have been removed during collapsing. The current idea is to change the collapsing info and add an array with all document IDs that are removed from the result. Any suggestion on how/where to implement this?
          Hide
          Tarjei Huse added a comment -

          Hi,

          I tested the latest patch (field-collapse-5.patch ) and got:

          HTTP Status 500 - 8452333 java.lang.ArrayIndexOutOfBoundsException: 8452333 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660) at org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:254) at org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:192) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158) at org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:99) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:174) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at
          ...

          I can provide the complete stacktrace if needed.

          Show
          Tarjei Huse added a comment - Hi, I tested the latest patch (field-collapse-5.patch ) and got: HTTP Status 500 - 8452333 java.lang.ArrayIndexOutOfBoundsException: 8452333 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660) at org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:254) at org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:192) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158) at org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:99) at org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:174) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at ... I can provide the complete stacktrace if needed.
          Hide
          Martijn van Groningen added a comment -

          Hi Thomas, currently both collapsing algorithms do not store the the ids of the collapsed documents.
          In order to have this functionality I think the following has to be done:
          1) In the doCollapsing(...) methods of both concrete implementations of DocumentCollapser, the collapsed documents have to be stored. Depending on what you want you can store it in one big list or store it a list per most relevant document. The most relevant document is the document that does not collapse.
          2) In the getCollapseInfo(...) method in the AbstractDocumentCollapser you then need to output these collapsed documents. If you are storing the collapsed documents in one big list then adding a new NamedList with collapsed document would be fine I guess. If you are storing the collapsed documents per document head, then I would add the collapsed document ids to existing resDoc named list. It is important that you return the Solr unique id instead of the lucene id.

          This is just one approach, but what is the reason that you want this functionality? I guess what would be much easier, is to do a second query after the collapse query. In this second query you disable field collapsing (by not setting collapse.field) and you set fq=[collapse.field]=[collapse.value] for example.

          Potentially the number of collapsed documents can be very large and in that situation it can have a impact on performance. Therefore I think that this functionality should be disabled by default. In the same way collapseInfoDoc and collapseInfoCount are managed.

          Show
          Martijn van Groningen added a comment - Hi Thomas, currently both collapsing algorithms do not store the the ids of the collapsed documents. In order to have this functionality I think the following has to be done: 1) In the doCollapsing(...) methods of both concrete implementations of DocumentCollapser, the collapsed documents have to be stored. Depending on what you want you can store it in one big list or store it a list per most relevant document. The most relevant document is the document that does not collapse. 2) In the getCollapseInfo(...) method in the AbstractDocumentCollapser you then need to output these collapsed documents. If you are storing the collapsed documents in one big list then adding a new NamedList with collapsed document would be fine I guess. If you are storing the collapsed documents per document head, then I would add the collapsed document ids to existing resDoc named list. It is important that you return the Solr unique id instead of the lucene id. This is just one approach, but what is the reason that you want this functionality? I guess what would be much easier, is to do a second query after the collapse query. In this second query you disable field collapsing (by not setting collapse.field) and you set fq= [collapse.field] = [collapse.value] for example. Potentially the number of collapsed documents can be very large and in that situation it can have a impact on performance. Therefore I think that this functionality should be disabled by default. In the same way collapseInfoDoc and collapseInfoCount are managed.
          Hide
          Martijn van Groningen added a comment -

          Hi Tarjei, that doesn't look good Besides the the complete stacktrace I'm also interested in what request url to Solr resulted in this exception (with what params etc.) and what version of Solr you are currently using?

          Show
          Martijn van Groningen added a comment - Hi Tarjei, that doesn't look good Besides the the complete stacktrace I'm also interested in what request url to Solr resulted in this exception (with what params etc.) and what version of Solr you are currently using?
          Hide
          Thomas Traeger added a comment -

          I use collapsing in an online store and need to do a quite complex price calculation for every collapse group based on the products behind that group. I also thought about doing a second query, but that is not an option as I would have to do that for every group (i have up to 100 groups per request). So doing the calculation outside the scope of solr but retrieving the necessary data from solr seems to be the best approach for me. I agree that this functionality should be disabled by default.

          Thanks for the pointer, I will have a look at it...

          Show
          Thomas Traeger added a comment - I use collapsing in an online store and need to do a quite complex price calculation for every collapse group based on the products behind that group. I also thought about doing a second query, but that is not an option as I would have to do that for every group (i have up to 100 groups per request). So doing the calculation outside the scope of solr but retrieving the necessary data from solr seems to be the best approach for me. I agree that this functionality should be disabled by default. Thanks for the pointer, I will have a look at it...
          Hide
          Darrell Silver added a comment -

          Hi there, Martijn & Thomas,

          We're using FieldCollapse exactly in this way. In order to retrieve the collapsed results we do subqueries over each the results returned from the (outer) collapsing query, as Martijn suggests. It would be a fantastic option if the documents in the collapse could be returned. Knowing how many there are would be a big improvement as well (maybe this is possible and I don't know how?).

          Right now, in order to manage the load, we're calling the subquery only on the results in the page in the user's view. Because this is all happening in a web environment, we also selectively choose to make some requests on that page while we're generating the search results page, and make the others via ajax from the browser, which gives the user a much faster response.

          Thanks,

          D

          Show
          Darrell Silver added a comment - Hi there, Martijn & Thomas, We're using FieldCollapse exactly in this way. In order to retrieve the collapsed results we do subqueries over each the results returned from the (outer) collapsing query, as Martijn suggests. It would be a fantastic option if the documents in the collapse could be returned. Knowing how many there are would be a big improvement as well (maybe this is possible and I don't know how?). Right now, in order to manage the load, we're calling the subquery only on the results in the page in the user's view. Because this is all happening in a web environment, we also selectively choose to make some requests on that page while we're generating the search results page, and make the others via ajax from the browser, which gives the user a much faster response. Thanks, D
          Hide
          Martijn van Groningen added a comment -

          Hi Thomas, I agree that in your situation this feature is very handy. Assuming that you want to return the whole document (with all fields) and you have groups of reasonable sizes then this increases your response time dramatically. What I think would be a better approach is to only return the fields you want to use for your calculation. Lets say an average price per group. So instead of returning 10 fields per group (let say 7000 documents) you will only return one and that will save you a lot response time.
          What do you think about this approach?

          I also find the Ajax response solution, that Darrell describes is a good way to go.

          Show
          Martijn van Groningen added a comment - Hi Thomas, I agree that in your situation this feature is very handy. Assuming that you want to return the whole document (with all fields) and you have groups of reasonable sizes then this increases your response time dramatically. What I think would be a better approach is to only return the fields you want to use for your calculation. Lets say an average price per group. So instead of returning 10 fields per group (let say 7000 documents) you will only return one and that will save you a lot response time. What do you think about this approach? I also find the Ajax response solution, that Darrell describes is a good way to go.
          Hide
          Thomas Traeger added a comment -

          yes, returning only one field would perfectly fit my needs, but Darrell seems to need more or even the complete document. So I think we need a collapse parameter that defines the field(s) of the removed documents that have to be included in the response. The Ajax approach is quite interesting but unfortunatly does not fit our needs in this case.

          Darrell, the counts are already inluded in the repsonse by default, look for "collapse_count".

          Show
          Thomas Traeger added a comment - yes, returning only one field would perfectly fit my needs, but Darrell seems to need more or even the complete document. So I think we need a collapse parameter that defines the field(s) of the removed documents that have to be included in the response. The Ajax approach is quite interesting but unfortunatly does not fit our needs in this case. Darrell, the counts are already inluded in the repsonse by default, look for "collapse_count".
          Hide
          Darrell Silver added a comment -

          Ha, so it is! Thanks for the note; I'd totally missed that.

          Returning only select fields of the collapsed documents would be a good option for us. Also, In our subquery of the collapsed documents we're finding the first and last result (they're time sorted so this makes sense). I guess this is similar to Thomas' average problem, but for us it's not necessary to iterate over the entire subquery results.

          Show
          Darrell Silver added a comment - Ha, so it is! Thanks for the note; I'd totally missed that. Returning only select fields of the collapsed documents would be a good option for us. Also, In our subquery of the collapsed documents we're finding the first and last result (they're time sorted so this makes sense). I guess this is similar to Thomas' average problem, but for us it's not necessary to iterate over the entire subquery results.
          Hide
          Martijn van Groningen added a comment - - edited

          Yes, specifying which collapse fields to return is a good idea. Just like the fl parameter for a normal request.
          I was thinking about how to fit this new feature into the current patch and I thought that it might be a good idea to revise the current field collapse result format. So that the results of this feature can fit nicely into the response.

          Currently the collapse response is like this:

          <lst name="collapse_counts">
                  <str name="field">venue</str>
                  <lst name="doc">
                      <int name="233238">1</int>
                  </lst>
                  <lst name="count">
                      <int name="melkweg">1</int>
                  </lst>
          </lst>
          

          I think a response format like the following would be more ....

          <lst name="collapse_counts">
                  <str name="field">venue</str>
                  <lst name="results">
                      <lst name="233238">
                           <str name="fieldValue">melkweg</str>
                           <int name="collapseCount">2</int>
                           <lst name="collapsedValues">
                               <str name="price">10.99, "1.999,99"</str>
                               <str name="name">adapter, laptop</str>
                           </lst>
                      </lst>
                  </lst>
          </lst>
          

          As you can see the data is more banded together and therefore easier to parse. The collapsedValues can have one or more fields, each containing collapsed field values in a comma separated format. The collapseValues element will off course only be added when the client specifies the collapsed fields in the request.
          What do you think about this new result format?

          Show
          Martijn van Groningen added a comment - - edited Yes, specifying which collapse fields to return is a good idea. Just like the fl parameter for a normal request. I was thinking about how to fit this new feature into the current patch and I thought that it might be a good idea to revise the current field collapse result format. So that the results of this feature can fit nicely into the response. Currently the collapse response is like this: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "doc" > <int name= "233238" > 1 </int> </lst> <lst name= "count" > <int name= "melkweg" > 1 </int> </lst> </lst> I think a response format like the following would be more .... <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "233238" > <str name= "fieldValue" > melkweg </str> <int name= "collapseCount" > 2 </int> <lst name= "collapsedValues" > <str name= "price" > 10.99, "1.999,99" </str> <str name= "name" > adapter, laptop </str> </lst> </lst> </lst> </lst> As you can see the data is more banded together and therefore easier to parse. The collapsedValues can have one or more fields, each containing collapsed field values in a comma separated format. The collapseValues element will off course only be added when the client specifies the collapsed fields in the request. What do you think about this new result format?
          Hide
          Thomas Traeger added a comment -

          Hi Martijn,

          i also thought about changing the reponse format and introducing two new parameters "collapse.response" and "collapse.response.fl".

          What do you think of these values for "collapse.response":

          "counts": the default and current behavior, maybe even current response format to provide backward compatibility
          "docs": returns the counts and the collapsed docs inside the collapse response (essentialy instead of removing the doc from the result just move it from the result to the collapse response). The parameter "collapse.response.fl" can be used to specify the field(s) to be returned in the collapse response.

          So starting with your proposal the new collapse reponse format might look like this:

          <lst name="collapse_counts">
              <str name="field">venue</str>
              <lst name="results">
                  <lst name="233238">
                      <str name="fieldValue">melkweg</str>
                      <int name="collapseCount">2</int>
                       <lst name="collapsedDocs">
                          <doc>
                              <str name="id">233239</str>
                              <str name="name">Foo Bar</str>
                              ...
                          </doc>
                          <doc>
                              <str name="id">233240</str>
                              <str name="name">Foo Bar 2</str>
                              ...
                          </doc>
                      </lst>
                  </lst>
              </lst>
          </lst>
          

          I think just moving the collapsed docs into the collapse response when desired provides us the necessary flexibility and is hopefully easy to implement.

          Show
          Thomas Traeger added a comment - Hi Martijn, i also thought about changing the reponse format and introducing two new parameters "collapse.response" and "collapse.response.fl". What do you think of these values for "collapse.response": "counts": the default and current behavior, maybe even current response format to provide backward compatibility "docs": returns the counts and the collapsed docs inside the collapse response (essentialy instead of removing the doc from the result just move it from the result to the collapse response). The parameter "collapse.response.fl" can be used to specify the field(s) to be returned in the collapse response. So starting with your proposal the new collapse reponse format might look like this: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "233238" > <str name= "fieldValue" > melkweg </str> <int name= "collapseCount" > 2 </int> <lst name= "collapsedDocs" > <doc> <str name= "id" > 233239 </str> <str name= "name" > Foo Bar </str> ... </doc> <doc> <str name= "id" > 233240 </str> <str name= "name" > Foo Bar 2 </str> ... </doc> </lst> </lst> </lst> </lst> I think just moving the collapsed docs into the collapse response when desired provides us the necessary flexibility and is hopefully easy to implement.
          Hide
          Martijn van Groningen added a comment -

          Hi Thomas,

          Comparing my format proposal with yours, the difference is how I output the collapsed documents. I chose to add all collapsed values in an element per field, because that would make it more compact and thus easier to transmit on the wire (certainly if the number of collapsed documents to return is large). This approach is not standard in Solr and your result structure is more common. I think that most of time is properly spent at reading the collapsed field values from the index anyway (i/o), therefore I think that your result structure is right now properly the best way to go.

          I think that supporting the 'old' format is not that good of an idea, because this only increases complexity in the code. Also field collapsing is just a patch (although it is around for while) and is not a core Solr feature. I think people using this patch (and a patch in general) should always be aware that everything in a patch is subject to change. I think that collapse.response should be named something like collapse.includeCollapsedDocs when this is specified it includes the collapsed documents. The collapse.includeCollapsedDocs.fl would then only include the specified fields in the collapsed documents. So specifying _collapse.includeCollapsedDocs=true would result into the following result:

          <lst name="collapse_counts">
              <str name="field">venue</str>
              <lst name="results">
                  <lst name="233238">
                      <str name="fieldValue">melkweg</str>
                      <int name="collapseCount">2</int>
                       <lst name="collapsedDocs">
                          <doc>
                              <str name="id">233239</str>
                              <str name="name">Foo Bar</str>
                              ...
                          </doc>
                          <doc>
                              <str name="id">233240</str>
                              <str name="name">Foo Bar 2</str>
                              ...
                          </doc>
                      </lst>
                  </lst>
              </lst>
          </lst>
          

          Not specifying the collapse.includeCollaspedDocs would result into the following response output:

          <lst name="collapse_counts">
              <str name="field">venue</str>
              <lst name="results">
                  <lst name="233238">
                      <str name="fieldValue">melkweg</str>
                      <int name="collapseCount">2</int>
                  </lst>
              </lst>
          </lst>
          

          This will be the default and only response format.
          And when for example collapse.info.doc=false is specified then the following result will be returned:

          <lst name="collapse_counts">
              <str name="field">venue</str>
              <lst name="results"> 
                  <lst name="melkweg"> <!-- we can not use the head document id any more, so we use the field value --> 
                      <int name="collapseCount">2</int>
                  </lst>
              </lst>
          </lst>
          

          When collapse.info.count=false is specified this would just remove the fieldValue from the response. I do not know if these parameters are actually set to false by many people, but it is something to keep in mind. I also recently added support for field collapsing to solrj in the patch, obviously this has to be updated to the latest response format.

          In general it must be made clear to the Solr user that this feature is handy, but it can dramatically influence the performance in a negative way. This is because the response can contain a lot of documents and each field value has to be read from the index, which results in a lot of i/o activity on the Solr side. Just because of the fact that a lot of data is returned in the response; simply viewing the response in the browser can become quite a challenge.

          But more important do you think that these changes are acceptable (response format / request parameters)?

          Show
          Martijn van Groningen added a comment - Hi Thomas, Comparing my format proposal with yours, the difference is how I output the collapsed documents. I chose to add all collapsed values in an element per field, because that would make it more compact and thus easier to transmit on the wire (certainly if the number of collapsed documents to return is large). This approach is not standard in Solr and your result structure is more common. I think that most of time is properly spent at reading the collapsed field values from the index anyway (i/o), therefore I think that your result structure is right now properly the best way to go. I think that supporting the 'old' format is not that good of an idea, because this only increases complexity in the code. Also field collapsing is just a patch (although it is around for while) and is not a core Solr feature. I think people using this patch (and a patch in general) should always be aware that everything in a patch is subject to change. I think that collapse.response should be named something like collapse.includeCollapsedDocs when this is specified it includes the collapsed documents. The collapse.includeCollapsedDocs.fl would then only include the specified fields in the collapsed documents. So specifying _collapse.includeCollapsedDocs=true would result into the following result: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "233238" > <str name= "fieldValue" > melkweg </str> <int name= "collapseCount" > 2 </int> <lst name= "collapsedDocs" > <doc> <str name= "id" > 233239 </str> <str name= "name" > Foo Bar </str> ... </doc> <doc> <str name= "id" > 233240 </str> <str name= "name" > Foo Bar 2 </str> ... </doc> </lst> </lst> </lst> </lst> Not specifying the collapse.includeCollaspedDocs would result into the following response output: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "233238" > <str name= "fieldValue" > melkweg </str> <int name= "collapseCount" > 2 </int> </lst> </lst> </lst> This will be the default and only response format. And when for example collapse.info.doc=false is specified then the following result will be returned: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "melkweg" > <!-- we can not use the head document id any more, so we use the field value --> <int name= "collapseCount" > 2 </int> </lst> </lst> </lst> When collapse.info.count=false is specified this would just remove the fieldValue from the response. I do not know if these parameters are actually set to false by many people, but it is something to keep in mind. I also recently added support for field collapsing to solrj in the patch, obviously this has to be updated to the latest response format. In general it must be made clear to the Solr user that this feature is handy, but it can dramatically influence the performance in a negative way. This is because the response can contain a lot of documents and each field value has to be read from the index, which results in a lot of i/o activity on the Solr side. Just because of the fact that a lot of data is returned in the response; simply viewing the response in the browser can become quite a challenge. But more important do you think that these changes are acceptable (response format / request parameters)?
          Hide
          Abdul Chaudhry added a comment - - edited

          I have some ideas for performance improvements.

          I noticed that the code fetches the field cache twice, once for the collapse and then for the response object, assuming you asked for the info count in the response.

          That seems expensive, especially for real-time content.

          I think its better to use FieldCache.StringIndex instead of returning a large string array and keep it around for the collapse and the response object.

          I changed the code so that I keep the cache around like so

          /**

          • Keep the field cached for the collapsed fields for the response object as well
            */
            private FieldCache.StringIndex collapseIndex;

          To get the index use something like this instead of getting the string array for all docs

          collapseIndex = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), collapseField)

          when collapsing , you can get the current value using something like this and remove the code passing the array

          int currentId = i.nextDoc();
          String currentValue = collapseIndex.lookup[collapseIndex.order[currentId]];

          when building the response for the info count, you can reference the same cache like so:-

          if (collapseInfoCount)

          { resCount.add(collapseFieldType.indexedToReadable( collapseIndex.lookup[collapseIndex.order[id]]), count); }

          I also added timing for the cache access as it could be slow if you are doing a lot of updates

          I have added code for displaying selected fields for the duplicates but its difficult to submit . I hope this gets committed as its hard to sumbit a patch as its not in svn and I cannot submit a patch to a patch to a patch .. you get the idea.

          Show
          Abdul Chaudhry added a comment - - edited I have some ideas for performance improvements. I noticed that the code fetches the field cache twice, once for the collapse and then for the response object, assuming you asked for the info count in the response. That seems expensive, especially for real-time content. I think its better to use FieldCache.StringIndex instead of returning a large string array and keep it around for the collapse and the response object. I changed the code so that I keep the cache around like so /** Keep the field cached for the collapsed fields for the response object as well */ private FieldCache.StringIndex collapseIndex; To get the index use something like this instead of getting the string array for all docs collapseIndex = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), collapseField) when collapsing , you can get the current value using something like this and remove the code passing the array int currentId = i.nextDoc(); String currentValue = collapseIndex.lookup[collapseIndex.order [currentId] ]; when building the response for the info count, you can reference the same cache like so:- if (collapseInfoCount) { resCount.add(collapseFieldType.indexedToReadable( collapseIndex.lookup[collapseIndex.order[id]]), count); } I also added timing for the cache access as it could be slow if you are doing a lot of updates I have added code for displaying selected fields for the duplicates but its difficult to submit . I hope this gets committed as its hard to sumbit a patch as its not in svn and I cannot submit a patch to a patch to a patch .. you get the idea.
          Hide
          Martijn van Groningen added a comment -

          Hi Abdul, nice improvements. It makes absolutely sense to keep the field values around during the collapsing as a StringIndex. From what I understand the StringIndex does not have duplicate string values, whereas the plain string array has. This will lower the memory footprint. I will add these improvements to the next patch. Thanks for pointing this out!

          Show
          Martijn van Groningen added a comment - Hi Abdul, nice improvements. It makes absolutely sense to keep the field values around during the collapsing as a StringIndex. From what I understand the StringIndex does not have duplicate string values, whereas the plain string array has. This will lower the memory footprint. I will add these improvements to the next patch. Thanks for pointing this out!
          Hide
          Abdul Chaudhry added a comment -

          If this helps you fix your unit tests. I fixed the unit tests by changing the CollapseFilter constructor that's used for testing to take a StringIndex like so :-

          • CollapseFilter(int collapseMaxDocs, int collapseTreshold) {
            + CollapseFilter(int collapseMaxDocs, int collapseTreshold, FieldCache.StringIndex index) {
            + this.collapseIndex = index;

          and then I changed the unit test cases to move values into a StringIndex in CollapseFilterTest like so:-

          public void testNormalCollapse_collapseThresholdOne() {

          • collapseFilter = new CollapseFilter(Integer.MAX_VALUE, 1);
            + String[] values = new String[] {"a", "b", "c"}

            ;
            + int[] order = new int[]

            {0, 1, 0, 2, 1, 0, 1}

            ;
            + FieldCache.StringIndex index = new FieldCache.StringIndex(order, values);
            + int[] docIds = new int[]

            {1, 2, 0, 3, 4, 5, 6}

            ;
            +
            + collapseFilter = new CollapseFilter(Integer.MAX_VALUE, 1, index);

          • String[] values = new String[] {"a", "b", "a", "c", "b", "a", "b"}

            ;

          Show
          Abdul Chaudhry added a comment - If this helps you fix your unit tests. I fixed the unit tests by changing the CollapseFilter constructor that's used for testing to take a StringIndex like so :- CollapseFilter(int collapseMaxDocs, int collapseTreshold) { + CollapseFilter(int collapseMaxDocs, int collapseTreshold, FieldCache.StringIndex index) { + this.collapseIndex = index; and then I changed the unit test cases to move values into a StringIndex in CollapseFilterTest like so:- public void testNormalCollapse_collapseThresholdOne() { collapseFilter = new CollapseFilter(Integer.MAX_VALUE, 1); + String[] values = new String[] {"a", "b", "c"} ; + int[] order = new int[] {0, 1, 0, 2, 1, 0, 1} ; + FieldCache.StringIndex index = new FieldCache.StringIndex(order, values); + int[] docIds = new int[] {1, 2, 0, 3, 4, 5, 6} ; + + collapseFilter = new CollapseFilter(Integer.MAX_VALUE, 1, index); String[] values = new String[] {"a", "b", "a", "c", "b", "a", "b"} ;
          Hide
          Paul Nelson added a comment - - edited

          Hey All: Just upgraded to 1.4 to get the new patch (many thanks, Martijn). The new algorithm appears to be sensitive to the size and complexity of the query (rather than simply the count of documents) - should this be the case? Unfortunately, we have rather large and complex queries with dozens of terms and several phrases, and while these queries are <0.5sec without collapsing, they are 3-4sec with collapsing. Meanwhile, collapse using *:* or other simple queries come back in <0.5sec - so it appears to be primarily a query-complexity issue.

          I'm wondering if the filter cache (or some other cache) might be able to help with this situation?

          Show
          Paul Nelson added a comment - - edited Hey All: Just upgraded to 1.4 to get the new patch (many thanks, Martijn). The new algorithm appears to be sensitive to the size and complexity of the query (rather than simply the count of documents) - should this be the case? Unfortunately, we have rather large and complex queries with dozens of terms and several phrases, and while these queries are <0.5sec without collapsing, they are 3-4sec with collapsing. Meanwhile, collapse using *:* or other simple queries come back in <0.5sec - so it appears to be primarily a query-complexity issue. I'm wondering if the filter cache (or some other cache) might be able to help with this situation?
          Hide
          Martijn van Groningen added a comment -

          I have updated the field collapse patch with the following:
          1. Added the return collapse documents feature. When the parameter collapse.includeCollapsedDocs with value true is specified then the collapsed documents will returned per distinct field value. When this feature is enabled a collapsedDocs element is added to the field collapse response part. It looks like this:

          <lst name="collapsedDocs">
            <result name="Amsterdam" numFound="2" start="0">
          	<doc>
          	 <str name="id">262701</str>
          	 <str name="title">Bitterzoet, 100% Halal, Appletree Records &amp; Deux d'Amsterdam presents</str>
          	</doc>
          	<doc>
          	 <str name="id">327511</str>
          	 <str name="title">Salsa Danscafé</str>
          	</doc>
            </result>
           </lst>
          

          It is also possible to return only specific fields with the collapse.includeCollapsedDocs.fl parameter. It expects fieldnames delimited by comma, just like the normal fl parameter.

          These feature can dramatically impact the performance, because a group can potently contain many documents which all have to retrieved from the index and transported over the wire. So it is certainly wise to use it in combination with the fl parameter.
          2. Added Solrj support for collapsed documents feature.
          3. Added the performance improvements that Abdul suggested.
          4. The debug information is now not returned by default. When the parameter collapse.debug with value true is specified, then the debug information is returned.
          5. When field collapsing is done on a field that is multivalued or tokenized then an exception is thrown. I have chosen to do this because collapsing on such fields lead to unexpected results. For example when a field is tokenized only the last token of the field can be retrieved from the fieldcache (the fieldcache is used for retrieving the fields from the index in a cached manner for grouping documents into groups of distinct field values). This results in collapsing only on the last token of a field value instead of the complete field value. Multivalued fields have similar behaviour, plus for multivalued fields the Lucene FieldCache throws an exception when there are more tokens for a field than documents. Personally I think that throwing an exception is better then have unexpected results, at least it is clear that something field collapse related is wrong.
          6. When doing a normal field collapse and not sorting on score the Solr caching mechanism is used. Unfortunately this was previously not the case.

          @Paul
          When doing non adjacent collapsing (aka normal collapsing) the Solr caches are not being used. The current patch uses the Solr caches when doing a search without scoring, but still the most common case is of course field collapsing and sorting on score. This is because the non adjacent field collapse algorithm requires the score of all results, which is collected with a Lucene collector. The search method on the SolrIndexSearcher that specifies a collector, does not have caching capabilities. In the next patch I will fix this problem, so that normal field collapse search uses the Solr caches as they should. The adjacent collapsing algorithm does use the solr caches, but the algorithm is much slower than non adjacent collapsing.

          Show
          Martijn van Groningen added a comment - I have updated the field collapse patch with the following: 1. Added the return collapse documents feature. When the parameter collapse.includeCollapsedDocs with value true is specified then the collapsed documents will returned per distinct field value. When this feature is enabled a collapsedDocs element is added to the field collapse response part. It looks like this: <lst name= "collapsedDocs" > <result name= "Amsterdam" numFound= "2" start= "0" > <doc> <str name= "id" > 262701 </str> <str name= "title" > Bitterzoet, 100% Halal, Appletree Records &amp; Deux d'Amsterdam presents </str> </doc> <doc> <str name= "id" > 327511 </str> <str name= "title" > Salsa Danscafé </str> </doc> </result> </lst> It is also possible to return only specific fields with the collapse.includeCollapsedDocs.fl parameter. It expects fieldnames delimited by comma, just like the normal fl parameter. These feature can dramatically impact the performance, because a group can potently contain many documents which all have to retrieved from the index and transported over the wire. So it is certainly wise to use it in combination with the fl parameter. 2. Added Solrj support for collapsed documents feature. 3. Added the performance improvements that Abdul suggested. 4. The debug information is now not returned by default. When the parameter collapse.debug with value true is specified, then the debug information is returned. 5. When field collapsing is done on a field that is multivalued or tokenized then an exception is thrown. I have chosen to do this because collapsing on such fields lead to unexpected results. For example when a field is tokenized only the last token of the field can be retrieved from the fieldcache (the fieldcache is used for retrieving the fields from the index in a cached manner for grouping documents into groups of distinct field values). This results in collapsing only on the last token of a field value instead of the complete field value. Multivalued fields have similar behaviour, plus for multivalued fields the Lucene FieldCache throws an exception when there are more tokens for a field than documents. Personally I think that throwing an exception is better then have unexpected results, at least it is clear that something field collapse related is wrong. 6. When doing a normal field collapse and not sorting on score the Solr caching mechanism is used. Unfortunately this was previously not the case. @Paul When doing non adjacent collapsing (aka normal collapsing) the Solr caches are not being used. The current patch uses the Solr caches when doing a search without scoring, but still the most common case is of course field collapsing and sorting on score. This is because the non adjacent field collapse algorithm requires the score of all results, which is collected with a Lucene collector. The search method on the SolrIndexSearcher that specifies a collector, does not have caching capabilities. In the next patch I will fix this problem, so that normal field collapse search uses the Solr caches as they should. The adjacent collapsing algorithm does use the solr caches, but the algorithm is much slower than non adjacent collapsing.
          Hide
          Paul Nelson added a comment -

          Thanks Martijn!

          Also, while I was doing testing on collapse, I've noticed some threading issues as well. I think they are primarily centered around the collapseRequest field.

          Specifically, when I run two collapse queries at the same time, I get the following exception:

          java.lang.IllegalStateException: Invoke the collapse method before invoking getCollapseInfo method
                  at org.apache.solr.search.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:183)
                  at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:115)
                  at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67)
                  at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
                  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
                  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
                  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
                  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
                  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
                  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
                  at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
                  at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
                  at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
                  at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
                  at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
                  at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
                  at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
                  at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
                  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
                  at java.lang.Thread.run(Thread.java:619)
          

          And when I run a second (non-collapsing) query at the same time I run the collapse query I get this exception:

          java.lang.NullPointerException
                  at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:109)
                  at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67)
                  at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
                  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
                  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
                  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
                  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
                  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
                  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
                  at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
                  at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
                  at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
                  at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
                  at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
                  at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
                  at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
                  at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
                  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
                  at java.lang.Thread.run(Thread.java:619)
          

          These errors occurred with the 2009-08-24 patch, but (upon brief inspection) it looks like the same situation would occur with the latest patch.

          If I get the chance, I'll try and debug further.

          Show
          Paul Nelson added a comment - Thanks Martijn! Also, while I was doing testing on collapse, I've noticed some threading issues as well. I think they are primarily centered around the collapseRequest field. Specifically, when I run two collapse queries at the same time, I get the following exception: java.lang.IllegalStateException: Invoke the collapse method before invoking getCollapseInfo method at org.apache.solr.search.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:183) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:115) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang. Thread .run( Thread .java:619) And when I run a second (non-collapsing) query at the same time I run the collapse query I get this exception: java.lang.NullPointerException at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:109) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:67) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang. Thread .run( Thread .java:619) These errors occurred with the 2009-08-24 patch, but (upon brief inspection) it looks like the same situation would occur with the latest patch. If I get the chance, I'll try and debug further.
          Hide
          Oleg Gnatovskiy added a comment -

          Hey Martijn,
          Have you made any progress on making field collapsing distributed?
          Oleg

          Show
          Oleg Gnatovskiy added a comment - Hey Martijn, Have you made any progress on making field collapsing distributed? Oleg
          Hide
          Martijn van Groningen added a comment -

          Hi Paul, thanks for pointing this out. I also tried to hammer my Solr instance and I got the same exceptions, which is not good. I have attached a patch that fixes these exceptions. The problem was indeed centred around the collapseRequest field and I have fixed this by using a ThreadLocal that holds the CollapseRequest instance. Because of this the reference to the CollapseRequest is not shared across the search requests and thus a new thread cannot interfere with a collapse request that is still being used by another thread.

          Show
          Martijn van Groningen added a comment - Hi Paul, thanks for pointing this out. I also tried to hammer my Solr instance and I got the same exceptions, which is not good. I have attached a patch that fixes these exceptions. The problem was indeed centred around the collapseRequest field and I have fixed this by using a ThreadLocal that holds the CollapseRequest instance. Because of this the reference to the CollapseRequest is not shared across the search requests and thus a new thread cannot interfere with a collapse request that is still being used by another thread.
          Hide
          Martijn van Groningen added a comment -

          Hi Oleg, no I have not made any progress. I'm still not clear how to solve it in an efficient manner as I have written in my previous comment:

          I was trying to come up with a solution to implement distributed field collapsing, but I ran into a problem that I could not solve in an efficient manner.

          Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse
          threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse.

          There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes.

          Besides that there is also another problem with distributed field collapsing. Field collapsing only keeps the most relevant document in the result set and collapses the less relevant ones. If scoring is used to sort then field collapsing will fail to do this properly, because of the fact there is no global scoring (idf).

          Does anyone have an idea on how to solve this? The first problem seems related to same kind of problem implementing global score has.

          I recently read something about Katta and . Katta facilitates distributed search and has for support global scoring. I'm not completely sure how it is implemented in Katta, but maybe with Katta it is relative efficient to share the intermediate collapse results between shards.

          Show
          Martijn van Groningen added a comment - Hi Oleg, no I have not made any progress. I'm still not clear how to solve it in an efficient manner as I have written in my previous comment: I was trying to come up with a solution to implement distributed field collapsing, but I ran into a problem that I could not solve in an efficient manner. Field collapsing keeps track of the number of document collapsed per unique field value and the total count documents encountered per unique field. If the total count is greater than the specified collapse threshold then the number of documents collapsed is the difference between the total count and threshold. Lets say we have two shards each shard has one document with the same field value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the shard individually both documents will never be collapsed. But when the algorithm applies to both shards, one of the documents must be collapsed however neither shared knows that its document is the one to collapse. There are more situations described as above, but it all boils down to the fact that each shard does not have meta information about the other shards in the cluster. Sharing the intermediate collapse results between the shards is in my opinion not an option. This is because if you do that then you also need to share information about documents / fields that have a collapse count of zero. This is totally impractical for large indexes. Besides that there is also another problem with distributed field collapsing. Field collapsing only keeps the most relevant document in the result set and collapses the less relevant ones. If scoring is used to sort then field collapsing will fail to do this properly, because of the fact there is no global scoring (idf). Does anyone have an idea on how to solve this? The first problem seems related to same kind of problem implementing global score has. I recently read something about Katta and . Katta facilitates distributed search and has for support global scoring. I'm not completely sure how it is implemented in Katta, but maybe with Katta it is relative efficient to share the intermediate collapse results between shards.
          Hide
          Uri Boness added a comment -

          Martijn, I think a more appropriate way to fix the threading issue is to bind the collapseRequest to the request context and drop the class field all together. So:

          public void prepare(ResponseBuilder rb) throws IOException {
              super.prepare(rb);
              rb.req.getContext().put("collapseRequest", resolveCollapseRequest(rb));
          }
          

          and

          public void process(ResponseBuilder rb) throws IOException {
              CollapseRequest collapseRequest = rb.req.getContext().remove("collapseRequest");
              if (collapseRequest == null) {
                super.process(rb);
                return;
              }
              doProcess(rb, collapseRequest);
          }
          
          Show
          Uri Boness added a comment - Martijn, I think a more appropriate way to fix the threading issue is to bind the collapseRequest to the request context and drop the class field all together. So: public void prepare(ResponseBuilder rb) throws IOException { super .prepare(rb); rb.req.getContext().put( "collapseRequest" , resolveCollapseRequest(rb)); } and public void process(ResponseBuilder rb) throws IOException { CollapseRequest collapseRequest = rb.req.getContext().remove( "collapseRequest" ); if (collapseRequest == null ) { super .process(rb); return ; } doProcess(rb, collapseRequest); }
          Hide
          Martijn van Groningen added a comment -

          You are right Uri, using the requestContext is much more appropriate than using a ThreadLocale. I have updated the patch with this change.

          Show
          Martijn van Groningen added a comment - You are right Uri, using the requestContext is much more appropriate than using a ThreadLocale. I have updated the patch with this change.
          Hide
          Thomas Traeger added a comment -

          Hi Martijn, I made some tests with the new collapsedDocs feature. Looks very good, but in some cases it seems to return wrong collapsed docs. There seems to be a connection between sorting and this problem. Here an example using the example docs collapsed on field inStock and sorting by popularity:

          http://localhost:8983/solr/select/?q=*:*&sort=popularity%20asc&fl=id&collapse.field=inStock&collapse.includeCollapsedDocs=true&collapse.includeCollapsedDocs.fl=id

          For inStock:T document id:VDBDB1A16 remains in the result after collapsing. But this document is also returned in the collapsedDocs response and in addition document id:SP2514N is missing there.

          Show
          Thomas Traeger added a comment - Hi Martijn, I made some tests with the new collapsedDocs feature. Looks very good, but in some cases it seems to return wrong collapsed docs. There seems to be a connection between sorting and this problem. Here an example using the example docs collapsed on field inStock and sorting by popularity: http://localhost:8983/solr/select/?q=*:*&sort=popularity%20asc&fl=id&collapse.field=inStock&collapse.includeCollapsedDocs=true&collapse.includeCollapsedDocs.fl=id For inStock:T document id:VDBDB1A16 remains in the result after collapsing. But this document is also returned in the collapsedDocs response and in addition document id:SP2514N is missing there.
          Hide
          Martijn van Groningen added a comment -

          Hi Thomas. I tried to reproduce something similar here, but I did run into the problems you described. Can you tell me what the fieldtypes are for your sort field and collapse field?

          Show
          Martijn van Groningen added a comment - Hi Thomas. I tried to reproduce something similar here, but I did run into the problems you described. Can you tell me what the fieldtypes are for your sort field and collapse field?
          Hide
          Thomas Traeger added a comment -

          I found the problem with my real world data and reproduced it with the solr example schema and data. In the solr example popularity is of type "int" and inStock is "boolean". I made some more tests and could reproduce other fieldtypes too, here some examples using the field manu_exact (string):

          http://localhost:8983/solr/select/?q=*:*&sort=manu_exact%20asc&fl=id&collapse.field=inStock&collapse.includeCollapsedDocs=true
          -> as in the previous example document id:VDBDB1A16 is in result and collapsedDocs

          http://localhost:8983/solr/select/?q=*:*&sort=manu_exact%20desc&fl=id&collapse.field=inStock&collapse.includeCollapsedDocs=true
          -> document id:VA902B is in result and collapsedDocs

          http://localhost:8983/solr/select/?q=*:*&sort=popularity%20desc&fl=id&collapse.field=manu_exact&collapse.includeCollapsedDocs=true
          -> document id:VS1GB400C3 is in result and collapsedDocs

          Show
          Thomas Traeger added a comment - I found the problem with my real world data and reproduced it with the solr example schema and data. In the solr example popularity is of type "int" and inStock is "boolean". I made some more tests and could reproduce other fieldtypes too, here some examples using the field manu_exact (string): http://localhost:8983/solr/select/?q=*:*&sort=manu_exact%20asc&fl=id&collapse.field=inStock&collapse.includeCollapsedDocs=true -> as in the previous example document id:VDBDB1A16 is in result and collapsedDocs http://localhost:8983/solr/select/?q=*:*&sort=manu_exact%20desc&fl=id&collapse.field=inStock&collapse.includeCollapsedDocs=true -> document id:VA902B is in result and collapsedDocs http://localhost:8983/solr/select/?q=*:*&sort=popularity%20desc&fl=id&collapse.field=manu_exact&collapse.includeCollapsedDocs=true -> document id:VS1GB400C3 is in result and collapsedDocs
          Hide
          Martijn van Groningen added a comment -

          Hi Thomas, I have fixed the problem and updated the patch. I was able to reproduce the bug on the Solr example dataset. The problem was not limited to field collapsing with sorting on a field alone. The problem was located in the NonAdjactentFieldCollapser in the doCollapse(...) method in this specific part:

                // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue
                collapseDoc.priorityQueue.insertWithOverflow(currentId);
          
                // check if we have reached the collapse threshold, if so start counting collapsed documents
                if (++collapseDoc.totalCount > collapseTreshold) {
                  collapseDoc.collapsedDocuments++;
                  if (dropOutId != null) {
                    addCollapsedDoc(currentId, currentValue);
                  }
                }
          

          Lets say that that the currentId has the most relevent field value and the collapseThreshold is met. When the currentId is added to the queue it stays there and another document id will be dropped out. In this situation a document that is the most relevant field value is added to the collapsed documents and it stays in the queue and therefore it will also be added to the normal results.

          I changed it to this.

                // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue
                Integer dropOutId = (Integer) collapseDoc.priorityQueue.insertWithOverflow(currentId);
          
                // check if we have reached the collapse threshold, if so start counting collapsed documents
                if (++collapseDoc.totalCount > collapseTreshold) {
                  collapseDoc.collapsedDocuments++;
                  if (dropOutId != null) {
                    addCollapsedDoc(dropOutId, currentValue);
                  }
                }
          

          Now only a document that will never and up in the final results is added to the collapsed documents (and not the current document that might be more relevant then other documents in the priority queue). The above code change fixes the bug in my test setups, can you also confirm that this fixes the issue on your side?

          Show
          Martijn van Groningen added a comment - Hi Thomas, I have fixed the problem and updated the patch. I was able to reproduce the bug on the Solr example dataset. The problem was not limited to field collapsing with sorting on a field alone. The problem was located in the NonAdjactentFieldCollapser in the doCollapse(...) method in this specific part: // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue collapseDoc.priorityQueue.insertWithOverflow(currentId); // check if we have reached the collapse threshold, if so start counting collapsed documents if (++collapseDoc.totalCount > collapseTreshold) { collapseDoc.collapsedDocuments++; if (dropOutId != null ) { addCollapsedDoc(currentId, currentValue); } } Lets say that that the currentId has the most relevent field value and the collapseThreshold is met. When the currentId is added to the queue it stays there and another document id will be dropped out. In this situation a document that is the most relevant field value is added to the collapsed documents and it stays in the queue and therefore it will also be added to the normal results. I changed it to this. // dropoutId has a value smaller than the smallest value in the queue and therefore it was removed from the queue Integer dropOutId = ( Integer ) collapseDoc.priorityQueue.insertWithOverflow(currentId); // check if we have reached the collapse threshold, if so start counting collapsed documents if (++collapseDoc.totalCount > collapseTreshold) { collapseDoc.collapsedDocuments++; if (dropOutId != null ) { addCollapsedDoc(dropOutId, currentValue); } } Now only a document that will never and up in the final results is added to the collapsed documents (and not the current document that might be more relevant then other documents in the priority queue). The above code change fixes the bug in my test setups, can you also confirm that this fixes the issue on your side?
          Hide
          Thomas Traeger added a comment -

          Hi Martijn, this fixed the problem, thanks

          Show
          Thomas Traeger added a comment - Hi Martijn, this fixed the problem, thanks
          Hide
          Martijn van Groningen added a comment -

          I have created a new patch that has the following changes:
          1) Non adajacent collasping with sorting on score also uses the Solr caches now. So now every field collapse searches are using the Solr caches properly. This was not the case in my previous versions of the patch. This improvement will make field collapsing perform better and reduce the query time for regular searches. The downside was, that in order to make this work I had to modify some methods in the SolrIndexSearcher.

          When sorting on score the non adjacent collapsing algorithm needs the score per document. The score is collected in a Lucene collector. The previous version of the patch uses the searcher.search(Query, Filter, Collector) method to collect the documents (as a DocSet) and scores, but by using this method the Solr caches were ignored.

          The methods that return a DocSet in the SolrIndexSearcher do not offer the ability the specify your own collector. I changed that so you can specify your own collector and still benefit from the Solr caches. I did this in a non intrusive manner, so that nothing changes for existing code that uses the normal versions of these methods.

          
             public DocSet getDocSet(Query query) throws IOException {
              DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc());
              return getDocSet(query, collector);
             }
          
             public DocSet getDocSet(Query query, DocSetAwareCollector collector) throws IOException {
              ....
             }
          
            DocSet getPositiveDocSet(Query q) throws IOException {
              DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc());
              return getPositiveDocSet(q, collector);
             }
          
            DocSet getPositiveDocSet(Query q, DocSetAwareCollector collector) throws IOException {
              .....
             }
          
            public DocSet getDocSet(List<Query> queries) throws IOException {
              DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc());
              return getDocSet(queries, collector);
             }
          
            public DocSet getDocSet(List<Query> queries, DocSetAwareCollector collector) throws IOException {
             .......
             }
          
            protected DocSet getDocSetNC(Query query, DocSet filter) throws IOException {
              DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc());
              return getDocSetNC(query,  filter, collector);
             }
          
            protected DocSet getDocSetNC(Query query, DocSet filter, DocSetAwareCollector collector) throws IOException {
             .........
             }
          

          I also made a DocSetAwareCollector that both DocSetCollector and DocSetScoreCollector implement.
          2) The collapse.includeCollapsedDocs parameters has been removed. In order to include the collapsed documents the parameter collapse.includeCollapsedDocs.fl must be specified. collapse.includeCollapsedDocs.fl=* will include all fields of the collapsed documents and collapse.includeCollapsedDocs.fl=id,name will only include the id and name field of the collapsed documents.

          Show
          Martijn van Groningen added a comment - I have created a new patch that has the following changes: 1) Non adajacent collasping with sorting on score also uses the Solr caches now. So now every field collapse searches are using the Solr caches properly. This was not the case in my previous versions of the patch. This improvement will make field collapsing perform better and reduce the query time for regular searches. The downside was, that in order to make this work I had to modify some methods in the SolrIndexSearcher. When sorting on score the non adjacent collapsing algorithm needs the score per document. The score is collected in a Lucene collector. The previous version of the patch uses the searcher.search(Query, Filter, Collector) method to collect the documents (as a DocSet) and scores, but by using this method the Solr caches were ignored. The methods that return a DocSet in the SolrIndexSearcher do not offer the ability the specify your own collector. I changed that so you can specify your own collector and still benefit from the Solr caches. I did this in a non intrusive manner, so that nothing changes for existing code that uses the normal versions of these methods. public DocSet getDocSet(Query query) throws IOException { DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc()); return getDocSet(query, collector); } public DocSet getDocSet(Query query, DocSetAwareCollector collector) throws IOException { .... } DocSet getPositiveDocSet(Query q) throws IOException { DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc()); return getPositiveDocSet(q, collector); } DocSet getPositiveDocSet(Query q, DocSetAwareCollector collector) throws IOException { ..... } public DocSet getDocSet(List<Query> queries) throws IOException { DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc()); return getDocSet(queries, collector); } public DocSet getDocSet(List<Query> queries, DocSetAwareCollector collector) throws IOException { ....... } protected DocSet getDocSetNC(Query query, DocSet filter) throws IOException { DocSetCollector collector = new DocSetCollector(maxDoc()>>6, maxDoc()); return getDocSetNC(query, filter, collector); } protected DocSet getDocSetNC(Query query, DocSet filter, DocSetAwareCollector collector) throws IOException { ......... } I also made a DocSetAwareCollector that both DocSetCollector and DocSetScoreCollector implement. 2) The collapse.includeCollapsedDocs parameters has been removed. In order to include the collapsed documents the parameter collapse.includeCollapsedDocs.fl must be specified. collapse.includeCollapsedDocs.fl=* will include all fields of the collapsed documents and collapse.includeCollapsedDocs.fl=id,name will only include the id and name field of the collapsed documents.
          Hide
          Aytek Ekici added a comment - - edited

          Hi all,
          Just applied "field-collapse-5.patch" and i guess there are problems with filter queries.

          Here it is:

          1- select?q=:&fq=lat:[37.2 TO 39.8]
          numFound: 6284

          2- select?q=:&fq=lng:[24.5 TO 29.9]
          numFound: 16912

          3- select?q=:&fq=lat:[37.2 TO 39.8]&fq=lng:[24.5 TO 29.9]
          numFound: 19419

          4- When using "q" instead of "fq" which is:
          select?q=lat:[37.2 TO 39.8] AND lng:[24.5 TO 29.9]
          numFound: 3777 (which is the only correct number)

          The thing is, as i understand, instead of applying "AND" for each filter query it applies "OR". Checked select?q=lat:[37.2 TO 39.8] OR lng:[24.5 TO 29.9]
          numFound: 19419 (same as 3rd one)

          Any idea how to fix this?
          Thx.

          Show
          Aytek Ekici added a comment - - edited Hi all, Just applied "field-collapse-5.patch" and i guess there are problems with filter queries. Here it is: 1- select?q= : &fq=lat: [37.2 TO 39.8] numFound: 6284 2- select?q= : &fq=lng: [24.5 TO 29.9] numFound: 16912 3- select?q= : &fq=lat: [37.2 TO 39.8] &fq=lng: [24.5 TO 29.9] numFound: 19419 4- When using "q" instead of "fq" which is: select?q=lat: [37.2 TO 39.8] AND lng: [24.5 TO 29.9] numFound: 3777 (which is the only correct number) The thing is, as i understand, instead of applying "AND" for each filter query it applies "OR". Checked select?q=lat: [37.2 TO 39.8] OR lng: [24.5 TO 29.9] numFound: 19419 (same as 3rd one) Any idea how to fix this? Thx.
          Hide
          Martijn van Groningen added a comment -

          Hi Aytek,

          How I understand filter queries work is that each separate filter query produces a result set and each of this result set is intersected together. Which means that it works as you want it.
          I'm not sure but I think that this issue is not related to the patch. I have tried to reproduce this situation (on a different data set), but it behaved as it should. With the patch and without.
          Have you tried fq=lat:[37.2 TO 39.8] AND lng:[24.5 TO 29.9] instead of having it in two separate fqs?

          Martijn

          Show
          Martijn van Groningen added a comment - Hi Aytek, How I understand filter queries work is that each separate filter query produces a result set and each of this result set is intersected together. Which means that it works as you want it. I'm not sure but I think that this issue is not related to the patch. I have tried to reproduce this situation (on a different data set), but it behaved as it should. With the patch and without. Have you tried fq=lat: [37.2 TO 39.8] AND lng: [24.5 TO 29.9] instead of having it in two separate fqs? Martijn
          Hide
          Anıl Çetin added a comment -

          Hi Martijn,

          to clarify the problem;

          1) select/?q=:&fq=+lat:[37.2 TO 39.8] +lng:[24.5 TO 29.9]

          2) select/?q=:&fq=lat:[37.2 TO 39.8]&fq=lng:[24.5 TO 29.9]

          Expected result set for these queries are identical (isnt it?) but actually with patch the results becomes different.

          Also without patch there is no problem.

          Show
          Anıl Çetin added a comment - Hi Martijn, to clarify the problem; 1) select/?q= : &fq=+lat: [37.2 TO 39.8] +lng: [24.5 TO 29.9] 2) select/?q= : &fq=lat: [37.2 TO 39.8] &fq=lng: [24.5 TO 29.9] Expected result set for these queries are identical (isnt it?) but actually with patch the results becomes different. Also without patch there is no problem.
          Hide
          Aytek Ekici added a comment -

          Hi Martijn,
          Intersection of results sets is also a kind of "AND", right? Intersection result of A docset and B docset is equal to resultset of "conA AND condB" i think.

          Your suggestion "fq=lat:[37.2 TO 39.8] AND lng:[24.5 TO 29.9]" works. And also Anil's suggestion "fq=+lat:[37.2 TO 39.8] +lng:[24.5 TO 29.9]" works.
          But they don't allow multiple selections for a facet field. I can't use excludes. It throws parsing errors.
          Using "AND" between two filters in a filter query results with one item in FilterList of QueryCommand, that must be the reason not to be able to parse/support ex/tag things there i guess.

          I have two solr instances here one with patch and another without patch. And i just copied configurations and data from one to other. Only difference is field_collapsing patch as i can see. I'm trying to see what makes the difference in results but new in solr so it takes time to see/catch what is going on. So any help/tip would be appreciated.

          Thanks,
          Aytek

          Show
          Aytek Ekici added a comment - Hi Martijn, Intersection of results sets is also a kind of "AND", right? Intersection result of A docset and B docset is equal to resultset of "conA AND condB" i think. Your suggestion "fq=lat: [37.2 TO 39.8] AND lng: [24.5 TO 29.9] " works. And also Anil's suggestion "fq=+lat: [37.2 TO 39.8] +lng: [24.5 TO 29.9] " works. But they don't allow multiple selections for a facet field. I can't use excludes. It throws parsing errors. Using "AND" between two filters in a filter query results with one item in FilterList of QueryCommand, that must be the reason not to be able to parse/support ex/tag things there i guess. I have two solr instances here one with patch and another without patch. And i just copied configurations and data from one to other. Only difference is field_collapsing patch as i can see. I'm trying to see what makes the difference in results but new in solr so it takes time to see/catch what is going on. So any help/tip would be appreciated. Thanks, Aytek
          Hide
          Martijn van Groningen added a comment -

          Hi Aytek,

          I was able to reproduce the same situation you described earlier. When I was testing yesterday I thought I was testing on a Solr instance without the patch, but I wasn't. Anyhow I have fixed bug and I have attached a new patch. Good thing you noticed this bug it was really corrupting the search results.

          Martijn

          Show
          Martijn van Groningen added a comment - Hi Aytek, I was able to reproduce the same situation you described earlier. When I was testing yesterday I thought I was testing on a Solr instance without the patch, but I wasn't. Anyhow I have fixed bug and I have attached a new patch. Good thing you noticed this bug it was really corrupting the search results. Martijn
          Hide
          Aytek Ekici added a comment -

          Hi Martijn,
          Thanks a lot it works.

          Aytek

          Show
          Aytek Ekici added a comment - Hi Martijn, Thanks a lot it works. Aytek
          Hide
          Martijn van Groningen added a comment - - edited

          I have attached a new patch which includes a major refactoring which makes the code more flexible and cleaner. The patch also includes a new aggregate functionality and a bug fix.

          Aggregate function and bug fix

          The new patch allows you to execute aggregate functions on the collapsed documents (for example sum the stock amount or calculating the minimum price of a collapsed group). Currently there are four aggregate functions available: sum(), min(), max() and avg(). To execute one or more functions the collapse.aggregate parameter has to be added to the request url. The parameter expects the following syntax: function_name(field_name)[, function_name(field_name)]. For example: collapse.aggregate=sum(stock), min(price) and might have a result like this:

          <lst name="aggregatedResults">
             <lst name="sum(stock)">
                <str name="Amsterdam">10</str>
                ...
             </lst>
             <lst name="min(price)">
                <str name="Amsterdam">5.99</str>
                ...
             </lst>
          </lst>
          

          The patch also fixes a bug inside the NonAdjacentDocumentCollapser that was reported on the solr-user mailing list a few days ago. An index out of bounds exception was thrown when documents were removed from an index and a field collapse search was done afterwards.

          Code refactoring

          The code refactoring includes the following things:

          • The notion of a CollapseGroup. A collapse group defines what an unique group is in the search result. For the adjacent and non adjacent document collapser this is different. For adjacent field collapsing a group is defined by its field value and the document id of the most relevant document in that group. More then one collapse group may have the same fieldvalue. For normal field collapsing (non adjacent) the group is defined just by the field value.
          • The notion of a CollapseCollector that receives the collapsed documents from a DocumentCollector and does something with it. For example keeps a count of how many documents were collapsed per collapse group or computes an average of a certain field like price. As you can see in the code instead of using field values or document ids a collapse group is used for identifying a collapse group.
            /**
             * A <code>CollapseCollector</code> is responsible for receiving collapse callbacks from the <code>DocumentCollapser</code>.
             * An implementation can choose what to do with the received callbacks and data. Whatever an implementation collects it
             * is responsible for adding its results to the response.
             *
             * Implementation of this interface don't need to be thread safe!
             */
            public interface CollapseCollector {
            
              /**
               * Informs the <code>CollapseCollector</code> that a document has been collapsed under the specified collapseGroup.
               *
               * @param docId The id of the document that has been collasped
               * @param collapseGroup The collapse group the docId has been collapsed under
               * @param collapseContext The collapse context
               */
              void documentCollapsed(int docId, CollapseGroup collapseGroup, CollapseContext collapseContext);
            
              /**
               * Informs the <code>CollapseCollector</code> about the document head.
               * The document head is the most relevant id for the specified collapseGroup.
               *
               * @param docHeadId The identifier of the document head
               * @param collapseGroup The collapse group of the document head
               * @param collapseContext The collapse context
               */
              void documentHead(int docHeadId, CollapseGroup collapseGroup, CollapseContext collapseContext);
            
              /**
               * Adds the <code>CollapseCollector</code> implementation specific result data to the result.
               *
               * @param result The response result 
               * @param docs The documents to be added to the response
               * @param collapseContext The collapse context
               */
              void getResult(NamedList result, DocList docs, CollapseContext collapseContext);
            
            }
            

            There is also a CollapseContext that allows you store data that can be shared between CollapseCollectors.

          • A CollapseCollectorFactory is responsible for creating a CollepseCollector. It does this based on the SolrQueryRequest. All the logic for when to enable a certain CollapseCollector must be placed in the factory.
            /**
             * A concrete <code>CollapseCollectorFactory</code> implementation is responsible for creating {@link CollapseCollector}
             * instances based on the {@link SolrQueryRequest}.
             */
            public interface CollapseCollectorFactory {
            
              /**
               * Creates an instance of a CollapseCollector specified by the concrete subclass.
               * The concrete subclass decides based on the specified request if an new instance has to be created and
               * can return <code>null</code> for that matter.
               * 
               * @param request The specified request
               * @return an instance of a CollapseCollector or <code>null</code>
               */
              CollapseCollector createCollapseCollector(SolrQueryRequest request);
            
            }
            

            Currently there are four CollapseCollectorFactories implementations:

          1. DocumentGroupCountCollapseCollectorFactory creates CollapseCollectors that collect the collapse counts per document group and return the counts in the response per collapsed group most relevant document id.
          2. FieldValueCountCollapseCollectorFactory creates CollapseCollectors that collect the collapse count per collapsed group and return the counts in the response per collepsed group field value.
          3. DocumentFieldsCollapseCollectorFactory creates CollapseCollectors that collect predefined fieldvalues from collapsed documents.
          4. AggregateCollapseCollectorFactory creates CollapseCollectors that create aggregate statistics based on the collapsed documents.
            CollapseCollectorFactories are configured in the solrconfig.xml and by default all implementations in the patch are configured. The following configuration is sufficient
            <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent" />
            

            The following configurations configures the same CollapseCollectorFactories as the previous configuration:

            <searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent">
                <arr name="collapseCollectorFactories">
                    <str>groupDocumentsCounts</str>
                    <str>groupFieldValue</str>
                    <str>groupDocumentsFields</str>
                    <str>groupAggregatedData</str>
                </arr>
              </searchComponent>
            
              <fieldCollapsing>
                <collapseCollectorFactory name="groupDocumentsCounts" 
            class="solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory name="groupFieldValue" class="solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" />
            
                <collapseCollectorFactory name="groupDocumentsFields" 
             class="solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" />
            
                <collapseCollectorFactory name="groupAggregatedData"
             class="org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory">
                    <lst name="aggregateFunctions">
                        <str name="sum">org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction</str>
                        <str name="avg">org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction</str>
                        <str name="min">org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction</str>
                        <str name="max">org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction</str>
                    </lst>
                </collapseCollectorFactory>
              </fieldCollapsing>
            

            The CollapseCollectorFactories configured can be shared among different CollapseComponents. Most users do not have to do this, but when you creating your own implementations or someone else's then you have to do this in order to configure the CollapseCollectorFactory implementation. The order in collapseCollectorFactories does matter. CollapseCollectors may share data via the CollapseContext for that reason the order is depend. The CollapseCollectorFactories in the patch do not share data, but other implementations may.

          The new patch contains a lot of changes, but I personally think that the patch is really an improvement especially the introduction of the CollapseCollectors that allows a lot of flexibility. Btw any feedback or questions are welcome.

          Show
          Martijn van Groningen added a comment - - edited I have attached a new patch which includes a major refactoring which makes the code more flexible and cleaner. The patch also includes a new aggregate functionality and a bug fix. Aggregate function and bug fix The new patch allows you to execute aggregate functions on the collapsed documents (for example sum the stock amount or calculating the minimum price of a collapsed group). Currently there are four aggregate functions available: sum(), min(), max() and avg(). To execute one or more functions the collapse.aggregate parameter has to be added to the request url. The parameter expects the following syntax: function_name(field_name) [, function_name(field_name)] . For example: collapse.aggregate=sum(stock), min(price) and might have a result like this: <lst name= "aggregatedResults" > <lst name= "sum(stock)" > <str name= "Amsterdam" > 10 </str> ... </lst> <lst name= "min(price)" > <str name= "Amsterdam" > 5.99 </str> ... </lst> </lst> The patch also fixes a bug inside the NonAdjacentDocumentCollapser that was reported on the solr-user mailing list a few days ago. An index out of bounds exception was thrown when documents were removed from an index and a field collapse search was done afterwards. Code refactoring The code refactoring includes the following things: The notion of a CollapseGroup . A collapse group defines what an unique group is in the search result. For the adjacent and non adjacent document collapser this is different. For adjacent field collapsing a group is defined by its field value and the document id of the most relevant document in that group. More then one collapse group may have the same fieldvalue. For normal field collapsing (non adjacent) the group is defined just by the field value. The notion of a CollapseCollector that receives the collapsed documents from a DocumentCollector and does something with it. For example keeps a count of how many documents were collapsed per collapse group or computes an average of a certain field like price. As you can see in the code instead of using field values or document ids a collapse group is used for identifying a collapse group. /** * A <code>CollapseCollector</code> is responsible for receiving collapse callbacks from the <code>DocumentCollapser</code>. * An implementation can choose what to do with the received callbacks and data. Whatever an implementation collects it * is responsible for adding its results to the response. * * Implementation of this interface don't need to be thread safe! */ public interface CollapseCollector { /** * Informs the <code>CollapseCollector</code> that a document has been collapsed under the specified collapseGroup. * * @param docId The id of the document that has been collasped * @param collapseGroup The collapse group the docId has been collapsed under * @param collapseContext The collapse context */ void documentCollapsed( int docId, CollapseGroup collapseGroup, CollapseContext collapseContext); /** * Informs the <code>CollapseCollector</code> about the document head. * The document head is the most relevant id for the specified collapseGroup. * * @param docHeadId The identifier of the document head * @param collapseGroup The collapse group of the document head * @param collapseContext The collapse context */ void documentHead( int docHeadId, CollapseGroup collapseGroup, CollapseContext collapseContext); /** * Adds the <code>CollapseCollector</code> implementation specific result data to the result. * * @param result The response result * @param docs The documents to be added to the response * @param collapseContext The collapse context */ void getResult(NamedList result, DocList docs, CollapseContext collapseContext); } There is also a CollapseContext that allows you store data that can be shared between CollapseCollectors . A CollapseCollectorFactory is responsible for creating a CollepseCollector . It does this based on the SolrQueryRequest . All the logic for when to enable a certain CollapseCollector must be placed in the factory. /** * A concrete <code>CollapseCollectorFactory</code> implementation is responsible for creating {@link CollapseCollector} * instances based on the {@link SolrQueryRequest}. */ public interface CollapseCollectorFactory { /** * Creates an instance of a CollapseCollector specified by the concrete subclass. * The concrete subclass decides based on the specified request if an new instance has to be created and * can return <code> null </code> for that matter. * * @param request The specified request * @ return an instance of a CollapseCollector or <code> null </code> */ CollapseCollector createCollapseCollector(SolrQueryRequest request); } Currently there are four CollapseCollectorFactories implementations: DocumentGroupCountCollapseCollectorFactory creates CollapseCollectors that collect the collapse counts per document group and return the counts in the response per collapsed group most relevant document id. FieldValueCountCollapseCollectorFactory creates CollapseCollectors that collect the collapse count per collapsed group and return the counts in the response per collepsed group field value. DocumentFieldsCollapseCollectorFactory creates CollapseCollectors that collect predefined fieldvalues from collapsed documents. AggregateCollapseCollectorFactory creates CollapseCollectors that create aggregate statistics based on the collapsed documents. CollapseCollectorFactories are configured in the solrconfig.xml and by default all implementations in the patch are configured. The following configuration is sufficient <searchComponent name= "collapse" class= "org.apache.solr.handler.component.CollapseComponent" /> The following configurations configures the same CollapseCollectorFactories as the previous configuration: <searchComponent name= "collapse" class= "org.apache.solr.handler.component.CollapseComponent" > <arr name= "collapseCollectorFactories" > <str> groupDocumentsCounts </str> <str> groupFieldValue </str> <str> groupDocumentsFields </str> <str> groupAggregatedData </str> </arr> </searchComponent> <fieldCollapsing> <collapseCollectorFactory name= "groupDocumentsCounts" class= "solr.fieldcollapse.collector.DocumentGroupCountCollapseCollectorFactory" /> <collapseCollectorFactory name= "groupFieldValue" class= "solr.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory" /> <collapseCollectorFactory name= "groupDocumentsFields" class= "solr.fieldcollapse.collector.DocumentFieldsCollapseCollectorFactory" /> <collapseCollectorFactory name= "groupAggregatedData" class= "org.apache.solr.search.fieldcollapse.collector.AggregateCollapseCollectorFactory" > <lst name= "aggregateFunctions" > <str name= "sum" > org.apache.solr.search.fieldcollapse.collector.aggregate.SumFunction </str> <str name= "avg" > org.apache.solr.search.fieldcollapse.collector.aggregate.AverageFunction </str> <str name= "min" > org.apache.solr.search.fieldcollapse.collector.aggregate.MinFunction </str> <str name= "max" > org.apache.solr.search.fieldcollapse.collector.aggregate.MaxFunction </str> </lst> </collapseCollectorFactory> </fieldCollapsing> The CollapseCollectorFactories configured can be shared among different CollapseComponents . Most users do not have to do this, but when you creating your own implementations or someone else's then you have to do this in order to configure the CollapseCollectorFactory implementation. The order in collapseCollectorFactories does matter. CollapseCollectors may share data via the CollapseContext for that reason the order is depend. The CollapseCollectorFactories in the patch do not share data, but other implementations may. The new patch contains a lot of changes, but I personally think that the patch is really an improvement especially the introduction of the CollapseCollectors that allows a lot of flexibility. Btw any feedback or questions are welcome.
          Hide
          Lance Norskog added a comment -

          This looks like a really nice rework! This JIRA has been a marathon (2.5 years!), but maybe the last miles are here.

          Since this JIRA has so many comments, it is hard to navigate. Maybe it is a good time to close it and start a new active JIRA for the field collapsing project.

          Show
          Lance Norskog added a comment - This looks like a really nice rework! This JIRA has been a marathon (2.5 years!), but maybe the last miles are here. Since this JIRA has so many comments, it is hard to navigate. Maybe it is a good time to close it and start a new active JIRA for the field collapsing project.
          Hide
          Martijn van Groningen added a comment -

          I have updated the patch that fixes the bug that was reported yesterday on the solr-user mailing list:

          found another exception, i cant find specific steps to reproduce
          besides starting with an unfiltered result and then given an int field
          with values (1,2,3) filtering by 3 triggers it sometimes, this is in
          an index with very frequent updates and deletes

          --joe

          java.lang.NullPointerException
          at org.apache.solr.search.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory
          $FieldValueCountCollapseCollector.getResult(FieldValueCountCollapseCollectorFactory.java:84)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:191)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:179)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
          at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:326)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
          at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
          at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)

          Show
          Martijn van Groningen added a comment - I have updated the patch that fixes the bug that was reported yesterday on the solr-user mailing list: found another exception, i cant find specific steps to reproduce besides starting with an unfiltered result and then given an int field with values (1,2,3) filtering by 3 triggers it sometimes, this is in an index with very frequent updates and deletes --joe java.lang.NullPointerException at org.apache.solr.search.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory $FieldValueCountCollapseCollector.getResult(FieldValueCountCollapseCollectorFactory.java:84) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:191) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:179) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
          Hide
          Martijn van Groningen added a comment -

          It certainly has be going on for a long time
          Talking about the last miles there are a few things in my mind about field collapsing:

          • Change the response format. Currently if I look at the response even I get confused sometimes about the information returned. The response should more structured. Something like this:
            <lst name="collapse_counts">
                <str name="field">venue</str>
                <lst name="results">
                    <lst name="233238"> <!-- id of most relevant document of the group -->
                        <str name="fieldValue">melkweg</str>
                        <int name="collapseCount">2</int>
                        <!-- and other CollapseCollector specific collapse information -->
                    </lst>
                    ...
                </lst>
            </lst>
            

            Currently when doing adjacent field collapsing the collapse_counts gives results that are unusable to use. The collapse_counts use the field value as key which is not unique for adjacent collapsing as shown in the example:

            <lst name="collapse_counts">
             <int name="hard">1</int>
             <int name="hard">1</int>
             <int name="electronics">1</int>
             <int name="memory">2</int>
             <int name="monitor">1</int>
            </lst>
            
          • Add the notion of a CollapseMatcher, that decides whether document field values are equal or not and thus whether they are allowed to be collapsed. This opens the road for more exotic features like fuzzy field collapsing and collapsing on more than one field. Also this allows users of the patch to easily implement their own matching rules.
          • Distributed field collapsing. Although I have some ideas on how to get started, from my perspective it not going to be performed. Because somehow the field collapse state has to be shared between shards in order to do proper field collapsing. This state can potentially be a lot of data depending on the specific search and corpus.
          • And maybe add a collapse collector that collects statistics about most common field value per collapsed group.

          I think that this is somewhat the roadmap from my side for field collapsing at moment, but feel free to elaborate on this.
          Btw I have recently written a blog about field collapsing in general, that might be handy for someone who is implementing field collapsing.

          Show
          Martijn van Groningen added a comment - It certainly has be going on for a long time Talking about the last miles there are a few things in my mind about field collapsing: Change the response format. Currently if I look at the response even I get confused sometimes about the information returned. The response should more structured. Something like this: <lst name= "collapse_counts" > <str name= "field" > venue </str> <lst name= "results" > <lst name= "233238" > <!-- id of most relevant document of the group --> <str name= "fieldValue" > melkweg </str> <int name= "collapseCount" > 2 </int> <!-- and other CollapseCollector specific collapse information --> </lst> ... </lst> </lst> Currently when doing adjacent field collapsing the collapse_counts gives results that are unusable to use. The collapse_counts use the field value as key which is not unique for adjacent collapsing as shown in the example: <lst name= "collapse_counts" > <int name= "hard" > 1 </int> <int name= "hard" > 1 </int> <int name= "electronics" > 1 </int> <int name= "memory" > 2 </int> <int name= "monitor" > 1 </int> </lst> Add the notion of a CollapseMatcher, that decides whether document field values are equal or not and thus whether they are allowed to be collapsed. This opens the road for more exotic features like fuzzy field collapsing and collapsing on more than one field. Also this allows users of the patch to easily implement their own matching rules. Distributed field collapsing. Although I have some ideas on how to get started, from my perspective it not going to be performed. Because somehow the field collapse state has to be shared between shards in order to do proper field collapsing. This state can potentially be a lot of data depending on the specific search and corpus. And maybe add a collapse collector that collects statistics about most common field value per collapsed group. I think that this is somewhat the roadmap from my side for field collapsing at moment, but feel free to elaborate on this. Btw I have recently written a blog about field collapsing in general, that might be handy for someone who is implementing field collapsing.
          Hide
          Lance Norskog added a comment -

          Getting the refactoring right is important.

          Scaling needs to be on the roadmap as well. The data created in collapsing has to be cached in some way. If I do a collapse on my 500m test index, the first one takes 110ms and the second one takes 80-90ms. Searches that walk from one result page to the next have to be fast the second time. Field collapsing probably needs some explicit caching. This is a show-stopper for getting this committed.

          When I sort or facet the work done up front is reused in some way. In sorting there is a huge amount of work pushed to the first query and explicitly cached. Faceting seems to leave its work in the existing caches and runs much faster the second time.

          Show
          Lance Norskog added a comment - Getting the refactoring right is important. Scaling needs to be on the roadmap as well. The data created in collapsing has to be cached in some way. If I do a collapse on my 500m test index, the first one takes 110ms and the second one takes 80-90ms. Searches that walk from one result page to the next have to be fast the second time. Field collapsing probably needs some explicit caching. This is a show-stopper for getting this committed. When I sort or facet the work done up front is reused in some way. In sorting there is a huge amount of work pushed to the first query and explicitly cached. Faceting seems to leave its work in the existing caches and runs much faster the second time.
          Hide
          Martijn van Groningen added a comment -

          I agree about the caching. When searching with fieldcollapsing for the same query more than ones, then some caching should kick in. I think that the execution of the doCollapse(...) method should be cached. In this method the field collapse logic is executed, which takes the most time of a field collapse search.

          Show
          Martijn van Groningen added a comment - I agree about the caching. When searching with fieldcollapsing for the same query more than ones, then some caching should kick in. I think that the execution of the doCollapse(...) method should be cached. In this method the field collapse logic is executed, which takes the most time of a field collapse search.
          Hide
          Michael Gundlach added a comment - - edited

          I've found an NPE that occurs when performing quasi-distributed field collapsing.

          My company only has one use case for field collapsing: collapsing on email address. Our index is spread across multiple cores. We found that if we shard by email address, so that all documents with a given email address are guaranteed to appear on the same core, then we can do distributed field collapsing.

          We add &collapse.field=email and &shards=core1,core2,... to a regular query. Each core collapses on email and sends the results back to the requestor. Since no emails appear on more than one core, we've accomplished distributed search. We do lose the <collapse_count> section, but that's not needed for our purpose – we just need an accurate total document count, and to have no more than one document for a given email address in the results.

          Unfortunately, this throws an NPE when searching on a tokenized field. Searching string fields is fine. I don't understand exactly why the NPE appears, but I did bandaid over it by checking explicitly for nulls at the appropriate line in the code. No more NPE.

          There's a downside, which is that if we attempt to collapse on a field other than email – one which has documents appearing in multiple cores – the results are buggy: the first search returns few documents, and the number of documents actually displayed don't always match the "numFound" value. Then upon refresh we get what we think is the correct numFound, and the correct list of documents. This doesn't bother me too much, as you're guaranteed to get incorrect answers from the collapse code anyway when collapsing on a field that you didn't use as your key for sharding.

          In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute the fix, or at least point out the error:

          1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a patch file. The resultant patch file looks very different from the latest SOLR-236 patchfile, so I assume I did something wrong.

          2. I pulled trunk, made my 2 line change, and created another patch file. This file is tiny but of course is missing all of the field collapsing changes.

          Would you like me to post either of these patchfiles to this issue? Or is it sufficient to just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was null.) Perhaps my use case is extraordinary enough that you're happy leaving the NPE in place and telling other users to not do what I'm doing?

          Thanks!
          Michael

          Show
          Michael Gundlach added a comment - - edited I've found an NPE that occurs when performing quasi-distributed field collapsing. My company only has one use case for field collapsing: collapsing on email address. Our index is spread across multiple cores. We found that if we shard by email address, so that all documents with a given email address are guaranteed to appear on the same core, then we can do distributed field collapsing. We add &collapse.field=email and &shards=core1,core2,... to a regular query. Each core collapses on email and sends the results back to the requestor. Since no emails appear on more than one core, we've accomplished distributed search. We do lose the <collapse_count> section, but that's not needed for our purpose – we just need an accurate total document count, and to have no more than one document for a given email address in the results. Unfortunately, this throws an NPE when searching on a tokenized field. Searching string fields is fine. I don't understand exactly why the NPE appears, but I did bandaid over it by checking explicitly for nulls at the appropriate line in the code. No more NPE. There's a downside, which is that if we attempt to collapse on a field other than email – one which has documents appearing in multiple cores – the results are buggy: the first search returns few documents, and the number of documents actually displayed don't always match the "numFound" value. Then upon refresh we get what we think is the correct numFound, and the correct list of documents. This doesn't bother me too much, as you're guaranteed to get incorrect answers from the collapse code anyway when collapsing on a field that you didn't use as your key for sharding. In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute the fix, or at least point out the error: 1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a patch file. The resultant patch file looks very different from the latest SOLR-236 patchfile, so I assume I did something wrong. 2. I pulled trunk, made my 2 line change, and created another patch file. This file is tiny but of course is missing all of the field collapsing changes. Would you like me to post either of these patchfiles to this issue? Or is it sufficient to just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was null.) Perhaps my use case is extraordinary enough that you're happy leaving the NPE in place and telling other users to not do what I'm doing? Thanks! Michael
          Hide
          Martijn van Groningen added a comment -

          With the current patch if you try to collapse on a field that is tokenized or multivalued an exception is thrown indicating that you cannot do that and the search is cancelled. What I guess is that when the search results are retrieved from the shards on the master a NPE is thrown because the shard result is not there. This is a limitation in itself, but it boils down to the fact how the FieldCache handles such field types (or at least how I think the FieldCache handles it).

          I think it is good idea to share your patch and from there we might be able to get the change in a proper manner. So others will also benefit from quasi-distributed field collapsing.

          Anyhow to properly implement distributed field collapsing the distributed methods have to be overriden in the collapse component, so that is where I would start. We might then also include the collapse_count in the response.

          Show
          Martijn van Groningen added a comment - With the current patch if you try to collapse on a field that is tokenized or multivalued an exception is thrown indicating that you cannot do that and the search is cancelled. What I guess is that when the search results are retrieved from the shards on the master a NPE is thrown because the shard result is not there. This is a limitation in itself, but it boils down to the fact how the FieldCache handles such field types (or at least how I think the FieldCache handles it). I think it is good idea to share your patch and from there we might be able to get the change in a proper manner. So others will also benefit from quasi-distributed field collapsing. Anyhow to properly implement distributed field collapsing the distributed methods have to be overriden in the collapse component, so that is where I would start. We might then also include the collapse_count in the response.
          Hide
          Shalin Shekhar Mangar added a comment -

          I'm using Martijn's patch from 2009-10-27. The FieldCollapseResponse#parseDocumentIdCollapseCounts assumes the unique key is a long. Is that a bug or an undocumented limitation?

          Nice work guys! We should definitely get this into Solr 1.5

          Show
          Shalin Shekhar Mangar added a comment - I'm using Martijn's patch from 2009-10-27. The FieldCollapseResponse#parseDocumentIdCollapseCounts assumes the unique key is a long. Is that a bug or an undocumented limitation? Nice work guys! We should definitely get this into Solr 1.5
          Hide
          Martijn van Groningen added a comment -

          Hi Shalin, it was not my intention (Usually in my case I use a long as id). I'm currently refactoring the response format as described in a previous comment, so I have to change the SolrJ classes anyway. I will submit a patch shortly.

          Show
          Martijn van Groningen added a comment - Hi Shalin, it was not my intention (Usually in my case I use a long as id). I'm currently refactoring the response format as described in a previous comment, so I have to change the SolrJ classes anyway. I will submit a patch shortly.
          Hide
          Michael Gundlach added a comment -

          Martijn,

          I probably wasn't clear – we are sharding and collapsing on a non-tokenized "email" field. We can perform distributed collapsing fine when searching on some other nontokenized field; the NPE occurs when we perform a search on a tokenized field.

          Anyway, I'll attach the small patch now, which just adds the null check to Solr trunk.

          Show
          Michael Gundlach added a comment - Martijn, I probably wasn't clear – we are sharding and collapsing on a non-tokenized "email" field. We can perform distributed collapsing fine when searching on some other nontokenized field; the NPE occurs when we perform a search on a tokenized field. Anyway, I'll attach the small patch now, which just adds the null check to Solr trunk.
          Hide
          Michael Gundlach added a comment - - edited

          This patch (quasidistributed.additional.patch) does not apply field collapsing.

          Apply this patch in addition to the latest field collapsing patch, to avoid an NPE when:

          • you are collapsing on a field F,
          • you are sharding into multiple cores, using the hash of field F as your sharding key, AND
          • you perform a distributed search on a tokenized field.

          Note that if you attempt to use this patch to collapse on a field F1 and shard according to a field F2, you will get buggy search behavior.

          Show
          Michael Gundlach added a comment - - edited This patch (quasidistributed.additional.patch) does not apply field collapsing. Apply this patch in addition to the latest field collapsing patch, to avoid an NPE when: you are collapsing on a field F, you are sharding into multiple cores, using the hash of field F as your sharding key, AND you perform a distributed search on a tokenized field. Note that if you attempt to use this patch to collapse on a field F1 and shard according to a field F2, you will get buggy search behavior.
          Hide
          Martijn van Groningen added a comment -

          I have updated the field collapse patch and improved the response format. Check my blog for more details.

          Show
          Martijn van Groningen added a comment - I have updated the field collapse patch and improved the response format. Check my blog for more details.
          Hide
          Thomas Woodard added a comment -

          I'm trying to get field collapsing to work against the 1.4.0 release. I applied the latest patch, moved the file, did a clean build, and set up a config based on the example. If I run a search without collapsing everything is fine, but if it actually tries to collapse, I get the following error:

          java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.getDocSet(Lorg/apache/lucene/search/Query;Lorg/apache/solr/search/DocSet;Lorg/apache/solr/search/DocSetAwareCollector;)Lorg/apache/solr/search/DocSet;
          at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:60)
          at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:168)
          at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:160)
          at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)

          The tricky part is that the method is there in the source and I wrote a little test JSP that can find it just fine. That implies a class loader issue of some sort, but I'm not seeing it. Any help would be greatly appreciated.

          Show
          Thomas Woodard added a comment - I'm trying to get field collapsing to work against the 1.4.0 release. I applied the latest patch, moved the file, did a clean build, and set up a config based on the example. If I run a search without collapsing everything is fine, but if it actually tries to collapse, I get the following error: java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.getDocSet(Lorg/apache/lucene/search/Query;Lorg/apache/solr/search/DocSet;Lorg/apache/solr/search/DocSetAwareCollector;)Lorg/apache/solr/search/DocSet; at org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:60) at org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:168) at org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:160) at org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) The tricky part is that the method is there in the source and I wrote a little test JSP that can find it just fine. That implies a class loader issue of some sort, but I'm not seeing it. Any help would be greatly appreciated.
          Hide
          Martijn van Groningen added a comment -

          Thomas, the method that cannot be found ( SolrIndexSearcher.getDocSet(...) ) is a method that is part of the patch. So if the patch was successful applied then this should not happen.
          When I released the latest patch I only tested against the solr trunk, but I have tried the following to verify that the patch works with 1.4.0 release:

          • Dowloaded 1.4.0 release from Solr site
          • Applied the patch
          • Executed: ant clean dist example
          • In the example config (example/solr/conf/solrconfig.xml) I added the following line under the standard request handler:
            <searchComponent name="query" class="org.apache.solr.handler.component.CollapseComponent" />
          • Started the Jetty with Solr with the following command: java -jar start.jar
          • Added example data to Solr with the following command in the exampledocs dir: ./post.sh *.xml
          • I Browsed to the following url: http://localhost:8983/solr/select/?q=*:*&collapse.field=inStock and saw that the result was collapsed on the inStock field.

          It seems that everything is running fine. Can you tell something about how you deployed Solr on your machine?

          Show
          Martijn van Groningen added a comment - Thomas, the method that cannot be found ( SolrIndexSearcher.getDocSet(...) ) is a method that is part of the patch. So if the patch was successful applied then this should not happen. When I released the latest patch I only tested against the solr trunk, but I have tried the following to verify that the patch works with 1.4.0 release: Dowloaded 1.4.0 release from Solr site Applied the patch Executed: ant clean dist example In the example config (example/solr/conf/solrconfig.xml) I added the following line under the standard request handler: <searchComponent name= "query" class= "org.apache.solr.handler.component.CollapseComponent" /> Started the Jetty with Solr with the following command: java -jar start.jar Added example data to Solr with the following command in the exampledocs dir: ./post.sh *.xml I Browsed to the following url: http://localhost:8983/solr/select/?q=*:*&collapse.field=inStock and saw that the result was collapsed on the inStock field. It seems that everything is running fine. Can you tell something about how you deployed Solr on your machine?
          Hide
          Thomas Woodard added a comment - - edited

          I tried the build again, and you are right, it does work fine with the default search handler. I had been trying to get it working with our search handler, which is dismax. That still doesn't work. Here is the handler configuration, which works fine until collapsing is added.

          <requestHandler name="glsearch" class="solr.SearchHandler">
          	<lst name="defaults">
          		<str name="defType">dismax</str>
          		<str name="qf">name^3 description^2 long_description^2 search_stars^1 search_directors^1 product_id^0.1</str>
          		<str name="tie">0.1</str>
          		<str name="facet">true</str>
          		<str name="facet.field">stars</str>
          		<str name="facet.field">directors</str>
          		<str name="facet.field">keywords</str>
          		<str name="facet.field">studio</str>
          		<str name="facet.mincount">1</str>
          	</lst>
          </requestHandler>
          

          Edit: The search fails even if you don't pass a collapse field.

          Show
          Thomas Woodard added a comment - - edited I tried the build again, and you are right, it does work fine with the default search handler. I had been trying to get it working with our search handler, which is dismax. That still doesn't work. Here is the handler configuration, which works fine until collapsing is added. <requestHandler name= "glsearch" class= "solr.SearchHandler" > <lst name= "defaults" > <str name= "defType" > dismax </str> <str name= "qf" > name^3 description^2 long_description^2 search_stars^1 search_directors^1 product_id^0.1 </str> <str name= "tie" > 0.1 </str> <str name= "facet" > true </str> <str name= "facet.field" > stars </str> <str name= "facet.field" > directors </str> <str name= "facet.field" > keywords </str> <str name= "facet.field" > studio </str> <str name= "facet.mincount" > 1 </str> </lst> </requestHandler> Edit: The search fails even if you don't pass a collapse field.
          Hide
          Martijn van Groningen added a comment -

          What kind of exception is occurring if you use dismax (with and without field collapsing)? If I do a collapse search with dismax in the example setup (http://localhost:8983/solr/select/?q=power&collapse.field=inStock&qt=dismax) field collapsing appears to be working.

          Show
          Martijn van Groningen added a comment - What kind of exception is occurring if you use dismax (with and without field collapsing)? If I do a collapse search with dismax in the example setup ( http://localhost:8983/solr/select/?q=power&collapse.field=inStock&qt=dismax ) field collapsing appears to be working.
          Hide
          Martijn van Groningen added a comment -

          I have attached a new patch, that incorporates Micheal's quasi distributed patch so you don't have to patch twice. In addition to that the new patch also merges the collapse_count data from each individual shard response. When using this patch you will still need to make sure that all documents of one collapse group stay on one shard, otherwise your collapse result will be incorrect. The documents of a different collapse group can stay on a different shard.

          Show
          Martijn van Groningen added a comment - I have attached a new patch, that incorporates Micheal's quasi distributed patch so you don't have to patch twice. In addition to that the new patch also merges the collapse_count data from each individual shard response. When using this patch you will still need to make sure that all documents of one collapse group stay on one shard, otherwise your collapse result will be incorrect. The documents of a different collapse group can stay on a different shard.
          Hide
          Thomas Woodard added a comment -

          And this morning, without changing anything, it is working fine. I don't know what happened on Friday, but the changes I made then must have fixed it without showing up for some reason. In any case, thank you for the assistance.

          Show
          Thomas Woodard added a comment - And this morning, without changing anything, it is working fine. I don't know what happened on Friday, but the changes I made then must have fixed it without showing up for some reason. In any case, thank you for the assistance.
          Hide
          German Attanasio Ruiz added a comment -

          Sorting of results doesn't work properly. Next, I detail the steps I followed and the problem I faced

          I am using solr as a search engine for web pages, from which I use a field named "site" for collapsing and sort over scord

          Steps
          After downloading the last version of solr "solr-2009-11-15" and applying the patch "field-collapse-5.patch 2009-11-15 08:55 PM Martijn van Groningen 239 kB"

          STEP 1 - I make a search using fieldcollapsing and the result is correct, the number with greatest scord is 0.477
          STEP 2 - I make the same search and the fieldcollapsing throws other result with scord 0.17, the (correct) result of step 1 does not appear again

          Possible problem
          Step 1 stores the document in the cache for future searches
          at Step 2 the search is don over the cache and does not find the previously stored document

          Possible solution
          I believe that the problem is in the storing of the document in the cache since if we make step 2 again we have the same result and the document with scord of 0.17 is not removed from the results, the only result removed is the document with scord 0.477

          Conclusion
          Documents are not sorted properly when using "fieldcollapsing + solrcache", that is when documents stored in solr cache are required

          Show
          German Attanasio Ruiz added a comment - Sorting of results doesn't work properly. Next, I detail the steps I followed and the problem I faced I am using solr as a search engine for web pages, from which I use a field named "site" for collapsing and sort over scord Steps After downloading the last version of solr "solr-2009-11-15" and applying the patch "field-collapse-5.patch 2009-11-15 08:55 PM Martijn van Groningen 239 kB" STEP 1 - I make a search using fieldcollapsing and the result is correct, the number with greatest scord is 0.477 STEP 2 - I make the same search and the fieldcollapsing throws other result with scord 0.17, the (correct) result of step 1 does not appear again Possible problem Step 1 stores the document in the cache for future searches at Step 2 the search is don over the cache and does not find the previously stored document Possible solution I believe that the problem is in the storing of the document in the cache since if we make step 2 again we have the same result and the document with scord of 0.17 is not removed from the results, the only result removed is the document with scord 0.477 Conclusion Documents are not sorted properly when using "fieldcollapsing + solrcache", that is when documents stored in solr cache are required
          Hide
          Martijn van Groningen added a comment -

          I can confirm this bug. I will attach a new patch that fixes this issue shortly. Thanks for noticing.

          Show
          Martijn van Groningen added a comment - I can confirm this bug. I will attach a new patch that fixes this issue shortly. Thanks for noticing.
          Hide
          Martijn van Groningen added a comment - - edited

          The reason why the search results after the first search were incorrect was, because the scores were not preserved in the cache. The result of that was that the collapsing algorithm could not properly group the documents into the collapse groups (the most relevant document per document group could not be determined properly), because there was no score information when retrieving the documents from cache (as DocSet in SolrIndexSearcher) .

          I made sure that in the attached patch the score is also saved in the cache, so the collapsing algorithm can do its work properly when the documents are retrieved from the cache. Because the scores are now stored with the cached documents the actual size of the filterCache in memory will increase.

          Show
          Martijn van Groningen added a comment - - edited The reason why the search results after the first search were incorrect was, because the scores were not preserved in the cache. The result of that was that the collapsing algorithm could not properly group the documents into the collapse groups (the most relevant document per document group could not be determined properly), because there was no score information when retrieving the documents from cache (as DocSet in SolrIndexSearcher) . I made sure that in