Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.4
    • Fix Version/s: 3.2
    • Component/s: search
    • Labels:
      None

      Description

      I am trying to develope a new way of doing field collapsing based on the adjacent field collapsing algorithm. I have started developing it beacuse I am experiencing performance problems with the field collapsing patch with big index (8G).
      The algorith does adjacent-pseudo-field collapsing. It does collapsing on the first X documents. Instead of making the collapsed docs disapear, the algorith will send them to a given position of the relevance results list.
      The reason I just do collapsing in the first X documents is that if I have for example 600000 results and I am showing 10 results per page, I really don't need to do collapsing in the page 30000 or even not in the 3000. Doing this I am noticing dramatically better performance. The problem is I couldn't find a way to plug the algorithm as a component and keep good performance. I had to hack few classes in SolrIndexSearcher.java
      This patch is just experimental and for testing purposes. In case someone finds it interesting would be good do find a way to integrate it in a better way than it is at the moment.
      Advices are more than welcome.

      Functionality:
      In solrconfig.xml we specify the pseudo-collapsing parameters:
      <str name="plus.considerMoreDocs">true</str>
      <str name="plus.considerHowMany">3000</str>
      <str name="plus.considerField">name</str>
      (at the moment there's no threshold and other parameters that exist in the current collapse-field patch)

      plus.considerMoreDocs one enables pseudo-collapsing
      plus.considerHowMany sets the number of resultant documents in wich we want to apply the algorithm
      plus.considerField is the field to do pseudo-collapsing

      If the number of results is lower than plus.considerHowMany the algorithm will be applyed to all the results.
      Let's say there is a query with 600000 results and we've set considerHowMany to 3000 (and we already have the docs sorted by relevance).
      What adjacent-pseudo-collapse does is, if the 2nd doc has to be collapsed it will be sent to the pos 2999 of the relevance results array. If the 3th has to be collpased too will go to the position 2998 and successively like this.

      The algorithm is not applyed when a sortspec is set or plus.considerMoreDocs is set to false. It neighter is applyed when using MoreLikeThisRequestHanlder.

      Example with a query of 9 results:
      Results sorted by relevance without pseudo-collapse-algorithm:

      doc1 - collapse_field_value 3
      doc2 - collapse_field_value 3
      doc3 - collapse_field_value 4
      doc4 - collapse_field_value 7
      doc5 - collapse_field_value 6
      doc6 - collapse_field_value 6
      doc7 - collapse_field_value 5
      doc8 - collapse_field_value 1
      doc9 - collapse_field_value 2

      Results pseudo-collapsed with plus.considerHowMany = 5

      doc1 - collapse_field_value 3
      doc3 - collapse_field_value 4
      doc4 - collapse_field_value 7
      doc5 - collapse_field_value 6
      doc2 - collapse_field_value 3*
      doc6 - collapse_field_value 6
      doc7 - collapse_field_value 5
      doc8 - collapse_field_value 1
      doc9 - collapse_field_value 2

      Results pseudo-collapsed with plus.considerHowMany = 9

      doc1 - collapse_field_value 3
      doc3 - collapse_field_value 4
      doc4 - collapse_field_value 7
      doc5 - collapse_field_value 6
      doc7 - collapse_field_value 5
      doc8 - collapse_field_value 1
      doc9 - collapse_field_value 2
      doc6 - collapse_field_value 6*
      doc2 - collapse_field_value 3*

      *pseudo-collapsed documents

        Issue Links

          Activity

          Hide
          Uri Boness added a comment -

          Wouldn't be an idea to try and merge this code with the original field collapsing patch? Quite a bit of work was done recently on that patch to make it more extensible. So for example, you now have a Collapser interface that encapsulates the actual collapsing algorithm, and my guess is that your algorithm can probably fit there. Indeed when the corpus is large, adjacent field collapsing can turn into a performance issue, and having this pseudo algorithm seems to make a lot of sense. So for example, using the original field collapsing patch, it would be nice if we could just define another parameter called collapse.type which will hold one of three values: adjacent, pseudo-adjacent, and non-adjacent.

          BTW, I haven't looked at your patch yet and I don't know how well it works with faceting? But integrating it with the original patch will enable you that support (i.e. before/after collapse facet counts support) automatically.

          Show
          Uri Boness added a comment - Wouldn't be an idea to try and merge this code with the original field collapsing patch? Quite a bit of work was done recently on that patch to make it more extensible. So for example, you now have a Collapser interface that encapsulates the actual collapsing algorithm, and my guess is that your algorithm can probably fit there. Indeed when the corpus is large, adjacent field collapsing can turn into a performance issue, and having this pseudo algorithm seems to make a lot of sense. So for example, using the original field collapsing patch, it would be nice if we could just define another parameter called collapse.type which will hold one of three values: adjacent, pseudo-adjacent, and non-adjacent. BTW, I haven't looked at your patch yet and I don't know how well it works with faceting? But integrating it with the original patch will enable you that support (i.e. before/after collapse facet counts support) automatically.
          Hide
          Marc Sturlese added a comment -

          Well, the thing is my patch is very good in performance because by now it can not be integrated as a plugin. Field collaping patch does 2 "searches". One to pick the ids to collapse and the second to filter the ids in the main search.
          What I do is to pseudo-collapse straight in the mian search... reordering the ids in the getDocListAndSetNC and getDocListNC so response times are almost the same with or without the patch.

          Show
          Marc Sturlese added a comment - Well, the thing is my patch is very good in performance because by now it can not be integrated as a plugin. Field collaping patch does 2 "searches". One to pick the ids to collapse and the second to filter the ids in the main search. What I do is to pseudo-collapse straight in the mian search... reordering the ids in the getDocListAndSetNC and getDocListNC so response times are almost the same with or without the patch.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Peter Karich added a comment -

          Hi Marc,

          could this issue be closed because of a field collapsing which is now in trunk and more mature?

          Why it cannot be integrated as a plugin?

          Show
          Peter Karich added a comment - Hi Marc, could this issue be closed because of a field collapsing which is now in trunk and more mature? Why it cannot be integrated as a plugin?
          Hide
          Marc Sturlese added a comment -

          Well I said it can not be integrated as a plugin because it hacks DocListAndSetNC and DocListNC. This 2 functions just can be altered altering the SolrIndexSearcher.java class.
          The pseudo-field-collapse sort is not included in the current field collapsing but current field collapsing seems to perform much better that it use to (I don't think as good as this patch, but the current feature is much more complete than my patch).
          I supose I can close it.

          Show
          Marc Sturlese added a comment - Well I said it can not be integrated as a plugin because it hacks DocListAndSetNC and DocListNC. This 2 functions just can be altered altering the SolrIndexSearcher.java class. The pseudo-field-collapse sort is not included in the current field collapsing but current field collapsing seems to perform much better that it use to (I don't think as good as this patch, but the current feature is much more complete than my patch). I supose I can close it.

            People

            • Assignee:
              Unassigned
              Reporter:
              Marc Sturlese
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development