Solr
  1. Solr
  2. SOLR-1773

Field Collapsing (lightweight version)

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: search
    • Labels:
      None

      Description

      I'd like to start another approach for field collapsing suggested by Yonik on 19/Dec/09 at SOLR-236. Re-posting the idea:

      =================== two pass collapsing algorithm for collapse.aggregate=max ====================
      First pass: pretend that collapseCount=1
        - Use a TreeSet as  a priority queue since one can remove and insert entries.
        - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
        - compare new doc with smallest element in treeset.  If smaller discard and go to the next doc.
        - If new doc is bigger, look up it's group.  Use the Map to find if the group has been added to the TreeSet and add it if not.
        - If the new bigger doc is already in the TreeSet, compare with the document in that group.  If bigger, update the node,
          remove and re-add to the TreeSet to re-sort.
      
      efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
      We will now have the top 10 documents collapsed by the right field with a collapseCount of 1.  Put another way, we have the top 10 groups.
      
      Second pass (if collapseCount>1):
       - create a priority queue for each group (10) of size collapseCount
       - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
       - for each document, find it's appropriate priority queue and insert
       - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
      
      So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
      Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
      We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
      
      Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
      

      The restriction is:

      one would not be able to tell the total number of collapsed docs, or the total number of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported.

      1. LOADTEST.patch
        109 kB
        Koji Sekiguchi
      2. SOLR-1773.patch
        42 kB
        Koji Sekiguchi

        Issue Links

          Activity

          Hide
          Koji Sekiguchi added a comment -

          The first draft, untested patch. Use for PoC only. In this patch, I hard-coded sort field by using java.util.Comparator.

          Show
          Koji Sekiguchi added a comment - The first draft, untested patch. Use for PoC only. In this patch, I hard-coded sort field by using java.util.Comparator.
          Hide
          Koji Sekiguchi added a comment - - edited

          Random comment on the patch:

          • TimeAllowed not supported
          • cache not supported
          • distributed search is not supported
          • sort field is hard-coded in the patch
          • collapse.type=adjacent is not supported
          • collapse.aggregate is not supported (but supportable)
          • not yet, but collapse.sort can be supported to specify sort criteria in collapse group

          supported parameters:

          collapse set to on to use field collapsing
          collapse.field field name to collapse (required)
          collapse.limit maximum number of collapsed docs to return in each collapse group. default is 0.
          collapse.fl comma- or space- delimited list of fields to return. multiValued field and TrieField are not supported yet
          Show
          Koji Sekiguchi added a comment - - edited Random comment on the patch: TimeAllowed not supported cache not supported distributed search is not supported sort field is hard-coded in the patch collapse.type=adjacent is not supported collapse.aggregate is not supported (but supportable) not yet, but collapse.sort can be supported to specify sort criteria in collapse group supported parameters: collapse set to on to use field collapsing collapse.field field name to collapse (required) collapse.limit maximum number of collapsed docs to return in each collapse group. default is 0. collapse.fl comma- or space- delimited list of fields to return. multiValued field and TrieField are not supported yet
          Hide
          Koji Sekiguchi added a comment -

          A very rough/simple load test patch attached.

          QTime average of 1,000 times random queries were:

          num docs in index SOLR-236 SOLR-1773
          1M 321 ms 185ms
          10M 2,914 ms 1,642 ms

          I needed to set -Xmx1024m in this case, though 512m for other cases, to avoid OOM.

          SOLR-1773 is 43% faster.

          Show
          Koji Sekiguchi added a comment - A very rough/simple load test patch attached. QTime average of 1,000 times random queries were: num docs in index SOLR-236 SOLR-1773 1M 321 ms 185ms 10M 2,914 ms 1,642 ms I needed to set -Xmx1024m in this case, though 512m for other cases, to avoid OOM. SOLR-1773 is 43% faster.
          Hide
          Shalin Shekhar Mangar added a comment -

          Koji, have you looked at SOLR-1682? I gave an implementation of the same approach but that too is only a PoC.

          Show
          Shalin Shekhar Mangar added a comment - Koji, have you looked at SOLR-1682 ? I gave an implementation of the same approach but that too is only a PoC.
          Hide
          Koji Sekiguchi added a comment - - edited

          Oops, I've glanced at SOLR-236 related issues, but I thought it was for finalize response format from Description. I'll look into SOLR-1682. Thanks!

          Show
          Koji Sekiguchi added a comment - - edited Oops, I've glanced at SOLR-236 related issues, but I thought it was for finalize response format from Description. I'll look into SOLR-1682 . Thanks!
          Hide
          Koji Sekiguchi added a comment -

          closed as duplicate of SOLR-236, SOLR-1682.

          Show
          Koji Sekiguchi added a comment - closed as duplicate of SOLR-236 , SOLR-1682 .

            People

            • Assignee:
              Unassigned
              Reporter:
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development