Lucene - Core
  1. Lucene - Core
  2. LUCENE-3129

Single-pass grouping collector based on doc blocks

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: modules/grouping
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      LUCENE-3112 enables adding/updating a contiguous block of documents to
      the index, guaranteed (yet, experimental!) to retain adjacent docID
      assignment through the full life of the index as long the app doesn't
      delete individual docs from the block.

      When an app does this, it can enable neat features like LUCENE-2454
      (nested documents), post-group facet counting (LUCENE-3097).

      It also makes single-pass grouping possible, when you group by
      the "identifier" field shared by the doc block, since we know we will
      see a given group only once with all of its docs within one block.

      This should be faster than the fully general two-pass collectors we
      already have.

      I'm working on a patch but not quite there yet...

      1. LUCENE-3129.patch
        47 kB
        Michael McCandless
      2. LUCENE-3129.patch
        47 kB
        Michael McCandless
      3. LUCENE-3129.patch
        82 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Bulk closing for 3.2

          Show
          Robert Muir added a comment - Bulk closing for 3.2
          Hide
          Michael McCandless added a comment -

          New patch with small changes: renamed to BlockGroupingCollector, fixed it to set the totalGroupCount in the returned TopGroups, removed some dead code and shuffled some params from ctor -> getTopGroups.

          I'll commit shortly...

          Show
          Michael McCandless added a comment - New patch with small changes: renamed to BlockGroupingCollector, fixed it to set the totalGroupCount in the returned TopGroups, removed some dead code and shuffled some params from ctor -> getTopGroups. I'll commit shortly...
          Hide
          Michael McCandless added a comment -

          New patch attached; I think it's ready to commit.

          I changed the approach, poaching an improvement from nested docs
          (LUCENE-2454): instead of pulling a DocTermsIndex from the field
          cache, and detecting new group by changing ord, I require the app
          provides a Filter to denote the transition between groups.

          Not only is this better because it uses far less RAM, it's also more
          general than the 2-pass collector in that the app is free to
          arbitrarily set the groups by indexing the right doc blocks. All
          that's necessary is the app has some way to create the Filter noting
          the last doc in each group. It need not be a "single valued indexed
          field"...

          Performance is good ~ 25-28% faster than the two-pass collector with
          caching.

          Show
          Michael McCandless added a comment - New patch attached; I think it's ready to commit. I changed the approach, poaching an improvement from nested docs ( LUCENE-2454 ): instead of pulling a DocTermsIndex from the field cache, and detecting new group by changing ord, I require the app provides a Filter to denote the transition between groups. Not only is this better because it uses far less RAM, it's also more general than the 2-pass collector in that the app is free to arbitrarily set the groups by indexing the right doc blocks. All that's necessary is the app has some way to create the Filter noting the last doc in each group. It need not be a "single valued indexed field"... Performance is good ~ 25-28% faster than the two-pass collector with caching.
          Hide
          Michael McCandless added a comment -

          That patch also requires you first apply LUCENE-3112.

          Show
          Michael McCandless added a comment - That patch also requires you first apply LUCENE-3112 .
          Hide
          Michael McCandless added a comment -

          Patch.

          I ran quick perf test – single pass was ~18% faster than two-pass (using cache). Not as much as I expected... but every bit counts!

          Show
          Michael McCandless added a comment - Patch. I ran quick perf test – single pass was ~18% faster than two-pass (using cache). Not as much as I expected... but every bit counts!

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development