Lucene - Core
  1. Lucene - Core
  2. LUCENE-4832

Unbounded getTopGroups for ToParentBlockJoinCollector

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/join
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      ToParentBlockJoinCollector#getTopGroups method takes several arguments:

      public TopGroups<Integer> getTopGroups(ToParentBlockJoinQuery query, 
                                             Sort withinGroupSort,
                                             int offset,
                                             int maxDocsPerGroup,
                                             int withinGroupOffset,
                                             boolean fillSortFields)
      

      and one of them is maxDocsPerGroup which specifies upper bound of child documents number returned within each group.
      ToParentBlockJoinCollector collects and caches all child documents matched by given ToParentBlockJoinQuery in OneGroup objects during search so it is possible to create GroupDocs with all matched child documents instead of part of them bounded by maxDocsPerGroup.

      When you specify maxDocsPerGroup new queues(I mean TopScoreDocCollector/TopFieldCollector) will be created for each group with maxDocsPerGroup objects created within each queue which could lead to redundant memory allocation in case of child documents number within group is less than maxDocsPerGroup.

      I suppose that there are many cases where you need to get all child documents matched by query so it could be nice to have ability to get top groups with all matched child documents without unnecessary memory allocation.

      Possible solution is to pass negative maxDocsPerGroup in case when you need to get all matched child documents within each group and check maxDocsPerGroup value: if it is negative then we need to create queue with size of matched child documents number; otherwise create queue with size equals to maxDocsPerGroup.

      1. LUCENE-4832.patch
        11 kB
        Aleksey Aleev
      2. LUCENE-4832.patch
        12 kB
        Aleksey Aleev
      3. LUCENE-4832.patch
        14 kB
        Aleksey Aleev

        Activity

        Hide
        Aleksey Aleev added a comment -

        Patch with proposed solution attached.

        Show
        Aleksey Aleev added a comment - Patch with proposed solution attached.
        Hide
        Michael McCandless added a comment -

        Could we just fix it so passing in Integer.MAX_VALUE for maxDocsPerGroup works? I.e., we'd need to do the min(numDocsInGroup, maxDocsPerGroup) when we create the collector for the group. And I think we don't need to create this separate Accumulator class...

        Show
        Michael McCandless added a comment - Could we just fix it so passing in Integer.MAX_VALUE for maxDocsPerGroup works? I.e., we'd need to do the min(numDocsInGroup, maxDocsPerGroup) when we create the collector for the group. And I think we don't need to create this separate Accumulator class...
        Hide
        Aleksey Aleev added a comment -

        Michael, thank you for replying!
        I agree with you about Integer.MAX_VALUE, it looks much better in this way.
        The reason why I introduced GroupDocsAccumulator class is that I wanted to reduce the size of getTopGroups(...) method and make it more readable.
        Could you please tell me you don't like introducing new class and creating an instance of it at all or you think that it's not clear what accumulate() method should do?
        Maybe it will be more clear if the loop by groups will remain in getTopGroups() and the loop's body will be extracted in accumulate() method? So we'll have:

        for(int groupIDX=offset;groupIDX<sortedGroups.length;groupIDX++) {
          groupDocsAccumulator.accumulate(groupIDX);
        }
        

        Please tell WDYT about it and I'll update the patch.

        Show
        Aleksey Aleev added a comment - Michael, thank you for replying! I agree with you about Integer.MAX_VALUE, it looks much better in this way. The reason why I introduced GroupDocsAccumulator class is that I wanted to reduce the size of getTopGroups(...) method and make it more readable. Could you please tell me you don't like introducing new class and creating an instance of it at all or you think that it's not clear what accumulate() method should do? Maybe it will be more clear if the loop by groups will remain in getTopGroups() and the loop's body will be extracted in accumulate() method? So we'll have: for ( int groupIDX=offset;groupIDX<sortedGroups.length;groupIDX++) { groupDocsAccumulator.accumulate(groupIDX); } Please tell WDYT about it and I'll update the patch.
        Hide
        Michael McCandless added a comment -

        Hi Aleksey, reducing that method size would be nice! Can we just make it a new method (accumulate is good), instead of a new class? (And also the Integer.MAX_VALUE fix). I think this will be a good improvement...

        Show
        Michael McCandless added a comment - Hi Aleksey, reducing that method size would be nice! Can we just make it a new method (accumulate is good), instead of a new class? (And also the Integer.MAX_VALUE fix). I think this will be a good improvement...
        Hide
        Aleksey Aleev added a comment -

        Hi Michael! I've updated the patch with introducing a method instead of class and Integer.MAX_VALUE fix. Please have a look and tell what do you think about it. Thank you.

        Show
        Aleksey Aleev added a comment - Hi Michael! I've updated the patch with introducing a method instead of class and Integer.MAX_VALUE fix. Please have a look and tell what do you think about it. Thank you.
        Hide
        Michael McCandless added a comment -

        The Integer.MAX_VALUE change looks great!

        But one thing I don't like about the accumulateGroups is there's now a separate (second) loop to sum up the totalGroupedHitCount.

        Maybe accumulateGroups should do this itself, and then return TopGroups instead of GroupDocs<Integer>()?

        Show
        Michael McCandless added a comment - The Integer.MAX_VALUE change looks great! But one thing I don't like about the accumulateGroups is there's now a separate (second) loop to sum up the totalGroupedHitCount. Maybe accumulateGroups should do this itself, and then return TopGroups instead of GroupDocs<Integer>()?
        Hide
        Aleksey Aleev added a comment -

        The patch is updated. It looks better, thanks.

        Show
        Aleksey Aleev added a comment - The patch is updated. It looks better, thanks.
        Hide
        Michael McCandless added a comment -

        OK, I committed the last patch (plus some small unrelated cleanup to a pre-existing test case); thanks Aleksey!

        Show
        Michael McCandless added a comment - OK, I committed the last patch (plus some small unrelated cleanup to a pre-existing test case); thanks Aleksey!
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Aleksey Aleev
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development