Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/grouping
    • Labels:
      None

      Description

      Currently the grouping module has many collector classes with a lot of different options per class. I think it would be a good idea to have a GroupUtil (Or another name?) convenience class. I think this could be a builder, because of the many options (sort,sortWithinGroup,groupOffset,groupCount and more) and implementations (term/dv/function) grouping has.

      1. LUCENE-3778.patch
        27 kB
        Martijn van Groningen
      2. LUCENE-3778.patch
        36 kB
        Martijn van Groningen
      3. LUCENE-3778.patch
        37 kB
        Martijn van Groningen
      4. LUCENE-3778.patch
        36 kB
        Martijn van Groningen

        Activity

        Hide
        Michael McCandless added a comment -

        This sounds like a good idea! Grouping is hard to use now...

        Would it be something like this?

        GroupSearch ctx = new GroupingSearch(groupByField);
        
        ctx.setGroupSort(...);
        ctx.setWithinGroupSort(...);
        ctx.setNeedsAllGroups(...);
        ctx.setNeedsAllGroupHeads(...);
        ctx.setUseCaching(...);
        
        topGroups = ctx.search(searcher, query, numGroups);
        

        ... and the search method figures out which first/second pass
        collectors to make, whether to cache, etc.?

        Would you also handle block (single pass) grouping with the same
        class...?

        I guess you'd then .getAllGroups(), .getAllGroupHeads() after
        .search(...)?

        Hmm would we try to handle Term/BytesRef and Function/MutableValue
        with the same class?

        Show
        Michael McCandless added a comment - This sounds like a good idea! Grouping is hard to use now... Would it be something like this? GroupSearch ctx = new GroupingSearch(groupByField); ctx.setGroupSort(...); ctx.setWithinGroupSort(...); ctx.setNeedsAllGroups(...); ctx.setNeedsAllGroupHeads(...); ctx.setUseCaching(...); topGroups = ctx.search(searcher, query, numGroups); ... and the search method figures out which first/second pass collectors to make, whether to cache, etc.? Would you also handle block (single pass) grouping with the same class...? I guess you'd then .getAllGroups(), .getAllGroupHeads() after .search(...)? Hmm would we try to handle Term/BytesRef and Function/MutableValue with the same class?
        Hide
        Martijn van Groningen added a comment - - edited

        Yes, it would look something like that.

        Would you also handle block (single pass) grouping with the same class...?

        I think we can do this. The block grouping returns TopGroups as result.

        I guess you'd then .getAllGroups(), .getAllGroupHeads() after .search(...)?

        Yes, we need that. In the case of getAllGroups() the TopGroups#totalGroupCount field can be used when the user is only interested in the number of matching groups.

        Hmm would we try to handle Term/BytesRef and Function/MutableValue with the same class?

        With generics?

        class GroupingSearch {
        
          public <T> TopGroups<T> search(IndexSearcher searcher, Query query, int numGroups) {
            ...
          }
        
        }
        

        Function usage:

        GroupSearch ctx = new GroupingSearch(groupByFunction);
        ctx.useFunction();
        ...
        TopGroups<MutableValue> topGroups = ctx.search(searcher, query, numGroups);
        

        The same could work for grouping by docvalues types int and float, right?

        Maybe distributed grouping needs its own class? Since the usage is different from a non distributed grouping.

        Show
        Martijn van Groningen added a comment - - edited Yes, it would look something like that. Would you also handle block (single pass) grouping with the same class...? I think we can do this. The block grouping returns TopGroups as result. I guess you'd then .getAllGroups(), .getAllGroupHeads() after .search(...)? Yes, we need that. In the case of getAllGroups() the TopGroups#totalGroupCount field can be used when the user is only interested in the number of matching groups. Hmm would we try to handle Term/BytesRef and Function/MutableValue with the same class? With generics? class GroupingSearch { public <T> TopGroups<T> search(IndexSearcher searcher, Query query, int numGroups) { ... } } Function usage: GroupSearch ctx = new GroupingSearch(groupByFunction); ctx.useFunction(); ... TopGroups<MutableValue> topGroups = ctx.search(searcher, query, numGroups); The same could work for grouping by docvalues types int and float, right? Maybe distributed grouping needs its own class? Since the usage is different from a non distributed grouping.
        Hide
        Michael McCandless added a comment -

        Would you also handle block (single pass) grouping with the same class...?

        I think we can do this. The block grouping returns TopGroups as result.

        Nice.

        I guess you'd then .getAllGroups(), .getAllGroupHeads() after .search(...)?

        Yes, we need that. In the case of getAllGroups() the TopGroups#totalGroupCount field can be used when the user is only interested in the number of matching groups.

        OK.

        Hmm would we try to handle Term/BytesRef and Function/MutableValue with the same class?

        With generics?

        I think so... but I think it may get tricky. Like, I think you should
        specify up front (to GroupingSearch ctor) the required things about
        your request (block join OR group field OR field + DV type OR function
        VS/ctx map), setters for the numerous optional things (sort,
        groupSort, getScores, getMaxScores, maxDocsPerGroup) and maybe params
        to search for the per-requesty things (topNGroups, groupOffset,
        withinGroupOffset).

        But then the <T> will depend on which ctor you used...? Not sure how
        it'd work...

        Maybe distributed grouping needs its own class? Since the usage is different from a non distributed grouping.

        Yeah...

        Maybe we can do this for join module too!

        Show
        Michael McCandless added a comment - Would you also handle block (single pass) grouping with the same class...? I think we can do this. The block grouping returns TopGroups as result. Nice. I guess you'd then .getAllGroups(), .getAllGroupHeads() after .search(...)? Yes, we need that. In the case of getAllGroups() the TopGroups#totalGroupCount field can be used when the user is only interested in the number of matching groups. OK. Hmm would we try to handle Term/BytesRef and Function/MutableValue with the same class? With generics? I think so... but I think it may get tricky. Like, I think you should specify up front (to GroupingSearch ctor) the required things about your request (block join OR group field OR field + DV type OR function VS/ctx map), setters for the numerous optional things (sort, groupSort, getScores, getMaxScores, maxDocsPerGroup) and maybe params to search for the per-requesty things (topNGroups, groupOffset, withinGroupOffset). But then the <T> will depend on which ctor you used...? Not sure how it'd work... Maybe distributed grouping needs its own class? Since the usage is different from a non distributed grouping. Yeah... Maybe we can do this for join module too!
        Hide
        Martijn van Groningen added a comment -

        Just a start, but it think this is the way to go.

        Show
        Martijn van Groningen added a comment - Just a start, but it think this is the way to go.
        Hide
        Michael McCandless added a comment -

        I really dislike the cancerous chaining (all setters returning
        this): it's poor API design because it creates unnecessary
        ambiguity on how to consume it. It amounts to a denial-of-service
        attack on the devs who consume our APIs.... We should strive to have
        less, not more, ambiguity in all of our APIs.

        But, since others seem to love it, as a compromise, can you write all
        consumption of the cancerous chaining as if the methods did not chain?
        Ie minimize the cancer: contain it to the API definition, alone.

        The litmus test is then simple: if I were to change all methods to
        return void instead, everything should compile / tests should pass.

        Otherwise, the patch looks great!

        One can actually use GroupingSearch in a shard'd env, on each shard,
        right? It's just that then you merge them like normal on the front
        end (ie, TopGroups.merge). Is that the only reason for the "... in a
        non distributed environment" javadoc warning?

        I think the jdocs for each ctor should explain what kind of grouping
        impl will be used (ie, the ctor taking Filter groupEndDocs uses
        single-pass block collector, and requires you indexed doc blocks).

        Maybe the ctor should take docValuesType / diskResidentValues, instead
        of setters to change it? Ie, so that you are stating up front what
        source to group by (DocValues, FC (Term), function, block).

        Maybe you should pass the groupSort, groupsOffset, groupsLimit to the
        search method (instead of setters)?

        Show
        Michael McCandless added a comment - I really dislike the cancerous chaining (all setters returning this ): it's poor API design because it creates unnecessary ambiguity on how to consume it. It amounts to a denial-of-service attack on the devs who consume our APIs.... We should strive to have less, not more, ambiguity in all of our APIs. But, since others seem to love it, as a compromise, can you write all consumption of the cancerous chaining as if the methods did not chain? Ie minimize the cancer: contain it to the API definition, alone. The litmus test is then simple: if I were to change all methods to return void instead, everything should compile / tests should pass. Otherwise, the patch looks great! One can actually use GroupingSearch in a shard'd env, on each shard, right? It's just that then you merge them like normal on the front end (ie, TopGroups.merge). Is that the only reason for the "... in a non distributed environment" javadoc warning? I think the jdocs for each ctor should explain what kind of grouping impl will be used (ie, the ctor taking Filter groupEndDocs uses single-pass block collector, and requires you indexed doc blocks). Maybe the ctor should take docValuesType / diskResidentValues, instead of setters to change it? Ie, so that you are stating up front what source to group by (DocValues, FC (Term), function, block). Maybe you should pass the groupSort, groupsOffset, groupsLimit to the search method (instead of setters)?
        Hide
        Martijn van Groningen added a comment -

        One can actually use GroupingSearch in a shard'd env, on each shard,
        right? It's just that then you merge them like normal on the front
        end (ie, TopGroups.merge). Is that the only reason for the "... in a
        non distributed environment" javadoc warning?

        In a sharded env one needs to first execute all the first pass collector on all shards, merge the returned search groups, execute the second pass search (with the top N merged search groups as argument) on most of the shards and finally merge the TopGroups from all shards into a topN TopGroups. Also grouping by docblock and grouping features like allGroups and groupHead don't work in a normal sharded environment (unless you partition the groups properly). The docs caching also only makes sense when performing grouping on a local index. This usage is very different then non distributed grouping, that is why I think it is better to have a separate grouping convenience class for distributed grouping (DistributedGroupSearch?).

        Maybe the ctor should take docValuesType / diskResidentValues

        Makes sense!

        Maybe you should pass the groupSort, groupsOffset, groupsLimit to the search method (instead of setters)?

        Maybe we just should have defaults for these options? Sort.RELEVANCE, 0 and 10?

        Show
        Martijn van Groningen added a comment - One can actually use GroupingSearch in a shard'd env, on each shard, right? It's just that then you merge them like normal on the front end (ie, TopGroups.merge). Is that the only reason for the "... in a non distributed environment" javadoc warning? In a sharded env one needs to first execute all the first pass collector on all shards, merge the returned search groups, execute the second pass search (with the top N merged search groups as argument) on most of the shards and finally merge the TopGroups from all shards into a topN TopGroups. Also grouping by docblock and grouping features like allGroups and groupHead don't work in a normal sharded environment (unless you partition the groups properly). The docs caching also only makes sense when performing grouping on a local index. This usage is very different then non distributed grouping, that is why I think it is better to have a separate grouping convenience class for distributed grouping (DistributedGroupSearch?). Maybe the ctor should take docValuesType / diskResidentValues Makes sense! Maybe you should pass the groupSort, groupsOffset, groupsLimit to the search method (instead of setters)? Maybe we just should have defaults for these options? Sort.RELEVANCE, 0 and 10?
        Hide
        Martijn van Groningen added a comment -

        Updated patch.

        • Added jdocs.
        • allGroups and allGroupHeads methods are implemented now.
        Show
        Martijn van Groningen added a comment - Updated patch. Added jdocs. allGroups and allGroupHeads methods are implemented now.
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        Also grouping by docblock and grouping features like allGroups and groupHead don't work in a normal sharded environment (unless you partition the groups properly).

        Doc block grouping should work well in a sharded env right? As long
        as you send the whole block to a single shard...

        This usage is very different then non distributed grouping, that is why I think it is better to have a separate grouping convenience class for distributed grouping (DistributedGroupSearch?).

        OK I agree.

        Maybe you should pass the groupSort, groupsOffset, groupsLimit to the search method (instead of setters)?

        Maybe we just should have defaults for these options? Sort.RELEVANCE, 0 and 10?

        Well, I was trying to mirror IndexSearcher.search (for lack of any
        other guidance on what should be required arg to ctor, optional via
        setter or required arg to .search).

        So, yeah, I think default to Sort.RELEVANCE is good? Maybe we have 1
        search method doing that and another taking the GroupSort?

        I think it's strange to have a default for the top N groups (10)? I
        think app should have to specify that as an arg to search...

        Should we name it GroupSearch? GroupedSearch? GroupingSearch...? I don't
        have a strong preference... I think GroupingSearch (current name) is OK?

        Show
        Michael McCandless added a comment - Patch looks good! Also grouping by docblock and grouping features like allGroups and groupHead don't work in a normal sharded environment (unless you partition the groups properly). Doc block grouping should work well in a sharded env right? As long as you send the whole block to a single shard... This usage is very different then non distributed grouping, that is why I think it is better to have a separate grouping convenience class for distributed grouping (DistributedGroupSearch?). OK I agree. Maybe you should pass the groupSort, groupsOffset, groupsLimit to the search method (instead of setters)? Maybe we just should have defaults for these options? Sort.RELEVANCE, 0 and 10? Well, I was trying to mirror IndexSearcher.search (for lack of any other guidance on what should be required arg to ctor, optional via setter or required arg to .search). So, yeah, I think default to Sort.RELEVANCE is good? Maybe we have 1 search method doing that and another taking the GroupSort? I think it's strange to have a default for the top N groups (10)? I think app should have to specify that as an arg to search... Should we name it GroupSearch? GroupedSearch? GroupingSearch...? I don't have a strong preference... I think GroupingSearch (current name) is OK?
        Hide
        Martijn van Groningen added a comment -

        So, yeah, I think default to Sort.RELEVANCE is good? Maybe we have 1
        search method doing that and another taking the GroupSort?

        I think it's strange to have a default for the top N groups (10)? I
        think app should have to specify that as an arg to search...

        Makes sense.

        I think GroupingSearch (current name) is OK?

        In the join module we have the JoinUtil. Maybe rename it to GroupUtil? Just to be consistent? Or rename JoinUtil to JoinSearch?

        Show
        Martijn van Groningen added a comment - So, yeah, I think default to Sort.RELEVANCE is good? Maybe we have 1 search method doing that and another taking the GroupSort? I think it's strange to have a default for the top N groups (10)? I think app should have to specify that as an arg to search... Makes sense. I think GroupingSearch (current name) is OK? In the join module we have the JoinUtil. Maybe rename it to GroupUtil? Just to be consistent? Or rename JoinUtil to JoinSearch?
        Hide
        Michael McCandless added a comment -

        In the join module we have the JoinUtil. Maybe rename it to GroupUtil? Just to be consistent? Or rename JoinUtil to JoinSearch?

        Hmm, except in JoinUtil's case, it just has one static method... vs GroupingSearch which you instantiate, tweak, and use. So I think it's good that they are named differently?

        Show
        Michael McCandless added a comment - In the join module we have the JoinUtil. Maybe rename it to GroupUtil? Just to be consistent? Or rename JoinUtil to JoinSearch? Hmm, except in JoinUtil's case, it just has one static method... vs GroupingSearch which you instantiate, tweak, and use. So I think it's good that they are named differently?
        Hide
        Martijn van Groningen added a comment -

        Hmm, except in JoinUtil's case, it just has one static method... vs GroupingSearch which you instantiate, tweak, and use. So I think it's good that they are named differently?

        True. Lets keep the current name then.

        I think that the package.html should also be updated? It should use the GroupSearch instead of all the different collectors.

        Show
        Martijn van Groningen added a comment - Hmm, except in JoinUtil's case, it just has one static method... vs GroupingSearch which you instantiate, tweak, and use. So I think it's good that they are named differently? True. Lets keep the current name then. I think that the package.html should also be updated? It should use the GroupSearch instead of all the different collectors.
        Hide
        Michael McCandless added a comment -

        I think that the package.html should also be updated? It should use the GroupSearch instead of all the different collectors.

        +1, as long as GroupSearch covers all cases package.html now describes I guess?

        Show
        Michael McCandless added a comment - I think that the package.html should also be updated? It should use the GroupSearch instead of all the different collectors. +1, as long as GroupSearch covers all cases package.html now describes I guess?
        Hide
        Robert Muir added a comment -

        I took a quick glance and ... completely avoiding the chaining discussion, i have a few silly API suggestions:

        • methods like enableCaching(double maxCacheRAMMB, boolean cacheScores) and enableCaching(int maxDocsToCache, boolean cacheScores)
          are confusing because the signature is overloaded by different primitive types... can we give those unique names?
        • methods like disableCaching(), includeScores(boolean includeScores) are a little confusing because the verb 'set' is missing?
          In general I think getXXX and setXXX are useful, because otherwise its hard to consume the API since you don't
          know if the method will have some unintended side effects. As an extreme example, imagine if instead of isClosed() if you
          used closed(), which is very different from close() but confusable.
        Show
        Robert Muir added a comment - I took a quick glance and ... completely avoiding the chaining discussion, i have a few silly API suggestions: methods like enableCaching(double maxCacheRAMMB, boolean cacheScores) and enableCaching(int maxDocsToCache, boolean cacheScores) are confusing because the signature is overloaded by different primitive types... can we give those unique names? methods like disableCaching(), includeScores(boolean includeScores) are a little confusing because the verb 'set' is missing? In general I think getXXX and setXXX are useful, because otherwise its hard to consume the API since you don't know if the method will have some unintended side effects. As an extreme example, imagine if instead of isClosed() if you used closed(), which is very different from close() but confusable.
        Hide
        Martijn van Groningen added a comment -

        +1, as long as GroupSearch covers all cases package.html now describes I guess?

        I think the GroupSearch class covers all local cases, so that should be fine.

        methods like disableCaching(), includeScores(boolean includeScores) are a little confusing because the verb 'set' is missing?
        In general I think getXXX and setXXX are useful, because otherwise its hard to consume the API since you don't
        know if the method will have some unintended side effects. As an extreme example, imagine if instead of isClosed() if you
        used closed(), which is very different from close() but confusable.

        Kinda makes sense, lets make the method names more descriptives to avoid any confusion (although we have jdocs).

        methods like enableCaching(double maxCacheRAMMB, boolean cacheScores) and enableCaching(int maxDocsToCache, boolean cacheScores)
        are confusing because the signature is overloaded by different primitive types... can we give those unique names?

        Sure, we can do that. What would be good names? setCacheSizeInMb(Double,boolean) and setCacheSizeInDocs(Integer,boolean)? Setting null would disable the cache?

        Show
        Martijn van Groningen added a comment - +1, as long as GroupSearch covers all cases package.html now describes I guess? I think the GroupSearch class covers all local cases, so that should be fine. methods like disableCaching(), includeScores(boolean includeScores) are a little confusing because the verb 'set' is missing? In general I think getXXX and setXXX are useful, because otherwise its hard to consume the API since you don't know if the method will have some unintended side effects. As an extreme example, imagine if instead of isClosed() if you used closed(), which is very different from close() but confusable. Kinda makes sense, lets make the method names more descriptives to avoid any confusion (although we have jdocs). methods like enableCaching(double maxCacheRAMMB, boolean cacheScores) and enableCaching(int maxDocsToCache, boolean cacheScores) are confusing because the signature is overloaded by different primitive types... can we give those unique names? Sure, we can do that. What would be good names? setCacheSizeInMb(Double,boolean) and setCacheSizeInDocs(Integer,boolean)? Setting null would disable the cache?
        Hide
        Martijn van Groningen added a comment -

        Updated patch.

        • Changed package.html
        • The methods that set something now have a name that begins with set.
        Show
        Martijn van Groningen added a comment - Updated patch. Changed package.html The methods that set something now have a name that begins with set.
        Hide
        Martijn van Groningen added a comment -

        Updated patch. Grouplimit and groupoffset are now parameters in the search methods. I think it is time to commit this.

        Show
        Martijn van Groningen added a comment - Updated patch. Grouplimit and groupoffset are now parameters in the search methods. I think it is time to commit this.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Martijn van Groningen added a comment -

        Committed to trunk. Feature work (distributed grouping, grouped facets etc.) will be done in new issues.

        Show
        Martijn van Groningen added a comment - Committed to trunk. Feature work (distributed grouping, grouped facets etc.) will be done in new issues.

          People

          • Assignee:
            Unassigned
            Reporter:
            Martijn van Groningen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development