Lucene - Core
  1. Lucene - Core
  2. LUCENE-3099

Grouping module should allow subclasses to set the group key per document

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The new grouping module can only group by a single-valued indexed field.

      But, if we make the 'getGroupKey' a method that a subclass could override, then I think we could refactor Solr over to the module, because it could do function queries and normal queries via subclass (I think).

      This also makes the impl more extensible to apps that might have their own interesting group values per document.

      1. LUCENE-3099.patch
        75 kB
        Martijn van Groningen
      2. LUCENE-3099.patch
        21 kB
        Michael McCandless
      3. LUCENE-3099.patch
        38 kB
        Martijn van Groningen
      4. LUCENE-3099.patch
        40 kB
        Martijn van Groningen
      5. LUCENE-3099.patch
        43 kB
        Martijn van Groningen
      6. LUCENE-3099.patch
        72 kB
        Martijn van Groningen
      7. LUCENE-3099.patch
        48 kB
        Martijn van Groningen
      8. LUCENE-3099.patch
        55 kB
        Martijn van Groningen

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          bulk close for 3.3

          Show
          Robert Muir added a comment - bulk close for 3.3
          Hide
          Michael McCandless added a comment -

          Thanks Martijn!

          Show
          Michael McCandless added a comment - Thanks Martijn!
          Hide
          Martijn van Groningen added a comment -

          Updated patch to trunk. Previous patch can't be applied on the current trunk without issues.

          Show
          Martijn van Groningen added a comment - Updated patch to trunk. Previous patch can't be applied on the current trunk without issues.
          Hide
          Martijn van Groningen added a comment -

          Looks like LUCENE-3097 snuck into the patch (AllGroupHeadsCollector).

          oops... Yes it did. My change lists collide. Maybe it is time use git...

          Show
          Martijn van Groningen added a comment - Looks like LUCENE-3097 snuck into the patch (AllGroupHeadsCollector). oops... Yes it did. My change lists collide. Maybe it is time use git...
          Hide
          Michael McCandless added a comment -

          Patch looks great – I think it's ready to commit!

          Looks like LUCENE-3097 snuck into the patch (AllGroupHeadsCollector).

          Show
          Michael McCandless added a comment - Patch looks great – I think it's ready to commit! Looks like LUCENE-3097 snuck into the patch (AllGroupHeadsCollector).
          Hide
          Martijn van Groningen added a comment -

          Attached an updated version of the patch with the Mike's points. When building the module no generic warnings occur any more in the ant build. I also updated some documentation.

          For AbstractAllGroupsCollector, can we impl the getGroupCount in
          the base class as getGroups.size()?

          Certainly! Much better. Since the method is syntactical sugar anyway.

          Show
          Martijn van Groningen added a comment - Attached an updated version of the patch with the Mike's points. When building the module no generic warnings occur any more in the ant build. I also updated some documentation. For AbstractAllGroupsCollector, can we impl the getGroupCount in the base class as getGroups.size()? Certainly! Much better. Since the method is syntactical sugar anyway.
          Hide
          Michael McCandless added a comment -

          I think we are close here! It's great to see you're able to cutover
          Solr trunk to the grouping module based on this.

          Random things:

          • I think we can in fact use Map<...> instead of HashMap<...> in 2nd
            pass abstract collector?
          • Can you add @SuppressWarnings("unchecked") for the generic array
            creations?
          • Can you fix the other generics warnings? Eg the copy-ctor in
            TopGroups, and TestGrouping has a few warnings. ("ant clean
            compile-test" should show all these warnings).
          • The class in AllGroupsCollectorTest needs to be renamed
          • OK, let's leave groupDocs as protected in the 2nd pass collector
            (remove the nocommit/your response).
          • For AbstractAllGroupsCollector, can we impl the getGroupCount in
            the base class as getGroups.size()?
          Show
          Michael McCandless added a comment - I think we are close here! It's great to see you're able to cutover Solr trunk to the grouping module based on this. Random things: I think we can in fact use Map<...> instead of HashMap<...> in 2nd pass abstract collector? Can you add @SuppressWarnings("unchecked") for the generic array creations? Can you fix the other generics warnings? Eg the copy-ctor in TopGroups, and TestGrouping has a few warnings. ("ant clean compile-test" should show all these warnings). The class in AllGroupsCollectorTest needs to be renamed OK, let's leave groupDocs as protected in the 2nd pass collector (remove the nocommit/your response). For AbstractAllGroupsCollector, can we impl the getGroupCount in the base class as getGroups.size()?
          Hide
          Martijn van Groningen added a comment -

          Previous patch was wrong. Here a new one.

          Show
          Martijn van Groningen added a comment - Previous patch was wrong. Here a new one.
          Hide
          Martijn van Groningen added a comment -

          Attached a new patch with the discussed changes. The AbstractMatchAllGroupCollector changed a lot. Most code is pushed to implementation classes. During development of fq impl I noticed that the abstraction was still too specific for terms impl.

          Show
          Martijn van Groningen added a comment - Attached a new patch with the discussed changes. The AbstractMatchAllGroupCollector changed a lot. Most code is pushed to implementation classes. During development of fq impl I noticed that the abstraction was still too specific for terms impl.
          Hide
          Martijn van Groningen added a comment -

          After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.

          That makes sense to me.

          Show
          Martijn van Groningen added a comment - After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch. That makes sense to me.
          Hide
          Michael McCandless added a comment -

          I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken.

          Right LUCENE-2883 is about FQ, but I'm thinking we should simply "birth" the commons module under LUCEEN-2883, meeting only the needs of FQ, initially.

          After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.

          Show
          Michael McCandless added a comment - I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken. Right LUCENE-2883 is about FQ, but I'm thinking we should simply "birth" the commons module under LUCEEN-2883, meeting only the needs of FQ, initially. After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.
          Hide
          Martijn van Groningen added a comment - - edited

          I think we can take this up under LUCENE-2883?

          I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken.

          Should we name it TermAllGroupsCollector

          +1 I'll rename it

          Maybe, instead, we can have only retrieveDocGroup? And if it returns
          null that means the group for this doc isn't being collected? Then we
          don't double the ord lookup for TermSecondPass.

          That seems reasonable to me. That also saves a key lookup for the TermSecondPassGroupingCollector.

          Show
          Martijn van Groningen added a comment - - edited I think we can take this up under LUCENE-2883 ? I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken. Should we name it TermAllGroupsCollector +1 I'll rename it Maybe, instead, we can have only retrieveDocGroup? And if it returns null that means the group for this doc isn't being collected? Then we don't double the ord lookup for TermSecondPass. That seems reasonable to me. That also saves a key lookup for the TermSecondPassGroupingCollector.
          Hide
          Michael McCandless added a comment -

          Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.

          +1

          You think that we already should open a new issue for this?

          I think we can take this up under LUCENE-2883? There's already an
          initial patch there, starting to factor out Mutable*...

          Should we name it TermAllGroupsCollector (instead of Terms...)? Or,
          fix the others to be TermsFirst/SecondPassCollector?

          I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector

          Ahh, excellent – these abstract base classes don't need to know it's
          a feild we are grouping on.

          When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex).

          OK I agree.

          Maybe, instead, we can have only retrieveDocGroup? And if it returns
          null that means the group for this doc isn't being collected? Then we
          don't double the ord lookup for TermSecondPass.

          Show
          Michael McCandless added a comment - Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract. +1 You think that we already should open a new issue for this? I think we can take this up under LUCENE-2883 ? There's already an initial patch there, starting to factor out Mutable*... Should we name it TermAllGroupsCollector (instead of Terms...)? Or, fix the others to be TermsFirst/SecondPassCollector? I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector Ahh, excellent – these abstract base classes don't need to know it's a feild we are grouping on. When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex). OK I agree. Maybe, instead, we can have only retrieveDocGroup? And if it returns null that means the group for this doc isn't being collected? Then we don't double the ord lookup for TermSecondPass.
          Hide
          Martijn van Groningen added a comment -

          I think the SecondPassGroupingCollector class needs the following two methods instead of getDocSlot:

            public void collect(int doc) throws IOException {
              totalHitCount++;
              if (gatherGroupedDocs(doc)) {
                retrieveGroup(doc).collector.collect(doc);
              }
            }
          
            protected abstract boolean gatherGroupedDocs(int doc) throws IOException;
          
            protected abstract SearchGroupDocs<GROUP_VALUE_TYPE> retrieveGroup(int doc) throws IOException;
          

          When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex).
          This setup works nicely in both impls. Downside is that for the Terms impl the ord key has to be looked up twice when gatherGroupedDocs(..) returns true.

          Show
          Martijn van Groningen added a comment - I think the SecondPassGroupingCollector class needs the following two methods instead of getDocSlot: public void collect( int doc) throws IOException { totalHitCount++; if (gatherGroupedDocs(doc)) { retrieveGroup(doc).collector.collect(doc); } } protected abstract boolean gatherGroupedDocs( int doc) throws IOException; protected abstract SearchGroupDocs<GROUP_VALUE_TYPE> retrieveGroup( int doc) throws IOException; When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex). This setup works nicely in both impls. Downside is that for the Terms impl the ord key has to be looked up twice when gatherGroupedDocs(..) returns true.
          Hide
          Martijn van Groningen added a comment -

          Attached an updated patch.

          I'm currently busy with integrating the grouping module in trunk Solr. I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector. Also made a small change in GroupDocs regarding generics.

          Show
          Martijn van Groningen added a comment - Attached an updated patch. I'm currently busy with integrating the grouping module in trunk Solr. I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector. Also made a small change in GroupDocs regarding generics.
          Hide
          Martijn van Groningen added a comment -

          Separately, we should merge the cool GroupValue, GroupValueSource,
          GroupHolder, etc., from the first patch here, with Solr's equivalents,
          factored out I think into a shared "common" module that the FQ module
          (LUCENE-2883) can also use.

          +1! We need to find out what fq and grouping really needs under the hood. Performance can not be harmed by moving this into common module.
          You think that we already should open a new issue for this?

          Show
          Martijn van Groningen added a comment - Separately, we should merge the cool GroupValue, GroupValueSource, GroupHolder, etc., from the first patch here, with Solr's equivalents, factored out I think into a shared "common" module that the FQ module ( LUCENE-2883 ) can also use. +1! We need to find out what fq and grouping really needs under the hood. Performance can not be harmed by moving this into common module. You think that we already should open a new issue for this?
          Hide
          Martijn van Groningen added a comment -

          Attached a new patch that is based on Mike's patch.

          • All existing grouping tests pass
          • AllGroupsCollector has also been included in this infrastructure.

          Only the TermSecondPassGroupingCollector didn't work. The size groupDocs array was too small.

          I think the following things need to be done:

          • Update the documentation in package.html
          • Backport to 3x

          Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.

          Show
          Martijn van Groningen added a comment - Attached a new patch that is based on Mike's patch. All existing grouping tests pass AllGroupsCollector has also been included in this infrastructure. Only the TermSecondPassGroupingCollector didn't work. The size groupDocs array was too small. I think the following things need to be done: Update the documentation in package.html Backport to 3x Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.
          Hide
          Michael McCandless added a comment -

          That patch is totally untested, and has at least 2 generics
          warnings! But hopefully the approach can work...

          Basically the idea of the 2nd patch is to just make minimal extensions
          points to the current grouping collectors, so that Solr could subclass
          these and use its MutableValue/DocValues to manage the group keys. I
          think this would then mean Solr trunk and Solr 3.x could fully cutover
          and not lose any functionality (and gain the benefits of the grouping
          module).

          Separately, we should merge the cool GroupValue, GroupValueSource,
          GroupHolder, etc., from the first patch here, with Solr's equivalents,
          factored out I think into a shared "common" module that the FQ module
          (LUCENE-2883) can also use.

          Show
          Michael McCandless added a comment - That patch is totally untested, and has at least 2 generics warnings! But hopefully the approach can work... Basically the idea of the 2nd patch is to just make minimal extensions points to the current grouping collectors, so that Solr could subclass these and use its MutableValue/DocValues to manage the group keys. I think this would then mean Solr trunk and Solr 3.x could fully cutover and not lose any functionality (and gain the benefits of the grouping module). Separately, we should merge the cool GroupValue, GroupValueSource, GroupHolder, etc., from the first patch here, with Solr's equivalents, factored out I think into a shared "common" module that the FQ module ( LUCENE-2883 ) can also use.
          Hide
          Michael McCandless added a comment -

          Attached patch, with a possible more minimal approach to enabling Solr trunk to cutover to the grouping module.

          Show
          Michael McCandless added a comment - Attached patch, with a possible more minimal approach to enabling Solr trunk to cutover to the grouping module.
          Hide
          Martijn van Groningen added a comment -

          Also some initial performance indications. This data was created with the runner provided in the patch. A ran it on a index of 30M documents that has 7502 unique groups.

          ===== First pass collectors execution =====
          number   |collector                               |query                   |time (ms)   |Current mem usage (MB)  
          1        |FirstPassGroupingCollector              |country:es              |888         |170                     
          2        |ResearchFirstPassGroupingCollector      |country:es              |710         |170                     
          3        |FirstPassGroupingCollector              |country:es country:eg   |1448        |170                      
          4        |ResearchFirstPassGroupingCollector      |country:es country:eg   |1350        |170                      
          5        |FirstPassGroupingCollector              |country:it              |2           |170                    
          6        |ResearchFirstPassGroupingCollector      |country:it              |3           |170                     
          7        |FirstPassGroupingCollector              |country:gr              |45          |170                      
          8        |ResearchFirstPassGroupingCollector      |country:gr              |43          |170                    
          9        |FirstPassGroupingCollector              |country:it              |3           |170                    
          10       |ResearchFirstPassGroupingCollector      |country:it              |3           |170                    
          11       |FirstPassGroupingCollector              |country:it              |2           |170                    
          12       |ResearchFirstPassGroupingCollector      |country:it              |3           |170                    
          13       |FirstPassGroupingCollector              |country:eg              |67          |170                    
          14       |ResearchFirstPassGroupingCollector      |country:eg              |91          |171                     
          15       |FirstPassGroupingCollector              |country:es country:eg   |1387        |171                    
          16       |ResearchFirstPassGroupingCollector      |country:es country:eg   |1351        |171                     
          17       |FirstPassGroupingCollector              |country:e*              |2264        |173                    
          18       |ResearchFirstPassGroupingCollector      |country:e*              |1036        |176                     
          19       |FirstPassGroupingCollector              |country:it              |2           |176                     
          20       |ResearchFirstPassGroupingCollector      |country:it              |3           |176                      
          21       |FirstPassGroupingCollector              |*:*                     |526         |176                     
          22       |ResearchFirstPassGroupingCollector      |*:*                     |410         |176                     
          23       |FirstPassGroupingCollector              |country:gr              |50          |177                      
          24       |ResearchFirstPassGroupingCollector      |country:gr              |45          |177                      
          25       |FirstPassGroupingCollector              |country:es country:eg   |1397        |177                     
          26       |ResearchFirstPassGroupingCollector      |country:es country:eg   |1362        |177                     
          27       |FirstPassGroupingCollector              |*:*                     |544         |177                      
          28       |ResearchFirstPassGroupingCollector      |*:*                     |411         |177                    
          29       |FirstPassGroupingCollector              |country:tr              |54          |177                     
          30       |ResearchFirstPassGroupingCollector      |country:tr              |52          |177                      
          31       |FirstPassGroupingCollector              |*:*                     |531         |177                    
          32       |ResearchFirstPassGroupingCollector      |*:*                     |413         |177                     
          33       |FirstPassGroupingCollector              |*:*                     |534         |177                     
          34       |ResearchFirstPassGroupingCollector      |*:*                     |416         |177                      
          35       |FirstPassGroupingCollector              |country:eg              |67          |177                    
          36       |ResearchFirstPassGroupingCollector      |country:eg              |64          |177                    
          37       |FirstPassGroupingCollector              |country:gr              |45          |177                     
          38       |ResearchFirstPassGroupingCollector      |country:gr              |47          |177                      
          39       |FirstPassGroupingCollector              |country:e*              |1109        |181                     
          40       |ResearchFirstPassGroupingCollector      |country:e*              |965         |185                      
          41       |FirstPassGroupingCollector              |country:us              |56          |185                      
          42       |ResearchFirstPassGroupingCollector      |country:us              |62          |185                      
          43       |FirstPassGroupingCollector              |country:gr              |44          |185                      
          44       |ResearchFirstPassGroupingCollector      |country:gr              |42          |185                      
          45       |FirstPassGroupingCollector              |country:es              |872         |185                      
          46       |ResearchFirstPassGroupingCollector      |country:es              |723         |185                     
          47       |FirstPassGroupingCollector              |country:es              |804         |185                      
          48       |ResearchFirstPassGroupingCollector      |country:es              |717         |185                      
          49       |FirstPassGroupingCollector              |country:es              |747         |185                      
          50       |ResearchFirstPassGroupingCollector      |country:es              |723         |185                      
          
          
          
          ===== All groups collectors execution =====
          number   |collector                     |query                   |time (ms)   |count       |Current mem usage (MB)
          1        |AllGroupsCollector            |country:us              |54          |11          |185         
          2        |ResearchAllGroupsCollector    |country:us              |53          |11          |169         
          3        |AllGroupsCollector            |country:it              |2           |9           |169         
          4        |ResearchAllGroupsCollector    |country:it              |2           |9           |169         
          5        |AllGroupsCollector            |country:tr              |64          |1700        |170         
          6        |ResearchAllGroupsCollector    |country:tr              |67          |1700        |170         
          7        |AllGroupsCollector            |*:*                     |454         |7502        |170         
          8        |ResearchAllGroupsCollector    |*:*                     |471         |7502        |170         
          9        |AllGroupsCollector            |country:tr              |51          |1700        |171         
          10       |ResearchAllGroupsCollector    |country:tr              |61          |1700        |171         
          11       |AllGroupsCollector            |country:es              |717         |3012        |171         
          12       |ResearchAllGroupsCollector    |country:es              |755         |3012        |171         
          13       |AllGroupsCollector            |country:es              |691         |3012        |172         
          14       |ResearchAllGroupsCollector    |country:es              |765         |3012        |172         
          15       |AllGroupsCollector            |*:*                     |411         |7502        |172         
          16       |ResearchAllGroupsCollector    |*:*                     |462         |7502        |173         
          17       |AllGroupsCollector            |country:es              |685         |3012        |173         
          18       |ResearchAllGroupsCollector    |country:es              |743         |3012        |173         
          19       |AllGroupsCollector            |country:eg              |57          |646         |173         
          20       |ResearchAllGroupsCollector    |country:eg              |63          |646         |173         
          21       |AllGroupsCollector            |country:tr              |51          |1700        |173         
          22       |ResearchAllGroupsCollector    |country:tr              |54          |1700        |174         
          23       |AllGroupsCollector            |country:it              |2           |9           |174         
          24       |ResearchAllGroupsCollector    |country:it              |3           |9           |174         
          25       |AllGroupsCollector            |*:*                     |433         |7502        |174         
          26       |ResearchAllGroupsCollector    |*:*                     |709         |7502        |175         
          27       |AllGroupsCollector            |country:tr              |50          |1700        |175         
          28       |ResearchAllGroupsCollector    |country:tr              |59          |1700        |175         
          29       |AllGroupsCollector            |country:gr              |44          |2061        |175         
          30       |ResearchAllGroupsCollector    |country:gr              |43          |2061        |175         
          31       |AllGroupsCollector            |country:us              |17          |11          |176         
          32       |ResearchAllGroupsCollector    |country:us              |27          |11          |176         
          33       |AllGroupsCollector            |country:tr              |55          |1700        |176         
          34       |ResearchAllGroupsCollector    |country:tr              |55          |1700        |176         
          35       |AllGroupsCollector            |country:e*              |941         |3658        |180         
          36       |ResearchAllGroupsCollector    |country:e*              |1087        |3658        |184         
          37       |AllGroupsCollector            |country:es              |706         |3012        |184         
          38       |ResearchAllGroupsCollector    |country:es              |747         |3012        |184         
          39       |AllGroupsCollector            |country:eg              |57          |646         |185         
          40       |ResearchAllGroupsCollector    |country:eg              |61          |646         |185         
          41       |AllGroupsCollector            |country:es country:eg   |1507        |3658        |170         
          42       |ResearchAllGroupsCollector    |country:es country:eg   |1512        |3658        |171         
          43       |AllGroupsCollector            |country:gr              |38          |2061        |171         
          44       |ResearchAllGroupsCollector    |country:gr              |41          |2061        |172         
          45       |AllGroupsCollector            |country:us              |17          |11          |172         
          46       |ResearchAllGroupsCollector    |country:us              |19          |11          |172         
          47       |AllGroupsCollector            |country:es              |708         |3012        |172         
          48       |ResearchAllGroupsCollector    |country:es              |742         |3012        |172         
          49       |AllGroupsCollector            |country:us              |17          |11          |172         
          50       |ResearchAllGroupsCollector    |country:us              |19          |11          |173         
          
          Show
          Martijn van Groningen added a comment - Also some initial performance indications. This data was created with the runner provided in the patch. A ran it on a index of 30M documents that has 7502 unique groups. ===== First pass collectors execution ===== number |collector |query |time (ms) |Current mem usage (MB) 1 |FirstPassGroupingCollector |country:es |888 |170 2 |ResearchFirstPassGroupingCollector |country:es |710 |170 3 |FirstPassGroupingCollector |country:es country:eg |1448 |170 4 |ResearchFirstPassGroupingCollector |country:es country:eg |1350 |170 5 |FirstPassGroupingCollector |country:it |2 |170 6 |ResearchFirstPassGroupingCollector |country:it |3 |170 7 |FirstPassGroupingCollector |country:gr |45 |170 8 |ResearchFirstPassGroupingCollector |country:gr |43 |170 9 |FirstPassGroupingCollector |country:it |3 |170 10 |ResearchFirstPassGroupingCollector |country:it |3 |170 11 |FirstPassGroupingCollector |country:it |2 |170 12 |ResearchFirstPassGroupingCollector |country:it |3 |170 13 |FirstPassGroupingCollector |country:eg |67 |170 14 |ResearchFirstPassGroupingCollector |country:eg |91 |171 15 |FirstPassGroupingCollector |country:es country:eg |1387 |171 16 |ResearchFirstPassGroupingCollector |country:es country:eg |1351 |171 17 |FirstPassGroupingCollector |country:e* |2264 |173 18 |ResearchFirstPassGroupingCollector |country:e* |1036 |176 19 |FirstPassGroupingCollector |country:it |2 |176 20 |ResearchFirstPassGroupingCollector |country:it |3 |176 21 |FirstPassGroupingCollector |*:* |526 |176 22 |ResearchFirstPassGroupingCollector |*:* |410 |176 23 |FirstPassGroupingCollector |country:gr |50 |177 24 |ResearchFirstPassGroupingCollector |country:gr |45 |177 25 |FirstPassGroupingCollector |country:es country:eg |1397 |177 26 |ResearchFirstPassGroupingCollector |country:es country:eg |1362 |177 27 |FirstPassGroupingCollector |*:* |544 |177 28 |ResearchFirstPassGroupingCollector |*:* |411 |177 29 |FirstPassGroupingCollector |country:tr |54 |177 30 |ResearchFirstPassGroupingCollector |country:tr |52 |177 31 |FirstPassGroupingCollector |*:* |531 |177 32 |ResearchFirstPassGroupingCollector |*:* |413 |177 33 |FirstPassGroupingCollector |*:* |534 |177 34 |ResearchFirstPassGroupingCollector |*:* |416 |177 35 |FirstPassGroupingCollector |country:eg |67 |177 36 |ResearchFirstPassGroupingCollector |country:eg |64 |177 37 |FirstPassGroupingCollector |country:gr |45 |177 38 |ResearchFirstPassGroupingCollector |country:gr |47 |177 39 |FirstPassGroupingCollector |country:e* |1109 |181 40 |ResearchFirstPassGroupingCollector |country:e* |965 |185 41 |FirstPassGroupingCollector |country:us |56 |185 42 |ResearchFirstPassGroupingCollector |country:us |62 |185 43 |FirstPassGroupingCollector |country:gr |44 |185 44 |ResearchFirstPassGroupingCollector |country:gr |42 |185 45 |FirstPassGroupingCollector |country:es |872 |185 46 |ResearchFirstPassGroupingCollector |country:es |723 |185 47 |FirstPassGroupingCollector |country:es |804 |185 48 |ResearchFirstPassGroupingCollector |country:es |717 |185 49 |FirstPassGroupingCollector |country:es |747 |185 50 |ResearchFirstPassGroupingCollector |country:es |723 |185 ===== All groups collectors execution ===== number |collector |query |time (ms) |count |Current mem usage (MB) 1 |AllGroupsCollector |country:us |54 |11 |185 2 |ResearchAllGroupsCollector |country:us |53 |11 |169 3 |AllGroupsCollector |country:it |2 |9 |169 4 |ResearchAllGroupsCollector |country:it |2 |9 |169 5 |AllGroupsCollector |country:tr |64 |1700 |170 6 |ResearchAllGroupsCollector |country:tr |67 |1700 |170 7 |AllGroupsCollector |*:* |454 |7502 |170 8 |ResearchAllGroupsCollector |*:* |471 |7502 |170 9 |AllGroupsCollector |country:tr |51 |1700 |171 10 |ResearchAllGroupsCollector |country:tr |61 |1700 |171 11 |AllGroupsCollector |country:es |717 |3012 |171 12 |ResearchAllGroupsCollector |country:es |755 |3012 |171 13 |AllGroupsCollector |country:es |691 |3012 |172 14 |ResearchAllGroupsCollector |country:es |765 |3012 |172 15 |AllGroupsCollector |*:* |411 |7502 |172 16 |ResearchAllGroupsCollector |*:* |462 |7502 |173 17 |AllGroupsCollector |country:es |685 |3012 |173 18 |ResearchAllGroupsCollector |country:es |743 |3012 |173 19 |AllGroupsCollector |country:eg |57 |646 |173 20 |ResearchAllGroupsCollector |country:eg |63 |646 |173 21 |AllGroupsCollector |country:tr |51 |1700 |173 22 |ResearchAllGroupsCollector |country:tr |54 |1700 |174 23 |AllGroupsCollector |country:it |2 |9 |174 24 |ResearchAllGroupsCollector |country:it |3 |9 |174 25 |AllGroupsCollector |*:* |433 |7502 |174 26 |ResearchAllGroupsCollector |*:* |709 |7502 |175 27 |AllGroupsCollector |country:tr |50 |1700 |175 28 |ResearchAllGroupsCollector |country:tr |59 |1700 |175 29 |AllGroupsCollector |country:gr |44 |2061 |175 30 |ResearchAllGroupsCollector |country:gr |43 |2061 |175 31 |AllGroupsCollector |country:us |17 |11 |176 32 |ResearchAllGroupsCollector |country:us |27 |11 |176 33 |AllGroupsCollector |country:tr |55 |1700 |176 34 |ResearchAllGroupsCollector |country:tr |55 |1700 |176 35 |AllGroupsCollector |country:e* |941 |3658 |180 36 |ResearchAllGroupsCollector |country:e* |1087 |3658 |184 37 |AllGroupsCollector |country:es |706 |3012 |184 38 |ResearchAllGroupsCollector |country:es |747 |3012 |184 39 |AllGroupsCollector |country:eg |57 |646 |185 40 |ResearchAllGroupsCollector |country:eg |61 |646 |185 41 |AllGroupsCollector |country:es country:eg |1507 |3658 |170 42 |ResearchAllGroupsCollector |country:es country:eg |1512 |3658 |171 43 |AllGroupsCollector |country:gr |38 |2061 |171 44 |ResearchAllGroupsCollector |country:gr |41 |2061 |172 45 |AllGroupsCollector |country:us |17 |11 |172 46 |ResearchAllGroupsCollector |country:us |19 |11 |172 47 |AllGroupsCollector |country:es |708 |3012 |172 48 |ResearchAllGroupsCollector |country:es |742 |3012 |172 49 |AllGroupsCollector |country:us |17 |11 |172 50 |ResearchAllGroupsCollector |country:us |19 |11 |173
          Hide
          Martijn van Groningen added a comment -

          Attached an initial idea of abstracting away the source of the group values.

          In the patch there are four concepts:

          • GroupValue. An abstraction of a value.
          • GroupValueSource. An abstraction of the source of the group value.
          • GroupHolder. An abstract holder to get the groups in a efficient manner.
          • GroupSpecification. A factory class that glues all the above concepts together.

          I think with this infrastructure it is quite straight forward to add the ability to group by a function. The patch contains implementations for all fields (string, int, double etc.).

          Some concepts are look like with what is already in Lucene / Solr. For example Lucene's ValueSource and DocValues or Solr's DocValues and MutableValue. I just started from scratch to see what grouping really needs. The Lucene's DocValues for example didn't have all functionality grouping needs.

          Furthermore I have added research group collectors. I ported the MatchAllGroupsCollector and FirstPassGroupingCollector to use this new infrastructure. Just to get a feeling of how it all fits together.

          I also included a simple runner class that runs these research collectors and compares them to the variants already added to the code base. You can then easily check search times and the group results.

          Show
          Martijn van Groningen added a comment - Attached an initial idea of abstracting away the source of the group values. In the patch there are four concepts: GroupValue. An abstraction of a value. GroupValueSource. An abstraction of the source of the group value. GroupHolder. An abstract holder to get the groups in a efficient manner. GroupSpecification. A factory class that glues all the above concepts together. I think with this infrastructure it is quite straight forward to add the ability to group by a function. The patch contains implementations for all fields (string, int, double etc.). Some concepts are look like with what is already in Lucene / Solr. For example Lucene's ValueSource and DocValues or Solr's DocValues and MutableValue. I just started from scratch to see what grouping really needs. The Lucene's DocValues for example didn't have all functionality grouping needs. Furthermore I have added research group collectors. I ported the MatchAllGroupsCollector and FirstPassGroupingCollector to use this new infrastructure. Just to get a feeling of how it all fits together. I also included a simple runner class that runs these research collectors and compares them to the variants already added to the code base. You can then easily check search times and the group results.
          Hide
          Martijn van Groningen added a comment -

          In order to avoid double FC usage we need to let the grouping collectors instantiate the right fieldcache for the right field.
          The current collectors always create a StringIndex (3.x) or DocTermsIndex (4.0). When you sort on the same field as you group and this field is for example an int. Then a complete new StringIndex / DocTermsIndex is put in the FieldCache.

          Show
          Martijn van Groningen added a comment - In order to avoid double FC usage we need to let the grouping collectors instantiate the right fieldcache for the right field. The current collectors always create a StringIndex (3.x) or DocTermsIndex (4.0). When you sort on the same field as you group and this field is for example an int. Then a complete new StringIndex / DocTermsIndex is put in the FieldCache.
          Hide
          Michael McCandless added a comment -

          I was thinking we could parameterize the type of the group key for these collectors, ie today it's BytesRef, but Solr would use MutableValue instead, and would subclass to override the getGroupKey (and presumably also setNextReader). Not sure it'll all work though until we try it... I agree there are alot of places that need to be touched.

          Show
          Michael McCandless added a comment - I was thinking we could parameterize the type of the group key for these collectors, ie today it's BytesRef, but Solr would use MutableValue instead, and would subclass to override the getGroupKey (and presumably also setNextReader). Not sure it'll all work though until we try it... I agree there are alot of places that need to be touched.
          Hide
          Martijn van Groningen added a comment -

          All the current group collectors use currently DocTermsIndex. With function queries I think this can't be used. However changing this requires changing the collectors in many places. So I think the getGroupKey() method isn't enough. Solr uses the MutableValue to abstract away the source of the values. I'm not sure what we can use in Lucene to have this abstraction.

          Show
          Martijn van Groningen added a comment - All the current group collectors use currently DocTermsIndex. With function queries I think this can't be used. However changing this requires changing the collectors in many places. So I think the getGroupKey() method isn't enough. Solr uses the MutableValue to abstract away the source of the values. I'm not sure what we can use in Lucene to have this abstraction.

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development