Solr
  1. Solr
  2. SOLR-6216

Better faceting for multiple intervals on DV fields

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10
    • Component/s: None
    • Labels:
      None

      Description

      There are two ways to have faceting on values ranges in Solr right now: “Range Faceting” and “Query Faceting” (doing range queries). They both end up doing something similar:

      searcher.numDocs(rangeQ , docs)
      

      The good thing about this implementation is that it can benefit from caching. The bad thing is that it may be slow with cold caches, and that there will be a query for each of the ranges.

      A different implementation would be one that works similar to regular field faceting, using doc values and validating ranges for each value of the matching documents. This implementation would sometimes be faster than Range Faceting / Query Faceting, specially on cases where caches are not very effective, like on a high update rate, or where ranges change frequently.

      Functionally, the result should be exactly the same as the one obtained by doing a facet query for every interval

      1. FacetTester.java
        4 kB
        Tomás Fernández Löbbe
      2. SOLR-6216.patch
        103 kB
        Erick Erickson
      3. SOLR-6216.patch
        103 kB
        Tomás Fernández Löbbe
      4. SOLR-6216.patch
        101 kB
        Tomás Fernández Löbbe
      5. SOLR-6216.patch
        96 kB
        Erick Erickson
      6. SOLR-6216.patch
        89 kB
        Tomás Fernández Löbbe
      7. SOLR-6216.patch
        93 kB
        Tomás Fernández Löbbe
      8. SOLR-6216.patch
        98 kB
        Tomás Fernández Löbbe
      9. SOLR-6216.patch
        87 kB
        Tomás Fernández Löbbe
      10. SOLR-6216.patch
        73 kB
        Tomás Fernández Löbbe
      11. SOLR-6216.patch
        73 kB
        Tomás Fernández Löbbe

        Issue Links

          Activity

          Hide
          Tomás Fernández Löbbe added a comment -

          Here is the approach I mentioned. It still has some things to be done, but should give an idea. I'd like to hear what people thinks about this

          Show
          Tomás Fernández Löbbe added a comment - Here is the approach I mentioned. It still has some things to be done, but should give an idea. I'd like to hear what people thinks about this
          Hide
          Tomás Fernández Löbbe added a comment - - edited

          I did some very basic performance testing to compare interval faceting vs facet queries:
          Dataset: Geonames.org dataset (added 4 times to make it a 33M docs)
          Query Set: 4960 boolean queries using terms from the dataset
          1 document updated every second
          autoSoftCommit every second.
          HW: MacBook Pro Core i7, 2.7 GHz with 8 GB of RAM with spinning disk (5400 RPM)
          All times are in milliseconds
          Repeated the test with different number of intervals (on the “population” field of the geonames dataset)

            Num Intervals 1 2 3 4 5 10
          Min Intervals 25 23 26 23 24 26
            Facet Query 2 2 3 4 4 6
          Max Intervals 1885 2254 2508 2800 2749 3031
            Facet Query 2199 2414 3957 2766 1869 5975
          Average Intervals 181 177 191 183 148 174
            Facet Query 156 277 359 299 216 408
          P10 Intervals 53 54 54 54 54 56
            Facet Query 26 30 33 31 29 35
          P50 Intervals 96 95 98 97 88 96
            Facet Query 54 211 293 188 58 74
          P90 Intervals 453 940 467 458 350 438
            Facet Query 432 656 794 749 660 1066
          P99 Intervals 809 884 968 877 857 897
            Facet Query 867 1041 1354 1219 1116 1784

          There is some variation between the tests with different number of intervals (with the same method) that I don’t understand very well. For each test, I’d restart the jetty (index files are probably cached between tests though).

          In general what I see is that the average is similar or lower than facet query, the p10 and p50 similar or higher than facet query (these are probably the cases where the facet queries hit cache), and lower p90 p99 for the Intervals impl. This probably because of facet query missing cache.

          “Max” variates a lot, I don’t think it’s a very representative number, I just left it for completeness. Min is very similar for all cases, it’s obvious that in the best case (all cache hits), facet query is much faster than intervals.

          I also did a quick test on an internal collection with around 100M docs in a single shard, ran around 6000 queries with around 40 intervals each, for this test I got:

          Min Intervals 122
            Facet Query 124
          Max Intervals 6626
            Facet Query 61009
          Average Intervals 238
            Facet Query 620
          P10 Intervals 155
            Facet Query 151
          P50 Intervals 201
            Facet Query 202
          P90 Intervals 324
            Facet Query 461
          P99 Intervals 836
            Facet Query 23662

          This collection has updates and soft commits.
          I don’t have numbers for distributed tests, but from what I could see, the result was even better on wide collections, because of the lower p90/p99 I assume.

          Show
          Tomás Fernández Löbbe added a comment - - edited I did some very basic performance testing to compare interval faceting vs facet queries: Dataset: Geonames.org dataset (added 4 times to make it a 33M docs) Query Set: 4960 boolean queries using terms from the dataset 1 document updated every second autoSoftCommit every second. HW: MacBook Pro Core i7, 2.7 GHz with 8 GB of RAM with spinning disk (5400 RPM) All times are in milliseconds Repeated the test with different number of intervals (on the “population” field of the geonames dataset)   Num Intervals 1 2 3 4 5 10 Min Intervals 25 23 26 23 24 26   Facet Query 2 2 3 4 4 6 Max Intervals 1885 2254 2508 2800 2749 3031   Facet Query 2199 2414 3957 2766 1869 5975 Average Intervals 181 177 191 183 148 174   Facet Query 156 277 359 299 216 408 P10 Intervals 53 54 54 54 54 56   Facet Query 26 30 33 31 29 35 P50 Intervals 96 95 98 97 88 96   Facet Query 54 211 293 188 58 74 P90 Intervals 453 940 467 458 350 438   Facet Query 432 656 794 749 660 1066 P99 Intervals 809 884 968 877 857 897   Facet Query 867 1041 1354 1219 1116 1784 There is some variation between the tests with different number of intervals (with the same method) that I don’t understand very well. For each test, I’d restart the jetty (index files are probably cached between tests though). In general what I see is that the average is similar or lower than facet query, the p10 and p50 similar or higher than facet query (these are probably the cases where the facet queries hit cache), and lower p90 p99 for the Intervals impl. This probably because of facet query missing cache. “Max” variates a lot, I don’t think it’s a very representative number, I just left it for completeness. Min is very similar for all cases, it’s obvious that in the best case (all cache hits), facet query is much faster than intervals. I also did a quick test on an internal collection with around 100M docs in a single shard, ran around 6000 queries with around 40 intervals each, for this test I got: Min Intervals 122   Facet Query 124 Max Intervals 6626   Facet Query 61009 Average Intervals 238   Facet Query 620 P10 Intervals 155   Facet Query 151 P50 Intervals 201   Facet Query 202 P90 Intervals 324   Facet Query 461 P99 Intervals 836   Facet Query 23662 This collection has updates and soft commits. I don’t have numbers for distributed tests, but from what I could see, the result was even better on wide collections, because of the lower p90/p99 I assume.
          Hide
          Tomás Fernández Löbbe added a comment -

          Updated to current trunk

          Show
          Tomás Fernández Löbbe added a comment - Updated to current trunk
          Hide
          Tomás Fernández Löbbe added a comment -

          Added a distributed test case. I don't know if it makes sense to have a complete different test class for this, but I need to suppress some older codecs, and I don't know if there is a way from inside the test (on the distributed test cases) to know which codecs are being used.

          Show
          Tomás Fernández Löbbe added a comment - Added a distributed test case. I don't know if it makes sense to have a complete different test class for this, but I need to suppress some older codecs, and I don't know if there is a way from inside the test (on the distributed test cases) to know which codecs are being used.
          Hide
          Tomás Fernández Löbbe added a comment -

          Added some more unit tests

          Show
          Tomás Fernández Löbbe added a comment - Added some more unit tests
          Hide
          Tomás Fernández Löbbe added a comment -

          added some more javadocs and cleanup tests

          Show
          Tomás Fernández Löbbe added a comment - added some more javadocs and cleanup tests
          Hide
          Tomás Fernández Löbbe added a comment -

          Any thoughts on this issue?

          Show
          Tomás Fernández Löbbe added a comment - Any thoughts on this issue?
          Hide
          Mark Miller added a comment -

          Thoughts:

          The numbers look good, the patch looks good, looks like good comments, looks like good tests, very clean patch.

          Unfortunately, probably not something I can look at soon personally, but just to note in case another committer can tackle this more near term.

          Show
          Mark Miller added a comment - Thoughts: The numbers look good, the patch looks good, looks like good comments, looks like good tests, very clean patch. Unfortunately, probably not something I can look at soon personally, but just to note in case another committer can tackle this more near term.
          Hide
          Tomás Fernández Löbbe added a comment -

          Thanks for the feedback Mark. Hopefully some other committer can also take a look soon.

          Show
          Tomás Fernández Löbbe added a comment - Thanks for the feedback Mark. Hopefully some other committer can also take a look soon.
          Hide
          Tomás Fernández Löbbe added a comment -

          Fixed a problem with the distributed test. Removed some unnecessary changes.

          Show
          Tomás Fernández Löbbe added a comment - Fixed a problem with the distributed test. Removed some unnecessary changes.
          Hide
          Tomás Fernández Löbbe added a comment -

          Anyone else that can take a look at the patch?

          Show
          Tomás Fernández Löbbe added a comment - Anyone else that can take a look at the patch?
          Hide
          Tomás Fernández Löbbe added a comment -

          I have been testing this code in 4.x too, it works fine too. The tests need to omit "Lucene3x" codec in addition to the omitted in trunk

          Show
          Tomás Fernández Löbbe added a comment - I have been testing this code in 4.x too, it works fine too. The tests need to omit "Lucene3x" codec in addition to the omitted in trunk
          Hide
          Erick Erickson added a comment -

          Tomas:

          Sorry I haven't been able to get to this, but I started looking at it today. Very nice patch! I uploaded a few tweaks in a bit (spelling corrections, and I thought that FacetComponent.countFacets was getting a little long so I refactored the three clauses into their own methods, no functional change at all).

          A few questions though:
          1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess
          2> There's a TODO in SimpleFacets [1410] or so....
          3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.

          Show
          Erick Erickson added a comment - Tomas: Sorry I haven't been able to get to this, but I started looking at it today. Very nice patch! I uploaded a few tweaks in a bit (spelling corrections, and I thought that FacetComponent.countFacets was getting a little long so I refactored the three clauses into their own methods, no functional change at all). A few questions though: 1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess 2> There's a TODO in SimpleFacets [1410] or so.... 3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.
          Hide
          Tomás Fernández Löbbe added a comment -

          Thanks for looking at the patch Erick.

          1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess

          Yes, that's the way to set the intervals for a field. For example:
          

          f.test_l_dv.facet.interval.set=[10,20]

          2> There's a TODO in SimpleFacets ...

          There are a couple of TODOs. In SimpleFacets, to add a comment about the distributed case. I'll add that. In IntervalFacets there are a couple more. Two of them I kept from the section of code I used from NumericFacets (maybe this can be refactored in the future). Two TODOs are in the parsing method of the interval. I see those two as possible future improvements, I don't think they are necessary at this point.

          3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax.

          Well, that is true, but I guess in range queries “{}“ had to be used because “()” was already being used for grouping clauses. I feel that “(10,20)” is a better representation of an interval, but I see your point too to be consistent with other Solr syntax. I think I’m OK either way. I was also planning on adding support for local params on the interval to support setting an arbitrary key, and if we use curly braces it would be more confusing:
          

          f.test_l_dv.facet.interval.set={!key=‘foo’}(10,20)

          vs
          

          f.test_l_dv.facet.interval.set={!key=‘foo’}{10,20}
          Show
          Tomás Fernández Löbbe added a comment - Thanks for looking at the patch Erick. 1> I'm assuming this supports the f.<fieldname>.facet syntax, that's at a higher level that this patch I'd guess Yes, that's the way to set the intervals for a field. For example:  f.test_l_dv.facet.interval.set=[10,20] 2> There's a TODO in SimpleFacets ... There are a couple of TODOs. In SimpleFacets, to add a comment about the distributed case. I'll add that. In IntervalFacets there are a couple more. Two of them I kept from the section of code I used from NumericFacets (maybe this can be refactored in the future). Two TODOs are in the parsing method of the interval. I see those two as possible future improvements, I don't think they are necessary at this point. 3> Perhaps the most substantive question I have is the syntax. Why use '()' for exclusive ranges and not '{}' like range facets do? Other things being equal, it seems like this would be more consistent with other Solr syntax. Well, that is true, but I guess in range queries “{}“ had to be used because “()” was already being used for grouping clauses. I feel that “(10,20)” is a better representation of an interval, but I see your point too to be consistent with other Solr syntax. I think I’m OK either way. I was also planning on adding support for local params on the interval to support setting an arbitrary key, and if we use curly braces it would be more confusing:  f.test_l_dv.facet.interval.set={!key=‘foo’}(10,20) vs  f.test_l_dv.facet.interval.set={!key=‘foo’}{10,20}
          Hide
          Tomás Fernández Löbbe added a comment -

          Attached a new patch with some more parsing unit tests (validating Erick's question of f.<field>.facet.interval.set vs facet.interval.set)
          Added some more javadocs on SimpleFacets (removed that TODO)

          Show
          Tomás Fernández Löbbe added a comment - Attached a new patch with some more parsing unit tests (validating Erick's question of f.<field>.facet.interval.set vs facet.interval.set) Added some more javadocs on SimpleFacets (removed that TODO)
          Hide
          Tomás Fernández Löbbe added a comment -

          Added a couple of tests

          Show
          Tomás Fernández Löbbe added a comment - Added a couple of tests
          Hide
          Erick Erickson added a comment -

          Trivial changes, took out a couple of unused parameters to methods in IntervalFacets.

          Show
          Erick Erickson added a comment - Trivial changes, took out a couple of unused parameters to methods in IntervalFacets.
          Hide
          Erick Erickson added a comment -

          Tomás:

          Took another look today, looks really good. I did happen to notice a couple of unused parameters, so I attached a patch takes them out.

          Unless there are objections, I'll commit this next Tuesday or Wednesday.

          Thanks!
          Erick

          Show
          Erick Erickson added a comment - Tomás: Took another look today, looks really good. I did happen to notice a couple of unused parameters, so I attached a patch takes them out. Unless there are objections, I'll commit this next Tuesday or Wednesday. Thanks! Erick
          Hide
          Tomás Fernández Löbbe added a comment -

          Great, thanks Erick. I'll create a Jira for the LocalParam behavior I mentioned before (override the interval key)

          Show
          Tomás Fernández Löbbe added a comment - Great, thanks Erick. I'll create a Jira for the LocalParam behavior I mentioned before (override the interval key)
          Hide
          ASF subversion and git services added a comment -

          Commit 1612889 from Erick Erickson in branch 'dev/trunk'
          [ https://svn.apache.org/r1612889 ]

          SOLR-6216: Better faceting for multiple intervals on DV fields. Thanks Tomas

          Show
          ASF subversion and git services added a comment - Commit 1612889 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1612889 ] SOLR-6216 : Better faceting for multiple intervals on DV fields. Thanks Tomas
          Hide
          Tomás Fernández Löbbe added a comment -

          Thanks for committing this Erick. I tested this in 4x and tests are passing too. The only thing that needs to be added in addition to the merge is "Lucene3x" to the list of codecs to skip (in DistributedIntervalFacetingTest and TestIntervalFaceting)

          Show
          Tomás Fernández Löbbe added a comment - Thanks for committing this Erick. I tested this in 4x and tests are passing too. The only thing that needs to be added in addition to the merge is "Lucene3x" to the list of codecs to skip (in DistributedIntervalFacetingTest and TestIntervalFaceting)
          Hide
          ASF subversion and git services added a comment -

          Commit 1612958 from Erick Erickson in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1612958 ]

          SOLR-6216: Better faceting for multiple intervals on DV fields. Thanks Tomas

          Show
          ASF subversion and git services added a comment - Commit 1612958 from Erick Erickson in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1612958 ] SOLR-6216 : Better faceting for multiple intervals on DV fields. Thanks Tomas
          Hide
          Erick Erickson added a comment -

          Thanks Tomás!

          Show
          Erick Erickson added a comment - Thanks Tomás!
          Hide
          David Smiley added a comment -

          Tomás Fernández Löbbe , do you have any scripts for your performance testing? I am impressed with that part specifically as I've got some interval faceting on my horizon to do what is somewhat similar to what you've got here LUCENE-5735

          Show
          David Smiley added a comment - Tomás Fernández Löbbe , do you have any scripts for your performance testing? I am impressed with that part specifically as I've got some interval faceting on my horizon to do what is somewhat similar to what you've got here LUCENE-5735
          Hide
          Tomás Fernández Löbbe added a comment -

          David Smiley I used the attached Java class for running the queries. As I said, the dataset was geonames, added 4 times (with different IDs) so that the index had 33M docs total.
          The queries are all boolean queries with two OR terms, generated by taking terms from the “name” field of the dataset. An example:

          name:cemetery name:lake
          name:el name:historical
          name:church name:el
          name:dam name:la
          name:al name:church
          name:al name:creek
          name:baptist name:la
          name:la name:mount
          name:creek name:de
          name:center name:park
          name:church name:creek
          ...
          

          Eyeballing the logs, most of those queries matched a high number of docs from the index. In addition, I had a bash script running to add documents every one second:

          #!/bin/bash
          IFS='\n'
          while read q   
          do
          echo $q > tmp.doc
          curl -v 'http://localhost:8983/solr/geonames/update?stream.file=/absolute/path/to/tmp.doc&stream.contentType=text/csv;charset=utf-8&separator=%09&encapsulator=%0E&header=false&fieldnames=id,name,,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date&f.alternatenames.split=true&f.alternatenames.separator=,&f.alternatenames.encapsulator=%0E&f.cc2.split=true&amp;f.cc2.separator=,&f.cc2.encapsulator=%0E'
          sleep 1
          done < allCountries.txt
          

          Unfortunately, It looks like I deleted the schema file I used, however there was nothing crazy about it, population is an int field with docValues=true. autoSoftCommit is configured to run every second.

          For the second test, I can’t upload the code because it’s full of customer specific data, but the test is very similar. I took some production queries, which had “intervals” in 6 fields, around 40 “intervals” total (originally using facet queries for each of them). For that test I used a similar bash script to upload data every second too.

          I have been testing this code in an environment mirroring production for around 2/3 weeks now and QTimes have improve dramatically (on a multi-shard collection). I haven’t seen errors related to this.

          Show
          Tomás Fernández Löbbe added a comment - David Smiley I used the attached Java class for running the queries. As I said, the dataset was geonames, added 4 times (with different IDs) so that the index had 33M docs total. The queries are all boolean queries with two OR terms, generated by taking terms from the “name” field of the dataset. An example: name:cemetery name:lake name:el name:historical name:church name:el name:dam name:la name:al name:church name:al name:creek name:baptist name:la name:la name:mount name:creek name:de name:center name:park name:church name:creek ... Eyeballing the logs, most of those queries matched a high number of docs from the index. In addition, I had a bash script running to add documents every one second: #!/bin/bash IFS='\n' while read q do echo $q > tmp.doc curl -v 'http://localhost:8983/solr/geonames/update?stream.file=/absolute/path/to/tmp.doc&stream.contentType=text/csv;charset=utf-8&separator=%09&encapsulator=%0E&header=false&fieldnames=id,name,,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date&f.alternatenames.split=true&f.alternatenames.separator=,&f.alternatenames.encapsulator=%0E&f.cc2.split=true&amp;f.cc2.separator=,&f.cc2.encapsulator=%0E' sleep 1 done < allCountries.txt Unfortunately, It looks like I deleted the schema file I used, however there was nothing crazy about it, population is an int field with docValues=true. autoSoftCommit is configured to run every second. For the second test, I can’t upload the code because it’s full of customer specific data, but the test is very similar. I took some production queries, which had “intervals” in 6 fields, around 40 “intervals” total (originally using facet queries for each of them). For that test I used a similar bash script to upload data every second too. I have been testing this code in an environment mirroring production for around 2/3 weeks now and QTimes have improve dramatically (on a multi-shard collection). I haven’t seen errors related to this.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Tomás Fernández Löbbe
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development