Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8040

Optimize IndexSearcher.collectionStatistics

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (8.0)
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexSearcher.collectionStatistics(field) can do a fair amount of work because with each invocation it will call MultiFields.getTerms(...). The effects of this are aggravated for queries with many fields since each field will want statistics, and also aggravated when there are many segments.

      1. lucenecollectionStatisticsbench.zip
        64 kB
        David Smiley
      2. MyBenchmark.java
        9 kB
        David Smiley
      3. LUCENE-8040.patch
        2 kB
        David Smiley
      4. LUCENE-8040.patch
        2 kB
        David Smiley

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          Good. Thanks for the review/feedback!

          Show
          dsmiley David Smiley added a comment - Good. Thanks for the review/feedback!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit af2b903f65e4451838fb3e93511329acec30a2a1 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=af2b903 ]

          LUCENE-8040: optimize IndexSearcher.collectionStatistics

          Show
          jira-bot ASF subversion and git services added a comment - Commit af2b903f65e4451838fb3e93511329acec30a2a1 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=af2b903 ] LUCENE-8040 : optimize IndexSearcher.collectionStatistics
          Hide
          rcmuir Robert Muir added a comment -

          +1 to take the conservative approach and just commit to master: the stats can't be -1 there.

          Show
          rcmuir Robert Muir added a comment - +1 to take the conservative approach and just commit to master: the stats can't be -1 there.
          Hide
          dsmiley David Smiley added a comment -

          no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.

          Oh yeah, thanks for the tip. So adding support for -1 stats would be pretty annoying here... like this but for all 3:
          Instead of

              docCount += terms.getDocCount()
          

          We have:

             int tmpDC = terms.getDocCount();
             docCount = tmpDC == -1 ? -1 : docCount + tmpDC;
          

          But even then it's not completely equivalent if the stats are -1 in some segments but not all. Do you think that matters Robert Muir? I'm tempted to just not backport to 7x.

          Show
          dsmiley David Smiley added a comment - no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does. Oh yeah, thanks for the tip. So adding support for -1 stats would be pretty annoying here... like this but for all 3: Instead of docCount += terms.getDocCount() We have: int tmpDC = terms.getDocCount(); docCount = tmpDC == -1 ? -1 : docCount + tmpDC; But even then it's not completely equivalent if the stats are -1 in some segments but not all. Do you think that matters Robert Muir ? I'm tempted to just not backport to 7x.
          Hide
          rcmuir Robert Muir added a comment -

          no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.

          Show
          rcmuir Robert Muir added a comment - no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.
          Hide
          dsmiley David Smiley added a comment -

          Updated patch to fix a mistake I discovered while investigating some test failures. I had written =+ instead of += – nasty! I've tweaked my IntelliJ inspection settings to alert me to this.

          I'll commit later this week if I don't hear further feedback.

          Show
          dsmiley David Smiley added a comment - Updated patch to fix a mistake I discovered while investigating some test failures. I had written =+ instead of += – nasty! I've tweaked my IntelliJ inspection settings to alert me to this. I'll commit later this week if I don't hear further feedback.
          Hide
          dsmiley David Smiley added a comment -

          LUCENE-8040.patch is the master branch version.

          I think the 7x version is simply removing the null return condition when docCount is 0.

          Show
          dsmiley David Smiley added a comment - LUCENE-8040 .patch is the master branch version. I think the 7x version is simply removing the null return condition when docCount is 0.
          Hide
          rcmuir Robert Muir added a comment -

          Can you please upload a proper patch so it can be reviewed? Note the code needs to be different for master and 7.x

          Show
          rcmuir Robert Muir added a comment - Can you please upload a proper patch so it can be reviewed? Note the code needs to be different for master and 7.x
          Hide
          dsmiley David Smiley added a comment -

          Ok, I'll commit the raw impl then. It's at least nice to no longer reference MultiFields and hence MultiTerms here.

          Show
          dsmiley David Smiley added a comment - Ok, I'll commit the raw impl then. It's at least nice to no longer reference MultiFields and hence MultiTerms here.
          Hide
          rcmuir Robert Muir added a comment -

          Its not saving a "lot". We are talking about microseconds here either way.

          IndexSearcher does not contain the querycache. The caching is at the segment level. You just configure it by passing it in there.

          Big difference: I'm strongly against caching on index searcher. especially for something that takes microseconds.

          Show
          rcmuir Robert Muir added a comment - Its not saving a "lot". We are talking about microseconds here either way. IndexSearcher does not contain the querycache. The caching is at the segment level. You just configure it by passing it in there. Big difference: I'm strongly against caching on index searcher. especially for something that takes microseconds.
          Hide
          dsmiley David Smiley added a comment -

          I updated the benchmark to use a custom FilterDirectoryReader that ultimately has a custom FilterLeafReader that caches the Terms impls into a HashMap. Then I reran the benchmark with 150 fields, 30 segments:

          IndexSearcher MultiFields (current)
            346.155 ±(99.9%) 57.775 us/op [Average]
            (min, avg, max) = (334.952, 346.155, 371.996), stdev = 15.004
            CI (99.9%): [288.380, 403.930] (assumes normal distribution)
          
          Raw compute on demand each time
            196.271 ±(99.9%) 14.716 us/op [Average]
            (min, avg, max) = (192.012, 196.271, 201.187), stdev = 3.822
            CI (99.9%): [181.555, 210.987] (assumes normal distribution)
          
          ConcurrentHashMap lazy cache of raw compute
            4.553 ±(99.9%) 0.245 us/op [Average]
            (min, avg, max) = (4.465, 4.553, 4.636), stdev = 0.064
            CI (99.9%): [4.308, 4.799] (assumes normal distribution)
          

          Clearly the ConcurrentHashMap is saving us a lot.

          You say we shouldn't add caching to IndexSearcher. IndexSearcher contains the QueryCache. Looking at LRUQueryCache, I think I can safely say that a ConcurrentHashMap is comparatively more lightweight. Do you disagree?

          Show
          dsmiley David Smiley added a comment - I updated the benchmark to use a custom FilterDirectoryReader that ultimately has a custom FilterLeafReader that caches the Terms impls into a HashMap. Then I reran the benchmark with 150 fields, 30 segments: IndexSearcher MultiFields (current) 346.155 ±(99.9%) 57.775 us/op [Average] (min, avg, max) = (334.952, 346.155, 371.996), stdev = 15.004 CI (99.9%): [288.380, 403.930] (assumes normal distribution) Raw compute on demand each time 196.271 ±(99.9%) 14.716 us/op [Average] (min, avg, max) = (192.012, 196.271, 201.187), stdev = 3.822 CI (99.9%): [181.555, 210.987] (assumes normal distribution) ConcurrentHashMap lazy cache of raw compute 4.553 ±(99.9%) 0.245 us/op [Average] (min, avg, max) = (4.465, 4.553, 4.636), stdev = 0.064 CI (99.9%): [4.308, 4.799] (assumes normal distribution) Clearly the ConcurrentHashMap is saving us a lot. You say we shouldn't add caching to IndexSearcher. IndexSearcher contains the QueryCache. Looking at LRUQueryCache, I think I can safely say that a ConcurrentHashMap is comparatively more lightweight. Do you disagree?
          Hide
          rcmuir Robert Muir added a comment -

          I don't really see it as a separate issue. collectionStatistics() is just looking up the stats from blocktree's maps and summing them up. so if its too slow, then blocktree is the place to fix it.

          Show
          rcmuir Robert Muir added a comment - I don't really see it as a separate issue. collectionStatistics() is just looking up the stats from blocktree's maps and summing them up. so if its too slow, then blocktree is the place to fix it.
          Hide
          dsmiley David Smiley added a comment -

          RE switching away from TreeMap to HashMap: absolutely! I've been working on a client project where this was already profiled and optimized to a HashMap in a fork. Definitely a separate issue, though I can see how the results of that will impact the results here. I'll file an issue.

          Show
          dsmiley David Smiley added a comment - RE switching away from TreeMap to HashMap: absolutely! I've been working on a client project where this was already profiled and optimized to a HashMap in a fork. Definitely a separate issue, though I can see how the results of that will impact the results here. I'll file an issue.
          Hide
          rcmuir Robert Muir added a comment -

          Also I think as far as lowering the overhead to getting to a field, the better fix is probably in BlockTreeTermsReader. Today getting to a specific field is log N (TreeMap). Maybe it should be HashMap instead.

          Either linkedhashmap or separate sorted array can be used for the "iterator" functionality, but I think it currently optimizes for the wrong case (iterating fields in order, versus getting to a particular field).

          Show
          rcmuir Robert Muir added a comment - Also I think as far as lowering the overhead to getting to a field, the better fix is probably in BlockTreeTermsReader. Today getting to a specific field is log N (TreeMap). Maybe it should be HashMap instead. Either linkedhashmap or separate sorted array can be used for the "iterator" functionality, but I think it currently optimizes for the wrong case (iterating fields in order, versus getting to a particular field).
          Hide
          rcmuir Robert Muir added a comment -

          I don't think we should add caching to the indexsearcher, as it is supposed to remain lightweight. Instead of using MultiFields, we should just sum up the stats per segment as a start. This is easier now that the statistics can no longer be -1.

          Show
          rcmuir Robert Muir added a comment - I don't think we should add caching to the indexsearcher, as it is supposed to remain lightweight. Instead of using MultiFields, we should just sum up the stats per segment as a start. This is easier now that the statistics can no longer be -1.
          Hide
          dsmiley David Smiley added a comment -

          I considered a few alternatives to this today using JMH:

          • Cache MultiFields on the IndexSearcher
          • Compute the CollectionStatics raw, immediately (don't lookup or cache)
          • Add a ConcurrentHashMap<String,CollectionStatistics> on the IndexSearcher and compute on demand.

          Attached is the JMH benchmark program. Between runs I would change line 78 to call out to the impl I wanted to try. JMH Main method is "org.openjdk.jmh.Main" and I used args "-wi 5 -i 5 -f 1"

          My annotated results are:

          Result "dsmiley.MyBenchmark.bench":    IndexSearcher
            1146.739 ±(99.9%) 280.645 us/op [Average]
            (min, avg, max) = (1034.410, 1146.739, 1238.123), stdev = 72.883
            CI (99.9%): [866.094, 1427.385] (assumes normal distribution)
          
          Result "dsmiley.MyBenchmark.bench":    cached MultiFields
            29.556 ±(99.9%) 8.929 us/op [Average]
            (min, avg, max) = (27.409, 29.556, 33.424), stdev = 2.319
            CI (99.9%): [20.626, 38.485] (assumes normal distribution)
          
          Result "dsmiley.MyBenchmark.bench":    raw compute 
            951.494 ±(99.9%) 182.555 us/op [Average]
            (min, avg, max) = (904.328, 951.494, 1024.473), stdev = 47.409
            CI (99.9%): [768.940, 1134.049] (assumes normal distribution)
          
          Result "dsmiley.MyBenchmark.bench":   ConcurrentHashMap
            4.448 ±(99.9%) 1.268 us/op [Average]
            (min, avg, max) = (4.090, 4.448, 4.860), stdev = 0.329
            CI (99.9%): [3.180, 5.717] (assumes normal distribution)
          
          
          For 5 fields:
          raw:               10.716
          ConcurrentHashMap:  0.155 us/op
          

          I think the results are pretty clear that we should go with the ConcurrentHashMap.

          I'm aware my benchmark implementation of this needs some more work. If an IOException is thrown it should pass through without RuntimeException wrapper. And if the field doesn't exist, we want to return null.

          Show
          dsmiley David Smiley added a comment - I considered a few alternatives to this today using JMH: Cache MultiFields on the IndexSearcher Compute the CollectionStatics raw, immediately (don't lookup or cache) Add a ConcurrentHashMap<String,CollectionStatistics> on the IndexSearcher and compute on demand. Attached is the JMH benchmark program. Between runs I would change line 78 to call out to the impl I wanted to try. JMH Main method is "org.openjdk.jmh.Main" and I used args "-wi 5 -i 5 -f 1" My annotated results are: Result "dsmiley.MyBenchmark.bench": IndexSearcher 1146.739 ±(99.9%) 280.645 us/op [Average] (min, avg, max) = (1034.410, 1146.739, 1238.123), stdev = 72.883 CI (99.9%): [866.094, 1427.385] (assumes normal distribution) Result "dsmiley.MyBenchmark.bench": cached MultiFields 29.556 ±(99.9%) 8.929 us/op [Average] (min, avg, max) = (27.409, 29.556, 33.424), stdev = 2.319 CI (99.9%): [20.626, 38.485] (assumes normal distribution) Result "dsmiley.MyBenchmark.bench": raw compute 951.494 ±(99.9%) 182.555 us/op [Average] (min, avg, max) = (904.328, 951.494, 1024.473), stdev = 47.409 CI (99.9%): [768.940, 1134.049] (assumes normal distribution) Result "dsmiley.MyBenchmark.bench": ConcurrentHashMap 4.448 ±(99.9%) 1.268 us/op [Average] (min, avg, max) = (4.090, 4.448, 4.860), stdev = 0.329 CI (99.9%): [3.180, 5.717] (assumes normal distribution) For 5 fields: raw: 10.716 ConcurrentHashMap: 0.155 us/op I think the results are pretty clear that we should go with the ConcurrentHashMap. I'm aware my benchmark implementation of this needs some more work. If an IOException is thrown it should pass through without RuntimeException wrapper. And if the field doesn't exist, we want to return null.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dsmiley David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development