Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8922

DocSetCollector can allocate massive garbage on large indexes

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      After reaching a point of diminishing returns tuning the GC collector, I decided to take a look at where the garbage was coming from. To my surprise, it turned out that for my index and query set, almost 60% of the garbage was coming from this single line:

      https://github.com/apache/lucene-solr/blob/94c04237cce44cac1e40e1b8b6ee6a6addc001a5/solr/core/src/java/org/apache/solr/search/DocSetCollector.java#L49

      This is due to the simple fact that I have 86M documents in my shards. Allocating a scratch array big enough to track a result set 1/64th of my index (1.3M) is also almost certainly excessive, considering my 99.9th percentile hit count is less than 56k.

      1. SOLR-8922.patch
        5 kB
        Yonik Seeley
      2. SOLR-8922.patch
        5 kB
        Jeff Wartes

        Issue Links

          Activity

          Hide
          jwartes Jeff Wartes added a comment -

          SOLR-5444 had a patch to help with this, (SOLR-5444_ExpandingIntArray_DocSetCollector_4_4_0.patch) but it was mixed in with some other things, and didn't get picked up with the other parts of the issue.

          Show
          jwartes Jeff Wartes added a comment - SOLR-5444 had a patch to help with this, ( SOLR-5444 _ExpandingIntArray_DocSetCollector_4_4_0.patch) but it was mixed in with some other things, and didn't get picked up with the other parts of the issue.
          Hide
          jwartes Jeff Wartes added a comment -

          This is essentially the same patch as in SOLR-5444, but applies cleanly against (at least) 5.4 where I did some GC testing, and master.

          Show
          jwartes Jeff Wartes added a comment - This is essentially the same patch as in SOLR-5444 , but applies cleanly against (at least) 5.4 where I did some GC testing, and master.
          Hide
          jwartes Jeff Wartes added a comment -

          For my index, (86M-doc shards and a per-shard 99.9th percentile query hit count of 56k) this reduced total garbage generation by 33%, which naturally also brought significant improvements in gc pause and frequency.

          Show
          jwartes Jeff Wartes added a comment - For my index, (86M-doc shards and a per-shard 99.9th percentile query hit count of 56k) this reduced total garbage generation by 33%, which naturally also brought significant improvements in gc pause and frequency.
          Hide
          erickerickson Erick Erickson added a comment -

          Jeff:

          Thanks for pointing that out, we need to keep from generating garbage whenever possible.

          One question I have is whether you've got any good stats on how performant this is?

          Show
          erickerickson Erick Erickson added a comment - Jeff: Thanks for pointing that out, we need to keep from generating garbage whenever possible. One question I have is whether you've got any good stats on how performant this is?
          Hide
          jwartes Jeff Wartes added a comment -

          Not yet. The major risk area would be the new ExpandingIntArray class, but it looked reasonable. It expands along powers of two, and although the add() and copyTo() calls are certainly more work than simple array assignment/retrieval, it still all looks like pretty simple stuff. A few ArrayList calls and some simple numeric comparisons mostly.
          I'm more worried about bugs in there than performance, I don't know how well Per Steffensen tested this, although I got the impression he was using it in production at the time.

          There may be better approaches, but this one was handy and I'm excited enough that I'm going to be doing a production test. I'll have more info in a day or two.

          As a side note, I got a similar garbage-related improvement on an earlier test by simply hard-coding the smallSetSize to 100000 - the expanding arrays approach only bought me another 3%. But of course, that 100000 is very index and query set dependant, so I didn't want to offer it as a general case.

          Show
          jwartes Jeff Wartes added a comment - Not yet. The major risk area would be the new ExpandingIntArray class, but it looked reasonable. It expands along powers of two, and although the add() and copyTo() calls are certainly more work than simple array assignment/retrieval, it still all looks like pretty simple stuff. A few ArrayList calls and some simple numeric comparisons mostly. I'm more worried about bugs in there than performance, I don't know how well Per Steffensen tested this, although I got the impression he was using it in production at the time. There may be better approaches, but this one was handy and I'm excited enough that I'm going to be doing a production test. I'll have more info in a day or two. As a side note, I got a similar garbage-related improvement on an earlier test by simply hard-coding the smallSetSize to 100000 - the expanding arrays approach only bought me another 3%. But of course, that 100000 is very index and query set dependant, so I didn't want to offer it as a general case.
          Hide
          erickerickson Erick Erickson added a comment -

          I'm also wondering if there are some tricks where we could re-use memory rather than allocate new all the time. I guess the reason I'm obsessing about this is that anything that accounts for this much of the garbage collected seems like it's worth the effort. This certainly seems like one of those odd places where efficiency might trump simplicity..

          I wonder if there's something we could do that would allocate some kind of reusable pool. I'm thinking of something stupid-simple that we could use for benchmarking (not for commit or production) to get a handle on the dimension of the problem and how broadly it would apply...

          After all, GC is one of the things we spend a lot of time on when supporting Solr.....

          Show
          erickerickson Erick Erickson added a comment - I'm also wondering if there are some tricks where we could re-use memory rather than allocate new all the time. I guess the reason I'm obsessing about this is that anything that accounts for this much of the garbage collected seems like it's worth the effort. This certainly seems like one of those odd places where efficiency might trump simplicity.. I wonder if there's something we could do that would allocate some kind of reusable pool. I'm thinking of something stupid-simple that we could use for benchmarking (not for commit or production) to get a handle on the dimension of the problem and how broadly it would apply... After all, GC is one of the things we spend a lot of time on when supporting Solr.....
          Hide
          jwartes Jeff Wartes added a comment -

          Absolutely. Memory pools were my first thought, between when I saw that 60% and when I looked at my hit rates and realized the allocation size was could just be changed. I had started poking around the internet for terms like "slab allocators" and "direct byte buffers", but even an on-heap persistent pool sounded good to me. Or, if you had persistent tracking of hit rates for the optimization, perhaps the size of the scratch array could optimize itself over time. All of that would be more complicated, of course.

          I did look one other place worth mentioning though. In Heliosearch the way the DocSetCollector handles the "scratch" array isn't any different, but it's interesting because it added a lifecycle with a close() method to the class, to support the native bitset implementation. Knowing that it's possible to impose a lifecycle on the class, checking things out and back into a persistent memory pool should be easy.

          Show
          jwartes Jeff Wartes added a comment - Absolutely. Memory pools were my first thought, between when I saw that 60% and when I looked at my hit rates and realized the allocation size was could just be changed. I had started poking around the internet for terms like "slab allocators" and "direct byte buffers", but even an on-heap persistent pool sounded good to me. Or, if you had persistent tracking of hit rates for the optimization, perhaps the size of the scratch array could optimize itself over time. All of that would be more complicated, of course. I did look one other place worth mentioning though. In Heliosearch the way the DocSetCollector handles the "scratch" array isn't any different, but it's interesting because it added a lifecycle with a close() method to the class, to support the native bitset implementation. Knowing that it's possible to impose a lifecycle on the class, checking things out and back into a persistent memory pool should be easy.
          Hide
          jwartes Jeff Wartes added a comment -

          Incidentally, I had one or two other findings from my garbage analysis. Solutions are less obvious there though, and probably involve some conversation. Is Jira the right place for that, or is there another medium more appropriate?

          Show
          jwartes Jeff Wartes added a comment - Incidentally, I had one or two other findings from my garbage analysis. Solutions are less obvious there though, and probably involve some conversation. Is Jira the right place for that, or is there another medium more appropriate?
          Hide
          jwartes Jeff Wartes added a comment -

          Ok, so after a little more than 12 hours on one of my production nodes, there was no noticeable change in CPU usage.
          Running before/after GC logs through GCViewer, it's a little hard to compare rate, since the logs were for different intervals and the "after" log included startup. That said, "Freed mem/minute" was down by 44%, and "Throughput" went from 87% to 93%. I also see noticeably reduced average pause time, and increased average pause interval. All positive signs.

          The only irritation I'm finding here is that it looks like the CMS collector is running more often. I expect that's simply because I changed the footing of a fairly tuned set of GC parameters though.

          Show
          jwartes Jeff Wartes added a comment - Ok, so after a little more than 12 hours on one of my production nodes, there was no noticeable change in CPU usage. Running before/after GC logs through GCViewer, it's a little hard to compare rate, since the logs were for different intervals and the "after" log included startup. That said, "Freed mem/minute" was down by 44%, and "Throughput" went from 87% to 93%. I also see noticeably reduced average pause time, and increased average pause interval. All positive signs. The only irritation I'm finding here is that it looks like the CMS collector is running more often. I expect that's simply because I changed the footing of a fairly tuned set of GC parameters though.
          Hide
          k317h Keith Laban added a comment -

          Just out of curiosity, have you run tests with G1? Was there a performance difference?

          Show
          k317h Keith Laban added a comment - Just out of curiosity, have you run tests with G1? Was there a performance difference?
          Hide
          dsmiley David Smiley added a comment -

          Solr uses HPPC for it's native collections. Just use IntArrayList instead of writing your own.

          Show
          dsmiley David Smiley added a comment - Solr uses HPPC for it's native collections. Just use IntArrayList instead of writing your own.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Also check out ArrayUtil in Lucene

          Show
          joel.bernstein Joel Bernstein added a comment - Also check out ArrayUtil in Lucene
          Hide
          jwartes Jeff Wartes added a comment -

          With some tweaking, I was able to get G1 pause to about the same ballpark as I get with ParNew/CMS. But without a compelling difference, the Lucene recommendation against G1 keep me away.

          This issue is more about garbage generation though. Less garbage should be a benefit regardless of the collector you choose.

          Show
          jwartes Jeff Wartes added a comment - With some tweaking, I was able to get G1 pause to about the same ballpark as I get with ParNew/CMS. But without a compelling difference, the Lucene recommendation against G1 keep me away. This issue is more about garbage generation though. Less garbage should be a benefit regardless of the collector you choose.
          Hide
          jwartes Jeff Wartes added a comment -

          Both of those appear to add capacity by declaring a new array and doing a System.arraycopy.
          Wouldn't that just result in more space allocated and then thrown away?

          Show
          jwartes Jeff Wartes added a comment - Both of those appear to add capacity by declaring a new array and doing a System.arraycopy. Wouldn't that just result in more space allocated and then thrown away?
          Hide
          dsmiley David Smiley added a comment -

          I spoke too soon; I like your utility

          Show
          dsmiley David Smiley added a comment - I spoke too soon; I like your utility
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Yep, that is cool!

          Show
          joel.bernstein Joel Bernstein added a comment - Yep, that is cool!
          Hide
          jwartes Jeff Wartes added a comment -

          I stumbled onto SJK recently, which provides me a more lightweight way to measure allocation rate on my production nodes, and also eliminate startup noise from the measurement.
          According to this tool, the node with this patch is allocating heap space at roughly 60% of the rate that the others are.
          That's reasonably consistent with my other measurements, and a pretty big improvement.

          If anyone decides to pull this in, I'd appreciate it getting applied to the 5.5 branch as well, in case there's a 5.5.1 release.

          Show
          jwartes Jeff Wartes added a comment - I stumbled onto SJK recently, which provides me a more lightweight way to measure allocation rate on my production nodes, and also eliminate startup noise from the measurement. According to this tool, the node with this patch is allocating heap space at roughly 60% of the rate that the others are. That's reasonably consistent with my other measurements, and a pretty big improvement. If anyone decides to pull this in, I'd appreciate it getting applied to the 5.5 branch as well, in case there's a 5.5.1 release.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I quickly tried some patches in the past that resulted in a performance decreases, so I never got around to committing something.
          I'll try to take a look at the patch here soon.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I quickly tried some patches in the past that resulted in a performance decreases, so I never got around to committing something. I'll try to take a look at the patch here soon.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment - - edited

          OK, it took me much longer to do the benchmarks than planned. It was sometimes difficult to get stable numbers (mostly due to variations in how hotspot can optimize/deoptimize things I imagine).

          I took the first patch here and further optimized it, getting rid of some of the branches in the inner loop. This new patch also keeps track of the smallSetSize to avoid over-allocating space that won't be used. On the best runs, this did not seem to make too much of a difference, but it became apparent that slower runs were much more frequent before this optimization (the average seemed to be about 15% better).

          Benchmark of new patch vs trunk:
          10M doc index, 20% chance of a document missing the value for a field.
          Queries consisted of many filters (using the filter() support in the query syntax)... 50 per request, with a filterCache size of 1 to generally avoid cache hits. The large number of filters per request is to just try and make docset generation the bottleneck.

          Single-threaded performance: (nterms is the number of unique terms in the field across the entire index)

          nterms perf improvement
          10 1.88%
          100 -1.05%
          1000 38.25%
          10000 75.10%
          100000 88.86%
          1000000 94.49%

          Single-threaded analysis: one could expect single threaded performance of the patch to be slower... if we promote to a bit set, we will have allocated more memory and done more work. Also, with a single thread doing requests, other CPU cores are free to concurrently perform GC. The fact that the patch was faster for the field with 10 unique terms is most likely measurement inaccuracies. nterms=10 had the most instability across runs, while nterms=100 had the next most instability. Standard deviation of performance results dropped with increasing nterms values.

          Multi-threaded performance (8 threads on a 4 core CPU):

          nterms perf improvement
          100 14.49%
          1000 93.36%
          10000 179.07%
          100000 216.65%
          1000000 148.45%
          Show
          yseeley@gmail.com Yonik Seeley added a comment - - edited OK, it took me much longer to do the benchmarks than planned. It was sometimes difficult to get stable numbers (mostly due to variations in how hotspot can optimize/deoptimize things I imagine). I took the first patch here and further optimized it, getting rid of some of the branches in the inner loop. This new patch also keeps track of the smallSetSize to avoid over-allocating space that won't be used. On the best runs, this did not seem to make too much of a difference, but it became apparent that slower runs were much more frequent before this optimization (the average seemed to be about 15% better). Benchmark of new patch vs trunk: 10M doc index, 20% chance of a document missing the value for a field. Queries consisted of many filters (using the filter() support in the query syntax)... 50 per request, with a filterCache size of 1 to generally avoid cache hits. The large number of filters per request is to just try and make docset generation the bottleneck. Single-threaded performance: (nterms is the number of unique terms in the field across the entire index) nterms perf improvement 10 1.88% 100 -1.05% 1000 38.25% 10000 75.10% 100000 88.86% 1000000 94.49% Single-threaded analysis: one could expect single threaded performance of the patch to be slower... if we promote to a bit set, we will have allocated more memory and done more work. Also, with a single thread doing requests, other CPU cores are free to concurrently perform GC. The fact that the patch was faster for the field with 10 unique terms is most likely measurement inaccuracies. nterms=10 had the most instability across runs, while nterms=100 had the next most instability. Standard deviation of performance results dropped with increasing nterms values. Multi-threaded performance (8 threads on a 4 core CPU): nterms perf improvement 100 14.49% 1000 93.36% 10000 179.07% 100000 216.65% 1000000 148.45%
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit cfba58f0d0adecab495c8ea073f38b0e53f5481f in lucene-solr's branch refs/heads/master from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cfba58f ]

          SOLR-8922: optimize DocSetCollector to produce less garbage

          Show
          jira-bot ASF subversion and git services added a comment - Commit cfba58f0d0adecab495c8ea073f38b0e53f5481f in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cfba58f ] SOLR-8922 : optimize DocSetCollector to produce less garbage
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 301e178681d72a142dac4bc44416b93f42f33c01 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=301e178 ]

          SOLR-8922: optimize DocSetCollector to produce less garbage

          Show
          jira-bot ASF subversion and git services added a comment - Commit 301e178681d72a142dac4bc44416b93f42f33c01 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=301e178 ] SOLR-8922 : optimize DocSetCollector to produce less garbage
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Committed. Thanks Jeff!

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Committed. Thanks Jeff!
          Hide
          dsmiley David Smiley added a comment -

          Wow; I'm (pleasantly) surprised to see such a general performance increase; I thought this was just about saving memory. Why is it faster? Less GC time?

          I'm confused by the benchmark and/or I don't understand the setup.

          20% chance of a document missing the value for a field.

          Put another way, do you mean any given term has an 80% chance of being in the doc?

          I'm confused why the number of terms that are in the field has anything to do with the performance of this patch. Perhaps what you've done in your benchmark is have the fields with the larger number of terms result in any given term matching fewer documents? I think it would be far clearer to report the performance increase over varying number of docs that were counted in the doc set. However many terms were in the field doesn't really matter in and of itself (I think). Couldn't you have done all this in one field and just chosen your 50 term queries based on those terms that have the same(ish) document frequency. It might be as a percentage of the total docs (thus making the numbers more generally interpretable).

          Show
          dsmiley David Smiley added a comment - Wow; I'm (pleasantly) surprised to see such a general performance increase; I thought this was just about saving memory. Why is it faster? Less GC time? I'm confused by the benchmark and/or I don't understand the setup. 20% chance of a document missing the value for a field. Put another way, do you mean any given term has an 80% chance of being in the doc? I'm confused why the number of terms that are in the field has anything to do with the performance of this patch. Perhaps what you've done in your benchmark is have the fields with the larger number of terms result in any given term matching fewer documents? I think it would be far clearer to report the performance increase over varying number of docs that were counted in the doc set. However many terms were in the field doesn't really matter in and of itself (I think). Couldn't you have done all this in one field and just chosen your 50 term queries based on those terms that have the same(ish) document frequency. It might be as a percentage of the total docs (thus making the numbers more generally interpretable).
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          I thought this was just about saving memory. Why is it faster? Less GC time?

          The fact that any difference can be seen at all is that I made docset generation a bottleneck in the test. So the benchmark itself is certainly not typical, even if it is real. And yes, I expect that it's all due to GC... but it's hard to prove.

          > 20% chance of a document missing the value for a field.

          bc. Put another way, do you mean any given term has an 80% chance of being in the doc?

          No, an 80% chance having a value for the field. The chance for having "any given term" would be 80%/nterms.

          I'm confused why the number of terms that are in the field has anything to do with the performance of this patch.

          I'm leveraging pre-existing indexes and tools to test this. Using the different fields with different doc freqs for the terms was an easy way to vary the number of docs collected. 100 unique values in the field means a single matching term query on that field will match 10M docs * .80 / 100, or 80K docs.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - I thought this was just about saving memory. Why is it faster? Less GC time? The fact that any difference can be seen at all is that I made docset generation a bottleneck in the test. So the benchmark itself is certainly not typical, even if it is real. And yes, I expect that it's all due to GC... but it's hard to prove. > 20% chance of a document missing the value for a field. bc. Put another way, do you mean any given term has an 80% chance of being in the doc? No, an 80% chance having a value for the field. The chance for having "any given term" would be 80%/nterms. I'm confused why the number of terms that are in the field has anything to do with the performance of this patch. I'm leveraging pre-existing indexes and tools to test this. Using the different fields with different doc freqs for the terms was an easy way to vary the number of docs collected. 100 unique values in the field means a single matching term query on that field will match 10M docs * .80 / 100, or 80K docs.
          Hide
          steff1193 Per Steffensen added a comment -

          I don't know how well Per Steffensen tested this, although I got the impression he was using it in production at the time

          Well, I know I am late, but just for the record. It was tested thoroughly and the gain was significant. It has been running in several production-systems since forever.

          Show
          steff1193 Per Steffensen added a comment - I don't know how well Per Steffensen tested this, although I got the impression he was using it in production at the time Well, I know I am late, but just for the record. It was tested thoroughly and the gain was significant. It has been running in several production-systems since forever.

            People

            • Assignee:
              yseeley@gmail.com Yonik Seeley
              Reporter:
              jwartes Jeff Wartes
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development