Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-16555

SolrIndexSearcher - FilterCache intersections/andNot should not clone bitsets repeatedly

    XMLWordPrintableJSON

Details

    Description

      SolrIndexSearcher takes the bitset from the result and tries to combine it with all the cached filter queries. Currently this duplicates the bitset multiple times based on the number of filter queries. It looks like this isn't necessary and instead could just operate on the bitset itself or a single mutable copy of the bitset.

      Lines 1219 to 1225
      https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L1219


      I've been using async profiler (https://github.com/jvm-profiling-tools/async-profiler) to look at some performance stuff with Solr for a client. Originally I looked at CPU in the profile and found that I could also capture and look at memory allocations during the same run. This led me to finding this crazy amount of memory allocation over a short period of time.

      Async profiler is being run with the following parameters which captures cpu, memory, and lock information for a 300 second period on some pid.

      /opt/async-profiler/profiler.sh -a -d 300 -o jfr -e cpu,alloc,lock -f /tmp/profile.jfr PID_GOES_HERE
      

      The resulting JFR for this is ~100-200MB usually and so not going to attach it here since it has some client specific methods in some calls in it.

      However screenshots of the findings from loading the jfr in both IntelliJ and Java Mission Control you can see some of the findings:

      The memory allocated from SolrIndexSearcher#getProcessedFilter is ~60% of total memory allocated during the 5 minute profile period.

      This shows that in 5 minutes ~1TB (yes thats TB=terabyte or 1000GB) of memory allocations for SolrIndexSearcher#getProcessedFilter

      ~680GB was allocated from BitDocSet#intersection

      ~315GB was allocated from BitDocSet#andNot

      Based on CPU profiling, it is amazing to me but G1 garbage collector is keeping up. Each of these objects are very short lived.

      This was during some load testing and able to give some query types in question:

      • ~30 queries/second
      • ~5 fq parameters per query
      • so ~9000 queries in 5 minutes with ~45000 fq clauses
      • 10GB heap for the Solr instance with 128GB ram on the node and index size completely fits in memory.
      • this is one shard on the node for testing and ~23 million documents in the shard - optimized so no deletes.
      • This was tested with Solr 8.8, but as far as I can tell the code has not changed for the main branch significantly (slight refactoring of class hierachy after SOLR-14256 but didn't affect intersection/andNot).

      Based on my rough calculations, that is ~24MB of heap per filter query clause (1.06TB/45000) or ~120MB of heap per query (assuming 5 fq per query).


      I loaded the same JFR up in Java Mission Control to see if there were other insights and found TLAB memory allocation details.

      Since most of these are large allocations, Java mission control is very helpful in saying that there are a large number of allocations outside of TLAB (~80%) and that you probably shouldn't do that.

      From what I understand about allocations outside of TLAB is that they should in theory be a small portion of allocations, but they aren't...

      Attachments

        1. Screenshot 2022-11-16 at 14.52.37.png
          283 kB
          Kevin Risden
        2. Screenshot 2022-11-16 at 14.53.23.png
          62 kB
          Kevin Risden
        3. Screenshot 2022-11-16 at 14.53.35.png
          46 kB
          Kevin Risden
        4. Screenshot 2022-11-17 at 13.03.21.png
          112 kB
          Kevin Risden
        5. Screenshot 2022-11-17 at 13.25.57.png
          554 kB
          Kevin Risden
        6. Screenshot 2022-11-17 at 13.28.06.png
          529 kB
          Kevin Risden

        Issue Links

          Activity

            People

              krisden Kevin Risden
              krisden Kevin Risden
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h
                  8h