Solr
  1. Solr
  2. SOLR-6521

CloudSolrClient should synchronize cache cluster state loading

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: SolrCloud
    • Labels:

      Description

      Under heavy load-testing with the new solrj client that caches the cluster state instead of setting a watcher, I started seeing lots of zk connection loss on the client-side when refreshing the CloudSolrServer collectionStateCache, and this was causing crazy client-side 99.9% latency (~15 sec). I swapped the cache out with guava's LoadingCache (which does locking to ensure only one thread loads the content under one key while the other threads that want the same key wait) and the connection loss went away and the 99.9% latency also went down to just about 1 sec.

      1. SOLR-6521.patch
        5 kB
        Noble Paul
      2. SOLR-6521.patch
        4 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          Noble Paul added a comment - - edited

          AFAIK SolrJ does not have a dependency on guava. I shall do a single threaded loading of state and it should be fine

          Show
          Noble Paul added a comment - - edited AFAIK SolrJ does not have a dependency on guava. I shall do a single threaded loading of state and it should be fine
          Hide
          Jessica Cheng Mallet added a comment -

          I worry that this will not be performant enough since that makes lookups of different keys serial, and in a system with lots of active collections, lookups will be unnecessarily blocked.

          In the same vein, I feel that the codebase can benefit from a LoadingCache-style cache that only synchronizes loading of the same key. Incidentally we also found that the FilterCache population suffers from the same problem. We've found that in our load test, when a filter query is heavily used, it actually adversely impacts performance to cache the fq because whenever the filter cache is invalidated, all the request threads try to build the cache entry for that exact same fq and locks the system up. We would get 30 second 99% request time when we cache the fq, and when we added

          {!cache=false}

          to the fq, 99% went back down to the 100ms range.

          I understand that adding guava to solrj would probably require a lot of discussions and voting, etc., but I think the benefit of a LoadingCache-style cache is high enough that even if we can't include guava, it might make sense for solr/apache to implement their own.

          Show
          Jessica Cheng Mallet added a comment - I worry that this will not be performant enough since that makes lookups of different keys serial, and in a system with lots of active collections, lookups will be unnecessarily blocked. In the same vein, I feel that the codebase can benefit from a LoadingCache-style cache that only synchronizes loading of the same key. Incidentally we also found that the FilterCache population suffers from the same problem. We've found that in our load test, when a filter query is heavily used, it actually adversely impacts performance to cache the fq because whenever the filter cache is invalidated, all the request threads try to build the cache entry for that exact same fq and locks the system up. We would get 30 second 99% request time when we cache the fq, and when we added {!cache=false} to the fq, 99% went back down to the 100ms range. I understand that adding guava to solrj would probably require a lot of discussions and voting, etc., but I think the benefit of a LoadingCache-style cache is high enough that even if we can't include guava, it might make sense for solr/apache to implement their own.
          Hide
          Xu Zhang added a comment -

          Thanks Jessica, I think we met a similar question. Could I get more information about your tests? I would like to reproduce it.
          Just curious, you said you using guava's cache for collectionStateCache ? Is it right?

          Show
          Xu Zhang added a comment - Thanks Jessica, I think we met a similar question. Could I get more information about your tests? I would like to reproduce it. Just curious, you said you using guava's cache for collectionStateCache ? Is it right?
          Hide
          Jessica Cheng Mallet added a comment -

          Xu, I think it's unlikely you've hit the same issue unless you are using the new split clusterstate feature in https://issues.apache.org/jira/browse/SOLR-5473.

          Show
          Jessica Cheng Mallet added a comment - Xu, I think it's unlikely you've hit the same issue unless you are using the new split clusterstate feature in https://issues.apache.org/jira/browse/SOLR-5473 .
          Hide
          Anshum Gupta added a comment -

          Noble Paul do you intend to get this in for 5.0?

          Show
          Anshum Gupta added a comment - Noble Paul do you intend to get this in for 5.0?
          Hide
          Noble Paul added a comment -

          It' unlikely for 5.0. I would take it up for the next release

          Show
          Noble Paul added a comment - It' unlikely for 5.0. I would take it up for the next release
          Hide
          Jessica Cheng Mallet added a comment -

          The patch is locking the entire cache for all loading, which might not be an ideal solution for a cluster with many, many collections. Guava's implementation of LocalCache would only lock and wait on "Segment"s, which increases the concurrency level (which is tunable).

          Show
          Jessica Cheng Mallet added a comment - The patch is locking the entire cache for all loading, which might not be an ideal solution for a cluster with many, many collections. Guava's implementation of LocalCache would only lock and wait on "Segment"s, which increases the concurrency level (which is tunable).
          Hide
          Noble Paul added a comment -

          The patch is locking the entire cache for all loading, which might not be an ideal solution for a cluster with many

          I understand that. In reality , different collections expire at different time .so everyone waiting on the lock would be a rare thing. The common use case is one collection expired and every thread is trying to refresh that simultaneously.

          I agree that the concurrency can be dramatically improved . Using Guava may not be an option because it is not yet a dependency on SolrJ. The other option would be to make the cache pluggable through an API . So ,if you have Guava or something else in your package you can plug it in through an API

          Show
          Noble Paul added a comment - The patch is locking the entire cache for all loading, which might not be an ideal solution for a cluster with many I understand that. In reality , different collections expire at different time .so everyone waiting on the lock would be a rare thing. The common use case is one collection expired and every thread is trying to refresh that simultaneously. I agree that the concurrency can be dramatically improved . Using Guava may not be an option because it is not yet a dependency on SolrJ. The other option would be to make the cache pluggable through an API . So ,if you have Guava or something else in your package you can plug it in through an API
          Hide
          Jessica Cheng Mallet added a comment -

          I agree that the concurrency can be dramatically improved . Using Guava may not be an option because it is not yet a dependency on SolrJ. The other option would be to make the cache pluggable through an API . So ,if you have Guava or something else in your package you can plug it in through an API

          That'd be awesome!

          Show
          Jessica Cheng Mallet added a comment - I agree that the concurrency can be dramatically improved . Using Guava may not be an option because it is not yet a dependency on SolrJ. The other option would be to make the cache pluggable through an API . So ,if you have Guava or something else in your package you can plug it in through an API That'd be awesome!
          Hide
          Noble Paul added a comment -

          hi Jessica Cheng Mallet

          This patch increases the parallelism and makes it tunable.

          Show
          Noble Paul added a comment - hi Jessica Cheng Mallet This patch increases the parallelism and makes it tunable.
          Hide
          Anshum Gupta added a comment -

          We should use CloudSolrClient.setParallelCacheRefreshes() while initializing the default locks (size=3) instead of Arrays.asList(..) directly.
          Also, can you add a bit of javadoc mentioning that the user can use custom cache implementation by overriding the getDocCollection() method? I think that would be helpful for users looking forward to do that.

          The rest looks good to me.

          Show
          Anshum Gupta added a comment - We should use CloudSolrClient.setParallelCacheRefreshes() while initializing the default locks (size=3) instead of Arrays.asList(..) directly. Also, can you add a bit of javadoc mentioning that the user can use custom cache implementation by overriding the getDocCollection() method? I think that would be helpful for users looking forward to do that. The rest looks good to me.
          Hide
          ASF subversion and git services added a comment -

          Commit 1653801 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1653801 ]

          SOLR-6521 CloudSolrServer should synchronize cache cluster state loading

          Show
          ASF subversion and git services added a comment - Commit 1653801 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1653801 ] SOLR-6521 CloudSolrServer should synchronize cache cluster state loading
          Hide
          ASF subversion and git services added a comment -

          Commit 1653805 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1653805 ]

          SOLR-6521 CloudSolrServer should synchronize cache cluster state loading

          Show
          ASF subversion and git services added a comment - Commit 1653805 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653805 ] SOLR-6521 CloudSolrServer should synchronize cache cluster state loading
          Hide
          ASF subversion and git services added a comment -

          Commit 1653822 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1653822 ]

          SOLR-6521 CloudSolrServer should synchronize cache cluster state loading

          Show
          ASF subversion and git services added a comment - Commit 1653822 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1653822 ] SOLR-6521 CloudSolrServer should synchronize cache cluster state loading
          Hide
          Shalin Shekhar Mangar added a comment -

          There's a test failure on jenkins due to this fix. I guess it's because of:

          locks.get(collection.hashCode() % locks.size())
          

          http://jenkins.thetaphi.de/job/Lucene-Solr-5.0-Linux/29/

          1 tests failed.
          FAILED:  org.apache.solr.cloud.HttpPartitionTest.testDistribSearch
          
          Error Message:
          -2
          
          Stack Trace:
          java.lang.ArrayIndexOutOfBoundsException: -2
                  at __randomizedtesting.SeedInfo.seed([117EBA4326468C3C:9098345B5119EC00]:0)
                  at java.util.ArrayList.elementData(ArrayList.java:418)
                  at java.util.ArrayList.get(ArrayList.java:431)
                  at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1099)
                  at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:762)
                  at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:737)
                  at org.apache.solr.cloud.HttpPartitionTest.sendDoc(HttpPartitionTest.java:481)
                  at org.apache.solr.cloud.HttpPartitionTest.testRf3(HttpPartitionTest.java:271)
                  at org.apache.solr.cloud.HttpPartitionTest.doTest(HttpPartitionTest.java:120)
                  at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:878)
                  at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
          
          Show
          Shalin Shekhar Mangar added a comment - There's a test failure on jenkins due to this fix. I guess it's because of: locks.get(collection.hashCode() % locks.size()) http://jenkins.thetaphi.de/job/Lucene-Solr-5.0-Linux/29/ 1 tests failed. FAILED: org.apache.solr.cloud.HttpPartitionTest.testDistribSearch Error Message: -2 Stack Trace: java.lang.ArrayIndexOutOfBoundsException: -2 at __randomizedtesting.SeedInfo.seed([117EBA4326468C3C:9098345B5119EC00]:0) at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1099) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:762) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:737) at org.apache.solr.cloud.HttpPartitionTest.sendDoc(HttpPartitionTest.java:481) at org.apache.solr.cloud.HttpPartitionTest.testRf3(HttpPartitionTest.java:271) at org.apache.solr.cloud.HttpPartitionTest.doTest(HttpPartitionTest.java:120) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:878) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
          Hide
          ASF subversion and git services added a comment -

          Commit 1653862 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1653862 ]

          SOLR-6521 use murmurhash because hashCode can give -ve val

          Show
          ASF subversion and git services added a comment - Commit 1653862 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1653862 ] SOLR-6521 use murmurhash because hashCode can give -ve val
          Hide
          ASF subversion and git services added a comment -

          Commit 1653864 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1653864 ]

          SOLR-6521 use murmurhash because hashCode can give -ve val

          Show
          ASF subversion and git services added a comment - Commit 1653864 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653864 ] SOLR-6521 use murmurhash because hashCode can give -ve val
          Hide
          ASF subversion and git services added a comment -

          Commit 1653866 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1653866 ]

          SOLR-6521 use murmurhash because hashCode can give -ve val

          Show
          ASF subversion and git services added a comment - Commit 1653866 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1653866 ] SOLR-6521 use murmurhash because hashCode can give -ve val
          Hide
          ASF subversion and git services added a comment -

          Commit 1653921 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1653921 ]

          SOLR-6521 use abs() to handle -ve hash

          Show
          ASF subversion and git services added a comment - Commit 1653921 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1653921 ] SOLR-6521 use abs() to handle -ve hash
          Hide
          ASF subversion and git services added a comment -

          Commit 1653923 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1653923 ]

          SOLR-6521 use abs() to handle -ve hash

          Show
          ASF subversion and git services added a comment - Commit 1653923 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1653923 ] SOLR-6521 use abs() to handle -ve hash
          Hide
          ASF subversion and git services added a comment -

          Commit 1653924 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1653924 ]

          SOLR-6521 use abs() to handle -ve hash

          Show
          ASF subversion and git services added a comment - Commit 1653924 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653924 ] SOLR-6521 use abs() to handle -ve hash
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Noble Paul
              Reporter:
              Jessica Cheng Mallet
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development