Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15093

Heavy lock contention during collection creation

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I was doing some lock analysis and found that we have quite a bit of contention on ZkStateReader$LazyCollectionRef.get(boolean) during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion.

      A few representative stack traces:

      org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
      org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
      org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
      org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor)
      org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
      

      And another:

      org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
      org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
      org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
      org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
      org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean)
      org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
      org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
      org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
      

      And one more:

      org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
       org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
       org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
       org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher)
       org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate)
       org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
       org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
       org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
       org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
      

      It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears.

      cc: noble.paul, you might find this interesting.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mdrob Mike Drob
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m