Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10720

Aggressive removal of a collection breaks cluster state

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.5.1
    • Fix Version/s: 7.3, 8.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      We are periodically seeing tricky concurrency bug in SolrCloud that starts with `Could not fully remove collection: my_collection` exception:

      2017-05-17T14:47:50,153 - ERROR [OverseerThreadFactory-6-thread-5:SolrException@159] - {} - Collection: my_collection operation: delete failed:org.apache.solr.common.SolrException: Could not fully remove collection: my_collection
              at org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:106)
              at org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:224)
              at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:463)
      

      After that all operations with SolrCloud that involve reading cluster state fail with

      org.apache.solr.common.SolrException: Error loading config name for collection my_collection
          at org.apache.solr.common.cloud.ZkStateReader.readConfigName(ZkStateReader.java:198)
          at org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:141)
      ...
      Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/my_collection
      ...
      

      See full stacktraces

      As a result SolrCloud becomes completely broken. We are seeing this with 6.5.1 but I think we’ve seen that with older versions too.

      From looking into the code it looks like it is a combination of two factors:

      • Forcefully removing collection's znode in finally block in DeleteCollectionCmd that was introduced in SOLR-5135. Note that this causes cached cluster state to be not in sync with the state in Zk, i.e. zkStateReader.getClusterState() still has collection in it (see the code here) whereas /collections/<collection_id> znode in Zk is already removed.
      • Reading cluster state operation not only returns cached version, but it is also reading collection's config name from /collections/<collection_id> znode, but this znode was forcefully removed. The code to read config name for every collection directly from Zk was introduced in SOLR-7636. Isn't there any performance implications of reading N znodes (1 per collection) on every getClusterStatus call?

      I'm not sure what the proper fix should be

      • Should we just catch KeeperException$NoNodeException in getClusterStatus and treat such collection as removed? That looks easiest / less invasive fix.
      • Should we stop reading config name from collection znode and get it from cache somehow?
      • Should we not try to delete collection's data from Zk if delete operation failed?

        Attachments

        1. SOLR-10720.patch
          5 kB
          Shalin Shekhar Mangar

          Issue Links

            Activity

              People

              • Assignee:
                shalinmangar Shalin Shekhar Mangar
                Reporter:
                alexey Alexey Serba
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: