[SOLR-10720] Aggressive removal of a collection breaks cluster state - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 6.5.1
Fix Version/s: 7.3, 8.0
Component/s: SolrCloud
Labels:
None

Description

We are periodically seeing tricky concurrency bug in SolrCloud that starts with `Could not fully remove collection: my_collection` exception:

2017-05-17T14:47:50,153 - ERROR [OverseerThreadFactory-6-thread-5:SolrException@159] - {} - Collection: my_collection operation: delete failed:org.apache.solr.common.SolrException: Could not fully remove collection: my_collection
        at org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:106)
        at org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:224)
        at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:463)

After that all operations with SolrCloud that involve reading cluster state fail with

org.apache.solr.common.SolrException: Error loading config name for collection my_collection
    at org.apache.solr.common.cloud.ZkStateReader.readConfigName(ZkStateReader.java:198)
    at org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:141)
...
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/my_collection
...

See full stacktraces

As a result SolrCloud becomes completely broken. We are seeing this with 6.5.1 but I think we’ve seen that with older versions too.

From looking into the code it looks like it is a combination of two factors:

Forcefully removing collection's znode in finally block in DeleteCollectionCmd that was introduced in ~~SOLR-5135~~. Note that this causes cached cluster state to be not in sync with the state in Zk, i.e. zkStateReader.getClusterState() still has collection in it (see the code here) whereas /collections/<collection_id> znode in Zk is already removed.
Reading cluster state operation not only returns cached version, but it is also reading collection's config name from /collections/<collection_id> znode, but this znode was forcefully removed. The code to read config name for every collection directly from Zk was introduced in ~~SOLR-7636~~. Isn't there any performance implications of reading N znodes (1 per collection) on every getClusterStatus call?

I'm not sure what the proper fix should be

Should we just catch KeeperException$NoNodeException in getClusterStatus and treat such collection as removed? That looks easiest / less invasive fix.
Should we stop reading config name from collection znode and get it from cache somehow?
Should we not try to delete collection's data from Zk if delete operation failed?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-10720.patch
26/Feb/18 05:49
5 kB
Shalin Shekhar Mangar

Issue Links

is duplicated by

SOLR-9471 Another race condition in ClusterStatus.getClusterStatus

Resolved

is related to

SOLR-5135 Deleting a collection should be extra aggressive in the face of failures.

Closed

SOLR-7636 CLUSTERSTATUS Api should not go to OCP to fetch data

Closed

relates to

SOLR-12544 ZkStateReader can cache deleted collections and never refresh it

Resolved

Activity

People

Assignee:: Shalin Shekhar Mangar

Reporter:: Alexey Serba

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/May/17 21:43

Updated:: 02/Oct/19 17:24

Resolved:: 26/Feb/18 05:54