[SOLR-16412] Race condition could trigger error on concurrent SizeLimitedDistributedMap cleanup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 8.8, 9.1, main (10.0)
Fix Version/s: 9.1, main (10.0)
Component/s: SolrCloud
Labels:
None

Description

Exception below is observed while updating the `completedMap` field in `OverseerTaskProcessor` :

o.a.s.c.OverseerTaskProcessor :org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /overseer/collection-map-completed/mn-736f6c726d616e2d312d31383930383730393837313333303932353331
at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
at org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:264)
at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:263)
at org.apache.solr.cloud.SizeLimitedDistributedMap.put(SizeLimitedDistributedMap.java:76)
at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:538)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

Cause

Based on the stack trace, `SizeLimitedDistributedMap` had reached the limit and attempted to cleanup entries:
https://github.com/fullstorydev/lucene-solr/blob/75e89929eb360b513ee864aeb23a80c049747246/solr/core/src/java/org/apache/solr/cloud/SizeLimitedDistributedMap.java#L73-L80

However, when it performs the actual deletion, it failed with `NoNodeException`

This is likely caused by race condition as multiple threads can enter the same code block and try to delete same list of children which the slower threads can delete on child node that no longer exists.

Such condition can be reproduced by unit test case, which will be included in the PR

Solution

Although we could enforce synchronization to prevent threads from purging the same set of child nodes, it might not be desirable to add extra blocking.

Instead, it's probably safe to ignore the `KeeperException.NoNodeException` if such node is no longer there for the purge operation.

Attachments

Issue Links

duplicates

SOLR-16175 race condition SizeLimitedDistributedMap

Resolved

is duplicated by

SOLR-16454 Fixed race condition that trigger error on SizeLimitedDistributedMap …

Resolved

links to

GitHub Pull Request #1032

Activity

People

Assignee:: Ishan Chattopadhyaya

Reporter:: Patson Luk

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Sep/22 04:17

Updated:: 15/May/23 22:00

Resolved:: 26/Oct/22 01:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 50m