Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-16412

Race condition could trigger error on concurrent SizeLimitedDistributedMap cleanup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 8.8, 9.1, main (10.0)
    • 9.1, main (10.0)
    • SolrCloud
    • None

    Description

      Description

      Exception below is observed while updating the `completedMap` field in `OverseerTaskProcessor` :

      o.a.s.c.OverseerTaskProcessor :org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /overseer/collection-map-completed/mn-736f6c726d616e2d312d31383930383730393837313333303932353331
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
      at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
      at org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:264)
      at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)
      at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:263)
      at org.apache.solr.cloud.SizeLimitedDistributedMap.put(SizeLimitedDistributedMap.java:76)
      at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:538)
      at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

      Cause

      Based on the stack trace, `SizeLimitedDistributedMap` had reached the limit and attempted to cleanup entries:
      https://github.com/fullstorydev/lucene-solr/blob/75e89929eb360b513ee864aeb23a80c049747246/solr/core/src/java/org/apache/solr/cloud/SizeLimitedDistributedMap.java#L73-L80

      However, when it performs the actual deletion, it failed with `NoNodeException`

      This is likely caused by race condition as multiple threads can enter the same code block and try to delete same list of children which the slower threads can delete on child node that no longer exists.

       

      Such condition can be reproduced by unit test case, which will be included in the PR

      Solution

      Although we could enforce synchronization to prevent threads from purging the same set of child nodes, it might not be desirable to add extra blocking.

      Instead, it's probably safe to ignore the `KeeperException.NoNodeException` if such node is no longer there for the purge operation.

      Attachments

        Issue Links

          Activity

            People

              ichattopadhyaya Ishan Chattopadhyaya
              patson Patson Luk
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m