Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.2.1
-
None
-
None
-
None
Description
There is a problem privately reported to me about stucking of Overseer, leading to no operations get being processed until a new Overseer node being elected.
There is an exception was logged
WARN - 2019-03-11 10:11:34.879; org.apache.solr.cloud.LockTree$Node; lock_is_leaked at[item-xref-secondary-stage] ERROR - 2019-03-11 10:11:35.002; org.apache.solr.common.SolrException; Collection: item-xref-secondary-stage operation: delete failed:org.apache.solr.common.SolrException: Could not find collection : item-xref-secondary-stage at org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:111) at org.apache.solr.cloud.OverseerCollectionMessageHandler.collectionCmd(OverseerCollectionMessageHandler.java:795) at org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:91) at org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:233) at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:464) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
This is a serious problem since it can leads to hanging of whole system.
Verified:
1. GC setting and long GC issues on Solr/ZK - none
2. Ulimits (OK): 65535 (-n open files) and nproc
3. ZK Quoram working (5 ZKs)
4. Checked min/avg/max latencies on the ZK ensemble
5. Solr startup parameters