[SOLR-6056] Zookeeper crash JVM stack OOM because of recover strategy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 4.6
Fix Version/s: 5.0
Component/s: SolrCloud
Labels:
- cluster
- crash
- recover
Environment:

Two linux servers, 65G memory, 16 core cpu
20 collections, every collection has one shard two replica
one zookeeper

Description

Some errors"org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later", that occur distributedupdateprocessor trig the core admin recover process.
That means every update request will send the core admin recover request.
(see the code DistributedUpdateProcessor.java doFinish())

The terrible thing is CoreAdminHandler will start a new thread to publish the recover status and start recovery. Threads increase very quickly, and stack OOM , Overseer can't handle a lot of status update , zookeeper node for /overseer/queue/qn-0000125553 increase more than 40 thousand in two minutes.

At the last zookeeper crash.
The worse thing is queue has too much nodes in the zookeeper, the cluster can't publish the right status because only one overseer work, I have to start three threads to clear the queue nodes. The cluster doesn't work normal near 30 minutes...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

patch-6056.txt
12/May/14 05:31
3 kB
Raintung Li

Issue Links

relates to

SOLR-8371 Try and prevent too many recovery requests from stacking up and clean up some faulty logic.

Closed

Activity

People

Assignee:: Shalin Shekhar Mangar

Reporter:: Raintung Li

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/May/14 06:07

Updated:: 02/Oct/19 17:24

Resolved:: 27/Jan/17 19:58