[SOLR-6524] Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries - ASF JIRA

Agile Board

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.10.2, 5.0, 6.0
Component/s: SolrCloud
Labels:

Description

The RecoveryStrategy has a retry wait which is exponential in nature. The first time it waits for 1 second before retrying recovery, then 2 seconds, then 4 seconds and so on.

This causes problems when running a large number of collections in SolrCloud. We saw a case where there were 500 collections on 3 nodes (1 shard, 3 replicas) and after a node is restarted, many collections can't come back up from recovery because:

The overseer is slow to process events (I'll create another issue for it)
Because the overseer is slow, cluster state updates are delayed and therefore recovery cannot succeed (WaitForState hangs while waiting to see recovery state on replicas)
Because recovery can't succeed immediately, the recovery thread sleeps for larger and larger amounts of time
Even after the whole overseer queue is cleared up, many recovery threads have such a long sleep that they won't even attempt to recover for many minutes (upto 10 minutes).

Attachments

SOLR-6524.patch
22/Oct/14 17:15
0.9 kB
Shalin Shekhar Mangar

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Shalin Shekhar Mangar

Reporter:: Shalin Shekhar Mangar

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Sep/14 13:15

Updated:: 09/May/16 18:55

Resolved:: 22/Oct/14 17:23

Agile

View on Board

Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment