[SOLR-6524] Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.10.2, 5.0, 6.0
Component/s: SolrCloud
Labels:

Description

The RecoveryStrategy has a retry wait which is exponential in nature. The first time it waits for 1 second before retrying recovery, then 2 seconds, then 4 seconds and so on.

This causes problems when running a large number of collections in SolrCloud. We saw a case where there were 500 collections on 3 nodes (1 shard, 3 replicas) and after a node is restarted, many collections can't come back up from recovery because:

The overseer is slow to process events (I'll create another issue for it)
Because the overseer is slow, cluster state updates are delayed and therefore recovery cannot succeed (WaitForState hangs while waiting to see recovery state on replicas)
Because recovery can't succeed immediately, the recovery thread sleeps for larger and larger amounts of time
Even after the whole overseer queue is cleared up, many recovery threads have such a long sleep that they won't even attempt to recover for many minutes (upto 10 minutes).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-6524.patch
22/Oct/14 17:15
0.9 kB
Shalin Shekhar Mangar

Activity

People

Assignee:: Shalin Shekhar Mangar

Reporter:: Shalin Shekhar Mangar

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Sep/14 13:15

Updated:: 09/May/16 18:55

Resolved:: 22/Oct/14 17:23