Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6524

Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The RecoveryStrategy has a retry wait which is exponential in nature. The first time it waits for 1 second before retrying recovery, then 2 seconds, then 4 seconds and so on.

      This causes problems when running a large number of collections in SolrCloud. We saw a case where there were 500 collections on 3 nodes (1 shard, 3 replicas) and after a node is restarted, many collections can't come back up from recovery because:

      1. The overseer is slow to process events (I'll create another issue for it)
      2. Because the overseer is slow, cluster state updates are delayed and therefore recovery cannot succeed (WaitForState hangs while waiting to see recovery state on replicas)
      3. Because recovery can't succeed immediately, the recovery thread sleeps for larger and larger amounts of time
      4. Even after the whole overseer queue is cleared up, many recovery threads have such a long sleep that they won't even attempt to recover for many minutes (upto 10 minutes).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shalin Shalin Shekhar Mangar
            shalin Shalin Shekhar Mangar
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment