Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6524

Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

    Details

      Description

      The RecoveryStrategy has a retry wait which is exponential in nature. The first time it waits for 1 second before retrying recovery, then 2 seconds, then 4 seconds and so on.

      This causes problems when running a large number of collections in SolrCloud. We saw a case where there were 500 collections on 3 nodes (1 shard, 3 replicas) and after a node is restarted, many collections can't come back up from recovery because:

      1. The overseer is slow to process events (I'll create another issue for it)
      2. Because the overseer is slow, cluster state updates are delayed and therefore recovery cannot succeed (WaitForState hangs while waiting to see recovery state on replicas)
      3. Because recovery can't succeed immediately, the recovery thread sleeps for larger and larger amounts of time
      4. Even after the whole overseer queue is cleared up, many recovery threads have such a long sleep that they won't even attempt to recover for many minutes (upto 10 minutes).

        Attachments

        1. SOLR-6524.patch
          0.9 kB
          Shalin Shekhar Mangar

          Activity

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              shalinmangar Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: