Solr
  1. Solr
  2. SOLR-6524

Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

    Details

      Description

      The RecoveryStrategy has a retry wait which is exponential in nature. The first time it waits for 1 second before retrying recovery, then 2 seconds, then 4 seconds and so on.

      This causes problems when running a large number of collections in SolrCloud. We saw a case where there were 500 collections on 3 nodes (1 shard, 3 replicas) and after a node is restarted, many collections can't come back up from recovery because:

      1. The overseer is slow to process events (I'll create another issue for it)
      2. Because the overseer is slow, cluster state updates are delayed and therefore recovery cannot succeed (WaitForState hangs while waiting to see recovery state on replicas)
      3. Because recovery can't succeed immediately, the recovery thread sleeps for larger and larger amounts of time
      4. Even after the whole overseer queue is cleared up, many recovery threads have such a long sleep that they won't even attempt to recover for many minutes (upto 10 minutes).
      1. SOLR-6524.patch
        0.9 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Shalin Shekhar Mangar added a comment -

        I think, at a minimum, we should reduce the max wait time of 600 seconds (10 minutes).

        Show
        Shalin Shekhar Mangar added a comment - I think, at a minimum, we should reduce the max wait time of 600 seconds (10 minutes).
        Hide
        Shalin Shekhar Mangar added a comment -

        A workaround for people who are affected is to call core reload or collection reload to force restart the recovery process.

        Show
        Shalin Shekhar Mangar added a comment - A workaround for people who are affected is to call core reload or collection reload to force restart the recovery process.
        Hide
        Mark Miller added a comment -

        Agreed - lets drop the max retry time.

        Show
        Mark Miller added a comment - Agreed - lets drop the max retry time.
        Hide
        Shalin Shekhar Mangar added a comment -

        Patch sets max retry wait time to 1 minute instead of 10 minutes.

        Show
        Shalin Shekhar Mangar added a comment - Patch sets max retry wait time to 1 minute instead of 10 minutes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1633655 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1633655 ]

        SOLR-6524: Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

        Show
        ASF subversion and git services added a comment - Commit 1633655 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1633655 ] SOLR-6524 : Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries
        Hide
        ASF subversion and git services added a comment -

        Commit 1633656 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1633656 ]

        SOLR-6524: Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

        Show
        ASF subversion and git services added a comment - Commit 1633656 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1633656 ] SOLR-6524 : Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries
        Hide
        ASF subversion and git services added a comment -

        Commit 1633658 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1633658 ]

        SOLR-6524: Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

        Show
        ASF subversion and git services added a comment - Commit 1633658 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1633658 ] SOLR-6524 : Collections left in recovery state after node restart because recovery sleep time increases exponentially between retries

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development