Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10914

RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 6.4, 6.5, 6.6
    • 6.7, 7.0
    • SolrCloud
    • None

    Description

      tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery request if the leader core is unloaded before the prep recovery request is made.

      SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier it had no connection/read timeout at all) but the fix has caused another problem. Say

      1. A replica starts up (or is newly created) and goes into recovery,
      2. Replica finds that leader=X
      3. The core X is unloaded but the node that used to host X is still running and taking requests
      4. Replica calls sendPrepRecoveryCmd to X

      At this point, the node X receives the prep recovery command, finds that the core X does not exist and keeps checking again in a sleep-loop until a timeout happens. I am not sure why prep recovery core admin command needs to continue waiting if a local core does not exist. The default timeout here is usually longer than 10 seconds.

      On the recovering replica's side, the prep recovery has a connection/read timeout of only 10s, so the request always times out and is retried upto 5 minutes. Only then does the recovery attempt fails and may be restarted again with the right leader URL.

      Attachments

        1. SOLR-10914.patch
          2 kB
          Shalin Shekhar Mangar
        2. SOLR-10914.patch
          10 kB
          Shalin Shekhar Mangar
        3. SOLR-10914.patch
          10 kB
          Shalin Shekhar Mangar

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shalin Shalin Shekhar Mangar
            shalin Shalin Shekhar Mangar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment