Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10914

RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 6.4, 6.5, 6.6
    • 6.7, 7.0
    • SolrCloud
    • None


      tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery request if the leader core is unloaded before the prep recovery request is made.

      SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier it had no connection/read timeout at all) but the fix has caused another problem. Say

      1. A replica starts up (or is newly created) and goes into recovery,
      2. Replica finds that leader=X
      3. The core X is unloaded but the node that used to host X is still running and taking requests
      4. Replica calls sendPrepRecoveryCmd to X

      At this point, the node X receives the prep recovery command, finds that the core X does not exist and keeps checking again in a sleep-loop until a timeout happens. I am not sure why prep recovery core admin command needs to continue waiting if a local core does not exist. The default timeout here is usually longer than 10 seconds.

      On the recovering replica's side, the prep recovery has a connection/read timeout of only 10s, so the request always times out and is retried upto 5 minutes. Only then does the recovery attempt fails and may be restarted again with the right leader URL.


        1. SOLR-10914.patch
          2 kB
          Shalin Shekhar Mangar
        2. SOLR-10914.patch
          10 kB
          Shalin Shekhar Mangar
        3. SOLR-10914.patch
          10 kB
          Shalin Shekhar Mangar

        Issue Links


          This comment will be Viewable by All Users Viewable by All Users


            shalin Shalin Shekhar Mangar
            shalin Shalin Shekhar Mangar
            0 Vote for this issue
            2 Start watching this issue




                Issue deployment