Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10914

RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.4, 6.5, 6.6
    • Fix Version/s: 6.7, 7.0
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery request if the leader core is unloaded before the prep recovery request is made.

      SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier it had no connection/read timeout at all) but the fix has caused another problem. Say

      1. A replica starts up (or is newly created) and goes into recovery,
      2. Replica finds that leader=X
      3. The core X is unloaded but the node that used to host X is still running and taking requests
      4. Replica calls sendPrepRecoveryCmd to X

      At this point, the node X receives the prep recovery command, finds that the core X does not exist and keeps checking again in a sleep-loop until a timeout happens. I am not sure why prep recovery core admin command needs to continue waiting if a local core does not exist. The default timeout here is usually longer than 10 seconds.

      On the recovering replica's side, the prep recovery has a connection/read timeout of only 10s, so the request always times out and is retried upto 5 minutes. Only then does the recovery attempt fails and may be restarted again with the right leader URL.

        Attachments

        1. SOLR-10914.patch
          10 kB
          Shalin Shekhar Mangar
        2. SOLR-10914.patch
          10 kB
          Shalin Shekhar Mangar
        3. SOLR-10914.patch
          2 kB
          Shalin Shekhar Mangar

          Issue Links

            Activity

              People

              • Assignee:
                shalinmangar Shalin Shekhar Mangar
                Reporter:
                shalinmangar Shalin Shekhar Mangar
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: