tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery request if the leader core is unloaded before the prep recovery request is made.
SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier it had no connection/read timeout at all) but the fix has caused another problem. Say
- A replica starts up (or is newly created) and goes into recovery,
- Replica finds that leader=X
- The core X is unloaded but the node that used to host X is still running and taking requests
- Replica calls sendPrepRecoveryCmd to X
At this point, the node X receives the prep recovery command, finds that the core X does not exist and keeps checking again in a sleep-loop until a timeout happens. I am not sure why prep recovery core admin command needs to continue waiting if a local core does not exist. The default timeout here is usually longer than 10 seconds.
On the recovering replica's side, the prep recovery has a connection/read timeout of only 10s, so the request always times out and is retried upto 5 minutes. Only then does the recovery attempt fails and may be restarted again with the right leader URL.
SOLR-10878 MOVEREPLICA command may lose data when replicationFactor==1
- relates to
SOLR-9716 RecoveryStrategy send prep recovery cmd without setting request time out