Discovered following issue with the core recovery:
- Core recovery is not being initialized and throwing following exception message :
- Above error occurs when ping request takes time more than a timeout period which is hard-coded to one second in solr source code. However In a general production setting it is common to have ping time more than one second, hence, the core recovery never starts and exception is thrown.
- Also the other major concern is that this exception is logged as an info message, hence it is very difficult to identify the error if info logging is not enabled.
- Please refer to following code snippet from the source code to understand the above issue.
The above issue will have high impact in production level clusters, since cores not being able to recover may lead to data loss.
Following improvements would be really helpful:
1. The timeout for ping request in RecoveryStrategy.java should be configurable and the defaults set to high values like 15seconds.
2. The exception message in line 797 and line 801 in RecoveryStrategy.java should be logged as error messages instead of info messages