Description
Discovered following issue with the core recovery:
- Core recovery is not being initialized and throwing following exception message :
2019-06-07 00:53:12.436 INFO (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778) x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on recovery, try again
- Above error occurs when ping request takes time more than a timeout period which is hard-coded to one second in solr source code. However In a general production setting it is common to have ping time more than one second, hence, the core recovery never starts and exception is thrown.
- Also the other major concern is that this exception is logged as an info message, hence it is very difficult to identify the error if info logging is not enabled.
- Please refer to following code snippet from the source code to understand the above issue.
try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl()) .withSocketTimeout(1000) .withConnectionTimeout(1000) .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient()) .build()) { SolrPingResponse resp = httpSolrClient.ping(); return leaderReplica; } catch (IOException e) { log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl()); Thread.sleep(500); } catch (Exception e) { if (e.getCause() instanceof IOException) { log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl()); Thread.sleep(500); } else { return leaderReplica; } }
The above issue will have high impact in production level clusters, since cores not being able to recover may lead to data loss.
Following improvements would be really helpful:
1. The timeout for ping request in RecoveryStrategy.java should be configurable and the defaults set to high values like 15seconds.
2. The exception message in line 797 and line 801 in RecoveryStrategy.java should be logged as error messages instead of info messages
Attachments
Attachments
Issue Links
- relates to
-
SOLR-13457 Managing Timeout values in Solr
- Open
- links to