[SOLR-13532] Unable to start core recovery due to timeout in ping request - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 7.6
Fix Version/s: 8.2, 8.3, 9.0
Component/s: SolrCloud
Labels:
None

Description

Discovered following issue with the core recovery:

Core recovery is not being initialized and throwing following exception message :

2019-06-07 00:53:12.436 INFO  (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778) x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on recovery, try again

Above error occurs when ping request takes time more than a timeout period which is hard-coded to one second in solr source code. However In a general production setting it is common to have ping time more than one second, hence, the core recovery never starts and exception is thrown.
Also the other major concern is that this exception is logged as an info message, hence it is very difficult to identify the error if info logging is not enabled.
Please refer to following code snippet from the source code to understand the above issue.

      try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl())
          .withSocketTimeout(1000)
          .withConnectionTimeout(1000)
          .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
          .build()) {
        SolrPingResponse resp = httpSolrClient.ping();
        return leaderReplica;
      } catch (IOException e) {
        log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
        Thread.sleep(500);
      } catch (Exception e) {
        if (e.getCause() instanceof IOException) {
          log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
          Thread.sleep(500);
        } else {
          return leaderReplica;
        }
      }

The above issue will have high impact in production level clusters, since cores not being able to recover may lead to data loss.

Following improvements would be really helpful:
1. The timeout for ping request in RecoveryStrategy.java should be configurable and the defaults set to high values like 15seconds.
2. The exception message in line 797 and line 801 in RecoveryStrategy.java should be logged as error messages instead of info messages

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-13532.patch
03/Jul/19 22:40
4 kB
Chris M. Hostetter

Issue Links

relates to

SOLR-13457 Managing Timeout values in Solr

Open

links to

GitHub Pull Request #736

GitHub Pull Request #737

GitHub Pull Request #738

Activity

People

Assignee:: Chris M. Hostetter

Reporter:: Suril Shah

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Jun/19 00:31

Updated:: 16/Sep/19 00:23

Resolved:: 11/Jul/19 23:46

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: