Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13532

Unable to start core recovery due to timeout in ping request

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 7.6
    • Fix Version/s: master (9.0), 8.2, 8.3
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Discovered following issue with the core recovery:

      • Core recovery is not being initialized and throwing following exception message :
        2019-06-07 00:53:12.436 INFO  (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778) x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr on recovery, try again
      • Above error occurs when ping request takes time more than a timeout period which is hard-coded to one second in solr source code. However In a general production setting it is common to have ping time more than one second, hence, the core recovery never starts and exception is thrown.
      • Also the other major concern is that this exception is logged as an info message, hence it is very difficult to identify the error if info logging is not enabled.
      • Please refer to following code snippet from the source code to understand the above issue.
            try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl())
                .withSocketTimeout(1000)
                .withConnectionTimeout(1000)
                .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
                .build()) {
              SolrPingResponse resp = httpSolrClient.ping();
              return leaderReplica;
            } catch (IOException e) {
              log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
              Thread.sleep(500);
            } catch (Exception e) {
              if (e.getCause() instanceof IOException) {
                log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
                Thread.sleep(500);
              } else {
                return leaderReplica;
              }
            }
      

      The above issue will have high impact in production level clusters, since cores not being able to recover may lead to data loss.

      Following improvements would be really helpful:
      1. The timeout for ping request in RecoveryStrategy.java should be configurable and the defaults set to high values like 15seconds.
      2. The exception message in line 797 and line 801 in RecoveryStrategy.java should be logged as error messages instead of info messages

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hossman Hoss Man
                Reporter:
                surilshah Suril Shah
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h