Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11881

Retry update requests from leaders to replicas

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.5, 8.0
    • None
    • None

    Description

      We can see that a connection reset is causing LIR.

      If a leader -> replica update get's a connection like this the leader will initiate LIR

      2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX r:core_node56 collection_shardX_replicaY] o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start recovery on replica https://host08.domain:8985/solr/collection_shardX_replicaY/
      java.net.SocketException: Connection reset
              at java.net.SocketInputStream.read(SocketInputStream.java:210)
              at java.net.SocketInputStream.read(SocketInputStream.java:141)
              at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
              at sun.security.ssl.InputRecord.read(InputRecord.java:503)
              at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
              at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
              at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
              at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
              at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
              at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
              at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
              at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
              at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
              at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
              at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
              at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
              at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
              at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy working SolrCloud cluster, even a rare response like this from a replica can cause a recovery and heavy cluster disruption" .

      Looking at SOLR-6931 we added a http retry handler but we only retry on GET requests. Updates are POST requests ConcurrentUpdateSolrClient#sendUpdateStream

      Update requests between the leader and replica should be retry-able since they have been versioned.

      Attachments

        1. SOLR-11881.patch
          47 kB
          Tomas Eduardo Fernandez Lobbe
        2. SOLR-11881-SolrCmdDistributor.patch
          17 kB
          Tomas Eduardo Fernandez Lobbe
        3. SOLR-11881.patch
          2 kB
          Varun Thacker
        4. SOLR-11881.patch
          2 kB
          Varun Thacker

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tflobbe Tomas Eduardo Fernandez Lobbe
            varun Varun Thacker
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment