Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7483

Investigate ways to deal with the tlog growing indefinitely while it's being replayed

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SolrCloud
    • None

    Description

      While trying to track down the data-loss issue I found while testing SOLR-7332, one of my replicas was forced into recovery by the leader due to a network error (I'm over-stressing Solr as part of this test) ...

      In the leader log:

      INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; I/O exception (java.net.SocketException) caught when processing request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985: Broken pipe
      INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; Retrying request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985
      ERROR - 2015-04-28 21:36:55.091; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.solr.update.StreamingSolrClients$1; error
      org.apache.http.NoHttpResponseException: ec2-54-242-70-241.compute-1.amazonaws.com:8985 failed to respond
              at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
              at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
              at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
              at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
              at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
              at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
              at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
              at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
              at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
              at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
              at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
              at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
              at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243)
              at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      In the logs on the replica, I see a bunch of failed checksums messages, like:

      WARN  - 2015-04-28 21:38:43.345; [   ] org.apache.solr.handler.IndexFetcher; File _xv.si did not match. expected checksum is 617655777 and actual is checksum 1090588695. expected length is 419 and actual length is 419
      WARN  - 2015-04-28 21:38:43.349; [   ] org.apache.solr.handler.IndexFetcher; File _xv.fnm did not match. expected checksum is 1992662616 and actual is checksum 1632122630. expected length is 1756 and actual length is 1756
      WARN  - 2015-04-28 21:38:43.353; [   ] org.apache.solr.handler.IndexFetcher; File _xv.nvm did not match. expected checksum is 384078655 and actual is checksum 3108095639. expected length is 92 and actual length is 92
      

      This tells me it tried a snapshot pull of the index from the leader ...

      Also, I see the replica started to replay the tlog (presumably the snapshot pull succeeded - of course my logging is set to WARN so I'm not seeing a full story in the logs):

      WARN  - 2015-04-28 21:38:45.656; [   ] org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay tlog{file=/vol0/cloud85/solr/perf10x2_shard2_replica1/data/tlog/tlog.0000000000000000046 refcount=2} active=true starting pos=56770101
      

      The problem is the tlog continues to grow and grow while this "replay" is happening ... when I first looked at the tlog, it was 769m a few minutes later, it's at 2.2g and still growing, i.e the leader is still pounding it with updates it can't keep up with.

      The good thing of course is that the updates are being persisted to durable storage on the replica, so it's better than if the replica was just marked down. So maybe there isn't much we can do about this, but I wanted to capture the description of this event in a JIRA so we can investigate it further.

      Attachments

        Activity

          People

            Unassigned Unassigned
            thelabdude Timothy Potter
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: