[SOLR-7483] Investigate ways to deal with the tlog growing indefinitely while it's being replayed - ASF JIRA

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: SolrCloud
Labels:
None

Description

While trying to track down the data-loss issue I found while testing ~~SOLR-7332~~, one of my replicas was forced into recovery by the leader due to a network error (I'm over-stressing Solr as part of this test) ...

In the leader log:

INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; I/O exception (java.net.SocketException) caught when processing request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985: Broken pipe
INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; Retrying request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985
ERROR - 2015-04-28 21:36:55.091; [perf10x2 shard2 core_node7 perf10x2_shard2_replica2] org.apache.solr.update.StreamingSolrClients$1; error
org.apache.http.NoHttpResponseException: ec2-54-242-70-241.compute-1.amazonaws.com:8985 failed to respond
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

In the logs on the replica, I see a bunch of failed checksums messages, like:

WARN  - 2015-04-28 21:38:43.345; [   ] org.apache.solr.handler.IndexFetcher; File _xv.si did not match. expected checksum is 617655777 and actual is checksum 1090588695. expected length is 419 and actual length is 419
WARN  - 2015-04-28 21:38:43.349; [   ] org.apache.solr.handler.IndexFetcher; File _xv.fnm did not match. expected checksum is 1992662616 and actual is checksum 1632122630. expected length is 1756 and actual length is 1756
WARN  - 2015-04-28 21:38:43.353; [   ] org.apache.solr.handler.IndexFetcher; File _xv.nvm did not match. expected checksum is 384078655 and actual is checksum 3108095639. expected length is 92 and actual length is 92

This tells me it tried a snapshot pull of the index from the leader ...

Also, I see the replica started to replay the tlog (presumably the snapshot pull succeeded - of course my logging is set to WARN so I'm not seeing a full story in the logs):

WARN  - 2015-04-28 21:38:45.656; [   ] org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay tlog{file=/vol0/cloud85/solr/perf10x2_shard2_replica1/data/tlog/tlog.0000000000000000046 refcount=2} active=true starting pos=56770101

The problem is the tlog continues to grow and grow while this "replay" is happening ... when I first looked at the tlog, it was 769m a few minutes later, it's at 2.2g and still growing, i.e the leader is still pounding it with updates it can't keep up with.

The good thing of course is that the updates are being persisted to durable storage on the replica, so it's better than if the replica was just marked down. So maybe there isn't much we can do about this, but I wanted to capture the description of this event in a JIRA so we can investigate it further.

Investigate ways to deal with the tlog growing indefinitely while it's being replayed

Details

Description

Attachments

Activity

People

Dates