Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
When a node is recovering, the leader can meet an exception when trying to send an update to the buffering node. So the leader will try running LIR process: first set the node's state to DOWN, then send recovery OP to the node.
In the same time, PrepRecoveryOp will make the leader wait for a very long time to see the node's state is RECOVERING.
This scenario can easily be achieved by using this test
String collection = "collection2"; CollectionAdminRequest .createCollection(collection, "config", 1, 2) .setMaxShardsPerNode(1) .process(cluster.getSolrClient()); AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, cluster.getSolrClient().getZkStateReader(), false, true, 30); CloudSolrClient cloudClient = cluster.getSolrClient(); DocCollection docCollection = cloudClient.getZkStateReader().getClusterState().getCollection(collection); Slice slice = docCollection.getSlice("shard1"); Replica replicaNode = slice.getReplicas(replica -> replica != slice.getLeader()).get(0); JettySolrRunner replicaRunner = cluster.getReplicaJetty(replicaNode); new UpdateRequest() .add(sdoc("id", "1")) .process(cloudClient, collection); ChaosMonkey.stop(replicaRunner); new UpdateRequest() .add(sdoc("id", "2")) .process(cloudClient, collection); ChaosMonkey.start(replicaRunner); new UpdateRequest() .add(sdoc("id", "3")) .process(cloudClient, collection); AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, cluster.getSolrClient().getZkStateReader(), false, true, 60); CollectionAdminRequest .deleteCollection(collection) .process(cloudClient);
Here are some log
10942 INFO (searcherExecutor-53-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.SolrCore [collection1_shard2_replica2] Registered new searcher Searcher@510d1ea5[collection1_shard2_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader())} 10942 INFO (searcherExecutor-56-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node3) [n:127.0.0.1:36014_solr c:collection1 s:shard1 r:core_node3 x:collection1_shard1_replica2] o.a.s.c.SolrCore [collection1_shard1_replica2] Registered new searcher Searcher@505d7839[collection1_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader())} 10958 INFO (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.ZkController Core needs to recover:collection1_shard2_replica2 10964 INFO (updateExecutor-20-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.u.DefaultSolrCoreState Running recovery 10970 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy ###### startupVersions=[[-1556034709770731520]] 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[collection1_shard2_replica2] 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.u.UpdateLog Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null} 10971 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [collection1_shard2_replica2] as recovering, leader is [http://127.0.0.1:42549/solr/collection1_shard2_replica1/] and I am [http://127.0.0.1:36014/solr/collection1_shard2_replica2/] 11129 INFO (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://127.0.0.1:42549/solr]; [WaitForState: action=PREPRECOVERY&core=collection1_shard2_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] 11131 INFO (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) [n:127.0.0.1:42549_solr ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json] for collection [collection1] has occurred - updating... (live nodes size: [2]) 11131 INFO (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json] for collection [collection1] has occurred - updating... (live nodes size: [2]) 11133 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see collection1_shard2_replica1 (shard2 of collection1) have state: recovering 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection1, shard=shard2, thisCore=collection1_shard2_replica1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=recovering, localState=active, nodeName=127.0.0.1:36014_solr, coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{"core":"collection1_shard2_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"recovering"} 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp Waited coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true for: 0 seconds. 11140 INFO (qtp344759431-38) [n:127.0.0.1:42549_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={nodeName=127.0.0.1:36014_solr&onlyIfLeaderActive=true&core=collection1_shard2_replica1&coreNodeName=core_node4&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2} status=0 QTime=7 11160 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening [Searcher@28717b92[collection2_shard1_replica2] main] 11161 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Configured ZooKeeperStorageIO with znodeBase: /configs/config 11162 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Loaded null at path _rest_managed.json using ZooKeeperStorageIO:path=/configs/config 11162 INFO (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.h.ReplicationHandler Commits will be reserved for 10000 11162 INFO (searcherExecutor-54-thread-1-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.SolrCore [collection2_shard1_replica2] Registered new searcher Searcher@28717b92[collection2_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.0.0):C1)))} 11168 INFO (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.ZkController Core needs to recover:collection2_shard1_replica2 11168 INFO (updateExecutor-20-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.u.DefaultSolrCoreState Running recovery 11168 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy ###### startupVersions=[[1556034713508904960]] 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[collection2_shard1_replica2] 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.u.UpdateLog Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null} 11169 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [collection2_shard1_replica2] as recovering, leader is [http://127.0.0.1:42549/solr/collection2_shard1_replica1/] and I am [http://127.0.0.1:36014/solr/collection2_shard1_replica2/] 11171 INFO (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://127.0.0.1:42549/solr]; [WaitForState: action=PREPRECOVERY&core=collection2_shard1_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node1&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] 11174 INFO (qtp344759431-37) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node1, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true 11175 INFO (qtp344759431-37) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see collection2_shard1_replica1 (shard1 of collection2) have state: recovering 11175 INFO (qtp344759431-37) [n:127.0.0.1:42549_solr ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection2, shard=shard1, thisCore=collection2_shard1_replica1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=down, localState=active, nodeName=127.0.0.1:36014_solr, coreNodeName=core_node1, onlyIfActiveCheckResult=false, nodeProps: core_node1:{"core":"collection2_shard1_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"down"} 11278 INFO (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection2/state.json] for collection [collection2] has occurred - updating... (live nodes size: [2]) 11278 INFO (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) [n:127.0.0.1:42549_solr ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection2/state.json] for collection [collection2] has occurred - updating... (live nodes size: [2]) 11323 ERROR (updateExecutor-7-thread-1-processing-http:////127.0.0.1:36014//solr//collection2_shard1_replica2 x:collection2_shard1_replica1 r:core_node2 n:127.0.0.1:42549_solr s:shard1 c:collection2) [n:127.0.0.1:42549_solr c:collection2 s:shard1 r:core_node2 x:collection2_shard1_replica1] o.a.s.u.StreamingSolrClients error java.net.SocketException: Broken pipe (Write failed) at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126) at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138) at org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:146) at org.apache.http.impl.io.ChunkedOutputStream.close(ChunkedOutputStream.java:205) at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:159) at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122) at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:61) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:323) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:186) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Attachments
Attachments
Issue Links
- duplicates
-
SOLR-11702 Redesign current LIR implementation
- Closed
- is related to
-
SOLR-9555 Leader incorrectly publishes state for replica when it puts replica into LIR.
- Closed