[SOLR-9945] LIR should check the node is recovering before bring it down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

When a node is recovering, the leader can meet an exception when trying to send an update to the buffering node. So the leader will try running LIR process: first set the node's state to DOWN, then send recovery OP to the node.
In the same time, PrepRecoveryOp will make the leader wait for a very long time to see the node's state is RECOVERING.
This scenario can easily be achieved by using this test

String collection = "collection2";
CollectionAdminRequest
    .createCollection(collection, "config", 1, 2)
    .setMaxShardsPerNode(1)
    .process(cluster.getSolrClient());
AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, cluster.getSolrClient().getZkStateReader(),
    false, true, 30);
CloudSolrClient cloudClient = cluster.getSolrClient();

DocCollection docCollection = cloudClient.getZkStateReader().getClusterState().getCollection(collection);
Slice slice = docCollection.getSlice("shard1");
Replica replicaNode = slice.getReplicas(replica -> replica != slice.getLeader()).get(0);
JettySolrRunner replicaRunner = cluster.getReplicaJetty(replicaNode);

new UpdateRequest()
    .add(sdoc("id", "1"))
    .process(cloudClient, collection);
ChaosMonkey.stop(replicaRunner);
new UpdateRequest()
    .add(sdoc("id", "2"))
    .process(cloudClient, collection);
ChaosMonkey.start(replicaRunner);
new UpdateRequest()
    .add(sdoc("id", "3"))
    .process(cloudClient, collection);
AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, cluster.getSolrClient().getZkStateReader(),
    false, true, 60);
CollectionAdminRequest
    .deleteCollection(collection)
    .process(cloudClient);

Here are some log

10942 INFO  (searcherExecutor-53-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.SolrCore [collection1_shard2_replica2] Registered new searcher Searcher@510d1ea5[collection1_shard2_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
10942 INFO  (searcherExecutor-56-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node3) [n:127.0.0.1:36014_solr c:collection1 s:shard1 r:core_node3 x:collection1_shard1_replica2] o.a.s.c.SolrCore [collection1_shard1_replica2] Registered new searcher Searcher@505d7839[collection1_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
10958 INFO  (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.ZkController Core needs to recover:collection1_shard2_replica2
10964 INFO  (updateExecutor-20-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.u.DefaultSolrCoreState Running recovery
10970 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true
10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy ###### startupVersions=[[-1556034709770731520]]
10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[collection1_shard2_replica2]
10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.u.UpdateLog Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [collection1_shard2_replica2] as recovering, leader is [http://127.0.0.1:42549/solr/collection1_shard2_replica1/] and I am [http://127.0.0.1:36014/solr/collection1_shard2_replica2/]
11129 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://127.0.0.1:42549/solr]; [WaitForState: action=PREPRECOVERY&core=collection1_shard2_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
11131 INFO  (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) [n:127.0.0.1:42549_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json] for collection [collection1] has occurred - updating... (live nodes size: [2])
11131 INFO  (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json] for collection [collection1] has occurred - updating... (live nodes size: [2])
11133 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see collection1_shard2_replica1 (shard2 of collection1) have state: recovering
11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection1, shard=shard2, thisCore=collection1_shard2_replica1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=recovering, localState=active, nodeName=127.0.0.1:36014_solr, coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{"core":"collection1_shard2_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"recovering"}
11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Waited coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true for: 0 seconds.
11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={nodeName=127.0.0.1:36014_solr&onlyIfLeaderActive=true&core=collection1_shard2_replica1&coreNodeName=core_node4&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2} status=0 QTime=7
11160 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening [Searcher@28717b92[collection2_shard1_replica2] main]
11161 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Configured ZooKeeperStorageIO with znodeBase: /configs/config
11162 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Loaded null at path _rest_managed.json using ZooKeeperStorageIO:path=/configs/config
11162 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.h.ReplicationHandler Commits will be reserved for  10000
11162 INFO  (searcherExecutor-54-thread-1-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.SolrCore [collection2_shard1_replica2] Registered new searcher Searcher@28717b92[collection2_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.0.0):C1)))}
11168 INFO  (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.ZkController Core needs to recover:collection2_shard1_replica2
11168 INFO  (updateExecutor-20-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.u.DefaultSolrCoreState Running recovery
11168 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true
11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy ###### startupVersions=[[1556034713508904960]]
11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[collection2_shard1_replica2]
11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.u.UpdateLog Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [collection2_shard1_replica2] as recovering, leader is [http://127.0.0.1:42549/solr/collection2_shard1_replica1/] and I am [http://127.0.0.1:36014/solr/collection2_shard1_replica2/]
11171 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://127.0.0.1:42549/solr]; [WaitForState: action=PREPRECOVERY&core=collection2_shard1_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node1&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
11174 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node1, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
11175 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see collection2_shard1_replica1 (shard1 of collection2) have state: recovering
11175 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection2, shard=shard1, thisCore=collection2_shard1_replica1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=down, localState=active, nodeName=127.0.0.1:36014_solr, coreNodeName=core_node1, onlyIfActiveCheckResult=false, nodeProps: core_node1:{"core":"collection2_shard1_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"down"}
11278 INFO  (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection2/state.json] for collection [collection2] has occurred - updating... (live nodes size: [2])
11278 INFO  (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) [n:127.0.0.1:42549_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection2/state.json] for collection [collection2] has occurred - updating... (live nodes size: [2])
11323 ERROR (updateExecutor-7-thread-1-processing-http:////127.0.0.1:36014//solr//collection2_shard1_replica2 x:collection2_shard1_replica1 r:core_node2 n:127.0.0.1:42549_solr s:shard1 c:collection2) [n:127.0.0.1:42549_solr c:collection2 s:shard1 r:core_node2 x:collection2_shard1_replica1] o.a.s.u.StreamingSolrClients error
java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
	at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126)
	at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138)
	at org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:146)
	at org.apache.http.impl.io.ChunkedOutputStream.close(ChunkedOutputStream.java:205)
	at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:159)
	at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162)
	at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
	at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:61)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:323)
	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:186)
	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-9945.patch
09/Jan/17 08:30
6 kB
Cao Manh Dat

Issue Links

duplicates

SOLR-11702 Redesign current LIR implementation

Closed

is related to

SOLR-9555 Leader incorrectly publishes state for replica when it puts replica into LIR.

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Cao Manh Dat

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Jan/17 08:24

Updated:: 08/Jun/19 15:31

Resolved:: 20/Jan/17 00:56