Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9945

LIR should check the node is recovering before bring it down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      When a node is recovering, the leader can meet an exception when trying to send an update to the buffering node. So the leader will try running LIR process: first set the node's state to DOWN, then send recovery OP to the node.
      In the same time, PrepRecoveryOp will make the leader wait for a very long time to see the node's state is RECOVERING.
      This scenario can easily be achieved by using this test

      String collection = "collection2";
      CollectionAdminRequest
          .createCollection(collection, "config", 1, 2)
          .setMaxShardsPerNode(1)
          .process(cluster.getSolrClient());
      AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, cluster.getSolrClient().getZkStateReader(),
          false, true, 30);
      CloudSolrClient cloudClient = cluster.getSolrClient();
      
      DocCollection docCollection = cloudClient.getZkStateReader().getClusterState().getCollection(collection);
      Slice slice = docCollection.getSlice("shard1");
      Replica replicaNode = slice.getReplicas(replica -> replica != slice.getLeader()).get(0);
      JettySolrRunner replicaRunner = cluster.getReplicaJetty(replicaNode);
      
      new UpdateRequest()
          .add(sdoc("id", "1"))
          .process(cloudClient, collection);
      ChaosMonkey.stop(replicaRunner);
      new UpdateRequest()
          .add(sdoc("id", "2"))
          .process(cloudClient, collection);
      ChaosMonkey.start(replicaRunner);
      new UpdateRequest()
          .add(sdoc("id", "3"))
          .process(cloudClient, collection);
      AbstractDistribZkTestBase.waitForRecoveriesToFinish(collection, cluster.getSolrClient().getZkStateReader(),
          false, true, 60);
      CollectionAdminRequest
          .deleteCollection(collection)
          .process(cloudClient);  
      

      Here are some log

      10942 INFO  (searcherExecutor-53-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.SolrCore [collection1_shard2_replica2] Registered new searcher Searcher@510d1ea5[collection1_shard2_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
      10942 INFO  (searcherExecutor-56-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard1_replica2 s:shard1 c:collection1 r:core_node3) [n:127.0.0.1:36014_solr c:collection1 s:shard1 r:core_node3 x:collection1_shard1_replica2] o.a.s.c.SolrCore [collection1_shard1_replica2] Registered new searcher Searcher@505d7839[collection1_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader())}
      10958 INFO  (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.ZkController Core needs to recover:collection1_shard2_replica2
      10964 INFO  (updateExecutor-20-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.u.DefaultSolrCoreState Running recovery
      10970 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true
      10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy ###### startupVersions=[[-1556034709770731520]]
      10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[collection1_shard2_replica2]
      10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.u.UpdateLog Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
      10971 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [collection1_shard2_replica2] as recovering, leader is [http://127.0.0.1:42549/solr/collection1_shard2_replica1/] and I am [http://127.0.0.1:36014/solr/collection1_shard2_replica2/]
      11129 INFO  (recoveryExecutor-21-thread-1-processing-n:127.0.0.1:36014_solr x:collection1_shard2_replica2 s:shard2 c:collection1 r:core_node4) [n:127.0.0.1:36014_solr c:collection1 s:shard2 r:core_node4 x:collection1_shard2_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://127.0.0.1:42549/solr]; [WaitForState: action=PREPRECOVERY&core=collection1_shard2_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
      11131 INFO  (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) [n:127.0.0.1:42549_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json] for collection [collection1] has occurred - updating... (live nodes size: [2])
      11131 INFO  (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json] for collection [collection1] has occurred - updating... (live nodes size: [2])
      11133 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
      11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see collection1_shard2_replica1 (shard2 of collection1) have state: recovering
      11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection1, shard=shard2, thisCore=collection1_shard2_replica1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=recovering, localState=active, nodeName=127.0.0.1:36014_solr, coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{"core":"collection1_shard2_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"recovering"}
      11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Waited coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true for: 0 seconds.
      11140 INFO  (qtp344759431-38) [n:127.0.0.1:42549_solr    ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={nodeName=127.0.0.1:36014_solr&onlyIfLeaderActive=true&core=collection1_shard2_replica1&coreNodeName=core_node4&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2} status=0 QTime=7
      11160 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening [Searcher@28717b92[collection2_shard1_replica2] main]
      11161 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Configured ZooKeeperStorageIO with znodeBase: /configs/config
      11162 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.r.ManagedResourceStorage Loaded null at path _rest_managed.json using ZooKeeperStorageIO:path=/configs/config
      11162 INFO  (coreLoadExecutor-52-thread-3-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.h.ReplicationHandler Commits will be reserved for  10000
      11162 INFO  (searcherExecutor-54-thread-1-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.SolrCore [collection2_shard1_replica2] Registered new searcher Searcher@28717b92[collection2_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(7.0.0):C1)))}
      11168 INFO  (coreZkRegister-47-thread-1-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.ZkController Core needs to recover:collection2_shard1_replica2
      11168 INFO  (updateExecutor-20-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.u.DefaultSolrCoreState Running recovery
      11168 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true
      11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy ###### startupVersions=[[1556034713508904960]]
      11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Begin buffering updates. core=[collection2_shard1_replica2]
      11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.u.UpdateLog Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
      11169 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Publishing state of core [collection2_shard1_replica2] as recovering, leader is [http://127.0.0.1:42549/solr/collection2_shard1_replica1/] and I am [http://127.0.0.1:36014/solr/collection2_shard1_replica2/]
      11171 INFO  (recoveryExecutor-21-thread-2-processing-n:127.0.0.1:36014_solr x:collection2_shard1_replica2 s:shard1 c:collection2 r:core_node1) [n:127.0.0.1:36014_solr c:collection2 s:shard1 r:core_node1 x:collection2_shard1_replica2] o.a.s.c.RecoveryStrategy Sending prep recovery command to [http://127.0.0.1:42549/solr]; [WaitForState: action=PREPRECOVERY&core=collection2_shard1_replica1&nodeName=127.0.0.1:36014_solr&coreNodeName=core_node1&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
      11174 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node1, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
      11175 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp Will wait a max of 183 seconds to see collection2_shard1_replica1 (shard1 of collection2) have state: recovering
      11175 INFO  (qtp344759431-37) [n:127.0.0.1:42549_solr    ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=collection2, shard=shard1, thisCore=collection2_shard1_replica1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=down, localState=active, nodeName=127.0.0.1:36014_solr, coreNodeName=core_node1, onlyIfActiveCheckResult=false, nodeProps: core_node1:{"core":"collection2_shard1_replica2","base_url":"http://127.0.0.1:36014/solr","node_name":"127.0.0.1:36014_solr","state":"down"}
      11278 INFO  (zkCallback-23-thread-1-processing-n:127.0.0.1:36014_solr) [n:127.0.0.1:36014_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection2/state.json] for collection [collection2] has occurred - updating... (live nodes size: [2])
      11278 INFO  (zkCallback-12-thread-2-processing-n:127.0.0.1:42549_solr) [n:127.0.0.1:42549_solr    ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection2/state.json] for collection [collection2] has occurred - updating... (live nodes size: [2])
      11323 ERROR (updateExecutor-7-thread-1-processing-http:////127.0.0.1:36014//solr//collection2_shard1_replica2 x:collection2_shard1_replica1 r:core_node2 n:127.0.0.1:42549_solr s:shard1 c:collection2) [n:127.0.0.1:42549_solr c:collection2 s:shard1 r:core_node2 x:collection2_shard1_replica1] o.a.s.u.StreamingSolrClients error
      java.net.SocketException: Broken pipe (Write failed)
      	at java.net.SocketOutputStream.socketWrite0(Native Method)
      	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
      	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
      	at org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:126)
      	at org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:138)
      	at org.apache.http.impl.io.SessionOutputBufferImpl.flush(SessionOutputBufferImpl.java:146)
      	at org.apache.http.impl.io.ChunkedOutputStream.close(ChunkedOutputStream.java:205)
      	at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:159)
      	at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:162)
      	at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:237)
      	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:122)
      	at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:61)
      	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
      	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
      	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
      	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
      	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
      	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
      	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
      	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:323)
      	at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:186)
      	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
      	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        1. SOLR-9945.patch
          6 kB
          Cao Manh Dat

        Issue Links

          Activity

            People

              Unassigned Unassigned
              caomanhdat Cao Manh Dat
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: