Solr
  1. Solr
  2. SOLR-6923

AutoAddReplicas should consult live nodes also to see if a state has changed

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      • I did the following
        ./solr start -e cloud -noprompt
        
        kill -9 <pid-of-node2> //Not the node which is running ZK
        
      • /live_nodes reflects that the node is gone.
      • This is the only message which gets logged on the node1 server after killing node2
      45812 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN  org.apache.zookeeper.server.NIOServerCnxn  – caught end of stream exception
      EndOfStreamException: Unable to read additional data from client sessionid 0x14ac40f26660001, likely client has closed socket
          at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
          at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
          at java.lang.Thread.run(Thread.java:745)
      
      • The graph shows the node2 as 'Gone' state
      • clusterstate.json keeps showing the replica as 'active'
      {"collection1":{
          "shards":{"shard1":{
              "range":"80000000-7fffffff",
              "state":"active",
              "replicas":{
                "core_node1":{
                  "state":"active",
                  "core":"collection1",
                  "node_name":"169.254.113.194:8983_solr",
                  "base_url":"http://169.254.113.194:8983/solr",
                  "leader":"true"},
                "core_node2":{
                  "state":"active",
                  "core":"collection1",
                  "node_name":"169.254.113.194:8984_solr",
                  "base_url":"http://169.254.113.194:8984/solr"}}}},
          "maxShardsPerNode":"1",
          "router":{"name":"compositeId"},
          "replicationFactor":"1",
          "autoAddReplicas":"false",
          "autoCreated":"true"}}
      

      One immediate problem I can see is that AutoAddReplicas doesn't work since the clusterstate.json never changes. There might be more features which are affected by this.

      On first thought I think we can handle this - The shard leader could listen to changes on /live_nodes and if it has replicas that were on that node, mark it as 'down' in the clusterstate.json?

      1. SOLR-6923.patch
        2 kB
        Varun Thacker

        Activity

        Hide
        Timothy Potter added a comment -

        The actual runtime state of a replica is determined by 1) what's in clusterstate.json and 2) check that the node hosting the replica is live. If the node is not live, the state reported in clusterstate.json can be "stale" for some time. It has always worked this way in SolrCloud. Thus, AutoAddReplicas needs to consult live nodes prior to thinking a node is live.

        Show
        Timothy Potter added a comment - The actual runtime state of a replica is determined by 1) what's in clusterstate.json and 2) check that the node hosting the replica is live. If the node is not live, the state reported in clusterstate.json can be "stale" for some time. It has always worked this way in SolrCloud. Thus, AutoAddReplicas needs to consult live nodes prior to thinking a node is live.
        Hide
        Varun Thacker added a comment -

        Thanks Tim for pointing it out. I was not aware of this.

        I'll rename the issue appropriately with this information and come up up with a patch for AutoAddReplicas to consult live nodes too.

        Show
        Varun Thacker added a comment - Thanks Tim for pointing it out. I was not aware of this. I'll rename the issue appropriately with this information and come up up with a patch for AutoAddReplicas to consult live nodes too.
        Hide
        Varun Thacker added a comment -

        Simple patch which checks against live nodes before short circuiting.

        SharedFSAutoReplicaFailoverTest passes.

        Show
        Varun Thacker added a comment - Simple patch which checks against live nodes before short circuiting. SharedFSAutoReplicaFailoverTest passes.
        Hide
        Anshum Gupta added a comment -

        LGTM.

        Show
        Anshum Gupta added a comment - LGTM.
        Hide
        ASF subversion and git services added a comment -

        Commit 1651221 from Anshum Gupta in branch 'dev/trunk'
        [ https://svn.apache.org/r1651221 ]

        SOLR-6923: AutoAddReplicas also consults live_nodes to see if a state change has happened

        Show
        ASF subversion and git services added a comment - Commit 1651221 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1651221 ] SOLR-6923 : AutoAddReplicas also consults live_nodes to see if a state change has happened
        Hide
        ASF subversion and git services added a comment -

        Commit 1651223 from Anshum Gupta in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1651223 ]

        SOLR-6923: AutoAddReplicas also consults live_nodes to see if a state change has happened (merge from trunk)

        Show
        ASF subversion and git services added a comment - Commit 1651223 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651223 ] SOLR-6923 : AutoAddReplicas also consults live_nodes to see if a state change has happened (merge from trunk)
        Hide
        Mark Miller added a comment -

        I'll look at what test we should add for this today or tomorrow.

        Show
        Mark Miller added a comment - I'll look at what test we should add for this today or tomorrow.
        Hide
        Mark Miller added a comment -

        Never got to this - I'll open a new issue to add a test.

        Show
        Mark Miller added a comment - Never got to this - I'll open a new issue to add a test.
        Hide
        Shalin Shekhar Mangar added a comment -

        I was going to backport is to 4.10.4 but then I realized that this code has:

        if (lastClusterStateVersion == clusterState.getZkClusterStateVersion() && baseUrlForBadNodes.size() == 0 &&
                  liveNodes.equals(clusterState.getLiveNodes())) {
        ...
        }
        

        Two Number objects are compared using == instead of .equals which is only guaranteed to work if the values are between -128 to 127. This is buggy!

        Show
        Shalin Shekhar Mangar added a comment - I was going to backport is to 4.10.4 but then I realized that this code has: if (lastClusterStateVersion == clusterState.getZkClusterStateVersion() && baseUrlForBadNodes.size() == 0 && liveNodes.equals(clusterState.getLiveNodes())) { ... } Two Number objects are compared using == instead of .equals which is only guaranteed to work if the values are between -128 to 127. This is buggy!
        Hide
        Shalin Shekhar Mangar added a comment -

        I am marking this for 4.10.5 whenever that happens. I fixed the bug I reported in my last comment with SOLR-7178.

        Show
        Shalin Shekhar Mangar added a comment - I am marking this for 4.10.5 whenever that happens. I fixed the bug I reported in my last comment with SOLR-7178 .
        Hide
        Shalin Shekhar Mangar added a comment -

        This was released in 5.0

        Show
        Shalin Shekhar Mangar added a comment - This was released in 5.0

          People

          • Assignee:
            Mark Miller
            Reporter:
            Varun Thacker
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development