Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: SolrJ
    • Labels:
      None

      Description

      Following SOLR-7325, this issue adds a State enum to Replica.

      1. SOLR-7336.patch
        117 kB
        Shai Erera
      2. SOLR-7336.patch
        116 kB
        Shai Erera
      3. SOLR-7336.patch
        116 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          Shai Erera added a comment -
          • Add Replica.getState()
          • Remove ZkStateReader state related constants and cutover to use Replica.State
          • Take Replica.State where need to, instead of a String.

          Tests pass, I think it's ready.

          Show
          Shai Erera added a comment - Add Replica.getState() Remove ZkStateReader state related constants and cutover to use Replica.State Take Replica.State where need to, instead of a String. Tests pass, I think it's ready.
          Hide
          Shai Erera added a comment -

          Forgot to mention that I also removed ZkStateReader.SYNC which seemed unused except by a test which waited on replicas to be active. But I don't think a replica is put in that state?

          Also, would appreciate if someone can review the documentation of the Replica.State values.

          Show
          Shai Erera added a comment - Forgot to mention that I also removed ZkStateReader.SYNC which seemed unused except by a test which waited on replicas to be active. But I don't think a replica is put in that state? Also, would appreciate if someone can review the documentation of the Replica.State values.
          Hide
          Mark Miller added a comment -

          SYNC is just cruft - part of some prototyping at the way start and never used.

          Show
          Mark Miller added a comment - SYNC is just cruft - part of some prototyping at the way start and never used.
          Hide
          Mark Miller added a comment -

          Notes on states:

          ACTIVE

          The replica is ready to receive updates and queries.

          DOWN

          Some of these names came before things were fully fleshed out DOWN is actually the first state before RECOVERING. I think tlog replay happens in DOWN, though that is a bit of a bug IMO. We should probably have a new state for it or something. A node in DOWN should be actively trying to move to RECOVERYING.

          RECOVERING

          The node is recovering from the leader. This might involve peersync or full replication or finding out things are already in sync.

          RECOVERY_FAILED

          RECOVERY attempts have not worked, something is not right.

          NOTE: This state doesn't matter if the node is not part of /live_nodes in zk - in that case the node is not part of the cluster and it's state should be discarded.

          Show
          Mark Miller added a comment - Notes on states: ACTIVE The replica is ready to receive updates and queries. DOWN Some of these names came before things were fully fleshed out DOWN is actually the first state before RECOVERING. I think tlog replay happens in DOWN, though that is a bit of a bug IMO. We should probably have a new state for it or something. A node in DOWN should be actively trying to move to RECOVERYING. RECOVERING The node is recovering from the leader. This might involve peersync or full replication or finding out things are already in sync. RECOVERY_FAILED RECOVERY attempts have not worked, something is not right. NOTE: This state doesn't matter if the node is not part of /live_nodes in zk - in that case the node is not part of the cluster and it's state should be discarded.
          Hide
          Shai Erera added a comment -

          ACTIVE

          Changed.

          RECOVERING

          Changed.

          RECOVERY_FAILED

          Changed.

          DOWN

          Left as is for now. When we change the semantics and logic, we should change the documentation too.

          Show
          Shai Erera added a comment - ACTIVE Changed. RECOVERING Changed. RECOVERY_FAILED Changed. DOWN Left as is for now. When we change the semantics and logic, we should change the documentation too.
          Hide
          Shai Erera added a comment -

          Patch updated to latest trunk and with Mark's comments compiled.

          Show
          Shai Erera added a comment - Patch updated to latest trunk and with Mark's comments compiled.
          Hide
          Mark Miller added a comment -

          + /**
          + * A replica can be in that state in two cases:
          + * <ul>
          + * <li>It is truly down, i.e. hosted on a node that is no longer live.</li>
          + * <li>It failed to acknowledge an update request from the leader and is
          + * catching up with the leader's transaction log.</li>
          + * </ul>

          Neither of those statements are really currently correct though. It's as I say above:

          "DOWN is the first state before RECOVERING. A node in DOWN should be actively trying to move to RECOVERING."

          The main reason for DOWN is so that leaders can see a replicas state change to RECOVERING.

          It doesn't mean it's truly down or hosted on a node that is no longer live, nor does it necessarily imply an update failed from the leader.

          Show
          Mark Miller added a comment - + /** + * A replica can be in that state in two cases: + * <ul> + * <li>It is truly down, i.e. hosted on a node that is no longer live.</li> + * <li>It failed to acknowledge an update request from the leader and is + * catching up with the leader's transaction log.</li> + * </ul> Neither of those statements are really currently correct though. It's as I say above: "DOWN is the first state before RECOVERING. A node in DOWN should be actively trying to move to RECOVERING." The main reason for DOWN is so that leaders can see a replicas state change to RECOVERING. It doesn't mean it's truly down or hosted on a node that is no longer live, nor does it necessarily imply an update failed from the leader.
          Hide
          Shai Erera added a comment -

          OK. I will update the documentation to what you put in quotes (and that only, correct?).

          But isn't a replica in DOWN, when the node it's on is also down? I know I saw that happening after shutting down Solr on a node.

          Show
          Shai Erera added a comment - OK. I will update the documentation to what you put in quotes (and that only, correct?). But isn't a replica in DOWN, when the node it's on is also down? I know I saw that happening after shutting down Solr on a node.
          Hide
          Mark Miller added a comment -

          and is catching up with the leader's transaction log

          A DOWN node may or may not be catching up from it's own transaction log.

          Show
          Mark Miller added a comment - and is catching up with the leader's transaction log A DOWN node may or may not be catching up from it's own transaction log.
          Hide
          Mark Miller added a comment -

          But isn't a replica in DOWN, when the node it's on is also down? I know I saw that happening after shutting down Solr on a node.

          We try and publish DOWN on shutdown just because on startup we always want to see the progression DOWN, RECOVERING and this makes it more pronounced and because we just want to make a best effort to make the node not ACTIVE in clusterstate.json just because it does confuse users that you have to consult zk live_nodes to know the actual state. It's not really required, it came later, it's kind of a best effort thing. The key is, if a node is really shutdown, it's zk live node is gone. A zk live node being gone means ignore the state. You need both pieces of info - state is useless by itself.

          See the Solr Cloud admin UI. It won't show a node that is shutdown as DOWN. It's zk live node is gone, and so its marked as gray and gone.

          DOWN is the wrong name, but that's what it is.

          Show
          Mark Miller added a comment - But isn't a replica in DOWN, when the node it's on is also down? I know I saw that happening after shutting down Solr on a node. We try and publish DOWN on shutdown just because on startup we always want to see the progression DOWN, RECOVERING and this makes it more pronounced and because we just want to make a best effort to make the node not ACTIVE in clusterstate.json just because it does confuse users that you have to consult zk live_nodes to know the actual state. It's not really required, it came later, it's kind of a best effort thing. The key is, if a node is really shutdown, it's zk live node is gone. A zk live node being gone means ignore the state. You need both pieces of info - state is useless by itself. See the Solr Cloud admin UI. It won't show a node that is shutdown as DOWN. It's zk live node is gone, and so its marked as gray and gone. DOWN is the wrong name, but that's what it is.
          Hide
          Mark Miller added a comment -

          We try and publish DOWN on shutdown

          Oh yeah, and I also think it was an attempt at taking nodes out of rotation on shutdown cleanly - just dropping and staying in ACTIVE in clusterstate.json can leave connections coming in during shutdown and this slows stuff down (there is a JIRA somewhere).

          CoreContainer#shutdown is actually too late to do this anyway though - it gets called too late. We need some explicit REST command or something to move to the DOWN state before we start container shutdown.

          Show
          Mark Miller added a comment - We try and publish DOWN on shutdown Oh yeah, and I also think it was an attempt at taking nodes out of rotation on shutdown cleanly - just dropping and staying in ACTIVE in clusterstate.json can leave connections coming in during shutdown and this slows stuff down (there is a JIRA somewhere). CoreContainer#shutdown is actually too late to do this anyway though - it gets called too late. We need some explicit REST command or something to move to the DOWN state before we start container shutdown.
          Hide
          Shai Erera added a comment -

          Thanks Mark for the clarifications. I improved the javadocs of all the states and Replica.State based on your feedback and my understanding. Please confirm I got it all right.

          Show
          Shai Erera added a comment - Thanks Mark for the clarifications. I improved the javadocs of all the states and Replica.State based on your feedback and my understanding. Please confirm I got it all right.
          Hide
          Shai Erera added a comment -

          Mark Miller if you have no objections, I will commit these changes.

          Show
          Shai Erera added a comment - Mark Miller if you have no objections, I will commit these changes.
          Hide
          Mark Miller added a comment -

          Looks good to me.

          Show
          Mark Miller added a comment - Looks good to me.
          Hide
          ASF subversion and git services added a comment -

          Commit 1671240 from Shai Erera in branch 'dev/trunk'
          [ https://svn.apache.org/r1671240 ]

          SOLR-7336: Add State enum to Replica

          Show
          ASF subversion and git services added a comment - Commit 1671240 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1671240 ] SOLR-7336 : Add State enum to Replica
          Hide
          ASF subversion and git services added a comment -

          Commit 1671246 from Shai Erera in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1671246 ]

          SOLR-7336: Add State enum to Replica

          Show
          ASF subversion and git services added a comment - Commit 1671246 from Shai Erera in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671246 ] SOLR-7336 : Add State enum to Replica
          Hide
          Shai Erera added a comment -

          Committed to trunk and 5x. Thanks Mark Miller for the review and feedback!

          Show
          Shai Erera added a comment - Committed to trunk and 5x. Thanks Mark Miller for the review and feedback!
          Hide
          ASF subversion and git services added a comment -

          Commit 1676350 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1676350 ]

          preemptive cleanup of 'Upgrading' section for 5.2 (SOLR-7325, SOLR-7336, SOLR-4839)

          Show
          ASF subversion and git services added a comment - Commit 1676350 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1676350 ] preemptive cleanup of 'Upgrading' section for 5.2 ( SOLR-7325 , SOLR-7336 , SOLR-4839 )
          Hide
          ASF subversion and git services added a comment -

          Commit 1676351 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1676351 ]

          preemptive cleanup of 'Upgrading' section for 5.2 (SOLR-7325, SOLR-7336, SOLR-4839 - merge r1676350)

          Show
          ASF subversion and git services added a comment - Commit 1676351 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1676351 ] preemptive cleanup of 'Upgrading' section for 5.2 ( SOLR-7325 , SOLR-7336 , SOLR-4839 - merge r1676350)
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Shai Erera
              Reporter:
              Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development