Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6961

nodes should go into hibernate when join_ring is false

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.7, 2.1 beta2
    • Component/s: None
    • Labels:

      Description

      The impetus here is this: a node that was down for some period and comes back can serve stale information. We know from CASSANDRA-768 that we can't just wait for hints, and know that tangentially related CASSANDRA-3569 prevents us from having the node in a down (from the FD's POV) state handle streaming.

      We can almost set join_ring to false, then repair, and then join the ring to narrow the window (actually, you can do this and everything succeeds because the node doesn't know it's a member yet, which is probably a bit of a bug.) If instead we modified this to put the node in hibernate, like replace_address does, it could work almost like replace, except you could run a repair (manually) while in the hibernate state, and then flip to normal when it's done.

      This won't prevent the staleness 100%, but it will greatly reduce the chance if the node has been down a significant amount of time.

      1. 6961.txt
        14 kB
        Brandon Williams

        Activity

        Hide
        brandon.williams Brandon Williams added a comment -

        Patch to enable this. We only advertise tokens if some were already saved, so you can still start with join_ring=false, and then use join to bootstrap later (I don't know why anyone would do that, but that's the behavior we had before.) You can run repair on the node while it's in hibernate, and as a bonus side effect, if you take a blank node and set join_ring=false but don't disable rpc, you have an instant coordinator-only fat client (where before if you did this, you were asking for trouble.)

        Show
        brandon.williams Brandon Williams added a comment - Patch to enable this. We only advertise tokens if some were already saved, so you can still start with join_ring=false, and then use join to bootstrap later (I don't know why anyone would do that, but that's the behavior we had before.) You can run repair on the node while it's in hibernate, and as a bonus side effect, if you take a blank node and set join_ring=false but don't disable rpc, you have an instant coordinator-only fat client (where before if you did this, you were asking for trouble.)
        Hide
        thobbs Tyler Hobbs added a comment - - edited

        I'm seeing some issues with repair while one node is running with join_ring=false.

        Here's what I did:

        • Start a three node ccm cluster
        • Start a stress write with RF=3
        • Stop node3
        • Start node3 with join_ring=false
        • Run a repair against node3

        It looks like the repair finishes everything diffing and streaming, but the repair command hangs, and netstats shows continuously increasing completed Command/Response counts.

        Show
        thobbs Tyler Hobbs added a comment - - edited I'm seeing some issues with repair while one node is running with join_ring=false. Here's what I did: Start a three node ccm cluster Start a stress write with RF=3 Stop node3 Start node3 with join_ring=false Run a repair against node3 It looks like the repair finishes everything diffing and streaming, but the repair command hangs, and netstats shows continuously increasing completed Command/Response counts.
        Hide
        brandon.williams Brandon Williams added a comment -

        Hmm, I can't reproduce that, even wiping the node before starting it with join_ring=false:

        [2014-04-04 21:51:30,888] Repair command #1 finished
        

        and nodetool exits.

        Show
        brandon.williams Brandon Williams added a comment - Hmm, I can't reproduce that, even wiping the node before starting it with join_ring=false: [2014-04-04 21:51:30,888] Repair command #1 finished and nodetool exits.
        Hide
        thobbs Tyler Hobbs added a comment -

        CASSANDRA-6984 was the cause of the hung repair.

        Show
        thobbs Tyler Hobbs added a comment - CASSANDRA-6984 was the cause of the hung repair.
        Hide
        thobbs Tyler Hobbs added a comment -

        +1

        Show
        thobbs Tyler Hobbs added a comment - +1
        Hide
        brandon.williams Brandon Williams added a comment -

        Committed.

        Show
        brandon.williams Brandon Williams added a comment - Committed.
        Hide
        rbranson Rick Branson added a comment -

        Very happy about this. It will make a ton of things easier from an operations perspective (bringing up new DCs, bringing up hosts after long-ish maintenance), but also interested in using this to potentially have dedicated coordinator nodes that are separate from storage. We find ourselves CPU bound on more capacity-constrained and expensive storage-class hardware. Most of this CPU time is spent on request coordination. Moving this work to cheap "diskless" application-class hardware is much more ideal and will allow us to maximize the capacity of our storage nodes.

        Show
        rbranson Rick Branson added a comment - Very happy about this. It will make a ton of things easier from an operations perspective (bringing up new DCs, bringing up hosts after long-ish maintenance), but also interested in using this to potentially have dedicated coordinator nodes that are separate from storage. We find ourselves CPU bound on more capacity-constrained and expensive storage-class hardware. Most of this CPU time is spent on request coordination. Moving this work to cheap "diskless" application-class hardware is much more ideal and will allow us to maximize the capacity of our storage nodes.
        Hide
        rcoli Robert Coli added a comment -

        While shortening the staleness race when reading at ONE is cool, I am most excited that this ticket provides an alternative to the previous operational best practice of restoring a node that was down for a long time by re-bootstrapping it. Briefly, this is because re-bootstrapping it decreases unique copies of the data, whereas this approach maintains the original data and replica sets. It is my view that we should aim to maintain the unique copy of data on any given replica, as much as feasible.

        Show
        rcoli Robert Coli added a comment - While shortening the staleness race when reading at ONE is cool, I am most excited that this ticket provides an alternative to the previous operational best practice of restoring a node that was down for a long time by re-bootstrapping it. Briefly, this is because re-bootstrapping it decreases unique copies of the data, whereas this approach maintains the original data and replica sets. It is my view that we should aim to maintain the unique copy of data on any given replica, as much as feasible.

          People

          • Assignee:
            brandon.williams Brandon Williams
            Reporter:
            brandon.williams Brandon Williams
            Reviewer:
            Tyler Hobbs
            Tester:
            Ryan McGuire
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development