Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 0.8 beta 1
    • Component/s: None
    • Labels:
      None

      Description

      Currently our move/loadbalance operations only implement case 2 of the Ruhl algorithm described at https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079.

      We should add functionality to optimize moves that take/give ranges to a node's direct neighbors.

      1. 0001-Minor-reorginization.patch
        8 kB
        Nick Bailey
      2. CASSANDRA-1427.patch
        20 kB
        Pavel Yaskevich
      3. CASSANDRA-1427-v2.patch
        31 kB
        Pavel Yaskevich
      4. CASSANDRA-1427-v3.patch
        114 kB
        Pavel Yaskevich
      5. CASSANDRA-1427-v4.patch
        114 kB
        Pavel Yaskevich
      6. CASSANDRA-1427-v5.patch
        116 kB
        Pavel Yaskevich
      7. CASSANDRA-1427-v6.patch
        119 kB
        Pavel Yaskevich

        Activity

        Hide
        stuhood Stu Hood added a comment -

        Congratulations to everyone who worked on this ticket: this has been very exciting work. We have upcoming usecases that will likely use OPP, and this was a blocker for that work. Thank you!

        Show
        stuhood Stu Hood added a comment - Congratulations to everyone who worked on this ticket: this has been very exciting work. We have upcoming usecases that will likely use OPP, and this was a blocker for that work. Thank you!
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra #752 (See https://hudson.apache.org/hudson/job/Cassandra/752/)
        optimize node movement within the same arc of the ring
        patch by Pavel Yaskevich; reviewed by Nick Bailey for CASSANDRA-1427

        Show
        hudson Hudson added a comment - Integrated in Cassandra #752 (See https://hudson.apache.org/hudson/job/Cassandra/752/ ) optimize node movement within the same arc of the ring patch by Pavel Yaskevich; reviewed by Nick Bailey for CASSANDRA-1427
        Hide
        jbellis Jonathan Ellis added a comment -

        committed

        Show
        jbellis Jonathan Ellis added a comment - committed
        Hide
        nickmbailey Nick Bailey added a comment -

        +1 on v6.

        Show
        nickmbailey Nick Bailey added a comment - +1 on v6.
        Hide
        xedin Pavel Yaskevich added a comment -

        merged with 0001-Minor-reorginization.patch, nodetool now supports "Moving" state, in MoveTest all assert changed to assertTrue

        Show
        xedin Pavel Yaskevich added a comment - merged with 0001-Minor-reorginization.patch, nodetool now supports "Moving" state, in MoveTest all assert changed to assertTrue
        Hide
        xedin Pavel Yaskevich added a comment -

        +1

        Show
        xedin Pavel Yaskevich added a comment - +1
        Hide
        nickmbailey Nick Bailey added a comment -

        A small patch on top of your latest patch to reorganize some of the streamRanges/requestRanges code. Thoughts?

        Show
        nickmbailey Nick Bailey added a comment - A small patch on top of your latest patch to reorganize some of the streamRanges/requestRanges code. Thoughts?
        Hide
        nickmbailey Nick Bailey added a comment -
        • In requestRanges, instead of passing a request count you should just use the same strategy that streamRanges uses for counting down the latch.
        • Hmm, I didn't notice this before but the fact that the new token in loadbalance is calculated before the move changes things pretty significantly. The fact that it left the ring before calculating a new token meant that it would consider the load it was giving away when trying to pick a new token. Honestly maybe we should just get rid of loadbalance since this a major version change. If someone wants load balance they can manually decommission then bootstrap. We should probably get someone else's opinion here.
        • I've got to wonder if there is a better way to test this kind of stuff. (Move/Remove/LeaveAndBootstrap)Test are all kind of ridiculous (It's sad that I wrote one of them). Maybe we just need a separate task for figuring that out. Also perhaps a ticket for making/updating the distributed test for this.
        Show
        nickmbailey Nick Bailey added a comment - In requestRanges, instead of passing a request count you should just use the same strategy that streamRanges uses for counting down the latch. Hmm, I didn't notice this before but the fact that the new token in loadbalance is calculated before the move changes things pretty significantly. The fact that it left the ring before calculating a new token meant that it would consider the load it was giving away when trying to pick a new token. Honestly maybe we should just get rid of loadbalance since this a major version change. If someone wants load balance they can manually decommission then bootstrap. We should probably get someone else's opinion here. I've got to wonder if there is a better way to test this kind of stuff. (Move/Remove/LeaveAndBootstrap)Test are all kind of ridiculous (It's sad that I wrote one of them). Maybe we just need a separate task for figuring that out. Also perhaps a ticket for making/updating the distributed test for this.
        Hide
        nickmbailey Nick Bailey added a comment -

        Some initial comments:

        • This change disallows moving when RF = N. The call to calculateNaturalEndpoints fails when you use the tokenmetadata that doesn't include the moving node.
        • We need to sleep for RING_DELAY before we start any streaming. Similar to how decommission does. Basically the other nodes need to calculate their pending ranges so they can start sending writes to the appropriate nodes. Once that happens we can stream. It might be nice to bypass the sleep if nothing is going to be streamed. It would make setting up the ring before any data is in it easier.
        • The check to see if data is moving to this node needs to be done in a loop at the beginning of the move method. By the time you throw the exception in the current implementation you've already gossiped the move state and calculated a bunch of ranges and whatnot.
        • You never await the latch in streamRanges so that call returns immediately.
        • It looks like decom is broken. There is a latch in unbootstrap that is waited on but never counted down.
        • Also in regards to the above point: I think we should try to combine a lot of the logic for calculating ranges that we are fetching or streaming. Can we have decom and move call the same method to calculate ranges to send to then do the actual sending and the same for bootstrap/move? Right now that move method is pretty hairy and it seems like it would be good to have that logic in a single spot.

        Gonna post this comment so it doesn't get lost, but I'm gonna still do some more reviewing.

        Show
        nickmbailey Nick Bailey added a comment - Some initial comments: This change disallows moving when RF = N. The call to calculateNaturalEndpoints fails when you use the tokenmetadata that doesn't include the moving node. We need to sleep for RING_DELAY before we start any streaming. Similar to how decommission does. Basically the other nodes need to calculate their pending ranges so they can start sending writes to the appropriate nodes. Once that happens we can stream. It might be nice to bypass the sleep if nothing is going to be streamed. It would make setting up the ring before any data is in it easier. The check to see if data is moving to this node needs to be done in a loop at the beginning of the move method. By the time you throw the exception in the current implementation you've already gossiped the move state and calculated a bunch of ranges and whatnot. You never await the latch in streamRanges so that call returns immediately. It looks like decom is broken. There is a latch in unbootstrap that is waited on but never counted down. Also in regards to the above point: I think we should try to combine a lot of the logic for calculating ranges that we are fetching or streaming. Can we have decom and move call the same method to calculate ranges to send to then do the actual sending and the same for bootstrap/move? Right now that move method is pretty hairy and it seems like it would be good to have that logic in a single spot. Gonna post this comment so it doesn't get lost, but I'm gonna still do some more reviewing.
        Hide
        xedin Pavel Yaskevich added a comment -

        rebased to the lastest trunk

        Show
        xedin Pavel Yaskevich added a comment - rebased to the lastest trunk
        Hide
        xedin Pavel Yaskevich added a comment -

        All things from your previous comment are implemented + MoveTest old content is moved to LeaveBootstrapTest (I think it is still useful) and MoveTest itself is updated.

        Show
        xedin Pavel Yaskevich added a comment - All things from your previous comment are implemented + MoveTest old content is moved to LeaveBootstrapTest (I think it is still useful) and MoveTest itself is updated.
        Hide
        xedin Pavel Yaskevich added a comment -

        I'm a bit stuck with reworking MoveTest, so v3 patch coming only tomorrow, FYI...

        Show
        xedin Pavel Yaskevich added a comment - I'm a bit stuck with reworking MoveTest, so v3 patch coming only tomorrow, FYI...
        Hide
        xedin Pavel Yaskevich added a comment -

        I concur! Changed that in my version 3 patch, will attach it later today.

        Show
        xedin Pavel Yaskevich added a comment - I concur! Changed that in my version 3 patch, will attach it later today.
        Hide
        nickmbailey Nick Bailey added a comment -

        Fine with me. Also I think getWriteEndpoints in TMD should return a set instead of a list. Moving nodes are likely going to generate duplicate pending ranges.

        Show
        nickmbailey Nick Bailey added a comment - Fine with me. Also I think getWriteEndpoints in TMD should return a set instead of a list. Moving nodes are likely going to generate duplicate pending ranges.
        Hide
        xedin Pavel Yaskevich added a comment -

        Nevermind about 2. It's simple operation - add record to the movingNodes in onChange handler for move.

        About 3. I think we should keep movingNodes for code readability and logic consistency...

        Show
        xedin Pavel Yaskevich added a comment - Nevermind about 2. It's simple operation - add record to the movingNodes in onChange handler for move. About 3. I think we should keep movingNodes for code readability and logic consistency...
        Hide
        nickmbailey Nick Bailey added a comment -

        2. In StorageService.handleStateMoving store both new token and InetAddress of the moving node.

        Not sure what you mean.

        3. Add following code to calculatePendingRanges(AbstractReplicationStrategy, String) after bootstrapToken ranges calculation to track pending ranges of the moving nodes (almost the same code as for bootstrapping nodes):

        I think this is the right approach. Would it be useful to get rid of 'movingRanges' in TMD and simply add nodes that are moving to both the leaving endpoints and the bootstrap tokens. I think this would remove the need to change calculate pending ranges at all. I could see how the distinction between moving and leaving+bootstrap might be easier to follow in the code though.

        Show
        nickmbailey Nick Bailey added a comment - 2. In StorageService.handleStateMoving store both new token and InetAddress of the moving node. Not sure what you mean. 3. Add following code to calculatePendingRanges(AbstractReplicationStrategy, String) after bootstrapToken ranges calculation to track pending ranges of the moving nodes (almost the same code as for bootstrapping nodes): I think this is the right approach. Would it be useful to get rid of 'movingRanges' in TMD and simply add nodes that are moving to both the leaving endpoints and the bootstrap tokens. I think this would remove the need to change calculate pending ranges at all. I could see how the distinction between moving and leaving+bootstrap might be easier to follow in the code though.
        Hide
        xedin Pavel Yaskevich added a comment -

        Shouldn't the pending ranges calculation include the tokens the nodes are moving too as well?

        Yeah, I think so, I'm going to do the following:

        1. Change TokenMetadata.movingEndpoints type from Set<InetAddress> to Set<Pair<Token, InetAddress>> which will allow us to store token where nodes is moving.

        2. In StorageService.handleStateMoving store both new token and InetAddress of the moving node.

        3. Add following code to calculatePendingRanges(AbstractReplicationStrategy, String) after bootstrapToken ranges calculation to track pending ranges of the moving nodes (almost the same code as for bootstrapping nodes):

        for (Pair<Token, InetAddress> moving : allLeftMetadata.getMovingEndpoints())
        {
           InetAddress endpoint = moving.right; // address of the moving node
        
           // moving.left is a new token of the endpoint
           allLeftMetadata.updateNormalToken(moving.left, endpoint);
        
           for (Range range : strategy.getAddressRanges(allLeftMetadata).get(endpoint))
           {
               pendingRanges.put(range, endpoint);
           }
        
           allLeftMetadata.removeEndpoint(endpoint);
        }
        

        What do you think?

        Show
        xedin Pavel Yaskevich added a comment - Shouldn't the pending ranges calculation include the tokens the nodes are moving too as well? Yeah, I think so, I'm going to do the following: 1. Change TokenMetadata.movingEndpoints type from Set<InetAddress> to Set<Pair<Token, InetAddress>> which will allow us to store token where nodes is moving. 2. In StorageService.handleStateMoving store both new token and InetAddress of the moving node. 3. Add following code to calculatePendingRanges(AbstractReplicationStrategy, String) after bootstrapToken ranges calculation to track pending ranges of the moving nodes (almost the same code as for bootstrapping nodes): for (Pair<Token, InetAddress> moving : allLeftMetadata.getMovingEndpoints()) { InetAddress endpoint = moving.right; // address of the moving node // moving.left is a new token of the endpoint allLeftMetadata.updateNormalToken(moving.left, endpoint); for (Range range : strategy.getAddressRanges(allLeftMetadata).get(endpoint)) { pendingRanges.put(range, endpoint); } allLeftMetadata.removeEndpoint(endpoint); } What do you think?
        Hide
        nickmbailey Nick Bailey added a comment -
        • Seems like move should just fail if token is null.
        • Shouldn't the pending ranges calculation include the tokens the nodes are moving too as well?
          • It also looks like the writeEndpoints calculation in token metadata is just a list rather than a set. So we could potentially be adding the same endpoint multiple times here. Best solution is probably to make that calculation return a set.
        • A MoveTest.java already exists. The fact that its still passing is probably indicative of it's usefulness but it should probably be updated and hopefully made more useful.
        Show
        nickmbailey Nick Bailey added a comment - Seems like move should just fail if token is null. Shouldn't the pending ranges calculation include the tokens the nodes are moving too as well? It also looks like the writeEndpoints calculation in token metadata is just a list rather than a set. So we could potentially be adding the same endpoint multiple times here. Best solution is probably to make that calculation return a set. A MoveTest.java already exists. The fact that its still passing is probably indicative of it's usefulness but it should probably be updated and hopefully made more useful.
        Hide
        xedin Pavel Yaskevich added a comment -

        Support for features described in your latest comment + streaming unused ranges to new replicas issued not at any move but only when needed and only needed parts (calculating unused parts between current/new ranges for the current node). What do you think?

        Show
        xedin Pavel Yaskevich added a comment - Support for features described in your latest comment + streaming unused ranges to new replicas issued not at any move but only when needed and only needed parts (calculating unused parts between current/new ranges for the current node). What do you think?
        Hide
        nickmbailey Nick Bailey added a comment -

        Regarding the patch:
        1. There is no need for the multiple loadbalance method signatures. Move is for specifying a token. Loadbalance is for letting c* choose.
        2. The general approach seems like it should work but there are a few things missing.

        • The pending ranges calculation needs to include moving nodes. This calculation is what causes writes to go to all the right nodes when a node is entering/leaving the ring so that writes that happened during the ring change but weren't necessarily streamed aren't lost.
        • Along with the new application state, SS needs to handle that state. When the moving node gossips the moving state to other nodes, they should update their pending ranges as well. This means the token the node is moving to will need to be included in the state.
        • Once the move is completed the node needs to return to a normal state
        • Along with that the node needs to be removed from the moving nodes list in TMD, so the pending ranges calculation will be correct again.
        • Along with the additional Application state, SS needs to handle that state in the onChange method
        Show
        nickmbailey Nick Bailey added a comment - Regarding the patch: 1. There is no need for the multiple loadbalance method signatures. Move is for specifying a token. Loadbalance is for letting c* choose. 2. The general approach seems like it should work but there are a few things missing. The pending ranges calculation needs to include moving nodes. This calculation is what causes writes to go to all the right nodes when a node is entering/leaving the ring so that writes that happened during the ring change but weren't necessarily streamed aren't lost. Along with the new application state, SS needs to handle that state. When the moving node gossips the moving state to other nodes, they should update their pending ranges as well. This means the token the node is moving to will need to be included in the state. Once the move is completed the node needs to return to a normal state Along with that the node needs to be removed from the moving nodes list in TMD, so the pending ranges calculation will be correct again. Along with the additional Application state, SS needs to handle that state in the onChange method
        Hide
        nickmbailey Nick Bailey added a comment -

        1). What should we do with the ranges left on the node if we are moving to the completely new range (no intersection with old range), should we stream them to the ring?

        We should just follow the same strategy decom uses. Any ranges that this node owned before the move should be streamed to anyone who will become a new replica for that data.

        2). Should we do 1). always (on any move)?

        Yes.

        Show
        nickmbailey Nick Bailey added a comment - 1). What should we do with the ranges left on the node if we are moving to the completely new range (no intersection with old range), should we stream them to the ring? We should just follow the same strategy decom uses. Any ranges that this node owned before the move should be streamed to anyone who will become a new replica for that data. 2). Should we do 1). always (on any move)? Yes.
        Hide
        xedin Pavel Yaskevich added a comment -

        This is initial patch which does following:

        Replaces leave/bootstrap with:
        (node registers it's state as MOVING).

        a). Call BootStrapper.getBalancedToken(...) to get token to move to;
        b). For each non system table (keyspace) get collection of the range in the current state and "after move" state;
        c). Calculate ranges to fetch from the ring (skipping intersection part of current/updated rages).
        d). Fetch ranges and move to a new token.

        Questions left undone (unanswered):

        1). What should we do with the ranges left on the node if we are moving to the completely new range (no intersection with old range), should we stream them to the ring?
        2). Should we do 1). always (on any move)?

        Requesting comments!

        Show
        xedin Pavel Yaskevich added a comment - This is initial patch which does following: Replaces leave/bootstrap with: (node registers it's state as MOVING). a). Call BootStrapper.getBalancedToken(...) to get token to move to; b). For each non system table (keyspace) get collection of the range in the current state and "after move" state; c). Calculate ranges to fetch from the ring (skipping intersection part of current/updated rages). d). Fetch ranges and move to a new token. Questions left undone (unanswered): 1). What should we do with the ranges left on the node if we are moving to the completely new range (no intersection with old range), should we stream them to the ring? 2). Should we do 1). always (on any move)? Requesting comments!
        Hide
        jbellis Jonathan Ellis added a comment -

        +1

        Show
        jbellis Jonathan Ellis added a comment - +1
        Hide
        kingryan Ryan King added a comment -

        I think we should generalize it to cover all cases.

        Show
        kingryan Ryan King added a comment - I think we should generalize it to cover all cases.
        Hide
        nickmbailey Nick Bailey added a comment -

        I'm wondering if its worth generalizing this enough to optimize all moves rather than just a move that doesn't move past any surrounding tokens. It's possible a move that skips other nodes could still leave some data on a node.

        For example if you are using simple strategy with RF=3. A node could perform a move that skips the two nodes before it in the ring, but it would still contain some of the data it had previously.

        Show
        nickmbailey Nick Bailey added a comment - I'm wondering if its worth generalizing this enough to optimize all moves rather than just a move that doesn't move past any surrounding tokens. It's possible a move that skips other nodes could still leave some data on a node. For example if you are using simple strategy with RF=3. A node could perform a move that skips the two nodes before it in the ring, but it would still contain some of the data it had previously.
        Hide
        nickmbailey Nick Bailey added a comment -

        Currently a move is just decommission then bootstrap. This means that if loadbalance is called, the token the node is going to move to isn't calculated until the node has fully left the ring.

        It might make sense for this ticket to implement this as a special case for when a token to move to is actually specified. Any loadbalancer implemented in CASSANDRA-1418 will need to calculate the token it plans to move to before the move operation takes place.

        Show
        nickmbailey Nick Bailey added a comment - Currently a move is just decommission then bootstrap. This means that if loadbalance is called, the token the node is going to move to isn't calculated until the node has fully left the ring. It might make sense for this ticket to implement this as a special case for when a token to move to is actually specified. Any loadbalancer implemented in CASSANDRA-1418 will need to calculate the token it plans to move to before the move operation takes place.

          People

          • Assignee:
            xedin Pavel Yaskevich
            Reporter:
            nickmbailey Nick Bailey
            Reviewer:
            Nick Bailey
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 42h
              42h
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 42h
              42h

                Development