Cassandra
  1. Cassandra
  2. CASSANDRA-4119

Support multiple non-consecutive tokens per host (virtual nodes)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0
    • Component/s: Core

      Description

      This is the parent ticket for the virtual nodes implementation which was proposed here: http://www.mail-archive.com/dev@cassandra.apache.org/msg03837.html and discussed in the subsequent thread.

      The goals of this ticket are:

      • reduced operations complexity for scaling up/down
      • reduced rebuild time in event of failure
      • evenly distributed load impact in the event of failure
      • evenly distributed impact of streaming operations
      • more viable support for heterogeneity of hardware

      The intention is that this can be done in a way which is

      • fully backwards-compatible
      • optionally enabled

      The latter of these can be trivially achieved by setting the number of tokens per host to 1, to reproduce the existing behaviour.

      Implementation detail can be added and discussed in the sub-tickets, but here is an overview of the proposed changes:

      • TokenMetadata will allow multiple tokens per host
      • Hosts will be referred to by a UUID instead of token (e.g. in Gossip, when storing hints, etc.)
      • A bootstrapping node can get multiple tokens from initial_token (comma separated) or by random allocation
      • NetworkTopologyStrategy will be extended to be aware of virtual nodes so that replicas are not placed on the same host (similar to racks now)
      • Repairs will be staggered similar to CASSANDRA-3721
      • Nodetool operations will be virtual-node aware, while maintaining backwards compatibility (ie. existing scripts won't have to change)
      • Upgrade will be a standard rolling upgrade, with optional rolling migration to full vnode support

        Issue Links

          Activity

          Sam Overton created issue -
          Eric Evans made changes -
          Field Original Value New Value
          Description This is the parent ticket for the virtual nodes implementation which was proposed here: http://www.mail-archive.com/dev@cassandra.apache.org/msg03837.html and discussed in the subsequent thread.

          The goals of this ticket are:
          * reduced operations complexity for scaling up/down
          * reduced rebuild time in event of failure
          * evenly distributed load impact in the event of failure
          * evenly distributed impact of streaming operations
          * more viable support for heterogeneity of hardware

          The intention is that this can be done in a way which is
          * fully backwards-compatible
          * optionally enabled

          The latter of these can be trivially achieved by setting the number of tokens per host to 1, to reproduce the existing behaviour.

          Implementation detail can be added and discussed in the sub-tickets, but here is an overview of the proposed changes:
          * TokenMetadata will allow multiple tokens per host
          * Hosts will be referred to by IP instead of token (e.g. in Gossip, when storing hints, etc.)
          * A bootstrapping node can get multiple tokens from initial_token (comma separated) or by random allocation
          * NetworkTopologyStrategy will be extended to be aware of virtual nodes so that replicas are not placed on the same host (similar to racks now)
          * Repairs will be staggered similar to CASSANDRA-3721
          * Nodetool operations will be virtual-node aware, while maintaining backwards compatibility (ie. existing scripts won't have to change)
          * Upgrade will be a standard rolling upgrade, with optional rolling migration to full vnode support
          This is the parent ticket for the virtual nodes implementation which was proposed here: http://www.mail-archive.com/dev@cassandra.apache.org/msg03837.html and discussed in the subsequent thread.

          The goals of this ticket are:
          * reduced operations complexity for scaling up/down
          * reduced rebuild time in event of failure
          * evenly distributed load impact in the event of failure
          * evenly distributed impact of streaming operations
          * more viable support for heterogeneity of hardware

          The intention is that this can be done in a way which is
          * fully backwards-compatible
          * optionally enabled

          The latter of these can be trivially achieved by setting the number of tokens per host to 1, to reproduce the existing behaviour.

          Implementation detail can be added and discussed in the sub-tickets, but here is an overview of the proposed changes:
          * TokenMetadata will allow multiple tokens per host
          * Hosts will be referred to by a UUID instead of token (e.g. in Gossip, when storing hints, etc.)
          * A bootstrapping node can get multiple tokens from initial_token (comma separated) or by random allocation
          * NetworkTopologyStrategy will be extended to be aware of virtual nodes so that replicas are not placed on the same host (similar to racks now)
          * Repairs will be staggered similar to CASSANDRA-3721
          * Nodetool operations will be virtual-node aware, while maintaining backwards compatibility (ie. existing scripts won't have to change)
          * Upgrade will be a standard rolling upgrade, with optional rolling migration to full vnode support
          Sam Overton made changes -
          Link This issue is blocked by CASSANDRA-3881 [ CASSANDRA-3881 ]
          Hide
          Sam Overton added a comment -

          Added CASSANDRA-3881 as blocker.

          Show
          Sam Overton added a comment - Added CASSANDRA-3881 as blocker.
          Hide
          Brandon Williams added a comment -

          I smoke tested your git branch and saw write performance was about 33% slower than trunk. I haven't dug into why that is yet though.

          Show
          Brandon Williams added a comment - I smoke tested your git branch and saw write performance was about 33% slower than trunk. I haven't dug into why that is yet though.
          Hide
          Eric Evans added a comment -

          I'm not seeing this; Likely I'm not testing the way you are

          You're testing from p/4127/01_migration_path? How many nodes in your test? I assume you used stress, what test parameters did you use? How many tokens per node are you using? Did you manually assign them or did you allow them to be selected automatically (and are they evenly distributed (hint: nodetool clusterinfo))?

          Show
          Eric Evans added a comment - I'm not seeing this; Likely I'm not testing the way you are You're testing from p/4127/01_migration_path? How many nodes in your test? I assume you used stress , what test parameters did you use? How many tokens per node are you using? Did you manually assign them or did you allow them to be selected automatically (and are they evenly distributed (hint: nodetool clusterinfo ))?
          Hide
          Brandon Williams added a comment -

          You're testing from p/4127/01_migration_path?

          Yes

          How many nodes in your test?

          Three

          what test parameters did you use?

          Just defaults with the thread count upped to 300 (separate stress machine)

          How many tokens per node are you using?

          Default; 64

          Did you manually assign them or did you allow them to be selected automatically

          Automatic.

          are they evenly distributed (hint: nodetool clusterinfo))?

          The largest discrepancy I saw was about 6%, but even at 1-2% the difference was still profound.

          Show
          Brandon Williams added a comment - You're testing from p/4127/01_migration_path? Yes How many nodes in your test? Three what test parameters did you use? Just defaults with the thread count upped to 300 (separate stress machine) How many tokens per node are you using? Default; 64 Did you manually assign them or did you allow them to be selected automatically Automatic. are they evenly distributed (hint: nodetool clusterinfo))? The largest discrepancy I saw was about 6%, but even at 1-2% the difference was still profound.
          Hide
          Eric Evans added a comment -

          Default; 64

          Was it 64, or 256 (the default is 256)?

          Show
          Eric Evans added a comment - Default; 64 Was it 64, or 256 (the default is 256)?
          Hide
          Brandon Williams added a comment -

          I mean 256, not sure where I got 64.

          Show
          Brandon Williams added a comment - I mean 256, not sure where I got 64.
          Hide
          Sam Overton added a comment -

          Brandon, I can't reproduce this difference either.

          Is debug logging turned on? If so, try with it off.

          What replication strategy & snitch are you using? Is the dynamic snitch on?

          The only piece of code on the insert path that these patches touch is calculateNaturalEndpoints, which is cached in AbstractReplicationStrategy anyway so it's not immediately clear to me what could cause such a discrepancy in performance for you.

          I'll do some more thorough testing and profiling tomorrow.

          Show
          Sam Overton added a comment - Brandon, I can't reproduce this difference either. Is debug logging turned on? If so, try with it off. What replication strategy & snitch are you using? Is the dynamic snitch on? The only piece of code on the insert path that these patches touch is calculateNaturalEndpoints, which is cached in AbstractReplicationStrategy anyway so it's not immediately clear to me what could cause such a discrepancy in performance for you. I'll do some more thorough testing and profiling tomorrow.
          Hide
          Brandon Williams added a comment -

          It's been some time since I last tested this, so I decided to give it another go. After updating the branch to 7e9ecd9783fbefaea13fe1b1ceb3bac46aa57291 I'm unable to reproduce, so perhaps it was some condition in trunk exacerbating the problem that has been fixed in the past ~20 days.

          Show
          Brandon Williams added a comment - It's been some time since I last tested this, so I decided to give it another go. After updating the branch to 7e9ecd9783fbefaea13fe1b1ceb3bac46aa57291 I'm unable to reproduce, so perhaps it was some condition in trunk exacerbating the problem that has been fixed in the past ~20 days.
          Hide
          Sam Overton added a comment -

          Just to make sure, I did an inserts test with trunk and with the patches applied: https://github.com/acunu/cassandra/wiki/4119-Performance-smoketest

          Show
          Sam Overton added a comment - Just to make sure, I did an inserts test with trunk and with the patches applied: https://github.com/acunu/cassandra/wiki/4119-Performance-smoketest
          Brandon Williams made changes -
          Link This issue is related to CASSANDRA-4658 [ CASSANDRA-4658 ]
          Brandon Williams made changes -
          Link This issue is related to CASSANDRA-4658 [ CASSANDRA-4658 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12661221 ] patch-available, re-open possible [ 12753680 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12753680 ] reopen-resolved, no closed status, patch-avail, testing [ 12758872 ]
          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 1.2.0 [ 12323243 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Sam Overton
              Reporter:
              Sam Overton
            • Votes:
              2 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development