Cassandra
  1. Cassandra
  2. CASSANDRA-3829

make seeds *only* be seeds, not special in gossip

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      First, a little bit of "framing" on how seeds work:

      The concept of "seed hosts" makes fundamental sense; you need to
      "seed" a new node with some information required in order to join a
      cluster. Seed hosts is the information Cassandra uses for this
      purpose.

      But seed hosts play a role even after the initial start-up of a new
      node in a ring. Specifically, seed hosts continue to be gossiped to
      separately by the Gossiper throughout the life of a node and the
      cluster.

      Generally, operators must be careful to ensure that all nodes in a
      cluster are appropriately configured to refer to an overlapping set of
      seed hosts. Strictly speaking this should not be necessary (see
      further down though), but is the general recommendation. An
      unfortunate side-effect of this is that whenever you are doing ring
      management, such as replacing nodes, removing nodes, etc, you have to
      keep in mind which nodes are seeds.

      For example, if you bring a new node into the cluster, doing
      everything right with token assignment and auto_bootstrap=true, it
      will just enter the cluster without bootstrap - causing inconsistent
      reads. This is dangerous.

      And worse - changing the notion of which nodes are seeds across a
      cluster requires a rolling restart. It can be argued that it should
      actually be okay for nodes other than the one being fiddled with to
      incorrectly treat the fiddled-with node as a seed node, but this fact
      is highly opaque to most users that are not intimately familiar with
      Cassandra internals.

      This adds additional complexity to operations, as it introduces a
      reason why you cannot view the ring as completely homogeneous, despite
      the fundamental idea of Cassandra that all nodes should be equal.

      Now, fast forward a bit to what we are doing over here to avoid this
      problem: We have a zookeeper based systems for keeping track of hosts
      in a cluster, which is used by our Cassandra client to discover nodes
      to talk to. This works well.

      In order to avoid the need to manually keep track of seeds, we wanted
      to make seeds be automatically discoverable in order to eliminate as
      an operational concern. We have implemented a seed provider that does
      this for us, based on the data we keep in zookeeper.

      We could see essentially three ways of plugging this in:

      • (1) We could simply rely on not needing overlapping seeds and grab whatever we have when a node starts.
      • (2) We could do something like continually treat all other nodes as seeds by dynamically changing the seed list (involves some other changes like having the Gossiper update it's notion of seeds.
      • (3) We could completely eliminate the use of seeds except for the very specific purpose of initial start-up of an unbootstrapped node, and keep using a static (for the duration of the node's uptime) seed list.

      (3) was attractive because it felt like this was the original intent
      of seeds; that they be used for seeding, and not be constantly
      required during cluster operation once nodes are already joined.

      Now before I make the suggestion, let me explain how we are currently
      (though not yet in production) handling seeds and start-up.

      First, we have the following relevant cases to consider during a normal start-up:

      • (a) we are starting up a cluster for the very first time
      • (b) we are starting up a new clean node in order to join it to a pre-existing cluster
      • (c) we are starting up a pre-existing already joined node in a pre-existing cluster

      First, we proceeded on the assumption that we wanted to remove the use
      of seeds during regular gossip (other than on initial startup). This
      means that for the (c) case, we can completely ignore seeds. We
      never even have to discover the seed list, or if we do, we don't have
      to use them.

      This leaves (a) and (b). In both cases, the critical invariant we want
      to achieve is that we must have one or more valid seeds (valid means
      for (b) that the seed is in the cluster, and for (a) that it is one of
      the nodes that are part of the initial cluster setup).

      In the (c) case the problem is trivial - ignore seeds.

      In the (a) case, the algorithm is:

      • Register with zookeeper as a seed
      • Wait until we see at least one seed other than ourselves in zookeeper
      • Continue regular start-up process with the seed list (with 1 or more seeds)

      In the (b) case, the algorithm is:

      • Wait until we see at least one seed in zookeeper
      • Continue regular start-up process with the seed list (with 1 or more seeds)
      • Once fully up (around the time we listen to thrift), register as a seed in zookeeper

      With the annoyance that you have to explicitly let Cassandra know that
      "I am starting a cluster for the very first time from scratch", and
      ignoring the problem of single node clusters (just to avoid
      complicating this post further), this guarantees in both cases that
      all nodes eventually see each other.

      In the (a) case, all nodes except one are guaranteed to see the "one"
      node. The "one" node is guaranteed to see one of the others. Thus -
      convergence.

      In the (b) case, it's simple - the new node is guaranteed to see one
      or more nodes that are in the cluster - convergence.

      The current status is that we have implemented the seed provider and
      the start-up sequence works. But in order to simplify Cassandra (and
      to avoid having to diverge), we propose that we take this to its
      conclusion and officially make seeds only relevant on start-up, by
      only ever gossiping to seeds when in pre-bootstrap mode during
      start-up.

      The perceived benefits are:

      • Simplicity for the operator. All nodes are equal once joined; you can almost forget completely about seeds.
      • No rolling restarts or potential for footshooting a node into a cluster without bootstrap because it happened to be a seed.
      • Production clusters will suddenly start to actually test the gossip protocol without relying on seeds. How sure are we that it even works, and that phi conviction is appropriate and RING_DELAY is appropriate, given that practical clusters tend to gossip to a random (among very few) seeds? This change would make it so that we always gossip randomly to anyone in the cluster, and there should be no danger that a cluster happens to hold together because seeds are up - only to explode when they are not.
      • It eliminates non-trivial concerns with automatic seed discover, particularly when you want that seed discovery to be rack and DC aware. All you care about it what was described above; if that seed happens to fail, we simply fail to find the cluster and can abort start-up and it can be retried. There is no need for "redundancy" in seeds.

      Thoughts? Are seeds important (by design) in some way other than for seeding? What do other people think about the implications of RING_DELAY etc?

        Activity

        Brandon Williams made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Brandon Williams added a comment -

        I don't think the potential danger of this optimization is worth the chance of regression, and I don't think RING_DELAY is something people consider a pain point.

        Show
        Brandon Williams added a comment - I don't think the potential danger of this optimization is worth the chance of regression, and I don't think RING_DELAY is something people consider a pain point.
        Jonathan Ellis made changes -
        Assignee Peter Schuller [ scode ] Brandon Williams [ brandon.williams ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12749190 ] reopen-resolved, no closed status, patch-avail, testing [ 12756864 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12651554 ] patch-available, re-open possible [ 12749190 ]
        Hide
        Brandon Williams added a comment -

        That sounds reasonable. (And would imply "wait a couple seconds between bootstrapping nodes," right?)

        Right, plus some padding for timer skew and processing on the seeds.

        Show
        Brandon Williams added a comment - That sounds reasonable. (And would imply "wait a couple seconds between bootstrapping nodes," right?) Right, plus some padding for timer skew and processing on the seeds.
        Hide
        Jonathan Ellis added a comment -

        That sounds reasonable. (And would imply "wait a couple seconds between bootstrapping nodes," right?)

        Show
        Jonathan Ellis added a comment - That sounds reasonable. (And would imply "wait a couple seconds between bootstrapping nodes," right?)
        Hide
        Brandon Williams added a comment -

        If I understand correctly, we're going to reduce RING_DELAY as follows:

        • gossip a full round to every seed, sleep one extra gossip interval (1s)
        • announcing the pending range setup to each seed, sleep one extra gossip interval

        Step 1 is to learn about all nodes in the ring, and make sure they know about us. Step 2 is roughly the same, but with the pending range announced. The catch here, however, is that we're exploiting the 'seed optimization' (meaning that all other nodes will have gossiped with one of the seeds during the gossip interval we slept for) which means that seed list homogeneity is now even more important than before; if any node has a differing list we can't guarantee that it saw our updates in this time frame.

        Show
        Brandon Williams added a comment - If I understand correctly, we're going to reduce RING_DELAY as follows: gossip a full round to every seed, sleep one extra gossip interval (1s) announcing the pending range setup to each seed, sleep one extra gossip interval Step 1 is to learn about all nodes in the ring, and make sure they know about us. Step 2 is roughly the same, but with the pending range announced. The catch here, however, is that we're exploiting the 'seed optimization' (meaning that all other nodes will have gossiped with one of the seeds during the gossip interval we slept for) which means that seed list homogeneity is now even more important than before; if any node has a differing list we can't guarantee that it saw our updates in this time frame.
        Hide
        Jonathan Ellis added a comment -

        Brandon says: "every round of gossip already fully updates both members, regardless of who initiates." So that simplifies part 1 of what I propose to "gossip to each seed on startup [and wait for acks]."

        Show
        Jonathan Ellis added a comment - Brandon says: "every round of gossip already fully updates both members, regardless of who initiates." So that simplifies part 1 of what I propose to "gossip to each seed on startup [and wait for acks] ."
        Hide
        Jonathan Ellis added a comment -

        this impacts the usability of single-node clusters which is where virtually everybody starts. So, I'll need to see a solution that doesn't make life more confusion for that overwhelming majority.

        What about this?

        1. Instead of special casing seeds in the gossip loop (so that seeds will eventually push state to us), have nodes pull state from all configured (and reachable) seeds on startup, by adding a new cluster state request verb (that would send the same state we have in gossip)
        2. Seeds bootstrap normally

        This means that for a single-node nothing changes, since there is only itself configured, but for the "replace node in large cluster scenario" you don't have to care about seed-ness.

        This also allows us to eliminate RING_DELAY in other places in the code where it's just saying "wait until we think we have an accurate picture of the cluster state" since explicitly pulling it from the seeds solves that better.

        Show
        Jonathan Ellis added a comment - this impacts the usability of single-node clusters which is where virtually everybody starts. So, I'll need to see a solution that doesn't make life more confusion for that overwhelming majority. What about this? Instead of special casing seeds in the gossip loop (so that seeds will eventually push state to us), have nodes pull state from all configured (and reachable) seeds on startup, by adding a new cluster state request verb (that would send the same state we have in gossip) Seeds bootstrap normally This means that for a single-node nothing changes, since there is only itself configured, but for the "replace node in large cluster scenario" you don't have to care about seed-ness. This also allows us to eliminate RING_DELAY in other places in the code where it's just saying "wait until we think we have an accurate picture of the cluster state" since explicitly pulling it from the seeds solves that better.
        Hide
        Jonathan Ellis added a comment -

        I have a hard time understanding how it cannot be obviously better to allow seeds to be reloadable?

        My hangup is that I still don't understand what problem that solves. In the example you give of replacing a seed node, all you need to do is take that node out of its own seed list so it will bootstrap, then afterwards you can add it back to its own seed list. You don't need to touch the other nodes at all.

        Show
        Jonathan Ellis added a comment - I have a hard time understanding how it cannot be obviously better to allow seeds to be reloadable? My hangup is that I still don't understand what problem that solves. In the example you give of replacing a seed node, all you need to do is take that node out of its own seed list so it will bootstrap, then afterwards you can add it back to its own seed list. You don't need to touch the other nodes at all.
        Hide
        Peter Schuller added a comment -

        Okay, I'm with you so far. But as you note, this impacts the usability of single-node clusters which is where virtually everybody starts. So, I'll need to see a solution that doesn't make life more confusion for that overwhelming majority. I get that you don't like the current tradeoffs but I haven't seen a better proposal yet. (I'll go ahead and pre-emptively -1 pecial environment variables...)

        I haven't been able to come up with a solution that avoids the initial setup requiring special actions. While I am personally fine with this (any software that doesn't would cause me to wonder "what? what if this wasn't an initial setup?") I understand that 99% of users would probably not be fond of this behavior and it would just turn people off of Cassandra.

        So, what about an opt-in setting which explicitly says the inverse - this is a production cluster that is not being set up? The recommendation could be that everyone uses this setting after a cluster is in production, but things keep working if they don't (subject to the risks associated with re-bootstrapping someone on the seed list, a problem we already have).

        This could be either a cassandra.yaml option or, if that is deemed too visible/confusing, a not-so-prominently-documented environment variable. However, if a documented cassandra.yaml option in the default config is not acceptable, I think I'd still prefer a cassandra.yaml setting that wasn't in the default configuration to an environment variable above an environment variable.

        (This is another case where it doesn't really matter to me. We can easily just patch in the env variable and run with it on our end, it's not like that patch will be a maintenance problem for us. I really just want to try to make this safer for all users.)

        I still haven't seen a case when this, or special-casing seeds to prevent gossip partitions, causes real problems. Whereas I was around when we added the gossip-partition-prevention code, so I do know the problems that prevents.

        Jumping into clusters/rolling restarts:

        So I can give anecdotal stories about seeing people, multiple times, being unaware and/or confused about a node jumping into a cluster without bootstrapping and not realizing what's going on, or tell you that a long time ago before I knew enough about gossip I was feeling the pains of rolling restarts whenever maintenance was done on clusters.

        But in this case it seems better to just have it flow from actual facts because it's not really that subjective. Consider the combination of:

        • Restarts are in fact required in change seeds.
        • A restart can easily be very very slow due to index sampling (until the samples-on-disk patch is in), row cache pre-load, commit log replay (not if you drained properly though), etc.
        • A restart can also be problematic if it e.g. causes page cache eviction and thus necessitates rate limiting rolling restarts.
        • Completing rolling restarts in a safe manner is prevented by pre-existing nodes being down in the cluster depending (e.g., RF=3 QUORUM, one node already down -> can't restart neighbors).
        • In addition, all forms of restarts carry with it some risk, even if we were to only consider the risk involved in terms of adding additional windows of potential double failures.

        Having to do a full rolling restart on a production cluster, particularly if the cluster has a lot of data (-> slower restarts, more sensitive to page caches, etc), is a huge operation to do just because you needed to e.g. replace a broken disk in and rebootstrap a node that just happened to be a seed. And clearly, the probability that some other node in the cluster is currently down for whatever reason in a large cluster is non-trivial, and would cause the inability to not be able to complete a orlling restart.

        Of course one might again argue that there is no real need to be that strict on maintaining the seed list, but again the circumstances under which this is safe is very opaque to people not intimately familiar with the code - and not being strict about it kind of takes away the protection against partitions it was supposed to give you from the start.

        So, while I realize changing the role of seeds is more controversial, I have a hard time understanding how it cannot be obviously better to allow seeds to be reloadable? Pushing a .yaml configuration file vs. a complete rolling restart of the entire cluster - that's a huge difference in impact, effort and risk for most production clusters.

        Show
        Peter Schuller added a comment - Okay, I'm with you so far. But as you note, this impacts the usability of single-node clusters which is where virtually everybody starts. So, I'll need to see a solution that doesn't make life more confusion for that overwhelming majority. I get that you don't like the current tradeoffs but I haven't seen a better proposal yet. (I'll go ahead and pre-emptively -1 pecial environment variables...) I haven't been able to come up with a solution that avoids the initial setup requiring special actions. While I am personally fine with this (any software that doesn't would cause me to wonder "what? what if this wasn't an initial setup?") I understand that 99% of users would probably not be fond of this behavior and it would just turn people off of Cassandra. So, what about an opt-in setting which explicitly says the inverse - this is a production cluster that is not being set up? The recommendation could be that everyone uses this setting after a cluster is in production, but things keep working if they don't (subject to the risks associated with re-bootstrapping someone on the seed list, a problem we already have). This could be either a cassandra.yaml option or, if that is deemed too visible/confusing, a not-so-prominently-documented environment variable. However, if a documented cassandra.yaml option in the default config is not acceptable, I think I'd still prefer a cassandra.yaml setting that wasn't in the default configuration to an environment variable above an environment variable. (This is another case where it doesn't really matter to me . We can easily just patch in the env variable and run with it on our end, it's not like that patch will be a maintenance problem for us. I really just want to try to make this safer for all users.) I still haven't seen a case when this, or special-casing seeds to prevent gossip partitions, causes real problems. Whereas I was around when we added the gossip-partition-prevention code, so I do know the problems that prevents. Jumping into clusters/rolling restarts: So I can give anecdotal stories about seeing people, multiple times, being unaware and/or confused about a node jumping into a cluster without bootstrapping and not realizing what's going on, or tell you that a long time ago before I knew enough about gossip I was feeling the pains of rolling restarts whenever maintenance was done on clusters. But in this case it seems better to just have it flow from actual facts because it's not really that subjective. Consider the combination of: Restarts are in fact required in change seeds. A restart can easily be very very slow due to index sampling (until the samples-on-disk patch is in), row cache pre-load, commit log replay (not if you drained properly though), etc. A restart can also be problematic if it e.g. causes page cache eviction and thus necessitates rate limiting rolling restarts. Completing rolling restarts in a safe manner is prevented by pre-existing nodes being down in the cluster depending (e.g., RF=3 QUORUM, one node already down -> can't restart neighbors). In addition, all forms of restarts carry with it some risk, even if we were to only consider the risk involved in terms of adding additional windows of potential double failures. Having to do a full rolling restart on a production cluster, particularly if the cluster has a lot of data (-> slower restarts, more sensitive to page caches, etc), is a huge operation to do just because you needed to e.g. replace a broken disk in and rebootstrap a node that just happened to be a seed. And clearly, the probability that some other node in the cluster is currently down for whatever reason in a large cluster is non-trivial, and would cause the inability to not be able to complete a orlling restart. Of course one might again argue that there is no real need to be that strict on maintaining the seed list, but again the circumstances under which this is safe is very opaque to people not intimately familiar with the code - and not being strict about it kind of takes away the protection against partitions it was supposed to give you from the start. So, while I realize changing the role of seeds is more controversial, I have a hard time understanding how it cannot be obviously better to allow seeds to be reloadable? Pushing a .yaml configuration file vs. a complete rolling restart of the entire cluster - that's a huge difference in impact, effort and risk for most production clusters.
        Hide
        Jonathan Ellis added a comment -

        I propose that the behavior of (1) be removed

        Okay, I'm with you so far. But as you note, this impacts the usability of single-node clusters which is where virtually everybody starts. So, I'll need to see a solution that doesn't make life more confusion for that overwhelming majority. I get that you don't like the current tradeoffs but I haven't seen a better proposal yet. (I'll go ahead and pre-emptively -1 pecial environment variables...)

        Fixing (2) so that the seed list is reloadable

        I still haven't seen a case when this, or special-casing seeds to prevent gossip partitions, causes real problems. Whereas I was around when we added the gossip-partition-prevention code, so I do know the problems that prevents.

        Show
        Jonathan Ellis added a comment - I propose that the behavior of (1) be removed Okay, I'm with you so far. But as you note, this impacts the usability of single-node clusters which is where virtually everybody starts. So, I'll need to see a solution that doesn't make life more confusion for that overwhelming majority. I get that you don't like the current tradeoffs but I haven't seen a better proposal yet. (I'll go ahead and pre-emptively -1 pecial environment variables...) Fixing (2) so that the seed list is reloadable I still haven't seen a case when this, or special-casing seeds to prevent gossip partitions, causes real problems. Whereas I was around when we added the gossip-partition-prevention code, so I do know the problems that prevents .
        Hide
        Peter Schuller added a comment -

        I'm not sure what's making it sound like I want a free lunch

        Let me start with what I hope are the less controversial bits.

        1. If you apply normal bootstrapping process when inserting a node into the cluster, and it happens to be a seed according to its own configuration, it will just jump into the cluster w/o streaming data.
        2. You currently have to do rolling restarts to change the seed list.

        In order to make clusters easier to operate, and make it more difficult to shoot yourself in the foot, I propose that the behavior of (1) be removed. I think it makes more sense to require a special setting (such as a system property) when performing the very unusual (in production) task of setting up a new cluster from scratch. For single-node cases, we could support a mode where a node is "alone" and never tries to bootstrap if we are concerned with maintaining simple "./bin/cassandra -f" type running of lone nodes.

        Fixing (2) so that the seed list is reloadable makes sense if seeds are kept "relevant" other than on start-up, and would in particular be even more important if we cannot agree on (1). Asking users for rolling restarts to do maintenance on a seed is IMO clearly not a good thing, even if we were to disagree about eliminating the behavior in (1).

        Ok - so far what I've said in this comment doesn't change the notion of seeds as something which is continually used throughout the life-time of a node.

        Now, if we make seeds be dynamic during runtime (minor changes in the code are needed to support this "cleanly", but it's not a big deal) everyone is of course free to do whatever they want in terms of seed sources. I described an example zookeeper/serversets based case in the original filing of this ticket. For someone with infrastructure in place for multiple clusters and where these things aren't "manually" maintained, it's not really an issue once we reach the point of never having to do rolling restarts.

        But I would really like to go further and make the seed concept simpler for everyone. I am not proposing to remove seeds; only to make them seeds only, in the sense of initially seeding a node with information about it's cluster when it starts up for the first time (not as a list of "special" nodes that are always gossiped to). Even if we make seeds reload:able, and provide an out-of-the-box implementation that e.g. loads from a property file, it still means operators (or their tools) have to actively be aware of seeds and the fact that special action is required during some tasks, if an affected node happens to be a seed.

        I believe that for operational simplicity, it would be better if seeds would only enter the consciousness (or tool) on initial bootstrap where they are fundamentally absolutely required no matter what (for obvious reason there must be some source, as you point out, pointing a node to the appropriate cluster).

        As discussed, this would be a slight regression in terms of partitioning in the sense that if a node goes down for a while, and goes back up, and all nodes it knows about have either changed IP addresses or are down - then yes, you would introduce a partition. But look at this this way; in my opinion this can easily be considered "operator error". If you point clients to a set of nodes in a cluster and make hugely significant topology changes while a node being used by clients is down, that's a mistake. It's worth nothing however that it's only slightly more easy to make that mistake than the potential for a mistake already there right now - in the exact same scenario, you are already in trouble if if all of the nodes listed in your seeds list are among those either down or having changed IP address. Nor granted if you change the IP address of the seed you will deploy that change; but what if the node that went down just booted up (never got deployed to)? You still have, in practice, the partitioning of the cluster.

        So in short, I believe that for practical use-cases, removing the significance of seeds in all but initial seeding has minimal negative consequences, while the positive consequences in terms of operational simplicity are very much significant.

        That said, if am I truly the only person who thinks this would be an important improvement, then we can at least make seeds dynamic and provide a simple out-of-the-box way of using that feature (property file based seeds probably). I'll submit the necessary patches in a separate ticket if so. If so, I will also try to make time to empirically test how the propagation time in the cluster is affected by cluster size (because of CASSANDRA-3830).

        Show
        Peter Schuller added a comment - I'm not sure what's making it sound like I want a free lunch Let me start with what I hope are the less controversial bits. If you apply normal bootstrapping process when inserting a node into the cluster, and it happens to be a seed according to its own configuration, it will just jump into the cluster w/o streaming data. You currently have to do rolling restarts to change the seed list. In order to make clusters easier to operate, and make it more difficult to shoot yourself in the foot, I propose that the behavior of (1) be removed. I think it makes more sense to require a special setting (such as a system property) when performing the very unusual (in production) task of setting up a new cluster from scratch. For single-node cases, we could support a mode where a node is "alone" and never tries to bootstrap if we are concerned with maintaining simple "./bin/cassandra -f" type running of lone nodes. Fixing (2) so that the seed list is reloadable makes sense if seeds are kept "relevant" other than on start-up, and would in particular be even more important if we cannot agree on (1). Asking users for rolling restarts to do maintenance on a seed is IMO clearly not a good thing, even if we were to disagree about eliminating the behavior in (1). Ok - so far what I've said in this comment doesn't change the notion of seeds as something which is continually used throughout the life-time of a node. Now, if we make seeds be dynamic during runtime (minor changes in the code are needed to support this "cleanly", but it's not a big deal) everyone is of course free to do whatever they want in terms of seed sources. I described an example zookeeper/serversets based case in the original filing of this ticket. For someone with infrastructure in place for multiple clusters and where these things aren't "manually" maintained, it's not really an issue once we reach the point of never having to do rolling restarts. But I would really like to go further and make the seed concept simpler for everyone . I am not proposing to remove seeds; only to make them seeds only , in the sense of initially seeding a node with information about it's cluster when it starts up for the first time ( not as a list of "special" nodes that are always gossiped to). Even if we make seeds reload:able, and provide an out-of-the-box implementation that e.g. loads from a property file, it still means operators (or their tools) have to actively be aware of seeds and the fact that special action is required during some tasks, if an affected node happens to be a seed. I believe that for operational simplicity, it would be better if seeds would only enter the consciousness (or tool) on initial bootstrap where they are fundamentally absolutely required no matter what (for obvious reason there must be some source, as you point out, pointing a node to the appropriate cluster). As discussed, this would be a slight regression in terms of partitioning in the sense that if a node goes down for a while, and goes back up, and all nodes it knows about have either changed IP addresses or are down - then yes, you would introduce a partition. But look at this this way; in my opinion this can easily be considered "operator error". If you point clients to a set of nodes in a cluster and make hugely significant topology changes while a node being used by clients is down, that's a mistake. It's worth nothing however that it's only slightly more easy to make that mistake than the potential for a mistake already there right now - in the exact same scenario, you are already in trouble if if all of the nodes listed in your seeds list are among those either down or having changed IP address. Nor granted if you change the IP address of the seed you will deploy that change; but what if the node that went down just booted up (never got deployed to)? You still have, in practice, the partitioning of the cluster. So in short, I believe that for practical use-cases, removing the significance of seeds in all but initial seeding has minimal negative consequences, while the positive consequences in terms of operational simplicity are very much significant. That said, if am I truly the only person who thinks this would be an important improvement, then we can at least make seeds dynamic and provide a simple out-of-the-box way of using that feature (property file based seeds probably). I'll submit the necessary patches in a separate ticket if so. If so, I will also try to make time to empirically test how the propagation time in the cluster is affected by cluster size (because of CASSANDRA-3830 ).
        Hide
        Jonathan Ellis added a comment -

        I don't want people to be scared because a node is a seed because they aren't intimately familiar with the code base.

        What do you propose? We need some kind of cluster discovery method; flat files, zookeeper, or other external shared storage (e.g. s3 or simpledb for AWS users) all have problems of their own. I'm not sure there's a free lunch to be had here.

        Show
        Jonathan Ellis added a comment - I don't want people to be scared because a node is a seed because they aren't intimately familiar with the code base. What do you propose? We need some kind of cluster discovery method; flat files, zookeeper, or other external shared storage (e.g. s3 or simpledb for AWS users) all have problems of their own. I'm not sure there's a free lunch to be had here.
        Hide
        Peter Schuller added a comment -

        My point is that the standard officially supported and document way of running a cluster, is that you're supposed to keep the seeds up to date. As I stated in my original submission, it takes a lot of insight into the code base to really understand under what circumstances it is safe to run with seeds down. I want to make Cassandra simpler to operate for everyone, not just developers of Cassandra. I don't want a user to randomly see errors in the log about trying to gossip to sone ramdon cluster that is not even related if they happen to re-use an IP for another cluster and didn't do rolling restarts of seeds; I don't want people to accidentally bootstrap machines into the cluster without actual streaming ("bootstrap") and serve bad reads, I don't want people to constantly have to think of "is any of the nodes I'm touching a seed and if so what do I need to do about it". I don't want people to be scared because a node is a seed because they aren't intimately familiar with the code base.

        I want Cassandra to live up to the claim that all nodes are equal.

        Further, even if you are aware of this, you DO still need seeds when bootstrapping nodes in a cluster, and if you do not want to run a cluster that is depending on the gossip-to-seed special case, your process for bootstrapping a node is now to bootstrap a node with seeds, bring it back down, remove seeds (or add some bogus seed if we require seeds, I didn't check), and start the node back up.

        Show
        Peter Schuller added a comment - My point is that the standard officially supported and document way of running a cluster, is that you're supposed to keep the seeds up to date. As I stated in my original submission, it takes a lot of insight into the code base to really understand under what circumstances it is safe to run with seeds down. I want to make Cassandra simpler to operate for everyone, not just developers of Cassandra. I don't want a user to randomly see errors in the log about trying to gossip to sone ramdon cluster that is not even related if they happen to re-use an IP for another cluster and didn't do rolling restarts of seeds; I don't want people to accidentally bootstrap machines into the cluster without actual streaming ("bootstrap") and serve bad reads, I don't want people to constantly have to think of "is any of the nodes I'm touching a seed and if so what do I need to do about it". I don't want people to be scared because a node is a seed because they aren't intimately familiar with the code base. I want Cassandra to live up to the claim that all nodes are equal. Further, even if you are aware of this, you DO still need seeds when bootstrapping nodes in a cluster, and if you do not want to run a cluster that is depending on the gossip-to-seed special case, your process for bootstrapping a node is now to bootstrap a node with seeds, bring it back down, remove seeds (or add some bogus seed if we require seeds, I didn't check), and start the node back up.
        Hide
        Brandon Williams added a comment -

        Most people can start almost completely ignoring seeds, except on initial bootstrap. This makes cluster operation simpler (see discussion above).

        Again, with the persistent ring, you can do this already, though it's not the cleanest practice perhaps.

        We remove reliance on the special case and can be more comfortable that gossip works (in any particular cluster) without the seeds (specifically) being up. We eliminate the risk of sudden explosion because seeds were brought down.

        You can already do this as well, having seeds up is not a requirement.

        Show
        Brandon Williams added a comment - Most people can start almost completely ignoring seeds, except on initial bootstrap. This makes cluster operation simpler (see discussion above). Again, with the persistent ring, you can do this already, though it's not the cleanest practice perhaps. We remove reliance on the special case and can be more comfortable that gossip works (in any particular cluster) without the seeds (specifically) being up. We eliminate the risk of sudden explosion because seeds were brought down. You can already do this as well, having seeds up is not a requirement.
        Hide
        Peter Schuller added a comment -

        It removes the massive special case that is gossip-to-seeds, and we can hopefully document better under what circumstances it's even relevant to begin with and under almost all circumstances most operators could just ignore seeds other than for initial bootstrap. For most operational cases, there would never be a need to maintain seed lists, but if someone is in a position where the partition prevention is statistically an issue (tiny two-node cluster or something whereby nodes change IP:s frequently), they can still maintain the seed list.

        So bottom line:

        • Most people can start almost completely ignoring seeds, except on initial bootstrap. This makes cluster operation simpler (see discussion above).
        • We remove reliance on the special case and can be more comfortable that gossip works (in any particular cluster) without the seeds (specifically) being up. We eliminate the risk of sudden explosion because seeds were brought down.
        Show
        Peter Schuller added a comment - It removes the massive special case that is gossip-to-seeds, and we can hopefully document better under what circumstances it's even relevant to begin with and under almost all circumstances most operators could just ignore seeds other than for initial bootstrap. For most operational cases, there would never be a need to maintain seed lists, but if someone is in a position where the partition prevention is statistically an issue (tiny two-node cluster or something whereby nodes change IP:s frequently), they can still maintain the seed list. So bottom line: Most people can start almost completely ignoring seeds, except on initial bootstrap. This makes cluster operation simpler (see discussion above). We remove reliance on the special case and can be more comfortable that gossip works (in any particular cluster) without the seeds (specifically) being up. We eliminate the risk of sudden explosion because seeds were brought down.
        Hide
        Brandon Williams added a comment -

        Instead of saying "seeds are absolutely only used on initial bootstrap", we make it "seeds are also considered after every start-up, until at least a single gossip round has happened successfully with the seed in question".

        This should retain, I think, the healing properties we have now with respect to nodes re-starting after having been down during topology changes (but unfortunately retains the requirement that a human keeps the seed list up to date at all times, and not just when adding nodes).

        If a human still has to maintain the seed list, what does this buy us over keeping things the way they are?

        Show
        Brandon Williams added a comment - Instead of saying "seeds are absolutely only used on initial bootstrap", we make it "seeds are also considered after every start-up, until at least a single gossip round has happened successfully with the seed in question". This should retain, I think, the healing properties we have now with respect to nodes re-starting after having been down during topology changes (but unfortunately retains the requirement that a human keeps the seed list up to date at all times, and not just when adding nodes). If a human still has to maintain the seed list, what does this buy us over keeping things the way they are?
        Hide
        Peter Schuller added a comment -

        Regarding the "everyone switches IP" case. Here is a stop-tap adjustment to my suggestion that should cause this to no longer be a significant problem:

        Instead of saying "seeds are absolutely only used on initial bootstrap", we make it "seeds are also considered after every start-up, until at least a single gossip round has happened successfully with the seed in question".

        This should retain, I think, the healing properties we have now with respect to nodes re-starting after having been down during topology changes (but unfortunately retains the requirement that a human keeps the seed list up to date at all times, and not just when adding nodes). It should also be easy to implement in the gossip (from my recollections of recent staring at that code, I didn't vet this idea specifically).

        Show
        Peter Schuller added a comment - Regarding the "everyone switches IP" case. Here is a stop-tap adjustment to my suggestion that should cause this to no longer be a significant problem: Instead of saying "seeds are absolutely only used on initial bootstrap", we make it "seeds are also considered after every start-up, until at least a single gossip round has happened successfully with the seed in question". This should retain, I think, the healing properties we have now with respect to nodes re-starting after having been down during topology changes (but unfortunately retains the requirement that a human keeps the seed list up to date at all times, and not just when adding nodes). It should also be easy to implement in the gossip (from my recollections of recent staring at that code, I didn't vet this idea specifically).
        Hide
        Peter Schuller added a comment -

        Historically, I believe this has been for ensuring partitions heal. However, with the ring persisted after CASSANDRA-1518 this is probably not important in the common case of an established ring.

        If the premise is that you've created a partition you are kind of fubar:ed anyway (you will have caused inconsistencies in data access). Even with "seeds as only be true seeds", you should never be partitioned to begin with as long as the seeds, when you do need them (bootstrap) is a correct list of members in the cluster.

        Now, it is true that if you e.g. bring a node down, lots of changes happen to the ring ("everyone" changes IP etc), and attempt to bring a node up again whose entire notion of the ring is no longer valid (or same for a group of hosts), you will have a problem. Especially for small clusters, this may be something to seriously consider.

        There are, I think, reasonable mitigations here that I think will be needed anyway, having to do with making node start-up include certain steps that ensure the node is in a valid ring prior to taking client traffic - but I want to avoid getting into that discussion at the moment (will revisit in the locator redesign JIRA:s).

        This is probably not true, you can just change them everywhere and let the nodes restart naturally due to whatever reason.

        That's if you aren't blocking on it and don't care about loosing track of whether you're satisfying the seed invariant. If you're wanting to maintain the "X seeds, no seed that is not actually in the cluster" and you need to e.g. pop out one node that's a seed as part of e.g. ring re-adjustment or just because you picked a few hosts (that were in a rack or whatever) to remove from the ring, you now need to actually push the seed change out.

        Again, I don't believe not doing this is a problem per say, but if we go with the officially supported behavior/invariant of maintaining seeds.

        I should of course mention that simply making the seeds dynamically re-load:able mitigates this problem quite a lot, so if this is the main issue that is in fact an easy fix to make. So this is not really my primary argument. I am much more interested in ensuring that we don't rely on seeds for correct propagation through gossip, and removing the mental load on operators to consider seeds for anything but initial bootstrapping of nodes.

        I'm not sure I follow, why would a node told to autobootstrap disregard that and just join the ring?

        That's what Cassandra does - check joinTokenRing(). A node that detects that it is a seed (i.e., it is itself listed in its own list of seeds) will skip the bootstrapping step and just enter the ring. Very dangerous. For example, you loose a node due to a disk crash. It comes back up with clean state. You deploy your confirmation. auto_bootstrap is true by default (which is good and correct and let's not change that). You forget that it's one of the seeds. Bang now you have brought a node into the cluster serving inconsistent reads.

        Show
        Peter Schuller added a comment - Historically, I believe this has been for ensuring partitions heal. However, with the ring persisted after CASSANDRA-1518 this is probably not important in the common case of an established ring. If the premise is that you've created a partition you are kind of fubar:ed anyway (you will have caused inconsistencies in data access). Even with "seeds as only be true seeds", you should never be partitioned to begin with as long as the seeds, when you do need them (bootstrap) is a correct list of members in the cluster. Now, it is true that if you e.g. bring a node down, lots of changes happen to the ring ("everyone" changes IP etc), and attempt to bring a node up again whose entire notion of the ring is no longer valid (or same for a group of hosts), you will have a problem. Especially for small clusters, this may be something to seriously consider. There are, I think, reasonable mitigations here that I think will be needed anyway, having to do with making node start-up include certain steps that ensure the node is in a valid ring prior to taking client traffic - but I want to avoid getting into that discussion at the moment (will revisit in the locator redesign JIRA:s). This is probably not true, you can just change them everywhere and let the nodes restart naturally due to whatever reason. That's if you aren't blocking on it and don't care about loosing track of whether you're satisfying the seed invariant. If you're wanting to maintain the "X seeds, no seed that is not actually in the cluster" and you need to e.g. pop out one node that's a seed as part of e.g. ring re-adjustment or just because you picked a few hosts (that were in a rack or whatever) to remove from the ring, you now need to actually push the seed change out. Again, I don't believe not doing this is a problem per say, but if we go with the officially supported behavior/invariant of maintaining seeds. I should of course mention that simply making the seeds dynamically re-load:able mitigates this problem quite a lot, so if this is the main issue that is in fact an easy fix to make. So this is not really my primary argument. I am much more interested in ensuring that we don't rely on seeds for correct propagation through gossip, and removing the mental load on operators to consider seeds for anything but initial bootstrapping of nodes. I'm not sure I follow, why would a node told to autobootstrap disregard that and just join the ring? That's what Cassandra does - check joinTokenRing() . A node that detects that it is a seed (i.e., it is itself listed in its own list of seeds) will skip the bootstrapping step and just enter the ring. Very dangerous. For example, you loose a node due to a disk crash. It comes back up with clean state. You deploy your confirmation. auto_bootstrap is true by default (which is good and correct and let's not change that). You forget that it's one of the seeds. Bang now you have brought a node into the cluster serving inconsistent reads.
        Hide
        Brandon Williams added a comment -

        Are seeds important (by design) in some way other than for seeding?

        Historically, I believe this has been for ensuring partitions heal. However, with the ring persisted after CASSANDRA-1518 this is probably not important in the common case of an established ring. That said..

        And worse - changing the notion of which nodes are seeds across a cluster requires a rolling restart.

        This is probably not true, you can just change them everywhere and let the nodes restart naturally due to whatever reason.

        For example, if you bring a new node into the cluster, doing everything right with token assignment and auto_bootstrap=true, it will just enter the cluster without bootstrap

        I'm not sure I follow, why would a node told to autobootstrap disregard that and just join the ring?

        Show
        Brandon Williams added a comment - Are seeds important (by design) in some way other than for seeding? Historically, I believe this has been for ensuring partitions heal. However, with the ring persisted after CASSANDRA-1518 this is probably not important in the common case of an established ring. That said.. And worse - changing the notion of which nodes are seeds across a cluster requires a rolling restart. This is probably not true, you can just change them everywhere and let the nodes restart naturally due to whatever reason. For example, if you bring a new node into the cluster, doing everything right with token assignment and auto_bootstrap=true, it will just enter the cluster without bootstrap I'm not sure I follow, why would a node told to autobootstrap disregard that and just join the ring?
        Hide
        Peter Schuller added a comment -

        Put another way, on a typical cluster the number of live nodes will not be less than the number of seeds, unless you have configured your cluster to have all nodes as seeds (recommended against). So for the purpose of that typical case, we should be able to ignore the CASSANDRA-150 stuff for the purpose of this ticket.

        Show
        Peter Schuller added a comment - Put another way, on a typical cluster the number of live nodes will not be less than the number of seeds, unless you have configured your cluster to have all nodes as seeds (recommended against). So for the purpose of that typical case, we should be able to ignore the CASSANDRA-150 stuff for the purpose of this ticket.
        Hide
        Peter Schuller added a comment -

        Random live members are gossiped to, this may of course include nodes that happen to be seeds, but the only time seeds are done 'separately' is if the amount of live nodes is less than the amount of seeds (see CASSANDRA-150)

        What you are talking about from CASSANDRA-150 is there in the code, but that is catering to an edge case which is almost never true. Like, early start-up of a cluster, avoiding partitions.

        When this does not come into play, the same old algo seems to apply: We always gossip to a random live member, followed by a random seed unless the random live member we picked happened to be a seed.

        The result is that on a typical cluster with seeds up, it would be expected that propagation seems to work even if the generalized propagation were broken, just because of the seed special case.

        Show
        Peter Schuller added a comment - Random live members are gossiped to, this may of course include nodes that happen to be seeds, but the only time seeds are done 'separately' is if the amount of live nodes is less than the amount of seeds (see CASSANDRA-150 ) What you are talking about from CASSANDRA-150 is there in the code, but that is catering to an edge case which is almost never true. Like, early start-up of a cluster, avoiding partitions. When this does not come into play, the same old algo seems to apply: We always gossip to a random live member, followed by a random seed unless the random live member we picked happened to be a seed. The result is that on a typical cluster with seeds up, it would be expected that propagation seems to work even if the generalized propagation were broken, just because of the seed special case.
        Hide
        Brandon Williams added a comment -

        Specifically, seed hosts continue to be gossiped to separately by the Gossiper throughout the life of a node and the cluster.

        Random live members are gossiped to, this may of course include nodes that happen to be seeds, but the only time seeds are done 'separately' is if the amount of live nodes is less than the amount of seeds (see CASSANDRA-150)

        Show
        Brandon Williams added a comment - Specifically, seed hosts continue to be gossiped to separately by the Gossiper throughout the life of a node and the cluster. Random live members are gossiped to, this may of course include nodes that happen to be seeds, but the only time seeds are done 'separately' is if the amount of live nodes is less than the amount of seeds (see CASSANDRA-150 )
        Peter Schuller made changes -
        Field Original Value New Value
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Peter Schuller created issue -

          People

          • Assignee:
            Brandon Williams
            Reporter:
            Peter Schuller
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development