Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-3829

make seeds *only* be seeds, not special in gossip

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Low
    • Resolution: Won't Fix
    • None
    • None
    • None

    Description

      First, a little bit of "framing" on how seeds work:

      The concept of "seed hosts" makes fundamental sense; you need to
      "seed" a new node with some information required in order to join a
      cluster. Seed hosts is the information Cassandra uses for this
      purpose.

      But seed hosts play a role even after the initial start-up of a new
      node in a ring. Specifically, seed hosts continue to be gossiped to
      separately by the Gossiper throughout the life of a node and the
      cluster.

      Generally, operators must be careful to ensure that all nodes in a
      cluster are appropriately configured to refer to an overlapping set of
      seed hosts. Strictly speaking this should not be necessary (see
      further down though), but is the general recommendation. An
      unfortunate side-effect of this is that whenever you are doing ring
      management, such as replacing nodes, removing nodes, etc, you have to
      keep in mind which nodes are seeds.

      For example, if you bring a new node into the cluster, doing
      everything right with token assignment and auto_bootstrap=true, it
      will just enter the cluster without bootstrap - causing inconsistent
      reads. This is dangerous.

      And worse - changing the notion of which nodes are seeds across a
      cluster requires a rolling restart. It can be argued that it should
      actually be okay for nodes other than the one being fiddled with to
      incorrectly treat the fiddled-with node as a seed node, but this fact
      is highly opaque to most users that are not intimately familiar with
      Cassandra internals.

      This adds additional complexity to operations, as it introduces a
      reason why you cannot view the ring as completely homogeneous, despite
      the fundamental idea of Cassandra that all nodes should be equal.

      Now, fast forward a bit to what we are doing over here to avoid this
      problem: We have a zookeeper based systems for keeping track of hosts
      in a cluster, which is used by our Cassandra client to discover nodes
      to talk to. This works well.

      In order to avoid the need to manually keep track of seeds, we wanted
      to make seeds be automatically discoverable in order to eliminate as
      an operational concern. We have implemented a seed provider that does
      this for us, based on the data we keep in zookeeper.

      We could see essentially three ways of plugging this in:

      • (1) We could simply rely on not needing overlapping seeds and grab whatever we have when a node starts.
      • (2) We could do something like continually treat all other nodes as seeds by dynamically changing the seed list (involves some other changes like having the Gossiper update it's notion of seeds.
      • (3) We could completely eliminate the use of seeds except for the very specific purpose of initial start-up of an unbootstrapped node, and keep using a static (for the duration of the node's uptime) seed list.

      (3) was attractive because it felt like this was the original intent
      of seeds; that they be used for seeding, and not be constantly
      required during cluster operation once nodes are already joined.

      Now before I make the suggestion, let me explain how we are currently
      (though not yet in production) handling seeds and start-up.

      First, we have the following relevant cases to consider during a normal start-up:

      • (a) we are starting up a cluster for the very first time
      • (b) we are starting up a new clean node in order to join it to a pre-existing cluster
      • (c) we are starting up a pre-existing already joined node in a pre-existing cluster

      First, we proceeded on the assumption that we wanted to remove the use
      of seeds during regular gossip (other than on initial startup). This
      means that for the (c) case, we can completely ignore seeds. We
      never even have to discover the seed list, or if we do, we don't have
      to use them.

      This leaves (a) and (b). In both cases, the critical invariant we want
      to achieve is that we must have one or more valid seeds (valid means
      for (b) that the seed is in the cluster, and for (a) that it is one of
      the nodes that are part of the initial cluster setup).

      In the (c) case the problem is trivial - ignore seeds.

      In the (a) case, the algorithm is:

      • Register with zookeeper as a seed
      • Wait until we see at least one seed other than ourselves in zookeeper
      • Continue regular start-up process with the seed list (with 1 or more seeds)

      In the (b) case, the algorithm is:

      • Wait until we see at least one seed in zookeeper
      • Continue regular start-up process with the seed list (with 1 or more seeds)
      • Once fully up (around the time we listen to thrift), register as a seed in zookeeper

      With the annoyance that you have to explicitly let Cassandra know that
      "I am starting a cluster for the very first time from scratch", and
      ignoring the problem of single node clusters (just to avoid
      complicating this post further), this guarantees in both cases that
      all nodes eventually see each other.

      In the (a) case, all nodes except one are guaranteed to see the "one"
      node. The "one" node is guaranteed to see one of the others. Thus -
      convergence.

      In the (b) case, it's simple - the new node is guaranteed to see one
      or more nodes that are in the cluster - convergence.

      The current status is that we have implemented the seed provider and
      the start-up sequence works. But in order to simplify Cassandra (and
      to avoid having to diverge), we propose that we take this to its
      conclusion and officially make seeds only relevant on start-up, by
      only ever gossiping to seeds when in pre-bootstrap mode during
      start-up.

      The perceived benefits are:

      • Simplicity for the operator. All nodes are equal once joined; you can almost forget completely about seeds.
      • No rolling restarts or potential for footshooting a node into a cluster without bootstrap because it happened to be a seed.
      • Production clusters will suddenly start to actually test the gossip protocol without relying on seeds. How sure are we that it even works, and that phi conviction is appropriate and RING_DELAY is appropriate, given that practical clusters tend to gossip to a random (among very few) seeds? This change would make it so that we always gossip randomly to anyone in the cluster, and there should be no danger that a cluster happens to hold together because seeds are up - only to explode when they are not.
      • It eliminates non-trivial concerns with automatic seed discover, particularly when you want that seed discovery to be rack and DC aware. All you care about it what was described above; if that seed happens to fail, we simply fail to find the cluster and can abort start-up and it can be retried. There is no need for "redundancy" in seeds.

      Thoughts? Are seeds important (by design) in some way other than for seeding? What do other people think about the implications of RING_DELAY etc?

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            brandon.williams Brandon Williams Assign to me
            scode Peter Schuller
            Brandon Williams
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment