Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-3830

gossip-to-seeds is not obviously independent of failure detection algorithm

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Low
    • Resolution: Unresolved
    • None
    • None

    Description

      The failure detector, ignoring all the theory, boils down to an
      extremely simple algorithm. The FD keeps track of a sliding window (of
      1000 currently) intervals of heartbeat for a given host. Meaning, we
      have a track record of the last 1000 times we saw an updated heartbeat
      for a host.

      At any given moment, a host has a score which is simply the time since
      the last heartbeat, over the mean interval in the sliding
      window. For historical reasons a simple scaling factor is applied to
      this prior to checking the phi conviction threshold.

      (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
      trivial to understand what it does based on gut feeling)

      So in effect, a host is considered down if we haven't heard from it in
      some time which is significantly longer than the "average" time we
      expect to hear from it.

      This seems reasonable, but it does assume that under normal conditions
      the average time between heartbeats does not change for reasons other
      than those that would be plausible reasons to think a node is
      unhealthy.

      This assumption could be violated by the gossip-to-seed
      feature. There is an argument to avoid gossip-to-seed for other
      reasons (see CASSANDRA-3829), but this is a concrete case in which the
      gossip-to-seed could cause a negative side-effect of the general kind
      mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
      not being continuously tested). Normally, due to gossip to seed,
      everyone essentially sees latest information within very few hart
      beats (assuming only 2-3 seeds). But should all seeds be down,
      suddenly we flip a switch and start relying on generalized propagation
      in the gossip system, rather than the seed special case.

      The potential problem I forese here is that if the average propagation
      time suddenly spikes when all seeds become available, it could cause
      bogus flapping of nodes into down state.

      In order to test this, I deployeda ~ 180 node cluster with a version
      that logs heartbet information on each interpret(), similar to:

      INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778

      It turns out that, at least at 180 nodes, with 4 seed nodes, whether
      or not seeds are running does not seem to matter significantly. In
      both cases, the mean interval is around 1500 milliseconds.

      I don't feel I have a good grasp of whether this is incidental or
      guaranteed, and it would be good to at least empirically test
      propagation time w/o seeds at differnet cluster sizes; it's supposed
      to be un-affected by cluster size (RING_DELAY is static for this
      reason, is my understanding). Would be nice to see this be the case.

      Attachments

        Activity

          People

            scode Peter Schuller
            scode Peter Schuller
            Peter Schuller
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: