Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Any cluster that has vnodes and consists of hundreds of physical nodes.

      Description

      There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets.

      The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes.

      Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less.

      With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic.

      1. 2013-11-05_18-04-03_no_compression_cpu_time.png
        337 kB
        Quentin Conner
      2. 2013-11-05_18-09-38_compression_on_cpu_time.png
        323 kB
        Quentin Conner
      3. 6000vnodes.patch
        0.5 kB
        Quentin Conner
      4. AdjustableGossipPeriod.patch
        3 kB
        Quentin Conner
      5. cpu-vs-token-graph.png
        9 kB
        Quentin Conner
      6. delayEstimatorUntilStatisticallyValid.patch
        0.5 kB
        Quentin Conner
      7. flaps-vs-tokens.png
        56 kB
        Quentin Conner
      8. vnodes & gossip flaps.png
        74 kB
        Quentin Conner

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          I think we've addressed the major problems in the related tickets above. No single culprit.

          Show
          jbellis Jonathan Ellis added a comment - I think we've addressed the major problems in the related tickets above. No single culprit.
          Hide
          qconner Quentin Conner added a comment - - edited

          early results of testing with patch #3 were good

          see
          vnodes & gossip flaps.png

          Unable to render embedded object: File (vnodes\ \&\ gossip\ flaps.png) not found.

          Show
          qconner Quentin Conner added a comment - - edited early results of testing with patch #3 were good see vnodes & gossip flaps.png Unable to render embedded object: File (vnodes\ \&\ gossip\ flaps.png) not found.
          Hide
          qconner Quentin Conner added a comment -

          Yes, both use case 1 and use case 2 (detailed in early comment above) were cured by patch #3. Zero flaps were recorded in multiple trials in both use cases. Patch #3 cures the flaps, but does not address the cpu usage symptom.

          This was tested against the cassandra-1.2 branch. I am conducting the same test today against use case 2 today, but using the current cassandra-2.0 branch of source.

          Show
          qconner Quentin Conner added a comment - Yes, both use case 1 and use case 2 (detailed in early comment above) were cured by patch #3. Zero flaps were recorded in multiple trials in both use cases. Patch #3 cures the flaps, but does not address the cpu usage symptom. This was tested against the cassandra-1.2 branch. I am conducting the same test today against use case 2 today, but using the current cassandra-2.0 branch of source.
          Hide
          jbellis Jonathan Ellis added a comment -

          Untested patch #3. Delays output from FailureDetector until statistically valid number of samples have been obtained.

          Did we ever find a scenario where we can demonstrate this patch making a difference? Because I think it's a good idea in theory.

          Show
          jbellis Jonathan Ellis added a comment - Untested patch #3. Delays output from FailureDetector until statistically valid number of samples have been obtained. Did we ever find a scenario where we can demonstrate this patch making a difference? Because I think it's a good idea in theory.
          Hide
          jbellis Jonathan Ellis added a comment -

          ISTM that FD processing Gossip updates synchronously is a fundamental problem. Any hiccup in processing will cause FD false positives.

          I've pulled a fix for this out to CASSANDRA-6338.

          Show
          jbellis Jonathan Ellis added a comment - ISTM that FD processing Gossip updates synchronously is a fundamental problem. Any hiccup in processing will cause FD false positives. I've pulled a fix for this out to CASSANDRA-6338 .
          Hide
          qconner Quentin Conner added a comment - - edited

          The num_tokens setting has a mild impact on average cpu utilization. Please see the graph below for the trend with 256 nodes.

          This graph does not characterize the bursty nature of any given node's CPU utilization. It does average the utilization over a 200 second period, taken at 10 second intervals using "sar -u".

          Since gossip heartbeat destinations are random, "unlucky" nodes will sometimes receive twice the gossip traffic and (10 second basis) CPU utilization has been casually observed @25% for N=256 and 60% for N=512.

          Show
          qconner Quentin Conner added a comment - - edited The num_tokens setting has a mild impact on average cpu utilization. Please see the graph below for the trend with 256 nodes. This graph does not characterize the bursty nature of any given node's CPU utilization. It does average the utilization over a 200 second period, taken at 10 second intervals using "sar -u". Since gossip heartbeat destinations are random, "unlucky" nodes will sometimes receive twice the gossip traffic and (10 second basis) CPU utilization has been casually observed @25% for N=256 and 60% for N=512.
          Hide
          qconner Quentin Conner added a comment - - edited

          Flapping occurs with vnodes or without. Please see below.

          Using vnodes appears to exacerbate, possibly with longer messages, probably with higher cpu utilization. Either would delay the timestamp for the Failure Detector interarrival time.

          Show
          qconner Quentin Conner added a comment - - edited Flapping occurs with vnodes or without. Please see below. Using vnodes appears to exacerbate, possibly with longer messages, probably with higher cpu utilization. Either would delay the timestamp for the Failure Detector interarrival time.
          Hide
          qconner Quentin Conner added a comment -

          Tupshin, can you further quantify the CPU usage you observed, in terms of USER CPU and KERNEL CPU?
          Also, can you confirm the number of nodes and vnodes for those observations.

          I've seen about 25% user cpu @ 256 nodes and 60% @ 512 nodes. Kernel cpu was under 5% for both in my trials.

          Show
          qconner Quentin Conner added a comment - Tupshin, can you further quantify the CPU usage you observed, in terms of USER CPU and KERNEL CPU? Also, can you confirm the number of nodes and vnodes for those observations. I've seen about 25% user cpu @ 256 nodes and 60% @ 512 nodes. Kernel cpu was under 5% for both in my trials.
          Hide
          qconner Quentin Conner added a comment -

          Good morning. We saw the same CPU usage profile with cassandra-1.2 8e7d7285cdeac4f2527c933280d595bbddd26935 (which included the patch to not flush peers CF).

          CPU time was spent in looking up EndpointState or spent in PHI calculation. No surprises were found. No race conditions, no deadlocks or mutex/monitor contention.

          I do not know if flapping happens in 1.2 head without vnodes. I will find out today, if I can get the nodes (having trouble this morning allocating from EC2). Will keep trying (Fridays seem better) but could slip into the weekend...

          Show
          qconner Quentin Conner added a comment - Good morning. We saw the same CPU usage profile with cassandra-1.2 8e7d7285cdeac4f2527c933280d595bbddd26935 (which included the patch to not flush peers CF). CPU time was spent in looking up EndpointState or spent in PHI calculation. No surprises were found. No race conditions, no deadlocks or mutex/monitor contention. I do not know if flapping happens in 1.2 head without vnodes. I will find out today, if I can get the nodes (having trouble this morning allocating from EC2). Will keep trying (Fridays seem better) but could slip into the weekend...
          Hide
          qconner Quentin Conner added a comment -

          Run against the following commit on cassandra-1.2 branch:
          8e7d7285cdeac4f2527c933280d595bbddd26935

          This profile measures CPU time, not elapsed time.

          Internode compression turned off.

          256 nodes w/ num_tokens=256.
          GossipPeriod 1000 ms.
          ArrivalWindow.size() = 1000.

          Show
          qconner Quentin Conner added a comment - Run against the following commit on cassandra-1.2 branch: 8e7d7285cdeac4f2527c933280d595bbddd26935 This profile measures CPU time, not elapsed time. Internode compression turned off. 256 nodes w/ num_tokens=256. GossipPeriod 1000 ms. ArrivalWindow.size() = 1000.
          Hide
          qconner Quentin Conner added a comment -

          Run against the following commit on cassandra-1.2 branch:
          8e7d7285cdeac4f2527c933280d595bbddd26935

          This profile measures CPU time, not elapsed time.

          Internode compression turned on.

          256 nodes w/ num_tokens=256.
          GossipPeriod 1000 ms.
          ArrivalWindow.size() = 1000.

          Show
          qconner Quentin Conner added a comment - Run against the following commit on cassandra-1.2 branch: 8e7d7285cdeac4f2527c933280d595bbddd26935 This profile measures CPU time, not elapsed time. Internode compression turned on. 256 nodes w/ num_tokens=256. GossipPeriod 1000 ms. ArrivalWindow.size() = 1000.
          Hide
          brandon.williams Brandon Williams added a comment - - edited

          At this point, I think we should:

          • see if the flapping happens with vnodes in 1.2 head (maybe Quentin already knows from his last test)
          • see if the flapping happens without vnodes in 1.2 head but the same number of nodes

          Because if sum() in ArrivalWindow is burning the most CPU in the Gossiper task (note: not bottlenecking, each call was at most ~3ms, there were just lots of them) then that means that the problem is no longer tied to vnodes (if it ever was, since sum is per-node, not per-token) and we should probably open a new ticket (can't start a cluster of size >=X all at once, or similar) and discuss there. We know that clusters much larger than any discussed on this ticket exist, but I don't think any of them have all rebooted at once.

          Show
          brandon.williams Brandon Williams added a comment - - edited At this point, I think we should: see if the flapping happens with vnodes in 1.2 head (maybe Quentin already knows from his last test) see if the flapping happens without vnodes in 1.2 head but the same number of nodes Because if sum() in ArrivalWindow is burning the most CPU in the Gossiper task (note: not bottlenecking, each call was at most ~3ms, there were just lots of them) then that means that the problem is no longer tied to vnodes (if it ever was, since sum is per-node, not per-token) and we should probably open a new ticket (can't start a cluster of size >=X all at once, or similar) and discuss there. We know that clusters much larger than any discussed on this ticket exist, but I don't think any of them have all rebooted at once.
          Hide
          tupshin Tupshin Harper added a comment -

          +1. Strongly agree with Jonathan's analysis and proposal.

          Show
          tupshin Tupshin Harper added a comment - +1. Strongly agree with Jonathan's analysis and proposal.
          Hide
          jbellis Jonathan Ellis added a comment -

          ISTM that FD processing Gossip updates synchronously is a fundamental problem. Any hiccup in processing will cause FD false positives. (And even if our own code is perfect, GC pauses can still do this to us.)

          Wouldn't it be better if we:

          • time heartbeats based on when they arrive instead of when Gossip processes them
          • teach FD to recognize that its information is only good up to the most recently processed message – the absence of messages after that doesn't mean everyone is down unless the Gossip stage is empty
          Show
          jbellis Jonathan Ellis added a comment - ISTM that FD processing Gossip updates synchronously is a fundamental problem. Any hiccup in processing will cause FD false positives. (And even if our own code is perfect, GC pauses can still do this to us.) Wouldn't it be better if we: time heartbeats based on when they arrive instead of when Gossip processes them teach FD to recognize that its information is only good up to the most recently processed message – the absence of messages after that doesn't mean everyone is down unless the Gossip stage is empty
          Hide
          qconner Quentin Conner added a comment -

          Good cpu profile results were obtained last night with the 1.2.9 code line. Switching over to the cassandra-1.2 HEAD this morning for up-to-date analysis.
          CPU profile of 1.2.9 showed bottleneck was computation of sum for the ArrivalWindow deque members (inter-arrival times of gossip messages).

          Show
          qconner Quentin Conner added a comment - Good cpu profile results were obtained last night with the 1.2.9 code line. Switching over to the cassandra-1.2 HEAD this morning for up-to-date analysis. CPU profile of 1.2.9 showed bottleneck was computation of sum for the ArrivalWindow deque members (inter-arrival times of gossip messages).
          Hide
          brandon.williams Brandon Williams added a comment -

          With CASSANDRA-6244 and CASSANDRA-6297 in 1.2 head, I think we need to re-verify this is still a problem.

          Show
          brandon.williams Brandon Williams added a comment - With CASSANDRA-6244 and CASSANDRA-6297 in 1.2 head, I think we need to re-verify this is still a problem.
          Hide
          mstump Matt Stump added a comment -

          Zero to minimal load. 177 writes/second, 0 reads against the entire ring. m2.4xlarge instances.

          Show
          mstump Matt Stump added a comment - Zero to minimal load. 177 writes/second, 0 reads against the entire ring. m2.4xlarge instances.
          Hide
          jbellis Jonathan Ellis added a comment -

          How heavy is read/write load?

          Show
          jbellis Jonathan Ellis added a comment - How heavy is read/write load?
          Hide
          mstump Matt Stump added a comment -

          We're observing the logs of a random sample of nodes and on all nodes observed the entire ring is marked as down, so I assume it's for all nodes.

          Show
          mstump Matt Stump added a comment - We're observing the logs of a random sample of nodes and on all nodes observed the entire ring is marked as down, so I assume it's for all nodes.
          Hide
          brandon.williams Brandon Williams added a comment -

          Every node in the ring will be marked as down by the gossiper

          In which node's view? (or all of them?)

          Show
          brandon.williams Brandon Williams added a comment - Every node in the ring will be marked as down by the gossiper In which node's view? (or all of them?)
          Hide
          mstump Matt Stump added a comment -

          As another datapoint/use case create a 32 node ring with vnodes, decommission one of the nodes and observe the logs. Every node in the ring will be marked as down by the gossiper, then immediately be re-added again as up/available.

          Show
          mstump Matt Stump added a comment - As another datapoint/use case create a 32 node ring with vnodes, decommission one of the nodes and observe the logs. Every node in the ring will be marked as down by the gossiper, then immediately be re-added again as up/available.
          Hide
          qconner Quentin Conner added a comment -

          Monday (11/4) I will be start getting the CPU profiling captured with a 256 or 512 node cluster. Plan is to capture with internode compression and without.
          I was able to get semi-reproduction this week in a 256 node cluster – one node had twice the cpu utilization of the others (20% user versus 10% user). But I had too much logging enabled and that skewed results.

          Show
          qconner Quentin Conner added a comment - Monday (11/4) I will be start getting the CPU profiling captured with a 256 or 512 node cluster. Plan is to capture with internode compression and without. I was able to get semi-reproduction this week in a 256 node cluster – one node had twice the cpu utilization of the others (20% user versus 10% user). But I had too much logging enabled and that skewed results.
          Hide
          jbellis Jonathan Ellis added a comment -

          it would be better to limit that in the config instead of failing at an assert later on.

          Split that out to CASSANDRA-6267.

          Show
          jbellis Jonathan Ellis added a comment - it would be better to limit that in the config instead of failing at an assert later on. Split that out to CASSANDRA-6267 .
          Hide
          brandon.williams Brandon Williams added a comment -

          Let's move that discussion to CASSANDRA-4288, since that change is orthogonal to the actual problem we have here, regardless of whether it fixes it or just papers over the problem. What we need to do next on this ticket is either correlate a thread dump to what is burning up CPU, or attach a debugger and see where the time is being spent.

          Show
          brandon.williams Brandon Williams added a comment - Let's move that discussion to CASSANDRA-4288 , since that change is orthogonal to the actual problem we have here, regardless of whether it fixes it or just papers over the problem. What we need to do next on this ticket is either correlate a thread dump to what is burning up CPU, or attach a debugger and see where the time is being spent.
          Hide
          qconner Quentin Conner added a comment -

          Brandon,

          You said Patch #3 will make it take much longer for a rebooted node to know who's actually up or down, exacerbating CASSANDRA-4288. I've given this some thought and want to see if I understand your concern.

          Patch #3 serves to send a zero value for phi, for newly-discovered nodes, until an accurate calculation of variance is complete. This would be 40 seconds, applicable to new nodes only.

          However (and this is what I'm looking for you to confirm) If a new node comes online, but is stopped again within 40 seconds of start-up, the FD will not "convict" it until the end of that 40 seconds.

          I suspect this occurs less frequently than adding a node to a cluster, but probably depends on your use case (dev vs prod).

          In my view, we can't escape the math, and the need to amass 40 samples. That is why the bug exists today. I agree we should look at tying thrift to a healthy startup as a compensating measure.

          Instead of a fixed amount of time (gossip rounds), perhaps we should consider adding a hold-down timer based on a statistical measure?

          This hold-down timer could be implemented for newly discovered nodes to suppress interaction until Gossip "stabilizes". Just like we have a high-water mark for phi to denote failure, we could set a low-water mark and call it a trust threshold. We wouldn't enable thrift communications to the new node until their phi value is below this low-water mark.

          So the condition for "recognizing" a new node for thrift purposes could be two fold:
          1. valid computation for variance (40 samples obtained in the 1000 sample window)
          2. accurate phi value is indeed below the low-water mark

          Show
          qconner Quentin Conner added a comment - Brandon, You said Patch #3 will make it take much longer for a rebooted node to know who's actually up or down, exacerbating CASSANDRA-4288 . I've given this some thought and want to see if I understand your concern. Patch #3 serves to send a zero value for phi, for newly-discovered nodes, until an accurate calculation of variance is complete. This would be 40 seconds, applicable to new nodes only. However (and this is what I'm looking for you to confirm) If a new node comes online, but is stopped again within 40 seconds of start-up, the FD will not "convict" it until the end of that 40 seconds. I suspect this occurs less frequently than adding a node to a cluster, but probably depends on your use case (dev vs prod). In my view, we can't escape the math, and the need to amass 40 samples. That is why the bug exists today. I agree we should look at tying thrift to a healthy startup as a compensating measure. Instead of a fixed amount of time (gossip rounds), perhaps we should consider adding a hold-down timer based on a statistical measure? This hold-down timer could be implemented for newly discovered nodes to suppress interaction until Gossip "stabilizes". Just like we have a high-water mark for phi to denote failure, we could set a low-water mark and call it a trust threshold. We wouldn't enable thrift communications to the new node until their phi value is below this low-water mark. So the condition for "recognizing" a new node for thrift purposes could be two fold: 1. valid computation for variance (40 samples obtained in the 1000 sample window) 2. accurate phi value is indeed below the low-water mark
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm okay with that limitation. Intuitively it's reasonable that C* can't compensate for really ridiculous performance differences.

          (Of course, you could also reduce the weak nodes below 256.)

          Show
          jbellis Jonathan Ellis added a comment - I'm okay with that limitation. Intuitively it's reasonable that C* can't compensate for really ridiculous performance differences. (Of course, you could also reduce the weak nodes below 256.)
          Hide
          cburroughs Chris Burroughs added a comment -

          I'd just set a max of 1024. No one could ever need more than that. (Famous last words)

          Isn't that equivalent to saying no one will have a heterogeneous cluster with more than a 1024/256 = 4 performance delta between physical nodes? SSD vs spinny could account for more than that.

          Show
          cburroughs Chris Burroughs added a comment - I'd just set a max of 1024. No one could ever need more than that. (Famous last words) Isn't that equivalent to saying no one will have a heterogeneous cluster with more than a 1024/256 = 4 performance delta between physical nodes? SSD vs spinny could account for more than that.
          Hide
          brandon.williams Brandon Williams added a comment -

          That might confuse autodiscovery clients, at least without further changes.

          Show
          brandon.williams Brandon Williams added a comment - That might confuse autodiscovery clients, at least without further changes.
          Hide
          jbellis Jonathan Ellis added a comment -

          Couldn't we tie the thrift/native server startup to "I have enough gossip data now?"

          Show
          jbellis Jonathan Ellis added a comment - Couldn't we tie the thrift/native server startup to "I have enough gossip data now?"
          Hide
          brandon.williams Brandon Williams added a comment -

          Patch #3 will make it take much longer for a rebooted node to know who's actually up or down, exacerbating CASSANDRA-4288. I'd still like to know why things are taking longer with vnodes, and I'm especially hesitant to make any adjustments to the gossiper or FD since we know they work fine with single tokens, and also because they have no knowledge about tokens, it's just another opaque state to them. I suspect something in StorageService is blocking the gossiper long enough to cause this, perhaps CASSANDRA-6244 or something similar.

          Show
          brandon.williams Brandon Williams added a comment - Patch #3 will make it take much longer for a rebooted node to know who's actually up or down, exacerbating CASSANDRA-4288 . I'd still like to know why things are taking longer with vnodes, and I'm especially hesitant to make any adjustments to the gossiper or FD since we know they work fine with single tokens, and also because they have no knowledge about tokens , it's just another opaque state to them. I suspect something in StorageService is blocking the gossiper long enough to cause this, perhaps CASSANDRA-6244 or something similar.
          Hide
          tupshin Tupshin Harper added a comment -

          I'd just set a max of 1024. No one could ever need more than that. (Famous
          last words)

          Show
          tupshin Tupshin Harper added a comment - I'd just set a max of 1024. No one could ever need more than that. (Famous last words)
          Hide
          jbellis Jonathan Ellis added a comment -

          Patch 1 will break things since later on we write the length of the string as two bytes.

          I think we're fine with 1700 vnodes per machine TBH, although it would be better to limit that in the config instead of failing at an assert later on.

          Show
          jbellis Jonathan Ellis added a comment - Patch 1 will break things since later on we write the length of the string as two bytes. I think we're fine with 1700 vnodes per machine TBH, although it would be better to limit that in the config instead of failing at an assert later on.
          Hide
          qconner Quentin Conner added a comment -

          I grabbed some sample log files from 10 nodes of 256 in a run today.
          flap-intervals.tar.gz

          Convictions are happening with only 1 to 5 intervals recorded. Patch #3 is looking like the winner but we should do the math by hand to be sure (volunteers?).

          Also, I just tested Patch #3 and found 0 flaps for the same setup as yesterday (256 nodes, phi=8, normal 1000 ms gossip period).

          Show
          qconner Quentin Conner added a comment - I grabbed some sample log files from 10 nodes of 256 in a run today. flap-intervals.tar.gz Convictions are happening with only 1 to 5 intervals recorded. Patch #3 is looking like the winner but we should do the math by hand to be sure (volunteers?). Also, I just tested Patch #3 and found 0 flaps for the same setup as yesterday (256 nodes, phi=8, normal 1000 ms gossip period).
          Hide
          cburroughs Chris Burroughs added a comment -

          > It would be helpful to dump the interval times for a node that is flapping (dumpInterArrivalTimes on the FD) so we can see how long the heartbeats are taking.

          A per endpoint histogram of heartbeat arrival latency seems a worthwhile o.a.c.Metric to have all the time.

          Quentin Conner On the topic of "wait until there is enough data before doing stuff" you might also be interested in the heuristic & report from CASSANDRA-4288

          Show
          cburroughs Chris Burroughs added a comment - > It would be helpful to dump the interval times for a node that is flapping (dumpInterArrivalTimes on the FD) so we can see how long the heartbeats are taking. A per endpoint histogram of heartbeat arrival latency seems a worthwhile o.a.c.Metric to have all the time. Quentin Conner On the topic of "wait until there is enough data before doing stuff" you might also be interested in the heuristic & report from CASSANDRA-4288
          Hide
          brandon.williams Brandon Williams added a comment -

          Can you see if adding -Dcassandra.unsafesystem=true allows the cluster to stabilize at some point?

          Show
          brandon.williams Brandon Williams added a comment - Can you see if adding -Dcassandra.unsafesystem=true allows the cluster to stabilize at some point?
          Hide
          brandon.williams Brandon Williams added a comment -

          It would be helpful to dump the interval times for a node that is flapping (dumpInterArrivalTimes on the FD) so we can see how long the heartbeats are taking. If some are excessively long, we need to get threads dumps/debugger timings from the gossiper to see if something is blocking it or taking a long time before changing any fundamentals (gossip interval, FD formula) that we already know work in principle without vnodes. Increasing the payload size to >32k shouldn't cause these problems, since that is only sent during initial state synchronization and isn't all that large to begin with.

          Show
          brandon.williams Brandon Williams added a comment - It would be helpful to dump the interval times for a node that is flapping (dumpInterArrivalTimes on the FD) so we can see how long the heartbeats are taking. If some are excessively long, we need to get threads dumps/debugger timings from the gossiper to see if something is blocking it or taking a long time before changing any fundamentals (gossip interval, FD formula) that we already know work in principle without vnodes. Increasing the payload size to >32k shouldn't cause these problems, since that is only sent during initial state synchronization and isn't all that large to begin with.
          Hide
          qconner Quentin Conner added a comment -

          First results with workaround patch #2.
          No load. No data. Only system keyspace and Gossip on a 256 node m1.medium cluster in EC2.
          Nodes started in rapid succession.

          phi=8, variable gossip_period
          1154 flaps for 1 sec
          685 flaps for 2 sec
          146 flaps for 3 sec
          88 flaps for 4 sec
          70 flaps for 5 sec
          100 flaps for 10 sec

          phi=12
          1289 flaps for 1 sec
          77 flaps for 2 sec
          6 flaps for 3 sec
          1 flaps for 4 sec
          3 flaps for 5 sec
          1 flaps for 6 sec
          0 flaps for 8 sec
          1 flaps for 10 sec

          Show
          qconner Quentin Conner added a comment - First results with workaround patch #2. No load. No data. Only system keyspace and Gossip on a 256 node m1.medium cluster in EC2. Nodes started in rapid succession. phi=8, variable gossip_period 1154 flaps for 1 sec 685 flaps for 2 sec 146 flaps for 3 sec 88 flaps for 4 sec 70 flaps for 5 sec 100 flaps for 10 sec phi=12 1289 flaps for 1 sec 77 flaps for 2 sec 6 flaps for 3 sec 1 flaps for 4 sec 3 flaps for 5 sec 1 flaps for 6 sec 0 flaps for 8 sec 1 flaps for 10 sec
          Hide
          qconner Quentin Conner added a comment -

          Untested patch #3. Delays output from FailureDetector until statistically valid number of samples have been obtained.

          Show
          qconner Quentin Conner added a comment - Untested patch #3. Delays output from FailureDetector until statistically valid number of samples have been obtained.
          Hide
          qconner Quentin Conner added a comment -

          This is patch #2. It adds a new configuration item in cassandra.yaml, "gossip_period". Set it in milliseconds.

          Setting the gossip period by JMX will be lost if stop/restart gossip so needs more work. Reading from JMX seems fine.

          This also isn't DRY. Should set the value for intervalInMillis, from config, in a static initializer. Wasn't sure if config object is available in that scope.

          Show
          qconner Quentin Conner added a comment - This is patch #2. It adds a new configuration item in cassandra.yaml, "gossip_period". Set it in milliseconds. Setting the gossip period by JMX will be lost if stop/restart gossip so needs more work. Reading from JMX seems fine. This also isn't DRY. Should set the value for intervalInMillis, from config, in a static initializer. Wasn't sure if config object is available in that scope.
          Hide
          qconner Quentin Conner added a comment -

          Patch #1. Increases number of allowed vnodes (num_tokens) from 1720 up to about 6000 with 512K stack size.

          Because of CQL antlr grammar parser, stack is needed to parse token definitions.

          Show
          qconner Quentin Conner added a comment - Patch #1. Increases number of allowed vnodes (num_tokens) from 1720 up to about 6000 with 512K stack size. Because of CQL antlr grammar parser, stack is needed to parse token definitions.
          Hide
          qconner Quentin Conner added a comment - - edited

          Feature Suggestion

          The current Gossip failure detector is characterized by a sliding window of elapsed time, a heartbeat message period and a PHI threshold used to make the continuous random variable (lower case phi) into a dichotomous (binary) random variable. That PHI (uppercase) threshold is called phi_convict_threshold.

          I don't have a better mathmatical theory or derivation at this writing, but I do have an easy workaround for your consideration. While phi_convict_threshold is adjustable, the period (or frequency) of Gossip messages is not. Adjusting the gossip period to integrate over a longer time baseline reduced false positives from the Gossip failure detector. The side effect increases the elapsed time to detect a legitimately-failed node.

          Depending on user workload characteristics, and the related sources of latency (CPU, disk and network activity or transient delays) cited above, a System Architect could present a reasonable use case for controlling the Gossip message period.

          The goal would be to set a detection window that accomodates common occurences for a given deployment scenario. Not all data centers are created equal.

          Patches and results from implementation will follow in subsequent posts.

          Potential Next Steps
          Explore concern about sensitivity to gossip period. Do the vnode gossip messages exceed capacity for peers to ingest?
          Explore concern about phi estimates from un-filled (new) deque. See Patch #3.
          Explore concern about assuming Gaussian PDF. Networks (not computers) generally characterize expected arrival time by Poisson distribution, not Gaussian.

          Show
          qconner Quentin Conner added a comment - - edited Feature Suggestion The current Gossip failure detector is characterized by a sliding window of elapsed time, a heartbeat message period and a PHI threshold used to make the continuous random variable (lower case phi) into a dichotomous (binary) random variable. That PHI (uppercase) threshold is called phi_convict_threshold. I don't have a better mathmatical theory or derivation at this writing, but I do have an easy workaround for your consideration. While phi_convict_threshold is adjustable, the period (or frequency) of Gossip messages is not. Adjusting the gossip period to integrate over a longer time baseline reduced false positives from the Gossip failure detector. The side effect increases the elapsed time to detect a legitimately-failed node. Depending on user workload characteristics, and the related sources of latency (CPU, disk and network activity or transient delays) cited above, a System Architect could present a reasonable use case for controlling the Gossip message period. The goal would be to set a detection window that accomodates common occurences for a given deployment scenario. Not all data centers are created equal. Patches and results from implementation will follow in subsequent posts. Potential Next Steps Explore concern about sensitivity to gossip period. Do the vnode gossip messages exceed capacity for peers to ingest? Explore concern about phi estimates from un-filled (new) deque. See Patch #3. Explore concern about assuming Gaussian PDF. Networks (not computers) generally characterize expected arrival time by Poisson distribution, not Gaussian.
          Hide
          qconner Quentin Conner added a comment -

          Analysis

          My first experiments aimed to quantify the length of Gossip messages and determine what factors drive the message length. I found the size of certain gossip messages increases proportionally with the number of vnodes (num_tokens in c.yaml). I recorded message size over the num_tokens and number of nodes domains (64,128,256,512,...) for tokens and (32,64,128,256,512) for nodes. I also made non-rigorous observation of User and Kernel CPU (Ubuntu 10.0.4 LTS). My hunch is that both vnode count and node count have a mild effect on user CPU resource usage.

          What is the rough estimate of bytes sent for certain Gossip messages and why does this matter? The Phi Accrual Failure Detector (Hayashibara, et al) assumes fixed length heartbeat messages while Cassandra uses variable length messages. I observed a correlation with larger messages, higher vnodes and false positive detections by the Gossip FailureDetector. These observations, IMHO, are not explained by the research paper. I formed a hypothesis that the false positives are due to jitter in the interval values. I wondered if perhaps using a longer baseline to integrate over would reduce the jitter.

          I have a second theory to follow up on. A newly added node will not have a long history of Gossip heartbeat interarrival times. At least 40 samples are needed to compute mean, variance with any statistical significance. It's possible the phi estimation algorithm is simply invalid for newly created nodes and that is why we see them flap shortly after creation.

          In any case, the message of interest is the GossipDigestAck2 (GDA2) because it is the largest of the Gossip messages. GDA2 contains the set of EndpointStateMaps (node metadata) for newly-discovered nodes, i.e. those nodes just added to an existing cluster. When each node becomes aware of joining node, they Gossip it to three randomly-chosen other nodes. The GDA2 message is tailored to contain the delta of new node metadata the receiving node is unaware of.

          For a single node, the upper limit on GDA message size is roughly 3 * N * k * V
          Where N is the number of nodes in the cluster,
          V is the number of tokens (vnodes) per cluster,
          k is a constant value, approximately 64 bytes, that represents a serialized token plus some other endpoint metadata.

          If one is running hundreds of nodes in a cluster, the Gossip message traffic created when a node joins can be significant and increases with the number of nodes. I believe this to be the first order effect and probably violates one of the assumptions of the PHI Accrual Failure Detection, that heartbeat messages are small enough not to consume a relevant amount of compute or communication resources. The variable transmission time (due to variable length messages) is a clear violation of assumptions, if I've read the source code correctly.

          On a related topic, there is a hard-coded limitation to the number of vnodes due to the serialization of the GDA messages.
          No more than 1720 vnodes can be configured without creating a greater than 32K serialized String vnode message. A patch is provided below for future use should this become an issue.

          In clusters with hundreds of nodes, GDA2 messages can be 200 KB or 2 MB if many nodes join simultaneously. This is not an issue if the computer experiences no latency from competing workloads. In the real world, nodes are added because the cluster load has grown in terms of retained data, or in terms of a high transaction arrival rate. This means node resources may be fully utilized when adding new nodes is typically attempted.

          It occured to me that we have another use case to accomodate. It is common to experience transient failure modes, even in modern data centers with disciplined maintenance practices. Ethernet cables get moved, switches and routers rebooted. BGP route errors and other temporary interruptions may occur with the network fabric in real world scenarios. People make mistakes, plans change and preventative maintenance often causes short-lived interruptions occur with network, CPU and disk subsystems.

          Show
          qconner Quentin Conner added a comment - Analysis My first experiments aimed to quantify the length of Gossip messages and determine what factors drive the message length. I found the size of certain gossip messages increases proportionally with the number of vnodes (num_tokens in c.yaml). I recorded message size over the num_tokens and number of nodes domains (64,128,256,512,...) for tokens and (32,64,128,256,512) for nodes. I also made non-rigorous observation of User and Kernel CPU (Ubuntu 10.0.4 LTS). My hunch is that both vnode count and node count have a mild effect on user CPU resource usage. What is the rough estimate of bytes sent for certain Gossip messages and why does this matter? The Phi Accrual Failure Detector (Hayashibara, et al) assumes fixed length heartbeat messages while Cassandra uses variable length messages. I observed a correlation with larger messages, higher vnodes and false positive detections by the Gossip FailureDetector. These observations, IMHO, are not explained by the research paper. I formed a hypothesis that the false positives are due to jitter in the interval values. I wondered if perhaps using a longer baseline to integrate over would reduce the jitter. I have a second theory to follow up on. A newly added node will not have a long history of Gossip heartbeat interarrival times. At least 40 samples are needed to compute mean, variance with any statistical significance. It's possible the phi estimation algorithm is simply invalid for newly created nodes and that is why we see them flap shortly after creation. In any case, the message of interest is the GossipDigestAck2 (GDA2) because it is the largest of the Gossip messages. GDA2 contains the set of EndpointStateMaps (node metadata) for newly-discovered nodes, i.e. those nodes just added to an existing cluster. When each node becomes aware of joining node, they Gossip it to three randomly-chosen other nodes. The GDA2 message is tailored to contain the delta of new node metadata the receiving node is unaware of. For a single node, the upper limit on GDA message size is roughly 3 * N * k * V Where N is the number of nodes in the cluster, V is the number of tokens (vnodes) per cluster, k is a constant value, approximately 64 bytes, that represents a serialized token plus some other endpoint metadata. If one is running hundreds of nodes in a cluster, the Gossip message traffic created when a node joins can be significant and increases with the number of nodes. I believe this to be the first order effect and probably violates one of the assumptions of the PHI Accrual Failure Detection, that heartbeat messages are small enough not to consume a relevant amount of compute or communication resources. The variable transmission time (due to variable length messages) is a clear violation of assumptions, if I've read the source code correctly. On a related topic, there is a hard-coded limitation to the number of vnodes due to the serialization of the GDA messages. No more than 1720 vnodes can be configured without creating a greater than 32K serialized String vnode message. A patch is provided below for future use should this become an issue. In clusters with hundreds of nodes, GDA2 messages can be 200 KB or 2 MB if many nodes join simultaneously. This is not an issue if the computer experiences no latency from competing workloads. In the real world, nodes are added because the cluster load has grown in terms of retained data, or in terms of a high transaction arrival rate. This means node resources may be fully utilized when adding new nodes is typically attempted. It occured to me that we have another use case to accomodate. It is common to experience transient failure modes, even in modern data centers with disciplined maintenance practices. Ethernet cables get moved, switches and routers rebooted. BGP route errors and other temporary interruptions may occur with the network fabric in real world scenarios. People make mistakes, plans change and preventative maintenance often causes short-lived interruptions occur with network, CPU and disk subsystems.
          Hide
          qconner Quentin Conner added a comment - - edited

          Background and Reproduction

          The symptom is evident with the presence of "is now DOWN" messages in the Cassandra system.log file. The recording of a node DOWN is often followed by a node UP a few seconds later. Users have coined this phenomenon "gossip flap" and the occurence of "Gossip flaps" has a machine and a human consequence.

          Humans react strongly to the (temporary) marking of a node down. Automated monitoring may trigger SNMP traps, etc. A "busy" node that doesn't transmit heartbeat gossip messages on time will be marked as "down" though it may still be performing useful work.

          Machine reactions include other C* nodes buffering of row mutations and storage of hints on disk when another node is marked down. I have not explored the machine reactions but imagine the endpointSnitch could also be affected from the client frame of reference.

          One piece of good news is that I was able to reproduce two different use cases that elicit the "is now DOWN" message in Log4J log files.

          Use Case #1 is as follows:
          provision 256 or 512 nodes in EC2
          install Cassandra 1.2.9
          take defaults except specify num_tokens=256 in c*.yaml
          start one node at a time

          Use Case #2 is as follows:
          provision 32 nodes in EC2
          install Cassandra 1.2.9
          take defaults in c*.yaml
          configure rack & datacenter
          start one node at a time
          when all nodes are up create about 1GB of data
          e.g. "tools/bin/cassandra-stress -c 20 -l 3 -n 1000000"
          provision a 33rdxtra node in EC2
          install Cassandra 1.2.9
          take defaults except specify num_tokens=256
          configure different datacenter than first 32 nodes
          start the node (auto_bootstrap=true)

          Show
          qconner Quentin Conner added a comment - - edited Background and Reproduction The symptom is evident with the presence of "is now DOWN" messages in the Cassandra system.log file. The recording of a node DOWN is often followed by a node UP a few seconds later. Users have coined this phenomenon "gossip flap" and the occurence of "Gossip flaps" has a machine and a human consequence. Humans react strongly to the (temporary) marking of a node down. Automated monitoring may trigger SNMP traps, etc. A "busy" node that doesn't transmit heartbeat gossip messages on time will be marked as "down" though it may still be performing useful work. Machine reactions include other C* nodes buffering of row mutations and storage of hints on disk when another node is marked down. I have not explored the machine reactions but imagine the endpointSnitch could also be affected from the client frame of reference. One piece of good news is that I was able to reproduce two different use cases that elicit the "is now DOWN" message in Log4J log files. Use Case #1 is as follows: provision 256 or 512 nodes in EC2 install Cassandra 1.2.9 take defaults except specify num_tokens=256 in c*.yaml start one node at a time Use Case #2 is as follows: provision 32 nodes in EC2 install Cassandra 1.2.9 take defaults in c*.yaml configure rack & datacenter start one node at a time when all nodes are up create about 1GB of data e.g. "tools/bin/cassandra-stress -c 20 -l 3 -n 1000000" provision a 33rdxtra node in EC2 install Cassandra 1.2.9 take defaults except specify num_tokens=256 configure different datacenter than first 32 nodes start the node (auto_bootstrap=true)
          Hide
          jbellis Jonathan Ellis added a comment -

          After a couple of hours, they were still flapping, had very high cpu load

          To clarify, this is a bit of a mashup of multiple observations:

          When there was zero traffic on the cluster, we were seeing flapping without very high cpu. On smaller tests, we saw much higher cpu than expected when under load.

          Show
          jbellis Jonathan Ellis added a comment - After a couple of hours, they were still flapping, had very high cpu load To clarify, this is a bit of a mashup of multiple observations: When there was zero traffic on the cluster, we were seeing flapping without very high cpu. On smaller tests, we saw much higher cpu than expected when under load.

            People

            • Assignee:
              jbellis Jonathan Ellis
              Reporter:
              tupshin Tupshin Harper
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development