Cassandra
  1. Cassandra
  2. CASSANDRA-3831

scaling to large clusters in GossipStage impossible due to calculatePendingRanges

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: Core
    • Labels:
      None

      Description

      (most observations below are from 0.8, but I just now tested on
      trunk and I can trigger this problem just by bootstrapping a ~180
      nod cluster concurrently, presumably due to the number of nodes that
      are simultaneously in bootstrap state)

      It turns out that:

      • (1) calculatePendingRanges is not just expensive, it's computationally complex - cubic or worse
      • (2) it gets called NOT just once per node being bootstrapped/leaving etc, but is called repeatedly while nodes are in these states

      As a result, clusters start exploding when you start reading 100-300
      nodes. The GossipStage will get backed up because a single
      calculdatePenginRanges takes seconds, and depending on what the
      average heartbeat interval is in relation to this, this can lead to
      massive cluster-wide flapping.

      This all started because we hit this in production; several nodes
      would start flapping several other nodes as down, with many nodes
      seeing the entire cluster, or a large portion of it, as down. Logging
      in to some of these nodes you would see that they would be constantly
      flapping up/down for minutes at a time until one became lucky and it
      stabilized.

      In the end we had to perform an emergency full-cluster restart with
      gossip patched to force-forget certain nodes in bootstrapping state.

      I can't go into all details here from the post-mortem (just the
      write-up would take a day), but in short:

      • We graphed the number of hosts in the cluster that had more than 5
        Down (in a cluster that should have 0 down) on a minutely timeline.
      • We also graphed the number of hosts in the cluster that had GossipStage backed up.
      • The two graphs correlated extremely well
      • jstack sampling showed it being CPU bound doing mostly sorting under calculatePendingRanges
      • We were never able to exactly reproduce it with normal RING_DELAY and gossip intervals, even on a 184 node cluster (the production cluster is around 180).
      • Dropping RING_DELAY and in particular dropping gossip interval to 10 ms instead of 1000 ms, we were able to observe all of the behavior we saw in production.

      So our steps to reproduce are:

      • Launch 184 node cluster w/ gossip interval at 10ms and RING_DELAY at 1 second.
      • Do something like: while [ 1 ] ; do date ; echo decom ; nodetool decommission ; date ; echo done leaving decommed for a while ; sleep 3 ; date ; echo done restarting; sudo rm -rf /data/disk1/commitlog/* ; sudo rm -rf /data/diskarray/tables/* ; sudo monit restart cassandra ;date ; echo restarted waiting for a while ; sleep 40; done (or just do a manual decom/bootstrap once, it triggers every time)
      • Watch all nodes flap massively and not recover at all, or maybe after a long time.

      I observed the flapping using a python script that every 5 second
      (randomly spread out) asked for unreachable nodes from all nodes in
      the cluster, and printed any nodes and their counts when they had
      unreachables > 5. The cluster can be observed instantly going into
      massive flapping when leaving/bootstrap is initiated. Script needs
      Cassandra running with Jolokia enabled for http/json access to
      JMX. Can provide scrit if needed after cleanup.

      The phi conviction, based on logging I added, was legitimate. Using
      the 10 ms interval the average heartbeat interval ends up being like 25
      ms or something like that. As a result, a single ~ 2 second delay in
      gossip stage is huge in comparison to those 25 ms, and so we go past
      the phi conviction threshold. This is much more sensitive than in
      production, but it's the same effect, even if it triggers less
      easily for real.

      The best work around currently internally is to memoize
      calculatePendingRanges so that we don't re-calculate if token meta
      data, list of moving, list of bootstrapping and list of leaving are
      all the same as on prior calculation. It's not entirely clear at this
      point whether there is a clean fix to avoid executing
      calculatePendingRanges more than once per unique node in this state.

      It should be noted though that even if that is fixed, it is not
      acceptable to spend several seconds doing these calculations on a ~
      200 node cluster and it needs to be made fundamentally more efficient.

      Here is a dump of thoughts by me in an internal JIRA ticket (not
      exhaustive, I just went as far as to show that there is an issue;
      there might be worse things I missed, but worse than cubic is bad
      enough that I stopped):

      (Comment uses 0.8 source.)

      Okay, so let's break down the computational complexity here.

      Suppose ring size is n and number of bootstrapping/leaving tokens is m. One of two places that take time (by measurement) is this part of calculatePendingRanges():

             // At this stage pendingRanges has been updated according to leave operations. We can
              // now continue the calculation by checking bootstrapping nodes.
      
              // For each of the bootstrapping nodes, simply add and remove them one by one to
              // allLeftMetadata and check in between what their ranges would be.
              for (Map.Entry<Token, InetAddress> entry : bootstrapTokens.entrySet())
              {
                  InetAddress endpoint = entry.getValue();
      
                  allLeftMetadata.updateNormalToken(entry.getKey(), endpoint);
                  for (Range range : strategy.getAddressRanges(allLeftMetadata).get(endpoint))
                      pendingRanges.put(range, endpoint);
                  allLeftMetadata.removeEndpoint(endpoint);
              }
      

      I'll ignore stuff that's log or better.

      The outer loops is O(m). The inner loop is O, making aggregate so far O(nm).

      We have a call in there to updateNormalTokens() which implies a sorting, which his O(n log). So now we're at O(n log m).

      Next up we call getAddressRanges() which immediately does another O(n log sort. we're still at O(n log m. It then iterates (linear) and:

      • calls getPrimaryRangeFor() for each.
      • calls calculateNaturalEndpoints for each.

      The former ends up sorting again, so now we're at O(n log n log m (worse than quadratic).

      NTS.calculateNaturalEndpoints starts by collecting token meta data for nodes in the DC, by using updateNormalToken, which implies sorting. Woha woha. Now we're at O(n log n log n log m).

      I might have missed things that are even worse, but this is bad enough to warrant this ticket. To put into perspective, 168 ^ 3 is 4.7 million.

        Issue Links

          Activity

          Hide
          Peter Schuller added a comment -

          CASSANDRA-3881 filed for further work.

          Show
          Peter Schuller added a comment - CASSANDRA-3881 filed for further work.
          Hide
          Sylvain Lebresne added a comment -

          Since code has been committed so can we close this one and open a separate ticket for the remaining work for the sake of keeping track of what went into 1.1.0 and was doesn't?

          Show
          Sylvain Lebresne added a comment - Since code has been committed so can we close this one and open a separate ticket for the remaining work for the sake of keeping track of what went into 1.1.0 and was doesn't?
          Hide
          Peter Schuller added a comment -

          Wow, that's embarrassing. Thanks for catching that!

          Show
          Peter Schuller added a comment - Wow, that's embarrassing. Thanks for catching that!
          Hide
          Jonathan Ellis added a comment - - edited

          committed the group add patch with a minor tweak to move the isempty check to the top so we can skip lock/unlock too, and a more important one from sortedTokens = sortedTokens() (which is a no-op) to sortedTokens = sortTokens().

          Show
          Jonathan Ellis added a comment - - edited committed the group add patch with a minor tweak to move the isempty check to the top so we can skip lock/unlock too, and a more important one from sortedTokens = sortedTokens() (which is a no-op) to sortedTokens = sortTokens() .
          Hide
          Peter Schuller added a comment -

          I filed CASSANDRA-3856 which might related to a proper fix to this.

          Show
          Peter Schuller added a comment - I filed CASSANDRA-3856 which might related to a proper fix to this.
          Hide
          Peter Schuller added a comment -

          Attaching CASSANDRA-3831-trunk-group-add-dc-tokens.txt which adds a "group-update" interface to TokenMetadata that allows NTS to use it when constructing it's local per-dc meta datas.

          This is not even close to a complete fix for this issue, but I do think it is a "clean" change because it makes sense in terms of TokenMetadata API to provide a group-update method given the expense involved. And given it's existence, it makes sense for NTS to use it.

          This change mitigates the problem significantly on the ~ 180 node test cluster since it takes a way an n from the complexity, and should significantly raise the bar of how many nodes in a cluster is realistic without other changes.

          I think this might be a fix worthwhile committing because it feels safe and is maybe a candidate for the 1.1 release, assuming review doesn't yield anything obvious. But, leaving the JIRA open for a more overarching fix (I'm not sure what that is at the moment; I'm mulling it over).

          Show
          Peter Schuller added a comment - Attaching CASSANDRA-3831-trunk-group-add-dc-tokens.txt which adds a "group-update" interface to TokenMetadata that allows NTS to use it when constructing it's local per-dc meta datas. This is not even close to a complete fix for this issue, but I do think it is a "clean" change because it makes sense in terms of TokenMetadata API to provide a group-update method given the expense involved. And given it's existence, it makes sense for NTS to use it. This change mitigates the problem significantly on the ~ 180 node test cluster since it takes a way an n from the complexity, and should significantly raise the bar of how many nodes in a cluster is realistic without other changes. I think this might be a fix worthwhile committing because it feels safe and is maybe a candidate for the 1.1 release, assuming review doesn't yield anything obvious. But, leaving the JIRA open for a more overarching fix (I'm not sure what that is at the moment; I'm mulling it over).
          Hide
          Peter Schuller added a comment -

          With respect to trunk and whether it gets called repeatedly, I gave that a try now. I picked a random node to tail (and running a version that logs calculate pending ranges and its time), and did a decommission of another node:

           INFO [GossipStage:1] 2012-02-05 23:33:32,675 StorageService.java (line 1275) calculate pending ranges called, took 885 ms
           INFO [OptionalTasks:1] 2012-02-05 23:34:00,828 HintedHandOffManager.java (line 180) Deleting any stored hints for /XXX.XXX.XXX.XXX
           INFO [GossipStage:1] 2012-02-05 23:34:00,828 StorageService.java (line 1275) calculate pending ranges called, took 0 ms
           INFO [GossipStage:1] 2012-02-05 23:34:00,829 StorageService.java (line 1275) calculate pending ranges called, took 0 ms
           INFO [GossipStage:1] 2012-02-05 23:34:00,829 StorageService.java (line 1217) Removing token 23117008622346363007022731483136427400 for /XXX.XXX.XXX.XXX
          

          At least two of those are expected - once when it goes into leaving, and once when it drops out of the cluster. Not sure at this point why we see a third call.

          For bootstrapping the guy back into the cluster I see:

           INFO [GossipStage:1] 2012-02-05 23:38:15,832 StorageService.java (line 1275) calculate pending ranges called, took 1413 ms
           INFO [GossipStage:1] 2012-02-05 23:38:45,229 ColumnFamilyStore.java (line 590) Enqueuing flush of Memtable-LocationInfo@659873291(35/43 serialized/live bytes, 1 ops)
           INFO [FlushWriter:4] 2012-02-05 23:38:45,229 Memtable.java (line 252) Writing Memtable-LocationInfo@659873291(35/43 serialized/live bytes, 1 ops)
           INFO [FlushWriter:4] 2012-02-05 23:38:45,236 Memtable.java (line 293) Completed flushing /data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-20-Data.db (89 bytes)
           INFO [GossipStage:1] 2012-02-05 23:38:45,236 StorageService.java (line 1275) calculate pending ranges called, took 0 ms
           INFO [CompactionExecutor:21] 2012-02-05 23:38:45,237 CompactionTask.java (line 115) Compacting [SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-20-Data.db'), SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-19-Data.db'), SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-18-Data.db'), SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-17-Data.db')]
           INFO [CompactionExecutor:21] 2012-02-05 23:38:45,248 CompactionTask.java (line 226) Compacted to [/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-21-Data.db,].  7,388 to 7,047 (~95% of original) bytes for 4 keys at 0.610958MB/s.  Time: 11ms.
          

          So it got called twice - which is expected. Once when it entered in joining state, and once when it flipped into Normal.

          Relatedly:

          On the other hand, on the node that is bootstrapping, I (and this is expected) have calculate pending ranges calls lots of times - presumably (not confirmed) at least once for every node in the cluster, as the gossiper emits events to inform it of each. If there is a single node bootstrapping this is kind of okay because the lack of "another" guy bootstrapping means the calculations are quick. But if other nodes are bootstrapping too (highly likely if you're doing capacity adds on large clusters) that would be expected to take a long time to process. This could throw off the node bootstrapping which tries to wait for RING_DELAY on start-up, but is spending a lot of that time doing these calculations rather than staying up-to-date with ring information (for the record though I have not specifically timed/tested this particular case).

          Show
          Peter Schuller added a comment - With respect to trunk and whether it gets called repeatedly, I gave that a try now. I picked a random node to tail (and running a version that logs calculate pending ranges and its time), and did a decommission of another node: INFO [GossipStage:1] 2012-02-05 23:33:32,675 StorageService.java (line 1275) calculate pending ranges called, took 885 ms INFO [OptionalTasks:1] 2012-02-05 23:34:00,828 HintedHandOffManager.java (line 180) Deleting any stored hints for /XXX.XXX.XXX.XXX INFO [GossipStage:1] 2012-02-05 23:34:00,828 StorageService.java (line 1275) calculate pending ranges called, took 0 ms INFO [GossipStage:1] 2012-02-05 23:34:00,829 StorageService.java (line 1275) calculate pending ranges called, took 0 ms INFO [GossipStage:1] 2012-02-05 23:34:00,829 StorageService.java (line 1217) Removing token 23117008622346363007022731483136427400 for /XXX.XXX.XXX.XXX At least two of those are expected - once when it goes into leaving, and once when it drops out of the cluster. Not sure at this point why we see a third call. For bootstrapping the guy back into the cluster I see: INFO [GossipStage:1] 2012-02-05 23:38:15,832 StorageService.java (line 1275) calculate pending ranges called, took 1413 ms INFO [GossipStage:1] 2012-02-05 23:38:45,229 ColumnFamilyStore.java (line 590) Enqueuing flush of Memtable-LocationInfo@659873291(35/43 serialized/live bytes, 1 ops) INFO [FlushWriter:4] 2012-02-05 23:38:45,229 Memtable.java (line 252) Writing Memtable-LocationInfo@659873291(35/43 serialized/live bytes, 1 ops) INFO [FlushWriter:4] 2012-02-05 23:38:45,236 Memtable.java (line 293) Completed flushing /data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-20-Data.db (89 bytes) INFO [GossipStage:1] 2012-02-05 23:38:45,236 StorageService.java (line 1275) calculate pending ranges called, took 0 ms INFO [CompactionExecutor:21] 2012-02-05 23:38:45,237 CompactionTask.java (line 115) Compacting [SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-20-Data.db'), SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-19-Data.db'), SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-18-Data.db'), SSTableReader(path='/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-17-Data.db')] INFO [CompactionExecutor:21] 2012-02-05 23:38:45,248 CompactionTask.java (line 226) Compacted to [/data/diskarray/tables/system/LocationInfo/system-LocationInfo-hc-21-Data.db,]. 7,388 to 7,047 (~95% of original) bytes for 4 keys at 0.610958MB/s. Time: 11ms. So it got called twice - which is expected. Once when it entered in joining state, and once when it flipped into Normal. Relatedly: On the other hand, on the node that is bootstrapping , I (and this is expected) have calculate pending ranges calls lots of times - presumably (not confirmed) at least once for every node in the cluster, as the gossiper emits events to inform it of each. If there is a single node bootstrapping this is kind of okay because the lack of "another" guy bootstrapping means the calculations are quick. But if other nodes are bootstrapping too (highly likely if you're doing capacity adds on large clusters) that would be expected to take a long time to process. This could throw off the node bootstrapping which tries to wait for RING_DELAY on start-up, but is spending a lot of that time doing these calculations rather than staying up-to-date with ring information (for the record though I have not specifically timed/tested this particular case).
          Hide
          Peter Schuller added a comment -

          I agree (I did say that myself already ). The memoization (+ being ready to change the cluster-wide phi convict threshold through JMX) was just the safest way to fix the situation on our production cluster so that we could continue to add capacity. It was never intended as a suggested fix. But I still wanted to upload it instead of keeping the patch private, in case someone's helped by it.

          But the larger issue is that calculatePendingRanges must be faster to begin with. Even if only called once, if it takes 1-4 seconds on a ~ 180 node cluster and it's worse than O(n^3) it's way too slow and won't scale. First due to the failure detector, and of course at some point it's just too slow to even wait for the calculation to complete at all (from a RING_DELAY standpoint for example).

          I'll see later this weekend about doing more tests on trunk confirm/deny whether it is getting called multiple times. As I indicated I never confirmed that particular bit on trunk and it's very possible it doesn't happen there.

          I haven't had time to seriously look at suggesting changes to fix the computational complexity. Might be very easy for all I know; I just haven't looked at it yet.

          Show
          Peter Schuller added a comment - I agree (I did say that myself already ). The memoization (+ being ready to change the cluster-wide phi convict threshold through JMX) was just the safest way to fix the situation on our production cluster so that we could continue to add capacity. It was never intended as a suggested fix. But I still wanted to upload it instead of keeping the patch private, in case someone's helped by it. But the larger issue is that calculatePendingRanges must be faster to begin with. Even if only called once, if it takes 1-4 seconds on a ~ 180 node cluster and it's worse than O(n^3) it's way too slow and won't scale. First due to the failure detector, and of course at some point it's just too slow to even wait for the calculation to complete at all (from a RING_DELAY standpoint for example). I'll see later this weekend about doing more tests on trunk confirm/deny whether it is getting called multiple times. As I indicated I never confirmed that particular bit on trunk and it's very possible it doesn't happen there. I haven't had time to seriously look at suggesting changes to fix the computational complexity. Might be very easy for all I know; I just haven't looked at it yet.
          Hide
          Jonathan Ellis added a comment -

          calculatePendingRanges is only supposed to be called when the ring changes. So I'd say the right fix would be to eliminate whatever is breaking that design, rather than adding a memoization bandaid.

          (I eyeballed 1.1 and didn't see anything obvious, so either it's subtle or it got fixed post-0.8.)

          I don't suppose your CPU spinning test got any more of a call tree to go on?

          Show
          Jonathan Ellis added a comment - calculatePendingRanges is only supposed to be called when the ring changes. So I'd say the right fix would be to eliminate whatever is breaking that design, rather than adding a memoization bandaid. (I eyeballed 1.1 and didn't see anything obvious, so either it's subtle or it got fixed post-0.8.) I don't suppose your CPU spinning test got any more of a call tree to go on?
          Hide
          Peter Schuller added a comment - - edited

          I am attaching CASSANDRA-3831-memoization-not-for-inclusion.txt as an "FYI" and in case it helps others. It's against 0.8, and implements memoization of calculate pending ranges.

          The correct/clean fix is probably to change behavior so that it doesn't get called unnecessarily to begin with (and to make sure the computational complexity is reasonable when it does get called). This patch was made specifically to address the production issue we are having in a minimally dangerous fashion, and is not to be taken as a suggested fix.

          Show
          Peter Schuller added a comment - - edited I am attaching CASSANDRA-3831-memoization-not-for-inclusion.txt as an "FYI" and in case it helps others. It's against 0.8, and implements memoization of calculate pending ranges. The correct/clean fix is probably to change behavior so that it doesn't get called unnecessarily to begin with (and to make sure the computational complexity is reasonable when it does get called). This patch was made specifically to address the production issue we are having in a minimally dangerous fashion, and is not to be taken as a suggested fix.
          Hide
          Peter Schuller added a comment -

          Correction: All above applies, except on 'trunk': I have not confirmed that calculatePendingTokens() indeed gets called repeatedly (not just once per node that starts a bootstrap/leave/etc) on trunk. I have only confirmed it being CPU spinning there, and gossip stage being backed up as a result.

          On 0.8, it is specifically confirmed that it does get called repeatedly both in my test case and when we saw this happen in production.

          Show
          Peter Schuller added a comment - Correction: All above applies, except on 'trunk': I have not confirmed that calculatePendingTokens() indeed gets called repeatedly (not just once per node that starts a bootstrap/leave/etc) on trunk. I have only confirmed it being CPU spinning there, and gossip stage being backed up as a result. On 0.8, it is specifically confirmed that it does get called repeatedly both in my test case and when we saw this happen in production.

            People

            • Assignee:
              Peter Schuller
              Reporter:
              Peter Schuller
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development