Cassandra
  1. Cassandra
  2. CASSANDRA-6345

Endpoint cache invalidation causes CPU spike (on vnode rings?)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.13, 2.0.4
    • Component/s: None
    • Labels:
      None
    • Environment:

      30 nodes total, 2 DCs
      Cassandra 1.2.11
      vnodes enabled (256 per node)

      Description

      We've observed that events which cause invalidation of the endpoint cache (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy result in several seconds of thundering herd behavior on the entire cluster.

      A thread dump shows over a hundred threads (I stopped counting at that point) with a backtrace like this:

      at java.net.Inet4Address.getAddress(Inet4Address.java:288)
      at org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106)
      at org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103)
      at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351)
      at java.util.TreeMap.getEntry(TreeMap.java:322)
      at java.util.TreeMap.get(TreeMap.java:255)
      at com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200)
      at com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117)
      at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74)
      at com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273)
      at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74)
      at org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60)
      at org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598)
      at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104)
      at org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671)
      at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375)

      It looks like there's a large amount of cost in the TokenMetadata.cloneOnlyTokenMap that AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is a cache miss for an endpoint. It seems as if this would only impact clusters with large numbers of tokens, so it's probably a vnodes-only issue.

      Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), wrapping it with a lock to prevent stampedes, and clearing it in clearEndpointCache(). Thoughts?

      1. 6345.txt
        3 kB
        Jonathan Ellis
      2. 6345-v2.txt
        3 kB
        Jonathan Ellis
      3. half-way-thru-6345-rbranson-patch-applied.png
        31 kB
        Rick Branson
      4. 6345-rbranson.txt
        5 kB
        Rick Branson
      5. 6345-rbranson-v2.txt
        10 kB
        Rick Branson
      6. 6345-v3.txt
        4 kB
        Jonathan Ellis
      7. 6345-v4.txt
        9 kB
        Jonathan Ellis
      8. 6345-v5.txt
        8 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          We don't care about keeping an accurate count, only that once it's done with that block it's higher than it was before.

          Show
          Jonathan Ellis added a comment - We don't care about keeping an accurate count, only that once it's done with that block it's higher than it was before.
          Hide
          Chris Burroughs added a comment -
          private volatile long ringVersion = 0;
          
          ringVersion++;
          

          If there is something tricky here that makes an increment on a volatile okay then it deserves a comment.

          Show
          Chris Burroughs added a comment - private volatile long ringVersion = 0; ringVersion++; If there is something tricky here that makes an increment on a volatile okay then it deserves a comment.
          Hide
          Rick Branson added a comment -

          LGTM!

          Show
          Rick Branson added a comment - LGTM!
          Hide
          Jonathan Ellis added a comment -

          My defensiveness comment suggested bumping the version number each time the TM write lock is released, which would be in addition to the existing invalidations.

          Okay. I'm going to leave this be then, because I don't want to accidentally start invalidating the cache unnecessarily because one of those operations was more common than I thought. Could address in trunk if you want to open a ticket.

          Committed v5 w/ nits fixed.

          Show
          Jonathan Ellis added a comment - My defensiveness comment suggested bumping the version number each time the TM write lock is released, which would be in addition to the existing invalidations. Okay. I'm going to leave this be then, because I don't want to accidentally start invalidating the cache unnecessarily because one of those operations was more common than I thought. Could address in trunk if you want to open a ticket. Committed v5 w/ nits fixed.
          Hide
          Rick Branson added a comment - - edited

          Thanks for taking the time to explain the consistency story. It makes perfect sense.

          My defensiveness comment suggested bumping the version number each time the TM write lock is released, which would be in addition to the existing invalidations. You're probably a much better gauge on the usefulness of this, so up to you.

          Really nice that the v5 patch is so compact. Two minor comments: the endpointsLock declaration is still in there, and not to be all nitpicky but there are two typos in the comments ("wo we keep" and "clone got invalidted").

          Show
          Rick Branson added a comment - - edited Thanks for taking the time to explain the consistency story. It makes perfect sense. My defensiveness comment suggested bumping the version number each time the TM write lock is released, which would be in addition to the existing invalidations. You're probably a much better gauge on the usefulness of this, so up to you. Really nice that the v5 patch is so compact. Two minor comments: the endpointsLock declaration is still in there, and not to be all nitpicky but there are two typos in the comments ("wo we keep" and "clone got invalidted").
          Hide
          Jonathan Ellis added a comment -

          It seems that unless I'm missing something either is possible with the current release code, and thus these patches as well

          Technically correct, but in practice we're in pretty good shape. The sequence is:

          1. Add the changing node to pending ranges
          2. Sleep for RING_DELAY so everyone else starts including the new target in their writes
          3. Flush data to be transferred
          4. Send over data for writes that happened before (1)

          Step 1 happens on every coordinator. 2-4 only happen on the node that is giving up a token range.

          The guarantee we need is that any write that happens before the pending range change, completes before the subsequent flush.

          Even if we used TM.lock to protect the entire ARS sequence (guaranteeing that no local write is in progress once the PRC happens) we could still receive writes from other nodes that began their PRC change later.

          So we rely on the RING_DELAY (30s) sleep. I suppose a GC pause for instance at just the wrong time could theoretically mean a mutation against the old state gets sent out late, but I don't see how we can improve it.

          IMHO to be defensive, any time the write lock is acquired in TokenMetadata, the version should be bumped in the finally block before the lock is released

          Haven't thought this through as much. What are you saying we should bump that we weren't calling invalidate on before?

          Is the idea with the striped lock on the endpoint cache in AbstractReplicationStrategy to help smooth out the stampede effect when the "global" lock on the cached TM gets released after the fill?

          I'm trying to avoid a minor stampede on calculateNaturalEndpoints (CASSANDRA-3881) but it's probably premature optimization. v5 attached w/o that.

          Show
          Jonathan Ellis added a comment - It seems that unless I'm missing something either is possible with the current release code, and thus these patches as well Technically correct, but in practice we're in pretty good shape. The sequence is: Add the changing node to pending ranges Sleep for RING_DELAY so everyone else starts including the new target in their writes Flush data to be transferred Send over data for writes that happened before (1) Step 1 happens on every coordinator. 2-4 only happen on the node that is giving up a token range. The guarantee we need is that any write that happens before the pending range change, completes before the subsequent flush. Even if we used TM.lock to protect the entire ARS sequence (guaranteeing that no local write is in progress once the PRC happens) we could still receive writes from other nodes that began their PRC change later. So we rely on the RING_DELAY (30s) sleep. I suppose a GC pause for instance at just the wrong time could theoretically mean a mutation against the old state gets sent out late, but I don't see how we can improve it. IMHO to be defensive, any time the write lock is acquired in TokenMetadata, the version should be bumped in the finally block before the lock is released Haven't thought this through as much. What are you saying we should bump that we weren't calling invalidate on before? Is the idea with the striped lock on the endpoint cache in AbstractReplicationStrategy to help smooth out the stampede effect when the "global" lock on the cached TM gets released after the fill? I'm trying to avoid a minor stampede on calculateNaturalEndpoints ( CASSANDRA-3881 ) but it's probably premature optimization. v5 attached w/o that.
          Hide
          Rick Branson added a comment -

          +100 at removing those pub/sub callbacks

          The concurrency issues I bring up are probably because I'm unfamiliar with the "guarantees" needed by TokenMetadata updates. It looks like the current release code is subject to the issue I brought up, where method calls on TokenMetadata that change state return successfully before all threads applying mutations have "seen" the update. There will be some mutations in progress that are using "stale" token data to apply writes even after TokenMetadata write methods returns as successful. So this does not appear to be a regression, but I'm just being overly cautious having been burned by these sort of double-caching scenarios before. You bring up the point that over-broad operations are ok, and I agree, but I'm more concerned about operations that are too narrow. It seems that unless I'm missing something either is possible with the current release code, and thus these patches as well (including mine).

          TokenMetadata#updateNormalTokens is (implicitly) relying on the removeFromMoving call to bump the version, but the tokenToEndpointMap is updated afterwards, which means internal data is updated after the version is bumped. IMHO to be defensive, any time the write lock is acquired in TokenMetadata, the version should be bumped in the finally block before the lock is released. I don't think this is exposing a bug in the existing patch though, because cloneOnlyTokenMap will be blocked until the write lock is released in the finally block.

          Is the idea with the striped lock on the endpoint cache in AbstractReplicationStrategy to help smooth out the stampede effect when the "global" lock on the cached TM gets released after the fill? How much do you think it's worth the extra complexity? FWIW, my v2 patch suffers from this issue and it hasn't reared itself in production. The write load for the machines in the cluster I've been looking at is comparatively low though compared to many others at 6-7k/sec peak on an 8-core box.

          Show
          Rick Branson added a comment - +100 at removing those pub/sub callbacks The concurrency issues I bring up are probably because I'm unfamiliar with the "guarantees" needed by TokenMetadata updates. It looks like the current release code is subject to the issue I brought up, where method calls on TokenMetadata that change state return successfully before all threads applying mutations have "seen" the update. There will be some mutations in progress that are using "stale" token data to apply writes even after TokenMetadata write methods returns as successful. So this does not appear to be a regression, but I'm just being overly cautious having been burned by these sort of double-caching scenarios before. You bring up the point that over-broad operations are ok, and I agree, but I'm more concerned about operations that are too narrow. It seems that unless I'm missing something either is possible with the current release code, and thus these patches as well (including mine). TokenMetadata#updateNormalTokens is (implicitly) relying on the removeFromMoving call to bump the version, but the tokenToEndpointMap is updated afterwards, which means internal data is updated after the version is bumped. IMHO to be defensive, any time the write lock is acquired in TokenMetadata, the version should be bumped in the finally block before the lock is released. I don't think this is exposing a bug in the existing patch though, because cloneOnlyTokenMap will be blocked until the write lock is released in the finally block. Is the idea with the striped lock on the endpoint cache in AbstractReplicationStrategy to help smooth out the stampede effect when the "global" lock on the cached TM gets released after the fill? How much do you think it's worth the extra complexity? FWIW, my v2 patch suffers from this issue and it hasn't reared itself in production. The write load for the machines in the cluster I've been looking at is comparatively low though compared to many others at 6-7k/sec peak on an 8-core box.
          Hide
          Jonathan Ellis added a comment - - edited

          v4 attached that uses a versioning approach like yours. I dropped the readLock acquire on version read since it's not necessary to block callers during the update. (A few extra over-broad replica set operations won't hurt.)

          Show
          Jonathan Ellis added a comment - - edited v4 attached that uses a versioning approach like yours. I dropped the readLock acquire on version read since it's not necessary to block callers during the update. (A few extra over-broad replica set operations won't hurt.)
          Hide
          Jonathan Ellis added a comment -

          It seems as if just setting the cache to empty would allow a period of time where TokenMetadata write methods had returned but not all threads have seen the mutation yet

          I'm not 100% sure this is what you're talking about, but I see this problem with the existing code (and my v3):

          Thread 1                 Thread 2        
          getNaturalEndpoints      
          cloneOnlyTokenMap        
                                   invalidateCachedTokenEndpointValues
          endpoints = calculate
          cacheEndpoint [based on the now-invalidated token map]
          

          So it doesn't quite work. We'd need to introduce another AtomicReference on the cache, so that invalidate could create a new Map (so it doesn't matter if someone updates the old one). But I think you're right that getting rid of the callback approach entirely is better.

          Show
          Jonathan Ellis added a comment - It seems as if just setting the cache to empty would allow a period of time where TokenMetadata write methods had returned but not all threads have seen the mutation yet I'm not 100% sure this is what you're talking about, but I see this problem with the existing code (and my v3): Thread 1 Thread 2 getNaturalEndpoints cloneOnlyTokenMap invalidateCachedTokenEndpointValues endpoints = calculate cacheEndpoint [based on the now-invalidated token map] So it doesn't quite work. We'd need to introduce another AtomicReference on the cache, so that invalidate could create a new Map (so it doesn't matter if someone updates the old one). But I think you're right that getting rid of the callback approach entirely is better.
          Hide
          Rick Branson added a comment -

          I like the simpler approach. I still think the callbacks for invalidation are asking for it I also think perhaps the stampede lock should be more explicit than a synchronized lock on "this" to prevent unintended blocking from future modifications.

          Either way, I think the only material concern I have is the order that TokenMetadata changes get applied to the caches in AbstractReplicationStrategy instances. Shouldn't the invalidation take place on all threads in all instances of AbstractReplicationStrategy before returning from an endpoint-mutating write operation in TokenMetadata? It seems as if just setting the cache to empty would allow a period of time where TokenMetadata write methods had returned but not all threads have seen the mutation yet because they are still holding onto the old clone of TM. This might be alright though, I'm not sure. Thoughts?

          Show
          Rick Branson added a comment - I like the simpler approach. I still think the callbacks for invalidation are asking for it I also think perhaps the stampede lock should be more explicit than a synchronized lock on "this" to prevent unintended blocking from future modifications. Either way, I think the only material concern I have is the order that TokenMetadata changes get applied to the caches in AbstractReplicationStrategy instances. Shouldn't the invalidation take place on all threads in all instances of AbstractReplicationStrategy before returning from an endpoint-mutating write operation in TokenMetadata? It seems as if just setting the cache to empty would allow a period of time where TokenMetadata write methods had returned but not all threads have seen the mutation yet because they are still holding onto the old clone of TM. This might be alright though, I'm not sure. Thoughts?
          Hide
          Jonathan Ellis added a comment -

          I think we can craft a simpler solution (v3) by using an AtomicReference to the TM clone. This removes the possibility of deadlock since clearEndpointCache now only makes non-blocking calls.

          I've also refined it to use a Striped<Lock> per-keyToken, as well as synchronizing the TM clone itself, since concurrent endpoint computation is fine.

          Show
          Jonathan Ellis added a comment - I think we can craft a simpler solution (v3) by using an AtomicReference to the TM clone. This removes the possibility of deadlock since clearEndpointCache now only makes non-blocking calls. I've also refined it to use a Striped<Lock> per-keyToken, as well as synchronizing the TM clone itself, since concurrent endpoint computation is fine.
          Hide
          Rick Branson added a comment -

          Attached a new patch with the deadlock fixed. We're running this on a production cluster.

          The primary issue was the callback for invalidation from TokenMetadata to all of the registered AbstractReplicationStrategy instances. This was asking for it anyway, so in the patch I replaced the "push" invalidation with simple versioning of the TokenMetadata endpoints. TokenMetadata bumps it's version number each time the cache would need to be invalidated, and AbstractReplicationStrategy checks it's version when it needs to do a read, invalidating if necessary. This gets the invalidation out of the gossip threads and into the RPC threads, which is probably a good thing. The only thing I'm not super crazy about is the extra hot path read lock acquisition on TokenMetadata.getEndpointVersion(), which might be avoidable.

          Show
          Rick Branson added a comment - Attached a new patch with the deadlock fixed. We're running this on a production cluster. The primary issue was the callback for invalidation from TokenMetadata to all of the registered AbstractReplicationStrategy instances. This was asking for it anyway, so in the patch I replaced the "push" invalidation with simple versioning of the TokenMetadata endpoints. TokenMetadata bumps it's version number each time the cache would need to be invalidated, and AbstractReplicationStrategy checks it's version when it needs to do a read, invalidating if necessary. This gets the invalidation out of the gossip threads and into the RPC threads, which is probably a good thing. The only thing I'm not super crazy about is the extra hot path read lock acquisition on TokenMetadata.getEndpointVersion(), which might be avoidable.
          Hide
          Rick Branson added a comment -

          Unfortunately both of the patches suffer from a deadlock, since the invalidation and fill are wrapped up in TokenMetadata's locks.

          T1 acquires cache read lock
          T2 acquires TokenMetadata write lock
          T1 acquires cache write lock on miss
          T2 is blocked on cache write lock trying to invalidate
          T1 is blocked on TokenMetadata read lock trying to cloneOnlyTokenMap to fill the cache

          Trying to work on a fix.

          Show
          Rick Branson added a comment - Unfortunately both of the patches suffer from a deadlock, since the invalidation and fill are wrapped up in TokenMetadata's locks. T1 acquires cache read lock T2 acquires TokenMetadata write lock T1 acquires cache write lock on miss T2 is blocked on cache write lock trying to invalidate T1 is blocked on TokenMetadata read lock trying to cloneOnlyTokenMap to fill the cache Trying to work on a fix.
          Hide
          Rick Branson added a comment -

          Well, I started writing the patch this morning and I don't write multi-threaded Java code every day, so I'm overly careful The only theoretical advantage to my patch is that it allows concurrent readers.

          Show
          Rick Branson added a comment - Well, I started writing the patch this morning and I don't write multi-threaded Java code every day, so I'm overly careful The only theoretical advantage to my patch is that it allows concurrent readers.
          Hide
          Jonathan Ellis added a comment -

          I have to admit I like it better without the custom wrapper class.

          Show
          Jonathan Ellis added a comment - I have to admit I like it better without the custom wrapper class.
          Hide
          Rick Branson added a comment -

          CPU user% graph during the rollout of the patch I attached on 1 DC (15 nodes) of the cluster. Around ~21:05 the patch starts to roll out and spikes are seen. The node in question receives the patch at ~21:30, and afterwards the spikes are gone. The rollout finishes at ~21:45.

          Show
          Rick Branson added a comment - CPU user% graph during the rollout of the patch I attached on 1 DC (15 nodes) of the cluster. Around ~21:05 the patch starts to roll out and spikes are seen. The node in question receives the patch at ~21:30, and afterwards the spikes are gone. The rollout finishes at ~21:45.
          Hide
          Rick Branson added a comment -

          Attached a patch we deployed to production that fixed the issue.

          Show
          Rick Branson added a comment - Attached a patch we deployed to production that fixed the issue.
          Hide
          Jonathan Ellis added a comment -

          I see, with vnodes we have enough ranges that we can have a thundering herd even if each range only clones once.

          v2 attached with the approach you described originally.

          Show
          Jonathan Ellis added a comment - I see, with vnodes we have enough ranges that we can have a thundering herd even if each range only clones once. v2 attached with the approach you described originally.
          Hide
          Jonathan Ellis added a comment -

          Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), wrapping it with a lock to prevent stampedes, and clearing it in clearEndpointCache().

          Why not just use a sharded lock to prevent stampedes directly w/o the caching complexity, as in the attached?

          Show
          Jonathan Ellis added a comment - Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), wrapping it with a lock to prevent stampedes, and clearing it in clearEndpointCache(). Why not just use a sharded lock to prevent stampedes directly w/o the caching complexity, as in the attached?
          Hide
          Benedict added a comment - - edited

          Just have the wrapper make any updates to a collection replace the collection instead of modifying it.

          [NB: I haven't looked to see if this would have any negative performance implications on the update side, I'm assuming the reads are more frequent and/or collections small... if not, a Treap is probably the better choice as my old Jjoost implementation (IIRC) supports snapshotting and multiple values are dealt with inside the tree itself, not as a collection]

          Show
          Benedict added a comment - - edited Just have the wrapper make any updates to a collection replace the collection instead of modifying it. [NB: I haven't looked to see if this would have any negative performance implications on the update side, I'm assuming the reads are more frequent and/or collections small... if not, a Treap is probably the better choice as my old Jjoost implementation (IIRC) supports snapshotting and multiple values are dealt with inside the tree itself, not as a collection]
          Hide
          Jonathan Ellis added a comment -

          Actually I think a STM multimap would still get messy quickly since you need to do a "deep" clone – cloning the top-level Map would leave the values (sub-collections) sharing a reference.

          Show
          Jonathan Ellis added a comment - Actually I think a STM multimap would still get messy quickly since you need to do a "deep" clone – cloning the top-level Map would leave the values (sub-collections) sharing a reference.
          Hide
          Jonathan Ellis added a comment -

          Yes. Too bad the implementation classes like AbstractSortedKeySortedSetMultimap are package-private.

          Show
          Jonathan Ellis added a comment - Yes. Too bad the implementation classes like AbstractSortedKeySortedSetMultimap are package-private.
          Hide
          Benedict added a comment - - edited

          A Treap? Can be cheaply built, cheaply merged and cheaply cloned.

          Also, anything cheaply cloneable would work for that operation. A SnapTree that is wrapped to support multi-map functionality would also work.

          Show
          Benedict added a comment - - edited A Treap? Can be cheaply built, cheaply merged and cheaply cloned. Also, anything cheaply cloneable would work for that operation. A SnapTree that is wrapped to support multi-map functionality would also work.
          Hide
          Jonathan Ellis added a comment -

          it looks like TreeMultimap.putAll actually loops over each entry and calls put one at a time

          I don't see a way around this: https://code.google.com/p/guava-libraries/issues/detail?id=1579

          Show
          Jonathan Ellis added a comment - it looks like TreeMultimap.putAll actually loops over each entry and calls put one at a time I don't see a way around this: https://code.google.com/p/guava-libraries/issues/detail?id=1579
          Hide
          Jonathan Ellis added a comment - - edited

          Interesting. Could we optimize cOTM instead? That would definitely be the simplest solution.

          E.g. it looks like TreeMultimap.putAll actually loops over each entry and calls put one at a time which is the worst-case scenario for binary tree rebalancing – quadratic time.

          Show
          Jonathan Ellis added a comment - - edited Interesting. Could we optimize cOTM instead? That would definitely be the simplest solution. E.g. it looks like TreeMultimap.putAll actually loops over each entry and calls put one at a time which is the worst-case scenario for binary tree rebalancing – quadratic time.

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Rick Branson
              Reviewer:
              Rick Branson
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development