Cassandra
  1. Cassandra
  2. CASSANDRA-5051

Allow automatic cleanup after gc_grace

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Fix Version/s: 3.0
    • Component/s: Core
    • Labels:

      Description

      When using vnodes, after adding a new node you have to run cleanup on all the machines, because you don't know which are affected and chances are it was most if not all of them. As an alternative to this intensive process, we could allow cleanup during compaction if the data is older than gc_grace (or perhaps some other time period since people tend to use gc_grace hacks to get rid of tombstones.)

      1. 5051-v2.txt
        12 kB
        Jonathan Ellis
      2. 0004-5051-additional-test-v4.patch
        4 kB
        Vijay
      3. 0002-5051-remove-upgradesstable-v4.patch
        9 kB
        Vijay
      4. 0002-5051-remove-upgradesstable.patch
        9 kB
        Vijay
      5. 0001-CASSANDRA-5051.patch
        17 kB
        Vijay
      6. 0001-5051-with-test-fixes.patch
        23 kB
        Vijay
      7. 0001-5051-v6.patch
        41 kB
        Vijay
      8. 0001-5051-v4.patch
        24 kB
        Vijay

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          I think I'd be comfortable with auto-cleanup during compaction now, even without a grace period.

          Throwing away data silently still makes me nervous, even if it's data we're sure we've sent elsewhere, but it's easy enough to set up a real backup strategy now that I think we're okay. And the truly paranoid can run with snapshot_before_compaction.

          Show
          Jonathan Ellis added a comment - I think I'd be comfortable with auto-cleanup during compaction now, even without a grace period. Throwing away data silently still makes me nervous, even if it's data we're sure we've sent elsewhere, but it's easy enough to set up a real backup strategy now that I think we're okay. And the truly paranoid can run with snapshot_before_compaction.
          Hide
          Jonathan Ellis added a comment -

          v2 attached that makes cleanup-during-compaction always-on. (v2 also switches from comparing based on table name to checking for LocalStrateg, and moves indexColumns list internal to rmIdxRenewCounter.)

          this highlights a problem, though – some tests now fail, because StorageService.getLocalRanges will return an empty list until StorageService initializes it. (I think this is either via loading the stored ring through initServer, or via gossip filling things in. Not really sure how gossip fills in my own token if we don't load from the system table. Bit of a mess here.)

          So (a) obviously this is a bit fragile for the tests. But is there potential for us to also throw away data that we shouldn't if we get behind on gossip somehow? At the very least I think we need to include pending ranges for the local node.

          Nit: ISTM we ought to be able to move LCR's Iterable<OnDiskAtom> into AbstractCompactedRow, but I'm not actually sure how to make generics happy w/ the PR subclass returning Iterator<Column>.

          Show
          Jonathan Ellis added a comment - v2 attached that makes cleanup-during-compaction always-on. (v2 also switches from comparing based on table name to checking for LocalStrateg, and moves indexColumns list internal to rmIdxRenewCounter.) this highlights a problem, though – some tests now fail, because StorageService.getLocalRanges will return an empty list until StorageService initializes it. (I think this is either via loading the stored ring through initServer, or via gossip filling things in. Not really sure how gossip fills in my own token if we don't load from the system table. Bit of a mess here.) So (a) obviously this is a bit fragile for the tests. But is there potential for us to also throw away data that we shouldn't if we get behind on gossip somehow? At the very least I think we need to include pending ranges for the local node. Nit: ISTM we ought to be able to move LCR's Iterable<OnDiskAtom> into AbstractCompactedRow, but I'm not actually sure how to make generics happy w/ the PR subclass returning Iterator<Column>.
          Hide
          Brandon Williams added a comment -

          But is there potential for us to also throw away data that we shouldn't if we get behind on gossip somehow?

          I don't see how. We're fairly well protected by persisting the ring, but even without that if we're behind, don't we just own more than we should? This was the impetus for persisting the ring.

          Show
          Brandon Williams added a comment - But is there potential for us to also throw away data that we shouldn't if we get behind on gossip somehow? I don't see how. We're fairly well protected by persisting the ring, but even without that if we're behind, don't we just own more than we should? This was the impetus for persisting the ring.
          Hide
          Vijay added a comment -

          Hi Jonathan, LGTM, do you want me to fix the test cases and get ready for commit?

          Show
          Vijay added a comment - Hi Jonathan, LGTM, do you want me to fix the test cases and get ready for commit?
          Hide
          Jonathan Ellis added a comment -

          WFM.

          Show
          Jonathan Ellis added a comment - WFM.
          Hide
          Jason Brown added a comment -

          +1 on this ticket (and patch), as well. My devops guys asked about something like this recently, so it will be helpful.

          I see it's tagged for 2.0 only; the change to the code seems innocuous enough, is there a safety concern for not putting it into 1.2, as well?

          Show
          Jason Brown added a comment - +1 on this ticket (and patch), as well. My devops guys asked about something like this recently, so it will be helpful. I see it's tagged for 2.0 only; the change to the code seems innocuous enough, is there a safety concern for not putting it into 1.2, as well?
          Hide
          Jonathan Ellis added a comment -

          Brandon's pretty confident, but I'd rather not cram it into a minor release just in case there's something we missed.

          Show
          Jonathan Ellis added a comment - Brandon's pretty confident, but I'd rather not cram it into a minor release just in case there's something we missed.
          Hide
          Jason Brown added a comment -

          Yeah, I figured it was a safety concern . Not a problem.

          Show
          Jason Brown added a comment - Yeah, I figured it was a safety concern . Not a problem.
          Hide
          Jonathan Ellis added a comment -

          Explicit cleanup is still useful to force reclaiming disk space immediately. But I think we can combine cleanup/upgradesstables. (I suggest retaining the "cleanup" command name.)

          Show
          Jonathan Ellis added a comment - Explicit cleanup is still useful to force reclaiming disk space immediately. But I think we can combine cleanup/upgradesstables. (I suggest retaining the "cleanup" command name.)
          Hide
          Vijay added a comment -

          0001 fixes the test cases
          0002 removes the upgrade SST command.

          Show
          Vijay added a comment - 0001 fixes the test cases 0002 removes the upgrade SST command.
          Hide
          Jonathan Ellis added a comment -

          Don't we need to include pending ranges? I don't think getRangesForEndpoint does.

          Show
          Jonathan Ellis added a comment - Don't we need to include pending ranges? I don't think getRangesForEndpoint does.
          Hide
          Vijay added a comment -

          v4 includes pending ranges. All the test cases pass, Thanks!

          Show
          Vijay added a comment - v4 includes pending ranges. All the test cases pass, Thanks!
          Hide
          Jonathan Ellis added a comment -

          Is it possible to add a test or dtest for the pending ranges case? I would really hate to introduce a regression there later.

          Show
          Jonathan Ellis added a comment - Is it possible to add a test or dtest for the pending ranges case? I would really hate to introduce a regression there later.
          Hide
          Vijay added a comment -

          0004 Adds additional test cases to test the senario where compaction happens during range movement.

          Show
          Vijay added a comment - 0004 Adds additional test cases to test the senario where compaction happens during range movement.
          Hide
          Jonathan Ellis added a comment -

          Does 0004 fail if you remove the pending ranges code? It looks to me like it might not. (Couldn't actually try it since I ran into rebase problems.)

          Show
          Jonathan Ellis added a comment - Does 0004 fail if you remove the pending ranges code? It looks to me like it might not. (Couldn't actually try it since I ran into rebase problems.)
          Hide
          Vijay added a comment -

          Hi Jonathan, I rebased fixed few more tests and adjusted the test case a bit and pushed to https://github.com/Vijay2win/cassandra/commits/5051-v5, if you remove the ranges code in v5 test case will fail. Thanks!

          Show
          Vijay added a comment - Hi Jonathan, I rebased fixed few more tests and adjusted the test case a bit and pushed to https://github.com/Vijay2win/cassandra/commits/5051-v5 , if you remove the ranges code in v5 test case will fail. Thanks!
          Hide
          Jonathan Ellis added a comment -

          I think I'm missing something still. The test makes tk0 bootstrap into tk2's range (the "local" node). So there should be a pending range for tk0 but tk2 should still be the "owner" of that range until the bootstrap finishes. So why would this test fail w/o pending ranges in the cleanup?

          Show
          Jonathan Ellis added a comment - I think I'm missing something still. The test makes tk0 bootstrap into tk2's range (the "local" node). So there should be a pending range for tk0 but tk2 should still be the "owner" of that range until the bootstrap finishes. So why would this test fail w/o pending ranges in the cleanup?
          Hide
          Vijay added a comment -

          Hi Jonathan, may be it will help if you add the following line in line 169 (testCleanupDuringRangeMovement)

                  logger.info("Node's Range: {}", StorageService.instance.getLocalRanges(table.getName()));
          

          the output is like this.

          13/04/06 21:48:24 INFO compaction.CompactionsPurgeTest: Node's Range: [(Token(bytes[01]),Token(bytes[02])]]
          13/04/06 21:48:24 INFO compaction.CompactionsPurgeTest: Range movement scheduled for: {(Token(bytes[02]),Token(bytes[00])]=[/127.0.0.3]}
          
          Show
          Vijay added a comment - Hi Jonathan, may be it will help if you add the following line in line 169 (testCleanupDuringRangeMovement) logger.info( "Node's Range: {}" , StorageService.instance.getLocalRanges(table.getName())); the output is like this. 13/04/06 21:48:24 INFO compaction.CompactionsPurgeTest: Node's Range: [(Token(bytes[01]),Token(bytes[02])]] 13/04/06 21:48:24 INFO compaction.CompactionsPurgeTest: Range movement scheduled for : {(Token(bytes[02]),Token(bytes[00])]=[/127.0.0.3]}
          Hide
          Edward Capriolo added a comment -

          This has come up before (I even remember suggesting it). To play devils advocate, Why does vnode change the rational of why this feature should be allowed? Not automatically cleaning up was always described as a "safety feature". Vnodes are not any more "safe" from that prospective. Adding this feature and having it on by default would be a complete 180. Even if all nodes are effected and need to be cleaned up, since they are vnodes they are only "slightly" effected.

          It would be nice if the feature did not happen until after gc_grace and it could be controlled at runtime, possibly defined at the keyspace or column family level.

          Show
          Edward Capriolo added a comment - This has come up before (I even remember suggesting it). To play devils advocate, Why does vnode change the rational of why this feature should be allowed? Not automatically cleaning up was always described as a "safety feature". Vnodes are not any more "safe" from that prospective. Adding this feature and having it on by default would be a complete 180. Even if all nodes are effected and need to be cleaned up, since they are vnodes they are only "slightly" effected. It would be nice if the feature did not happen until after gc_grace and it could be controlled at runtime, possibly defined at the keyspace or column family level.
          Hide
          Brandon Williams added a comment -

          Why does vnode change the rational of why this feature should be allowed?

          The rationale is in the description.

          Show
          Brandon Williams added a comment - Why does vnode change the rational of why this feature should be allowed? The rationale is in the description.
          Hide
          Edward Capriolo added a comment -

          That is not rational that describes why cleanup is no longer required. It
          is only rational that argues life would be easier if I did not have to do
          it.

          I agree with the argument that not having to run cleanup would be nice on
          administrators.

          My point was this opinion has been put forward before and been denied. I'm
          not going to look up the jira, but if I were to find it and quote it, how
          can we account for previous arguments against it.

          https://issues.apache.org/jira/browse/CASSANDRA-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625027#comment-13625027]
          allowed?
          0001-5051-with-test-fixes.patch, 0001-CASSANDRA-5051.patch,
          0002-5051-remove-upgradesstable.patch,
          0002-5051-remove-upgradesstable-v4.patch,
          0004-5051-additional-test-v4.patch, 5051-v2.txt
          all the machines, because you don't know which are affected and chances are
          it was most if not all of them. As an alternative to this intensive
          process, we could allow cleanup during compaction if the data is older than
          gc_grace (or perhaps some other time period since people tend to use
          gc_grace hacks to get rid of tombstones.)
          administrators

          Show
          Edward Capriolo added a comment - That is not rational that describes why cleanup is no longer required. It is only rational that argues life would be easier if I did not have to do it. I agree with the argument that not having to run cleanup would be nice on administrators. My point was this opinion has been put forward before and been denied. I'm not going to look up the jira, but if I were to find it and quote it, how can we account for previous arguments against it. https://issues.apache.org/jira/browse/CASSANDRA-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625027#comment-13625027 ] allowed? 0001-5051-with-test-fixes.patch, 0001- CASSANDRA-5051 .patch, 0002-5051-remove-upgradesstable.patch, 0002-5051-remove-upgradesstable-v4.patch, 0004-5051-additional-test-v4.patch, 5051-v2.txt all the machines, because you don't know which are affected and chances are it was most if not all of them. As an alternative to this intensive process, we could allow cleanup during compaction if the data is older than gc_grace (or perhaps some other time period since people tend to use gc_grace hacks to get rid of tombstones.) administrators
          Hide
          Brandon Williams added a comment -

          That is not rational that describes why cleanup is no longer required

          That is true, but it also doesn't say that. It says it's required on all the nodes if you have vnodes, instead of just the neighbor preceding the new node in the ring.

          Show
          Brandon Williams added a comment - That is not rational that describes why cleanup is no longer required That is true, but it also doesn't say that. It says it's required on all the nodes if you have vnodes, instead of just the neighbor preceding the new node in the ring.
          Hide
          Edward Capriolo added a comment -

          https://issues.apache.org/jira/browse/CASSANDRA-1746

          is only rational that argues life would be easier if I did not have to do
          it.
          administrators.
          I'm not going to look up the jira, but if I were to find it and quote it,
          how can we account for previous arguments against it.
          https://issues.apache.org/jira/browse/CASSANDRA-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625027#comment-13625027]
          allowed?
          https://issues.apache.org/jira/browse/CASSANDRA-5051
          0001-5051-with-test-fixes.patch, 0001-CASSANDRA-5051.patch,
          0002-5051-remove-upgradesstable.patch,
          0002-5051-remove-upgradesstable-v4.patch,
          0004-5051-additional-test-v4.patch, 5051-v2.txt
          all the machines, because you don't know which are affected and chances are
          it was most if not all of them. As an alternative to this intensive
          process, we could allow cleanup during compaction if the data is older than
          gc_grace (or perhaps some other time period since people tend to use
          gc_grace hacks to get rid of tombstones.)
          administrators

          Show
          Edward Capriolo added a comment - https://issues.apache.org/jira/browse/CASSANDRA-1746 is only rational that argues life would be easier if I did not have to do it. administrators. I'm not going to look up the jira, but if I were to find it and quote it, how can we account for previous arguments against it. https://issues.apache.org/jira/browse/CASSANDRA-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625027#comment-13625027 ] allowed? https://issues.apache.org/jira/browse/CASSANDRA-5051 0001-5051-with-test-fixes.patch, 0001- CASSANDRA-5051 .patch, 0002-5051-remove-upgradesstable.patch, 0002-5051-remove-upgradesstable-v4.patch, 0004-5051-additional-test-v4.patch, 5051-v2.txt all the machines, because you don't know which are affected and chances are it was most if not all of them. As an alternative to this intensive process, we could allow cleanup during compaction if the data is older than gc_grace (or perhaps some other time period since people tend to use gc_grace hacks to get rid of tombstones.) administrators
          Hide
          Jonathan Ellis added a comment -

          Did you read the comments on the ticket you linked?

          The reason cleanup and compaction are different things is that rows that do not belong to the current node can be generated by CL.ANY writes as well as "left behind" by token changes... We don't have an efficient way to ask "does key exist in hints" (hints are keyed by target).

          Hinted handoff no longer works by storing foreign material in the data tables.

          I also alluded to another reason in my first comment here:

          Throwing away data silently still makes me nervous ... but it's easy enough to set up a real backup strategy now that I think we're okay.

          Incremental backups and PITR also did not exist back in November 2010.

          Show
          Jonathan Ellis added a comment - Did you read the comments on the ticket you linked? The reason cleanup and compaction are different things is that rows that do not belong to the current node can be generated by CL.ANY writes as well as "left behind" by token changes... We don't have an efficient way to ask "does key exist in hints" (hints are keyed by target). Hinted handoff no longer works by storing foreign material in the data tables. I also alluded to another reason in my first comment here: Throwing away data silently still makes me nervous ... but it's easy enough to set up a real backup strategy now that I think we're okay. Incremental backups and PITR also did not exist back in November 2010.
          Hide
          Jonathan Ellis added a comment -

          may be it will help if you add the following line in line 169 (testCleanupDuringRangeMovement)

          That does help. I think the test and the code are both wrong.

          First, the code is wrong because the range being moved is going from node 2's range to node 3. The "local" node's range ([01 .. 02)) is unaffected by the bootstrap here. Only node 3 should add that range to its "don't throw this away on cleanup" list. (So, just don't drop the destination part of TM.getPendingRanges and this will be trivial to calculate.)

          Second, the test is wrong because (as written) it should only retain keys from [01 .. 02]. What we actually need to test is making the local node the one bootstrapping (i.e., having a pending range entry).

          Show
          Jonathan Ellis added a comment - may be it will help if you add the following line in line 169 (testCleanupDuringRangeMovement) That does help. I think the test and the code are both wrong. First, the code is wrong because the range being moved is going from node 2's range to node 3. The "local" node's range ( [01 .. 02) ) is unaffected by the bootstrap here. Only node 3 should add that range to its "don't throw this away on cleanup" list. (So, just don't drop the destination part of TM.getPendingRanges and this will be trivial to calculate.) Second, the test is wrong because (as written) it should only retain keys from [01 .. 02] . What we actually need to test is making the local node the one bootstrapping (i.e., having a pending range entry).
          Hide
          Vijay added a comment -

          Pushed change to same git branch, which handles all of the above concerns in addition handles replace_token case where we should not cleanup (Range information is not available for replace_token case).

          Show
          Vijay added a comment - Pushed change to same git branch, which handles all of the above concerns in addition handles replace_token case where we should not cleanup (Range information is not available for replace_token case).
          Hide
          Jonathan Ellis added a comment -

          It still looks to me like the test is not checking range calculation for a node that is in the process of bootstrapping (i.e.: localhost is a Normal token; another node is Bootstrapping).

          I think I liked the isInRanges implementation better before the "if it's empty, assume we're bootstrapping" refactor. At the least I'd add an assert to make sure empty really does imply isBoostrapping.

          Show
          Jonathan Ellis added a comment - It still looks to me like the test is not checking range calculation for a node that is in the process of bootstrapping (i.e.: localhost is a Normal token; another node is Bootstrapping). I think I liked the isInRanges implementation better before the "if it's empty, assume we're bootstrapping" refactor. At the least I'd add an assert to make sure empty really does imply isBoostrapping.
          Hide
          Vijay added a comment -

          (i.e.: localhost is a Normal token; another node is Bootstrapping)

          I think thats exactly what the test does

          tk0 (Local node owns the whole range before BS) is taking over the whole range but it is still bootstrapping...
          CPT.testCleanupDuringRangeMovement()

                  // test if node streaming is not dropping the data.
                  tmd.addBootstrapToken(new BytesToken(tk0), InetAddress.getByName("127.0.0.3"));
                  tmd.updateNormalToken(new BytesToken(tk2), FBUtilities.getBroadcastAddress());
                  StorageService.calculatePendingRanges(table.getReplicationStrategy(), table.getName());
                  logger.info("Range movement scheduled for: {}", tmd.getPendingRanges(table.getName()));
                  CompactionManager.instance.submitMaximal(cfs, Integer.MAX_VALUE).get();
          

          At the least I'd add an assert to make sure empty really does imply isBoostrapping.

          Will do, Let me know... Thanks!

          Show
          Vijay added a comment - (i.e.: localhost is a Normal token; another node is Bootstrapping) I think thats exactly what the test does tk0 (Local node owns the whole range before BS) is taking over the whole range but it is still bootstrapping... CPT.testCleanupDuringRangeMovement() // test if node streaming is not dropping the data. tmd.addBootstrapToken( new BytesToken(tk0), InetAddress.getByName( "127.0.0.3" )); tmd.updateNormalToken( new BytesToken(tk2), FBUtilities.getBroadcastAddress()); StorageService.calculatePendingRanges(table.getReplicationStrategy(), table.getName()); logger.info( "Range movement scheduled for : {}" , tmd.getPendingRanges(table.getName())); CompactionManager.instance.submitMaximal(cfs, Integer .MAX_VALUE).get(); At the least I'd add an assert to make sure empty really does imply isBoostrapping. Will do, Let me know... Thanks!
          Hide
          Jonathan Ellis added a comment -

          I meant that localhost=Normal, other=Bootstrap does not exercise "include pending ranges" code, since local node ranges do not change during BS.

          Show
          Jonathan Ellis added a comment - I meant that localhost=Normal, other=Bootstrap does not exercise "include pending ranges" code, since local node ranges do not change during BS.
          Hide
          Edward Capriolo added a comment -

          Weird edge case I thought of. How will this work if someone needs to change snitch/RF? There might be some delta of time where I would not want cassandra to delete data even though it is not supposed to be there. In that case it would be nice to be able to turn this feature off.

          Show
          Edward Capriolo added a comment - Weird edge case I thought of. How will this work if someone needs to change snitch/RF? There might be some delta of time where I would not want cassandra to delete data even though it is not supposed to be there. In that case it would be nice to be able to turn this feature off.
          Hide
          Vijay added a comment -

          V6 addressees the test concerns.

          Show
          Vijay added a comment - V6 addressees the test concerns.
          Hide
          Jonathan Ellis added a comment -

          How will this work if someone needs to change snitch/RF?

          I don't see a problem, but I could be missing something subtle. Here are the three scenarios I see:

          1. You expand a replica set to new nodes. This doesn't affect you since there is never "data that used to belong but now does not anymore."
          2. You reduce the replica set. This saves you from having to run cleanup but I don't see why you might want to stop it from doing that.
          3. You reduce the replica set temporarily, knowing that you're going to re-expand it. Here it might look like you could save time by preventing the cleanup, but Cassandra doesn't actually know that "I used to be responsible for some of this data" and will re-stream the entire range anyway.
          Show
          Jonathan Ellis added a comment - How will this work if someone needs to change snitch/RF? I don't see a problem, but I could be missing something subtle. Here are the three scenarios I see: You expand a replica set to new nodes. This doesn't affect you since there is never "data that used to belong but now does not anymore." You reduce the replica set. This saves you from having to run cleanup but I don't see why you might want to stop it from doing that. You reduce the replica set temporarily, knowing that you're going to re-expand it. Here it might look like you could save time by preventing the cleanup, but Cassandra doesn't actually know that "I used to be responsible for some of this data" and will re-stream the entire range anyway.
          Hide
          Jonathan Ellis added a comment - - edited

          V6 addressees the test concerns

          Why do half a dozen tests need initServer now?

          When I rip pendingranges out of getNodeRange, CPTest still passes:

          .   private List<Range<Token>> getNodeRange()
              {
                  List<Range<Token>> ranges = new ArrayList<Range<Token>>(StorageService.instance.getLocalRanges(cfs.table.getName()));
                  return ranges;
              }
          

          This may be because it's still hitting the "empty ranges == all ranges" code in isInRanges, which looks kind of fishy to me – if we have the pending ranges in getNodeRange, we shouldn't need this extra piece.

          Show
          Jonathan Ellis added a comment - - edited V6 addressees the test concerns Why do half a dozen tests need initServer now? When I rip pendingranges out of getNodeRange, CPTest still passes: . private List<Range<Token>> getNodeRange() { List<Range<Token>> ranges = new ArrayList<Range<Token>>(StorageService.instance.getLocalRanges(cfs.table.getName())); return ranges; } This may be because it's still hitting the "empty ranges == all ranges" code in isInRanges, which looks kind of fishy to me – if we have the pending ranges in getNodeRange, we shouldn't need this extra piece.
          Hide
          Vijay added a comment -

          Hi Jonathan, we also need to remove the following for the test to fail...

          StorageService.instance.startBootstrapping();

          in the test case, the empty ranges check is when the node is bootstrapping. I would be safe than sorry

          Show
          Vijay added a comment - Hi Jonathan, we also need to remove the following for the test to fail... StorageService.instance.startBootstrapping(); in the test case, the empty ranges check is when the node is bootstrapping. I would be safe than sorry
          Hide
          Jonathan Ellis added a comment -

          Why do half a dozen tests need initServer now?

          Via IRC: since we don't have have a token without it, and the bootstrap is not true, and those tests do compactions.

          Show
          Jonathan Ellis added a comment - Why do half a dozen tests need initServer now? Via IRC: since we don't have have a token without it, and the bootstrap is not true, and those tests do compactions.
          Hide
          Jonathan Ellis added a comment -

          I wanted to see if I could decouple things a bit and https://github.com/jbellis/cassandra/commits/5051-v7 is the result (also rebased to current trunk). I added IRangeProvider to allow tests to inject what they want to test directly. I

          I also switched off of using bootstrapMode in favor of Mode, but either way I don't think we're quite in the clear – when we tell node A to move or relocate data to B, it relies on gossip to tell node B about that (the Pending Ranges). But if gossip is extra slow for whatever reason – even just a large cluster where everything else is working as planned – it might take longer than the hardcoded ring delay for B to learn about his new responsibility, and compaction would delete the data.

          I think we might need to add a "tell the recipient about the new ranges he's going to have" step to streaming. /cc Yuki Morishita Brandon Williams

          Show
          Jonathan Ellis added a comment - I wanted to see if I could decouple things a bit and https://github.com/jbellis/cassandra/commits/5051-v7 is the result (also rebased to current trunk). I added IRangeProvider to allow tests to inject what they want to test directly. I I also switched off of using bootstrapMode in favor of Mode, but either way I don't think we're quite in the clear – when we tell node A to move or relocate data to B, it relies on gossip to tell node B about that (the Pending Ranges). But if gossip is extra slow for whatever reason – even just a large cluster where everything else is working as planned – it might take longer than the hardcoded ring delay for B to learn about his new responsibility, and compaction would delete the data. I think we might need to add a "tell the recipient about the new ranges he's going to have" step to streaming. /cc Yuki Morishita Brandon Williams
          Hide
          Jonathan Ellis added a comment - - edited
          Show
          Jonathan Ellis added a comment - - edited Ping Yuki Morishita Brandon Williams
          Hide
          Brandon Williams added a comment -

          I think we might need to add a "tell the recipient about the new ranges he's going to have" step to streaming

          This sounds like a good idea to me, since gossip can only guarantee eventual consistency, and there might be a partition you're not aware of at the time, causing loss.

          Show
          Brandon Williams added a comment - I think we might need to add a "tell the recipient about the new ranges he's going to have" step to streaming This sounds like a good idea to me, since gossip can only guarantee eventual consistency, and there might be a partition you're not aware of at the time, causing loss.
          Hide
          Jonathan Ellis added a comment -

          I think that creeps the scope enough that we should push this to 2.1, especially since Yuki is concurrently working on a new Streaming design.

          Show
          Jonathan Ellis added a comment - I think that creeps the scope enough that we should push this to 2.1, especially since Yuki is concurrently working on a new Streaming design.
          Hide
          Chris Burroughs added a comment -

          /operating-a-cluster-hat

          "Yo we added some more nodes why is performance not better yet" is a fairly common complaint I get. That said of: (A) status quo; (B) auto-cleanup after X; (C) auto-cleanup all the time. I'd welcome the option for (B) and probably enable it on all current clusters, but I can't think of a cluster where I would prefer (C) over (A). Even a small risk of dealing with a backup/restore situation is not worth it.

          Show
          Chris Burroughs added a comment - /operating-a-cluster-hat "Yo we added some more nodes why is performance not better yet" is a fairly common complaint I get. That said of: (A) status quo; (B) auto-cleanup after X; (C) auto-cleanup all the time. I'd welcome the option for (B) and probably enable it on all current clusters, but I can't think of a cluster where I would prefer (C) over (A). Even a small risk of dealing with a backup/restore situation is not worth it.

            People

            • Assignee:
              Vijay
              Reporter:
              Brandon Williams
              Reviewer:
              Jonathan Ellis
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development