Cassandra
  1. Cassandra
  2. CASSANDRA-1740

Nodetool commands to query and stop compaction, repair, cleanup and scrub

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: Tools
    • Labels:

      Description

      The only way to stop compaction, repair, cleanup, or scrub in progress is to stop and restart the entire Cassandra server. Please provide nodetool commands to query whether such things are running, and stop them if they are.

      1. 0001-Patch-to-Stop-compactions.patch
        13 kB
        Vijay
      2. 0001-Patch-to-Stop-compactions-v2.patch
        17 kB
        Vijay
      3. 0001-Patch-to-Stop-compactions-v3.patch
        17 kB
        Vijay
      4. 0001-Patch-to-Stop-compactions-v4.patch
        16 kB
        Sylvain Lebresne
      5. 0001-Patch-to-Stop-compactions-v5.patch
        16 kB
        Vijay
      6. 0001-Patch-to-Stop-compactions-v6.patch
        17 kB
        Sylvain Lebresne
      7. CASSANDRA-1740.patch
        9 kB
        Pavel Yaskevich

        Issue Links

          Activity

          Hide
          Pavel Yaskevich added a comment -

          it supports stopping cleanup, scrub, and both minor/major compactions. I'm not sure if we should do the same thing for repair, what do you think, Jonathan?

          Show
          Pavel Yaskevich added a comment - it supports stopping cleanup, scrub, and both minor/major compactions. I'm not sure if we should do the same thing for repair, what do you think, Jonathan?
          Hide
          Stu Hood added a comment -

          If you don't mind, I'd really like to get CASSANDRA-2191 in first, as it will change the semantics of this ticket quite a bit. In particular, it looks like you'll need a compaction id to decide what to stop: I'm currently exposing the CI hashcode, which might do the trick.

          Show
          Stu Hood added a comment - If you don't mind, I'd really like to get CASSANDRA-2191 in first, as it will change the semantics of this ticket quite a bit. In particular, it looks like you'll need a compaction id to decide what to stop: I'm currently exposing the CI hashcode, which might do the trick.
          Hide
          Pavel Yaskevich added a comment -

          I don't mind, sure.

          Show
          Pavel Yaskevich added a comment - I don't mind, sure.
          Hide
          Jonathan Ellis added a comment -

          Repair is tough since it interacts with other nodes. Let's leave that alone for now.

          Show
          Jonathan Ellis added a comment - Repair is tough since it interacts with other nodes. Let's leave that alone for now.
          Hide
          Pavel Yaskevich added a comment -

          Great, thank you!

          Show
          Pavel Yaskevich added a comment - Great, thank you!
          Hide
          Pavel Yaskevich added a comment -

          Stu Hood: do you want to be in charge of this one since your changes in compaction mechanism?

          Show
          Pavel Yaskevich added a comment - Stu Hood: do you want to be in charge of this one since your changes in compaction mechanism?
          Hide
          Sylvain Lebresne added a comment -

          I committed CASSANDRA-2191 so as stu said, this will need some rebasing to handle it.

          For repair, let's just create another ticket. If stu wants/have time to do it, fine, otherwise I may do it.

          Show
          Sylvain Lebresne added a comment - I committed CASSANDRA-2191 so as stu said, this will need some rebasing to handle it. For repair, let's just create another ticket. If stu wants/have time to do it, fine, otherwise I may do it.
          Hide
          Sylvain Lebresne added a comment -

          As for handling the multi-threaded compactions, the hashcode would more or less work but since it's not totally safe I would prefer assigning a name to each compaction which could be:

          • a uuid assigned to each compaction when created
          • a simple (atomically) increasing number
          • a simple (atomically) increasing number for each type of compaction, the name being something like major-42 or minor-3012. Nice thing is it tells you how many minor, major, validata, ... compaction you have run already.
          Show
          Sylvain Lebresne added a comment - As for handling the multi-threaded compactions, the hashcode would more or less work but since it's not totally safe I would prefer assigning a name to each compaction which could be: a uuid assigned to each compaction when created a simple (atomically) increasing number a simple (atomically) increasing number for each type of compaction, the name being something like major-42 or minor-3012. Nice thing is it tells you how many minor, major, validata, ... compaction you have run already.
          Hide
          Pavel Yaskevich added a comment -

          I like the last option!

          Show
          Pavel Yaskevich added a comment - I like the last option!
          Hide
          Jonathan Ellis added a comment -

          Stu, are you planning to multithreadify this, then?

          Show
          Jonathan Ellis added a comment - Stu, are you planning to multithreadify this, then?
          Hide
          Stu Hood added a comment -

          I wasn't necessarily volunteering, but I might be able to get to this after CASSANDRA-2455?

          Show
          Stu Hood added a comment - I wasn't necessarily volunteering , but I might be able to get to this after CASSANDRA-2455 ?
          Hide
          Pavel Yaskevich added a comment -

          sgtm

          Show
          Pavel Yaskevich added a comment - sgtm
          Hide
          Pavel Yaskevich added a comment -

          Stu, are you still planning to work on this?

          Show
          Pavel Yaskevich added a comment - Stu, are you still planning to work on this?
          Hide
          Stu Hood added a comment -

          Hey Pavel, sorry: I should have commented ages ago: I won't be working on this.

          Show
          Stu Hood added a comment - Hey Pavel, sorry: I should have commented ages ago: I won't be working on this.
          Hide
          Vijay added a comment -

          Will it make more sense to have a commands like pause compaction via "nt pause compaction" and resume via "nt resume compaction"? instead of "nt stop compaction"

          Show
          Vijay added a comment - Will it make more sense to have a commands like pause compaction via "nt pause compaction" and resume via "nt resume compaction"? instead of "nt stop compaction"
          Hide
          Jonathan Ellis added a comment -

          No. I don't want to deal with trying to "freeze" compaction state for later. It's also dangerous because of the reference counting we do, to pause that indefinitely.

          Show
          Jonathan Ellis added a comment - No. I don't want to deal with trying to "freeze" compaction state for later. It's also dangerous because of the reference counting we do, to pause that indefinitely.
          Hide
          Vijay added a comment -

          Attached patch can stop the following, COMPACTION, VALIDATION, KEY_CACHE_SAVE, ROW_CACHE_SAVE,CLEANUP, SCRUB, INDEX_BUILD

          nt stop <compaction type>

          but for the minor compaction when we run stop it starts immediately hence i am not sure if this will help to stop the minor compactions but it is definitely better than restarting.

          Show
          Vijay added a comment - Attached patch can stop the following, COMPACTION, VALIDATION, KEY_CACHE_SAVE, ROW_CACHE_SAVE,CLEANUP, SCRUB, INDEX_BUILD nt stop <compaction type> but for the minor compaction when we run stop it starts immediately hence i am not sure if this will help to stop the minor compactions but it is definitely better than restarting.
          Hide
          Sylvain Lebresne added a comment -
          • In CompactionInfo.Holder, stop should likely be volatile
          • I'm not fond of using exceptions (StoppedException) for something that is not exceptional or an error in the first place. But in particular, throwing the exception will result in a 'Fatal error' in the log, which tends to scare people, so that's another reason to avoid it.
          • Stopping an AutoSavingCache task will leave a tmp file around.
          • Doesn't make much sense to me to have StoppedException extend IOException since it's not. If the only goal is to avoid having to add a new checked exception, then making it a RuntimeException is imo better. But as said above, I think we shouldn't use exception for that anyway.
          Show
          Sylvain Lebresne added a comment - In CompactionInfo.Holder, stop should likely be volatile I'm not fond of using exceptions (StoppedException) for something that is not exceptional or an error in the first place. But in particular, throwing the exception will result in a 'Fatal error' in the log, which tends to scare people, so that's another reason to avoid it. Stopping an AutoSavingCache task will leave a tmp file around. Doesn't make much sense to me to have StoppedException extend IOException since it's not. If the only goal is to avoid having to add a new checked exception, then making it a RuntimeException is imo better. But as said above, I think we shouldn't use exception for that anyway.
          Hide
          Vijay added a comment -

          I can do all of the above, but i still think it is an Exception because someone/something interrupted (agree about runtime exception and will add more message so users wont panic). It will be easier to maintain later because people dont need to understand a return vs exception, they just need to worry about is an exception will be thrown somewhere.

          I can remove the exception and just do return if stopped if we prefer otherwise, let me know.

          Show
          Vijay added a comment - I can do all of the above, but i still think it is an Exception because someone/something interrupted (agree about runtime exception and will add more message so users wont panic). It will be easier to maintain later because people dont need to understand a return vs exception, they just need to worry about is an exception will be thrown somewhere. I can remove the exception and just do return if stopped if we prefer otherwise, let me know.
          Hide
          Sylvain Lebresne added a comment -

          Let me reformulate: I strongly think that we shouldn't use an uncaught exception (or rather only caught by our default uncaught exception handler) for this. If we have a nodetool command to stop something, then stopping it is not an error, and thus we shouldn't have a line in the log that start with ERROR. Not only because it scares people, but also because you don't want to have false alert for any system that is grepping the log for ERROR. And also because that would be super ugly.

          We do want a log message at INFO though. And it wouldn't hurt to make it as informative as we can. Something like "Stopping compaction of [sstable1, sstabl2, ...] as requested by user". Now I care less whether we use an exception internally to achieve this or not, and I could see benefits in throwing an exception for the JMX commands that are stopped, but I'm -1 on having an ERROR log message for a user triggered action.

          As for stopping validation compaction, this will actually leave repair compaction hanging. It's probably ok to add it still since it's better than nothing, but this ticket should probably spawn a specific 'stop repair' ticket.

          Show
          Sylvain Lebresne added a comment - Let me reformulate: I strongly think that we shouldn't use an uncaught exception (or rather only caught by our default uncaught exception handler) for this. If we have a nodetool command to stop something, then stopping it is not an error, and thus we shouldn't have a line in the log that start with ERROR. Not only because it scares people, but also because you don't want to have false alert for any system that is grepping the log for ERROR. And also because that would be super ugly. We do want a log message at INFO though. And it wouldn't hurt to make it as informative as we can. Something like "Stopping compaction of [sstable1, sstabl2, ...] as requested by user". Now I care less whether we use an exception internally to achieve this or not, and I could see benefits in throwing an exception for the JMX commands that are stopped, but I'm -1 on having an ERROR log message for a user triggered action. As for stopping validation compaction, this will actually leave repair compaction hanging. It's probably ok to add it still since it's better than nothing, but this ticket should probably spawn a specific 'stop repair' ticket.
          Hide
          Vijay added a comment -

          Hi Sylvain,

          • we shouldn't have a line in the log that start with ERROR

          I have made it to log info (with a localized message instead of error stack trace).

          In this way when a person running a command will also get exception printed out. Let me know if you still want to remove throw exception. Thanks!

          Show
          Vijay added a comment - Hi Sylvain, we shouldn't have a line in the log that start with ERROR I have made it to log info (with a localized message instead of error stack trace). In this way when a person running a command will also get exception printed out. Let me know if you still want to remove throw exception. Thanks!
          Hide
          Jonathan Ellis added a comment -

          I'm not a fan of the instanceof UIE in the default exception handler. Feels spaghetti-ish to me.

          What about checking for UIE in the compactionexecutor afterExecute? That would address the DRY problem without violating encapsulation so badly.

          Nit: would prefer mutable fields to be private and exposed via getter if necessary, e.g. would use Holder.isStopped() here.

          Show
          Jonathan Ellis added a comment - I'm not a fan of the instanceof UIE in the default exception handler. Feels spaghetti-ish to me. What about checking for UIE in the compactionexecutor afterExecute? That would address the DRY problem without violating encapsulation so badly. Nit: would prefer mutable fields to be private and exposed via getter if necessary, e.g. would use Holder.isStopped() here.
          Hide
          Vijay added a comment -

          Looks like calling TPE.afterExecute wont work because Throwable t is always null... http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6450211 (in java6)
          I can add back the isStopped() if we are ok with the rest, plz let me know. Thanks!

          Show
          Vijay added a comment - Looks like calling TPE.afterExecute wont work because Throwable t is always null... http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6450211 (in java6) I can add back the isStopped() if we are ok with the rest, plz let me know. Thanks!
          Hide
          Jonathan Ellis added a comment -

          Why not just make StoppedException extend RuntimeException?

          Show
          Jonathan Ellis added a comment - Why not just make StoppedException extend RuntimeException?
          Hide
          Vijay added a comment -

          Hi Jonathan,

          Moved instanceof to DTPE.logExceptionsAfterExecute

          Thanks!

          Show
          Vijay added a comment - Hi Jonathan, Moved instanceof to DTPE.logExceptionsAfterExecute Thanks!
          Hide
          Sylvain Lebresne added a comment -

          Looks good except that I believe logExceptionAfterExecute uses one too many getCause() so that the UserInterruptedException is not actually catched. Attaching v4 that solves this as well as the following minor changes:

          • make UserInteruptedException takes a CompactionInfo directly and generate the message internally (instead of duplicating the same message at each call site).
          • remove an added and non used logger in CompactionInfo
          • update the nodetool help message to respect the convention for mandatory arguments.
          Show
          Sylvain Lebresne added a comment - Looks good except that I believe logExceptionAfterExecute uses one too many getCause() so that the UserInterruptedException is not actually catched. Attaching v4 that solves this as well as the following minor changes: make UserInteruptedException takes a CompactionInfo directly and generate the message internally (instead of duplicating the same message at each call site). remove an added and non used logger in CompactionInfo update the nodetool help message to respect the convention for mandatory arguments.
          Hide
          Vijay added a comment -

          the only difference from v4 in v5 is

          if (actualException.getCause() != null && actualException.getCause() instanceof UserInterruptedException)

          Otherwise there will be a error in the log. Thanks!

          Show
          Vijay added a comment - the only difference from v4 in v5 is if (actualException.getCause() != null && actualException.getCause() instanceof UserInterruptedException) Otherwise there will be a error in the log. Thanks!
          Hide
          Sylvain Lebresne added a comment -

          I don't understand. Did you actually try v4 and got an exception in the log? actualException is already the cause of the ExecutionException, so it is directly the UserInterruptedException, unless that latter gets further encapsulated by another exception but I see nowhere in the code where that could happen. And as a matter of fact, if you do try stopping a compaction, both v3 and v5 do log an error, while v4 does not. If your testing differs, there is something going on.

          Show
          Sylvain Lebresne added a comment - I don't understand. Did you actually try v4 and got an exception in the log? actualException is already the cause of the ExecutionException, so it is directly the UserInterruptedException, unless that latter gets further encapsulated by another exception but I see nowhere in the code where that could happen. And as a matter of fact, if you do try stopping a compaction, both v3 and v5 do log an error, while v4 does not. If your testing differs, there is something going on.
          Hide
          Vijay added a comment - - edited

          Yeah I did try it and saw the exception, I was not sure who is wrapping it again with RTE.

          To be more clear:
          The RTE additional wrap is because WrappableRunnable which catches all Exceptions just to wrap it to a RTE (plz check http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/RuntimeException.java). Hope this helps

          Show
          Vijay added a comment - - edited Yeah I did try it and saw the exception, I was not sure who is wrapping it again with RTE. To be more clear: The RTE additional wrap is because WrappableRunnable which catches all Exceptions just to wrap it to a RTE (plz check http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/RuntimeException.java ). Hope this helps
          Hide
          Sylvain Lebresne added a comment -

          Ok, I believe you were either testing on trunk or you were testing stopping cache writes. On trunk, compaction tasks are wrapped but not on 1.0 (on 1.0, only the cache writing tasks uses WrappedRunnable).

          There is different way to fix but I believe the best one is to change WrappedRunnable so that it doesn't wrap RTEs. Attaching v6 that takes this approach.

          (Note that I have nothing against using WrappedRunnable in 1.0 too as for trunk for compactions if we feel so inclined but I still think WrappedRunnable should not wrap RTE)

          Show
          Sylvain Lebresne added a comment - Ok, I believe you were either testing on trunk or you were testing stopping cache writes. On trunk, compaction tasks are wrapped but not on 1.0 (on 1.0, only the cache writing tasks uses WrappedRunnable). There is different way to fix but I believe the best one is to change WrappedRunnable so that it doesn't wrap RTEs. Attaching v6 that takes this approach. (Note that I have nothing against using WrappedRunnable in 1.0 too as for trunk for compactions if we feel so inclined but I still think WrappedRunnable should not wrap RTE)
          Hide
          Vijay added a comment -

          +1, Thanks!

          Show
          Vijay added a comment - +1, Thanks!
          Hide
          Sylvain Lebresne added a comment -

          Committed, thanks

          Show
          Sylvain Lebresne added a comment - Committed, thanks
          Hide
          Jonathan Ellis added a comment -

          Reverted from 1.0.6 to avoid possible introduction of instability. (E.g., CASSANDRA-3566.) Left the WrappedRunnable improvement.

          Show
          Jonathan Ellis added a comment - Reverted from 1.0.6 to avoid possible introduction of instability. (E.g., CASSANDRA-3566 .) Left the WrappedRunnable improvement.
          Hide
          Robert Coli added a comment -

          To whom it may concern (people finding this ticket via CHANGES.txt/google and/or people who are wondering why their 1.2.x era nodetool help doesn't mention "stop") :

          While this code is in all versions of cassandra >=1.1.0, the "nodetool help" line referring to it disappears in 1.2.0 (CASSANDRA-2293) and re-appears in 1.2.4 (thx driftx!).

          Show
          Robert Coli added a comment - To whom it may concern (people finding this ticket via CHANGES.txt/google and/or people who are wondering why their 1.2.x era nodetool help doesn't mention "stop") : While this code is in all versions of cassandra >=1.1.0, the "nodetool help" line referring to it disappears in 1.2.0 ( CASSANDRA-2293 ) and re-appears in 1.2.4 (thx driftx!).

            People

            • Assignee:
              Vijay
              Reporter:
              Chip Salzenberg
              Reviewer:
              Sylvain Lebresne
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development