Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Fix Version/s: 2.1.x
    • Component/s: Tools
    • Labels:
    • Environment:

      JVM

      Description

      After CASSANDRA-1740, If the validation compaction is stopped then the repair will hang. This ticket will allow users to kill the original repair.

        Issue Links

          Activity

          Hide
          Sylvain Lebresne added a comment -

          Moving to 1.1 because this will almost certainly require a wire protocol change. Also, CASSANDRA-3112 requires a very similar change (basically validation needs to be able to report an error back to the repair), so it's worth making room for both of those changes together.

          Show
          Sylvain Lebresne added a comment - Moving to 1.1 because this will almost certainly require a wire protocol change. Also, CASSANDRA-3112 requires a very similar change (basically validation needs to be able to report an error back to the repair), so it's worth making room for both of those changes together.
          Hide
          Radim Kolar added a comment -

          It will be useful. sometimes validation eats too much disk bandwidth making compaction too slow

          Show
          Radim Kolar added a comment - It will be useful. sometimes validation eats too much disk bandwidth making compaction too slow
          Hide
          Vijay added a comment -

          Attached is a simple patch to expose SS.forceTerminateAllRepairSessions in NodeTool

          #nt stop repairsessions

          will stop the repair sessions in the repair originating node (cleanup).
          NOTE: User might need to do "#nt stop Validation" in all the nodes which are involved in the repair.

          Show
          Vijay added a comment - Attached is a simple patch to expose SS.forceTerminateAllRepairSessions in NodeTool #nt stop repairsessions will stop the repair sessions in the repair originating node (cleanup). NOTE: User might need to do "#nt stop Validation" in all the nodes which are involved in the repair.
          Hide
          Sylvain Lebresne added a comment -

          As a said back on CASSANDRA-3316, I see forceTerminateAllRepairSessions more as a band-aid solution to avoid having someone stuck until we better handle repair failures and have a correct to stop it. But it isn't really a very use friendly solution since it doesn't stop properly the repair (it won't stop the validation compaction (though we can do that by other means now), nor any streaming that would be running) and you have to run it everywhere. For all those reason, my opinion is that we should keep this JMX only, I see no good reason to promote it to nodetool. That's just an opinion though.

          Show
          Sylvain Lebresne added a comment - As a said back on CASSANDRA-3316 , I see forceTerminateAllRepairSessions more as a band-aid solution to avoid having someone stuck until we better handle repair failures and have a correct to stop it. But it isn't really a very use friendly solution since it doesn't stop properly the repair (it won't stop the validation compaction (though we can do that by other means now), nor any streaming that would be running) and you have to run it everywhere. For all those reason, my opinion is that we should keep this JMX only, I see no good reason to promote it to nodetool. That's just an opinion though.
          Hide
          Vijay added a comment -

          Sounds good to me, Let me know if you want me to clear the streaming sessions which where associated with the repair while running (SS.forceTerminateAllRepairSessions) i can do that (as a part of this ticket or a separate one)...
          if the current state is good enough, i can close this ticket as it is already exposed for the advanced users (I think the streaming sessions are mostly harmless for most of the users).

          Show
          Vijay added a comment - Sounds good to me, Let me know if you want me to clear the streaming sessions which where associated with the repair while running (SS.forceTerminateAllRepairSessions) i can do that (as a part of this ticket or a separate one)... if the current state is good enough, i can close this ticket as it is already exposed for the advanced users (I think the streaming sessions are mostly harmless for most of the users).
          Hide
          Sylvain Lebresne added a comment -

          I don't think we should close that ticket because I don't think that we have a satisfying way to stop repair. A satisfying way to run repair would be to be able to run 'nodetool repair stop <some_repair_session_id>' on the host the repair was started on, on have it cleaning stop everything related to that repair (including any validation or streaming going on) and this on every participating nodes. That require a bit more work however and let me note that I don't see this ticket as a priority at all.

          Show
          Sylvain Lebresne added a comment - I don't think we should close that ticket because I don't think that we have a satisfying way to stop repair. A satisfying way to run repair would be to be able to run 'nodetool repair stop <some_repair_session_id>' on the host the repair was started on, on have it cleaning stop everything related to that repair (including any validation or streaming going on) and this on every participating nodes. That require a bit more work however and let me note that I don't see this ticket as a priority at all.
          Hide
          Robert Coli added a comment -

          Regarding the priority of this ticket, operators frequently report hung repair streaming sessions on #cassandra/cassandra-user@. Currently the only thing we can tell them is to restart all affected nodes. This presumably gives them a bad impression of cassandra. First, because the repair (which they have to run once every GCGraceSeconds per best practice) hangs with no useful messaging. Second, because the only solution is to restart multiple nodes. It's a little bit surprising that this ticket suggests that this negative user experience is uncommon enough to not expose some version of this functionality via nodetool.. two people's clusters have been in this state in #cassandra so far today and it's only 2pm...

          Show
          Robert Coli added a comment - Regarding the priority of this ticket, operators frequently report hung repair streaming sessions on #cassandra/cassandra-user@. Currently the only thing we can tell them is to restart all affected nodes. This presumably gives them a bad impression of cassandra. First, because the repair (which they have to run once every GCGraceSeconds per best practice) hangs with no useful messaging. Second, because the only solution is to restart multiple nodes. It's a little bit surprising that this ticket suggests that this negative user experience is uncommon enough to not expose some version of this functionality via nodetool.. two people's clusters have been in this state in #cassandra so far today and it's only 2pm...
          Hide
          Jeremy Hanna added a comment -

          See also CASSANDRA-5426

          Show
          Jeremy Hanna added a comment - See also CASSANDRA-5426
          Hide
          Bill Hathaway added a comment -

          I hit this today on 1.1.10.
          node X was running a repair that was hung. It had several SS tables it reported it was streaming from node Y. Node Y reported it was not streaming anything to node X.
          It looks like our only solution is to bounce the node, which is frustrating. A 'nodetool stop repair' would have been very helpful in my scenario.

          Show
          Bill Hathaway added a comment - I hit this today on 1.1.10. node X was running a repair that was hung. It had several SS tables it reported it was streaming from node Y. Node Y reported it was not streaming anything to node X. It looks like our only solution is to bounce the node, which is frustrating. A 'nodetool stop repair' would have been very helpful in my scenario.
          Hide
          Justen Walker added a comment -

          I also hit this today on 1.1.11, same problem as Bill describes

          Show
          Justen Walker added a comment - I also hit this today on 1.1.11, same problem as Bill describes
          Hide
          Nate McCall added a comment -

          I disagree with the prioritization of this ticket as minor. I've seen this on three separate clusters (all 1.2.x >= 1.2.8) in the past month alone. It is a difficult and time consuming problem to have to work around.

          Show
          Nate McCall added a comment - I disagree with the prioritization of this ticket as minor. I've seen this on three separate clusters (all 1.2.x >= 1.2.8) in the past month alone. It is a difficult and time consuming problem to have to work around.
          Hide
          Jason Brown added a comment -

          Yuki Morishita since I'm knees deep in CASSANDRA-6503, I'll knock this out at the same time.

          Show
          Jason Brown added a comment - Yuki Morishita since I'm knees deep in CASSANDRA-6503 , I'll knock this out at the same time.
          Hide
          Robert Coli added a comment -

          Jason Brown : I see that CASSANDRA-6503 is complete, but this ticket is unresolved. Do you still plan to address this issue soon?

          Show
          Robert Coli added a comment - Jason Brown : I see that CASSANDRA-6503 is complete, but this ticket is unresolved. Do you still plan to address this issue soon?
          Hide
          Jason Brown added a comment -

          Robert Coli hmm, totally forgot about this ticket - i can give it a shot over the next week.

          Show
          Jason Brown added a comment - Robert Coli hmm, totally forgot about this ticket - i can give it a shot over the next week.
          Hide
          David Huang added a comment -

          Is there any update on this?

          Show
          David Huang added a comment - Is there any update on this?
          Hide
          Jason Brown added a comment -

          First draft is about 90% complete.

          Show
          Jason Brown added a comment - First draft is about 90% complete.
          Hide
          David Huang added a comment -

          Thanks.

          Show
          David Huang added a comment - Thanks.
          Hide
          Brian Flad added a comment -

          Any updates? Much appreciated.

          Show
          Brian Flad added a comment - Any updates? Much appreciated.
          Hide
          Jason Brown added a comment -

          It's been a while since I looked at this, but iirc, the problem was determining if we should kill all ongoing repairs in a cluster, or expose a way for a user to terminate a single one. The trick was 1) how to expose repair session ids to users, globally, and 2) allows users to cancel an individual repair session from any node in the cluster.

          Show
          Jason Brown added a comment - It's been a while since I looked at this, but iirc, the problem was determining if we should kill all ongoing repairs in a cluster, or expose a way for a user to terminate a single one. The trick was 1) how to expose repair session ids to users, globally, and 2) allows users to cancel an individual repair session from any node in the cluster.
          Hide
          Carlos Diaz added a comment -

          +1 for this fix. I run into this issue quite regularly and have no good solution other than restarting the cassandra nodes one by one. However, this causes a significant performance impact when I do it.

          Show
          Carlos Diaz added a comment - +1 for this fix. I run into this issue quite regularly and have no good solution other than restarting the cassandra nodes one by one. However, this causes a significant performance impact when I do it.
          Hide
          Paulo Motta added a comment -

          Jason Brown do you still plan to work on this? There is a lot of overlap between this and CASSANDRA-11190, which I'll start working on it soon, so I wanted to check if we could still reuse anything of your first draft or better start from scratch.

          As to your previous questions, I think we can provide a way for users to stop/abort an ongoing local repair session given its session id. So, if session <id> is happening on nodes A, B and C, I can execute nodetool repair --abort <id> on nodes A, B or C to abort session <id>. I if execute nodetool repair --abort <id> on node D, it will print something like No ongoing repair session found with id <id>. For this to be useful, we should also provide a nodetool repair --list command to list ongoing repair sessions on the local node, and also nodetool repair --abort-all to abort all ongoing sessions on the local node.

          Show
          Paulo Motta added a comment - Jason Brown do you still plan to work on this? There is a lot of overlap between this and CASSANDRA-11190 , which I'll start working on it soon, so I wanted to check if we could still reuse anything of your first draft or better start from scratch. As to your previous questions, I think we can provide a way for users to stop/abort an ongoing local repair session given its session id. So, if session <id> is happening on nodes A, B and C, I can execute nodetool repair --abort <id> on nodes A, B or C to abort session <id>. I if execute nodetool repair --abort <id> on node D, it will print something like No ongoing repair session found with id <id> . For this to be useful, we should also provide a nodetool repair --list command to list ongoing repair sessions on the local node, and also nodetool repair --abort-all to abort all ongoing sessions on the local node.
          Hide
          Jason Brown added a comment -

          Paulo Motta It's all yours My first draft at this is so out of date that it's probably not worth the archaeological effort. And, yes, those were sort of nodetool commands I thought we'd need.

          IIRC, the tricky thing, two years ago when I worked on this, is you have to pass around the parent repair session id (which I don't think we did before) in order group all the child repairs due to vnodes (to kill all of those, too). I haven't been in the repair code in a while, so it might be different now.

          Show
          Jason Brown added a comment - Paulo Motta It's all yours My first draft at this is so out of date that it's probably not worth the archaeological effort. And, yes, those were sort of nodetool commands I thought we'd need. IIRC, the tricky thing, two years ago when I worked on this, is you have to pass around the parent repair session id (which I don't think we did before) in order group all the child repairs due to vnodes (to kill all of those, too). I haven't been in the repair code in a while, so it might be different now.
          Hide
          Daniel Sand added a comment -

          Paulo Motta +1 for the feature - really needed this feature when our cluster blew up this week

          Show
          Daniel Sand added a comment - Paulo Motta +1 for the feature - really needed this feature when our cluster blew up this week
          Hide
          Paulo Motta added a comment -

          Attaching preliminary patch in case anyone wants to have a look or give feedback before the review-ready version

          Current state

          • Add nodetool repair --list to list ongoing repair jobs (parent repair sessions) in the local node
          • Add nodetool repair --abort <jobId> and nodetool repair --abort-all to abort a specific or all jobs
          • Any participant can abort the repair job:
            • When a participant receives an abort request, it sends an abort message to the coordinator and abort its local tasks
            • When a coordinator receives an abort message or abort request, it sends an abort message to all participants and abort its local tasks, failing the repair job
          • Add abort support to StreamResultFuture and StreamSession
          • Refactor ActiveRepairService and RepairMessageVerbHandler
          • Add dtests to abort repair on coordinator and participants on different phases (validation, sync, anticompaction)
          • Fix races and leaks found during dtests

          Limitations and next steps

          While compactions have abort/stop support via CompactionManager.stopCompactionById,
          we cannot guarantee it's going to be aborted during a repair abortion because it's abort handler (Holder) is only registered during iteration via the CompactionIterator, so if we stop the compaction before that the task is not aborted, and will execute even if it's parent repair session was aborted. Furthermore, an anti-compaction is split into multiple subcompactions, so this method only stop the currently running subcompaction.

          In order to overcome this, I aborted the compaction task Future directly, which causes the task thread to be interrupted, so I check for Thread.currentThread.isInterrupted() during iteration and throw a CompactionInterruptedException if this is true, causing the compaction to be aborted (by brute force).

          However this is not very safe, because it can generate a ClosedByInterruptException if we're blocked on an I/O operation, and we currently treat any IOException as a corrupt sstable. Furthermore, an interrupted thread is not able to abort the transaction when getting a CompactionInterruptedException. In order to solve this we could special case interruptions in many places (readers, transaction aborting, etc) but even this wouldn't guarantee we're safe so this is probably a bad smell.

          A cleaner option that I will be doing in the next iteration is to associate a CompactionHolder with a ListenableFuture as soon as the anti-compaction or validation is submitted, so we can abort it safely without interrupting the compaction thread.

          Show
          Paulo Motta added a comment - Attaching preliminary patch in case anyone wants to have a look or give feedback before the review-ready version Current state Add nodetool repair --list to list ongoing repair jobs (parent repair sessions) in the local node Add nodetool repair --abort <jobId> and nodetool repair --abort-all to abort a specific or all jobs Any participant can abort the repair job: When a participant receives an abort request, it sends an abort message to the coordinator and abort its local tasks When a coordinator receives an abort message or abort request, it sends an abort message to all participants and abort its local tasks, failing the repair job Add abort support to StreamResultFuture and StreamSession Refactor ActiveRepairService and RepairMessageVerbHandler Add dtests to abort repair on coordinator and participants on different phases (validation, sync, anticompaction) Fix races and leaks found during dtests Limitations and next steps While compactions have abort/stop support via CompactionManager.stopCompactionById , we cannot guarantee it's going to be aborted during a repair abortion because it's abort handler ( Holder ) is only registered during iteration via the CompactionIterator , so if we stop the compaction before that the task is not aborted, and will execute even if it's parent repair session was aborted. Furthermore, an anti-compaction is split into multiple subcompactions, so this method only stop the currently running subcompaction. In order to overcome this, I aborted the compaction task Future directly, which causes the task thread to be interrupted, so I check for Thread.currentThread.isInterrupted() during iteration and throw a CompactionInterruptedException if this is true, causing the compaction to be aborted (by brute force). However this is not very safe, because it can generate a ClosedByInterruptException if we're blocked on an I/O operation, and we currently treat any IOException as a corrupt sstable. Furthermore, an interrupted thread is not able to abort the transaction when getting a CompactionInterruptedException . In order to solve this we could special case interruptions in many places (readers, transaction aborting, etc) but even this wouldn't guarantee we're safe so this is probably a bad smell. A cleaner option that I will be doing in the next iteration is to associate a CompactionHolder with a ListenableFuture as soon as the anti-compaction or validation is submitted, so we can abort it safely without interrupting the compaction thread.
          Hide
          Nick Bailey added a comment -

          Some questions:

          • If the abort is initiated on the coordinator can we return the success/failure of the attempt to abort on the participants as well? And vice versa?
          • Similarly for the list of results when aborting all jobs.
          • Can we make sure we are testing the case where for whatever reason a coordinator or participant receives an abort for a repair it doesn't know about?
          • Since we are now tracking repairs by uuid like this, can we expose a progress API outside of the jmx notification process? An mbean for retrieving the progress/status of a repair job by uuid?
          Show
          Nick Bailey added a comment - Some questions: If the abort is initiated on the coordinator can we return the success/failure of the attempt to abort on the participants as well? And vice versa? Similarly for the list of results when aborting all jobs. Can we make sure we are testing the case where for whatever reason a coordinator or participant receives an abort for a repair it doesn't know about? Since we are now tracking repairs by uuid like this, can we expose a progress API outside of the jmx notification process? An mbean for retrieving the progress/status of a repair job by uuid?
          Hide
          Paulo Motta added a comment -

          Thanks for the feedback Nick Bailey. See follow-up below:

          If the abort is initiated on the coordinator can we return the success/failure of the attempt to abort on the participants as well? And vice versa? Similarly for the list of results when aborting all jobs.

          We could, but in this initial implementation I opted to take an optimistic approach to keep the protocol simple and non-blocking. If for some reason there is a network partition and "orphaned" sessions keep running, you can always abort them individually later. Do you think a blocking + timeout approach would be preferable?

          Can we make sure we are testing the case where for whatever reason a coordinator or participant receives an abort for a repair it doesn't know about?

          Sure. One of the changes of this patch that I forgot to mention is that all messages are validated against the repair session UUID, so if a node receives a message from a repair it doesn't know about it logs and ignores it.

          Since we are now tracking repairs by uuid like this, can we expose a progress API outside of the jmx notification process? An mbean for retrieving the progress/status of a repair job by uuid?

          We could, but we currently don't keep state or progress information in the repair session. Furthermore we clear repair session information as soon as it's finished, so the list repairs stub only list currently active repairs. So we would need to maintain progress status and provide some way to clear repair information after some time.

          I personally think we should go this route of making repair more stateful, what will not only improve monitoring but will also allow us to break up a repair job into more decoupled subtasks, simplifying the single chain of futures we have today, which can be quite complex to understand and error-prone.

          Show
          Paulo Motta added a comment - Thanks for the feedback Nick Bailey . See follow-up below: If the abort is initiated on the coordinator can we return the success/failure of the attempt to abort on the participants as well? And vice versa? Similarly for the list of results when aborting all jobs. We could, but in this initial implementation I opted to take an optimistic approach to keep the protocol simple and non-blocking. If for some reason there is a network partition and "orphaned" sessions keep running, you can always abort them individually later. Do you think a blocking + timeout approach would be preferable? Can we make sure we are testing the case where for whatever reason a coordinator or participant receives an abort for a repair it doesn't know about? Sure. One of the changes of this patch that I forgot to mention is that all messages are validated against the repair session UUID, so if a node receives a message from a repair it doesn't know about it logs and ignores it. Since we are now tracking repairs by uuid like this, can we expose a progress API outside of the jmx notification process? An mbean for retrieving the progress/status of a repair job by uuid? We could, but we currently don't keep state or progress information in the repair session. Furthermore we clear repair session information as soon as it's finished, so the list repairs stub only list currently active repairs. So we would need to maintain progress status and provide some way to clear repair information after some time. I personally think we should go this route of making repair more stateful, what will not only improve monitoring but will also allow us to break up a repair job into more decoupled subtasks, simplifying the single chain of futures we have today, which can be quite complex to understand and error-prone.
          Hide
          Nick Bailey added a comment -

          Do you think a blocking + timeout approach would be preferable?

          Maybe. My goal in asking would be to know if the repair needs to be canceled on other nodes or not. Right now you need to either just run the abort on all nodes from the start or run it on the coordinator then check the participants to double check that it succeeded there as well.

          I personally think we should go this route of making repair more stateful

          I agree, especially with the upcoming coordinated repairs in C*

          Show
          Nick Bailey added a comment - Do you think a blocking + timeout approach would be preferable? Maybe. My goal in asking would be to know if the repair needs to be canceled on other nodes or not. Right now you need to either just run the abort on all nodes from the start or run it on the coordinator then check the participants to double check that it succeeded there as well. I personally think we should go this route of making repair more stateful I agree, especially with the upcoming coordinated repairs in C*
          Hide
          Stefan Podkowinski added a comment - - edited

          I've now looked at this issue as a potential use case for CASSANDRA-12016 and added a test on top of it. The branch can be found at WIP-3486 and the test I'm talking about in ActiveRepairServiceMessagingTest.java and RepairRunnableTest.java.

          My goal was to be able to make coordination between different nodes in repair scenarios easier to test. The basic cases covered so far are pretty simple, but I like to add more edge cases in the future. Nonetheless I wanted to share this early on in case Paulo Motta and others have any feedback on how this approach would be helpful to make progress on this issue.

          Show
          Stefan Podkowinski added a comment - - edited I've now looked at this issue as a potential use case for CASSANDRA-12016 and added a test on top of it. The branch can be found at WIP-3486 and the test I'm talking about in ActiveRepairServiceMessagingTest.java and RepairRunnableTest.java . My goal was to be able to make coordination between different nodes in repair scenarios easier to test. The basic cases covered so far are pretty simple, but I like to add more edge cases in the future. Nonetheless I wanted to share this early on in case Paulo Motta and others have any feedback on how this approach would be helpful to make progress on this issue.
          Hide
          Paulo Motta added a comment -

          Stefan Podkowinski well done! I think this is going to be useful for testing this ticket, specially when we add acks to abort, since we will need to deal with timeout and error handling, etc, which is generally a pain to do via dtests. Thanks!

          Show
          Paulo Motta added a comment - Stefan Podkowinski well done! I think this is going to be useful for testing this ticket, specially when we add acks to abort, since we will need to deal with timeout and error handling, etc, which is generally a pain to do via dtests. Thanks!

            People

            • Assignee:
              Paulo Motta
              Reporter:
              Vijay
            • Votes:
              24 Vote for this issue
              Watchers:
              44 Start watching this issue

              Dates

              • Created:
                Updated:

                Development