Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Fix Version/s: 2.1.1
    • Component/s: Tools
    • Labels:
    • Environment:

      JVM

      Description

      After CASSANDRA-1740, If the validation compaction is stopped then the repair will hang. This ticket will allow users to kill the original repair.

        Issue Links

          Activity

          Hide
          David Huang added a comment -

          Thanks.

          Show
          David Huang added a comment - Thanks.
          Hide
          Jason Brown added a comment -

          First draft is about 90% complete.

          Show
          Jason Brown added a comment - First draft is about 90% complete.
          Hide
          David Huang added a comment -

          Is there any update on this?

          Show
          David Huang added a comment - Is there any update on this?
          Hide
          Jason Brown added a comment -

          Robert Coli hmm, totally forgot about this ticket - i can give it a shot over the next week.

          Show
          Jason Brown added a comment - Robert Coli hmm, totally forgot about this ticket - i can give it a shot over the next week.
          Hide
          Robert Coli added a comment -

          Jason Brown : I see that CASSANDRA-6503 is complete, but this ticket is unresolved. Do you still plan to address this issue soon?

          Show
          Robert Coli added a comment - Jason Brown : I see that CASSANDRA-6503 is complete, but this ticket is unresolved. Do you still plan to address this issue soon?
          Hide
          Jason Brown added a comment -

          Yuki Morishita since I'm knees deep in CASSANDRA-6503, I'll knock this out at the same time.

          Show
          Jason Brown added a comment - Yuki Morishita since I'm knees deep in CASSANDRA-6503 , I'll knock this out at the same time.
          Hide
          Nate McCall added a comment -

          I disagree with the prioritization of this ticket as minor. I've seen this on three separate clusters (all 1.2.x >= 1.2.8) in the past month alone. It is a difficult and time consuming problem to have to work around.

          Show
          Nate McCall added a comment - I disagree with the prioritization of this ticket as minor. I've seen this on three separate clusters (all 1.2.x >= 1.2.8) in the past month alone. It is a difficult and time consuming problem to have to work around.
          Hide
          Justen Walker added a comment -

          I also hit this today on 1.1.11, same problem as Bill describes

          Show
          Justen Walker added a comment - I also hit this today on 1.1.11, same problem as Bill describes
          Hide
          Bill Hathaway added a comment -

          I hit this today on 1.1.10.
          node X was running a repair that was hung. It had several SS tables it reported it was streaming from node Y. Node Y reported it was not streaming anything to node X.
          It looks like our only solution is to bounce the node, which is frustrating. A 'nodetool stop repair' would have been very helpful in my scenario.

          Show
          Bill Hathaway added a comment - I hit this today on 1.1.10. node X was running a repair that was hung. It had several SS tables it reported it was streaming from node Y. Node Y reported it was not streaming anything to node X. It looks like our only solution is to bounce the node, which is frustrating. A 'nodetool stop repair' would have been very helpful in my scenario.
          Hide
          Jeremy Hanna added a comment -

          See also CASSANDRA-5426

          Show
          Jeremy Hanna added a comment - See also CASSANDRA-5426
          Hide
          Robert Coli added a comment -

          Regarding the priority of this ticket, operators frequently report hung repair streaming sessions on #cassandra/cassandra-user@. Currently the only thing we can tell them is to restart all affected nodes. This presumably gives them a bad impression of cassandra. First, because the repair (which they have to run once every GCGraceSeconds per best practice) hangs with no useful messaging. Second, because the only solution is to restart multiple nodes. It's a little bit surprising that this ticket suggests that this negative user experience is uncommon enough to not expose some version of this functionality via nodetool.. two people's clusters have been in this state in #cassandra so far today and it's only 2pm...

          Show
          Robert Coli added a comment - Regarding the priority of this ticket, operators frequently report hung repair streaming sessions on #cassandra/cassandra-user@. Currently the only thing we can tell them is to restart all affected nodes. This presumably gives them a bad impression of cassandra. First, because the repair (which they have to run once every GCGraceSeconds per best practice) hangs with no useful messaging. Second, because the only solution is to restart multiple nodes. It's a little bit surprising that this ticket suggests that this negative user experience is uncommon enough to not expose some version of this functionality via nodetool.. two people's clusters have been in this state in #cassandra so far today and it's only 2pm...
          Hide
          Sylvain Lebresne added a comment -

          I don't think we should close that ticket because I don't think that we have a satisfying way to stop repair. A satisfying way to run repair would be to be able to run 'nodetool repair stop <some_repair_session_id>' on the host the repair was started on, on have it cleaning stop everything related to that repair (including any validation or streaming going on) and this on every participating nodes. That require a bit more work however and let me note that I don't see this ticket as a priority at all.

          Show
          Sylvain Lebresne added a comment - I don't think we should close that ticket because I don't think that we have a satisfying way to stop repair. A satisfying way to run repair would be to be able to run 'nodetool repair stop <some_repair_session_id>' on the host the repair was started on, on have it cleaning stop everything related to that repair (including any validation or streaming going on) and this on every participating nodes. That require a bit more work however and let me note that I don't see this ticket as a priority at all.
          Hide
          Vijay added a comment -

          Sounds good to me, Let me know if you want me to clear the streaming sessions which where associated with the repair while running (SS.forceTerminateAllRepairSessions) i can do that (as a part of this ticket or a separate one)...
          if the current state is good enough, i can close this ticket as it is already exposed for the advanced users (I think the streaming sessions are mostly harmless for most of the users).

          Show
          Vijay added a comment - Sounds good to me, Let me know if you want me to clear the streaming sessions which where associated with the repair while running (SS.forceTerminateAllRepairSessions) i can do that (as a part of this ticket or a separate one)... if the current state is good enough, i can close this ticket as it is already exposed for the advanced users (I think the streaming sessions are mostly harmless for most of the users).
          Hide
          Sylvain Lebresne added a comment -

          As a said back on CASSANDRA-3316, I see forceTerminateAllRepairSessions more as a band-aid solution to avoid having someone stuck until we better handle repair failures and have a correct to stop it. But it isn't really a very use friendly solution since it doesn't stop properly the repair (it won't stop the validation compaction (though we can do that by other means now), nor any streaming that would be running) and you have to run it everywhere. For all those reason, my opinion is that we should keep this JMX only, I see no good reason to promote it to nodetool. That's just an opinion though.

          Show
          Sylvain Lebresne added a comment - As a said back on CASSANDRA-3316 , I see forceTerminateAllRepairSessions more as a band-aid solution to avoid having someone stuck until we better handle repair failures and have a correct to stop it. But it isn't really a very use friendly solution since it doesn't stop properly the repair (it won't stop the validation compaction (though we can do that by other means now), nor any streaming that would be running) and you have to run it everywhere. For all those reason, my opinion is that we should keep this JMX only, I see no good reason to promote it to nodetool. That's just an opinion though.
          Hide
          Vijay added a comment -

          Attached is a simple patch to expose SS.forceTerminateAllRepairSessions in NodeTool

          #nt stop repairsessions

          will stop the repair sessions in the repair originating node (cleanup).
          NOTE: User might need to do "#nt stop Validation" in all the nodes which are involved in the repair.

          Show
          Vijay added a comment - Attached is a simple patch to expose SS.forceTerminateAllRepairSessions in NodeTool #nt stop repairsessions will stop the repair sessions in the repair originating node (cleanup). NOTE: User might need to do "#nt stop Validation" in all the nodes which are involved in the repair.
          Hide
          Radim Kolar added a comment -

          It will be useful. sometimes validation eats too much disk bandwidth making compaction too slow

          Show
          Radim Kolar added a comment - It will be useful. sometimes validation eats too much disk bandwidth making compaction too slow
          Hide
          Sylvain Lebresne added a comment -

          Moving to 1.1 because this will almost certainly require a wire protocol change. Also, CASSANDRA-3112 requires a very similar change (basically validation needs to be able to report an error back to the repair), so it's worth making room for both of those changes together.

          Show
          Sylvain Lebresne added a comment - Moving to 1.1 because this will almost certainly require a wire protocol change. Also, CASSANDRA-3112 requires a very similar change (basically validation needs to be able to report an error back to the repair), so it's worth making room for both of those changes together.

            People

            • Assignee:
              Jason Brown
              Reporter:
              Vijay
            • Votes:
              12 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development