Cassandra
  1. Cassandra
  2. CASSANDRA-3112

Make repair fail when an unexpected error occurs

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      CASSANDRA-2433 makes it so that nodetool repair will fail if a node participating to repair dies before completing his part of the repair. This handles most of the situation where repair was previously hanging, but repair can still hang if an unexpected error occurs during either the merkle tree creation (an on-disk corruption triggers an IOError say) or during streaming (though I'm not sure what could make streaming failed outside of 'one of the node died' (besides a bug)).

        Issue Links

          Activity

          Sylvain Lebresne created issue -
          Hide
          Sylvain Lebresne added a comment -

          Attaching the 2 last patches from an initial version of CASSANDRA-2433 that were handling this. This will need to be rebased to current code though.

          Show
          Sylvain Lebresne added a comment - Attaching the 2 last patches from an initial version of CASSANDRA-2433 that were handling this. This will need to be rebased to current code though.
          Sylvain Lebresne made changes -
          Field Original Value New Value
          Attachment 0003-Report-streaming-errors-back-to-repair-v4.patch [ 12492474 ]
          Attachment 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch [ 12492475 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.0.1 [ 12317948 ]
          Fix Version/s 1.1 [ 12317615 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.0.2 [ 12318740 ]
          Fix Version/s 1.0.1 [ 12317948 ]
          Sylvain Lebresne made changes -
          Fix Version/s 1.0.3 [ 12318940 ]
          Fix Version/s 1.0.2 [ 12318740 ]
          Sylvain Lebresne made changes -
          Fix Version/s 1.0.4 [ 12319064 ]
          Fix Version/s 1.0.3 [ 12318940 ]
          Sylvain Lebresne made changes -
          Fix Version/s 1.0.5 [ 12319144 ]
          Fix Version/s 1.0.4 [ 12319064 ]
          Sylvain Lebresne made changes -
          Fix Version/s 1.0.6 [ 12319161 ]
          Fix Version/s 1.0.5 [ 12319144 ]
          Hide
          Vijay added a comment - - edited

          Hi Sylvain,

          I have seen the following issues in the Repairs specially in AWS Multi DC deployments...
          1) Stream session or the stream doesn't have any progress (Read Timeout/rpc timeout - Socket timeout might help)
          2) Validation compaction completed but the result tree is sent but not received.
          3) Repair request is sent but the receiving node didn't receive it.
          4) When we have a big repair which runs for hours it will be better to retry the failed part rather than full retry.

          Do you think it is worth to address this in a separate ticket? else i will close CASSANDRA-3487.

          Show
          Vijay added a comment - - edited Hi Sylvain, I have seen the following issues in the Repairs specially in AWS Multi DC deployments... 1) Stream session or the stream doesn't have any progress (Read Timeout/rpc timeout - Socket timeout might help) 2) Validation compaction completed but the result tree is sent but not received. 3) Repair request is sent but the receiving node didn't receive it. 4) When we have a big repair which runs for hours it will be better to retry the failed part rather than full retry. Do you think it is worth to address this in a separate ticket? else i will close CASSANDRA-3487 .
          Hide
          Sylvain Lebresne added a comment -

          1) Stream session or the stream doesn't have any progress (Read Timeout/rpc timeout - Socket timeout might help)

          But do you know what is the reason for it making no progress? Because unless we know what can cause it, not sure what to fix?

          2) Validation compaction completed but the result tree is sent but not received.
          3) Repair request is sent but the receiving node didn't receive it.

          How can we "lose" messages, aren't tcp supposed to avoid this?

          4) When we have a big repair which runs for hours it will be better to retry the failed part rather than full retry.

          Streaming is supposed to have some part of built-in retry, though I'm not sure there is situation where it is actually useful. But if we talking like having a repair fail because a node die and continuing it once the node is back up, then that would be nice, but I'm pretty sure this will be mightily complicated. In particular and to name only one difficulty, whether this is for the validation compaction or the streaming itself, we likely will have a hard time making sure that sstables haven't been compacted between the initial try and the retry (or we'll risk hanging on obsolete sstables forever). But in principle, that would be nice. Clearly not in the scope of this ticket in any case.

          Show
          Sylvain Lebresne added a comment - 1) Stream session or the stream doesn't have any progress (Read Timeout/rpc timeout - Socket timeout might help) But do you know what is the reason for it making no progress? Because unless we know what can cause it, not sure what to fix? 2) Validation compaction completed but the result tree is sent but not received. 3) Repair request is sent but the receiving node didn't receive it. How can we "lose" messages, aren't tcp supposed to avoid this? 4) When we have a big repair which runs for hours it will be better to retry the failed part rather than full retry. Streaming is supposed to have some part of built-in retry, though I'm not sure there is situation where it is actually useful. But if we talking like having a repair fail because a node die and continuing it once the node is back up, then that would be nice, but I'm pretty sure this will be mightily complicated. In particular and to name only one difficulty, whether this is for the validation compaction or the streaming itself, we likely will have a hard time making sure that sstables haven't been compacted between the initial try and the retry (or we'll risk hanging on obsolete sstables forever). But in principle, that would be nice. Clearly not in the scope of this ticket in any case.
          Sylvain Lebresne made changes -
          Fix Version/s 1.0.7 [ 12319244 ]
          Fix Version/s 1.0.6 [ 12319161 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.1 [ 12317615 ]
          Fix Version/s 1.0.7 [ 12319244 ]
          Hide
          Vijay added a comment -

          "But do you know what is the reason for it making no progress? Because unless we know what can cause it, not sure what to fix?"
          it is usually is in the Streaming phase, i think adding a SoTimeout might fix it... but it is so random i couldn't reproduce in my tests but definitely seeing it in production.

          "How can we "lose" messages, aren't tcp supposed to avoid this?"
          Once you send the message the other node might get restarted (without validation or starting any thing) or the sockets can get reset, Actually i think when i posted this message it was because of CASSANDRA-3577. There isnt something like hints or a retry on the messages sent for the repairs.

          I understand this isnt the scope of this ticket, but i still think there should be a way to orchestrate repairs with a little complicated logic and i will try to do some parts of it in the other ticket.

          Show
          Vijay added a comment - "But do you know what is the reason for it making no progress? Because unless we know what can cause it, not sure what to fix?" it is usually is in the Streaming phase, i think adding a SoTimeout might fix it... but it is so random i couldn't reproduce in my tests but definitely seeing it in production. "How can we "lose" messages, aren't tcp supposed to avoid this?" Once you send the message the other node might get restarted (without validation or starting any thing) or the sockets can get reset, Actually i think when i posted this message it was because of CASSANDRA-3577 . There isnt something like hints or a retry on the messages sent for the repairs. I understand this isnt the scope of this ticket, but i still think there should be a way to orchestrate repairs with a little complicated logic and i will try to do some parts of it in the other ticket.
          Sylvain Lebresne made changes -
          Fix Version/s 1.1.1 [ 12319857 ]
          Fix Version/s 1.1 [ 12317615 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.1.2 [ 12321445 ]
          Fix Version/s 1.1.1 [ 12319857 ]
          Sylvain Lebresne made changes -
          Fix Version/s 1.1.3 [ 12321881 ]
          Fix Version/s 1.1.2 [ 12321445 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.2 [ 12319262 ]
          Fix Version/s 1.1.3 [ 12321881 ]
          Jonathan Ellis made changes -
          Fix Version/s 1.3 [ 12322954 ]
          Fix Version/s 1.2.0 [ 12319262 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12630856 ] patch-available, re-open possible [ 12753112 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12753112 ] reopen-resolved, no closed status, patch-avail, testing [ 12755764 ]
          Hide
          Jason Wee added a comment - - edited

          In StreamOutSession.java (cassandra version 1.0.8), the logger in the method convict(...) has two holder but only 1 variable given... missing another variable to log it?

          logger.error("StreamOutSession {} failed because {} died or was restarted/removed", endpoint);

          Show
          Jason Wee added a comment - - edited In StreamOutSession.java (cassandra version 1.0.8), the logger in the method convict(...) has two holder but only 1 variable given... missing another variable to log it? logger.error("StreamOutSession {} failed because {} died or was restarted/removed", endpoint);
          Yuki Morishita made changes -
          Link This issue is related to CASSANDRA-5426 [ CASSANDRA-5426 ]
          Hide
          Jonathan Ellis added a comment -

          What is the scope of this ticket? Should it be wontfixed or moved to 2.1?

          Show
          Jonathan Ellis added a comment - What is the scope of this ticket? Should it be wontfixed or moved to 2.1?
          Hide
          Yuki Morishita added a comment -

          I'm working on CASSANDRA-5426 and this can be dup of that. CASSANDRA-5426 is targeting for 2.0.0 release.

          Show
          Yuki Morishita added a comment - I'm working on CASSANDRA-5426 and this can be dup of that. CASSANDRA-5426 is targeting for 2.0.0 release.
          Jonathan Ellis made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee Sylvain Lebresne [ slebresne ]
          Fix Version/s 2.0 [ 12322954 ]
          Resolution Duplicate [ 3 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Sylvain Lebresne
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development