Cassandra
  1. Cassandra
  2. CASSANDRA-5009

RangeStreamer has no way to report failures, allowing bootstrap/move etc to complete without data

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 1.1.8, 1.2.0 beta 3
    • Component/s: Core
    • Labels:
      None

      Description

      It looks like we fixed this for 1.2 by having RS.fetch() throw, but in 1.1 it does not and there doesn't appear to be a way to detect an RS failure, which among other things will cause bootstrap to succeed even though it failed to fetch any data.

      1. 5009-v2.txt
        4 kB
        Brandon Williams
      2. 5009.txt
        4 kB
        Brandon Williams

        Issue Links

          Activity

          Brandon Williams created issue -
          Hide
          Brandon Williams added a comment -

          Actually we haven't completely fixed this in 1.2, because we rely on the FD notification to fail but use phi * 2 to avoid a premature costly repair abortion, but the streaming can fail out before that and thus fetch() will never raise an exception.

          Show
          Brandon Williams added a comment - Actually we haven't completely fixed this in 1.2, because we rely on the FD notification to fail but use phi * 2 to avoid a premature costly repair abortion, but the streaming can fail out before that and thus fetch() will never raise an exception.
          Hide
          Brandon Williams added a comment -

          Patch to maintain a set of completed ranges, then check at the end that every range was successful or raise an exception.

          This patch is against 1.1, but I'll note that when merged RangeStreamer should look exactly the same. That is to say, all the FD business will be removed, as I can't figure out how it makes any sense to check the FD at all since we're going to block on the streaming latch first. If any ranges (not necessarily streams) failed, then it's time to bail regardless of the FD, and if the FD does convict before that it won't break us out of the latch wait.

          Show
          Brandon Williams added a comment - Patch to maintain a set of completed ranges, then check at the end that every range was successful or raise an exception. This patch is against 1.1, but I'll note that when merged RangeStreamer should look exactly the same. That is to say, all the FD business will be removed, as I can't figure out how it makes any sense to check the FD at all since we're going to block on the streaming latch first. If any ranges (not necessarily streams) failed, then it's time to bail regardless of the FD, and if the FD does convict before that it won't break us out of the latch wait.
          Brandon Williams made changes -
          Field Original Value New Value
          Attachment 5009.txt [ 12555577 ]
          Brandon Williams made changes -
          Reviewer yukim
          Brandon Williams made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Brandon Williams made changes -
          Link This issue is duplicated by CASSANDRA-4651 [ CASSANDRA-4651 ]
          Hide
          Yuki Morishita added a comment -

          two comments on the patch:

          • looks like exceptionMessage is not used anywhere
          • I think completed needs to be synchronized and completed.addAll(ranges) should be placed before latch.countDown(), otherwise we have chance to get incomplete completed set.
          Show
          Yuki Morishita added a comment - two comments on the patch: looks like exceptionMessage is not used anywhere I think completed needs to be synchronized and completed.addAll(ranges) should be placed before latch.countDown() , otherwise we have chance to get incomplete completed set.
          Hide
          Brandon Williams added a comment - - edited

          Oops, exceptionMessage was a vestige from my initial attempt at backporting the 1.2 version before I discovered it was flawed too.

          Updated patch removes that, moves the addAll call before the latch countdown, and initiates the set with a concurrent hash map.

          Show
          Brandon Williams added a comment - - edited Oops, exceptionMessage was a vestige from my initial attempt at backporting the 1.2 version before I discovered it was flawed too. Updated patch removes that, moves the addAll call before the latch countdown, and initiates the set with a concurrent hash map.
          Brandon Williams made changes -
          Attachment 5009-v2.txt [ 12555808 ]
          Hide
          Yuki Morishita added a comment -

          +1

          Show
          Yuki Morishita added a comment - +1
          Hide
          Brandon Williams added a comment -

          Committed.

          Show
          Brandon Williams added a comment - Committed.
          Brandon Williams made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 1.2.0 rc1 [ 12323481 ]
          Resolution Fixed [ 1 ]
          Sylvain Lebresne made changes -
          Fix Version/s 1.2.0 beta 3 [ 12323603 ]
          Fix Version/s 1.2.0 rc1 [ 12323481 ]
          Fix Version/s 1.1.8 [ 12323574 ]
          Brandon Williams made changes -
          Fix Version/s 1.1.8 [ 12323574 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12736470 ] patch-available, re-open possible [ 12752089 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12752089 ] reopen-resolved, no closed status, patch-avail, testing [ 12755181 ]

            People

            • Assignee:
              Brandon Williams
              Reporter:
              Brandon Williams
              Reviewer:
              Yuki Morishita
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development