Cassandra
  1. Cassandra
  2. CASSANDRA-2643

read repair/reconciliation breaks slice based iteration at QUORUM

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      In short, I believe iterating over columns is impossible to do reliably with QUORUM due to the way reconciliation works.

      The problem is that the SliceQueryFilter is executing locally when reading on a node, but no attempts seem to be made to consider limits when doing reconciliation and/or read-repair (RowRepairResolver.resolveSuperset() and ColumnFamily.resolve()).

      If a node slices and comes up with 100 columns, and another node slices and comes up with 100 columns, some of which are unique to each side, reconciliation results in > 100 columns in the result set. In this case the effect is limited to "client gets more than asked for", but the columns still accurately represent the range. This is easily triggered by my test-case.

      In addition to the client receiving "too many" columns, I believe some of them will not be satisfying the QUORUM consistency level for the same reasons as with deletions (see discussion below).

      Now, there should be a problem for tombstones as well, but it's more subtle. Suppose A has:

      1
      2
      3
      4
      5
      6

      and B has:

      1
      del 2
      del 3
      del 4
      5
      6

      If you now slice 1-6 with count=3 the tombstones from B will reconcile with those from A - fine. So you end up getting 1,5,6 back. This made it a bit difficult to trigger in a test case until I realized what was going on. At first I was "hoping" to see a "short" iteration result, which would mean that the process of iterating until you get a short result will cause spurious "end of columns" and thus make it impossible to iterate correctly.

      So; due to 5-6 existing (and if they didn't, you legitimately reached end-of-columns) we do indeed get a result of size 3 which contains 1,5 and 6. However, only node B would have contributed columns 5 and 6; so there is actually no QUORUM consistency on the co-ordinating node with respect to these columns. If node A and C also had 5 and 6, they would not have been considered.

      Am I wrong?

      In any case; using script I'm about to attach, you can trigger the over-delivery case very easily:

      (0) disable hinted hand-off to avoid that interacting with the test
      (1) start three nodes
      (2) create ks 'test' with rf=3 and cf 'slicetest'
      (3) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, then ctrl-c
      (4) stop node A
      (5) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, then ctrl-c
      (6) start node A, wait for B and C to consider it up
      (7) ./slicetest.py hostname_of_node_A slice # make A co-ordinator though it doesn't necessarily matter

      You can also pass 'delete' (random deletion of 50% of contents) or 'deleterange' (delete all in [0.2,0.8]) to slicetest, but you don't trigger a short read by doing that (see discussion above).

      1. ASF.LICENSE.NOT.GRANTED--short_read_0.8.sh
        2 kB
        Byron Clark
      2. CASSANDRA-2643-poc.patch
        17 kB
        Byron Clark
      3. CASSANDRA-2643-v2.patch
        17 kB
        Byron Clark
      4. CASSANDRA-2643-v3.patch
        18 kB
        Byron Clark
      5. reliable_short_read_0.8.sh
        2 kB
        Byron Clark
      6. short_read.sh
        1 kB
        Sylvain Lebresne
      7. slicetest.py
        3 kB
        Peter Schuller

        Activity

        Peter Schuller created issue -
        Peter Schuller made changes -
        Field Original Value New Value
        Attachment slicetest.py [ 12478987 ]
        Hide
        Peter Schuller added a comment -

        Unless I'm off, the problem with deleted columns is fairly similar to that of deleted rows. Like with range ghosts, an option would be to propagate tombstones to clients.

        Another option is to make read responses include the range for which they are authoritative, and then only consider the intersection of all responses' authoritative ranges when reconciling results. The authoritative range of the response would have to be communicated back to the client, such that it can continue iterating from the appropriate column name even without actually receiving a column for that name.

        Other than failing requests with a new kind of error, I don't see a good way to fix the tombstone case (the over-shoot case can be fixed by just capping results) without changing the protocol. For obvious reason's we don't want the co-originating node to go into a potentially unbounded re-try spin until sufficient results are obtained from all nodes participating.

        FWIW, returning iteration state feels pretty clean to me (it's what our layer implements on top to enable easier iteration). It is also compatible with future improvements to e.g. cap the size of a response by bytes for safely slicing over columns whose size you do not know. By removing the strict requirement to deliver exactly the number of asked-for columns else it be interpreted as "out of columns", significant flexibility is gained. So if the only option for a clean fix is truly to change the protocol, at least other positive benefits may be had.

        Show
        Peter Schuller added a comment - Unless I'm off, the problem with deleted columns is fairly similar to that of deleted rows. Like with range ghosts, an option would be to propagate tombstones to clients. Another option is to make read responses include the range for which they are authoritative, and then only consider the intersection of all responses' authoritative ranges when reconciling results. The authoritative range of the response would have to be communicated back to the client, such that it can continue iterating from the appropriate column name even without actually receiving a column for that name. Other than failing requests with a new kind of error, I don't see a good way to fix the tombstone case (the over-shoot case can be fixed by just capping results) without changing the protocol. For obvious reason's we don't want the co-originating node to go into a potentially unbounded re-try spin until sufficient results are obtained from all nodes participating. FWIW, returning iteration state feels pretty clean to me (it's what our layer implements on top to enable easier iteration). It is also compatible with future improvements to e.g. cap the size of a response by bytes for safely slicing over columns whose size you do not know. By removing the strict requirement to deliver exactly the number of asked-for columns else it be interpreted as "out of columns", significant flexibility is gained. So if the only option for a clean fix is truly to change the protocol, at least other positive benefits may be had.
        Hide
        Peter Schuller added a comment -

        I realized I failed to make the possibly most important point: That you can indeed get short reads such that iteration will stop early. Consider 3 nodes, RF=3, QUORUM.

        • Node A has [10,40] of columns
        • Node B has [10,40] of columns
        • Node C has [10,20] of column deletions for the columns that A/B has,
          but does NOT have any for [21,40] because it was down when those were written

        Now a client slices [10,1000] with count=11. The co-ordinating node will reconcile that; C's tombstones will override A/B (I'm assuming tombstones are later than A+B's columns), but since C is lacking the "remainder" of columns you don't just get some columns at lowered consistency level - you actually get a "short" result, and the application or high-level client will believe that the iteration is complete.

        This was the primary reason why I said in the OP that I believed "iterating over columns is impossible to do reliably with QUORUM". I somehow lost this when re-phrasing the JIRA post a couple of times.

        Note: The short read case is not something I have tested and triggered, so is based on extrapolation from my understanding of the code.

        Show
        Peter Schuller added a comment - I realized I failed to make the possibly most important point: That you can indeed get short reads such that iteration will stop early. Consider 3 nodes, RF=3, QUORUM. Node A has [10,40] of columns Node B has [10,40] of columns Node C has [10,20] of column deletions for the columns that A/B has, but does NOT have any for [21,40] because it was down when those were written Now a client slices [10,1000] with count=11. The co-ordinating node will reconcile that; C's tombstones will override A/B (I'm assuming tombstones are later than A+B's columns), but since C is lacking the "remainder" of columns you don't just get some columns at lowered consistency level - you actually get a "short" result, and the application or high-level client will believe that the iteration is complete. This was the primary reason why I said in the OP that I believed "iterating over columns is impossible to do reliably with QUORUM". I somehow lost this when re-phrasing the JIRA post a couple of times. Note: The short read case is not something I have tested and triggered, so is based on extrapolation from my understanding of the code.
        Hide
        Sylvain Lebresne added a comment -

        You are right, there is a problem here.

        I'll just note that you example is not a good example for QUORUM, because the fact that only C "has [10,20] of column deletions" means this situation did not happen with QUORUM writes (and the consistency guarantee for QUORUM involves read and write being at QUORUM).

        However, this still show that there is a problem for ONE (writes) + ALL (reads). And it's not hard to come up with an example for QUORUM (reads and writes). Just consider the case where you insert like 10 columns and then delete the 3 first ones but with each time 1 node down, but never the same one.

        To make this concrete, I'm attaching a script that produce this "short read" effect. Disclaimer: it uses https://github.com/pcmanus/ccm and require the patch I've attached to CASSANDRA-2646 (to be able to do a bounded slice with the cli).

        The simplest way to fix that I see (which doesn't imply simple per se) would be to requests more columns if we're short after a resolve on the coordinator. Yes in theory it means we may have to do a unknown number of such re-request, but in practice I strongly doubt this will be a problem. The problem has very little chance to happen in real life to start with (for QUORUM, my script is simple but implements something that has very very little change to actually happen in real life – especially with HH, read repair and repair), but the chances that if that happens we need more that 1 re-request are ridiculously small.

        Show
        Sylvain Lebresne added a comment - You are right, there is a problem here. I'll just note that you example is not a good example for QUORUM, because the fact that only C "has [10,20] of column deletions" means this situation did not happen with QUORUM writes (and the consistency guarantee for QUORUM involves read and write being at QUORUM). However, this still show that there is a problem for ONE (writes) + ALL (reads). And it's not hard to come up with an example for QUORUM (reads and writes). Just consider the case where you insert like 10 columns and then delete the 3 first ones but with each time 1 node down, but never the same one. To make this concrete, I'm attaching a script that produce this "short read" effect. Disclaimer: it uses https://github.com/pcmanus/ccm and require the patch I've attached to CASSANDRA-2646 (to be able to do a bounded slice with the cli). The simplest way to fix that I see (which doesn't imply simple per se) would be to requests more columns if we're short after a resolve on the coordinator. Yes in theory it means we may have to do a unknown number of such re-request, but in practice I strongly doubt this will be a problem. The problem has very little chance to happen in real life to start with (for QUORUM, my script is simple but implements something that has very very little change to actually happen in real life – especially with HH, read repair and repair), but the chances that if that happens we need more that 1 re-request are ridiculously small.
        Sylvain Lebresne made changes -
        Attachment short_read.sh [ 12479111 ]
        Hide
        Peter Schuller added a comment -

        You're right of course - my example was bogus. I'll also agree about re-try being reasonable under the circumstances, though perhaps not optimal.

        With regards to the fix. Let me just make sure I understand you correctly. So given a read command with a limit N that yields <N columns (post-reconciliation), we may need to re-request from one or more nodes. But how do we distinguish between a legitimate short read and a spurious short read? The criteria seems to me to be, that a read is potentially spuriously short if "one or more of the nodes involved returned a NON-short read". If all of them returned short reads, it's fine; only if we have results from a node that we cannot prove did indeed exhaust its list of available columns do we need to check.

        That is my understanding of your proposed solution, and that does seem doable on the co-ordinator side without protocol changes since we obviously know what we actually got from each node; it's just a matter of coding acrobatics (not sure how much work).

        However, would you agree with this claim: This would fix the spurious short read problem specifically, but does not address the more general problem of consistency - i.e., one might receive columns that have not gone through reconciliation by QUORUM?

        If we are to solve that, while still not implying protocol changes, I believe we need to do re-tries whenever a more general condition is true: That we do not have confirmed QUORUM for the full range implied by the start+limit range that we are being asked for. In other words, if one or more of the nodes participating in the read returned a response that satisfies:

        (1) The response was not short.
        AND
        (2) The response "last" column was < than the "last" column that we are to return post-reconciliation.

        Lacking a protocol change to communicate authoritative ranges of responses, and given that the premise is that we must deliver start+limit unless there are < limit number of columns available, we necessarily can only consider the full range (first-to-last column) of a response as authoritative (except in the case of a short read, in which case it's authoritative to infinity).

        Without revisiting the code to try to figure out what the easiest way to implement it is, one thought is that if you agree that a clean long-term fix would be to communicate authoritativeness in responses, perhaps one can at least make the logic to handle this compatible with that way of thinking. It's just that until protocol changes can happen, we'd (1) infer authoritativeness from columns/tombstones in the result instead of from explicit indicators in a response, and (2) since we cannot propagate short ranges to clients, we must re-request instead of cleanly return a short-but-not-eof-indicating range to the client.

        Thoughts?

        Show
        Peter Schuller added a comment - You're right of course - my example was bogus. I'll also agree about re-try being reasonable under the circumstances, though perhaps not optimal. With regards to the fix. Let me just make sure I understand you correctly. So given a read command with a limit N that yields <N columns (post-reconciliation), we may need to re-request from one or more nodes. But how do we distinguish between a legitimate short read and a spurious short read? The criteria seems to me to be, that a read is potentially spuriously short if "one or more of the nodes involved returned a NON-short read". If all of them returned short reads, it's fine; only if we have results from a node that we cannot prove did indeed exhaust its list of available columns do we need to check. That is my understanding of your proposed solution, and that does seem doable on the co-ordinator side without protocol changes since we obviously know what we actually got from each node; it's just a matter of coding acrobatics (not sure how much work). However, would you agree with this claim: This would fix the spurious short read problem specifically, but does not address the more general problem of consistency - i.e., one might receive columns that have not gone through reconciliation by QUORUM? If we are to solve that, while still not implying protocol changes, I believe we need to do re-tries whenever a more general condition is true: That we do not have confirmed QUORUM for the full range implied by the start+limit range that we are being asked for. In other words, if one or more of the nodes participating in the read returned a response that satisfies: (1) The response was not short. AND (2) The response "last" column was < than the "last" column that we are to return post-reconciliation. Lacking a protocol change to communicate authoritative ranges of responses, and given that the premise is that we must deliver start+limit unless there are < limit number of columns available, we necessarily can only consider the full range (first-to-last column) of a response as authoritative (except in the case of a short read, in which case it's authoritative to infinity). Without revisiting the code to try to figure out what the easiest way to implement it is, one thought is that if you agree that a clean long-term fix would be to communicate authoritativeness in responses, perhaps one can at least make the logic to handle this compatible with that way of thinking. It's just that until protocol changes can happen, we'd (1) infer authoritativeness from columns/tombstones in the result instead of from explicit indicators in a response, and (2) since we cannot propagate short ranges to clients, we must re-request instead of cleanly return a short-but-not-eof-indicating range to the client. Thoughts?
        Jonathan Ellis made changes -
        Assignee Sylvain Lebresne [ slebresne ]
        Fix Version/s 1.0 [ 12316349 ]
        Jonathan Ellis made changes -
        Assignee Sylvain Lebresne [ slebresne ] Brandon Williams [ brandon.williams ]
        Byron Clark made changes -
        Attachment short_read_0.8.sh [ 12488060 ]
        Hide
        Byron Clark added a comment -

        [^short_read_0.8.sh] is an update to short_read.sh that works with 0.8.x.

        Show
        Byron Clark added a comment - [^short_read_0.8.sh] is an update to short_read.sh that works with 0.8.x.
        Hide
        Byron Clark added a comment - - edited

        I'm having a really hard time reproducing this issue consistently on trunk. That's not to say that it doesn't happen, but I'm only seeing it one time out of fifteen using short_read.sh.

        I'm running this on Linux and, from the logs, it looks like the downed node isn't being detected as down, even when I sleep 10 seconds before doing the set.

        Any hints on getting this failure to happen more reliably?

        Show
        Byron Clark added a comment - - edited I'm having a really hard time reproducing this issue consistently on trunk. That's not to say that it doesn't happen, but I'm only seeing it one time out of fifteen using short_read.sh. I'm running this on Linux and, from the logs, it looks like the downed node isn't being detected as down, even when I sleep 10 seconds before doing the set. Any hints on getting this failure to happen more reliably?
        Hide
        Byron Clark added a comment -

        reliable_short_read_0.8.sh reproduces the issue for me every time. This script requires the following commit to ccm: https://github.com/byronclark/ccm/commit/974a5773228e783d4a91d7ba46d744e5a1216377

        Show
        Byron Clark added a comment - reliable_short_read_0.8.sh reproduces the issue for me every time. This script requires the following commit to ccm: https://github.com/byronclark/ccm/commit/974a5773228e783d4a91d7ba46d744e5a1216377
        Byron Clark made changes -
        Attachment reliable_short_read_0.8.sh [ 12488169 ]
        Hide
        Byron Clark added a comment - - edited

        The attached CASSANDRA-2643-poc.patch, while extremely ugly, serves as a proof of concept that all the data is available and the short read problem can be corrected.

        Show
        Byron Clark added a comment - - edited The attached CASSANDRA-2643-poc.patch , while extremely ugly, serves as a proof of concept that all the data is available and the short read problem can be corrected.
        Byron Clark made changes -
        Attachment CASSANDRA-2643-poc.patch [ 12488173 ]
        Hide
        Byron Clark added a comment -

        CASSANDRA-2643-v2.patch makes the following improvements on CASSANDRA-2643-poc.patch:

        • Cleans up duplicated code for counting live columns.
        • Removes the need to store the ReadCommand in the RepairCallback.

        Remaining issue to be dealt with:

        • Storing maxLiveColumns in the resolver feels wrong, but that's where the data is generated. We need to know how many live rows the biggest return had to detect if this is actually a short read. I'm open to suggestions a better place for that count to live.
        Show
        Byron Clark added a comment - CASSANDRA-2643-v2.patch makes the following improvements on CASSANDRA-2643-poc.patch : Cleans up duplicated code for counting live columns. Removes the need to store the ReadCommand in the RepairCallback. Remaining issue to be dealt with: Storing maxLiveColumns in the resolver feels wrong, but that's where the data is generated. We need to know how many live rows the biggest return had to detect if this is actually a short read. I'm open to suggestions a better place for that count to live.
        Byron Clark made changes -
        Attachment CASSANDRA-2643-v2.patch [ 12488280 ]
        Hide
        Jonathan Ellis added a comment -

        Looks good on the whole. One point to clear up:

        if ((maxLiveColumns >= sliceCommand.count) && (liveColumnsInRow < sliceCommand.count))
        

        maxLiveColumns is the max from a single response, so how can it be greater than sliceCommand.count? Would this be a valid reformulation?

        assert maxLiveColumns <= sliceCommand.count;
        if ((maxLiveColumns == sliceCommand.count) && (liveColumnsInRow < sliceCommand.count))
        

        Minor things I'd like to clean up:

        • is maxLiveColumns valid on any AbstractRR subclass other than RRR? If not I'd rather move it in there and throw an UnsupportedOperation in ARR.
        • Would prefer initializing commandsToRetry to Collections.emptyList, to avoid allocating that list in the common case that no retries are needed. (Then clear of course needs to become allocate.)
        Show
        Jonathan Ellis added a comment - Looks good on the whole. One point to clear up: if ((maxLiveColumns >= sliceCommand.count) && (liveColumnsInRow < sliceCommand.count)) maxLiveColumns is the max from a single response, so how can it be greater than sliceCommand.count? Would this be a valid reformulation? assert maxLiveColumns <= sliceCommand.count; if ((maxLiveColumns == sliceCommand.count) && (liveColumnsInRow < sliceCommand.count)) Minor things I'd like to clean up: is maxLiveColumns valid on any AbstractRR subclass other than RRR? If not I'd rather move it in there and throw an UnsupportedOperation in ARR. Would prefer initializing commandsToRetry to Collections.emptyList, to avoid allocating that list in the common case that no retries are needed. (Then clear of course needs to become allocate.)
        Hide
        Byron Clark added a comment -

        CASSANDRA-2643-v3.patch incorporates Jonathan's suggestions.

        Show
        Byron Clark added a comment - CASSANDRA-2643-v3.patch incorporates Jonathan's suggestions.
        Byron Clark made changes -
        Attachment CASSANDRA-2643-v3.patch [ 12488855 ]
        Hide
        Jonathan Ellis added a comment -

        added a skeleton getMaxLiveColumns to RangeSliceResponseResolver (and created CASSANDRA-2986 to follow up on that) and committed.

        thanks!

        Show
        Jonathan Ellis added a comment - added a skeleton getMaxLiveColumns to RangeSliceResponseResolver (and created CASSANDRA-2986 to follow up on that) and committed. thanks!
        Jonathan Ellis made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Brandon Williams [ brandon.williams ] Byron Clark [ byronclark ]
        Reviewer jbellis
        Hide
        Hudson added a comment -

        Integrated in Cassandra #992 (See https://builds.apache.org/job/Cassandra/992/)
        fix "short reads" in [multi]get
        patch by Byron Clark; reviewed by jbellis for CASSANDRA-2643

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1153115
        Files :

        • /cassandra/trunk/CHANGES.txt
        • /cassandra/trunk/src/java/org/apache/cassandra/service/IResponseResolver.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/RepairCallback.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/AbstractRowResolver.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/RowRepairResolver.java
        • /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractColumnContainer.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/RangeSliceResponseResolver.java
        Show
        Hudson added a comment - Integrated in Cassandra #992 (See https://builds.apache.org/job/Cassandra/992/ ) fix "short reads" in [multi] get patch by Byron Clark; reviewed by jbellis for CASSANDRA-2643 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1153115 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/service/IResponseResolver.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java /cassandra/trunk/src/java/org/apache/cassandra/service/RepairCallback.java /cassandra/trunk/src/java/org/apache/cassandra/service/AbstractRowResolver.java /cassandra/trunk/src/java/org/apache/cassandra/service/RowRepairResolver.java /cassandra/trunk/src/java/org/apache/cassandra/db/AbstractColumnContainer.java /cassandra/trunk/src/java/org/apache/cassandra/service/RangeSliceResponseResolver.java
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12613295 ] patch-available, re-open possible [ 12752829 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12752829 ] reopen-resolved, no closed status, patch-avail, testing [ 12758480 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        81d 20h 59m 1 Jonathan Ellis 02/Aug/11 14:27

          People

          • Assignee:
            Byron Clark
            Reporter:
            Peter Schuller
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development