Cassandra
  1. Cassandra
  2. CASSANDRA-3395

Quorum returns incorrect results during hinted handoff

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.2
    • Component/s: Core
    • Labels:
      None

      Description

      In a 3 node cluster with RF=3 and using a single coordinator, if monotonically increasing columns are inserted into a row and the latest one sliced (both at QUORUM) during HH replay occasionally this column will not be seen.

      1. logs.tar.bz2
        2.15 MB
        Brandon Williams
      2. ttestraw.py
        2 kB
        Brandon Williams
      3. ttest.py
        1 kB
        Brandon Williams

        Activity

        Hide
        Brandon Williams added a comment -

        Logs from a failure where 20172 was expected but 20171 was returned and the pycassa script to reproduce. Procedure is start the script and kill a node, wait, bring it up back up and wait for HH. Takes a few tries.

        Show
        Brandon Williams added a comment - Logs from a failure where 20172 was expected but 20171 was returned and the pycassa script to reproduce. Procedure is start the script and kill a node, wait, bring it up back up and wait for HH. Takes a few tries.
        Hide
        Jonathan Ellis added a comment -

        This sure looks like a case of "if timestamps are equal, all bets are off" to me. Suggest testing w/ explicitly increasing timestamps.

        Show
        Jonathan Ellis added a comment - This sure looks like a case of "if timestamps are equal, all bets are off" to me. Suggest testing w/ explicitly increasing timestamps.
        Hide
        Jonathan Ellis added a comment -

        Sorry, mis-read the script. Brandon points out that there are no column overwrites, only appends.

        Show
        Jonathan Ellis added a comment - Sorry, mis-read the script. Brandon points out that there are no column overwrites, only appends.
        Hide
        Brandon Williams added a comment -

        Equivalent script using telephus that does not repro. Slightly hackish in that it won't create the ks/cf, but the pycassa script can do that and be killed before it inserts.

        Show
        Brandon Williams added a comment - Equivalent script using telephus that does not repro. Slightly hackish in that it won't create the ks/cf, but the pycassa script can do that and be killed before it inserts.
        Hide
        Jonathan Ellis added a comment - - edited

        This is tricky, but we figured out what's happening.

        First, hinted handoff isn't important to reproducing, but bouncing nodes is. You need a node to miss an update to get this.

        If you do that, then you can get this situation, as seen in Brandon's log:

        all 3 nodes reply: (not strictly necessary, all we need is two nodes that disagree with each other)

        DEBUG [ReadRepairStage:8] 2011-10-21 19:35:23,105 RowDigestResolver.java (line 62) resolving 3 responses
        

        The responses don't match:

        DEBUG [pool-2-thread-6] 2011-10-21 19:35:23,105 StorageProxy.java (line 615) Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(91747740688180627279175449712403223124, 747465737472617732) (6705a2ef7042fd98f2c30c5450d33e17 vs bf8c16eb98f3209d3abb723ee8c33185)
        

        The coordinator requests the actual data from each replica and merges the responses:

        DEBUG [pool-2-thread-6] 2011-10-21 19:35:23,107 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 00004ecb:false:4@1319226304173446
        DEBUG [pool-2-thread-6] 2011-10-21 19:35:23,108 SliceQueryFilter.java (line 123) collecting 1 of 2147483647: 00004ecc:false:4@1319226304178304
        

        Note that 00004ecb=20171, and 00004ecc=20172. So both columns are present, and the coordinator now has a slice of [20171, 20172]. It repairs the missing data, then reverses the order as requested in the query and returns [20172, 20171] to the client.

        So, the bug on the Cassandra side is that we don't re-restrict the resultset to the requested count after a digest mismatch, before sending it to the client.

        Then, the client calls popitem() on the result, which means you get back the last item in the resultset, i.e., 20171.

        In short: Cassandra needs to fix sending back more results than requested when there are different versions on different nodes that need to be resolved. We'll address this as part of the related CASSANDRA-3303. Unfortunately, this code path is one that we know from experience is easy to introduce regressions to, so I don't think we can safely do this in 0.8; the fix will be in 1.0.2+.

        However, a simple workaround exists, which is for clients to consume results from the front of the row, instead of the back. This is what the telphus script did by accident, which is why Brandon couldn't reproduce with that version.

        Show
        Jonathan Ellis added a comment - - edited This is tricky, but we figured out what's happening. First, hinted handoff isn't important to reproducing, but bouncing nodes is. You need a node to miss an update to get this. If you do that, then you can get this situation, as seen in Brandon's log: all 3 nodes reply: (not strictly necessary, all we need is two nodes that disagree with each other) DEBUG [ReadRepairStage:8] 2011-10-21 19:35:23,105 RowDigestResolver.java (line 62) resolving 3 responses The responses don't match: DEBUG [pool-2-thread-6] 2011-10-21 19:35:23,105 StorageProxy.java (line 615) Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(91747740688180627279175449712403223124, 747465737472617732) (6705a2ef7042fd98f2c30c5450d33e17 vs bf8c16eb98f3209d3abb723ee8c33185) The coordinator requests the actual data from each replica and merges the responses: DEBUG [pool-2-thread-6] 2011-10-21 19:35:23,107 SliceQueryFilter.java (line 123) collecting 0 of 2147483647: 00004ecb:false:4@1319226304173446 DEBUG [pool-2-thread-6] 2011-10-21 19:35:23,108 SliceQueryFilter.java (line 123) collecting 1 of 2147483647: 00004ecc:false:4@1319226304178304 Note that 00004ecb=20171, and 00004ecc=20172. So both columns are present, and the coordinator now has a slice of [20171, 20172] . It repairs the missing data, then reverses the order as requested in the query and returns [20172, 20171] to the client. So, the bug on the Cassandra side is that we don't re-restrict the resultset to the requested count after a digest mismatch, before sending it to the client. Then, the client calls popitem() on the result, which means you get back the last item in the resultset, i.e., 20171. In short: Cassandra needs to fix sending back more results than requested when there are different versions on different nodes that need to be resolved. We'll address this as part of the related CASSANDRA-3303 . Unfortunately, this code path is one that we know from experience is easy to introduce regressions to, so I don't think we can safely do this in 0.8; the fix will be in 1.0.2+. However, a simple workaround exists, which is for clients to consume results from the front of the row, instead of the back. This is what the telphus script did by accident, which is why Brandon couldn't reproduce with that version.
        Hide
        Sylvain Lebresne added a comment -

        CASSANDRA-3303 has been committed with the fix for this. As said by Jonathan above, the fix is 1.0 only.

        Show
        Sylvain Lebresne added a comment - CASSANDRA-3303 has been committed with the fix for this. As said by Jonathan above, the fix is 1.0 only.

          People

          • Assignee:
            Brandon Williams
            Reporter:
            Brandon Williams
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development