Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-1042

ColumnFamilyRecordReader returns duplicate rows

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 0.6.5
    • Component/s: None
    • Labels:

      Description

      There's a bug in ColumnFamilyRecordReader that appears when processing a single split (which happens in most tests that have small number of rows), and potentially in other cases. When the start and end tokens of the split are equal, duplicate rows can be returned.

      Example with 5 rows:
      token (start and end) = 53193025635115934196771903670925341736

      Tokens returned by first get_range_slices iteration (all 5 rows):
      16955237001963240173058271559858726497
      40670782773005619916245995581909898190
      99079589977253916124855502156832923443
      144992942750327304334463589818972416113
      166860289390734216023086131251507064403

      Tokens returned by next iteration (first token is last token from
      previous, end token is unchanged)
      16955237001963240173058271559858726497
      40670782773005619916245995581909898190

      Tokens returned by final iteration (first token is last token from
      previous, end token is unchanged)
      [] (empty)

      In this example, the mapper has processed 7 rows in total, 2 of which
      were duplicates.

      1. 1042-0_6.txt
        3 kB
        Jeremy Hanna
      2. 1042-test.txt
        5 kB
        Jonathan Ellis
      3. 1042-v2.txt
        8 kB
        Jonathan Ellis
      4. cassandra.tar.gz
        3 kB
        Jeremy Hanna
      5. Cassandra-1042-0_6-branch.patch.txt
        2 kB
        Jeremy Hanna
      6. CASSANDRA-1042-trunk.patch.txt
        5 kB
        Jeremy Hanna
      7. duplicate_keys.rtf
        0.8 kB
        Jeremy Hanna

        Activity

        Hide
        jbellis Jonathan Ellis added a comment -

        This sounds like something we could make a unit test for, without having to get Hadoop itself involved.

        Show
        jbellis Jonathan Ellis added a comment - This sounds like something we could make a unit test for, without having to get Hadoop itself involved.
        Hide
        cbiocca Christophe Biocca added a comment -

        The basic issue is that the thrift server's return value is sorted by the absolute value of the tokens, while the CassandraRecordReader assumes that the order is the one given by traversal of the range (that is, we get the smallest value greater than start_token in first position, and the greatest value smaller than or equal to end_token in last position.
        Now I don't know which is correct, as the API docs I've looked at don't suggest which order is supposed to be returned, but if the server's implementation is correct, then the record reader needs to iterate over the returned tokens to figure out which one is actually the last token for iteration purposes. Otherwise, switching the server's implementation to return keys in the iteration order will work.

        Show
        cbiocca Christophe Biocca added a comment - The basic issue is that the thrift server's return value is sorted by the absolute value of the tokens, while the CassandraRecordReader assumes that the order is the one given by traversal of the range (that is, we get the smallest value greater than start_token in first position, and the greatest value smaller than or equal to end_token in last position. Now I don't know which is correct, as the API docs I've looked at don't suggest which order is supposed to be returned, but if the server's implementation is correct, then the record reader needs to iterate over the returned tokens to figure out which one is actually the last token for iteration purposes. Otherwise, switching the server's implementation to return keys in the iteration order will work.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Hmm, I was able to reproduce this with the contrib/word_count piece on trunk. It appears to double count rows in ranges that have a single row as well as those that have more - in this case 1000.

        Show
        jeromatron Jeremy Hanna added a comment - Hmm, I was able to reproduce this with the contrib/word_count piece on trunk. It appears to double count rows in ranges that have a single row as well as those that have more - in this case 1000.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Just some more data - when I use an OrderPreservingPartioner, the word count works fine...

        Show
        jeromatron Jeremy Hanna added a comment - Just some more data - when I use an OrderPreservingPartioner, the word count works fine...
        Hide
        jeromatron Jeremy Hanna added a comment - - edited

        For the describe_splits call it makes, it returns 3 sub splits, the first of which is a wrapping split. Sounds like it's buggy on the server side. Will check with Jonathan.

        Show
        jeromatron Jeremy Hanna added a comment - - edited For the describe_splits call it makes, it returns 3 sub splits, the first of which is a wrapping split. Sounds like it's buggy on the server side. Will check with Jonathan.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Jonathan:

        Inputs to client.describe_splits() - ColumnFamilyInputSplit:185:
        range.start_token: 85469146195799762548951268272529359452
        range.end_token: 85469146195799762548951268272529359452
        splitsize: 65536

        output:
        splits - arraylist<String>:
        0: 85469146195799762548951268272529359452
        1: 85469146195799762548951268272529359452

        Seems like that is the bug right there, but not familiar enough with what it's supposed to do in that case?

        Btw, there are only 4 rows in the CF.

        Show
        jeromatron Jeremy Hanna added a comment - Jonathan: Inputs to client.describe_splits() - ColumnFamilyInputSplit:185: range.start_token: 85469146195799762548951268272529359452 range.end_token: 85469146195799762548951268272529359452 splitsize: 65536 output: splits - arraylist<String>: 0: 85469146195799762548951268272529359452 1: 85469146195799762548951268272529359452 Seems like that is the bug right there, but not familiar enough with what it's supposed to do in that case? Btw, there are only 4 rows in the CF.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Appears to be something server related in the splits themselves.

        Show
        jeromatron Jeremy Hanna added a comment - Appears to be something server related in the splits themselves.
        Hide
        jeromatron Jeremy Hanna added a comment -

        In order to facilitate reproducing the problem, I'm attaching my cassandra data directory tar/gzed up.

        There are 4 rows in the cassandra instance. If you modify WordCountSetup to change TEST_COUNT from 4 to 1, then run WordCount, you will find that Cassandra trunk will count 7 occurrences instead of 4. You can also debug on the line I mentioned previously to see what describe_splits receives and then outputs.

        Just wanted to facilitate reproducing the problem.

        Show
        jeromatron Jeremy Hanna added a comment - In order to facilitate reproducing the problem, I'm attaching my cassandra data directory tar/gzed up. There are 4 rows in the cassandra instance. If you modify WordCountSetup to change TEST_COUNT from 4 to 1, then run WordCount, you will find that Cassandra trunk will count 7 occurrences instead of 4. You can also debug on the line I mentioned previously to see what describe_splits receives and then outputs. Just wanted to facilitate reproducing the problem.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Adding a patch that does the following:

        1. Removes an ordering section in StorageProxy that messes with the wrapping range for a get_range_slice call - thereby messing up the order of the records returned. That led to having the initial wrapping range returned in token order instead of wrapping order. So there was a second call going from last token as far as natural ordering goes, all the way to the initial start token. So if the server's token were 5, and there were 10 tokens, it would list 1-10, then 1-5 again. With this fix, the return order of the tokens is 6-10, then 1-5, which is correct - the order of the wrapped range, then in token order.

        2. A few instances of token.toString should have been TokenFactory.toString(token) - fixed.

        3. There was a method in StorageService - getStringEndpointMap - that is never call - removed that.

        4. Updated WordCountSetup with the latest trunk to use new Clock(System.currentTimeMillis())

        Show
        jeromatron Jeremy Hanna added a comment - Adding a patch that does the following: 1. Removes an ordering section in StorageProxy that messes with the wrapping range for a get_range_slice call - thereby messing up the order of the records returned. That led to having the initial wrapping range returned in token order instead of wrapping order. So there was a second call going from last token as far as natural ordering goes, all the way to the initial start token. So if the server's token were 5, and there were 10 tokens, it would list 1-10, then 1-5 again. With this fix, the return order of the tokens is 6-10, then 1-5, which is correct - the order of the wrapped range, then in token order. 2. A few instances of token.toString should have been TokenFactory.toString(token) - fixed. 3. There was a method in StorageService - getStringEndpointMap - that is never call - removed that. 4. Updated WordCountSetup with the latest trunk to use new Clock(System.currentTimeMillis())
        Hide
        jeromatron Jeremy Hanna added a comment -

        Tx to Stu Hood for helping me narrow this down.

        Show
        jeromatron Jeremy Hanna added a comment - Tx to Stu Hood for helping me narrow this down.
        Hide
        jeromatron Jeremy Hanna added a comment -

        To clarify: to fix the problem - this removes some ordering in StorageProxy.getRangeIterator since getRestricedRanges should already have returned the right thing.

        Show
        jeromatron Jeremy Hanna added a comment - To clarify: to fix the problem - this removes some ordering in StorageProxy.getRangeIterator since getRestricedRanges should already have returned the right thing.
        Hide
        jbellis Jonathan Ellis added a comment -

        Looks good to me. Nice work, Jeremy and Stu.

        Can you submit a version against 0.6 branch too?

        Show
        jbellis Jonathan Ellis added a comment - Looks good to me. Nice work, Jeremy and Stu. Can you submit a version against 0.6 branch too?
        Hide
        jeromatron Jeremy Hanna added a comment -

        Attaching patches for 0.6-branch and trunk.

        Show
        jeromatron Jeremy Hanna added a comment - Attaching patches for 0.6-branch and trunk.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Jonathan - I updated the trunk patch to not add a couple of unused imports that snuck in while I was messing with WordCount.

        Show
        jeromatron Jeremy Hanna added a comment - Jonathan - I updated the trunk patch to not add a couple of unused imports that snuck in while I was messing with WordCount.
        Hide
        jeromatron Jeremy Hanna added a comment - - edited

        Also - I didn't remove StorageService.getStringEndpointMap in the 0.6 branch version because CassandraServer.get_string_property still calls it. get_string_property was removed on trunk as part of CASSANDRA-965

        Show
        jeromatron Jeremy Hanna added a comment - - edited Also - I didn't remove StorageService.getStringEndpointMap in the 0.6 branch version because CassandraServer.get_string_property still calls it. get_string_property was removed on trunk as part of CASSANDRA-965
        Hide
        jbellis Jonathan Ellis added a comment -

        committed, thanks!

        Show
        jbellis Jonathan Ellis added a comment - committed, thanks!
        Hide
        jbellis Jonathan Ellis added a comment -

        done

        Show
        jbellis Jonathan Ellis added a comment - done
        Hide
        jbellis Jonathan Ellis added a comment -

        re-opening in light of CASSANDRA-1198

        Show
        jbellis Jonathan Ellis added a comment - re-opening in light of CASSANDRA-1198
        Hide
        jeromatron Jeremy Hanna added a comment -

        Unwrapped the tokens in the first place ensuring that the splits would not contain wraps. Works fine now.

        Show
        jeromatron Jeremy Hanna added a comment - Unwrapped the tokens in the first place ensuring that the splits would not contain wraps. Works fine now.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Attaching new patch.

        Show
        jeromatron Jeremy Hanna added a comment - Attaching new patch.
        Hide
        jeromatron Jeremy Hanna added a comment -

        The patch should apply cleanly to 0.7/trunk as well

        Show
        jeromatron Jeremy Hanna added a comment - The patch should apply cleanly to 0.7/trunk as well
        Hide
        stuhood Stu Hood added a comment -

        +1

        Show
        stuhood Stu Hood added a comment - +1
        Hide
        jbellis Jonathan Ellis added a comment -

        do we have a theory as to why wrapped ranges should cause bugs?

        i worry that if we're just trying code out and it seems to work, that we may not be fixing the real problem

        Show
        jbellis Jonathan Ellis added a comment - do we have a theory as to why wrapped ranges should cause bugs? i worry that if we're just trying code out and it seems to work, that we may not be fixing the real problem
        Hide
        jeromatron Jeremy Hanna added a comment - - edited

        Good point.

        From what I could tell in this instance, it would go through the input splits and on the last input split, it would have an incorrect last value. So it would go back through and take that value to the end of the input list. I would imagine that is where it had wrapped. I'm not sure why it had the incorrect last value as the last value in the wrapped input split though. If someone is wiser than I in these matters, please chime in. But it appears that normalizing how the splits are done so one split does not wrap internally, it solves the problem.

        To reproduce easily and with a small dataset: If you don't apply the patch and run the word_count_setup with only 10 values for text3, usually that will be enough to manifest the problem when running wordcount.

        Also, I might think that if the wrap could be detected when creating the splits, as with this patch, then it makes sense that wrapping could be detected when reading the rows in the ColumnFamilyRecordReader. That could be another way to resolve it. But I think it's sixes when it comes to the solution.

        Like I said, I'm not certain why that incorrect ordering happens on the wrapped split.

        Show
        jeromatron Jeremy Hanna added a comment - - edited Good point. From what I could tell in this instance, it would go through the input splits and on the last input split, it would have an incorrect last value. So it would go back through and take that value to the end of the input list. I would imagine that is where it had wrapped. I'm not sure why it had the incorrect last value as the last value in the wrapped input split though. If someone is wiser than I in these matters, please chime in. But it appears that normalizing how the splits are done so one split does not wrap internally, it solves the problem. To reproduce easily and with a small dataset: If you don't apply the patch and run the word_count_setup with only 10 values for text3, usually that will be enough to manifest the problem when running wordcount. Also, I might think that if the wrap could be detected when creating the splits, as with this patch, then it makes sense that wrapping could be detected when reading the rows in the ColumnFamilyRecordReader. That could be another way to resolve it. But I think it's sixes when it comes to the solution. Like I said, I'm not certain why that incorrect ordering happens on the wrapped split.
        Hide
        jeromatron Jeremy Hanna added a comment -

        Adding the output of word count with duplicate tokens. It appears to happen when the input split contains a wrapped key range. That's why the updated patch splits wrapped key ranges (fixing the problem).

        Show
        jeromatron Jeremy Hanna added a comment - Adding the output of word count with duplicate tokens. It appears to happen when the input split contains a wrapped key range. That's why the updated patch splits wrapped key ranges (fixing the problem).
        Hide
        jeromatron Jeremy Hanna added a comment -

        Sorry if this is redundant but pasting in a thought we had a while ago that motivated the attached patch. If we make sure that the splits are always in ring order and never wrap, it solves the problem.

        "Token ranges may also wrap – that is, the end token may be less than the start one. Thus, a range from keyX to keyX is a one-element range, but a range from tokenY to tokenY is the full ring."

        It does not say what order they will be in when it wraps. Some clients assume that the ordering is natural order while the hadoop client interactions assume that it will be ring order.

        For example:
        – a list of tokens (1,2,3,4,5,6,7,8,9)
        – a get_range_slice call with start_token = 5, end_token = 5
        Natural order meaning token order from start to finish, returning the results (1,2,3,4,5,6,7.8,9).
        Ring order or wrapping order meaning it would return the results (5,6,7,8,9,1,2,3,4).

        Show
        jeromatron Jeremy Hanna added a comment - Sorry if this is redundant but pasting in a thought we had a while ago that motivated the attached patch. If we make sure that the splits are always in ring order and never wrap, it solves the problem. "Token ranges may also wrap – that is, the end token may be less than the start one. Thus, a range from keyX to keyX is a one-element range, but a range from tokenY to tokenY is the full ring." It does not say what order they will be in when it wraps. Some clients assume that the ordering is natural order while the hadoop client interactions assume that it will be ring order. For example: – a list of tokens (1,2,3,4,5,6,7,8,9) – a get_range_slice call with start_token = 5, end_token = 5 Natural order meaning token order from start to finish, returning the results (1,2,3,4,5,6,7.8,9). Ring order or wrapping order meaning it would return the results (5,6,7,8,9,1,2,3,4).
        Hide
        jbellis Jonathan Ellis added a comment -

        the "correct" order when tokens are involved is ring order

        (when start_key is used instead of start_token, you can't have a wrapping range so it should be moot)

        Show
        jbellis Jonathan Ellis added a comment - the "correct" order when tokens are involved is ring order (when start_key is used instead of start_token, you can't have a wrapping range so it should be moot)
        Hide
        jbellis Jonathan Ellis added a comment -

        patching CFIF isn't the answer, we need any client using the API to get the right results

        Show
        jbellis Jonathan Ellis added a comment - patching CFIF isn't the answer, we need any client using the API to get the right results
        Hide
        jbellis Jonathan Ellis added a comment -

        it seems that the root of the problem is, as Jeremy said, rows getting returned in token order instead of ring order. if, in joost's original example, the rows were returned in order of

        99079589977253916124855502156832923443
        144992942750327304334463589818972416113
        166860289390734216023086131251507064403
        16955237001963240173058271559858726497
        40670782773005619916245995581909898190

        then doing an extra query for (40670782773005619916245995581909898190, 53193025635115934196771903670925341736]

        would return the desired result of nothing.

        but I am unable to reproduce this behavior in a unit test (against 0.6 branch, attached). trying jeremy's data dir (also against 0.6 branch), I get "java.io.IOException: Found system table files, but they couldn't be loaded. Did you change the partitioner?"

        Show
        jbellis Jonathan Ellis added a comment - it seems that the root of the problem is, as Jeremy said, rows getting returned in token order instead of ring order. if, in joost's original example, the rows were returned in order of 99079589977253916124855502156832923443 144992942750327304334463589818972416113 166860289390734216023086131251507064403 16955237001963240173058271559858726497 40670782773005619916245995581909898190 then doing an extra query for (40670782773005619916245995581909898190, 53193025635115934196771903670925341736] would return the desired result of nothing. but I am unable to reproduce this behavior in a unit test (against 0.6 branch, attached). trying jeremy's data dir (also against 0.6 branch), I get "java.io.IOException: Found system table files, but they couldn't be loaded. Did you change the partitioner?"
        Hide
        jbellis Jonathan Ellis added a comment -

        Jeremy pointed out that the sorting that removed by the original patch here is sorting in raw token order rather than taking into account the requested start token. I think that's our problem, although I'm not sure why my unit test isn't running into that.

        Show
        jbellis Jonathan Ellis added a comment - Jeremy pointed out that the sorting that removed by the original patch here is sorting in raw token order rather than taking into account the requested start token. I think that's our problem, although I'm not sure why my unit test isn't running into that.
        Hide
        jbellis Jonathan Ellis added a comment -

        ah, the unit test hits CFS directly instead of going through StorageProxy (where the sort happens)...

        Show
        jbellis Jonathan Ellis added a comment - ah, the unit test hits CFS directly instead of going through StorageProxy (where the sort happens)...
        Hide
        jbellis Jonathan Ellis added a comment -

        v2 attached:

        • removes wrapped-range handling from CFS.getRangeSlice, since StorageProxy always unwraps first
        • adds (initially failing) system test exercising wrapped-range path
        • adds sorting of unwrapped, restricted ranges relative to the original query range [this is the bug fix]
        Show
        jbellis Jonathan Ellis added a comment - v2 attached: removes wrapped-range handling from CFS.getRangeSlice, since StorageProxy always unwraps first adds (initially failing) system test exercising wrapped-range path adds sorting of unwrapped, restricted ranges relative to the original query range [this is the bug fix]
        Hide
        jeromatron Jeremy Hanna added a comment -

        +1

        Show
        jeromatron Jeremy Hanna added a comment - +1
        Hide
        jbellis Jonathan Ellis added a comment -

        committed

        Show
        jbellis Jonathan Ellis added a comment - committed
        Hide
        jbellis Jonathan Ellis added a comment -

        Backported a related fix from CASSANDRA-1156 to 0.6.5 in r982580

        Show
        jbellis Jonathan Ellis added a comment - Backported a related fix from CASSANDRA-1156 to 0.6.5 in r982580

          People

          • Assignee:
            jbellis Jonathan Ellis
            Reporter:
            joosto Joost Ouwerkerk
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development