Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13595

Short read protection doesn't work at the end of a partition

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Coordination
    • Labels:

      Description

      It seems that short read protection doesn't work when the short read is done at the end of a partition in a range query. The final assertion of this dtest fails:

      def short_read_partitions_delete_test(self):
              cluster = self.cluster
              cluster.set_configuration_options(values={'hinted_handoff_enabled': False})
              cluster.set_batch_commitlog(enabled=True)
              cluster.populate(2).start(wait_other_notice=True)
              node1, node2 = self.cluster.nodelist()
      
              session = self.patient_cql_connection(node1)
              create_ks(session, 'ks', 2)
              session.execute("CREATE TABLE t (k int, c int, PRIMARY KEY(k, c)) WITH read_repair_chance = 0.0")
      
              # we write 1 and 2 in a partition: all nodes get it.
              session.execute(SimpleStatement("INSERT INTO t (k, c) VALUES (1, 1)", consistency_level=ConsistencyLevel.ALL))
              session.execute(SimpleStatement("INSERT INTO t (k, c) VALUES (2, 1)", consistency_level=ConsistencyLevel.ALL))
      
              # we delete partition 1: only node 1 gets it.
              node2.flush()
              node2.stop(wait_other_notice=True)
              session = self.patient_cql_connection(node1, 'ks', consistency_level=ConsistencyLevel.ONE)
              session.execute(SimpleStatement("DELETE FROM t WHERE k = 1"))
              node2.start(wait_other_notice=True)
      
              # we delete partition 2: only node 2 gets it.
              node1.flush()
              node1.stop(wait_other_notice=True)
              session = self.patient_cql_connection(node2, 'ks', consistency_level=ConsistencyLevel.ONE)
              session.execute(SimpleStatement("DELETE FROM t WHERE k = 2"))
              node1.start(wait_other_notice=True)
      
              # read from both nodes
              session = self.patient_cql_connection(node1, 'ks', consistency_level=ConsistencyLevel.ALL)
              assert_none(session, "SELECT * FROM t LIMIT 1")
      

      However, the dtest passes if we remove the LIMIT 1.

      Short read protection uses a SinglePartitionReadCommand, maybe it should use a PartitionRangeReadCommand instead?

        Activity

        Hide
        jasonstack ZhaoYang added a comment - - edited

        The cause is no short-read-protection generated for node2 with key=2, because no UnfilteredRowIterator with key=2 for node2...

        For initial read:
        Node1 returns:  
                   PartitionIterator {
                        UnfilteredIterator( k=1@tombstone, back by short-read-protection)  
                        UnfilteredIterator( k=2, back by short-read-protection)  
                  }
        Node2 returns: 
                   PartitionIterator {
                        UnfilteredIterator( k=1, back by short-read-protection) 
                  }
        
        Show
        jasonstack ZhaoYang added a comment - - edited The cause is no short-read-protection generated for node2 with key=2, because no UnfilteredRowIterator with key=2 for node2... For initial read: Node1 returns: PartitionIterator { UnfilteredIterator( k=1@tombstone, back by short -read-protection) UnfilteredIterator( k=2, back by short -read-protection) } Node2 returns: PartitionIterator { UnfilteredIterator( k=1, back by short -read-protection) }
        Hide
        jasonstack ZhaoYang added a comment -

        It seems that current ShortReadProtection only supports the same partition.

        The idea is to extend current ShortReadProtection function with across-partition support. If current response (UnfilteredPartitionIterator) reached the end and its last key is "behind" (token smaller than) other responses' current key, we will do a PartitionRange retry from the last key to make sure no response is falling short of unfetched partitions.

        Show
        jasonstack ZhaoYang added a comment - It seems that current ShortReadProtection only supports the same partition. The idea is to extend current ShortReadProtection function with across-partition support. If current response (UnfilteredPartitionIterator) reached the end and its last key is "behind" (token smaller than) other responses' current key, we will do a PartitionRange retry from the last key to make sure no response is falling short of unfetched partitions.
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        The idea is to extend current ShortReadProtection function with across-partition support.

        Correct. Short read protection hasn't been implemented properly for range reads, which causes correctness issues in particular with paging.

        I'm currently addressing the few outstanding issues with single partition short reads on 3.0 and above (CASSANDRA-13794, CASSANDRA-12872). This would be an extension of that work, I guess - or at least there is strong overlap. Feel free to give it a go though - however it may or may not have to be altered afterwards to harmonise both implementations.

        Show
        iamaleksey Aleksey Yeschenko added a comment - The idea is to extend current ShortReadProtection function with across-partition support. Correct. Short read protection hasn't been implemented properly for range reads, which causes correctness issues in particular with paging. I'm currently addressing the few outstanding issues with single partition short reads on 3.0 and above ( CASSANDRA-13794 , CASSANDRA-12872 ). This would be an extension of that work, I guess - or at least there is strong overlap. Feel free to give it a go though - however it may or may not have to be altered afterwards to harmonise both implementations.
        Hide
        jasonstack ZhaoYang added a comment - - edited

        Aleksey Yeschenko Thanks for the heads up.

        Source
        3.0
        dtest

        Here is draft of making ShortReadProtection extend MorePartitions.

        Changes:
        1. If there is no more rows ahead, no short-read-protection(both partition and row) is triggered
        2. Short-read-partition-protection only fetches range from last-partition-key to max token of current command range
        3. Extend UnfilteredPartitionItr with short-read-partition-protection first, then apply short-read-partition

        If you have better solution or integration, feel free to mark it as duplicated.

        Show
        jasonstack ZhaoYang added a comment - - edited Aleksey Yeschenko Thanks for the heads up. Source 3.0 dtest Here is draft of making ShortReadProtection extend MorePartitions . Changes: 1. If there is no more rows ahead, no short-read-protection(both partition and row) is triggered 2. Short-read-partition-protection only fetches range from last-partition-key to max token of current command range 3. Extend UnfilteredPartitionItr with short-read-partition-protection first, then apply short-read-partition If you have better solution or integration, feel free to mark it as duplicated.
        Hide
        jjirsa Jeff Jirsa added a comment -

        Quick note that the dtest repo has moved to apache/cassandra-dtest - so we should move e706952e over to the new repo

        Show
        jjirsa Jeff Jirsa added a comment - Quick note that the dtest repo has moved to apache/cassandra-dtest - so we should move e706952e over to the new repo
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        ZhaoYang Yours is not a duplicate, I'd just recommend holding it off a bit until the fixes for CASSANDRA-13794 and CASSANDRA-12872 get in. The code you've submitted doesn't seem very far off of what it should be, either. I will probably review/collaborate after I'm done with the other two tickets.

        But this is 3.0. Have you looked into 2.1? I'm not sure how I feel about committing it to 2.1. On one hand, it's a huge correctness issues. Does it count as critical?

        Show
        iamaleksey Aleksey Yeschenko added a comment - ZhaoYang Yours is not a duplicate, I'd just recommend holding it off a bit until the fixes for CASSANDRA-13794 and CASSANDRA-12872 get in. The code you've submitted doesn't seem very far off of what it should be, either. I will probably review/collaborate after I'm done with the other two tickets. But this is 3.0. Have you looked into 2.1? I'm not sure how I feel about committing it to 2.1. On one hand, it's a huge correctness issues. Does it count as critical?
        Hide
        jasonstack ZhaoYang added a comment -

        Jeff Jirsa thanks, I will move to new repo.

        Aleksey Yeschenko I haven't looked at 2.2 yet, just started off with 3.0, will do that after you finalized 13794, 12872.

        Does it count as critical?

        As a user, yes..

        Show
        jasonstack ZhaoYang added a comment - Jeff Jirsa thanks, I will move to new repo. Aleksey Yeschenko I haven't looked at 2.2 yet, just started off with 3.0, will do that after you finalized 13794, 12872. Does it count as critical? As a user, yes..
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        ZhaoYang My current thinking is that as bad as this is, fixing in 3.0+ is fine. You need to upgrade to get the fix though.

        Can you rebase based on the latest 3.0, though - with CASSANDRA-13794 committed? Then we can take it from there.

        Show
        iamaleksey Aleksey Yeschenko added a comment - ZhaoYang My current thinking is that as bad as this is, fixing in 3.0+ is fine. You need to upgrade to get the fix though. Can you rebase based on the latest 3.0, though - with CASSANDRA-13794 committed? Then we can take it from there.

          People

          • Assignee:
            jasonstack ZhaoYang
            Reporter:
            adelapena Andrés de la Peña
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development