Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12872

Fix short read protection when more than one row is missing

    Details

      Description

      We are seeing an issue with paging reads missing some small number of columns when we do paging/limit reads. We get this on a single DC cluster itself when both reads and writes are happening with QUORUM. Paging/limit reads see this issue. I have attached the ccm based script which reproduces the problem.

      • Keyspace RF - 3
      • Table (id int, course text, marks int, primary key(id, course))
      • replicas for partition key 1 - r1, r2 and r3
      • insert (1, '1', 1) , (1, '2', 2), (1, '3', 3), (1, '4', 4), (1, '5', 5) - succeeded on all 3 replicas
      • insert (1, '6', 6) succeeded on r1 and r3, failed on r2
      • delete (1, '2'), (1, '3'), (1, '4'), (1, '5') succeeded on r1 and r2, failed on r3
      • insert (1, '7', 7) succeeded on r1 and r2, failed on r3

      Local data on 3 nodes looks like as below now

      r1: (1, '1', 1), tombstone(2-5 records), (1, '6', 6), (1, '7', 7)
      r2: (1, '1', 1), tombstone(2-5 records), (1, '7', 7)
      r3: (1, '1', 1), (1, '2', 2), (1, '3', 3), (1, '4', 4), (1, '5', 5), (1, '6', 6)

      If we do a paging read with page_size 2, and if it gets data from r2 and r3, then it will only get the data (1, '1', 1) and (1, '7', 7) skipping record 6. This problem would happen if the same query is not doing paging but limit set to 2 records.

      Resolution code for reads works same for paging queries and normal queries. Co-ordinator shouldn't respond back to client with records/columns that it didn't have complete visibility on all required replicas (in this case 2 replicas). In above case, it is sending back record (1, '7', 7) back to client, but its visibility on r3 is limited up to (1, '2', 2) and it is relying on just r2 data to assume (1, '6', 6) doesn't exist, which is wrong. End of the resolution all it can conclusively say any thing about is (1, '1', and the other one is that we and and and and and and the and the and the and d and the other is and 1), which exists and (1, '2', 2), which is deleted.

      Ideally we should have different resolution implementation for paging/limit queries.

      We could reproduce this on 2.0.17, 2.1.16 and 3.0.9.

      Seems like 3.0.9 we have ShortReadProtection transformation on list queries. I assume that is to protect against the cases like above. But, we can reproduce the issue in 3.0.9 as well.

        Activity

        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        A minimal dtest reproducing the issue here. CASSANDRA-13794 changes make it pass, but unfortunately the bug itself is still here - it's just masked by min(limit, 16) optimisation in CASSANDRA-13794 branch.

        Benjamin Lerer I'm close-ish to completion here. Mind if I take over? It's a bit time-sensitive. If you do mind and/or have something in progress, feel free to take it back.

        Show
        iamaleksey Aleksey Yeschenko added a comment - A minimal dtest reproducing the issue here . CASSANDRA-13794 changes make it pass, but unfortunately the bug itself is still here - it's just masked by min(limit, 16) optimisation in CASSANDRA-13794 branch. Benjamin Lerer I'm close-ish to completion here. Mind if I take over? It's a bit time-sensitive. If you do mind and/or have something in progress, feel free to take it back.
        Hide
        blerer Benjamin Lerer added a comment -

        Aleksey Yeschenko I do not mind at all. You can take more if you want too

        Show
        blerer Benjamin Lerer added a comment - Aleksey Yeschenko I do not mind at all. You can take more if you want too
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        Fixed branches: 3.0, 3.11, and 4.0

        Will link to circle and dtest runs a bit later (Jenkins is being a bit unresponsive).

        The problem is caused by us applying the counter transformation in an incorrect order: before extending the partition for moreContents() instead of after. And obviously only the transformations applied after extension will be applied upon re-extension, to the next element. Because it makes the most sense.

        This causes us skipping an increment of the counter, which in turn causes SRP deciding to not proceed with fetching more rows, triggering this early return:

        if (lastCount == counter.counted() || !counter.isDoneForPartition())
            return null;
        

        The patch simply moves counter application to the correct spot, and the new dtest passes (as do the older ones).

        Show
        iamaleksey Aleksey Yeschenko added a comment - Fixed branches: 3.0 , 3.11 , and 4.0 Will link to circle and dtest runs a bit later (Jenkins is being a bit unresponsive). The problem is caused by us applying the counter transformation in an incorrect order: before extending the partition for moreContents() instead of after. And obviously only the transformations applied after extension will be applied upon re-extension, to the next element. Because it makes the most sense. This causes us skipping an increment of the counter, which in turn causes SRP deciding to not proceed with fetching more rows, triggering this early return: if (lastCount == counter.counted() || !counter.isDoneForPartition()) return null ; The patch simply moves counter application to the correct spot, and the new dtest passes (as do the older ones).
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        In effect, SRP only works correctly on the first call to moreContents(). And given that because of a logical bug (CASSANDRA-13794) we only fetch one extra row at a time, SRP doesn't work correctly for most cases at all.

        Show
        iamaleksey Aleksey Yeschenko added a comment - In effect, SRP only works correctly on the first call to moreContents() . And given that because of a logical bug ( CASSANDRA-13794 ) we only fetch one extra row at a time, SRP doesn't work correctly for most cases at all.
        Hide
        iamaleksey Aleksey Yeschenko added a comment -

        The reason our existing dtests never caught it is that all of them only require getting one extra row at most, if anyone's curious.

        Show
        iamaleksey Aleksey Yeschenko added a comment - The reason our existing dtests never caught it is that all of them only require getting one extra row at most, if anyone's curious.
        Hide
        benedict Benedict added a comment - - edited

        nit: the comment suggests we will not apply the transformations "correctly" to future partitions; in reality, the new transformation will be applied just fine, we will just begin accumulating the old ones. Since it was the counter transformation preceding MoreRows, this meant we began double, triple, etc. counting each time we extended. Had it been the "protection" transformation this would have just meant a transient memory leak (of sorts) and a modest incremental CPU burden.

        My bad, I'm forgetting which way around extensions work (it's been a while, even if I did write them). In which case I would elide "correctly" as they won't be applied at all

        Also, grammatically, we're not "extending MoreRows" we're extending "partition" with MoreRows

        +1

        Show
        benedict Benedict added a comment - - edited nit: the comment suggests we will not apply the transformations "correctly" to future partitions; in reality, the new transformation will be applied just fine, we will just begin accumulating the old ones. Since it was the counter transformation preceding MoreRows, this meant we began double, triple, etc. counting each time we extended. Had it been the "protection" transformation this would have just meant a transient memory leak (of sorts) and a modest incremental CPU burden. My bad, I'm forgetting which way around extensions work (it's been a while, even if I did write them). In which case I would elide "correctly" as they won't be applied at all Also, grammatically, we're not "extending MoreRows" we're extending "partition" with MoreRows +1
        Hide
        iamaleksey Aleksey Yeschenko added a comment - - edited

        Thanks, committed as 9ea61305ec30a476f48320c06f56d8d67000bbbe and merged with 3.11 and trunk, with grammar corrected. Or altered, anyway. dtest committed as 9119cdfe921a2f39a315badd58900a12409d506e.

        All unit tests are passing, although a few flaked out on CircleCI and had to be rerun locally. CommitLogSegmentManagerTest in 3.0, DeleteTest, PreparedStatementsTest, and RemoveTest on 3.11, and ViewTest in 4.0. Links to runs: 3.0, 3.11, 4.0.

        On dtests front, there is one new legit failure - in consistency_test.py:TestConsistency.test_13747. Fixing the bug with counting exposed yet another bug in SRP and potentially an orthogonal issue with the read path. It has to do with how we handle EMPTY clustering and bounds. See CASSANDRA-13880.

        Show
        iamaleksey Aleksey Yeschenko added a comment - - edited Thanks, committed as 9ea61305ec30a476f48320c06f56d8d67000bbbe and merged with 3.11 and trunk, with grammar corrected. Or altered, anyway. dtest committed as 9119cdfe921a2f39a315badd58900a12409d506e . All unit tests are passing, although a few flaked out on CircleCI and had to be rerun locally. CommitLogSegmentManagerTest in 3.0, DeleteTest , PreparedStatementsTest , and RemoveTest on 3.11, and ViewTest in 4.0. Links to runs: 3.0 , 3.11 , 4.0 . On dtests front, there is one new legit failure - in consistency_test.py:TestConsistency.test_13747 . Fixing the bug with counting exposed yet another bug in SRP and potentially an orthogonal issue with the read path. It has to do with how we handle EMPTY clustering and bounds. See CASSANDRA-13880 .

          People

          • Assignee:
            iamaleksey Aleksey Yeschenko
            Reporter:
            mgvbhaskar Bhaskar Muppana
            Reviewer:
            Benedict
          • Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development