Details

    • Type: Improvement
    • Status: Awaiting Feedback
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: Coordination
    • Labels:
      None

      Description

      Today, if there’s a digest mismatch in a foreground read repair, the insert to update out of date replicas is blocking. This means, if it fails, the read fails with a timeout. If a node is dropping writes (maybe it is overloaded or the mutation stage is backed up for some other reason), all reads to a replica set could fail. Further, replicas dropping writes get more out of sync so will require more read repair.

      The comment on the code for why the writes are blocking is:

      // wait for the repair writes to be acknowledged, to minimize impact on any replica that's
      // behind on writes in case the out-of-sync row is read multiple times in quick succession
      

      but the bad side effect is that reads timeout. Either the writes should not be blocking or we should return success for the read even if the write times out.

        Issue Links

          Activity

          Hide
          bdeggleston Blake Eggleston added a comment -

          I opened CASSANDRA-14058 for the refactor since that's sort of a separate task and to keep refactor discussion out of this ticket. There's an initial implementation up, take a look and let me know what you think.

          Show
          bdeggleston Blake Eggleston added a comment - I opened CASSANDRA-14058 for the refactor since that's sort of a separate task and to keep refactor discussion out of this ticket. There's an initial implementation up, take a look and let me know what you think.
          Hide
          jjirsa Jeff Jirsa added a comment -

          Agree as well - keep the default as-is, but I like the other two options.

          Show
          jjirsa Jeff Jirsa added a comment - Agree as well - keep the default as-is, but I like the other two options.
          Hide
          jjordan Jeremiah Jordan added a comment -

          And I'm leaning towards these three settings:
          1. Perform blocking RR (current default, left default)
          2. Write hints instead of sending RR mutations, and let hint delivery repair the inconsistency, in a manner decoupled from the original read request
          3. Do nothing, for those who rely on regular repair for one reason or another (thinking TWCS, DTCS users?)

          I think the current default needs to stay the default (I was bitten by the original problem more than once in the 0.6/0.7 days), but giving advanced users the ability to disable that seems fine to me. +1 to those options.

          Show
          jjordan Jeremiah Jordan added a comment - And I'm leaning towards these three settings: 1. Perform blocking RR (current default, left default) 2. Write hints instead of sending RR mutations, and let hint delivery repair the inconsistency, in a manner decoupled from the original read request 3. Do nothing, for those who rely on regular repair for one reason or another (thinking TWCS, DTCS users?) I think the current default needs to stay the default (I was bitten by the original problem more than once in the 0.6/0.7 days), but giving advanced users the ability to disable that seems fine to me. +1 to those options.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I don't agree that it's required, but I'm not talking here about changing the default behaviour. I'm talking about providing extra options, while retaining the default blocking behaviour.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I don't agree that it's required , but I'm not talking here about changing the default behaviour. I'm talking about providing extra options, while retaining the default blocking behaviour.
          Hide
          rlow Richard Low added a comment -

          Background read repair is quite different. This foreground read repair is required to be blocking as the discussion at the beginning of the ticket shows. Now I understand it, I think this is an important guarantee and people would be very surprised if this behaviour changed.

          So I'm strongly in favour of 1, although the title of the ticket may be misleading

          Show
          rlow Richard Low added a comment - Background read repair is quite different. This foreground read repair is required to be blocking as the discussion at the beginning of the ticket shows. Now I understand it, I think this is an important guarantee and people would be very surprised if this behaviour changed. So I'm strongly in favour of 1, although the title of the ticket may be misleading
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I've been thinking more about this, in particular in context of deprecating/removing read_repair_chance and dclocal_read_repair_chance options. The question I asked myself was: what kind of tuning do we want for read repair, once those two are gone?

          And I'm leaning towards these three settings:
          1. Perform blocking RR (current default, left default)
          2. Write hints instead of sending RR mutations, and let hint delivery repair the inconsistency, in a manner decoupled from the original read request
          3. Do nothing, for those who rely on regular repair for one reason or another (thinking TWCS, DTCS users?)

          Any opinions?

          Show
          iamaleksey Aleksey Yeschenko added a comment - I've been thinking more about this, in particular in context of deprecating/removing read_repair_chance and dclocal_read_repair_chance options. The question I asked myself was: what kind of tuning do we want for read repair, once those two are gone? And I'm leaning towards these three settings: 1. Perform blocking RR (current default, left default) 2. Write hints instead of sending RR mutations, and let hint delivery repair the inconsistency, in a manner decoupled from the original read request 3. Do nothing, for those who rely on regular repair for one reason or another (thinking TWCS, DTCS users?) Any opinions?
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Aleksey Yeschenko Because I don't know how to do refactor to make my change clean except we change the iteractor style. If we do, it will be a much bigger change which we can put in a different new JIRA. But I am not sure how safe to refactor since I think the iteractor style is new engine style and it's all over the place, which seems the significant change and new direction after 3.0. If you have any suggestion of how to refactor or even specific refactor to this JIRA, I would be more than happy to make the proper change.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Aleksey Yeschenko Because I don't know how to do refactor to make my change clean except we change the iteractor style. If we do, it will be a much bigger change which we can put in a different new JIRA. But I am not sure how safe to refactor since I think the iteractor style is new engine style and it's all over the place, which seems the significant change and new direction after 3.0. If you have any suggestion of how to refactor or even specific refactor to this JIRA, I would be more than happy to make the proper change.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I wound not refactor the whole read pipeline right now I guess even though I do agree the code becomes so complicated

          Can you elaborate on reasons why you wouldn't do that?

          Show
          iamaleksey Aleksey Yeschenko added a comment - I wound not refactor the whole read pipeline right now I guess even though I do agree the code becomes so complicated Can you elaborate on reasons why you wouldn't do that?
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Marcus Eriksson we don't want to share results. if there is speculative read retry kicks in, it will mess up and may trigger a background async repair, which actually will do the repair twice together with foreground read repair. We are already doing duplicated repair already before.
          So I want to two repairs (foreground and background async repair) not to mess up each other. Ideally, I would return the map back as the input for next steps, which we don't have any race at all. However the iterator close thing makes it impossible to pass around value cleanly, that's how I end up with this shared map hack. Aleksey Yeschenko I wound not refactor the whole read pipeline right now I guess even though I do agree the code becomes so complicated .
          Regarding the 1M rows timeout, I compared with what we did before with I am doing now. It turns out previous code is waiting repair back with write rpc timeout. I was hoping I can make it better, but it turns out I am making it worse. It's better to wait longer instead of returning failure for read. If we can not get result even after waiting longer, the client will get timeout anyway.

          Thus I changed the repair wait time out same as before. and I also ran the stress test with 1M rows by shutdown node3 when writing and then read with cl=ALL to force read repair. It's looking to me, no read timeout now. I pushed the fix to same PR I sent previously. Could you please check again?

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Marcus Eriksson we don't want to share results. if there is speculative read retry kicks in, it will mess up and may trigger a background async repair, which actually will do the repair twice together with foreground read repair. We are already doing duplicated repair already before. So I want to two repairs (foreground and background async repair) not to mess up each other. Ideally, I would return the map back as the input for next steps, which we don't have any race at all. However the iterator close thing makes it impossible to pass around value cleanly, that's how I end up with this shared map hack. Aleksey Yeschenko I wound not refactor the whole read pipeline right now I guess even though I do agree the code becomes so complicated . Regarding the 1M rows timeout, I compared with what we did before with I am doing now. It turns out previous code is waiting repair back with write rpc timeout. I was hoping I can make it better, but it turns out I am making it worse. It's better to wait longer instead of returning failure for read. If we can not get result even after waiting longer, the client will get timeout anyway. Thus I changed the repair wait time out same as before. and I also ran the stress test with 1M rows by shutdown node3 when writing and then read with cl=ALL to force read repair. It's looking to me, no read timeout now. I pushed the fix to same PR I sent previously. Could you please check again?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Not a fan of what the patch does to already complicated AbstractReadExecutor and DataResolver. Given that the patch is going into trunk, we might use the opportunity to maybe do some deeper refactoring instead of hacking this on top of already overly complicated path?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Not a fan of what the patch does to already complicated AbstractReadExecutor and DataResolver . Given that the patch is going into trunk, we might use the opportunity to maybe do some deeper refactoring instead of hacking this on top of already overly complicated path?
          Hide
          krummas Marcus Eriksson added a comment -

          Your patch fixes the issue, but don't we actually want the different invocations to share results? I tried replacing the HashMap with a ConcurrentHashMap and that also fixes the issue.

          I also ran a small test to make sure standard read performance is not horrible, but it seems we are getting way more timeouts with this patch - I started a 3 node ccm cluster, inserted 1M rows with stress, flushed and stopped node3, removed all its data and restarted, then read back 1M rows with CL=ALL. With the patch I'm getting lots of timeouts, without it, no timeouts.

          Show
          krummas Marcus Eriksson added a comment - Your patch fixes the issue, but don't we actually want the different invocations to share results? I tried replacing the HashMap with a ConcurrentHashMap and that also fixes the issue. I also ran a small test to make sure standard read performance is not horrible, but it seems we are getting way more timeouts with this patch - I started a 3 node ccm cluster, inserted 1M rows with stress, flushed and stopped node3, removed all its data and restarted, then read back 1M rows with CL=ALL. With the patch I'm getting lots of timeouts, without it, no timeouts.
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Marcus Eriksson this is the patch to fix CME

          https://github.com/krummas/cassandra/pull/4

          Can u please take a look? If it sounds good, can you merge to your repo and retrigger the dtest?

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Marcus Eriksson this is the patch to fix CME https://github.com/krummas/cassandra/pull/4 Can u please take a look? If it sounds good, can you merge to your repo and retrigger the dtest?
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Marcus Eriksson After digging into this, I think the problem is there are two pathes to do the read repair. The first path is the one I fixed which is directly part of read call. The other path is back ground AsyncReadRepair runner which use same DataReslover to resolve the conflict. Since they are using same DataReslover to reslove, they are sharing same repairResponseRequestMap object. So the background AsyncRepairRunner is also changing this map causing the interation on the repairResponseRequestMap keySet get CME. Let me see how I can fix this.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Marcus Eriksson After digging into this, I think the problem is there are two pathes to do the read repair. The first path is the one I fixed which is directly part of read call. The other path is back ground AsyncReadRepair runner which use same DataReslover to resolve the conflict. Since they are using same DataReslover to reslove, they are sharing same repairResponseRequestMap object. So the background AsyncRepairRunner is also changing this map causing the interation on the repairResponseRequestMap keySet get CME. Let me see how I can fix this.
          Hide
          krummas Marcus Eriksson added a comment -

          Xiaolong Jiang seems we are getting a CME:

          java.util.ConcurrentModificationException: null
                  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) ~[na:1.8.0_149-apple]
                  at java.util.HashMap$KeyIterator.next(HashMap.java:1461) ~[na:1.8.0_149-apple]
                  at org.apache.cassandra.service.DataResolver$RepairMergeListener.awaitRepairResponses(DataResolver.java:298) ~[main/:na]
                  at org.apache.cassandra.service.DataResolver$RepairMergeListener.waitRepairToFinishWithPossibleRetry(DataResolver.java:223) ~[main/:na]
                  at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:175) ~[main/:na]
                  at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[main/:na]
                  at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[main/:na]
                  at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:103) ~[main/:na]
                  at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:232) ~[main/:na]
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_149-apple]
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_149-apple]
                  at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[main/:na]
                  at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_149-apple]
          

          in the materialized_views_test.py:TestMaterializedViewsConsistency.multi_partition_consistent_reads_after_write_test dtest

          Show
          krummas Marcus Eriksson added a comment - Xiaolong Jiang seems we are getting a CME: java.util.ConcurrentModificationException: null at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) ~[na:1.8.0_149-apple] at java.util.HashMap$KeyIterator.next(HashMap.java:1461) ~[na:1.8.0_149-apple] at org.apache.cassandra.service.DataResolver$RepairMergeListener.awaitRepairResponses(DataResolver.java:298) ~[main/:na] at org.apache.cassandra.service.DataResolver$RepairMergeListener.waitRepairToFinishWithPossibleRetry(DataResolver.java:223) ~[main/:na] at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:175) ~[main/:na] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[main/:na] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[main/:na] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:103) ~[main/:na] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:232) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_149-apple] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_149-apple] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[main/:na] at java.lang. Thread .run( Thread .java:745) ~[na:1.8.0_149-apple] in the materialized_views_test.py:TestMaterializedViewsConsistency.multi_partition_consistent_reads_after_write_test dtest
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Marcus Eriksson Thanks for the review. Yes, your change looks good to me. I saw the tests in circle CI passed. Dtest is still running. Please go ahead and merge when dtest passes.
          I think we can go to 4.0 only for open source version.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Marcus Eriksson Thanks for the review. Yes, your change looks good to me. I saw the tests in circle CI passed. Dtest is still running. Please go ahead and merge when dtest passes. I think we can go to 4.0 only for open source version.
          Hide
          krummas Marcus Eriksson added a comment -

          btw, I think this should go to 4.0 only, do you agree?

          Show
          krummas Marcus Eriksson added a comment - btw, I think this should go to 4.0 only, do you agree?
          Hide
          krummas Marcus Eriksson added a comment -

          This LGTM, pushed a branch with some small nits fixed here: https://github.com/krummas/cassandra/tree/xiaolong/10726 (please have a look)

          running tests:
          https://circleci.com/gh/krummas/cassandra/67
          https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/175

          Will commit if the tests look good and you think my nits are ok

          Show
          krummas Marcus Eriksson added a comment - This LGTM, pushed a branch with some small nits fixed here: https://github.com/krummas/cassandra/tree/xiaolong/10726 (please have a look) running tests: https://circleci.com/gh/krummas/cassandra/67 https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/175 Will commit if the tests look good and you think my nits are ok
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Blake EgglestonI made a few changes based on your comments and squashed to one commit. Please take a look and let me know what you think.

          https://github.com/apache/cassandra/pull/94/files

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Blake Eggleston I made a few changes based on your comments and squashed to one commit. Please take a look and let me know what you think. https://github.com/apache/cassandra/pull/94/files
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          1. I will change isQuorum to satisfiesQuorumFor and add unit tests. Not sure about your suggestion "satisfiedQuorumFor(int quorum)" though. I will mock keyspace and do the unit test
          2. I will remove FBUtilities#waitOnFuturesNanos
          3. I will make changes to wait maximum timeToWaitNanos for all responses instead of for each one
          4. I do have tests to cover read repair response from second node which is testResolveOneReadRepairRetry in DataResolverTest. It's not directly checking the response, it's making sure the correct data is sent to peer4. (the response is actually mocked by calling resolver.preprocess which is meanless, we only need to make sure correct data is retried to peer4)
          5. hum, it's building in my personal CASSANDRA-10726 branch. I will remove the "final" keyword.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - 1. I will change isQuorum to satisfiesQuorumFor and add unit tests. Not sure about your suggestion "satisfiedQuorumFor(int quorum)" though. I will mock keyspace and do the unit test 2. I will remove FBUtilities#waitOnFuturesNanos 3. I will make changes to wait maximum timeToWaitNanos for all responses instead of for each one 4. I do have tests to cover read repair response from second node which is testResolveOneReadRepairRetry in DataResolverTest. It's not directly checking the response, it's making sure the correct data is sent to peer4. (the response is actually mocked by calling resolver.preprocess which is meanless, we only need to make sure correct data is retried to peer4) 5. hum, it's building in my personal CASSANDRA-10726 branch. I will remove the "final" keyword.
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Blake EgglestonThanks for taking the time to review.
          The reason why I need responseCntSnapshot is sources.length != responseCntSnapshot because of the read speculative retry. When read is slow, cassandra will try read on one more host

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Blake Eggleston Thanks for taking the time to review. The reason why I need responseCntSnapshot is sources.length != responseCntSnapshot because of the read speculative retry. When read is slow, cassandra will try read on one more host
          Hide
          bdeggleston Blake Eggleston added a comment -

          Sorry for the delay getting to this. I have some comments, but conceptually, this seems good.

          Here's my first round of comments:

          ConsistencyLevel#isQuorum

          • a name like satisfiesQuorumFor would be more descriptive
          • needs unit tests that demonstrates it works as expected for different rf/cl combos. It might be easier to have this call an satisfiedQuorumFor(int quorum), and test that.

          FBUtilities#waitOnFuturesNanos

          • this can be removed, it’s not used anywhere

          DataResolver

          • repairResponseRequestMap
            • unless you need to mutate this as part of a test, this should remain private. For testing purposes, you should have a getter method that makes a defensive copy
          • responseCntSnapshot
            • I’m pretty sure you can get rid of this class member. In the resolve method, sources.length == responseCntSnapshot, so you can just use RepairMergeListener.sources.length instead of setting a variable on the class.
          • timeOuts
            • I think the intended behavior is to wait for a max of timeToWaitNanos, but return immediately if any of the read repair recipients have responded. This method waits for timeToWaitNanos for each AsyncOneResponse, so if the last response has been received, it will still wait for each preceding response future before getting to it and returning.
            • should also have a more descriptive name like awaitRepairResponses or something
          • waitRepairToFinishWithPossibleRetry
            • this method needs a detailed javadoc explaining what sort of guarantees it’s providing with regard to read repair. Specifically, the idea of a “monotonic read” is brought up in it’s comments, this has been discussed in this JIRA, but it also needs to be explained in the code. How it provides that guarantee without unnecessary blocking should also be explained

          DataResolverTest

          • You’re using the wrong brace style for some of the tests still
          • I don’t see any tests confirming that the method actually returns as expected if you receive a read repair response from the second node.

          Also, could you please also fix these more general issues before posting changes:

          • fix the compilation errors (it won’t build for me)
          • squash and rebase onto trunk, the way the various commits are interleaved with other commits makes it difficult to review this without the help of github.
          • remove the unnecessary uses of the final keyword. It's used in a lot of local variables and method arguments. final typically isn't used in C* outside of class members
          Show
          bdeggleston Blake Eggleston added a comment - Sorry for the delay getting to this. I have some comments, but conceptually, this seems good. Here's my first round of comments: ConsistencyLevel#isQuorum a name like satisfiesQuorumFor would be more descriptive needs unit tests that demonstrates it works as expected for different rf/cl combos. It might be easier to have this call an satisfiedQuorumFor(int quorum) , and test that. FBUtilities#waitOnFuturesNanos this can be removed, it’s not used anywhere DataResolver repairResponseRequestMap unless you need to mutate this as part of a test, this should remain private. For testing purposes, you should have a getter method that makes a defensive copy responseCntSnapshot I’m pretty sure you can get rid of this class member. In the resolve method, sources.length == responseCntSnapshot, so you can just use RepairMergeListener.sources.length instead of setting a variable on the class. timeOuts I think the intended behavior is to wait for a max of timeToWaitNanos , but return immediately if any of the read repair recipients have responded. This method waits for timeToWaitNanos for each AsyncOneResponse, so if the last response has been received, it will still wait for each preceding response future before getting to it and returning. should also have a more descriptive name like awaitRepairResponses or something waitRepairToFinishWithPossibleRetry this method needs a detailed javadoc explaining what sort of guarantees it’s providing with regard to read repair. Specifically, the idea of a “monotonic read” is brought up in it’s comments, this has been discussed in this JIRA, but it also needs to be explained in the code. How it provides that guarantee without unnecessary blocking should also be explained DataResolverTest You’re using the wrong brace style for some of the tests still I don’t see any tests confirming that the method actually returns as expected if you receive a read repair response from the second node. Also, could you please also fix these more general issues before posting changes: fix the compilation errors (it won’t build for me) squash and rebase onto trunk, the way the various commits are interleaved with other commits makes it difficult to review this without the help of github. remove the unnecessary uses of the final keyword. It's used in a lot of local variables and method arguments. final typically isn't used in C* outside of class members
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          Dtest to verify read repair retry works

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Dtest to verify read repair retry works
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          regarding distinctHostNum:
          /**

          • When doing the read repair, the mutation is per partition key, so it's possible we will repair multiple
          • partitions into different hosts. let's say RF = 5, we need to read partition p1, p2, p3, p4 from three nodes,
          • n1, n2, n3. If n1 contains latest data, n2 is missing p1 and p2, n2 is missing p3 and p4. So we need to run
          • repair for n2 by sending p1 and p2 partitions and run repair for n3 by sending p3 and p4 partitions. it's
          • possible p1 and p3 repair is slow, so beloew distinctHostNum will return 2. In this case, I will not retry
          • a new node for read repair since this read repair retry will only handle one slow host. If p3 and p4 is fast,
          • p1 and p2 repair is slow or just p1 repair is slow, below distinctHostNum will return 1, in this case, I will
          • retry 1 extra node and send p1, p2 to extra node or just p1 if only p1 read repair times out.
          • In same host, we can have multiple partition read repair and we can only handle one host slowness, so we should
          • get distinct host from read repair response future.
            */
          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - regarding distinctHostNum: /** When doing the read repair, the mutation is per partition key, so it's possible we will repair multiple partitions into different hosts. let's say RF = 5, we need to read partition p1, p2, p3, p4 from three nodes, n1, n2, n3. If n1 contains latest data, n2 is missing p1 and p2, n2 is missing p3 and p4. So we need to run repair for n2 by sending p1 and p2 partitions and run repair for n3 by sending p3 and p4 partitions. it's possible p1 and p3 repair is slow, so beloew distinctHostNum will return 2. In this case, I will not retry a new node for read repair since this read repair retry will only handle one slow host. If p3 and p4 is fast, p1 and p2 repair is slow or just p1 repair is slow, below distinctHostNum will return 1, in this case, I will retry 1 extra node and send p1, p2 to extra node or just p1 if only p1 read repair times out. In same host, we can have multiple partition read repair and we can only handle one host slowness, so we should get distinct host from read repair response future. */
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          regarding the response snapshot count:

          we need to capture this count and save to this data resolver. All return results is based on this many
          reponse count and below "responseCntSnapshot" is used to calculate whether read repair is ok or not
          since response list can get more response later, but we only iterator "responseCntSnapshot" for any future
          operations including read repair retry. So we have to save this count in current state.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - regarding the response snapshot count: we need to capture this count and save to this data resolver. All return results is based on this many reponse count and below "responseCntSnapshot" is used to calculate whether read repair is ok or not since response list can get more response later, but we only iterator "responseCntSnapshot" for any future operations including read repair retry. So we have to save this count in current state.
          Hide
          jasobrown Jason Brown added a comment -

          Xiaolong Jiang Thanks for the comments they add more context to the patch. Here's my initial round of comments on your patch:

          • general nit: code style is incorrect wrt braces
          • rename ConsistencyLevel#isMajority to ConsistencyLevel#isQuorum as it's more in line with our general naming conventions
          • AbstractReadExecutor#getReadRepairExtra
            • function needs a javadoc comment
            • rename #readRepairExtra to ... ??? "spareReadRepairNode". I would like to make this Optional<InetAddress> as it makes it clearer that it's not required (rather than piggybacking on "null" to indicate that sematic meaning).
            • readRepairExtra = allSortedReplicas.get(blockForReplicas.size()); - this seems unsafe as it assumes the {{List}}s are ordered the same.
          • DataResolver
            • #reppairExtra is misspelled. Further, make it consistent with whatever we name it in AbstractReadExecutor
            • #close - don't add the exception to the Tracing.trace() call, and it's debatable if you need it on the logger.debug line, as well.
            • repairResponseRequestMap and #responseCntSnapshot - I don't think you actually need these. You just care about the number of replicas that have responded, and you can just call responses.size() where you actually care about (in #waitRepairToFinishWithPossibleRetry().
            • #distinctHostNum() I'm not sure why you need this as we should only send the message to distinct hosts. I think you can just use results.size() instead.
            • #waitRepairToFinishWithPossibleRetry
              • wrt the block starting at line 224. You iterate over all the timed out entries, but you need to cover the full merge set and send that "repairExtra" node. You don't know what data that replica has or needs, thus you need to send the full merged set. I'm not sure off the top of my head where/how to grab the merged rows, but I'm sure you can figure it out
              • Furthermore, once you get the merged set, you do not need to send multiple messages to the target node, and can instead send one. Thus you can simplify the block starting at line 224 as such:
                                Tracing.trace("retry read-repair-mutation to {}", reppairExtra);
                                PartitionUpdate update = // get merged result from some location ...
                                MessageOut messageOut = new Mutation(update).createMessage(MessagingService.Verb.READ_REPAIR); 
                                AsyncOneResponse response = MessagingService.instance().sendRR(messageOut,reppairExtra);
                                response.get(waitTimeNanos, TimeUnit.NANOSECONDS); // TimeoutException will be thrown 
                		....
                

          The biggest thing this patch needs is testing. You might be able to unit test this one (in fact Xiaolong Jiang spoke offline about some idea about how to do it), but it will take some time (worthwhile, in my opinion). A dtest will probably be required, as well, even though that will get tricky - byteman will probably necessary to help you out.

          Show
          jasobrown Jason Brown added a comment - Xiaolong Jiang Thanks for the comments they add more context to the patch. Here's my initial round of comments on your patch: general nit: code style is incorrect wrt braces rename ConsistencyLevel#isMajority to ConsistencyLevel#isQuorum as it's more in line with our general naming conventions AbstractReadExecutor#getReadRepairExtra function needs a javadoc comment rename #readRepairExtra to ... ??? "spareReadRepairNode". I would like to make this Optional<InetAddress> as it makes it clearer that it's not required (rather than piggybacking on "null" to indicate that sematic meaning). readRepairExtra = allSortedReplicas.get(blockForReplicas.size()); - this seems unsafe as it assumes the {{List}}s are ordered the same. DataResolver #reppairExtra is misspelled. Further, make it consistent with whatever we name it in AbstractReadExecutor #close - don't add the exception to the Tracing.trace() call, and it's debatable if you need it on the logger.debug line, as well. repairResponseRequestMap and #responseCntSnapshot - I don't think you actually need these. You just care about the number of replicas that have responded, and you can just call responses.size() where you actually care about (in #waitRepairToFinishWithPossibleRetry() . #distinctHostNum() I'm not sure why you need this as we should only send the message to distinct hosts. I think you can just use results.size() instead. #waitRepairToFinishWithPossibleRetry wrt the block starting at line 224. You iterate over all the timed out entries, but you need to cover the full merge set and send that "repairExtra" node. You don't know what data that replica has or needs, thus you need to send the full merged set. I'm not sure off the top of my head where/how to grab the merged rows, but I'm sure you can figure it out Furthermore, once you get the merged set, you do not need to send multiple messages to the target node, and can instead send one. Thus you can simplify the block starting at line 224 as such: Tracing.trace( "retry read-repair-mutation to {}" , reppairExtra); PartitionUpdate update = // get merged result from some location ... MessageOut messageOut = new Mutation(update).createMessage(MessagingService.Verb.READ_REPAIR); AsyncOneResponse response = MessagingService.instance().sendRR(messageOut,reppairExtra); response.get(waitTimeNanos, TimeUnit.NANOSECONDS); // TimeoutException will be thrown .... The biggest thing this patch needs is testing. You might be able to unit test this one (in fact Xiaolong Jiang spoke offline about some idea about how to do it), but it will take some time (worthwhile, in my opinion). A dtest will probably be required, as well, even though that will get tricky - byteman will probably necessary to help you out.
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          The patch is trying to do 2 things:
          1. Before, when we read, say quorum read, let RF = 3 (replica1, replica2, replica3), so the client request is trying to read from 2 replicas (replica1, replica2), but there is a digest mismatch between these 2 replicas, so read repair will kick in. Let's say the stale data is in replica2, read repair will send the correct data to replica2. But for some reason, the write request got timeout, then we send "read timeout " to client side.
          After this patch, we will wait for replica2 write for some time, if it didn't come back, correct data is sent to replica3 no matter whether replica3 already has latest data or not. Because we know if replica3 write succeeds, it's guaranteed 2 replicas got the correct data, client will return success with data for read request, and next time the quorum read will definitely read correct data.

          2. The second thing this patch is trying to do is to make sure in read repair part, we don't block for replicas beyond what is needed for consistency level to reply back in speculative retry/read repair chance case. For example, we still use above RF = 3 quorum read case, it's trying to read from replica1 and replica2, but replica2 is slow, then speculative retry kicks in, read will try to read replica3, then all 3 replicas read come back, but there is digest mismatch, both replica2 and replica3 are stale data, what happens before is read repair will block for both replica2 and replica3 to finish read repair, but there is no need to wait for both to come back, we only need to wait for one repair to come back since we only need one successful repair to guarantee successful quorum read. And next quorum read will definitely read latest data even replica 3 read repair failed. This is applied same to read repiar chance. Let's say the read repair chance is "GLOBAL", we don't need to block for all replicas to finish repair, we only need to block what the read consistency level needs.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - The patch is trying to do 2 things: 1. Before, when we read, say quorum read, let RF = 3 (replica1, replica2, replica3), so the client request is trying to read from 2 replicas (replica1, replica2), but there is a digest mismatch between these 2 replicas, so read repair will kick in. Let's say the stale data is in replica2, read repair will send the correct data to replica2. But for some reason, the write request got timeout, then we send "read timeout " to client side. After this patch, we will wait for replica2 write for some time, if it didn't come back, correct data is sent to replica3 no matter whether replica3 already has latest data or not. Because we know if replica3 write succeeds, it's guaranteed 2 replicas got the correct data, client will return success with data for read request, and next time the quorum read will definitely read correct data. 2. The second thing this patch is trying to do is to make sure in read repair part, we don't block for replicas beyond what is needed for consistency level to reply back in speculative retry/read repair chance case. For example, we still use above RF = 3 quorum read case, it's trying to read from replica1 and replica2, but replica2 is slow, then speculative retry kicks in, read will try to read replica3, then all 3 replicas read come back, but there is digest mismatch, both replica2 and replica3 are stale data, what happens before is read repair will block for both replica2 and replica3 to finish read repair, but there is no need to wait for both to come back, we only need to wait for one repair to come back since we only need one successful repair to guarantee successful quorum read. And next quorum read will definitely read latest data even replica 3 read repair failed. This is applied same to read repiar chance. Let's say the read repair chance is "GLOBAL", we don't need to block for all replicas to finish repair, we only need to block what the read consistency level needs.
          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - Jason Brown Patch is ready https://github.com/apache/cassandra/pull/94
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user xiaolong302 opened a pull request:

          https://github.com/apache/cassandra/pull/94

          CASSANDRA-10726: Read repair inserts should use speculative retry

          1. do an extra read repair retry to only guarantee “monotonic quorum
          read”. Here “quorum” means majority of nodes among replicas
          2. only block what is needed for resolving the digest mismatch no
          matter whether it’s speculative retry or read repair chance.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/xiaolong302/cassandra CASSANDRA-10726

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/cassandra/pull/94.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #94


          commit a587c20e82ffc4aa7c4a3cb1468551b255fc7f71
          Author: Xiaolong Jiang <xiaolong_jiang@apple.com>
          Date: 2017-01-17T05:31:06Z

          Cass: CASSANDRA-10726: Read repair inserts should use speculative retry

          1. do an extra read repair retry to only guarantee “monotonic quorum
          read”. Here “quorum” means majority of nodes among replicas
          2. only block what is needed for resolving the digest mismatch no
          matter whether it’s speculative retry or read repair chance.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user xiaolong302 opened a pull request: https://github.com/apache/cassandra/pull/94 CASSANDRA-10726 : Read repair inserts should use speculative retry 1. do an extra read repair retry to only guarantee “monotonic quorum read”. Here “quorum” means majority of nodes among replicas 2. only block what is needed for resolving the digest mismatch no matter whether it’s speculative retry or read repair chance. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xiaolong302/cassandra CASSANDRA-10726 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/cassandra/pull/94.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #94 commit a587c20e82ffc4aa7c4a3cb1468551b255fc7f71 Author: Xiaolong Jiang <xiaolong_jiang@apple.com> Date: 2017-01-17T05:31:06Z Cass: CASSANDRA-10726 : Read repair inserts should use speculative retry 1. do an extra read repair retry to only guarantee “monotonic quorum read”. Here “quorum” means majority of nodes among replicas 2. only block what is needed for resolving the digest mismatch no matter whether it’s speculative retry or read repair chance.
          Hide
          xiaolong302@gmail.com Xiaolong Jiang added a comment -

          sankalp kohliCan you assign to me? Richard Low did a quick chat to me about this JIRA. I will try to fix this one.

          Show
          xiaolong302@gmail.com Xiaolong Jiang added a comment - sankalp kohli Can you assign to me? Richard Low did a quick chat to me about this JIRA. I will try to fix this one.
          Hide
          kohlisankalp sankalp kohli added a comment -

          Assigned to Nachiket Patil

          Show
          kohlisankalp sankalp kohli added a comment - Assigned to Nachiket Patil
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          Yes. And if two of the nodes agree on the most recent value then we don't need to block on the third being updated since our monotonicity of future quorum reads is already assured.

          Show
          jbellis Jonathan Ellis added a comment - - edited Yes. And if two of the nodes agree on the most recent value then we don't need to block on the third being updated since our monotonicity of future quorum reads is already assured.
          Hide
          rlow Richard Low added a comment -

          What do you think Jonathan Ellis Sylvain Lebresne?

          Show
          rlow Richard Low added a comment - What do you think Jonathan Ellis Sylvain Lebresne ?
          Hide
          rlow Richard Low added a comment -

          Actually, isn't the real problem here that speculative retry doesn't include the RR write? We should give up waiting for the write to complete and retry on another replica. A slow RR insert is just as bad as a slow read.

          Show
          rlow Richard Low added a comment - Actually, isn't the real problem here that speculative retry doesn't include the RR write? We should give up waiting for the write to complete and retry on another replica. A slow RR insert is just as bad as a slow read.
          Hide
          jbellis Jonathan Ellis added a comment -

          I can live with the -D approach.

          Show
          jbellis Jonathan Ellis added a comment - I can live with the -D approach.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Would a reasonable half way house be to keep the write as blocking but return success in the case of a write timeout?

          That would still break the "monotonic quorum reads": unless you get positive acks from the read-repair, you can't guarantee a quorum of replica is now up to date. Granted, it will work more often if we do that (than if we don't block at all), but guarantees are not about "most of the time"

          And just to recap my personal position on this, I do feel we should keep the guarantee, at least by default, and still feel the right way to deal with the scenario you're complaining about would be a better way to deal with nodes backing up on writes. But we all know it's easier said than fixed, and while I'd rather we spend time on that better way to deal with the 2 scenario Jonathan Ellis mentioned above, I'm not too strongly opposed to a -D stopgap for advanced users.

          Show
          slebresne Sylvain Lebresne added a comment - Would a reasonable half way house be to keep the write as blocking but return success in the case of a write timeout? That would still break the "monotonic quorum reads": unless you get positive acks from the read-repair, you can't guarantee a quorum of replica is now up to date. Granted, it will work more often if we do that (than if we don't block at all), but guarantees are not about "most of the time" And just to recap my personal position on this, I do feel we should keep the guarantee, at least by default, and still feel the right way to deal with the scenario you're complaining about would be a better way to deal with nodes backing up on writes. But we all know it's easier said than fixed, and while I'd rather we spend time on that better way to deal with the 2 scenario Jonathan Ellis mentioned above, I'm not too strongly opposed to a -D stopgap for advanced users.
          Hide
          rlow Richard Low added a comment -

          It would lose a guarantee (which admittedly I didn't know existed), but most people who care about what happens when there's a write timeout will use CAS read and write.

          Would a reasonable half way house be to keep the write as blocking but return success in the case of a write timeout? Then almost always the behaviour will be the same, but it would avoid the timeouts caused by a single broken replica.

          Show
          rlow Richard Low added a comment - It would lose a guarantee (which admittedly I didn't know existed), but most people who care about what happens when there's a write timeout will use CAS read and write. Would a reasonable half way house be to keep the write as blocking but return success in the case of a write timeout? Then almost always the behaviour will be the same, but it would avoid the timeouts caused by a single broken replica.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I understand your sentiment, but maybe a -D flag for the 'power user' isn't too awful?

          I agree. We lose nothing here by providing a -D flag.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I understand your sentiment, but maybe a -D flag for the 'power user' isn't too awful? I agree. We lose nothing here by providing a -D flag.
          Hide
          brandon.williams Brandon Williams added a comment -

          I understand your sentiment, but maybe a -D flag for the 'power user' isn't too awful?

          Show
          brandon.williams Brandon Williams added a comment - I understand your sentiment, but maybe a -D flag for the 'power user' isn't too awful?
          Hide
          jbellis Jonathan Ellis added a comment -

          For the last, we already have options for how much read repair to do. For the former, I'm really -0 leaning to -1 against adding an option to discard an important guarantee.

          Show
          jbellis Jonathan Ellis added a comment - For the last, we already have options for how much read repair to do. For the former, I'm really -0 leaning to -1 against adding an option to discard an important guarantee.
          Hide
          brandon.williams Brandon Williams added a comment -

          Let's make it configurable for a) blocking read repair, b) non-blocking read repair, and c) no read repair at all.

          Show
          brandon.williams Brandon Williams added a comment - Let's make it configurable for a) blocking read repair, b) non-blocking read repair, and c) no read repair at all.
          Hide
          rlow Richard Low added a comment -

          +1 on the option to disable.

          Show
          rlow Richard Low added a comment - +1 on the option to disable.
          Hide
          jbellis Jonathan Ellis added a comment -

          Seeing reads "go backwards in time" is one of the most confusing aspects of eventual consistency for people, so I do think it's important that quorum reads avoid that, even more so because users tend to oversimplify quorum reads as "strong consistency that means I don't have to think about EC." So to the degree we can make that assumption true, we should, especially if that's been our behavior already for 4+ years.

          It seems like there are two primary problem scenarios:

          • When a node is overloaded for writes, this stops reads as well. First, delaying reads when we're behind on writes is arguably a good thing that will help you recover faster. Second, the right way to tackle this is with better handling of the write overload as in CASANDRA-9318.
          • When data is read-only because disks are failing. I agree with Sylvain that half-broken is often worse than completely broken, and in this specific case if a disk puts itself in read-only mode then it won't be long until it isn't readable either. This is another case where "mark a disk bad and broadcast to other nodes not to send me requests for tokens pinned to it" as envisioned in CASSANDRA-6696 would be useful, along with an option for "promote write errors to blacklist on reads as wells."
          Show
          jbellis Jonathan Ellis added a comment - Seeing reads "go backwards in time" is one of the most confusing aspects of eventual consistency for people, so I do think it's important that quorum reads avoid that, even more so because users tend to oversimplify quorum reads as "strong consistency that means I don't have to think about EC." So to the degree we can make that assumption true, we should, especially if that's been our behavior already for 4+ years. It seems like there are two primary problem scenarios: When a node is overloaded for writes, this stops reads as well. First, delaying reads when we're behind on writes is arguably a good thing that will help you recover faster. Second, the right way to tackle this is with better handling of the write overload as in CASANDRA-9318. When data is read-only because disks are failing. I agree with Sylvain that half-broken is often worse than completely broken, and in this specific case if a disk puts itself in read-only mode then it won't be long until it isn't readable either. This is another case where "mark a disk bad and broadcast to other nodes not to send me requests for tokens pinned to it" as envisioned in CASSANDRA-6696 would be useful, along with an option for "promote write errors to blacklist on reads as wells."
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Need to think about it more, but I wouldn't mind, in principle, dropping the guarantee. Arguably the behaviour seen by Richard is more confusing/painful than CASSANDRA-2494 issue. Should at the very least make an option to disable this.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Need to think about it more, but I wouldn't mind, in principle, dropping the guarantee. Arguably the behaviour seen by Richard is more confusing/painful than CASSANDRA-2494 issue. Should at the very least make an option to disable this.
          Hide
          slebresne Sylvain Lebresne added a comment -

          The comment in the code is not terribly informative, but the original reason for this is CASSANDRA-2494. I'll let you read up for precise context, but the summary is that we wait for read repair to ensure "monotonic quorum reads", i.e. that if you do 2 successive quorum reads, you're guaranteed the 2nd one won't see something older than the 1st one, and this even if you had a failed quorum write that made you write the most up to date value only to a minority of replicas.

          How important that guarantee is in practice is probably debatable, I'm sure some would be totally fine without it, but we've been providing it (silently) pretty much forever at this point so some users are likely relying on it (even if without realizing it). I also more generally think we should try to always lean towards providing more guarantee rather than less when we can as it yield a less surprising system. So, without pretending this is the best guarantee since sliced bread, I'm not terribly enthusiastic at the idea of dropping it.

          Which doesn't mean I ignore the problem you're raising. If a node properly respond to read but not to writes, then it can indeed be a problem for the reads it is participating in, and that's not great. I'm just not sure dropping our monotonic quorum read guarantee is the correct way to mitigate that problem. Part of me feels like a node that is dropping writes consistently shouldn't be serving read as if everything was fine (a half broken node is often worth than a fully broken one) but I'm not saying I have a good solution of the top of my head to ensure that without too much downsides.

          I'd certainly welcome a broader range of opinions/ideas (Jonathan Ellis, Aleksey Yeschenko in particular?).

          Show
          slebresne Sylvain Lebresne added a comment - The comment in the code is not terribly informative, but the original reason for this is CASSANDRA-2494 . I'll let you read up for precise context, but the summary is that we wait for read repair to ensure "monotonic quorum reads", i.e. that if you do 2 successive quorum reads, you're guaranteed the 2nd one won't see something older than the 1st one, and this even if you had a failed quorum write that made you write the most up to date value only to a minority of replicas. How important that guarantee is in practice is probably debatable, I'm sure some would be totally fine without it, but we've been providing it (silently) pretty much forever at this point so some users are likely relying on it (even if without realizing it). I also more generally think we should try to always lean towards providing more guarantee rather than less when we can as it yield a less surprising system. So, without pretending this is the best guarantee since sliced bread, I'm not terribly enthusiastic at the idea of dropping it. Which doesn't mean I ignore the problem you're raising. If a node properly respond to read but not to writes, then it can indeed be a problem for the reads it is participating in, and that's not great. I'm just not sure dropping our monotonic quorum read guarantee is the correct way to mitigate that problem. Part of me feels like a node that is dropping writes consistently shouldn't be serving read as if everything was fine (a half broken node is often worth than a fully broken one) but I'm not saying I have a good solution of the top of my head to ensure that without too much downsides. I'd certainly welcome a broader range of opinions/ideas ( Jonathan Ellis , Aleksey Yeschenko in particular?).
          Hide
          rlow Richard Low added a comment -

          How does it violate consistency? The replica was already inconsistent to require a read repair insert so returning before completing the write can't make it any worse.

          Show
          rlow Richard Low added a comment - How does it violate consistency? The replica was already inconsistent to require a read repair insert so returning before completing the write can't make it any worse.
          Hide
          brandon.williams Brandon Williams added a comment -

          Blocking foreground repair basically means a quorum or all consistency level. If we return before completing the write, that violates consistency.

          Show
          brandon.williams Brandon Williams added a comment - Blocking foreground repair basically means a quorum or all consistency level. If we return before completing the write, that violates consistency.
          Hide
          kohlisankalp sankalp kohli added a comment -

          cc Brandon Williams Sylvain Lebresne
          What do you think?

          Show
          kohlisankalp sankalp kohli added a comment - cc Brandon Williams Sylvain Lebresne What do you think?

            People

            • Assignee:
              bdeggleston Blake Eggleston
              Reporter:
              rlow Richard Low
              Reviewer:
              Marcus Eriksson
            • Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

              • Created:
                Updated:

                Development