Cassandra
  1. Cassandra
  2. CASSANDRA-5932

Speculative read performance data show unexpected results

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.2
    • Component/s: None
    • Labels:
      None

      Description

      I've done a series of stress tests with eager retries enabled that show undesirable behavior. I'm grouping these behaviours into one ticket as they are most likely related.

      1) Killing off a node in a 4 node cluster actually increases performance.
      2) Compactions make nodes slow, even after the compaction is done.
      3) Eager Reads tend to lessen the immediate performance impact of a node going down, but not consistently.

      My Environment:
      1 stress machine: node0
      4 C* nodes: node4, node5, node6, node7

      My script:
      node0 writes some data: stress -d node4 -F 30000000 -n 30000000 -i 5 -l 2 -K 20
      node0 reads some data: stress -d node4 -n 30000000 -o read -i 5 -K 20

      Examples:

      A node going down increases performance:

      Data for this test here

      At 450s, I kill -9 one of the nodes. There is a brief decrease in performance as the snitch adapts, but then it recovers... to even higher performance than before.

      Compactions make nodes permanently slow:


      The green and orange lines represent trials with eager retry enabled, they never recover their op-rate from before the compaction as the red and blue lines do.

      Data for this test here

      Speculative Read tends to lessen the immediate impact:


      This graph looked the most promising to me, the two trials with eager retry, the green and orange line, at 450s showed the smallest dip in performance.

      Data for this test here

      But not always:


      This is a retrial with the same settings as above, yet the 95percentile eager retry (red line) did poorly this time at 450s.

      Data for this test here

      1. eager-read-not-consistent.png
        61 kB
        Ryan McGuire
      2. eager-read-looks-promising.png
        53 kB
        Ryan McGuire
      3. compaction-makes-slow.png
        50 kB
        Ryan McGuire
      4. node-down-increase-performance.png
        32 kB
        Ryan McGuire
      5. eager-read-not-consistent-stats.png
        31 kB
        Ryan McGuire
      6. eager-read-looks-promising-stats.png
        31 kB
        Ryan McGuire
      7. compaction-makes-slow-stats.png
        32 kB
        Ryan McGuire
      8. 5932.txt
        23 kB
        Aleksey Yeschenko
      9. 5933-7a87fc11.png
        83 kB
        Ryan McGuire
      10. 5933-128_and_200rc1.png
        77 kB
        Ryan McGuire
      11. 5933-logs.tar.gz
        565 kB
        Ryan McGuire
      12. 5933-randomized-dsnitch-replica.png
        67 kB
        Ryan McGuire
      13. 5933-randomized-dsnitch-replica.2.png
        79 kB
        Ryan McGuire
      14. 5933-randomized-dsnitch-replica.3.png
        68 kB
        Ryan McGuire
      15. 5932.ded39c7e1c2fa.logs.tar.gz
        536 kB
        Ryan McGuire
      16. 5932-6692c50412ef7d.png
        76 kB
        Ryan McGuire
      17. 5932.6692c50412ef7d.compaction.png
        66 kB
        Ryan McGuire
      18. 5932.6692c50412ef7d.rr0.png
        99 kB
        Ryan McGuire
      19. 5932.6692c50412ef7d.rr1.png
        100 kB
        Ryan McGuire

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Under git branch.

          Show
          Jonathan Ellis added a comment - Under git branch .
          Hide
          Li Zou added a comment -

          I even cannot see the cassandra-2.0 branch.
          My "git tag" gives a list including following branches.

          $ git tag
          1.2.8
          1.2.8-tentative
          cassandra-0.3.0-final
          cassandra-0.3.0-rc1
          cassandra-0.3.0-rc2
          ...
          cassandra-1.2.4
          cassandra-1.2.5
          cassandra-1.2.6
          cassandra-1.2.7
          cassandra-1.2.8
          cassandra-1.2.9
          cassandra-2.0.0
          cassandra-2.0.0-beta1
          cassandra-2.0.0-beta2
          cassandra-2.0.0-rc1
          cassandra-2.0.0-rc2
          cassandra-2.0.1
          drivers
          list
          

          There is no cassandra-2.0 branch. Where can I find it?

          Show
          Li Zou added a comment - I even cannot see the cassandra-2.0 branch. My "git tag" gives a list including following branches. $ git tag 1.2.8 1.2.8-tentative cassandra-0.3.0-final cassandra-0.3.0-rc1 cassandra-0.3.0-rc2 ... cassandra-1.2.4 cassandra-1.2.5 cassandra-1.2.6 cassandra-1.2.7 cassandra-1.2.8 cassandra-1.2.9 cassandra-2.0.0 cassandra-2.0.0-beta1 cassandra-2.0.0-beta2 cassandra-2.0.0-rc1 cassandra-2.0.0-rc2 cassandra-2.0.1 drivers list There is no cassandra-2.0 branch. Where can I find it?
          Hide
          Jonathan Ellis added a comment -

          the cassandra-2.0 branch is what will become 2.0.2

          Show
          Jonathan Ellis added a comment - the cassandra-2.0 branch is what will become 2.0.2
          Hide
          Li Zou added a comment -

          Hello,

          As this ticket is already fixed in 2.0.2, where can I get the 2.0.2 source code?

          Currently, my "git tag" only shows up to 2.0.1.

          Show
          Li Zou added a comment - Hello, As this ticket is already fixed in 2.0.2, where can I get the 2.0.2 source code? Currently, my "git tag" only shows up to 2.0.1.
          Hide
          Jonathan Ellis added a comment -

          There's a ticket open for trunk over at CASSANDRA-6154.

          Show
          Jonathan Ellis added a comment - There's a ticket open for trunk over at CASSANDRA-6154 .
          Hide
          Li Zou added a comment - - edited

          Jonathan Ellis, this morning's trunk load has a slightly different symptom, and is even more serious than last Friday's load, as this time just commenting out the assert statement in the MessagingService.addCallback() will not help.

          I copy the /var/log/cassandra/system.log exception errors below.

          ERROR [Thrift:12] 2013-10-07 14:42:39,396 Caller+0       at org.apache.cassandra.service.CassandraDaemon$2.uncaughtException(CassandraDaemon.java:134)
           - Exception in thread Thread[Thrift:12,5,main]
          java.lang.AssertionError: null
                  at org.apache.cassandra.net.MessagingService.addCallback(MessagingService.java:543) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:591) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:571) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:869) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.service.StorageProxy$2.apply(StorageProxy.java:123) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:739) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:511) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:581) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.executeWithoutCondition(ModificationStatement.java:379) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.execute(ModificationStatement.java:363) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:126) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:267) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.thrift.CassandraServer.execute_prepared_cql3_query(CassandraServer.java:2061) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.thrift.Cassandra$Processor$execute_prepared_cql3_query.getResult(Cassandra.java:4502) ~[apache-cassandra-thrift-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.cassandra.thrift.Cassandra$Processor$execute_prepared_cql3_query.getResult(Cassandra.java:4486) ~[apache-cassandra-thrift-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1]
                  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1]
                  at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT]
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_25]
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_25]
                  at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
          
          
          Show
          Li Zou added a comment - - edited Jonathan Ellis , this morning's trunk load has a slightly different symptom, and is even more serious than last Friday's load, as this time just commenting out the assert statement in the MessagingService.addCallback() will not help. I copy the /var/log/cassandra/system.log exception errors below. ERROR [Thrift:12] 2013-10-07 14:42:39,396 Caller+0 at org.apache.cassandra.service.CassandraDaemon$2.uncaughtException(CassandraDaemon.java:134) - Exception in thread Thread[Thrift:12,5,main] java.lang.AssertionError: null at org.apache.cassandra.net.MessagingService.addCallback(MessagingService.java:543) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:591) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.net.MessagingService.sendRR(MessagingService.java:571) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:869) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.service.StorageProxy$2.apply(StorageProxy.java:123) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:739) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:511) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:581) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.cql3.statements.ModificationStatement.executeWithoutCondition(ModificationStatement.java:379) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.cql3.statements.ModificationStatement.execute(ModificationStatement.java:363) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:126) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:267) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.thrift.CassandraServer.execute_prepared_cql3_query(CassandraServer.java:2061) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.thrift.Cassandra$Processor$execute_prepared_cql3_query.getResult(Cassandra.java:4502) ~[apache-cassandra-thrift-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.cassandra.thrift.Cassandra$Processor$execute_prepared_cql3_query.getResult(Cassandra.java:4486) ~[apache-cassandra-thrift-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194) ~[apache-cassandra-2.1-SNAPSHOT.jar:2.1-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_25] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_25] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
          Hide
          Jonathan Ellis added a comment -

          (Pushed fix for mutateCounter in 3da10f469d6a328bad209d723a5997c932284344.)

          Show
          Jonathan Ellis added a comment - (Pushed fix for mutateCounter in 3da10f469d6a328bad209d723a5997c932284344.)
          Hide
          Jonathan Ellis added a comment -

          Are you doing counter updates? That's the only use of sendRR for updates I see.

          Can you post the stack trace of the assertion error you're getting?

          Show
          Jonathan Ellis added a comment - Are you doing counter updates? That's the only use of sendRR for updates I see. Can you post the stack trace of the assertion error you're getting?
          Hide
          Li Zou added a comment -

          As for yesterday's trunk load, there were two addCallback() methods. But the one with ConsistencyLevel was not called by anyone. The one without ConsistencyLevel asserts.

          Show
          Li Zou added a comment - As for yesterday's trunk load, there were two addCallback() methods. But the one with ConsistencyLevel was not called by anyone. The one without ConsistencyLevel asserts.
          Hide
          Li Zou added a comment - - edited

          The trunk load I used for testing was pulled this noon. It has two addCallback() methods. One of them (i.e. without the ConsistencyLevel) asserts.

          I checked the MessagingService.java, there are two addCallback() methods.

          • The one without ConsistencyLevel is called by sendRR()
          • The one with ConsistencyLevel is called by sendMessageToNonLocalDC()
          Show
          Li Zou added a comment - - edited The trunk load I used for testing was pulled this noon. It has two addCallback() methods. One of them (i.e. without the ConsistencyLevel) asserts. I checked the MessagingService.java, there are two addCallback() methods. The one without ConsistencyLevel is called by sendRR() The one with ConsistencyLevel is called by sendMessageToNonLocalDC()
          Hide
          Jonathan Ellis added a comment -

          MessagingService.addCallback

          Are you sure you have the latest code? The only invocations of addCallback in 2.0/trunk include the consistencylevel argument as of late last night.

          Show
          Jonathan Ellis added a comment - MessagingService.addCallback Are you sure you have the latest code? The only invocations of addCallback in 2.0/trunk include the consistencylevel argument as of late last night.
          Hide
          Li Zou added a comment -

          Have done some testing using today's trunk. Have observed following issues.

          Issue 1 – The first method MessagingService.addCallback() (i.e. without the ConsistencyLevel argument) asserts.

          Commenting out the assert statement seems to work. But the Cassandra servers themselves will produce 10-second outage (i.e. zero transactions from the client point of view) periodically.

          Issue 2 – The Speculative Retry seems stop retrying during the outage window.

          During the outage window triggered either by killing one of Cassandra nodes or produced by Cassandra servers themselves, the JConsole shows that the JMX stats, SpeculativeRetry counter stops incrementing until the gossip figures out the outage issue.

          What is the reason for this? The Speculative Retry is meant to help during the outage period. This observed behavior is consistent with Cassandra 2.0.0-rc2.

          Show
          Li Zou added a comment - Have done some testing using today's trunk. Have observed following issues. Issue 1 – The first method MessagingService.addCallback() (i.e. without the ConsistencyLevel argument) asserts. Commenting out the assert statement seems to work. But the Cassandra servers themselves will produce 10-second outage (i.e. zero transactions from the client point of view) periodically. Issue 2 – The Speculative Retry seems stop retrying during the outage window. During the outage window triggered either by killing one of Cassandra nodes or produced by Cassandra servers themselves, the JConsole shows that the JMX stats, SpeculativeRetry counter stops incrementing until the gossip figures out the outage issue. What is the reason for this? The Speculative Retry is meant to help during the outage period. This observed behavior is consistent with Cassandra 2.0.0-rc2.
          Hide
          Ryan McGuire added a comment -

          With read_repair_chance = 1

          data here

          Show
          Ryan McGuire added a comment - With read_repair_chance = 1 data here
          Hide
          Li Zou added a comment -

          This testing result is reasonable and what is expected.

          For PERCENTILE / CUSTOM configuration, the larger the cfs.sampleLatencyNanos the smaller the throughput impact for normal operations before the outage. However, during the outage period, the situation is reversed, i.e. the smaller cfs.sampleLatencyNanos, the smaller the throughput impact will be, as it times out quicker and triggers the speculative retries.

          For the ALWAYS configuration, as it always sends out one speculative in addition to the usual read requests, the throughput performance should be lower than those of PERCENTILE / CUSTOM for normal operations before the outage. Since it always sends out the speculative retries, the throughput impact during the outage period should be the smallest. The testing result indicates that this is true.

          Show
          Li Zou added a comment - This testing result is reasonable and what is expected. For PERCENTILE / CUSTOM configuration, the larger the cfs.sampleLatencyNanos the smaller the throughput impact for normal operations before the outage. However, during the outage period, the situation is reversed, i.e. the smaller cfs.sampleLatencyNanos , the smaller the throughput impact will be, as it times out quicker and triggers the speculative retries. For the ALWAYS configuration, as it always sends out one speculative in addition to the usual read requests, the throughput performance should be lower than those of PERCENTILE / CUSTOM for normal operations before the outage. Since it always sends out the speculative retries, the throughput impact during the outage period should be the smallest. The testing result indicates that this is true.
          Hide
          Jonathan Ellis added a comment - - edited

          The throughput is a wash in the compaction scenario, but the 99.9% latency looks a lot better with the retries.

          Any theories on why the percentile settings are posting better latency numbers than ALWAYS though?

          Show
          Jonathan Ellis added a comment - - edited The throughput is a wash in the compaction scenario, but the 99.9% latency looks a lot better with the retries. Any theories on why the percentile settings are posting better latency numbers than ALWAYS though?
          Hide
          Ryan McGuire added a comment - - edited

          Node killed while read_repair_chance=0. I accidentally left the test run to be 60M rows, so I chopped off the uninteresting bit.

          data here

          (rr=1 is next..)

          Show
          Ryan McGuire added a comment - - edited Node killed while read_repair_chance=0. I accidentally left the test run to be 60M rows, so I chopped off the uninteresting bit. data here (rr=1 is next..)
          Hide
          Ryan McGuire added a comment -

          I had to double the test length to get a good compaction graph. I'm not sure why it took so long, it didn't take as long in the original test.

          data here

          (Aleksey Yeschenko your read_repair 0 / 1 tests are in progress...)

          Show
          Ryan McGuire added a comment - I had to double the test length to get a good compaction graph. I'm not sure why it took so long, it didn't take as long in the original test. data here ( Aleksey Yeschenko your read_repair 0 / 1 tests are in progress...)
          Hide
          Aleksey Yeschenko added a comment -

          Li Zou see CASSANDRA-4792 (TLDR: yes)

          Show
          Aleksey Yeschenko added a comment - Li Zou see CASSANDRA-4792 (TLDR: yes)
          Hide
          Aleksey Yeschenko added a comment -

          For the data repair, do we need to block waiting for the acks?

          The reasons are listed in the comments, as you've seen:

          // wait for the repair writes to be acknowledged, to minimize impact on any replica that's
          // behind on writes in case the out-of-sync row is read multiple times in quick succession
          

          To reach that goal - yes, it's necessary. Is that scenario worth optimizing for or should we reconsider? Dunno. We are only writing to the replicas that we got the result from, though, so a known down replica wouldn't affect it.

          Show
          Aleksey Yeschenko added a comment - For the data repair, do we need to block waiting for the acks? The reasons are listed in the comments, as you've seen: // wait for the repair writes to be acknowledged, to minimize impact on any replica that's // behind on writes in case the out-of-sync row is read multiple times in quick succession To reach that goal - yes, it's necessary. Is that scenario worth optimizing for or should we reconsider? Dunno. We are only writing to the replicas that we got the result from, though, so a known down replica wouldn't affect it.
          Hide
          Li Zou added a comment -

          Thanks for the clarification of the sendRR issue.

          Since the Randomized approach is not checked in, let us skip over it.

          For the data repair, do we need to block waiting for the acks?

          Show
          Li Zou added a comment - Thanks for the clarification of the sendRR issue. Since the Randomized approach is not checked in, let us skip over it. For the data repair , do we need to block waiting for the acks?
          Hide
          Aleksey Yeschenko added a comment -

          First, let me thank you for your continued digging. Some of it helped. That said, you should probably look at the current cassandra-2.0 branch, and not the 2.0.0 tarball/branches here in the comments.

          Issue 1 – When handling DigestMismatchException in StorageProxy.fetchRows(), all data read requests are sent out using sendRR without distinguishing remote nodes from the local node.

          This is not an issue, and it's not spec retry related. Using LRR for local read requests is merely an optimisation - there is nothing wrong with sendRR (not that it isn't worth optimising here - just noting that it's not an issue). This is also the answer to "How do we handle the case for local node? Does the sendRR() and the corresponding receive part can handle the case for local node? If not, then this may block for 10 seconds." Same goes for Issue 2 and Issue 3.

          The data read request for local node may never sent out. As one of the nodes is down (which triggered the Speculative Retry) will cause one missing response.

          The former is not true, the latter won't, since the current cassandra-2.0 code will send requests to all the contacted replicas. So if a node triggered spec retry, that extra speculated replica will get the request as well, and we can still satisfy the CL.

                              for (InetAddress endpoint : exec.getContactedReplicas())
                              {
                                  Tracing.trace("Enqueuing full data read to {}", endpoint);
                                  MessagingService.instance().sendRR(message, endpoint, repairHandler);
                              }
          

          Question for the Randomized approach – Since the end points are randomized, the first node in the list is no likely the local node. This may cause a higher possibility of data repair.

          I don't see how the possibility of data repair is correlated with the locality of a target node, but, it doesn't matter. The 'randomised approach' was an experiment, it wasn't committed as part of the fix. See the latest cassandra-2.0 branch code.

          In the Randomized Approach, the end points are reshuffled. Then, the first node in the list used for data read request is not likely the local node. If this node happens to be the DOWN node, then, we end with all digest responses without the data, which will block and eventually timed out.

          See the above reply.

          TLDR: None of these seem to be issues, but we could optimise RR to use LRR for local reads to get slightly better performance for local requests (and to be consistent with the regular reads code).

          Show
          Aleksey Yeschenko added a comment - First, let me thank you for your continued digging. Some of it helped. That said, you should probably look at the current cassandra-2.0 branch, and not the 2.0.0 tarball/branches here in the comments. Issue 1 – When handling DigestMismatchException in StorageProxy.fetchRows(), all data read requests are sent out using sendRR without distinguishing remote nodes from the local node. This is not an issue, and it's not spec retry related. Using LRR for local read requests is merely an optimisation - there is nothing wrong with sendRR (not that it isn't worth optimising here - just noting that it's not an issue). This is also the answer to "How do we handle the case for local node? Does the sendRR() and the corresponding receive part can handle the case for local node? If not, then this may block for 10 seconds." Same goes for Issue 2 and Issue 3. The data read request for local node may never sent out. As one of the nodes is down (which triggered the Speculative Retry) will cause one missing response. The former is not true, the latter won't, since the current cassandra-2.0 code will send requests to all the contacted replicas. So if a node triggered spec retry, that extra speculated replica will get the request as well, and we can still satisfy the CL. for (InetAddress endpoint : exec.getContactedReplicas()) { Tracing.trace("Enqueuing full data read to {}", endpoint); MessagingService.instance().sendRR(message, endpoint, repairHandler); } Question for the Randomized approach – Since the end points are randomized, the first node in the list is no likely the local node. This may cause a higher possibility of data repair. I don't see how the possibility of data repair is correlated with the locality of a target node, but, it doesn't matter. The 'randomised approach' was an experiment, it wasn't committed as part of the fix. See the latest cassandra-2.0 branch code. In the Randomized Approach, the end points are reshuffled. Then, the first node in the list used for data read request is not likely the local node. If this node happens to be the DOWN node, then, we end with all digest responses without the data, which will block and eventually timed out. See the above reply. TLDR: None of these seem to be issues, but we could optimise RR to use LRR for local reads to get slightly better performance for local requests (and to be consistent with the regular reads code).
          Hide
          Li Zou added a comment - - edited

          Hello Aleksey Yeschenko and Jonathan Ellis,

          It appears to me that the testing results have suggested that the "data read + speculative retry" path work as expected. This "data read + speculative retry" path has greatly minimized the throughput impact caused by the failure of one of Cassandra server nodes.

          The observed small degradation of throughput performance when speculative retry is enabled is very likely to be caused by the "read repair" path. I did the code reading of this path last Friday and noticed some design / coding issues. I would like to discuss them with you.

          Please note that my code base is still the Cassandra 2.0.0 tarball, not updated with the latest code changes.

          Issue 1 – When handling DigestMismatchException in StorageProxy.fetchRows(), all data read requests are sent out using sendRR without distinguishing remote nodes from the local node.

          Will this cause an issue, as MessagingService.instance().sendRR() will send out enqueued messages for a specified remote node via its pre-established TCP socket connection. For local node, this should be done via LocalReadRunnable, i.e. StageManager.getStage(Stage.READ).execute(new LocalReadRunnable(command, handler)).

          If this may cause an issue, the following wait may block.

                      // read the results for the digest mismatch retries
                      if (repairResponseHandlers != null)
                      {
                          for (int i = 0; i < repairCommands.size(); i++)
                          {
                              ReadCommand command = repairCommands.get(i);
                              ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);
          
                              Row row;
                              try
                              {
                                  row = handler.get();
                              }
          

          For two reasons.

          • The data read request for local node may never sent out
          • As one of the nodes is down (which triggered the Speculative Retry) will cause one missing response.

          If missing two responses, this will block for 10 seconds.

          Issue 2 – For data repair, RowDataResolver.resolve() has a similar issue as it calls scheduleRepairs() to send out messages using sendRR() without distinguishing remote nodes from the local node.

          Issue 3 – When handling data repair, StorageProxy.fetchRows() blocks waiting for acks to all of data repair requests sent out using sendRR(). This may cause the thread to block.

          For data repair path, data requests are sent out and then compare / merge the received responses; send out the merged / diff version and then block for acks.

          How do we handle the case for local node? Does the sendRR() and the corresponding receive part can handle the case for local node? If not, then this may block for 10 seconds.

                      if (repairResponseHandlers != null)
                      {
                          for (int i = 0; i < repairCommands.size(); i++)
                          {
                              ReadCommand command = repairCommands.get(i);
                              ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);
          
                              Row row;
                              try
                              {
                                  row = handler.get();
                              }
                              catch (DigestMismatchException e)
                              ...
                              RowDataResolver resolver = (RowDataResolver)handler.resolver;
                              try
                              {
                                  // wait for the repair writes to be acknowledged, to minimize impact on any replica that's
                                  // behind on writes in case the out-of-sync row is read multiple times in quick succession
                                  FBUtilities.waitOnFutures(resolver.repairResults, DatabaseDescriptor.getWriteRpcTimeout());
                              }
                              catch (TimeoutException e)
                              {
                                  Tracing.trace("Timed out on digest mismatch retries");
                                  int blockFor = consistency_level.blockFor(Keyspace.open(command.getKeyspace()));
                                  throw new ReadTimeoutException(consistency_level, blockFor, blockFor, true);
                              }
          

          Question for waiting for the ack – Do we really need to wait for the ack?

          We should assume the best effort approach, i.e. do the data repair and then return. No need to block waiting for the acks for confirmation.

          Question for the Randomized approach – Since the end points are randomized, the first node in the list is no likely the local node. This may cause a higher possibility of data repair.

          In the Randomized Approach, the end points are reshuffled. Then, the first node in the list used for data read request is not likely the local node. If this node happens to be the DOWN node, then, we end with all digest responses without the data, which will block and eventually timed out.

          Show
          Li Zou added a comment - - edited Hello Aleksey Yeschenko and Jonathan Ellis , It appears to me that the testing results have suggested that the " data read + speculative retry " path work as expected. This " data read + speculative retry " path has greatly minimized the throughput impact caused by the failure of one of Cassandra server nodes. The observed small degradation of throughput performance when speculative retry is enabled is very likely to be caused by the " read repair " path. I did the code reading of this path last Friday and noticed some design / coding issues. I would like to discuss them with you. Please note that my code base is still the Cassandra 2.0.0 tarball, not updated with the latest code changes. Issue 1 – When handling DigestMismatchException in StorageProxy.fetchRows() , all data read requests are sent out using sendRR without distinguishing remote nodes from the local node. Will this cause an issue, as MessagingService.instance().sendRR() will send out enqueued messages for a specified remote node via its pre-established TCP socket connection. For local node, this should be done via LocalReadRunnable , i.e. StageManager.getStage(Stage.READ).execute(new LocalReadRunnable(command, handler)) . If this may cause an issue, the following wait may block. // read the results for the digest mismatch retries if (repairResponseHandlers != null) { for (int i = 0; i < repairCommands.size(); i++) { ReadCommand command = repairCommands.get(i); ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i); Row row; try { row = handler.get(); } For two reasons. The data read request for local node may never sent out As one of the nodes is down (which triggered the Speculative Retry) will cause one missing response. If missing two responses, this will block for 10 seconds . Issue 2 – For data repair , RowDataResolver.resolve() has a similar issue as it calls scheduleRepairs() to send out messages using sendRR() without distinguishing remote nodes from the local node. Issue 3 – When handling data repair , StorageProxy.fetchRows() blocks waiting for acks to all of data repair requests sent out using sendRR(). This may cause the thread to block. For data repair path, data requests are sent out and then compare / merge the received responses; send out the merged / diff version and then block for acks. How do we handle the case for local node ? Does the sendRR() and the corresponding receive part can handle the case for local node? If not, then this may block for 10 seconds. if (repairResponseHandlers != null) { for (int i = 0; i < repairCommands.size(); i++) { ReadCommand command = repairCommands.get(i); ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i); Row row; try { row = handler.get(); } catch (DigestMismatchException e) ... RowDataResolver resolver = (RowDataResolver)handler.resolver; try { // wait for the repair writes to be acknowledged, to minimize impact on any replica that's // behind on writes in case the out-of-sync row is read multiple times in quick succession FBUtilities.waitOnFutures(resolver.repairResults, DatabaseDescriptor.getWriteRpcTimeout()); } catch (TimeoutException e) { Tracing.trace("Timed out on digest mismatch retries"); int blockFor = consistency_level.blockFor(Keyspace.open(command.getKeyspace())); throw new ReadTimeoutException(consistency_level, blockFor, blockFor, true); } Question for waiting for the ack – Do we really need to wait for the ack? We should assume the best effort approach, i.e. do the data repair and then return. No need to block waiting for the acks for confirmation. Question for the Randomized approach – Since the end points are randomized, the first node in the list is no likely the local node. This may cause a higher possibility of data repair. In the Randomized Approach , the end points are reshuffled. Then, the first node in the list used for data read request is not likely the local node. If this node happens to be the DOWN node, then, we end with all digest responses without the data, which will block and eventually timed out.
          Hide
          Jonathan Ellis added a comment -

          Ryan McGuire, can you also re-test the uncapped compaction scenario with the same set of retry settings?

          Show
          Jonathan Ellis added a comment - Ryan McGuire , can you also re-test the uncapped compaction scenario with the same set of retry settings?
          Hide
          Jonathan Ellis added a comment -

          It looks to me like 75/90/Always are about the same, with Always dropping from a lower baseline. Which makes sense; it's still doing a lot of unnecessary work compared to the others.

          Show
          Jonathan Ellis added a comment - It looks to me like 75/90/Always are about the same, with Always dropping from a lower baseline. Which makes sense; it's still doing a lot of unnecessary work compared to the others.
          Hide
          Hubert Sugeng added a comment - - edited

          Definitely the best results seen so far! Nice work!

          In my mind, during the transition period right after the killing of the node, I expected "ALWAYS" to have negligible impact, or at least the smallest impact of all of the other values. However, red (90%), and purple (75%) is having a smaller impact. Seems fishy. Do I misunderstand the intention of the "ALWAYS" setting?

          (Edited to clarify the period I'm talking about.)

          Show
          Hubert Sugeng added a comment - - edited Definitely the best results seen so far! Nice work! In my mind, during the transition period right after the killing of the node, I expected "ALWAYS" to have negligible impact, or at least the smallest impact of all of the other values. However, red (90%), and purple (75%) is having a smaller impact. Seems fishy. Do I misunderstand the intention of the "ALWAYS" setting? (Edited to clarify the period I'm talking about.)
          Hide
          Ryan McGuire added a comment -

          BINGO!

          data here

          Show
          Ryan McGuire added a comment - BINGO! data here
          Hide
          Jonathan Ellis added a comment -

          The code to convert 99 into 0.99 was buggy and was actually converting to 0.0099. Fix pushed, can you try it again?

          Show
          Jonathan Ellis added a comment - The code to convert 99 into 0.99 was buggy and was actually converting to 0.0099. Fix pushed, can you try it again?
          Hide
          Ryan McGuire added a comment -

          Jonathan Ellis Here's your two runs:

          dea27f84f40
          ded39c7e1c2fa

          Logs for the second run are attached as 5932.ded39c7e1c2fa.logs.tar.gz

          Show
          Ryan McGuire added a comment - Jonathan Ellis Here's your two runs: dea27f84f40 ded39c7e1c2fa Logs for the second run are attached as 5932.ded39c7e1c2fa.logs.tar.gz
          Hide
          Jonathan Ellis added a comment -

          Pretty sure that's our smoking gun. Pushed a commit to the -randomized branch that adds coordinator-level, per-cf latency tracking and uses that instead.

          Can you repeat the last test with that? (Maybe throw in ALL as well if you're feeling optimistic that we'll have a measurable difference between ALL and 90%.

          Show
          Jonathan Ellis added a comment - Pretty sure that's our smoking gun. Pushed a commit to the -randomized branch that adds coordinator-level, per-cf latency tracking and uses that instead. Can you repeat the last test with that? (Maybe throw in ALL as well if you're feeling optimistic that we'll have a measurable difference between ALL and 90%.
          Hide
          Jonathan Ellis added a comment -

          Maybe the problem is that we're using CF-level latency instead of StorageProxy.

          What does cfhistograms give for read latency?

          Show
          Jonathan Ellis added a comment - Maybe the problem is that we're using CF-level latency instead of StorageProxy. What does cfhistograms give for read latency?
          Hide
          Li Zou added a comment -

          I did some more code reading and noticed some potential issues and possible improvement. I've got to run now. I will get back to you guys Monday morning.

          My guess is that the Speculative NONE is hit by the initial request reading path which is successfully resolved by the Speculative Retry. The observed throughput performance hit when Speculative Retry is enabled is caused by the ReadRepair path which has some coding / design issues. I will talk to you next Monday.

          Show
          Li Zou added a comment - I did some more code reading and noticed some potential issues and possible improvement. I've got to run now. I will get back to you guys Monday morning. My guess is that the Speculative NONE is hit by the initial request reading path which is successfully resolved by the Speculative Retry . The observed throughput performance hit when Speculative Retry is enabled is caused by the ReadRepair path which has some coding / design issues. I will talk to you next Monday.
          Hide
          Jonathan Ellis added a comment -

          Starting to think we still have a bug. 99.9 should be doing less retries than 10ms but the graph shows it doing more.

          Show
          Jonathan Ellis added a comment - Starting to think we still have a bug. 99.9 should be doing less retries than 10ms but the graph shows it doing more.
          Hide
          Ryan McGuire added a comment -

          Here's 99th and 99.9th percentiles:

          I'll look at that stress patch again, I seem to recall it not making a lot of sense to me when I last tried it, but will give it another go.

          Show
          Ryan McGuire added a comment - Here's 99th and 99.9th percentiles: I'll look at that stress patch again, I seem to recall it not making a lot of sense to me when I last tried it, but will give it another go.
          Hide
          Jonathan Ellis added a comment -

          Also, did that stress patch work to get you failed request counts? Would be good to get that too if we can show that even 10ms keeps requests from failing entirely.

          Show
          Jonathan Ellis added a comment - Also, did that stress patch work to get you failed request counts? Would be good to get that too if we can show that even 10ms keeps requests from failing entirely.
          Hide
          Jonathan Ellis added a comment -

          Yeah, that makes sense. 70th..95th are all pretty damn close to median still.

          What I'd like to do is get close to the 10ms performance hit (~none) as a percentile, and make that default in 2.1. Try 99th and 99.9th?

          Show
          Jonathan Ellis added a comment - Yeah, that makes sense. 70th..95th are all pretty damn close to median still. What I'd like to do is get close to the 10ms performance hit (~none) as a percentile, and make that default in 2.1. Try 99th and 99.9th?
          Hide
          Ryan McGuire added a comment - - edited

          Seems like it's quite tunable, but not a lot of difference under 90%:

          data here

          Show
          Ryan McGuire added a comment - - edited Seems like it's quite tunable, but not a lot of difference under 90%: data here
          Hide
          Jonathan Ellis added a comment -

          That's awesome.

          Can you test 90th and 75th percentile too?

          Show
          Jonathan Ellis added a comment - That's awesome. Can you test 90th and 75th percentile too?
          Hide
          Ryan McGuire added a comment - - edited

          This looks exactly like what I was expecting:

          data here

          Show
          Ryan McGuire added a comment - - edited This looks exactly like what I was expecting: data here
          Hide
          Jonathan Ellis added a comment -

          That still gives you luck-of-the-draw as to which replica it prefers. (Unlikely to be evenly distributed.)

          Show
          Jonathan Ellis added a comment - That still gives you luck-of-the-draw as to which replica it prefers. (Unlikely to be evenly distributed.)
          Hide
          Brandon Williams added a comment -

          Couldn't we just do a run with the dsnitch disabled?

          Show
          Brandon Williams added a comment - Couldn't we just do a run with the dsnitch disabled?
          Hide
          Jonathan Ellis added a comment -

          Hmm.

          I wonder if it's just luck of the draw as to which replica dsnitch is preferring. Here's a branch to randomize that, per-operation:

          https://github.com/jbellis/cassandra/commits/5932-randomized

          Show
          Jonathan Ellis added a comment - Hmm. I wonder if it's just luck of the draw as to which replica dsnitch is preferring. Here's a branch to randomize that, per-operation: https://github.com/jbellis/cassandra/commits/5932-randomized
          Hide
          Ryan McGuire added a comment -

          The other thing that I note, is that all of these runs are better than 1.2.8 (further evidence that CASSANDRA-5933 may be invalid)

          Show
          Ryan McGuire added a comment - The other thing that I note, is that all of these runs are better than 1.2.8 (further evidence that CASSANDRA-5933 may be invalid)
          Hide
          Jeremiah Jordan added a comment -

          Yeah. The graphs for ALWAYS and NONE look swapped from what I would expect.

          Show
          Jeremiah Jordan added a comment - Yeah. The graphs for ALWAYS and NONE look swapped from what I would expect.
          Hide
          Ryan McGuire added a comment - - edited

          The good news is that speculative read has improved across the board.

          However, this new batch of testing introduces some new mysteries.

          Here is all of the runs from 7a87fc1186f39678382cf9b3e1dd224d9c71aead:

          All of the speculative retry runs are better than with 2.0.0-rc1. However, I can't explain why sr=NONE did better than ALWAYS and 95percentile. There is no visible indication that a node went down for sr=NONE. I have double checked the logs, and it did, in fact, go down.

          Compare this to the baseline of 1.2.8 and 2.0.0-rc1 (redone last night on same hardware as above):

          All of these have clear indications of the node going down.

          You can see all the data here - you can double click the colored squares to toggle the visibility of the lines, as they do overlap.

          I've uploaded logs from all these runs as 5933-logs.tar.gz.

          Show
          Ryan McGuire added a comment - - edited The good news is that speculative read has improved across the board. However, this new batch of testing introduces some new mysteries. Here is all of the runs from 7a87fc1186f39678382cf9b3e1dd224d9c71aead: All of the speculative retry runs are better than with 2.0.0-rc1. However, I can't explain why sr=NONE did better than ALWAYS and 95percentile. There is no visible indication that a node went down for sr=NONE. I have double checked the logs, and it did, in fact, go down. Compare this to the baseline of 1.2.8 and 2.0.0-rc1 (redone last night on same hardware as above): All of these have clear indications of the node going down. You can see all the data here - you can double click the colored squares to toggle the visibility of the lines, as they do overlap. I've uploaded logs from all these runs as 5933-logs.tar.gz.
          Hide
          Jonathan Ellis added a comment -

          I see what you mean. Fixed in 7a87fc1186f39678382cf9b3e1dd224d9c71aead.

          Show
          Jonathan Ellis added a comment - I see what you mean. Fixed in 7a87fc1186f39678382cf9b3e1dd224d9c71aead.
          Hide
          Li Zou added a comment -

          The logic for AlwaysSpeculatingReadExecutor is good. What I meant in my previous comment is that when targetReplicas.size() == allReplicas.size() and targetReplicas.size() == 1, then AlwaysSpeculatingReadExecutor.executeAsync() will throw an exception as there is only one endpoint in targetReplicas, but it tries to access two endpoints in targetReplicas.

          Show
          Li Zou added a comment - The logic for AlwaysSpeculatingReadExecutor is good. What I meant in my previous comment is that when targetReplicas.size() == allReplicas.size() and targetReplicas.size() == 1 , then AlwaysSpeculatingReadExecutor.executeAsync() will throw an exception as there is only one endpoint in targetReplicas , but it tries to access two endpoints in targetReplicas .
          Hide
          Jonathan Ellis added a comment -

          (Committed after Aleksey's +1, incidentally.)

          Show
          Jonathan Ellis added a comment - (Committed after Aleksey's +1, incidentally.)
          Hide
          Jonathan Ellis added a comment -

          The logic looks like this:

          1. Figure out how many replicas we need to contact to satisfy the desired consistencyLevel + Read Repair settings
          2. If that ends up being all the replicas, then use ASRE to get some redundancy on the data reads. This will allow the read to succeed even if a digest for RR times out. Of course if you are reading at CL.ALL and a replica times out there's nothing we can do.
          3. Otherwise, use SRE and make an "extra" request later, if it looks like one of the minimal set isn't going to respond in time

          Note that performing extra data requests does not affect handler.blockfor – just makes it possible for the request to proceed if it gets enough responses back, no matter which replicas they come from.

          Show
          Jonathan Ellis added a comment - The logic looks like this: Figure out how many replicas we need to contact to satisfy the desired consistencyLevel + Read Repair settings If that ends up being all the replicas, then use ASRE to get some redundancy on the data reads. This will allow the read to succeed even if a digest for RR times out. Of course if you are reading at CL.ALL and a replica times out there's nothing we can do. Otherwise, use SRE and make an "extra" request later, if it looks like one of the minimal set isn't going to respond in time Note that performing extra data requests does not affect handler.blockfor – just makes it possible for the request to proceed if it gets enough responses back, no matter which replicas they come from.
          Hide
          Li Zou added a comment -

          Hello Aleksey Yeschenko and Jonathan Ellis,

          I took a quick look at the code changes. The new code looks very good to me. But I saw one potential issue in AlwaysSpeculatingReadExecutor.executeAsync(), in which it makes at least two data / digest requests. This will cause problems for a data center with only one Cassandra server node (e.g. bring up an embedded Cassandra node in JVM for JUnit test) or a deployed production data center of two Cassandra server nodes with one node shut down for maintenance. In the above mentioned two cases, AbstractReadExecutor.getReadExecutor() will return the AlwaysSpeculatingReadExecutor as condition (targetReplicas.size() == allReplicas.size()) is met, though the tables may / may not be configured with Speculative ALWAYS.

          It is true for our legacy products we are considering to deploy each data center with only two Cassandra server nodes with RF = 2 and CL = 1.

          Show
          Li Zou added a comment - Hello Aleksey Yeschenko and Jonathan Ellis , I took a quick look at the code changes. The new code looks very good to me. But I saw one potential issue in AlwaysSpeculatingReadExecutor.executeAsync() , in which it makes at least two data / digest requests. This will cause problems for a data center with only one Cassandra server node (e.g. bring up an embedded Cassandra node in JVM for JUnit test) or a deployed production data center of two Cassandra server nodes with one node shut down for maintenance. In the above mentioned two cases, AbstractReadExecutor.getReadExecutor() will return the AlwaysSpeculatingReadExecutor as condition (targetReplicas.size() == allReplicas.size()) is met, though the tables may / may not be configured with Speculative ALWAYS . It is true for our legacy products we are considering to deploy each data center with only two Cassandra server nodes with RF = 2 and CL = 1.
          Hide
          Aleksey Yeschenko added a comment -

          +1, I'm out of OCD juice.

          Show
          Aleksey Yeschenko added a comment - +1, I'm out of OCD juice.
          Hide
          Jonathan Ellis added a comment -

          Pushed one more set of changes to mine, not forced: https://github.com/jbellis/cassandra/commits/5932. Goal is to make SRE less fragile when doing RR.

          Show
          Jonathan Ellis added a comment - Pushed one more set of changes to mine, not forced: https://github.com/jbellis/cassandra/commits/5932 . Goal is to make SRE less fragile when doing RR.
          Hide
          Aleksey Yeschenko added a comment -

          Force-pushed the 'final' version to https://github.com/iamaleksey/cassandra/commits/5932.

          Among other things, properly handles RRD.GLOBAL and RRD.DC_LOCAL in 1-DC scenario.

          Show
          Aleksey Yeschenko added a comment - Force-pushed the 'final' version to https://github.com/iamaleksey/cassandra/commits/5932 . Among other things, properly handles RRD.GLOBAL and RRD.DC_LOCAL in 1-DC scenario.
          Hide
          Aleksey Yeschenko added a comment -

          Pushed even more OCD to https://github.com/iamaleksey/cassandra/commits/5932 on top of yours.

          I'm not sure what the distinction is here. Do you mean that if we weren't read-repairing, there would be no extra data request at all?

          Instead of making, for example 1 data request + 2 digest requests, ALWAYS was making 2 data requests + 1 digest request, instead of making 2 data requests + 2 digest requests, not really helping to satisfy the CL in case of node's failure.

          I dunno, I think we should turn a digest into a data for redundancy the way ALWAYS used to.

          Maybe.

          Show
          Aleksey Yeschenko added a comment - Pushed even more OCD to https://github.com/iamaleksey/cassandra/commits/5932 on top of yours. I'm not sure what the distinction is here. Do you mean that if we weren't read-repairing, there would be no extra data request at all? Instead of making, for example 1 data request + 2 digest requests, ALWAYS was making 2 data requests + 1 digest request, instead of making 2 data requests + 2 digest requests, not really helping to satisfy the CL in case of node's failure. I dunno, I think we should turn a digest into a data for redundancy the way ALWAYS used to. Maybe.
          Hide
          Jonathan Ellis added a comment -

          Pushed some OCD of my own to https://github.com/jbellis/cassandra/commits/5932 on top of this.

          ALWAYS wasn't making an extra request, it was making an extra data request at the expense of one digest request

          I'm not sure what the distinction is here. Do you mean that if we weren't read-repairing, there would be no extra data request at all?

          SpecRetry w/ RRD.GLOBAL is a noop, you can't speculate if you contact all the replicas in the first place.

          I dunno, I think we should turn a digest into a data for redundancy the way ALWAYS used to.

          Show
          Jonathan Ellis added a comment - Pushed some OCD of my own to https://github.com/jbellis/cassandra/commits/5932 on top of this. ALWAYS wasn't making an extra request, it was making an extra data request at the expense of one digest request I'm not sure what the distinction is here. Do you mean that if we weren't read-repairing, there would be no extra data request at all? SpecRetry w/ RRD.GLOBAL is a noop, you can't speculate if you contact all the replicas in the first place. I dunno, I think we should turn a digest into a data for redundancy the way ALWAYS used to.
          Hide
          Aleksey Yeschenko added a comment -

          Attaching 5932.txt that will hopefully fix this (Ryan McGuire could you run the tests again, please, with the patch applied?)

          1. As noted by Li Zou and sankalp kohli, ALWAYS wasn't making an extra request, it was making an extra data request at the expense of one digest request. Fixed.

          2. SpecRetry wasn't working correctly with RRD.DC_LOCAL, as noted by @lizou, because the two lists will be in different order, and a retry might be sent to a node that already had a request sent to it. (Please note that LOCAL_QUORUM here does not affect anything - CL.filterForQuery() sorts in place, so the two lists would be in the same order, everything was working correct). RRD.DC_LOCAL handling was a legit issue though. Fixed.

          3. SpecRetry w/ RRD.GLOBAL is a noop, you can't speculate if you contact all the replicas in the first place. This is normal.

          4. The DME issue is semi-legit. Killing a node shouldn't trigger DME or increase the likelihood of DME happening. HOWEVER when shooting requests for repair, we were not considering the case where one of the replies satisfying the original CL came from a SpecRetry attempt. The patch includes the extra replica in repair commands if SpecRetry had been triggered by the original request.

          5. SP.getRangeSlice() is not SpectRetry-aware as of now. I don't know if this is an omission or by design, but for now, please don't include that in the benchmarks, since it would only be misleading.

          Show
          Aleksey Yeschenko added a comment - Attaching 5932.txt that will hopefully fix this ( Ryan McGuire could you run the tests again, please, with the patch applied?) 1. As noted by Li Zou and sankalp kohli , ALWAYS wasn't making an extra request, it was making an extra data request at the expense of one digest request. Fixed. 2. SpecRetry wasn't working correctly with RRD.DC_LOCAL, as noted by @lizou, because the two lists will be in different order, and a retry might be sent to a node that already had a request sent to it. (Please note that LOCAL_QUORUM here does not affect anything - CL.filterForQuery() sorts in place, so the two lists would be in the same order, everything was working correct). RRD.DC_LOCAL handling was a legit issue though. Fixed. 3. SpecRetry w/ RRD.GLOBAL is a noop, you can't speculate if you contact all the replicas in the first place. This is normal. 4. The DME issue is semi-legit. Killing a node shouldn't trigger DME or increase the likelihood of DME happening. HOWEVER when shooting requests for repair, we were not considering the case where one of the replies satisfying the original CL came from a SpecRetry attempt. The patch includes the extra replica in repair commands if SpecRetry had been triggered by the original request. 5. SP.getRangeSlice() is not SpectRetry-aware as of now. I don't know if this is an omission or by design, but for now, please don't include that in the benchmarks, since it would only be misleading.
          Hide
          Aleksey Yeschenko added a comment -

          Hey Li Zou. Yeah, I've fixed most of these already (rewritten most of the ARE code, actually). Specifically issues 2,3,4. Will look into 1 too.

          Thanks.

          Show
          Aleksey Yeschenko added a comment - Hey Li Zou . Yeah, I've fixed most of these already (rewritten most of the ARE code, actually). Specifically issues 2,3,4. Will look into 1 too. Thanks.
          Hide
          Li Zou added a comment -

          Hello Aleksey Yeschenko,

          Thanks for the link to this jira and for your very detailed testing results. It confirms what we have seen in our lab testing for the Cassandra 2.0.0-rc2 "Speculative Execution for Reads".

          We have a very simple data center setup consisting of four Cassandra nodes running on four server machines. A testing application (Cassandra client) is interacting with Cassandra nodes 1, 2 and 3. That is, the testing app does not directly connected to the Cassandra node 4.

          The keyspace Replication Factor is set to 3 and the client requested Consistency Level is set to CL_TWO.

          I have tested all of three configurations of the Speculative Execution for Reads ('ALWAYS', '85 PERCENTILE', '50 MS' / '100 MS'). It seems that none of them works as expected. From the test app log file point of view, they all give a 20-second window of outage immediately after the 4th node was killed. This behavior is consistent to Cassandra 1.2.4.

          I have done a quick code reading of the Cassandra Server implementation (Cassandra 2.0.0 tarball) and I have noticed some design issues. I would like to discuss them with you.

          Issue 1 - StorageProxy.fetchRows() may still block for as long as conf.read_request_timeout_in_ms, though the speculative retry did fire correctly after the Cassandra node 4 was killed.

          Take the speculative configuration of 'PERCENTILE' / 'CUSTOM' as example, after the Cassandra node 4 was killed, SpeculativeReadExecutor.speculate() would block for responses. If timed out, it would send out one more read request to an alternative node (from unfiltered) and increment the speculativeRetry counter. This part should work.

          However, killing the 4th node would very likely cause inconsistency in the database and this will trigger the DigestMismatchException. In the fetchRows(), when handling DigestMismatchException, it uses handler.endpoints to send out digest mismatch retries and then block for responses. As we know that one of the endpoints was already killed, the handler.get() will block until it is timed out, which is 10 seconds.

                          catch (DigestMismatchException ex)
                          {
                              Tracing.trace("Digest mismatch: {}", ex);
          
                              ...
          
                              MessageOut<ReadCommand> message = exec.command.createMessage();
                              for (InetAddress endpoint : exec.handler.endpoints)
                              {
                                  Tracing.trace("Enqueuing full data read to {}", endpoint);
                                  MessagingService.instance().sendRR(message, endpoint, repairHandler);
                              }
                          }
                      }
          
                      ...
          
                      // read the results for the digest mismatch retries
                      if (repairResponseHandlers != null)
                      {
                          for (int i = 0; i < repairCommands.size(); i++)
                          {
                              ReadCommand command = repairCommands.get(i);
                              ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);
          
                              Row row;
                              try
                              {
                                  row = handler.get();
                              }
          

          Issue 2 - The speculative 'ALWAYS' does NOT send out any more read requests. Thus, in face of the failure of node 4, it will not help at all.

          The SpeculateAlwaysExecutor.executeAsync() only sends out handler.endpoints.size() number of read requests and it blocks for the responses to come back. If one of the nodes is killed, say node 4, this speculative retry 'ALWAYS' will work the same way as Cassandra 1.2.4, i.e. it will block until timed out, which is 10 seconds.

          My understanding of this speculative retry 'ALWAYS' should ALWAYS send out "handler.endpoints.size() + 1" number of read requests and block for handler.endpoints.size() number of responses.

          Issue 3 - Since the ReadRepairDecison is determined by a Random() number, this speculative retry may not work as the ReadRepairDecision may be ReadRepairDecision.GLOBAL

          Issue 4 - For the ReadExecutor(s), the this.unfiltered and this.endpoints may not consistent. Thus, using this.unfiltered and this.endpoints for speculative retry may cause unexpected results. This is especially true when the Consistency Level is LOCAL_QUARUM and the ReadRepairDecision is DC_LOCAL.

          Show
          Li Zou added a comment - Hello Aleksey Yeschenko , Thanks for the link to this jira and for your very detailed testing results. It confirms what we have seen in our lab testing for the Cassandra 2.0.0-rc2 "Speculative Execution for Reads". We have a very simple data center setup consisting of four Cassandra nodes running on four server machines. A testing application (Cassandra client) is interacting with Cassandra nodes 1, 2 and 3. That is, the testing app does not directly connected to the Cassandra node 4. The keyspace Replication Factor is set to 3 and the client requested Consistency Level is set to CL_TWO. I have tested all of three configurations of the Speculative Execution for Reads ('ALWAYS', '85 PERCENTILE', '50 MS' / '100 MS'). It seems that none of them works as expected. From the test app log file point of view, they all give a 20-second window of outage immediately after the 4th node was killed. This behavior is consistent to Cassandra 1.2.4. I have done a quick code reading of the Cassandra Server implementation (Cassandra 2.0.0 tarball) and I have noticed some design issues. I would like to discuss them with you. Issue 1 - StorageProxy.fetchRows() may still block for as long as conf.read_request_timeout_in_ms, though the speculative retry did fire correctly after the Cassandra node 4 was killed. Take the speculative configuration of 'PERCENTILE' / 'CUSTOM' as example, after the Cassandra node 4 was killed, SpeculativeReadExecutor.speculate() would block for responses. If timed out, it would send out one more read request to an alternative node (from unfiltered ) and increment the speculativeRetry counter. This part should work. However, killing the 4th node would very likely cause inconsistency in the database and this will trigger the DigestMismatchException. In the fetchRows(), when handling DigestMismatchException, it uses handler.endpoints to send out digest mismatch retries and then block for responses. As we know that one of the endpoints was already killed, the handler.get() will block until it is timed out, which is 10 seconds. catch (DigestMismatchException ex) { Tracing.trace("Digest mismatch: {}", ex); ... MessageOut<ReadCommand> message = exec.command.createMessage(); for (InetAddress endpoint : exec.handler.endpoints) { Tracing.trace("Enqueuing full data read to {}", endpoint); MessagingService.instance().sendRR(message, endpoint, repairHandler); } } } ... // read the results for the digest mismatch retries if (repairResponseHandlers != null) { for (int i = 0; i < repairCommands.size(); i++) { ReadCommand command = repairCommands.get(i); ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i); Row row; try { row = handler.get(); } Issue 2 - The speculative 'ALWAYS' does NOT send out any more read requests. Thus, in face of the failure of node 4, it will not help at all. The SpeculateAlwaysExecutor.executeAsync() only sends out handler.endpoints.size() number of read requests and it blocks for the responses to come back. If one of the nodes is killed, say node 4, this speculative retry 'ALWAYS' will work the same way as Cassandra 1.2.4, i.e. it will block until timed out, which is 10 seconds. My understanding of this speculative retry 'ALWAYS' should ALWAYS send out "handler.endpoints.size() + 1" number of read requests and block for handler.endpoints.size() number of responses . Issue 3 - Since the ReadRepairDecison is determined by a Random() number, this speculative retry may not work as the ReadRepairDecision may be ReadRepairDecision.GLOBAL Issue 4 - For the ReadExecutor(s), the this.unfiltered and this.endpoints may not consistent. Thus, using this.unfiltered and this.endpoints for speculative retry may cause unexpected results. This is especially true when the Consistency Level is LOCAL_QUARUM and the ReadRepairDecision is DC_LOCAL .
          Hide
          Hubert Sugeng added a comment -

          Thanks, Ryan McGuire. Sounds like you are seeing the same thing as us, so it's great to see it's getting attention!

          Show
          Hubert Sugeng added a comment - Thanks, Ryan McGuire . Sounds like you are seeing the same thing as us, so it's great to see it's getting attention!
          Hide
          sankalp kohli added a comment -

          SpeculateAlwaysExecutor - Here we are not reading from more endpoints than normal. We are only reading data from two endpoints. We should be reading from one more endpoint if possible.

          Show
          sankalp kohli added a comment - SpeculateAlwaysExecutor - Here we are not reading from more endpoints than normal. We are only reading data from two endpoints. We should be reading from one more endpoint if possible.
          Hide
          Ryan McGuire added a comment -

          I'm using 5 second intervals in these charts. 'multi-second stress-client outage' is a good way to put it, for both the case of speculative retry and not, the drop in performance after a node goes down is a duration of complete non-responsiveness (not degraded performance.) The addition of speculative retry consistently shortens this duration (it's always better), but this duration itself is inconsistent.

          Show
          Ryan McGuire added a comment - I'm using 5 second intervals in these charts. 'multi-second stress-client outage' is a good way to put it, for both the case of speculative retry and not, the drop in performance after a node goes down is a duration of complete non-responsiveness (not degraded performance.) The addition of speculative retry consistently shortens this duration (it's always better), but this duration itself is inconsistent.
          Hide
          Hubert Sugeng added a comment -

          Ryan McGuire, what was your collection interval for the metrics you've collected?

          I ask because I'm observing results similar to yours for "3) Eager Reads tend to lessen the immediate performance impact of a node going down, but not consistently.". However, I've polled metrics @ a 1 second granularity to see that it's actually a multi-second stress-client outage - not just poor and inconsistent performance.

          Polling metrics @ a 1 second interval, has observations that a ~20 second read operations starvation outage occurs for the stress client for all data in the cluster (even with the lowest phi_convict_threshold=6).

          Analysis so far indicates that high-operations reads starve out all the C* client threads/connections, because they get stuck on awaiting for a server response whenever the key-space hits the node that is down (and by probability + high-operation reads, within 1 second each stress client thread will all hit the downed-node's key-space).

          So I'm confirming that I'm also seeing this bug that Speculative reads (even with an ALWAYS setting). It isn't solving this outage for clients during high-operation reads, and based on what I understand of the feature, it should.

          Thanks, guys!

          Show
          Hubert Sugeng added a comment - Ryan McGuire , what was your collection interval for the metrics you've collected? I ask because I'm observing results similar to yours for "3) Eager Reads tend to lessen the immediate performance impact of a node going down, but not consistently.". However, I've polled metrics @ a 1 second granularity to see that it's actually a multi-second stress-client outage - not just poor and inconsistent performance. Polling metrics @ a 1 second interval, has observations that a ~20 second read operations starvation outage occurs for the stress client for all data in the cluster (even with the lowest phi_convict_threshold=6). Analysis so far indicates that high-operations reads starve out all the C* client threads/connections, because they get stuck on awaiting for a server response whenever the key-space hits the node that is down (and by probability + high-operation reads, within 1 second each stress client thread will all hit the downed-node's key-space). So I'm confirming that I'm also seeing this bug that Speculative reads (even with an ALWAYS setting). It isn't solving this outage for clients during high-operation reads, and based on what I understand of the feature, it should. Thanks, guys!

            People

            • Assignee:
              Aleksey Yeschenko
              Reporter:
              Ryan McGuire
              Reviewer:
              Jonathan Ellis
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development