Cassandra
  1. Cassandra
  2. CASSANDRA-4705

Speculative execution for reads / eager retries

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 1
    • Component/s: None
    • Labels:
      None

      Description

      When read_repair is not 1.0, we send the request to one node for some of the requests. When a node goes down or when a node is too busy the client has to wait for the timeout before it can retry.

      It would be nice to watch for latency and execute an additional request to a different node, if the response is not received within average/99% of the response times recorded in the past.

      CASSANDRA-2540 might be able to solve the variance when read_repair is set to 1.0

      1) May be we need to use metrics-core to record various Percentiles
      2) Modify ReadCallback.get to execute additional request speculatively.

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          No.

          Show
          Jonathan Ellis added a comment - No.
          Hide
          Curtis Caravone added a comment -

          Is there any plan to backport this to the 1.2 branch?

          Show
          Curtis Caravone added a comment - Is there any plan to backport this to the 1.2 branch?
          Hide
          Li Zou added a comment -

          Aleksey Yeschenko please see my comments at CASSANDRA-5932

          Show
          Li Zou added a comment - Aleksey Yeschenko please see my comments at CASSANDRA-5932
          Hide
          Aleksey Yeschenko added a comment -
          Show
          Aleksey Yeschenko added a comment - Li Zou See CASSANDRA-5932
          Hide
          Li Zou added a comment -

          Hello Vijay and Jonathan Ellis,

          My company is adopting the Cassandra db technology and we are very much interested in this new feature of "Speculative Execution for Reads", as it could potentially help reduce the overall outage window to the sub-second range upon the failure of any one of Cassandra nodes in a data center.

          I have recently tested the "Speculative Execution for Reads" using Cassandra 2.0.0-rc2 and I have not seen the expected results. In my tests, I have already excluded the possible effects from the client (connection pool) side. Not sure what is still missed out in my tests.

          Here is my setup for the tests. One data center of four Cassandra nodes is configured on machine A, B, C and D with each physical machine has one Cassandra node running on it. My testing app (Cassandra client) is running on machine A and it is configured to connect to Cassandra nodes A, B, and C only. In other words, my testing app will never connect to Cassandra node D.

          • Replication Factor (RF) is set to 3
          • Client requested Consistency Level (CL) is set to CL_TWO

          For Cassandra 1.2.4, upon the failure of Cassandra node D (via "kill -9 <pid>" or "kill -s SIGKILL <pid>"), there will be an outage window for around 20 seconds with zero transactions. The outage ends as soon as the gossip protocol detects the failure of Cassandra node D and marks it down.

          For Cassandra 2.0.0-rc2, database tables are configured with "speculative_retry" proper values (such as 'ALWAYS', '10 ms', '100 ms', '80 percentile', 'NONE'). No expected results are observed in the tests. The testing results are quite similar to those observed for Cassandra 1.2.4. That is, upon the failure of Cassandra node D, there will be an outage window for around 20 seconds with zero transactions. The outage ends as soon as the gossip protocol detects the failure of Cassandra node D and marks it down. The tested values for speculative_retry against database table are listed below:

          • ALWAYS
          • 10 ms
          • 100 ms
          • 80 percentile
          • NONE

          In my tests, I also checked the JMX stats of Cassandra node A for the speculative_retry for each table. What I observed really surprised me. For an instance, the speculative_retry (for each table) is set to '20 ms'. During normal operations with all of four Cassandra nodes up, the "SpeculativeRetry" count went up occasionally. However, during the 20-second outage window, the "SpeculativeRetry" count somehow never went up for three tests in a row. On the contrary, I would expect to see lots of speculative retries were executed (say on Cassandra node A) during the 20-second outage window.

          Some notes for the "kill" signals. I tested if the Cassandra node D is killed using SIGTERM (for Cassandra 1.2.4) and SIGINT / SIGTERM (for Cassandra 2.0.0-rc2), there will be no observed outage in second range, as the Cassandra node D does the orderly shutdown and the gossip announces the node "down". From my testing app point of view, everything is going very well.

          But if the Cassandra node D is killed using SIGKILL, there will be an outage with zero transactions. I guess this might be a TCP socket related issue. I often observed TCP socket "CLOSE_WAIT" on the passive close side (i.e. Cassandra node A, B, and C) and TCP socket TIME_WAIT on the active close side (Cassandra node D).

          Show
          Li Zou added a comment - Hello Vijay and Jonathan Ellis , My company is adopting the Cassandra db technology and we are very much interested in this new feature of "Speculative Execution for Reads", as it could potentially help reduce the overall outage window to the sub-second range upon the failure of any one of Cassandra nodes in a data center. I have recently tested the "Speculative Execution for Reads" using Cassandra 2.0.0-rc2 and I have not seen the expected results. In my tests, I have already excluded the possible effects from the client (connection pool) side. Not sure what is still missed out in my tests. Here is my setup for the tests. One data center of four Cassandra nodes is configured on machine A, B, C and D with each physical machine has one Cassandra node running on it. My testing app (Cassandra client) is running on machine A and it is configured to connect to Cassandra nodes A, B, and C only. In other words, my testing app will never connect to Cassandra node D. Replication Factor (RF) is set to 3 Client requested Consistency Level (CL) is set to CL_TWO For Cassandra 1.2.4, upon the failure of Cassandra node D (via "kill -9 <pid>" or "kill -s SIGKILL <pid>"), there will be an outage window for around 20 seconds with zero transactions. The outage ends as soon as the gossip protocol detects the failure of Cassandra node D and marks it down. For Cassandra 2.0.0-rc2, database tables are configured with "speculative_retry" proper values (such as 'ALWAYS', '10 ms', '100 ms', '80 percentile', 'NONE'). No expected results are observed in the tests. The testing results are quite similar to those observed for Cassandra 1.2.4. That is, upon the failure of Cassandra node D, there will be an outage window for around 20 seconds with zero transactions. The outage ends as soon as the gossip protocol detects the failure of Cassandra node D and marks it down. The tested values for speculative_retry against database table are listed below: ALWAYS 10 ms 100 ms 80 percentile NONE In my tests, I also checked the JMX stats of Cassandra node A for the speculative_retry for each table. What I observed really surprised me. For an instance, the speculative_retry (for each table) is set to '20 ms'. During normal operations with all of four Cassandra nodes up, the "SpeculativeRetry" count went up occasionally. However, during the 20-second outage window, the "SpeculativeRetry" count somehow never went up for three tests in a row. On the contrary, I would expect to see lots of speculative retries were executed (say on Cassandra node A) during the 20-second outage window. Some notes for the "kill" signals. I tested if the Cassandra node D is killed using SIGTERM (for Cassandra 1.2.4) and SIGINT / SIGTERM (for Cassandra 2.0.0-rc2), there will be no observed outage in second range, as the Cassandra node D does the orderly shutdown and the gossip announces the node "down". From my testing app point of view, everything is going very well. But if the Cassandra node D is killed using SIGKILL, there will be an outage with zero transactions. I guess this might be a TCP socket related issue. I often observed TCP socket "CLOSE_WAIT" on the passive close side (i.e. Cassandra node A, B, and C) and TCP socket TIME_WAIT on the active close side (Cassandra node D).
          Hide
          Vijay added a comment -

          Stress tool

          Show
          Vijay added a comment - Stress tool
          Hide
          Wesley Chow added a comment -

          Thanks for the fast reply.

          Did you use YCSB to test at all? Or did you just use the stress tool?

          Show
          Wesley Chow added a comment - Thanks for the fast reply. Did you use YCSB to test at all? Or did you just use the stress tool?
          Hide
          Vijay added a comment -

          Hey Wesley, I dont have it anymore, but you can simulate by running stress tool, perform manual GC's (with some huge heap settings) on some the servers, or shut some of them down... You will obviously see reduce in throughput because there are duplicate requests sent... but your latency with SPE should be better than none!

          Show
          Vijay added a comment - Hey Wesley, I dont have it anymore, but you can simulate by running stress tool, perform manual GC's (with some huge heap settings) on some the servers, or shut some of them down... You will obviously see reduce in throughput because there are duplicate requests sent... but your latency with SPE should be better than none!
          Hide
          Wesley Chow added a comment -

          Hi Vijay,

          What sort of benchmarks have you done with this? I'm exploring some possible ways to improve speculative reads, but would like to see first what sort of performance gains you've already measured.

          Thanks!

          Show
          Wesley Chow added a comment - Hi Vijay, What sort of benchmarks have you done with this? I'm exploring some possible ways to improve speculative reads, but would like to see first what sort of performance gains you've already measured. Thanks!
          Hide
          Vijay added a comment -

          Committed to trunk!

          Show
          Vijay added a comment - Committed to trunk!
          Hide
          Jonathan Ellis added a comment -

          Can you rebase + commit?

          Show
          Jonathan Ellis added a comment - Can you rebase + commit?
          Hide
          Vijay added a comment -

          LGTM +1

          (will create a ticket for SimpleCondition with some benchmarks (Possibly) )

          Show
          Vijay added a comment - LGTM +1 (will create a ticket for SimpleCondition with some benchmarks (Possibly) )
          Hide
          Jonathan Ellis added a comment - - edited

          (Also, changing SimpleCondition scares me a bit so let's pull that out to a separate ticket if it's necessary – I took out those changes in my branch.)

          Show
          Jonathan Ellis added a comment - - edited (Also, changing SimpleCondition scares me a bit so let's pull that out to a separate ticket if it's necessary – I took out those changes in my branch.)
          Hide
          Jonathan Ellis added a comment -

          Thanks, that makes it a lot easier to follow the refactor.

          v6 pushed to https://github.com/jbellis/cassandra/branches/4705-v6 with a bunch of changes. Biggest is, simplified SpeculativeReadExecutor.speculate under the reasoning that if an earlier request hasn't been serviced yet on a given replica, sending a second request to the same one probably won't help. Also, realized sorting the speculation wasn't necessary after all. LMKWYT.

          Show
          Jonathan Ellis added a comment - Thanks, that makes it a lot easier to follow the refactor. v6 pushed to https://github.com/jbellis/cassandra/branches/4705-v6 with a bunch of changes. Biggest is, simplified SpeculativeReadExecutor.speculate under the reasoning that if an earlier request hasn't been serviced yet on a given replica, sending a second request to the same one probably won't help. Also, realized sorting the speculation wasn't necessary after all. LMKWYT.
          Hide
          Vijay added a comment -

          Hi Jonathan, Sorry I missed the mail notification on the comment hence the delay.
          I have also pushed the updates to https://github.com/Vijay2win/cassandra/tree/4705-v5

          Should split latency tracking into, at least, single-row reads vs index/seq scans.

          This patch doesn't cover the index scans for now, but i will create a ticket to include those (Reason being it needs more refactor to track the info), hope thats fine.

          Making the extra call to isSignaled in RC.get is probably a pessimization since it is also synchronized

          Honestly i am not sure why isSignaled is synchronized and set is not volatile, i fixed it, hope thats fine.

          Done the rest as suggested.

          Thanks!

          Show
          Vijay added a comment - Hi Jonathan, Sorry I missed the mail notification on the comment hence the delay. I have also pushed the updates to https://github.com/Vijay2win/cassandra/tree/4705-v5 Should split latency tracking into, at least, single-row reads vs index/seq scans. This patch doesn't cover the index scans for now, but i will create a ticket to include those (Reason being it needs more refactor to track the info), hope thats fine. Making the extra call to isSignaled in RC.get is probably a pessimization since it is also synchronized Honestly i am not sure why isSignaled is synchronized and set is not volatile, i fixed it, hope thats fine. Done the rest as suggested. Thanks!
          Hide
          Jonathan Ellis added a comment -

          1. Can you pull the AbstractReadExecutor refactor into a separate commit? That is, just the introduction of ARE and DRE, then SRE would be added with the rest of the changes here in the "main"commit.

          2. This looks problematic to me, since if the first callback is high-latency we won't send out extra requests for low-latency callbacks promptly.

          .           for (AbstractReadExecutor exec: readCallbacks)
                          exec.speculate();
          

          Would just sorting by expected latency be enough to fix this?

          3. Should split latency tracking into, at least, single-row reads vs index/seq scans. Can we go a step farther and track by PreparedStatement? (Thrift singlerow/scan ops would have to be lumped into one bucket each, still.) This can be pushed into a separate ticket.

          Nits:

          • ReadCallback.get only appears to be called with command.timeout, so pulling that out into a parameter looks like premature generalization
          • Making the extra call to isSignaled in RC.get is probably a pessimization since it is also synchronized
          • missing @Override annotations for ARE subclasses
          Show
          Jonathan Ellis added a comment - 1. Can you pull the AbstractReadExecutor refactor into a separate commit? That is, just the introduction of ARE and DRE, then SRE would be added with the rest of the changes here in the "main"commit. 2. This looks problematic to me, since if the first callback is high-latency we won't send out extra requests for low-latency callbacks promptly. . for (AbstractReadExecutor exec: readCallbacks) exec.speculate(); Would just sorting by expected latency be enough to fix this? 3. Should split latency tracking into, at least, single-row reads vs index/seq scans. Can we go a step farther and track by PreparedStatement? (Thrift singlerow/scan ops would have to be lumped into one bucket each, still.) This can be pushed into a separate ticket. Nits: ReadCallback.get only appears to be called with command.timeout, so pulling that out into a parameter looks like premature generalization Making the extra call to isSignaled in RC.get is probably a pessimization since it is also synchronized missing @Override annotations for ARE subclasses
          Hide
          Vijay added a comment -

          Hi Jonathan, i pushed it to https://github.com/Vijay2win/cassandra/tree/4705-v4 let me know i can also rebase, Thanks!

          Show
          Vijay added a comment - Hi Jonathan, i pushed it to https://github.com/Vijay2win/cassandra/tree/4705-v4 let me know i can also rebase, Thanks!
          Hide
          Jonathan Ellis added a comment -

          Can you rebase or publish github branch?

          Show
          Jonathan Ellis added a comment - Can you rebase or publish github branch?
          Hide
          Vijay added a comment -

          Hi Jonathan, I think the attached patch covers all the previous concerns, except the timeout to be in micros. I created #5014 to convert the timeouts to be microseconds. Thanks!

          Show
          Vijay added a comment - Hi Jonathan, I think the attached patch covers all the previous concerns, except the timeout to be in micros. I created #5014 to convert the timeouts to be microseconds. Thanks!
          Hide
          Jonathan Ellis added a comment -

          Okay, let's leave UpdateSampleLatencies alone (although as style I'd prefer to inline it as an anyonymous Runnable).

          Thinking more about the core functionality:

          • a RetryType of "one pre-emptive redundant data read" would be a useful alternative to ALL. (If supporting both makes things more complex, I would vote for just supporting the single extra read.) E.g., for a CL.ONE read it would perform two data reads; for CL.QUORUM it would perform two data reads and a digest read. Put another way, it would do the same exta data read Xpercentile would, but it would do it ahead of the threshold timeout.
          • ISTM we should continue to use RDR for normal (non-RR) SR reads, and just accept the first data reply that comes back without comparing it to others. This makes the most sense to me semantically, and keeps CL.ONE reads lightweight.
          • I think it's incorrect (again, in the non-RR case) to perform a data read against the same host we sent a digest read to. Consider CL.QUORUM: I send a data read to replica X and a digest to replica Y. X is slow to respond. Doing a data read to Y won't help, since I need both to meet my CL. I have to do my SR read to replica Z, if one exists and is alive.
          • We should probably extend this to doing extra digest reads for CL > ONE, when we get the data read back quickly but the digest read is slow.
          • SR + RR is the tricky part... this is where SR could result in data and digests from the same host. So ideally, we want the ability to compare (potentially) multiple data reads, and multiple digests, and track the source for CL purposes, which neither RDR nor RRR is equipped to do. Perhaps we should just force all reads to data reads for SR + RR [or even for all RR reads], to simplify this.

          Finally,

          • millis may be too coarse a grain here, especially for Custom settings. Currently an in-memory read will typically be under 2ms and it's quite possible we can get that down to 1 if we can purge some of the latency between stages. Might as well use micros since Timer gives it to us for free, right?
          Show
          Jonathan Ellis added a comment - Okay, let's leave UpdateSampleLatencies alone (although as style I'd prefer to inline it as an anyonymous Runnable). Thinking more about the core functionality: a RetryType of "one pre-emptive redundant data read" would be a useful alternative to ALL. (If supporting both makes things more complex, I would vote for just supporting the single extra read.) E.g., for a CL.ONE read it would perform two data reads; for CL.QUORUM it would perform two data reads and a digest read. Put another way, it would do the same exta data read Xpercentile would, but it would do it ahead of the threshold timeout. ISTM we should continue to use RDR for normal (non-RR) SR reads, and just accept the first data reply that comes back without comparing it to others. This makes the most sense to me semantically, and keeps CL.ONE reads lightweight. I think it's incorrect (again, in the non-RR case) to perform a data read against the same host we sent a digest read to. Consider CL.QUORUM: I send a data read to replica X and a digest to replica Y. X is slow to respond. Doing a data read to Y won't help, since I need both to meet my CL. I have to do my SR read to replica Z, if one exists and is alive. We should probably extend this to doing extra digest reads for CL > ONE, when we get the data read back quickly but the digest read is slow. SR + RR is the tricky part... this is where SR could result in data and digests from the same host. So ideally, we want the ability to compare (potentially) multiple data reads, and multiple digests, and track the source for CL purposes, which neither RDR nor RRR is equipped to do. Perhaps we should just force all reads to data reads for SR + RR [or even for all RR reads] , to simplify this. Finally, millis may be too coarse a grain here, especially for Custom settings. Currently an in-memory read will typically be under 2ms and it's quite possible we can get that down to 1 if we can purge some of the latency between stages. Might as well use micros since Timer gives it to us for free, right?
          Hide
          Vijay added a comment - - edited

          Hi Jonathan, Sorry for the delay.

          Would it make more sense to have getReadLatencyRate and UpdateSampleLatencies into SR? that way we could replace case statements with polymorphism.

          The problem is that we have to calculate the expensive percentile calculation Async using a scheduled TPE, We can avoid the switch by introducing additional SRFactory which will initialize the TPE as per CF changes in the settings? Let me know.

          Why does preprocess return a boolean now?

          The current patch uses the boolean to understand if the processing was done or not.... its used by RCB after the patch when there are more than 1 responses received by the co-ordinator from the same host (When SR is on and the actual read response gets back at the same time as the speculated response), we should not count that towards the consistency level.

          How does/should SR interact with RR? Using ALL + RRR

          Currently we are doing additional read to double check if we need to write, I thought the goal for ALL will eliminate that and do additional write instead... Most cases it will be a memtable update
          I can think of 2 options:
          1) Just document the ALL case and live with the additional writes, might not be a big issue for most cases and for the rest user can switch to the default behavior.
          2) We can queue the repair Mutations, in the Async thread we can check if there are duplicate mutations pending... if yes then we can just ignore the duplicates this can be done by doing sendRR and adding the CF to be repaired in a HashSet (it takes additional memory footprint).

          Should we move this discussion to a different ticket?

          Let me know, Thanks!

          Show
          Vijay added a comment - - edited Hi Jonathan, Sorry for the delay. Would it make more sense to have getReadLatencyRate and UpdateSampleLatencies into SR? that way we could replace case statements with polymorphism. The problem is that we have to calculate the expensive percentile calculation Async using a scheduled TPE, We can avoid the switch by introducing additional SRFactory which will initialize the TPE as per CF changes in the settings? Let me know. Why does preprocess return a boolean now? The current patch uses the boolean to understand if the processing was done or not.... its used by RCB after the patch when there are more than 1 responses received by the co-ordinator from the same host (When SR is on and the actual read response gets back at the same time as the speculated response), we should not count that towards the consistency level. How does/should SR interact with RR? Using ALL + RRR Currently we are doing additional read to double check if we need to write, I thought the goal for ALL will eliminate that and do additional write instead... Most cases it will be a memtable update I can think of 2 options: 1) Just document the ALL case and live with the additional writes, might not be a big issue for most cases and for the rest user can switch to the default behavior. 2) We can queue the repair Mutations, in the Async thread we can check if there are duplicate mutations pending... if yes then we can just ignore the duplicates this can be done by doing sendRR and adding the CF to be repaired in a HashSet (it takes additional memory footprint). Should we move this discussion to a different ticket? Let me know, Thanks!
          Hide
          Jonathan Ellis added a comment -

          avro is just used for upgrading from 1.0 schemas, so shouldn't need to touch that anymore.

          Would it make more sense to have getReadLatencyRate and UpdateSampleLatencies into SR? that way we could replace case statements with polymorphism.

          Can you split the AbstractReadExecutor refactor out from the speculative execution code? That would make it easier to isolate the changes in review.

          Why does preprocess return a boolean now?

          How does/should SR interact with RR? Using ALL + RRR means we're probably going to do a lot of unnecessary "repair" writes in a high-update environment (i.e., it would be normal for one replica to be slightly behind others on a read), which is probably not what we want. Also unclear to me what happens when we use RDR and do a SR when we've also requested extra digests for RR, and we get a data read and a digest from the same replica.

          Show
          Jonathan Ellis added a comment - avro is just used for upgrading from 1.0 schemas, so shouldn't need to touch that anymore. Would it make more sense to have getReadLatencyRate and UpdateSampleLatencies into SR? that way we could replace case statements with polymorphism. Can you split the AbstractReadExecutor refactor out from the speculative execution code? That would make it easier to isolate the changes in review. Why does preprocess return a boolean now? How does/should SR interact with RR? Using ALL + RRR means we're probably going to do a lot of unnecessary "repair" writes in a high-update environment (i.e., it would be normal for one replica to be slightly behind others on a read), which is probably not what we want. Also unclear to me what happens when we use RDR and do a SR when we've also requested extra digests for RR, and we get a data read and a digest from the same replica.
          Hide
          Vijay added a comment -

          Attached patch does

          • ALL
          • Xpercentile
          • Xms
          • NONE;

          Optionally we might also need to rename RowRepairResolver to RowAllDataResolver or something.

          Show
          Vijay added a comment - Attached patch does ALL Xpercentile Xms NONE; Optionally we might also need to rename RowRepairResolver to RowAllDataResolver or something.
          Hide
          Jonathan Ellis added a comment -

          our history has been that sooner or later someone always wants fractional ms, but I'm fine w/ long (or int)

          Show
          Jonathan Ellis added a comment - our history has been that sooner or later someone always wants fractional ms, but I'm fine w/ long (or int)
          Hide
          Vijay added a comment -

          Cool, let me work on the patch soon...

          are both doubles?

          Well it will be long in ms,

          Show
          Vijay added a comment - Cool, let me work on the patch soon... are both doubles? Well it will be long in ms,
          Hide
          Jonathan Ellis added a comment -

          So I guess we could support

          {ALL, Xpercentile, Yms, NONE}

          where X and Y are both doubles?

          Show
          Jonathan Ellis added a comment - So I guess we could support {ALL, Xpercentile, Yms, NONE} where X and Y are both doubles?
          Hide
          Jonathan Ellis added a comment -

          Thanks Chris!

          Show
          Jonathan Ellis added a comment - Thanks Chris!
          Hide
          Chris Burroughs added a comment -

          > Looks like metrics-core exposes 75, 95, 97, 99 and 99.9

          Reporters have a limited set (ie you can't generate new values that will pop up in jmx on the fly), but in code you should be able to get at any percentile you want: https://github.com/codahale/metrics/blob/2.x-maintenance/metrics-core/src/main/java/com/yammer/metrics/stats/Snapshot.java#L54

          Show
          Chris Burroughs added a comment - > Looks like metrics-core exposes 75, 95, 97, 99 and 99.9 Reporters have a limited set (ie you can't generate new values that will pop up in jmx on the fly), but in code you should be able to get at any percentile you want: https://github.com/codahale/metrics/blob/2.x-maintenance/metrics-core/src/main/java/com/yammer/metrics/stats/Snapshot.java#L54
          Hide
          Vijay added a comment -

          Hi Jonathan, the custom value is kind of better in the cases where users can say, My SLA is 20 MS and i want co-ordinator to retry the reads after 15 MS, or more aggressively retry after 10 MS.

          Attached patch supports the following:

          • ALL
          • auto95 (Default)
          • auto98
          • auto99
          • auto999
          • autoMean
          • NONE (current behavior)
          Show
          Vijay added a comment - Hi Jonathan, the custom value is kind of better in the cases where users can say, My SLA is 20 MS and i want co-ordinator to retry the reads after 15 MS, or more aggressively retry after 10 MS. Attached patch supports the following: ALL auto95 (Default) auto98 auto99 auto999 autoMean NONE (current behavior)
          Hide
          Jonathan Ellis added a comment -

          Well, we have a pretty short list of possibilities from metrics... I guess we could add auto95, auto97, auto99 options?

          Show
          Jonathan Ellis added a comment - Well, we have a pretty short list of possibilities from metrics... I guess we could add auto95, auto97, auto99 options?
          Hide
          Vijay added a comment - - edited

          I pushed the prototype code into https://github.com/Vijay2win/cassandra/commit/62bbabfc41ba8e664eb63ba50110e5f5909b2a87

          Looks like metrics-core exposes 75, 95, 97, 99 and 99.9 Percentile's, with my tests 75P is too low, and 99 is too high to make a difference, whereas 95P long tail looks better (Moving average doesn't make much of a difference too). It also supports ALL, AUTO, NONE (current behavior) as per jonathan's comment above.

          But I still think we should also support hard coded value in addition to the auto

          Note: have to make the speculative_retry part of the schema but currently if you want to test it out it is a code change in CFMetaData

          Show
          Vijay added a comment - - edited I pushed the prototype code into https://github.com/Vijay2win/cassandra/commit/62bbabfc41ba8e664eb63ba50110e5f5909b2a87 Looks like metrics-core exposes 75, 95, 97, 99 and 99.9 Percentile's, with my tests 75P is too low, and 99 is too high to make a difference, whereas 95P long tail looks better (Moving average doesn't make much of a difference too). It also supports ALL, AUTO, NONE (current behavior) as per jonathan's comment above. But I still think we should also support hard coded value in addition to the auto Note: have to make the speculative_retry part of the schema but currently if you want to test it out it is a code change in CFMetaData
          Hide
          Jonathan Ellis added a comment -

          I don't like the idea of making users manually specify thresholds. They will usually get it wrong, and we have latency histograms that should let us do a better job automagically.

          But I could see the value of a setting to allow disabling it when you know your CF has a bunch of different query types being thrown at it. Something like speculative_retry =

          {off, automatic, full}

          where full is Peter's full data reads to each replica.

          Show
          Jonathan Ellis added a comment - I don't like the idea of making users manually specify thresholds. They will usually get it wrong, and we have latency histograms that should let us do a better job automagically. But I could see the value of a setting to allow disabling it when you know your CF has a bunch of different query types being thrown at it. Something like speculative_retry = {off, automatic, full} where full is Peter's full data reads to each replica.
          Hide
          Vijay added a comment -

          99% based on what time period? If period it too short, you won't get the full impact since you'll pollute the track record. If it's too large, consider the traffic increase resulting from a prolonged hiccup

          Thats the hardest problem which i am trying to solve right now Actually surprisingly (to me) the code itself is not complicated to send a backup request.

          Will you be able to hide typical GC pauses?

          Worst case we send some extra requests which IMO is ok for few milliseconds.

          Most times while working on AWS the network is usually not that predictable, and with MR clusters we where reluctant to enable RR.
          This is not something new to me, we did something like this back in NFLX (we never named it fancy ) in the client (http://netflix.github.com/astyanax/javadoc/com/netflix/astyanax/Execution.html#executeAsync()) to retry independent of the default rpc_timeout.

          am not arguing against the idea of backup requests, but I strongly recommend simply going for the trivial and obvious route of full data reads

          I am neutral about this, originally the idea was to move the above logic which was done in the client back in to the server.

          Here's a good example of complexity implication that I just thought of
          ...

          How about we provide a override for the users with multiple kinds of request? we can override via CF setting which will be something like timeout... wait for x seconds before sending a secondary request.

          Show
          Vijay added a comment - 99% based on what time period? If period it too short, you won't get the full impact since you'll pollute the track record. If it's too large, consider the traffic increase resulting from a prolonged hiccup Thats the hardest problem which i am trying to solve right now Actually surprisingly (to me) the code itself is not complicated to send a backup request. Will you be able to hide typical GC pauses? Worst case we send some extra requests which IMO is ok for few milliseconds. Most times while working on AWS the network is usually not that predictable, and with MR clusters we where reluctant to enable RR. This is not something new to me, we did something like this back in NFLX (we never named it fancy ) in the client ( http://netflix.github.com/astyanax/javadoc/com/netflix/astyanax/Execution.html#executeAsync( )) to retry independent of the default rpc_timeout. am not arguing against the idea of backup requests, but I strongly recommend simply going for the trivial and obvious route of full data reads I am neutral about this, originally the idea was to move the above logic which was done in the client back in to the server. Here's a good example of complexity implication that I just thought of ... How about we provide a override for the users with multiple kinds of request? we can override via CF setting which will be something like timeout... wait for x seconds before sending a secondary request.
          Hide
          Lior Golan added a comment -

          How about just letting the user configure a threshold, above which a backup request will be sent?
          This can be an easy way to start with this feature (saving the need to estimate the p99 point).
          It will allow what Peter is suggesting above (just set the threshold to 0), and will allow the user to tune the tradeoff between latency and throughput.

          It would be cool to be able to set this threshold on a per request basis, similar to how CL is specified.

          But thinking about this a bit more - isn't such a feature better implemented at the client library level? Implementing this at the client library level will also allow handling cases where the StorageProxy is down (i.e. GC at the coordinator), and would make it easier to specify at the per request level (no need to pollute the protocol with this setting)

          Show
          Lior Golan added a comment - How about just letting the user configure a threshold, above which a backup request will be sent? This can be an easy way to start with this feature (saving the need to estimate the p99 point). It will allow what Peter is suggesting above (just set the threshold to 0), and will allow the user to tune the tradeoff between latency and throughput. It would be cool to be able to set this threshold on a per request basis, similar to how CL is specified. But thinking about this a bit more - isn't such a feature better implemented at the client library level? Implementing this at the client library level will also allow handling cases where the StorageProxy is down (i.e. GC at the coordinator), and would make it easier to specify at the per request level (no need to pollute the protocol with this setting)
          Hide
          Peter Schuller added a comment -

          Here's a good example of complexity implication that I just thought of (and it's stuff like this I'm worried about w.r.t. complexity): How do you split requests into "groups" within which to do latency profiling? If you don't, you'll easily end up having the expensive requests always be processed multiple times because they always hit the backup path (because they are expensive and thus latent). So you could very easily "eat up" all your intended benefit by having the very expensive requests take the backup path. Without knowledge of the nature of the requests, and since we cannot reliably just assume a homogenous request pattern, you would probably need some non-trivial way of classifying requests and having it relate to these statistics to keep.

          In some cases, having it be a per-cf setting might be enough. In other cases that's not feasable - for example maybe you're doing slicing on large rows, and maybe it's impossible to determine based on an incoming requests whether it's expensive or not (the range may be high but result in only a single column, for example).

          What if you don't care about the latency of the "legitimately expensive" requests, but about the cheap ones? And what if those "legitimately expensive" requests consumes your 1% (p99), such that none of the "cheaper" requests are subject to backup requests? Now you get none of the benefit, but you still take the brunt of the cost you'd have if you just went with full data reads.

          I'm sure there are many other concerns I'm not thinking of; this was meant as an example of how it can be hard to make this actually work the way it's intended.

          Show
          Peter Schuller added a comment - Here's a good example of complexity implication that I just thought of (and it's stuff like this I'm worried about w.r.t. complexity): How do you split requests into "groups" within which to do latency profiling? If you don't, you'll easily end up having the expensive requests always be processed multiple times because they always hit the backup path (because they are expensive and thus latent). So you could very easily "eat up" all your intended benefit by having the very expensive requests take the backup path. Without knowledge of the nature of the requests, and since we cannot reliably just assume a homogenous request pattern, you would probably need some non-trivial way of classifying requests and having it relate to these statistics to keep. In some cases, having it be a per-cf setting might be enough. In other cases that's not feasable - for example maybe you're doing slicing on large rows, and maybe it's impossible to determine based on an incoming requests whether it's expensive or not (the range may be high but result in only a single column, for example). What if you don't care about the latency of the "legitimately expensive" requests, but about the cheap ones? And what if those "legitimately expensive" requests consumes your 1% (p99), such that none of the "cheaper" requests are subject to backup requests? Now you get none of the benefit, but you still take the brunt of the cost you'd have if you just went with full data reads. I'm sure there are many other concerns I'm not thinking of; this was meant as an example of how it can be hard to make this actually work the way it's intended.
          Hide
          Peter Schuller added a comment -

          99% based on what time period? If period it too short, you won't get the full impact since you'll pollute the track record. If it's too large, consider the traffic increase resulting from a prolonged hiccup. Will you be able to hide typical GC pauses? Then you better have the window be higher than 250 ms. What about full gc:s? How do you determine what the p99 is given a node with multiple replica sets shared with it? If a single node goes into full gc, how do you make latency be un-affected while still capping the number of backup requests at a reasonable number? If you don't cap it, the optimization is more dangerous than useful, since it just means you'll fall over under various hard-to-predict emergent situations if you expect to take advantage of less reads when provisioning your cluster. What's an appropriate cap? How do you scale that with RF and consistency level? How do you explain this to the person who has to figure out how much capacity is needed for a cluster?

          In our case, we pretty much run all our clusters with RR turned fully up - not necessarily for RR purposes, but for the purpose of more deterministic behavior. You don't want things falling over when a replica goas down. If you don't have the iops/CPU to take all replicas having to process all requests for a replica set, you're at risk of falling over (i.e., you don't scale, because failures are common in large clusters) - unless you over-provision, but then you might as well go all data reads to begin with.

          I am not arguing against the idea of backup requests, but I strongly recommend simply going for the trivial and obvious route of full data reads first and getting the obvious pay-off with no increase in complexity (I would even argue it's a decrease in complexity in terms of the behavior of the system as a whole, especially from the perspective of a human understanding emergent cluster behavior) - and then slowly develop something like this, with very careful thought to all the edge cases and implications of it.

          I'm in favor of long-term predictable performance. Full data reads is a very very easy way to achieve that, and vastly better latency, in many cases (the bandwidth saturation case pretty much being the major exception; CPU savings aren't really relevant with Cassandra's model if you expect to survive nodes being down). It's also very easy for a human to understand the behavior when looking at graphs of system behavior in some event, and trying to predict what will happen, or explain what did happen.

          I really think the drawbacks of full data reads are being massively over-estimated and the implications of lack of data reads massively under-estimated.

          Show
          Peter Schuller added a comment - 99% based on what time period? If period it too short, you won't get the full impact since you'll pollute the track record. If it's too large, consider the traffic increase resulting from a prolonged hiccup. Will you be able to hide typical GC pauses? Then you better have the window be higher than 250 ms. What about full gc:s? How do you determine what the p99 is given a node with multiple replica sets shared with it? If a single node goes into full gc, how do you make latency be un-affected while still capping the number of backup requests at a reasonable number? If you don't cap it, the optimization is more dangerous than useful, since it just means you'll fall over under various hard-to-predict emergent situations if you expect to take advantage of less reads when provisioning your cluster. What's an appropriate cap? How do you scale that with RF and consistency level? How do you explain this to the person who has to figure out how much capacity is needed for a cluster? In our case, we pretty much run all our clusters with RR turned fully up - not necessarily for RR purposes, but for the purpose of more deterministic behavior. You don't want things falling over when a replica goas down. If you don't have the iops/CPU to take all replicas having to process all requests for a replica set, you're at risk of falling over (i.e., you don't scale, because failures are common in large clusters) - unless you over-provision, but then you might as well go all data reads to begin with. I am not arguing against the idea of backup requests, but I strongly recommend simply going for the trivial and obvious route of full data reads first and getting the obvious pay-off with no increase in complexity (I would even argue it's a decrease in complexity in terms of the behavior of the system as a whole, especially from the perspective of a human understanding emergent cluster behavior) - and then slowly develop something like this, with very careful thought to all the edge cases and implications of it. I'm in favor of long-term predictable performance. Full data reads is a very very easy way to achieve that, and vastly better latency, in many cases (the bandwidth saturation case pretty much being the major exception; CPU savings aren't really relevant with Cassandra's model if you expect to survive nodes being down). It's also very easy for a human to understand the behavior when looking at graphs of system behavior in some event, and trying to predict what will happen, or explain what did happen. I really think the drawbacks of full data reads are being massively over-estimated and the implications of lack of data reads massively under-estimated.
          Hide
          Jonathan Ellis added a comment -

          FTR I'm not sure CL.ONE is going to be substantially easier than generalizing to all CL.

          Show
          Jonathan Ellis added a comment - FTR I'm not sure CL.ONE is going to be substantially easier than generalizing to all CL.
          Hide
          Vijay added a comment -

          No, DSnitch watches for the latency but doesn't do the later.... It wont speculate/execute duplicate requests to another host, if the response times are > x%.

          I think this patch will be in addition to dsnitch, something like Jonathan posted in 2540

          I like the approach described in http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf of doing "backup" requests if the original doesn't reply within N% of normal.

          Show
          Vijay added a comment - No, DSnitch watches for the latency but doesn't do the later.... It wont speculate/execute duplicate requests to another host, if the response times are > x%. I think this patch will be in addition to dsnitch, something like Jonathan posted in 2540 I like the approach described in http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf of doing "backup" requests if the original doesn't reply within N% of normal.
          Hide
          Brandon Williams added a comment -

          It would be nice to watch for latency and execute an additional request to a different node

          Isn't this what the dsnitch does to some degree?

          Show
          Brandon Williams added a comment - It would be nice to watch for latency and execute an additional request to a different node Isn't this what the dsnitch does to some degree?

            People

            • Assignee:
              Vijay
              Reporter:
              Vijay
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development