Cassandra
  1. Cassandra
  2. CASSANDRA-3722

Send Hints to Dynamic Snitch when Compaction or repair is going on for a node.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      Currently Dynamic snitch looks at the latency for figuring out which node will be better serving the requests, this works great but there is a part of the traffic sent to collect this data... There is also a window when Snitch doesn't know about some major event which are going to happen on the node (Node which is going to receive the data request).

      It would be great if we can send some sort hints to the Snitch so they can score based on known events causing higher latencies.

      1. 0001-CASSANDRA-3722-A1.patch
        10 kB
        Vijay
      2. 0001-CASSANDRA-3722-A1-V2.patch
        8 kB
        Vijay
      3. 0001-CASSANDRA-3723-A2-Patch.patch
        6 kB
        Vijay
      4. 0001-Expose-SP-latencies-in-nodetool-proxyhistograms.txt
        7 kB
        Brandon Williams
      5. 0001-CASSANDRA-3722-v3.patch
        13 kB
        Vijay
      6. 3722-v4.txt
        12 kB
        Brandon Williams
      7. 3722-v5.txt
        14 kB
        Brandon Williams

        Activity

        Hide
        Brandon Williams added a comment -

        Committed.

        Show
        Brandon Williams added a comment - Committed.
        Hide
        Vijay added a comment -

        wfm +1 on both the v5 and nt PROXYHISTOGRAMS

        Show
        Vijay added a comment - wfm +1 on both the v5 and nt PROXYHISTOGRAMS
        Hide
        Brandon Williams added a comment -

        v5 does some minor clean up (REPORT renamed to SEVERITY mostly) and fixes the dsnitch test so it passes.

        Show
        Brandon Williams added a comment - v5 does some minor clean up (REPORT renamed to SEVERITY mostly) and fixes the dsnitch test so it passes.
        Hide
        Brandon Williams added a comment - - edited

        is it because the lack of a datapoint, isn't taken into account as slowness?

        Exactly. It's not receiving new data, so the score doesn't change and the dead host is still rated the best until the FD removes it as an option, or enough TOEs demote it. Doing it this way, time itself penalizes the host when it stops responding.

        Show
        Brandon Williams added a comment - - edited is it because the lack of a datapoint, isn't taken into account as slowness? Exactly. It's not receiving new data, so the score doesn't change and the dead host is still rated the best until the FD removes it as an option, or enough TOEs demote it. Doing it this way, time itself penalizes the host when it stops responding.
        Hide
        Jonathan Ellis added a comment -

        Why doesn't dsnitch incorporate this automatically? is it because the lack of a datapoint, isn't taken into account as slowness?

        Show
        Jonathan Ellis added a comment - Why doesn't dsnitch incorporate this automatically? is it because the lack of a datapoint, isn't taken into account as slowness?
        Hide
        Vijay added a comment -

        Done some testing (single DC) and it works as expected +1 for v4. Thanks!

        Show
        Vijay added a comment - Done some testing (single DC) and it works as expected +1 for v4. Thanks!
        Hide
        Brandon Williams added a comment -

        I think requests pending just isn't a good measure, especially for a multi-dc setup. Instead what we can do is penalize hosts based on the last time they replied, with a cap of the update interval. This way, if we aren't querying cross-dc, that dc is penalized, but within a reasonable limit such that if the local reads get bad we'll exceed the threshold and go cross-dc if we have to. This seems to respond much better to my torture test of force-suspending a JVM, but not long enough for the FD to kick in.

        Show
        Brandon Williams added a comment - I think requests pending just isn't a good measure, especially for a multi-dc setup. Instead what we can do is penalize hosts based on the last time they replied, with a cap of the update interval. This way, if we aren't querying cross-dc, that dc is penalized, but within a reasonable limit such that if the local reads get bad we'll exceed the threshold and go cross-dc if we have to. This seems to respond much better to my torture test of force-suspending a JVM, but not long enough for the FD to kick in.
        Hide
        Vijay added a comment -

        Hi Brandon, Fixed in v3. Thanks!

        Show
        Vijay added a comment - Hi Brandon, Fixed in v3. Thanks!
        Hide
        Brandon Williams added a comment -

        A few things while I continue to test:

        Compaction can occur before the gossiper is started (and indeed, while gossip is deactivated):

        ERROR 23:38:09,192 Exception in thread Thread[CompactionExecutor:1,1,main]
        java.lang.AssertionError
                at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1105)
                at org.apache.cassandra.service.StorageService.reportSeverity(StorageService.java:790)
                at org.apache.cassandra.db.compaction.CompactionManager$CompactionExecutor.beginCompaction(CompactionManager.java:1021)
                at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:136)
                at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:128)
                at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:26)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
                at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                at java.util.concurrent.FutureTask.run(FutureTask.java:138)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        

        getConnectionManagers is unused, but I imagine that was to help with the hashcode business that we can now remove.

        Show
        Brandon Williams added a comment - A few things while I continue to test: Compaction can occur before the gossiper is started (and indeed, while gossip is deactivated): ERROR 23:38:09,192 Exception in thread Thread[CompactionExecutor:1,1,main] java.lang.AssertionError at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1105) at org.apache.cassandra.service.StorageService.reportSeverity(StorageService.java:790) at org.apache.cassandra.db.compaction.CompactionManager$CompactionExecutor.beginCompaction(CompactionManager.java:1021) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:136) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:128) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:26) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) getConnectionManagers is unused, but I imagine that was to help with the hashcode business that we can now remove.
        Hide
        Brandon Williams added a comment -

        Here's a patch to make this a bit easier to test, it exposes SP recent latency histograms via nodetool.

        Show
        Brandon Williams added a comment - Here's a patch to make this a bit easier to test, it exposes SP recent latency histograms via nodetool.
        Hide
        Vijay added a comment -

        A2 goes on top of A1. A2 also has pending queue stats looks like it is more choppy with that metric in the mix.

        Show
        Vijay added a comment - A2 goes on top of A1. A2 also has pending queue stats looks like it is more choppy with that metric in the mix.
        Hide
        Vijay added a comment -

        Fixed, not sure why i decided to do that

        Show
        Vijay added a comment - Fixed, not sure why i decided to do that
        Hide
        Brandon Williams added a comment -

        We should probably avoid having Gossiper inject application states directly, if for nothing else than to not make life harder for CASSANDRA-3125

        Show
        Brandon Williams added a comment - We should probably avoid having Gossiper inject application states directly, if for nothing else than to not make life harder for CASSANDRA-3125
        Hide
        Vijay added a comment -

        Attached is the version which just has the hints and doesn't have pending queue into the account. (Working on A2, just in case with artificial load for other DC's).

        Im addition there is a JMX, in which users can force load in/out into a node by specifying +ve/-ve severity.

        Show
        Vijay added a comment - Attached is the version which just has the hints and doesn't have pending queue into the account. (Working on A2, just in case with artificial load for other DC's). Im addition there is a JMX, in which users can force load in/out into a node by specifying +ve/-ve severity.
        Hide
        Vijay added a comment -

        I was almost complete with the patch but filtering based on pending queue can be potentially dangerous.
        On a Multi region cluster the pending commands in the local replicas will be almost always higher than remote ones because they dont receive reads. We might want to filter might not want to filter based on pending.

        We can do a hackie solution by padding the score of the remote DC's to be higher artificial value than the local DC, What do you guys think?

        Show
        Vijay added a comment - I was almost complete with the patch but filtering based on pending queue can be potentially dangerous. On a Multi region cluster the pending commands in the local replicas will be almost always higher than remote ones because they dont receive reads. We might want to filter might not want to filter based on pending. We can do a hackie solution by padding the score of the remote DC's to be higher artificial value than the local DC, What do you guys think?
        Hide
        Pavel Yaskevich added a comment -

        Also maintaining a kind of normalized load statistics for each node in the combination with pending requests could give a better view of what is going on on the node e.g. load is <= 0.5 but we have a big pending queue for/on the node - that could mean network failure. For the statistic we can assign each of the sub-routines "load impact" factor e.g. compaction 0.3, scrub - 0.2, read - 0.01, we can set the load threshold for "overloaded" nodes e.g. 0.85 (which could be adjusted at runtime) and sort hosts accordingly to their load + pending request statistics which would make penalizing hosts more precise. Obviously some normalization should be done because clusters won't always have nodes with identical processing capabilities (network, hardware etc.).

        Show
        Pavel Yaskevich added a comment - Also maintaining a kind of normalized load statistics for each node in the combination with pending requests could give a better view of what is going on on the node e.g. load is <= 0.5 but we have a big pending queue for/on the node - that could mean network failure. For the statistic we can assign each of the sub-routines "load impact" factor e.g. compaction 0.3, scrub - 0.2, read - 0.01, we can set the load threshold for "overloaded" nodes e.g. 0.85 (which could be adjusted at runtime) and sort hosts accordingly to their load + pending request statistics which would make penalizing hosts more precise. Obviously some normalization should be done because clusters won't always have nodes with identical processing capabilities (network, hardware etc.).
        Hide
        Jonathan Ellis added a comment -

        Taking into account the number of outstanding requests is IMO a necessity

        It sounds like you're saying that if coordinator X has 1000 requests pending response from replica Y, and only 100 to replica Z, then X should suspect that Y is having trouble and rely more heavily on Z, even before requests to Y start timing out. Right?

        That sounds reasonable to me in theory. How do we mix that into the existing latency information we track for dsnitch?

        Show
        Jonathan Ellis added a comment - Taking into account the number of outstanding requests is IMO a necessity It sounds like you're saying that if coordinator X has 1000 requests pending response from replica Y, and only 100 to replica Z, then X should suspect that Y is having trouble and rely more heavily on Z, even before requests to Y start timing out. Right? That sounds reasonable to me in theory. How do we mix that into the existing latency information we track for dsnitch?
        Hide
        Peter Schuller added a comment - - edited

        I'm -0 on the original bit of this ticket, but +1 on more generic changes that covers the original use case as good if not better anyway. I think that instead of trying to predict exactly the behavior of some particular event like compaction, we should just be better at actually responding to what is actually going on:

        • We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped or slow request even if we haven't had the opportunity to react to overall behavior yet (I have a partial patch that breaks read repair, I haven't had time to finish it).
        • Taking into account the number of outstanding requests is IMO a necessity. There is plenty of precedent for anyone who wants that (least used connections policies in various LB:s), but more importantly it would so clearly help in several situations, including:
          • Sudden GC pause of a node
          • Sudden death of a node
          • Sudden page cache eviction and slowness of a node, before snitching figures it out
          • Constantly overloaded node; even with the dynsnitch it would improve the situation as the number of requests affected by a dynsnitch reset is lessened
          • Packet loss/hiccup/whatever across DC:s

        There is some potential for foot shooting in the sense that if a node is broken in a way that it responds with incorrect data, but responds faster than anyone else, it will tend to "swallow" all the traffic. But honestly, that feels like a minor concern to me based on what I've seen actually happen in production clusters. If we ever start sending non-successes back over inter-node RPC, this would change however.

        My only major concern is potential performance impacts of keeping track of the number of outstanding requests, but if that does become a problem one can make it probabilistic - have N % of all requests be tracked. Less impact, but also less immediate response to what's happening.

        This will also have the side-effect of mitigating sudden bursts of promotion into old-gen if we combine it with pro-actively dropping read-repair messages for nodes that are overloaded (effectively prioritizing data reads), hence helping for CASSANDRA-3853.

        Should we T (send additional requests which are not part of the normal operations) the requests until the other node recovers?

        In the absence of read repair, we'd have to do speculative reads as Stu has previously noted. With read repair turned on, this is not an issue because the node will still receive requests and eventually warm up. Only with read repair turned off do we not send requests to more than the first N of endpoints, with N being what is required by CL.

        Semi-relatedly, I think it would be a good idea to make the proximity sorting probabilistic in nature so that we don't do a binary flip back and fourth between who gets data vs. digest reads or who doesn't get reads at all. That might mitigate this problem, but not help fundamentally since the rate of warm-up would decrease with a node being slow.

        I do want to make this point though: Every single production cluster I have ever been involved with so far, has been such that you basically never want to turn read repair off. Not because of read repair itself, but because of the traffic it generates. Having nodes not receive traffic is extremely dangerous under most circumstances as it leaves nodes cold, only to suddenly explode and cause timeouts and other bad behavior as soon as e.g. some neighbor goes down and it suddenly starts taking traffic. This is an easy way to make production clusters fall over. If your workload is entirely in memory or otherwise not reliant on caching the problem is much less pronounced, but even then I would generally recommend that you keep it turned on if only because your nodes will have to be able to take the additional load anyway if you are to survive other nodes in the neighborhood going down. It just makes clusters much more easy to reason about.

        Show
        Peter Schuller added a comment - - edited I'm -0 on the original bit of this ticket, but +1 on more generic changes that covers the original use case as good if not better anyway. I think that instead of trying to predict exactly the behavior of some particular event like compaction, we should just be better at actually responding to what is actually going on: We have CASSANDRA-2540 which can help avoid blocking uselessly on a dropped or slow request even if we haven't had the opportunity to react to overall behavior yet (I have a partial patch that breaks read repair, I haven't had time to finish it). Taking into account the number of outstanding requests is IMO a necessity. There is plenty of precedent for anyone who wants that (least used connections policies in various LB:s), but more importantly it would so clearly help in several situations, including: Sudden GC pause of a node Sudden death of a node Sudden page cache eviction and slowness of a node, before snitching figures it out Constantly overloaded node; even with the dynsnitch it would improve the situation as the number of requests affected by a dynsnitch reset is lessened Packet loss/hiccup/whatever across DC:s There is some potential for foot shooting in the sense that if a node is broken in a way that it responds with incorrect data, but responds faster than anyone else, it will tend to "swallow" all the traffic. But honestly, that feels like a minor concern to me based on what I've seen actually happen in production clusters. If we ever start sending non-successes back over inter-node RPC, this would change however. My only major concern is potential performance impacts of keeping track of the number of outstanding requests, but if that does become a problem one can make it probabilistic - have N % of all requests be tracked. Less impact, but also less immediate response to what's happening. This will also have the side-effect of mitigating sudden bursts of promotion into old-gen if we combine it with pro-actively dropping read-repair messages for nodes that are overloaded (effectively prioritizing data reads), hence helping for CASSANDRA-3853 . Should we T (send additional requests which are not part of the normal operations) the requests until the other node recovers? In the absence of read repair, we'd have to do speculative reads as Stu has previously noted. With read repair turned on, this is not an issue because the node will still receive requests and eventually warm up. Only with read repair turned off do we not send requests to more than the first N of endpoints, with N being what is required by CL. Semi-relatedly, I think it would be a good idea to make the proximity sorting probabilistic in nature so that we don't do a binary flip back and fourth between who gets data vs. digest reads or who doesn't get reads at all. That might mitigate this problem, but not help fundamentally since the rate of warm-up would decrease with a node being slow. I do want to make this point though: Every single production cluster I have ever been involved with so far, has been such that you basically never want to turn read repair off. Not because of read repair itself, but because of the traffic it generates. Having nodes not receive traffic is extremely dangerous under most circumstances as it leaves nodes cold, only to suddenly explode and cause timeouts and other bad behavior as soon as e.g. some neighbor goes down and it suddenly starts taking traffic. This is an easy way to make production clusters fall over. If your workload is entirely in memory or otherwise not reliant on caching the problem is much less pronounced, but even then I would generally recommend that you keep it turned on if only because your nodes will have to be able to take the additional load anyway if you are to survive other nodes in the neighborhood going down. It just makes clusters much more easy to reason about.
        Hide
        Vijay added a comment -

        Another issue just came up which is related to this ticket.
        Where a node just completed bootstrap and hence it will not have the files cache warm enough for a faster response. Hence the scores for this node will be much lower score and hence there is a very less traffic going to that node. The problem is that unless there is more traffic sent the file caches will not warm up fast enough. Should we T (send additional requests which are not part of the normal operations) the requests until the other node recovers?

        Show
        Vijay added a comment - Another issue just came up which is related to this ticket. Where a node just completed bootstrap and hence it will not have the files cache warm enough for a faster response. Hence the scores for this node will be much lower score and hence there is a very less traffic going to that node. The problem is that unless there is more traffic sent the file caches will not warm up fast enough. Should we T (send additional requests which are not part of the normal operations) the requests until the other node recovers?
        Hide
        Vijay added a comment -

        How about instead of sending gossip message, in DES.updateScores we can also compare the number of waiting tasks to be performed on a given node, this way we can actually move on to the next node when there is a lot more pending data to be received before receiving the latencies... Makes sense? (Where pending is node y pending > (another node x 's pending + %) the prefer the node x)

        Show
        Vijay added a comment - How about instead of sending gossip message, in DES.updateScores we can also compare the number of waiting tasks to be performed on a given node, this way we can actually move on to the next node when there is a lot more pending data to be received before receiving the latencies... Makes sense? (Where pending is node y pending > (another node x 's pending + %) the prefer the node x)
        Hide
        Brandon Williams added a comment -

        Right, so I'm saying instead of gossiping an indirect indicator like compaction status, we could gossip read latency directly.

        I'm not sure that's a good solution either due to the propagation time. Especially in the case that the repair is staggered across the replica set, there's a good chance you're penalizing the wrong host after the first one until the state is propagated to you.

        Show
        Brandon Williams added a comment - Right, so I'm saying instead of gossiping an indirect indicator like compaction status, we could gossip read latency directly. I'm not sure that's a good solution either due to the propagation time. Especially in the case that the repair is staggered across the replica set, there's a good chance you're penalizing the wrong host after the first one until the state is propagated to you.
        Hide
        Vijay added a comment -

        will do.

        Show
        Vijay added a comment - will do.
        Hide
        Jonathan Ellis added a comment -

        the problem with that is that until we send the traffic actual traffic to the suspected node we will not know it

        Right, so I'm saying instead of gossiping an indirect indicator like compaction status, we could gossip read latency directly.

        Show
        Jonathan Ellis added a comment - the problem with that is that until we send the traffic actual traffic to the suspected node we will not know it Right, so I'm saying instead of gossiping an indirect indicator like compaction status, we could gossip read latency directly.
        Hide
        Vijay added a comment -

        Probably after CASSANDRA-3723 we might be able to include the wait time in the queue, but the problem with that is that until we send the traffic actual traffic to the suspected node we will not know it, we should actually send occasional requests to those nodes to get this data (Not sure how we can throttle to be fewer requests)...

        Something like Test/Real messages to see if they are back to normal SLA's (which does a complete end to end read from the disk), if we dont want to publish major events.

        Show
        Vijay added a comment - Probably after CASSANDRA-3723 we might be able to include the wait time in the queue, but the problem with that is that until we send the traffic actual traffic to the suspected node we will not know it, we should actually send occasional requests to those nodes to get this data (Not sure how we can throttle to be fewer requests)... Something like Test/Real messages to see if they are back to normal SLA's (which does a complete end to end read from the disk), if we dont want to publish major events.
        Hide
        Brandon Williams added a comment -

        you could base it on latency instead of pending count...

        I'm not sure I understand, do you mean the latency the pending reads would have if they came back at the time of calculating the scores? I was thinking about doing that since we have the timestamps in EM.

        Show
        Brandon Williams added a comment - you could base it on latency instead of pending count... I'm not sure I understand, do you mean the latency the pending reads would have if they came back at the time of calculating the scores? I was thinking about doing that since we have the timestamps in EM.
        Hide
        Jonathan Ellis added a comment -

        you could base it on latency instead of pending count...

        Show
        Jonathan Ellis added a comment - you could base it on latency instead of pending count...
        Hide
        Vijay added a comment -

        but the pending will be almost zero for the bad nodes as we already redirected traffic, right?

        Show
        Vijay added a comment - but the pending will be almost zero for the bad nodes as we already redirected traffic, right?
        Hide
        Brandon Williams added a comment -

        The tricky thing is, we don't know how much to penalize them per pending read. We could choose an arbitrary static amount, and that would work as long as it's the same for all hosts... except the badness threshold loses meaning.

        Show
        Brandon Williams added a comment - The tricky thing is, we don't know how much to penalize them per pending read. We could choose an arbitrary static amount, and that would work as long as it's the same for all hosts... except the badness threshold loses meaning.
        Hide
        Jonathan Ellis added a comment -

        maybe penalizing hosts with pending reads would work better since it would work universally

        I like that idea.

        Show
        Jonathan Ellis added a comment - maybe penalizing hosts with pending reads would work better since it would work universally I like that idea.
        Hide
        Brandon Williams added a comment -

        Hmm, I see. Instead of communicating various states the node is in over gossip, maybe penalizing hosts with pending reads would work better since it would work universally.

        Show
        Brandon Williams added a comment - Hmm, I see. Instead of communicating various states the node is in over gossip, maybe penalizing hosts with pending reads would work better since it would work universally.
        Hide
        Vijay added a comment -

        Hi Brandon, i am talking about Reads when the RR is disabled (0.0).... It is not going to be 1 or 2 reads (we would be really happy if it is only 1 to 2 reads) it will be as many reads it takes for the first read to come back, for example we have 1000 request per second work load when we see the node goes slower it might take a second or timeout (10 second) which means 1000 to 10k requests.

        Show
        Vijay added a comment - Hi Brandon, i am talking about Reads when the RR is disabled (0.0).... It is not going to be 1 or 2 reads (we would be really happy if it is only 1 to 2 reads) it will be as many reads it takes for the first read to come back, for example we have 1000 request per second work load when we see the node goes slower it might take a second or timeout (10 second) which means 1000 to 10k requests.
        Hide
        Brandon Williams added a comment -

        If we have the end time we can avoid sending traffic back to the node until it recovers completely right?

        I'm not sure if you mean writes or checksum requests for read repair here, but neither is avoidable (except for RR by adjusting the probability setting.)

        If you mean the one or two reads after the reset interval, that doesn't seem like it's going to have a big enough impact to optimize for.

        Show
        Brandon Williams added a comment - If we have the end time we can avoid sending traffic back to the node until it recovers completely right? I'm not sure if you mean writes or checksum requests for read repair here, but neither is avoidable (except for RR by adjusting the probability setting.) If you mean the one or two reads after the reset interval, that doesn't seem like it's going to have a big enough impact to optimize for.
        Hide
        Vijay added a comment -

        If we have thee end time we can avoid sending traffic back to the node untill it recovers completely right? If compaction are for a second Is the worst case is till ring delay we will not send traffic... alternatively we could do the difference from the response time avg and throttle the request... or even T the request just to conform if the even is complete.... Makes sense?

        Show
        Vijay added a comment - If we have thee end time we can avoid sending traffic back to the node untill it recovers completely right? If compaction are for a second Is the worst case is till ring delay we will not send traffic... alternatively we could do the difference from the response time avg and throttle the request... or even T the request just to conform if the even is complete.... Makes sense?
        Hide
        Brandon Williams added a comment -

        Hi Brandon, Will it make sense for the remote node to avoid traffic for a given the start and end?

        Sure, but my point is that's going to happen already:

        • node X begins a repair, adversely affecting its read latency
        • node Y tries to read data from X, and scores it badly due to the extra latency

        At this point, node Y will not try to read from X again until either:

        • all other members of X's replica set perform worse than X
        • the RESET_INTERVAL_IN_MS elapses, and the entire process starts over again

        Eventually, node X finishes the validation compaction and when the reset interval elapses, its reads are more equal to the rest of the replica set and everything is back to normal.

        Telling the dsnitch when a remote node starts and stop compacting doesn't seem like it's going to improve on this a whole lot.

        Show
        Brandon Williams added a comment - Hi Brandon, Will it make sense for the remote node to avoid traffic for a given the start and end? Sure, but my point is that's going to happen already: node X begins a repair, adversely affecting its read latency node Y tries to read data from X, and scores it badly due to the extra latency At this point, node Y will not try to read from X again until either: all other members of X's replica set perform worse than X the RESET_INTERVAL_IN_MS elapses, and the entire process starts over again Eventually, node X finishes the validation compaction and when the reset interval elapses, its reads are more equal to the rest of the replica set and everything is back to normal. Telling the dsnitch when a remote node starts and stop compacting doesn't seem like it's going to improve on this a whole lot.
        Hide
        Vijay added a comment -

        Hi Brandon, Will it make sense for the remote node to avoid traffic for a given the start and end? yes start of it is going to be an issue but the end of the major event can be at least conservative on not sending data?

        Show
        Vijay added a comment - Hi Brandon, Will it make sense for the remote node to avoid traffic for a given the start and end? yes start of it is going to be an issue but the end of the major event can be at least conservative on not sending data?
        Hide
        Brandon Williams added a comment -

        It might be worth experimenting with the update interval and the window size, though.

        Show
        Brandon Williams added a comment - It might be worth experimenting with the update interval and the window size, though.
        Hide
        Brandon Williams added a comment -

        The dsnitch doesn't have a phi to adjust. What would be phi in the FD is just the score here, and we don't care what that is; just that we sort by it.

        Show
        Brandon Williams added a comment - The dsnitch doesn't have a phi to adjust. What would be phi in the FD is just the score here, and we don't care what that is; just that we sort by it.
        Hide
        Jonathan Ellis added a comment -

        Can we get a free lunch by adjusting the dsnitch's phi?

        Show
        Jonathan Ellis added a comment - Can we get a free lunch by adjusting the dsnitch's phi?
        Hide
        Brandon Williams added a comment -

        The only way we could do this efficiently is via gossip which will have at best a latency of one second to another (up to) 3 nodes, but asymptotically just one. This is going to easily take far longer to propagate through cluster than it will for the snitch to see the slow read that affects the score enough to lower the remote node's read priority.

        Show
        Brandon Williams added a comment - The only way we could do this efficiently is via gossip which will have at best a latency of one second to another (up to) 3 nodes, but asymptotically just one. This is going to easily take far longer to propagate through cluster than it will for the snitch to see the slow read that affects the score enough to lower the remote node's read priority.

          People

          • Assignee:
            Vijay
            Reporter:
            Vijay
            Reviewer:
            Brandon Williams
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development