Cassandra
  1. Cassandra
  2. CASSANDRA-3294

a node whose TCP connection is not up should be considered down for the purpose of reads and writes

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances).

      A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.

      Are there difficulties I'm not seeing?

      I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we know we cannot talk to, it seems unnecessarily detrimental.

        Activity

        Hide
        Jonathan Ellis added a comment -

        What do you suggest? TCP connection death isn't synonymous with process death.

        Show
        Jonathan Ellis added a comment - What do you suggest? TCP connection death isn't synonymous with process death.
        Hide
        Peter Schuller added a comment -

        Right. What I am saying is that for the purpose of picking nodes to send reads/writes to, it doesn't make sense to pick nodes that we know for a fact we cannot communicate with. As I indicate, I'm afraid that maybe this does not translate to gossip up/down state which may have other implications. But, in the most extreme case, if I'm connecting to a co-ordinator and submitting a read at CL.ONE, I don't want the co-ordinator to opt to send it to a node it knows for a fact it is currently not able to communicate with, causing RPC timeouts until rpc_timeout seconds have passed (more or less) and the dynamic snitch has figured things out.

        Although... the more I think of it maybe it's better to just go for more actively weighting in outstanding RPC requests instead.

        Basically, for what I've observed in high-throughput (lots of queries) clusters, most "hiccups" in day-to-day that spill over to applications tend to be one of:

        • Process got stopped without disablegossip+wait, or got killed, etc.
        • Process had a temporary hiccup (e.g., GC pause, networking glitch, sudden burst of streaming from another node causing a short duration of disk bottlenecking)
        • Process is legitimately overloaded for a while, e.g. due to disk I/O, and despite dynamic snitching there is an annoyingly significant impact to clusters likely resulting from the periodic dynamic snitch reset.

        All of these, in addition to many other "soft" failure modes that add up to "node is slow", should be helped quite a lot by significantly weighting outstanding request count when picking nodes. I'm not necessarily suggesting least-used flat out, and I'm aware that one could introduce foot shooting bere, and there are some performance vs. responsiveness concerns.

        In the end, the goal is often not really to maximize throughput, but rather to keep a decent latency consistently. Not necessarily the lowest possible average latency, but avoiding extreme outliers. Particularly when application code is not good at dealing with concurrency spikes (e.g., thread limits, process limits, whatever).

        If we get a read and know that we have 50 requests outstanding to the node that is closest according to the switch, but 0 outstanding to others, we shouldn't be adding another request onto that...

        Thoughts? Am I trying to make the problem be simpler than what it is?

        Show
        Peter Schuller added a comment - Right. What I am saying is that for the purpose of picking nodes to send reads/writes to, it doesn't make sense to pick nodes that we know for a fact we cannot communicate with. As I indicate, I'm afraid that maybe this does not translate to gossip up/down state which may have other implications. But, in the most extreme case, if I'm connecting to a co-ordinator and submitting a read at CL.ONE, I don't want the co-ordinator to opt to send it to a node it knows for a fact it is currently not able to communicate with, causing RPC timeouts until rpc_timeout seconds have passed (more or less) and the dynamic snitch has figured things out. Although... the more I think of it maybe it's better to just go for more actively weighting in outstanding RPC requests instead. Basically, for what I've observed in high-throughput (lots of queries) clusters, most "hiccups" in day-to-day that spill over to applications tend to be one of: Process got stopped without disablegossip+wait, or got killed, etc. Process had a temporary hiccup (e.g., GC pause, networking glitch, sudden burst of streaming from another node causing a short duration of disk bottlenecking) Process is legitimately overloaded for a while, e.g. due to disk I/O, and despite dynamic snitching there is an annoyingly significant impact to clusters likely resulting from the periodic dynamic snitch reset. All of these, in addition to many other "soft" failure modes that add up to "node is slow", should be helped quite a lot by significantly weighting outstanding request count when picking nodes. I'm not necessarily suggesting least-used flat out, and I'm aware that one could introduce foot shooting bere, and there are some performance vs. responsiveness concerns. In the end, the goal is often not really to maximize throughput, but rather to keep a decent latency consistently. Not necessarily the lowest possible average latency, but avoiding extreme outliers. Particularly when application code is not good at dealing with concurrency spikes (e.g., thread limits, process limits, whatever). If we get a read and know that we have 50 requests outstanding to the node that is closest according to the switch, but 0 outstanding to others, we shouldn't be adding another request onto that... Thoughts? Am I trying to make the problem be simpler than what it is?
        Hide
        Stu Hood added a comment -

        Narrowing the window before node A notices that node B is slow/dead is a good thing, but it is not possible to remove this window entirely. Instead, speculatively requesting the data from extra nodes should be an option (if not the default.) CASSANDRA-2540 touches on this issue, but I made the mistake of conflating data vs digest with speculation: even if we never remove digest reads, we should consider adding speculative reads.

        Show
        Stu Hood added a comment - Narrowing the window before node A notices that node B is slow/dead is a good thing, but it is not possible to remove this window entirely. Instead, speculatively requesting the data from extra nodes should be an option (if not the default.) CASSANDRA-2540 touches on this issue, but I made the mistake of conflating data vs digest with speculation: even if we never remove digest reads, we should consider adding speculative reads.
        Hide
        Pavel Yaskevich added a comment -

        How about we assign probability "to be alive" to each of the nodes in the ring (starting from uniform distribution) and with each of the failures e.g. RPC/Gossiper communication error we would decrease probability of node being alive by constant factor and increase by other constant factor if communication was successful. That would allow us to calculate the endpoint with the highest alive (and all other sorted) probability for sub-group of SS.getLiveNaturalEndpoints(String, RingPosition), what do you think?

        Show
        Pavel Yaskevich added a comment - How about we assign probability "to be alive" to each of the nodes in the ring (starting from uniform distribution) and with each of the failures e.g. RPC/Gossiper communication error we would decrease probability of node being alive by constant factor and increase by other constant factor if communication was successful. That would allow us to calculate the endpoint with the highest alive (and all other sorted) probability for sub-group of SS.getLiveNaturalEndpoints(String, RingPosition), what do you think?
        Hide
        Brandon Williams added a comment -

        How about we assign probability "to be alive" to each of the nodes in the ring

        This sounds like reinventing the existing failure detector to me.

        Show
        Brandon Williams added a comment - How about we assign probability "to be alive" to each of the nodes in the ring This sounds like reinventing the existing failure detector to me.
        Hide
        Brandon Williams added a comment -

        One thing that occurs to me here is that the FD is sort of a one-way device: we can send it hints that something is alive, but we can't send it hints that something is dead. Thus, the only way a node can be marked down is by its phi decaying over time. If we added the ability to negatively affect the phi directly (TCP connection isn't present, or has been refused, etc) this could speed failure detection up considerably.

        Show
        Brandon Williams added a comment - One thing that occurs to me here is that the FD is sort of a one-way device: we can send it hints that something is alive, but we can't send it hints that something is dead. Thus, the only way a node can be marked down is by its phi decaying over time. If we added the ability to negatively affect the phi directly (TCP connection isn't present, or has been refused, etc) this could speed failure detection up considerably.
        Hide
        Pavel Yaskevich added a comment -

        The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.

        Show
        Pavel Yaskevich added a comment - The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.
        Hide
        Brandon Williams added a comment -

        I see. We can do that by sorting on the current phi scores, but we'd need to respect the badness threshold for those doing replica pinning. Sounds like we're starting to bump up against CASSANDRA-3722 here.

        Show
        Brandon Williams added a comment - I see. We can do that by sorting on the current phi scores, but we'd need to respect the badness threshold for those doing replica pinning. Sounds like we're starting to bump up against CASSANDRA-3722 here.
        Hide
        Pavel Yaskevich added a comment -

        After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?

        Show
        Pavel Yaskevich added a comment - After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?
        Hide
        Peter Schuller added a comment - - edited

        This sounds like reinventing the existing failure detector to me.

        Except we don't use it that way at all (see CASSANDRA-3927). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the perfect measurement - whether the TCP connection is up.

        It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we know the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail).

        The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector.

        I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary "either we get traffic or we dont" (or "either we get data or we get digest") situation. That would certainly help. As would least-requests-outstanding.

        You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side.

        After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter?

        I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more.

        In any case, feel free to close if you want. If I ever get to actually implementing this (if at that point there is no other mechanism to remove the need) I'll just re-file or re-open with a patch. We don't need to track this if others aren't interested.

        Show
        Peter Schuller added a comment - - edited This sounds like reinventing the existing failure detector to me. Except we don't use it that way at all (see CASSANDRA-3927 ). Even if we did though, I personally think it's totally the wrong solution to this problem since we have the perfect measurement - whether the TCP connection is up. It's fine if we have other information that actively indicates we shouldn't send messages to it (whether it's the FD or the fact that we have 500 000 messages queued to the node), but if we know the TCP connection is down, we should just not send messages to it, period. With the only caveat being that of course we'd have to make sure TCP connections are in fact pro-actively kept up under all circumstances (I'd have to look at code to figure out what issues there are, if any, in detail). The main idea of the algorithm I have mentioned is to make sure that we always do operations (write/read etc.) on the nodes that have the highest probability to be alive determined by live traffic going there instead of passively relying on the failure detector. I have an unfiled ticket to suggest making the proximity sorting probabilistic to avoid the binary "either we get traffic or we dont" (or "either we get data or we get digest") situation. That would certainly help. As would least-requests-outstanding. You can totally make it so that this ticket is irrelevant by just making the general case well-supported enough that there is no reason to special case this. This was originally filed since we had none of that, and we still don't, and it seemed like a very trivial case to handle for the TCP connection to be actively reset by the other side. After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter? I think CASSANDRA-3722 's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more. In any case, feel free to close if you want. If I ever get to actually implementing this (if at that point there is no other mechanism to remove the need) I'll just re-file or re-open with a patch. We don't need to track this if others aren't interested.
        Hide
        Brandon Williams added a comment -

        I think CASSANDRA-3722's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more.

        I agree; the original premise there was jumping the gun with a solution a bit, but I think ultimately we end up in very similar places.

        Show
        Brandon Williams added a comment - I think CASSANDRA-3722 's original premise doesn't address the concerns I see in real life (I don't want special cases trying to communicate "X is happening"), but towards the end I start agreeing with the ticket more. I agree; the original premise there was jumping the gun with a solution a bit, but I think ultimately we end up in very similar places.
        Hide
        Jonathan Ellis added a comment -

        I think between CASSANDRA-3722, CASSANDRA-5393, and CASSANDRA-4705 we've covered this pretty well.

        Show
        Jonathan Ellis added a comment - I think between CASSANDRA-3722 , CASSANDRA-5393 , and CASSANDRA-4705 we've covered this pretty well.

          People

          • Assignee:
            Peter Schuller
            Reporter:
            Peter Schuller
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development