Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8611

give streaming_socket_timeout_in_ms a non-zero default

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 2.1.10, 2.2.2, 3.0 beta 2
    • Component/s: None
    • Labels:
      None

      Description

      Sometimes as mentioned in CASSANDRA-8472 streams will hang. We have streaming_socket_timeout_in_ms which can retry after a timeout. It would be good to make a default non-zero value. We don't want to paper over problems, but streams sometimes hang and you don't want long running streaming operations to just fail - as in repairs or bootstraps.

      streaming_socket_timeout_in_ms should be based on the tcp idle timeout so it shouldn't be a problem to set it to on the order of minutes. Also the socket should only be open during the actual streaming and not during operations such as merkle tree generation. We can set it to a conservative value and people can set it more aggressively as needed. Disabling as a default, in my opinion, is too conservative.

        Activity

        Hide
        jeromatron Jeremy Hanna added a comment -

        J.B. Langston mentioned that it would be good to make sure that it does time out and that it gives an understandable message in the logs when it does time out. That way, it can be tracked in case of troubleshooting socket timeouts/resets at the router level, for example.

        Show
        jeromatron Jeremy Hanna added a comment - J.B. Langston mentioned that it would be good to make sure that it does time out and that it gives an understandable message in the logs when it does time out. That way, it can be tracked in case of troubleshooting socket timeouts/resets at the router level, for example.
        Hide
        sebastian.estevez@datastax.com Sebastian Estevez added a comment -

        This bites most production bootstraps that I encounter, especially on the cloud. Are there any downsides to improving this bad default?

        Show
        sebastian.estevez@datastax.com Sebastian Estevez added a comment - This bites most production bootstraps that I encounter, especially on the cloud. Are there any downsides to improving this bad default?
        Hide
        blerer Benjamin Lerer added a comment -

        Yuki Morishita, Paulo Motta any suggestion for the default value? My knowledge of streaming is limited.

        Show
        blerer Benjamin Lerer added a comment - Yuki Morishita , Paulo Motta any suggestion for the default value? My knowledge of streaming is limited.
        Hide
        pauloricardomg Paulo Motta added a comment -

        If we want to be really conservative, how about setitng it to default linux tcp_keepalive_time of 7200 seconds (two hours)? Given that I have seen streams hang on EC2 for tens of hours or even days, this should be sufficient to catch the most extreme scenarios, while still allowing operators to set it to a lower value if they want to. If this is too conservative, maybe we can set it to 10-30 minutes.

        Show
        pauloricardomg Paulo Motta added a comment - If we want to be really conservative, how about setitng it to default linux tcp_keepalive_time of 7200 seconds (two hours)? Given that I have seen streams hang on EC2 for tens of hours or even days, this should be sufficient to catch the most extreme scenarios, while still allowing operators to set it to a lower value if they want to. If this is too conservative, maybe we can set it to 10-30 minutes.
        Hide
        elubow Eric Lubow added a comment -

        I've seen streams hang for days on EC2 as well. This can be especially problematic when you are trying to add capacity. Typically if nothing has happened in an hour, then it's probably the result of a hung stream and waiting another hour doesn't serve to benefit much. The one thing to keep in mind for a timeout of two hours is that on smaller datasets, the timeout for the stream is going to be longer than the entire bootstrap of the machine would take. I think it would be safe to bring thing down to an hour which is also still very conservative.

        Show
        elubow Eric Lubow added a comment - I've seen streams hang for days on EC2 as well. This can be especially problematic when you are trying to add capacity. Typically if nothing has happened in an hour, then it's probably the result of a hung stream and waiting another hour doesn't serve to benefit much. The one thing to keep in mind for a timeout of two hours is that on smaller datasets, the timeout for the stream is going to be longer than the entire bootstrap of the machine would take. I think it would be safe to bring thing down to an hour which is also still very conservative.
        Hide
        rcoli Robert Coli added a comment -

        Attaching a patch which sets this timeout to 10 minutes. Rationale is as follows :

        • Streams continue to hang in normal operation.
        • Operators want hung streams to restart faster than they could (notice they were hung and then) restart them by restarting the node. This is in the order of 10 minutes for a typical node.
        • Re-streaming 10 minutes worth of data is not prohibitive; at the default throttle of 25megabytes/second, it's "only" 15gb.

        Patch was created before seeing above discussion, but seems to be within the bounds discussed above.

        Show
        rcoli Robert Coli added a comment - Attaching a patch which sets this timeout to 10 minutes. Rationale is as follows : Streams continue to hang in normal operation. Operators want hung streams to restart faster than they could (notice they were hung and then) restart them by restarting the node. This is in the order of 10 minutes for a typical node. Re-streaming 10 minutes worth of data is not prohibitive; at the default throttle of 25megabytes/second, it's "only" 15gb. Patch was created before seeing above discussion, but seems to be within the bounds discussed above.
        Hide
        blerer Benjamin Lerer added a comment -

        What we are looking for is a safety net, not something too aggressive. Based on the discussion, I am on favor of setting it to 1 hour. We can still lower it down in the future if needed.
        Robert Coli can you provide another patch for 2.1?

        Show
        blerer Benjamin Lerer added a comment - What we are looking for is a safety net, not something too aggressive. Based on the discussion, I am on favor of setting it to 1 hour. We can still lower it down in the future if needed. Robert Coli can you provide another patch for 2.1?
        Hide
        rcoli Robert Coli added a comment -

        Attached two patches, one against trunk and one against 2.1, which default to 3600000 aka 1 hour.

        Show
        rcoli Robert Coli added a comment - Attached two patches, one against trunk and one against 2.1, which default to 3600000 aka 1 hour.
        Hide
        blerer Benjamin Lerer added a comment -

        Thanks for the patch.

        • the results for the unit tests for 2.1 are here
        • the results for the dtests for 2.1 are here
        • the results for the unit tests for 2.2 are here
        • the results for the dtests for 2.2 are here
        • the results for the unit tests for 3.0 are here
        • the results for the dtests for 3.0 are here

        LGTM

        Show
        blerer Benjamin Lerer added a comment - Thanks for the patch. the results for the unit tests for 2.1 are here the results for the dtests for 2.1 are here the results for the unit tests for 2.2 are here the results for the dtests for 2.2 are here the results for the unit tests for 3.0 are here the results for the dtests for 3.0 are here LGTM
        Hide
        blerer Benjamin Lerer added a comment -

        commited: 7e1ea4c8c1af0809b990c27648edbff2efb2434a

        Show
        blerer Benjamin Lerer added a comment - commited: 7e1ea4c8c1af0809b990c27648edbff2efb2434a

          People

          • Assignee:
            rcoli Robert Coli
            Reporter:
            jeromatron Jeremy Hanna
            Reviewer:
            Benjamin Lerer
          • Votes:
            5 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development