[CASSANDRA-8611] give streaming_socket_timeout_in_ms a non-zero default - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.1.10, 2.2.2, 3.0 beta 2
Component/s: None
Labels:
None

Description

Sometimes as mentioned in ~~CASSANDRA-8472~~ streams will hang. We have streaming_socket_timeout_in_ms which can retry after a timeout. It would be good to make a default non-zero value. We don't want to paper over problems, but streams sometimes hang and you don't want long running streaming operations to just fail - as in repairs or bootstraps.

streaming_socket_timeout_in_ms should be based on the tcp idle timeout so it shouldn't be a problem to set it to on the order of minutes. Also the socket should only be open during the actual streaming and not during operations such as merkle tree generation. We can set it to a conservative value and people can set it more aggressively as needed. Disabling as a default, in my opinion, is too conservative.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CASSANDRA-8611.streaming_socket_timeout_in_ms.default.to.an.hour.against.2.1.patch
25/Aug/15 20:02
1 kB
Robert Coli
CASSANDRA-8611.streaming_socket_timeout_in_ms.default.to.an.hour.against.trunk.patch
25/Aug/15 20:02
1 kB
Robert Coli

Activity

Ascending order - Click to sort in descending order

Jeremy Hanna added a comment - 13/Jan/15 16:53

jblangston@datastax.com mentioned that it would be good to make sure that it does time out and that it gives an understandable message in the logs when it does time out. That way, it can be tracked in case of troubleshooting socket timeouts/resets at the router level, for example.

Jeremy Hanna added a comment - 13/Jan/15 16:53 jblangston@datastax.com mentioned that it would be good to make sure that it does time out and that it gives an understandable message in the logs when it does time out. That way, it can be tracked in case of troubleshooting socket timeouts/resets at the router level, for example.

Sebastian Estevez added a comment - 15/May/15 20:07

This bites most production bootstraps that I encounter, especially on the cloud. Are there any downsides to improving this bad default?

Sebastian Estevez added a comment - 15/May/15 20:07 This bites most production bootstraps that I encounter, especially on the cloud. Are there any downsides to improving this bad default?

Benjamin Lerer added a comment - 24/Aug/15 09:24

yukim, pauloricardomg any suggestion for the default value? My knowledge of streaming is limited.

Benjamin Lerer added a comment - 24/Aug/15 09:24 yukim , pauloricardomg any suggestion for the default value? My knowledge of streaming is limited.

Paulo Motta added a comment - 24/Aug/15 13:26

If we want to be really conservative, how about setitng it to default linux tcp_keepalive_time of 7200 seconds (two hours)? Given that I have seen streams hang on EC2 for tens of hours or even days, this should be sufficient to catch the most extreme scenarios, while still allowing operators to set it to a lower value if they want to. If this is too conservative, maybe we can set it to 10-30 minutes.

Paulo Motta added a comment - 24/Aug/15 13:26 If we want to be really conservative, how about setitng it to default linux tcp_keepalive_time of 7200 seconds (two hours)? Given that I have seen streams hang on EC2 for tens of hours or even days, this should be sufficient to catch the most extreme scenarios, while still allowing operators to set it to a lower value if they want to. If this is too conservative, maybe we can set it to 10-30 minutes.

Eric Lubow added a comment - 24/Aug/15 14:27

I've seen streams hang for days on EC2 as well. This can be especially problematic when you are trying to add capacity. Typically if nothing has happened in an hour, then it's probably the result of a hung stream and waiting another hour doesn't serve to benefit much. The one thing to keep in mind for a timeout of two hours is that on smaller datasets, the timeout for the stream is going to be longer than the entire bootstrap of the machine would take. I think it would be safe to bring thing down to an hour which is also still very conservative.

Eric Lubow added a comment - 24/Aug/15 14:27 I've seen streams hang for days on EC2 as well. This can be especially problematic when you are trying to add capacity. Typically if nothing has happened in an hour, then it's probably the result of a hung stream and waiting another hour doesn't serve to benefit much. The one thing to keep in mind for a timeout of two hours is that on smaller datasets, the timeout for the stream is going to be longer than the entire bootstrap of the machine would take. I think it would be safe to bring thing down to an hour which is also still very conservative.

Robert Coli added a comment - 24/Aug/15 22:23

Attaching a patch which sets this timeout to 10 minutes. Rationale is as follows :

Streams continue to hang in normal operation.
Operators want hung streams to restart faster than they could (notice they were hung and then) restart them by restarting the node. This is in the order of 10 minutes for a typical node.
Re-streaming 10 minutes worth of data is not prohibitive; at the default throttle of 25megabytes/second, it's "only" 15gb.

Patch was created before seeing above discussion, but seems to be within the bounds discussed above.

Robert Coli added a comment - 24/Aug/15 22:23 Attaching a patch which sets this timeout to 10 minutes. Rationale is as follows : Streams continue to hang in normal operation. Operators want hung streams to restart faster than they could (notice they were hung and then) restart them by restarting the node. This is in the order of 10 minutes for a typical node. Re-streaming 10 minutes worth of data is not prohibitive; at the default throttle of 25megabytes/second, it's "only" 15gb. Patch was created before seeing above discussion, but seems to be within the bounds discussed above.

Benjamin Lerer added a comment - 25/Aug/15 07:26

What we are looking for is a safety net, not something too aggressive. Based on the discussion, I am on favor of setting it to 1 hour. We can still lower it down in the future if needed.
rcoli can you provide another patch for 2.1?

Benjamin Lerer added a comment - 25/Aug/15 07:26 What we are looking for is a safety net, not something too aggressive. Based on the discussion, I am on favor of setting it to 1 hour. We can still lower it down in the future if needed. rcoli can you provide another patch for 2.1?

Robert Coli added a comment - 25/Aug/15 20:02

Attached two patches, one against trunk and one against 2.1, which default to 3600000 aka 1 hour.

Robert Coli added a comment - 25/Aug/15 20:02 Attached two patches, one against trunk and one against 2.1, which default to 3600000 aka 1 hour.

Benjamin Lerer added a comment - 26/Aug/15 12:48

Thanks for the patch.

the results for the unit tests for 2.1 are here
the results for the dtests for 2.1 are here
the results for the unit tests for 2.2 are here
the results for the dtests for 2.2 are here
the results for the unit tests for 3.0 are here
the results for the dtests for 3.0 are here

LGTM

Benjamin Lerer added a comment - 26/Aug/15 12:48 Thanks for the patch. the results for the unit tests for 2.1 are here the results for the dtests for 2.1 are here the results for the unit tests for 2.2 are here the results for the dtests for 2.2 are here the results for the unit tests for 3.0 are here the results for the dtests for 3.0 are here LGTM

Benjamin Lerer added a comment - 26/Aug/15 13:36

commited: 7e1ea4c8c1af0809b990c27648edbff2efb2434a

Benjamin Lerer added a comment - 26/Aug/15 13:36 commited: 7e1ea4c8c1af0809b990c27648edbff2efb2434a

People

Assignee:: Robert Coli

Reporter:: Jeremy Hanna

Authors:: Robert Coli

Reviewers:: Benjamin Lerer

Votes:: 5 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 13/Jan/15 16:25

Updated:: 16/Apr/19 09:31

Resolved:: 26/Aug/15 13:36

Apache Cassandra

Details

Description

Attachments

Attachments

Activity

People

Dates