Details
-
Sub-task
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
0.9.2-incubating, 0.9.3
-
None
-
None
Description
The latest netty client code will attempt to reestablish the connection on failure as part of the send method call. It will block until the connection is established or a timeout happens, by default this is about 30 seconds, which is also the default tuple timeout.
This is exacerbated by the read lock that is held during the send, that prevents the node->socket mapping from changing while we are sending. This is mostly so that we don't close connections while we are trying to write to them, which would cause an exception. But this makes it so if there are multiple workers on a node that all get rescheduled we will wait the full 30 seconds to timeout for each worker.
send must be non-blocking in the current design of the worker, or it will prevent other messages from being delivered, and is likely to cause many many messages to timeout on a reschedule.
Attachments
Issue Links
- is related to
-
STORM-329 Fix cascading Storm failure by improving reconnection strategy and buffering messages
- Resolved