Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
We've experienced a number of issues lately where "ruok" requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again.
I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle.
A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected.