Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
-
Mesosphere Sprint 27
-
3
Description
Sample Mesos Agent log: https://gist.github.com/brndnmtthws/fb846fa988487250a809
Note, zookeeper has a function to change the list of servers at runtime: https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232
This comes up when using an AWS AutoScalingGroup for managing the set of masters.
The agent when it comes up the first time, resolves the zk:// string. Once all the hosts that were in the original string fail (Each fails, is replaced by a new machine, which has the same DNS name), the agent just keeps spinning in an internal loop, never re-resolving the DNS names.
Two solutions I see are
1. Update the list of servers / re-resolve
2. Have the agent detect it hasn't connected recently, and kill itself (Which will force a re-resolution when the agent starts back up)
Attachments
Issue Links
- relates to
-
MESOS-2681 Slave process must restart to update ensemble members
- Open