[MESOS-4546] Mesos Agents needs to re-resolve hosts in zk string on leader change / failure to connect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.27.1, 0.28.0
Component/s: agent
Labels:
- mesosphere

Target Version/s:

0.28.0
Sprint:
Mesosphere Sprint 27
Story Points:
3

Description

Sample Mesos Agent log: https://gist.github.com/brndnmtthws/fb846fa988487250a809

Note, zookeeper has a function to change the list of servers at runtime: https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232

This comes up when using an AWS AutoScalingGroup for managing the set of masters.

The agent when it comes up the first time, resolves the zk:// string. Once all the hosts that were in the original string fail (Each fails, is replaced by a new machine, which has the same DNS name), the agent just keeps spinning in an internal loop, never re-resolving the DNS names.

Two solutions I see are
1. Update the list of servers / re-resolve
2. Have the agent detect it hasn't connected recently, and kill itself (Which will force a re-resolution when the agent starts back up)

Attachments

Issue Links

relates to

MESOS-2681 Slave process must restart to update ensemble members

Open

Activity

People

Assignee:: Neil Conway

Reporter:: Cody Maloney

Shepherd:: Joris Van Remoortere

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Jan/16 21:24

Updated:: 26/Nov/18 12:22

Resolved:: 26/Nov/18 12:22