Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4546

Mesos Agents needs to re-resolve hosts in zk string on leader change / failure to connect

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 0.27.1, 0.28.0
    • agent
    • Mesosphere Sprint 27
    • 3

    Description

      Sample Mesos Agent log: https://gist.github.com/brndnmtthws/fb846fa988487250a809

      Note, zookeeper has a function to change the list of servers at runtime: https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232

      This comes up when using an AWS AutoScalingGroup for managing the set of masters.

      The agent when it comes up the first time, resolves the zk:// string. Once all the hosts that were in the original string fail (Each fails, is replaced by a new machine, which has the same DNS name), the agent just keeps spinning in an internal loop, never re-resolving the DNS names.

      Two solutions I see are
      1. Update the list of servers / re-resolve
      2. Have the agent detect it hasn't connected recently, and kill itself (Which will force a re-resolution when the agent starts back up)

      Attachments

        Issue Links

          Activity

            People

              neilc Neil Conway
              cmaloney Cody Maloney
              Joris Van Remoortere Joris Van Remoortere
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: