Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-1523

ZooKeeper timeout should be longer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.20.0
    • agent
    • None
    • Q2 Sprint 4

    Description

      zookeeper_init relies on name resolution which can temporarily fail. When getaddrinfo returns EAI_AGAIN, which normally suggests a retry, ZooKeeper instead returns EINVAL to the calling code. We currently use this as a signal that we should retry.

      However, our timeout is set to 10 seconds. If there are, say, three nameservers and each takes fifteen seconds to timeout, we will see a single call to zookeeper_init that takes 45 seconds and will thus only try once before aborting.

      To increase resilience in the case of name server failure, we should increase this timeout.

      Given that the slave is still able to respond to health checks and tasks are still running, this can be quite long. However, we don't want to stay in this state too long as we want to readily observer a more persistent name resolution error.

      As such, ten minutes seems reasonable.

      Attachments

        Issue Links

          Activity

            People

              dhamon Dominic Hamon
              dhamon Dominic Hamon
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: