Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3790

ZooKeeper connection should retry on EAI_NONAME

    XMLWordPrintableJSON

Details

    Description

      The zookeeper interface is designed to retry (once per second for up to ten minutes) if one or more of the Zookeeper hostnames can't be resolved (see MESOS-1326 and MESOS-1523).

      However, the current implementation assumes that a DNS resolution failure is indicated by zookeeper_init() returning NULL and errno being set to EINVAL (Zk translates getaddrinfo() failures into errno values). However, the current Zk code does:

      static int getaddrinfo_errno(int rc) {
          switch(rc) {
          case EAI_NONAME:
      // ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD.
      #if defined EAI_NODATA && EAI_NODATA != EAI_NONAME
          case EAI_NODATA:
      #endif
              return ENOENT;
          case EAI_MEMORY:
              return ENOMEM;
          default:
              return EINVAL;
          }
      }
      

      getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per discussion in MESOS-2186, this seems to happen intermittently due to DNS failures.

      Proposed fix: looking at errno is always going to be somewhat fragile, but if we're going to continue doing that, we should check for ENOENT as well as EINVAL.

      Attachments

        Issue Links

          Activity

            People

              andschwa Andrew Schwartzmeyer
              neilc Neil Conway
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: