Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.0
    • Fix Version/s: None
    • Component/s: Framework
    • Labels:
      None

      Description

      Our environment is setup so that host names (rather than IP addresses) are used when registering services.

      When disconnecting a node from the network, it will attempt to reconnect and - in order to do this - attempts to resolve a host name, which fails (since we have no network connectivity and a DNS server is used).

      It appears this type of exception is not retryable, and the node simply gives up and never reconnects, even when the network connectivity is back.

      Is this the expected behavior? Is there any way to configure Curator so that this type of exception is retryable? I had a look at CuratorFrameworkImpl.java around line 768 but there doesn't seem to be anything configurable.

      If this is not the expected behavior (or if it is but you don't mind making it configurable), I should be able to provide a patch via a pull request.

        Activity

        Hide
        mput Michael Putters added a comment -

        The stacktrace:

        ERROR o.a.c.f.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
        java.net.UnknownHostException: host.name.here
          at java.net.InetAddress.getAllByName0(InetAddress.java:1250) ~[na:1.7.0_67]
          at java.net.InetAddress.getAllByName(InetAddress.java:1162) ~[na:1.7.0_67]
          at java.net.InetAddress.getAllByName(InetAddress.java:1098) ~[na:1.7.0_67]
          at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
          at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
          at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:160) ~[curator-framework-2.7.0.jar:na]
          at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.0.jar:na]
          at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:816) [curator-framework-2.7.0.jar:na]
          at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:802) [curator-framework-2.7.0.jar:na]
          at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:61) [curator-framework-2.7.0.jar:na]
          at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:272) [curator-framework-2.7.0.jar:na]
          at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_67]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
          at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
        
        Show
        mput Michael Putters added a comment - The stacktrace: ERROR o.a.c.f.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up java.net.UnknownHostException: host.name.here at java.net.InetAddress.getAllByName0(InetAddress.java:1250) ~[na:1.7.0_67] at java.net.InetAddress.getAllByName(InetAddress.java:1162) ~[na:1.7.0_67] at java.net.InetAddress.getAllByName(InetAddress.java:1098) ~[na:1.7.0_67] at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) ~[curator-client-2.7.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:160) ~[curator-framework-2.7.0.jar:na] at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) ~[curator-client-2.7.0.jar:na] at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) ~[curator-client-2.7.0.jar:na] at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) ~[curator-client-2.7.0.jar:na] at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) ~[curator-client-2.7.0.jar:na] at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.0.jar:na] at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:816) [curator-framework-2.7.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:802) [curator-framework-2.7.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:61) [curator-framework-2.7.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:272) [curator-framework-2.7.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_67] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67] at java.lang. Thread .run( Thread .java:745) [na:1.7.0_67]
        Hide
        asloane Andy Sloane added a comment -

        Seconding this.

        What's super annoying is that the background thread just dies, and yet no API calls ever throw any exceptions or ever indicate anything is wrong. If CuratorFramework.start() or LeaderLatch.await() threw an exception then we'd be fine since we'd just die and restart, but instead everything is just stuck.

        Show
        asloane Andy Sloane added a comment - Seconding this. What's super annoying is that the background thread just dies, and yet no API calls ever throw any exceptions or ever indicate anything is wrong. If CuratorFramework.start() or LeaderLatch.await() threw an exception then we'd be fine since we'd just die and restart, but instead everything is just stuck.
        Hide
        asloane Andy Sloane added a comment - - edited

        Actually I take that back; a possible workaround from the client side is this:

            client.getUnhandledErrorListenable().addListener((message, e) -> {
                throw new RuntimeException(message, e);
            });
        

        edit: no, the thrown exception here will also just be printed and not cause the app to exit, so you have to flip some sort of flag in the callback which your main app watches instead.

        Show
        asloane Andy Sloane added a comment - - edited Actually I take that back; a possible workaround from the client side is this: client.getUnhandledErrorListenable().addListener((message, e) -> { throw new RuntimeException(message, e); }); edit: no, the thrown exception here will also just be printed and not cause the app to exit, so you have to flip some sort of flag in the callback which your main app watches instead.
        Hide
        cammckenzie Cameron McKenzie added a comment -

        I don't believe there's a way of configuring Curator to modify the exceptions that it considers retryable currently.

        I'm not sure if changing Curator to retry on the UnknownHostException would do much anyway. The exception is being thrown by the underlying ZK client. Every time Curator tried to create a new ZK client while the hostname was unresolvable, the error would occur.

        I guess if your DNS outage is only for a short period (or you have a long retry period defined) then it would be recoverable.

        I think that possibly the best generic fix is to make the RetryLoop code have a default set of error codes that are retryable and then allow that to be overridden. I think that this is a bit of effort due to the way that RetryLoop is currently accessed (statically), but I will have a look into it.

        Show
        cammckenzie Cameron McKenzie added a comment - I don't believe there's a way of configuring Curator to modify the exceptions that it considers retryable currently. I'm not sure if changing Curator to retry on the UnknownHostException would do much anyway. The exception is being thrown by the underlying ZK client. Every time Curator tried to create a new ZK client while the hostname was unresolvable, the error would occur. I guess if your DNS outage is only for a short period (or you have a long retry period defined) then it would be recoverable. I think that possibly the best generic fix is to make the RetryLoop code have a default set of error codes that are retryable and then allow that to be overridden. I think that this is a bit of effort due to the way that RetryLoop is currently accessed (statically), but I will have a look into it.
        Hide
        asloane Andy Sloane added a comment -

        Right, there are two cases (permanent and temporary), but in both cases I would argue the behavior is undesirable.

        If the host is truly not resolvable, then you get the above background thread exception logged, and... nothing else is obviously wrong. CuratorFrameworkImpl.start() returns without issue while the background thread hangs, and there's no API-level indication, unless you've registered an UnhandledErrorListenable, that anything is wrong, at least if you're using simple things like LeaderLatch.

        If it's a temporary DNS failure, and retrying would work, then retrying in the background and not complaining in start() is fine, but if it's permanent you're stuck without really bubbling the configuration error to the surface.

        Even just not treating the error within CuratorFrameworkImpl.start() as a background exception but instead just throwing it to the caller would improve the situation. And if it was previously connected, and is reconnecting outside of start then attempting to reconnect to zk makes sense.

        Show
        asloane Andy Sloane added a comment - Right, there are two cases (permanent and temporary), but in both cases I would argue the behavior is undesirable. If the host is truly not resolvable, then you get the above background thread exception logged, and... nothing else is obviously wrong. CuratorFrameworkImpl.start() returns without issue while the background thread hangs, and there's no API-level indication, unless you've registered an UnhandledErrorListenable, that anything is wrong, at least if you're using simple things like LeaderLatch . If it's a temporary DNS failure, and retrying would work, then retrying in the background and not complaining in start() is fine, but if it's permanent you're stuck without really bubbling the configuration error to the surface. Even just not treating the error within CuratorFrameworkImpl.start() as a background exception but instead just throwing it to the caller would improve the situation. And if it was previously connected, and is reconnecting outside of start then attempting to reconnect to zk makes sense.
        Hide
        cammckenzie Cameron McKenzie added a comment -

        My concern with modifying the start() method functionality is that it breaks the current contract. The current start() method definition contains no throws clause.

        We could possibly introduce a separate start() method that explicitly fails if no connection to Zookeeper can be established. Would that in conjunction with having control over which exceptions / errors are retryable solve your respective problems?

        Show
        cammckenzie Cameron McKenzie added a comment - My concern with modifying the start() method functionality is that it breaks the current contract. The current start() method definition contains no throws clause. We could possibly introduce a separate start() method that explicitly fails if no connection to Zookeeper can be established. Would that in conjunction with having control over which exceptions / errors are retryable solve your respective problems?
        Hide
        asloane Andy Sloane added a comment -

        Yeah, that's what I was thinking. No throws clause on start, so there's no API-compatible way to fix that...

        I think that sounds like a reasonable solution.

        Show
        asloane Andy Sloane added a comment - Yeah, that's what I was thinking. No throws clause on start, so there's no API-compatible way to fix that... I think that sounds like a reasonable solution.
        Hide
        cammckenzie Cameron McKenzie added a comment -

        Jordan Zimmerman, Fangmin Lv

        Any thoughts on this approach? If everyone is happy with this, I'll have a look at implementing it.

        Show
        cammckenzie Cameron McKenzie added a comment - Jordan Zimmerman , Fangmin Lv Any thoughts on this approach? If everyone is happy with this, I'll have a look at implementing it.
        Hide
        randgalt Jordan Zimmerman added a comment -

        You can always wrap the exception in RuntimeException and put the work in the existing start().

        Show
        randgalt Jordan Zimmerman added a comment - You can always wrap the exception in RuntimeException and put the work in the existing start().
        Hide
        cammckenzie Cameron McKenzie added a comment -

        So you're saying just add an unhandled exception handler and have it throw a Runtime exception if it encounters an UnknownHostException (in this case)?

        That doesn't handle the case where we've already connected and want to allow reconnection does it?

        We would need to implement a way of programatically defining retryable errors as well.

        Show
        cammckenzie Cameron McKenzie added a comment - So you're saying just add an unhandled exception handler and have it throw a Runtime exception if it encounters an UnknownHostException (in this case)? That doesn't handle the case where we've already connected and want to allow reconnection does it? We would need to implement a way of programatically defining retryable errors as well.
        Hide
        igstan Ionuț G. Stan added a comment - - edited

        We've bumped into the same issue. Our DNS server was temporarily down and Curator stopped retrying to connect because ZooKepper threw a non-retryable exception: (UnknownHostException). In ZooKepper >= 3.4.11 it will throw an IllegalArgumentException. This behaviour has changed as a result of these:

        What we've ended up doing is to register a custom ZookeeperFactory with
        CuratorFrameworkFactory.builder(). That factory is responsible for creating new ZooKeeper instances when retrying. So we're just catching UnknownHostException and IllegalArgumentException there and then throw a ConnectionLossException, which is retry-able as far as Curator is concerned.

        In case anyone's interested, here's the code, in Scala:

        ZookeeperFactory.scala
        import java.net.UnknownHostException
        import com.typesafe.scalalogging.LazyLogging
        import org.apache.zookeeper.KeeperException.ConnectionLossException
        import org.apache.zookeeper.{Watcher, ZooKeeper}
        
        /** ZooKeeper client factory that's resilient to hostname lookup errors.
          *
          * The purpose of this wrapper is to handle hostname errors encountered
          * while creating ZooKeeper client instances. It works around these issues:
          *
          *   - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
          *   - https://issues.apache.org/jira/browse/ZOOKEEPER-2614
          *   - https://issues.apache.org/jira/browse/CURATOR-229
          *
          * Curator knows how to retry a finite and predefined set of exceptions. What
          * this custom factory does is to map hostname-related exceptions into one
          * that Curator interprets as a retry-able exception. So it will keep trying
          * to establish a connection to ZooKeeper even in the face of such errors.
          *
          * @param servers The list of ZooKeeper hostnames or addresses.
          */
        class ZookeeperFactory(servers: Seq[String])
          extends org.apache.curator.utils.ZookeeperFactory
            with LazyLogging {
        
          override def newZooKeeper(connectString: String, sessionTimeout: Int, watcher: Watcher, canBeReadOnly: Boolean): ZooKeeper = {
            def retry(servers: Seq[String]): ZooKeeper = {
              servers match {
                case Nil =>
                  // All server hostnames have failed. Tell Curator to retry later.
                  throw new ConnectionLossException()
                case remainingServers =>
                  val connectString = remainingServers.mkString(",")
        
                  try {
        
                    new ZooKeeper(connectString, sessionTimeout, watcher, canBeReadOnly)
        
                  } catch {
                    // Apache ZooKeeper <= 3.4.10 will throw an UnknownHostException at
                    // the first hostname which it can't resolve, instead of trying the
                    // following hostnames in the list. So, we just drop the offending
                    // hostnames from the servers list and try again.
                    case e: UnknownHostException =>
                      logger.warn(s"ZooKeeper client creation failed for server list: $connectString", e)
                      retry(remainingServers.drop(1))
        
                    // Apache ZooKeeper >= 3.4.11, will try all hostnames, but we still
                    // want to retry if all of them fail right now.
                    case EmptyHostProvider(e) =>
                      logger.warn(s"ZooKeeper client creation failed for server list: $connectString", e)
                      throw new ConnectionLossException()
                  }
              }
            }
        
            retry(servers)
          }
        }
        
        object EmptyHostProvider {
          private final val MESSAGE = "A HostProvider may not be empty!"
        
          def unapply(e: Throwable): Option[IllegalArgumentException] = {
            e match {
              case e: IllegalArgumentException if e.getMessage == MESSAGE => Some(e)
              case _ => None
            }
          }
        }
        

        And its usage:

        val zk = CuratorFrameworkFactory.builder()
            .connectString(config.servers)
            .sessionTimeoutMs(...)
            .connectionTimeoutMs(...)
            .zookeeperFactory(new ZookeeperFactory(config.servers.split(',')))
            .retryPolicy(new RetryForever(1000))
            .build()
        
        Show
        igstan Ionuț G. Stan added a comment - - edited We've bumped into the same issue. Our DNS server was temporarily down and Curator stopped retrying to connect because ZooKepper threw a non-retryable exception: ( UnknownHostException ). In ZooKepper >= 3.4.11 it will throw an IllegalArgumentException . This behaviour has changed as a result of these: https://issues.apache.org/jira/browse/ZOOKEEPER-1576 https://issues.apache.org/jira/browse/ZOOKEEPER-2614 What we've ended up doing is to register a custom ZookeeperFactory with CuratorFrameworkFactory.builder() . That factory is responsible for creating new ZooKeeper instances when retrying. So we're just catching UnknownHostException and IllegalArgumentException there and then throw a ConnectionLossException , which is retry-able as far as Curator is concerned. In case anyone's interested, here's the code, in Scala: ZookeeperFactory.scala import java.net.UnknownHostException import com.typesafe.scalalogging.LazyLogging import org.apache.zookeeper.KeeperException.ConnectionLossException import org.apache.zookeeper.{Watcher, ZooKeeper} /** ZooKeeper client factory that's resilient to hostname lookup errors. * * The purpose of this wrapper is to handle hostname errors encountered * while creating ZooKeeper client instances. It works around these issues: * * - https: //issues.apache.org/jira/browse/ZOOKEEPER-1576 * - https: //issues.apache.org/jira/browse/ZOOKEEPER-2614 * - https: //issues.apache.org/jira/browse/CURATOR-229 * * Curator knows how to retry a finite and predefined set of exceptions. What * this custom factory does is to map hostname-related exceptions into one * that Curator interprets as a retry-able exception. So it will keep trying * to establish a connection to ZooKeeper even in the face of such errors. * * @param servers The list of ZooKeeper hostnames or addresses. */ class ZookeeperFactory(servers: Seq[ String ]) extends org.apache.curator.utils.ZookeeperFactory with LazyLogging { override def newZooKeeper(connectString: String , sessionTimeout: Int, watcher: Watcher, canBeReadOnly: Boolean ): ZooKeeper = { def retry(servers: Seq[ String ]): ZooKeeper = { servers match { case Nil => // All server hostnames have failed. Tell Curator to retry later. throw new ConnectionLossException() case remainingServers => val connectString = remainingServers.mkString( "," ) try { new ZooKeeper(connectString, sessionTimeout, watcher, canBeReadOnly) } catch { // Apache ZooKeeper <= 3.4.10 will throw an UnknownHostException at // the first hostname which it can't resolve, instead of trying the // following hostnames in the list. So, we just drop the offending // hostnames from the servers list and try again. case e: UnknownHostException => logger.warn(s "ZooKeeper client creation failed for server list: $connectString" , e) retry(remainingServers.drop(1)) // Apache ZooKeeper >= 3.4.11, will try all hostnames, but we still // want to retry if all of them fail right now. case EmptyHostProvider(e) => logger.warn(s "ZooKeeper client creation failed for server list: $connectString" , e) throw new ConnectionLossException() } } } retry(servers) } } object EmptyHostProvider { private final val MESSAGE = "A HostProvider may not be empty!" def unapply(e: Throwable): Option[IllegalArgumentException] = { e match { case e: IllegalArgumentException if e.getMessage == MESSAGE => Some(e) case _ => None } } } And its usage: val zk = CuratorFrameworkFactory.builder() .connectString(config.servers) .sessionTimeoutMs(...) .connectionTimeoutMs(...) .zookeeperFactory( new ZookeeperFactory(config.servers.split(','))) .retryPolicy( new RetryForever(1000)) .build()

          People

          • Assignee:
            Unassigned
            Reporter:
            mput Michael Putters
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development