ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1506

Re-try DNS hostname -> IP resolution if node connection fails

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.4.5
    • Fix Version/s: 3.4.7, 3.5.1, 3.6.0
    • Component/s: server
    • Labels:
    • Environment:

      Ubuntu 11.04 64-bit

    • Release Note:
      Incomplete - will break tests.

      Description

      In our zoo.cfg we use hostnames to identify the ZK servers that are part of an ensemble. These hostnames are configured with a low (<= 60s) TTL and the IP address they map to can and does change. Our procedure for replacing/upgrading a ZK node is to boot an entirely new instance and remap the hostname to the new instance's IP address. Our expectation is that when the original ZK node is terminated/shutdown, the remaining nodes in the ensemble would reconnect to the new instance.

      However, what we are noticing is that the remaining ZK nodes do not attempt to re-resolve the hostname->IP mapping for the new server. Once the original ZK node is terminated, the existing servers continue to attempt contacting it at the old IP address. It would be great if the ZK servers could try to re-resolve the hostname when attempting to connect to a lost ZK server, instead of caching the lookup indefinitely. Currently we must do a rolling restart of the ZK ensemble after swapping a node – which at three nodes means we periodically lose quorum.

      The exact method we are following is to boot new instances in EC2 and attach one, of a set of three, Elastic IP address. External to EC2 this IP address remains the same and maps to whatever instance it is attached to. Internal to EC2, the elastic IP hostname has a TTL of about 45-60 seconds and is remapped to the internal (10.x.y.z) address of the instance it is attached to. Therefore, in our case we would like ZK to pickup the new 10.x.y.z address that the elastic IP hostname gets mapped to and reconnect appropriately.

      1. zk-dns-caching-refresh.patch
        7 kB
        Michael Lasevich
      2. ZOOKEEPER-1506.patch
        6 kB
        Michi Mutsuzaki
      3. ZOOKEEPER-1506.patch
        3 kB
        Michi Mutsuzaki
      4. ZOOKEEPER-1506.patch
        3 kB
        Michi Mutsuzaki
      5. ZOOKEEPER-1506.patch
        2 kB
        Michi Mutsuzaki

        Activity

        Hide
        Matt Wise added a comment -

        This is also impacting our environment. We run our entire infrastructure in the cloud, and use a set of static (but short TTL'd) hostnames to point to our Zookeeper instances. We have the exact same concern ... that right now it requires a restart of the Zookeeper application on each system to trigger a 're-lookup' of the DNS record.

        This seems like it should be customizable with a simple option. If 'dns_lookup_on_connect' is turned on, it should do a new DNS lookup for every connection to a ensemble server. If its disabled, it can lookup the DNS once and only once. Tools like 'stunnel' have this behavior and its great in the cloud.

        Show
        Matt Wise added a comment - This is also impacting our environment. We run our entire infrastructure in the cloud, and use a set of static (but short TTL'd) hostnames to point to our Zookeeper instances. We have the exact same concern ... that right now it requires a restart of the Zookeeper application on each system to trigger a 're-lookup' of the DNS record. This seems like it should be customizable with a simple option. If 'dns_lookup_on_connect' is turned on, it should do a new DNS lookup for every connection to a ensemble server. If its disabled, it can lookup the DNS once and only once. Tools like 'stunnel' have this behavior and its great in the cloud.
        Hide
        Matt Wise added a comment -

        This is still an ongoing issue. An AWS failure last night caused us to have to restart our entire Zookeeper cluster one-node-at-a-time because of this bug. Can someone at least set a target date for it?

        Show
        Matt Wise added a comment - This is still an ongoing issue. An AWS failure last night caused us to have to restart our entire Zookeeper cluster one-node-at-a-time because of this bug. Can someone at least set a target date for it?
        Hide
        Thawan Kooburat added a comment -

        There is also a problem with Java DNS caching. I haven't tried this but you might want to check this out
        http://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl

        Show
        Thawan Kooburat added a comment - There is also a problem with Java DNS caching. I haven't tried this but you might want to check this out http://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl
        Hide
        Matt Wise added a comment -

        We've tried that and had no luck... It seems that the Zookeeper app does a Dns lookup before creating its connection object. Once the connection object is opened, it's never re created or updated in the event of a connection failure.

        This is an ongoing (and ugly) bug that impacts the running of Zookeeper in public clouds.

        Will someone please take a look and see how fixable this is?

        Show
        Matt Wise added a comment - We've tried that and had no luck... It seems that the Zookeeper app does a Dns lookup before creating its connection object. Once the connection object is opened, it's never re created or updated in the event of a connection failure. This is an ongoing (and ugly) bug that impacts the running of Zookeeper in public clouds. Will someone please take a look and see how fixable this is?
        Hide
        Daniel Heidebrecht added a comment -

        I am also seeing this with a 5 node Zookeeper 3.4.5 cluster running in AWS. All nodes in the cluster must be restarted.

        Show
        Daniel Heidebrecht added a comment - I am also seeing this with a 5 node Zookeeper 3.4.5 cluster running in AWS. All nodes in the cluster must be restarted.
        Hide
        Matt Wise added a comment -

        This is still an ongoing issue. We've tried making changes to the way Java handles DNS caching and we've been unable to solve the issue. This really seems like a simple thing to fix, and its critical when running Zookeeper in Amazons cloud where internal IP addresses change each time you boot up.

        Please... somebody fix this!

        Show
        Matt Wise added a comment - This is still an ongoing issue. We've tried making changes to the way Java handles DNS caching and we've been unable to solve the issue. This really seems like a simple thing to fix, and its critical when running Zookeeper in Amazons cloud where internal IP addresses change each time you boot up. Please... somebody fix this!
        Hide
        Michael Lasevich added a comment -

        This is a patch for the DNS resolution caching in case other server restarts with a new IP. This is an incomplete patch, since it may cause test cases to fail and only takes care of the elections. More complete patch to come.

        Show
        Michael Lasevich added a comment - This is a patch for the DNS resolution caching in case other server restarts with a new IP. This is an incomplete patch, since it may cause test cases to fail and only takes care of the elections. More complete patch to come.
        Hide
        Michael Lasevich added a comment -

        DNS Caching refresh in case of election connection failure. (apply with patch -p1). Will cause tests to fail, more complete patch to come.

        Show
        Michael Lasevich added a comment - DNS Caching refresh in case of election connection failure. (apply with patch -p1). Will cause tests to fail, more complete patch to come.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch
        against trunk revision 1463329.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1467//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch against trunk revision 1463329. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1467//console This message is automatically generated.
        Hide
        Charity Majors added a comment -

        This is affecting us too. This is a pretty terrible bug for anyone trying to run zk in the cloud.

        Show
        Charity Majors added a comment - This is affecting us too. This is a pretty terrible bug for anyone trying to run zk in the cloud.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch
        against trunk revision 1561672.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1904//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch against trunk revision 1561672. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1904//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        A few comments and concerns with the current patch:

        1. What's the point of making the QuorumServer constructors private?
        2. Please check the spacing, it is not aligned with the other lines.
        3. This patch will recreate the socket addresses every time a connection in QCM fails to get established. Sometimes this happens because the server is not up yet, and in such cases, if the mapping hasn't changed, then the recreation of the socket addresses is necessary. Right now, I don't have any great idea to get around it, so I just wanted to point it out.
        4. I was also thinking about how to test this. It would be good to have a test case.
        Show
        Flavio Junqueira added a comment - A few comments and concerns with the current patch: What's the point of making the QuorumServer constructors private? Please check the spacing, it is not aligned with the other lines. This patch will recreate the socket addresses every time a connection in QCM fails to get established. Sometimes this happens because the server is not up yet, and in such cases, if the mapping hasn't changed, then the recreation of the socket addresses is necessary. Right now, I don't have any great idea to get around it, so I just wanted to point it out. I was also thinking about how to test this. It would be good to have a test case.
        Hide
        Robert Kamphuis added a comment -

        As this seems not to progress much, some tips for people working in AWS or similar enough environments. This is likely not the only, nor the best, but it is working for me.

        • configure the zookeeper ensemble servers to connect to the elastic-IP-address in stead of the hostname
        • on a serious failure of one of the servers, boot a replacement, and re-assign the corresponding elastic-ip to that server.
        • others will reconnect correctly
        • you will need to setup the Security group to explicitly enable the interconnect to 2888/3888(/2181) or your ports of choice for the elasticIPs to enable the connections to work.
        • downsides:
          1. traffic between zookeeper servers goes via whatever boxes doing the elastic-ip to server mapping - bigger latency. My measurements as an example: ping using private IPs vs elastic- IPs: 0.8 ms vs 1.4 msec (500 byte packets - servers in two different AZs in US-east)
          2. you will need to pay for this traffic whereas when using the names which are mapped to the internal IPs you would not.

        Also: for the clients, I am using as connect string static DNS records with names like: zookeeperN.<domain> pointing to the ec2-A-B-C-D.compute-1.amazonaws.com - thus pointing to the elastic-ip's name and not the IPs. These are mapped by EC2 to the active private IPs after assigning the elastic-ip to an instance. The clients will be recognised properly as from the correct security group(s). No need to add all the client IPs - of which I have many, and changing set; just add the clients security groups access to the the zookeeper security group.

        BTW: if someone knows of good resources running zookeeper and curator-based clients in AWS I would kindly like to know where...

        Show
        Robert Kamphuis added a comment - As this seems not to progress much, some tips for people working in AWS or similar enough environments. This is likely not the only, nor the best, but it is working for me. configure the zookeeper ensemble servers to connect to the elastic-IP-address in stead of the hostname on a serious failure of one of the servers, boot a replacement, and re-assign the corresponding elastic-ip to that server. others will reconnect correctly you will need to setup the Security group to explicitly enable the interconnect to 2888/3888(/2181) or your ports of choice for the elasticIPs to enable the connections to work. downsides: traffic between zookeeper servers goes via whatever boxes doing the elastic-ip to server mapping - bigger latency. My measurements as an example: ping using private IPs vs elastic- IPs: 0.8 ms vs 1.4 msec (500 byte packets - servers in two different AZs in US-east) you will need to pay for this traffic whereas when using the names which are mapped to the internal IPs you would not. Also: for the clients, I am using as connect string static DNS records with names like: zookeeperN.<domain> pointing to the ec2-A-B-C-D.compute-1.amazonaws.com - thus pointing to the elastic-ip's name and not the IPs. These are mapped by EC2 to the active private IPs after assigning the elastic-ip to an instance. The clients will be recognised properly as from the correct security group(s). No need to add all the client IPs - of which I have many, and changing set; just add the clients security groups access to the the zookeeper security group. BTW: if someone knows of good resources running zookeeper and curator-based clients in AWS I would kindly like to know where...
        Hide
        Michi Mutsuzaki added a comment -

        I'd like to get this fixed in 3.5. I'll rebase the patch and address Flavio's comments. I'm still not sure how we should test this though.

        Show
        Michi Mutsuzaki added a comment - I'd like to get this fixed in 3.5. I'll rebase the patch and address Flavio's comments. I'm still not sure how we should test this though.
        Hide
        Michi Mutsuzaki added a comment -

        Rebased and cleaned up the patch. No test yet.

        Show
        Michi Mutsuzaki added a comment - Rebased and cleaned up the patch. No test yet.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642754/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642754/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//console This message is automatically generated.
        Hide
        Michael Lasevich added a comment -

        Michi Mutsuzaki Thanks for taking this over, I clearly lost track of it. For what its worth, we have been running this patch in production for almost a year with no issues but it would be nice to have it properly merged. Thank you.

        Show
        Michael Lasevich added a comment - Michi Mutsuzaki Thanks for taking this over, I clearly lost track of it. For what its worth, we have been running this patch in production for almost a year with no issues but it would be nice to have it properly merged. Thank you.
        Hide
        Michi Mutsuzaki added a comment -

        Thanks Michael, it's good to know that you've been using this patch without an issue.

        Show
        Michi Mutsuzaki added a comment - Thanks Michael, it's good to know that you've been using this patch without an issue.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//console This message is automatically generated.
        Hide
        Michi Mutsuzaki added a comment -

        I've been running FollowerResyncConcurrencyTest for a while, but I can't reproduce the failure. Maybe it's a transient failure.

        Show
        Michi Mutsuzaki added a comment - I've been running FollowerResyncConcurrencyTest for a while, but I can't reproduce the failure. Maybe it's a transient failure.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//console This message is automatically generated.
        Hide
        Diwaker Gupta added a comment -

        This is a severe issue in any environment that uses DNSMasq or relies on DHCP for DNS mappings. Would really like to see this fixed for 3.5.0, or better yet, the next bugfix release in 3.4.x if there's going to be one.

        Show
        Diwaker Gupta added a comment - This is a severe issue in any environment that uses DNSMasq or relies on DHCP for DNS mappings. Would really like to see this fixed for 3.5.0, or better yet, the next bugfix release in 3.4.x if there's going to be one.
        Hide
        Flavio Junqueira added a comment -

        Diwaker Gupta, is the patch here good for you, have you had a chance to review it?

        Show
        Flavio Junqueira added a comment - Diwaker Gupta , is the patch here good for you, have you had a chance to review it?
        Hide
        MUFEED USMAN added a comment -

        In the case where no IPs change, but a mere DNS outage occurs (for some unforeseen reason); shouldn't the ZKs be able to able to make use of the cached info and restore connectivity once the DNS is back up?

        In the case I'm handling I faced the following:

        2014-02-06 20:22:54,438 [myid:0] - INFO [WorkerReceiver[myid=0]:FastLeaderElection@542] - Notification: 0 (n.leader), 0x200000171 (n.zxid), 0x1 (n.round), LOOKING (n.state), 0 (n.sid), 0x2 (
        n.peerEPoch), LOOKING (my state)
        2014-02-06 20:22:54,451 [myid:0] - WARN [WorkerSender[myid=0]:QuorumCnxManager@368] - Cannot open channel to 1 at election address node02:3888
        java.net.UnknownHostException: node02

        And only ZK restarts restored services.

        Show
        MUFEED USMAN added a comment - In the case where no IPs change, but a mere DNS outage occurs (for some unforeseen reason); shouldn't the ZKs be able to able to make use of the cached info and restore connectivity once the DNS is back up? In the case I'm handling I faced the following: 2014-02-06 20:22:54,438 [myid:0] - INFO [WorkerReceiver [myid=0] :FastLeaderElection@542] - Notification: 0 (n.leader), 0x200000171 (n.zxid), 0x1 (n.round), LOOKING (n.state), 0 (n.sid), 0x2 ( n.peerEPoch), LOOKING (my state) 2014-02-06 20:22:54,451 [myid:0] - WARN [WorkerSender [myid=0] :QuorumCnxManager@368] - Cannot open channel to 1 at election address node02:3888 java.net.UnknownHostException: node02 And only ZK restarts restored services.
        Hide
        Michi Mutsuzaki added a comment -

        Hi Mufeed,

        Good point, I can update this.addr and this.electionAddr only if the InetSocketAddress constructor actually resolved the hostname.

        By the way, are you running ZooKeeper with this patch applied? Does the patch fix the original problem?

        Show
        Michi Mutsuzaki added a comment - Hi Mufeed, Good point, I can update this.addr and this.electionAddr only if the InetSocketAddress constructor actually resolved the hostname. By the way, are you running ZooKeeper with this patch applied? Does the patch fix the original problem?
        Hide
        MUFEED USMAN added a comment -

        Nope. I do not have any patch installed. Happened to hit this JIRA during my search to understand the working relationship between ZooKeeper and DNS when the service was disrupted and did not have an automatic recovery.

        Show
        MUFEED USMAN added a comment - Nope. I do not have any patch installed. Happened to hit this JIRA during my search to understand the working relationship between ZooKeeper and DNS when the service was disrupted and did not have an automatic recovery.
        Hide
        Michi Mutsuzaki added a comment -

        Ok thanks Mufeed. I'll update the patch to handle the case you reported.

        Show
        Michi Mutsuzaki added a comment - Ok thanks Mufeed. I'll update the patch to handle the case you reported.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch
        against trunk revision 1600481.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch against trunk revision 1600481. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//console This message is automatically generated.
        Hide
        Joshua Buss added a comment -

        This is very, very annoying. Please get the fix into master as soon as possible. Thanks from another heavy user of cloud services (this affects any environment where you use the same hostnames for configuration purposes but dynamically re-build instances which get new IPs).

        Show
        Joshua Buss added a comment - This is very, very annoying. Please get the fix into master as soon as possible. Thanks from another heavy user of cloud services (this affects any environment where you use the same hostnames for configuration purposes but dynamically re-build instances which get new IPs).
        Hide
        Flavio Junqueira added a comment -

        I think there is a deeper problem here that I'm worried about. If you point a name to a different server starting from scratch, then it is like the server state has been wiped out from the perspective of the ZK replication, which can lead to state loss in some corner cases. One way around this is to use reconfiguration: remove and re-introduce the server. Is something like this doable for you?

        Show
        Flavio Junqueira added a comment - I think there is a deeper problem here that I'm worried about. If you point a name to a different server starting from scratch, then it is like the server state has been wiped out from the perspective of the ZK replication, which can lead to state loss in some corner cases. One way around this is to use reconfiguration: remove and re-introduce the server. Is something like this doable for you?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch
        against trunk revision 1623916.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch against trunk revision 1623916. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Let me clarify my previous point. The description of this jira talks about bringing up a new instance to upgrade/replace a ZK node. It it starting a ZK node from scratch, essentially dropping the state of the previous incarnation of the node, that I was trying to point out that could be a problem. Resolving names again as this patch proposes is fine. We should get this one in.

        Show
        Flavio Junqueira added a comment - Let me clarify my previous point. The description of this jira talks about bringing up a new instance to upgrade/replace a ZK node. It it starting a ZK node from scratch, essentially dropping the state of the previous incarnation of the node, that I was trying to point out that could be a problem. Resolving names again as this patch proposes is fine. We should get this one in.
        Hide
        Ramya Bharathi Nimmagadda added a comment -

        Thanks Flavio.
        Michi Mutsuzaki Would you be able to submit a new patch with unit tests included? Thanks

        Show
        Ramya Bharathi Nimmagadda added a comment - Thanks Flavio. Michi Mutsuzaki Would you be able to submit a new patch with unit tests included? Thanks
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch
        against trunk revision 1632209.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch against trunk revision 1632209. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        I'm not sure I have any great suggestion for how to do a test case for this fix. Any suggestion?

        Show
        Flavio Junqueira added a comment - I'm not sure I have any great suggestion for how to do a test case for this fix. Any suggestion?
        Hide
        Michi Mutsuzaki added a comment -

        We can modify /etc/hosts (or a local hosts file specified in HOSTALIASES environment variable) to simulate ip address change. This wont' work on windows though.

        Show
        Michi Mutsuzaki added a comment - We can modify /etc/hosts (or a local hosts file specified in HOSTALIASES environment variable) to simulate ip address change. This wont' work on windows though.
        Hide
        Ramya Bharathi Nimmagadda added a comment -

        Like the idea. There is a similar file in Windows at "%SystemRoot%\system32\drivers\etc\hosts" that could be updated.

        Here's the format :
        #The IP address should be placed in the first column followed by the corresponding host name.
        #The IP address and the host name should be separated by at least one space.

        Show
        Ramya Bharathi Nimmagadda added a comment - Like the idea. There is a similar file in Windows at "%SystemRoot%\system32\drivers\etc\hosts" that could be updated. Here's the format : #The IP address should be placed in the first column followed by the corresponding host name. #The IP address and the host name should be separated by at least one space.
        Hide
        Michi Mutsuzaki added a comment -

        I manually tested the patch by modifying /etc/hosts and verified that quorum connection manager picks up new ip addresses. As a part of this change, I removed the retry limit in the qcm listener. Otherwise the listener dies after 3 retries if it fails to resolve the hostname.

        Any volunteer for adding a unit test? If not, maybe we could check this in after manually verifying the fix.

        Show
        Michi Mutsuzaki added a comment - I manually tested the patch by modifying /etc/hosts and verified that quorum connection manager picks up new ip addresses. As a part of this change, I removed the retry limit in the qcm listener. Otherwise the listener dies after 3 retries if it fails to resolve the hostname. Any volunteer for adding a unit test? If not, maybe we could check this in after manually verifying the fix.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12675762/ZOOKEEPER-1506.patch
        against trunk revision 1632209.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675762/ZOOKEEPER-1506.patch against trunk revision 1632209. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Modifying /etc/hosts from a unit test doesn't sound like an appropriate thing to do, so I'm still not sure how to write a test for this. Perhaps we just have to agree to not having a test case for this.

        Show
        Flavio Junqueira added a comment - Modifying /etc/hosts from a unit test doesn't sound like an appropriate thing to do, so I'm still not sure how to write a test for this. Perhaps we just have to agree to not having a test case for this.

          People

          • Assignee:
            Michael Lasevich
            Reporter:
            Mike Heffner
          • Votes:
            20 Vote for this issue
            Watchers:
            29 Start watching this issue

            Dates

            • Created:
              Updated:

              Development