ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1506

Re-try DNS hostname -> IP resolution if node connection fails

    Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.4.5
    • Fix Version/s: 3.4.7, 3.5.1, 3.6.0
    • Component/s: server
    • Labels:
    • Environment:

      Ubuntu 11.04 64-bit

    • Release Note:
      Incomplete - will break tests.

      Description

      In our zoo.cfg we use hostnames to identify the ZK servers that are part of an ensemble. These hostnames are configured with a low (<= 60s) TTL and the IP address they map to can and does change. Our procedure for replacing/upgrading a ZK node is to boot an entirely new instance and remap the hostname to the new instance's IP address. Our expectation is that when the original ZK node is terminated/shutdown, the remaining nodes in the ensemble would reconnect to the new instance.

      However, what we are noticing is that the remaining ZK nodes do not attempt to re-resolve the hostname->IP mapping for the new server. Once the original ZK node is terminated, the existing servers continue to attempt contacting it at the old IP address. It would be great if the ZK servers could try to re-resolve the hostname when attempting to connect to a lost ZK server, instead of caching the lookup indefinitely. Currently we must do a rolling restart of the ZK ensemble after swapping a node – which at three nodes means we periodically lose quorum.

      The exact method we are following is to boot new instances in EC2 and attach one, of a set of three, Elastic IP address. External to EC2 this IP address remains the same and maps to whatever instance it is attached to. Internal to EC2, the elastic IP hostname has a TTL of about 45-60 seconds and is remapped to the internal (10.x.y.z) address of the instance it is attached to. Therefore, in our case we would like ZK to pickup the new 10.x.y.z address that the elastic IP hostname gets mapped to and reconnect appropriately.

      1. zk-dns-caching-refresh.patch
        7 kB
        Michael Lasevich
      2. ZOOKEEPER-1506.patch
        5 kB
        Michi Mutsuzaki
      3. ZOOKEEPER-1506.patch
        7 kB
        Michi Mutsuzaki
      4. ZOOKEEPER-1506.patch
        6 kB
        Michi Mutsuzaki
      5. ZOOKEEPER-1506.patch
        6 kB
        Michi Mutsuzaki
      6. ZOOKEEPER-1506.patch
        3 kB
        Michi Mutsuzaki
      7. ZOOKEEPER-1506.patch
        3 kB
        Michi Mutsuzaki
      8. ZOOKEEPER-1506.patch
        2 kB
        Michi Mutsuzaki
      9. ZOOKEEPER-1506-fix.patch
        0.7 kB
        Michi Mutsuzaki

        Activity

        Hide
        Matt Wise added a comment -

        This is also impacting our environment. We run our entire infrastructure in the cloud, and use a set of static (but short TTL'd) hostnames to point to our Zookeeper instances. We have the exact same concern ... that right now it requires a restart of the Zookeeper application on each system to trigger a 're-lookup' of the DNS record.

        This seems like it should be customizable with a simple option. If 'dns_lookup_on_connect' is turned on, it should do a new DNS lookup for every connection to a ensemble server. If its disabled, it can lookup the DNS once and only once. Tools like 'stunnel' have this behavior and its great in the cloud.

        Show
        Matt Wise added a comment - This is also impacting our environment. We run our entire infrastructure in the cloud, and use a set of static (but short TTL'd) hostnames to point to our Zookeeper instances. We have the exact same concern ... that right now it requires a restart of the Zookeeper application on each system to trigger a 're-lookup' of the DNS record. This seems like it should be customizable with a simple option. If 'dns_lookup_on_connect' is turned on, it should do a new DNS lookup for every connection to a ensemble server. If its disabled, it can lookup the DNS once and only once. Tools like 'stunnel' have this behavior and its great in the cloud.
        Hide
        Matt Wise added a comment -

        This is still an ongoing issue. An AWS failure last night caused us to have to restart our entire Zookeeper cluster one-node-at-a-time because of this bug. Can someone at least set a target date for it?

        Show
        Matt Wise added a comment - This is still an ongoing issue. An AWS failure last night caused us to have to restart our entire Zookeeper cluster one-node-at-a-time because of this bug. Can someone at least set a target date for it?
        Hide
        Thawan Kooburat added a comment -

        There is also a problem with Java DNS caching. I haven't tried this but you might want to check this out
        http://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl

        Show
        Thawan Kooburat added a comment - There is also a problem with Java DNS caching. I haven't tried this but you might want to check this out http://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl
        Hide
        Matt Wise added a comment -

        We've tried that and had no luck... It seems that the Zookeeper app does a Dns lookup before creating its connection object. Once the connection object is opened, it's never re created or updated in the event of a connection failure.

        This is an ongoing (and ugly) bug that impacts the running of Zookeeper in public clouds.

        Will someone please take a look and see how fixable this is?

        Show
        Matt Wise added a comment - We've tried that and had no luck... It seems that the Zookeeper app does a Dns lookup before creating its connection object. Once the connection object is opened, it's never re created or updated in the event of a connection failure. This is an ongoing (and ugly) bug that impacts the running of Zookeeper in public clouds. Will someone please take a look and see how fixable this is?
        Hide
        Daniel Heidebrecht added a comment -

        I am also seeing this with a 5 node Zookeeper 3.4.5 cluster running in AWS. All nodes in the cluster must be restarted.

        Show
        Daniel Heidebrecht added a comment - I am also seeing this with a 5 node Zookeeper 3.4.5 cluster running in AWS. All nodes in the cluster must be restarted.
        Hide
        Matt Wise added a comment -

        This is still an ongoing issue. We've tried making changes to the way Java handles DNS caching and we've been unable to solve the issue. This really seems like a simple thing to fix, and its critical when running Zookeeper in Amazons cloud where internal IP addresses change each time you boot up.

        Please... somebody fix this!

        Show
        Matt Wise added a comment - This is still an ongoing issue. We've tried making changes to the way Java handles DNS caching and we've been unable to solve the issue. This really seems like a simple thing to fix, and its critical when running Zookeeper in Amazons cloud where internal IP addresses change each time you boot up. Please... somebody fix this!
        Hide
        Michael Lasevich added a comment -

        This is a patch for the DNS resolution caching in case other server restarts with a new IP. This is an incomplete patch, since it may cause test cases to fail and only takes care of the elections. More complete patch to come.

        Show
        Michael Lasevich added a comment - This is a patch for the DNS resolution caching in case other server restarts with a new IP. This is an incomplete patch, since it may cause test cases to fail and only takes care of the elections. More complete patch to come.
        Hide
        Michael Lasevich added a comment -

        DNS Caching refresh in case of election connection failure. (apply with patch -p1). Will cause tests to fail, more complete patch to come.

        Show
        Michael Lasevich added a comment - DNS Caching refresh in case of election connection failure. (apply with patch -p1). Will cause tests to fail, more complete patch to come.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch
        against trunk revision 1463329.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1467//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch against trunk revision 1463329. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1467//console This message is automatically generated.
        Hide
        Charity Majors added a comment -

        This is affecting us too. This is a pretty terrible bug for anyone trying to run zk in the cloud.

        Show
        Charity Majors added a comment - This is affecting us too. This is a pretty terrible bug for anyone trying to run zk in the cloud.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch
        against trunk revision 1561672.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1904//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581564/zk-dns-caching-refresh.patch against trunk revision 1561672. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1904//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        A few comments and concerns with the current patch:

        1. What's the point of making the QuorumServer constructors private?
        2. Please check the spacing, it is not aligned with the other lines.
        3. This patch will recreate the socket addresses every time a connection in QCM fails to get established. Sometimes this happens because the server is not up yet, and in such cases, if the mapping hasn't changed, then the recreation of the socket addresses is necessary. Right now, I don't have any great idea to get around it, so I just wanted to point it out.
        4. I was also thinking about how to test this. It would be good to have a test case.
        Show
        Flavio Junqueira added a comment - A few comments and concerns with the current patch: What's the point of making the QuorumServer constructors private? Please check the spacing, it is not aligned with the other lines. This patch will recreate the socket addresses every time a connection in QCM fails to get established. Sometimes this happens because the server is not up yet, and in such cases, if the mapping hasn't changed, then the recreation of the socket addresses is necessary. Right now, I don't have any great idea to get around it, so I just wanted to point it out. I was also thinking about how to test this. It would be good to have a test case.
        Hide
        Robert Kamphuis added a comment -

        As this seems not to progress much, some tips for people working in AWS or similar enough environments. This is likely not the only, nor the best, but it is working for me.

        • configure the zookeeper ensemble servers to connect to the elastic-IP-address in stead of the hostname
        • on a serious failure of one of the servers, boot a replacement, and re-assign the corresponding elastic-ip to that server.
        • others will reconnect correctly
        • you will need to setup the Security group to explicitly enable the interconnect to 2888/3888(/2181) or your ports of choice for the elasticIPs to enable the connections to work.
        • downsides:
          1. traffic between zookeeper servers goes via whatever boxes doing the elastic-ip to server mapping - bigger latency. My measurements as an example: ping using private IPs vs elastic- IPs: 0.8 ms vs 1.4 msec (500 byte packets - servers in two different AZs in US-east)
          2. you will need to pay for this traffic whereas when using the names which are mapped to the internal IPs you would not.

        Also: for the clients, I am using as connect string static DNS records with names like: zookeeperN.<domain> pointing to the ec2-A-B-C-D.compute-1.amazonaws.com - thus pointing to the elastic-ip's name and not the IPs. These are mapped by EC2 to the active private IPs after assigning the elastic-ip to an instance. The clients will be recognised properly as from the correct security group(s). No need to add all the client IPs - of which I have many, and changing set; just add the clients security groups access to the the zookeeper security group.

        BTW: if someone knows of good resources running zookeeper and curator-based clients in AWS I would kindly like to know where...

        Show
        Robert Kamphuis added a comment - As this seems not to progress much, some tips for people working in AWS or similar enough environments. This is likely not the only, nor the best, but it is working for me. configure the zookeeper ensemble servers to connect to the elastic-IP-address in stead of the hostname on a serious failure of one of the servers, boot a replacement, and re-assign the corresponding elastic-ip to that server. others will reconnect correctly you will need to setup the Security group to explicitly enable the interconnect to 2888/3888(/2181) or your ports of choice for the elasticIPs to enable the connections to work. downsides: traffic between zookeeper servers goes via whatever boxes doing the elastic-ip to server mapping - bigger latency. My measurements as an example: ping using private IPs vs elastic- IPs: 0.8 ms vs 1.4 msec (500 byte packets - servers in two different AZs in US-east) you will need to pay for this traffic whereas when using the names which are mapped to the internal IPs you would not. Also: for the clients, I am using as connect string static DNS records with names like: zookeeperN.<domain> pointing to the ec2-A-B-C-D.compute-1.amazonaws.com - thus pointing to the elastic-ip's name and not the IPs. These are mapped by EC2 to the active private IPs after assigning the elastic-ip to an instance. The clients will be recognised properly as from the correct security group(s). No need to add all the client IPs - of which I have many, and changing set; just add the clients security groups access to the the zookeeper security group. BTW: if someone knows of good resources running zookeeper and curator-based clients in AWS I would kindly like to know where...
        Hide
        Michi Mutsuzaki added a comment -

        I'd like to get this fixed in 3.5. I'll rebase the patch and address Flavio's comments. I'm still not sure how we should test this though.

        Show
        Michi Mutsuzaki added a comment - I'd like to get this fixed in 3.5. I'll rebase the patch and address Flavio's comments. I'm still not sure how we should test this though.
        Hide
        Michi Mutsuzaki added a comment -

        Rebased and cleaned up the patch. No test yet.

        Show
        Michi Mutsuzaki added a comment - Rebased and cleaned up the patch. No test yet.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642754/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642754/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2072//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2074//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2075//console This message is automatically generated.
        Hide
        Michael Lasevich added a comment -

        Michi Mutsuzaki Thanks for taking this over, I clearly lost track of it. For what its worth, we have been running this patch in production for almost a year with no issues but it would be nice to have it properly merged. Thank you.

        Show
        Michael Lasevich added a comment - Michi Mutsuzaki Thanks for taking this over, I clearly lost track of it. For what its worth, we have been running this patch in production for almost a year with no issues but it would be nice to have it properly merged. Thank you.
        Hide
        Michi Mutsuzaki added a comment -

        Thanks Michael, it's good to know that you've been using this patch without an issue.

        Show
        Michi Mutsuzaki added a comment - Thanks Michael, it's good to know that you've been using this patch without an issue.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2076//console This message is automatically generated.
        Hide
        Michi Mutsuzaki added a comment -

        I've been running FollowerResyncConcurrencyTest for a while, but I can't reproduce the failure. Maybe it's a transient failure.

        Show
        Michi Mutsuzaki added a comment - I've been running FollowerResyncConcurrencyTest for a while, but I can't reproduce the failure. Maybe it's a transient failure.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch
        against trunk revision 1591175.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642801/ZOOKEEPER-1506.patch against trunk revision 1591175. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2077//console This message is automatically generated.
        Hide
        Diwaker Gupta added a comment -

        This is a severe issue in any environment that uses DNSMasq or relies on DHCP for DNS mappings. Would really like to see this fixed for 3.5.0, or better yet, the next bugfix release in 3.4.x if there's going to be one.

        Show
        Diwaker Gupta added a comment - This is a severe issue in any environment that uses DNSMasq or relies on DHCP for DNS mappings. Would really like to see this fixed for 3.5.0, or better yet, the next bugfix release in 3.4.x if there's going to be one.
        Hide
        Flavio Junqueira added a comment -

        Diwaker Gupta, is the patch here good for you, have you had a chance to review it?

        Show
        Flavio Junqueira added a comment - Diwaker Gupta , is the patch here good for you, have you had a chance to review it?
        Hide
        MUFEED USMAN added a comment -

        In the case where no IPs change, but a mere DNS outage occurs (for some unforeseen reason); shouldn't the ZKs be able to able to make use of the cached info and restore connectivity once the DNS is back up?

        In the case I'm handling I faced the following:

        2014-02-06 20:22:54,438 [myid:0] - INFO [WorkerReceiver[myid=0]:FastLeaderElection@542] - Notification: 0 (n.leader), 0x200000171 (n.zxid), 0x1 (n.round), LOOKING (n.state), 0 (n.sid), 0x2 (
        n.peerEPoch), LOOKING (my state)
        2014-02-06 20:22:54,451 [myid:0] - WARN [WorkerSender[myid=0]:QuorumCnxManager@368] - Cannot open channel to 1 at election address node02:3888
        java.net.UnknownHostException: node02

        And only ZK restarts restored services.

        Show
        MUFEED USMAN added a comment - In the case where no IPs change, but a mere DNS outage occurs (for some unforeseen reason); shouldn't the ZKs be able to able to make use of the cached info and restore connectivity once the DNS is back up? In the case I'm handling I faced the following: 2014-02-06 20:22:54,438 [myid:0] - INFO [WorkerReceiver [myid=0] :FastLeaderElection@542] - Notification: 0 (n.leader), 0x200000171 (n.zxid), 0x1 (n.round), LOOKING (n.state), 0 (n.sid), 0x2 ( n.peerEPoch), LOOKING (my state) 2014-02-06 20:22:54,451 [myid:0] - WARN [WorkerSender [myid=0] :QuorumCnxManager@368] - Cannot open channel to 1 at election address node02:3888 java.net.UnknownHostException: node02 And only ZK restarts restored services.
        Hide
        Michi Mutsuzaki added a comment -

        Hi Mufeed,

        Good point, I can update this.addr and this.electionAddr only if the InetSocketAddress constructor actually resolved the hostname.

        By the way, are you running ZooKeeper with this patch applied? Does the patch fix the original problem?

        Show
        Michi Mutsuzaki added a comment - Hi Mufeed, Good point, I can update this.addr and this.electionAddr only if the InetSocketAddress constructor actually resolved the hostname. By the way, are you running ZooKeeper with this patch applied? Does the patch fix the original problem?
        Hide
        MUFEED USMAN added a comment -

        Nope. I do not have any patch installed. Happened to hit this JIRA during my search to understand the working relationship between ZooKeeper and DNS when the service was disrupted and did not have an automatic recovery.

        Show
        MUFEED USMAN added a comment - Nope. I do not have any patch installed. Happened to hit this JIRA during my search to understand the working relationship between ZooKeeper and DNS when the service was disrupted and did not have an automatic recovery.
        Hide
        Michi Mutsuzaki added a comment -

        Ok thanks Mufeed. I'll update the patch to handle the case you reported.

        Show
        Michi Mutsuzaki added a comment - Ok thanks Mufeed. I'll update the patch to handle the case you reported.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch
        against trunk revision 1600481.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch against trunk revision 1600481. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2125//console This message is automatically generated.
        Hide
        Joshua Buss added a comment -

        This is very, very annoying. Please get the fix into master as soon as possible. Thanks from another heavy user of cloud services (this affects any environment where you use the same hostnames for configuration purposes but dynamically re-build instances which get new IPs).

        Show
        Joshua Buss added a comment - This is very, very annoying. Please get the fix into master as soon as possible. Thanks from another heavy user of cloud services (this affects any environment where you use the same hostnames for configuration purposes but dynamically re-build instances which get new IPs).
        Hide
        Flavio Junqueira added a comment -

        I think there is a deeper problem here that I'm worried about. If you point a name to a different server starting from scratch, then it is like the server state has been wiped out from the perspective of the ZK replication, which can lead to state loss in some corner cases. One way around this is to use reconfiguration: remove and re-introduce the server. Is something like this doable for you?

        Show
        Flavio Junqueira added a comment - I think there is a deeper problem here that I'm worried about. If you point a name to a different server starting from scratch, then it is like the server state has been wiped out from the perspective of the ZK replication, which can lead to state loss in some corner cases. One way around this is to use reconfiguration: remove and re-introduce the server. Is something like this doable for you?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch
        against trunk revision 1623916.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch against trunk revision 1623916. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2337//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Let me clarify my previous point. The description of this jira talks about bringing up a new instance to upgrade/replace a ZK node. It it starting a ZK node from scratch, essentially dropping the state of the previous incarnation of the node, that I was trying to point out that could be a problem. Resolving names again as this patch proposes is fine. We should get this one in.

        Show
        Flavio Junqueira added a comment - Let me clarify my previous point. The description of this jira talks about bringing up a new instance to upgrade/replace a ZK node. It it starting a ZK node from scratch, essentially dropping the state of the previous incarnation of the node, that I was trying to point out that could be a problem. Resolving names again as this patch proposes is fine. We should get this one in.
        Hide
        Ramya Bharathi Nimmagadda added a comment -

        Thanks Flavio.
        Michi Mutsuzaki Would you be able to submit a new patch with unit tests included? Thanks

        Show
        Ramya Bharathi Nimmagadda added a comment - Thanks Flavio. Michi Mutsuzaki Would you be able to submit a new patch with unit tests included? Thanks
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch
        against trunk revision 1632209.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648826/ZOOKEEPER-1506.patch against trunk revision 1632209. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2394//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        I'm not sure I have any great suggestion for how to do a test case for this fix. Any suggestion?

        Show
        Flavio Junqueira added a comment - I'm not sure I have any great suggestion for how to do a test case for this fix. Any suggestion?
        Hide
        Michi Mutsuzaki added a comment -

        We can modify /etc/hosts (or a local hosts file specified in HOSTALIASES environment variable) to simulate ip address change. This wont' work on windows though.

        Show
        Michi Mutsuzaki added a comment - We can modify /etc/hosts (or a local hosts file specified in HOSTALIASES environment variable) to simulate ip address change. This wont' work on windows though.
        Hide
        Ramya Bharathi Nimmagadda added a comment -

        Like the idea. There is a similar file in Windows at "%SystemRoot%\system32\drivers\etc\hosts" that could be updated.

        Here's the format :
        #The IP address should be placed in the first column followed by the corresponding host name.
        #The IP address and the host name should be separated by at least one space.

        Show
        Ramya Bharathi Nimmagadda added a comment - Like the idea. There is a similar file in Windows at "%SystemRoot%\system32\drivers\etc\hosts" that could be updated. Here's the format : #The IP address should be placed in the first column followed by the corresponding host name. #The IP address and the host name should be separated by at least one space.
        Hide
        Michi Mutsuzaki added a comment -

        I manually tested the patch by modifying /etc/hosts and verified that quorum connection manager picks up new ip addresses. As a part of this change, I removed the retry limit in the qcm listener. Otherwise the listener dies after 3 retries if it fails to resolve the hostname.

        Any volunteer for adding a unit test? If not, maybe we could check this in after manually verifying the fix.

        Show
        Michi Mutsuzaki added a comment - I manually tested the patch by modifying /etc/hosts and verified that quorum connection manager picks up new ip addresses. As a part of this change, I removed the retry limit in the qcm listener. Otherwise the listener dies after 3 retries if it fails to resolve the hostname. Any volunteer for adding a unit test? If not, maybe we could check this in after manually verifying the fix.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12675762/ZOOKEEPER-1506.patch
        against trunk revision 1632209.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675762/ZOOKEEPER-1506.patch against trunk revision 1632209. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2395//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Modifying /etc/hosts from a unit test doesn't sound like an appropriate thing to do, so I'm still not sure how to write a test for this. Perhaps we just have to agree to not having a test case for this.

        Show
        Flavio Junqueira added a comment - Modifying /etc/hosts from a unit test doesn't sound like an appropriate thing to do, so I'm still not sure how to write a test for this. Perhaps we just have to agree to not having a test case for this.
        Hide
        koray sariteke added a comment -

        We also have that problem, when will it be submitted to release?

        Show
        koray sariteke added a comment - We also have that problem, when will it be submitted to release?
        Hide
        Rakesh R added a comment -

        Perhaps we just have to agree to not having a test case for this

        I also failed to find a better approach to write unit test for this. I agree to not having a test case.

        Show
        Rakesh R added a comment - Perhaps we just have to agree to not having a test case for this I also failed to find a better approach to write unit test for this. I agree to not having a test case.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12675762/ZOOKEEPER-1506.patch
        against trunk revision 1665315.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2559//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2559//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2559//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675762/ZOOKEEPER-1506.patch against trunk revision 1665315. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2559//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2559//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2559//console This message is automatically generated.
        Hide
        Michi Mutsuzaki added a comment -

        Camille FournierRakesh R could you take a look at the patch? I'd like to get this in for 3.5.1. Thanks!

        Show
        Michi Mutsuzaki added a comment - Camille Fournier Rakesh R could you take a look at the patch? I'd like to get this in for 3.5.1. Thanks!
        Hide
        Camille Fournier added a comment -

        The only concern I have is removing the retries. I think it's probably right to do but does it imply documentation changes anywhere that we need to make? Michi Mutsuzaki?

        Show
        Camille Fournier added a comment - The only concern I have is removing the retries. I think it's probably right to do but does it imply documentation changes anywhere that we need to make? Michi Mutsuzaki ?
        Hide
        Michi Mutsuzaki added a comment -

        The retry was added a while back in ZOOKEEPER-512. Flavio Junqueira let me know if I should leave the retry logic as is. Thanks!

        Show
        Michi Mutsuzaki added a comment - The retry was added a while back in ZOOKEEPER-512 . Flavio Junqueira let me know if I should leave the retry logic as is. Thanks!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12704647/ZOOKEEPER-1506.patch
        against trunk revision 1666768.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2570//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2570//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2570//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704647/ZOOKEEPER-1506.patch against trunk revision 1666768. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2570//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2570//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2570//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        One clarification here. Don't we have to guarantee that all code paths to connectOne (both signatures) recreate addresses? It looks like it isn't the case with this patch. For example, QuorumPeer.connectNewPeers invokes the version of connectOne that doesn't recreate.

        Show
        Flavio Junqueira added a comment - One clarification here. Don't we have to guarantee that all code paths to connectOne (both signatures) recreate addresses? It looks like it isn't the case with this patch. For example, QuorumPeer.connectNewPeers invokes the version of connectOne that doesn't recreate.
        Hide
        Michi Mutsuzaki added a comment -

        Thanks Flavio Junqueira, yeah I think you are right. I'll change QuorumPeer.connectNewPeers to call connectOne(long sid) and make connectOne(long sid, InetSocketAddress electionAddr) private so that everybody goes through connectOne(long sid). The only other place we use connectOne(long sid, InetSocketAddress electionAddr) is QuorumCnxManager.receiveConnection() but this one is ok since it creates a new InetSocketAddress explicitly.

        Show
        Michi Mutsuzaki added a comment - Thanks Flavio Junqueira , yeah I think you are right. I'll change QuorumPeer.connectNewPeers to call connectOne(long sid) and make connectOne(long sid, InetSocketAddress electionAddr) private so that everybody goes through connectOne(long sid). The only other place we use connectOne(long sid, InetSocketAddress electionAddr) is QuorumCnxManager.receiveConnection() but this one is ok since it creates a new InetSocketAddress explicitly.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12707161/ZOOKEEPER-1506.patch
        against trunk revision 1669060.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2588//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2588//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2588//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707161/ZOOKEEPER-1506.patch against trunk revision 1669060. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2588//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2588//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2588//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        The change looks good, thanks. About the retry logic, the change seems gratuitous. I don´t see a big issue with removing it, but it sounds best to leave the retry logic as is because it is unrelated to the fix here. If needed, propose it in a new jira. Does it make sense?

        Show
        Flavio Junqueira added a comment - The change looks good, thanks. About the retry logic, the change seems gratuitous. I don´t see a big issue with removing it, but it sounds best to leave the retry logic as is because it is unrelated to the fix here. If needed, propose it in a new jira. Does it make sense?
        Hide
        Michi Mutsuzaki added a comment -

        Yes that makes sense. I updated the patch to leave the retry logic as is. Please take a look when you get a chance. Thanks!

        Show
        Michi Mutsuzaki added a comment - Yes that makes sense. I updated the patch to leave the retry logic as is. Please take a look when you get a chance. Thanks!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12707249/ZOOKEEPER-1506.patch
        against trunk revision 1669062.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2589//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2589//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2589//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707249/ZOOKEEPER-1506.patch against trunk revision 1669062. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2589//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2589//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2589//console This message is automatically generated.
        Hide
        Michi Mutsuzaki added a comment -

        Camille Fournier Flavio Junqueira could you take a look at the patch? Once this gets checked in, I'll create a release candidate for 3.5.1.

        Show
        Michi Mutsuzaki added a comment - Camille Fournier Flavio Junqueira could you take a look at the patch? Once this gets checked in, I'll create a release candidate for 3.5.1.
        Hide
        Rakesh R added a comment -

        Thanks Michi Mutsuzaki for the fix. +1 latest patch looks good to me.

        Show
        Rakesh R added a comment - Thanks Michi Mutsuzaki for the fix. +1 latest patch looks good to me.
        Hide
        Flavio Junqueira added a comment -

        +1, lgtm. Camille Fournier, is it a go?

        Show
        Flavio Junqueira added a comment - +1, lgtm. Camille Fournier , is it a go?
        Hide
        Camille Fournier added a comment -

        +1

        Show
        Camille Fournier added a comment - +1
        Hide
        Rakesh R added a comment -

        I'll commit this shortly.

        Thanks Flavio Junqueira, Camille Fournier, all others for the reviews and Michi Mutsuzaki, Michael Lasevich for taking care this issue.

        Show
        Rakesh R added a comment - I'll commit this shortly. Thanks Flavio Junqueira , Camille Fournier , all others for the reviews and Michi Mutsuzaki , Michael Lasevich for taking care this issue.
        Hide
        Rakesh R added a comment -

        Committed to trunk : http://svn.apache.org/viewvc?view=revision&revision=1672436
        Committed to br3.5 : http://svn.apache.org/viewvc?view=revision&revision=1672438

        Since the patch has conflicts in 3.4 branch, am not resolving this issue now. Michi Mutsuzaki will you generate a 3.4 patch?

        Show
        Rakesh R added a comment - Committed to trunk : http://svn.apache.org/viewvc?view=revision&revision=1672436 Committed to br3.5 : http://svn.apache.org/viewvc?view=revision&revision=1672438 Since the patch has conflicts in 3.4 branch, am not resolving this issue now. Michi Mutsuzaki will you generate a 3.4 patch?
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in ZooKeeper-trunk #2656 (See https://builds.apache.org/job/ZooKeeper-trunk/2656/)
        ZOOKEEPER-1506 Re-try DNS hostname -> IP resolution if node connection fails(Michi Mutsuzaki via rakeshr) (rakeshr: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1672436)

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
        Show
        Hudson added a comment - SUCCESS: Integrated in ZooKeeper-trunk #2656 (See https://builds.apache.org/job/ZooKeeper-trunk/2656/ ) ZOOKEEPER-1506 Re-try DNS hostname -> IP resolution if node connection fails(Michi Mutsuzaki via rakeshr) (rakeshr: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1672436 ) /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
        Hide
        Raul Gutierrez Segales added a comment -

        I am running elections (in a 5 participants + 1 observer cluster) as part of validating the 3.5.1 alpha rc proposed by Michi Mutsuzaki. I am getting this from time to time:

        https://gist.github.com/rgs1/d11822799fdbbfa5d5f2

        I only have IP addresses in zoo.cfg and this patch seems to be triggering a reverse lookup (IP-> hostname). Given that in my current setup (a test setup, with systemd-nspawn containers) hostnames don't necessarily resolve back (i.e.: hostname -> IP doesn't work), participants might end up unable to connect to the leader if it's initially unavailable.

        Is the reverse lookup (IP -> hostname) something expected with this patch or a side effect? I don't see why we'd ever want/need that reverse lookup given that it could be problematic in some setups.

        Thoughts?

        p.s.: will post my entire, reproducible, setup a bit later.

        Show
        Raul Gutierrez Segales added a comment - I am running elections (in a 5 participants + 1 observer cluster) as part of validating the 3.5.1 alpha rc proposed by Michi Mutsuzaki . I am getting this from time to time: https://gist.github.com/rgs1/d11822799fdbbfa5d5f2 I only have IP addresses in zoo.cfg and this patch seems to be triggering a reverse lookup (IP-> hostname). Given that in my current setup (a test setup, with systemd-nspawn containers) hostnames don't necessarily resolve back (i.e.: hostname -> IP doesn't work), participants might end up unable to connect to the leader if it's initially unavailable. Is the reverse lookup (IP -> hostname) something expected with this patch or a side effect? I don't see why we'd ever want/need that reverse lookup given that it could be problematic in some setups. Thoughts? p.s.: will post my entire, reproducible, setup a bit later.
        Hide
        Michi Mutsuzaki added a comment -

        The patch uses HostNameUtils.getHostString(), which supposedly avoid reverse lookup. Maybe there is a bug in HostNameUtils.getHostString()?

        We can replace HostNameUtils with InetSocketAddress.getHostString since we now require java7.

        Show
        Michi Mutsuzaki added a comment - The patch uses HostNameUtils.getHostString(), which supposedly avoid reverse lookup. Maybe there is a bug in HostNameUtils.getHostString()? We can replace HostNameUtils with InetSocketAddress.getHostString since we now require java7.
        Hide
        Michi Mutsuzaki added a comment -

        Raul, could you try ZOOKEEPER-1506-fix.patch and see if it fixes the problem? Thanks!

        Show
        Michi Mutsuzaki added a comment - Raul, could you try ZOOKEEPER-1506 -fix.patch and see if it fixes the problem? Thanks!
        Hide
        Raul Gutierrez Segales added a comment -

        testing right now

        Show
        Raul Gutierrez Segales added a comment - testing right now
        Hide
        Raul Gutierrez Segales added a comment -

        hmmm, it does not help Michi Mutsuzaki

        Show
        Raul Gutierrez Segales added a comment - hmmm, it does not help Michi Mutsuzaki
        Hide
        Raul Gutierrez Segales added a comment -

        actually, maybe it does. not sure my first try was clean. couldn't get a repro after the 2nd try.

        Show
        Raul Gutierrez Segales added a comment - actually, maybe it does. not sure my first try was clean. couldn't get a repro after the 2nd try.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12726674/ZOOKEEPER-1506-fix.patch
        against trunk revision 1672934.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2641//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2641//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2641//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726674/ZOOKEEPER-1506-fix.patch against trunk revision 1672934. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2641//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2641//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2641//console This message is automatically generated.
        Hide
        Raul Gutierrez Segales added a comment -

        It doesn't. I got a consistent repro by first firewalling the participant with id 0, to force that code path.

        I'll try reverting the patch entirely and see if that helps.

        Show
        Raul Gutierrez Segales added a comment - It doesn't. I got a consistent repro by first firewalling the participant with id 0, to force that code path. I'll try reverting the patch entirely and see if that helps.
        Hide
        Michi Mutsuzaki added a comment -

        Thanks Raul for testing this. I'd try replacing calls to getHostName to getHostString. For example, I found another one in QuorumCnxManager.java:

        org/apache/zookeeper/server/quorum/QuorumCnxManager.java: String addr = self.getElectionAddress().getHostName() + ":" + self.getElectionAddress().getPort();

        Show
        Michi Mutsuzaki added a comment - Thanks Raul for testing this. I'd try replacing calls to getHostName to getHostString. For example, I found another one in QuorumCnxManager.java: org/apache/zookeeper/server/quorum/QuorumCnxManager.java: String addr = self.getElectionAddress().getHostName() + ":" + self.getElectionAddress().getPort();
        Hide
        Raul Gutierrez Segales added a comment -

        So I created a build with ZOOKEEPER-1506 removed and I still get the problem.

        It's probably due the getHostName() calls that you pointed out. These calls can actually generate reverse lookups according to:

        http://download.java.net/jdk7/archive/b123/docs/api/java/net/InetSocketAddress.html#getHostName%28%29

        However, these calls have been introduced by ZOOKEEPER-107 (according to git-blame). I think we should avoid them, though lets do that in another ticket.

        In conclusion, if you have a bad resolver or bogus reverse lookups (as is the case in my test scenario): you'll have issues because of these calls.

        Show
        Raul Gutierrez Segales added a comment - So I created a build with ZOOKEEPER-1506 removed and I still get the problem. It's probably due the getHostName() calls that you pointed out. These calls can actually generate reverse lookups according to: http://download.java.net/jdk7/archive/b123/docs/api/java/net/InetSocketAddress.html#getHostName%28%29 However, these calls have been introduced by ZOOKEEPER-107 (according to git-blame). I think we should avoid them, though lets do that in another ticket. In conclusion, if you have a bad resolver or bogus reverse lookups (as is the case in my test scenario): you'll have issues because of these calls.
        Hide
        Raul Gutierrez Segales added a comment -

        I'll go ahead and close this again Michi Mutsuzaki.

        Show
        Raul Gutierrez Segales added a comment - I'll go ahead and close this again Michi Mutsuzaki .
        Hide
        Raul Gutierrez Segales added a comment -

        Looks like we closed this one by accident, there hasn't been a backport to 3.4 yet. I'll prepare a patch.

        cc: Michi Mutsuzaki

        Show
        Raul Gutierrez Segales added a comment - Looks like we closed this one by accident, there hasn't been a backport to 3.4 yet. I'll prepare a patch. cc: Michi Mutsuzaki
        Hide
        André Cruz added a comment -

        Any news regarding the inclusion of this fix in a stable zookeeper version?

        Show
        André Cruz added a comment - Any news regarding the inclusion of this fix in a stable zookeeper version?

          People

          • Assignee:
            Raul Gutierrez Segales
            Reporter:
            Mike Heffner
          • Votes:
            26 Vote for this issue
            Watchers:
            40 Start watching this issue

            Dates

            • Created:
              Updated:

              Development