ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1506

Re-try DNS hostname -> IP resolution if node connection fails

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.4.5, 3.4.6
    • Fix Version/s: 3.4.7, 3.5.1, 3.6.0
    • Component/s: server
    • Labels:
    • Environment:

      Ubuntu 11.04 64-bit

    • Release Note:
      Tests pass with this patch.

      Description

      In our zoo.cfg we use hostnames to identify the ZK servers that are part of an ensemble. These hostnames are configured with a low (<= 60s) TTL and the IP address they map to can and does change. Our procedure for replacing/upgrading a ZK node is to boot an entirely new instance and remap the hostname to the new instance's IP address. Our expectation is that when the original ZK node is terminated/shutdown, the remaining nodes in the ensemble would reconnect to the new instance.

      However, what we are noticing is that the remaining ZK nodes do not attempt to re-resolve the hostname->IP mapping for the new server. Once the original ZK node is terminated, the existing servers continue to attempt contacting it at the old IP address. It would be great if the ZK servers could try to re-resolve the hostname when attempting to connect to a lost ZK server, instead of caching the lookup indefinitely. Currently we must do a rolling restart of the ZK ensemble after swapping a node – which at three nodes means we periodically lose quorum.

      The exact method we are following is to boot new instances in EC2 and attach one, of a set of three, Elastic IP address. External to EC2 this IP address remains the same and maps to whatever instance it is attached to. Internal to EC2, the elastic IP hostname has a TTL of about 45-60 seconds and is remapped to the internal (10.x.y.z) address of the instance it is attached to. Therefore, in our case we would like ZK to pickup the new 10.x.y.z address that the elastic IP hostname gets mapped to and reconnect appropriately.

      1. zk-dns-caching-refresh.patch
        7 kB
        Michael Lasevich
      2. Zookeeper-1506.patch
        40 kB
        Robert P. Thille
      3. ZOOKEEPER-1506.patch
        39 kB
        Robert P. Thille
      4. ZOOKEEPER-1506.patch
        5 kB
        Michi Mutsuzaki
      5. ZOOKEEPER-1506.patch
        7 kB
        Michi Mutsuzaki
      6. ZOOKEEPER-1506.patch
        6 kB
        Michi Mutsuzaki
      7. ZOOKEEPER-1506.patch
        6 kB
        Michi Mutsuzaki
      8. ZOOKEEPER-1506.patch
        3 kB
        Michi Mutsuzaki
      9. ZOOKEEPER-1506.patch
        3 kB
        Michi Mutsuzaki
      10. ZOOKEEPER-1506.patch
        2 kB
        Michi Mutsuzaki
      11. ZOOKEEPER-1506-fix.patch
        0.7 kB
        Michi Mutsuzaki

        Activity

        Robert P. Thille made changes -
        Attachment ZOOKEEPER-1506.patch [ 12752610 ]
        Robert P. Thille made changes -
        Attachment Zookeeper-1506.patch [ 12752535 ]
        Robert P. Thille made changes -
        Status Reopened [ 4 ] Patch Available [ 10002 ]
        Affects Version/s 3.4.6 [ 12323310 ]
        Release Note Incomplete - will break tests. Tests pass with this patch.
        Michi Mutsuzaki made changes -
        Assignee Michi Mutsuzaki [ michim ] Raul Gutierrez Segales [ rgs ]
        Michi Mutsuzaki made changes -
        Description In our zoo.cfg we use hostnames to identify the ZK servers that are part of an ensemble. These hostnames are configured with a low (<= 60s) TTL and the IP address they map to can and does change. Our procedure for replacing/upgrading a ZK node is to boot an entirely new instance and remap the hostname to the new instance's IP address. Our expectation is that when the original ZK node is terminated/shutdown, the remaining nodes in the ensemble would reconnect to the new instance.

        However, what we are noticing is that the remaining ZK nodes do not attempt to re-resolve the hostname->IP mapping for the new server. Once the original ZK node is terminated, the existing servers continue to attempt contacting it at the old IP address. It would be great if the ZK servers could try to re-resolve the hostname when attempting to connect to a lost ZK server, instead of caching the lookup indefinitely. Currently we must do a rolling restart of the ZK ensemble after swapping a node -- which at three nodes means we periodically lose quorum.

        The exact method we are following is to boot new instances in EC2 and attach one, of a set of three, Elastic IP address. External to EC2 this IP address remains the same and maps to whatever instance it is attached to. Internal to EC2, the elastic IP hostname has a TTL of about 45-60 seconds and is remapped to the internal (10.x.y.z) address of the instance it is attached to. Therefore, in our case we would like ZK to pickup the new 10.x.y.z address that the elastic IP hostname gets mapped to and reconnect appropriately.
           In our zoo.cfg we use hostnames to identify the ZK servers that are part of an ensemble. These hostnames are configured with a low (<= 60s) TTL and the IP address they map to can and does change. Our procedure for replacing/upgrading a ZK node is to boot an entirely new instance and remap the hostname to the new instance's IP address. Our expectation is that when the original ZK node is terminated/shutdown, the remaining nodes in the ensemble would reconnect to the new instance.

        However, what we are noticing is that the remaining ZK nodes do not attempt to re-resolve the hostname->IP mapping for the new server. Once the original ZK node is terminated, the existing servers continue to attempt contacting it at the old IP address. It would be great if the ZK servers could try to re-resolve the hostname when attempting to connect to a lost ZK server, instead of caching the lookup indefinitely. Currently we must do a rolling restart of the ZK ensemble after swapping a node -- which at three nodes means we periodically lose quorum.

        The exact method we are following is to boot new instances in EC2 and attach one, of a set of three, Elastic IP address. External to EC2 this IP address remains the same and maps to whatever instance it is attached to. Internal to EC2, the elastic IP hostname has a TTL of about 45-60 seconds and is remapped to the internal (10.x.y.z) address of the instance it is attached to. Therefore, in our case we would like ZK to pickup the new 10.x.y.z address that the elastic IP hostname gets mapped to and reconnect appropriately.
        Raul Gutierrez Segales made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Raul Gutierrez Segales made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506-fix.patch [ 12726674 ]
        Rakesh R made changes -
        Assignee Michael Lasevich [ mlasevich ] Michi Mutsuzaki [ michim ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12707249 ]
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12707161 ]
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12704647 ]
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 3.6.0 [ 12326518 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12675762 ]
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Flavio Junqueira made changes -
        Fix Version/s 3.4.7 [ 12325149 ]
        Patrick Hunt made changes -
        Fix Version/s 3.5.1 [ 12326786 ]
        Fix Version/s 3.5.0 [ 12316644 ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12648826 ]
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12642801 ]
        Michi Mutsuzaki made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Michi Mutsuzaki made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Michi Mutsuzaki made changes -
        Attachment ZOOKEEPER-1506.patch [ 12642754 ]
        Michi Mutsuzaki made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        Flavio Junqueira made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Patrick Hunt made changes -
        Fix Version/s 3.5.0 [ 12316644 ]
        Patrick Hunt made changes -
        Assignee Michael Lasevich [ mlasevich ]
        Michael Lasevich made changes -
        Attachment zk-dns-caching-refresh.patch [ 12581564 ]
        Michael Lasevich made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note Incomplete - will break tests.
        Labels patch
        Mike Heffner made changes -
        Priority Minor [ 4 ] Major [ 3 ]
        Matt Wise made changes -
        Field Original Value New Value
        Affects Version/s 3.4.5 [ 12321883 ]
        Affects Version/s 3.3.5 [ 12319081 ]
        Mike Heffner created issue -

          People

          • Assignee:
            Raul Gutierrez Segales
            Reporter:
            Mike Heffner
          • Votes:
            29 Vote for this issue
            Watchers:
            44 Start watching this issue

            Dates

            • Created:
              Updated:

              Development