Whirr
  1. Whirr
  2. WHIRR-128

In ec2 instances instead of public dns names a public ip address is resolved for the started master node which causes workers to not be able to connect back to the master

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.3.0
    • Component/s: core
    • Labels:
      None
    • Environment:

      Running hadoop (apache or CDH distro) in ec2 instances (Ubuntu or CentOS or Fedora).
      The same issue with the integration test of whirr.

      Description

      The problem it is related to the nature how it is resolved the reverse address in ec2 instances.
      After isolating the problem I could write a very simple app which reproduces the cause of the issue.
      Pass in args the public ip address of the ec2 instance where are you running the following small code.
      InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
      System.out.println("getHostAddress: " + namenodePublicAddress.getHostAddress());
      System.out.println("getHostName: " + namenodePublicAddress.getHostName());
      System.out.println("getCanonicalHostName: " + namenodePublicAddress.getCanonicalHostName());

      If I am running it on my laptop I get
      getHostAddress: 50.16.71.64
      getHostName: ec2-50-16-71-64.compute-1.amazonaws.com
      getCanonicalHostName: ec2-50-16-71-64.compute-1.amazonaws.com

      if I am running it on ec2 instance
      getHostAddress: 50.16.71.64
      getHostName: 50.16.71.64
      getCanonicalHostName: 50.16.71.64

      My laptop has the same CentOS 5.5 as my ec2 instance and the /etc/resolv.conf in each cases contains a nameserver entry.
      For some unknown reason, the java.net.InetAddress's getHostName() or getCanonicalHostName() does not resolves reverse dns names for ec2 public addresses if it was running in ec2 instance.
      But any other resolver tools correctly resolves that reverse dns name.

      In whirr codebase there are some getHostName() calls, which because of the previously described symptom, causes that /etc/hadoop/conf/hadoop-site.xml on the worker nodes are incorrectly filled with ip addresses instead of dns names. As we know, it is important to use public dns name of the ec2 instance because amazon's nameserver it can resolve to an external or internal ip address, one that is better for direct communication. In case of hadoop cluster, the used security group does not allow intercommunication between nodes by using public ip address and therefore the worker nodes cannot contact the services on the master node. Looking into the hadoop logs it is clearly visible the problem that workers cannot connect to master.

      1. whirr-trunk.patch
        11 kB
        Tibor Kiss
      2. on-ec2-before-patch.tar.gz
        2 kB
        Tibor Kiss
      3. on-ec2-after-patch.tar.gz
        0.9 kB
        Tibor Kiss
      4. compare-myhost-with-ec2.txt
        14 kB
        Tibor Kiss

        Activity

        Tibor Kiss created issue -
        Tibor Kiss made changes -
        Field Original Value New Value
        Attachment compare-myhost-with-ec2.txt [ 12458872 ]
        Attachment on-ec2-before-patch.tar.gz [ 12458873 ]
        Attachment on-ec2-after-patch.tar.gz [ 12458874 ]
        Tibor Kiss made changes -
        Attachment whirr-trunk.patch [ 12458875 ]
        Tibor Kiss made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Patrick Hunt made changes -
        Component/s core [ 12313574 ]
        Patrick Hunt made changes -
        Assignee Tibor Kiss [ tibor.kiss ]
        Fix Version/s 0.3.0 [ 12315487 ]
        Tibor Kiss made changes -
        Attachment whirr-trunk.patch [ 12459062 ]
        Tibor Kiss made changes -
        Attachment whirr-trunk.patch [ 12458875 ]
        Tibor Kiss made changes -
        Attachment whirr-trunk.patch [ 12459080 ]
        Tibor Kiss made changes -
        Attachment whirr-trunk.patch [ 12459062 ]
        Tom White made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Tibor Kiss
            Reporter:
            Tibor Kiss
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development