Hadoop Common
  1. Hadoop Common
  2. HADOOP-1638

Master node unable to bind to DNS hostname

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.13.0, 0.13.1, 0.14.0, 0.15.0
    • Fix Version/s: 0.14.0
    • Component/s: contrib/cloud
    • Labels:
      None

      Description

      With a release package of Hadoop 0.13.0 or with latest SVN, the Hadoop contrib/ec2 scripts fail to start Hadoop correctly. After working around issues HADOOP-1634 and HADOOP-1635, and setting up a DynDNS address pointing to the master's IP, the ec2/bin/start-hadoop script completes.

      But the cluster is unusable because the namenode and tasktracker have not started successfully. Looking at the namenode log on the master reveals the following error:

      2007-07-19 16:54:53,156 ERROR org.apache.hadoop.dfs.NameNode: java.net.BindException: Cannot assign requested address
      at sun.nio.ch.Net.bind(Native Method)
      at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
      at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
      at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:186)
      at org.apache.hadoop.ipc.Server.<init>(Server.java:631)
      at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:325)
      at org.apache.hadoop.ipc.RPC.getServer(RPC.java:295)
      at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
      at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:211)
      at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:803)
      at org.apache.hadoop.dfs.NameNode.main(NameNode.java:811)

      The master node refuses to bind to the DynDNS hostname in the generated hadoop-site.xml. Here is the relevant part of the generated file:

      <property>
      <name>fs.default.name</name>
      <value>blah-ec2.gotdns.org:50001</value>
      </property>

      <property>
      <name>mapred.job.tracker</name>
      <value>blah-ec2.gotdns.org:50002</value>
      </property>

      I'll attach a patch against hadoop-trunk that fixes the issue for me, but I'm not sure if this issue is something that someone can fix more thoroughly.

        Issue Links

          Activity

          Stu Hood created issue -
          Hide
          Stu Hood added a comment -

          Here is the patch I mentioned...

          Sorry about the nasty formatting in the original issue.

          Show
          Stu Hood added a comment - Here is the patch I mentioned... Sorry about the nasty formatting in the original issue.
          Stu Hood made changes -
          Field Original Value New Value
          Attachment hadoop-1638.patch [ 12362158 ]
          Stu Hood made changes -
          Fix Version/s 0.13.1 [ 12312579 ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.15.0 [ 12312565 ]
          Fix Version/s 0.14.0 [ 12312474 ]
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12362158/hadoop-1638.patch applied and successfully tested against trunk revision r557790. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/435/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/435/console
          Hide
          Tom White added a comment -

          The hadoop-site.xml is created on node startup using the hostname passed in as user data (which comes from the env file). See src/contrib/ec2/bin/image/hadoop-init. Is the problem that something is corrupted - what are the #</property> lines above? Or is it that the wrong hostname is substituted?

          Show
          Tom White added a comment - The hadoop-site.xml is created on node startup using the hostname passed in as user data (which comes from the env file). See src/contrib/ec2/bin/image/hadoop-init. Is the problem that something is corrupted - what are the #</property> lines above? Or is it that the wrong hostname is substituted?
          Hide
          Stu Hood added a comment -

          (fixed the confusing formatting)

          The hadoop-init script gets the MASTER_HOST value correctly and places it in hadoop-site.xml, but the problem is that the master node will not bind to the MASTER_HOST value. Since this is the address that gets put in hadoop-site.xml, the jobtracker and namenode will not start.

          Show
          Stu Hood added a comment - (fixed the confusing formatting) The hadoop-init script gets the MASTER_HOST value correctly and places it in hadoop-site.xml, but the problem is that the master node will not bind to the MASTER_HOST value. Since this is the address that gets put in hadoop-site.xml, the jobtracker and namenode will not start.
          Stu Hood made changes -
          Description With a release package of Hadoop 0.13.0 or with latest SVN, the Hadoop contrib/ec2 scripts fail to start Hadoop correctly. After working around issues HADOOP-1634 and HADOOP-1635, and setting up a DynDNS address pointing to the master's IP, the ec2/bin/start-hadoop script completes.

          But the cluster is unusable because the namenode and tasktracker have not started successfully. Looking at the namenode log on the master reveals the following error:
          #2007-07-19 16:54:53,156 ERROR org.apache.hadoop.dfs.NameNode: java.net.BindException: Cannot assign requested address
          # at sun.nio.ch.Net.bind(Native Method)
          # at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
          # at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
          # at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:186)
          # at org.apache.hadoop.ipc.Server.<init>(Server.java:631)
          # at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:325)
          # at org.apache.hadoop.ipc.RPC.getServer(RPC.java:295)
          # at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
          # at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:211)
          # at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:803)
          # at org.apache.hadoop.dfs.NameNode.main(NameNode.java:811)

          The master node refuses to bind to the DynDNS hostname in the generated hadoop-site.xml. Here is the relevant part of the generated file:
          #<property>
          # <name>fs.default.name</name>
          # <value>blah-ec2.gotdns.org:50001</value>
          #</property>
          #
          #<property>
          # <name>mapred.job.tracker</name>
          # <value>blah-ec2.gotdns.org:50002</value>
          #</property>

          I'll attach a patch against hadoop-trunk that fixes the issue for me, but I'm not sure if this issue is something that someone can fix more thoroughly.
          With a release package of Hadoop 0.13.0 or with latest SVN, the Hadoop contrib/ec2 scripts fail to start Hadoop correctly. After working around issues HADOOP-1634 and HADOOP-1635, and setting up a DynDNS address pointing to the master's IP, the ec2/bin/start-hadoop script completes.

          But the cluster is unusable because the namenode and tasktracker have not started successfully. Looking at the namenode log on the master reveals the following error:
          {quote}
          2007-07-19 16:54:53,156 ERROR org.apache.hadoop.dfs.NameNode: java.net.BindException: Cannot assign requested address
                  at sun.nio.ch.Net.bind(Native Method)
                  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
                  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
                  at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:186)
                  at org.apache.hadoop.ipc.Server.<init>(Server.java:631)
                  at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:325)
                  at org.apache.hadoop.ipc.RPC.getServer(RPC.java:295)
                  at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
                  at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:211)
                  at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:803)
                  at org.apache.hadoop.dfs.NameNode.main(NameNode.java:811)
          {quote}

          The master node refuses to bind to the DynDNS hostname in the generated hadoop-site.xml. Here is the relevant part of the generated file:
          {quote}
          <property>
            <name>fs.default.name</name>
            <value>blah-ec2.gotdns.org:50001</value>
          </property>

          <property>
            <name>mapred.job.tracker</name>
            <value>blah-ec2.gotdns.org:50002</value>
          </property>
          {quote}

          I'll attach a patch against hadoop-trunk that fixes the issue for me, but I'm not sure if this issue is something that someone can fix more thoroughly.
          Fredrik Hedberg made changes -
          Link This issue depends on HADOOP-1202 [ HADOOP-1202 ]
          Hide
          Fredrik Hedberg added a comment -

          There doesn't seem to be much action around HADOOP-1202, but it should make this a non-issue.

          Show
          Fredrik Hedberg added a comment - There doesn't seem to be much action around HADOOP-1202 , but it should make this a non-issue.
          Hide
          Michael Bieniosek added a comment -

          I abandoned HADOOP-1202 because people didn't seem to see any value in it,
          and I changed the way I use hadoop on ec2 around the same time. You're
          welcome to pick it up and port the patch to trunk; it shouldn't be too much
          work.

          -Michael

          Show
          Michael Bieniosek added a comment - I abandoned HADOOP-1202 because people didn't seem to see any value in it, and I changed the way I use hadoop on ec2 around the same time. You're welcome to pick it up and port the patch to trunk; it shouldn't be too much work. -Michael
          Hide
          Tom White added a comment -

          This problem was caused by the changes made in Amazon EC2 addressing: previously instances were direct addressed (given a single IP routable address) and now they are NAT-addressed (by default, for later tool versions). The key point is that NAT-addressed instances can't access other NAT-addressed instances using the public address. Direct addressing is going to be phased out. See http://developer.amazonwebservices.com/connect/entry.jspa?externalID=682&categoryID=100 for more details.

          Tools versions ec2-api-tools-1.2-9739 and later use NAT addressing, and I have been using ec2-api-tools-1.2-7546 (although I thought I had been using ec2-api-tools-1.2-9739) which still uses direct addressing.

          I don't think HADOOP-1202 will make this a non-issue since EC2 NAT instances cannot route to the public address of other instances. So even if the namenode and job tracker could bind to the public address that would not be much help to the slaves since they have to connect to the internal address - so this patch would still be needed.

          Stu, I agree that it would be nice to fix this problem more thoroughly but until we have a better solution I think this approach is fine.

          I've tested with the last three versions of ec2-api-tools and have successfully run the grep example on small multi-node clusters. When NAT-addressing is used however the webservers on datanodes and task trackers are not accessible since non-routable addresses are used. Apart from this limitation (which can be worked around by logging in to the relevant machine to browse logs) jobs ran OK.

          So I vote to commit this (along with HADOOP-1635, HADOOP-1634) - I'll have some time to do this tomorrow.

          Show
          Tom White added a comment - This problem was caused by the changes made in Amazon EC2 addressing: previously instances were direct addressed (given a single IP routable address) and now they are NAT-addressed (by default, for later tool versions). The key point is that NAT-addressed instances can't access other NAT-addressed instances using the public address. Direct addressing is going to be phased out. See http://developer.amazonwebservices.com/connect/entry.jspa?externalID=682&categoryID=100 for more details. Tools versions ec2-api-tools-1.2-9739 and later use NAT addressing, and I have been using ec2-api-tools-1.2-7546 (although I thought I had been using ec2-api-tools-1.2-9739) which still uses direct addressing. I don't think HADOOP-1202 will make this a non-issue since EC2 NAT instances cannot route to the public address of other instances. So even if the namenode and job tracker could bind to the public address that would not be much help to the slaves since they have to connect to the internal address - so this patch would still be needed. Stu, I agree that it would be nice to fix this problem more thoroughly but until we have a better solution I think this approach is fine. I've tested with the last three versions of ec2-api-tools and have successfully run the grep example on small multi-node clusters. When NAT-addressing is used however the webservers on datanodes and task trackers are not accessible since non-routable addresses are used. Apart from this limitation (which can be worked around by logging in to the relevant machine to browse logs) jobs ran OK. So I vote to commit this (along with HADOOP-1635 , HADOOP-1634 ) - I'll have some time to do this tomorrow.
          Hide
          Michael Bieniosek added a comment -

          > This problem was caused by the changes made in Amazon EC2 addressing: previously instances were direct addressed (given a single IP routable address) and now they are NAT-addressed (by default, for later tool versions). The key point is that NAT-addressed instances can't access other NAT-addressed instances using the public address.

          I don't use the hadoop ec2 scripts, but I filed HADOOP-1202 specifically because of this issue.

          The solution I intended with HADOOP-1202 was to make the namenode and jobtracker bind to 0.0.0.0 using my HADOOP-1202 patch, but use the internal addresses in the hadoop configs. I set up an http proxy to view logs for the datanodes and tasktrackers (I have my httpd.conf if anybody is interested). It is then possible to view the jobtracker & namenode website normally (you have to submit jobs from inside the cluster though, since submitting a job writes to the dfs). The problem is that you can't use the dfs from outside the cluster; instead you have to use some proxying solution which will be much slower (in our case it took longer to copy data back than to compute it).

          If you need to use dfs, the real solution is to make all datanodes bind to 0.0.0.0, make the namenode aware that each datanode has two addresses, and make sure the namenode knows when to use which one. This would require significantly more work than my HADOOP-1202 patch though.

          Show
          Michael Bieniosek added a comment - > This problem was caused by the changes made in Amazon EC2 addressing: previously instances were direct addressed (given a single IP routable address) and now they are NAT-addressed (by default, for later tool versions). The key point is that NAT-addressed instances can't access other NAT-addressed instances using the public address. I don't use the hadoop ec2 scripts, but I filed HADOOP-1202 specifically because of this issue. The solution I intended with HADOOP-1202 was to make the namenode and jobtracker bind to 0.0.0.0 using my HADOOP-1202 patch, but use the internal addresses in the hadoop configs. I set up an http proxy to view logs for the datanodes and tasktrackers (I have my httpd.conf if anybody is interested). It is then possible to view the jobtracker & namenode website normally (you have to submit jobs from inside the cluster though, since submitting a job writes to the dfs). The problem is that you can't use the dfs from outside the cluster; instead you have to use some proxying solution which will be much slower (in our case it took longer to copy data back than to compute it). If you need to use dfs, the real solution is to make all datanodes bind to 0.0.0.0, make the namenode aware that each datanode has two addresses, and make sure the namenode knows when to use which one. This would require significantly more work than my HADOOP-1202 patch though.
          Tom White committed 558546 (2 files)
          Reviews: none

          HADOOP-1638. Fix contrib EC2 scripts to support NAT addressing. Contributed by Stu Hood.

          Tom White committed 558547 (2 files)
          Tom White committed 558548 (2 files)
          Hide
          Tom White added a comment -

          I've just committed this. Thanks Stu!

          (I also fixed another instance of the column 7 problem in start-hadoop.)

          Show
          Tom White added a comment - I've just committed this. Thanks Stu! (I also fixed another instance of the column 7 problem in start-hadoop.)
          Tom White made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Doug Cutting made changes -
          Fix Version/s 0.13.1 [ 12312579 ]
          Fix Version/s 0.15.0 [ 12312565 ]
          Gavin made changes -
          Link This issue depends on HADOOP-1202 [ HADOOP-1202 ]
          Gavin made changes -
          Link This issue depends upon HADOOP-1202 [ HADOOP-1202 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Stu Hood
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development