Whirr
  1. Whirr
  2. WHIRR-459

DNS Failure when trying to spawn HBase cluster

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.2
    • Component/s: None
    • Labels:
      None
    • Environment:

      Trying to use WHirr from behind a NAT

      Description

      While trying to launch an HBase cluster from a system which runs behind a NAT I get the following Exception. The cluster is spawned and then it gets destroyed. The same when run from another EC2 instance runs fine.

      bin/whirr launch-cluster --config hbase-ec2.properties
      Bootstrapping cluster
      Configuring template
      Configuring template
      Starting 1 node(s) with roles [zookeeper, hadoop-namenode, hadoop-jobtracker, hbase-master]
      Starting 2 node(s) with roles [hadoop-datanode, hadoop-tasktracker, hbase-regionserver]
      Nodes started: [[id=us-east-1/i-5890203a, providerId=i-5890203a, group=hbase, name=hbase-5890203a, location=[id=us-east-1c, scope=ZONE, description=us-east-1c, parent=us-east-1, iso3166Codes=[US-VA], metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true, description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml], state=RUNNING, loginPort=22, hostname=domU-12-31-39-0F-94-D1, privateAddresses=[10.193.151.31], publicAddresses=[204.236.208.250], hardware=[id=c1.xlarge, providerId=c1.xlarge, name=null, processors=[[cores=8.0, speed=2.5]], ram=7168, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0, device=/dev/sdb, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdc, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdd, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sde, durable=false, isBootDevice=false]], supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit()), tags=[]], loginUser=ubuntu, userMetadata=

      {Name=hbase-5890203a}

      , tags=[]]]
      Nodes started: [[id=us-east-1/i-54902036, providerId=i-54902036, group=hbase, name=hbase-54902036, location=[id=us-east-1c, scope=ZONE, description=us-east-1c, parent=us-east-1, iso3166Codes=[US-VA], metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true, description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml], state=RUNNING, loginPort=22, hostname=ip-10-7-29-242, privateAddresses=[10.7.29.242], publicAddresses=[75.101.240.254], hardware=[id=c1.xlarge, providerId=c1.xlarge, name=null, processors=[[cores=8.0, speed=2.5]], ram=7168, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0, device=/dev/sdb, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdc, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdd, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sde, durable=false, isBootDevice=false]], supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit()), tags=[]], loginUser=ubuntu, userMetadata=

      {Name=hbase-54902036}

      , tags=[]], [id=us-east-1/i-5a902038, providerId=i-5a902038, group=hbase, name=hbase-5a902038, location=[id=us-east-1c, scope=ZONE, description=us-east-1c, parent=us-east-1, iso3166Codes=[US-VA], metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null, family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true, description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml], state=RUNNING, loginPort=22, hostname=ip-10-108-182-53, privateAddresses=[10.108.182.53], publicAddresses=[50.16.48.211], hardware=[id=c1.xlarge, providerId=c1.xlarge, name=null, processors=[[cores=8.0, speed=2.5]], ram=7168, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1, durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0, device=/dev/sdb, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdc, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sdd, durable=false, isBootDevice=false], [id=null, type=LOCAL, size=420.0, device=/dev/sde, durable=false, isBootDevice=false]], supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit()), tags=[]], loginUser=ubuntu, userMetadata=

      {Name=hbase-5a902038}

      , tags=[]]]
      Authorizing firewall ingress to [us-east-1/i-5890203a] on ports [2181] for [122.172.0.45/32]
      Unable to start the cluster. Terminating all nodes.
      org.apache.whirr.net.DnsException: java.net.ConnectException: Connection refused
      at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:83)
      at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:40)
      at org.apache.whirr.Cluster$Instance.getPublicHostName(Cluster.java:112)
      at org.apache.whirr.Cluster$Instance.getPublicAddress(Cluster.java:94)
      at org.apache.whirr.service.hadoop.HadoopNameNodeClusterActionHandler.doBeforeConfigure(HadoopNameNodeClusterActionHandler.java:58)
      at org.apache.whirr.service.hadoop.HadoopClusterActionHandler.beforeConfigure(HadoopClusterActionHandler.java:86)
      at org.apache.whirr.service.ClusterActionHandlerSupport.beforeAction(ClusterActionHandlerSupport.java:53)
      at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:100)
      at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:109)
      at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:63)
      at org.apache.whirr.cli.Main.run(Main.java:64)
      at org.apache.whirr.cli.Main.main(Main.java:97)
      Caused by: java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
      at org.xbill.DNS.TCPClient.connect(TCPClient.java:30)
      at org.xbill.DNS.TCPClient.sendrecv(TCPClient.java:118)
      at org.xbill.DNS.SimpleResolver.send(SimpleResolver.java:254)
      at org.xbill.DNS.ExtendedResolver$Resolution.start(ExtendedResolver.java:95)
      at org.xbill.DNS.ExtendedResolver.send(ExtendedResolver.java:358)
      at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:69)
      ... 11 more
      Unable to load cluster state, assuming it has no running nodes.
      java.io.FileNotFoundException: /home/akash/.whirr/hbase/instances (No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:137)
      at com.google.common.io.Files$1.getInput(Files.java:100)
      at com.google.common.io.Files$1.getInput(Files.java:97)
      at com.google.common.io.CharStreams$2.getInput(CharStreams.java:91)
      at com.google.common.io.CharStreams$2.getInput(CharStreams.java:88)
      at com.google.common.io.CharStreams.readLines(CharStreams.java:306)
      at com.google.common.io.Files.readLines(Files.java:580)
      at org.apache.whirr.state.FileClusterStateStore.load(FileClusterStateStore.java:54)
      at org.apache.whirr.state.ClusterStateStore.tryLoadOrEmpty(ClusterStateStore.java:58)
      at org.apache.whirr.ClusterController.destroyCluster(ClusterController.java:143)
      at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:118)
      at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:63)
      at org.apache.whirr.cli.Main.run(Main.java:64)
      at org.apache.whirr.cli.Main.main(Main.java:97)
      Starting to run scripts on cluster for phase destroyinstances:
      Starting to run scripts on cluster for phase destroyinstances:
      Finished running destroy phase scripts on all cluster instances
      Destroying hbase cluster
      Cluster hbase destroyed
      Exception in thread "main" java.lang.RuntimeException: org.apache.whirr.net.DnsException: java.net.ConnectException: Connection refused
      at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:125)
      at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:63)
      at org.apache.whirr.cli.Main.run(Main.java:64)
      at org.apache.whirr.cli.Main.main(Main.java:97)
      Caused by: org.apache.whirr.net.DnsException: java.net.ConnectException: Connection refused
      at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:83)
      at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:40)
      at org.apache.whirr.Cluster$Instance.getPublicHostName(Cluster.java:112)
      at org.apache.whirr.Cluster$Instance.getPublicAddress(Cluster.java:94)
      at org.apache.whirr.service.hadoop.HadoopNameNodeClusterActionHandler.doBeforeConfigure(HadoopNameNodeClusterActionHandler.java:58)
      at org.apache.whirr.service.hadoop.HadoopClusterActionHandler.beforeConfigure(HadoopClusterActionHandler.java:86)
      at org.apache.whirr.service.ClusterActionHandlerSupport.beforeAction(ClusterActionHandlerSupport.java:53)
      at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:100)
      at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:109)
      ... 3 more
      Caused by: java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
      at org.xbill.DNS.TCPClient.connect(TCPClient.java:30)
      at org.xbill.DNS.TCPClient.sendrecv(TCPClient.java:118)
      at org.xbill.DNS.SimpleResolver.send(SimpleResolver.java:254)
      at org.xbill.DNS.ExtendedResolver$Resolution.start(ExtendedResolver.java:95)
      at org.xbill.DNS.ExtendedResolver.send(ExtendedResolver.java:358)
      at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:69)
      ... 11 more

      1. WHIRR-459.patch
        0.8 kB
        Paolo Castagna
      2. WHIRR-459-fallback-to-jclouds-hostname.patch
        6 kB
        Alex Heneveld

        Issue Links

          Activity

          Hide
          Andrei Savu added a comment -

          Committed to trunk and branch 0.7. Thanks Alex!

          Show
          Andrei Savu added a comment - Committed to trunk and branch 0.7. Thanks Alex!
          Hide
          Alex Heneveld added a comment -

          falls back to jclouds public hostname if DNS fails, and adds logging; should fix #459, at least in AWS, maybe in other clouds (are there any environments where public hostname is not appropriate?)

          Show
          Alex Heneveld added a comment - falls back to jclouds public hostname if DNS fails, and adds logging; should fix #459, at least in AWS, maybe in other clouds (are there any environments where public hostname is not appropriate?)
          Hide
          Alex Heneveld added a comment -

          submit the more recent patch instead

          Show
          Alex Heneveld added a comment - submit the more recent patch instead
          Hide
          Alex Heneveld added a comment -

          patch which not only catches error but also falls back to the jclouds hostname

          Show
          Alex Heneveld added a comment - patch which not only catches error but also falls back to the jclouds hostname
          Hide
          Alex Heneveld added a comment -

          falls back to jclouds public hostname if DNS fails, fixes #459, adds logging

          Show
          Alex Heneveld added a comment - falls back to jclouds public hostname if DNS fails, fixes #459, adds logging
          Hide
          Alex Heneveld added a comment -

          I ran in to this, catching the error helps sometimes, but then it failed – with Hadoop 1.0.2 (not sure whether it's a problem with 0.2x) – because Hadoop was attempting to bind to the public IP which is not available.

          The patch I'm attaching adds one extra fallback – checking whether nodeMetadata.getHostname() is suitable, and using that in preference to the purely numeric IP address. Good thing about this is that internally (EC2 definitely, other clouds I think) the hostname resolves to the private IP, but externally it resolves to the public address.

          (This may address a few other of the DNS-woe issues.)

          Show
          Alex Heneveld added a comment - I ran in to this, catching the error helps sometimes, but then it failed – with Hadoop 1.0.2 (not sure whether it's a problem with 0.2x) – because Hadoop was attempting to bind to the public IP which is not available. The patch I'm attaching adds one extra fallback – checking whether nodeMetadata.getHostname() is suitable, and using that in preference to the purely numeric IP address. Good thing about this is that internally (EC2 definitely, other clouds I think) the hostname resolves to the private IP, but externally it resolves to the public address. (This may address a few other of the DNS-woe issues.)
          Hide
          Andrei Savu added a comment -

          Putting this on the roadmap for 0.7.2 and 0.8.0. Unfortunately handling the ConnectionException is not going to make things work for Hadoop & HBase - what we need is the ability to fetch the hostname using the API (this works for Amazon).

          Show
          Andrei Savu added a comment - Putting this on the roadmap for 0.7.2 and 0.8.0. Unfortunately handling the ConnectionException is not going to make things work for Hadoop & HBase - what we need is the ability to fetch the hostname using the API (this works for Amazon).
          Hide
          Grant Ingersoll added a comment -

          I can confirm that the patch included here fixes the problem for me running the 5 min quick start and the FastDnsResolverTest. What a colossal waste of a day tracking down that one!

          Show
          Grant Ingersoll added a comment - I can confirm that the patch included here fixes the problem for me running the 5 min quick start and the FastDnsResolverTest. What a colossal waste of a day tracking down that one!
          Hide
          Grant Ingersoll added a comment -

          Also, from what I can tell, it's getting through the install part (creating the nodes and installing zk, but then failing in config)

          Show
          Grant Ingersoll added a comment - Also, from what I can tell, it's getting through the install part (creating the nodes and installing zk, but then failing in config)
          Hide
          Grant Ingersoll added a comment -

          I'm pretty sure I"m getting this, too, when running the 5 minute quick start. I can confirm FastDnsResolverTest fails as well. This is for both 0.7.1 and trunk.

          Show
          Grant Ingersoll added a comment - I'm pretty sure I"m getting this, too, when running the 5 minute quick start. I can confirm FastDnsResolverTest fails as well. This is for both 0.7.1 and trunk.
          Hide
          Ashish Paliwal added a comment -

          Not sure if this would of help, but this is a recent Thread on HBase ML regarding the DNS issue, and talks about a utility

          http://markmail.org/thread/d3l46ejly5kr63g5

          Show
          Ashish Paliwal added a comment - Not sure if this would of help, but this is a recent Thread on HBase ML regarding the DNS issue, and talks about a utility http://markmail.org/thread/d3l46ejly5kr63g5
          Hide
          Paolo Castagna added a comment -

          > Do you think it's better if we implement a fail fast mechanism?

          Andrei, I am not sure... I am clearly not an expert of DNS reverse lookup requests and I am not completely sure how this is done, used and needed in the context of a tool such as Whirr.

          Certainly, from a point of view of a user, you do not want to wait (and pay!) to provision a cluster to find out that something goes wrong towards the end (when you have already paid... and on EC2 you pay by the hour even if you use 2 minutes). Fail fast is good in general, even more so IMHO in this case.

          If DNS reverse lookup is necessary in order to provision a service with Whirr, Whirr should test for that before doing anything which will make an user pay and deliver a clear error message. I also did search for similar errors and suggestions on-line, but I did not find anything useful other than this JIRA issue. I think others might hit this problem: it isn't a Whirr problem, but Whirr could help in the diagnosis and, as you suggested, fail fast/sooner.

          My 2 cents.

          Show
          Paolo Castagna added a comment - > Do you think it's better if we implement a fail fast mechanism? Andrei, I am not sure... I am clearly not an expert of DNS reverse lookup requests and I am not completely sure how this is done, used and needed in the context of a tool such as Whirr. Certainly, from a point of view of a user, you do not want to wait (and pay!) to provision a cluster to find out that something goes wrong towards the end (when you have already paid... and on EC2 you pay by the hour even if you use 2 minutes). Fail fast is good in general, even more so IMHO in this case. If DNS reverse lookup is necessary in order to provision a service with Whirr, Whirr should test for that before doing anything which will make an user pay and deliver a clear error message. I also did search for similar errors and suggestions on-line, but I did not find anything useful other than this JIRA issue. I think others might hit this problem: it isn't a Whirr problem, but Whirr could help in the diagnosis and, as you suggested, fail fast/sooner. My 2 cents.
          Hide
          Andrei Savu added a comment -

          Do you think it's better if we implement a fail fast mechanism?

          Show
          Andrei Savu added a comment - Do you think it's better if we implement a fail fast mechanism?
          Hide
          Paolo Castagna added a comment - - edited

          > Can you try to switch to UDP? ( resolver.setTCP(false) )

          This didn't help.

          However, I think this is a client problem:

          1. check your router/boardband modem
          2. check your DNS configuration settings (i.e. /etc/resolv.conf and/or wherever is in Windows)
          3. check your firewall configuration if you are running one
          4. run FastDnsResolverTest.java to quickly check if reverse DNS queries with Whirr are working

          In my case it was a problem with 1.
          I can confirm Apache Whirr 0.7.1 works with Apache Hadoop 1.0.1

          You might decide to apply the patch anyway, but that is not going to save troubles to others who, for some reasons, have no reverse DNS requests working properly.

          Show
          Paolo Castagna added a comment - - edited > Can you try to switch to UDP? ( resolver.setTCP(false) ) This didn't help. However, I think this is a client problem: 1. check your router/boardband modem 2. check your DNS configuration settings (i.e. /etc/resolv.conf and/or wherever is in Windows) 3. check your firewall configuration if you are running one 4. run FastDnsResolverTest.java to quickly check if reverse DNS queries with Whirr are working In my case it was a problem with 1. I can confirm Apache Whirr 0.7.1 works with Apache Hadoop 1.0.1 You might decide to apply the patch anyway, but that is not going to save troubles to others who, for some reasons, have no reverse DNS requests working properly.
          Hide
          Andrei Savu added a comment -

          You are right. At least for Hadoop and HBase I think we need to be able to find the public hostname. The good news is that on Amazon we can retrieve that information using the API.

          Show
          Andrei Savu added a comment - You are right. At least for Hadoop and HBase I think we need to be able to find the public hostname. The good news is that on Amazon we can retrieve that information using the API.
          Hide
          Paolo Castagna added a comment -

          > Do you know why reverse DNS queries over TCP are not supported from your network?

          No, I don't. I never has any issue related to reverse DNS queries before.

          > Can you try to switch to UDP?

          I'll try that.

          > or should we always fallback to standard Java on exception?

          Maybe, but I am not sure... if returning the public IP is not going to work on VMs running in Amazon, what is the point?

          Show
          Paolo Castagna added a comment - > Do you know why reverse DNS queries over TCP are not supported from your network? No, I don't. I never has any issue related to reverse DNS queries before. > Can you try to switch to UDP? I'll try that. > or should we always fallback to standard Java on exception? Maybe, but I am not sure... if returning the public IP is not going to work on VMs running in Amazon, what is the point?
          Hide
          Andrei Savu added a comment -

          or should we always fallback to standard Java on exception?

          Show
          Andrei Savu added a comment - or should we always fallback to standard Java on exception?
          Hide
          Andrei Savu added a comment -

          Thanks Paolo. This is failing because the public IP is not exposed as a NIC on VMs running in Amazon. Do you know why reverse DNS queries over TCP are not supported from your network? Can you try to switch to UDP? ( resolver.setTCP(false) )

          Show
          Andrei Savu added a comment - Thanks Paolo. This is failing because the public IP is not exposed as a NIC on VMs running in Amazon. Do you know why reverse DNS queries over TCP are not supported from your network? Can you try to switch to UDP? ( resolver.setTCP(false) )
          Hide
          Paolo Castagna added a comment -

          This is my attempt at fixing this issue.

          Running FastDnsResolverTest.java locally from Eclipse, I saw exactly the same exceptions.

          With the patch applied, no exceptions. I tried to provision an Hadoop cluster with a patched version of Whirr. The cluster started, however there might be issues. I used to see public DNS names to connect to the NameNode UI and to the JobTracker UI. This time, I saw IP addresses.

          I failed to connect to them.

          I was able to ssh into the instances of the cluster. But on the master I saw errors:

          2012-03-07 17:30:34,173 FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to /50.16.125.61:8021 : Cannot assign requested address
          at org.apache.hadoop.ipc.Server.bind(Server.java:227)
          at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:301)
          at org.apache.hadoop.ipc.Server.<init>(Server.java:1483)
          at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:545)
          at org.apache.hadoop.ipc.RPC.getServer(RPC.java:506)
          at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2306)
          at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2192)
          at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2186)
          at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:300)
          at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:291)
          at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4978)
          Caused by: java.net.BindException: Cannot assign requested address
          at sun.nio.ch.Net.bind(Native Method)
          at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:137)
          at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:77)
          at org.apache.hadoop.ipc.Server.bind(Server.java:225)
          ... 10 more

          Similar exception for the NameNode.

          Any idea of what's going badly wrong here?

          Show
          Paolo Castagna added a comment - This is my attempt at fixing this issue. Running FastDnsResolverTest.java locally from Eclipse, I saw exactly the same exceptions. With the patch applied, no exceptions. I tried to provision an Hadoop cluster with a patched version of Whirr. The cluster started, however there might be issues. I used to see public DNS names to connect to the NameNode UI and to the JobTracker UI. This time, I saw IP addresses. I failed to connect to them. I was able to ssh into the instances of the cluster. But on the master I saw errors: 2012-03-07 17:30:34,173 FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to /50.16.125.61:8021 : Cannot assign requested address at org.apache.hadoop.ipc.Server.bind(Server.java:227) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:301) at org.apache.hadoop.ipc.Server.<init>(Server.java:1483) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:545) at org.apache.hadoop.ipc.RPC.getServer(RPC.java:506) at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2306) at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2192) at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2186) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:300) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:291) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4978) Caused by: java.net.BindException: Cannot assign requested address at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:137) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:77) at org.apache.hadoop.ipc.Server.bind(Server.java:225) ... 10 more Similar exception for the NameNode. Any idea of what's going badly wrong here?
          Hide
          Andrei Savu added a comment -

          Maybe it is possible to catch java.net.ConnectException and, as for SocketTimeoutException, return hostIp.

          Sounds reasonable to me. Would that fix the problem for you?

          Show
          Andrei Savu added a comment - Maybe it is possible to catch java.net.ConnectException and, as for SocketTimeoutException, return hostIp. Sounds reasonable to me. Would that fix the problem for you?
          Hide
          Paolo Castagna added a comment -

          I am looking at FastDnsResolver.java:

           
            @Override
            public String apply(String hostIp) {
              try {
                   ...
              } catch(SocketTimeoutException e) {
                return hostIp;  /* same response as standard Java on timeout */
          
              } catch(IOException e) {
                throw new DnsException(e);
              }
            }
          

          Maybe it is possible to catch java.net.ConnectException and, as for SocketTimeoutException, return hostIp. If not, why not?

          Show
          Paolo Castagna added a comment - I am looking at FastDnsResolver.java: @Override public String apply( String hostIp) { try { ... } catch (SocketTimeoutException e) { return hostIp; /* same response as standard Java on timeout */ } catch (IOException e) { throw new DnsException(e); } } Maybe it is possible to catch java.net.ConnectException and, as for SocketTimeoutException, return hostIp. If not, why not?
          Hide
          Andrei Savu added a comment -

          Thanks Paolo. I will look into this more later today. There are some known issues with reverse DNS resolution WHIRR-511 that we are working on for 0.8.0.

          Show
          Andrei Savu added a comment - Thanks Paolo. I will look into this more later today. There are some known issues with reverse DNS resolution WHIRR-511 that we are working on for 0.8.0.
          Hide
          Paolo Castagna added a comment -

          Possible: yes. Ideal: no. I normally use my laptop or desktop to develop and test, when I am ready, I run Whirr. On my laptop or desktop I have everything setup properly. An additional VM just to launch stuff is really a 'PITA'. I know: "if you are behind NAT you are not on the net".

          It would be good to describe the change in behaviour between Whirr 0.6.0-incubating (which AFAIK did not have this problem) and Whirr 0.7.1 and understand what functionality/benefit that change brought to the users.

          By the way, this isn't limited to an HBase cluster, I was firing up an Hadoop cluster.

          Show
          Paolo Castagna added a comment - Possible: yes. Ideal: no. I normally use my laptop or desktop to develop and test, when I am ready, I run Whirr. On my laptop or desktop I have everything setup properly. An additional VM just to launch stuff is really a 'PITA'. I know: "if you are behind NAT you are not on the net". It would be good to describe the change in behaviour between Whirr 0.6.0-incubating (which AFAIK did not have this problem) and Whirr 0.7.1 and understand what functionality/benefit that change brought to the users. By the way, this isn't limited to an HBase cluster, I was firing up an Hadoop cluster.
          Hide
          Andrei Savu added a comment -

          Paolo would it be possible for you to use a VM in Amazon as a launcher?

          Show
          Andrei Savu added a comment - Paolo would it be possible for you to use a VM in Amazon as a launcher?
          Hide
          Paolo Castagna added a comment -

          I am having the same issue and, as Joris, I am interested in any work around. Thanks.

          Show
          Paolo Castagna added a comment - I am having the same issue and, as Joris, I am interested in any work around. Thanks.
          Hide
          Joris Poort added a comment -

          I'm having issues with this, but pardon my ignorance - how does a NAT work and is there any way to get around it?

          Thanks... Joris

          Show
          Joris Poort added a comment - I'm having issues with this, but pardon my ignorance - how does a NAT work and is there any way to get around it? Thanks... Joris
          Hide
          Andrei Savu added a comment -

          Thanks Akash for looking into this. Can we at least make the error message more friendly? Is there a better way of handling this failure scenario? (e.g. returning the raw IP as reverse DNS)

          Show
          Andrei Savu added a comment - Thanks Akash for looking into this. Can we at least make the error message more friendly? Is there a better way of handling this failure scenario? (e.g. returning the raw IP as reverse DNS)
          Hide
          Akash Ashok added a comment -

          Figured this is not an issue with the code per say but with the network configuration and accessibility of DNS servers. I have opened a mail conversation with bwelling@xbill.org.

          Can we resolve this issue ?

          Show
          Akash Ashok added a comment - Figured this is not an issue with the code per say but with the network configuration and accessibility of DNS servers. I have opened a mail conversation with bwelling@xbill.org. Can we resolve this issue ?
          Hide
          Akash Ashok added a comment -

          My mistake. It has nothing to do with any reverse connection being made because there is not reverse connection being made.

          It's the below code resolver.send which is failing because my system is not able to connect to the dns servers. so it's giving connection refused.

          Message response = resolver.send(newQuery(record));
          
          Show
          Akash Ashok added a comment - My mistake. It has nothing to do with any reverse connection being made because there is not reverse connection being made. It's the below code resolver.send which is failing because my system is not able to connect to the dns servers. so it's giving connection refused. Message response = resolver.send(newQuery(record));
          Hide
          Akash Ashok added a comment -

          Was testing as to why this was happening. As the stack trace shows ReverseMap.fromAddress(hostIp). As it turned out its not so much because of the value being passed for resolution but its the network from which this is being called. I ran the same function from my system as a standalone code it threw the same exception.

          So I re-ran it on a system from another IP address it ran fine. So I am guessing this API somehow has a reverse connection being made.

          Will post futher on this issue.

          Show
          Akash Ashok added a comment - Was testing as to why this was happening. As the stack trace shows ReverseMap.fromAddress(hostIp). As it turned out its not so much because of the value being passed for resolution but its the network from which this is being called. I ran the same function from my system as a standalone code it threw the same exception. So I re-ran it on a system from another IP address it ran fine. So I am guessing this API somehow has a reverse connection being made. Will post futher on this issue.

            People

            • Assignee:
              Alex Heneveld
              Reporter:
              Akash Ashok
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development