Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-62

SecondaryNamenode may report incorrect info host name

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I have set up dfs.secondary.http.address like this:

      <property>
        <name>dfs.secondary.http.address</name>
        <value>secondary.example.com:50090</value>
      </property>
      

      In my setup secondary.example.com resolves to an IP address (say, 192.168.0.10) which is not the same as the host's name (as returned by InetAddress.getLocalHost().getHostAddress(), say 192.168.0.1).

      In this situation, edit log related transfers fail. From the namenode log:

      2009-04-05 13:32:39,128 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 192.168.0.10
      2009-04-05 13:32:39,168 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.net.ConnectException: Connection refused
              at java.net.PlainSocketImpl.socketConnect(Native Method)
              at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
              at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
              at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
              at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
              at java.net.Socket.connect(Socket.java:519)
              at java.net.Socket.connect(Socket.java:469)
              at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
              at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
              at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
              at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
              at sun.net.www.http.HttpClient.New(HttpClient.java:306)
              at sun.net.www.http.HttpClient.New(HttpClient.java:323)
              at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:837)
              at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:778)
              at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:703)
              at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1026)
              at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:151)
              ...
      

      From the secondary namenode log:

      2009-04-05 13:42:39,238 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 
      2009-04-05 13:42:39,238 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.FileNotFoundException: http://nn.example.com:50070/getimage?putimage=1&port=50090&machine=
      192.168.0.1&token=-19:1243068779:0:1238929357000:1238929031783
              at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1288)
              at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:151)
              at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:294)
              at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:333)
              at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:239)
              at java.lang.Thread.run(Thread.java:619)
      
      1. hadoop-5626.txt
        3 kB
        Todd Lipcon
      2. HADOOP-5626.patch
        0.9 kB
        Carlos Valiente

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          This is the issue that prevented TestCheckpoint from passing in HADOOP-3694.

          My patch is the same spirit as Carlos's but uses InetAddress.isAnyLocalAddress instead of a string compare. Uploading shortly.

          Show
          Todd Lipcon added a comment - This is the issue that prevented TestCheckpoint from passing in HADOOP-3694 . My patch is the same spirit as Carlos's but uses InetAddress.isAnyLocalAddress instead of a string compare. Uploading shortly.
          Hide
          Todd Lipcon added a comment -

          Fixes the behavior described in this ticket. Modifications to test improve the speed (100 seconds to 30 seconds on my machine where 0.0.0.0 lookup is very slow) and also verify the new behavior (this test fails with the old behavior as noted in HADOOP-3694)

          Show
          Todd Lipcon added a comment - Fixes the behavior described in this ticket. Modifications to test improve the speed (100 seconds to 30 seconds on my machine where 0.0.0.0 lookup is very slow) and also verify the new behavior (this test fails with the old behavior as noted in HADOOP-3694 )
          Hide
          steve_l added a comment -

          Todd,

          -can you use the code in DNS to get localhost reliably on more systems?

          Show
          steve_l added a comment - Todd, -can you use the code in DNS to get localhost reliably on more systems?
          Hide
          Todd Lipcon added a comment -

          Steve: that would require the name of the local interface. In the default configuration, that isn't defined, unless I'm missing something.

          Here's an alternate solution that I like better: I don't see any reason why the 2NN specifies its own IP address as the machine=<ip> parameter to the putimage request. There are two other options:

          • The NN could simply use the requester address of the HTTP request to determine the 2NN's IP.
          • The 2NN could simply use an HTTP PUT or POST to upload the checkpoint rather than requesting a GET callback. I don't know a lot about the Java HTTP facilities, but I assume this isn't too difficult.
          Show
          Todd Lipcon added a comment - Steve: that would require the name of the local interface. In the default configuration, that isn't defined, unless I'm missing something. Here's an alternate solution that I like better: I don't see any reason why the 2NN specifies its own IP address as the machine=<ip> parameter to the putimage request. There are two other options: The NN could simply use the requester address of the HTTP request to determine the 2NN's IP. The 2NN could simply use an HTTP PUT or POST to upload the checkpoint rather than requesting a GET callback. I don't know a lot about the Java HTTP facilities, but I assume this isn't too difficult.
          Hide
          steve_l added a comment -

          I think we could tease out the code that caches the local hostname and make it more public; that would save the 30s hangs whenever something tries to look it up on one of my badly configured boxes.

          For Java upload, POST is good, goes through firewalls too. HttpClient is the best way to do this -it is already part of the hadoop core library dependencies. Don't waste time trying to understand java.net if you can help it.

          Show
          steve_l added a comment - I think we could tease out the code that caches the local hostname and make it more public; that would save the 30s hangs whenever something tries to look it up on one of my badly configured boxes. For Java upload, POST is good, goes through firewalls too. HttpClient is the best way to do this -it is already part of the hadoop core library dependencies. Don't waste time trying to understand java.net if you can help it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12408757/hadoop-5626.txt
          against trunk revision 778182.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12408757/hadoop-5626.txt against trunk revision 778182. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/394/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The test failures are unrelated (job history and capacity scheduler, both mapred whereas this only touches hdfs)

          While I think the discussion above is valuable, since this blocks HADOOP-5888 I'd like for this to be committed and we can continue work on making image transfer more sane in another JIRA.

          As for switching to POST rather than the "GET with a callback to another GET", I looked into this today and, while easy, I'm not sure it's a great idea. I think there are some people who are relying on the GetImageServlet on the 2NN for backup purposes, so removing the Jetty from the 2NN would break that.

          Show
          Todd Lipcon added a comment - The test failures are unrelated (job history and capacity scheduler, both mapred whereas this only touches hdfs) While I think the discussion above is valuable, since this blocks HADOOP-5888 I'd like for this to be committed and we can continue work on making image transfer more sane in another JIRA. As for switching to POST rather than the "GET with a callback to another GET", I looked into this today and, while easy, I'm not sure it's a great idea. I think there are some people who are relying on the GetImageServlet on the 2NN for backup purposes, so removing the Jetty from the 2NN would break that.
          Hide
          dhruba borthakur added a comment -

          > As for switching to POST rather than the "GET with a callback to another GET", I

          I remember that when this was being designed the first time, I was told that there could be "permission" related issues to do a POST vs a GET. The POST is like a "write" somewhat, whereas a GET is more like a "read".

          Show
          dhruba borthakur added a comment - > As for switching to POST rather than the "GET with a callback to another GET", I I remember that when this was being designed the first time, I was told that there could be "permission" related issues to do a POST vs a GET. The POST is like a "write" somewhat, whereas a GET is more like a "read".
          Hide
          Jakob Homan added a comment -

          In BackupNode::getHttpServerAddress this issue also arises and is handled by using the rpcAddress, which has already been resolved to a real address. It would be good to move your solution to NetUtils and access it from both the SecondaryNameNode and the BackupNode, to avoid code duplication.

          Show
          Jakob Homan added a comment - In BackupNode::getHttpServerAddress this issue also arises and is handled by using the rpcAddress, which has already been resolved to a real address. It would be good to move your solution to NetUtils and access it from both the SecondaryNameNode and the BackupNode, to avoid code duplication.
          Hide
          steve_l added a comment -

          +1 to having something common

          I think we also ought to spell out the minimum requirements of hadoop on a network, something like

          /etc/hosts is well configured (local hostname doesnt map to 127.0.0.1 or ::1)
          dns works
          rDNS works
          ..etc

          Show
          steve_l added a comment - +1 to having something common I think we also ought to spell out the minimum requirements of hadoop on a network, something like /etc/hosts is well configured (local hostname doesnt map to 127.0.0.1 or ::1) dns works rDNS works ..etc
          Hide
          Todd Lipcon added a comment -

          Jakob and Steve: I definitely agree that we need a more centralized common way of doing this. I think HADOOP-4383 is a good place to suggest doing a more general overhaul of how services locate each other. I wrote up a brief proposal to Steve in a private email which I'll copy paste into that JIRA.

          For now, I think making some small concrete steps is the best route. History has shown that it's difficult (and takes a lot of time) to get larger sweeping changes through the development pipeline, so we should work towards some small achievable goals while planning out the better solution.

          Show
          Todd Lipcon added a comment - Jakob and Steve: I definitely agree that we need a more centralized common way of doing this. I think HADOOP-4383 is a good place to suggest doing a more general overhaul of how services locate each other. I wrote up a brief proposal to Steve in a private email which I'll copy paste into that JIRA. For now, I think making some small concrete steps is the best route. History has shown that it's difficult (and takes a lot of time) to get larger sweeping changes through the development pipeline, so we should work towards some small achievable goals while planning out the better solution.
          Hide
          Jakob Homan added a comment -

          For now, I think making some small concrete steps is the best route. History has shown that it's difficult (and takes a lot of time) to get larger sweeping changes through the development pipeline, so we should work towards some small achievable goals while planning out the better solution.

          Service location issues are certainly worth looking at and you're right that HADOOP-4383 is a good place to start. My comment dealt with your specific patch, which as it stands, is introducing a bit of code duplication in terms of resolving the full hostname of the backup node and secondary namenode. I'd recommend updating your patch to avoid this by utilizing the net utils package as a repository of this rather generic function. No need to introduce two ways of doing the same thing.

          Show
          Jakob Homan added a comment - For now, I think making some small concrete steps is the best route. History has shown that it's difficult (and takes a lot of time) to get larger sweeping changes through the development pipeline, so we should work towards some small achievable goals while planning out the better solution. Service location issues are certainly worth looking at and you're right that HADOOP-4383 is a good place to start. My comment dealt with your specific patch, which as it stands, is introducing a bit of code duplication in terms of resolving the full hostname of the backup node and secondary namenode. I'd recommend updating your patch to avoid this by utilizing the net utils package as a repository of this rather generic function. No need to introduce two ways of doing the same thing.
          Hide
          Todd Lipcon added a comment -

          Removing "Patch available" status to incorporate feedback and refactor the change into a static method in NetUtils.

          Show
          Todd Lipcon added a comment - Removing "Patch available" status to incorporate feedback and refactor the change into a static method in NetUtils.
          Hide
          Jakob Homan added a comment -

          Seeing as there hasn't been any movement on this in more than a year, and that HDFS-1080 also solves this issue, I'd like to close this as won't fix. I think I prefer the 1080 approach for two reasons: it relies on user-provided configuration to guarantee the hostname is reported as expected, rather than how we may be able to decipher, and also it's a smaller change with fewer moving parts.

          Thoughts?

          Show
          Jakob Homan added a comment - Seeing as there hasn't been any movement on this in more than a year, and that HDFS-1080 also solves this issue, I'd like to close this as won't fix. I think I prefer the 1080 approach for two reasons: it relies on user-provided configuration to guarantee the hostname is reported as expected, rather than how we may be able to decipher, and also it's a smaller change with fewer moving parts. Thoughts?
          Hide
          Jakob Homan added a comment -

          Not hearing any objections, I'm resolving this issue.

          Show
          Jakob Homan added a comment - Not hearing any objections, I'm resolving this issue.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Carlos Valiente
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development