Hadoop Common
  1. Hadoop Common
  2. HADOOP-6760

WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.3
    • Fix Version/s: 0.20.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When a negative port is assigned to a webserver socket (because of a race inside of the Jetty server) the workaround from HADOOP-6386 is increasing the original port number on the next bind attempt. Apparently, this is an incorrect logic and next bind attempt should happen on the same port number if possible.

      1. HADOOP-6760.patch
        0.5 kB
        Konstantin Boudnik
      2. HADOOP-6760.patch
        2 kB
        Konstantin Boudnik
      3. HADOOP-6760.patch
        1 kB
        Konstantin Boudnik
      4. HADOOP-6760.0.20.patch
        0.6 kB
        Konstantin Boudnik
      5. HADOOP-6760.0.20.patch
        2 kB
        Konstantin Boudnik
      6. HADOOP-6760.0.20.patch
        2 kB
        Konstantin Boudnik
      7. HADOOP-6760.0.20.patch
        0.6 kB
        Konstantin Boudnik
      8. HADOOP-6760.0.20.patch
        2 kB
        Konstantin Boudnik

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #346 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/346/)

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #346 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/346/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #263 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/263/)
          HADOOP-6760. Moving the entry to the correct base release 0.20.3

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #263 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/263/ ) HADOOP-6760 . Moving the entry to the correct base release 0.20.3
          Hide
          Konstantin Boudnik added a comment -

          I have just committed this to all open branches (0.20.3, 0.21) and trunk

          Show
          Konstantin Boudnik added a comment - I have just committed this to all open branches (0.20.3, 0.21) and trunk
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #262 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/262/)
          HADOOP-6760. WebServer shouldn't increase port number in case of
          negative port setting caused by Jetty's race. Contributed by Konstantin
          Boudnik.

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #262 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/262/ ) HADOOP-6760 . WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race. Contributed by Konstantin Boudnik.
          Hide
          Konstantin Boudnik added a comment -

          Wrong file was uploaded for 0.20.3 branch

          Show
          Konstantin Boudnik added a comment - Wrong file was uploaded for 0.20.3 branch
          Hide
          Konstantin Boudnik added a comment -

          If not one else has any objections I'm going to commit this tomorrow.

          Show
          Konstantin Boudnik added a comment - If not one else has any objections I'm going to commit this tomorrow.
          Hide
          Koji Noguchi added a comment -

          Thus a daemon's webserver becomes pretty much useless.

          Just for the record, we observed multiple Datanodes/TaskTrackers constantly failing as bleow.

          2010-04-01 00:36:39,706 INFO org.apache.hadoop.tools.DistCp: FAIL 4/part-00031 : java.io.FileNotFoundException: http://abc1234.com:50076/streamFile?filename=/data/4/part-00031&ugi=knoguchi
                  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1288)
                  at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:143)
                  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356)
                  at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:415)
                  at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:543)
                  at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:310)
                  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
                  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
                  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
                  at org.apache.hadoop.mapred.Child.main(Child.java:159)
          
          2010-04-27 14:48:48,641 WARN org.apache.hadoop.mapred.ReduceTask: java.io.FileNotFoundException: http://abcl51349.com:50061/mapOutput?job=job_201004211814_6666&map=attempt_201004211814_6666_m_000791_0&reduce=0
                  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1288)
                  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1500)
                  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1381)
                  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1293)
                  at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1224)
          

          Map tasks on these tasktrackers kept on failing with
          "Status : FAILED Too many fetch-failures"

          Show
          Koji Noguchi added a comment - Thus a daemon's webserver becomes pretty much useless. Just for the record, we observed multiple Datanodes/TaskTrackers constantly failing as bleow. 2010-04-01 00:36:39,706 INFO org.apache.hadoop.tools.DistCp: FAIL 4/part-00031 : java.io.FileNotFoundException: http://abc1234.com:50076/streamFile?filename=/data/4/part-00031&ugi=knoguchi at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1288) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:143) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:356) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:415) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:543) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:310) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) 2010-04-27 14:48:48,641 WARN org.apache.hadoop.mapred.ReduceTask: java.io.FileNotFoundException: http://abcl51349.com:50061/mapOutput?job=job_201004211814_6666&map=attempt_201004211814_6666_m_000791_0&reduce=0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1288) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1500) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1381) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1293) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1224) Map tasks on these tasktrackers kept on failing with "Status : FAILED Too many fetch-failures"
          Hide
          Konstantin Boudnik added a comment -

          Same patch for 0.20

          Show
          Konstantin Boudnik added a comment - Same patch for 0.20
          Hide
          Konstantin Boudnik added a comment -

          Unit test isn't feasible for this fix.

          Show
          Konstantin Boudnik added a comment - Unit test isn't feasible for this fix.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444852/HADOOP-6760.patch
          against trunk revision 945953.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444852/HADOOP-6760.patch against trunk revision 945953. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/528/console This message is automatically generated.
          Hide
          Konstantin Boudnik added a comment -

          Patch is the same except for CHANGES.txt mods. Re verifying nonetheless.

          Show
          Konstantin Boudnik added a comment - Patch is the same except for CHANGES.txt mods. Re verifying nonetheless.
          Hide
          Konstantin Boudnik added a comment -

          Patch should include modifications for CHANGES.txt

          Show
          Konstantin Boudnik added a comment - Patch should include modifications for CHANGES.txt
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444456/HADOOP-6760.patch
          against trunk revision 944012.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444456/HADOOP-6760.patch against trunk revision 944012. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h1.grid.sp2.yahoo.net/67/console This message is automatically generated.
          Hide
          Konstantin Boudnik added a comment -

          Need to re-verify the patch because it has been changed.

          Show
          Konstantin Boudnik added a comment - Need to re-verify the patch because it has been changed.
          Hide
          Konstantin Boudnik added a comment -

          Same for the trunk and 0.21 branch

          Show
          Konstantin Boudnik added a comment - Same for the trunk and 0.21 branch
          Hide
          Konstantin Boudnik added a comment -

          Fixing CHANGES.txt as well removing the JIRA numbers.

          Show
          Konstantin Boudnik added a comment - Fixing CHANGES.txt as well removing the JIRA numbers.
          Hide
          Konstantin Boudnik added a comment -

          Patch removes the workaround in question.

          Show
          Konstantin Boudnik added a comment - Patch removes the workaround in question.
          Hide
          Devaraj Das added a comment -

          +1 on reverting HADOOP-6386 for now. BTW http://jira.codehaus.org/browse/JETTY-748 is the jira that is tracking the fix on the jetty side.

          Show
          Devaraj Das added a comment - +1 on reverting HADOOP-6386 for now. BTW http://jira.codehaus.org/browse/JETTY-748 is the jira that is tracking the fix on the jetty side.
          Hide
          Konstantin Boudnik added a comment -

          Aftere some considerations and investigations we have found that the workaround from HADOOP-6386 has a very unpleasant side effect which weren't apparent at the time. Here what is happening:

          • daemons like NN, TT, etc. are adding certain context to the webserver before it is started,
          • when the workaround in question detects a negative port problem (aka Jetty race) it applies some logic which essentially restarts the server via server.stop() and server.start() calls.
          • the trick is that server.stop() clears up (or either shutdown) all previously added context.

          The effect of this is that even a daemon's webserver was started it lacks all intimate little servlets which were added initially. Thus a daemon's webserver becomes pretty much useless.

          Unfortunately, there's no easy way to fix it because the context is being added outside and it'd be pretty tricky to trace what context needs to be preserved before server is restarted and so on.

          I believe the better solution would be to revert the workaround introduced by HADOOP-6386. In that case we'll get back to the original behavior where webserver either starts or fails but isn't left in an inconsistent state.

          Show
          Konstantin Boudnik added a comment - Aftere some considerations and investigations we have found that the workaround from HADOOP-6386 has a very unpleasant side effect which weren't apparent at the time. Here what is happening: daemons like NN, TT, etc. are adding certain context to the webserver before it is started, when the workaround in question detects a negative port problem (aka Jetty race) it applies some logic which essentially restarts the server via server.stop() and server.start() calls. the trick is that server.stop() clears up (or either shutdown) all previously added context. The effect of this is that even a daemon's webserver was started it lacks all intimate little servlets which were added initially. Thus a daemon's webserver becomes pretty much useless. Unfortunately, there's no easy way to fix it because the context is being added outside and it'd be pretty tricky to trace what context needs to be preserved before server is restarted and so on. I believe the better solution would be to revert the workaround introduced by HADOOP-6386 . In that case we'll get back to the original behavior where webserver either starts or fails but isn't left in an inconsistent state.
          Hide
          Eli Collins added a comment -

          Yes, I was just simplying the HADOOP-4744 part, looks like you could do something similar for the HADOOP-6386 part. Feel free to ignore or defer to a separate jira, at the minimum that code could use a big comment explaining why it's jumping through so many hoops.

          +1 to your patch. The change for HADOOP-6386 should not have incremented oriPort.

          Show
          Eli Collins added a comment - Yes, I was just simplying the HADOOP-4744 part, looks like you could do something similar for the HADOOP-6386 part. Feel free to ignore or defer to a separate jira, at the minimum that code could use a big comment explaining why it's jumping through so many hoops. +1 to your patch. The change for HADOOP-6386 should not have incremented oriPort.
          Hide
          Konstantin Boudnik added a comment -

          Yes, Eli. This seems to be a valid simplification. We are seeing quite a bunch of -1 ports on our production clusters. And the workaround for HADOOP-6386 was trying to address it. I guess it has done a pretty good job however increasing the port was wrong.

          Two workarounds exist for a purpose, actually. First one HADOOP-4744 is about getting a negative port as the result of initial getLocalPort() call. However, what we are seeing sometime is that getLocalPort() can get you a positive number and then when you are trying to bind to it you are getting IllegalArgumentException because the port is actually negative.... It is apparently caused by some crazy race in Jetty. Therefore, the workaround #2 which verifies if allocated port is actually positive and if isn't it engage all that voodoo ...

          So, I believe your simplification won't address the second issue... Please correct me if I'm wrong.

          Show
          Konstantin Boudnik added a comment - Yes, Eli. This seems to be a valid simplification. We are seeing quite a bunch of -1 ports on our production clusters. And the workaround for HADOOP-6386 was trying to address it. I guess it has done a pretty good job however increasing the port was wrong. Two workarounds exist for a purpose, actually. First one HADOOP-4744 is about getting a negative port as the result of initial getLocalPort() call. However, what we are seeing sometime is that getLocalPort() can get you a positive number and then when you are trying to bind to it you are getting IllegalArgumentException because the port is actually negative.... It is apparently caused by some crazy race in Jetty. Therefore, the workaround #2 which verifies if allocated port is actually positive and if isn't it engage all that voodoo ... So, I believe your simplification won't address the second issue... Please correct me if I'm wrong.
          Hide
          Eli Collins added a comment -

          Hey Kos - your patch looks correct to me. Rationale being that according to the jetty javadoc below getPort should never return a negative value, therefore oriPort should never be negative, therefore you don't need the workaround of adding one, and it's OK to re-bind to the same port.

          Not your change but the "workaround HADOOP-4744" looks ok, per the docs below it is valid for getLocalPort to return -1, but I think it could be simpler, something like the following. Make sense?

          int numRetries = 1;
          while (port < 0) {
            LOG.warn("listener.getLocalPort returned " + port);
            if (numRetries++ > MAX_RETRIES) {
              throw new Exception(" listener.getLocalPort is returning " +
              		"less than 0 even after " +numRetries+" resets");
            }
            LOG.info("Bouncing the listener");
            listener.close();
            Thread.sleep(1000);
            listener.setPort(oriPort);
            listener.open();
            Thread.sleep(100);
            port = listener.getLocalPort();
          }
          
          
          

          187 /**
          188 * @return The configured port for the connector or 0 if any available
          189 * port may be used.
          190 */
          191 int getPort();
          192
          193 /* ------------------------------------------------------------ */
          194 /**
          195 * @return The actual port the connector is listening on or -1 if there
          196 * is no port or the connector is not open.
          197 */
          198 int getLocalPort();

          
          
          Show
          Eli Collins added a comment - Hey Kos - your patch looks correct to me. Rationale being that according to the jetty javadoc below getPort should never return a negative value, therefore oriPort should never be negative, therefore you don't need the workaround of adding one, and it's OK to re-bind to the same port. Not your change but the "workaround HADOOP-4744 " looks ok, per the docs below it is valid for getLocalPort to return -1, but I think it could be simpler, something like the following. Make sense? int numRetries = 1; while (port < 0) { LOG.warn( "listener.getLocalPort returned " + port); if (numRetries++ > MAX_RETRIES) { throw new Exception( " listener.getLocalPort is returning " + "less than 0 even after " +numRetries+ " resets" ); } LOG.info( "Bouncing the listener" ); listener.close(); Thread .sleep(1000); listener.setPort(oriPort); listener.open(); Thread .sleep(100); port = listener.getLocalPort(); } 187 /** 188 * @return The configured port for the connector or 0 if any available 189 * port may be used. 190 */ 191 int getPort(); 192 193 /* ------------------------------------------------------------ */ 194 /** 195 * @return The actual port the connector is listening on or -1 if there 196 * is no port or the connector is not open. 197 */ 198 int getLocalPort();
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444225/HADOOP-6760.0.20.patch
          against trunk revision 941662.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/515/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444225/HADOOP-6760.0.20.patch against trunk revision 941662. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/515/console This message is automatically generated.
          Hide
          Konstantin Boudnik added a comment -

          Fix is simple and is ready for verification.

          Show
          Konstantin Boudnik added a comment - Fix is simple and is ready for verification.
          Hide
          Konstantin Boudnik added a comment -

          same for 0.20 source tree

          Show
          Konstantin Boudnik added a comment - same for 0.20 source tree

            People

            • Assignee:
              Konstantin Boudnik
              Reporter:
              Konstantin Boudnik
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development