Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2
    • Fix Version/s: 0.20.3, 0.20.205.0
    • Component/s: hdfs-client
    • Labels:
      None
    • Environment:

      Linux 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
      java version "1.6.0_23"
      Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
      Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)

    • Hadoop Flags:
      Reviewed

      Description

      $ /usr/sbin/lsof -i TCP:50010 | grep -c CLOSE_WAIT
      4471

      It is better if everything runs normal.
      However, from time to time there are some "DataStreamer Exception: java.net.SocketTimeoutException" and "DFSClient.processDatanodeError(2507) | Error Recovery for" can be found from log file and the number of CLOSE_WAIT socket just keep increasing

      The CLOSE_WAIT handles may remain for hours and days; then "Too many open file" some day.

      1. patch-draft-1836.patch
        2 kB
        Dennis Cheung
      2. hdfs-1836-0.20.txt
        2 kB
        Todd Lipcon
      3. hdfs-1836-0.20.txt
        2 kB
        Todd Lipcon
      4. hdfs-1836-0.20.205.txt
        1 kB
        Bharath Mundlapudi

        Activity

        Hide
        Matt Foley added a comment -

        Closed upon release of 0.20.205.0

        Show
        Matt Foley added a comment - Closed upon release of 0.20.205.0
        Hide
        Suresh Srinivas added a comment -

        This patch is not required for trunk as it has equivalent code.

        Show
        Suresh Srinivas added a comment - This patch is not required for trunk as it has equivalent code.
        Hide
        Matt Foley added a comment -

        That's correct. This patch was NOT in the old 205 branch which Owen merged into 204. It was only committed to 0.20-security, which will be in the new 205. Thanks for the catch.

        Show
        Matt Foley added a comment - That's correct. This patch was NOT in the old 205 branch which Owen merged into 204. It was only committed to 0.20-security, which will be in the new 205. Thanks for the catch.
        Hide
        Nathan Roberts added a comment -

        @Matt, I think this should be marked as fixed in 0.20.205.0. Is that correct?

        Show
        Nathan Roberts added a comment - @Matt, I think this should be marked as fixed in 0.20.205.0. Is that correct?
        Hide
        Matt Foley added a comment -

        Marking it fixed in 205 since 206 isn't there yet.

        Show
        Matt Foley added a comment - Marking it fixed in 205 since 206 isn't there yet.
        Hide
        Matt Foley added a comment -

        Committed to 0.20-security for .206 release.
        Will correct the "Fix Versions" field when .206 is added.

        Show
        Matt Foley added a comment - Committed to 0.20-security for .206 release. Will correct the "Fix Versions" field when .206 is added.
        Hide
        Matt Foley added a comment -

        Reopening because Bharath's adapted patch for 0.20.205.0 has sat here for a week, and no one noticed because the bug is closed and already says fixed in 0.20.205.0 – but it isn't.

        Show
        Matt Foley added a comment - Reopening because Bharath's adapted patch for 0.20.205.0 has sat here for a week, and no one noticed because the bug is closed and already says fixed in 0.20.205.0 – but it isn't.
        Hide
        Bharath Mundlapudi added a comment -

        Attaching a patch for 0.20.205 version. I just eliminated some hunks.

        Show
        Bharath Mundlapudi added a comment - Attaching a patch for 0.20.205 version. I just eliminated some hunks.
        Hide
        Eli Collins added a comment -

        I've committed this. Thanks Todd!

        Show
        Eli Collins added a comment - I've committed this. Thanks Todd!
        Hide
        Eli Collins added a comment -

        +1 to the latest hdfs-1836-0.20.txt patch for 20.

        Here are test-patch results. This is just cleanup so no new tests are required. The eclipse classpath error is unrelated.

             [exec] 
             [exec] -1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no tests are needed for this patch.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories.
             [exec]
        
        Show
        Eli Collins added a comment - +1 to the latest hdfs-1836-0.20.txt patch for 20. Here are test-patch results. This is just cleanup so no new tests are required. The eclipse classpath error is unrelated. [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories. [exec]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12479218/hdfs-1836-0.20.txt
        against trunk revision 1103987.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/536//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479218/hdfs-1836-0.20.txt against trunk revision 1103987. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/536//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Sure, no problem. I was trying to maintain the old behavior, but I agree it's nicer to provide the exceptions if we have debug on.

        Show
        Todd Lipcon added a comment - Sure, no problem. I was trying to maintain the old behavior, but I agree it's nicer to provide the exceptions if we have debug on.
        Hide
        Bharath Mundlapudi added a comment -

        Thats correct. This code is part of trunk already.

        Todd, One minor comment.

        1. Can we also pass LOG object to this method? Users who wants to debug can enable debug option.
        IOUtils.cleanup(LOG, blockStream, blockReplyStream);

        Otherwise, patch looks good. Thank you.

        Show
        Bharath Mundlapudi added a comment - Thats correct. This code is part of trunk already. Todd, One minor comment. 1. Can we also pass LOG object to this method? Users who wants to debug can enable debug option. IOUtils.cleanup(LOG, blockStream, blockReplyStream); Otherwise, patch looks good. Thank you.
        Hide
        Todd Lipcon added a comment -

        Here's a patch against branch-0.20. Bharath, can you take a look?

        Show
        Todd Lipcon added a comment - Here's a patch against branch-0.20. Bharath, can you take a look?
        Hide
        Todd Lipcon added a comment -

        Looks like this code was fixed in trunk by HADOOP-5859

        Show
        Todd Lipcon added a comment - Looks like this code was fixed in trunk by HADOOP-5859
        Hide
        Bharath Mundlapudi added a comment -

        +1 to Todd's suggestion.

        Though current code logs errors in debug mode only.

        Show
        Bharath Mundlapudi added a comment - +1 to Todd's suggestion. Though current code logs errors in debug mode only.
        Hide
        Todd Lipcon added a comment -

        How about using IOUtils.cleanup, which does the above but also logs errors, etc?

        Show
        Todd Lipcon added a comment - How about using IOUtils.cleanup, which does the above but also logs errors, etc?
        Hide
        Bharath Mundlapudi added a comment -

        Retyping with format for better readability.

        Change the following code

        try { 
            blockStream.close(); 
            blockReplyStream.close(); 
        } catch (IOException e) {
        }
        

        to this:

        try { 
           blockStream.close(); 
        } catch (IOException e) {
        }
        try { 
           blockReplyStream.close(); 
        } catch (IOException e) {
        }
        
        Show
        Bharath Mundlapudi added a comment - Retyping with format for better readability. Change the following code try { blockStream.close(); blockReplyStream.close(); } catch (IOException e) { } to this: try { blockStream.close(); } catch (IOException e) { } try { blockReplyStream.close(); } catch (IOException e) { }
        Hide
        Bharath Mundlapudi added a comment -

        Dennis,

        For me, the following code seems like an issue.

        try

        { blockStream.close(); blockReplyStream.close(); }

        catch (IOException e) {
        }

        Reason: if blockStream throws an exception, blockReplyStream will not be closed.

        Can we replace all the places (2 places) in DFSClient with the following and try?

        try

        { blockStream.close(); }

        catch (IOException e) {
        }
        try

        { blockReplyStream.close(); }

        catch (IOException e) {
        }

        Can you just try this change in your environment?

        Show
        Bharath Mundlapudi added a comment - Dennis, For me, the following code seems like an issue. try { blockStream.close(); blockReplyStream.close(); } catch (IOException e) { } Reason: if blockStream throws an exception, blockReplyStream will not be closed. Can we replace all the places (2 places) in DFSClient with the following and try? try { blockStream.close(); } catch (IOException e) { } try { blockReplyStream.close(); } catch (IOException e) { } Can you just try this change in your environment?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12478378/patch-draft-1836.patch
        against trunk revision 1100054.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/463//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478378/patch-draft-1836.patch against trunk revision 1100054. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/463//console This message is automatically generated.
        Hide
        Dennis Cheung added a comment -

        I've run the patched 0.20.2 for a few days and the number of CLOSE_WAIT socket to port 50010 drop to zero.

        Show
        Dennis Cheung added a comment - I've run the patched 0.20.2 for a few days and the number of CLOSE_WAIT socket to port 50010 drop to zero.
        Hide
        Dennis Cheung added a comment -

        Two new finding after reading the hadoop source.

        1. If there were an exception between lines DFSClient.java:2883-2901. The reference to blockReplyStream may lost and remain unclosed condition.

        2. processDatanodeError() will not close the previous Socket, if any.

        Show
        Dennis Cheung added a comment - Two new finding after reading the hadoop source. 1. If there were an exception between lines DFSClient.java:2883-2901. The reference to blockReplyStream may lost and remain unclosed condition. 2. processDatanodeError() will not close the previous Socket, if any.
        Hide
        Dennis Cheung added a comment -

        It creates a singleton FileSystem object which is connected to a HDFS cloud.
        And it does nothing with hadoop but one simple thing, copy local files into hadoop.

        We use only simple FileSystem.create(path, true) and IOUtils.copyLarge(). And already applied IOUtils.closeQuietly(out); in the finally block.

        Nothing more than that.

        Show
        Dennis Cheung added a comment - It creates a singleton FileSystem object which is connected to a HDFS cloud. And it does nothing with hadoop but one simple thing, copy local files into hadoop. We use only simple FileSystem.create(path, true) and IOUtils.copyLarge(). And already applied IOUtils.closeQuietly(out); in the finally block. Nothing more than that.
        Hide
        Bharath Mundlapudi added a comment -

        Hi Dennis,

        What is this client code doing? Is this your program which uses HDFS APIs to talk to datanode and namenode?

        -Bharath

        Show
        Bharath Mundlapudi added a comment - Hi Dennis, What is this client code doing? Is this your program which uses HDFS APIs to talk to datanode and namenode? -Bharath
        Hide
        Dennis Cheung added a comment -

        @Allen Wittenauer
        Update.

        None. `netstat -ano` shown NO any connection to the client. But only a few ESTABLISHED connection from some other datanode / clients

        Recheck on the client side, the connection may still remain CLOSE_WAIT.

        Show
        Dennis Cheung added a comment - @Allen Wittenauer Update. None. `netstat -ano` shown NO any connection to the client. But only a few ESTABLISHED connection from some other datanode / clients Recheck on the client side, the connection may still remain CLOSE_WAIT.
        Hide
        Dennis Cheung added a comment -

        @Allen Wittenauer
        Unknown, I've no access to the remote system

        Show
        Dennis Cheung added a comment - @Allen Wittenauer Unknown, I've no access to the remote system
        Hide
        Allen Wittenauer added a comment -

        What is the state of the other side of the socket connection?

        Show
        Allen Wittenauer added a comment - What is the state of the other side of the socket connection?

          People

          • Assignee:
            Bharath Mundlapudi
            Reporter:
            Dennis Cheung
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development