Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.20.0
    • Component/s: io
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Starting with Hadoop-0.17, most of the network I/O uses non-blocking NIO channels. Normal blocking reads and writes are handled by Hadoop and use our own cache of selectors. This cache suites well for Hadoop where I/O often occurs on many short lived threads. Number of fds consumed is proportional to number of threads currently blocked.

      If blocking I/O is done using java.*, Sun's implementation uses internal per-thread selectors. These selectors are closed using sun.misc.Cleaner. Looks like this cleaning is kind of like finalizers and tied to GC. This is pretty ill suited if we have many threads that are short lived. Until GC happens, number of these selectors keeps growing. Each selector consumes 3 fds.

      Though blocking read and write are handled by Hadoop, connect() is still the default implementation that uses per-thread selector.

      Koji helped a lot in tracking this. Some sections from 'jmap' output and other info Koji collected led to this suspicion and will include that in the next comment.

      One solution might be to handle connect() also in Hadoop using our selectors.

      1. HADOOP-4346-0.18.3.patch
        10 kB
        Aaron Kimball
      2. HADOOP-4346-0.18.3.patch
        10 kB
        Aaron Kimball
      3. HADOOP-4346-0.18.3.patch
        10 kB
        Aaron Kimball
      4. HADOOP-4346.patch
        11 kB
        Raghu Angadi
      5. HADOOP-4346.patch
        11 kB
        Raghu Angadi
      6. HADOOP-4346.patch
        11 kB
        Raghu Angadi
      7. HADOOP-4346.patch
        11 kB
        Raghu Angadi
      8. HADOOP-4346.patch
        11 kB
        Raghu Angadi
      9. HADOOP-4346-branch-18.patch
        10 kB
        Raghu Angadi
      10. HADOOP-4346.patch
        11 kB
        Raghu Angadi

        Activity

        Hide
        Raghu Angadi added a comment -

        > When I grepped, I noticed that we don't fix connect in Balancer.java. We probably should. I will file jira on that.

        not an issue. Balancer does not use NIO sockets, so not affected. sorry for the confusion.

        Show
        Raghu Angadi added a comment - > When I grepped, I noticed that we don't fix connect in Balancer.java. We probably should. I will file jira on that. not an issue. Balancer does not use NIO sockets, so not affected. sorry for the confusion.
        Hide
        Raghu Angadi added a comment -

        +1. looks correct.

        When I grepped, I noticed that we don't fix connect in Balancer.java. We probably should. I will file jira on that.

        Show
        Raghu Angadi added a comment - +1. looks correct. When I grepped, I noticed that we don't fix connect in Balancer.java. We probably should. I will file jira on that.
        Hide
        Aaron Kimball added a comment -

        Fixed -p0 / -p1 issue in patch (again...)

        Show
        Aaron Kimball added a comment - Fixed -p0 / -p1 issue in patch (again...)
        Hide
        Aaron Kimball added a comment -

        Updated 0.18.3 patch per Raghu's suggestions

        Show
        Aaron Kimball added a comment - Updated 0.18.3 patch per Raghu's suggestions
        Hide
        Raghu Angadi added a comment -

        You resolution is correct. targetSock.connect() should be changed instead of proxySock.

        Essentially, DataNode should not invoke plain socket.connect() at all, they should all be replaced with NetUtils.connect(sock...)

        Show
        Raghu Angadi added a comment - You resolution is correct. targetSock.connect() should be changed instead of proxySock. Essentially, DataNode should not invoke plain socket.connect() at all, they should all be replaced with NetUtils.connect(sock...)
        Hide
        Aaron Kimball added a comment -

        Thanks Raghu,

        So what's the resolution on DataNode.java? Just disregard the non-applying hunk? If so, what about the targetSock.connect() call at line 1431?

        Show
        Aaron Kimball added a comment - Thanks Raghu, So what's the resolution on DataNode.java? Just disregard the non-applying hunk? If so, what about the targetSock.connect() call at line 1431?
        Hide
        Raghu Angadi added a comment -

        > The lines near 1490 do not make any reference to 'proxySock'.

        Right, looks like that part got reverted in 0.18 later.

        0.18 patch I attached does not have some minor fixes that were done later.

        I suggest following change for your patch : changes for SocketIOWtihTimeout.java and NetUtils.java should come from trunk patch (they should just apply without any conflicts). The rest of it should be fine.

        We should run 'ant test-patch' with new one.. I can run that if you want.

        Show
        Raghu Angadi added a comment - > The lines near 1490 do not make any reference to 'proxySock'. Right, looks like that part got reverted in 0.18 later . 0.18 patch I attached does not have some minor fixes that were done later. I suggest following change for your patch : changes for SocketIOWtihTimeout.java and NetUtils.java should come from trunk patch (they should just apply without any conflicts). The rest of it should be fine. We should run 'ant test-patch' with new one.. I can run that if you want.
        Hide
        Aaron Kimball added a comment -

        Reuploading patch that works with -p0 instead of -p1

        Show
        Aaron Kimball added a comment - Reuploading patch that works with -p0 instead of -p1
        Hide
        Aaron Kimball added a comment -

        My proposed change to the patch is attached as HADOOP-4346-0.18.3.patch, but I'm not sure if it's correct.

        Show
        Aaron Kimball added a comment - My proposed change to the patch is attached as HADOOP-4346 -0.18.3.patch, but I'm not sure if it's correct.
        Hide
        Aaron Kimball added a comment -

        The -branch-18 patch does not apply to Hadoop 0.18.3. Specifically, the following hunk fails to apply in DataNode.java:

        *** 1490,1496 ****
                  InetSocketAddress proxyAddr = NetUtils.createSocketAddr(
                      proxySource.getName());
                  proxySock = newSocket();
        -         proxySock.connect(proxyAddr, socketTimeout);
                  proxySock.setSoTimeout(socketTimeout);
        
                  OutputStream baseStream = NetUtils.getOutputStream(proxySock,
        

        The lines near 1490 do not make any reference to 'proxySock'.

        The only definitions of "OutputStream baseStream" are in run(), copyBlock(), and readBlock(). Without more context, for this patch hunk, I'm not sure where this might be applied, though the line:

        targetSock.connect(targetAddr, socketTimeout);
        

        near line 1431 seems like a reasonable candidate, as it's the only example of a fooSock.Connect() call.

        Raghu, am I correct here? If not, can you release a new 18-branch patch?

        Show
        Aaron Kimball added a comment - The -branch-18 patch does not apply to Hadoop 0.18.3. Specifically, the following hunk fails to apply in DataNode.java: *** 1490,1496 **** InetSocketAddress proxyAddr = NetUtils.createSocketAddr( proxySource.getName()); proxySock = newSocket(); - proxySock.connect(proxyAddr, socketTimeout); proxySock.setSoTimeout(socketTimeout); OutputStream baseStream = NetUtils.getOutputStream(proxySock, The lines near 1490 do not make any reference to 'proxySock'. The only definitions of "OutputStream baseStream" are in run(), copyBlock(), and readBlock(). Without more context, for this patch hunk, I'm not sure where this might be applied, though the line: targetSock.connect(targetAddr, socketTimeout); near line 1431 seems like a reasonable candidate, as it's the only example of a fooSock.Connect() call. Raghu, am I correct here? If not, can you release a new 18-branch patch?
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #646 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/646/)
        . Implement blocking connect so that Hadoop is not affected
        by selector problem with JDK default implementation. (Raghu Angadi)

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #646 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/646/ ) . Implement blocking connect so that Hadoop is not affected by selector problem with JDK default implementation. (Raghu Angadi)
        Hide
        Raghu Angadi added a comment -

        I just committed this.

        Show
        Raghu Angadi added a comment - I just committed this.
        Hide
        Raghu Angadi added a comment -

        tes-patch output :

             [exec] -1 overall.
        
             [exec]     +1 @author.  The patch does not contain any @author tags.
        
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no tests are needed for this patch.
        
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
        
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
        
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
        
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        
        Show
        Raghu Angadi added a comment - tes-patch output : [exec] -1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        Hide
        Raghu Angadi added a comment -

        Updated patch fixes two findbugs warnings. 'SocketChannel' is is already a SelectableChannel, there is no need to check with instanceof.

        Show
        Raghu Angadi added a comment - Updated patch fixes two findbugs warnings. 'SocketChannel' is is already a SelectableChannel, there is no need to check with instanceof.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
        Hide
        Raghu Angadi added a comment -

        Thanks Nicholas.

        Updated patch is attached.

        > In case SelectionKey.OP_WRITE, "read" should be "write"
        good catch.

        > I suggest to simplify the codes for timing as below
        I modified the loop a bit like the way you suggested, but not exactly same, since it requires two calls to currentTime() instead of one for each loop.. even in the common case of no timeouts.

        Show
        Raghu Angadi added a comment - Thanks Nicholas. Updated patch is attached. > In case SelectionKey.OP_WRITE, "read" should be "write" good catch. > I suggest to simplify the codes for timing as below I modified the loop a bit like the way you suggested, but not exactly same, since it requires two calls to currentTime() instead of one for each loop.. even in the common case of no timeouts.
        Hide
        Tsz Wo Nicholas Sze added a comment -
        • In case SelectionKey.OP_WRITE, "read" should be "write"
        • I suggest to simplify the codes for timing as below
          final long endtime = System.currentTimeMillis() + timeout;
          while(true) {
            final long timeleft = endtime - System.currentTimeMillis();
            ...
            if (... && timeleft < 0) {
              throw ...
            }
          }
          
        Show
        Tsz Wo Nicholas Sze added a comment - In case SelectionKey.OP_WRITE, "read" should be "write" I suggest to simplify the codes for timing as below final long endtime = System .currentTimeMillis() + timeout; while ( true ) { final long timeleft = endtime - System .currentTimeMillis(); ... if (... && timeleft < 0) { throw ... } }
        Hide
        Ankur added a comment - - edited

        That would explain the kind of file handle leaks we are observing when writing from Apache HTTp server via a custom logwriter to HDFS. The log-writer opens up a file for writing every X min and keeps writing the apache log entries that it receives through a pipe. Underneath the log writer the DFSClient opens up a new blocking connection (DFSClient.connect()) and transfers the data via data streamer threads. In our case X is sufficiently large to cause the wrapped selector objects to move to "old space" and not get garbage collected because a full GC never runs since the program never exceeds its memory requirements.

        I think having this patch will alleviate the problem. Another way for client applications is to force full GC at regular intervals. I will try forcing full GC and see if that works for my case.

        Show
        Ankur added a comment - - edited That would explain the kind of file handle leaks we are observing when writing from Apache HTTp server via a custom logwriter to HDFS. The log-writer opens up a file for writing every X min and keeps writing the apache log entries that it receives through a pipe. Underneath the log writer the DFSClient opens up a new blocking connection (DFSClient.connect()) and transfers the data via data streamer threads. In our case X is sufficiently large to cause the wrapped selector objects to move to "old space" and not get garbage collected because a full GC never runs since the program never exceeds its memory requirements. I think having this patch will alleviate the problem. Another way for client applications is to force full GC at regular intervals. I will try forcing full GC and see if that works for my case.
        Hide
        Raghu Angadi added a comment -

        Could someone review this patch for trunk?

        Eventually we should write a Socket a socket factory and return our own Socket so that we don't need NetUtils.connect(), NetUtils.getOuptStream() etc.

        Show
        Raghu Angadi added a comment - Could someone review this patch for trunk? Eventually we should write a Socket a socket factory and return our own Socket so that we don't need NetUtils.connect(), NetUtils.getOuptStream() etc.
        Hide
        Raghu Angadi added a comment -

        Updated patch closes the channel as javadoc for SocketChannel.connect() requires.

        Show
        Raghu Angadi added a comment - Updated patch closes the channel as javadoc for SocketChannel.connect() requires.
        Hide
        Koji Noguchi added a comment -

        I guess our datanodes aren't that busy then

        It also depends on the heapsize which would change the frequency of fullGC.
        We use default heapsize of 1000M for the datanode.
        Cluster that hit this issue was using 2048M (with 1024 file descriptor limit)

        Show
        Koji Noguchi added a comment - I guess our datanodes aren't that busy then It also depends on the heapsize which would change the frequency of fullGC. We use default heapsize of 1000M for the datanode. Cluster that hit this issue was using 2048M (with 1024 file descriptor limit)
        Hide
        Allen Wittenauer added a comment -

        I guess our datanodes aren't that busy then...

        Show
        Allen Wittenauer added a comment - I guess our datanodes aren't that busy then...
        Hide
        Raghu Angadi added a comment -

        I doubt 8K would be enough for a moderately busy datanode with out this patch. Ultimately it depends on what goes on in blackbox of GC and phantom references.. it might be fine for some users or some loads.

        Show
        Raghu Angadi added a comment - I doubt 8K would be enough for a moderately busy datanode with out this patch. Ultimately it depends on what goes on in blackbox of GC and phantom references.. it might be fine for some users or some loads.
        Hide
        Allen Wittenauer added a comment -

        We're running ours with 8192, which is likely why we've never seen this.

        Show
        Allen Wittenauer added a comment - We're running ours with 8192, which is likely why we've never seen this.
        Hide
        Bryan Duxbury added a comment -

        Originally I had 1024 per machine. I recently increased it to 4096 and I haven't seen the problem as much, but it was intermittent before, so I don't know if it's resolved or not.

        Show
        Bryan Duxbury added a comment - Originally I had 1024 per machine. I recently increased it to 4096 and I haven't seen the problem as much, but it was intermittent before, so I don't know if it's resolved or not.
        Hide
        Raghu Angadi added a comment -

        Patch for 0.18 is attached.

        Show
        Raghu Angadi added a comment - Patch for 0.18 is attached.
        Hide
        Allen Wittenauer added a comment -

        Bryan: out of curiosity, what is your upper limit on FD's? In theory, you should be able to work around this by increasing the fd limit. (rlim_fd_max in /etc/sytsem on solaris, nofile in /etc/security/limits.conf on Linux, ...)

        Show
        Allen Wittenauer added a comment - Bryan: out of curiosity, what is your upper limit on FD's? In theory, you should be able to work around this by increasing the fd limit. (rlim_fd_max in /etc/sytsem on solaris, nofile in /etc/security/limits.conf on Linux, ...)
        Hide
        Bryan Duxbury added a comment -

        I'd love to try this out on 0.18.1. Sounds like my exact problem.

        Show
        Bryan Duxbury added a comment - I'd love to try this out on 0.18.1. Sounds like my exact problem.
        Hide
        Raghu Angadi added a comment -

        Suggest fix is attached. Datanodes have no instances of Sun's SelectorWrapper objects with this patch.

        This adds NetUtils.connect(socket, addr, timeout) which invokes SocketIOWithTimeout.connect() for channels.

        connect seems to work fine. If you anyone wants to try this patch out, I can attach a version for 0.17 or 0.18.

        Show
        Raghu Angadi added a comment - Suggest fix is attached. Datanodes have no instances of Sun's SelectorWrapper objects with this patch. This adds NetUtils.connect(socket, addr, timeout) which invokes SocketIOWithTimeout.connect() for channels. connect seems to work fine. If you anyone wants to try this patch out, I can attach a version for 0.17 or 0.18.
        Hide
        Raghu Angadi added a comment -

        The following shows relevant info from jmap for a datanode that had a lot fds open.

        • #jmap with out full-GC. Includes stale objects:
          # num of fds for the process : 5358
          
          #java internal selectors
           117:          1780          42720  sun.nio.ch.Util$SelectorWrapper
           118:          1762          42288  sun.nio.ch.Util$SelectorWrapper$Closer
          
          #Hadoop selectors
            93:          3026         121040  org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo
           844:             1             40  org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$ProviderInfo
          
          #Datanode threads 
            99:          2229         106992  org.apache.hadoop.dfs.DataNode$DataXceiver
          
        • #jmap -histo:live immediately after the previous. This does a full-GC before counting.
          #num of fds : 5187
          
            64:          1759          42216  sun.nio.ch.Util$SelectorWrapper
            65:          1759          42216  sun.nio.ch.Util$SelectorWrapper$Closer
          
           465:             4            160  org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo
           772:             1             40  org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$ProviderInfo
          
           422:             4            192  org.apache.hadoop.dfs.DataNode$DataXceiver
          

          This shows that there is no fd leak in Hadoop's selector cache. DN has 4 threads doing I/O and there are 4 selectors. But there are a lot of java internal selectors open.

        • # 'jmap -histo:live' bout 1 minute after the previous full-GC
          #num of fds : 57
          
          # There are no SelectorWrapper objects. All of these must have been closed.
          
           768:             1             40  org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo
          
           730:             1             48  org.apache.hadoop.dfs.DataNode$DataXceiver
          

        I will try to reproduced this myself and try out a patch for connect().

        Show
        Raghu Angadi added a comment - The following shows relevant info from jmap for a datanode that had a lot fds open. #jmap with out full-GC. Includes stale objects: # num of fds for the process : 5358 #java internal selectors 117: 1780 42720 sun.nio.ch.Util$SelectorWrapper 118: 1762 42288 sun.nio.ch.Util$SelectorWrapper$Closer #Hadoop selectors 93: 3026 121040 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo 844: 1 40 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$ProviderInfo #Datanode threads 99: 2229 106992 org.apache.hadoop.dfs.DataNode$DataXceiver #jmap -histo:live immediately after the previous. This does a full-GC before counting. #num of fds : 5187 64: 1759 42216 sun.nio.ch.Util$SelectorWrapper 65: 1759 42216 sun.nio.ch.Util$SelectorWrapper$Closer 465: 4 160 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo 772: 1 40 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$ProviderInfo 422: 4 192 org.apache.hadoop.dfs.DataNode$DataXceiver This shows that there is no fd leak in Hadoop's selector cache. DN has 4 threads doing I/O and there are 4 selectors. But there are a lot of java internal selectors open. # 'jmap -histo:live' bout 1 minute after the previous full-GC #num of fds : 57 # There are no SelectorWrapper objects. All of these must have been closed. 768: 1 40 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool$SelectorInfo 730: 1 48 org.apache.hadoop.dfs.DataNode$DataXceiver I will try to reproduced this myself and try out a patch for connect().

          People

          • Assignee:
            Raghu Angadi
            Reporter:
            Raghu Angadi
          • Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development