HBase
  1. HBase
  2. HBASE-9393

Hbase does not closing a closed socket resulting in many CLOSE_WAIT

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.94.2, 0.98.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 7279 regions

      Description

      HBase dose not close a dead connection with the datanode.
      This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect to the datanode because too many mapped sockets from one host to another on the same port.

      The example below is with low CLOSE_WAIT count because we had to restart hbase to solve the porblem, later in time it will incease to 60-100K sockets on CLOSE_WAIT

      [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
      13156
      [root@hd2-region3 ~]# ps -ef |grep 21592
      root 17255 17219 0 12:26 pts/0 00:00:00 grep 21592
      hbase 21592 1 17 Aug29 ? 03:29:06 /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...

        Activity

        Hide
        Colin Patrick McCabe added a comment -

        Best guess is that you didn't apply your configuration to HBase, which is the DFSClient in this scenario. Suggest posting to hdfs-user@apache.org

        Show
        Colin Patrick McCabe added a comment - Best guess is that you didn't apply your configuration to HBase, which is the DFSClient in this scenario. Suggest posting to hdfs-user@apache.org
        Hide
        steven xu added a comment -

        [Colin Patrick McCabe], Please review my comments. Could you give more suggestions? Thanks a lo.

        Show
        steven xu added a comment - [Colin Patrick McCabe] , Please review my comments. Could you give more suggestions? Thanks a lo.
        Hide
        steven xu added a comment -

        I also changed the conf and tested

        • dfs.client.socketcache.capacity = 0
        • dfs.datanode.socket.reuse.keepalive = 0

          And the number of CLOSE_WAIT also grown up quickly.

        Show
        steven xu added a comment - I also changed the conf and tested dfs.client.socketcache.capacity = 0 dfs.datanode.socket.reuse.keepalive = 0 And the number of CLOSE_WAIT also grown up quickly.
        Hide
        steven xu added a comment -

        I considered [cmccabe] suggestions. And I testes the following case:

        • use short-circuit reads;
        • set the socket cache to a bigger size ( dfs.client.socketcache.capacity = 1024) and adjust the timeouts to be longer
          But it didnot work, the numbers of CLOSE_WAIT also grown up quickly.
        Show
        steven xu added a comment - I considered [cmccabe] suggestions. And I testes the following case: use short-circuit reads; set the socket cache to a bigger size ( dfs.client.socketcache.capacity = 1024) and adjust the timeouts to be longer But it didnot work, the numbers of CLOSE_WAIT also grown up quickly.
        Hide
        steven xu added a comment -

        Hi, Avi Zrachya, I also have the same issue HBASE-11833 in HDP 2.1/Hadoop 2.4/HBase 0.98.0. Can you share more info about this issue. Thanks.

        Show
        steven xu added a comment - Hi, Avi Zrachya, I also have the same issue HBASE-11833 in HDP 2.1/Hadoop 2.4/HBase 0.98.0. Can you share more info about this issue. Thanks.
        Hide
        Liang Xie added a comment -

        Avi Zrachya, do you observe a similar stack trace like HDFS-5671 in your scenario?

        Show
        Liang Xie added a comment - Avi Zrachya , do you observe a similar stack trace like HDFS-5671 in your scenario?
        Hide
        Colin Patrick McCabe added a comment -

        I guess I should also explain why this doesn't happen in branch-1 of Hadoop. The reason is because Hadoop-1 had no socket cache and no grace period before the sockets were closed. The client simply opened a new socket each time, performed the op, and then closed it. This would result in (basically) no sockets in CLOSE_WAIT. Remember CLOSE_WAIT only happens when the server is waiting for the client to execute close.

        Keeping sockets open is an optimization, but one that may require you to raise your maximum number of file descriptors. If you are not happy with this tradeoff, you can set dfs.client.socketcache.capacity to 0 and dfs.datanode.socket.reuse.keepalive to 0 to get the old branch-1 behavior. It will be slower, though.

        Show
        Colin Patrick McCabe added a comment - I guess I should also explain why this doesn't happen in branch-1 of Hadoop. The reason is because Hadoop-1 had no socket cache and no grace period before the sockets were closed. The client simply opened a new socket each time, performed the op, and then closed it. This would result in (basically) no sockets in CLOSE_WAIT . Remember CLOSE_WAIT only happens when the server is waiting for the client to execute close . Keeping sockets open is an optimization, but one that may require you to raise your maximum number of file descriptors. If you are not happy with this tradeoff, you can set dfs.client.socketcache.capacity to 0 and dfs.datanode.socket.reuse.keepalive to 0 to get the old branch-1 behavior. It will be slower, though.
        Hide
        Colin Patrick McCabe added a comment -

        I looked into this issue. I found a few things:

        The HDFS socket cache is too small by default and times out too quickly. Its default size is 16, but HBase seems to be opening many more connections to the DN than that. In this situation, sockets must inevitably be opened and then discarded, leading to sockets in CLOSE_WAIT.

        When you use positional read (aka pread), we grab a socket from the cache, read from it, and then immediately put it back. When you seek and then call read, we don't put the socket back at the end. The assumption behind the normal read method is that you are probably going to call read again, so it holds on to the socket until something else comes up (such as closing the stream). In many scenarios, this can lead to seek+read generating more sockets in CLOSE_WAIT than pread.

        I don't think we want to alter this HDFS behavior, since it's helpful in the case that you're reading through the entire file from start to finish-- which many HDFS clients do. It allows us to make certain optimizations such as reading a few kilobytes at a time, even if the user only asks for a few bytes at a time. These optimizations are unavailable with pread because it creates a new BlockReader each time.

        So as far as recommendations for HBase go:

        • use short-circuit reads whenever possible, since in many cases you can avoid needing a socket at all and just reuse the same file descriptor
        • set the socket cache to a bigger size and adjust the timeouts to be longer (I may explore changing the defaults in HDFS...)
        • if you are going to keep files open for a while and random read, use pread, never seek+read.
        Show
        Colin Patrick McCabe added a comment - I looked into this issue. I found a few things: The HDFS socket cache is too small by default and times out too quickly. Its default size is 16, but HBase seems to be opening many more connections to the DN than that. In this situation, sockets must inevitably be opened and then discarded, leading to sockets in CLOSE_WAIT . When you use positional read (aka pread ), we grab a socket from the cache, read from it, and then immediately put it back. When you seek and then call read , we don't put the socket back at the end. The assumption behind the normal read method is that you are probably going to call read again, so it holds on to the socket until something else comes up (such as closing the stream). In many scenarios, this can lead to seek+read generating more sockets in CLOSE_WAIT than pread . I don't think we want to alter this HDFS behavior, since it's helpful in the case that you're reading through the entire file from start to finish-- which many HDFS clients do. It allows us to make certain optimizations such as reading a few kilobytes at a time, even if the user only asks for a few bytes at a time. These optimizations are unavailable with pread because it creates a new BlockReader each time. So as far as recommendations for HBase go: use short-circuit reads whenever possible, since in many cases you can avoid needing a socket at all and just reuse the same file descriptor set the socket cache to a bigger size and adjust the timeouts to be longer (I may explore changing the defaults in HDFS...) if you are going to keep files open for a while and random read, use pread , never seek+read .
        Hide
        Avi Zrachya added a comment -

        I have something that might shade light on this issue.
        It happens with cloudera's CDH 4.2.1 and CDH 4.4.0
        It dose not happen with Horton Works.
        So i assume it is connected to the fact that CDH 4 are working with HDFS 2

        Please check this direction, this problem is consist on CDH 4.

        Show
        Avi Zrachya added a comment - I have something that might shade light on this issue. It happens with cloudera's CDH 4.2.1 and CDH 4.4.0 It dose not happen with Horton Works. So i assume it is connected to the fact that CDH 4 are working with HDFS 2 Please check this direction, this problem is consist on CDH 4.
        Hide
        Enis Soztutar added a comment -

        I was not able to repro this same problem consistently when I had initially bumped into it. The number of hanging tcp connections varied across nodes. Let me try Avi's approach once more.

        Show
        Enis Soztutar added a comment - I was not able to repro this same problem consistently when I had initially bumped into it. The number of hanging tcp connections varied across nodes. Let me try Avi's approach once more.
        Hide
        Avi Zrachya added a comment -

        Yes, it is most definetly the regionserver as you can see below.
        pid 21592 is the pid the holding the CLOSE_WAIT socket and as you can see 21592 is the regionserver.

        [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
        13156
        [root@hd2-region3 ~]# ps -ef |grep 21592
        root 17255 17219 0 12:26 pts/0 00:00:00 grep 21592
        hbase 21592 1 17 Aug29 ? 03:29:06 /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/var /log/hbase -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ..
        
        Show
        Avi Zrachya added a comment - Yes, it is most definetly the regionserver as you can see below. pid 21592 is the pid the holding the CLOSE_WAIT socket and as you can see 21592 is the regionserver. [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l 13156 [root@hd2-region3 ~]# ps -ef |grep 21592 root 17255 17219 0 12:26 pts/0 00:00:00 grep 21592 hbase 21592 1 17 Aug29 ? 03:29:06 /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dhbase.log.dir=/ var /log/hbase -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ..
        Hide
        stack added a comment -

        You can attribute the CLOSE_WAIT to the regionserver process (it is not the datanode process? If datanode process, what version of hadoop).

        Can someone else try and repro what Avi Zrachya is reporting above? Enis Soztutar What if you do the sequence Avi Zrachya outlines? Thanks.

        Show
        stack added a comment - You can attribute the CLOSE_WAIT to the regionserver process (it is not the datanode process? If datanode process, what version of hadoop). Can someone else try and repro what Avi Zrachya is reporting above? Enis Soztutar What if you do the sequence Avi Zrachya outlines? Thanks.
        Hide
        Avi Zrachya added a comment -

        In our case it's definitely not because of many random reads.

        On a cluster that have no traffic:
        1. restart the regionserver on one node, the CLOSE_WAIT goes away..
        2. wait few min and see that there are still 0 CLOSE_WAIT sockets.
        2. re-balance the cluster so the new regionserver will get some regions.
        3. As soon as the regionserver get some regions, many CLOSE_WAIT starts to show up.

        My assumption it happens when the regionserver reads/initialize the assigned region.

        Read data from the cluster dose not seems to affect the CLOSE_WAIT sockets.. but when regions in transition kicks in for whatever reason, the CLOSE_WAIT stating to pump up.

        Show
        Avi Zrachya added a comment - In our case it's definitely not because of many random reads. On a cluster that have no traffic: 1. restart the regionserver on one node, the CLOSE_WAIT goes away.. 2. wait few min and see that there are still 0 CLOSE_WAIT sockets. 2. re-balance the cluster so the new regionserver will get some regions. 3. As soon as the regionserver get some regions, many CLOSE_WAIT starts to show up. My assumption it happens when the regionserver reads/initialize the assigned region. Read data from the cluster dose not seems to affect the CLOSE_WAIT sockets.. but when regions in transition kicks in for whatever reason, the CLOSE_WAIT stating to pump up.
        Hide
        stack added a comment -

        Lots of random reads?

        Show
        stack added a comment - Lots of random reads?
        Hide
        Enis Soztutar added a comment -

        We have also observed a similar situation which rendered some of the regionservers unusable because master was not able to open more sockets to the regionserver.

        $ for i in `cat allnodes`; do echo $i; ssh $i "netstat -to | grep CLOSE_WAIT" ; done  
        horn04
        tcp       22      0 horn04.gq1.ygridcore:49853 horn04.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        tcp        1      0 horn04.gq1.ygridcore:49812 horn04.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        horn05
        tcp       76      0 horn05.gq1.ygridcore:40253 horn05.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        tcp        1      0 horn05.gq1.ygridcore:39667 horn05.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        tcp      166      0 horn05.gq1.ygridcore:39919 horn05.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        tcp       97      0 horn05.gq1.ygridcore:40631 horn05.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        tcp        5      0 horn05.gq1.ygridcore:40227 horn05.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        tcp       32      0 horn05.gq1.ygridcore:39707 horn05.gq1.ygridcore:50010 CLOSE_WAIT  off (0.00/0/0)
        

        I was not able to nail down the root cause at that time though.

        Show
        Enis Soztutar added a comment - We have also observed a similar situation which rendered some of the regionservers unusable because master was not able to open more sockets to the regionserver. $ for i in `cat allnodes`; do echo $i; ssh $i "netstat -to | grep CLOSE_WAIT" ; done horn04 tcp 22 0 horn04.gq1.ygridcore:49853 horn04.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) tcp 1 0 horn04.gq1.ygridcore:49812 horn04.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) horn05 tcp 76 0 horn05.gq1.ygridcore:40253 horn05.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) tcp 1 0 horn05.gq1.ygridcore:39667 horn05.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) tcp 166 0 horn05.gq1.ygridcore:39919 horn05.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) tcp 97 0 horn05.gq1.ygridcore:40631 horn05.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) tcp 5 0 horn05.gq1.ygridcore:40227 horn05.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) tcp 32 0 horn05.gq1.ygridcore:39707 horn05.gq1.ygridcore:50010 CLOSE_WAIT off (0.00/0/0) I was not able to nail down the root cause at that time though.

          People

          • Assignee:
            Unassigned
            Reporter:
            Avi Zrachya
          • Votes:
            0 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

            • Created:
              Updated:

              Development