HBase
  1. HBase
  2. HBASE-24

Scaling: Too many open file handles to datanodes

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Blocker Blocker
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      We've been here before (HADOOP-2341).

      Today the rapleaf gave me an lsof listing from a regionserver. Had thousands of open sockets to datanodes all in ESTABLISHED and CLOSE_WAIT state. On average they seem to have about ten file descriptors/sockets open per region (They have 3 column families IIRC. Per family, can have between 1-5 or so mapfiles open per family – 3 is max... but compacting we open a new one, etc.).

      They have thousands of regions. 400 regions – ~100G, which is not that much – takes about 4k open file handles.

      If they want a regionserver to server a decent disk worths – 300-400G – then thats maybe 1600 regions... 16k file handles. If more than just 3 column families..... then we are in danger of blowing out limits if they are 32k.

      We've been here before with HADOOP-2341.

      A dfsclient that used non-blocking i/o would help applications like hbase (The datanode doesn't have this problem as bad – CLOSE_WAIT on regionserver side, the bulk of the open fds in the rapleaf log, don't have a corresponding open resource on datanode end).

      Could also just open mapfiles as needed, but that'd kill our random read performance and its bad enough already.

      1. HBASE-823.patch
        1 kB
        Luo Ning
      2. MonitoredReader.java
        10 kB
        Luo Ning

        Issue Links

          Activity

          Hide
          Alex Newman added a comment -

          + conf); // defer opening streams
          is in the patch. Are we actually still defering. That seems wrong.

          Show
          Alex Newman added a comment - + conf); // defer opening streams is in the patch. Are we actually still defering. That seems wrong.
          Hide
          stack added a comment -

          Moving out of 0.92.0.

          Show
          stack added a comment - Moving out of 0.92.0.
          Hide
          Ted Yu added a comment -

          Should we take further action on this JIRA ?

          Show
          Ted Yu added a comment - Should we take further action on this JIRA ?
          Hide
          stack added a comment -

          Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.

          Show
          stack added a comment - Moved from 0.21 to 0.22 just after merge of old 0.20 branch into TRUNK.
          Hide
          stack added a comment -

          Moving out of 0.20.0 – we won't get to it I'm thinking.

          Show
          stack added a comment - Moving out of 0.20.0 – we won't get to it I'm thinking.
          Hide
          stack added a comment -

          Thank Luo Ning. Yeah, agree with your exposition above. In 0.19.0 hadoop, I believe the timed-out socket on datanode is revived by dfsclient (hadoop-3831). I need to test. Regardless, we need to put a bound on number of mapfiles (As you've been saying for a good while now). Let me look at your patch (500 regions per server is really good for 0.18.x hbase).

          Show
          stack added a comment - Thank Luo Ning. Yeah, agree with your exposition above. In 0.19.0 hadoop, I believe the timed-out socket on datanode is revived by dfsclient (hadoop-3831). I need to test. Regardless, we need to put a bound on number of mapfiles (As you've been saying for a good while now). Let me look at your patch (500 regions per server is really good for 0.18.x hbase).
          Hide
          Luo Ning added a comment -

          my understands:
          xceiverCount must be set to avoiding OOME, with -Xss option and memory size a java process can use, it can be set more larger than 256, i'm using 2500 in my DN.

          how may concurrent open mapfiles in hbase, how may xceiver thread in hadoop DNs.

          if set write.timeout to 0, xceiverCount will reach soon, as hbase data increasing, then DN hangs.

          set write.timeout to 8 min, only useful in following situation: hbase never read/write more the 'xceiverCount' mapfiles in the 8 min, otherwise the DN will stop serving data.

          the chains is: more data -> more mapfiles ->more opening mapfiles -> more xceiver threads, comprare all solutions we can have in the chain, controlling opening mapfiles is the most efficient.

          my hbase installation: 2 regionservers/namenodes, each handle about 500 region, 250G data. concurrent open mapfiles set to 2000, and xceiverCount should larger than it, so set to 2500.

          Show
          Luo Ning added a comment - my understands: xceiverCount must be set to avoiding OOME, with -Xss option and memory size a java process can use, it can be set more larger than 256, i'm using 2500 in my DN. how may concurrent open mapfiles in hbase, how may xceiver thread in hadoop DNs. if set write.timeout to 0, xceiverCount will reach soon, as hbase data increasing, then DN hangs. set write.timeout to 8 min, only useful in following situation: hbase never read/write more the 'xceiverCount' mapfiles in the 8 min, otherwise the DN will stop serving data. the chains is: more data -> more mapfiles ->more opening mapfiles -> more xceiver threads, comprare all solutions we can have in the chain, controlling opening mapfiles is the most efficient. my hbase installation: 2 regionservers/namenodes, each handle about 500 region, 250G data. concurrent open mapfiles set to 2000, and xceiverCount should larger than it, so set to 2500.
          Hide
          stack added a comment -

          Hey Luo Ning:

          So I understand correctly, you are saying that rather than a timeout of 0 for dfs.datanode.socket.write.timeout as we suggest elsewhere, you say that it should be set say to the current default of 8 minutes?

          Also, rather than keep raising the xceiverCount as more regions are added, you're saying that it should be bounded – at the current default of 256? – otherwise the OOME about can't create thread, etc?

          How many regions per server with your patch in place Luo Ning?

          Thanks

          Show
          stack added a comment - Hey Luo Ning: So I understand correctly, you are saying that rather than a timeout of 0 for dfs.datanode.socket.write.timeout as we suggest elsewhere, you say that it should be set say to the current default of 8 minutes? Also, rather than keep raising the xceiverCount as more regions are added, you're saying that it should be bounded – at the current default of 256? – otherwise the OOME about can't create thread, etc? How many regions per server with your patch in place Luo Ning? Thanks
          Hide
          Luo Ning added a comment -

          the patch for HStoreFile.java

          Show
          Luo Ning added a comment - the patch for HStoreFile.java
          Hide
          Luo Ning added a comment -

          sharing my experience here, hope it helpful:
          1. since hbase never close mapfiles in normal usage, timeout of dfs.datanode.socket.write.timeout should always be set.
          2. xceiverCount should be set too, or there will be a stack memory overflow or reaching other resource limit.
          3. as my comments above, the only solution is limiting hbase concurrent open mapfiles, or it will cause OOME as data size increasing.

          posting my patch here:
          1. this patch made for hbase 0.18.0, and it works well in last 3 month, about 500G data in 4 machines now;
          2. the patch make HStoreFile..HbaseReader extends MonitoredReader instead of the original MapFile.Reader, so we can control things in it;
          3. see javadoc of MonitoredReader for more concurrent open controlling detail.

          Show
          Luo Ning added a comment - sharing my experience here, hope it helpful: 1. since hbase never close mapfiles in normal usage, timeout of dfs.datanode.socket.write.timeout should always be set. 2. xceiverCount should be set too, or there will be a stack memory overflow or reaching other resource limit. 3. as my comments above, the only solution is limiting hbase concurrent open mapfiles, or it will cause OOME as data size increasing. posting my patch here: 1. this patch made for hbase 0.18.0, and it works well in last 3 month, about 500G data in 4 machines now; 2. the patch make HStoreFile..HbaseReader extends MonitoredReader instead of the original MapFile.Reader, so we can control things in it; 3. see javadoc of MonitoredReader for more concurrent open controlling detail.
          Hide
          stack added a comment -

          Made it a blocker. Fellas have been running into this in various guises; e.g. jgray recent OOME in DNs.

          Show
          stack added a comment - Made it a blocker. Fellas have been running into this in various guises; e.g. jgray recent OOME in DNs.
          Hide
          stack added a comment -

          Here are related remarks made by Jean-Adrien over in hadoop. Baseline is that we've been bandaging over this uglyness for a while now. Time to address the puss.

          xceiverCount limit reason
          Click to flag this post
          
          by Jean-Adrien Jan 08, 2009; 03:03am :: Rate this Message: - Use ratings to moderate (?)
          
          Reply | Reply to Author | Print | View Threaded | Show Only this Message
          Hello all,
          
          I'm running HBase on top of hadoop and I have some difficulties to tune hadoop conf in order to work fine with HBase.
          My configuration is 4 desktop class machines, 2 are running a datanode/region server, 1 only a region server and 1 a namenode/hbase master, 1Gb RAM each
          
          When I start HBase, about 300 regions must be load on 3 region servers; a lot of accesses are made concurrently on Hadoop. My first problem, using the default configuration, was to see too many of:
          DataXceiver: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write.
          
          I was wondering what the reason of such a time out is. Where is the bottleneck ? First I believed that was a network problem (I have  100Mbits/s interfaces). But after monitoring the network, it seems the load is low when it happens.
          Anyway, I found the parameter
          dfs.datanode.socket.write.timeout and I set it 0 to disable the timeout.
          
          Then I saw in datanodes
          xceiverCount 256 exceeds the limit of concurrent xcievers 255
          What is exactly the role of the receivers ? to receive the replicated blocks and/or to receive the file from clients ?
          When their threads end ? When their threads are created ?
          
          Anyway, I found the parameter
          dfs.datanode.max.xcievers
          I upped it to 511, then to 1023 and today to 2047; but by cluster is not so big (300 HBase regions, 200Gb including replication factor of 2); I'm not sure I will be able to up this limit for a long time. Moreover, it considerably increases the amount of virtual memory needed for the datanode jvm (about 2Gb now, only 500Mb for heap). That yields to excessive swap, and a new problem arises; some leases expired, and my entire cluster eventually fails.
          
          Can I tune other parameter to avoid these concurrent receivers to be created ?
          Upping the dfs.replication.interval for example could help ?
          
          Could the fact the I run the regionserver on the same machine that the datanode up the amount of xciever ? in which case I'll try a different layout, and use the network bottleneck to avoid stress datanodes.
          
          Any clue on the inside-hadoop-xciever would be appreciated.
          Thanks.
          
          -- Jean-Adrien
          

          ... and

          Some more information about the case.
          
          I read the HADOOP-3633 / 3859 / 3831 in jira.
          I run the version 18.1 of hadoop therefore I have no fix for 3831.
          Nevertheless my problem seems different.
          The threads are created as soon the client (HBase) requests data. the data
          arrives to HBase without problem but the thread never ends. Looking at the #
          of threads graphs:
          
          http://www.nabble.com/file/p21352818/launch_tests.png
          (you might need to go to nabble to see the image:
          http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21349807.html
          
          In the graph one runs hadoop / HBase 3 times (A/B/C) :
          A:
          I configure hadoop with dfs.datanode.max.xcievers=2023 and
          dfs.datanode.socket.write.timeout=0
          as soon I start hbase, the region load their data from dfs and the number of
          threads climbs up to 1100 in about 2-3 min. Then it stays in this scope.
          All DataXceiver threads are in one of these two states:
          
          "org.apache.hadoop.dfs.DataNode$DataXceiver@6a2f81" daemon prio=10
          tid=0x08289c00 nid=0x6bb6 runnable [0x8f980000..0x8f981140]
            java.lang.Thread.State: RUNNABLE
                 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
                 at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
                 at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
                 at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
                 - locked <0x95838858> (a sun.nio.ch.Util$1)
                 - locked <0x95838868> (a java.util.Collections$UnmodifiableSet)
                 - locked <0x95838818> (a sun.nio.ch.EPollSelectorImpl)
                 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
                 at
          org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:260)
                 at
          org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
                 at
          org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
                 at
          org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
                 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
                 - locked <0x95838b90> (a java.io.BufferedInputStream)
                 at java.io.DataInputStream.readShort(DataInputStream.java:295)
                 at
          org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115)
                 at
          org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
                 at java.lang.Thread.run(Thread.java:619)
          
          "org.apache.hadoop.dfs.DataNode$DataXceiver@1abf87e" daemon prio=10
          tid=0x90bbd400 nid=0x61ae runnable [0x7b68a000..0x7b68afc0]
            java.lang.Thread.State: RUNNABLE
                 at java.net.SocketInputStream.socketRead0(Native Method)
                 at java.net.SocketInputStream.read(SocketInputStream.java:129)
                 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
                 - locked <0x9671a8e0> (a java.io.BufferedInputStream)
                 at java.io.DataInputStream.readShort(DataInputStream.java:295)
                 at
          org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115)
                 at
          org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
                 at java.lang.Thread.run(Thread.java:619)
          
          
          B:
          I changed hadoop configuration, introducing the default 8min timeout.
          Once again, as soon HBase gets data from dfs, the number of thread grows to
          1100. After 8 minutes the timeout fires, and they fail one after each other
          with the exception:
          
          2009-01-08 14:21:09,305 WARN org.apache.hadoop.dfs.DataNode:
          DatanodeRegistration(192.168.1.13:50010,
          storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075,
          ipcPor
          t=50020):Got exception while serving blk_-1718199459793984230_722338 to
          /192.168.1.13:
          java.net.SocketTimeoutException: 480000 millis timeout while waiting for
          channel to be ready for write. ch :
          java.nio.channels.SocketChannel[connected local=/192.168.1.13:50010 re
          mote=/192.168.1.13:37462]
                 at
          org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
                 at
          org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
                 at
          org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
                 at
          org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
                 at
          org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
                 at
          org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
                 at
          org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
                 at java.lang.Thread.run(Thread.java:619)
          
          C:
          During this third session, I made the same run, but before the timeout
          fires, I stop HBase. In this case, the thread ends correctly.
          
          Is it the responsibility of hadoop client too manage its connection pool
          with the server ? In which case the problem would be an HBase problem?
          Anyway I found my problem, it is not a matter of performances.
          
          Thanks for your answers
          Have a nice day.
          
          -- Jean-Adrien
          --
          View this message in context: http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21352818.html
          - Show quoted text -
          

          And Raghu:

          Jean-Adrien wrote:
          
              Is it the responsibility of hadoop client too manage its connection pool
              with the server ? In which case the problem would be an HBase problem?
              Anyway I found my problem, it is not a matter of performances.
          
          
          Essentially, yes. Client has to close the file to relinquish connections, if clients are using the common read/write interface.
          
          Currently if a client keeps many hdfs files open, it results in many threads held at the DataNodes. As you noticed, timeout at DNs helps.
          
          Various solutions are possible at different levels: application(hbase), Client API, HDFS, etc. https://issues.apache.org/jira/browse/HADOOP-3856 is proposal at HDFS level.
          
          Show
          stack added a comment - Here are related remarks made by Jean-Adrien over in hadoop. Baseline is that we've been bandaging over this uglyness for a while now. Time to address the puss. xceiverCount limit reason Click to flag this post by Jean-Adrien Jan 08, 2009; 03:03am :: Rate this Message: - Use ratings to moderate (?) Reply | Reply to Author | Print | View Threaded | Show Only this Message Hello all, I'm running HBase on top of hadoop and I have some difficulties to tune hadoop conf in order to work fine with HBase. My configuration is 4 desktop class machines, 2 are running a datanode/region server, 1 only a region server and 1 a namenode/hbase master, 1Gb RAM each When I start HBase, about 300 regions must be load on 3 region servers; a lot of accesses are made concurrently on Hadoop. My first problem, using the default configuration, was to see too many of: DataXceiver: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. I was wondering what the reason of such a time out is. Where is the bottleneck ? First I believed that was a network problem (I have 100Mbits/s interfaces). But after monitoring the network, it seems the load is low when it happens. Anyway, I found the parameter dfs.datanode.socket.write.timeout and I set it 0 to disable the timeout. Then I saw in datanodes xceiverCount 256 exceeds the limit of concurrent xcievers 255 What is exactly the role of the receivers ? to receive the replicated blocks and/or to receive the file from clients ? When their threads end ? When their threads are created ? Anyway, I found the parameter dfs.datanode.max.xcievers I upped it to 511, then to 1023 and today to 2047; but by cluster is not so big (300 HBase regions, 200Gb including replication factor of 2); I'm not sure I will be able to up this limit for a long time. Moreover, it considerably increases the amount of virtual memory needed for the datanode jvm (about 2Gb now, only 500Mb for heap). That yields to excessive swap, and a new problem arises; some leases expired, and my entire cluster eventually fails. Can I tune other parameter to avoid these concurrent receivers to be created ? Upping the dfs.replication.interval for example could help ? Could the fact the I run the regionserver on the same machine that the datanode up the amount of xciever ? in which case I'll try a different layout, and use the network bottleneck to avoid stress datanodes. Any clue on the inside-hadoop-xciever would be appreciated. Thanks. -- Jean-Adrien ... and Some more information about the case . I read the HADOOP-3633 / 3859 / 3831 in jira. I run the version 18.1 of hadoop therefore I have no fix for 3831. Nevertheless my problem seems different. The threads are created as soon the client (HBase) requests data. the data arrives to HBase without problem but the thread never ends. Looking at the # of threads graphs: http: //www.nabble.com/file/p21352818/launch_tests.png (you might need to go to nabble to see the image: http: //www.nabble.com/xceiverCount-limit-reason-tp21349807p21349807.html In the graph one runs hadoop / HBase 3 times (A/B/C) : A: I configure hadoop with dfs.datanode.max.xcievers=2023 and dfs.datanode.socket.write.timeout=0 as soon I start hbase, the region load their data from dfs and the number of threads climbs up to 1100 in about 2-3 min. Then it stays in this scope. All DataXceiver threads are in one of these two states: "org.apache.hadoop.dfs.DataNode$DataXceiver@6a2f81" daemon prio=10 tid=0x08289c00 nid=0x6bb6 runnable [0x8f980000..0x8f981140] java.lang. Thread .State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x95838858> (a sun.nio.ch.Util$1) - locked <0x95838868> (a java.util.Collections$UnmodifiableSet) - locked <0x95838818> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:260) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x95838b90> (a java.io.BufferedInputStream) at java.io.DataInputStream.readShort(DataInputStream.java:295) at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037) at java.lang. Thread .run( Thread .java:619) "org.apache.hadoop.dfs.DataNode$DataXceiver@1abf87e" daemon prio=10 tid=0x90bbd400 nid=0x61ae runnable [0x7b68a000..0x7b68afc0] java.lang. Thread .State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x9671a8e0> (a java.io.BufferedInputStream) at java.io.DataInputStream.readShort(DataInputStream.java:295) at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037) at java.lang. Thread .run( Thread .java:619) B: I changed hadoop configuration, introducing the default 8min timeout. Once again, as soon HBase gets data from dfs, the number of thread grows to 1100. After 8 minutes the timeout fires, and they fail one after each other with the exception: 2009-01-08 14:21:09,305 WARN org.apache.hadoop.dfs.DataNode: DatanodeRegistration(192.168.1.13:50010, storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075, ipcPor t=50020):Got exception while serving blk_-1718199459793984230_722338 to /192.168.1.13: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.13:50010 re mote=/192.168.1.13:37462] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873) at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967) at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037) at java.lang. Thread .run( Thread .java:619) C: During this third session, I made the same run, but before the timeout fires, I stop HBase. In this case , the thread ends correctly. Is it the responsibility of hadoop client too manage its connection pool with the server ? In which case the problem would be an HBase problem? Anyway I found my problem, it is not a matter of performances. Thanks for your answers Have a nice day. -- Jean-Adrien -- View this message in context: http: //www.nabble.com/xceiverCount-limit-reason-tp21349807p21352818.html - Show quoted text - And Raghu: Jean-Adrien wrote: Is it the responsibility of hadoop client too manage its connection pool with the server ? In which case the problem would be an HBase problem? Anyway I found my problem, it is not a matter of performances. Essentially, yes. Client has to close the file to relinquish connections, if clients are using the common read/write interface . Currently if a client keeps many hdfs files open, it results in many threads held at the DataNodes. As you noticed, timeout at DNs helps. Various solutions are possible at different levels: application(hbase), Client API, HDFS, etc. https: //issues.apache.org/jira/browse/HADOOP-3856 is proposal at HDFS level.
          Hide
          stack added a comment -

          Lets play with all ideas suggested above – only open on demand, an LRU/ARU cache of opened mapfiles, etc. I don't think it'd be hard to do and it would get us over this awkward hump in our scaling story.

          Show
          stack added a comment - Lets play with all ideas suggested above – only open on demand, an LRU/ARU cache of opened mapfiles, etc. I don't think it'd be hard to do and it would get us over this awkward hump in our scaling story.
          Hide
          Andrew Purtell added a comment -

          This came up on IRC today.

          Show
          Andrew Purtell added a comment - This came up on IRC today.
          Hide
          stack added a comment -

          Putting into 0.19. We can move it out later.

          Show
          stack added a comment - Putting into 0.19. We can move it out later.
          Hide
          Billy Pearson added a comment -

          We may not need to reject if we do a little more work on the load balancing functions and make them smarter by factoring in memory size

          Show
          Billy Pearson added a comment - We may not need to reject if we do a little more work on the load balancing functions and make them smarter by factoring in memory size
          Hide
          Andrew Purtell added a comment -

          I think the bounded pool idea, dynamically sized to the available heap on startup, makes a lot of sense.

          Also I wonder if region servers can/should relinquish or reject regions after a threshold of available to maximum heap is reached and forced GC does not help?

          Show
          Andrew Purtell added a comment - I think the bounded pool idea, dynamically sized to the available heap on startup, makes a lot of sense. Also I wonder if region servers can/should relinquish or reject regions after a threshold of available to maximum heap is reached and forced GC does not help?
          Hide
          Luo Ning added a comment -

          i have file a new issue HBASE-745 for scalability discussion of regionserver.
          btw, i don't think reducing threshold is a good idea, because it will waste cpu time for compaction, described in that issue.

          Show
          Luo Ning added a comment - i have file a new issue HBASE-745 for scalability discussion of regionserver. btw, i don't think reducing threshold is a good idea, because it will waste cpu time for compaction, described in that issue.
          Hide
          stack added a comment -

          Thanks for doing the profiling LN. Do you think we should put an upper bound on the number of regions a particular regionserver can carry at any one time or an upper bound on number of open Readers? I wonder, if you want to carry many regions, if lowering the compaction threshold from 3 to 2 – or even to 1 – would make any difference in our memory profile (at a CPU cost)? We load the index and keep it around to avoid doing it on each random access – maybe if a bounded pool of open MapFiles, we could move files in and out of the pool on some kind of LRU basis?

          Show
          stack added a comment - Thanks for doing the profiling LN. Do you think we should put an upper bound on the number of regions a particular regionserver can carry at any one time or an upper bound on number of open Readers? I wonder, if you want to carry many regions, if lowering the compaction threshold from 3 to 2 – or even to 1 – would make any difference in our memory profile (at a CPU cost)? We load the index and keep it around to avoid doing it on each random access – maybe if a bounded pool of open MapFiles, we could move files in and out of the pool on some kind of LRU basis?
          Hide
          Luo Ning added a comment -

          from my profiling result, memory usage of a regionserver is determined by 2 things:
          1. the mapfile index read into memory(io.map.index.skip can adjust it, buf allwill stay in mem weather u need it or not)
          2. data output buffer used by each SequenceFile$Reader(each can measured as the largest value size in the file)
          3. memcache, can controlled by hbase itsself.

          so, when concurrent open mapfiles limited, memory usage of a regionserver limited too. otherwise, a regionserver will cause OOME when data size increasing. in my test env, 100G data(200M mapfile index total, 2000 HStoreFiles opened, 512M mamcache) used 1.5G heap memory, the max i can set using -Xmx.

          i think this issue should be seriously concerned by Hbase, not only DFS side.

          Show
          Luo Ning added a comment - from my profiling result, memory usage of a regionserver is determined by 2 things: 1. the mapfile index read into memory(io.map.index.skip can adjust it, buf allwill stay in mem weather u need it or not) 2. data output buffer used by each SequenceFile$Reader(each can measured as the largest value size in the file) 3. memcache, can controlled by hbase itsself. so, when concurrent open mapfiles limited, memory usage of a regionserver limited too. otherwise, a regionserver will cause OOME when data size increasing. in my test env, 100G data(200M mapfile index total, 2000 HStoreFiles opened, 512M mamcache) used 1.5G heap memory, the max i can set using -Xmx. i think this issue should be seriously concerned by Hbase, not only DFS side.
          Hide
          Bryan Duxbury added a comment -

          This is really more of a DFS issue, and there's an admittedly suboptimal solution to be had in increasing the max number of open file handles at the OS level. As such, we're going to hold off on solving this issue until after 0.2.

          Show
          Bryan Duxbury added a comment - This is really more of a DFS issue, and there's an admittedly suboptimal solution to be had in increasing the max number of open file handles at the OS level. As such, we're going to hold off on solving this issue until after 0.2.
          Hide
          Jim Kellerman added a comment -

          Changing priority to Critical to emphasize that this is one of the major roadblocks to scalability that we have.

          And the problem is not only on the number of connections the region servers have open to the dfs, but also include datanode connections for each open file.

          Show
          Jim Kellerman added a comment - Changing priority to Critical to emphasize that this is one of the major roadblocks to scalability that we have. And the problem is not only on the number of connections the region servers have open to the dfs, but also include datanode connections for each open file.
          Hide
          stack added a comment -

          We could do that or make it optional behavior until we figure a fix.

          FYI, to do a cold-start random read into a MapFile, you need to open two files (the data file and its index), read all of the index into memory, find the closest offset in the index and then seek around in the data file to find the asked-for key. In hbase currently, only the data file is held open (the index file read into memory has been forced and the index file has then been let go).

          Show
          stack added a comment - We could do that or make it optional behavior until we figure a fix. FYI, to do a cold-start random read into a MapFile, you need to open two files (the data file and its index), read all of the index into memory, find the closest offset in the index and then seek around in the data file to find the asked-for key. In hbase currently, only the data file is held open (the index file read into memory has been forced and the index file has then been let go).
          Hide
          Chad Walters added a comment -

          Perhaps we should just open the mapfiles as needed and measure the effect on random reads. It could turn out that the effect is not that big, given that random reads are not performing well anyway. Focusing on allowing folks to scale up rather than performance on the random read use case seems like it could be the right trade-off at the moment and give us some breathing room to address the non-blocking IO issue at a more measured pace.

          Show
          Chad Walters added a comment - Perhaps we should just open the mapfiles as needed and measure the effect on random reads. It could turn out that the effect is not that big, given that random reads are not performing well anyway. Focusing on allowing folks to scale up rather than performance on the random read use case seems like it could be the right trade-off at the moment and give us some breathing room to address the non-blocking IO issue at a more measured pace.

            People

            • Assignee:
              Unassigned
              Reporter:
              stack
            • Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development