HBase
  1. HBase
  2. HBASE-745

scaling of one regionserver, improving memory and cpu usage

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.1.3, 0.2.0
    • Fix Version/s: 0.2.0
    • Component/s: regionserver
    • Labels:
      None
    • Environment:

      hadoop 0.17.1

      Description

      after weeks testing hbase 0.1.3 and hadoop(0.16.4, 0.17.1), i found there are many works to do, before a particular regionserver can handle data about 100G, or even more. i'd share my opions here with stack, and other developers.

      first, the easiest way improving scalability of regionserver is upgrading hardware, use 64bit os and 8G memory for the regionserver process, and speed up disk io.

      besides hardware, following are software bottlenecks i found in regionserver:
      1. as data increasing, compaction was eating cpu(with io) times, the total compaction time is basicly linear relative to whole data size, even worse, sometimes square relavtive to that size.
      2. memory usage are depends on opened mapfiles
      3. network connection are depends on opened mapfiles, see HADOOP-2341 and HBASE-24.

      1. hbase-745-for-0.2.patch
        4 kB
        Izaak Rubin
      2. HBASE-745.compact.patch
        3 kB
        Luo Ning

        Issue Links

          Activity

          Hide
          Luo Ning added a comment - - edited

          compaction time caculating:
          1. suppose we are keep writing data to regionserver, and rowid of data is hashed to all regions.
          2. according to default optionalcacheflushinterval(30min) and threshold(3), all HStore will create a flushed storefile in 30min, after 1 hour, each HStore will have 3 storefile(include original 1), so a compaction will taken. that is, all HStore in the regionserver will do a compaction in 1 hour.
          3. a compaction of HStore will read all data in mapfiles of the HStore, i'd suppose the time of compcating is depends on total file size of mapfiles holding by the HStore. so the whole compacting time(caused by optionalcacheflushinterval) of a regionserver, depends on data size the regionserver serving.
          4. now we can see, the default optionalcacheflushinterval is not suitable for most env., i've found my hardware(Xeon 3.2*2, dualcore, scsi ) can compacting 10M data per second, this mean it can compact 36G in 1 hour, so a regionserver can only holding data size less than 36G?
          5. how about increasing optionalcacheflushinterval? to 12hours, even 24hours? unfortunatly, i found it useless. because globalMemcacheLimit, it default 512M, when reached, memcache will flushed(storefile created), until total size of memcache lower than 256M, since inserted rowids are distributed to all regions, nearly half of all regions will have a new storefile too. then when inserted data reach 1G(4 times of flushing global memcache), all data of the regionserver need compaction. no setting can adjust this behavor.

          Show
          Luo Ning added a comment - - edited compaction time caculating: 1. suppose we are keep writing data to regionserver, and rowid of data is hashed to all regions. 2. according to default optionalcacheflushinterval(30min) and threshold(3), all HStore will create a flushed storefile in 30min, after 1 hour, each HStore will have 3 storefile(include original 1), so a compaction will taken. that is, all HStore in the regionserver will do a compaction in 1 hour. 3. a compaction of HStore will read all data in mapfiles of the HStore, i'd suppose the time of compcating is depends on total file size of mapfiles holding by the HStore. so the whole compacting time(caused by optionalcacheflushinterval) of a regionserver, depends on data size the regionserver serving. 4. now we can see, the default optionalcacheflushinterval is not suitable for most env., i've found my hardware(Xeon 3.2*2, dualcore, scsi ) can compacting 10M data per second, this mean it can compact 36G in 1 hour, so a regionserver can only holding data size less than 36G? 5. how about increasing optionalcacheflushinterval? to 12hours, even 24hours? unfortunatly, i found it useless. because globalMemcacheLimit, it default 512M, when reached, memcache will flushed(storefile created), until total size of memcache lower than 256M, since inserted rowids are distributed to all regions, nearly half of all regions will have a new storefile too. then when inserted data reach 1G(4 times of flushing global memcache), all data of the regionserver need compaction. no setting can adjust this behavor.
          Hide
          Luo Ning added a comment -

          compaction improvement:

          compaction has very poor efficiency in current hbase release(0.1.3), suppose 3 mapfile in a HStore, the 1 orginal is 128M, and newly flushed 2 is smaller than 1M(this is the most common situation where regionserver carrying 512 hstore or more, flushing 256M global mamcache each time), we compacted 2M data, but read and write 120M!

          my suggestion:
          1. set threshold larger, this will cause lower compaction times, but more mapfiles(will discuss later in this issue about memory usage)
          2. implementing incremental compaction, that's mean: don't compact to 1 file each time, compact small files only,
          do a whole compaction when file size large enough. in HStore#compact(boolean), we can use a alorighm to select hstorefiles for compacting. (will attach my impl for review later.)

          Show
          Luo Ning added a comment - compaction improvement: compaction has very poor efficiency in current hbase release(0.1.3), suppose 3 mapfile in a HStore, the 1 orginal is 128M, and newly flushed 2 is smaller than 1M(this is the most common situation where regionserver carrying 512 hstore or more, flushing 256M global mamcache each time), we compacted 2M data, but read and write 120M! my suggestion: 1. set threshold larger, this will cause lower compaction times, but more mapfiles(will discuss later in this issue about memory usage) 2. implementing incremental compaction, that's mean: don't compact to 1 file each time, compact small files only, do a whole compaction when file size large enough. in HStore#compact(boolean), we can use a alorighm to select hstorefiles for compacting. (will attach my impl for review later.)
          Hide
          Luo Ning added a comment -

          memory calculating:
          memory usage of a regionserver is determined by 3 things:
          #1. the mapfile index read into memory(io.map.index.skip can adjust it, buf allwill stay in mem weather u need it or not)
          #2. data output buffer used by each SequenceFile$Reader(each can measured as the largest value size in the file)
          #3. memcache, controlled by 'globalMemcacheLimit' and 'globalMemcacheLimitLowMark'

          that is, beside already controlled #3, memory is determined by 'concurrent opening' mapfiles(in fact, opening SequenceFiles of mapfile data).

          in HBASE-24, stack advicing control open region number or open mapfile reader number, i'd prefer contorlling opened mapfile reader directly, the core of regionserver resource usage.

          my suggestions of regionserver memory:
          1. upgrade to hadoop 0.17.1(there's only one line incompatible with hadoop 0.17.1 in hbase 0.1.3, i'll file a issue seprately.), HADOOP-2346 resolved out of connection/thread in DataNode, using read/write timeout.
          2. set globalMemcacheLimit to a lower size, if ur application didn't read recently inserted records frequently.
          3. implment a MonitoredMapFileReader, it extends MapFile.reader, control cocurrent opening instances use LRU, checkin/checkout in every MapFile.Reader method. make HStoreFile.HbaseMapFile.HbaseReader extends MonitoredMapFileReader.

          further more the release 0.1.3, i think hbase need a interface like HStoreFileReader for abstracting file reading method, that will make open reader controlling more easier.

          Show
          Luo Ning added a comment - memory calculating: memory usage of a regionserver is determined by 3 things: #1. the mapfile index read into memory(io.map.index.skip can adjust it, buf allwill stay in mem weather u need it or not) #2. data output buffer used by each SequenceFile$Reader(each can measured as the largest value size in the file) #3. memcache, controlled by 'globalMemcacheLimit' and 'globalMemcacheLimitLowMark' that is, beside already controlled #3, memory is determined by 'concurrent opening' mapfiles(in fact, opening SequenceFiles of mapfile data). in HBASE-24 , stack advicing control open region number or open mapfile reader number, i'd prefer contorlling opened mapfile reader directly, the core of regionserver resource usage. my suggestions of regionserver memory: 1. upgrade to hadoop 0.17.1(there's only one line incompatible with hadoop 0.17.1 in hbase 0.1.3, i'll file a issue seprately.), HADOOP-2346 resolved out of connection/thread in DataNode, using read/write timeout. 2. set globalMemcacheLimit to a lower size, if ur application didn't read recently inserted records frequently. 3. implment a MonitoredMapFileReader, it extends MapFile.reader, control cocurrent opening instances use LRU, checkin/checkout in every MapFile.Reader method. make HStoreFile.HbaseMapFile.HbaseReader extends MonitoredMapFileReader. further more the release 0.1.3, i think hbase need a interface like HStoreFileReader for abstracting file reading method, that will make open reader controlling more easier.
          Hide
          Billy Pearson added a comment -

          I agree on your idea of a incremental compaction

          My two ideas for increased efficiency in compaction while under load

          1. compact only the newest threshold(3) of mapfiles

          This will allow a region server to compact the lastest 3 map files created lowering the number of mapfile by 2 per compaction
          the newest mapfile will not store the bulk of the data for a region if we are under load they will be small memcache flushes and compact fast.

          By doing the newest ones when the load reduces and there is only 3 map files left 1 will be the largest and oldest mapfile
          and all old data and new data will get compacted together.

          2. The compaction queue
          Currently we only add the region to a queued list of regions needing compaction check and compact in that order.

          My suggestion would be to have the queued list store how many times a region has been added to the compaction queued(memcache flushes)
          That way we can sort the list and compact the hot spots under load and compact them first and reduce the number of map files the fastest with the above idea implemented.
          When it is done with the compaction reduce the number in the queued by how many files we compacted or remove it if left to compact and sort the list again start over.

          these are my ideas on how we can reduce the number of mapfiles we have while we are under a write load.

          Show
          Billy Pearson added a comment - I agree on your idea of a incremental compaction My two ideas for increased efficiency in compaction while under load 1. compact only the newest threshold(3) of mapfiles This will allow a region server to compact the lastest 3 map files created lowering the number of mapfile by 2 per compaction the newest mapfile will not store the bulk of the data for a region if we are under load they will be small memcache flushes and compact fast. By doing the newest ones when the load reduces and there is only 3 map files left 1 will be the largest and oldest mapfile and all old data and new data will get compacted together. 2. The compaction queue Currently we only add the region to a queued list of regions needing compaction check and compact in that order. My suggestion would be to have the queued list store how many times a region has been added to the compaction queued(memcache flushes) That way we can sort the list and compact the hot spots under load and compact them first and reduce the number of map files the fastest with the above idea implemented. When it is done with the compaction reduce the number in the queued by how many files we compacted or remove it if left to compact and sort the list again start over. these are my ideas on how we can reduce the number of mapfiles we have while we are under a write load.
          Hide
          Luo Ning added a comment -

          incremental compaction patch for 0.1.3 release. i use a simple algorithm for automate selecting compacting files, described in source.

          sorry for no unit test case for this patch, i haven't learn how to prepare unit test data for such issues:<. in fact, this patch has worked about a week in my test env. most of compation time reduced to less than 5sec from 1min before.

          btw, i removed some modification to HStore.java 0.1.3 release version in this patch manually, those for hadoop 0.17.1 compatible.

          Show
          Luo Ning added a comment - incremental compaction patch for 0.1.3 release. i use a simple algorithm for automate selecting compacting files, described in source. sorry for no unit test case for this patch, i haven't learn how to prepare unit test data for such issues:<. in fact, this patch has worked about a week in my test env. most of compation time reduced to less than 5sec from 1min before. btw, i removed some modification to HStore.java 0.1.3 release version in this patch manually, those for hadoop 0.17.1 compatible.
          Hide
          Jim Kellerman added a comment -

          With respect to MapFile extensions in HBase, see HStoreFile$HBaseMapFile, HStoreFile$BloomFilterMapFile and HStoreFile$HalfMapFileReader

          Show
          Jim Kellerman added a comment - With respect to MapFile extensions in HBase, see HStoreFile$HBaseMapFile, HStoreFile$BloomFilterMapFile and HStoreFile$HalfMapFileReader
          Hide
          Jim Kellerman added a comment -

          I would also suggest that with respect to performance, you should focus on trunk and not 0.1.x because trunk has changed the internals of flushing and compaction quite a bit, and it is unlikely that performance improvements for 0.1.x will port easily to trunk.

          Show
          Jim Kellerman added a comment - I would also suggest that with respect to performance, you should focus on trunk and not 0.1.x because trunk has changed the internals of flushing and compaction quite a bit, and it is unlikely that performance improvements for 0.1.x will port easily to trunk.
          Hide
          Luo Ning added a comment - - edited

          > With respect to MapFile extensions in HBase, see HStoreFile$HBaseMapFile, HStoreFile$BloomFilterMapFile and HStoreFile$HalfMapFileReader

          i have noticed the inheritance between HStoreFile$xxxMapFile classes, since all xxxReader inheriting HStoreFile$HbaseMapFile$HbaseReader, it is a good point for us controlling all reading operations in HbaseReader, so my suggestion is let HbaseReader extends a new class(extends MapFile.Reader), we do limitations in there.

          Show
          Luo Ning added a comment - - edited > With respect to MapFile extensions in HBase, see HStoreFile$HBaseMapFile, HStoreFile$BloomFilterMapFile and HStoreFile$HalfMapFileReader i have noticed the inheritance between HStoreFile$xxxMapFile classes, since all xxxReader inheriting HStoreFile$HbaseMapFile$HbaseReader, it is a good point for us controlling all reading operations in HbaseReader, so my suggestion is let HbaseReader extends a new class(extends MapFile.Reader), we do limitations in there.
          Hide
          Luo Ning added a comment -

          about version: forget my code(patch) here, i want providing more infomations about hbase running(include patched results), that may helpful for further design and coding. however, i think users of hbase need more stable and scaling on current release, if it can.

          i will read codes from trunk. is there any other discussion about memory and compaction i can read first, in jira or wiki?

          Show
          Luo Ning added a comment - about version: forget my code(patch) here, i want providing more infomations about hbase running(include patched results), that may helpful for further design and coding. however, i think users of hbase need more stable and scaling on current release, if it can. i will read codes from trunk. is there any other discussion about memory and compaction i can read first, in jira or wiki?
          Hide
          stack added a comment -

          LN: I'm not sure I follow the above comment. What you thinking? Yes, hbase users need stability in 0.1.3 and in 0.2. Lets experiment in 0.3.

          No discussion of memory or compaction other than what is in JIRAs. Want to start up a wiki page that we can all hack on?

          FYI, Izaak is working on upgrading your patch so it works against TRUNK.

          Show
          stack added a comment - LN: I'm not sure I follow the above comment. What you thinking? Yes, hbase users need stability in 0.1.3 and in 0.2. Lets experiment in 0.3. No discussion of memory or compaction other than what is in JIRAs. Want to start up a wiki page that we can all hack on? FYI, Izaak is working on upgrading your patch so it works against TRUNK.
          Hide
          Luo Ning added a comment -

          maybe i'm hungering for hbase stronger i know Robustness and Scalabilit(in order) are focused by 0.2 release. and "3TB of data on about ~50 nodes" means 60G per regionserver, not very hard, each (default config) regionserver can handle 30G data on my testing server, by 0.1.3.

          i'm trying to make regionserver handling more data, 1T? because i think the resource(memory, cpu) usage of a regionserver should not depends on existing data size, but active data size(read/write throughput).

          i think i found the bottlenecks(compaction eating cpu, open mapfiles eating memory), but NOT SURE my solution, so i paste here for review, esp. from Jim and Stack.

          here my 'total solution', i named it '0.1.3/0.17.1 scalability pack':
          1. patch HBASE-749 for 0.17.1 compatible
          2. patch HADOOP-3778 for a socket exception bug
          3. HADOOP-3779 for concurrent connection limitation of datanode(patch not attached)
          4. attached incremental compaction patch
          5. a "open mapfile reader" limitaion patch, implemented my suggestion above, but looks not good, so havn't attach.

          with above and adjusting some config properties, i have my regionserver handling about 400G data now, with about 15G testing write throughput per day.

          Show
          Luo Ning added a comment - maybe i'm hungering for hbase stronger i know Robustness and Scalabilit(in order) are focused by 0.2 release. and "3TB of data on about ~50 nodes" means 60G per regionserver, not very hard, each (default config) regionserver can handle 30G data on my testing server, by 0.1.3. i'm trying to make regionserver handling more data, 1T? because i think the resource(memory, cpu) usage of a regionserver should not depends on existing data size, but active data size(read/write throughput). i think i found the bottlenecks(compaction eating cpu, open mapfiles eating memory), but NOT SURE my solution, so i paste here for review, esp. from Jim and Stack. here my 'total solution', i named it '0.1.3/0.17.1 scalability pack': 1. patch HBASE-749 for 0.17.1 compatible 2. patch HADOOP-3778 for a socket exception bug 3. HADOOP-3779 for concurrent connection limitation of datanode(patch not attached) 4. attached incremental compaction patch 5. a "open mapfile reader" limitaion patch, implemented my suggestion above, but looks not good, so havn't attach. with above and adjusting some config properties, i have my regionserver handling about 400G data now, with about 15G testing write throughput per day.
          Hide
          Izaak Rubin added a comment -

          I've been looking over the issue, and I (and Stack) agree with LN and the changes proposed in his patch. However, as Jim noted, we want to be focusing on 0.2 instead of 0.1.3. I've taken LN's patch and modified it slightly to fit into trunk (hbase-745-for-0.2.patch). I've also added several additional assertions to TestCompaction to account for the changes.

          All HBase tests passed successfully. However, this patch SHOULD NOT be applied until after HBase-720 is resolved and it's patch (hbase-720.patch) is applied. Both of these patches modify the same two files (HStore, TestCompaction), and they must be committed in the correct order (first 720, then 745).

          Show
          Izaak Rubin added a comment - I've been looking over the issue, and I (and Stack) agree with LN and the changes proposed in his patch. However, as Jim noted, we want to be focusing on 0.2 instead of 0.1.3. I've taken LN's patch and modified it slightly to fit into trunk (hbase-745-for-0.2.patch). I've also added several additional assertions to TestCompaction to account for the changes. All HBase tests passed successfully. However, this patch SHOULD NOT be applied until after HBase-720 is resolved and it's patch (hbase-720.patch) is applied. Both of these patches modify the same two files (HStore, TestCompaction), and they must be committed in the correct order (first 720, then 745).
          Hide
          Billy Pearson added a comment -

          I tried to apply this patch for 0.2 to trunk but got an error I applied hbase-720 patch successfully first but this one failed.

          Show
          Billy Pearson added a comment - I tried to apply this patch for 0.2 to trunk but got an error I applied hbase-720 patch successfully first but this one failed.
          Hide
          Izaak Rubin added a comment - - edited

          Hi Billy,

          I can't seem to replicate this problem - I removed my local copies of HStore and TestCompaction, updated, and then applied hbase-745-for-0.2.patch (successfully). The patch for HBase-720 was committed before you made your comment on Thursday (although the issue was only closed today) - is it possible that when you tried to apply the hbase-720 patch, you actually removed it by accident? Maybe try what I did (remove the files, update, and re-apply 745) and see if it still doesn't work - let me know.

          Show
          Izaak Rubin added a comment - - edited Hi Billy, I can't seem to replicate this problem - I removed my local copies of HStore and TestCompaction, updated, and then applied hbase-745-for-0.2.patch (successfully). The patch for HBase-720 was committed before you made your comment on Thursday (although the issue was only closed today) - is it possible that when you tried to apply the hbase-720 patch, you actually removed it by accident? Maybe try what I did (remove the files, update, and re-apply 745) and see if it still doesn't work - let me know.
          Hide
          Billy Pearson added a comment - - edited

          now that I thank about it I thank I tryed a older version of trunk my mistake it applys to trunk.

          I will run some bulk import test soon on it and see if the compaction's work out ok

          Show
          Billy Pearson added a comment - - edited now that I thank about it I thank I tryed a older version of trunk my mistake it applys to trunk. I will run some bulk import test soon on it and see if the compaction's work out ok
          Hide
          stack added a comment -

          Billy, I'm running tests too. So far, it looks like the LN patch is an improvement. Will report back when more data. If its good – should know tonight – I'll apply it.

          Show
          stack added a comment - Billy, I'm running tests too. So far, it looks like the LN patch is an improvement. Will report back when more data. If its good – should know tonight – I'll apply it.
          Hide
          stack added a comment -

          I applied hbase-745-for-0.2.patch, Izaak's fixup of LN's original patch though little discernible improvement.

          Running the PerformanceEvaluation with the patch, we spent about 20% less time compacting in total but on test completion, there were 79 data files in the filesystem as opposed to 72 when I ran without the patch. My guess is that after the 79 files became 72, there wouldn't be much of the 20% difference left over.

          Test ran for about 30 minutes running 8 concurrent MR clients writing 8M rows.

          Show
          stack added a comment - I applied hbase-745-for-0.2.patch, Izaak's fixup of LN's original patch though little discernible improvement. Running the PerformanceEvaluation with the patch, we spent about 20% less time compacting in total but on test completion, there were 79 data files in the filesystem as opposed to 72 when I ran without the patch. My guess is that after the 79 files became 72, there wouldn't be much of the 20% difference left over. Test ran for about 30 minutes running 8 concurrent MR clients writing 8M rows.
          Hide
          stack added a comment -

          Hmm. Took another look. Comparison is a little more complicated than I above suppose. I did a recheck of the number of data files post completion of the without patch run, about ten minutes after it ended; about the same amount of time that had elapsed when I went to check the withpatch test. The number of data files is rising as is the aggregate of all time spent compacting. Would seem then that the patch cuts time spent compacting by some 10-20% or so in the test I just ran.

          Show
          stack added a comment - Hmm. Took another look. Comparison is a little more complicated than I above suppose. I did a recheck of the number of data files post completion of the without patch run, about ten minutes after it ended; about the same amount of time that had elapsed when I went to check the withpatch test. The number of data files is rising as is the aggregate of all time spent compacting. Would seem then that the patch cuts time spent compacting by some 10-20% or so in the test I just ran.
          Hide
          Billy Pearson added a comment -

          I lost my data I was running to test large import so re downloading I will be able to run my test on the patch in about 24 hours when I get done processing my dataset again.
          Your last post sounds better and more correct if the patch is working correct we pick up efficiency when we do not have to compact the larger mapfiles with every compaction.
          I would assume this will help with the users that still have 32bit server keep the region server under the 2gb limit by flushing a little more often if needed under load.

          Show
          Billy Pearson added a comment - I lost my data I was running to test large import so re downloading I will be able to run my test on the patch in about 24 hours when I get done processing my dataset again. Your last post sounds better and more correct if the patch is working correct we pick up efficiency when we do not have to compact the larger mapfiles with every compaction. I would assume this will help with the users that still have 32bit server keep the region server under the 2gb limit by flushing a little more often if needed under load.
          Hide
          Billy Pearson added a comment -

          can not test patch will not apply to trunk.

          we should get this in to 2.0 f its showing good results like stack reported above

          Show
          Billy Pearson added a comment - can not test patch will not apply to trunk. we should get this in to 2.0 f its showing good results like stack reported above
          Hide
          Billy Pearson added a comment -

          looks like this has been committed to trunk
          seams to be improving my import speed because I am spending less time on compaction so I get more cpu time for transactions.
          I use compression on my table so it improving my speed on compaction's by not havening to uncompress and re compress all the map files each compaction.
          +1

          so we need to mark this issue done.

          Show
          Billy Pearson added a comment - looks like this has been committed to trunk seams to be improving my import speed because I am spending less time on compaction so I get more cpu time for transactions. I use compression on my table so it improving my speed on compaction's by not havening to uncompress and re compress all the map files each compaction. +1 so we need to mark this issue done.
          Hide
          stack added a comment -

          Hey Billy: Yeah, it was committed a while back. In my comments above, I'm not very enthusiastic because I did not see BIG gains in our simple PerformanceEvaluation. But thinking on it more, Luo Ning's simple rule is kinda elegant and in real-life situations is probably saving truckloads of CPU and I/O.

          Show
          stack added a comment - Hey Billy: Yeah, it was committed a while back. In my comments above, I'm not very enthusiastic because I did not see BIG gains in our simple PerformanceEvaluation. But thinking on it more, Luo Ning's simple rule is kinda elegant and in real-life situations is probably saving truckloads of CPU and I/O.
          Hide
          stack added a comment -

          Bulk of this work was applied to 0.2.0. I opened HBASE-823 to do Luo Ning's '"open mapfile reader" limitation patch. Thanks for the patch Luo (and Izaak).

          Show
          stack added a comment - Bulk of this work was applied to 0.2.0. I opened HBASE-823 to do Luo Ning's '"open mapfile reader" limitation patch. Thanks for the patch Luo (and Izaak).

            People

            • Assignee:
              Unassigned
              Reporter:
              Luo Ning
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development