HBase
  1. HBase
  2. HBASE-900

Regionserver memory leak causing OOME during relatively modest bulk importing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.1, 0.19.0
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None

      Description

      I have recreated this issue several times and it appears to have been introduced in 0.2.

      During an import to a single table, memory usage of individual region servers grows w/o bounds and when set to the default 1GB it will eventually die with OOME. This has happened to me as well as Daniel Ploeg on the mailing list. In my case, I have 10 RS nodes and OOME happens w/ 1GB heap at only about 30-35 regions per RS. In previous versions, I have imported to several hundred regions per RS with default heap size.

      I am able to get past this by increasing the max heap to 2GB. However, the appearance of this in newer versions leads me to believe there is now some kind of memory leak happening in the region servers during import.

      1. 900-p4.patch
        15 kB
        stack
      2. hbase-900-part3.patch
        3 kB
        stack
      3. 900-part2-v7.patch
        18 kB
        stack
      4. 900-part2-v5.patch
        15 kB
        stack
      5. 900-part2-v4.patch
        15 kB
        stack
      6. 900-part2.patch
        12 kB
        stack
      7. 900.patch
        130 kB
        stack
      8. memoryOn13.png
        14 kB
        stack

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          Should the indexing interval be set higher by default? At least until MapFile is brought down or a custom file format replacement is put in?

          Show
          Andrew Purtell added a comment - Should the indexing interval be set higher by default? At least until MapFile is brought down or a custom file format replacement is put in?
          Hide
          stack added a comment -

          Other thing to do to ameliorate memory usage when small cells is to up the indexing interval from default 32 to 245 or 1024, etc. Makes a difference for sure, more than changing the upperLimit does.

          Show
          stack added a comment - Other thing to do to ameliorate memory usage when small cells is to up the indexing interval from default 32 to 245 or 1024, etc. Makes a difference for sure, more than changing the upperLimit does.
          Hide
          stack added a comment -

          To generate OOMEs, change the PE so that cells are 10 bytes in size instead of the default 1000 bytes in size.

          Show
          stack added a comment - To generate OOMEs, change the PE so that cells are 10 bytes in size instead of the default 1000 bytes in size.
          Hide
          stack added a comment -

          Resolving with commit of part 4.

          Ran more tests with small cells. We run for longer if hbase.regionserver.globalMemcache.upperLimit is set down from the 0.4 default. Set it down to 0.3 or even 0.25 to make more room for indices (Means we can carry more regions before OOME).

          Show
          stack added a comment - Resolving with commit of part 4. Ran more tests with small cells. We run for longer if hbase.regionserver.globalMemcache.upperLimit is set down from the 0.4 default. Set it down to 0.3 or even 0.25 to make more room for indices (Means we can carry more regions before OOME).
          Hide
          stack added a comment -

          Added comment to hbase-70. Its about fixing up our memory management story.

          Show
          stack added a comment - Added comment to hbase-70. Its about fixing up our memory management story.
          Hide
          stack added a comment -

          I'm going to close this issue after p4 goes in. Enough work has been done on it at least for 0.19.0 time frame even though we will continue to have memory issues until we start counting the size of loaded MapFile indices. I'll open a new issue to do this for 0.20.0 timeframe. To fix, will require our bringing down MapFile into hbase or putting in place the new file format.

          Show
          stack added a comment - I'm going to close this issue after p4 goes in. Enough work has been done on it at least for 0.19.0 time frame even though we will continue to have memory issues until we start counting the size of loaded MapFile indices. I'll open a new issue to do this for 0.20.0 timeframe. To fix, will require our bringing down MapFile into hbase or putting in place the new file format.
          Hide
          stack added a comment -

          Patch that adds scheduled Excecutor to BlockFSInputStream. Runs on a period to check for any entries in the Soft Values Reference Queue. Testing it seems to work. It has hard-coded values though which is kinda ugly but alternative – passing in Configuration – is not viable down here low in io classes.

          Show
          stack added a comment - Patch that adds scheduled Excecutor to BlockFSInputStream. Runs on a period to check for any entries in the Soft Values Reference Queue. Testing it seems to work. It has hard-coded values though which is kinda ugly but alternative – passing in Configuration – is not viable down here low in io classes.
          Hide
          stack added a comment -

          Looking more at why I can make an OOME writing small cells, I see the MapFile indices starting to come to the fore. I counted 90MB of indices in a heap of 11 regions and 45 storefiles. A few were up in the 20+MB range. Accounting for this size, I'll leave aside for 0.19.0 release (As is, can't get at the index anyways in current MapFile, not unless we brought MapFile local – lets not do that for 0.19.0 release).

          Show
          stack added a comment - Looking more at why I can make an OOME writing small cells, I see the MapFile indices starting to come to the fore. I counted 90MB of indices in a heap of 11 regions and 45 storefiles. A few were up in the 20+MB range. Accounting for this size, I'll leave aside for 0.19.0 release (As is, can't get at the index anyways in current MapFile, not unless we brought MapFile local – lets not do that for 0.19.0 release).
          Hide
          stack added a comment -

          Committed part3. Its an improvement. Will look at a few more heaps but this might be good enough for 0.19.0 for writing. Next up, part 4, making sure blockcache gets cleared promplty; i.e. oome when writing and reading at same time.

          Show
          stack added a comment - Committed part3. Its an improvement. Will look at a few more heaps but this might be good enough for 0.19.0 for writing. Next up, part 4, making sure blockcache gets cleared promplty; i.e. oome when writing and reading at same time.
          Hide
          stack added a comment -

          part 3 takes size of region memcaches at start of flush and then subtracts the size on flush completion rather than set things to zero as soon as the cache starts up. Its still not enough for case where cells are 10bytes in size but it goes half-as far again before OOME'ing. Might be enough for 0.19.0.

          Show
          stack added a comment - part 3 takes size of region memcaches at start of flush and then subtracts the size on flush completion rather than set things to zero as soon as the cache starts up. Its still not enough for case where cells are 10bytes in size but it goes half-as far again before OOME'ing. Might be enough for 0.19.0.
          Hide
          stack added a comment -

          Around flush we make a snapshot. As soon as the snapshot is made, we zero the memcache size. Suspicious. The snapshot hangs out until flush is completed. Can take anything from millisecond to ten+ seconds. Its usually 64MB in size. In 1G heap fielding a withering upload, could be what throws us over.

          Show
          stack added a comment - Around flush we make a snapshot. As soon as the snapshot is made, we zero the memcache size. Suspicious. The snapshot hangs out until flush is completed. Can take anything from millisecond to ten+ seconds. Its usually 64MB in size. In 1G heap fielding a withering upload, could be what throws us over.
          Hide
          stack added a comment -

          Thanks for the +1s.

          I can still make it OOME here locally if I use small BatchOperations – cells of size 10 bytes – and if I put up lots of clients. Investigating. And I still need to fix the OOME that happens randomreading because the blockcache is not getting processed comprehensively.

          Show
          stack added a comment - Thanks for the +1s. I can still make it OOME here locally if I use small BatchOperations – cells of size 10 bytes – and if I put up lots of clients. Investigating. And I still need to fix the OOME that happens randomreading because the blockcache is not getting processed comprehensively.
          Hide
          Tim Sell added a comment -

          ditto

          Show
          Tim Sell added a comment - ditto
          Hide
          Andrew Purtell added a comment -

          +1 No OOME with part 1 and v7 of part 2, even with heavy write load.

          Show
          Andrew Purtell added a comment - +1 No OOME with part 1 and v7 of part 2, even with heavy write load.
          Hide
          stack added a comment -

          Applying v7. Can improve on it in later patches as get more info.

          + Downs the client-side batch write heap default from 10MB to 2MB
          + Adds ByteSize interface with a heapSize member. BatchUpdate, HStoreKey, etc., implement it.
          + The sizes returned out of heapSize favor 64-bit JVMs. Were obtained from study of heaps made by running HRS and from runs of the new BU and Memcache mains which have little scripts to generate heaps with arrays of BUs and different Memcaches which can then be heap-dumped and studied.

          Show
          stack added a comment - Applying v7. Can improve on it in later patches as get more info. + Downs the client-side batch write heap default from 10MB to 2MB + Adds ByteSize interface with a heapSize member. BatchUpdate, HStoreKey, etc., implement it. + The sizes returned out of heapSize favor 64-bit JVMs. Were obtained from study of heaps made by running HRS and from runs of the new BU and Memcache mains which have little scripts to generate heaps with arrays of BUs and different Memcaches which can then be heap-dumped and studied.
          Hide
          stack added a comment -

          v7 adjusts our BatchUpdate sizing (we were a little under). It also sets down the default for the client write buffer from 10M to 2M. If 10 handlers, then when it gets to serverside thats 10MB x 10 which is 1/10th of your heap if you are 1G.

          Show
          stack added a comment - v7 adjusts our BatchUpdate sizing (we were a little under). It also sets down the default for the client write buffer from 10M to 2M. If 10 handlers, then when it gets to serverside thats 10MB x 10 which is 1/10th of your heap if you are 1G.
          Hide
          stack added a comment -

          v5 just adds a new Memcache to the Memcache#main. I did some more testing and we are coming in close enough on Memcache sizes. Would like to commit this part2.

          Tim's run w/ v4 ran into hdfs issues – twice. Didn't OOME. What about you Andrew?

          In a 1G heap I OOME'd and it was not blockcache retention nor Memcache size so at least two other fixes coming on this issue.

          Show
          stack added a comment - v5 just adds a new Memcache to the Memcache#main. I did some more testing and we are coming in close enough on Memcache sizes. Would like to commit this part2. Tim's run w/ v4 ran into hdfs issues – twice. Didn't OOME. What about you Andrew? In a 1G heap I OOME'd and it was not blockcache retention nor Memcache size so at least two other fixes coming on this issue.
          Hide
          Tim Sell added a comment -

          Ran with 900 part2. 2 gig heap. using table output format.
          OOME'd. 18 of 88 maps completed.
          Running test again with 900 part2 v4.
          stack I'll email you a link to the dump / logs.

          Show
          Tim Sell added a comment - Ran with 900 part2. 2 gig heap. using table output format. OOME'd. 18 of 88 maps completed. Running test again with 900 part2 v4. stack I'll email you a link to the dump / logs.
          Hide
          Andrew Purtell added a comment -

          I'm testing part 2 v4 now also.

          Show
          Andrew Purtell added a comment - I'm testing part 2 v4 now also.
          Hide
          stack added a comment -

          v4 of part 2 of this issue. Want to do a bit more testing before I commit.

          Show
          stack added a comment - v4 of part 2 of this issue. Want to do a bit more testing before I commit.
          Hide
          Tim Sell added a comment -

          running test of part2 patch now. I'll post the results in my morning.

          Show
          Tim Sell added a comment - running test of part2 patch now. I'll post the results in my morning.
          Hide
          stack added a comment -

          First cut at new sizing. Adds new ByteSize interface that things like HSK, BU, and BO implement making estimates of size that is not just count of payload. Left the flush on client side at 10MB; number of edits should be a good bit smaller now we do things like count the BU row and size of BU+BO when summing to see if we've hit the flush boundary.

          I checked our estimates against files output to the filesystem and they seem close enough. Doing same comparing memcache size to that given by profiler is a bit tougher but trying.

          Show
          stack added a comment - First cut at new sizing. Adds new ByteSize interface that things like HSK, BU, and BO implement making estimates of size that is not just count of payload. Left the flush on client side at 10MB; number of edits should be a good bit smaller now we do things like count the BU row and size of BU+BO when summing to see if we've hit the flush boundary. I checked our estimates against files output to the filesystem and they seem close enough. Doing same comparing memcache size to that given by profiler is a bit tougher but trying.
          Hide
          stack added a comment -

          Our calculation of MemCache sizes is way off. Our math says the aggregate of all Memcaches is 200MB. In the profiler, the 153 Memcaches present on OOME have 800MBs accumulated. Working on a better memcache sizer.

          Other items, the presence of compressors/decompressors is 'normal'. A mapfile index is block compressed. Bad news is that though the index file is closed as soon as possible, allocated buffers for decompressors stick around (MapFile keeps reference to the index SequenceFile so its not GC'd). OK news is that in scheme of things, accounts for small amount of heap – about 10MB in tim's case.

          Show
          stack added a comment - Our calculation of MemCache sizes is way off. Our math says the aggregate of all Memcaches is 200MB. In the profiler, the 153 Memcaches present on OOME have 800MBs accumulated. Working on a better memcache sizer. Other items, the presence of compressors/decompressors is 'normal'. A mapfile index is block compressed. Bad news is that though the index file is closed as soon as possible, allocated buffers for decompressors stick around (MapFile keeps reference to the index SequenceFile so its not GC'd). OK news is that in scheme of things, accounts for small amount of heap – about 10MB in tim's case.
          Hide
          stack added a comment -

          Studying my replica of Tim Sell job – i.e. using TOF and seeing 100k+ BatchUpdates in an array held in the HBaseRPC#Invocation#parameters field – I now conclude that TOF is operating "as-advertised". Default is that client marshalls 10MB of data. In PE case, this is 12k edits (We measure the BU to be of size 1039 bytes which is probably low-ball looking at BU up in jhat but near-enough). If server is running 10 handlers, then a common case is 10x10MB of edits just sitting around while the batch of edits are being processed server-side. We should set the client-side 10MB down to maybe 2MB as default but this is not the root cause of the Tim Sell OOME (Avoiding TOE, he ran longer but still OOME'd). In his case, the10MB holds even more edits – 70k for 10MB seems viable after he described his data format – and that allowing that our accounting of object sizes is coarse, that the 'deep size' reported in the profiler of 318MB is probably about right.

          So, TODO, set the client-side batch of edits flush size down from 10MB to 2MB.

          Now to look at latest Tim Sell heap dump.

          Show
          stack added a comment - Studying my replica of Tim Sell job – i.e. using TOF and seeing 100k+ BatchUpdates in an array held in the HBaseRPC#Invocation#parameters field – I now conclude that TOF is operating "as-advertised". Default is that client marshalls 10MB of data. In PE case, this is 12k edits (We measure the BU to be of size 1039 bytes which is probably low-ball looking at BU up in jhat but near-enough). If server is running 10 handlers, then a common case is 10x10MB of edits just sitting around while the batch of edits are being processed server-side. We should set the client-side 10MB down to maybe 2MB as default but this is not the root cause of the Tim Sell OOME (Avoiding TOE, he ran longer but still OOME'd). In his case, the10MB holds even more edits – 70k for 10MB seems viable after he described his data format – and that allowing that our accounting of object sizes is coarse, that the 'deep size' reported in the profiler of 318MB is probably about right. So, TODO, set the client-side batch of edits flush size down from 10MB to 2MB. Now to look at latest Tim Sell heap dump.
          Hide
          Andrew Purtell added a comment -

          We use TOF also.

          Show
          Andrew Purtell added a comment - We use TOF also.
          Hide
          stack added a comment -

          Ran a MR job using TableOutputFormat – batches of BatchUpdate – and got a heap that looked like Tim Sells with 100k BatchUpdate instances. Its not obvious to me how we're doing this. Adding instrumentation to help me narrow in on the issue. Tim Sell is running a test that avoids TOF.

          Show
          stack added a comment - Ran a MR job using TableOutputFormat – batches of BatchUpdate – and got a heap that looked like Tim Sells with 100k BatchUpdate instances. Its not obvious to me how we're doing this. Adding instrumentation to help me narrow in on the issue. Tim Sell is running a test that avoids TOF.
          Hide
          stack added a comment -

          Have been looking at Tim Sell heaps over last day. The anomaly is hundreds of thousands of BatchUpdates. The write rate at OOME is about 15k/20k a second. Its like we're retaining arrays of BatchUpdates – something in the rpc invocation code – but it looks right when I read it. I'm missing something obvious. Will keep at it.

          Show
          stack added a comment - Have been looking at Tim Sell heaps over last day. The anomaly is hundreds of thousands of BatchUpdates. The write rate at OOME is about 15k/20k a second. Its like we're retaining arrays of BatchUpdates – something in the rpc invocation code – but it looks right when I read it. I'm missing something obvious. Will keep at it.
          Hide
          stack added a comment -

          I committed above patch as part 1 of this issue. Thanks for testing Andrew. Part 2 will be fixing blockcache. There may be a part 3 (Tim Sell just manufactured an hprof for his OOME'ing cluster – need to figure whats up w/ his failures) and even a part 4 (jgray's failure – though I think this fixed by hbase-1027).

          Show
          stack added a comment - I committed above patch as part 1 of this issue. Thanks for testing Andrew. Part 2 will be fixing blockcache. There may be a part 3 (Tim Sell just manufactured an hprof for his OOME'ing cluster – need to figure whats up w/ his failures) and even a part 4 (jgray's failure – though I think this fixed by hbase-1027).
          Hide
          stack added a comment -

          nm. I'll just go w/ your +1 above. Will work on the 1046 next.

          Show
          stack added a comment - nm. I'll just go w/ your +1 above. Will work on the 1046 next.
          Hide
          stack added a comment -

          It wasn't an OOME on a regionserver that brought on the hbase-1046? If you think not, and no OOMEs, great. I'll commit.

          Show
          stack added a comment - It wasn't an OOME on a regionserver that brought on the hbase-1046? If you think not, and no OOMEs, great. I'll commit.
          Hide
          Andrew Purtell added a comment -

          A scenario where I'm sure I would have seen OOMEs succeeds. However another occurrence of HBASE-1046 might have compromised the testing.

          Definitely +1 on the patch. Heap use on my regionservers is much better.

          Show
          Andrew Purtell added a comment - A scenario where I'm sure I would have seen OOMEs succeeds. However another occurrence of HBASE-1046 might have compromised the testing. Definitely +1 on the patch. Heap use on my regionservers is much better.
          Hide
          stack added a comment -

          HADOOP-4797 is intriguing though not our direct problem (I think – we can pull it in if we think it can help). I made HADOOP-4802 to fix the apurtell issue up in hadoop.

          Show
          stack added a comment - HADOOP-4797 is intriguing though not our direct problem (I think – we can pull it in if we think it can help). I made HADOOP-4802 to fix the apurtell issue up in hadoop.
          Hide
          Andrew Purtell added a comment -

          Running with the patch now. There is an improvement. Will have to run for a while to see what the impact on stability is.

          Show
          Andrew Purtell added a comment - Running with the patch now. There is an improvement. Will have to run for a while to see what the impact on stability is.
          Hide
          stack added a comment -

          Patch that brings down Server and Client from hadoop ipc. We now have bulk of hadoop ipc local. Classes have been renamed to have an HBase prefix to distingush them from their hadoop versions. Had to bring at least Server local because fix needed meddling in private class (Server.Handler). Added check on size of stack-based ByteArrayOutputStream size after every use. It used to always reset. Now, if BAOS is > initial buffersize, we allocate a new BAOS instance rather than reset.

          Verified in testbed it does the right thing. Unit tests pass. Tempted to commit but maybe Andrew you can give it a spin first?

          Next will work on the blockcache leak.

          Show
          stack added a comment - Patch that brings down Server and Client from hadoop ipc. We now have bulk of hadoop ipc local. Classes have been renamed to have an HBase prefix to distingush them from their hadoop versions. Had to bring at least Server local because fix needed meddling in private class (Server.Handler). Added check on size of stack-based ByteArrayOutputStream size after every use. It used to always reset. Now, if BAOS is > initial buffersize, we allocate a new BAOS instance rather than reset. Verified in testbed it does the right thing. Unit tests pass. Tempted to commit but maybe Andrew you can give it a spin first? Next will work on the blockcache leak.
          Hide
          stack added a comment -

          Yeah, I think its the stack based BAOS in Server. Rather than allocate a new one each time, its reset. Reset looks like it keeps the old buffer – it just resets buffer length. Saves on allocations. Means that if we ever return a big Cell in a RowResult, then the server buffer stays that big (I see also that hbase.regionserver.handler.count is set to 30 which jibes with the 30 I saw in the heap dump). Handlers live the life of the application.

          Working on a patch for Andrew to try.

          Show
          stack added a comment - Yeah, I think its the stack based BAOS in Server. Rather than allocate a new one each time, its reset. Reset looks like it keeps the old buffer – it just resets buffer length. Saves on allocations. Means that if we ever return a big Cell in a RowResult, then the server buffer stays that big (I see also that hbase.regionserver.handler.count is set to 30 which jibes with the 30 I saw in the heap dump). Handlers live the life of the application. Working on a patch for Andrew to try.
          Hide
          stack added a comment -

          Looking more at my local heap, I see ReferenceQueues with links to megabytes of unreleased data. More evidence that we are not processing ReferenceQueues fast enough. Need to fix. Might be able to go with bigger blockcache size if one Map of all blockcaches.

          Looking at Andrew Purtell heap dump, it does not have the same character as mine where we are holding on to blockcache items. His heap has 30 instances of stack-based ByteArrayOutputStreams; together they add up to 690MBs of data. Trying to figure which BAOS is prob. Our use in hbase is innocent. At moment the ipc Server use is suspect. Digging.

          Show
          stack added a comment - Looking more at my local heap, I see ReferenceQueues with links to megabytes of unreleased data. More evidence that we are not processing ReferenceQueues fast enough. Need to fix. Might be able to go with bigger blockcache size if one Map of all blockcaches. Looking at Andrew Purtell heap dump, it does not have the same character as mine where we are holding on to blockcache items. His heap has 30 instances of stack-based ByteArrayOutputStreams; together they add up to 690MBs of data. Trying to figure which BAOS is prob. Our use in hbase is innocent. At moment the ipc Server use is suspect. Digging.
          Hide
          stack added a comment -

          Here is one theory. Looking at a heap that OOME'd here on test cluster using jprofiler, there were a bunch of instances of SoftValue (30 or 40k). I was able to sort them by deep size and most encountered held byte arrays of 16k in size. This would seem to indicate elements of the blockcache. Odd thing is that you'd think the SoftValues shouldn't be in the heap on OOME; they should have been cleared by the GCor. Looking, each store file instance has a Map of SoftValues. They are keyed by position into the file. The GC does the job of moving the blocks that are to be cleared onto a ReferenceQueue but unless the ReferenceQueue gets processed promptly, we'll hold on to the SoftValue references (JProfiler has a button which says 'clean References' and after selecting this, the SoftValues remained). The ReferenceQueue gets processed when we add a new block to the cache or if we seek to a new location in a block that we got from the cache (only). Otherwise, blocks to be removed are not processed. If random-reading or only looking at certain stores in a regionserver, all other storefiles, unless they are accessed, will continue to hold on to blocks via their uncleared ReferenceQueue.

          I tried adding in check of the ReferenceQueue everytime anything was accessed on a file but I still OOME'd using a random read test.

          Next thing to try is a single Map that holds all blockcache entries. Will be lots of contention on this single Map but better than going to disk any day. All accesses will check the ReferenceQueue.

          Only downer is that Tim Sell says his last test was run without blockcache enabled and that it made no difference. Maybe try it Andrew?

          Meantime, I'll try the above suggestion. Andrew, any chance of a copy of your heap dump? Tim the same?

          Show
          stack added a comment - Here is one theory. Looking at a heap that OOME'd here on test cluster using jprofiler, there were a bunch of instances of SoftValue (30 or 40k). I was able to sort them by deep size and most encountered held byte arrays of 16k in size. This would seem to indicate elements of the blockcache. Odd thing is that you'd think the SoftValues shouldn't be in the heap on OOME; they should have been cleared by the GCor. Looking, each store file instance has a Map of SoftValues. They are keyed by position into the file. The GC does the job of moving the blocks that are to be cleared onto a ReferenceQueue but unless the ReferenceQueue gets processed promptly, we'll hold on to the SoftValue references (JProfiler has a button which says 'clean References' and after selecting this, the SoftValues remained). The ReferenceQueue gets processed when we add a new block to the cache or if we seek to a new location in a block that we got from the cache (only). Otherwise, blocks to be removed are not processed. If random-reading or only looking at certain stores in a regionserver, all other storefiles, unless they are accessed, will continue to hold on to blocks via their uncleared ReferenceQueue. I tried adding in check of the ReferenceQueue everytime anything was accessed on a file but I still OOME'd using a random read test. Next thing to try is a single Map that holds all blockcache entries. Will be lots of contention on this single Map but better than going to disk any day. All accesses will check the ReferenceQueue. Only downer is that Tim Sell says his last test was run without blockcache enabled and that it made no difference. Maybe try it Andrew? Meantime, I'll try the above suggestion. Andrew, any chance of a copy of your heap dump? Tim the same?
          Hide
          stack added a comment -

          Datapoint: tim sell is having similar OOME'ing issues doing an import. He reports that he disabled blockcaching to no apparent change in behavior.

          Trying to run a simplified replica of Andrew's receipe above, the global memflusher – with new hbase-1027 in place – got stuck here around an OOME down in DFSClient build up response:

          Exception in thread "ResponseProcessor for block blk_5165224789834035674_1602" java.lang.OutOfMemoryError: GC overhead limit exceeded
                  at java.util.Arrays.copyOf(Unknown Source)
                  at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
                  at java.lang.AbstractStringBuilder.append(Unknown Source)
                  at java.lang.StringBuilder.append(Unknown Source)
                  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2319)
          
          "IPC Server handler 0 on 60020" daemon prio=10 tid=0x00007f7f501a4400 nid=0x2b31 in Object.wait() [0x0000000042b00000..0x0000000042b01b00]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  at java.lang.Object.wait(Object.java:485)
                  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3026)
                  - locked <0x00007f7f8615cd50> (a java.util.LinkedList)
                  - locked <0x00007f7f8615c9c0> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
                  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3104)
                  - locked <0x00007f7f8615c9c0> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
                  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
                  at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
                  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
                  at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:959)
                  - locked <0x00007f7f8615c858> (a org.apache.hadoop.io.SequenceFile$Writer)
                  at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:183)
                  - locked <0x00007f7f86157370> (a org.apache.hadoop.hbase.io.BloomFilterMapFile$Writer)
                  at org.apache.hadoop.hbase.io.BloomFilterMapFile$Writer.close(BloomFilterMapFile.java:212)
                  - locked <0x00007f7f86157370> (a org.apache.hadoop.hbase.io.BloomFilterMapFile$Writer)
                  at org.apache.hadoop.hbase.regionserver.HStore.internalFlushCache(HStore.java:680)
                  - locked <0x00007f7f5de99c88> (a java.lang.Integer)
                  at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:627)
                  at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:863)
                  at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:772) 
                  at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:220)
                  at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushSomeRegions(MemcacheFlusher.java:284)
                  - locked <0x00007f7f5dc0f828> (a org.apache.hadoop.hbase.regionserver.MemcacheFlusher)
                  at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.reclaimMemcacheMemory(MemcacheFlusher.java:254)
                  at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1455)
                  at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
                  at java.lang.reflect.Method.invoke(Unknown Source)
                  at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:634)
                  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
          
          Show
          stack added a comment - Datapoint: tim sell is having similar OOME'ing issues doing an import. He reports that he disabled blockcaching to no apparent change in behavior. Trying to run a simplified replica of Andrew's receipe above, the global memflusher – with new hbase-1027 in place – got stuck here around an OOME down in DFSClient build up response: Exception in thread "ResponseProcessor for block blk_5165224789834035674_1602" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Unknown Source) at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source) at java.lang.AbstractStringBuilder.append(Unknown Source) at java.lang.StringBuilder.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2319) "IPC Server handler 0 on 60020" daemon prio=10 tid=0x00007f7f501a4400 nid=0x2b31 in Object .wait() [0x0000000042b00000..0x0000000042b01b00] java.lang. Thread .State: WAITING (on object monitor) at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:485) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3026) - locked <0x00007f7f8615cd50> (a java.util.LinkedList) - locked <0x00007f7f8615c9c0> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3104) - locked <0x00007f7f8615c9c0> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:959) - locked <0x00007f7f8615c858> (a org.apache.hadoop.io.SequenceFile$Writer) at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:183) - locked <0x00007f7f86157370> (a org.apache.hadoop.hbase.io.BloomFilterMapFile$Writer) at org.apache.hadoop.hbase.io.BloomFilterMapFile$Writer.close(BloomFilterMapFile.java:212) - locked <0x00007f7f86157370> (a org.apache.hadoop.hbase.io.BloomFilterMapFile$Writer) at org.apache.hadoop.hbase.regionserver.HStore.internalFlushCache(HStore.java:680) - locked <0x00007f7f5de99c88> (a java.lang. Integer ) at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:627) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:863) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:772) at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:220) at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushSomeRegions(MemcacheFlusher.java:284) - locked <0x00007f7f5dc0f828> (a org.apache.hadoop.hbase.regionserver.MemcacheFlusher) at org.apache.hadoop.hbase.regionserver.MemcacheFlusher.reclaimMemcacheMemory(MemcacheFlusher.java:254) at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1455) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:634) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
          Hide
          Andrew Purtell added a comment - - edited

          Here is a scenario that guarantees a flurry of regionserver OOMEs on my cluster, which is now running latest trunk on top of Hadoop 0.18.2-dev + Ganglia 3.1 patch:

          1) Start up heritrix with hbase-writer. 25 TOEs should do it. Start a long running job.

          2) Build up content until there are ~20 regions per regionserver.

          3) Run a mapreduce job that walks a metadata column of the content table – not all columns, not the family storing the content itself, just some small auxiliary metadata.

          4) Simultaneously to the scanning read (#3), perform what amounts to a bulk import with 5 concurrent writers. (Typical for my load is 4-8GB in maybe a few 10K updates.) Specifically I am using MozillaHtmlParser to build Document objects from text content and am then storing back serialized representations of those Document objects.

          After an invocation of #4, heap usage has balooned across the cluster and it is only a matter of time. Memcache is within limits and for my configuration represents 25% of heap max (I run with 2G heap), so the remaining data is something else. Heap histograms from jhat show a very large number of allocations of [B which can be as much as 1.5GB in total. Soon the regionservers will start to compact or do other heap intensive activities and will fall over.

          A flurry of OOMEs can confuse the master. It will reject region opens thinking they are closing and the regions will remain offline until a manual restart of the cluster. Disable/enable of the table only makes that particular wrinkle worse.

          After restart, invariably a number of regions want to (and do) split.

          Show
          Andrew Purtell added a comment - - edited Here is a scenario that guarantees a flurry of regionserver OOMEs on my cluster, which is now running latest trunk on top of Hadoop 0.18.2-dev + Ganglia 3.1 patch: 1) Start up heritrix with hbase-writer. 25 TOEs should do it. Start a long running job. 2) Build up content until there are ~20 regions per regionserver. 3) Run a mapreduce job that walks a metadata column of the content table – not all columns, not the family storing the content itself, just some small auxiliary metadata. 4) Simultaneously to the scanning read (#3), perform what amounts to a bulk import with 5 concurrent writers. (Typical for my load is 4-8GB in maybe a few 10K updates.) Specifically I am using MozillaHtmlParser to build Document objects from text content and am then storing back serialized representations of those Document objects. After an invocation of #4, heap usage has balooned across the cluster and it is only a matter of time. Memcache is within limits and for my configuration represents 25% of heap max (I run with 2G heap), so the remaining data is something else. Heap histograms from jhat show a very large number of allocations of [B which can be as much as 1.5GB in total. Soon the regionservers will start to compact or do other heap intensive activities and will fall over. A flurry of OOMEs can confuse the master. It will reject region opens thinking they are closing and the regions will remain offline until a manual restart of the cluster. Disable/enable of the table only makes that particular wrinkle worse. After restart, invariably a number of regions want to (and do) split.
          Hide
          Andrew Purtell added a comment -

          The file involved was a 105MB Win32 executable. Using compression compouded the heap charge already taken by several copies from RPC to Cell to ByteArrayOutputStream, etc. I will use a file size limit of 20MB going forward. Also I filed HBASE-1024.

          Show
          Andrew Purtell added a comment - The file involved was a 105MB Win32 executable. Using compression compouded the heap charge already taken by several copies from RPC to Cell to ByteArrayOutputStream, etc. I will use a file size limit of 20MB going forward. Also I filed HBASE-1024 .
          Hide
          stack added a comment -

          Andrew, what if you disabled compression? See if you still have issue. How many heritrix instances? If one, how many Writers? 5 is default IIRC? A byte array of 100MB is kinda crazy. Was there a big page crawled by heritrix? YOu can check its log. It outputs sizes. Maybe you need upper bound on page sizes in heritrix if not there already?

          Show
          stack added a comment - Andrew, what if you disabled compression? See if you still have issue. How many heritrix instances? If one, how many Writers? 5 is default IIRC? A byte array of 100MB is kinda crazy. Was there a big page crawled by heritrix? YOu can check its log. It outputs sizes. Maybe you need upper bound on page sizes in heritrix if not there already?
          Hide
          Andrew Purtell added a comment -

          Yes, RECORD compression on 'content' family, which will have up to two cells per row: 'content:raw' will contain the response body written by a custom Heritrix hbase writer, and if the mimetype is text/*, another cell 'content:document' containing a serialized Document object produced by MozillaHtmlParser (http://sourceforge.net/projects/mozillaparser/). Some binary content can be very large, e.g. 100MB zip, tgz, etc. Row index is SHA1 hash of content object. There is also an 'info' family, not compressed, that stores attributes. Finally there is a 'urls' family, not compressed, that will have a cell for each unique URL corresponding to the content object.

          Show
          Andrew Purtell added a comment - Yes, RECORD compression on 'content' family, which will have up to two cells per row: 'content:raw' will contain the response body written by a custom Heritrix hbase writer, and if the mimetype is text/*, another cell 'content:document' containing a serialized Document object produced by MozillaHtmlParser ( http://sourceforge.net/projects/mozillaparser/ ). Some binary content can be very large, e.g. 100MB zip, tgz, etc. Row index is SHA1 hash of content object. There is also an 'info' family, not compressed, that stores attributes. Finally there is a 'urls' family, not compressed, that will have a cell for each unique URL corresponding to the content object.
          Hide
          stack added a comment -

          You using compression in your instance Andrew?
          St.Ack

          Show
          stack added a comment - You using compression in your instance Andrew? St.Ack
          Hide
          Andrew Purtell added a comment -

          Another RS went down this morning. This time somehow I ended up with 1,790,757,470 bytes in 9760 instances of byte[] on the heap. Scrolling through the list of these objects, most are < 256 bytes, some are <= 5K. Only 2727 HSKs. I also see 2426 instances of Hashtable$Entry, only 148 instances of TreeMap$Entry. 12 HStores with 22 HStoreFiles.

          Found a number of 100MB instances of byte[], e.g. referenced from a ByteArrayOutputStream referenced from an o.a.h.i.compress.CompressionOutputStream. Another referenced from both o.a.h.h.io.DataOutputBuffer and o.a.h.h.io.DataInputBuffer referenced from a SequenceFile$Reader. Looks like a Cell has a reference to a copy of this. Found more with local/weak references from Server$Handler. Did a couple of big files (~100MB) and copies thereof take down the RS?

          Show
          Andrew Purtell added a comment - Another RS went down this morning. This time somehow I ended up with 1,790,757,470 bytes in 9760 instances of byte[] on the heap. Scrolling through the list of these objects, most are < 256 bytes, some are <= 5K. Only 2727 HSKs. I also see 2426 instances of Hashtable$Entry, only 148 instances of TreeMap$Entry. 12 HStores with 22 HStoreFiles. Found a number of 100MB instances of byte[], e.g. referenced from a ByteArrayOutputStream referenced from an o.a.h.i.compress.CompressionOutputStream. Another referenced from both o.a.h.h.io.DataOutputBuffer and o.a.h.h.io.DataInputBuffer referenced from a SequenceFile$Reader. Looks like a Cell has a reference to a copy of this. Found more with local/weak references from Server$Handler. Did a couple of big files (~100MB) and copies thereof take down the RS?
          Hide
          Andrew Purtell added a comment - - edited

          Last night under Heritrix hbase-writer stress I had a regionserver with 2GB heap go down with an OOME. It was serving 4 regions only. This was with 0.18.1, so the line numbers won't match up with trunk.

          class [B 27343 2008042041
          class [C 11714 966164
          class org.apache.hadoop.hbase.HStoreKey 9781 312992
          class java.util.TreeMap$Entry 7596 311436
          class [Lorg.apache.hadoop.io.WritableComparable; 17 139536

          Incidentally the RS was also hosting ROOT so the whole cluster went down. I agree with jgray this combined with the ROOT SPOF is deadly.

          Stack trace of the OOME:

          2008-11-22 09:24:40,950 INFO org.apache.hadoop.hbase.regionserver.HRegion: starting compaction on region content,29308276c599f8a0baca15c224c22ad2,1227337259738
          2008-11-22 09:24:56,429 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.compactor
          java.lang.OutOfMemoryError: Java heap space
          at java.util.Arrays.copyOf(Arrays.java:2786)
          at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
          at java.io.DataOutputStream.write(DataOutputStream.java:90)
          at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:78)
          at org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:71)
          at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
          at java.io.DataOutputStream.write(DataOutputStream.java:90)
          at org.apache.hadoop.hbase.io.ImmutableBytesWritable.write(ImmutableBytesWritable.java:116)
          at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
          at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
          at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1131)
          at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980
          )
          at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:198)
          at org.apache.hadoop.hbase.regionserver.HStoreFile$BloomFilterMapFile$Writer.append(HStoreFile.java:846)
          at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:988)
          at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:893)
          at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:902)
          at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:860)
          at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:83)

          Show
          Andrew Purtell added a comment - - edited Last night under Heritrix hbase-writer stress I had a regionserver with 2GB heap go down with an OOME. It was serving 4 regions only. This was with 0.18.1, so the line numbers won't match up with trunk. class [B 27343 2008042041 class [C 11714 966164 class org.apache.hadoop.hbase.HStoreKey 9781 312992 class java.util.TreeMap$Entry 7596 311436 class [Lorg.apache.hadoop.io.WritableComparable; 17 139536 Incidentally the RS was also hosting ROOT so the whole cluster went down. I agree with jgray this combined with the ROOT SPOF is deadly. Stack trace of the OOME: 2008-11-22 09:24:40,950 INFO org.apache.hadoop.hbase.regionserver.HRegion: starting compaction on region content,29308276c599f8a0baca15c224c22ad2,1227337259738 2008-11-22 09:24:56,429 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.compactor java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:78) at org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:71) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.hbase.io.ImmutableBytesWritable.write(ImmutableBytesWritable.java:116) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1131) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980 ) at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:198) at org.apache.hadoop.hbase.regionserver.HStoreFile$BloomFilterMapFile$Writer.append(HStoreFile.java:846) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:988) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:893) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:902) at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:860) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:83)
          Hide
          stack added a comment -

          One thought: I wonder if fixing the indexing interval so its actually 32 rather than default 128 helped make this issue worse?

          Show
          stack added a comment - One thought: I wonder if fixing the indexing interval so its actually 32 rather than default 128 helped make this issue worse?
          Hide
          Andrew Purtell added a comment -

          This is a recurring issue presently causing pain on current trunk. Seems to be worse now than 0.18.1. Heap gets out of control (> 1GB) for regionservers hosting only ~20 regions or so on. Much of the heap is tied up in byte referenced by HSKs referenced by the WritableComparable[] arrays used by MapFile indexes.

          From a jgray server:

          class [B 3525873 615313626
          class org.apache.hadoop.hbase.HStoreKey 1605046 51361472
          class java.util.TreeMap$Entry 1178067 48300747
          class [Lorg.apache.hadoop.io.WritableComparable; 56 4216992

          Approximately 56 mapfile indexes were resident. Approximately 15-20 regions were being hosted at the time of the crash.

          On an apurtell server, >900MB of heap was observed to be consumed by mapfile indexes for 48 store files corresponding to 16 regions.

          Show
          Andrew Purtell added a comment - This is a recurring issue presently causing pain on current trunk. Seems to be worse now than 0.18.1. Heap gets out of control (> 1GB) for regionservers hosting only ~20 regions or so on. Much of the heap is tied up in byte referenced by HSKs referenced by the WritableComparable[] arrays used by MapFile indexes. From a jgray server: class [B 3525873 615313626 class org.apache.hadoop.hbase.HStoreKey 1605046 51361472 class java.util.TreeMap$Entry 1178067 48300747 class [Lorg.apache.hadoop.io.WritableComparable; 56 4216992 Approximately 56 mapfile indexes were resident. Approximately 15-20 regions were being hosted at the time of the crash. On an apurtell server, >900MB of heap was observed to be consumed by mapfile indexes for 48 store files corresponding to 16 regions.
          Hide
          stack added a comment -

          Let me move this issue out of 0.18.1. There is something going on here given its reported by different people but my attempts at replication using simple schema fail. We need more info. We can do a 0.18.2 later after we figure whats leaking.

          Show
          stack added a comment - Let me move this issue out of 0.18.1. There is something going on here given its reported by different people but my attempts at replication using simple schema fail. We need more info. We can do a 0.18.2 later after we figure whats leaking.
          Hide
          stack added a comment -

          I ran randomread test over night w/ gc logging enabled. Here are snippets from the gc log from different times during the night showing full gcs:

          3738.529: [Full GC 107893K->86326K(220480K), 0.3393940 secs]
          3944.907: [Full GC 110079K->90694K(212160K), 0.3828950 secs]
          ...
          43142.078: [Full GC 105996K->82458K(139840K), 0.3558530 secs]
          43339.019: [Full GC 102767K->86387K(190656K), 0.3512450 secs]
          43490.046: [Full GC 105187K->87709K(212288K), 0.3523640 secs]
          43735.589: [Full GC 107799K->88233K(174784K), 0.3547080 secs]
          ...
          25003.983: [Full GC 105412K->87523K(205312K), 0.3559230 secs]
          25139.998: [Full GC 106102K->80911K(131712K), 0.3432420 secs]
          ..
          47924.811: [Full GC 105487K->80566K(148864K), 0.3392500 secs]
          48088.641: [Full GC 98025K->86603K(212736K), 0.3439750 secs]
          48338.127: [Full GC 105214K->87088K(159872K), 0.3481490 secs]
          ..
          

          Its holding pretty steady.

          I also attached memory graph from ganglia over night. Shows nothing untoward.

          Show
          stack added a comment - I ran randomread test over night w/ gc logging enabled. Here are snippets from the gc log from different times during the night showing full gcs: 3738.529: [Full GC 107893K->86326K(220480K), 0.3393940 secs] 3944.907: [Full GC 110079K->90694K(212160K), 0.3828950 secs] ... 43142.078: [Full GC 105996K->82458K(139840K), 0.3558530 secs] 43339.019: [Full GC 102767K->86387K(190656K), 0.3512450 secs] 43490.046: [Full GC 105187K->87709K(212288K), 0.3523640 secs] 43735.589: [Full GC 107799K->88233K(174784K), 0.3547080 secs] ... 25003.983: [Full GC 105412K->87523K(205312K), 0.3559230 secs] 25139.998: [Full GC 106102K->80911K(131712K), 0.3432420 secs] .. 47924.811: [Full GC 105487K->80566K(148864K), 0.3392500 secs] 48088.641: [Full GC 98025K->86603K(212736K), 0.3439750 secs] 48338.127: [Full GC 105214K->87088K(159872K), 0.3481490 secs] .. Its holding pretty steady. I also attached memory graph from ganglia over night. Shows nothing untoward.
          Hide
          Rong-En Fan added a comment -

          I see some possible memory leaks from regionserver after running it w/ ~200 regions per node for few days (it keeps receive read traffic with very few write). My used swap grows slowly. Region servers occupies around 3G memory (both virtual and rss shown in top). Once I restart regionserver, swap space is also freed.

          This is with hbase 0.2.x on hadoop 0.17.x.

          Show
          Rong-En Fan added a comment - I see some possible memory leaks from regionserver after running it w/ ~200 regions per node for few days (it keeps receive read traffic with very few write). My used swap grows slowly. Region servers occupies around 3G memory (both virtual and rss shown in top). Once I restart regionserver, swap space is also freed. This is with hbase 0.2.x on hadoop 0.17.x.
          Hide
          stack added a comment -

          Looks like it was actually 225 regions before things went bad (need to be more patient).

          Show
          stack added a comment - Looks like it was actually 225 regions before things went bad (need to be more patient).
          Hide
          stack added a comment -

          I got 185 PE regions into single HRS before HDFS went bad; about 32M rows of 1k values and 10byte keys. Profiling, only thing that grows in memory is the count of HSKs. Each store file we open has an index of long to HSK. As the upload progresses, more index is in memory.

          Was going to move this out of 0.18.1 since its not obviously broken but then talked to Rong-en. He says that it was reading where he was seeing memory issues. Will try a read test.

          Show
          stack added a comment - I got 185 PE regions into single HRS before HDFS went bad; about 32M rows of 1k values and 10byte keys. Profiling, only thing that grows in memory is the count of HSKs. Each store file we open has an index of long to HSK. As the upload progresses, more index is in memory. Was going to move this out of 0.18.1 since its not obviously broken but then talked to Rong-en. He says that it was reading where he was seeing memory issues. Will try a read test.

            People

            • Assignee:
              stack
              Reporter:
              Jonathan Gray
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development