HBase
  1. HBase
  2. HBASE-3551

Loaded hfile indexes occupy a good chunk of heap; look into shrinking the amount used

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I hung with a user Marc and we were looking over configs and his cluster profile up on ec2. One thing we noticed was that his 100+ 1G regions of two families had ~2.5G of heap resident. We did a bit of math and couldn't get to 2.5G so that needs looking into. Even still, 2.5G is a bunch of heap to give over to indices (He actually OOME'd when he had his RS heap set to just 3G; we shouldn't OOME, we should just run slower). It sounds like he needs the indices loaded but still, for some cases we should drop indices for unaccessed files.

        Activity

        Hide
        Lars George added a comment -

        Just as a note: maybe another option is to compress the index in memory?

        Show
        Lars George added a comment - Just as a note: maybe another option is to compress the index in memory?
        Hide
        stack added a comment -

        Ok. Closing. Will reference your comment Marc over in HBASE-25, etc. I also added a section to schema design on size of rows and column family names, keeping them small. Thanks for digging in boss.

        <section xml:id="keysize">
        <title>Try to minimize row and column sizes</title>
        <para>In HBase, values are always freighted with their coordinates; as a
        cell value passes through the system, it'll be accompanied by its
        row, column name, and timestamp. Always. If your rows and column names
        are large, especially compared o the size of the cell value, then
        you may run up against some interesting scenarios. One such is
        the case described by Marc Limotte at the tail of
        <link xlink:url="https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272">HBASE-3551</link>
        (recommended!).
        Therein, the indices that are kept on HBase storefiles (<link linkend="hfile">HFile</link>s)
        to facilitate random access may end up occupyng large chunks of the HBase
        allotted RAM because the cell value coordinates are large.
        Mark in the above cited comment suggests upping the block size so
        entries in the store file index happen at a larger interval or
        modify the table schema so it makes for smaller rows and column
        names.
        `</para>
        </section>

        Show
        stack added a comment - Ok. Closing. Will reference your comment Marc over in HBASE-25 , etc. I also added a section to schema design on size of rows and column family names, keeping them small. Thanks for digging in boss. <section xml:id="keysize"> <title>Try to minimize row and column sizes</title> <para>In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp. Always. If your rows and column names are large, especially compared o the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of <link xlink:url="https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272"> HBASE-3551 </link> (recommended!). Therein, the indices that are kept on HBase storefiles (<link linkend="hfile">HFile</link>s) to facilitate random access may end up occupyng large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. `</para> </section>
        Hide
        Marc Limotte added a comment -

        I understand this better now. I did some poking around with the HFile tool. Average key length does seem to be around 150 bytes, as I estimated.

        For one hfile /hbase/foo/fb820ae7002fc96f78165802a0b05e63/metrics/14129209576094096, metadata is:

        avgKeyLen=159, avgValueLen=7, entries=49285512, length=615516343
        fileinfoOffset=592314718, dataIndexOffset=592315104, dataIndexCount=131869, metaIndexOffset=0, metaIndexCount=0, totalBytes=8653853680, entryCount=49285512, version=1

        Size of index = length - dataIndexOffset = 615516343 - 592315104 = 22mb

        Index data per Region Server = 22mb * 180 regions = almost 4gb. Plus the other column family, so this does seem to add up to the 5 to 6gb of HEAP we are seeing.

        1. of entries per dataindex entry = 49285512 / 131869 = 374
          Times the key size (avg 157 bytes for this file) = 59k (close to the block size of 64k). So, seems to make sense.

        I also looked at the keyvalue pairs using the HFile tool (a section of output is below).

        We have a few billion rows (2 - 4 billion). I haven't done a full row count.

        What I didn't understand previously is that it's not 374 rows, but 374 "entries". An entry means a single column entry and the key is repeated for each column value. Given our fairly large key, that would add up quickly.

        Solutions
        1) Increase the hbase block size (I did this and it resolved our situation for now)
        2) Modifying our schema to use smaller keys - perhaps IDs instead of string names.
        3) Modifying our schema to have fewer columns - we could combine several related columns into one compound value.
        4) An LRU cache for storefile indexes

        Given the other options, #4 may not be warranted, so I think we can close this issue.

        Show
        Marc Limotte added a comment - I understand this better now. I did some poking around with the HFile tool. Average key length does seem to be around 150 bytes, as I estimated. For one hfile /hbase/foo/fb820ae7002fc96f78165802a0b05e63/metrics/14129209576094096, metadata is: avgKeyLen=159, avgValueLen=7, entries=49285512, length=615516343 fileinfoOffset=592314718, dataIndexOffset=592315104, dataIndexCount=131869, metaIndexOffset=0, metaIndexCount=0, totalBytes=8653853680, entryCount=49285512, version=1 Size of index = length - dataIndexOffset = 615516343 - 592315104 = 22mb Index data per Region Server = 22mb * 180 regions = almost 4gb. Plus the other column family, so this does seem to add up to the 5 to 6gb of HEAP we are seeing. of entries per dataindex entry = 49285512 / 131869 = 374 Times the key size (avg 157 bytes for this file) = 59k (close to the block size of 64k). So, seems to make sense. I also looked at the keyvalue pairs using the HFile tool (a section of output is below). We have a few billion rows (2 - 4 billion). I haven't done a full row count. What I didn't understand previously is that it's not 374 rows, but 374 "entries". An entry means a single column entry and the key is repeated for each column value. Given our fairly large key, that would add up quickly. Solutions 1) Increase the hbase block size (I did this and it resolved our situation for now) 2) Modifying our schema to use smaller keys - perhaps IDs instead of string names. 3) Modifying our schema to have fewer columns - we could combine several related columns into one compound value. 4) An LRU cache for storefile indexes Given the other options, #4 may not be warranted, so I think we can close this issue.
        Hide
        Marc Limotte added a comment -

        Here's some more detail about the situation that Stack and I saw:

        From region server UI (via lynx)
        HBase Version 0.90.0, r0b7903c50eef589c632582f7d9d6364eb3912c38 HBase version and svn revision
        HBase Compiled Mon Jan 24 20:44:24 UTC 2011, root When HBase version was compiled and by whom
        Metrics request=0.0, regions=107, stores=214, storefiles=381, storefileIndexSize=2983, memstoreSize=0,
        compactionQueueSize=29, usedHeap=3774, maxHeap=7141, blockCacheSize=509777848, blockCacheFree=987798472,
        blockCacheCount=7557, blockCacheHitCount=60151, blockCacheMissCount=38698247, blockCacheEvictedCount=0,
        blockCacheHitRatio=0, blockCacheHitCachingRatio=88 RegionServer Metrics; file and heap sizes are in megabytes
        Zookeeper Quorum ip-xxxxxxxxx.ec2.internal:2181 Addresses of all registered ZK servers

        So, almost 3gb for the index

        1-2 stores per region, storefile-size = 1gb, hbase block size = 64k
        num-of-entries-per-storefile = storefile-size / hbase-block-size
        estimated index size = num-of-entries-per-storefile * num-store-files * key-and-entry-size
        key-and-entry-size = 20 to 200 => 150 (guess)
        estimated index size = (1G / 64K) * 381 * 150 = 900M (much less than 2983M)
        This doesn't account for any overhead in the index, but it's hard to imaging that the overhead would account for 3X size difference.

        Also, our compaction queue is fairly deep (due to forced major compactions). What impact could that have storefileIndexSize?

        Show
        Marc Limotte added a comment - Here's some more detail about the situation that Stack and I saw: From region server UI (via lynx) HBase Version 0.90.0, r0b7903c50eef589c632582f7d9d6364eb3912c38 HBase version and svn revision HBase Compiled Mon Jan 24 20:44:24 UTC 2011, root When HBase version was compiled and by whom Metrics request=0.0, regions=107, stores=214, storefiles=381, storefileIndexSize=2983, memstoreSize=0, compactionQueueSize=29, usedHeap=3774, maxHeap=7141, blockCacheSize=509777848, blockCacheFree=987798472, blockCacheCount=7557, blockCacheHitCount=60151, blockCacheMissCount=38698247, blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=88 RegionServer Metrics; file and heap sizes are in megabytes Zookeeper Quorum ip-xxxxxxxxx.ec2.internal:2181 Addresses of all registered ZK servers So, almost 3gb for the index 1-2 stores per region, storefile-size = 1gb, hbase block size = 64k num-of-entries-per-storefile = storefile-size / hbase-block-size estimated index size = num-of-entries-per-storefile * num-store-files * key-and-entry-size key-and-entry-size = 20 to 200 => 150 (guess) estimated index size = (1G / 64K) * 381 * 150 = 900M (much less than 2983M) This doesn't account for any overhead in the index, but it's hard to imaging that the overhead would account for 3X size difference. Also, our compaction queue is fairly deep (due to forced major compactions). What impact could that have storefileIndexSize?
        Hide
        stack added a comment -

        @Ryan Regards key size, was small in this case and not tweakable. Same for index offset. Didn't want to use bigger blocks.

        @Andrew So, you are suggesting that we let go of whole file not just the index. That is probably the better thing to do; it addresses index-size and other costs associated with keeping files open.

        Show
        stack added a comment - @Ryan Regards key size, was small in this case and not tweakable. Same for index offset. Didn't want to use bigger blocks. @Andrew So, you are suggesting that we let go of whole file not just the index. That is probably the better thing to do; it addresses index-size and other costs associated with keeping files open.
        Hide
        ryan rawson added a comment -

        the index size is related to (a) block size and (b) the key size,
        perhaps by tweaking one or both something might beneficial might
        happen?

        Show
        ryan rawson added a comment - the index size is related to (a) block size and (b) the key size, perhaps by tweaking one or both something might beneficial might happen?
        Hide
        Andrew Purtell added a comment -

        We talked before about using a fixed resource pool for file handles / storefile indices and loading or unloading on a LRU basis (HBASE-24, HBASE-2751, etc.)

        Show
        Andrew Purtell added a comment - We talked before about using a fixed resource pool for file handles / storefile indices and loading or unloading on a LRU basis ( HBASE-24 , HBASE-2751 , etc.)

          People

          • Assignee:
            Unassigned
            Reporter:
            stack
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development