HBase
  1. HBase
  2. HBASE-9553

Pad HFile blocks to a fixed size before placing them into the blockcache

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In order to make it easy on the garbage collector and to avoid full compaction phases we should make sure that all (or at least a large percentage) of the HFile blocks as cached in the block cache are exactly the same size.

      Currently an HFile block is typically slightly larger than the declared block size, as the block will accommodate that last KV on the block. The padding would be a ColumnFamily option. In many cases 100 bytes would probably be a good value to make all blocks exactly the same size (but of course it depends on the max size of the KVs).

      This does not have to be perfect. The more blocks evicted and replaced in the block cache are of the exact same size the easier it should be on the GC.

      Thoughts?

        Activity

        Hide
        Nick Dimiduk added a comment -

        I think it's worth giving a try. Why not take it one step further and self-manage a slice of the BlockCache with this pre-defined block size, a la MemStoreLAB? Reserve, say, 80% of the BlockCache for slab management and leave the rest for the awkward-sized blocks.

        Instead of explicitly setting the buffer size, why not sample existing HFiles and calculate a guesstimate?

        Show
        Nick Dimiduk added a comment - I think it's worth giving a try. Why not take it one step further and self-manage a slice of the BlockCache with this pre-defined block size, a la MemStoreLAB? Reserve, say, 80% of the BlockCache for slab management and leave the rest for the awkward-sized blocks. Instead of explicitly setting the buffer size, why not sample existing HFiles and calculate a guesstimate?
        Hide
        Lars Hofhansl added a comment -

        The memstore stores small variable sized KVs so slab is essential there.
        Not sure a slab is needed or even desired here, as we already have fixed (well after we do some simple padding) sized chunks for memory. The padding is simple and low overhead.

        Could calculate standard variation of the KV sizes and add that to the HFile's metadata. Then the padding could be a multiple of the standard deviation, subject to some maximum (like 2% of the hfile's blocksize or something).

        For testing, I would generate data with KVs drawn from a simple size distribution and then measure the GC as we evict/replace block in the block cache.

        Vasu Mariyala, this is the idea I was talking about earlier today.

        Show
        Lars Hofhansl added a comment - The memstore stores small variable sized KVs so slab is essential there. Not sure a slab is needed or even desired here, as we already have fixed (well after we do some simple padding) sized chunks for memory. The padding is simple and low overhead. Could calculate standard variation of the KV sizes and add that to the HFile's metadata. Then the padding could be a multiple of the standard deviation, subject to some maximum (like 2% of the hfile's blocksize or something). For testing, I would generate data with KVs drawn from a simple size distribution and then measure the GC as we evict/replace block in the block cache. Vasu Mariyala , this is the idea I was talking about earlier today.
        Hide
        Jean-Marc Spaggiari added a comment -

        The idea seems correct. Looking forward to seeing the results. I'm not sure we will get much improvements, but as Nick sais, it's worth giving at try.

        Show
        Jean-Marc Spaggiari added a comment - The idea seems correct. Looking forward to seeing the results. I'm not sure we will get much improvements, but as Nick sais, it's worth giving at try.
        Hide
        Liang Xie added a comment -

        probably it could beat the current implement but imho, the off-heap solution(e.g. bucket cache with off-heap enabled) is still better than padding. per one of our internal benchmark, the off-heap block caching model could cut off the "99% pencentile latency" to a half, comparing the current on-heap block caching implement.

        ps: i remembered(unclear) hotspot internal could dynamically resize some stuff, like PLAB, to meet the diff obj sizes. maybe some vm expects could give more explaination of cause, i agree, the change from app code would be better than depends on hotspot

        Show
        Liang Xie added a comment - probably it could beat the current implement but imho, the off-heap solution(e.g. bucket cache with off-heap enabled) is still better than padding. per one of our internal benchmark, the off-heap block caching model could cut off the "99% pencentile latency" to a half, comparing the current on-heap block caching implement. ps: i remembered(unclear) hotspot internal could dynamically resize some stuff, like PLAB, to meet the diff obj sizes. maybe some vm expects could give more explaination of cause, i agree, the change from app code would be better than depends on hotspot
        Hide
        Anoop Sam John added a comment -

        What abt when the on-cache encoding is enabled. Will the HFile block sizes can change much from block to block?

        Show
        Anoop Sam John added a comment - What abt when the on-cache encoding is enabled. Will the HFile block sizes can change much from block to block?
        Hide
        Todd Lipcon added a comment -

        Interested to see the results here. When I tested block cache churn before, I didn't see heap fragmentation really crop up: http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-2/

        For testing this improvement, it would be good to produce similar graphs of the CMS maximum chunk size metric from -XX:+PrintFLSStatistics output under some workload, and show that the improvement results in less fragmentation over time for at least some workload(s).

        Show
        Todd Lipcon added a comment - Interested to see the results here. When I tested block cache churn before, I didn't see heap fragmentation really crop up: http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-2/ For testing this improvement, it would be good to produce similar graphs of the CMS maximum chunk size metric from -XX:+PrintFLSStatistics output under some workload, and show that the improvement results in less fragmentation over time for at least some workload(s).
        Hide
        Matt Corgan added a comment -

        I don't know the code-level implementation details of any of the garbage collectors, but I imagine they do this to an extent already by dividing the heap into regions of different chunk sizes and placing blocks into slightly bigger slots than they need, effectively doing the padding by leaving empty space after each block. Maybe not for tiny objects, but possibly for bigger ones.

        I also worry it would be hard to pick a single size to round all the blocks to because hbase allows configurable block size and encoding per table. And even if all tables use the default block size and encoding, the encoding will result in different block sizes depending on the nature of the data in each table.

        It would be a good question for the Mechanical Sympathy mailing list.

        Show
        Matt Corgan added a comment - I don't know the code-level implementation details of any of the garbage collectors, but I imagine they do this to an extent already by dividing the heap into regions of different chunk sizes and placing blocks into slightly bigger slots than they need, effectively doing the padding by leaving empty space after each block. Maybe not for tiny objects, but possibly for bigger ones. I also worry it would be hard to pick a single size to round all the blocks to because hbase allows configurable block size and encoding per table. And even if all tables use the default block size and encoding, the encoding will result in different block sizes depending on the nature of the data in each table. It would be a good question for the Mechanical Sympathy mailing list.
        Hide
        Lars Hofhansl added a comment -

        So I did some simple tests with just byte[]'s:

        1. allocated chunks of 10000 64k+-100 bytes
        2. allocated chunks of 10000 65636 (64k+100) bytes
        3. allocated chunks of 10000 64k+-1000 bytes
        4. allocated chunks of 10000 66536 (64k+1000) bytes

        Runs allocate and GC 10m of those 64k byte[]'s.

        With various GC settings... There was no discernible difference, between the fixed and variable sized blocks.
        Maybe I should have done this testing before I filed this idea, going to close as "Invalid".

        Show
        Lars Hofhansl added a comment - So I did some simple tests with just byte[]'s: allocated chunks of 10000 64k+-100 bytes allocated chunks of 10000 65636 (64k+100) bytes allocated chunks of 10000 64k+-1000 bytes allocated chunks of 10000 66536 (64k+1000) bytes Runs allocate and GC 10m of those 64k byte[]'s. With various GC settings... There was no discernible difference, between the fixed and variable sized blocks. Maybe I should have done this testing before I filed this idea, going to close as "Invalid".
        Hide
        Andrew Purtell added a comment -

        Maybe I should have done this testing before I filed this idea, going to close as "Invalid".

        This was an interesting issue though.

        A negative result is just as interesting and informative as a positive one. In some cases, more.

        Show
        Andrew Purtell added a comment - Maybe I should have done this testing before I filed this idea, going to close as "Invalid". This was an interesting issue though. A negative result is just as interesting and informative as a positive one. In some cases, more.

          People

          • Assignee:
            Unassigned
            Reporter:
            Lars Hofhansl
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development