HBase
  1. HBase
  2. HBASE-5313

Restructure hfiles layout for better compression

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: io
    • Labels:
      None
    • Tags:
      Phoenix

      Description

      A HFile block contain a stream of key-values. Can we can organize these kvs on the disk in a better way so that we get much greater compression ratios?

      One option (thanks Prakash) is to store all the keys in the beginning of the block (let's call this the key-section) and then store all their corresponding values towards the end of the block. This will allow us to not-even decompress the values when we are scanning and skipping over rows in the block.

      Any other ideas?

        Activity

        Hide
        Jean-Daniel Cryans added a comment -

        After some more investigation, I don't think it will be easy to do. Matt Corgan's HBASE-7162 relies on that code too. So we have to make HFileBlockDefaultEncodingContext thread safe it seems.

        Show
        Jean-Daniel Cryans added a comment - After some more investigation, I don't think it will be easy to do. Matt Corgan 's HBASE-7162 relies on that code too. So we have to make HFileBlockDefaultEncodingContext thread safe it seems.
        Hide
        Mikhail Bautin added a comment -

        Jean-Daniel Cryans: I'm OK with reverting HBASE-5521 because it does not look like HBASE-5313 is moving forward.

        Show
        Mikhail Bautin added a comment - Jean-Daniel Cryans : I'm OK with reverting HBASE-5521 because it does not look like HBASE-5313 is moving forward.
        Hide
        Jean-Daniel Cryans added a comment -

        He Yongqiang, dhruba borthakur, Mikhail Bautin

        Guys, I need your help to understand what's going on with this jira. HBASE-5521 has been committed more than a year ago and nothing moved after that. Moreover, the code breaks encoding by making it not thread safe. See HBASE-8732.

        This makes me think that the code in 5521 was not seriously tested (maybe waiting on this jira to tie all the loose ends?) and since we are trying to release 0.96.0 soonish, I'm currently in favor of reverting it.

        Show
        Jean-Daniel Cryans added a comment - He Yongqiang , dhruba borthakur , Mikhail Bautin Guys, I need your help to understand what's going on with this jira. HBASE-5521 has been committed more than a year ago and nothing moved after that. Moreover, the code breaks encoding by making it not thread safe. See HBASE-8732 . This makes me think that the code in 5521 was not seriously tested (maybe waiting on this jira to tie all the loose ends?) and since we are trying to release 0.96.0 soonish, I'm currently in favor of reverting it.
        Hide
        He Yongqiang added a comment -

        Hi Kannan,

        We are still experimenting this. The initial results shows only less than one quarter off, which is kind of not big enough for us. The timestamp issue is a low hanging fruit, which can cut 8%.
        We will post some diff asap, once after we finalize our experiments.

        Show
        He Yongqiang added a comment - Hi Kannan, We are still experimenting this. The initial results shows only less than one quarter off, which is kind of not big enough for us. The timestamp issue is a low hanging fruit, which can cut 8%. We will post some diff asap, once after we finalize our experiments.
        Hide
        Kannan Muthukkaruppan added a comment -

        Yongqiang: Any updates on this effort/investigation? I noticed HBASE-5674 that you had created which is sort of going after a specific part (timestamps)... but was curious where things are with respect to this JIRA.

        Show
        Kannan Muthukkaruppan added a comment - Yongqiang: Any updates on this effort/investigation? I noticed HBASE-5674 that you had created which is sort of going after a specific part (timestamps)... but was curious where things are with respect to this JIRA.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #143 (See https://builds.apache.org/job/HBase-TRUNK-security/143/)
        HBASE-5521 [jira] Move compression/decompression to an encoder specific encoding
        context

        Author: Yongqiang He

        Summary:
        https://issues.apache.org/jira/browse/HBASE-5521

        As part of working on HBASE-5313, we want to add a new columnar encoder/decoder.
        It makes sense to move compression to be part of encoder/decoder:
        1) a scanner for a columnar encoded block can do lazy decompression to a
        specific part of a key value object
        2) avoid an extra bytes copy from encoder to hblock-writer.

        If there is no encoder specified for a writer, the HBlock.Writer will use a
        default compression-context to do something very similar to today's code.

        Test Plan: existing unit tests verified by mbautin and tedyu. And no new test
        added here since this code is just a preparation for columnar encoder. Will add
        testcase later in that diff.

        Reviewers: dhruba, tedyu, sc, mbautin

        Reviewed By: mbautin

        Differential Revision: https://reviews.facebook.net/D2097 (Revision 1302602)

        Result = FAILURE
        mbautin :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDecodingContext.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDefaultDecodingContext.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDefaultEncodingContext.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockEncodingContext.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/Compression.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockCompatibility.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/DataBlockEncodingTool.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #143 (See https://builds.apache.org/job/HBase-TRUNK-security/143/ ) HBASE-5521 [jira] Move compression/decompression to an encoder specific encoding context Author: Yongqiang He Summary: https://issues.apache.org/jira/browse/HBASE-5521 As part of working on HBASE-5313 , we want to add a new columnar encoder/decoder. It makes sense to move compression to be part of encoder/decoder: 1) a scanner for a columnar encoded block can do lazy decompression to a specific part of a key value object 2) avoid an extra bytes copy from encoder to hblock-writer. If there is no encoder specified for a writer, the HBlock.Writer will use a default compression-context to do something very similar to today's code. Test Plan: existing unit tests verified by mbautin and tedyu. And no new test added here since this code is just a preparation for columnar encoder. Will add testcase later in that diff. Reviewers: dhruba, tedyu, sc, mbautin Reviewed By: mbautin Differential Revision: https://reviews.facebook.net/D2097 (Revision 1302602) Result = FAILURE mbautin : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDecodingContext.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDefaultDecodingContext.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockDefaultEncodingContext.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/HFileBlockEncodingContext.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/Compression.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockCompatibility.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/DataBlockEncodingTool.java
        Hide
        dhruba borthakur added a comment -

        I am guessing that initially, we keep this new "columnar encoding" completely isolated inside a HFileBlock. At table creation time, one can specify that the table be stored in columnar-encoded fashion.

        A HFile will have information in the FixedFileTrailer that specifies whether the data inside the hfile is in columnar-format. A single HFileBlock will have four "column-entity": all the rowkeys will be laid out first, followed by all the cf, followed by all the "column names", followed by the timestamps, followed by the memstoreTS, followed by all the values.

        If 'prefix-encoding' is enabled, then each column-entity will be prefix encoded individually. If compression (lzo, gz, etc) is enabled, the entire HFileBlock will be compressed accordingly.

        Prefix-encoding will work well for the rowkey entity and the column-family entity. The column name entity may need to be sorted and then prefix encoded. The timestamp entity may need special kind of encoding. One option (suggested by a co-worker Chip Turner) is to take the first timestamp as the base and xor it with each of the following timestamps (thus, zeroing out the common bits) and then storing it. Another option is to find the minimum timestamp in the block and then store diffs from that minimum value. Yet another option is to make Jan-01-2012 as the hbase-epoch and store the difference from that number.

        Show
        dhruba borthakur added a comment - I am guessing that initially, we keep this new "columnar encoding" completely isolated inside a HFileBlock. At table creation time, one can specify that the table be stored in columnar-encoded fashion. A HFile will have information in the FixedFileTrailer that specifies whether the data inside the hfile is in columnar-format. A single HFileBlock will have four "column-entity": all the rowkeys will be laid out first, followed by all the cf, followed by all the "column names", followed by the timestamps, followed by the memstoreTS, followed by all the values. If 'prefix-encoding' is enabled, then each column-entity will be prefix encoded individually. If compression (lzo, gz, etc) is enabled, the entire HFileBlock will be compressed accordingly. Prefix-encoding will work well for the rowkey entity and the column-family entity. The column name entity may need to be sorted and then prefix encoded. The timestamp entity may need special kind of encoding. One option (suggested by a co-worker Chip Turner) is to take the first timestamp as the base and xor it with each of the following timestamps (thus, zeroing out the common bits) and then storing it. Another option is to find the minimum timestamp in the block and then store diffs from that minimum value. Yet another option is to make Jan-01-2012 as the hbase-epoch and store the difference from that number.
        Hide
        Matt Corgan added a comment -

        Just noticed this jira. I've been working on https://issues.apache.org/jira/browse/HBASE-4676. In this trie format all the values are concatenated at the end of the block. I haven't done anything with compressing them because they are generally small in my use cases, but seems like it would eventually be a good option. I would think that the compression savings would be similar to the on-disk compression savings, but the benefit is that you have access to scan the keys while the data part of the block is still compressed.

        Show
        Matt Corgan added a comment - Just noticed this jira. I've been working on https://issues.apache.org/jira/browse/HBASE-4676 . In this trie format all the values are concatenated at the end of the block. I haven't done anything with compressing them because they are generally small in my use cases, but seems like it would eventually be a good option. I would think that the compression savings would be similar to the on-disk compression savings, but the benefit is that you have access to scan the keys while the data part of the block is still compressed.
        Hide
        He Yongqiang added a comment -

        As part of working on HBASE-5313, we first tried to write a HFileWriter/HFileReader to do it. After finishing some work, it seems this requires a lot of code refactoring in order to reuse existing code as much as possible.

        Then we find seems adding a new columnar encoder/decoder would be easy to do. opened https://issues.apache.org/jira/browse/HBASE-5521 to do encoder/decoder specific compression work.

        Show
        He Yongqiang added a comment - As part of working on HBASE-5313 , we first tried to write a HFileWriter/HFileReader to do it. After finishing some work, it seems this requires a lot of code refactoring in order to reuse existing code as much as possible. Then we find seems adding a new columnar encoder/decoder would be easy to do. opened https://issues.apache.org/jira/browse/HBASE-5521 to do encoder/decoder specific compression work.
        Hide
        He Yongqiang added a comment -

        As a first step, we will go ahead with a simple columnar layout implementation. And leave more advanced features (like nested column layout) in a follow up.

        Show
        He Yongqiang added a comment - As a first step, we will go ahead with a simple columnar layout implementation. And leave more advanced features (like nested column layout) in a follow up.
        Hide
        He Yongqiang added a comment -

        However, those compression numbers are pretty nice. I worry a little bit about having now an hfileV3, so soon on the heels of the last, leading to a proliferation of versions. My other concern is that the columnar storage doesn't make sense for all cases - Dremel is for a specific use case.

        That being said, I would love to see the ability to do Dremel in HBase. How about along with a new version/columnar data support comes the ability to select storage files on a per-table basis? That would enable some tables to be optimized for certain use cases, other tables for others, rather than having to use completely different clusters (continuing the multi-tenancy story).

        @Jesse Yates, Yeah. Agree here. One big thing we need to answer is how to integrate with current HFile implementation. We want to reuse code as much as possible. I guess a nested columnar structure like Dremel is what we finally want for HBase. But we first need to figure out a good story of how applications will use it.

        Show
        He Yongqiang added a comment - However, those compression numbers are pretty nice. I worry a little bit about having now an hfileV3, so soon on the heels of the last, leading to a proliferation of versions. My other concern is that the columnar storage doesn't make sense for all cases - Dremel is for a specific use case. That being said, I would love to see the ability to do Dremel in HBase. How about along with a new version/columnar data support comes the ability to select storage files on a per-table basis? That would enable some tables to be optimized for certain use cases, other tables for others, rather than having to use completely different clusters (continuing the multi-tenancy story). @Jesse Yates, Yeah. Agree here. One big thing we need to answer is how to integrate with current HFile implementation. We want to reuse code as much as possible. I guess a nested columnar structure like Dremel is what we finally want for HBase. But we first need to figure out a good story of how applications will use it.
        Hide
        He Yongqiang added a comment -

        >>Can you also list the time it took writing the HFile for each of the three schemes ?
        @Zhihong, we are still trying to explore more ideas here. Once we got a finalized plan, i will get the cpu/latency numbers.

        >>Yongqiang, what is the delta encoding algorithm did you use? The default algorithm only do a simple encoding. Do we have results using prefix with fast diff algorithm for the current hfile v2?

        @jerry, i tried all three delta. And Diff with HFileWriterV2 is producing smallest file in my test.

        Show
        He Yongqiang added a comment - >>Can you also list the time it took writing the HFile for each of the three schemes ? @Zhihong, we are still trying to explore more ideas here. Once we got a finalized plan, i will get the cpu/latency numbers. >>Yongqiang, what is the delta encoding algorithm did you use? The default algorithm only do a simple encoding. Do we have results using prefix with fast diff algorithm for the current hfile v2? @jerry, i tried all three delta. And Diff with HFileWriterV2 is producing smallest file in my test.
        Hide
        dhruba borthakur added a comment -

        yq: can we get some numbers how the compression is if we just do columnar and delta-compression (no lzo). This will tell us if there is benefit in storing data columnar in cache.

        we still have to measure the overhead of a read/scan when data us stored in columnar fashion. Very early to say whether this is 0.96 or something further out.

        Show
        dhruba borthakur added a comment - yq: can we get some numbers how the compression is if we just do columnar and delta-compression (no lzo). This will tell us if there is benefit in storing data columnar in cache. we still have to measure the overhead of a read/scan when data us stored in columnar fashion. Very early to say whether this is 0.96 or something further out.
        Hide
        Lars Hofhansl added a comment -

        I agree with Ted, this is 0.96 material.

        Show
        Lars Hofhansl added a comment - I agree with Ted, this is 0.96 material.
        Hide
        Jerry Chen added a comment -

        Yongqiang, what is the delta encoding algorithm did you use? The default algorithm only do a simple encoding. Do we have results using prefix with fast diff algorithm for the current hfile v2?

        I suppose this is only for the on-disk representation. How do we plan to represent it in block cache?

        Sent from my iPhone

        Show
        Jerry Chen added a comment - Yongqiang, what is the delta encoding algorithm did you use? The default algorithm only do a simple encoding. Do we have results using prefix with fast diff algorithm for the current hfile v2? I suppose this is only for the on-disk representation. How do we plan to represent it in block cache? Sent from my iPhone
        Hide
        Ted Yu added a comment -

        There're only two weeks before we branch 0.94
        I think HFile v3 would be in 0.96, containing this feature and HBASE-5347.

        Show
        Ted Yu added a comment - There're only two weeks before we branch 0.94 I think HFile v3 would be in 0.96, containing this feature and HBASE-5347 .
        Hide
        Jesse Yates added a comment -

        However, those compression numbers are pretty nice. I worry a little bit about having now an hfileV3, so soon on the heels of the last, leading to a proliferation of versions. My other concern is that the columnar storage doesn't make sense for all cases - Dremel is for a specific use case.

        That being said, I would love to see the ability to do Dremel in HBase. How about along with a new version/columnar data support comes the ability to select storage files on a per-table basis? That would enable some tables to be optimized for certain use cases, other tables for others, rather than having to use completely different clusters (continuing the multi-tenancy story).

        Show
        Jesse Yates added a comment - However, those compression numbers are pretty nice. I worry a little bit about having now an hfileV3, so soon on the heels of the last, leading to a proliferation of versions. My other concern is that the columnar storage doesn't make sense for all cases - Dremel is for a specific use case. That being said, I would love to see the ability to do Dremel in HBase. How about along with a new version/columnar data support comes the ability to select storage files on a per-table basis? That would enable some tables to be optimized for certain use cases, other tables for others, rather than having to use completely different clusters (continuing the multi-tenancy story).
        Hide
        dhruba borthakur added a comment -

        The same amount of kvs in each file. total of 3 million kvs for this experiment. The blocksize is 16 KB.

        Show
        dhruba borthakur added a comment - The same amount of kvs in each file. total of 3 million kvs for this experiment. The blocksize is 16 KB.
        Hide
        Ted Yu added a comment -

        @Yongqiang:
        Thanks for sharing the results.
        Can you also list the time it took writing the HFile for each of the three schemes ?

        If you can characterize the row keys and values, that would be nice too.

        Show
        Ted Yu added a comment - @Yongqiang: Thanks for sharing the results. Can you also list the time it took writing the HFile for each of the three schemes ? If you can characterize the row keys and values, that would be nice too.
        Hide
        stack added a comment -

        How do I read the above? Its same amount of kvs in each of the files?

        Show
        stack added a comment - How do I read the above? Its same amount of kvs in each of the files?
        Hide
        He Yongqiang added a comment -

        @Todd, with such a small block size and data also already sorted, i was also thinking it is will be very hard to optimize the space.

        So we did some experiments by modifying today's HFileWriter. It turns out it can still save a lot if we play more tricks.

        Here are test results (block size is 16KB):

        42MB HFile, with Delta compression and with LZO compression (with default setting on Apache trunk)

        30MB HFile, with Columnar, with Delta compression, and with LZO compression.

        Inside one block, first put all row keys inside that block, and do delta compression, and then LZO compression. After row key, put all column family data in that block, and do Delta+LZO for it. And then similarly put column_qualifier. etc

        24MB HFile, with Columnar, Sort value column, Sort column_qualifier column, and with LZO compression.

        Inside one block, first put all row keys inside that block, and do delta compression, and then LZO compression. After row key, put all column family data in that block, and do Delta+LZO for it. And then put column_qualifier, sort it, and then do Delta+LZO. TS column and Code column are processed the same as column family. The value column is processed the same as column_qualifier. So it is the same as disk format for the 30MB HFile, except all data for 'column_qualifier' and 'value' are sorted separately.

        Out of 24MB file, 6MB is used to store row keys, 7MB is used to store column_qualifier, and 6MB is to store value.

        More ideas are welcome!

        Show
        He Yongqiang added a comment - @Todd, with such a small block size and data also already sorted, i was also thinking it is will be very hard to optimize the space. So we did some experiments by modifying today's HFileWriter. It turns out it can still save a lot if we play more tricks. Here are test results (block size is 16KB): 42MB HFile, with Delta compression and with LZO compression (with default setting on Apache trunk) 30MB HFile, with Columnar, with Delta compression, and with LZO compression. Inside one block, first put all row keys inside that block, and do delta compression, and then LZO compression. After row key, put all column family data in that block, and do Delta+LZO for it. And then similarly put column_qualifier. etc 24MB HFile, with Columnar, Sort value column, Sort column_qualifier column, and with LZO compression. Inside one block, first put all row keys inside that block, and do delta compression, and then LZO compression. After row key, put all column family data in that block, and do Delta+LZO for it. And then put column_qualifier, sort it, and then do Delta+LZO. TS column and Code column are processed the same as column family. The value column is processed the same as column_qualifier. So it is the same as disk format for the 30MB HFile, except all data for 'column_qualifier' and 'value' are sorted separately. Out of 24MB file, 6MB is used to store row keys, 7MB is used to store column_qualifier, and 6MB is to store value. More ideas are welcome!
        Hide
        Todd Lipcon added a comment -

        I'm curious what the expected compression gain would be. Has anyone tried "rearranging" an example of a production hfile block and recompressing to see the difference?

        My thinking is that typical LZ-based compression (eg snappy) uses a hashtable for common substring identification which is up to 16K entries or so. So I don't know that it would do a particularly better job with the common keys if they were all grouped at the front of the block - so long as the keyval pairs are less than a few hundred bytes apart, it should still find them OK.

        Of course the other gains (storing large values compressed in RAM for example) seem good.

        Show
        Todd Lipcon added a comment - I'm curious what the expected compression gain would be. Has anyone tried "rearranging" an example of a production hfile block and recompressing to see the difference? My thinking is that typical LZ-based compression (eg snappy) uses a hashtable for common substring identification which is up to 16K entries or so. So I don't know that it would do a particularly better job with the common keys if they were all grouped at the front of the block - so long as the keyval pairs are less than a few hundred bytes apart, it should still find them OK. Of course the other gains (storing large values compressed in RAM for example) seem good.
        Hide
        He Yongqiang added a comment -

        "I suppose we could use the value length from the key, then know we have nth key and by using the value length of all 1 to n-1 keys to find the value."
        Yes. The value length is stored in the key header. The key header is cheap. And can always be decompressed without a big cpu cost.

        Show
        He Yongqiang added a comment - "I suppose we could use the value length from the key, then know we have nth key and by using the value length of all 1 to n-1 keys to find the value." Yes. The value length is stored in the key header. The key header is cheap. And can always be decompressed without a big cpu cost.
        Hide
        Prakash Khemani added a comment -

        The values can be kept compressed in memory. We can uncompress them on
        demand when writing out the key-values during rpc or compactions.

        The key has to have a pointer to the values. The pointer can be implicit
        and can be derived from value lengths if all the values are stored in the
        same order as keys.

        The value pointer has to be explicit if the values are stored in a
        different order than the keys. We might want to write out the values in a
        different order if we want to do per column compression. While writing out
        the HFileBlock the following can be done - group all the values by their
        column identifier, independently compress and write out each group of
        values, go back to the keys and update the value pointers.

        On 2/8/12 11:50 AM, "Lars Hofhansl (Commented) (JIRA)" <jira@apache.org>

        Show
        Prakash Khemani added a comment - The values can be kept compressed in memory. We can uncompress them on demand when writing out the key-values during rpc or compactions. The key has to have a pointer to the values. The pointer can be implicit and can be derived from value lengths if all the values are stored in the same order as keys. The value pointer has to be explicit if the values are stored in a different order than the keys. We might want to write out the values in a different order if we want to do per column compression. While writing out the HFileBlock the following can be done - group all the values by their column identifier, independently compress and write out each group of values, go back to the keys and update the value pointers. On 2/8/12 11:50 AM, "Lars Hofhansl (Commented) (JIRA)" <jira@apache.org>
        Hide
        Lars Hofhansl added a comment -

        Presumably storing the keys together might lends itself for better compression.
        Do we need to index values then? In that case we'd use up more space. Or how would we find the value belonging to a key?
        I suppose we could use the value length from the key, then know we have nth key and by using the value length of all 1 to n-1 keys to find the value.
        Or store the lengths with the values and scan the keys and values in "parallel".

        Show
        Lars Hofhansl added a comment - Presumably storing the keys together might lends itself for better compression. Do we need to index values then? In that case we'd use up more space. Or how would we find the value belonging to a key? I suppose we could use the value length from the key, then know we have nth key and by using the value length of all 1 to n-1 keys to find the value. Or store the lengths with the values and scan the keys and values in "parallel".
        Hide
        Nicolas Spiegelberg added a comment -

        Storing all keys together would just help on CPU, correct? We wouldn't get any disk size savings or IO savings with the current approach.

        Show
        Nicolas Spiegelberg added a comment - Storing all keys together would just help on CPU, correct? We wouldn't get any disk size savings or IO savings with the current approach.
        Hide
        He Yongqiang added a comment -

        As discussed earlier, one thing we can try is to use something like hive's rcfile. The thing different from hive is hbase row's value is not a single type. If it turns out the columnar file format helps, we can employ nested columnar format for the value (like what dremel does.). There is one thread on Quora about dremel http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases.

        Show
        He Yongqiang added a comment - As discussed earlier, one thing we can try is to use something like hive's rcfile. The thing different from hive is hbase row's value is not a single type. If it turns out the columnar file format helps, we can employ nested columnar format for the value (like what dremel does.). There is one thread on Quora about dremel http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases .
        Hide
        dhruba borthakur added a comment -

        One option listed above is to keep all the keys in the beginning of the block and all the values in the end of the block. The keys will still be delta-encoded. The values can be lzo-compressed.

        any other ideas out there?

        Show
        dhruba borthakur added a comment - One option listed above is to keep all the keys in the beginning of the block and all the values in the end of the block. The keys will still be delta-encoded. The values can be lzo-compressed. any other ideas out there?

          People

          • Assignee:
            dhruba borthakur
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            30 Start watching this issue

            Dates

            • Created:
              Updated:

              Development