HBase
  1. HBase
  2. HBASE-3474

HFileOutputFormat to use column family's compression algorithm

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.0
    • Component/s: mapreduce
    • Labels:
      None
    • Environment:

      All

    • Hadoop Flags:
      Reviewed
    • Release Note:
      HFileOutputFormat to use column family's compression algorithm instead of a blanket all column family wide

      Description

      HFileOutputFormat currently creates HFile writer's using a compression algorithm set as configuration "hbase.hregion.max.filesize" with default as no compression. The code does not take into account the compression algorithm configured for the table's column family. As a result bulk uploaded tables are not compressed until a major compaction is run on them. This could be fixed by using the column family descriptors while creating HFile writers.

      1. hbase-3474.txt
        16 kB
        Ashish Shinde
      2. hbase-3474.txt
        17 kB
        Todd Lipcon
      3. patch3474.txt
        16 kB
        Ashish Shinde
      4. patch3474.txt
        4 kB
        Ashish Shinde
      5. patch3474.txt
        4 kB
        Ashish Shinde

        Activity

        Ashish Shinde created issue -
        Hide
        Ashish Shinde added a comment -

        HFileOutputFormat uses HColumnDescriptor's of column families in the HTable and creates writers with corresponding compression algorithms.

        Show
        Ashish Shinde added a comment - HFileOutputFormat uses HColumnDescriptor's of column families in the HTable and creates writers with corresponding compression algorithms.
        Ashish Shinde made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note HFileOutputFormat to use column family's compression algorithm instead of a blanket all column family wide
        Affects Version/s 0.92.0 [ 12314223 ]
        Affects Version/s 0.90.0 [ 12313607 ]
        Fix Version/s 0.92.0 [ 12314223 ]
        Hide
        Ashish Shinde added a comment -

        Patch that uses the column descriptors for setting compression.

        Show
        Ashish Shinde added a comment - Patch that uses the column descriptors for setting compression.
        Ashish Shinde made changes -
        Attachment patch3474.txt [ 12469541 ]
        Hide
        Ashish Shinde added a comment -

        Removed stray prints from the patch

        Show
        Ashish Shinde added a comment - Removed stray prints from the patch
        Ashish Shinde made changes -
        Attachment patch3474.txt [ 12469544 ]
        Ashish Shinde made changes -
        Attachment patch3474.txt [ 12469541 ]
        Hide
        Todd Lipcon added a comment -

        Thanks for taking this on, Ashish. A few comments on the patch:

        • for createFamilyCompressionMap, specify in the javadoc that this is run inside the task to deserialize the map back out of the Configuration
        • In javadoc for configureCompression, I don't think there's any reason to include the @param table and @param conf since they're kind of obvious
        • in createFamilyCompressionMap, 'compression' is misspelled in one var name
        • The encoding you've chosen for this config value isn't the best, since we currently do support ',' in a column family name. What about encoding the map like a query string in a URL? foo=bar&baz=blah - and using URLEncoder/URLDecoder to escape the =s and &s that might be in there.
        Show
        Todd Lipcon added a comment - Thanks for taking this on, Ashish. A few comments on the patch: for createFamilyCompressionMap, specify in the javadoc that this is run inside the task to deserialize the map back out of the Configuration In javadoc for configureCompression, I don't think there's any reason to include the @param table and @param conf since they're kind of obvious in createFamilyCompressionMap, 'compression' is misspelled in one var name The encoding you've chosen for this config value isn't the best, since we currently do support ',' in a column family name. What about encoding the map like a query string in a URL? foo=bar&baz=blah - and using URLEncoder/URLDecoder to escape the =s and &s that might be in there.
        Hide
        Ashish Shinde added a comment -

        Incorporated Todd's comments on javadoc and use of URL encoder for serializing family compression map.

        Show
        Ashish Shinde added a comment - Incorporated Todd's comments on javadoc and use of URL encoder for serializing family compression map.
        Ashish Shinde made changes -
        Attachment patch3474.txt [ 12469636 ]
        Hide
        Todd Lipcon added a comment -

        Hey Ashish. This is looking good. A few more comments before I think it's ready to commit:

        • If you make configureCompression and createFamilyCompressionMap package-private instead of private, you could add a nice unit test for them. You can use Mockito to make a mock HTable that returns the HColumnDescriptors you want for the sake of the test - if you need a hand with this you can catch me on #hbase IRC
        • Probably a good idea to put "families.compression" in a constant, and maybe change it to be more obviously scoped - ie something like "hbase.hfileoutputformat.families.compression"
        • When you upload a patch, you should take the diff from the root of the SVN trunk/ - so that it applies with "patch -p0" from inside trunk/.

        Thanks!

        Show
        Todd Lipcon added a comment - Hey Ashish. This is looking good. A few more comments before I think it's ready to commit: If you make configureCompression and createFamilyCompressionMap package-private instead of private, you could add a nice unit test for them. You can use Mockito to make a mock HTable that returns the HColumnDescriptors you want for the sake of the test - if you need a hand with this you can catch me on #hbase IRC Probably a good idea to put "families.compression" in a constant, and maybe change it to be more obviously scoped - ie something like "hbase.hfileoutputformat.families.compression" When you upload a patch, you should take the diff from the root of the SVN trunk/ - so that it applies with "patch -p0" from inside trunk/. Thanks!
        Hide
        Ashish Shinde added a comment -

        Will use a constant for the config key. The test will take me some time because I am not familiar with mockito. Will put out a patch as soon as I get my head around creating the unit test.

        Show
        Ashish Shinde added a comment - Will use a constant for the config key. The test will take me some time because I am not familiar with mockito. Will put out a patch as soon as I get my head around creating the unit test.
        Hide
        Todd Lipcon added a comment -

        Let me know if you need a hand with the test, happy to help. Mockito is strange at first but once you get the hang of it, it's pretty nice

        Show
        Todd Lipcon added a comment - Let me know if you need a hand with the test, happy to help. Mockito is strange at first but once you get the hang of it, it's pretty nice
        Hide
        Ashish Shinde added a comment -

        Added unit tests for
        1. compression serialization deserialization to job config
        2. testing that the hfile writers use the config correctly.

        Minor issues
        -------------

        1. The test parses out HFile.Reader.toString() to retrieve the compression algorithm used on an HFile. One alternative is to add getCompressionAlgo() to HFile.Reader.
        2. The testColumnFamilyCompression create a mapred job just to get hold of an usable writer. Not sure if this is the best thing to do.

        Show
        Ashish Shinde added a comment - Added unit tests for 1. compression serialization deserialization to job config 2. testing that the hfile writers use the config correctly. Minor issues ------------- 1. The test parses out HFile.Reader.toString() to retrieve the compression algorithm used on an HFile. One alternative is to add getCompressionAlgo() to HFile.Reader. 2. The testColumnFamilyCompression create a mapred job just to get hold of an usable writer. Not sure if this is the best thing to do.
        Ashish Shinde made changes -
        Attachment patch3474.txt [ 12473973 ]
        Hide
        Todd Lipcon added a comment -

        Hi Ashish. I took your patch and cleaned up the unit tests a bit to share some of the mocking code. I also added a getCompressionAlgorithm call to HFile.Reader as you suggested.

        I removed one of the unit tests since it seemed to be entirely subsumed by the other one. Also updated the unit tests to check cases like 0-CF and 1-CF.

        Can you check over this updated patch and make sure it looks good to you?

        Show
        Todd Lipcon added a comment - Hi Ashish. I took your patch and cleaned up the unit tests a bit to share some of the mocking code. I also added a getCompressionAlgorithm call to HFile.Reader as you suggested. I removed one of the unit tests since it seemed to be entirely subsumed by the other one. Also updated the unit tests to check cases like 0-CF and 1-CF. Can you check over this updated patch and make sure it looks good to you?
        Todd Lipcon made changes -
        Attachment hbase-3474.txt [ 12474224 ]
        Hide
        Ashish Shinde added a comment -

        Hi Todd, Thanks for re-factoring the tests. They look better now. Will try to be more careful in looking for code reuse opportunities. I just removed some unused imports from the test file. My IDE marks unused imports as warnings

        Show
        Ashish Shinde added a comment - Hi Todd, Thanks for re-factoring the tests. They look better now. Will try to be more careful in looking for code reuse opportunities. I just removed some unused imports from the test file. My IDE marks unused imports as warnings
        Ashish Shinde made changes -
        Attachment hbase-3474.txt [ 12474271 ]
        Hide
        stack added a comment -

        @Ashish So, are you +1 on committing your patch (w/ Todd cleanup)?

        Show
        stack added a comment - @Ashish So, are you +1 on committing your patch (w/ Todd cleanup)?
        Hide
        Ashish Shinde added a comment -

        +1 on committing the patch with Todd's cleanup.

        Show
        Ashish Shinde added a comment - +1 on committing the patch with Todd's cleanup.
        Hide
        stack added a comment -

        Committed Todds fixup of Ashish's patch. Thank you for the patch Ashish.

        Show
        stack added a comment - Committed Todds fixup of Ashish's patch. Thank you for the patch Ashish.
        stack made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #1814 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/)

        Show
        Hudson added a comment - Integrated in HBase-TRUNK #1814 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/ )

          People

          • Assignee:
            Unassigned
            Reporter:
            Ashish Shinde
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development