Hive
  1. Hive
  2. HIVE-2097 Explore mechanisms for better compression with RC Files
  3. HIVE-2604

Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies

    Details

    • Type: Sub-task Sub-task
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Contrib
    • Labels:
      None

      Description

      The strategies supported are
      1. using a specified codec on the column
      2. using a specific codec on the column which is serialized via a specific serde
      3. using a specific "TypeSpecificCompressor" instance

      1. ASF.LICENSE.NOT.GRANTED--HIVE-2604.D1011.1.patch
        61 kB
        Phabricator
      2. ASF.LICENSE.NOT.GRANTED--HIVE-2604.D1011.2.patch
        63 kB
        Phabricator
      3. HIVE-2604.v0.patch
        62 kB
        Krishna Kumar
      4. HIVE-2604.v1.patch
        62 kB
        Krishna Kumar
      5. HIVE-2604.v2.patch
        63 kB
        Krishna Kumar

        Activity

        Hide
        Uma Maheswara Rao G added a comment -

        Hi Yongqiang, Any reason for holding this off from commit?

        Show
        Uma Maheswara Rao G added a comment - Hi Yongqiang, Any reason for holding this off from commit?
        Hide
        He Yongqiang added a comment -

        +1, will commit after tests pass

        Show
        He Yongqiang added a comment - +1, will commit after tests pass
        Hide
        Phabricator added a comment -

        krishnakumar updated the revision "HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies".
        Reviewers: JIRA, heyongqiang

        Addressing review comments

        REVISION DETAIL
        https://reviews.facebook.net/D1011

        AFFECTED FILES
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerdeField.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java
        contrib/src/test/queries/clientpositive/ubercompressor.q
        contrib/src/test/results/clientpositive/ubercompressor.q.out

        Show
        Phabricator added a comment - krishnakumar updated the revision " HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies". Reviewers: JIRA, heyongqiang Addressing review comments REVISION DETAIL https://reviews.facebook.net/D1011 AFFECTED FILES contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerdeField.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java contrib/src/test/queries/clientpositive/ubercompressor.q contrib/src/test/results/clientpositive/ubercompressor.q.out
        Hide
        Phabricator added a comment -

        krishnakumar has commented on the revision "HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies".

        INLINE COMMENTS
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java:33 This, itself, is an implementation of the ComressionCodec interface. The only important part of the class are the createInputStream/createOutputStream methods. The dummyCompressor is needed for conforming to the interface.
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:70 Will add comments.

        The method is called readFromCompressor as it is reading from the inputreader created off a type-specific compressor. I can rename it to readFromInputReader?

        If you mean the copying annotated by the FIXME, yes, it can be avoided by having an outputstream on an existing buffer. Did not find a readymade class for that, I will create one.
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:101 This is the second case (in the jira description) where the user specifies a custom serde+codec to be used for compressing a specific column. So we need to deserialize and reserialize here.
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java:38 I needed a simple read/write on outputstream. WritableUtils implements a more complicated mechanism which prefers smaller values.
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java:1 data structures and algorithms!
        contrib/src/test/queries/clientpositive/ubercompressor.q:4 The configs are modelled on existing config for compression, so I guess that means that all output tables will be compressed using the same config?

        The codec and its child classes do not have access to table/partition, right? How would we populate the metastore from codec implementation classes?

        REVISION DETAIL
        https://reviews.facebook.net/D1011

        Show
        Phabricator added a comment - krishnakumar has commented on the revision " HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies". INLINE COMMENTS contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java:33 This, itself, is an implementation of the ComressionCodec interface. The only important part of the class are the createInputStream/createOutputStream methods. The dummyCompressor is needed for conforming to the interface. contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:70 Will add comments. The method is called readFromCompressor as it is reading from the inputreader created off a type-specific compressor. I can rename it to readFromInputReader? If you mean the copying annotated by the FIXME, yes, it can be avoided by having an outputstream on an existing buffer. Did not find a readymade class for that, I will create one. contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:101 This is the second case (in the jira description) where the user specifies a custom serde+codec to be used for compressing a specific column. So we need to deserialize and reserialize here. contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java:38 I needed a simple read/write on outputstream. WritableUtils implements a more complicated mechanism which prefers smaller values. contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java:1 data structures and algorithms! contrib/src/test/queries/clientpositive/ubercompressor.q:4 The configs are modelled on existing config for compression, so I guess that means that all output tables will be compressed using the same config? The codec and its child classes do not have access to table/partition, right? How would we populate the metastore from codec implementation classes? REVISION DETAIL https://reviews.facebook.net/D1011
        Hide
        Phabricator added a comment -

        heyongqiang has commented on the revision "HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies".

        INLINE COMMENTS
        contrib/src/test/queries/clientpositive/ubercompressor.q:4 setting a bunch of compression config here is fine for single insert. But how about multi-insert queries?

        Can u put these configs to table/partition object? And that will make things easy to debug. (if u want to do in a followup, please open a follow up jira.)

        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java:1 what is the package name "dsalg"
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java:38 just curious, can WritableUtils be used here?
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java:33 How is this class used? Can it be defined as an interface? DummyCompressor inside it is not doing anything.
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:70 can u add more comments here? If i understand correctly, it is doing read and decompression here. But there is readFromCompressor. Should it be readFromDecompressor()?

        And there is some bytes transfer and copied involved here. Can that be avoided?
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:101 why is the serde involved here? It is deserializing and serializing again here...

        REVISION DETAIL
        https://reviews.facebook.net/D1011

        Show
        Phabricator added a comment - heyongqiang has commented on the revision " HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies". INLINE COMMENTS contrib/src/test/queries/clientpositive/ubercompressor.q:4 setting a bunch of compression config here is fine for single insert. But how about multi-insert queries? Can u put these configs to table/partition object? And that will make things easy to debug. (if u want to do in a followup, please open a follow up jira.) contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java:1 what is the package name "dsalg" contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java:38 just curious, can WritableUtils be used here? contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java:33 How is this class used? Can it be defined as an interface? DummyCompressor inside it is not doing anything. contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:70 can u add more comments here? If i understand correctly, it is doing read and decompression here. But there is readFromCompressor. Should it be readFromDecompressor()? And there is some bytes transfer and copied involved here. Can that be avoided? contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java:101 why is the serde involved here? It is deserializing and serializing again here... REVISION DETAIL https://reviews.facebook.net/D1011
        Hide
        He Yongqiang added a comment -

        looking.

        Show
        He Yongqiang added a comment - looking.
        Hide
        Phabricator added a comment -

        krishnakumar requested code review of "HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies".
        Reviewers: JIRA

        added serde/codec which provide for type-specific compression mechanisms

        The strategies supported are
        1. using a specified codec on the column
        2. using a specific codec on the column which is serialized via a specific serde
        3. using a specific "TypeSpecificCompressor" instance

        TEST PLAN
        EMPTY

        REVISION DETAIL
        https://reviews.facebook.net/D1011

        AFFECTED FILES
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerdeField.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java
        contrib/src/test/queries/clientpositive/ubercompressor.q
        contrib/src/test/results/clientpositive/ubercompressor.q.out

        MANAGE HERALD DIFFERENTIAL RULES
        https://reviews.facebook.net/herald/view/differential/

        WHY DID I GET THIS EMAIL?
        https://reviews.facebook.net/herald/transcript/2121/

        Tip: use the X-Herald-Rules header to filter Herald messages in your client.

        Show
        Phabricator added a comment - krishnakumar requested code review of " HIVE-2604 [jira] Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies". Reviewers: JIRA added serde/codec which provide for type-specific compression mechanisms The strategies supported are 1. using a specified codec on the column 2. using a specific codec on the column which is serialized via a specific serde 3. using a specific "TypeSpecificCompressor" instance TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D1011 AFFECTED FILES contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerdeField.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java contrib/src/test/queries/clientpositive/ubercompressor.q contrib/src/test/results/clientpositive/ubercompressor.q.out MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/2121/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3075/
        -----------------------------------------------------------

        (Updated 2011-12-17 10:41:45.367761)

        Review request for hive and Yongqiang He.

        Changes
        -------

        Closed the two gaps - support for arbitrary types, and stats

        Summary
        -------

        Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies

        • gaps
        • supports only certain complex types
        • stats

        This addresses bug HIVE-2604.
        https://issues.apache.org/jira/browse/HIVE-2604

        Diffs (updated)


        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerdeField.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java PRE-CREATION
        contrib/src/test/queries/clientpositive/ubercompressor.q PRE-CREATION
        contrib/src/test/results/clientpositive/ubercompressor.q.out PRE-CREATION

        Diff: https://reviews.apache.org/r/3075/diff

        Testing
        -------

        test added

        Thanks,

        Krishna

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3075/ ----------------------------------------------------------- (Updated 2011-12-17 10:41:45.367761) Review request for hive and Yongqiang He. Changes ------- Closed the two gaps - support for arbitrary types, and stats Summary ------- Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies gaps supports only certain complex types stats This addresses bug HIVE-2604 . https://issues.apache.org/jira/browse/HIVE-2604 Diffs (updated) contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerdeField.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java PRE-CREATION contrib/src/test/queries/clientpositive/ubercompressor.q PRE-CREATION contrib/src/test/results/clientpositive/ubercompressor.q.out PRE-CREATION Diff: https://reviews.apache.org/r/3075/diff Testing ------- test added Thanks, Krishna
        Hide
        Krishna Kumar added a comment -

        I used the word Uber, not in the sense of 'super' but, as [Wikipedia def] 'Über also translates to over, above, meta, but mainly in compound words.'; that is, highlighting the fact that this is not a 'real' compressor but a wrapper on other existing compressors/codecs. Anyway, no hangups, can bulk rename if necessary.

        Show
        Krishna Kumar added a comment - I used the word Uber, not in the sense of 'super' but, as [Wikipedia def] 'Über also translates to over, above, meta, but mainly in compound words.'; that is, highlighting the fact that this is not a 'real' compressor but a wrapper on other existing compressors/codecs. Anyway, no hangups, can bulk rename if necessary.
        Hide
        Edward Capriolo added a comment -

        I think this is a +1 idea, but I am -1 on the name 'Uber' has to go. The name should describe what the class does ie ColumnarCompressor or PerColumnCompressor. The problem is everything in hive is Uber cool anyway so every class would have to be named as such.

        Show
        Edward Capriolo added a comment - I think this is a +1 idea, but I am -1 on the name 'Uber' has to go. The name should describe what the class does ie ColumnarCompressor or PerColumnCompressor. The problem is everything in hive is Uber cool anyway so every class would have to be named as such.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/3075/
        -----------------------------------------------------------

        Review request for hive.

        Summary
        -------

        Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies

        • gaps
        • supports only certain complex types
        • stats

        This addresses bug HIVE-2604.
        https://issues.apache.org/jira/browse/HIVE-2604

        Diffs


        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java PRE-CREATION
        contrib/src/test/queries/clientpositive/ubercompressor.q PRE-CREATION
        contrib/src/test/results/clientpositive/ubercompressor.q.out PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java PRE-CREATION
        contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java PRE-CREATION

        Diff: https://reviews.apache.org/r/3075/diff

        Testing
        -------

        test added

        Thanks,

        Krishna

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3075/ ----------------------------------------------------------- Review request for hive. Summary ------- Add UberCompressor Serde/Codec to contrib which allows per-column compression strategies gaps supports only certain complex types stats This addresses bug HIVE-2604 . https://issues.apache.org/jira/browse/HIVE-2604 Diffs contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/compressors/DummyIntegerCompressor.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/dsalg/Tuple.java PRE-CREATION contrib/src/test/queries/clientpositive/ubercompressor.q PRE-CREATION contrib/src/test/results/clientpositive/ubercompressor.q.out PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorUtils.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorColumnConfig.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorConfig.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressorSerde.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionOutputStream.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionInputStream.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/InputReader.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/OutputWriter.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/TypeSpecificCompressor.java PRE-CREATION contrib/src/java/org/apache/hadoop/hive/contrib/ubercompressor/UberCompressionCodec.java PRE-CREATION Diff: https://reviews.apache.org/r/3075/diff Testing ------- test added Thanks, Krishna
        Hide
        Krishna Kumar added a comment -

        Serde wraps lazysimpleserde to make it more similar to columnarserde + tests added

        Show
        Krishna Kumar added a comment - Serde wraps lazysimpleserde to make it more similar to columnarserde + tests added
        Hide
        Krishna Kumar added a comment -

        The current implementation works as follows:

        • Adds a serde UberCompressorSerde, which is used to convert the cell values to bytes
        • Adds a codec UberCompressionCodec which uses user-specific config to compress each block of column values through one of three possible mechanisms
        • Config for the column: "codec:<codecname>" - Apply a CompressionCodec on the UberCompressorSerde serialized bytestream
        • Config for the column: "codec:<codecname>,<serdename>" Re-serialize the bytestream through serdename and then apply codecname on it
        • Config for the column: "compressor:<compressorname>" compress the cell values by sending them through the type-specific compressor
        • [As a future enhancement, the config, say if is "dynamic", can let the codec decide the mechanism on the current block stats/previous seen blocks]

        The idea is to maintain the ability to use a serde/codec combination (as we do now) for any columns which are not 'interesting' and use type-specific compressors only for special columns.

        Type-specific compressor is also an extension point only; no implementation attached to this jira. Have attached one sample compressor to HIVE-2623, while many others are possible.

        Show
        Krishna Kumar added a comment - The current implementation works as follows: Adds a serde UberCompressorSerde, which is used to convert the cell values to bytes Adds a codec UberCompressionCodec which uses user-specific config to compress each block of column values through one of three possible mechanisms Config for the column: "codec:<codecname>" - Apply a CompressionCodec on the UberCompressorSerde serialized bytestream Config for the column: "codec:<codecname>,<serdename>" Re-serialize the bytestream through serdename and then apply codecname on it Config for the column: "compressor:<compressorname>" compress the cell values by sending them through the type-specific compressor [As a future enhancement, the config, say if is "dynamic", can let the codec decide the mechanism on the current block stats/previous seen blocks] The idea is to maintain the ability to use a serde/codec combination (as we do now) for any columns which are not 'interesting' and use type-specific compressors only for special columns. Type-specific compressor is also an extension point only; no implementation attached to this jira. Have attached one sample compressor to HIVE-2623 , while many others are possible.
        Hide
        Krishna Kumar added a comment -

        Sure. I'll add some of the compressors in a day or two.

        Show
        Krishna Kumar added a comment - Sure. I'll add some of the compressors in a day or two.
        Hide
        He Yongqiang added a comment -

        Can u give some examples of such compressors? so we can also try that.

        Show
        He Yongqiang added a comment - Can u give some examples of such compressors? so we can also try that.
        Hide
        Krishna Kumar added a comment -

        initial version of the patch.

        Show
        Krishna Kumar added a comment - initial version of the patch.

          People

          • Assignee:
            Krishna Kumar
            Reporter:
            Krishna Kumar
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development