Hive
  1. Hive
  2. HIVE-2917

Add support for various charsets in LazySimpleSerDe

    Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      Currently hive can only serialize/deserialize data encoded in utf-8.

      It would be useful to specify the data's charset when creating the table.

      The idea is to add a new keyword CHARSET to set charset at table level.
      For example:
      CREATE TABLE tbl1 (col1 STRING) ROW FORMAT CHARET "GBK" DELIMITED FIELDS TERMINATED BY '\t';

      Another place to use CHARSET is in TRANSFORM clause.
      For example:
      SELECT TRANSFORM(col1, col2) ROW FORMAT CHARSET 'gbk'
      USING 'some_script'
      AS (col3, col4) ROW FORMAT CHARSET 'utf-8';

      1. HIVE-2917.3.patch.txt
        72 kB
        Kai Zhang
      2. HIVE-2917.2.patch.txt
        72 kB
        Kai Zhang
      3. HIVE-2917.1.patch.txt
        72 kB
        Kai Zhang
      4. ASF.LICENSE.NOT.GRANTED--HIVE-2917.D2619.1.patch
        73 kB
        Phabricator

        Activity

        Hide
        Kai Zhang added a comment -

        Fixed a mistake in the last patch

        Show
        Kai Zhang added a comment - Fixed a mistake in the last patch
        Hide
        Phabricator added a comment -

        flyinggarden requested code review of "HIVE-2917 [jira] Add support for various charsets in LazySimpleSerDe".
        Reviewers: JIRA

        https://issues.apache.org/jira/browse/HIVE-2917

        HIVE-2917: Add support for various charsets in LazySimpleSerDe

        Currently hive can only serialize/deserialize data encoded in utf-8.

        It would be useful to specify the data's charset when creating the table.

        The idea is to add a new keyword CHARSET to set charset at table level.
        For example:
        CREATE TABLE tbl1 (col1 STRING) ROW FORMAT CHARET "GBK" DELIMITED FIELDS TERMINATED BY '\t';

        Another place to use CHARSET is in TRANSFORM clause.
        For example:
        SELECT TRANSFORM(col1, col2) ROW FORMAT CHARSET 'gbk'
        USING 'some_script'
        AS (col3, col4) ROW FORMAT CHARSET 'utf-8';

        TEST PLAN
        EMPTY

        REVISION DETAIL
        https://reviews.facebook.net/D2619

        AFFECTED FILES
        hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestLazyHBaseObject.java
        data/files/gbk.txt
        serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyArrayMapStruct.java
        serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyPrimitive.java
        serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyCharset.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyString.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyObjectInspectorFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyUnionObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyStringObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyListObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyMapObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazySimpleStructObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java
        serde/src/gen/thrift/gen-py/org_apache_hadoop_hive_serde/constants.py
        serde/src/gen/thrift/gen-cpp/serde_constants.cpp
        serde/src/gen/thrift/gen-cpp/serde_constants.h
        serde/src/gen/thrift/gen-rb/serde_constants.rb
        serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/Constants.java
        serde/src/gen/thrift/gen-php/serde/serde_constants.php
        serde/if/serde.thrift
        ql/src/test/results/clientpositive/charset.q.out
        ql/src/test/results/clientpositive/input35.q.out
        ql/src/test/results/clientpositive/input36.q.out
        ql/src/test/results/clientpositive/transform_charset.q.out
        ql/src/test/queries/clientpositive/transform_charset.q
        ql/src/test/queries/clientpositive/charset.q
        ql/src/java/org/apache/hadoop/hive/ql/plan/CreateTableDesc.java
        ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g
        ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java
        ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java
        ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java

        MANAGE HERALD DIFFERENTIAL RULES
        https://reviews.facebook.net/herald/view/differential/

        WHY DID I GET THIS EMAIL?
        https://reviews.facebook.net/herald/transcript/6027/

        Tip: use the X-Herald-Rules header to filter Herald messages in your client.

        Show
        Phabricator added a comment - flyinggarden requested code review of " HIVE-2917 [jira] Add support for various charsets in LazySimpleSerDe". Reviewers: JIRA https://issues.apache.org/jira/browse/HIVE-2917 HIVE-2917 : Add support for various charsets in LazySimpleSerDe Currently hive can only serialize/deserialize data encoded in utf-8. It would be useful to specify the data's charset when creating the table. The idea is to add a new keyword CHARSET to set charset at table level. For example: CREATE TABLE tbl1 (col1 STRING) ROW FORMAT CHARET "GBK" DELIMITED FIELDS TERMINATED BY '\t'; Another place to use CHARSET is in TRANSFORM clause. For example: SELECT TRANSFORM(col1, col2) ROW FORMAT CHARSET 'gbk' USING 'some_script' AS (col3, col4) ROW FORMAT CHARSET 'utf-8'; TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D2619 AFFECTED FILES hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestLazyHBaseObject.java data/files/gbk.txt serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyArrayMapStruct.java serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyPrimitive.java serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyCharset.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyString.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyObjectInspectorFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyUnionObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyStringObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyListObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyMapObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazySimpleStructObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java serde/src/gen/thrift/gen-py/org_apache_hadoop_hive_serde/constants.py serde/src/gen/thrift/gen-cpp/serde_constants.cpp serde/src/gen/thrift/gen-cpp/serde_constants.h serde/src/gen/thrift/gen-rb/serde_constants.rb serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/Constants.java serde/src/gen/thrift/gen-php/serde/serde_constants.php serde/if/serde.thrift ql/src/test/results/clientpositive/charset.q.out ql/src/test/results/clientpositive/input35.q.out ql/src/test/results/clientpositive/input36.q.out ql/src/test/results/clientpositive/transform_charset.q.out ql/src/test/queries/clientpositive/transform_charset.q ql/src/test/queries/clientpositive/charset.q ql/src/java/org/apache/hadoop/hive/ql/plan/CreateTableDesc.java ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/6027/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.

          People

          • Assignee:
            Unassigned
            Reporter:
            Kai Zhang
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development