Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.5.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      I'd like to work with non-UTF8 data easily.

      Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

        Activity

        Hide
        Ted Xu added a comment -

        We implemented encoding config feature on tables.
        Set table encoding through serde parameter, for example:

        alter table src set serdeproperties ('serialization.encoding'='GBK');
        

        that makes table src using GBK encoding (Chinese encoding format). Further more, if using command line interface, parameter 'hive.cli.encoding' shall be set. 'hive.cli.encoding' must set before hive prompt started, so set 'hive.cli.encoding' in hive-site.xml or using -hiveconf hive.cli.encoding=GBK in command line parameter, instead of 'set hive.cli.encoding=GBK' in hive ql.
        Because of the reason above, I can't find a way to add a unit test.

        Show
        Ted Xu added a comment - We implemented encoding config feature on tables. Set table encoding through serde parameter, for example: alter table src set serdeproperties ('serialization.encoding'='GBK'); that makes table src using GBK encoding (Chinese encoding format). Further more, if using command line interface, parameter 'hive.cli.encoding' shall be set. 'hive.cli.encoding' must set before hive prompt started, so set 'hive.cli.encoding' in hive-site.xml or using -hiveconf hive.cli.encoding=GBK in command line parameter, instead of 'set hive.cli.encoding=GBK' in hive ql. Because of the reason above, I can't find a way to add a unit test.
        Hide
        Ted Xu added a comment -

        Please have a review for trunk-encoding.patch, thanks.

        Show
        Ted Xu added a comment - Please have a review for trunk-encoding.patch, thanks.
        Hide
        Edward Capriolo added a comment -

        Maybe you should fork hive and call it chive.

        On a serious node . Great job. Would you consider editing the cli.xml in the xdocs to explain this feature? I think it would be very helpful look in docs/xdocs/.

        Show
        Edward Capriolo added a comment - Maybe you should fork hive and call it chive. On a serious node . Great job. Would you consider editing the cli.xml in the xdocs to explain this feature? I think it would be very helpful look in docs/xdocs/.
        Hide
        Ted Xu added a comment -

        Thanks Edward.

        I dug into the problem and found the patch will not working when the query have subqueries, it is very hard to retain encoding information in those queries.

        Table properties may miss in queries, the problem is the same as missing field delimiter setting, because whenever hive can't get table properties in subquery (e.g., join operation), the default value is used (^A for field delimiter, that's why the deserializer will fail most of the time when data contains ^A character even if ^A is not set for field delimiter).

        Show
        Ted Xu added a comment - Thanks Edward. I dug into the problem and found the patch will not working when the query have subqueries, it is very hard to retain encoding information in those queries. Table properties may miss in queries, the problem is the same as missing field delimiter setting, because whenever hive can't get table properties in subquery (e.g., join operation), the default value is used (^A for field delimiter, that's why the deserializer will fail most of the time when data contains ^A character even if ^A is not set for field delimiter).

          People

          • Assignee:
            Ted Xu
            Reporter:
            bc Wong
          • Votes:
            4 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development