Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.5.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      I'd like to work with non-UTF8 data easily.

      Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        17d 9h 45m 1 Ted Xu 20/Aug/10 02:46
        In Progress In Progress Patch Available Patch Available
        4m 3s 1 Ted Xu 20/Aug/10 02:50
        Patch Available Patch Available Open Open
        5d 16h 45m 1 Namit Jain 25/Aug/10 19:35
        Namit Jain made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Ted Xu added a comment -

        Thanks Edward.

        I dug into the problem and found the patch will not working when the query have subqueries, it is very hard to retain encoding information in those queries.

        Table properties may miss in queries, the problem is the same as missing field delimiter setting, because whenever hive can't get table properties in subquery (e.g., join operation), the default value is used (^A for field delimiter, that's why the deserializer will fail most of the time when data contains ^A character even if ^A is not set for field delimiter).

        Show
        Ted Xu added a comment - Thanks Edward. I dug into the problem and found the patch will not working when the query have subqueries, it is very hard to retain encoding information in those queries. Table properties may miss in queries, the problem is the same as missing field delimiter setting, because whenever hive can't get table properties in subquery (e.g., join operation), the default value is used (^A for field delimiter, that's why the deserializer will fail most of the time when data contains ^A character even if ^A is not set for field delimiter).
        Hide
        Edward Capriolo added a comment -

        Maybe you should fork hive and call it chive.

        On a serious node . Great job. Would you consider editing the cli.xml in the xdocs to explain this feature? I think it would be very helpful look in docs/xdocs/.

        Show
        Edward Capriolo added a comment - Maybe you should fork hive and call it chive. On a serious node . Great job. Would you consider editing the cli.xml in the xdocs to explain this feature? I think it would be very helpful look in docs/xdocs/.
        Ted Xu made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Hide
        Ted Xu added a comment -

        Please have a review for trunk-encoding.patch, thanks.

        Show
        Ted Xu added a comment - Please have a review for trunk-encoding.patch, thanks.
        Ted Xu made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Ted Xu made changes -
        Assignee Ted Xu [ tedxu ]
        Ted Xu made changes -
        Field Original Value New Value
        Attachment trunk-encoding.patch [ 12452490 ]
        Hide
        Ted Xu added a comment -

        We implemented encoding config feature on tables.
        Set table encoding through serde parameter, for example:

        alter table src set serdeproperties ('serialization.encoding'='GBK');
        

        that makes table src using GBK encoding (Chinese encoding format). Further more, if using command line interface, parameter 'hive.cli.encoding' shall be set. 'hive.cli.encoding' must set before hive prompt started, so set 'hive.cli.encoding' in hive-site.xml or using -hiveconf hive.cli.encoding=GBK in command line parameter, instead of 'set hive.cli.encoding=GBK' in hive ql.
        Because of the reason above, I can't find a way to add a unit test.

        Show
        Ted Xu added a comment - We implemented encoding config feature on tables. Set table encoding through serde parameter, for example: alter table src set serdeproperties ('serialization.encoding'='GBK'); that makes table src using GBK encoding (Chinese encoding format). Further more, if using command line interface, parameter 'hive.cli.encoding' shall be set. 'hive.cli.encoding' must set before hive prompt started, so set 'hive.cli.encoding' in hive-site.xml or using -hiveconf hive.cli.encoding=GBK in command line parameter, instead of 'set hive.cli.encoding=GBK' in hive ql. Because of the reason above, I can't find a way to add a unit test.
        bc Wong created issue -

          People

          • Assignee:
            Ted Xu
            Reporter:
            bc Wong
          • Votes:
            3 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development