Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21042

varchar and string will behave differently for malformed utf8 characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SQL
    • None

    Description

      If data contains malformed utf8 characters, varchar and string will behave differently.

      The content in /tmp/hex_data (as attached) is 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 in hex.

      B0A5 and 8DAEAB are malformed utf8 characters, they are decoded to EFBFBD if the data type is varchar, but string will not change it. So:

      VARCHAR in hex shows: 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936
      STRING in hex shows: 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936

      See details:

      hive> DROP TABLE TBL_S;
      OK
      Time taken: 0.562 seconds
      hive> CREATE TABLE TBL_S
          > (
          >       GUID STRING
          > )
          > row format delimited fields terminated by '\177' stored as textfile
          > LOCATION
          > '/tmp/hex_data'
          > tblproperties('serialization.null.format'='',  'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss')
          > ;
      OK
      Time taken: 3.074 seconds
      hive>
          > DROP TABLE TBL_V;
      OK
      Time taken: 0.894 seconds
      hive> CREATE TABLE TBL_V
          > (
          >       GUID VARCHAR(32)
          > )
          > row format delimited fields terminated by '\177' stored as textfile
          > LOCATION
          > '/tmp/hex_data'
          > tblproperties('serialization.null.format'='',  'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss')
          > ;
      OK
      Time taken: 0.242 seconds
      hive> SELECT GUID, hex(GUID) FROM TBL_S;
      OK
      a0750c1f13f0k5��F8j���b�Ro'4da96        6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936
      Time taken: 1.581 seconds, Fetched: 1 row(s)
      hive> SELECT GUID, hex(GUID) FROM TBL_V;
      OK
      a0750c1f13f0k5��F8j���b�Ro'4da96        6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936
      

      Is it expected?

      Attachments

        Activity

          People

            Unassigned Unassigned
            yucai yucai
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: