Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
If data contains malformed utf8 characters, varchar and string will behave differently.
The content in /tmp/hex_data (as attached) is 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 in hex.
B0A5 and 8DAEAB are malformed utf8 characters, they are decoded to EFBFBD if the data type is varchar, but string will not change it. So:
VARCHAR in hex shows: 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936
STRING in hex shows: 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936
See details:
hive> DROP TABLE TBL_S; OK Time taken: 0.562 seconds hive> CREATE TABLE TBL_S > ( > GUID STRING > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 3.074 seconds hive> > DROP TABLE TBL_V; OK Time taken: 0.894 seconds hive> CREATE TABLE TBL_V > ( > GUID VARCHAR(32) > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 0.242 seconds hive> SELECT GUID, hex(GUID) FROM TBL_S; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 Time taken: 1.581 seconds, Fetched: 1 row(s) hive> SELECT GUID, hex(GUID) FROM TBL_V; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936
Is it expected?