-
Type:
Bug
-
Status: Open
-
Priority:
Critical
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Environment:
Hive version: 1.1.0-cdh5.5.1 (bundled with cloudera 5.5.1)
When \n characters are contained in Avro files that are used as data bases for an external table, the result of SELECT queries may be corrupt. I encountered this error when querying hive both from beeline and from JDBC.
Steps to reproduce (used files are attached to ticket)
- Create an .avro file that contains newline characters in a value of a map:
avro-tools fromjson --schema-file test.schema test.json > test.avro
- Copy .avro file to HDFS
hdfs dfs -copyFromLocal test.avro /some/location/
- Create an external table in beeline containing this .avro:
beeline> CREATE EXTERNAL TABLE broken_newline_map ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/some/location/' TBLPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : "myEntry", "namespace" : "myNamespace", "fields" : [ { "name" : "foo", "type" : "long" }, { "name" : "bar", "type" : { "type" : "map", "values" : "string" } } ] } ');
- Now, selecting may return corrupt results:
jdbc:hive2://my-server:10000/> select * from broken_newline_map; +-------------------------+---------------------------------------------------+--+ | broken_newline_map.foo | broken_newline_map.bar | +-------------------------+---------------------------------------------------+--+ | 1 | {"key2":"value2","key1":"value1\nafter newline"} | | 2 | {"key2":"new value2","key1":"new value"} | +-------------------------+---------------------------------------------------+--+ 2 rows selected (1.661 seconds) jdbc:hive2://my-server:10000/> select foo, map_keys(bar), map_values(bar) from broken_newline_map; +-------+------------------+-----------------------------+--+ | foo | _c1 | _c2 | +-------+------------------+-----------------------------+--+ | 1 | ["key2","key1"] | ["value2","value1"] | | NULL | NULL | NULL | | 2 | ["key2","key1"] | ["new value2","new value"] | +-------+------------------+-----------------------------+--+ 3 rows selected (28.05 seconds)
Obviously, the last result set contains corrupt entries (line 2) and incorrect entries (line 1). I also encountered this when doing this query with JDBC.
- is related to
-
HIVE-11785 Support escaping carriage return and new line for LazySimpleSerDe
-
- Closed
-