Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.2.1, 2.0.1
-
None
-
None
Description
AvroSerDe allows one to skip the table-columns in a table-definition when creating a table, as long as the TBLPROPERTIES includes a valid avro.schema.url or avro.schema.literal. The table-columns are inferred from processing the Avro schema file/literal.
The problem is that the inferred schema might not be congruent with the actual schema in the Avro schema file/literal. Consider the following table definition:
CREATE TABLE avro_schema_break_1 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'='{ "type": "record", "name": "Messages", "namespace": "net.myth", "fields": [ { "name": "header", "type": [ "null", { "type": "record", "name": "HeaderInfo", "fields": [ { "name": "inferred_event_type", "type": [ "null", "string" ], "default": null }, { "name": "event_type", "type": [ "null", "string" ], "default": null }, { "name": "event_version", "type": [ "null", "string" ], "default": null } ] } ] }, { "name": "messages", "type": { "type": "array", "items": { "name": "MessageInfo", "type": "record", "fields": [ { "name": "message_id", "type": [ "null", "string" ], "doc": "Message-ID" }, { "name": "received_date", "type": [ "null", "long" ], "doc": "Received Date" }, { "name": "sent_date", "type": [ "null", "long" ] }, { "name": "from_name", "type": [ "null", "string" ] }, { "name": "flags", "type": [ "null", { "type": "record", "name": "Flags", "fields": [ { "name": "is_seen", "type": [ "null", "boolean" ], "default": null }, { "name": "is_read", "type": [ "null", "boolean" ], "default": null }, { "name": "is_flagged", "type": [ "null", "boolean" ], "default": null } ] } ], "default": null } ] } } } ] }');
This produces a table with the following schema:
2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] hive.log: DDL: struct avro_schema_break_1 { struct<inferred_event_type:string,event_type:string,event_version:string> header, list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> messages}
Data written to this table using the AvroSchema from avro.schema.literal using Pig's AvroStorage cannot be read using Hive using the generated table schema. This is the exception one sees:
java.io.IOException: org.apache.avro.AvroTypeException: Found net.myth.HeaderInfo, expecting union at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59) ...
The only way to read this table is by using the attached avro.schema.literal or avro.schema.url. This has implications on systems where data could be produced externally to Hive. It also has repercussions on table-replication using Falcon/GDM, in that the schema file/literal needs to be replicated.