[HIVE-14789] Avro Table-reads bork when using SerDe-generated table-schema. - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.2.1, 2.0.1
Fix Version/s: None
Component/s: Serializers/Deserializers
Labels:
None

Description

AvroSerDe allows one to skip the table-columns in a table-definition when creating a table, as long as the TBLPROPERTIES includes a valid avro.schema.url or avro.schema.literal. The table-columns are inferred from processing the Avro schema file/literal.

The problem is that the inferred schema might not be congruent with the actual schema in the Avro schema file/literal. Consider the following table definition:

CREATE TABLE avro_schema_break_1
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "type": "record",
  "name": "Messages",
  "namespace": "net.myth",
  "fields": [
    {
      "name": "header",
      "type": [
        "null",
        {
          "type": "record",
          "name": "HeaderInfo",
          "fields": [
            {
              "name": "inferred_event_type",
              "type": [
                "null",
                "string"
              ],
              "default": null
            },
            {
              "name": "event_type",
              "type": [
                "null",
                "string"
              ],
              "default": null
            },
            {
              "name": "event_version",
              "type": [
                "null",
                "string"
              ],
              "default": null
            }    
          ]
        }
      ]
    },
    {
      "name": "messages",
      "type": {
        "type": "array",
        "items": {
          "name": "MessageInfo",
          "type": "record",
          "fields": [
            {
              "name": "message_id",
              "type": [
                "null",
                "string"
              ],
              "doc": "Message-ID"
            },
            {
              "name": "received_date",
              "type": [
                "null",
                "long"
              ],
              "doc": "Received Date"
            },
            {
              "name": "sent_date",
              "type": [
                "null",
                "long"
              ]
            },
            {
              "name": "from_name",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "flags",
              "type": [
                "null",
                {
                  "type": "record",
                  "name": "Flags",
                  "fields": [
                    {
                      "name": "is_seen",
                      "type": [
                        "null",
                        "boolean"
                      ],
                      "default": null
                    },
                    {
                      "name": "is_read",
                      "type": [
                        "null",
                        "boolean"
                      ],
                      "default": null
                    },
                    {
                      "name": "is_flagged",
                      "type": [
                        "null",
                        "boolean"
                      ],
                      "default": null
                    }
                  ]
                }
              ],
              "default": null
            }
          ]
        }
      }
    }
  ]
}');

This produces a table with the following schema:

2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] hive.log: DDL: struct avro_schema_break_1 { struct<inferred_event_type:string,event_type:string,event_version:string> header, list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> messages}

Data written to this table using the AvroSchema from avro.schema.literal using Pig's AvroStorage cannot be read using Hive using the generated table schema. This is the exception one sees:

java.io.IOException: org.apache.avro.AvroTypeException: Found net.myth.HeaderInfo, expecting union
  at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
  at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
  at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
  at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019)
  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
  at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162)
  at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136)
  at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172)
  at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104)
  at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59)
...

The only way to read this table is by using the attached avro.schema.literal or avro.schema.url. This has implications on systems where data could be produced externally to Hive. It also has repercussions on table-replication using Falcon/GDM, in that the schema file/literal needs to be replicated.

Avro Table-reads bork when using SerDe-generated table-schema.

Details

Description

Attachments

Attachments

Activity

People

Dates