Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14789

Avro Table-reads bork when using SerDe-generated table-schema.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.1, 2.0.1
    • None
    • None

    Description

      AvroSerDe allows one to skip the table-columns in a table-definition when creating a table, as long as the TBLPROPERTIES includes a valid avro.schema.url or avro.schema.literal. The table-columns are inferred from processing the Avro schema file/literal.

      The problem is that the inferred schema might not be congruent with the actual schema in the Avro schema file/literal. Consider the following table definition:

      CREATE TABLE avro_schema_break_1
      ROW FORMAT
      SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
      STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
      TBLPROPERTIES ('avro.schema.literal'='{
        "type": "record",
        "name": "Messages",
        "namespace": "net.myth",
        "fields": [
          {
            "name": "header",
            "type": [
              "null",
              {
                "type": "record",
                "name": "HeaderInfo",
                "fields": [
                  {
                    "name": "inferred_event_type",
                    "type": [
                      "null",
                      "string"
                    ],
                    "default": null
                  },
                  {
                    "name": "event_type",
                    "type": [
                      "null",
                      "string"
                    ],
                    "default": null
                  },
                  {
                    "name": "event_version",
                    "type": [
                      "null",
                      "string"
                    ],
                    "default": null
                  }    
                ]
              }
            ]
          },
          {
            "name": "messages",
            "type": {
              "type": "array",
              "items": {
                "name": "MessageInfo",
                "type": "record",
                "fields": [
                  {
                    "name": "message_id",
                    "type": [
                      "null",
                      "string"
                    ],
                    "doc": "Message-ID"
                  },
                  {
                    "name": "received_date",
                    "type": [
                      "null",
                      "long"
                    ],
                    "doc": "Received Date"
                  },
                  {
                    "name": "sent_date",
                    "type": [
                      "null",
                      "long"
                    ]
                  },
                  {
                    "name": "from_name",
                    "type": [
                      "null",
                      "string"
                    ]
                  },
                  {
                    "name": "flags",
                    "type": [
                      "null",
                      {
                        "type": "record",
                        "name": "Flags",
                        "fields": [
                          {
                            "name": "is_seen",
                            "type": [
                              "null",
                              "boolean"
                            ],
                            "default": null
                          },
                          {
                            "name": "is_read",
                            "type": [
                              "null",
                              "boolean"
                            ],
                            "default": null
                          },
                          {
                            "name": "is_flagged",
                            "type": [
                              "null",
                              "boolean"
                            ],
                            "default": null
                          }
                        ]
                      }
                    ],
                    "default": null
                  }
                ]
              }
            }
          }
        ]
      }');
      

      This produces a table with the following schema:

      2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] hive.log: DDL: struct avro_schema_break_1 { struct<inferred_event_type:string,event_type:string,event_version:string> header, list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> messages}
      

      Data written to this table using the AvroSchema from avro.schema.literal using Pig's AvroStorage cannot be read using Hive using the generated table schema. This is the exception one sees:

      java.io.IOException: org.apache.avro.AvroTypeException: Found net.myth.HeaderInfo, expecting union
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
        at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162)
        at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136)
        at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172)
        at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104)
        at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59)
      ...
      

      The only way to read this table is by using the attached avro.schema.literal or avro.schema.url. This has implications on systems where data could be produced externally to Hive. It also has repercussions on table-replication using Falcon/GDM, in that the schema file/literal needs to be replicated.

      Attachments

        1. HIVE-14789-reproduce.patch
          4 kB
          Mithun Radhakrishnan

        Activity

          People

            mithun Mithun Radhakrishnan
            mithun Mithun Radhakrishnan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: