Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14044

Newlines in Avro maps cause external table to return corrupt values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.2
    • None
    • None
    • None
    • Hive version: 1.1.0-cdh5.5.1 (bundled with cloudera 5.5.1)

    Description

      When \n characters are contained in Avro files that are used as data bases for an external table, the result of SELECT queries may be corrupt. I encountered this error when querying hive both from beeline and from JDBC.

      Steps to reproduce (used files are attached to ticket)

      1. Create an .avro file that contains newline characters in a value of a map:
        avro-tools fromjson --schema-file test.schema test.json > test.avro
        
      2. Copy .avro file to HDFS
        hdfs dfs -copyFromLocal test.avro /some/location/
        
      3. Create an external table in beeline containing this .avro:
        beeline> CREATE EXTERNAL TABLE broken_newline_map
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
        STORED AS
        INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
        LOCATION '/some/location/'
        TBLPROPERTIES ('avro.schema.literal'='
        {
          "type" : "record",
          "name" : "myEntry",
          "namespace" : "myNamespace",
          "fields" : [ {
            "name" : "foo",
            "type" : "long"
          }, {
            "name" : "bar",
            "type" : {
              "type" : "map",
              "values" : "string"
            }
          } ]
        }
        ');
        
      4. Now, selecting may return corrupt results:
        jdbc:hive2://my-server:10000/> select * from broken_newline_map;
        +-------------------------+---------------------------------------------------+--+
        | broken_newline_map.foo  |              broken_newline_map.bar               |
        +-------------------------+---------------------------------------------------+--+
        | 1                       | {"key2":"value2","key1":"value1\nafter newline"}  |
        | 2                       | {"key2":"new value2","key1":"new value"}          |
        +-------------------------+---------------------------------------------------+--+
        2 rows selected (1.661 seconds)
        
        jdbc:hive2://my-server:10000/> select foo, map_keys(bar), map_values(bar) from broken_newline_map;
        +-------+------------------+-----------------------------+--+
        |  foo  |       _c1        |             _c2             |
        +-------+------------------+-----------------------------+--+
        | 1     | ["key2","key1"]  | ["value2","value1"]         |
        | NULL  | NULL             | NULL                        |
        | 2     | ["key2","key1"]  | ["new value2","new value"]  |
        +-------+------------------+-----------------------------+--+
        3 rows selected (28.05 seconds)
        

      Obviously, the last result set contains corrupt entries (line 2) and incorrect entries (line 1). I also encountered this when doing this query with JDBC.

      Attachments

        1. test.json
          0.1 kB
          David Nies
        2. test.schema
          0.2 kB
          David Nies

        Issue Links

          Activity

            People

              stakiar Sahil Takiar
              Sh4pe David Nies
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: