Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14044

Newlines in Avro maps cause external table to return corrupt values

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Hive version: 1.1.0-cdh5.5.1 (bundled with cloudera 5.5.1)

      Description

      When \n characters are contained in Avro files that are used as data bases for an external table, the result of SELECT queries may be corrupt. I encountered this error when querying hive both from beeline and from JDBC.

      Steps to reproduce (used files are attached to ticket)

      1. Create an .avro file that contains newline characters in a value of a map:
        avro-tools fromjson --schema-file test.schema test.json > test.avro
        
      2. Copy .avro file to HDFS
        hdfs dfs -copyFromLocal test.avro /some/location/
        
      3. Create an external table in beeline containing this .avro:
        beeline> CREATE EXTERNAL TABLE broken_newline_map
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
        STORED AS
        INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
        LOCATION '/some/location/'
        TBLPROPERTIES ('avro.schema.literal'='
        {
          "type" : "record",
          "name" : "myEntry",
          "namespace" : "myNamespace",
          "fields" : [ {
            "name" : "foo",
            "type" : "long"
          }, {
            "name" : "bar",
            "type" : {
              "type" : "map",
              "values" : "string"
            }
          } ]
        }
        ');
        
      4. Now, selecting may return corrupt results:
        jdbc:hive2://my-server:10000/> select * from broken_newline_map;
        +-------------------------+---------------------------------------------------+--+
        | broken_newline_map.foo  |              broken_newline_map.bar               |
        +-------------------------+---------------------------------------------------+--+
        | 1                       | {"key2":"value2","key1":"value1\nafter newline"}  |
        | 2                       | {"key2":"new value2","key1":"new value"}          |
        +-------------------------+---------------------------------------------------+--+
        2 rows selected (1.661 seconds)
        
        jdbc:hive2://my-server:10000/> select foo, map_keys(bar), map_values(bar) from broken_newline_map;
        +-------+------------------+-----------------------------+--+
        |  foo  |       _c1        |             _c2             |
        +-------+------------------+-----------------------------+--+
        | 1     | ["key2","key1"]  | ["value2","value1"]         |
        | NULL  | NULL             | NULL                        |
        | 2     | ["key2","key1"]  | ["new value2","new value"]  |
        +-------+------------------+-----------------------------+--+
        3 rows selected (28.05 seconds)
        

      Obviously, the last result set contains corrupt entries (line 2) and incorrect entries (line 1). I also encountered this when doing this query with JDBC.

        Attachments

        1. test.schema
          0.2 kB
          David Nies
        2. test.json
          0.1 kB
          David Nies

          Issue Links

            Activity

              People

              • Assignee:
                stakiar Sahil Takiar
                Reporter:
                Sh4pe David Nies
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: