Hive
  1. Hive
  2. HIVE-5850

Multiple table join error for avro

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Reproduce step:

      -- Create table Part.
      CREATE EXTERNAL TABLE part
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
      STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
      LOCATION 'hdfs://<hostname>/user/hadoop/tpc-h/data/part'
      TBLPROPERTIES ('avro.schema.url'='hdfs://<hostname>/user/hadoop/tpc-h/schema/part.avsc');
      
      -- Create table Part Supplier.
      CREATE EXTERNAL TABLE partsupp
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
      STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
      LOCATION 'hdfs://<hostname>/user/hadoop/tpc-h/data/partsupp'
      TBLPROPERTIES ('avro.schema.url'='hdfs://<hostname>/user/hadoop/tpc-h/schema/partsupp.avsc');
      --- Query
      select * from partsupp ps join part p on ps.ps_partkey = p.p_partkey where p.p_partkey=1;
      
      Error message is:
      Error: java.io.IOException: java.io.IOException: org.apache.avro.AvroTypeException: Found {
        "type" : "record",
        "name" : "partsupp",
        "namespace" : "com.gs.sdst.pl.avro.tpch",
        "fields" : [ {
          "name" : "ps_partkey",
          "type" : "long"
        }, {
          "name" : "ps_suppkey",
          "type" : "long"
        }, {
          "name" : "ps_availqty",
          "type" : "long"
        }, {
          "name" : "ps_supplycost",
          "type" : "double"
        }, {
          "name" : "ps_comment",
          "type" : "string"
        }, {
          "name" : "systimestamp",
          "type" : "long"
        } ]
      }, expecting {
        "type" : "record",
        "name" : "part",
        "namespace" : "com.gs.sdst.pl.avro.tpch",
        "fields" : [ {
          "name" : "p_partkey",
          "type" : "long"
        }, {
          "name" : "p_name",
          "type" : "string"
        }, {
          "name" : "p_mfgr",
          "type" : "string"
        }, {
          "name" : "p_brand",
          "type" : "string"
        }, {
          "name" : "p_type",
          "type" : "string"
        }, {
          "name" : "p_size",
          "type" : "int"
        }, {
          "name" : "p_container",
          "type" : "string"
        }, {
          "name" : "p_retailprice",
          "type" : "double"
        }, {
          "name" : "p_comment",
          "type" : "string"
        }, {
          "name" : "systimestamp",
          "type" : "long"
        } ]
      }
              at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
              at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
              at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:302)
              at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:218)
              at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:197)
              at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:183)
              at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
              at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
              at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:158)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
              at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:153)
      
      1. part.tar.gz
        4.12 MB
        Shengjun Xin
      2. partsupp.tar.gz
        5.02 MB
        Shengjun Xin
      3. schema.tar.gz
        0.4 kB
        Shengjun Xin

        Activity

        Hide
        Shengjun Xin added a comment -

        This issue is because of using wrong schema when process the split.

        In getSchema function of AvroGenericRecordReader.java, if a partition is the prefix of a split, it will use schema of this partition to parse the split, but this is not always correct.

        For example, partition '/user/hadoop/tpc-h/data/part' is the prefix of '/user/hadoop/tpc-h/data/partsupp/good_2013-01_partsupp_tbl_0002.avro', but we can not use the schema of '/user/hadoop/tpc-h/data/part' to parse '/user/hadoop/tpc-h/data/partsupp/good_2013-01_partsupp_tbl_0002.avro'

        In my opinion, if a partition is the path parent of a split, we can use this partition's schema to parse the split.

        Show
        Shengjun Xin added a comment - This issue is because of using wrong schema when process the split. In getSchema function of AvroGenericRecordReader.java, if a partition is the prefix of a split, it will use schema of this partition to parse the split, but this is not always correct. For example, partition '/user/hadoop/tpc-h/data/part' is the prefix of '/user/hadoop/tpc-h/data/partsupp/good_2013-01_partsupp_tbl_0002.avro', but we can not use the schema of '/user/hadoop/tpc-h/data/part' to parse '/user/hadoop/tpc-h/data/partsupp/good_2013-01_partsupp_tbl_0002.avro' In my opinion, if a partition is the path parent of a split, we can use this partition's schema to parse the split.
        Hide
        Jakob Homan added a comment -

        This has been a recurring problem. The code to figure out what schema goes where has been problematic and the information passed to the mapper has changed from Hive version to Hive version. Using the parent may not always get the latest schema, yes?

        Show
        Jakob Homan added a comment - This has been a recurring problem. The code to figure out what schema goes where has been problematic and the information passed to the mapper has changed from Hive version to Hive version. Using the parent may not always get the latest schema, yes?
        Hide
        Shengjun Xin added a comment -

        So could you please give me an example that parent can not get the latest schema?

        Show
        Shengjun Xin added a comment - So could you please give me an example that parent can not get the latest schema?

          People

          • Assignee:
            Unassigned
            Reporter:
            Shengjun Xin
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development