Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1806

Impala is unable to read hive created parquet field: interpreting INT8 as BYTE_ARRAY

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.1.1
    • Fix Version/s: Impala 2.2
    • Component/s: None
    • Labels:
      None

      Description

      I have created a parquet table using hive and here is the table structure:

      0: jdbc:hive2://shashmi-test-1.vpc.cloudera.c> show create table newparquethiveclient_impalaclient_uncompressed;
      +---------------------------------------------------------------------------------------------------------------------+--+
      |                                                   createtab_stmt                                                    |
      +---------------------------------------------------------------------------------------------------------------------+--+
      | CREATE TABLE `newparquethiveclient_impalaclient_uncompressed`(                                                      |
      |   `row_id` tinyint,                                                                                                 |
      |   `int8_field` tinyint,                                                                                             |
      |   `int16_field` smallint,                                                                                           |
      |   `int32_field` int,                                                                                                |
      |   `int64_field` bigint,                                                                                             |
      |   `float_field` float,                                                                                              |
      |   `double_field` double,                                                                                            |
      |   `char_field` char(4),                                                                                             |
      |   `varchar_field` varchar(4),                                                                                       |
      |   `string_field` string,                                                                                            |
      |   `bool_field` boolean)                                                                                             |
      | ROW FORMAT SERDE                                                                                                    |
      |   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                     |
      | STORED AS INPUTFORMAT                                                                                               |
      |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                   |
      | OUTPUTFORMAT                                                                                                        |
      |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                  |
      | LOCATION                                                                                                            |
      |   'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed'  |
      | TBLPROPERTIES (                                                                                                     |
      |   'transient_lastDdlTime'='1424742425')                                                                             |
      +---------------------------------------------------------------------------------------------------------------------+--+
      22 rows selected (0.062 seconds)
      

      If I select from this table using hive, I can fetch the records correctly:

      0: jdbc:hive2://shashmi-test-1.vpc.cloudera.c> SELECT row_id,int8_field,int16_field,int32_field,int64_field,float_field,double_field,char_field,varchar_field,string_field,bool_field FROM newparquethiveclient_impalaclient_uncompressed ORDER BY row_id LIMIT 10;
      INFO  : Number of reduce tasks determined at compile time: 1
      INFO  : In order to change the average load for a reducer (in bytes):
      INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
      INFO  : In order to limit the maximum number of reducers:
      INFO  :   set hive.exec.reducers.max=<number>
      INFO  : In order to set a constant number of reducers:
      INFO  :   set mapreduce.job.reduces=<number>
      WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
      INFO  : number of splits:1
      INFO  : Submitting tokens for job: job_1424044706702_0055
      INFO  : The url to track the job: http://shashmi-test-1.vpc.cloudera.com:8088/proxy/application_1424044706702_0055/
      INFO  : Starting Job = job_1424044706702_0055, Tracking URL = http://shashmi-test-1.vpc.cloudera.com:8088/proxy/application_1424044706702_0055/
      INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.796/lib/hadoop/bin/hadoop job  -kill job_1424044706702_0055
      INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
      INFO  : 2015-02-23 17:47:36,477 Stage-1 map = 0%,  reduce = 0%
      INFO  : 2015-02-23 17:47:42,739 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.63 sec
      INFO  : 2015-02-23 17:47:49,935 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.01 sec
      INFO  : MapReduce Total cumulative CPU time: 3 seconds 10 msec
      INFO  : Ended Job = job_1424044706702_0055
      +---------+-------------+--------------+--------------+-----------------------+--------------+---------------+-------------+----------------+-----------------+-------------+--+
      | row_id  | int8_field  | int16_field  | int32_field  |      int64_field      | float_field  | double_field  | char_field  | varchar_field  |  string_field   | bool_field  |
      +---------+-------------+--------------+--------------+-----------------------+--------------+---------------+-------------+----------------+-----------------+-------------+--+
      | 1       | 4           | 4            | 4            | 4                     | 4.0          | 4.0           | ab          | cd             | test string     | false       |
      | 2       | -4          | -4           | -4           | -4                    | -4.0         | -4.0          | abcd        | abcd           | another string  | false       |
      | 3       | 0           | 0            | 0            | 0                     | 0.0          | 0.0           | a           | a              |                 | true        |
      | 4       | 127         | 32767        | 2147483647   | 9223372036854775807   | 14.0         | 14.0          | a           | a              | a               | true        |
      | 5       | -128        | -32768       | -2147483648  | -9223372036854775808  | NULL         | NULL          | null        | null           | null            | true        |
      +---------+-------------+--------------+--------------+-----------------------+--------------+---------------+-------------+----------------+-----------------+-------------+--+
      5 rows selected (21.385 seconds)
      

      But when I try to select using Impala, I am getting following exception:

      [shashmi-test-2.vpc.cloudera.com:21000] > SELECT row_id,int8_field,int16_field,int32_field,int64_field,float_field,double_field,char_field,varchar_field,string_field,bool_field FROM newparquethiveclient_impalaclient_uncompressed ORDER BY row_id LIMIT 10;
      Query: select row_id,int8_field,int16_field,int32_field,int64_field,float_field,double_field,char_field,varchar_field,string_field,bool_field FROM newparquethiveclient_impalaclient_uncompressed ORDER BY row_id LIMIT 10
      WARNINGS: File 'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed/000000_0' has an incompatible type with the table schema for column 'row_id'.  Expected type: INT32.  Actual type: BYTE_ARRAY
      
      
      File 'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed/000000_0' has an incompatible type with the table schema for column 'row_id'.  Expected type: INT32.  Actual type: BYTE_ARRAY
      

      However If I run show create table from Impala, it shows the right data type:

      Query: show create table newparquethiveclient_impalaclient_uncompressed
      +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | result                                                                                                                                                      |
      +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | CREATE TABLE default.newparquethiveclient_impalaclient_uncompressed (                                                                                       |
      |   row_id TINYINT,                                                                                                                                           |
      |   int8_field TINYINT,                                                                                                                                       |
      |   int16_field SMALLINT,                                                                                                                                     |
      |   int32_field INT,                                                                                                                                          |
      |   int64_field BIGINT,                                                                                                                                       |
      |   float_field FLOAT,                                                                                                                                        |
      |   double_field DOUBLE,                                                                                                                                      |
      |   char_field CHAR(4),                                                                                                                                       |
      |   varchar_field VARCHAR(4),                                                                                                                                 |
      |   string_field STRING,                                                                                                                                      |
      |   bool_field BOOLEAN                                                                                                                                        |
      | )                                                                                                                                                           |
      | WITH SERDEPROPERTIES ('serialization.format'='1')                                                                                                           |
      | STORED AS PARQUET                                                                                                                                           |
      | LOCATION 'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed'                                   |
      | TBLPROPERTIES ('numFiles'='1', 'COLUMN_STATS_ACCURATE'='true', 'transient_lastDdlTime'='1424742442', 'numRows'='5', 'totalSize'='1614', 'rawDataSize'='55') |
      +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
      

      Repro file is attached as well.

        Attachments

        1. impala_repro.parquet
          2 kB
          Syed A. Hashmi

          Activity

            People

            • Assignee:
              skye Skye Wanderman-Milne
              Reporter:
              shashmi Syed A. Hashmi
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: