[IMPALA-1806] Impala is unable to read hive created parquet field: interpreting INT8 as BYTE_ARRAY - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.1.1
Fix Version/s: Impala 2.2
Component/s: None
Labels:
None

Description

I have created a parquet table using hive and here is the table structure:

0: jdbc:hive2://shashmi-test-1.vpc.cloudera.c> show create table newparquethiveclient_impalaclient_uncompressed;
+---------------------------------------------------------------------------------------------------------------------+--+
|                                                   createtab_stmt                                                    |
+---------------------------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `newparquethiveclient_impalaclient_uncompressed`(                                                      |
|   `row_id` tinyint,                                                                                                 |
|   `int8_field` tinyint,                                                                                             |
|   `int16_field` smallint,                                                                                           |
|   `int32_field` int,                                                                                                |
|   `int64_field` bigint,                                                                                             |
|   `float_field` float,                                                                                              |
|   `double_field` double,                                                                                            |
|   `char_field` char(4),                                                                                             |
|   `varchar_field` varchar(4),                                                                                       |
|   `string_field` string,                                                                                            |
|   `bool_field` boolean)                                                                                             |
| ROW FORMAT SERDE                                                                                                    |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                     |
| STORED AS INPUTFORMAT                                                                                               |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                   |
| OUTPUTFORMAT                                                                                                        |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                  |
| LOCATION                                                                                                            |
|   'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed'  |
| TBLPROPERTIES (                                                                                                     |
|   'transient_lastDdlTime'='1424742425')                                                                             |
+---------------------------------------------------------------------------------------------------------------------+--+
22 rows selected (0.062 seconds)

If I select from this table using hive, I can fetch the records correctly:

0: jdbc:hive2://shashmi-test-1.vpc.cloudera.c> SELECT row_id,int8_field,int16_field,int32_field,int64_field,float_field,double_field,char_field,varchar_field,string_field,bool_field FROM newparquethiveclient_impalaclient_uncompressed ORDER BY row_id LIMIT 10;
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1424044706702_0055
INFO  : The url to track the job: http://shashmi-test-1.vpc.cloudera.com:8088/proxy/application_1424044706702_0055/
INFO  : Starting Job = job_1424044706702_0055, Tracking URL = http://shashmi-test-1.vpc.cloudera.com:8088/proxy/application_1424044706702_0055/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.796/lib/hadoop/bin/hadoop job  -kill job_1424044706702_0055
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2015-02-23 17:47:36,477 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-02-23 17:47:42,739 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.63 sec
INFO  : 2015-02-23 17:47:49,935 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.01 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 10 msec
INFO  : Ended Job = job_1424044706702_0055
+---------+-------------+--------------+--------------+-----------------------+--------------+---------------+-------------+----------------+-----------------+-------------+--+
| row_id  | int8_field  | int16_field  | int32_field  |      int64_field      | float_field  | double_field  | char_field  | varchar_field  |  string_field   | bool_field  |
+---------+-------------+--------------+--------------+-----------------------+--------------+---------------+-------------+----------------+-----------------+-------------+--+
| 1       | 4           | 4            | 4            | 4                     | 4.0          | 4.0           | ab          | cd             | test string     | false       |
| 2       | -4          | -4           | -4           | -4                    | -4.0         | -4.0          | abcd        | abcd           | another string  | false       |
| 3       | 0           | 0            | 0            | 0                     | 0.0          | 0.0           | a           | a              |                 | true        |
| 4       | 127         | 32767        | 2147483647   | 9223372036854775807   | 14.0         | 14.0          | a           | a              | a               | true        |
| 5       | -128        | -32768       | -2147483648  | -9223372036854775808  | NULL         | NULL          | null        | null           | null            | true        |
+---------+-------------+--------------+--------------+-----------------------+--------------+---------------+-------------+----------------+-----------------+-------------+--+
5 rows selected (21.385 seconds)

But when I try to select using Impala, I am getting following exception:

[shashmi-test-2.vpc.cloudera.com:21000] > SELECT row_id,int8_field,int16_field,int32_field,int64_field,float_field,double_field,char_field,varchar_field,string_field,bool_field FROM newparquethiveclient_impalaclient_uncompressed ORDER BY row_id LIMIT 10;
Query: select row_id,int8_field,int16_field,int32_field,int64_field,float_field,double_field,char_field,varchar_field,string_field,bool_field FROM newparquethiveclient_impalaclient_uncompressed ORDER BY row_id LIMIT 10
WARNINGS: File 'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed/000000_0' has an incompatible type with the table schema for column 'row_id'.  Expected type: INT32.  Actual type: BYTE_ARRAY


File 'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed/000000_0' has an incompatible type with the table schema for column 'row_id'.  Expected type: INT32.  Actual type: BYTE_ARRAY

However If I run show create table from Impala, it shows the right data type:

Query: show create table newparquethiveclient_impalaclient_uncompressed
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CREATE TABLE default.newparquethiveclient_impalaclient_uncompressed (                                                                                       |
|   row_id TINYINT,                                                                                                                                           |
|   int8_field TINYINT,                                                                                                                                       |
|   int16_field SMALLINT,                                                                                                                                     |
|   int32_field INT,                                                                                                                                          |
|   int64_field BIGINT,                                                                                                                                       |
|   float_field FLOAT,                                                                                                                                        |
|   double_field DOUBLE,                                                                                                                                      |
|   char_field CHAR(4),                                                                                                                                       |
|   varchar_field VARCHAR(4),                                                                                                                                 |
|   string_field STRING,                                                                                                                                      |
|   bool_field BOOLEAN                                                                                                                                        |
| )                                                                                                                                                           |
| WITH SERDEPROPERTIES ('serialization.format'='1')                                                                                                           |
| STORED AS PARQUET                                                                                                                                           |
| LOCATION 'hdfs://shashmi-test-1.vpc.cloudera.com:8020/user/hive/warehouse/newparquethiveclient_impalaclient_uncompressed'                                   |
| TBLPROPERTIES ('numFiles'='1', 'COLUMN_STATS_ACCURATE'='true', 'transient_lastDdlTime'='1424742442', 'numRows'='5', 'totalSize'='1614', 'rawDataSize'='55') |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

Repro file is attached as well.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

impala_repro.parquet
24/Feb/15 02:06
2 kB
Syed A. Hashmi

Activity

People

Assignee:: Skye Wanderman-Milne

Reporter:: Syed A. Hashmi

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Feb/15 02:06

Updated:: 20/Dec/15 00:05

Resolved:: 10/Mar/15 19:36