Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 2.2.4
-
None
-
None
Description
Adding a column in IMPALA for a partitioned table causes Hive to display "NULL" values for the newly added column.
== Reproduction Details ==
<<< 1. Setup using impala.
– Impala Setup –
create table parttest (a int) partitioned by (c int); insert into table parttest partition (c=1) values (1); insert into table parttest partition (c=2) values (2); alter table parttest add columns (b int); insert into table parttest partition (c=1) values (3,3);
<<< 2. Query using impala and hive to see the different results.
– Query Results –
*Note that Hive reports value of "b" as NULL for "C=1" whereas Impala correctly reports it as 3.
*
- Impala -
[jrepo7-2:21000] > select * from parttest; Query: select * from parttest +---+------+---+ | a | b | c | +---+------+---+ | 3 | 3 | 1 | | 2 | NULL | 2 | | 1 | NULL | 1 | +---+------+---+
- Hive -
hive> select * from parttest; OK 1 NULL 1 3 NULL 1 2 NULL 2
<<< 3. Do the following describes in hive to see the metadata difference.
hive> desc parttest; OK a int b int c int # Partition Information # col_name data_type comment c int
- Describe partition
hive> desc parttest partition (c=1); OK a int c int # Partition Information # col_name data_type comment c int
<<< 3.a. Alternatively you can do the following in the metastore database to see the difference.
select tbl.TBL_NAME, tbl.TBL_ID, tbl.SD_ID table_sd_id, col.CD_ID, col.COLUMN_NAME from TBLS tbl join SDS sd on tbl.SD_ID = sd.SD_ID join COLUMNS_V2 col on sd.CD_ID = col.CD_ID where tbl.TBL_NAME = "parttest"; select part.SD_ID partition_sd_id, part.PART_NAME, col.COLUMN_NAME from PARTITIONS part join TBLS tbl on part.TBL_ID = tbl.TBL_ID join SDS sd on part.SD_ID = sd.SD_ID join COLUMNS_V2 col on col.CD_ID = sd.CD_ID where tbl.TBL_NAME = "parttest";
Example:
mysql> select tbl.TBL_NAME, tbl.TBL_ID, tbl.SD_ID table_sd_id, col.CD_ID, col.COLUMN_NAME from TBLS tbl join SDS sd on tbl.SD_ID = sd.SD_ID join COLUMNS_V2 col on sd.CD_ID = col.CD_ID where tbl.TBL_NAME = "parttest"; +----------+--------+-------------+-------+-------------+ | TBL_NAME | TBL_ID | table_sd_id | CD_ID | COLUMN_NAME | +----------+--------+-------------+-------+-------------+ | parttest | 3775 | 11273 | 3781 | a | | parttest | 3775 | 11273 | 3781 | b | +----------+--------+-------------+-------+-------------+ 2 rows in set (0.01 sec) mysql> mysql> select part.SD_ID partition_sd_id, part.PART_NAME, col.COLUMN_NAME from PARTITIONS part join TBLS tbl on part.TBL_ID = tbl.TBL_ID join SDS sd on part.SD_ID = sd.SD_ID join COLUMNS_V2 col on col.CD_ID = sd.CD_ID where tbl.TBL_NAME = "parttest"; +-----------------+-----------+-------------+ | partition_sd_id | PART_NAME | COLUMN_NAME | +-----------------+-----------+-------------+ | 11274 | c=1 | a | | 11275 | c=2 | a | +-----------------+-----------+-------------+ 2 rows in set (0.00 sec)
Workarounds:
- ALTER TABLE ... CASCADE in hive instead of impala which will update metadata for table and push to partitions.
- COMPUTE STATS INCREMENTAL in Impala for affected partitions to update metadata.