Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4894

Impala changes table schema when you do ALTER TABLE .. ADD COLUMNS but not partition schema resulting in false NULL values in Hive.

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.2.4
    • None
    • Catalog
    • None

    Description

      Adding a column in IMPALA for a partitioned table causes Hive to display "NULL" values for the newly added column.

      == Reproduction Details ==

      <<< 1. Setup using impala.

      – Impala Setup –

      create table parttest (a int) partitioned by (c int);
      insert into table parttest partition (c=1) values (1);
      insert into table parttest partition (c=2) values (2);
      alter table parttest add columns (b int);
      insert into table parttest partition (c=1) values (3,3);
      

      <<< 2. Query using impala and hive to see the different results.

      – Query Results –

      *Note that Hive reports value of "b" as NULL for "C=1" whereas Impala correctly reports it as 3.
      *

      • Impala -
      [jrepo7-2:21000] > select * from parttest;
      Query: select * from parttest
      +---+------+---+
      | a | b    | c |
      +---+------+---+
      | 3 | 3    | 1 |
      | 2 | NULL | 2 |
      | 1 | NULL | 1 |
      +---+------+---+
      
      • Hive -
      hive> select * from parttest;
      OK
      1  NULL  1
      3  NULL  1
      2  NULL  2
      
      

      <<< 3. Do the following describes in hive to see the metadata difference.

      hive> desc parttest;
      OK
      a                     int
      b                     int
      c                     int
      
      # Partition Information
      # col_name              data_type             comment
      
      c                     int
      
      • Describe partition
      hive> desc parttest partition (c=1);
      OK
      a                     int
      c                     int
      
      # Partition Information
      # col_name              data_type             comment
      
      c                     int
      

      <<< 3.a. Alternatively you can do the following in the metastore database to see the difference.

      select tbl.TBL_NAME, tbl.TBL_ID, tbl.SD_ID table_sd_id, col.CD_ID, col.COLUMN_NAME from TBLS tbl join SDS sd on tbl.SD_ID = sd.SD_ID join COLUMNS_V2 col on sd.CD_ID = col.CD_ID where tbl.TBL_NAME = "parttest";
      
      select part.SD_ID partition_sd_id, part.PART_NAME, col.COLUMN_NAME from PARTITIONS part join TBLS tbl on part.TBL_ID = tbl.TBL_ID join SDS sd on part.SD_ID = sd.SD_ID join COLUMNS_V2 col on col.CD_ID = sd.CD_ID where tbl.TBL_NAME = "parttest";
      

      Example:

      mysql> select tbl.TBL_NAME, tbl.TBL_ID, tbl.SD_ID table_sd_id, col.CD_ID, col.COLUMN_NAME from TBLS tbl join SDS sd on tbl.SD_ID = sd.SD_ID join COLUMNS_V2 col on sd.CD_ID = col.CD_ID where tbl.TBL_NAME = "parttest";
      +----------+--------+-------------+-------+-------------+
      | TBL_NAME | TBL_ID | table_sd_id | CD_ID | COLUMN_NAME |
      +----------+--------+-------------+-------+-------------+
      | parttest |   3775 |       11273 |  3781 | a           |
      | parttest |   3775 |       11273 |  3781 | b           |
      +----------+--------+-------------+-------+-------------+
      2 rows in set (0.01 sec)
      
      mysql>
      mysql> select part.SD_ID partition_sd_id, part.PART_NAME, col.COLUMN_NAME from PARTITIONS part join TBLS tbl on part.TBL_ID = tbl.TBL_ID join SDS sd on part.SD_ID = sd.SD_ID join COLUMNS_V2 col on col.CD_ID = sd.CD_ID where tbl.TBL_NAME = "parttest";
      +-----------------+-----------+-------------+
      | partition_sd_id | PART_NAME | COLUMN_NAME |
      +-----------------+-----------+-------------+
      |           11274 | c=1       | a           |
      |           11275 | c=2       | a           |
      +-----------------+-----------+-------------+
      2 rows in set (0.00 sec)
      

      Workarounds:

      • ALTER TABLE ... CASCADE in hive instead of impala which will update metadata for table and push to partitions.
      • COMPUTE STATS INCREMENTAL in Impala for affected partitions to update metadata.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            SudarshanS Sudarshan

            Dates

              Created:
              Updated:

              Slack

                Issue deployment