Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3963

refresh <table> and refresh <partition> are broken for boolean partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Duplicate
    • Impala 2.7.0
    • None
    • Catalog

    Description

      Impala does not refresh boolean partitions correctly if there are multiple directories corresponding to a "true" or "false" partition key value.

      The root cause of this issue is HIVE-6590. Impala can generally handle the strange metadata state caused by HIVE-6590, except when doing a refresh <table> or refresh <partition>.

      Reproduction
      In Hive:

      CREATE TABLE tbl (i INT) PARTITIONED BY (b BOOLEAN);
      INSERT OVERWRITE TABLE tbl PARTITION(b=false) VALUES(1);
      INSERT OVERWRITE TABLE tbl PARTITION(b=FALSE) VALUES(2);
      INSERT OVERWRITE TABLE tbl PARTITION(b=true) VALUES(10);
      

      In Impala:

      invalidate metadata tbl;
      show files in tbl;
      +------------------------------------------------------------+------+-----------+
      | Path                                                       | Size | Partition |
      +------------------------------------------------------------+------+-----------+
      | hdfs://localhost:20500/test-warehouse/tbl/b=false/000000_0 | 2B   | b=FALSE   |
      | hdfs://localhost:20500/test-warehouse/tbl/b=FALSE/000000_0 | 2B   | b=FALSE   |
      | hdfs://localhost:20500/test-warehouse/tbl/b=true/000000_0  | 3B   | b=TRUE    |
      +------------------------------------------------------------+------+-----------+
      
      refresh tbl;
      show files in tbl;
      +------------------------------------------------------------+------+-----------+
      | Path                                                       | Size | Partition |
      +------------------------------------------------------------+------+-----------+
      | hdfs://localhost:20500/test-warehouse/tbl/b=false/000000_0 | 2B   | b=FALSE   |
      | hdfs://localhost:20500/test-warehouse/tbl/b=FALSE/000000_0 | 2B   | b=FALSE   |
      | hdfs://localhost:20500/test-warehouse/tbl/b=false/000000_0 | 2B   | b=FALSE   |
      | hdfs://localhost:20500/test-warehouse/tbl/b=true/000000_0  | 3B   | b=TRUE    |
      +------------------------------------------------------------+------+-----------+
      

      Notice how some files are reported multiple times. Queries also return wrong results:

      select sum(i) from tbl;
      +--------+
      | sum(i) |
      +--------+
      | 14     |
      +--------+
      

      A similar problem occurs with refresh <partition>.

      Workaround

      • invalidate metadata <table> fixes the table metadata
      • ensure that boolean partitions only have a single corresponding HDFS directory, i.e., try to avoid HIVE-6590

      Attachments

        Activity

          People

            alex.behm Alexander Behm
            anujphadke Anuj Phadke
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: