XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Impala 4.0.0
    • Component/s: Frontend
    • Labels:

      Description

      We currently detect corrupt stats (0 rows but data in partition) but only flag it. The 0 row count is used for planning. I ran into a scenario where this lead to an extremely pathological plan - the 0 row count lead to flipping a nested loop join to put the big table on the build side and running out of memory.

      I propose doing something very conservative to avoid this scenario: if we see corrupt stats in any partition, and the row count is computed to be zero, ignore the row count and treat it the same as missing stats in the planner.

      Here's an example where we end up with corrupt stats. Warning: this can remove the data file from your alltypes type, I recommend copying the file to a different location before running this.

      # In beeline against HS2
      !connect jdbc:hive2://localhost:11050 hive org.apache.hive.jdbc.HiveDrive
      set hive.stats.autogather=true;
      CREATE TABLE `alltypes_insert_only`(
         `id` int COMMENT 'Add a comment',
         `bool_col` boolean,
         `tinyint_col` tinyint,
         `smallint_col` smallint,
         `int_col` int,
         `bigint_col` bigint,
         `float_col` float,
         `double_col` double,
         `date_string_col` string,
         `string_col` string,
         `timestamp_col` timestamp)
       PARTITIONED BY (
         `year` int,
         `month` int)
       STORED AS PARQUET
       TBLPROPERTIES ("transactional"="true", "transactional_properties"="insert_only");
      load data inpath 'hdfs://172.19.0.1:20500/test-warehouse/alltypes_parquet/year=2009/month=1/154473eafa08ea0e-f9d70e7100000004_1040780996_data.0.parq' into table alltypes_insert_only partition (year=2009,month=9);
      
      # In Impala
      show table stats alltypes_insert_only;
      +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
      | year  | month | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                               |
      +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
      | 2009  | 10    | 0     | 1      | 7.75KB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://172.19.0.1:20500/test-warehouse/managed/alltypes_insert_only/year=2009/month=10 |
      | Total |       | -1    | 1      | 7.75KB | 0B           |                   |         |                   |                                                                                        |
      +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+
      

        Attachments

          Activity

            People

            • Assignee:
              sql_forever Qifan Chen
              Reporter:
              tarmstrong Tim Armstrong
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: