Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-4
Description
We currently detect corrupt stats (0 rows but data in partition) but only flag it. The 0 row count is used for planning. I ran into a scenario where this lead to an extremely pathological plan - the 0 row count lead to flipping a nested loop join to put the big table on the build side and running out of memory.
I propose doing something very conservative to avoid this scenario: if we see corrupt stats in any partition, and the row count is computed to be zero, ignore the row count and treat it the same as missing stats in the planner.
Here's an example where we end up with corrupt stats. Warning: this can remove the data file from your alltypes type, I recommend copying the file to a different location before running this.
# In beeline against HS2 !connect jdbc:hive2://localhost:11050 hive org.apache.hive.jdbc.HiveDrive set hive.stats.autogather=true; CREATE TABLE `alltypes_insert_only`( `id` int COMMENT 'Add a comment', `bool_col` boolean, `tinyint_col` tinyint, `smallint_col` smallint, `int_col` int, `bigint_col` bigint, `float_col` float, `double_col` double, `date_string_col` string, `string_col` string, `timestamp_col` timestamp) PARTITIONED BY ( `year` int, `month` int) STORED AS PARQUET TBLPROPERTIES ("transactional"="true", "transactional_properties"="insert_only"); load data inpath 'hdfs://172.19.0.1:20500/test-warehouse/alltypes_parquet/year=2009/month=1/154473eafa08ea0e-f9d70e7100000004_1040780996_data.0.parq' into table alltypes_insert_only partition (year=2009,month=9); # In Impala show table stats alltypes_insert_only; +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+ | year | month | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+ | 2009 | 10 | 0 | 1 | 7.75KB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://172.19.0.1:20500/test-warehouse/managed/alltypes_insert_only/year=2009/month=10 | | Total | | -1 | 1 | 7.75KB | 0B | | | | | +-------+-------+-------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------+