Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
ghx-label-14
Description
Currently, 'hasCorruptTableStats_' of an HDFS table is set to true when one of the following is true in HdfsScanNode.java.
- Its 'cardinality_' less than -1.
- The number of rows in one of its partition is less than -1.
- The number of rows in one of its partition is 0 but the size of the associated files of this partition is greater than 0.
- The number of rows in the table is 0 but the size of the associated files of this table is greater than 0.
For such a table, the EXPLAIN statement for queries involving the table would contain the message of "WARNING: The following tables have potentially corrupt table statistics. Drop and re-compute statistics to resolve this problem."
The warning message may be a bit too scary for an Impala user especially if we consider the fact that a table without corrupt statistics could indeed have its 'hasCorruptTableStats_' set to true by Impala's frontend.
Specifically, a table without corrupt statistics but having its 'hasCorruptTableStats_' set to 1 could be created as follows after starting the Impala cluster.
- Execute on the command line "beeline -u "jdbc:hive2://localhost:11050/default"" to enter beeline.
- Create a transactional table in beeline via "create table test_db.test_tbl_01 (id int, name string) stored as orc tblproperties ('transactional'='true')".
- Insert a row into the table just created in beeline via "insert into table test_db.test_tbl_01 (1, "Alex");".
- Delete the row just inserted in beeline via "delete from test_db.test_tbl_01 where id = 1".
- In Impala shell, execute "compute stats test_db.test_tbl_01".
- In Impala shell, execute "explain select * from test_db.test_tbl_01" to verify that the warning message described above appears in the output.
The table 'test_tbl_01' above has 0 row but the associated file size is greater than 0.
It may be better that we revise the warning message to something less scary as shown below.
The number of rows in the following tables or in a partition of them has 0 or fewer than -1 row but positive total file size. This does not necessarily imply the existence of corrupt statistics. In the case of corrupt statistics, drop and re-compute statistics could resolve this problem.