Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.6.0
-
CDH 5.8.2 running impala 2.6.0-cdh5.8.2
Description
According to docs "https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html" under section "Setting Column Stats Manually through ALTER TABLE" we can manually set column stats. This was introduced as part of IMPALA-3369
However, when setting column stats manually, table level stats seem to be removed.
To reproduce:
Create a table in hive with a single column.
hive> create table t(c int);
Insert 1 row of data using hive:
hive> insert into t values (1);
Compute table level stats using hive:
hive> analyze table t compute statistics;
Running describe formatted in hive should show 1 row for numRows:
hive> describe formatted d_level;
# col_name data_type comment
c int
# Detailed Table Information
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 1
numRows 1
rawDataSize 1
totalSize 12
transient_lastDdlTime 1484319025
Running show table stats in impala should show the same value of 1 for #Rows:
[impala:21000] > show table stats t; Query: show table stats t +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ | 1 | 1 | 12B | NOT CACHED | NOT CACHED | TEXT | false | hdfs://....t | +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
Now manually set column stats on column 'c':
[impala:21000] > alter table t set column stats c ('numdvs'='1');
View the column stats and see that '#Distinct Values' is now set to 1
[impala:21000] > show column stats t; Query: show column stats t +--------+------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+------+------------------+--------+----------+----------+ | c | INT | 1 | -1 | 4 | 4 | +--------+------+------------------+--------+----------+----------+
But we now seem to have lost the table level stats. Show table stats in impala now says -1 for #Rows:
[impala:21000] > show table stats t; Query: show table stats t +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+ | -1 | 1 | 12B | NOT CACHED | NOT CACHED | TEXT | false | hdfs://...t | +-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------+
Describe formatted in hive reports -1 for numRows:
hive> describe formatted t; OK # col_name data_type comment c int # Detailed Table Information Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false numFiles 1 numRows -1 rawDataSize -1 totalSize 12 transient_lastDdlTime 1484319616
This causes problems for any application (such as hive and impala) which rely on these table level stats.
Workaround:
Recompute the table level stats again in hive using:
analyze table t1 compute statistics;
Attachments
Issue Links
- is duplicated by
-
IMPALA-3231 renaming table discards column statistics
- Resolved
- relates to
-
IMPALA-4260 Alter table add column drops all the column stats
- Resolved
- requires
-
IMPALA-1657 Improve logging incase of query failures with negative cardinalities
- Resolved