Details
Description
When hive.stats.autogather=true then the Metastore lists all files under the table directory to populate basic stats like file counts and sizes. This file listing operation can be very expensive particularly on filesystems like S3.
One way to address this issue is to reconfigure hive.stats.autogather=false.
Here's the bug
It is my understanding that the DO_NOT_UPDATE_STATS table property is intended to selectively prevent this stats collection. Unfortunately, this table property is checked after the expensive file listing operation, so the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
Relevant code snippet:
public static boolean updateTableStatsFast(Database db, Table tbl, Warehouse wh, boolean madeDir, boolean forceRecompute, EnvironmentContext environmentContext) throws MetaException { if (tbl.getPartitionKeysSize() == 0) { // Update stats only when unpartitioned FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, tbl); return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after wh.getFileStatusesForUnpartitionedTable() has already been called } else { return false; } }
Attachments
Attachments
Issue Links
- relates to
-
HIVE-19489 Disable stats autogather for external tables
- Open
-
HIVE-10228 Changes to Hive Export/Import/DropTable/DropPartition to support replication semantics
- Closed
-
HIVE-13341 Stats state is not captured correctly: differentiate load table and create table
- Closed
-
HIVE-17478 Move filesystem stats collection from metastore to ql
- Patch Available
- links to