Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.1.1
-
None
-
None
Description
CREATE TABLE `test_stats`( `a` int, `b` int, `c` string, `d` bigint) stored as parquet; INSERT INTO test_stats VALUES (1,31,"This is a test", 4567890); ANALYZE TABLE test_stats COMPUTE STATISTICS FOR COLUMNS; EXPLAIN SELECT * FROM test_stats WHERE c IS NULL;
Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test_stats filterExpr: c is null (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: c is null (type: boolean) Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: a (type: int), b (type: int), null (type: string), d (type: bigint) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
For Parquet tables, when Hive looks at Table stats, it sees one size for the table, when it looks at column stats, it sees another:
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
The row stats are much more accurate though I would expect them to be the same.
Total Table Size = (Num Rows) * (Average Row Size)
The rawDataSize is reported as 4 bytes. This is way under reporting the size of the data. Perhaps it is 4 bytes in Parquet format, but when this data loads into a Spark HashTable Sink or a Cache, this one row is going to be at least (4+4+8+14) 30 bytes.
In cases where we set hive.auto.convert.join.noconditionaltask.size to be 10mb and it is based off of the rawDataSize, in Spark when stats are enabled, we will require 114 bytes instead of the 4 bytes as reported in table stats (28.5x). You can imagine a case where the rawDataSize is reported as 10MB but the real amount of required memory to cache is 285MB! This may break an executor with limited memory or where multiple tables are being cached.
The Parquet SerDe should be reporting the total size of the data (uncompressed) so that we know exactly how much data we will be loading into memory, much like they do it here:
Attachments
Issue Links
- relates to
-
HIVE-20079 Populate more accurate rawDataSize for parquet format
- Closed