[HIVE-16887] Parquet rawDataSize Under Reported - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: HiveServer2, Query Planning
Labels:
None

Description

CREATE TABLE `test_stats`(
	  `a` int, 
	  `b` int, 
	  `c` string, 
	  `d` bigint) stored as parquet;


INSERT INTO test_stats VALUES (1,31,"This is a test", 4567890);
ANALYZE TABLE test_stats COMPUTE STATISTICS FOR COLUMNS;
EXPLAIN SELECT * FROM test_stats WHERE c IS NULL;

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test_stats
            filterExpr: c is null (type: boolean)
            Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
            Filter Operator
              predicate: c is null (type: boolean)
              Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
              Select Operator
                expressions: a (type: int), b (type: int), null (type: string), d (type: bigint)
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

For Parquet tables, when Hive looks at Table stats, it sees one size for the table, when it looks at column stats, it sees another:

Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE

The row stats are much more accurate though I would expect them to be the same.

Total Table Size = (Num Rows) * (Average Row Size)

The rawDataSize is reported as 4 bytes. This is way under reporting the size of the data. Perhaps it is 4 bytes in Parquet format, but when this data loads into a Spark HashTable Sink or a Cache, this one row is going to be at least (4+4+8+14) 30 bytes.

In cases where we set hive.auto.convert.join.noconditionaltask.size to be 10mb and it is based off of the rawDataSize, in Spark when stats are enabled, we will require 114 bytes instead of the 4 bytes as reported in table stats (28.5x). You can imagine a case where the rawDataSize is reported as 10MB but the real amount of required memory to cache is 285MB! This may break an executor with limited memory or where multiple tables are being cached.

The Parquet SerDe should be reporting the total size of the data (uncompressed) so that we know exactly how much data we will be loading into memory, much like they do it here:

https://github.com/apache/hive/blob/6b6a00ffb0dae651ef407a99bab00d5e74f0d6aa/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L544

Attachments

Issue Links

relates to

HIVE-20079 Populate more accurate rawDataSize for parquet format

Closed

Activity

People

Assignee:: Unassigned

Reporter:: David Mollitor

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jun/17 21:51

Updated:: 12/Jul/18 19:25