Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16887

Parquet rawDataSize Under Reported

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.1.1
    • None
    • HiveServer2, Query Planning
    • None

    Description

      CREATE TABLE `test_stats`(
      	  `a` int, 
      	  `b` int, 
      	  `c` string, 
      	  `d` bigint) stored as parquet;
      
      
      INSERT INTO test_stats VALUES (1,31,"This is a test", 4567890);
      ANALYZE TABLE test_stats COMPUTE STATISTICS FOR COLUMNS;
      EXPLAIN SELECT * FROM test_stats WHERE c IS NULL;
      
      Explain
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: test_stats
                  filterExpr: c is null (type: boolean)
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: c is null (type: boolean)
                    Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: a (type: int), b (type: int), null (type: string), d (type: bigint)
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE
                        table:
                            input format: org.apache.hadoop.mapred.TextInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      

      For Parquet tables, when Hive looks at Table stats, it sees one size for the table, when it looks at column stats, it sees another:

      Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE
      Statistics: Num rows: 1 Data size: 114 Basic stats: COMPLETE Column stats: COMPLETE

      The row stats are much more accurate though I would expect them to be the same.

      Total Table Size = (Num Rows) * (Average Row Size)

      The rawDataSize is reported as 4 bytes. This is way under reporting the size of the data. Perhaps it is 4 bytes in Parquet format, but when this data loads into a Spark HashTable Sink or a Cache, this one row is going to be at least (4+4+8+14) 30 bytes.

      In cases where we set hive.auto.convert.join.noconditionaltask.size to be 10mb and it is based off of the rawDataSize, in Spark when stats are enabled, we will require 114 bytes instead of the 4 bytes as reported in table stats (28.5x). You can imagine a case where the rawDataSize is reported as 10MB but the real amount of required memory to cache is 285MB! This may break an executor with limited memory or where multiple tables are being cached.

      The Parquet SerDe should be reporting the total size of the data (uncompressed) so that we know exactly how much data we will be loading into memory, much like they do it here:

      https://github.com/apache/hive/blob/6b6a00ffb0dae651ef407a99bab00d5e74f0d6aa/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L544

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              belugabehr David Mollitor
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: