Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-10278

Hive does not use Parquet projection to access structures

    XMLWordPrintableJSON

Details

    Description

      Selection from table stored in Parquet format with structures does not uses projections as per Parquet specification. This means that reading just one item from structure results in reading the whole structure. It was found by following test:

      Two tables (one flat one with structures) were created as follows:

      drop table if exists test_flat;
      create table test_flat
      (urlurl string,
      urlvalid boolean,
      urlhost string,
      urldomain string,
      urlsubdomain string,
      urlprotocol string,
      urlsuffix string,
      urlmiddomain string,
      refererurl string,
      referervalid boolean,
      refererhost string,
      refererdomain string,
      referersubdomain string,
      refererprotocol string,
      referersuffix string,
      referermiddomain string)
      stored as parquet
      ;

      drop table if exists test_struct;
      create table test_struct
      (url struct<url:string, valid:boolean, host:string, domain:string, subdomain:string, protocol:string, suffix:string, middomain:string>,
      referer struct<url:string, valid:boolean, host:string, domain:string, subdomain:string, protocol:string, suffix:string, middomain:string>)
      stored as parquet;

      Size of these tables is:

      [havlik@ams07-015 ~]$ hdfs dfs -du -s -h /results/havlik/new_calibration/test_flat/
      820.4 G 1.6 T /results/havlik/new_calibration/test_flat

      [havlik@ams07-015 ~]$ hdfs dfs -du -s -h /results/havlik/new_calibration/test_struct/
      822.6 G 1.6 T /results/havlik/new_calibration/test_struct

      Flat SELECT:

      select
      count
      from
      test_struct
      where
      url.valid = true
      and referer.valid = true;

      Struct SELECT:

      select
      count
      from
      test_flat
      where
      urlvalid = true
      and referervalid = true;

      CPU time:
      flat: 11785 seconds
      struct: 38004 seconds

      HDFS bytes read:
      flat: 1 812 148 468
      struct: 883 774 856 844 (which is total size of the table)

      Using own MapReduce it is possible to use projections into structures to get results similar to flat table. It is clear that Hive needs to implement it as it creates unnecessary disk reading and CPU time overhead and cripples performance.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jakub_havlik Jakub HavlĂ­k
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: