Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9398

Hive did not start small file merge if the source table has .deflate files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.12.0, 0.13.0
    • None
    • Compression
    • None

    Description

      My lab Env:
      Hive 0.13

      If the source table has .deflate compressed files and there is where condition,
      Hive did not start small file merge feature.
      If we have one partition table, and if we run SQL like:
      INSERT OVERWRITE TABLE target
      select xxx from source where...;

      After that, "target" table has many empty files, and the number of files = the
      number of mappers.

      I can reproduce it in house, and here is minimum reproduce.
      Is it by design or do we need to fix it?

      ----------------------------------------
      --------------Reproduce---------------
      ----------------------------------------

      1. Create a source tables – "source_support" and "source_support2" with the
      same DDL.
      "source_support" is to store normal text files, "source_support2" will have
      .deflate compressed files.

      CREATE TABLE source_support(
      onecol string
      )
      PARTITIONED BY (
      partcol string)
      ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

      CREATE TABLE source_support2(
      onecol string
      )
      PARTITIONED BY (
      partcol string)
      ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

      2. Create a one-row data file:

      1. cat /root/hao/000000_0
        'abc'

      3. Loading to 3 partitions of "source_support":

      LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
      PARTITION(partcol='2015-01-01');
      LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
      PARTITION(partcol='2015-01-02');
      LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
      PARTITION(partcol='2015-01-03');

      hive> select * from source_support;
      OK
      'abc' 2015-01-01
      'abc' 2015-01-02
      'abc' 2015-01-03
      Time taken: 0.836 seconds, Fetched: 3 row(s)

      4. Loading to "source_support2" from "source_support" to generate deflate
      files.
      set hive.exec.compress.output=true;
      INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-01')
      select onecol from source_support where
      partcol='2015-01-01';

      set hive.exec.compress.output=true;
      INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-02')
      select onecol from source_support where
      partcol='2015-01-02';

      set hive.exec.compress.output=true;
      INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-03')
      select onecol from source_support where
      partcol='2015-01-03';

      5. Source has .deflate files even though the small file merge is enabled.

      drop table testbysupport2;
      set hive.merge.mapfiles=true;
      set hive.merge.mapredfiles=true;
      create table testbysupport2 as
      SELECT 'policy-sale' data_source
      FROM source_support2
      WHERE onecol = '2015.01.04' and partcol in
      ('2015-01-01','2015-01-02','2015-01-03');

      [root@n3a warehouse]# ls -altr testbysupport2
      total 1
      drwxr-xr-x 42 xxx xxx 42 Jan 13 14:34 ..
      -rwxr-xr-x 1 root root 0 Jan 13 14:34 000002_0
      -rwxr-xr-x 1 root root 0 Jan 13 14:34 000001_0
      -rwxr-xr-x 1 root root 0 Jan 13 14:34 000000_0
      drwxr-xr-x 2 root root 3 Jan 13 14:34 .

      6. If we remove the where condition "onecol = '2015.01.04'",
      small file merge is now enabled.

      drop table testbysupport2;
      set hive.merge.mapfiles=true;
      set hive.merge.mapredfiles=true;
      create table testbysupport2 as
      SELECT 'policy-sale' data_source
      FROM source_support2
      WHERE partcol in ('2015-01-01','2015-01-02','2015-01-03');

      [root@n3a warehouse]# ls -altr testbysupport2
      total 2
      drwxr-xr-x 42 xxx xxx 42 Jan 13 14:37 ..
      -rwxr-xr-x 1 root root 36 Jan 13 14:37 000000_0
      drwxr-xr-x 2 root root 1 Jan 13 14:37 .
      ----------------------------------------
      --------------Reproduce---------------
      ----------------------------------------

      Attachments

        Activity

          People

            Unassigned Unassigned
            haozhu Hao Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: