Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16870

Give Hive the ability to suppress output of empty files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • StorageHandler
    • None
    • Important

    Description

      Today some hive queries using joins can output zero byte files, particularly on large joins. This can have a negative affect on HDFS as it can lead to too many small files [1].

      A solution suggested in this Cloudera Community thread [2] suggests using OutputFormat of LazyOutputFormat because MapReduce can be set to suppress the generation of empty (zero byte) files.

      But it's not possible to create a table with an OutputFormat of just LazyOutputFormat in Hive. Below is what we found when testing.

      create table mytable (fip int, state string, zip string, level int) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat';

      ------------
      Error: Error while compiling statement: FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat (state=42000,code=10055)

      [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
      [2] https://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-suppress-mapper-output-files-if-the-output-file-does-not/td-p/29540

      Attachments

        Activity

          People

            Unassigned Unassigned
            smeasmer@cloudera.com Stephen Measmer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: