Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
Important
Description
Today some hive queries using joins can output zero byte files, particularly on large joins. This can have a negative affect on HDFS as it can lead to too many small files [1].
A solution suggested in this Cloudera Community thread [2] suggests using OutputFormat of LazyOutputFormat because MapReduce can be set to suppress the generation of empty (zero byte) files.
But it's not possible to create a table with an OutputFormat of just LazyOutputFormat in Hive. Below is what we found when testing.
create table mytable (fip int, state string, zip string, level int) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat';
------------
Error: Error while compiling statement: FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat (state=42000,code=10055)
[1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
[2] https://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-suppress-mapper-output-files-if-the-output-file-does-not/td-p/29540