Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
Hive has a feature that could automatically merge small files in HQL's output path.
This feature is quite useful for some cases that people use insert into to handle minute data from the input path to a daily table.
In that case, if the SQL includes group by or join operation, we always set the reduce number at least 200 to avoid the possible OOM in reduce side.
That will cause this SQL output at least 200 files at the end of the execution. So the daily table will finally contains more than 50000 files.
If we could provide the same feature in SparkSQL, it will extremely reduce hdfs operations and spark tasks when we run other sql on this table.