Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
With dynamic partitions, it becomes very easy to create partitions.
We have seen some scenarios, where a lot of partitions/files get created due to some corrupt data (1 corrupt row
can end up creating a partition and a lot of files (number of mappers, if merge is false)).
This puts a lot of load on the cluster, and is a debugging nightmare.
It would be good to have a configuration parameter, for the minimum number of rows for a partition.
If the number of rows is less than the threshold, the partition need not be created. The default value
of this parameter can be zero for backward compatibility