Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 512MB or 1G when they are reading TBs of data to avoid launching too many map tasks (50-100K) for loading data. It has unnecessary overhead in terms of container launch and wastes lot of resources.
Would be good to have a new settings to configure the max number of tasks which will override pig.maxCombinedSplitSize and combine more splits into one task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K tasks. That will go as default into pig-default.properties and apply to all users.
Thank you rohini for filing the issue.