Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5365

Add support for PARALLEL clause in LOAD statement

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 512MB or 1G when they are reading TBs of data to avoid launching too many map tasks (50-100K) for loading data. It has unnecessary overhead in terms of container launch and wastes lot of resources. 

      Would be good to have a new settings to configure the max number of tasks which will override pig.maxCombinedSplitSize and combine more splits into one task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K tasks. That will go as default into pig-default.properties and apply to all users.

       Thank you rohini for filing the issue.

      Attachments

        Activity

          People

            satishsaley Satish Saley
            satishsaley Satish Saley
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: