Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-744

PERFORMANCE: Bag creation can be more efficiently handled in order by

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.2.0
    • 0.3.0
    • None
    • None

    Description

      Currently order by results in multiple map reduce jobs (2 or 3 depending on the script) of which the last one does the actual ordering. In this last map reduce job, we create a bag of values (each value being the entire tuple that is getting sorted) for each sort key(s) using POPackage in the reduce phase. Then we turn around and flatten the bag in the foreach following the package. So there is really no need for the bag. But to be generic and use the existing operators, we can be more efficient by tagging the POPackage to create bags which are backed by the Hadoop iterator itself. This way we do not create a bag by making a copy of each tuple from the hadoop iterator. This should help both performance and scalability by making better use of memory.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pkamath Pradeep Kamath
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: