Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-841

PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.8.1
    • Component/s: None
    • Labels:
      None

      Description

      Currently the sample map reduce job in order by implementation does the following:

      • sample 100 records from each map
      • group all on the above output
      • sort the output bag from the above grouping on keys of the order by
      • give the sorted bag to FindQuantiles udf

      The steps 2 and 3 above can be replaced by

      • group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              pkamath Pradeep Kamath
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: