[PIG-841] PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3.0
Fix Version/s: 0.8.1
Component/s: None
Labels:
None

Description

Currently the sample map reduce job in order by implementation does the following:

sample 100 records from each map
group all on the above output
sort the output bag from the above grouping on keys of the order by
give the sorted bag to FindQuantiles udf

The steps 2 and 3 above can be replaced by

group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pradeep Kamath

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Jun/09 19:14

Updated:: 25/Apr/11 21:27

Resolved:: 04/Mar/11 21:49