Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.3.0
-
None
-
None
Description
Currently the sample map reduce job in order by implementation does the following:
- sample 100 records from each map
- group all on the above output
- sort the output bag from the above grouping on keys of the order by
- give the sorted bag to FindQuantiles udf
The steps 2 and 3 above can be replaced by
- group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort