Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
-
Description
Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results with the query of "SELECT * FROM t SORT BY col1 LIMIT 100"
Basically, in the first map-reduce job, each reducer will get sorted data and only keep the first 100. In the second map-reduce job, we will distribute and sort the data randomly, before feeding into a single reducer that outputs the first 100.
In short, the query will output 100 random records in N * 100 top records from each of the reducer in the first map-reduce job.
This is contradicting to what people expects.
We should propagate the SORT BY columns to the second map-reduce job.