[HIVE-404] Problems in "SELECT * FROM t SORT BY col1 LIMIT 100" - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.3.0
Component/s: Query Processor
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
~~HIVE-404~~. Fix ordering in "SELECT * FROM t SORT BY col1 LIMIT 100" when query is a outer-most query. (Namit Jain via zshao)

Show
HIVE-404 . Fix ordering in "SELECT * FROM t SORT BY col1 LIMIT 100" when query is a outer-most query. (Namit Jain via zshao)

Description

Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results with the query of "SELECT * FROM t SORT BY col1 LIMIT 100"

Basically, in the first map-reduce job, each reducer will get sorted data and only keep the first 100. In the second map-reduce job, we will distribute and sort the data randomly, before feeding into a single reducer that outputs the first 100.

In short, the query will output 100 random records in N * 100 top records from each of the reducer in the first map-reduce job.

This is contradicting to what people expects.

We should propagate the SORT BY columns to the second map-reduce job.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hive.404.1.patch
16/Apr/09 01:18
4 kB
Namit Jain
hive.404.2.patch
16/Apr/09 16:16
4 kB
Namit Jain

Activity

People

Assignee:: Namit Jain

Reporter:: Zheng Shao

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Apr/09 07:51

Updated:: 17/Dec/11 00:08

Resolved:: 18/Apr/09 01:25