[HIVE-588] LIMIT n is slower than it needs to be - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

SELECT a FROM t LIMIT 10;
...simply prints the output of the first 10 lines of the first file in the database. That's good.

However,
SELECT function(a) FROM t LIMIT 10;
appears to send all of t to the mappers, runs the function, and and then returns the first 10 rows from whatever mapper(s) finish first. This is very slow in some cases!

Appropriate behavior for LIMIT would be to use ONE mapper, and to push files from the table into that mapper, and then auto-kill the mapper once it has output 10 rows...just take the first 10 rows and kill the whole task if necessary. On dying, throw some informative error message like, "Dying intentionally; LIMIT has been reached." This should be the case even for TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it has split out 10, the whole task should die and the 10 should be returned immediately.

The purpose of LIMIT is not just to have "only one response," but it's also to speed up queries a whole lot. Running the function over the entire table is a big waste.

Obviously, when a reduce step is necessary, the whole table will have to be pushed through mappers and then copied and then sorted--but in those cases, whenever 10 total rows have been output by any reducer(s), at which point all reduce tasks should be killed.

Attachments

Issue Links

is duplicated by

HIVE-908 optimize limit

Open

Activity

People

Assignee:: Unassigned

Reporter:: Adam Kramer

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Jun/09 22:06

Updated:: 16/Dec/09 17:18

Resolved:: 16/Dec/09 17:18