[HIVE-2295] Implement CLUSTERED BY, DISTRIBUTED BY, SORTED BY directives for a single query level. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Query Processor
Labels:
None

Description

The common framework for utilizing the mapreduce framework looks like this:

SELECT TRANSFORM(a.foo, a.bar)
USING 'mapper.py'
AS x, y, z
FROM (
SELECT b.foo, b.bar
FROM tablename b
CLUSTER BY b.foo
) a;

...however, this is exceptionally fragile, as it relies on the assumption that Hive is not doing any "magic" in between the query steps. People familiar with SQL frequently assume that query steps are effectively separated from each other. CLUSTER BY, then, would guarantee that data are clustered on their way OUT of the query, but really what we need is a directive to indicate that data must be clustered on the way INTO the query.

This is not pedantic, because there is no reason that Hive wouldn't try to optimize data flow between queries, for example, systematically splitting up big queries. The UDAF framework, with its merging step, would allow a single key/value pair to be split across SEVERAL reducers, "violating" the mapreduce assumptions but returning the correct data...however, for a TRANSFORM statement, no such protections are afforded.

I propose, for greater clarity, that these directives be part of the same query level. Example syntax:

SELECT TRANSFORM(foo, bar)
USING 'reducer.py'
AS x, y, z
FROM tablename
CLUSTERED BY foo;

...in other words, move the directive regarding data distribution to the query that actually cares about it, allowing for users who are making the assumptions of the mapreduce framework to formally indicate that their transformer really DOES need clustered data. Or to put it in other words, CLUSTER BY is a directive guaranteeing that data are clustered on the way OUT OF a query (i.e., for bucketed tables), whereas CLUSTERED BY is a directive guaranteeing that data are clustered on the way INTO a query.

Bonus points: For tables that are already CLUSTERED BY in their definition, allow this query to run in the map phase.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Adam Kramer

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Jul/11 14:03

Updated:: 20/Jul/11 14:03