Affects Version/s: None
Fix Version/s: None
Component/s: Query Processor
The common framework for utilizing the mapreduce framework looks like this:
SELECT TRANSFORM(a.foo, a.bar)
AS x, y, z
SELECT b.foo, b.bar
FROM tablename b
CLUSTER BY b.foo
...however, this is exceptionally fragile, as it relies on the assumption that Hive is not doing any "magic" in between the query steps. People familiar with SQL frequently assume that query steps are effectively separated from each other. CLUSTER BY, then, would guarantee that data are clustered on their way OUT of the query, but really what we need is a directive to indicate that data must be clustered on the way INTO the query.
This is not pedantic, because there is no reason that Hive wouldn't try to optimize data flow between queries, for example, systematically splitting up big queries. The UDAF framework, with its merging step, would allow a single key/value pair to be split across SEVERAL reducers, "violating" the mapreduce assumptions but returning the correct data...however, for a TRANSFORM statement, no such protections are afforded.
I propose, for greater clarity, that these directives be part of the same query level. Example syntax:
SELECT TRANSFORM(foo, bar)
AS x, y, z
CLUSTERED BY foo;
...in other words, move the directive regarding data distribution to the query that actually cares about it, allowing for users who are making the assumptions of the mapreduce framework to formally indicate that their transformer really DOES need clustered data. Or to put it in other words, CLUSTER BY is a directive guaranteeing that data are clustered on the way OUT OF a query (i.e., for bucketed tables), whereas CLUSTERED BY is a directive guaranteeing that data are clustered on the way INTO a query.
Bonus points: For tables that are already CLUSTERED BY in their definition, allow this query to run in the map phase.