Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Many traditional transforms can be accomplished via simple unix commands chained together. For example, the "sort" phase is an instance of "cut -f 1 | sort". However, the TRANSFORM command in Hive doesn't allow for unix-style piping to occur.
One classic case where I wish there was piping is when I want to "stack" a column into several rows:
SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python reducer.py' AS key, value
...in this case, stacker.py would produce output of this form:
key col0
key col1
key col2
...and then the reducer would reduce the above down to one item per key. In this case, the current workaround is this:
SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
(SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, col FROM table)
...the problem here is that for the above to work (and it should, indeed, work in a map-only MR task), I must assume that the data output from one subquery will be passed in EXACTLY THE SAME FORMAT to the outer query--i.e., I must assume that Hive will not cut a map or reduce phase in between, or "fan out" data from the inner query into different mappers in the outer query.
As a user, I should not be allowed to assume that data coming out of a subquery goes into the nodes for a superquery in the same order...ESPECIALLY in the map phase.