Details
-
Wish
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
-
None
-
See comments for workarounds.
Description
Hive currently does a lot of awesome work to figure out when my transformers should be used in the mapper and when they should be used in the reducer. However, sometimes I have a different plan.
For example, consider this:
SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
SELECT b.val AS val1, c.val AS val2
FROM tblb b JOIN tblc c on (b.key=c.key)
) a
...now, assume that the join step is very easy and 'niftyscript' is really processor intensive. The ideal format for this is a MR task with few mappers and few reducers, and then a second MR task with lots of mappers.
Currently, there is no way to even require the outer TRANSFORM statement occur in a separate map phase. Implementing a "hint" such as /* +MAP /, akin to / +MAPJOIN */, would be awesome.
Current workaround is to dump everything to a temporary table and then start over, but that is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points so no other job can touch the table.