Hive
  1. Hive
  2. HIVE-836

Add syntax to force a new mapreduce job / transform subquery in mapper

    Details

    • Type: Wish Wish
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Release Note:
      See comments for workarounds.

      Description

      Hive currently does a lot of awesome work to figure out when my transformers should be used in the mapper and when they should be used in the reducer. However, sometimes I have a different plan.

      For example, consider this:

      foo.sql
      SELECT TRANSFORM(a.val1, a.val2)
      USING './niftyscript'
      AS part1, part2, part3
      FROM (
          SELECT b.val AS val1, c.val AS val2
          FROM tblb b JOIN tblc c on (b.key=c.key)
      ) a
      

      ...now, assume that the join step is very easy and 'niftyscript' is really processor intensive. The ideal format for this is a MR task with few mappers and few reducers, and then a second MR task with lots of mappers.

      Currently, there is no way to even require the outer TRANSFORM statement occur in a separate map phase. Implementing a "hint" such as /* +MAP /, akin to / +MAPJOIN */, would be awesome.

      Current workaround is to dump everything to a temporary table and then start over, but that is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points so no other job can touch the table.

        Activity

        Adam Kramer made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Release Note See comments for workarounds.
        Resolution Won't Fix [ 2 ]
        Adam Kramer made changes -
        Field Original Value New Value
        Description Hive currently does a lot of awesome work to figure out when my transformers should be used in the mapper and when they should be used in the reducer. However, sometimes I have a different plan.

        For example, consider this:

        SELECT TRANSFORM(a.val1, a.val2)
        USING './niftyscript'
        AS part1, part2, part3
        FROM (
            SELECT b.val AS val1, c.val AS val2
            FROM tblb b JOIN tblc c on (b.key=c.key)
        ) a

        ...in this syntax b and c will be joined (in the reducer, of course), and then the rows that pass the join clause will be passed to niftyscript _in the reducer._ However, when niftyscript is high-computation and there is a lot of data coming out of the join but very few reducers, there's a huge hold-up. It would be awesome if I could somehow force a new mapreduce step after the subquery, so that ./niftyscript is run in the mappers rather than the prior step's reducers.

        Current workaround is to dump everything to a temporary table and then start over, but that is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points so no other job can touch the table.

        SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f. https://issues.apache.org/jira/browse/HIVE-835 ), or add a query element to specify that "the job ends here." For example, in the above query, FROM a SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that.
        Hive currently does a lot of awesome work to figure out when my transformers should be used in the mapper and when they should be used in the reducer. However, sometimes I have a different plan.

        For example, consider this:

        {code:title=foo.sql}
        SELECT TRANSFORM(a.val1, a.val2)
        USING './niftyscript'
        AS part1, part2, part3
        FROM (
            SELECT b.val AS val1, c.val AS val2
            FROM tblb b JOIN tblc c on (b.key=c.key)
        ) a
        {code}

        ...now, assume that the join step is very easy and 'niftyscript' is really processor intensive. The ideal format for this is a MR task with few mappers and few reducers, and then a second MR task with lots of mappers.

        Currently, there is no way to even require the outer TRANSFORM statement occur in a separate map phase. Implementing a "hint" such as /* +MAP */, akin to /* +MAPJOIN(x) */, would be awesome.

        Current workaround is to dump everything to a temporary table and then start over, but that is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points so no other job can touch the table.
        Adam Kramer created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Kramer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development