Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1251

TRANSFORM should allow piping or allow cross-subquery assumptions.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Many traditional transforms can be accomplished via simple unix commands chained together. For example, the "sort" phase is an instance of "cut -f 1 | sort". However, the TRANSFORM command in Hive doesn't allow for unix-style piping to occur.

      One classic case where I wish there was piping is when I want to "stack" a column into several rows:

      SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python reducer.py' AS key, value

      ...in this case, stacker.py would produce output of this form:
      key col0
      key col1
      key col2
      ...and then the reducer would reduce the above down to one item per key. In this case, the current workaround is this:

      SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
      (SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, col FROM table)

      ...the problem here is that for the above to work (and it should, indeed, work in a map-only MR task), I must assume that the data output from one subquery will be passed in EXACTLY THE SAME FORMAT to the outer query--i.e., I must assume that Hive will not cut a map or reduce phase in between, or "fan out" data from the inner query into different mappers in the outer query.

      As a user, I should not be allowed to assume that data coming out of a subquery goes into the nodes for a superquery in the same order...ESPECIALLY in the map phase.

      Attachments

        Activity

          People

            Unassigned Unassigned
            akramer Adam Kramer
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: