Hive
  1. Hive
  2. HIVE-655

Add support for user defined table generating functions

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Provide a way for users to add a table generating function, i.e., functions that generate multiple rows from a single input row. Currently, the only way to do it is via the TRANSFORM clause which requires streaming the data.

      1. HIVE-655.5.patch
        57 kB
        Paul Yang
      2. HIVE-655.4.patch
        55 kB
        Paul Yang
      3. HIVE-655.3.patch
        52 kB
        Paul Yang
      4. HIVE-655.2.patch
        45 kB
        Paul Yang
      5. HIVE-655.1.patch
        45 kB
        Paul Yang

        Issue Links

          Activity

          Hide
          Namit Jain added a comment -

          Committed. Thanks Paul

          Show
          Namit Jain added a comment - Committed. Thanks Paul
          Hide
          Paul Yang added a comment -
          • Removed author tag
          • Moved error message strings to ErrorMsg
          • Added entry for hive.udtf.auto.progress in hive-default.xml
          Show
          Paul Yang added a comment - Removed author tag Moved error message strings to ErrorMsg Added entry for hive.udtf.auto.progress in hive-default.xml
          Hide
          Namit Jain added a comment -

          Minor comments:

          1. Remove author tag from AutoProgress
          2. SemanticAnalyzer/TypeCheckProcFactory - when you are throwing SemanticException, add new error messages in ErrorMsg
          3. Do you want to enable auto progress by default - this might break some existing script operator applications.

          Show
          Namit Jain added a comment - Minor comments: 1. Remove author tag from AutoProgress 2. SemanticAnalyzer/TypeCheckProcFactory - when you are throwing SemanticException, add new error messages in ErrorMsg 3. Do you want to enable auto progress by default - this might break some existing script operator applications.
          Hide
          Paul Yang added a comment -
          • Additional simplifications to UDTFOperator, SemanticAnalyzer support only SELECT UDTF(...)
          • Added option for periodic progress updates to the tracker
          Show
          Paul Yang added a comment - Additional simplifications to UDTFOperator, SemanticAnalyzer support only SELECT UDTF(...) Added option for periodic progress updates to the tracker
          Hide
          Namit Jain added a comment -

          Can you remove it now - maybe, copy it in your local file - but let us apply the patch which only supports
          select udtf() as col from T

          Show
          Namit Jain added a comment - Can you remove it now - maybe, copy it in your local file - but let us apply the patch which only supports select udtf() as col from T
          Hide
          Paul Yang added a comment -

          No, I don't think that will be supported - only SELECT UDTF(col0) as colName...

          previousRow & previousTag aren't needed if the only object that should be send to Operator.forward() is the output of the UDTF. It was needed for supporting SELECT pageid, explode(adid_list).... but I left it in there as I thought it might be useful for implementing LATERAL VIEW later.

          Show
          Paul Yang added a comment - No, I don't think that will be supported - only SELECT UDTF(col0) as colName... previousRow & previousTag aren't needed if the only object that should be send to Operator.forward() is the output of the UDTF. It was needed for supporting SELECT pageid, explode(adid_list).... but I left it in there as I thought it might be useful for implementing LATERAL VIEW later.
          Hide
          Namit Jain added a comment -

          /**
          121 * forwardUDTFOutput is typically called indirectly by the GenericUDTF when
          122 * the GenericUDTF has generated output data that should be passed on to the
          123 * next operator in the DAG.
          124 *
          125 * @param o
          126 * @throws HiveException
          127 */
          128 public void forwardUDTFOutput(Object o) throws HiveException {
          129 // Now that we have the data for the UDTF, repack the row by including
          130 // the columns not used by the UDTF.
          131 StandardStructObjectInspector soi =
          132 (StandardStructObjectInspector) inputObjInspectors[previousTag];
          133 List<Object> newRow = new ArrayList<Object>();
          134 List<? extends StructField> fields = soi.getAllStructFieldRefs();
          135 boolean insertedUDTFCol = false;
          136
          137 for(StructField sf : fields) {
          138 if(udtfInputColNamesSet.contains(sf.getFieldName())) {
          139 if(!insertedUDTFCol)

          { 140 newRow.add(o); 141 insertedUDTFCol = true; 142 }


          143 } else

          { 144 newRow.add(soi.getStructFieldData(previousRow, sf)); 145 }


          146 }
          147 forward(newRow, outputObjInspector);
          148 }
          149

          Are we planning to support:

          select udtf(..) as (a,b)...
          select udtf(..) as (a int,b int)...

          If not, why do you need previousTag etc.

          Show
          Namit Jain added a comment - /** 121 * forwardUDTFOutput is typically called indirectly by the GenericUDTF when 122 * the GenericUDTF has generated output data that should be passed on to the 123 * next operator in the DAG. 124 * 125 * @param o 126 * @throws HiveException 127 */ 128 public void forwardUDTFOutput(Object o) throws HiveException { 129 // Now that we have the data for the UDTF, repack the row by including 130 // the columns not used by the UDTF. 131 StandardStructObjectInspector soi = 132 (StandardStructObjectInspector) inputObjInspectors [previousTag] ; 133 List<Object> newRow = new ArrayList<Object>(); 134 List<? extends StructField> fields = soi.getAllStructFieldRefs(); 135 boolean insertedUDTFCol = false; 136 137 for(StructField sf : fields) { 138 if(udtfInputColNamesSet.contains(sf.getFieldName())) { 139 if(!insertedUDTFCol) { 140 newRow.add(o); 141 insertedUDTFCol = true; 142 } 143 } else { 144 newRow.add(soi.getStructFieldData(previousRow, sf)); 145 } 146 } 147 forward(newRow, outputObjInspector); 148 } 149 Are we planning to support: select udtf(..) as (a,b)... select udtf(..) as (a int,b int)... If not, why do you need previousTag etc.
          Hide
          Paul Yang added a comment -
          • Simplified genSelectPlan(), genUDTFPlan() to handle only single UDTF case
          • Added detection of multiple expressions in SELECT to throw error when used with UDTF's
          • Added detection of {group, cluster, distribute, sort}

            by to throw errors as well

          Show
          Paul Yang added a comment - Simplified genSelectPlan(), genUDTFPlan() to handle only single UDTF case Added detection of multiple expressions in SELECT to throw error when used with UDTF's Added detection of {group, cluster, distribute, sort} by to throw errors as well
          Hide
          Ning Zhang added a comment -

          +1 the plan.

          @Zheng, lateral view is typically used to write a view in SQL (usually not UDTF) that can be embedded in the FROM clause and reference other tables in the same FROM clause. Oracle and DB2 (maybe MS SQLServer as well) all have their own table function definitions that are similar to our UDTF. Internally their table function implementations may use lateral views if possible.

          Let's have a detailed design discussion about how to implement lateral views when you are back, as Rugu mentioned lateral views is quite expressive.

          Show
          Ning Zhang added a comment - +1 the plan. @Zheng, lateral view is typically used to write a view in SQL (usually not UDTF) that can be embedded in the FROM clause and reference other tables in the same FROM clause. Oracle and DB2 (maybe MS SQLServer as well) all have their own table function definitions that are similar to our UDTF. Internally their table function implementations may use lateral views if possible. Let's have a detailed design discussion about how to implement lateral views when you are back, as Rugu mentioned lateral views is quite expressive.
          Hide
          Namit Jain added a comment -

          I agree that there is a patch already from Paul - I will file a follow-up jira for the LATERAL VIEW, and get the

          SELECT explode(adid_list) FROM myTabl

          in this jira

          Show
          Namit Jain added a comment - I agree that there is a patch already from Paul - I will file a follow-up jira for the LATERAL VIEW, and get the SELECT explode(adid_list) FROM myTabl in this jira
          Hide
          Namit Jain added a comment -

          Yes, "LATERAL VIEW" does support cartesian product.

          select ...
          from A LATERAL VIEW explode(c1) LATERAL VIEW explode(c2)

          Show
          Namit Jain added a comment - Yes, "LATERAL VIEW" does support cartesian product. select ... from A LATERAL VIEW explode(c1) LATERAL VIEW explode(c2)
          Hide
          Zheng Shao added a comment -

          Yeah I think we can support both formats. I agree "LATERAL VIEW" is a better and more consistent with SQL, but that might require much more time to implement (given that the discussion of this issue was started long time ago and Paul has done most of the work given the current design).

          We can finish the current implementation asap, and at the same time keep in mind about the "LATERAL VIEW" struct, and implement it when we get time.

          Does "LATERAL VIEW" supports exploding multiple columns (as a cartisian product)?

          @Raghu: The history of map/reduce and transform was like this: tranform was in the first proposal and implemented when hive was started. Later we added "Map" and "Reduce" as a shortcut so that users (who know map-reduce already) can understand it better. However, there are also concerns from users (e.g. Venky) because the keywords do not make sure the "map" and "reduce" are run in actual "map" and "reduce" phase. So I would say these are historical issues.

          Show
          Zheng Shao added a comment - Yeah I think we can support both formats. I agree "LATERAL VIEW" is a better and more consistent with SQL, but that might require much more time to implement (given that the discussion of this issue was started long time ago and Paul has done most of the work given the current design). We can finish the current implementation asap, and at the same time keep in mind about the "LATERAL VIEW" struct, and implement it when we get time. Does "LATERAL VIEW" supports exploding multiple columns (as a cartisian product)? @Raghu: The history of map/reduce and transform was like this: tranform was in the first proposal and implemented when hive was started. Later we added "Map" and "Reduce" as a shortcut so that users (who know map-reduce already) can understand it better. However, there are also concerns from users (e.g. Venky) because the keywords do not make sure the "map" and "reduce" are run in actual "map" and "reduce" phase. So I would say these are historical issues.
          Hide
          Ning Zhang added a comment -

          I think it is fine to allow both formats. The select explode could be a shortcut to the lateral view format. This may help users who are familiar with the TRANSFORM syntax.

          Show
          Ning Zhang added a comment - I think it is fine to allow both formats. The select explode could be a shortcut to the lateral view format. This may help users who are familiar with the TRANSFORM syntax.
          Hide
          Namit Jain added a comment -

          @Ning, please comment

          Show
          Namit Jain added a comment - @Ning, please comment
          Hide
          Namit Jain added a comment -

          I am fine with that.

          So, we allow:

          SELECT explode(adid_list) FROM myTabl

          and

          SELECT .... from T LATERAL VIEW explode(adid_list)

          @Zheng, please comment

          Show
          Namit Jain added a comment - I am fine with that. So, we allow: SELECT explode(adid_list) FROM myTabl and SELECT .... from T LATERAL VIEW explode(adid_list) @Zheng, please comment
          Hide
          Namit Jain added a comment -

          When you say : if it is the only thing specified: do you mean that we should disallow:

          SELECT pageid, explode(adid_list) FROM myTabl

          Show
          Namit Jain added a comment - When you say : if it is the only thing specified: do you mean that we should disallow: SELECT pageid, explode(adid_list) FROM myTabl
          Hide
          Raghu Murthy added a comment -

          There was an alternate syntax to TRANSFORM.

          FROM (
          MAP columnlist USING mapscript AS newcolumnlist
          FROM tablename ) a
          REDUCE a.newcolumnlist USING reducescript

          IIRC we later decided to just use TRANSORM for both map and reduce.
          The lateral view syntax is more expressive. But I think it makes it
          simpler to specify udtf in select clause if it's the only thing
          specified, similar to transform.

          Show
          Raghu Murthy added a comment - There was an alternate syntax to TRANSFORM. FROM ( MAP columnlist USING mapscript AS newcolumnlist FROM tablename ) a REDUCE a.newcolumnlist USING reducescript IIRC we later decided to just use TRANSORM for both map and reduce. The lateral view syntax is more expressive. But I think it makes it simpler to specify udtf in select clause if it's the only thing specified, similar to transform.
          Hide
          Paul Yang added a comment -

          So I had a discussion with Ning and Namit this morning and a slightly different syntax for UDTF's was proposed. Something like:

          SELECT pageid, adid FROM myTable LATERAL VIEW explode(adid_list) AS adid ;
          

          where the LATERAL VIEW keyword associates the given UDTF with the table in the FROM clause. As Ning pointed out, one of the issues with having the UDTF in the SELECT is that queries like the following

          SELECT pageid, explode(adid_list), count(1) FROM myTable GROUP BY pageid;
          

          are a bit confusing as it's not clear what it's supposed to do. We could disallow these sort of operations but it makes it more complicated to the user. Using LATERAL VIEW also handles Raghotham's concern about having to specify the input for the UDTF. The UDTF still returns one column, thought multiple values can be returned via a an array or a struct.

          Zheng, do you have any thoughts about the proposed syntax? I know from early on UDTF's were planned to be in the SELECT clause and I'm wondering if there were other reasons for why UDTF's should be there. With SELECT, it seemed more straightforward implementation-wise. Also, going back to TRANSFORM, it does seem like it can fit in FROM too. What was the rationale for having it in the SELECT?

          Show
          Paul Yang added a comment - So I had a discussion with Ning and Namit this morning and a slightly different syntax for UDTF's was proposed. Something like: SELECT pageid, adid FROM myTable LATERAL VIEW explode(adid_list) AS adid ; where the LATERAL VIEW keyword associates the given UDTF with the table in the FROM clause. As Ning pointed out, one of the issues with having the UDTF in the SELECT is that queries like the following SELECT pageid, explode(adid_list), count(1) FROM myTable GROUP BY pageid; are a bit confusing as it's not clear what it's supposed to do. We could disallow these sort of operations but it makes it more complicated to the user. Using LATERAL VIEW also handles Raghotham's concern about having to specify the input for the UDTF. The UDTF still returns one column, thought multiple values can be returned via a an array or a struct. Zheng, do you have any thoughts about the proposed syntax? I know from early on UDTF's were planned to be in the SELECT clause and I'm wondering if there were other reasons for why UDTF's should be there. With SELECT, it seemed more straightforward implementation-wise. Also, going back to TRANSFORM, it does seem like it can fit in FROM too. What was the rationale for having it in the SELECT?
          Hide
          Ning Zhang added a comment -

          If I understand it correctly, the semantics of "select pageid, explode(addid_list) from T" is to for each row in T, we will take a row in T and combine pageid and every item in addid_list to make it a set of rows. This can be easily translated to something like:

          SELECT T.pageid, S.addid FROM T, explode(addid_list) S;

          The benefits of this is that you can add WHERE, GROUP BY etc. to the query such as

          SELECT s.addid, t.pageid FROM t, explod(addid_list) S WHERE pageid < 10 GROUP BY S.addid, T.pageid;

          If we put explode in the SELECT clause, we have to use a subquery to wrap up the explode() and put it into the FROM clause of the outer query. And we have to make sure the inner subquery has no group by, no aggregation function etc., and we have to explain it to the users. That seems unnecessarily complicated.

          If we want to follow the SQL's data flow: rows are passed in the order of FROM --> WHERE --> GROUP BY --> HAVING --> SELECT, I think it's better to put it into the FROM clause. This is not only the syntax of Oracle's table function, DB2 also need to reference the user defined table function in the FROM clause (http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/ad/t0009218.htm). This is also in line with the SQL's LATERAL VIEW construct where views defined in the FROM clause can reference tables appearing before them in the same FROM clause. It has a well-define semantics and it can be easily composed with other SQL construct without any exceptions.

          But again, if we want backward compatibility I have no objection of putting it into the SELECT clause, but I think we'd better also support UDTFs in the FROM clause as well.

          Show
          Ning Zhang added a comment - If I understand it correctly, the semantics of "select pageid, explode(addid_list) from T" is to for each row in T, we will take a row in T and combine pageid and every item in addid_list to make it a set of rows. This can be easily translated to something like: SELECT T.pageid, S.addid FROM T, explode(addid_list) S; The benefits of this is that you can add WHERE, GROUP BY etc. to the query such as SELECT s.addid, t.pageid FROM t, explod(addid_list) S WHERE pageid < 10 GROUP BY S.addid, T.pageid; If we put explode in the SELECT clause, we have to use a subquery to wrap up the explode() and put it into the FROM clause of the outer query. And we have to make sure the inner subquery has no group by, no aggregation function etc., and we have to explain it to the users. That seems unnecessarily complicated. If we want to follow the SQL's data flow: rows are passed in the order of FROM --> WHERE --> GROUP BY --> HAVING --> SELECT, I think it's better to put it into the FROM clause. This is not only the syntax of Oracle's table function, DB2 also need to reference the user defined table function in the FROM clause ( http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/ad/t0009218.htm ). This is also in line with the SQL's LATERAL VIEW construct where views defined in the FROM clause can reference tables appearing before them in the same FROM clause. It has a well-define semantics and it can be easily composed with other SQL construct without any exceptions. But again, if we want backward compatibility I have no objection of putting it into the SELECT clause, but I think we'd better also support UDTFs in the FROM clause as well.
          Hide
          Namit Jain added a comment -

          I was thinking about it, and there is no workaround for the above operation. So, I am fine with supporting it.

          However, we should throw an error if there is a group by or anything like that.

          Show
          Namit Jain added a comment - I was thinking about it, and there is no workaround for the above operation. So, I am fine with supporting it. However, we should throw an error if there is a group by or anything like that.
          Hide
          Zheng Shao added a comment -

          I think we should support:

          SELECT pageid, explode(adid_list)...

          This will probably be the most popular query that people will write. Otherwise I don't know how many people will actually use UDTF ...

          I don't think we have to follow Oracle's syntax here. Oracle's UDTF may not support the "explode" feature as described above.
          This syntax looks much cleaner. Transform is actually a UDTF in Oracle's sense, but we didn't follow Oracle's syntax.

          We can give it a different name if calling this UDTF causes confusion. The real use case above (and the single "explode" function) will probably dominate what users will write.

          Show
          Zheng Shao added a comment - I think we should support: SELECT pageid, explode(adid_list)... This will probably be the most popular query that people will write. Otherwise I don't know how many people will actually use UDTF ... I don't think we have to follow Oracle's syntax here. Oracle's UDTF may not support the "explode" feature as described above. This syntax looks much cleaner. Transform is actually a UDTF in Oracle's sense, but we didn't follow Oracle's syntax. We can give it a different name if calling this UDTF causes confusion. The real use case above (and the single "explode" function) will probably dominate what users will write.
          Hide
          Namit Jain added a comment -

          Let us not support:

          SELECT pageid, UDTF(adid_list)...

          for now - again the semantics are not clear.

          Show
          Namit Jain added a comment - Let us not support: SELECT pageid, UDTF(adid_list)... for now - again the semantics are not clear.
          Hide
          Ning Zhang added a comment -

          Agree with Namit. A UDTF is a function returning a table, so it should appear where any table/view could appear in the QL. Ideally it should only appear in the FROM clause since that's the only place a table can appear, and the semantics is well defined. If we really want to put UDTF in a SELECT clause for backward compatibility reason, we'd better to limit the number of UDTF to 1 so that it can only be used as a table in the subquery select * from (select UDTF from ..). Otherwise we have to deal with the complexity of semantics for queries such as select pageid, count(...), UDTF(...) from .. group by ... It is new to users and it is hard to explain.

          Show
          Ning Zhang added a comment - Agree with Namit. A UDTF is a function returning a table, so it should appear where any table/view could appear in the QL. Ideally it should only appear in the FROM clause since that's the only place a table can appear, and the semantics is well defined. If we really want to put UDTF in a SELECT clause for backward compatibility reason, we'd better to limit the number of UDTF to 1 so that it can only be used as a table in the subquery select * from (select UDTF from ..). Otherwise we have to deal with the complexity of semantics for queries such as select pageid, count(...), UDTF(...) from .. group by ... It is new to users and it is hard to explain.
          Hide
          Paul Yang added a comment -
          • API for the UDTF's have changed slightly
          • Added support for columns in addition to UDTF's - i.e. SELECT pageid, UDTF(adid_list)... Does this conflict with 'i.e only 1 item can be present in the select clause'?
          • Disallowed >1 UDTF per SELECT
          Show
          Paul Yang added a comment - API for the UDTF's have changed slightly Added support for columns in addition to UDTF's - i.e. SELECT pageid, UDTF(adid_list)... Does this conflict with 'i.e only 1 item can be present in the select clause'? Disallowed >1 UDTF per SELECT
          Hide
          Namit Jain added a comment -

          Since it is a table function, I think we should limit its usage, i.e only 1 item can be present in the select clause.

          SELECT explode(list1), EXPLODE(list2) FROM mytable;
          should throw an error instead of returning (list1, list2) pairs.

          We can always do it via a sub-query.

          select * from
          (SELECT explode(list1) from mytable)sq1 join (select * from EXPLODE(list2) FROM mytable) sq2;

          If need be, we can add it later in a follow-up jira

          Show
          Namit Jain added a comment - Since it is a table function, I think we should limit its usage, i.e only 1 item can be present in the select clause. SELECT explode(list1), EXPLODE(list2) FROM mytable; should throw an error instead of returning (list1, list2) pairs. We can always do it via a sub-query. select * from (SELECT explode(list1) from mytable)sq1 join (select * from EXPLODE(list2) FROM mytable) sq2; If need be, we can add it later in a follow-up jira
          Hide
          Paul Yang added a comment -

          Preliminary version

          Show
          Paul Yang added a comment - Preliminary version
          Hide
          Zheng Shao added a comment -

          We had an offline design meeting and here are some conclusions:

          1. Grammar:

          SELECT myUDTF(expression1, expression2 ...) AS col1, col2, col3
          FROM src;
          

          The column names after AS is required for now.

          2. UDTF class hierarchy:
          We will have an abstract base class UDTF that have the following interface. The interface is simplified from Operator interface.

          abstract ObjectInspector init(ObjectInspector row); // take input row OI and output output row OI
          abstract void process(Object o); // Get a row and process it
          abstract void close(); // Notify that the data stream is ended
          final void forward(Object o); // called by UDTF to forward the row that is generated
          

          3. UDTF creation statement. Follow the same way as UDF and UDAF.

          CREATE TEMPORARY FUNCTION myUDAF ..
          
          Show
          Zheng Shao added a comment - We had an offline design meeting and here are some conclusions: 1. Grammar: SELECT myUDTF(expression1, expression2 ...) AS col1, col2, col3 FROM src; The column names after AS is required for now. 2. UDTF class hierarchy: We will have an abstract base class UDTF that have the following interface. The interface is simplified from Operator interface. abstract ObjectInspector init(ObjectInspector row); // take input row OI and output output row OI abstract void process( Object o); // Get a row and process it abstract void close(); // Notify that the data stream is ended final void forward( Object o); // called by UDTF to forward the row that is generated 3. UDTF creation statement. Follow the same way as UDF and UDAF. CREATE TEMPORARY FUNCTION myUDAF ..
          Hide
          Zheng Shao added a comment -

          Another way is like this:

          1. Simplest example: just use the same syntax as UDF.

            SELECT pageid, EXPLODE(adid_list) as adid
            FROM mytable;
          

          2. If the UDTF produces more than 1 columns, then we have 3 options:

          A. Simplest way: needs common sub expression elimination to achieve good performance
            SELECT pageid, EXPLODE(ad_list).adid AS adid, EXPLODE(ad_list).adtext AS adtext
            FROM mytable;
          
          B. Simplify the query using sub query:
            SELECT pageid, ad.adid AS adid, ad.adtext AS adtext
            FROM (SELECT pageid, EXPLODE(ad_list) AS ad
                 FROM mytable) a;
          
          C. Expand the structure inline:
            SELECT pageid, EXPLODE(ad_list) as (adid, adtext)
            FROM mytable;
          

          Hive already have support for B. For A, we need to do the common sub expression, but I guess we want to do it anyway.
          C seems a nice extension but it is not limited to UDTF - UDF/UDAF should support the same thing, if we want to support this.

          3. Parallel UDTF calls means cross product:

            SELECT pageid, EXPLODE(adid_list) AS adid, EXPLODE(link_list) AS link
            FROM mytable;
          
          Show
          Zheng Shao added a comment - Another way is like this: 1. Simplest example: just use the same syntax as UDF. SELECT pageid, EXPLODE(adid_list) as adid FROM mytable; 2. If the UDTF produces more than 1 columns, then we have 3 options: A. Simplest way: needs common sub expression elimination to achieve good performance SELECT pageid, EXPLODE(ad_list).adid AS adid, EXPLODE(ad_list).adtext AS adtext FROM mytable; B. Simplify the query using sub query: SELECT pageid, ad.adid AS adid, ad.adtext AS adtext FROM (SELECT pageid, EXPLODE(ad_list) AS ad FROM mytable) a; C. Expand the structure inline: SELECT pageid, EXPLODE(ad_list) as (adid, adtext) FROM mytable; Hive already have support for B. For A, we need to do the common sub expression, but I guess we want to do it anyway. C seems a nice extension but it is not limited to UDTF - UDF/UDAF should support the same thing, if we want to support this. 3. Parallel UDTF calls means cross product: SELECT pageid, EXPLODE(adid_list) AS adid, EXPLODE(link_list) AS link FROM mytable;
          Hide
          Raghotham Murthy added a comment -

          The problem with doing it the oracle way is that we need to find another way to specify the input (or FROM clause) for the TRANSFORM/UDTF itself. So, seems like its better to specify table function in the SELECT and use that as a subquery in the FROM clause.

          Show
          Raghotham Murthy added a comment - The problem with doing it the oracle way is that we need to find another way to specify the input (or FROM clause) for the TRANSFORM/UDTF itself. So, seems like its better to specify table function in the SELECT and use that as a subquery in the FROM clause.
          Hide
          Raghotham Murthy added a comment -

          I was trying to use the same syntax for UDTF as TRANSFORM, since UDTF is just the java way of doing a transform. So, TRANSFORM clause can also be deemed a table function and should be written as

          SELECT * FROM TABLE (TRANSFORM(parameters) USING 'script' AS col_list)
          
          Show
          Raghotham Murthy added a comment - I was trying to use the same syntax for UDTF as TRANSFORM, since UDTF is just the java way of doing a transform. So, TRANSFORM clause can also be deemed a table function and should be written as SELECT * FROM TABLE (TRANSFORM(parameters) USING 'script' AS col_list)
          Hide
          Namit Jain added a comment -

          Other option is go the oracle route:
          select * from TABLE(udf(prameters));

          Show
          Namit Jain added a comment - Other option is go the oracle route: select * from TABLE(udf(prameters));
          Hide
          Raghotham Murthy added a comment -

          One option for the syntax would be:

            SELECT udtf(a,b,c) FROM r
          

          'utdf' is the name of the user defined table function which needs to be registered.

          Note that there shouldnt be any other expression in the SELECT clause (similar to TRANSFORM). One way to implement this would be to define a new TableFunctionOperator which takes one of UDTF or GenericUDTF.

          Show
          Raghotham Murthy added a comment - One option for the syntax would be: SELECT udtf(a,b,c) FROM r 'utdf' is the name of the user defined table function which needs to be registered. Note that there shouldnt be any other expression in the SELECT clause (similar to TRANSFORM). One way to implement this would be to define a new TableFunctionOperator which takes one of UDTF or GenericUDTF.

            People

            • Assignee:
              Paul Yang
              Reporter:
              Raghotham Murthy
            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development