Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      For table functions, the following syntax should be supported

      select ... from T LATERAL VIEW explode(adid_list) as adid,...

      1. HIVE-935.1.patch
        74 kB
        Paul Yang
      2. HIVE-935.2.patch
        74 kB
        Paul Yang
      3. HIVE-935.3.patch
        76 kB
        Paul Yang
      4. HIVE-935.4.patch
        85 kB
        Paul Yang
      5. HIVE-935.5.patch
        85 kB
        Paul Yang
      6. HIVE-935.6.patch
        93 kB
        Paul Yang
      7. HIVE-935.7.patch
        93 kB
        Paul Yang
      8. HIVE-935.8.patch
        101 kB
        Paul Yang

        Issue Links

          Activity

          Hide
          Paul Yang added a comment -

          Some notes from an offline discussion

          1. To be similar to other vendors, the alias specified after the lateral view might be better as a table alias instead of as a column alias. For reference, take the query:

          SELECT <expressions> FROM example_table LATERAL VIEW explode(adid_list) AS adid
          

          Whether the alias 'adid' is for table or a column affects which <expressions> are (in)valid. If adid were a table alias, then the following query would not work:

          SELECT adid ...
          

          On the other hand If adid were a column, alias, the following query would not work:

          SELECT adid.* ...
          

          In either case, the following query would be valid:

          SELECT adid.col0 ...
          

          2. Following the notion that the output of the UDTF is a table, the output of explode() and any other GenericUDTF should be a struct.

          3. The operator structure of the lateral view is as follows:

          For a query such as

          SELECT pageid, adid.* FROM example_table LATERAL VIEW explode(adid_list) AS adid
          

          The top of the operator tree will look something like (excuse the ASCII):

          *
                 [Table Scan]
                    /   \
           [Select](*)  [Select](adid_list)
                   |      |
                   |     [UDTF] (explode)
                   \     /
             [Lateral View Join]
                      |
                      |
            [Select] (pageid, adid.*)
                      |
                     ....
                     
          

          Rows from the table scan operator will branch in two ways. The left branch will contain a single select operator that gets all the columns from the table. The right branch will only get the columns that should be sent to the following UDTF operator. The output of both the left and right branches will get sent to a lateral view join operator that appropriately joins the inputs.

          An issue with this setup is that for every row coming from the left branch, there may be one or more rows (that are the output of the UDTF) coming from the right branch. The single row from the left branch and the multiple rows from the right branch must be associated in some way. The proposed solution is to depend on the positioning of the operators. If the operator DAG is setup as previously mentioned, then the lateral view join operator should first see a row from the left branch, followed by rows from the right branch, followed by a single row from the left branch again, and so on. Hence, the reception of a row from the left branch can be used as a delimiter.

          For multiple lateral views, additional branches will be added from the table scan operator that include the appropriate select and UDTF operators. A recursive approach was considered, but may present issues with generating a Cartesian product

          4. With the proposed semantics and operator DAG, the definitions of GenericUDTF.

          {process(), close()}

          must be adjusted slightly. GenericUDTF.close() should not output additional rows and GenericUDTF.process() should make all the calls to GenericUDTF.forward() that are necessary for the given row.

          Show
          Paul Yang added a comment - Some notes from an offline discussion 1. To be similar to other vendors, the alias specified after the lateral view might be better as a table alias instead of as a column alias. For reference, take the query: SELECT <expressions> FROM example_table LATERAL VIEW explode(adid_list) AS adid Whether the alias 'adid' is for table or a column affects which <expressions> are (in)valid. If adid were a table alias, then the following query would not work: SELECT adid ... On the other hand If adid were a column, alias, the following query would not work: SELECT adid.* ... In either case, the following query would be valid: SELECT adid.col0 ... 2. Following the notion that the output of the UDTF is a table, the output of explode() and any other GenericUDTF should be a struct. 3. The operator structure of the lateral view is as follows: For a query such as SELECT pageid, adid.* FROM example_table LATERAL VIEW explode(adid_list) AS adid The top of the operator tree will look something like (excuse the ASCII): * [Table Scan] / \ [Select](*) [Select](adid_list) | | | [UDTF] (explode) \ / [Lateral View Join] | | [Select] (pageid, adid.*) | .... Rows from the table scan operator will branch in two ways. The left branch will contain a single select operator that gets all the columns from the table. The right branch will only get the columns that should be sent to the following UDTF operator. The output of both the left and right branches will get sent to a lateral view join operator that appropriately joins the inputs. An issue with this setup is that for every row coming from the left branch, there may be one or more rows (that are the output of the UDTF) coming from the right branch. The single row from the left branch and the multiple rows from the right branch must be associated in some way. The proposed solution is to depend on the positioning of the operators. If the operator DAG is setup as previously mentioned, then the lateral view join operator should first see a row from the left branch, followed by rows from the right branch, followed by a single row from the left branch again, and so on. Hence, the reception of a row from the left branch can be used as a delimiter. For multiple lateral views, additional branches will be added from the table scan operator that include the appropriate select and UDTF operators. A recursive approach was considered, but may present issues with generating a Cartesian product 4. With the proposed semantics and operator DAG, the definitions of GenericUDTF. {process(), close()} must be adjusted slightly. GenericUDTF.close() should not output additional rows and GenericUDTF.process() should make all the calls to GenericUDTF.forward() that are necessary for the given row.
          Hide
          Namit Jain added a comment -

          Looks great - few more things.

          1. Lateral View Join will be a new operator (not same as Join)
          2. Lateral View Join operator will always have 2 inputs - the inputs will not be merged like join

          Show
          Namit Jain added a comment - Looks great - few more things. 1. Lateral View Join will be a new operator (not same as Join) 2. Lateral View Join operator will always have 2 inputs - the inputs will not be merged like join
          Hide
          Paul Yang added a comment -

          I recall in our meeting yesterday that having a GenericUDTF.forward() was considered preferable to having GenericUDTF.process() return an array of results because there were performance benefits. Namely, the GenericUDTF could create a mutable object once, and then call GenericUDTF.forward() multiple times on the same object.

          However, the following LateralViewJoinOperator will need to hold all the objects forwarded by the GenericUDTF for a given row before it can run the crossproduct/join - which implies that the forwarded objects can't be the same.

          Are there other reasons for having a separate GenericUDTF.forward()?

          Show
          Paul Yang added a comment - I recall in our meeting yesterday that having a GenericUDTF.forward() was considered preferable to having GenericUDTF.process() return an array of results because there were performance benefits. Namely, the GenericUDTF could create a mutable object once, and then call GenericUDTF.forward() multiple times on the same object. However, the following LateralViewJoinOperator will need to hold all the objects forwarded by the GenericUDTF for a given row before it can run the crossproduct/join - which implies that the forwarded objects can't be the same. Are there other reasons for having a separate GenericUDTF.forward()?
          Hide
          Paul Yang added a comment -

          Actually, never mind. With the 2 input approach, the lateral view join operator doesn't need to store the rows in memory.

          Show
          Paul Yang added a comment - Actually, never mind. With the 2 input approach, the lateral view join operator doesn't need to store the rows in memory.
          Hide
          Paul Yang added a comment -
          • Implemented lateral view
          • Changed the GenericUDTF API to require that the output be a struct
          • Changed the AS alias (i.e. explode(adid_list) AS myAlias) to be a table alias
          Show
          Paul Yang added a comment - Implemented lateral view Changed the GenericUDTF API to require that the output be a struct Changed the AS alias (i.e. explode(adid_list) AS myAlias) to be a table alias
          Hide
          Paul Yang added a comment -
          • Changed naming of internal names to use getColumnInternalName()
          • Moved renaming of internal columns to lateral view join operator
          Show
          Paul Yang added a comment - Changed naming of internal names to use getColumnInternalName() Moved renaming of internal columns to lateral view join operator
          Hide
          Paul Yang added a comment -

          For the first cut, LATERAL VIEW can't be used directly with JOIN. Instead, a subquery must be used.
          i.e.
          SELECT * FROM (SELECT * FROM src LATERAL VIEW explode(col) AS myTable) a JOIN b ON a.elt = b.key

          • Added detection of LATERAL VIEW with JOIN
          Show
          Paul Yang added a comment - For the first cut, LATERAL VIEW can't be used directly with JOIN. Instead, a subquery must be used. i.e. SELECT * FROM (SELECT * FROM src LATERAL VIEW explode(col) AS myTable) a JOIN b ON a.elt = b.key Added detection of LATERAL VIEW with JOIN
          Hide
          Namit Jain added a comment -

          I will take a look at the patch

          Show
          Namit Jain added a comment - I will take a look at the patch
          Hide
          Paul Yang added a comment -
          • Some comment corrections
          • Altered how the select operator was created for lateral view joins
          Show
          Paul Yang added a comment - Some comment corrections Altered how the select operator was created for lateral view joins
          Hide
          Namit Jain added a comment -

          Can you regenerate the patch after refreshing ? It has some conflicts

          Show
          Namit Jain added a comment - Can you regenerate the patch after refreshing ? It has some conflicts
          Hide
          Paul Yang added a comment -

          Conflict was with a test output file - re-ran test to revolve the conflict.

          Show
          Paul Yang added a comment - Conflict was with a test output file - re-ran test to revolve the conflict.
          Hide
          Namit Jain added a comment -

          I also got conflicts in SemanticAnalyzer.java and Hive.g

          Show
          Namit Jain added a comment - I also got conflicts in SemanticAnalyzer.java and Hive.g
          Hide
          Namit Jain added a comment -

          The latest patch applied fine - no conflicts now

          Show
          Namit Jain added a comment - The latest patch applied fine - no conflicts now
          Hide
          Namit Jain added a comment -

          Very well commented code -
          A couple of minor points:

          1. The test is not deterministic -

          SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) AS myTable LIMIT 3;
          – TABLE.* should be supported
          SELECT myTable.* FROM src LATERAL VIEW explode(array(1,2,3)) AS myTable LIMIT 3;

          add a order by before the LIMIT

          2. In SemanticAnalyzer, whenever you throw an error - add the message in ErrorMessage.java
          3. You can change

          private String processLateralView(QB qb, ASTNode lateralView) throws SemanticException {
          int numChildren = lateralView.getChildCount();
          if(numChildren != 2)

          { throw new SemanticException("Lateral view with incorrect nubmer of children"); }

          to an assert instead

          In the same function:

          default:
          throw new SemanticException("Invalid 3rd child under TOK_LATERAL_VIEW");
          }

          The error message is not correct

          Show
          Namit Jain added a comment - Very well commented code - A couple of minor points: 1. The test is not deterministic - SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) AS myTable LIMIT 3; – TABLE.* should be supported SELECT myTable.* FROM src LATERAL VIEW explode(array(1,2,3)) AS myTable LIMIT 3; add a order by before the LIMIT 2. In SemanticAnalyzer, whenever you throw an error - add the message in ErrorMessage.java 3. You can change private String processLateralView(QB qb, ASTNode lateralView) throws SemanticException { int numChildren = lateralView.getChildCount(); if(numChildren != 2) { throw new SemanticException("Lateral view with incorrect nubmer of children"); } to an assert instead In the same function: default: throw new SemanticException("Invalid 3rd child under TOK_LATERAL_VIEW"); } The error message is not correct
          Hide
          Namit Jain added a comment -

          One issue about "elt" - is this a standard terminology ?

          Show
          Namit Jain added a comment - One issue about "elt" - is this a standard terminology ?
          Hide
          Paul Yang added a comment -

          Well, I wasn't sure what to name the output columns for the explode UDTF. I chose 'elt' because it seemed like a good abbreviation for element, but I don't think it's standard. It does seem a bit unusual - what's a better name?

          Show
          Paul Yang added a comment - Well, I wasn't sure what to name the output columns for the explode UDTF. I chose 'elt' because it seemed like a good abbreviation for element, but I don't think it's standard. It does seem a bit unusual - what's a better name?
          Hide
          Namit Jain added a comment -

          In Oracle, the client used to define the schema. But here, we are implicitly defining it for him.

          I am not sure, but "elt" seems confusing - we should find a way for the user to specify it,
          or have some better default like "key"

          Show
          Namit Jain added a comment - In Oracle, the client used to define the schema. But here, we are implicitly defining it for him. I am not sure, but "elt" seems confusing - we should find a way for the user to specify it, or have some better default like "key"
          Hide
          Namit Jain added a comment -

          I thought about it more and also checked documentation for other databases for similar functionality.
          Having a hard-coded column name does not seem like a good idea.

          I would propose the following syntax:

          select ... as T LATERAL VIEW UDTF() <table_alias> as <column_names>

          or

          select ... as T LATERAL VIEW UDTF() <table_alias> as <column_schema>

          In the second case, we should do appropriate type conversions.

          Oracle, LucidDB let the user define the schema. We should also let the user define that.
          @Paul, hopefully, this will not require many changes to the patch. But, we should finalize the
          syntax first before you continue on coding. Both <table_alias> and <column_names>/<column_schema>
          are mandatory. We can split this, and support <column_schema> later if that makes it easier.

          Any suggestions.

          Show
          Namit Jain added a comment - I thought about it more and also checked documentation for other databases for similar functionality. Having a hard-coded column name does not seem like a good idea. I would propose the following syntax: select ... as T LATERAL VIEW UDTF() <table_alias> as <column_names> or select ... as T LATERAL VIEW UDTF() <table_alias> as <column_schema> In the second case, we should do appropriate type conversions. Oracle, LucidDB let the user define the schema. We should also let the user define that. @Paul, hopefully, this will not require many changes to the patch. But, we should finalize the syntax first before you continue on coding. Both <table_alias> and <column_names>/<column_schema> are mandatory. We can split this, and support <column_schema> later if that makes it easier. Any suggestions.
          Hide
          Paul Yang added a comment -

          If we go ahead with this change, then the syntax for UDTF's in the select expression list probably should change from:

          SELECT udtf(...) AS tableAlias ...
          

          to:

          SELECT udtf(...) tableAlias AS col0, col1 ...
          
          Show
          Paul Yang added a comment - If we go ahead with this change, then the syntax for UDTF's in the select expression list probably should change from: SELECT udtf(...) AS tableAlias ... to: SELECT udtf(...) tableAlias AS col0, col1 ...
          Hide
          Paul Yang added a comment -

          Offline meeting - we will support both forms but with slight modifications to the syntax to allow the user to define the column aliases.

          For UDTF's in the select expression: (To reiterate, we are supporting this because it'd be easier for new users to learn. Plus, it's similar to the transform syntax)

          SELECT udtf(inputCol1, inputCol2) AS (colAlias1, colAlias2, ...) FROM sourceTable
          i.e.
          SELECT explode(adid_list) as (adid) FROM src;
          

          The column alias list is required, and must be in parenthesis to aid parsing. Otherwise, it would be difficult to differentiate between column aliases in the AS and other select expressions.

          For UDTF's in lateral view

          SELECT <expression list> FROM sourceTable LATERAL VIEW udtf(inputCol1, inputCol2, ...) tableAlias AS (colAlias1, colAlias2..) ...
          i.e.
          SELECT myTable.* FROM src LATERAL VIEW explode(adid_list) myTable AS (adid);
          

          For the lateral view, the both table and column aliases will be required parameters.

          In either form, an error will be generated if the number of aliases supplied in the AS is not the same as the number of columns (fields) output by the UDTF.

          Show
          Paul Yang added a comment - Offline meeting - we will support both forms but with slight modifications to the syntax to allow the user to define the column aliases. For UDTF's in the select expression: (To reiterate, we are supporting this because it'd be easier for new users to learn. Plus, it's similar to the transform syntax) SELECT udtf(inputCol1, inputCol2) AS (colAlias1, colAlias2, ...) FROM sourceTable i.e. SELECT explode(adid_list) as (adid) FROM src; The column alias list is required, and must be in parenthesis to aid parsing. Otherwise, it would be difficult to differentiate between column aliases in the AS and other select expressions. For UDTF's in lateral view SELECT <expression list> FROM sourceTable LATERAL VIEW udtf(inputCol1, inputCol2, ...) tableAlias AS (colAlias1, colAlias2..) ... i.e. SELECT myTable.* FROM src LATERAL VIEW explode(adid_list) myTable AS (adid); For the lateral view, the both table and column aliases will be required parameters. In either form, an error will be generated if the number of aliases supplied in the AS is not the same as the number of columns (fields) output by the UDTF.
          Hide
          Paul Yang added a comment -

          I've implemented the column aliases as previously mentioned, and made fixes for the issues Namit had pointed out.

          One change is that the patch only requires parenthesis for specifying multiple column aliases for a UDTF in the select expression. Requiring parenthesis in all cases seemed a little too verbose.

          Show
          Paul Yang added a comment - I've implemented the column aliases as previously mentioned, and made fixes for the issues Namit had pointed out. One change is that the patch only requires parenthesis for specifying multiple column aliases for a UDTF in the select expression. Requiring parenthesis in all cases seemed a little too verbose.
          Hide
          Namit Jain added a comment -

          can you regenerate the patch - there is a conflict while applying the patch

          Show
          Namit Jain added a comment - can you regenerate the patch - there is a conflict while applying the patch
          Hide
          Paul Yang added a comment -

          Sorry about that - had to run merge.sh.

          Show
          Paul Yang added a comment - Sorry about that - had to run merge.sh.
          Hide
          Namit Jain added a comment -

          5 ArrayList<exprNodeDesc> colList = new ArrayList<exprNodeDesc>();
          4726 RowResolver inputRR = opParseCtx.get(op).getRR();
          4727 RowResolver allPathRR = new RowResolver();
          4728 genColListRegex(".*", null, null, null, colList, inputRR,
          4729 Integer.valueOf(0), allPathRR);
          4730 Vector<ColumnInfo> cols = allPathRR.getColumnInfos();
          4731 ArrayList<String> outputColumnNames = new ArrayList<String>();
          4732 for (ColumnInfo c : cols)

          { 4733 outputColumnNames.add(c.getInternalName()); 4734 }


          4735 Operator allPath =
          4736 putOpInsertMap(OperatorFactory.getAndMakeChild(
          4737 new selectDesc(colList, outputColumnNames, true),
          4738 new RowSchema(allPathRR.getColumnInfos()),
          4739 op), allPathRR);
          4740

          Instead of the above in SemanticAnalyzer.java, you can create a select by

          public selectDesc(final boolean selStarNoCompute)

          { this.selStarNoCompute = selStarNoCompute; }

          If selStarNoCompute is true, you get the same as above

          Show
          Namit Jain added a comment - 5 ArrayList<exprNodeDesc> colList = new ArrayList<exprNodeDesc>(); 4726 RowResolver inputRR = opParseCtx.get(op).getRR(); 4727 RowResolver allPathRR = new RowResolver(); 4728 genColListRegex(".*", null, null, null, colList, inputRR, 4729 Integer.valueOf(0), allPathRR); 4730 Vector<ColumnInfo> cols = allPathRR.getColumnInfos(); 4731 ArrayList<String> outputColumnNames = new ArrayList<String>(); 4732 for (ColumnInfo c : cols) { 4733 outputColumnNames.add(c.getInternalName()); 4734 } 4735 Operator allPath = 4736 putOpInsertMap(OperatorFactory.getAndMakeChild( 4737 new selectDesc(colList, outputColumnNames, true), 4738 new RowSchema(allPathRR.getColumnInfos()), 4739 op), allPathRR); 4740 Instead of the above in SemanticAnalyzer.java, you can create a select by public selectDesc(final boolean selStarNoCompute) { this.selStarNoCompute = selStarNoCompute; } If selStarNoCompute is true, you get the same as above
          Hide
          Namit Jain added a comment -

          Also, would it be possible to add the following tests:

          1. A negative test where the explode is returning 2 columns, but you have specified 2 aliases.
          2. A positive test where 2 column aliases are being returned - you can add a dummy function in contrib/test
          which returns the same value twice.

          Show
          Namit Jain added a comment - Also, would it be possible to add the following tests: 1. A negative test where the explode is returning 2 columns, but you have specified 2 aliases. 2. A positive test where 2 column aliases are being returned - you can add a dummy function in contrib/test which returns the same value twice.
          Hide
          Namit Jain added a comment -

          I got a diff in clientnegative/lateral_view_join.q.out - please correct that also

          Show
          Namit Jain added a comment - I got a diff in clientnegative/lateral_view_join.q.out - please correct that also
          Hide
          Paul Yang added a comment -
          • Added 2 output col tests, additional negative tests in contrib
          • Changed select to use selStarNoCompute
          • Updated test outputs
          Show
          Paul Yang added a comment - Added 2 output col tests, additional negative tests in contrib Changed select to use selStarNoCompute Updated test outputs
          Hide
          Namit Jain added a comment -

          +1

          looks good - will commit if the tests pass

          Show
          Namit Jain added a comment - +1 looks good - will commit if the tests pass
          Hide
          Namit Jain added a comment -

          Committed. Thanks Paul

          Show
          Namit Jain added a comment - Committed. Thanks Paul

            People

            • Assignee:
              Paul Yang
              Reporter:
              Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development