Some notes from an offline discussion
1. To be similar to other vendors, the alias specified after the lateral view might be better as a table alias instead of as a column alias. For reference, take the query:
SELECT <expressions> FROM example_table LATERAL VIEW explode(adid_list) AS adid
Whether the alias 'adid' is for table or a column affects which <expressions> are (in)valid. If adid were a table alias, then the following query would not work:
On the other hand If adid were a column, alias, the following query would not work:
In either case, the following query would be valid:
2. Following the notion that the output of the UDTF is a table, the output of explode() and any other GenericUDTF should be a struct.
3. The operator structure of the lateral view is as follows:
For a query such as
SELECT pageid, adid.* FROM example_table LATERAL VIEW explode(adid_list) AS adid
The top of the operator tree will look something like (excuse the ASCII):
| [UDTF] (explode)
[Lateral View Join]
[Select] (pageid, adid.*)
Rows from the table scan operator will branch in two ways. The left branch will contain a single select operator that gets all the columns from the table. The right branch will only get the columns that should be sent to the following UDTF operator. The output of both the left and right branches will get sent to a lateral view join operator that appropriately joins the inputs.
An issue with this setup is that for every row coming from the left branch, there may be one or more rows (that are the output of the UDTF) coming from the right branch. The single row from the left branch and the multiple rows from the right branch must be associated in some way. The proposed solution is to depend on the positioning of the operators. If the operator DAG is setup as previously mentioned, then the lateral view join operator should first see a row from the left branch, followed by rows from the right branch, followed by a single row from the left branch again, and so on. Hence, the reception of a row from the left branch can be used as a delimiter.
For multiple lateral views, additional branches will be added from the table scan operator that include the appropriate select and UDTF operators. A recursive approach was considered, but may present issues with generating a Cartesian product
4. With the proposed semantics and operator DAG, the definitions of GenericUDTF.
must be adjusted slightly. GenericUDTF.close() should not output additional rows and GenericUDTF.process() should make all the calls to GenericUDTF.forward() that are necessary for the given row.