Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1551

RelBuilder's project() doesn't preserve alias

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11.0
    • Component/s: core
    • Labels:
      None

      Description

      The alias should be preserved by the project node, e.g.

      .scan("EMP")
      .as("EMP_alias")
      .project(...) // resets alias to null
      .project(builder.field("EMP_alias", "DEPTNO")) // fails that no such alias exists
      

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          Agreed, project should preserve table aliases.

          Show
          julianhyde Julian Hyde added a comment - Agreed, project should preserve table aliases.
          Hide
          jbalint@gmail.com Jess Balint added a comment -

          The difficulty that I'm facing is that it's not clear how to handle the case when the alias contains more than one entry, i.e. a join node. The resulting alias is something like [<left_rel_alias, left_row_type>, <right_rel_alias, right_row_type>]. The projection can include references to an arbitrary set of fields from both sides of the join. Would it make sense to rework the Frame.right field to include the alias for each field in the row type?

          I sketch an example here:

          • Each frame contains an ordered listed of <rel_alias, field_name> pairs. Current list of <rel_alias, row_type> pairs is removed.

          Given the following rel builder code:

          .scan("EMP")
          .as("e")
          // (1)
          .scan("DEPT")
          .as("d")
          // (2)
          .join(JoinRelType.INNER) // condition omitted
          // (3)
          .project(builder.field("e", "ENAME"),
                   builder.field("d", "DNAME"),
                   builder.field("e", "MGR"),
                   builder.literal(20))
          // (4)
          .as("x")
          // (5)
          

          At each numbered location above, the top Frame.right consists of the following:

          1. Scan includes all fields from EMP: [<e, EMPNO>, <e, ENAME>, <e, JOB>, <e, MGR>, <e, HIREDATE>, <e, SAL>, <e, COMM>, <e, DEPTNO>]
          2. Scan includes all fields from DEPT: [<d, DEPTNO>, <d, DNAME>, <d, LOC>]
          3. Join builds a concatenation of both lists of fields: [<e, EMPNO>, <e, ENAME>, <e, JOB>, <e, MGR>, <e, HIREDATE>, <e, SAL>, <e, COMM>, <e, DEPTNO>, <d, DEPTNO>, <d, DNAME>, <d, LOC>]
          4. Project retains rel aliases of individual fields, assigning null to new fields: [<e, ENAME>, <d, DNAME>, <e, MGR>, <null, $f2>]
          5. Alias (as()) sets the alias of all fields: [<x, ENAME>, <x, DNAME>, <x, MGR>, <x, $f2>]
          Show
          jbalint@gmail.com Jess Balint added a comment - The difficulty that I'm facing is that it's not clear how to handle the case when the alias contains more than one entry, i.e. a join node. The resulting alias is something like [<left_rel_alias, left_row_type>, <right_rel_alias, right_row_type>] . The projection can include references to an arbitrary set of fields from both sides of the join. Would it make sense to rework the Frame.right field to include the alias for each field in the row type? I sketch an example here: Each frame contains an ordered listed of <rel_alias, field_name> pairs. Current list of <rel_alias, row_type> pairs is removed. Given the following rel builder code: .scan( "EMP" ) .as( "e" ) // (1) .scan( "DEPT" ) .as( "d" ) // (2) .join(JoinRelType.INNER) // condition omitted // (3) .project(builder.field( "e" , "ENAME" ), builder.field( "d" , "DNAME" ), builder.field( "e" , "MGR" ), builder.literal(20)) // (4) .as( "x" ) // (5) At each numbered location above, the top Frame.right consists of the following: Scan includes all fields from EMP: [<e, EMPNO>, <e, ENAME>, <e, JOB>, <e, MGR>, <e, HIREDATE>, <e, SAL>, <e, COMM>, <e, DEPTNO>] Scan includes all fields from DEPT: [<d, DEPTNO>, <d, DNAME>, <d, LOC>] Join builds a concatenation of both lists of fields: [<e, EMPNO>, <e, ENAME>, <e, JOB>, <e, MGR>, <e, HIREDATE>, <e, SAL>, <e, COMM>, <e, DEPTNO>, <d, DEPTNO>, <d, DNAME>, <d, LOC>] Project retains rel aliases of individual fields, assigning null to new fields: [<e, ENAME>, <d, DNAME>, <e, MGR>, <null, $f2>] Alias ( as() ) sets the alias of all fields: [<x, ENAME>, <x, DNAME>, <x, MGR>, <x, $f2>]
          Hide
          julianhyde Julian Hyde added a comment -

          Frame.right is a list of pairs, each considered as a (key, value) pair, and at any point the number of pairs is equal to the number of "tables" that are in scope. At point 1 there is one table [e]; at point 2 and point 3 there are two tables [e, d], and at point 5 there is one table [x]. It's not clear how many tables there should be at point 4, and we should discuss.

          The "key" of each pair is a table alias, and the "value" is its record type (i.e. the list of field names and types). For instance, at point 3, there are the following 2 pairs:

          • key: 'e' value: RecordType(EMPNO INTEGER, ENAME VARCHAR, JOB VARCHAR, MGR INTEGER, HIREDATE DATE, SAL FLOAT, COMM FLOAT, DEPTNO INTEGER)
          • key: 'd', value: RecordType(DEPTNO INTEGER, DNAME VARCHAR, LOC VARCHAR)

          Now, back to point 4. I can see it would be useful if you can see through the project, and reference fields from the underlying aliases (d and e). But obviously you can't reference fields that have been projected away. Therefore I think that project should create two NEW aliases d and e with row-types that are subsets of the original d and e row types. What do you think?

          Show
          julianhyde Julian Hyde added a comment - Frame.right is a list of pairs, each considered as a (key, value) pair, and at any point the number of pairs is equal to the number of "tables" that are in scope. At point 1 there is one table [e] ; at point 2 and point 3 there are two tables [e, d] , and at point 5 there is one table [x] . It's not clear how many tables there should be at point 4, and we should discuss. The "key" of each pair is a table alias, and the "value" is its record type (i.e. the list of field names and types). For instance, at point 3, there are the following 2 pairs: key: 'e' value: RecordType(EMPNO INTEGER, ENAME VARCHAR, JOB VARCHAR, MGR INTEGER, HIREDATE DATE, SAL FLOAT, COMM FLOAT, DEPTNO INTEGER) key: 'd', value: RecordType(DEPTNO INTEGER, DNAME VARCHAR, LOC VARCHAR) Now, back to point 4. I can see it would be useful if you can see through the project, and reference fields from the underlying aliases (d and e). But obviously you can't reference fields that have been projected away. Therefore I think that project should create two NEW aliases d and e with row-types that are subsets of the original d and e row types. What do you think?
          Hide
          jbalint@gmail.com Jess Balint added a comment -

          you said:

          at point 2 and point 3 there are two tables [e, d]

          There's a difference between these two points. At point 2, there are two frames on the stack. Each frame has only one alias in the list. At point 3, there is one stack frame with two aliases (and row types) in the list.

          With the relational algebra, we don't consider tables "in scope", but instead we consider each expression to be a relation itself. The current approach accumulates aliases and it works for joins until there is a projection because the set of fields can be decomposed into sequential disjoint lists - represented by the list of pairs. Howver, this artificially limits the propagation of aliases when the sequential lists are no longer maintained, e.g. a projection with arbitrary ordering of fields from both tables in the join.

          Unless I'm wrong, the projection needs to maintain a list of pairs in Frame.right whereby the (concatenated list of) row types fields correspond exactly with the projected elements. You suggested:

          that project should create two NEW aliases d and e with row-types that are subsets of the original d and e row types

          Applying this to the example, we would need two pairs for e as projection's use of e's fields is not sequential. Two aliases are not enough. This would require something like this at point 4:
          [<e, RecordType(ENAME VARCHAR)>, <d, RecordType(DNAME VARCHAR)>, <e, RecordType(MGR INTEGER)>, <null, RecordType($f2 INTEGER)>]

          The change I am proposing is not too dissimilar but treats each stack frame as one relation (not a sequence of aliased relations) with a single list of fields. Each field includes the relation alias from where it originated.

          Thanks.

          Show
          jbalint@gmail.com Jess Balint added a comment - you said: at point 2 and point 3 there are two tables [e, d] There's a difference between these two points. At point 2, there are two frames on the stack. Each frame has only one alias in the list. At point 3, there is one stack frame with two aliases (and row types) in the list. With the relational algebra, we don't consider tables "in scope", but instead we consider each expression to be a relation itself. The current approach accumulates aliases and it works for joins until there is a projection because the set of fields can be decomposed into sequential disjoint lists - represented by the list of pairs. Howver, this artificially limits the propagation of aliases when the sequential lists are no longer maintained, e.g. a projection with arbitrary ordering of fields from both tables in the join. Unless I'm wrong, the projection needs to maintain a list of pairs in Frame.right whereby the (concatenated list of) row types fields correspond exactly with the projected elements. You suggested: that project should create two NEW aliases d and e with row-types that are subsets of the original d and e row types Applying this to the example, we would need two pairs for e as projection's use of e 's fields is not sequential. Two aliases are not enough. This would require something like this at point 4: [<e, RecordType(ENAME VARCHAR)>, <d, RecordType(DNAME VARCHAR)>, <e, RecordType(MGR INTEGER)>, <null, RecordType($f2 INTEGER)>] The change I am proposing is not too dissimilar but treats each stack frame as one relation (not a sequence of aliased relations) with a single list of fields. Each field includes the relation alias from where it originated. Thanks.
          Hide
          julianhyde Julian Hyde added a comment - - edited

          I think you're proposing that each field has a two-part qualified name in addition to its actual field name, and can be referenced by either. I think that would be fine.

          I believe you can apply multiple aliases, for example

          builder.table("emp")
            .as("e")
            .project(...)
            .as("e2")
            .table("dept")
            .as("d")
            .join("deptno")
            .as("ed")
            .table("location")
           .join("locid);
          

          You can reference the "empno" field in 4 ways: "empno", ["e", "empno"], ["e2", "empno"], ["ed", "empno"].

          With the relational algebra, we don't consider tables "in scope", but instead we consider each expression to be a relation itself.

          Yes, but we're augmenting relational algebra a bit. The output of the builder is pure relational algebra. But the builder maintains some intermediate state to make the process a bit easier. In particular, the "table alias" concept, which is somewhat similar to table aliases in SQL, makes it easier to calculate field offsets after joins have occurred. Your proposal achieves that, I think.

          Show
          julianhyde Julian Hyde added a comment - - edited I think you're proposing that each field has a two-part qualified name in addition to its actual field name, and can be referenced by either. I think that would be fine. I believe you can apply multiple aliases, for example builder.table( "emp" ) .as( "e" ) .project(...) .as( "e2" ) .table( "dept" ) .as( "d" ) .join( "deptno" ) .as( "ed" ) .table( "location" ) .join("locid); You can reference the "empno" field in 4 ways: "empno", ["e", "empno"] , ["e2", "empno"] , ["ed", "empno"] . With the relational algebra, we don't consider tables "in scope", but instead we consider each expression to be a relation itself. Yes, but we're augmenting relational algebra a bit. The output of the builder is pure relational algebra. But the builder maintains some intermediate state to make the process a bit easier. In particular, the "table alias" concept, which is somewhat similar to table aliases in SQL, makes it easier to calculate field offsets after joins have occurred. Your proposal achieves that, I think.
          Hide
          jbalint@gmail.com Jess Balint added a comment -

          Thanks. PR opened

          Show
          jbalint@gmail.com Jess Balint added a comment - Thanks. PR opened
          Hide
          julianhyde Julian Hyde added a comment -

          Looks basically good.

          • Pig requires that the name of the aggregate is retained. Can you revert the change you made to PigletTest and make sure it passes.
          • Is there a really good reason why DEPTNO0 became DEPTNO9 in RelOptRulesTest.xml? We'd like derived field names to be stable, otherwise this change will break downstream projects.
          • It doesn't look as if the set of strings is immutable. Specifically, in RelBuilder#as. Could you use an ImmutableSet? If Pair<ImmutableSet<String>, RelDataTypeField> is too verbose, create class FrameField extends Pair<ImmutableSet<String>, RelDataTypeField>.
          • The standard formatting (e.g. for multi-line strings and chained method calls) is not to align, but indent 4. I am happy to fix these before I commit.
          Show
          julianhyde Julian Hyde added a comment - Looks basically good. Pig requires that the name of the aggregate is retained. Can you revert the change you made to PigletTest and make sure it passes. Is there a really good reason why DEPTNO0 became DEPTNO9 in RelOptRulesTest.xml? We'd like derived field names to be stable, otherwise this change will break downstream projects. It doesn't look as if the set of strings is immutable. Specifically, in RelBuilder#as . Could you use an ImmutableSet ? If Pair<ImmutableSet<String>, RelDataTypeField> is too verbose, create class FrameField extends Pair<ImmutableSet<String>, RelDataTypeField> . The standard formatting (e.g. for multi-line strings and chained method calls) is not to align, but indent 4. I am happy to fix these before I commit.
          Hide
          jbalint@gmail.com Jess Balint added a comment -

          Pig requires that the name of the aggregate is retained. Can you revert the change you made to PigletTest and make sure it passes.

          Not 100% clear on what the alias is doing here, but the reverted test is passing.

          Is there a really good reason why DEPTNO0 became DEPTNO9 in RelOptRulesTest.xml? We'd like derived field names to be stable, otherwise this change will break downstream projects.

          It was slightly less code to do it this way. I've changed it to name the duplicate fields the same way as was done before. One small caveat is that RelOptRulesTest.testReduceCasts() and StreamTest.testStreamToRelationJoin() require minor renamings but are more consistent now.

          It doesn't look as if the set of strings is immutable.

          Fixed.

          The standard formatting (e.g. for multi-line strings and chained method calls) is not to align, but indent 4. I am happy to fix these before I commit.

          Do you do it in your IDE? If it's not much work for you, that would be great. Otherwise, I can do it. Do the checkstyle rules need to be amended?

          Show
          jbalint@gmail.com Jess Balint added a comment - Pig requires that the name of the aggregate is retained. Can you revert the change you made to PigletTest and make sure it passes. Not 100% clear on what the alias is doing here, but the reverted test is passing. Is there a really good reason why DEPTNO0 became DEPTNO9 in RelOptRulesTest.xml? We'd like derived field names to be stable, otherwise this change will break downstream projects. It was slightly less code to do it this way. I've changed it to name the duplicate fields the same way as was done before. One small caveat is that RelOptRulesTest.testReduceCasts() and StreamTest.testStreamToRelationJoin() require minor renamings but are more consistent now. It doesn't look as if the set of strings is immutable. Fixed. The standard formatting (e.g. for multi-line strings and chained method calls) is not to align, but indent 4. I am happy to fix these before I commit. Do you do it in your IDE? If it's not much work for you, that would be great. Otherwise, I can do it. Do the checkstyle rules need to be amended?
          Hide
          julianhyde Julian Hyde added a comment -

          Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/31d68f71. Thanks for the PR, Jess Balint!

          I re-indented your code a bit (my Intellij settings seem to do the right thing, but unfortunately checkstyle doesn't seem capable of checking the indentation) and created a class Field extends Pair, which made a few things more concise. Thanks for restoring the functionality in PigletRelBuilder. It is necessary for Pig Latin GROUP and COGROUP operators, for example the following (from the Pig Latin reference manual):

          A = load 'data' as (x, y);
          B = load 'data' as (x, z);
          C = cogroup A by x, B by x;
          D = foreach C generate flatten(A), flatten(B);
          

          Note that relation C has multi-set columns A and B. The columns inherit their names from the previous relation names (what we would call table aliases). What you have done in PigletRelBuilder doesn't look entirely robust if there are multiple relations, but it is good enough.

          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/31d68f71 . Thanks for the PR, Jess Balint ! I re-indented your code a bit (my Intellij settings seem to do the right thing, but unfortunately checkstyle doesn't seem capable of checking the indentation) and created a class Field extends Pair , which made a few things more concise. Thanks for restoring the functionality in PigletRelBuilder . It is necessary for Pig Latin GROUP and COGROUP operators, for example the following (from the Pig Latin reference manual ): A = load 'data' as (x, y); B = load 'data' as (x, z); C = cogroup A by x, B by x; D = foreach C generate flatten(A), flatten(B); Note that relation C has multi-set columns A and B . The columns inherit their names from the previous relation names (what we would call table aliases). What you have done in PigletRelBuilder doesn't look entirely robust if there are multiple relations, but it is good enough.
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.11.0 (2017-01-11).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.11.0 (2017-01-11).

            People

            • Assignee:
              julianhyde Julian Hyde
              Reporter:
              jbalint@gmail.com Jess Balint
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development