Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1682

New metadata providers for expression column origin and all predicates in plan

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12.0
    • Fix Version/s: 1.13.0
    • Component/s: core
    • Labels:
      None

      Description

      I am working on the integration of materialized view rewriting within Hive.

      Once a view matches an operator plan, rewriting is split vastly in two steps. The first step will verify that the input to the root operator of the matched plan is equivalent or contained within the input to the root operator of the query representing the view. The second step will trigger a unify rule, which tries to rewrite the matched operator tree into a scan on the view and possibly some additional operators to compute the exact results needed by the query (think about Project that alters the column order, additional Filter on the view, additional Join operation, etc.)

      If we focus on step 1, checking equivalence/containment, I would like to extend the metadata providers in Calcite to give us more information about the matched (sub)plan. In particular, I am thinking on:

      • Expression column origin. Currently Calcite can provide the column origins for a certain column and whether it is derived or not. However, we would need to obtain the expression that generated a certain column. This expression should contain references to the input tables. For instance, given expression column c, the new md provider would return that it was generated by expression A.a + B.b.
      • All predicates. Currently Calcite can extract predicates that have been applied on an RelNode output (we can think on them as constraints on the output). However, I would like to extract all predicates that have been applied on a given RelNode (sub)plan. Since nodes might not be part of the output, expressions should contain references to the input tables. For instance, the new md provider might return the expressions A.a + B.b > C.c AND D.d = 100.
      • PK-FK relationship. I do not plan to implement this one immediately. However, exposing this information (given it is provided) can help us to trigger more rewriting containing join operators. Thus, I was wondering if it is worth adding it.

      Once this information is available, we can rely on it to implement logic similar to [1] to check whether a given (sub)plan is equivalent/contained within a given view.

      One question I have is about representing the table columns as a RexNode, as I think it is the easiest way to be returned by the new metadata providers. I checked RexPatternFieldRef and I think it will meet our requirements: alpha would be the qualified table name, while the index is the column idx for the table. Thoughts?

      I have started working on this and will provide a patch shortly; feedback is greatly appreciated.

      [1] ftp://ftp10.us.freebsd.org/users/azhang/disc/SIGMOD/pdf-files/331/202-optimizing.pdf

        Issue Links

          Activity

          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Resolved in release 1.13.0 (2017-06-26).

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.13.0 (2017-06-26).
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/41b05d7 .
          Hide
          julianhyde Julian Hyde added a comment -

          Jesus Camacho Rodriguez, As we discussed today, you need some kind of "wrapper table" to prevent metadata caching. And other ways of creating a "grey box" where you can control which parts of the behavior/properties of the underlying table leak through the abstraction. I think you should consider using StarTable, with just a single underlying table, as that wrapper table. It's not the original intent of StarTable but I think it can be made to work.

          Show
          julianhyde Julian Hyde added a comment - Jesus Camacho Rodriguez , As we discussed today, you need some kind of "wrapper table" to prevent metadata caching. And other ways of creating a "grey box" where you can control which parts of the behavior/properties of the underlying table leak through the abstraction. I think you should consider using StarTable , with just a single underlying table, as that wrapper table. It's not the original intent of StarTable but I think it can be made to work.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, initial work in case you want to take a look is in https://github.com/jcamachor/calcite/tree/CALCITE-agg-views branch.

          I am working on adding one additional metadata provider in the same PR that would extract the RelNode types for a given (sub)plan.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , initial work in case you want to take a look is in https://github.com/jcamachor/calcite/tree/CALCITE-agg-views branch. I am working on adding one additional metadata provider in the same PR that would extract the RelNode types for a given (sub)plan.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Using constraints could be an elegant way of defining it. Since this would be useful to check whether rows are in the view with the right duplication factor, will IN imply that there is a single match? Or will we need a stronger constraint?

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Using constraints could be an elegant way of defining it. Since this would be useful to check whether rows are in the view with the right duplication factor, will IN imply that there is a single match? Or will we need a stronger constraint?
          Hide
          julianhyde Julian Hyde added a comment -

          Regarding PK-FK relationship. There are 2 ways to do this already. First is to use a lattice (which implicitly models FK relationships to dimension tables). Second is a constraint based on a RexSubQuery (basically, you'd place a constraint on EMP saying "DEPTNO IN (SELECT DEPTNO FROM DEPT)").

          Show
          julianhyde Julian Hyde added a comment - Regarding PK-FK relationship. There are 2 ways to do this already. First is to use a lattice (which implicitly models FK relationships to dimension tables). Second is a constraint based on a RexSubQuery (basically, you'd place a constraint on EMP saying "DEPTNO IN (SELECT DEPTNO FROM DEPT)").
          Hide
          julianhyde Julian Hyde added a comment -

          The idea you suggested in the meeting - basically, treating each use of a table within a query as unique - seems simple and likely to work. I would tend to work from each use in the text of the query, not each TableScan, because Calcite will aggressively combine scans of the same table into the same object.

          It might be useful to register all table uses inside the RelOptCluster. Give each table use an ID such that if the table is used only once the ID is the same as the fully-qualified table name.

          It's almost as if you create a dummy table HR.EMP$2 for the second use of EMP in the query.

          Show
          julianhyde Julian Hyde added a comment - The idea you suggested in the meeting - basically, treating each use of a table within a query as unique - seems simple and likely to work. I would tend to work from each use in the text of the query, not each TableScan, because Calcite will aggressively combine scans of the same table into the same object. It might be useful to register all table uses inside the RelOptCluster. Give each table use an ID such that if the table is used only once the ID is the same as the fully-qualified table name. It's almost as if you create a dummy table HR.EMP$2 for the second use of EMP in the query.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, one doubt I have is concerning the unique identification of a table (within a TableScan) in the plan for expression origin/lineage. If we have a self-join on table A with condition a1.x = a2.y, I would like lineage to be different than a filter on single table A with condition a.x = a.y. However, getQualifiedName for the concerning tables will return the same qualified name (obviously)... Any thoughts on how we solve this?

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , one doubt I have is concerning the unique identification of a table (within a TableScan) in the plan for expression origin/lineage. If we have a self-join on table A with condition a1.x = a2.y, I would like lineage to be different than a filter on single table A with condition a.x = a.y. However, getQualifiedName for the concerning tables will return the same qualified name (obviously)... Any thoughts on how we solve this?
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Cc Julian Hyde , Ashutosh Chauhan

            People

            • Assignee:
              jcamachorodriguez Jesus Camacho Rodriguez
              Reporter:
              jcamachorodriguez Jesus Camacho Rodriguez
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development