Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1389

Add rule to perform rewriting of queries using materialized views with joins

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11.0
    • Component/s: core
    • Labels:
      None

      Description

      I've been looking into implementing the approach from the following paper. It's very nicely written and easy to follow. It also doesn't require trying different join permutations. I'm starting with several additional restrictions (only equijoins, no aggregations, etc.)

      ftp://ftp.cse.buffalo.edu/users/azhang/disc/SIGMOD/pdf-files/331/202-optimizing.pdf

      Thanks to Jesus Camacho Rodriguez for his help in sorting out some issues with the rule so far.

        Activity

        Hide
        michaelmior Michael Mior added a comment -

        Pull request available on GitHub.

        https://github.com/apache/calcite/pull/284

        Show
        michaelmior Michael Mior added a comment - Pull request available on GitHub. https://github.com/apache/calcite/pull/284
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Michael Mior, I think you are missing some files in the PR (at least the rule itself)?

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Michael Mior , I think you are missing some files in the PR (at least the rule itself)?
        Hide
        julianhyde Julian Hyde added a comment -

        I haven't looked at your code yet but I am familiar with the paper. Did you consider implementing that paper by building on top of Calcite's lattice construct?

        The problem with "free form" materialized views is that there tend to be a lot of them. The paper aims to solve that problem, and so do lattices. But lattices are better: they can gather statistics and recommend creating views that don't exist but would likely be useful.

        Lattices are essentially the same as the SPJ views described in the paper, but of course today they need to be created by hand. (I would argue that for DW-style workloads, creating lattices by hand is a lot more practical than creating MVs by hand. This is not just about making the optimizer's life easier, but also about making the DBA's life easier. MVs are not easy to manage operationally.) By anyway, if people have created a lot of MVs by hand, my idea was to have an algorithm that created lattices automatically, and thereby reduce the cost of examining all of those MVs.

        The main missing piece, as I see it, is an algorithm that, given a collection of MVs, creates an optimal set of lattices such that each MV belongs to a lattice.

        Show
        julianhyde Julian Hyde added a comment - I haven't looked at your code yet but I am familiar with the paper. Did you consider implementing that paper by building on top of Calcite's lattice construct? The problem with "free form" materialized views is that there tend to be a lot of them. The paper aims to solve that problem, and so do lattices. But lattices are better: they can gather statistics and recommend creating views that don't exist but would likely be useful. Lattices are essentially the same as the SPJ views described in the paper, but of course today they need to be created by hand. (I would argue that for DW-style workloads, creating lattices by hand is a lot more practical than creating MVs by hand. This is not just about making the optimizer's life easier, but also about making the DBA's life easier. MVs are not easy to manage operationally.) By anyway, if people have created a lot of MVs by hand, my idea was to have an algorithm that created lattices automatically, and thereby reduce the cost of examining all of those MVs. The main missing piece, as I see it, is an algorithm that, given a collection of MVs, creates an optimal set of lattices such that each MV belongs to a lattice.
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Michael Mior, I think you are missing some files in the PR (at least the rule itself)?

        Michael Mior, have you checked this? I could review and try to include it in 1.10.0. Thanks

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Michael Mior , I think you are missing some files in the PR (at least the rule itself)? Michael Mior , have you checked this? I could review and try to include it in 1.10.0. Thanks
        Hide
        michaelmior Michael Mior added a comment -

        Sorry, somehow I didn't see your earlier comment. Not sure how I missed this, but I added it now.

        Show
        michaelmior Michael Mior added a comment - Sorry, somehow I didn't see your earlier comment. Not sure how I missed this, but I added it now.
        Hide
        julianhyde Julian Hyde added a comment -

        I reviewed your change. It looks fine, but a few comments:

        • I don't get a sense that this is very robust. It may be robust, but to be convinced, I'd need to see a lot more tests covering more of the potential patterns.
        • I can't figure out how it would fit into multi-phase planning. It seems to be a Hep step. So, would it happen early in the planning process, before Volcano?
        • Can it handle multiple MVs? What if more than one matches?
        • If you have an MV on A join B, can it match a query A join B join C; can it match a query B join C?
        • What if there are intervening GROUP BYs between the items in the join tree (or SELECT DISTINCTs or semi-joins).
        • What if the MV uses the same table twice, by different join paths? Or if the the MV doesn't use any table twice, but the MV does?
        • What are the performance characteristics if the number of MVs grows large?

        In short, I can't tell how far this is along the road to production quality. It doesn't need to be production quality (we can flag it as 'experimental') but we do need to know what cases it can handle, what the deficiencies are, and which of those deficiencies could be addressed by future work.

        Code style:

        Show
        julianhyde Julian Hyde added a comment - I reviewed your change. It looks fine, but a few comments: I don't get a sense that this is very robust. It may be robust, but to be convinced, I'd need to see a lot more tests covering more of the potential patterns. I can't figure out how it would fit into multi-phase planning. It seems to be a Hep step. So, would it happen early in the planning process, before Volcano? Can it handle multiple MVs? What if more than one matches? If you have an MV on A join B , can it match a query A join B join C ; can it match a query B join C ? What if there are intervening GROUP BYs between the items in the join tree (or SELECT DISTINCTs or semi-joins). What if the MV uses the same table twice, by different join paths? Or if the the MV doesn't use any table twice, but the MV does? What are the performance characteristics if the number of MVs grows large? In short, I can't tell how far this is along the road to production quality. It doesn't need to be production quality (we can flag it as 'experimental') but we do need to know what cases it can handle, what the deficiencies are, and which of those deficiencies could be addressed by future work. Code style: If you need to wrap arguments, e.g. https://github.com/apache/calcite/pull/284/files#diff-937b1dde0b0e60b098796a4a10d4a2e6R115 , don't align with the '(', instead indent 4. In method javadoc, use active "Produces ..." rather than imperative "Produce ..."
        Hide
        michaelmior Michael Mior added a comment -

        Definitely far from production quality IMO, but I think interesting enough to be included. I'll have to think over what would happen in each of the scenarios you presented above, but it's definitely not very robust at this point. Some more test cases clarifying the current behaviour are definitely in order.

        That said, I've tried to make the rule fairly restrictive so it won't try to rewrite cases that aren't correctly handled. For example, I check that all the joins are inner joins. My target for this is conjunctive queries with equality predicates and simple field references in the SELECT clause.

        Or if the the MV doesn't use any table twice, but the MV does?

        I assume one of the references to MV above should be query?

        I haven't thought specifically about the complexity of this approach, but I would expect poor performance with a large number of MVs. The paper gives some ideas on how to build index structures to optimize this. On a related note, if you have any suggestions on where an appropriate place in the code would be to construct such indexes and what object should own the index data, that would be helpful.

        As for where this fits into planning, I think this could still work as part of Volcano. Using MVs when possible is not always the best choice. It also enables cost-based selection of multiple MVs when that's an option.

        Thanks for the style suggestions. I fixed those issues.

        Show
        michaelmior Michael Mior added a comment - Definitely far from production quality IMO, but I think interesting enough to be included. I'll have to think over what would happen in each of the scenarios you presented above, but it's definitely not very robust at this point. Some more test cases clarifying the current behaviour are definitely in order. That said, I've tried to make the rule fairly restrictive so it won't try to rewrite cases that aren't correctly handled. For example, I check that all the joins are inner joins. My target for this is conjunctive queries with equality predicates and simple field references in the SELECT clause. Or if the the MV doesn't use any table twice, but the MV does? I assume one of the references to MV above should be query? I haven't thought specifically about the complexity of this approach, but I would expect poor performance with a large number of MVs. The paper gives some ideas on how to build index structures to optimize this. On a related note, if you have any suggestions on where an appropriate place in the code would be to construct such indexes and what object should own the index data, that would be helpful. As for where this fits into planning, I think this could still work as part of Volcano. Using MVs when possible is not always the best choice. It also enables cost-based selection of multiple MVs when that's an option. Thanks for the style suggestions. I fixed those issues.
        Hide
        julianhyde Julian Hyde added a comment -

        I'm happy for these changes to go in. But I'd like to have a sense for where the effort is going - and where it fits with related efforts.

        How would you feel about creating a page in our documentation about materialized views? It would cover regular materializations (with their general-purpose matching algorithm), lattices (specialized for SPJA), planning secondary indexes (as Maryann Xue has done in Phoenix), view recommendations, and life cycle issues (how do we invalidate materialized views? how to add the necessary hooks so that they appear in the schema). We could discuss which query patterns each kind of MV can handle; and compare with MVs in other systems such as Cassandra and Druid; and list some of the possible future directions. It would help our users figure out which are the best pieces of our (complicated) framework to use.

        Show
        julianhyde Julian Hyde added a comment - I'm happy for these changes to go in. But I'd like to have a sense for where the effort is going - and where it fits with related efforts. How would you feel about creating a page in our documentation about materialized views? It would cover regular materializations (with their general-purpose matching algorithm), lattices (specialized for SPJA), planning secondary indexes (as Maryann Xue has done in Phoenix), view recommendations, and life cycle issues (how do we invalidate materialized views? how to add the necessary hooks so that they appear in the schema). We could discuss which query patterns each kind of MV can handle; and compare with MVs in other systems such as Cassandra and Druid; and list some of the possible future directions. It would help our users figure out which are the best pieces of our (complicated) framework to use.
        Hide
        michaelmior Michael Mior added a comment -

        Sounds like a great idea. I will admit that I don't have a great handle on some of the aspects you mentioned, but I think all of those should be included in the documentation. I think this is worth opening up a separate issue for which I'll do if there are no objections. Maybe we can get some stable documentation on existing use of MVs before pushing this further.

        Show
        michaelmior Michael Mior added a comment - Sounds like a great idea. I will admit that I don't have a great handle on some of the aspects you mentioned, but I think all of those should be included in the documentation. I think this is worth opening up a separate issue for which I'll do if there are no objections. Maybe we can get some stable documentation on existing use of MVs before pushing this further.
        Hide
        julianhyde Julian Hyde added a comment -

        Yes, please open a separate issue for the doc, and go ahead and commit your PR.

        Show
        julianhyde Julian Hyde added a comment - Yes, please open a separate issue for the doc, and go ahead and commit your PR.
        Hide
        michaelmior Michael Mior added a comment -

        I think it might be good to disable this rule by default. That is, avoid adding it to the default planner constructed in CalcitePrepareImpl. Thoughts? The documentation would mention that it is necessary to manually add this rule and the goal would still be to eventually have this enabled by default.

        Show
        michaelmior Michael Mior added a comment - I think it might be good to disable this rule by default. That is, avoid adding it to the default planner constructed in CalcitePrepareImpl . Thoughts? The documentation would mention that it is necessary to manually add this rule and the goal would still be to eventually have this enabled by default.
        Hide
        julianhyde Julian Hyde added a comment -

        I'm fine with that. In fact, the documentation will help people decide which rules to enable to achieve a particular materialization strategy.

        Show
        julianhyde Julian Hyde added a comment - I'm fine with that. In fact, the documentation will help people decide which rules to enable to achieve a particular materialization strategy.
        Hide
        michaelmior Michael Mior added a comment -

        Any suggestions on how I can selectively enable the rule during tests? I'd like to keep the test case in MaterializationTest, but it will fail with the MaterializedViewJoinRule not enabled. I don't see an obvious to have it enabled just for the tests. (Ideally, just for that single test).

        Show
        michaelmior Michael Mior added a comment - Any suggestions on how I can selectively enable the rule during tests? I'd like to keep the test case in MaterializationTest, but it will fail with the MaterializedViewJoinRule not enabled. I don't see an obvious to have it enabled just for the tests. (Ideally, just for that single test).
        Hide
        julianhyde Julian Hyde added a comment -

        Maybe something involving CalciteAssert.AssertQuery.withHook(Hook.PROGRAM, ...)?

        Show
        julianhyde Julian Hyde added a comment - Maybe something involving CalciteAssert.AssertQuery.withHook(Hook.PROGRAM, ...) ?
        Hide
        michaelmior Michael Mior added a comment -

        Thanks! That did the trick.

        Show
        michaelmior Michael Mior added a comment - Thanks! That did the trick.
        Hide
        michaelmior Michael Mior added a comment -

        Fixed in 435b6d4.

        Show
        michaelmior Michael Mior added a comment - Fixed in 435b6d4 .
        Hide
        julianhyde Julian Hyde added a comment -

        Resolved in release 1.11.0 (2017-01-11).

        Show
        julianhyde Julian Hyde added a comment - Resolved in release 1.11.0 (2017-01-11).

          People

          • Assignee:
            michaelmior Michael Mior
            Reporter:
            michaelmior Michael Mior
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development