Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9415

Implicit use of Materialized Views on SELECT

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Later
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      CASSANDRA-6477 introduces Materialized Views. This greatly simplifies the write path for the best-practice of "query tables". But it does not simplify the read path as much as our users want/need.

      We suggest to folks to create multiple copies of their base table optimized for certain queries - hence "query table". For example, we may have a USER table with two type of queries: lookup by userid and lookup by email address. We would recommend creating 2 tables USER_BY_USERID and USER_BY_EMAIL. Both would have the exact same schema, with the same PRIMARY KEY columns, but different PARTITION KEY - the first would be USERID and the second would be EMAIL.

      One complicating thing with this approach is that the application now needs to know that when it INSERT/UPDATE/DELETEs from the base table it needs to INSERT/UPDATE/DELETE from all of the query tables as well. CASSANDRA-6477 covers this nicely.

      However, the other side of the coin is that the application needs to know which query table to leverage based on the selection criteria. Using the example above, if the query has a predicate such as "WHERE userid = 'bhess'", then USERS_BY_USERID is the better table to use. Similarly, when the predicate is "WHERE email = 'bhess@company.whatever'", USERS_BY_EMAIL is appropriate.

      On INSERT/UPDATE/DELETE, Materialized Views essentially give a single "name" to the collection of tables. You do operations just on the base table. It is very attractive for the SELECT side as well. It would be very good to allow an application to simply do "SELECT * FROM users WHERE userid = 'bhess'" and have that query implicitly leverage the USERS_BY_USERID materialized view.

      For additional use cases, especially analytics use cases like in Spark, this allows the Spark code to simply push down the query without having to know about all of the MVs that have been set up. The system will route the query appropriately. And if additional MVs are necessary to make a query run better/faster, then those MVs can be set up and Spark will implicitly leverage them.

        Issue Links

          Activity

          Hide
          rssvihla Ryan Svihla added a comment -

          This would be a big win for a lot of analytics tools and would bring us ever closer to RDBMS for ease of use. I can see this greatly smoothing the learning curve for new users as well.

          Show
          rssvihla Ryan Svihla added a comment - This would be a big win for a lot of analytics tools and would bring us ever closer to RDBMS for ease of use. I can see this greatly smoothing the learning curve for new users as well.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          This sounds interesting. May or may not be doable to provide this by default - depending on consistency provided by CASSANDRA-6477 - but certainly something to look into.

          Show
          iamaleksey Aleksey Yeschenko added a comment - This sounds interesting. May or may not be doable to provide this by default - depending on consistency provided by CASSANDRA-6477 - but certainly something to look into.
          Hide
          tjake T Jake Luciani added a comment -

          This might be a better fit for the drivers to implement. since they could route the queries to the appropriate nodes directly.

          Show
          tjake T Jake Luciani added a comment - This might be a better fit for the drivers to implement. since they could route the queries to the appropriate nodes directly.
          Hide
          carlyeks Carl Yeksigian added a comment -

          There are a couple of problems with substituting materialized views in place of base tables transparently:

          1. Unless the MV includes all of the columns exactly as the underlying table has them, select statements which are valid on the base table will not be valid on the MV
          2. When where clauses are allowed for the MV, the MV can't be used for the select statement since it isn't a full copy of the base table

          Also, using a different table was a benefit because the driver's will be able to easily route the queries (CASSANDRA-8517).

          Overall, I'm weary of transforming user's queries for them; I'd rather users who use MV use the views directly. It is less likely that we would change the rules surrounding queries to a MV directly versus base table query transformations.

          Show
          carlyeks Carl Yeksigian added a comment - There are a couple of problems with substituting materialized views in place of base tables transparently: Unless the MV includes all of the columns exactly as the underlying table has them, select statements which are valid on the base table will not be valid on the MV When where clauses are allowed for the MV, the MV can't be used for the select statement since it isn't a full copy of the base table Also, using a different table was a benefit because the driver's will be able to easily route the queries ( CASSANDRA-8517 ). Overall, I'm weary of transforming user's queries for them; I'd rather users who use MV use the views directly. It is less likely that we would change the rules surrounding queries to a MV directly versus base table query transformations.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          For the record - if we did this at all, it wouldn't be a part of 3.0.0. Maybe not even 3.X at all.

          Show
          iamaleksey Aleksey Yeschenko added a comment - For the record - if we did this at all, it wouldn't be a part of 3.0.0. Maybe not even 3.X at all.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm -0 leaning to -1 on this. Explicit is better than implicit and we should not train users to expect us to silently rewrite queries for them. Nor am I aware of any RDBMS that does this.

          Show
          jbellis Jonathan Ellis added a comment - I'm -0 leaning to -1 on this. Explicit is better than implicit and we should not train users to expect us to silently rewrite queries for them. Nor am I aware of any RDBMS that does this.
          Hide
          brianmhess Brian Hess added a comment -

          Jonathan Ellis Oracle, DB2, and SQL Server do this (at least - maybe others).

          In Oracle it is Materialized Views (see http://docs.oracle.com/cd/B28359_01/server.111/b28313/basicmv.htm):

          The end user queries the tables and views at the detail data level. The query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application."

          In DB2 it is Materialized Query Tables (see http://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/)

          Materialized query tables can significantly improve the performance of queries, especially complex queries. If the optimizer determines that a query or part of a query could be resolved using an MQT, the query might be rewritten to take advantage of the MQT.

          In SQL Server it is Indexed Views (see https://msdn.microsoft.com/en-us/library/dd171921(SQL.100).aspx):

          The indexed view can be used in a query execution in two ways. The query can reference the indexed view directly, or, more importantly, the query optimizer can select the view if it determines that the view can be substituted for some or all of the query in the lowest-cost query plan. In the second case, the indexed view is used instead of the underlying tables and their ordinary indexes. The view does not need to be referenced in the query for the query optimizer to use it during query execution. This allows existing applications to benefit from the newly created indexed views without changing those applications.

          Show
          brianmhess Brian Hess added a comment - Jonathan Ellis Oracle, DB2, and SQL Server do this (at least - maybe others). In Oracle it is Materialized Views (see http://docs.oracle.com/cd/B28359_01/server.111/b28313/basicmv.htm): The end user queries the tables and views at the detail data level. The query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application." In DB2 it is Materialized Query Tables (see http://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/ ) Materialized query tables can significantly improve the performance of queries, especially complex queries. If the optimizer determines that a query or part of a query could be resolved using an MQT, the query might be rewritten to take advantage of the MQT. In SQL Server it is Indexed Views (see https://msdn.microsoft.com/en-us/library/dd171921(SQL.100).aspx): The indexed view can be used in a query execution in two ways. The query can reference the indexed view directly, or, more importantly, the query optimizer can select the view if it determines that the view can be substituted for some or all of the query in the lowest-cost query plan. In the second case, the indexed view is used instead of the underlying tables and their ordinary indexes. The view does not need to be referenced in the query for the query optimizer to use it during query execution. This allows existing applications to benefit from the newly created indexed views without changing those applications.
          Hide
          brianmhess Brian Hess added a comment -

          Wouldn't the preparing of the statement handle that?

          Show
          brianmhess Brian Hess added a comment - Wouldn't the preparing of the statement handle that?
          Hide
          brianmhess Brian Hess added a comment -

          For #1, that would be part of choosing the MV. If it didn't contain all the columns in the projection list then you couldn't use that MV.
          For #2, that would also be understandable by C*. That is, you can see if the predicate of the query "matches" the predicate of the MV. For example, if the MV had a predicate like "WHERE x > 100", then if the query had a predicate like "WHERE x=200" then you know you could use the MV.

          As to the driver being able to route the query - in order for the driver to route the query you need to prepare the statement. In preparing the statement, you would choose the MV, and the driver gets that anyway.

          Show
          brianmhess Brian Hess added a comment - For #1, that would be part of choosing the MV. If it didn't contain all the columns in the projection list then you couldn't use that MV. For #2, that would also be understandable by C*. That is, you can see if the predicate of the query "matches" the predicate of the MV. For example, if the MV had a predicate like "WHERE x > 100", then if the query had a predicate like "WHERE x=200" then you know you could use the MV. As to the driver being able to route the query - in order for the driver to route the query you need to prepare the statement. In preparing the statement, you would choose the MV, and the driver gets that anyway.
          Hide
          brianmhess Brian Hess added a comment -

          We already silently rewrite queries for users when they use a secondary index. The user doesn't specify that C* should consult the secondary index when they query; it is implicit.

          Show
          brianmhess Brian Hess added a comment - We already silently rewrite queries for users when they use a secondary index. The user doesn't specify that C* should consult the secondary index when they query; it is implicit.
          Hide
          slebresne Sylvain Lebresne added a comment -

          While I understand the interest it would have for analytics, I, like others expressed, tend to not love adding an implicit way to do something when we have an explicit one that seems simple enough to me (and we've constantly tried to favor explicit over implicit in C* as a design choice).

          But the more important problem is that doing this would break the consistency level guarantees. And for that reason I'm -1 on doing this completely implicitly (unless/until we somehow fix the CL guarantees made on MVs but that's probably not around the corner either). In theory I suppose we could add some form of flag for the user to say "I'm happy to give up any CL guarantee but please use MVs if you can" but that's pretty messy imo.

          I'll also note that while it would undeniably be convenient for analytics, it's totally possible for analytic drivers to check the defined MVs and do the exact same work on their side.

          Anyway, given what's above I'm going to close as "Later", by which I mean "We're not planning to implement this anytime soon so let's be upfront about it, but maybe later, when we have more mileage with MVs and if we've improve the CL guarantees on them, we might look again at this with fresh eyes".

          Show
          slebresne Sylvain Lebresne added a comment - While I understand the interest it would have for analytics, I, like others expressed, tend to not love adding an implicit way to do something when we have an explicit one that seems simple enough to me (and we've constantly tried to favor explicit over implicit in C* as a design choice). But the more important problem is that doing this would break the consistency level guarantees. And for that reason I'm -1 on doing this completely implicitly (unless/until we somehow fix the CL guarantees made on MVs but that's probably not around the corner either). In theory I suppose we could add some form of flag for the user to say "I'm happy to give up any CL guarantee but please use MVs if you can" but that's pretty messy imo. I'll also note that while it would undeniably be convenient for analytics, it's totally possible for analytic drivers to check the defined MVs and do the exact same work on their side. Anyway, given what's above I'm going to close as "Later", by which I mean "We're not planning to implement this anytime soon so let's be upfront about it, but maybe later, when we have more mileage with MVs and if we've improve the CL guarantees on them, we might look again at this with fresh eyes".

            People

            • Assignee:
              Unassigned
              Reporter:
              brianmhess Brian Hess
            • Votes:
              5 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development