Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9664

Allow MV's select statements to be more complex

    Details

      Description

      Materialized Views add support for a syntax which includes a SELECT statement, but only allows selection of direct columns, and does not allow any filtering to take place.

      We should add support to the MV SELECT statement to bring better parity with the normal CQL SELECT statement, specifically simple functions in the selected columns, as well as specifying a WHERE clause.

        Issue Links

          Activity

          Hide
          boneill Brian ONeill added a comment -

          Will the scope of this ticket include aggregations? e.g.

          select avg(price), hour from stockprices;

          Show
          boneill Brian ONeill added a comment - Will the scope of this ticket include aggregations? e.g. select avg(price), hour from stockprices;
          Hide
          jbellis Jonathan Ellis added a comment -

          Materializing aggregates is out of scope here. See CASSANDRA-9778 for that.

          Show
          jbellis Jonathan Ellis added a comment - Materializing aggregates is out of scope here. See CASSANDRA-9778 for that.
          Hide
          thobbs Tyler Hobbs added a comment -

          I've discovered one limitation of this: I don't think we will be able to support filtering on regular columns in the base table due to timestamp issues.

          Suppose we have a base table like this:

          CREATE TABLE base (a int, b int, c int, d int, PRIMARY KEY (a, b));
          

          and a view:

          CREATE MATERIALIZED VIEW view AS
              SELECT * FROM base
              WHERE a IS NOT NULL
              AND b IS NOT NULL
              AND c = 1
              PRIMARY KEY (a, b)
          

          The view should only contain rows where c = 1.

          Now, suppose we do the following:

          INSERT INTO base (a, b, c, d) VALUES (0, 0, 1, 0) USING TIMESTAMP 0;
          UPDATE base SET c = 1 USING TIMESTAMP 2;
          UPDATE base SET c = 0 USING TIMESTAMP 3;  -- view row should be deleted w/ timestamp 3
          UPDATE base SET c = 1 USING TIMESTAMP 4;
          

          The third update results in a tombstone being written to the view row with a timestamp of 3 because c no longer matches our SELECT statement. The problem comes at the final update: now that c matches the SELECT again, we should reinsert a row into the view. For the reinsertion, we can use timestamp 4 for the row marker and c's cell, but what about d's cell? If we use its timestamp from the base row (0), the view tombstone supercedes it, leaving d as null. If we use timestamp 4 for d's cell, we may accidentally ignore updates to d. For example, this would update d in the base table but not the view:

          UPDATE base SET d = 1 WHERE a = 0 AND b = 0 USING TIMESTAMP 1;
          

          Note that the timestamp of 1 is higher than d's timestamp in the base table (0) but lower than d's timestamp in the view (4).

          In light of this, I suggest that for the first iteration of this we only allow the SELECT statement to restrict PK columns.

          Show
          thobbs Tyler Hobbs added a comment - I've discovered one limitation of this: I don't think we will be able to support filtering on regular columns in the base table due to timestamp issues. Suppose we have a base table like this: CREATE TABLE base (a int, b int, c int, d int, PRIMARY KEY (a, b)); and a view: CREATE MATERIALIZED VIEW view AS SELECT * FROM base WHERE a IS NOT NULL AND b IS NOT NULL AND c = 1 PRIMARY KEY (a, b) The view should only contain rows where c = 1 . Now, suppose we do the following: INSERT INTO base (a, b, c, d) VALUES (0, 0, 1, 0) USING TIMESTAMP 0; UPDATE base SET c = 1 USING TIMESTAMP 2; UPDATE base SET c = 0 USING TIMESTAMP 3; -- view row should be deleted w/ timestamp 3 UPDATE base SET c = 1 USING TIMESTAMP 4; The third update results in a tombstone being written to the view row with a timestamp of 3 because c no longer matches our SELECT statement. The problem comes at the final update: now that c matches the SELECT again, we should reinsert a row into the view. For the reinsertion, we can use timestamp 4 for the row marker and c 's cell, but what about d 's cell? If we use its timestamp from the base row (0), the view tombstone supercedes it, leaving d as null. If we use timestamp 4 for d 's cell, we may accidentally ignore updates to d . For example, this would update d in the base table but not the view: UPDATE base SET d = 1 WHERE a = 0 AND b = 0 USING TIMESTAMP 1; Note that the timestamp of 1 is higher than d 's timestamp in the base table (0) but lower than d 's timestamp in the view (4). In light of this, I suggest that for the first iteration of this we only allow the SELECT statement to restrict PK columns.
          Hide
          thobbs Tyler Hobbs added a comment -

          I have a branch with an implementation that's ready for review.

          As noted in my previous comment, this only allows restricting columns that are in the base table's primary key. When handling individual mutations this filtering can be performed prior to the read-before-write, so write-path work for unselected rows is minimal. When performing the initial build of the MV, we don't yet take full advantage of the select statement's restrictions. I would like to improve single-partition builds and builds that can be assisted by secondary indexes, but if it's okay with the reviewer, that feels best left to another ticket.

          One of the trickiest parts of this ticket was representing the WHERE clause restrictions in the MV's schema. This needs to support multi-column relations, single-column relations, and any operator (including IN, which expects multiple values). The schema I settled on was this:

          where_clause frozen<list<tuple<list<text>, int, list<text>>>>
          

          Roughly speaking, this is a list of <id, operator, value> tuples, but with lists for ids and values to support multi-column relations. I know the nesting is a little crazy there, but that allows us to represent everything that we need. I also considered storing a single string of the WHERE clause, but this presents difficulties when loading the MV from the schema. In particular, we don't have a good way to use the parser only for the whereClause rule. If somebody has a better idea, I'm open to suggestions.

          Last, this implementation is nearly restricted to what normal SELECT statements allow. In some cases those restrictions don't make much sense for MVs, where we don't need to execute an efficient query. For the most part I haven't changed anything here. The one modification I did make is to allow filtering on clustering columns when the SELECT is being built for use by an MV. As an example, if the base primary key is (a, b, c), the MV can do "WHERE c = 0" without restricting column b. Normally this is only allowed if column c is indexed, but for MV purposes, this is efficient to filter.

          Pending CI tests:

          After the first round of review, I'll also run CI tests on trunk.

          Show
          thobbs Tyler Hobbs added a comment - I have a branch with an implementation that's ready for review. As noted in my previous comment, this only allows restricting columns that are in the base table's primary key. When handling individual mutations this filtering can be performed prior to the read-before-write, so write-path work for unselected rows is minimal. When performing the initial build of the MV, we don't yet take full advantage of the select statement's restrictions. I would like to improve single-partition builds and builds that can be assisted by secondary indexes, but if it's okay with the reviewer, that feels best left to another ticket. One of the trickiest parts of this ticket was representing the WHERE clause restrictions in the MV's schema. This needs to support multi-column relations, single-column relations, and any operator (including IN, which expects multiple values). The schema I settled on was this: where_clause frozen<list<tuple<list<text>, int, list<text>>>> Roughly speaking, this is a list of <id, operator, value> tuples, but with lists for ids and values to support multi-column relations. I know the nesting is a little crazy there, but that allows us to represent everything that we need. I also considered storing a single string of the WHERE clause, but this presents difficulties when loading the MV from the schema. In particular, we don't have a good way to use the parser only for the whereClause rule. If somebody has a better idea, I'm open to suggestions. Last, this implementation is nearly restricted to what normal SELECT statements allow. In some cases those restrictions don't make much sense for MVs, where we don't need to execute an efficient query. For the most part I haven't changed anything here. The one modification I did make is to allow filtering on clustering columns when the SELECT is being built for use by an MV. As an example, if the base primary key is (a, b, c), the MV can do "WHERE c = 0" without restricting column b. Normally this is only allowed if column c is indexed, but for MV purposes, this is efficient to filter. Pending CI tests: 3.0 testall 3.0 dtest After the first round of review, I'll also run CI tests on trunk.
          Hide
          slebresne Sylvain Lebresne added a comment -

          The schema I settled on was this:

          where_clause frozen<list<tuple<list<text>, int, list<text>>>>
          

          I'm actually a little bit worried about this for 2 reasons:

          1. it's incomprehensible by a human being and will be a bit of a pain for external tools (like, say, the DESCRIBE of cqlsh). We've done extensive and painful work to make the schema tables more understandable and more easy to translate to their original statement, and this feels like a step backward.
          2. it's not future proof. For instance, and while the exact plans are not fully fleshed out, I strongly suspect we'll start to support at least some form of OR in the medium-ish term (I also have some ideas for allowing some custom form of expressions for the sake of custom index that might or might not pan out, but probably wouldn't work with this if they do).

          I also considered storing a single string of the WHERE clause

          That would have my preference, mainly for my 2nd reason above: it makes no assumption on what a WHERE clause may end up supporting in the future. It also solves my 1st point, even though in all honestly, it does makes it a little harder for external tools that would prefer a most "parsed" version of the WHERE clause (but then again, a WHERE clause parser is not that hard if you really care). Overall, I'd be much more comfortable with it. It does mean we'd need to modify the parser to parse WHERE clauses separately, but is it really that hard?

          Show
          slebresne Sylvain Lebresne added a comment - The schema I settled on was this: where_clause frozen<list<tuple<list<text>, int, list<text>>>> I'm actually a little bit worried about this for 2 reasons: it's incomprehensible by a human being and will be a bit of a pain for external tools (like, say, the DESCRIBE of cqlsh). We've done extensive and painful work to make the schema tables more understandable and more easy to translate to their original statement, and this feels like a step backward. it's not future proof. For instance, and while the exact plans are not fully fleshed out, I strongly suspect we'll start to support at least some form of OR in the medium-ish term (I also have some ideas for allowing some custom form of expressions for the sake of custom index that might or might not pan out, but probably wouldn't work with this if they do). I also considered storing a single string of the WHERE clause That would have my preference, mainly for my 2nd reason above: it makes no assumption on what a WHERE clause may end up supporting in the future. It also solves my 1st point, even though in all honestly, it does makes it a little harder for external tools that would prefer a most "parsed" version of the WHERE clause (but then again, a WHERE clause parser is not that hard if you really care). Overall, I'd be much more comfortable with it. It does mean we'd need to modify the parser to parse WHERE clauses separately, but is it really that hard?
          Hide
          thobbs Tyler Hobbs added a comment -

          I strongly suspect we'll start to support at least some form of OR in the medium-ish term

          That's a good point.

          It does mean we'd need to modify the parser to parse WHERE clauses separately, but is it really that hard?

          I'll look more into this today.

          but then again, a WHERE clause parser is not that hard if you really care

          Well, it is the most complex part of our grammar right now, but I'm not sure how many external tools will really need to be able to parse it.

          Show
          thobbs Tyler Hobbs added a comment - I strongly suspect we'll start to support at least some form of OR in the medium-ish term That's a good point. It does mean we'd need to modify the parser to parse WHERE clauses separately, but is it really that hard? I'll look more into this today. but then again, a WHERE clause parser is not that hard if you really care Well, it is the most complex part of our grammar right now, but I'm not sure how many external tools will really need to be able to parse it.
          Hide
          thobbs Tyler Hobbs added a comment -

          Okay, I've pushed a commit that switches to using a single string for the where clause in the schema. This actually turned out to be fairly simple, and reduces some of the complexities in the patch.

          I also pushed a second commit to stop using apache common's StringUtils, which was causing problems with the dtests.

          Show
          thobbs Tyler Hobbs added a comment - Okay, I've pushed a commit that switches to using a single string for the where clause in the schema. This actually turned out to be fairly simple, and reduces some of the complexities in the patch. I also pushed a second commit to stop using apache common's StringUtils, which was causing problems with the dtests.
          Hide
          thobbs Tyler Hobbs added a comment -

          I pushed another two small commits to the branch: one that throws an IRE if query parameters are in the MV SELECT statement, and one that tests that IS NOT NULL is rejected in normal SELECT statements.

          Show
          thobbs Tyler Hobbs added a comment - I pushed another two small commits to the branch: one that throws an IRE if query parameters are in the MV SELECT statement, and one that tests that IS NOT NULL is rejected in normal SELECT statements.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          You might want (should want) to wait for CASSANDRA-9921 before committing this (which is going to happen once the drivers are ready), or at least base the patch on the branch in that ticket (somewhat in-progress).

          Show
          iamaleksey Aleksey Yeschenko added a comment - You might want (should want) to wait for CASSANDRA-9921 before committing this (which is going to happen once the drivers are ready), or at least base the patch on the branch in that ticket (somewhat in-progress).
          Hide
          thobbs Tyler Hobbs added a comment -

          I've rebased on the latest patch/branch from CASSANDRA-9921 in a new branch. Due to an issue in 9921 with handling ALTER TABLE in some cases, I've commented out one of the new unit tests.

          Show
          thobbs Tyler Hobbs added a comment - I've rebased on the latest patch/branch from CASSANDRA-9921 in a new branch . Due to an issue in 9921 with handling ALTER TABLE in some cases, I've commented out one of the new unit tests.
          Hide
          aholmber Adam Holmberg added a comment - - edited

          Tyler Hobbs we will need to rebase again. New changes here brought in changes from trunk including updates to the system_schema.columns table.

          Meanwhile, no dev is blocked – I'm working rebased locally.

          Show
          aholmber Adam Holmberg added a comment - - edited Tyler Hobbs we will need to rebase again. New changes here brought in changes from trunk including updates to the system_schema.columns table. Meanwhile, no dev is blocked – I'm working rebased locally.
          Hide
          thobbs Tyler Hobbs added a comment -

          Alright, I've rebased again on the latest from 9921 in another new branch. The problem with ALTER TABLE still exists, so there are still a couple of commented out unit tests. However, this should be good enough for Sylvain Lebresne to start code review.

          Show
          thobbs Tyler Hobbs added a comment - Alright, I've rebased again on the latest from 9921 in another new branch . The problem with ALTER TABLE still exists, so there are still a couple of commented out unit tests. However, this should be good enough for Sylvain Lebresne to start code review.
          Hide
          aholmber Adam Holmberg added a comment -

          Driver branch here

          Show
          aholmber Adam Holmberg added a comment - Driver branch here
          Hide
          adutra Alexandre Dutra added a comment -

          Java driver branch here.

          Show
          adutra Alexandre Dutra added a comment - Java driver branch here .
          Hide
          aboudreault Alan Boudreault added a comment -

          Written a basic test to ensure this was also working properly in a multinode cluster: https://github.com/riptano/cassandra-dtest/pull/556

          Show
          aboudreault Alan Boudreault added a comment - Written a basic test to ensure this was also working properly in a multinode cluster: https://github.com/riptano/cassandra-dtest/pull/556
          Hide
          thobbs Tyler Hobbs added a comment -

          I've done a (hopefully final) rebase on the latest 3.0 in a new branch. This re-enables the previously failing unit tests.

          T Jake Luciani pointed out that with CASSANDRA-10261 complete, we should be able to filter on non-PK columns now. However, since we're getting close to the deadline for 3.0.0-rc1, I think it might be better to push that to another ticket targeting 3.2. What do you think, Sylvain Lebresne?

          Show
          thobbs Tyler Hobbs added a comment - I've done a (hopefully final) rebase on the latest 3.0 in a new branch . This re-enables the previously failing unit tests. T Jake Luciani pointed out that with CASSANDRA-10261 complete, we should be able to filter on non-PK columns now. However, since we're getting close to the deadline for 3.0.0-rc1, I think it might be better to push that to another ticket targeting 3.2. What do you think, Sylvain Lebresne ?
          Hide
          adutra Alexandre Dutra added a comment - - edited

          Updated java driver branch here.

          Show
          adutra Alexandre Dutra added a comment - - edited Updated java driver branch here .
          Hide
          slebresne Sylvain Lebresne added a comment -

          A few remarks on the patch:

          • I don't think the Term.Literal thing works, because a collection can contain a function call and yet FunctionCall.Raw is not a Term.Literal. So the whole collection will pass the instanceof Term.Literal test in CreateViewStatement, but getRawText will throw a ClassCastException. Outside of that, I'm not sure there is a good reason to disallow function calls (or type-cast for that matter). What I think we want is just rejecting bind markers, and we can do that by just waiting in CreateViewStatement to have prepared the statement (ParsedStatement.Prepared tells use if there is any bound variable in particular). Once we do that, I think getRawText can just be put in Term.Raw (and renamed getText).
          • Regarding getRawText, I believe all or almost all Term.Raw are already implementing the right thing in their toString() method, and we somewhat rely on it for some error messages. I understand that adding a specific method makes it a bit less likely that we'll add a new Term.Raw and forgot to implement its toString(), but that still mean a bit of ugly code duplication in practice so I wonder if it's worth it. The worth case scenario is we add a new Term.Raw, forget about toString() and someone tries to use it while creating a MV and it fails at creation time. This wouldn't be great, but 1) this would be easily detected and fixed and 2) this would imply we've released code without testing it, at which point I'd almost rather have the MV creation fail (so we can fix and test), rather than having it pass but potentially do something bad (since it hasn't been tested). Anyway, I don't really feel too strongly, but just to say I'd be fine with just toString() and some clear message that it has to be properly implemented in the class javadoc of Term.Raw.
          • Not sure to understand why the initialization of select and query in View.java is delayed. Are you worried about MV creation schema changes arriving before other schema changes it depends on on some nodes? If so, other places of the code where we have this kind of dependencies (creating a table after a keyspace creation for instance) assume that the user has to wait for schema agreement, and I'm not sure why we'd do differently here as it's less consistent and complicate the code a bit.
          • The fact that ReadCommand now has both a selects(DecoratedKey key, Clustering clustering) and selectsClustering(DecoratedKey key, Clustering clustering) which do roughly the same thing but not exactly and with no comment as to the difference is quite a bit confusing. I think we should just remove the existing selects and replace its current call by (selectsKey || selectsClustering). I'm also not sure why those methods in SinglePartitionReadCommand don't check the rowFilter (I suspect they should).
          • It's a bit weird/unintuitive that StatementRestrictions.isRestricted ignores IS NOT NULL restriction. And it seems changing that would actually simplify the code using that method.
          • Nit: In the last if of View.createForDeletionInfo, having ReadQuery selectQuery = getReadQuery(); but using query on the next line assert is a bit anti-social.
          • Nit: in Operator.java, can make readFrom call the new fromValue.

          Those are based on the previous rebase (CASSANDRA-9664-rebase-2) so I assume the last rebase was just that, a rebase.

          since we're getting close to the deadline for 3.0.0-rc1, I think it might be better to push that to another ticket targeting 3.2.

          Yes, pretty please, let's push that to 3.2.

          Show
          slebresne Sylvain Lebresne added a comment - A few remarks on the patch: I don't think the Term.Literal thing works, because a collection can contain a function call and yet FunctionCall.Raw is not a Term.Literal . So the whole collection will pass the instanceof Term.Literal test in CreateViewStatement , but getRawText will throw a ClassCastException . Outside of that, I'm not sure there is a good reason to disallow function calls (or type-cast for that matter). What I think we want is just rejecting bind markers, and we can do that by just waiting in CreateViewStatement to have prepared the statement ( ParsedStatement.Prepared tells use if there is any bound variable in particular). Once we do that, I think getRawText can just be put in Term.Raw (and renamed getText ). Regarding getRawText , I believe all or almost all Term.Raw are already implementing the right thing in their toString() method, and we somewhat rely on it for some error messages. I understand that adding a specific method makes it a bit less likely that we'll add a new Term.Raw and forgot to implement its toString() , but that still mean a bit of ugly code duplication in practice so I wonder if it's worth it. The worth case scenario is we add a new Term.Raw , forget about toString() and someone tries to use it while creating a MV and it fails at creation time. This wouldn't be great, but 1) this would be easily detected and fixed and 2) this would imply we've released code without testing it, at which point I'd almost rather have the MV creation fail (so we can fix and test), rather than having it pass but potentially do something bad (since it hasn't been tested). Anyway, I don't really feel too strongly, but just to say I'd be fine with just toString() and some clear message that it has to be properly implemented in the class javadoc of Term.Raw . Not sure to understand why the initialization of select and query in View.java is delayed. Are you worried about MV creation schema changes arriving before other schema changes it depends on on some nodes? If so, other places of the code where we have this kind of dependencies (creating a table after a keyspace creation for instance) assume that the user has to wait for schema agreement, and I'm not sure why we'd do differently here as it's less consistent and complicate the code a bit. The fact that ReadCommand now has both a selects(DecoratedKey key, Clustering clustering) and selectsClustering(DecoratedKey key, Clustering clustering) which do roughly the same thing but not exactly and with no comment as to the difference is quite a bit confusing. I think we should just remove the existing selects and replace its current call by (selectsKey || selectsClustering) . I'm also not sure why those methods in SinglePartitionReadCommand don't check the rowFilter (I suspect they should). It's a bit weird/unintuitive that StatementRestrictions.isRestricted ignores IS NOT NULL restriction. And it seems changing that would actually simplify the code using that method. Nit: In the last if of View.createForDeletionInfo , having ReadQuery selectQuery = getReadQuery(); but using query on the next line assert is a bit anti-social. Nit: in Operator.java , can make readFrom call the new fromValue . Those are based on the previous rebase ( CASSANDRA-9664 -rebase-2 ) so I assume the last rebase was just that, a rebase. since we're getting close to the deadline for 3.0.0-rc1, I think it might be better to push that to another ticket targeting 3.2. Yes, pretty please, let's push that to 3.2.
          Hide
          thobbs Tyler Hobbs added a comment -

          Regarding getRawText, I believe all or almost all Term.Raw are already implementing the right thing in their toString() method, and we somewhat rely on it for some error messages.

          What do you think about leaving getRawText()/getText() and using that for toString() where it makes sense? That should eliminate most of the code duplication, but still give us some compile-time checks.

          Show
          thobbs Tyler Hobbs added a comment - Regarding getRawText, I believe all or almost all Term.Raw are already implementing the right thing in their toString() method, and we somewhat rely on it for some error messages. What do you think about leaving getRawText() / getText() and using that for toString() where it makes sense? That should eliminate most of the code duplication, but still give us some compile-time checks.
          Hide
          slebresne Sylvain Lebresne added a comment -

          What do you think about leaving getRawText()/getText() and using that for toString() where it makes sense?

          I though about it and figure that it would be perfect if Term.Raw was an abstract class (cause if we re-declare toString() every time, it's better than nothing but a bit ugly imo) but that we probably wouldn't want to make it an abstract class. That said, I just remembered that interface now have default functions, so if we can use that, that works for me.

          Show
          slebresne Sylvain Lebresne added a comment - What do you think about leaving getRawText()/getText() and using that for toString() where it makes sense? I though about it and figure that it would be perfect if Term.Raw was an abstract class (cause if we re-declare toString() every time, it's better than nothing but a bit ugly imo) but that we probably wouldn't want to make it an abstract class. That said, I just remembered that interface now have default functions, so if we can use that, that works for me.
          Hide
          thobbs Tyler Hobbs added a comment -

          I've pushed several commits to the same branch to address the review comments. A few responses inline:

          I just remembered that interface now have default functions, so if we can use that, that works for me.

          Unfortunately, default functions can't override inherited functions like toString. However, I made Term.Literal an abstract class with toString() defined, which works quite nicely with minimal pointless code.

          I don't think the Term.Literal thing works, because a collection can contain a function call and yet FunctionCall.Raw is not a Term.Literal

          Good catch. I've made the changes you suggested and added some additional tests around this.

          Not sure to understand why the initialization of select and query in View.java is delayed. Are you worried about MV creation schema changes arriving before other schema changes it depends on on some nodes?

          No, it's about the schema initialization sequence during startup. I've updated the comment to explain this better; let me know if you still have questions.

          I think we should just remove the existing selects and replace its current call

          I meant to do this and forgot to. That's taken care of now, thanks.

          in Operator.java, can make readFrom call the new fromValue.

          This was actually unused (after storing where_clause as a single string), so I've removed it.

          Yes, pretty please, let's push that to 3.2.

          Okay, I've created CASSANDRA-10368 as a follow-up.

          I assume the last rebase was just that, a rebase.

          Yes, no real changes there.

          Pending CI tests:

          Show
          thobbs Tyler Hobbs added a comment - I've pushed several commits to the same branch to address the review comments. A few responses inline: I just remembered that interface now have default functions, so if we can use that, that works for me. Unfortunately, default functions can't override inherited functions like toString. However, I made Term.Literal an abstract class with toString() defined, which works quite nicely with minimal pointless code. I don't think the Term.Literal thing works, because a collection can contain a function call and yet FunctionCall.Raw is not a Term.Literal Good catch. I've made the changes you suggested and added some additional tests around this. Not sure to understand why the initialization of select and query in View.java is delayed. Are you worried about MV creation schema changes arriving before other schema changes it depends on on some nodes? No, it's about the schema initialization sequence during startup. I've updated the comment to explain this better; let me know if you still have questions. I think we should just remove the existing selects and replace its current call I meant to do this and forgot to. That's taken care of now, thanks. in Operator.java, can make readFrom call the new fromValue. This was actually unused (after storing where_clause as a single string), so I've removed it. Yes, pretty please, let's push that to 3.2. Okay, I've created CASSANDRA-10368 as a follow-up. I assume the last rebase was just that, a rebase. Yes, no real changes there. Pending CI tests: 3.0 testall 3.0 dtest trunk testall trunk dtest
          Hide
          andrew.tolbert Andy Tolbert added a comment - - edited

          Is it required for a where clause to be present now? I've been experimenting on this branch w/ the java-driver and noticed that I now get an NPE when not providing a where clause:

          ERROR [SharedPool-Worker-1] 2015-09-17 17:50:02,862 ErrorMessage.java:336 - Unexpected exception during request
          java.lang.NullPointerException: null
                  at org.apache.cassandra.db.view.View.relationsToWhereClause(View.java:719) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.CreateViewStatement.announceMigration(CreateViewStatement.java:224) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:93) ~[main/:na]
                  at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:204) ~[main/:na]
                  at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:235) ~[main/:na]
                  at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:220) ~[main/:na]
                  at org.apache.cassandra.cql3.CustomPayloadMirroringQueryHandler.process(CustomPayloadMirroringQueryHandler.java:43) ~[main/:na]
                  at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[main/:na]
                  at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507) [main/:na]
                  at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401) [main/:na]
                  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
                  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
                  at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
                  at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_25]
                  at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) [main/:na]
                  at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [main/:na]
                  at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]
          

          Where on trunk C* surfaces an InvalidQueryException (as each primary key column I declare must include a 'IS NOT NULL' in my where clause).

          I'm also wondering why the 'IS NOT NULL' declaration for primary keys is required (couldn't that be implicit?) although thats not new on this branch and there might have been previous conversations about that (I'll look through tickets and strikethrough and link to that if I find it). Explanation (thanks Joel Knighton )

          Show
          andrew.tolbert Andy Tolbert added a comment - - edited Is it required for a where clause to be present now? I've been experimenting on this branch w/ the java-driver and noticed that I now get an NPE when not providing a where clause: ERROR [SharedPool-Worker-1] 2015-09-17 17:50:02,862 ErrorMessage.java:336 - Unexpected exception during request java.lang.NullPointerException: null at org.apache.cassandra.db.view.View.relationsToWhereClause(View.java:719) ~[main/:na] at org.apache.cassandra.cql3.statements.CreateViewStatement.announceMigration(CreateViewStatement.java:224) ~[main/:na] at org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:93) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:204) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:235) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:220) ~[main/:na] at org.apache.cassandra.cql3.CustomPayloadMirroringQueryHandler.process(CustomPayloadMirroringQueryHandler.java:43) ~[main/:na] at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[main/:na] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507) [main/:na] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401) [main/:na] at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final] at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_25] at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) [main/:na] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [main/:na] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25] Where on trunk C* surfaces an InvalidQueryException (as each primary key column I declare must include a 'IS NOT NULL' in my where clause). I'm also wondering why the 'IS NOT NULL' declaration for primary keys is required (couldn't that be implicit?) although thats not new on this branch and there might have been previous conversations about that (I'll look through tickets and strikethrough and link to that if I find it). Explanation (thanks Joel Knighton )
          Hide
          andrew.tolbert Andy Tolbert added a comment -

          One small nit I noticed that isn't technically a bug is that the whereClause in system_schema.views quotes all column names, even if they are all in lowercase (don't need to be quoted identifiers):

          "game" IS NOT NULL AND "year" IS NOT NULL AND "month" IS NOT NULL AND "score" IS NOT NULL AND "user" IS NOT NULL AND "day" IS NOT NULL

          is this ok? I'm not concerned about it, but thought i'd bring it up.

          Show
          andrew.tolbert Andy Tolbert added a comment - One small nit I noticed that isn't technically a bug is that the whereClause in system_schema.views quotes all column names, even if they are all in lowercase (don't need to be quoted identifiers): "game" IS NOT NULL AND "year" IS NOT NULL AND "month" IS NOT NULL AND "score" IS NOT NULL AND "user" IS NOT NULL AND "day" IS NOT NULL is this ok? I'm not concerned about it, but thought i'd bring it up.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I now get an NPE when not providing a where clause

          Definitively something to fix.

          quotes all column names, even if they are all in lowercase

          It was done out of simplicity (quoting unconditionally in the code), but that made realize that this is broken if the identifier has a quote itself as we need to escape it. And I think it's high time we add a ColumnIdentifier.toCQLString() that does the right thing (and could easily avoid quoting when not required so it's a bit nicer), especially since CASSANDRA-10217 also has to do this and there is no point in not having that in only one place.

          Show
          slebresne Sylvain Lebresne added a comment - I now get an NPE when not providing a where clause Definitively something to fix. quotes all column names, even if they are all in lowercase It was done out of simplicity (quoting unconditionally in the code), but that made realize that this is broken if the identifier has a quote itself as we need to escape it. And I think it's high time we add a ColumnIdentifier.toCQLString() that does the right thing (and could easily avoid quoting when not required so it's a bit nicer), especially since CASSANDRA-10217 also has to do this and there is no point in not having that in only one place.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I made Term.Literal an abstract class with toString() defined

          But you forgot to make some classes like FunctionCall.Raw, TypeCast and AbstractMarker.Raw extensions of Term.Literal. Truth is, I really don't think we need Term.Literal since its only purpose is to make sure we have a proper toString() method and I see no reason why we wouldn't want that for each and every Term.Raw. I only mention not wanting to make Term.Raw an abstract class because I though we might get a multi-inheritance problem in a few places, but it appears it's not the case so let's just make Term.Raw an abstract class and get rid of Term.Literal.

          Regarding the last commit, about not checking the row filter for 2i, it seems from the code that only the IN case is not currently handled by RowFilter. At which point I would have a small preference for just quickly adding code to make it work (it's not too hard) and let 2i also check the row filter. My reasoning is that:

          1. it'll be cleaner than having the boolean flag to selectsKey and selectsClustering
          2. it's actually an optimization for 2i. Even though we do check those condition elsewhere, selectsKey and selectsClustering are called before the data query (on the index entry), so filtering at that point is better. We can even easily avoid checking those condition again later if we really want to.

          Anyway, if you feel adding the code for IN is too involved, I'm fine pushing that to a followup, but it feels to me like it's almost simpler to do it "right" from the get go.

          Show
          slebresne Sylvain Lebresne added a comment - I made Term.Literal an abstract class with toString() defined But you forgot to make some classes like FunctionCall.Raw , TypeCast and AbstractMarker.Raw extensions of Term.Literal . Truth is, I really don't think we need Term.Literal since its only purpose is to make sure we have a proper toString() method and I see no reason why we wouldn't want that for each and every Term.Raw . I only mention not wanting to make Term.Raw an abstract class because I though we might get a multi-inheritance problem in a few places, but it appears it's not the case so let's just make Term.Raw an abstract class and get rid of Term.Literal . Regarding the last commit, about not checking the row filter for 2i, it seems from the code that only the IN case is not currently handled by RowFilter . At which point I would have a small preference for just quickly adding code to make it work (it's not too hard) and let 2i also check the row filter. My reasoning is that: it'll be cleaner than having the boolean flag to selectsKey and selectsClustering it's actually an optimization for 2i. Even though we do check those condition elsewhere, selectsKey and selectsClustering are called before the data query (on the index entry), so filtering at that point is better. We can even easily avoid checking those condition again later if we really want to. Anyway, if you feel adding the code for IN is too involved, I'm fine pushing that to a followup, but it feels to me like it's almost simpler to do it "right" from the get go.
          Hide
          thobbs Tyler Hobbs added a comment -

          let's just make Term.Raw an abstract class and get rid of Term.Literal.

          Okay, done.

          it seems from the code that only the IN case is not currently handled by RowFilter

          It turns out we also need to handle CONTAINS and CONTAINS KEY in Operator.isSatisfiedBy() (as evidenced by failing FrozenCollectionsTest tests), so I've added support for all three.

          Show
          thobbs Tyler Hobbs added a comment - let's just make Term.Raw an abstract class and get rid of Term.Literal. Okay, done. it seems from the code that only the IN case is not currently handled by RowFilter It turns out we also need to handle CONTAINS and CONTAINS KEY in Operator.isSatisfiedBy() (as evidenced by failing FrozenCollectionsTest tests), so I've added support for all three.
          Hide
          thobbs Tyler Hobbs added a comment -

          I now get an NPE when not providing a where clause

          Fixed and added a test for this, thanks.

          that this is broken if the identifier has a quote itself as we need to escape it. And I think it's high time we add a ColumnIdentifier.toCQLString() that does the right thing

          Thanks for pointing that out. I've created ColumnIdentifier.toCQLString() as you suggested, and updated a test to cover this.

          I've also merged the latest 3.0/trunk into my branches to get clearer test results and resolve a few merge conflicts.

          Show
          thobbs Tyler Hobbs added a comment - I now get an NPE when not providing a where clause Fixed and added a test for this, thanks. that this is broken if the identifier has a quote itself as we need to escape it. And I think it's high time we add a ColumnIdentifier.toCQLString() that does the right thing Thanks for pointing that out. I've created ColumnIdentifier.toCQLString() as you suggested, and updated a test to cover this. I've also merged the latest 3.0/trunk into my branches to get clearer test results and resolve a few merge conflicts.
          Hide
          aholmber Adam Holmberg added a comment -

          Since I can't guarantee availability when this might go in:
          Python driver branch for zip update is 399 @2803bb7.
          There is also a pull request for getting this into cassandra-test when you're ready.

          Show
          aholmber Adam Holmberg added a comment - Since I can't guarantee availability when this might go in: Python driver branch for zip update is 399 @2803bb7. There is also a pull request for getting this into cassandra-test when you're ready.
          Hide
          andrew.tolbert Andy Tolbert added a comment -

          just reaffirming that java-driver branch is good to go here.

          Show
          andrew.tolbert Andy Tolbert added a comment - just reaffirming that java-driver branch is good to go here .
          Hide
          slebresne Sylvain Lebresne added a comment -

          +1 on the last branch, but could you replace the calls to the maybeEscapeQuotedName() method in IndexTarget.java by the new ColumnIdentifier.toCQLString(), as I fear we'll forget to do it otherwise.

          Show
          slebresne Sylvain Lebresne added a comment - +1 on the last branch, but could you replace the calls to the maybeEscapeQuotedName() method in IndexTarget.java by the new ColumnIdentifier.toCQLString() , as I fear we'll forget to do it otherwise.
          Hide
          thobbs Tyler Hobbs added a comment -

          Thanks, committed with maybeEscapeQuotedName() usage replaced as 5a4253b6a17de9810fbc4e1c3b8d4980e26adcca to 3.0 and merged to trunk.

          Show
          thobbs Tyler Hobbs added a comment - Thanks, committed with maybeEscapeQuotedName() usage replaced as 5a4253b6a17de9810fbc4e1c3b8d4980e26adcca to 3.0 and merged to trunk.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Updated the bundled python- and java-driver in 6af4657a96409e2b5f40822646daf42fae33c11b. Merged driver PRs before then.

          We should be all good now.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Updated the bundled python- and java-driver in 6af4657a96409e2b5f40822646daf42fae33c11b . Merged driver PRs before then. We should be all good now.
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          Reopening this issue. This change caused an important regression with MV performance. I'm using mvbench to test with RF=3 and I confirm it's not a driver issue.

          EC2 RF=3 (i2.2xlarge, also tried on i2.4xlarge)
          mvn exec:java -Dexec.args="--num-users 100000 --num-songs 1000000 --num-artists 10000 -n 500000 --endpoint node1"
          
          3.0.0-beta2 (alpha2 java driver)
          -----------
          total
                       count = 461601
                   mean rate = 1923.21 calls/second
               1-minute rate = 1937.82 calls/second
               5-minute rate = 1424.09 calls/second
              15-minute rate = 1058.28 calls/second
                         min = 1.90 milliseconds
                         max = 3707.76 milliseconds
                        mean = 516.42 milliseconds
                      stddev = 457.41 milliseconds
                      median = 390.07 milliseconds
                        75% <= 775.95 milliseconds
                        95% <= 1417.67 milliseconds
                        98% <= 1728.05 milliseconds
                        99% <= 1954.55 milliseconds
                      99.9% <= 2566.91 milliseconds
          
          
          3.0.0-rc1 (alpha3 java driver)
          ---------
          
          total
                       count = 310373
                   mean rate = 272.25 calls/second
               1-minute rate = 0.00 calls/second
               5-minute rate = 45.47 calls/second
              15-minute rate = 295.94 calls/second
                         min = 1.05 milliseconds
                         max = 10468.98 milliseconds
                        mean = 492.99 milliseconds
                      stddev = 510.42 milliseconds
                      median = 281.02 milliseconds
                        75% <= 696.25 milliseconds
                        95% <= 1434.45 milliseconds
                        98% <= 1820.33 milliseconds
                        99% <= 2080.37 milliseconds
                      99.9% <= 4362.08 milliseconds
          

          Tyler Hobbs any idea what could make things getting slower?

          Show
          aboudreault Alan Boudreault added a comment - - edited Reopening this issue. This change caused an important regression with MV performance. I'm using mvbench to test with RF=3 and I confirm it's not a driver issue. EC2 RF=3 (i2.2xlarge, also tried on i2.4xlarge) mvn exec:java -Dexec.args= "--num-users 100000 --num-songs 1000000 --num-artists 10000 -n 500000 --endpoint node1" 3.0.0-beta2 (alpha2 java driver) ----------- total count = 461601 mean rate = 1923.21 calls/second 1-minute rate = 1937.82 calls/second 5-minute rate = 1424.09 calls/second 15-minute rate = 1058.28 calls/second min = 1.90 milliseconds max = 3707.76 milliseconds mean = 516.42 milliseconds stddev = 457.41 milliseconds median = 390.07 milliseconds 75% <= 775.95 milliseconds 95% <= 1417.67 milliseconds 98% <= 1728.05 milliseconds 99% <= 1954.55 milliseconds 99.9% <= 2566.91 milliseconds 3.0.0-rc1 (alpha3 java driver) --------- total count = 310373 mean rate = 272.25 calls/second 1-minute rate = 0.00 calls/second 5-minute rate = 45.47 calls/second 15-minute rate = 295.94 calls/second min = 1.05 milliseconds max = 10468.98 milliseconds mean = 492.99 milliseconds stddev = 510.42 milliseconds median = 281.02 milliseconds 75% <= 696.25 milliseconds 95% <= 1434.45 milliseconds 98% <= 1820.33 milliseconds 99% <= 2080.37 milliseconds 99.9% <= 4362.08 milliseconds Tyler Hobbs any idea what could make things getting slower?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Can you open a new ticket instead? In general we try not to reopen tickets that have been included in a released C* version.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Can you open a new ticket instead? In general we try not to reopen tickets that have been included in a released C* version.
          Hide
          aboudreault Alan Boudreault added a comment -

          sure

          Show
          aboudreault Alan Boudreault added a comment - sure
          Hide
          aboudreault Alan Boudreault added a comment -

          Closing in favor of CASSANDRA-10609.

          Show
          aboudreault Alan Boudreault added a comment - Closing in favor of CASSANDRA-10609 .

            People

            • Assignee:
              thobbs Tyler Hobbs
              Reporter:
              carlyeks Carl Yeksigian
              Reviewer:
              Sylvain Lebresne
              Tester:
              Alan Boudreault
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development