Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: API, Core
    • Labels:

      Issue Links

        Activity

        Hide
        Vivek Mishra added a comment -

        Eric,
        Can you please provide more information on this JIRA? what needs to be done? I can take up this issue.

        Vivek

        Show
        Vivek Mishra added a comment - Eric, Can you please provide more information on this JIRA? what needs to be done? I can take up this issue. Vivek
        Hide
        Jonathan Ellis added a comment -

        Traditionally, a prepared statement API would look something like this:

          int prepare_cql_query(1:binary query, 2:Compression compression);
          void execute_prepared_query(1:int query_handle, 2:list<binary> parameters, 3:Compression compression);
        

        ... that is, prepare would parse and plan the query, then execute would supply the parameters to be bound to query variables.

        On the client side, the APIs are well-defined, e.g. JDBC's prepareStatement. (Note that we currently "fake" these APIs on top of the standard execute_cql_query.)

        Show
        Jonathan Ellis added a comment - Traditionally, a prepared statement API would look something like this: int prepare_cql_query(1:binary query, 2:Compression compression); void execute_prepared_query(1:int query_handle, 2:list<binary> parameters, 3:Compression compression); ... that is, prepare would parse and plan the query, then execute would supply the parameters to be bound to query variables. On the client side, the APIs are well-defined, e.g. JDBC's prepareStatement. (Note that we currently "fake" these APIs on top of the standard execute_cql_query.)
        Hide
        Eric Evans added a comment -

        Why list<binary> for parameters, and not list<string>?

        Show
        Eric Evans added a comment - Why list<binary> for parameters, and not list<string> ?
        Hide
        Jonathan Ellis added a comment -

        The thinking being that it's easier for a client to encode as string?

        That's a reasonable point, but having been up close and personal with both of our existing client APIs I'd say that saving a 20-line encode case statement is not going to make much difference in pain there. On the other hand, the efficiency wins from receiving the data in "native" encoded format seem substantial (one of the most common data types, numerics, is particularly inefficient to decode-from-string).

        Show
        Jonathan Ellis added a comment - The thinking being that it's easier for a client to encode as string? That's a reasonable point, but having been up close and personal with both of our existing client APIs I'd say that saving a 20-line encode case statement is not going to make much difference in pain there. On the other hand, the efficiency wins from receiving the data in "native" encoded format seem substantial (one of the most common data types, numerics, is particularly inefficient to decode-from-string).
        Hide
        Eric Evans added a comment -

        The thinking being that it's easier for a client to encode as string?

        Well, the thinking being that the code to do so already exists. It's also pushing some logic back to the client and opening up the possibility for a whole class of bugs (multiplied by the number of clients that have to implement it).

        This isn't entirely hypothetical, it was recently discovered that Pycassa treated IntegerType as always being 4 bytes in length, and that had gone undiscovered for a long time. The string representation of a number though, can be parsed by IntegerType.fromString() node-side consistently regardless of the client.

        Show
        Eric Evans added a comment - The thinking being that it's easier for a client to encode as string? Well, the thinking being that the code to do so already exists. It's also pushing some logic back to the client and opening up the possibility for a whole class of bugs (multiplied by the number of clients that have to implement it). This isn't entirely hypothetical, it was recently discovered that Pycassa treated IntegerType as always being 4 bytes in length, and that had gone undiscovered for a long time. The string representation of a number though, can be parsed by IntegerType.fromString() node-side consistently regardless of the client.
        Hide
        Nate McCall added a comment -

        Usually I would side with Eric's point about re-use on these types of arguments, but in the context of a prepared statement, I'm willing to add some complexity hoops at the client level if it buys me a noticeable performance gain over a standard statement.

        Show
        Nate McCall added a comment - Usually I would side with Eric's point about re-use on these types of arguments, but in the context of a prepared statement, I'm willing to add some complexity hoops at the client level if it buys me a noticeable performance gain over a standard statement.
        Hide
        Eric Evans added a comment -

        Either way, prepared statements should be significantly more performant than their unprepared counterparts, but how performant is that? Put another way, the performance argument could be a compelling one (for either side), but without the data how can we really make it?

        Show
        Eric Evans added a comment - Either way, prepared statements should be significantly more performant than their unprepared counterparts, but how performant is that? Put another way, the performance argument could be a compelling one (for either side), but without the data how can we really make it?
        Hide
        Rick Shaw added a comment -

        I think prepared statements for Cassandra is a nice convenience that seems very natural, but it is born out of the way relational systems work (pre-compiling, strategies and the like). I just don't see a lot of performance advantage is going to be gained in Cassandra over the "faked" way it is done now.

        The other question I have is about the above code snippet. It implies (i think?) that the caller passes in CQL and get a handle to a "something" that is going to be stored on the Server side? That seems to be fraught with problems? How long does it live? How does the server side know when you might be done with it? I wonder if the complication of keeping state on the server side and substituting binary values into a new tokenized structure will really be faster than parsing the string again with ANTLR?

        Show
        Rick Shaw added a comment - I think prepared statements for Cassandra is a nice convenience that seems very natural, but it is born out of the way relational systems work (pre-compiling, strategies and the like). I just don't see a lot of performance advantage is going to be gained in Cassandra over the "faked" way it is done now. The other question I have is about the above code snippet. It implies (i think?) that the caller passes in CQL and get a handle to a "something" that is going to be stored on the Server side? That seems to be fraught with problems? How long does it live? How does the server side know when you might be done with it? I wonder if the complication of keeping state on the server side and substituting binary values into a new tokenized structure will really be faster than parsing the string again with ANTLR?
        Hide
        Jonathan Ellis added a comment -

        How long does it live? How does the server side know when you might be done with it?

        that part is easy: until you disconnect, same as keyspace/authentication.

        I wonder if the complication of keeping state on the server side and substituting binary values into a new tokenized structure will really be faster

        I suppose nothing's a given but hash lookup is pretty damn fast.

        Show
        Jonathan Ellis added a comment - How long does it live? How does the server side know when you might be done with it? that part is easy: until you disconnect, same as keyspace/authentication. I wonder if the complication of keeping state on the server side and substituting binary values into a new tokenized structure will really be faster I suppose nothing's a given but hash lookup is pretty damn fast.
        Hide
        Michal Augustýn added a comment -

        It would be great if there is this overload in order to eliminate one client-server roundtrip:

        CqlResult execute_cql_query(1:binary query, 2:list<binary> parameters, 3:Compression compression);

        In many applications, there is just few queries (max. hundreds?) and so I think the handle could be cached server-side (we could limit the cache size via configuration).
        And do you/we plan to support named parameters?

        Show
        Michal Augustýn added a comment - It would be great if there is this overload in order to eliminate one client-server roundtrip: CqlResult execute_cql_query(1:binary query, 2:list<binary> parameters, 3:Compression compression); In many applications, there is just few queries (max. hundreds?) and so I think the handle could be cached server-side (we could limit the cache size via configuration). And do you/we plan to support named parameters?
        Hide
        Rick Shaw added a comment -

        In order to see the performance value (as opposed to convenience) of the concept of Prepared Statements in CQL, the design needs to change to have ANTLR parse the strings into a token-stream that can be reused as opposed to executing in line with recursive decent. It is the "reusable" token stream that provides the performance leverage.

        I suggest that these token streams could be small enough that they could be the returned value from the "Prepare" call and relieve the server side from the maintenance and accounting hassle of keeping track of them. The client would squirrel them away for reuse, whenever. Possibly the token stream could be designed to be reused across processes; just a resource in a proprietary byte stream format? The design of the JDBC driver assumes that Prepared Statements are pool-able on the client side along with connections. I am aware that it could be the handle that is pooled, but that would have a potentially shorter lifetime.

        Even simple statements would be parsed down to the stream of tokens; It would just be executed immediately and then tossed as opposed to returning the to the caller.

        A roadblock may be the fact that batch is handled directly in the CQL syntax right now. This is fine using the current approach, but will be very messy with a saved set of tokens that are reused and the parameter substitution that must go on server side. I believe that is why the JDBC driver treats batch as another concept that can take sets of prepared statements and their associated parameters as a list rather than making it just part of the SQL syntax.

        Show
        Rick Shaw added a comment - In order to see the performance value (as opposed to convenience) of the concept of Prepared Statements in CQL, the design needs to change to have ANTLR parse the strings into a token-stream that can be reused as opposed to executing in line with recursive decent. It is the "reusable" token stream that provides the performance leverage. I suggest that these token streams could be small enough that they could be the returned value from the "Prepare" call and relieve the server side from the maintenance and accounting hassle of keeping track of them. The client would squirrel them away for reuse, whenever. Possibly the token stream could be designed to be reused across processes; just a resource in a proprietary byte stream format? The design of the JDBC driver assumes that Prepared Statements are pool-able on the client side along with connections. I am aware that it could be the handle that is pooled, but that would have a potentially shorter lifetime. Even simple statements would be parsed down to the stream of tokens; It would just be executed immediately and then tossed as opposed to returning the to the caller. A roadblock may be the fact that batch is handled directly in the CQL syntax right now. This is fine using the current approach, but will be very messy with a saved set of tokens that are reused and the parameter substitution that must go on server side. I believe that is why the JDBC driver treats batch as another concept that can take sets of prepared statements and their associated parameters as a list rather than making it just part of the SQL syntax.
        Hide
        Jonathan Ellis added a comment -

        I suggest that these token streams could be small enough that they could be the returned value from the "Prepare" call and relieve the server side from the maintenance and accounting hassle of keeping track of them.

        It's really no hassle. We already encapsulate per-connection state in the ClientState object. And parsing the tokens from the client each request is going to generate a ton of GC churn...

        Show
        Jonathan Ellis added a comment - I suggest that these token streams could be small enough that they could be the returned value from the "Prepare" call and relieve the server side from the maintenance and accounting hassle of keeping track of them. It's really no hassle. We already encapsulate per-connection state in the ClientState object. And parsing the tokens from the client each request is going to generate a ton of GC churn...
        Hide
        Rick Shaw added a comment -

        I guess I was a bit too vague... My suggestion would be to return the pre-parsed token stream not something you would re-parse every time it is re-submitted. It is the same item I think you suggest we cache on the server side. I think the interesting twist is that using the suggested method the pre-parsed item (or "prepared" statement) could be used days later in a different connection. It would be an immutable resource. If it is cached server side it is only good for the connection life.

        But its just a suggestion. I understand the merits of retaining complete control of the format over time and the efficiencies by passing the "handle" back and forth. And I was not familiar with the ClientState at all.

        More troubling is the batch semantics... I hate the idea of disrupting the current syntax in CQL but I think the parameter substitution step will be very fragile if there is not a notion of lists of items that are tightly coupled with their respective handle's parameters in the batch. The thought of thousands of rows worth of entries in a batch and getting the parameters right for a giant array/list of parameters that fill into the pre-compiled tokens seems fraught with problems. How does the repeating nature get expressed? Currently it is very concrete and can be parsed into a mutation on the fly. But if it is pre-parsed what syntax represents the concept of repetition? Is the syntax different for the prepared statement vs. the simple (not prepared) statement as today?

        The crafters of the JDBC driver specification seemed to have been faced with the same problem. Their solution was to have a "batch" method as well as "execute" methods that takes an array/list of prepared statements. Unsolved for us is how to recognize the notion of "mutation start"/"Mutation end" using that approach. Maybe you just do a "prepare" call for "BEGIN BATCH" and "APPLY BATCH" and use them in the list sent via the "batch" method?

        Show
        Rick Shaw added a comment - I guess I was a bit too vague... My suggestion would be to return the pre-parsed token stream not something you would re-parse every time it is re-submitted. It is the same item I think you suggest we cache on the server side. I think the interesting twist is that using the suggested method the pre-parsed item (or "prepared" statement) could be used days later in a different connection. It would be an immutable resource. If it is cached server side it is only good for the connection life. But its just a suggestion. I understand the merits of retaining complete control of the format over time and the efficiencies by passing the "handle" back and forth. And I was not familiar with the ClientState at all. More troubling is the batch semantics... I hate the idea of disrupting the current syntax in CQL but I think the parameter substitution step will be very fragile if there is not a notion of lists of items that are tightly coupled with their respective handle's parameters in the batch. The thought of thousands of rows worth of entries in a batch and getting the parameters right for a giant array/list of parameters that fill into the pre-compiled tokens seems fraught with problems. How does the repeating nature get expressed? Currently it is very concrete and can be parsed into a mutation on the fly. But if it is pre-parsed what syntax represents the concept of repetition? Is the syntax different for the prepared statement vs. the simple (not prepared) statement as today? The crafters of the JDBC driver specification seemed to have been faced with the same problem. Their solution was to have a "batch" method as well as "execute" methods that takes an array/list of prepared statements. Unsolved for us is how to recognize the notion of "mutation start"/"Mutation end" using that approach. Maybe you just do a "prepare" call for "BEGIN BATCH" and "APPLY BATCH" and use them in the list sent via the "batch" method?
        Hide
        Jonathan Ellis added a comment -

        My point was you still need to "parse" it from the socket bytestream. Not re-parsing the raw CQL.

        I really don't see what is so complex about apply(parsed_tokens_list, parameters) vs apply(saved_queries.get(parsed_tokens_id), parameters).

        Show
        Jonathan Ellis added a comment - My point was you still need to "parse" it from the socket bytestream. Not re-parsing the raw CQL. I really don't see what is so complex about apply(parsed_tokens_list, parameters) vs apply(saved_queries.get(parsed_tokens_id), parameters).
        Hide
        Rick Shaw added a comment -

        I really don't see what is so complex about apply(parsed_tokens_list, parameters) vs apply(saved_queries.get(parsed_tokens_id), parameters).

        Given that the design has such a stream I agree completely. Not complex at all. Hence my statement:

        Even simple statements would be parsed down to the stream of tokens; It would just be executed immediately and then tossed as opposed to cached and returning the to the caller.

        I think we are in agreement of the need for such a precompiled item, and given that it needs to exist anyway we might as well only have one ANTLR parser and use its product for both simple and "prepared" statements.

        Show
        Rick Shaw added a comment - I really don't see what is so complex about apply(parsed_tokens_list, parameters) vs apply(saved_queries.get(parsed_tokens_id), parameters). Given that the design has such a stream I agree completely. Not complex at all. Hence my statement: Even simple statements would be parsed down to the stream of tokens; It would just be executed immediately and then tossed as opposed to cached and returning the to the caller. I think we are in agreement of the need for such a precompiled item, and given that it needs to exist anyway we might as well only have one ANTLR parser and use its product for both simple and "prepared" statements.
        Hide
        Jonathan Ellis added a comment -

        we might as well only have one ANTLR parser and use its product for both simple and "prepared" statements

        Makes sense to me. So conceptually, that would involve adding to Cql.g that ? (for instance) is an acceptable "term" value?

        Show
        Jonathan Ellis added a comment - we might as well only have one ANTLR parser and use its product for both simple and "prepared" statements Makes sense to me. So conceptually, that would involve adding to Cql.g that ? (for instance) is an acceptable "term" value?
        Hide
        Rick Shaw added a comment -

        Yes.

        The confusing part would be the use of "?" in a Statement rather than a PreparedStatement. I guess it will be allowed syntactically so it will generate the prescribed token stream but it will be a semantic error when the token string is (immediately) executed. The "fetch the next Term parameter" token will be encountered but the next parameter will not exist so that would signal an error.

        Show
        Rick Shaw added a comment - Yes. The confusing part would be the use of "?" in a Statement rather than a PreparedStatement . I guess it will be allowed syntactically so it will generate the prescribed token stream but it will be a semantic error when the token string is (immediately) executed. The "fetch the next Term parameter" token will be encountered but the next parameter will not exist so that would signal an error.
        Hide
        Rick Shaw added a comment -

        Server side modifications to support Prepared statements. All existing Prepared Statement unit tests run (with a small mod to remove reference to previous implementation specific code).

        Show
        Rick Shaw added a comment - Server side modifications to support Prepared statements. All existing Prepared Statement unit tests run (with a small mod to remove reference to previous implementation specific code).
        Hide
        Eric Evans added a comment -

        Rick, is this meant to be applied to trunk? I think it needs to be rebased.

        Show
        Eric Evans added a comment - Rick, is this meant to be applied to trunk? I think it needs to be rebased.
        Hide
        Rick Shaw added a comment - - edited

        Yes but perhaps I missed a release? I used the GIT version? I'll try and rebase from the current SVN version.

        Show
        Rick Shaw added a comment - - edited Yes but perhaps I missed a release? I used the GIT version? I'll try and rebase from the current SVN version.
        Hide
        Jonathan Ellis added a comment -

        It applies to trunk fine for me with the exception of one piece to QP, which is probably due to the changes made there by CASSANDRA-1034. So, I'd say you're on the right branch, Rick.

        Show
        Jonathan Ellis added a comment - It applies to trunk fine for me with the exception of one piece to QP, which is probably due to the changes made there by CASSANDRA-1034 . So, I'd say you're on the right branch, Rick.
        Hide
        Rick Shaw added a comment -

        Try 2

        Show
        Rick Shaw added a comment - Try 2
        Hide
        Eric Evans added a comment -

        Alright. This one is sort of a monster, so bare with me if any of this seems disorganized, or jumps around. Also, if it seems long(ish), don't be disheartened, again, it's a bit of a monster; All-in-all it looks pretty good.

        First, on the subject of patch submission, coding conventions, etc:

        If at all possible, try to break a large change like this into more than one changeset, with logically separate changes in separate patches. A work-flow based on patch submission sucks, I know, but Git can make this fairly easy (you mentioned you were using Git). It definitely makes review easier and is much appreciated.

        That said, don't worry about breaking this one into smaller patches (something to keep in mind for next time).

        Also, avoid unnecessary changes to whitespace, or unrelated/cosmetic changes. They add noise to the review and increase the likelihood of collisions when merging or rebasing. A bit here and there is fine, but if you do any sort of substantive cleanup, roll that up into a separate patch.

        Some specific code-style/convention nits:

        • Consensus seems to be against SuppressWarnings annotations, or the use of Override for interfaces
        • Put a space between method arguments
        • When wrapping a method call, align the arguments with the open paren

        OK, on to the bigger fish.

        There was some earlier discussion in this ticket about using preserialized binary parameters, or strings that would be parsed node-side. Which of those two views is "right" notwithstanding, I feel pretty strongly that a struct that can optionally do either, is the wrong choice. Let's not make this implementation, or the client-facing interface any more complicated than it needs to be.

        My opinion on string vs. binary is largely unchanged here. String parameters is the path of greatest abstraction, eliminating a proven vector for bugs, and it keeps our query interface as friendly as possible.

        That said, if the performance advantage to preserialized values were known, and turned out to be significant enough, I'd happily reconsider (I like fast as much as the next guy). My suggestion then would be this: Let's refactor this patch to type the variables as String and get it in-tree. From there, it's a simple matter of a patch to change from String to ByteBuffer (and of course to drop the AT.fromString), and we can run some benchmarks. I will commit to working up this latter patch, and to the benchmarking, within the time remaining to 1.1, if that helps.

        Other questions and concerns in no particular order:

        • Does remove_prepared_query support something particular in JDBC (or any other standard)? How will that be used?
        • With regard to CqlPreparedResult:
          • What is the purpose of the count that is returned? How is that used?
          • What is the purpose of the CqlStatementType returned. How will that be used?
        • Is CQLStatement.cql only used for logging? If so, should we be keeping a copy of the query string around for that? Maybe we could create a toString that would descend to create the query (or something comparable).
        • There are a few places where queries are being logged at DEBUG. That seems too verbose for DEBUG.
        • Why is Term.bindIndex marked as volatile?
        • In CassandraServer.prepare_cql_query, don't create a separate variable for state
        • Not a biggie, but how about using "bind" or "bound" instead of "mark" when referring to term position? i.e. needsBind instead of isMarked, or indexBindTerms instead of discoverMarkedTerms
        • QueryProcessor.prepare seems as though it could be folded into CassandraServer.prepare_cql_query
        • It seems as though QueryProcessor.doTheStatement and QueryProcessor.process could be merged into one method.
        Show
        Eric Evans added a comment - Alright. This one is sort of a monster, so bare with me if any of this seems disorganized, or jumps around. Also, if it seems long(ish), don't be disheartened, again, it's a bit of a monster; All-in-all it looks pretty good. First, on the subject of patch submission, coding conventions, etc: If at all possible, try to break a large change like this into more than one changeset, with logically separate changes in separate patches. A work-flow based on patch submission sucks, I know, but Git can make this fairly easy (you mentioned you were using Git). It definitely makes review easier and is much appreciated. That said, don't worry about breaking this one into smaller patches (something to keep in mind for next time). Also, avoid unnecessary changes to whitespace, or unrelated/cosmetic changes. They add noise to the review and increase the likelihood of collisions when merging or rebasing. A bit here and there is fine, but if you do any sort of substantive cleanup, roll that up into a separate patch. Some specific code-style/convention nits: Consensus seems to be against SuppressWarnings annotations, or the use of Override for interfaces Put a space between method arguments When wrapping a method call, align the arguments with the open paren OK, on to the bigger fish. There was some earlier discussion in this ticket about using preserialized binary parameters, or strings that would be parsed node-side. Which of those two views is "right" notwithstanding, I feel pretty strongly that a struct that can optionally do either, is the wrong choice. Let's not make this implementation, or the client-facing interface any more complicated than it needs to be. My opinion on string vs. binary is largely unchanged here. String parameters is the path of greatest abstraction, eliminating a proven vector for bugs, and it keeps our query interface as friendly as possible. That said, if the performance advantage to preserialized values were known, and turned out to be significant enough, I'd happily reconsider (I like fast as much as the next guy). My suggestion then would be this: Let's refactor this patch to type the variables as String and get it in-tree. From there, it's a simple matter of a patch to change from String to ByteBuffer (and of course to drop the AT.fromString ), and we can run some benchmarks. I will commit to working up this latter patch, and to the benchmarking, within the time remaining to 1.1, if that helps. Other questions and concerns in no particular order: Does remove_prepared_query support something particular in JDBC (or any other standard)? How will that be used? With regard to CqlPreparedResult : What is the purpose of the count that is returned? How is that used? What is the purpose of the CqlStatementType returned. How will that be used? Is CQLStatement.cql only used for logging? If so, should we be keeping a copy of the query string around for that? Maybe we could create a toString that would descend to create the query (or something comparable). There are a few places where queries are being logged at DEBUG. That seems too verbose for DEBUG. Why is Term.bindIndex marked as volatile? In CassandraServer.prepare_cql_query , don't create a separate variable for state Not a biggie, but how about using "bind" or "bound" instead of "mark" when referring to term position? i.e. needsBind instead of isMarked , or indexBindTerms instead of discoverMarkedTerms QueryProcessor.prepare seems as though it could be folded into CassandraServer.prepare_cql_query It seems as though QueryProcessor.doTheStatement and QueryProcessor.process could be merged into one method.
        Hide
        Eric Evans added a comment -

        Since I had to rebase a couple of times (and resolve conflicts), I've attached patches from my repository if that helps.

        Show
        Eric Evans added a comment - Since I had to rebase a couple of times (and resolve conflicts), I've attached patches from my repository if that helps.
        Hide
        Rick Shaw added a comment -

        You are right of course this was a really big bear, and very cumbersome.

        I appreciate all the comments and I'll try to address all the problems. Thanks for the review! Really helpful.

        I try and start from the top. I tried several times to break it up. Really. But there never seemed to be a clean break. I see what you did with generated thrift... that really helps... I'll do that.

        I am using GIT but some of the techniques were a bit foreign to me a tricky to get all together so I learned as I went along on this patch and I think I have turned the corner but it definitely slowed me down an I backtracked a lot... I think I got it now...

        I must say I really did not intend any of the whitespace changes (in general) they just sneak in there when I am not looking I guess. I'll try to keep it tidier.

        As to coding conventions, I'll comply... NBD.

        And now on the the meat...

        I feel pretty strongly that a struct that can optionally do either, is the wrong choice. Let's not make this implementation, or the client-facing interface any more complicated than it needs to be.

        I completely agree and implemented the whole thing first using String.

        But then I hated the idea of having to go to HEX to get in "blob"/bytes data. This approach let me do both. It also allowed me to serialize Java Objects nicely as bytes. Can you think of a clever way to handle byte streams (AbstractBytes)? But I can live with just String. I agree it is the most flexible.

        I really don't see any real performance advantage and the loss of flexibility on the server side is just too much in my opinion.

        Does remove_prepared_query support something particular in JDBC (or any other standard)? How will that be used?

        It allows you to free the cached

        {CQLStatement}

        on the server side when a client side PreparedStatement is closed. If not, you will accumulate dead entries until the connection is closed. That could be a lot of dead wood. Seemed the tidy thing to do.

        With regard to CqlPreparedResult:
        What is the purpose of the count that is returned? How is that used?
        What is the purpose of the CqlStatementType returned. How will that be used?

        The count is to know how many markers were actually encountered. This number serves as the upper bound for Setxxx parameter indexes. Better than regexing for it... it is exactly what the server side encountered.

        The statement type is again a validation of what the server side saw. Remember this happens in 2 steps prepare then execute, and the execute step does not have the CQL text.
        But I used it while debugging and I don't seem to use it any more so I guess it could go, but it I thought I might find a use for is so I never did make it go away.

        Is CQLStatement.cql only used for logging? If so, should we be keeping a copy of the query string around for that? Maybe we could create a toString that would descend to create the query (or something comparable).

        Another "seems useful" so I kept it around. If something goes wrong and you need to go poking around its handy to have attached to the statement (I think).

        There are a few places where queries are being logged at DEBUG. That seems too verbose for DEBUG.

        I think there was already an instance there at DEBUG and I just added some more. I'll gladly move to TRACE.

        Why is Term.bindIndex marked as volatile?

        No good reason. I'll fix.

        Not a biggie, but how about using "bind" or "bound" instead of "mark" when referring to term position? i.e. needsBind instead of isMarked, or indexBindTerms instead of discoverMarkedTerms

        The short answer is because the question marks are often referred to in the spec as "bound variable markers". So I just propagated it. But NBD to change to "bind" theme.

        QueryProcessor.prepare seems as though it could be folded into CassandraServer.prepare_cql_query

        I guess it could but I liked the way it read better with it split up for all the methods.


        It seems as though QueryProcessor.doTheStatement and QueryProcessor.process could be merged into one method.

        I factored it out because doTheStatement is used by both process and process_prepared


        So in summary, I'll start another pass and would welcome response to my excuses

        Show
        Rick Shaw added a comment - You are right of course this was a really big bear, and very cumbersome. I appreciate all the comments and I'll try to address all the problems. Thanks for the review! Really helpful. I try and start from the top. I tried several times to break it up. Really. But there never seemed to be a clean break. I see what you did with generated thrift... that really helps... I'll do that. I am using GIT but some of the techniques were a bit foreign to me a tricky to get all together so I learned as I went along on this patch and I think I have turned the corner but it definitely slowed me down an I backtracked a lot... I think I got it now... I must say I really did not intend any of the whitespace changes (in general) they just sneak in there when I am not looking I guess. I'll try to keep it tidier. As to coding conventions, I'll comply... NBD. And now on the the meat... I feel pretty strongly that a struct that can optionally do either, is the wrong choice. Let's not make this implementation, or the client-facing interface any more complicated than it needs to be. I completely agree and implemented the whole thing first using String . But then I hated the idea of having to go to HEX to get in "blob"/bytes data. This approach let me do both. It also allowed me to serialize Java Objects nicely as bytes. Can you think of a clever way to handle byte streams ( AbstractBytes )? But I can live with just String . I agree it is the most flexible. I really don't see any real performance advantage and the loss of flexibility on the server side is just too much in my opinion. Does remove_prepared_query support something particular in JDBC (or any other standard)? How will that be used? It allows you to free the cached {CQLStatement} on the server side when a client side PreparedStatement is closed. If not, you will accumulate dead entries until the connection is closed. That could be a lot of dead wood. Seemed the tidy thing to do. With regard to CqlPreparedResult: What is the purpose of the count that is returned? How is that used? What is the purpose of the CqlStatementType returned. How will that be used? The count is to know how many markers were actually encountered. This number serves as the upper bound for Setxxx parameter indexes. Better than regexing for it... it is exactly what the server side encountered. The statement type is again a validation of what the server side saw. Remember this happens in 2 steps prepare then execute , and the execute step does not have the CQL text. But I used it while debugging and I don't seem to use it any more so I guess it could go, but it I thought I might find a use for is so I never did make it go away. Is CQLStatement.cql only used for logging? If so, should we be keeping a copy of the query string around for that? Maybe we could create a toString that would descend to create the query (or something comparable). Another "seems useful" so I kept it around. If something goes wrong and you need to go poking around its handy to have attached to the statement (I think). There are a few places where queries are being logged at DEBUG. That seems too verbose for DEBUG. I think there was already an instance there at DEBUG and I just added some more. I'll gladly move to TRACE. Why is Term.bindIndex marked as volatile? No good reason. I'll fix. Not a biggie, but how about using "bind" or "bound" instead of "mark" when referring to term position? i.e. needsBind instead of isMarked, or indexBindTerms instead of discoverMarkedTerms The short answer is because the question marks are often referred to in the spec as "bound variable markers". So I just propagated it. But NBD to change to "bind" theme. QueryProcessor.prepare seems as though it could be folded into CassandraServer.prepare_cql_query I guess it could but I liked the way it read better with it split up for all the methods. It seems as though QueryProcessor.doTheStatement and QueryProcessor.process could be merged into one method. I factored it out because doTheStatement is used by both process and process_prepared So in summary, I'll start another pass and would welcome response to my excuses
        Hide
        Jonathan Ellis added a comment -

        My opinion on string vs. binary is largely unchanged here. String parameters is the path of greatest abstraction, eliminating a proven vector for bugs, and it keeps our query interface as friendly as possible.

        While I agree in principle, from CASSANDRA-1788, CASSANDRA-3333, and others, we've seen that reducing copies really matters to CPU-bound queries. So I would expect the same here; convert-from-string is basically a glorified copy.

        We're already doing binary data for resultsets, so I don't think the bar for client developers gets much higher if we use them for prepared statements.

        Show
        Jonathan Ellis added a comment - My opinion on string vs. binary is largely unchanged here. String parameters is the path of greatest abstraction, eliminating a proven vector for bugs, and it keeps our query interface as friendly as possible. While I agree in principle, from CASSANDRA-1788 , CASSANDRA-3333 , and others, we've seen that reducing copies really matters to CPU-bound queries. So I would expect the same here; convert-from-string is basically a glorified copy. We're already doing binary data for resultsets, so I don't think the bar for client developers gets much higher if we use them for prepared statements.
        Hide
        Eric Evans added a comment -

        But then I hated the idea of having to go to HEX to get in "blob"/bytes data. This approach let me do both. It also allowed me to serialize Java Objects nicely as bytes. Can you think of a clever way to handle byte streams (AbstractBytes)? But I can live with just String. I agree it is the most flexible.

        Believe it or not, between ASCII strings and hex encoded blobs, the latter is cheaper to decode node-side.

        It allows you to free the cached CQLStatement on the server side when a client side PreparedStatement is closed. If not, you will accumulate dead entries until the connection is closed. That could be a lot of dead wood. Seemed the tidy thing to do.

        So, my initial thought (and what led me to ask this), was that in practice, the number of prepared statements per active connection is probably quite low. Low enough that there would never be any reason to evict. You probably wouldn't want to bet the farm on that though, so I had thought it would probably make some sense to have a threshold that would cause statements to be evicted when new ones were added (on an LRU-basis).

        This seems to have the advantage that would make the interface simpler. It would also be less error prone; Relying on the client to free resources seems a bit brittle.

        What do you think?

        The count is to know how many markers were actually encountered. This number serves as the upper bound for Setxxx parameter indexes. Better than regexing for it... it is exactly what the server side encountered.

        OK

        The statement type is again a validation of what the server side saw. Remember this happens in 2 steps prepare then execute, and the execute step does not have the CQL text.
        But I used it while debugging and I don't seem to use it any more so I guess it could go, but it I thought I might find a use for is so I never did make it go away.

        It's probably best to avoid without a raison d'etre. Things like this are easier to add later, than they are to remove once they've made it into release.

        Another "seems useful" so I kept it around. If something goes wrong and you need to go poking around its handy to have attached to the statement (I think).

        I worry that it might be wasteful. Especially if we do need to worry about the number of statements we keep for each connection.

        Query logging can be used to capture the verbatim string for troubleshooting purposes, and all of the data should still be there in the form of the graph of objects. Is there some known case that doesn't cover?

        I think there was already an instance there at DEBUG and I just added some more. I'll gladly move to TRACE.

        The way it was originally, the statement type (SELECT, UPDATE, etc) was logged at level DEBUG, and the entire query was logged at TRACE. If there isn't a reason to change, we should probably keep it that way.

        The short answer is because the question marks are often referred to in the spec as "bound variable markers". So I just propagated it. But NBD to change to "bind" theme.

        OK. Yeah, even say indexBindMarkers would be good. I was just thinking that "markers" was a bit generic there.

        Show
        Eric Evans added a comment - But then I hated the idea of having to go to HEX to get in "blob"/bytes data. This approach let me do both. It also allowed me to serialize Java Objects nicely as bytes. Can you think of a clever way to handle byte streams (AbstractBytes)? But I can live with just String. I agree it is the most flexible. Believe it or not, between ASCII strings and hex encoded blobs, the latter is cheaper to decode node-side. It allows you to free the cached CQLStatement on the server side when a client side PreparedStatement is closed. If not, you will accumulate dead entries until the connection is closed. That could be a lot of dead wood. Seemed the tidy thing to do. So, my initial thought (and what led me to ask this), was that in practice, the number of prepared statements per active connection is probably quite low. Low enough that there would never be any reason to evict. You probably wouldn't want to bet the farm on that though, so I had thought it would probably make some sense to have a threshold that would cause statements to be evicted when new ones were added (on an LRU-basis). This seems to have the advantage that would make the interface simpler. It would also be less error prone; Relying on the client to free resources seems a bit brittle. What do you think? The count is to know how many markers were actually encountered. This number serves as the upper bound for Setxxx parameter indexes. Better than regexing for it... it is exactly what the server side encountered. OK The statement type is again a validation of what the server side saw. Remember this happens in 2 steps prepare then execute, and the execute step does not have the CQL text. But I used it while debugging and I don't seem to use it any more so I guess it could go, but it I thought I might find a use for is so I never did make it go away. It's probably best to avoid without a raison d'etre. Things like this are easier to add later, than they are to remove once they've made it into release. Another "seems useful" so I kept it around. If something goes wrong and you need to go poking around its handy to have attached to the statement (I think). I worry that it might be wasteful. Especially if we do need to worry about the number of statements we keep for each connection. Query logging can be used to capture the verbatim string for troubleshooting purposes, and all of the data should still be there in the form of the graph of objects. Is there some known case that doesn't cover? I think there was already an instance there at DEBUG and I just added some more. I'll gladly move to TRACE. The way it was originally, the statement type (SELECT, UPDATE, etc) was logged at level DEBUG, and the entire query was logged at TRACE. If there isn't a reason to change, we should probably keep it that way. The short answer is because the question marks are often referred to in the spec as "bound variable markers". So I just propagated it. But NBD to change to "bind" theme. OK. Yeah, even say indexBindMarkers would be good. I was just thinking that "markers" was a bit generic there.
        Hide
        Eric Evans added a comment -

        While I agree in principle, from CASSANDRA-1788, CASSANDRA-3333, and others, we've seen that reducing copies really matters to CPU-bound queries. So I would expect the same here; convert-from-string is basically a glorified copy.

        Is it more expensive to parse them as strings? Sure, but evaluating the cost-to-benefit could be difficult enough without guessing at what that cost is. Whether it's preserialized binary, or string, it should be one or the other and it sounds like no one is in disagreement there. Testing it both ways should be very easy, so I suggest we revisit this part of the discussion (if necessary) after we have some real data.

        We're already doing binary data for resultsets, so I don't think the bar for client developers gets much higher if we use them for prepared statements.

        If by bar you mean skills/capabilities, then sure, but that wasn't my concern.

        Serializing to bytes is a Whole Other Thing, it's not as if already doing the one, is going to make doing the other any easier/less error-prone. It's also two very different vectors for bugs, multiplied by the number of client implementations. And, it is very different than deserializing results which can only happen one way, a serialization bug could mean that execute_cql_query() and execute_prepared_cql_query() do very different things with the same query.

        Show
        Eric Evans added a comment - While I agree in principle, from CASSANDRA-1788 , CASSANDRA-3333 , and others, we've seen that reducing copies really matters to CPU-bound queries. So I would expect the same here; convert-from-string is basically a glorified copy. Is it more expensive to parse them as strings? Sure, but evaluating the cost-to-benefit could be difficult enough without guessing at what that cost is. Whether it's preserialized binary, or string, it should be one or the other and it sounds like no one is in disagreement there. Testing it both ways should be very easy, so I suggest we revisit this part of the discussion (if necessary) after we have some real data. We're already doing binary data for resultsets, so I don't think the bar for client developers gets much higher if we use them for prepared statements. If by bar you mean skills/capabilities, then sure, but that wasn't my concern. Serializing to bytes is a Whole Other Thing, it's not as if already doing the one, is going to make doing the other any easier/less error-prone. It's also two very different vectors for bugs, multiplied by the number of client implementations. And, it is very different than deserializing results which can only happen one way, a serialization bug could mean that execute_cql_query() and execute_prepared_cql_query() do very different things with the same query.
        Hide
        Rick Shaw added a comment -

        I have created 2475-v3.1.patch and the thrift code in 2475-v3.2-Thrift.patch.

        These patches reflect the changes from the review.

        • removed unnecessary Trift methods and parameters as agreed.
        • went with the "never release" of the prepared CQLStatement approach.
        • tidied up the debugging; moving more verbose stuff to TRACE
        • replaced "marked" thing with "bind" theme as agreed
        • went to a "String" bound variable for now to get a base into trunk. (Eric to test performance with ByteBuffers)
        Show
        Rick Shaw added a comment - I have created 2475-v3.1.patch and the thrift code in 2475-v3.2-Thrift.patch. These patches reflect the changes from the review. removed unnecessary Trift methods and parameters as agreed. went with the "never release" of the prepared CQLStatement approach. tidied up the debugging; moving more verbose stuff to TRACE replaced "marked" thing with "bind" theme as agreed went to a "String" bound variable for now to get a base into trunk. (Eric to test performance with ByteBuffers)
        Hide
        Eric Evans added a comment -

        Thank Rick; I think we're getting closer.

        I was going to run the tests in the new-prepared branch of the JDBC driver, but had problems figuring out how to build it, then realized that it needed changes for CqlBindValue and CqlResultType, but had problems getting it setup in eclipse (thrift not on the class path).

        I'll try to get back to this tomorrow, but if you have any time to look at this in the meantime, that would be great.

        Show
        Eric Evans added a comment - Thank Rick; I think we're getting closer. I was going to run the tests in the new-prepared branch of the JDBC driver, but had problems figuring out how to build it, then realized that it needed changes for CqlBindValue and CqlResultType , but had problems getting it setup in eclipse (thrift not on the class path). I'll try to get back to this tomorrow, but if you have any time to look at this in the meantime, that would be great.
        Hide
        Rick Shaw added a comment -

        Made the changes but didn't push to the branch... Sorry... I'll get on it.

        Show
        Rick Shaw added a comment - Made the changes but didn't push to the branch... Sorry... I'll get on it.
        Hide
        Eric Evans added a comment -

        OK, the attached v2-* patches are:

        • v2-0001-CASSANDRA-2475-rickshaw-2475-v3.1.patch.txt
        • v2-0002-rickshaw-2475-v3.2-Thrift.patch-w-changes.txt

        Rick's last, rebased against trunk as of today.

        • 0003-eevans-increment-thrift-version-by-1-not-3.txt

        Increment Thrift minor from 19 to 20, instead of 22

        • 0004-eevans-misc-cleanups.txt

        Various code cleanups

        • 0005-eevans-refactor-for-better-encapsulation-of-prepare.txt

        A refactoring of CassandraServer.prepare_cql_query and QueryProcessor.prepare() to encapsulate query preparation within QueryProcessor

        • 0006-eevans-log-queries-at-TRACE.txt

        Log queries at TRACE.

        • 0007-use-an-LRU-map-for-storage-of-prepared-statements.txt

        Turn the Map that stores prepared statements into an LRU cache (currently evicts when size() > 50)


        Patches 3-6 are pretty minor and I would have just committed them as-is, but it wouldn't hurt to have someone (Rick?) else look at 7 (at least).

        Show
        Eric Evans added a comment - OK, the attached v2-* patches are: v2-0001- CASSANDRA-2475 -rickshaw-2475-v3.1.patch.txt v2-0002-rickshaw-2475-v3.2-Thrift.patch-w-changes.txt Rick's last, rebased against trunk as of today. 0003-eevans-increment-thrift-version-by-1-not-3.txt Increment Thrift minor from 19 to 20, instead of 22 0004-eevans-misc-cleanups.txt Various code cleanups 0005-eevans-refactor-for-better-encapsulation-of-prepare.txt A refactoring of CassandraServer.prepare_cql_query and QueryProcessor.prepare() to encapsulate query preparation within QueryProcessor 0006-eevans-log-queries-at-TRACE.txt Log queries at TRACE. 0007-use-an-LRU-map-for-storage-of-prepared-statements.txt Turn the Map that stores prepared statements into an LRU cache (currently evicts when size() > 50) Patches 3-6 are pretty minor and I would have just committed them as-is, but it wouldn't hurt to have someone (Rick?) else look at 7 (at least).
        Hide
        Rick Shaw added a comment -

        +1

        Thanks Eric, for all the help and attention to detail! I learned a lot. I'll be tuning up my formatter (Eclipse) to handle the nits of formatting and white-space. And I'll try to break things up much finer in future large patches.

        I'm still a bit iffy about having the server ditching entries in the prepared statement map without regard to whether the client side closed the associated PreparedStatement. But like you, I think the chances of ever seeing 50 entries without closing the connection are slim.

        On to Batch and Functions.

        Show
        Rick Shaw added a comment - +1 Thanks Eric, for all the help and attention to detail! I learned a lot. I'll be tuning up my formatter (Eclipse) to handle the nits of formatting and white-space. And I'll try to break things up much finer in future large patches. I'm still a bit iffy about having the server ditching entries in the prepared statement map without regard to whether the client side closed the associated PreparedStatement . But like you, I think the chances of ever seeing 50 entries without closing the connection are slim. On to Batch and Functions.
        Hide
        Eric Evans added a comment -

        Thanks Eric, for all the help and attention to detail! I learned a lot. I'll be tuning up my formatter (Eclipse) to handle the nits of formatting and white-space. And I'll try to break things up much finer in future large patches

        Thank you for knocking this out!

        I'm still a bit iffy about having the server ditching entries in the prepared statement map without regard to whether the client side closed the associated PreparedStatement. But like you, I think the chances of ever seeing 50 entries without closing the connection are slim.

        Think about it this way: whether or not the client removes unused entries, it's still prudent to put a limit on the number of statements we cache, otherwise a buggy client could eat up our heap. And, if you're going to put something in place to evict, choosing the least recently accessed seems like the safest approach.

        The question then becomes, how many can we afford to hang onto, and what is the likelihood that, short of a buggy client, we won't be able to accommodate everything needed. Remember, this is only for the life of a connection, reconnecting starts a whole new Map. So if an API to remove entries is ultimately of little to no benefit, then we should save everyone the trouble of implementing and maintaining it.

        And, we could always take this to a wider audience to see what others think, but it's easier to add something like this later than it is to remove (or fix) it.

        Also, 50 might not be the right number. I basically pulled that out of my butt.

        Show
        Eric Evans added a comment - Thanks Eric, for all the help and attention to detail! I learned a lot. I'll be tuning up my formatter (Eclipse) to handle the nits of formatting and white-space. And I'll try to break things up much finer in future large patches Thank you for knocking this out! I'm still a bit iffy about having the server ditching entries in the prepared statement map without regard to whether the client side closed the associated PreparedStatement. But like you, I think the chances of ever seeing 50 entries without closing the connection are slim. Think about it this way: whether or not the client removes unused entries, it's still prudent to put a limit on the number of statements we cache, otherwise a buggy client could eat up our heap. And, if you're going to put something in place to evict, choosing the least recently accessed seems like the safest approach. The question then becomes, how many can we afford to hang onto, and what is the likelihood that, short of a buggy client, we won't be able to accommodate everything needed. Remember, this is only for the life of a connection, reconnecting starts a whole new Map. So if an API to remove entries is ultimately of little to no benefit, then we should save everyone the trouble of implementing and maintaining it. And, we could always take this to a wider audience to see what others think, but it's easier to add something like this later than it is to remove (or fix) it. Also, 50 might not be the right number. I basically pulled that out of my butt.
        Hide
        Hudson added a comment -

        Integrated in Cassandra #1256 (See https://builds.apache.org/job/Cassandra/1256/)
        use an LRU map for storage of prepared statements

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475
        log CQL queries at TRACE

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475
        refactor for better encapsulation of prepare()

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475
        review-related code-style fixes and cleanups

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475
        compiler generated thrift code for prepared statements

        Patch by Rick Shaw; reviewed by eevans for CASSANDRA-2475
        CQL support for prepared statements

        Patch by Rick Shaw; reviewed by eevans for CASSANDRA-2475

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214527
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/service/ClientState.java
        • /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214526
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214525
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java
        • /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214523
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/cql/DeleteStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectExpression.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/WhereClause.java
        • /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214521
        Files :

        • /cassandra/trunk/interface/cassandra.thrift
        • /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java
        • /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Constants.java
        • /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/CqlPreparedResult.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214520
        Files :

        • /cassandra/trunk/interface/cassandra.thrift
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/AbstractModification.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/BatchStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/CQLStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/Cql.g
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/CreateColumnFamilyStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/DeleteStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectExpression.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/UpdateStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/WhereClause.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/ClientState.java
        • /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java
        Show
        Hudson added a comment - Integrated in Cassandra #1256 (See https://builds.apache.org/job/Cassandra/1256/ ) use an LRU map for storage of prepared statements Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 log CQL queries at TRACE Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 refactor for better encapsulation of prepare() Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 review-related code-style fixes and cleanups Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 compiler generated thrift code for prepared statements Patch by Rick Shaw; reviewed by eevans for CASSANDRA-2475 CQL support for prepared statements Patch by Rick Shaw; reviewed by eevans for CASSANDRA-2475 eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214527 Files : /cassandra/trunk/src/java/org/apache/cassandra/service/ClientState.java /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214526 Files : /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214525 Files : /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214523 Files : /cassandra/trunk/src/java/org/apache/cassandra/cql/DeleteStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectExpression.java /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java /cassandra/trunk/src/java/org/apache/cassandra/cql/WhereClause.java /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214521 Files : /cassandra/trunk/interface/cassandra.thrift /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Cassandra.java /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/Constants.java /cassandra/trunk/interface/thrift/gen-java/org/apache/cassandra/thrift/CqlPreparedResult.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214520 Files : /cassandra/trunk/interface/cassandra.thrift /cassandra/trunk/src/java/org/apache/cassandra/cql/AbstractModification.java /cassandra/trunk/src/java/org/apache/cassandra/cql/BatchStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/CQLStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/Cql.g /cassandra/trunk/src/java/org/apache/cassandra/cql/CreateColumnFamilyStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/DeleteStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectExpression.java /cassandra/trunk/src/java/org/apache/cassandra/cql/SelectStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java /cassandra/trunk/src/java/org/apache/cassandra/cql/UpdateStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/WhereClause.java /cassandra/trunk/src/java/org/apache/cassandra/service/ClientState.java /cassandra/trunk/src/java/org/apache/cassandra/thrift/CassandraServer.java
        Hide
        Jonathan Ellis added a comment -

        I'm still a bit iffy about having the server ditching entries in the prepared statement map without regard to whether the client side closed the associated PreparedStatement.

        Me, too. Feels like premature defensiveness to me. I think we're a lot more likely to have someone get bitten from PS number 51 failing, than from OOMing because of 100,000 if we just say "we'll keep them around until you close them or disconnect."

        Show
        Jonathan Ellis added a comment - I'm still a bit iffy about having the server ditching entries in the prepared statement map without regard to whether the client side closed the associated PreparedStatement. Me, too. Feels like premature defensiveness to me. I think we're a lot more likely to have someone get bitten from PS number 51 failing, than from OOMing because of 100,000 if we just say "we'll keep them around until you close them or disconnect."
        Hide
        Eric Evans added a comment -

        Me, too. Feels like premature defensiveness to me. I think we're a lot more likely to have someone get bitten from PS number 51 failing, than from OOMing because of 100,000 if we just say "we'll keep them around until you close them or disconnect."

        To clarify, you're saying you're in favor of putting no constraints whatsoever on the number of stored statements, and relying solely on clients to free them up?

        Does this change at all (the perceived likelihood) if the number were 1,000 instead of 51? Or 10,000?

        And, to be clear about what I was thinking: Short of a (seriously )buggy client, my expectation is that there would be no reason we couldn't hold as many of these statements as needed (with or without an API to remove them on close()). Based on that assumption, the LRU map was a brake against buggy clients (and so should probably be much higher than 50), and the API to remove them was unnecessary.

        Show
        Eric Evans added a comment - Me, too. Feels like premature defensiveness to me. I think we're a lot more likely to have someone get bitten from PS number 51 failing, than from OOMing because of 100,000 if we just say "we'll keep them around until you close them or disconnect." To clarify, you're saying you're in favor of putting no constraints whatsoever on the number of stored statements, and relying solely on clients to free them up? Does this change at all (the perceived likelihood) if the number were 1,000 instead of 51? Or 10,000? And, to be clear about what I was thinking: Short of a (seriously )buggy client, my expectation is that there would be no reason we couldn't hold as many of these statements as needed (with or without an API to remove them on close()). Based on that assumption, the LRU map was a brake against buggy clients (and so should probably be much higher than 50), and the API to remove them was unnecessary.
        Hide
        Jonathan Ellis added a comment -

        To clarify, you're saying you're in favor of putting no constraints whatsoever on the number of stored statements, and relying solely on clients to free them up?

        That's what I meant, yes.

        Does this change at all (the perceived likelihood) if the number were 1,000 instead of 51? Or 10,000?

        If a client is doing it "right," then it will use pooled connections, each of which will use PS for each query needed by the app. Hundreds or even thousands isn't crazy. So I'd be okay with a limit of 10000 as "practically equivalent to unlimited for all intents and purposes, but it might save you from a buggy client."

        Show
        Jonathan Ellis added a comment - To clarify, you're saying you're in favor of putting no constraints whatsoever on the number of stored statements, and relying solely on clients to free them up? That's what I meant, yes. Does this change at all (the perceived likelihood) if the number were 1,000 instead of 51? Or 10,000? If a client is doing it "right," then it will use pooled connections, each of which will use PS for each query needed by the app. Hundreds or even thousands isn't crazy. So I'd be okay with a limit of 10000 as "practically equivalent to unlimited for all intents and purposes, but it might save you from a buggy client."
        Hide
        Eric Evans added a comment -

        If a client is doing it "right," then it will use pooled connections, each of which will use PS for each query needed by the app. Hundreds or even thousands isn't crazy. So I'd be okay with a limit of 10000 as "practically equivalent to unlimited for all intents and purposes, but it might save you from a buggy client."

        OK. The second half of that is, how many of those hundreds or even thousands will you want to remove during the life of a connection. For every application I could think of, I picture, like you say, a pool of connections that contain a PS for each query needed. The cases where you would no longer need a PS for the life of a connection seems exceptional enough that there would be no harm in keeping it.

        I'm not fundamentally opposed to an API call to remove PSes, I'm just trying to keep things simpler if it doesn't actually provide any value.

        But this raises something I hadn't considered. Storing prepared statements per connection seems like it could make things awkward. If you want to access a PS on an arbitrary connection pulled from a pool, then you're going to need some way of dealing with a connection that doesn't have that PS stored. Likewise, if we have an API for removing them, then you'd need to iterate all open connections to remove the others. Or am I missing something?

        Show
        Eric Evans added a comment - If a client is doing it "right," then it will use pooled connections, each of which will use PS for each query needed by the app. Hundreds or even thousands isn't crazy. So I'd be okay with a limit of 10000 as "practically equivalent to unlimited for all intents and purposes, but it might save you from a buggy client." OK. The second half of that is, how many of those hundreds or even thousands will you want to remove during the life of a connection. For every application I could think of, I picture, like you say, a pool of connections that contain a PS for each query needed. The cases where you would no longer need a PS for the life of a connection seems exceptional enough that there would be no harm in keeping it. I'm not fundamentally opposed to an API call to remove PSes, I'm just trying to keep things simpler if it doesn't actually provide any value. But this raises something I hadn't considered. Storing prepared statements per connection seems like it could make things awkward. If you want to access a PS on an arbitrary connection pulled from a pool, then you're going to need some way of dealing with a connection that doesn't have that PS stored. Likewise, if we have an API for removing them, then you'd need to iterate all open connections to remove the others. Or am I missing something?
        Hide
        Jonathan Ellis added a comment -

        I'm not fundamentally opposed to an API call to remove PSes, I'm just trying to keep things simpler if it doesn't actually provide any value.

        WFM. I assumed Rick had already implemented one for JDBC API completeness but if we're just going to no-op that out for now I'm not going to lose any sleep over it.

        If you want to access a PS on an arbitrary connection pulled from a pool, then you're going to need some way of dealing with a connection that doesn't have that PS stored.

        It's the client's responsibility to prepare the statements on each connection before using them, which implies some caching behavior on the part of the driver as in http://www.theserverside.com/news/1365244/Why-Prepared-Statements-are-important-and-how-to-use-them-properly

        Show
        Jonathan Ellis added a comment - I'm not fundamentally opposed to an API call to remove PSes, I'm just trying to keep things simpler if it doesn't actually provide any value. WFM. I assumed Rick had already implemented one for JDBC API completeness but if we're just going to no-op that out for now I'm not going to lose any sleep over it. If you want to access a PS on an arbitrary connection pulled from a pool, then you're going to need some way of dealing with a connection that doesn't have that PS stored. It's the client's responsibility to prepare the statements on each connection before using them, which implies some caching behavior on the part of the driver as in http://www.theserverside.com/news/1365244/Why-Prepared-Statements-are-important-and-how-to-use-them-properly
        Hide
        Eric Evans added a comment -

        WFM. I assumed Rick had already implemented one for JDBC API completeness but if we're just going to no-op that out for now I'm not going to lose any sleep over it.

        He did, but we removed it at an earlier stage of the review, for the reasons listed here (so if it's decided that we should have one, I'll do the work to put it back in).

        It's the client's responsibility to prepare the statements on each connection before using them, which implies some caching behavior on the part of the driver as in http://www.theserverside.com/news/1365244/Why-Prepared-Statements-are-important-and-how-to-use-them-properly

        OK, that makes sense. Though, it would seem to add another data-point to the "API to remove PSes isn't necessary" argument, since a close() on a pooled a connection isn't going to remove the statement server-side anyway.

        Show
        Eric Evans added a comment - WFM. I assumed Rick had already implemented one for JDBC API completeness but if we're just going to no-op that out for now I'm not going to lose any sleep over it. He did, but we removed it at an earlier stage of the review, for the reasons listed here (so if it's decided that we should have one, I'll do the work to put it back in). It's the client's responsibility to prepare the statements on each connection before using them, which implies some caching behavior on the part of the driver as in http://www.theserverside.com/news/1365244/Why-Prepared-Statements-are-important-and-how-to-use-them-properly OK, that makes sense. Though, it would seem to add another data-point to the "API to remove PSes isn't necessary" argument, since a close() on a pooled a connection isn't going to remove the statement server-side anyway.
        Hide
        Jonathan Ellis added a comment -

        it would seem to add another data-point to the "API to remove PSes isn't necessary" argument

        Agreed, let's leave that out for now.

        Show
        Jonathan Ellis added a comment - it would seem to add another data-point to the "API to remove PSes isn't necessary" argument Agreed, let's leave that out for now.
        Hide
        Hudson added a comment -

        Integrated in Cassandra #1257 (See https://builds.apache.org/job/Cassandra/1257/)
        bump maximum cached prepared statements to 10,000 (from 50)

        (and fix Map so that it is actually LRU)

        Patch by evans for CASSANDRA-2475

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214803
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/service/ClientState.java
        Show
        Hudson added a comment - Integrated in Cassandra #1257 (See https://builds.apache.org/job/Cassandra/1257/ ) bump maximum cached prepared statements to 10,000 (from 50) (and fix Map so that it is actually LRU) Patch by evans for CASSANDRA-2475 eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214803 Files : /cassandra/trunk/src/java/org/apache/cassandra/service/ClientState.java
        Hide
        Eric Evans added a comment -

        I found a rather serious issue with the way bind markers are indexed.

        Basically QueryProcessor.discoverBoundTerms attempts to work through the parsed result intances in order, but the order in which terms appeared in the statement is not preserved. For example, the following are equivalent:

        col = 10 AND KEY > aaa AND KEY < bbb
        KEY < bbb AND col = 10 AND KEY > aaa
        

        However, both will be represented by a WhereClause with attributes for start and finish key, and a list of column relations, with no regard to the order the terms appeared in the original statement.

        The attached patches (v10-*) refactor marker indexing to occur within the parser.

        Show
        Eric Evans added a comment - I found a rather serious issue with the way bind markers are indexed. Basically QueryProcessor.discoverBoundTerms attempts to work through the parsed result intances in order, but the order in which terms appeared in the statement is not preserved. For example, the following are equivalent: col = 10 AND KEY > aaa AND KEY < bbb KEY < bbb AND col = 10 AND KEY > aaa However, both will be represented by a WhereClause with attributes for start and finish key, and a list of column relations, with no regard to the order the terms appeared in the original statement. The attached patches (v10-*) refactor marker indexing to occur within the parser.
        Hide
        Rick Shaw added a comment -

        +1

        Very elegant fix to the problem. As is often the case, new folks don't want to be too disruptive so they try to isolate a problem that is best solved with bold change. Cool.

        Show
        Rick Shaw added a comment - +1 Very elegant fix to the problem. As is often the case, new folks don't want to be too disruptive so they try to isolate a problem that is best solved with bold change. Cool.
        Hide
        Hudson added a comment -

        Integrated in Cassandra #1260 (See https://builds.apache.org/job/Cassandra/1260/)
        clean up Term ctors

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475
        index bind markers using parser

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475
        properly report number of markers in a statement

        Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1220843
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1220840
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/cql/CQLStatement.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/Cql.g
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java
        • /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java

        eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1220839
        Files :

        • /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java
        Show
        Hudson added a comment - Integrated in Cassandra #1260 (See https://builds.apache.org/job/Cassandra/1260/ ) clean up Term ctors Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 index bind markers using parser Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 properly report number of markers in a statement Patch by eevans; reviewed by Rick Shaw for CASSANDRA-2475 eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1220843 Files : /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1220840 Files : /cassandra/trunk/src/java/org/apache/cassandra/cql/CQLStatement.java /cassandra/trunk/src/java/org/apache/cassandra/cql/Cql.g /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java /cassandra/trunk/src/java/org/apache/cassandra/cql/Term.java eevans : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1220839 Files : /cassandra/trunk/src/java/org/apache/cassandra/cql/QueryProcessor.java

          People

          • Assignee:
            Rick Shaw
            Reporter:
            Eric Evans
            Reviewer:
            Eric Evans
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development