Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: API, Core
    • Labels:

      Description

      A custom wire protocol would give us the flexibility to optimize for our specific use-cases, and eliminate a troublesome dependency (I'm referring to Thrift, but none of the others would be significantly better). Additionally, RPC is bad fit here, and we'd do better to move in the direction of something that natively supports streaming.

      I don't think this is as daunting as it might seem initially. Utilizing an existing server framework like Netty, combined with some copy-and-paste of bits from other FLOSS projects would probably get us 80% of the way there.

      1. cql_binary_protocol
        13 kB
        Sylvain Lebresne
      2. cql_binary_protocol-v2
        14 kB
        Sylvain Lebresne

        Activity

        Hide
        Nick Berardi added a comment -

        Apparently a lot of other software also choose 8000 at random too. http://www.speedguide.net/port.php?port=8000 Everything from internet radio to trojans to VoIP also use it. It might be wise to choose one that is less random, so that communication over the internet isn't hampered by other rogue programs getting blocked for using the same port.

        Port 8160 seems pretty clean:
        http://www.speedguide.net/port.php?port=8160

        I don't care either way, 8000 just seems crowded to me.

        Show
        Nick Berardi added a comment - Apparently a lot of other software also choose 8000 at random too. http://www.speedguide.net/port.php?port=8000 Everything from internet radio to trojans to VoIP also use it. It might be wise to choose one that is less random, so that communication over the internet isn't hampered by other rogue programs getting blocked for using the same port. Port 8160 seems pretty clean: http://www.speedguide.net/port.php?port=8160 I don't care either way, 8000 just seems crowded to me.
        Hide
        Nick Berardi added a comment -

        I didn't see a reference to any port, so 8000 is good enough for me. It wasn't actually stopping me from doing anything, I just didn't know what endpoint to connect my build up to.

        Show
        Nick Berardi added a comment - I didn't see a reference to any port, so 8000 is good enough for me. It wasn't actually stopping me from doing anything, I just didn't know what endpoint to connect my build up to.
        Hide
        Sylvain Lebresne added a comment -

        The currently committed implementation uses port 8000 for the binary protocol. Honestly that was kind of a random choice and I don't care about changing that to any other random choice. Not sure why that choice would stop anyone from starting implementing the binary protocol in any client library though.

        Show
        Sylvain Lebresne added a comment - The currently committed implementation uses port 8000 for the binary protocol. Honestly that was kind of a random choice and I don't care about changing that to any other random choice. Not sure why that choice would stop anyone from starting implementing the binary protocol in any client library though.
        Hide
        Nick Berardi added a comment -

        Has a default port been decided for the binary protocol yet? I would love to get a jump on implementing this for my .NET library, FluentCassandra.

        Show
        Nick Berardi added a comment - Has a default port been decided for the binary protocol yet? I would love to get a jump on implementing this for my .NET library, FluentCassandra.
        Hide
        Sylvain Lebresne added a comment -

        I'm very much with the other commenters suggesting that async messaging will be important soon

        I've changed my mind to: CASSANDRA-4473

        Show
        Sylvain Lebresne added a comment - I'm very much with the other commenters suggesting that async messaging will be important soon I've changed my mind to: CASSANDRA-4473
        Hide
        paul cannon added a comment -

        3. we will be nice and have server understand old protocol version for at least 1 or 2 major C* version after we've changed them.

        Ok, a guarantee like this, plus the reporting of supported protocol versions in OPTIONS, alleviates most of my concern.

        New queries capabalities will not imply a change in the protocol for instance. Typically, I don't know that libpq changes all the time or lacks hundreds of features.

        It's interesting that you bring that up, because libpq uses 32 bits for its versioning. 16 bits for a major version, and 16 for a minor. And no, it doesn't change much.

        I see your point, though, that limiting the space to 7 bits should be a strong discouragement against making changes, which ought to result in a more stable protocol. I don't quite share the optimism of how well it will serve us, but I do at least understand, and I guess having the option of "version==127 -> use additional bytes for version" in the future is good enough.

        Show
        paul cannon added a comment - 3. we will be nice and have server understand old protocol version for at least 1 or 2 major C* version after we've changed them. Ok, a guarantee like this, plus the reporting of supported protocol versions in OPTIONS, alleviates most of my concern. New queries capabalities will not imply a change in the protocol for instance. Typically, I don't know that libpq changes all the time or lacks hundreds of features. It's interesting that you bring that up, because libpq uses 32 bits for its versioning. 16 bits for a major version, and 16 for a minor. And no, it doesn't change much. I see your point, though, that limiting the space to 7 bits should be a strong discouragement against making changes, which ought to result in a more stable protocol. I don't quite share the optimism of how well it will serve us, but I do at least understand, and I guess having the option of "version==127 -> use additional bytes for version" in the future is good enough.
        Hide
        Sylvain Lebresne added a comment -

        Let me try to clarify my position.

        The current versioning is per-frame, so widening it would be a per-frame extra cost. Furthermore, I like that the header is just 8 bytes, because that's 1) small and 2) easy to decode (even for debuging with tcpdump or other).

        And the idea of the current implementation was to say:

        1. there is just one version for the whole protocol.
        2. two different protocol versions are incompatible, period (though we should make the guarantee that a server is always able to understand at least the STARTUP message of every old versions to be able to return an error with what versions are supported).
        3. we will be nice and have server understand old protocol version for at least 1 or 2 major C* version after we've changed them.

        As far as I can tell, this does cover what we need, i.e. clients can always tell whether they can talk to a server or not and we have a clear compatibility story (not the same version == not compatible). I'll note that this is how we version our internal protocol (not trying to imply it's the exact same problem but ...).

        That being said, I would agree there is at least 2 missing pieces in the implementation currently:

        • The server should respond with an error to the STARTUP message if the version is not understood (with a list of supported version) which is not implementated yet.
        • The OPTIONS/SUPPORTED pair of message could return the support protocol version too. But I plan to modify the SUPPORTED message in a separate ticket to make it more versatile anyway, so I'll do that then.

        There is the argument that a 7 bits version number might not be enough. On that, my personal belief is that we don't have a choice: we have to make it enough if we want the protocol to be used. You can have a network protocol that change every two weeks. If it changes all the time, it's not a "protocol". And it happens that I think it's very possible to achieve a stable protocol, because imho there is not infinitely many things that the protocol can do (which again doesn't mean there isn't tons of stuff the current implementation doesn't handle correctl, but that's ok, it's not the final version and the goal is to add those missing stuff for the final version). This especially true imho because we have CQL. New queries capabalities will not imply a change in the protocol for instance. Typically, I don't know that libpq changes all the time or lacks hundreds of features.

        async messaging

        I'm warming up to the idea of allowing async query handling too actually, especially because I think we can make it option for the client. But I'll open a separate ticket.

        challenge/response auth

        As said above, I'm all for having SASL and I certainly intend to have that for version 1, I just prefer handling that in it's own ticket. Again, the committed version is not version 1.

        Now, separating the versioning into a simple frame version and another per-connection message version is a possibily, but I'm afraid of the following downsides:

        • It complicates things. You know have to care about case like the server handle the frame version but not the messaging one or the contrary. You also need to document the versions separatly, so you need a versionned document for the frame protocol and one for the message protocol. Again, feels like a single version where different version == incompatible is simpler.
        • I really do believe that having the protocol stable (as in, one version per C* major version as a worst case) is not an option. Adding a more complex versioning doesn't convey that intention and will make it easier for us to make excuses to break the protocol often. I don't want that.
        • I'm not sure the 3-part semantic versioning makes complete sense for a protocol. A least the last part (the 'patch' version) does not make sense because this info will be carried by the C* version (you don't "patch" a protocol, you patch an implementation of that protocol). For the minor version, it's more debatable I suppose, but I would make the argument that it doesn't make sense either. Again because a protocol is different from it's implementation. Any change to a protocol will break either the server or the client so there isn't really minor changes. Again, on simple major version seems much simpler.

        (a) a signal to clients of compatibility

        My argument below is that this signal exists since we do have a version

        (b) an acknowledgement of a possibly-remote chance of failure; a safety net

        We have a safety net since we have a version. With the 'only major version' versioning I'm advocating, in the remote change we've fail at keeping things stable, one version purpose can be to introduce finer grained versioning. Introducing finer grained versioning right away is not a safety net, it's saying that you expect to fail.

        Show
        Sylvain Lebresne added a comment - Let me try to clarify my position. The current versioning is per-frame, so widening it would be a per-frame extra cost. Furthermore, I like that the header is just 8 bytes, because that's 1) small and 2) easy to decode (even for debuging with tcpdump or other). And the idea of the current implementation was to say: there is just one version for the whole protocol. two different protocol versions are incompatible, period (though we should make the guarantee that a server is always able to understand at least the STARTUP message of every old versions to be able to return an error with what versions are supported). we will be nice and have server understand old protocol version for at least 1 or 2 major C* version after we've changed them. As far as I can tell, this does cover what we need, i.e. clients can always tell whether they can talk to a server or not and we have a clear compatibility story (not the same version == not compatible). I'll note that this is how we version our internal protocol (not trying to imply it's the exact same problem but ...). That being said, I would agree there is at least 2 missing pieces in the implementation currently: The server should respond with an error to the STARTUP message if the version is not understood (with a list of supported version) which is not implementated yet. The OPTIONS/SUPPORTED pair of message could return the support protocol version too. But I plan to modify the SUPPORTED message in a separate ticket to make it more versatile anyway, so I'll do that then. There is the argument that a 7 bits version number might not be enough. On that, my personal belief is that we don't have a choice: we have to make it enough if we want the protocol to be used. You can have a network protocol that change every two weeks. If it changes all the time, it's not a "protocol". And it happens that I think it's very possible to achieve a stable protocol, because imho there is not infinitely many things that the protocol can do (which again doesn't mean there isn't tons of stuff the current implementation doesn't handle correctl, but that's ok, it's not the final version and the goal is to add those missing stuff for the final version). This especially true imho because we have CQL. New queries capabalities will not imply a change in the protocol for instance. Typically, I don't know that libpq changes all the time or lacks hundreds of features. async messaging I'm warming up to the idea of allowing async query handling too actually, especially because I think we can make it option for the client. But I'll open a separate ticket. challenge/response auth As said above, I'm all for having SASL and I certainly intend to have that for version 1, I just prefer handling that in it's own ticket. Again, the committed version is not version 1. Now, separating the versioning into a simple frame version and another per-connection message version is a possibily, but I'm afraid of the following downsides: It complicates things. You know have to care about case like the server handle the frame version but not the messaging one or the contrary. You also need to document the versions separatly, so you need a versionned document for the frame protocol and one for the message protocol. Again, feels like a single version where different version == incompatible is simpler. I really do believe that having the protocol stable (as in, one version per C* major version as a worst case) is not an option. Adding a more complex versioning doesn't convey that intention and will make it easier for us to make excuses to break the protocol often. I don't want that. I'm not sure the 3-part semantic versioning makes complete sense for a protocol. A least the last part (the 'patch' version) does not make sense because this info will be carried by the C* version (you don't "patch" a protocol, you patch an implementation of that protocol). For the minor version, it's more debatable I suppose, but I would make the argument that it doesn't make sense either. Again because a protocol is different from it's implementation. Any change to a protocol will break either the server or the client so there isn't really minor changes. Again, on simple major version seems much simpler. (a) a signal to clients of compatibility My argument below is that this signal exists since we do have a version (b) an acknowledgement of a possibly-remote chance of failure; a safety net We have a safety net since we have a version. With the 'only major version' versioning I'm advocating, in the remote change we've fail at keeping things stable, one version purpose can be to introduce finer grained versioning. Introducing finer grained versioning right away is not a safety net, it's saying that you expect to fail.
        Hide
        paul cannon added a comment -

        First, I think that for dealing with cross-version compatibility, having one single version number is much easier than splitting it in 2 different versions. All you care about is 'can I have a discussion with that server', and there is no point in adding complexity by having to check 2 numbers (this would also complicate documentation typically), so I'm fairly strongly against that idea.

        It doesn't need to add much complexity; it's just a separation of layers, both of which would be more simple afterward. Any good implementation of this protocol will already be handling the framing at a different level than the messaging. But if you're that much against it, that's fine, I'd be ok with just a single, wider version space.

        Now there is the question of whether we'll have enough version number with the current format. First, I certainly hope so. I think that the protocol should be something as stable as we possibly can, and I honestly don't think things in the protocol will change all the time. There is really so many thing that a protocol can do and so many things we can add.

        I wholeheartedly agree we want the protocol to be as stable as possible, while still evolving toward additional needs (I'm very much with the other commenters suggesting that async messaging will be important soon, and also challenge/response auth). I also hope that we won't need more than a few dozen revisions total. But I find it unlikely; the very nature of network protocols is that the more specific in intention they are, the more they will need to grow over time.

        Another consideration is letting servers and clients know, if their protocol versions differ, whether they're still compatible with each other. This is vital information to have—it will be necessary in practical situations for clients to speak with multiple server versions and for servers to speak with multiple client versions. Think rolling upgrades or multiple client libs in use simultaneously.

        Finally, what's the rationale for not having a three-part semantic version for the protocol? It's an extra two bytes used per connection. There's practically no downside, and a massive potential upside.

        I prefer not designing the first version of the protocol assuming we will fail at that.

        It's not an assumption that we will fail; it's (a) a signal to clients of compatibility and (b) an acknowledgement of a possibly-remote chance of failure; a safety net. It's just like having exception handlers around complex code, even if you don't expect to need them. It's just good design.

        Show
        paul cannon added a comment - First, I think that for dealing with cross-version compatibility, having one single version number is much easier than splitting it in 2 different versions. All you care about is 'can I have a discussion with that server', and there is no point in adding complexity by having to check 2 numbers (this would also complicate documentation typically), so I'm fairly strongly against that idea. It doesn't need to add much complexity; it's just a separation of layers, both of which would be more simple afterward. Any good implementation of this protocol will already be handling the framing at a different level than the messaging. But if you're that much against it, that's fine, I'd be ok with just a single, wider version space. Now there is the question of whether we'll have enough version number with the current format. First, I certainly hope so. I think that the protocol should be something as stable as we possibly can, and I honestly don't think things in the protocol will change all the time. There is really so many thing that a protocol can do and so many things we can add. I wholeheartedly agree we want the protocol to be as stable as possible, while still evolving toward additional needs (I'm very much with the other commenters suggesting that async messaging will be important soon, and also challenge/response auth). I also hope that we won't need more than a few dozen revisions total. But I find it unlikely; the very nature of network protocols is that the more specific in intention they are, the more they will need to grow over time. Another consideration is letting servers and clients know, if their protocol versions differ, whether they're still compatible with each other. This is vital information to have—it will be necessary in practical situations for clients to speak with multiple server versions and for servers to speak with multiple client versions. Think rolling upgrades or multiple client libs in use simultaneously. Finally, what's the rationale for not having a three-part semantic version for the protocol? It's an extra two bytes used per connection. There's practically no downside, and a massive potential upside. I prefer not designing the first version of the protocol assuming we will fail at that. It's not an assumption that we will fail; it's (a) a signal to clients of compatibility and (b) an acknowledgement of a possibly-remote chance of failure; a safety net. It's just like having exception handlers around complex code, even if you don't expect to need them. It's just good design.
        Hide
        Sylvain Lebresne added a comment -

        Do you expect the spec to be implemented "by hand"?

        That's what I would do but everybody is free to do whatever he prefers

        I don't see too many people writing their own generators

        That's likely true but I personally don't see that as a show stopper. In my ideal world, what I would like to see is 1 good implementation (of the protocol) per language that everyone reuse. Now, yeah it's unlikely COBOL will get such good implementation soon, but I can live with that.

        it should be possible to verify (not "test", verify) a driver for compliance

        Sure, it would be nice. I'd lie if I said this is a priority on my todo list at this stage, but I would certainly welcome such a contribution.

        Show
        Sylvain Lebresne added a comment - Do you expect the spec to be implemented "by hand"? That's what I would do but everybody is free to do whatever he prefers I don't see too many people writing their own generators That's likely true but I personally don't see that as a show stopper. In my ideal world, what I would like to see is 1 good implementation (of the protocol) per language that everyone reuse. Now, yeah it's unlikely COBOL will get such good implementation soon, but I can live with that. it should be possible to verify (not "test", verify) a driver for compliance Sure, it would be nice. I'd lie if I said this is a priority on my todo list at this stage, but I would certainly welcome such a contribution.
        Hide
        Holger Hoffstätte added a comment -

        Another question about implementing the protocol. Thrift's only saving grace is its capability to generate code for a multitude of languages from a schema (aka protocol DSL). Do you expect the spec to be implemented "by hand"? Even with a BNF or similar I don't see too many people writing their own generators (sigh).

        Either way, it should be possible to verify (not "test", verify) a driver for compliance, otherwise I'd expect the same drama as with AMQP where non-vendor supplied drivers were (and often still are) in constantly diverging states of brokenness. Just a suggestion.

        Show
        Holger Hoffstätte added a comment - Another question about implementing the protocol. Thrift's only saving grace is its capability to generate code for a multitude of languages from a schema (aka protocol DSL). Do you expect the spec to be implemented "by hand"? Even with a BNF or similar I don't see too many people writing their own generators (sigh). Either way, it should be possible to verify (not "test", verify ) a driver for compliance, otherwise I'd expect the same drama as with AMQP where non-vendor supplied drivers were (and often still are) in constantly diverging states of brokenness. Just a suggestion.
        Hide
        Holger Hoffstätte added a comment -

        Sylvain, thanks for the explanation - makes perfect sense, though I disagree In my experiene every protocol that starts with fully-synchronous interactions eventually has to bite that bullet. I agree though that it's not the most important thing right now and other ways of coordinating client<->cluster interactions could be much more beneficial and efficient.

        As for response frames I don't see how you could handle two interleaved queries (first slow, second fast but "blocked from returning" by the previous one) without command/response correlation; there can only be a single query to a node in flight at a time.

        Show
        Holger Hoffstätte added a comment - Sylvain, thanks for the explanation - makes perfect sense, though I disagree In my experiene every protocol that starts with fully-synchronous interactions eventually has to bite that bullet. I agree though that it's not the most important thing right now and other ways of coordinating client<->cluster interactions could be much more beneficial and efficient. As for response frames I don't see how you could handle two interleaved queries (first slow, second fast but "blocked from returning" by the previous one) without command/response correlation; there can only be a single query to a node in flight at a time.
        Hide
        Sylvain Lebresne added a comment -

        I did consider a fully asynchronous protocol for a minute, but I felt that it would complicate things quite a bit and I'm unsure of the benefits. That is, I can see the point for out-of-order frame in things like SPDY where you want to multiplex lots of different streams inside the same connection, but it feels overkill for us. I typically don't think having clients maintaining a pool of connections be a big problem, and in fact client will want that anyway because you want to load balance the connections over multiple nodes.

        That being said, there is nothing in the spec (or it's unintentional) preventing one way commands nor even incremental server-push. There is no frame sequence, which prevent completely out-of-order frames, but that doesn't mean client shouldn't say expect that the next frame after a query request must necessarily be a result frame. For instance, I intend to explore adding some form of server event push, where the server could push (at any time) a event frame saying for instance that a new node joined the cluster.

        Show
        Sylvain Lebresne added a comment - I did consider a fully asynchronous protocol for a minute, but I felt that it would complicate things quite a bit and I'm unsure of the benefits. That is, I can see the point for out-of-order frame in things like SPDY where you want to multiplex lots of different streams inside the same connection, but it feels overkill for us. I typically don't think having clients maintaining a pool of connections be a big problem, and in fact client will want that anyway because you want to load balance the connections over multiple nodes. That being said, there is nothing in the spec (or it's unintentional) preventing one way commands nor even incremental server-push. There is no frame sequence, which prevent completely out-of-order frames, but that doesn't mean client shouldn't say expect that the next frame after a query request must necessarily be a result frame. For instance, I intend to explore adding some form of server event push, where the server could push (at any time) a event frame saying for instance that a new node joined the cluster.
        Hide
        Holger Hoffstätte added a comment -

        Hi - just found this and got curious (esp. after our conclusions on Thrift in Cologne . The protocol looks straightforward enough, though I'm not sure about the general assumption of synchronous behaviour. My understanding so far is that this is a completely synchronous request/response protocol with no provision for out-of-order or async-oneway commands (which might also obviate otherwise empty Void RESULTs), and I don't see frame or command/reply correlation numbers anywhere. These might become necessary for incremental server-push as well. Is this assessment correct? Is this omission intentional? Just asking whether this was a consideration at all since it has coupling implications for the actual transport layer and client interaction.

        Show
        Holger Hoffstätte added a comment - Hi - just found this and got curious (esp. after our conclusions on Thrift in Cologne . The protocol looks straightforward enough, though I'm not sure about the general assumption of synchronous behaviour. My understanding so far is that this is a completely synchronous request/response protocol with no provision for out-of-order or async-oneway commands (which might also obviate otherwise empty Void RESULTs), and I don't see frame or command/reply correlation numbers anywhere. These might become necessary for incremental server-push as well. Is this assessment correct? Is this omission intentional? Just asking whether this was a consideration at all since it has coupling implications for the actual transport layer and client interaction.
        Hide
        Sylvain Lebresne added a comment -

        It appears as though there is no addition to the cassandra-clientutils jar (yet) to allow for client only access to the new transport libraries

        No, there is not. But we love contributions

        Is there a plan to run the new native CQL protocol without the thrift libraries?

        Hell yeah. I intend to remove the thrift exceptions as part of CASSANDRA-3979 (it's not explicit from the ticket description but I think it would be a good occasion to do so). Making thrift needed only for the thrift is definitively a goal (at least one of mine). It's just that we do that incrementally.

        Show
        Sylvain Lebresne added a comment - It appears as though there is no addition to the cassandra-clientutils jar (yet) to allow for client only access to the new transport libraries No, there is not. But we love contributions Is there a plan to run the new native CQL protocol without the thrift libraries? Hell yeah. I intend to remove the thrift exceptions as part of CASSANDRA-3979 (it's not explicit from the ticket description but I think it would be a good occasion to do so). Making thrift needed only for the thrift is definitively a goal (at least one of mine). It's just that we do that incrementally.
        Hide
        Sylvain Lebresne added a comment -

        This draft spec doesn't allow much in that respect.

        That's was certainely not the intention so let me try to give my point of view.

        First, I think that for dealing with cross-version compatibility, having one single version number is much easier than splitting it in 2 different versions. All you care about is 'can I have a discussion with that server', and there is no point in adding complexity by having to check 2 numbers (this would also complicate documentation typically), so I'm fairly strongly against that idea.

        Now there is the question of whether we'll have enough version number with the current format. First, I certainly hope so. I think that the protocol should be something as stable as we possibly can, and I honestly don't think things in the protocol will change all the time. There is really so many thing that a protocol can do and so many things we can add.

        But let me be clear that I don't pretend the current committed version to be perfect or in any way final. I wanted it committed so that people can try working with it and give feedback. I typically don't think we should call the first version of the protocol final before there is at least 2 or 3 drivers (in different language) that uses it.

        But once we get to that first final version, I think that at the very least we should not bump the protocol version between minor C* version. And truth is, while I expect burning a few versions towards the beginning because I'm sure we won't get everything right, if we continue burning 1 version of the protocol every major C* version after say version 3 or 4 of the protocol, then I would consider that a big big failure on our part (if only because no driver implementor in its right mind will want to put up with our shit if we do that). But even if that happens, with one major C* release every 6 months, we're good for some time.

        Besides, if I happen to be completely off the mark and we do burn version like crazy, version 127 of the protocol can very well add a new byte to the header to increase the number of possible versions. But I estimate the chances of that happening to be 0.

        That being said, and to make things a bit more flexible, I did put a paragraph to the spec saying that clients should never expect a message to contain no more data that what they know about. So it will be possible to add optional info in the server response without having to burn a protocol version, which might be handy.

        Also, the OPTIONS/STARTUP pair is here exactly so that there can be some negotiation, which will allow to support new optional features without necessarily increasing the protocol version, reducing a bit more the change that do need to increment the protocol version.

        Overall, I did though about that compatibility issue quite a bit and I do think the protocol is not too ill equipped already to deal with that. In any case, it will always be possible to extend the versioning later if we really need to, but I do want to believe that we'll be able to have a fairly stable protocol (with optional features negotiated at startup), and I prefer not designing the first version of the protocol assuming we will fail at that.

        Show
        Sylvain Lebresne added a comment - This draft spec doesn't allow much in that respect. That's was certainely not the intention so let me try to give my point of view. First, I think that for dealing with cross-version compatibility, having one single version number is much easier than splitting it in 2 different versions. All you care about is 'can I have a discussion with that server', and there is no point in adding complexity by having to check 2 numbers (this would also complicate documentation typically), so I'm fairly strongly against that idea. Now there is the question of whether we'll have enough version number with the current format. First, I certainly hope so. I think that the protocol should be something as stable as we possibly can, and I honestly don't think things in the protocol will change all the time. There is really so many thing that a protocol can do and so many things we can add. But let me be clear that I don't pretend the current committed version to be perfect or in any way final. I wanted it committed so that people can try working with it and give feedback. I typically don't think we should call the first version of the protocol final before there is at least 2 or 3 drivers (in different language) that uses it. But once we get to that first final version, I think that at the very least we should not bump the protocol version between minor C* version. And truth is, while I expect burning a few versions towards the beginning because I'm sure we won't get everything right, if we continue burning 1 version of the protocol every major C* version after say version 3 or 4 of the protocol, then I would consider that a big big failure on our part (if only because no driver implementor in its right mind will want to put up with our shit if we do that). But even if that happens, with one major C* release every 6 months, we're good for some time. Besides, if I happen to be completely off the mark and we do burn version like crazy, version 127 of the protocol can very well add a new byte to the header to increase the number of possible versions. But I estimate the chances of that happening to be 0. That being said, and to make things a bit more flexible, I did put a paragraph to the spec saying that clients should never expect a message to contain no more data that what they know about. So it will be possible to add optional info in the server response without having to burn a protocol version, which might be handy. Also, the OPTIONS/STARTUP pair is here exactly so that there can be some negotiation, which will allow to support new optional features without necessarily increasing the protocol version, reducing a bit more the change that do need to increment the protocol version. Overall, I did though about that compatibility issue quite a bit and I do think the protocol is not too ill equipped already to deal with that. In any case, it will always be possible to extend the versioning later if we really need to, but I do want to believe that we'll be able to have a fairly stable protocol (with optional features negotiated at startup), and I prefer not designing the first version of the protocol assuming we will fail at that.
        Hide
        Rick Shaw added a comment -

        The client community typically only uses cassandra-thrift and cassandra-clientutil.jar as well as some required dependencies like guava.jar. It appears as though there is no addition to the cassandra-clientutils jar (yet) to allow for client only access to the new transport libraries. It also appears that there are some thrift exceptions in the new code methods. Is there a plan to run the new native CQL protocol without the thrift libraries?

        Show
        Rick Shaw added a comment - The client community typically only uses cassandra-thrift and cassandra-clientutil.jar as well as some required dependencies like guava.jar. It appears as though there is no addition to the cassandra-clientutils jar (yet) to allow for client only access to the new transport libraries. It also appears that there are some thrift exceptions in the new code methods. Is there a plan to run the new native CQL protocol without the thrift libraries?
        Hide
        paul cannon added a comment -

        I have some input on this too, if it's worth anything. When implementing servers and clients with this sort of protocol, the biggest headaches I experience are usually around cross-version compatibility. This draft spec doesn't allow much in that respect.

        As an initial change, could we divorce the versioning for the framing protocol and the set of actual communication messages within the protocol? For example, the framing protocol (referring only to the encapsulation of direction, version, flags, opcode, length, and message body) currently has version 1, but maybe the rest of the protocol (the exact format and set of the supported messages) could have something fitting the Semantic Versioning stuff. This would let the client know whether a difference in the server-side protocol version(s) was/were compatible with the client's own supported version(s) or not. Maybe we could even negotiate a protocol version between the server and client with the OPTIONS/STARTUP pair.

        If we don't do this, and we end up making 127 changes to any part of the protocol over time (definitely possible), then we have no backward-compatible way to state a higher version number.

        Show
        paul cannon added a comment - I have some input on this too, if it's worth anything. When implementing servers and clients with this sort of protocol, the biggest headaches I experience are usually around cross-version compatibility. This draft spec doesn't allow much in that respect. As an initial change, could we divorce the versioning for the framing protocol and the set of actual communication messages within the protocol? For example, the framing protocol (referring only to the encapsulation of direction, version, flags, opcode, length, and message body) currently has version 1, but maybe the rest of the protocol (the exact format and set of the supported messages) could have something fitting the Semantic Versioning stuff. This would let the client know whether a difference in the server-side protocol version(s) was/were compatible with the client's own supported version(s) or not. Maybe we could even negotiate a protocol version between the server and client with the OPTIONS/STARTUP pair. If we don't do this, and we end up making 127 changes to any part of the protocol over time (definitely possible), then we have no backward-compatible way to state a higher version number.
        Hide
        Sylvain Lebresne added a comment -

        I did.

        Show
        Sylvain Lebresne added a comment - I did.
        Hide
        Norman Maurer added a comment -

        We released 3.5.2.Final today.. not sure if you used it

        Show
        Norman Maurer added a comment - We released 3.5.2.Final today.. not sure if you used it
        Hide
        Sylvain Lebresne added a comment -

        Alright, committed, thanks (I've updated the netty jar to the last version in the process).

        Show
        Sylvain Lebresne added a comment - Alright, committed, thanks (I've updated the netty jar to the last version in the process).
        Hide
        Sylvain Lebresne added a comment -

        Here, I mean automated protocol test using SimpleClient in unit test.

        Oh ok. I was confused because the CliTest itself won't work (and will never) with the native protocol since it's completely thrift related. But yes, having some unit test can never hurt. Another avenue for tests will be to update the python cql driver to use the native transport, and then we'll be able to reuse the dtests (which have quite a bunch of CQL tests).

        @Yuki I think everything is ok.. Let me do a final review cycle today again

        Ok. I will commit to trunk. We can always commit improvements later anyway (Norman: if you do spot some minor things, do feel free to comment on that ticket directly (or on github), even if it's closed, and I'll commit the changes).

        Show
        Sylvain Lebresne added a comment - Here, I mean automated protocol test using SimpleClient in unit test. Oh ok. I was confused because the CliTest itself won't work (and will never) with the native protocol since it's completely thrift related. But yes, having some unit test can never hurt. Another avenue for tests will be to update the python cql driver to use the native transport, and then we'll be able to reuse the dtests (which have quite a bunch of CQL tests). @Yuki I think everything is ok.. Let me do a final review cycle today again Ok. I will commit to trunk. We can always commit improvements later anyway (Norman: if you do spot some minor things, do feel free to comment on that ticket directly (or on github), even if it's closed, and I'll commit the changes).
        Hide
        Norman Maurer added a comment -

        @Yuki I think everything is ok.. Let me do a final review cycle today again.

        Show
        Norman Maurer added a comment - @Yuki I think everything is ok.. Let me do a final review cycle today again.
        Hide
        Yuki Morishita added a comment -

        And not necessary at this time, but it is nice to have unit test like CliTest, based on debug client.

        Here, I mean automated protocol test using SimpleClient in unit test. CliTest runs bunch of cli commands to test cli functionality, so I thought we might be able to do the same thing to test native transport.

        If Norman has nothing more to say, then let's commit this to trunk.

        Show
        Yuki Morishita added a comment - And not necessary at this time, but it is nice to have unit test like CliTest, based on debug client. Here, I mean automated protocol test using SimpleClient in unit test. CliTest runs bunch of cli commands to test cli functionality, so I thought we might be able to do the same thing to test native transport. If Norman has nothing more to say, then let's commit this to trunk.
        Hide
        Sylvain Lebresne added a comment -

        All of protocol related classes are placed under org.apache.cassandra.cql3.transport, but I think it'd be better to keep them under org.apache.cassandra.transport.

        Make sense.

        Native transport service starts by default along with thrift service, but isn't it better to turn off by default to indicate "for developing client only"?

        I agree. And in any case, most people won't want to run both server at the same time anyway, so I've added flag for both thrift and the new native server in the config file to choose whether to start those. You can still overwrite those by used startup flags, but I believe most of the time the config setting will be more convenient. And the default is to not start the native transport by default while it's still beta.

        I've rebased the branch and added the two changes above in https://github.com/pcmanus/cassandra/commits/2478-3. I also made a small modification due to a remark made by Norman on github. Taking about that, @Norman, was that your only remark or are you just not done looking at this?

        And not necessary at this time, but it is nice to have unit test like CliTest, based on debug client.

        Not sure what you mean by that.

        Show
        Sylvain Lebresne added a comment - All of protocol related classes are placed under org.apache.cassandra.cql3.transport, but I think it'd be better to keep them under org.apache.cassandra.transport. Make sense. Native transport service starts by default along with thrift service, but isn't it better to turn off by default to indicate "for developing client only"? I agree. And in any case, most people won't want to run both server at the same time anyway, so I've added flag for both thrift and the new native server in the config file to choose whether to start those. You can still overwrite those by used startup flags, but I believe most of the time the config setting will be more convenient. And the default is to not start the native transport by default while it's still beta. I've rebased the branch and added the two changes above in https://github.com/pcmanus/cassandra/commits/2478-3 . I also made a small modification due to a remark made by Norman on github. Taking about that, @Norman, was that your only remark or are you just not done looking at this? And not necessary at this time, but it is nice to have unit test like CliTest, based on debug client. Not sure what you mean by that.
        Hide
        Norman Maurer added a comment -

        @Jonathan sorry I was so busy that I was not able to review yet... I will do today and give feedback!

        Show
        Norman Maurer added a comment - @Jonathan sorry I was so busy that I was not able to review yet... I will do today and give feedback!
        Hide
        Yuki Morishita added a comment -

        I'm +1 to put Sylvain's patch into 1.2. It works as described in spec, so client developers can start developing on this new protocol.
        But before commit, I suggest two changes to the patch:

        • All of protocol related classes are placed under org.apache.cassandra.cql3.transport, but I think it'd be better to keep them under org.apache.cassandra.transport.
        • Native transport service starts by default along with thrift service, but isn't it better to turn off by default to indicate "for developing client only"?

        And not necessary at this time, but it is nice to have unit test like CliTest, based on debug client.

        Show
        Yuki Morishita added a comment - I'm +1 to put Sylvain's patch into 1.2. It works as described in spec, so client developers can start developing on this new protocol. But before commit, I suggest two changes to the patch: All of protocol related classes are placed under org.apache.cassandra.cql3.transport, but I think it'd be better to keep them under org.apache.cassandra.transport. Native transport service starts by default along with thrift service, but isn't it better to turn off by default to indicate "for developing client only"? And not necessary at this time, but it is nice to have unit test like CliTest, based on debug client.
        Hide
        Jonathan Ellis added a comment -

        Norman, would you mind reviewing the updated branch?

        Show
        Jonathan Ellis added a comment - Norman, would you mind reviewing the updated branch?
        Hide
        Sylvain Lebresne added a comment -

        I've pushed at https://github.com/pcmanus/cassandra/commits/2478-2 an updated version of this patch. I'm also attaching the v2 document describing this new version. This version adds to the previous one:

        • keyspace and table names to the metadata returned as response to SELECT and prepared queries. The protocol supports having different ks/cf names for each column, but have a more compact form when all columns are from the same ks/cf (which is always the case for SELECT, but not for prepared queries).
        • it removes (mostly from the document) the few tidbits about the sketched auto-paging. I turns out this is more complicated than I though and may require a bit more discussion. So I prefer living that to a follow up ticket.
        • a few other cleanups

        In this form, this is a "complete" first version in that it is on par with the thrift API. So imo it's a good first step and I'd like trying to get that in and then adds new features/improve it in separate tickets (SSL, auto-paging, consider SASL, etc...). I think the sooner we get a working version in, the sooner we'll have people playing with it and get feedback.

        Show
        Sylvain Lebresne added a comment - I've pushed at https://github.com/pcmanus/cassandra/commits/2478-2 an updated version of this patch. I'm also attaching the v2 document describing this new version. This version adds to the previous one: keyspace and table names to the metadata returned as response to SELECT and prepared queries. The protocol supports having different ks/cf names for each column, but have a more compact form when all columns are from the same ks/cf (which is always the case for SELECT, but not for prepared queries). it removes (mostly from the document) the few tidbits about the sketched auto-paging. I turns out this is more complicated than I though and may require a bit more discussion. So I prefer living that to a follow up ticket. a few other cleanups In this form, this is a "complete" first version in that it is on par with the thrift API. So imo it's a good first step and I'd like trying to get that in and then adds new features/improve it in separate tickets (SSL, auto-paging, consider SASL, etc...). I think the sooner we get a working version in, the sooner we'll have people playing with it and get feedback.
        Hide
        Rick Shaw added a comment -

        Much of the pressure in the corporate world I live in is from teams that are trying to use JDBC driver for C* in the same fashion and with the same BI, ETL and statistics tools (Talend, Informatica, Datastage, SPSS, SAS) as they do for relational solutions. Many of these tools are quite comprehensive in the information they can display and use, so they are heavy users of the metadata features of the driver. Returning no data for some functions often causes the tooling to just give up, not to use what they have.

        Show
        Rick Shaw added a comment - Much of the pressure in the corporate world I live in is from teams that are trying to use JDBC driver for C* in the same fashion and with the same BI, ETL and statistics tools (Talend, Informatica, Datastage, SPSS, SAS) as they do for relational solutions. Many of these tools are quite comprehensive in the information they can display and use, so they are heavy users of the metadata features of the driver. Returning no data for some functions often causes the tooling to just give up, not to use what they have.
        Hide
        Jeremy Hanna added a comment -

        Rick: do you have specific tools which would benefit from that metadata?

        Show
        Jeremy Hanna added a comment - Rick: do you have specific tools which would benefit from that metadata?
        Hide
        Rick Shaw added a comment -

        My personal opinion is we will never get to a point where we need multiple KSs. But Multiple CFs (Tables) yes. The point is now in the current state of the returned meta-data we do not know which KS or CF, even if there is only one, and I contend we have client calls that want to know that info. "bikesheding" on my part will now end.

        Show
        Rick Shaw added a comment - My personal opinion is we will never get to a point where we need multiple KSs. But Multiple CFs (Tables) yes. The point is now in the current state of the returned meta-data we do not know which KS or CF, even if there is only one, and I contend we have client calls that want to know that info. "bikesheding" on my part will now end.
        Hide
        Sylvain Lebresne added a comment -

        The ResultSetMetaData interface provides methods for getSchemaName(column) and getTableName(column) on a column-by-column basis

        Which begs the question: do we want to also allow per-column keyspace/table names? As of C* current state this is not needed, one can only query one at a time. But wiring that in the protocol could be limiting in the future. On the other side, it will more simple/compact to only allow 1 keyspace/table name and adding query on multiple table, if we ever do it, won't be a small addition, so maybe we're fine with having it trigger a bump in the protocol version when that happen.

        I suppose we could support both version through a simple flag that say whether there is just one keyspace/table pair or one per column, but that complicates the protocol for something that may well never be useful. Opinions?

        use this as an opportunity to get rid of our custom authentication/authorization, and add hooks for SASL instead

        I'm not against that in theory. But I'll admit not knowing all the nuts and bolts of SASL. From an initial read, it seems the protocol part is fairly simple, it's a just a couple of simple message carrying string to support. However what's less clear to me is how to wire that in the Cassandra side and in particular how to ensure some form of compatibility with our current IAuthenticator interface.

        Show
        Sylvain Lebresne added a comment - The ResultSetMetaData interface provides methods for getSchemaName(column) and getTableName(column) on a column-by-column basis Which begs the question: do we want to also allow per-column keyspace/table names? As of C* current state this is not needed, one can only query one at a time. But wiring that in the protocol could be limiting in the future. On the other side, it will more simple/compact to only allow 1 keyspace/table name and adding query on multiple table, if we ever do it, won't be a small addition, so maybe we're fine with having it trigger a bump in the protocol version when that happen. I suppose we could support both version through a simple flag that say whether there is just one keyspace/table pair or one per column, but that complicates the protocol for something that may well never be useful. Opinions? use this as an opportunity to get rid of our custom authentication/authorization, and add hooks for SASL instead I'm not against that in theory. But I'll admit not knowing all the nuts and bolts of SASL. From an initial read, it seems the protocol part is fairly simple, it's a just a couple of simple message carrying string to support. However what's less clear to me is how to wire that in the Cassandra side and in particular how to ensure some form of compatibility with our current IAuthenticator interface.
        Hide
        Rick Shaw added a comment - - edited

        re: KS/CF metadata. As you say, JDBC/SQL can have resultset information from many tables, which is why you can acquire that metadata information at the column level. The ResultSetMetaData interface provides methods for getSchemaName(column) and getTableName(column) on a column-by-column basis. The point is tools use these interfaces and methods heavily to deliver their functionality and we will improve adoption of the Server and the Driver if we can make the interface process for these tools deliver as much information as practical.

        Show
        Rick Shaw added a comment - - edited re: KS/CF metadata. As you say, JDBC/SQL can have resultset information from many tables, which is why you can acquire that metadata information at the column level. The ResultSetMetaData interface provides methods for getSchemaName(column) and getTableName(column) on a column-by-column basis. The point is tools use these interfaces and methods heavily to deliver their functionality and we will improve adoption of the Server and the Driver if we can make the interface process for these tools deliver as much information as practical.
        Hide
        Eric Evans added a comment -

        I haven't had a chance to do much more than skim through the code, but it looks great so far; Nice work

        One suggestion I'd like to throw out there though, is that we use this as an opportunity to get rid of our custom authentication/authorization, and add hooks for SASL instead. This isn't a wheel we should have reinvented ourselves, and SASL would provide a simple means of integrating with a lot of different systems.

        Show
        Eric Evans added a comment - I haven't had a chance to do much more than skim through the code, but it looks great so far; Nice work One suggestion I'd like to throw out there though, is that we use this as an opportunity to get rid of our custom authentication/authorization, and add hooks for SASL instead. This isn't a wheel we should have reinvented ourselves, and SASL would provide a simple means of integrating with a lot of different systems.
        Hide
        Jonathan Ellis added a comment - - edited

        How does this matter? As far as JDBC/SQL is concerned, a resultset can be from many tables or even none. Granted, CQL is more limited, but just because we can expose something doesn't mean we should. ISTM if we report what the column names and types are, that should be all the driver needs to care about.

        Show
        Jonathan Ellis added a comment - - edited How does this matter? As far as JDBC/SQL is concerned, a resultset can be from many tables or even none. Granted, CQL is more limited, but just because we can expose something doesn't mean we should . ISTM if we report what the column names and types are, that should be all the driver needs to care about.
        Hide
        Rick Shaw added a comment -

        Please consider returning the KS (schema) and CF (table) in the metadata returned with "rows" and a "prepare". This will facilitate clients reporting this information in metadata commands from the client side for tools and the like. This information can not really be known without the fragile method of scanning the CQL because the context is always passed withing the script itself. And with the addition of "KS.CF" notation in CQL for the selected CF you can't really use the state of a USE to tell what the applicable KS is within the currently returned information.

        Show
        Rick Shaw added a comment - Please consider returning the KS (schema) and CF (table) in the metadata returned with "rows" and a "prepare". This will facilitate clients reporting this information in metadata commands from the client side for tools and the like. This information can not really be known without the fragile method of scanning the CQL because the context is always passed withing the script itself. And with the addition of "KS.CF" notation in CQL for the selected CF you can't really use the state of a USE to tell what the applicable KS is within the currently returned information.
        Hide
        Shahryar Sedghi added a comment -

        Please look at Websocket for the wire protocol, it is way faster than HTTP and also HTTP friendly

        Show
        Shahryar Sedghi added a comment - Please look at Websocket for the wire protocol, it is way faster than HTTP and also HTTP friendly
        Hide
        Sylvain Lebresne added a comment -

        I did look at that as main inspiration
        But I'll have a final go over to make sure I didn't forget something.

        Show
        Sylvain Lebresne added a comment - I did look at that as main inspiration But I'll have a final go over to make sure I didn't forget something.
        Hide
        Jonathan Ellis added a comment -

        Is it worth taking a look at something like the libpq protocol to see if we're missing anything obvious? http://www.postgresql.org/docs/9.2/static/protocol.html

        Show
        Jonathan Ellis added a comment - Is it worth taking a look at something like the libpq protocol to see if we're missing anything obvious? http://www.postgresql.org/docs/9.2/static/protocol.html
        Hide
        Sylvain Lebresne added a comment -

        Support for secure connection(SSL support)

        Yes, I completely forgot about that. I meant to mention it. I indeed think we can do that in a follow up ticket, through that won't be too hard (we don't plan on supporting StartTLS right?). Anyway, to some extend it's not really part of the protocol. It's more that we can embed the protocol inside SSL. But clearly, that's planned.

        READY messages body - We may want to reserve some space for server response

        I've tried to not add stuffs where I did not know if that would be useful or not, otherwise you never stop. I mean, things like that can be easily added later when they become useful (and we decide when version 1 of the protocol is final so we don't have to have figure everything out as soon as this get committed).

        Types in metadata - proposal states it is sent in <string> format, but is it better to use flags? When custom type is used, we can append FQCN of that type after the flag

        Why not. I'm not sure we'll have a huge saving, and I wonder if tying the supported native type to the protocol is a good idea. Though as long as we a way to have custom types we're good so why not. I'll update.

        Show
        Sylvain Lebresne added a comment - Support for secure connection(SSL support) Yes, I completely forgot about that. I meant to mention it. I indeed think we can do that in a follow up ticket, through that won't be too hard (we don't plan on supporting StartTLS right?). Anyway, to some extend it's not really part of the protocol. It's more that we can embed the protocol inside SSL. But clearly, that's planned. READY messages body - We may want to reserve some space for server response I've tried to not add stuffs where I did not know if that would be useful or not, otherwise you never stop. I mean, things like that can be easily added later when they become useful (and we decide when version 1 of the protocol is final so we don't have to have figure everything out as soon as this get committed). Types in metadata - proposal states it is sent in <string> format, but is it better to use flags? When custom type is used, we can append FQCN of that type after the flag Why not. I'm not sure we'll have a huge saving, and I wonder if tying the supported native type to the protocol is a good idea. Though as long as we a way to have custom types we're good so why not. I'll update.
        Hide
        Yuki Morishita added a comment -

        Couple of comments on protocol proposal:

        • Support for secure connection(SSL support) - we can implement this later, but I think it is better to mention in protocol. Do we need to connect to secure port from the beginning or we upgrade to SSL after handshake when server supports secure connection?
        • READY messages body - We may want to reserve some space for server response (possibly, server capabilities), using [option list]
        • Types in metadata - proposal states it is sent in <string> format, but is it better to use flags? When custom type is used, we can append FQCN of that type after the flag.
        Show
        Yuki Morishita added a comment - Couple of comments on protocol proposal: Support for secure connection(SSL support) - we can implement this later, but I think it is better to mention in protocol. Do we need to connect to secure port from the beginning or we upgrade to SSL after handshake when server supports secure connection? READY messages body - We may want to reserve some space for server response (possibly, server capabilities), using [option list] Types in metadata - proposal states it is sent in <string> format, but is it better to use flags? When custom type is used, we can append FQCN of that type after the flag.
        Hide
        Sylvain Lebresne added a comment -

        Thanks Norman, it does. I've pushed a new patch to the branch with modification from your comment and one of Yuki.

        Show
        Sylvain Lebresne added a comment - Thanks Norman, it does. I've pushed a new patch to the branch with modification from your comment and one of Yuki.
        Hide
        Norman Maurer added a comment -

        Sylvain Lebresne I added some comments to your code.. Hope it helps.

        Show
        Norman Maurer added a comment - Sylvain Lebresne I added some comments to your code.. Hope it helps.
        Hide
        Sylvain Lebresne added a comment -

        Attaching cql_binary_protocol as a draft for such a custom binary protocol (CQL3 only). The protocol follows more or less Rick's outline above. It is frame based, each frame has a small header indicating amongst other the opcode for the message that defines what the body of said message must contain.

        A typical communication start by a small handshake/authentication phase, after which queries (and preparation/execution) can be performed. The protocol support a form of 'cursor' api for queries. Given a select query, the client can ask the server to return only a handful of rows first, and then the client can fetch more rows at his own rate using a NEXT message.

        Outside of the cursor thing, the protocol as described here pretty much expose the same things than the thrift transport (as far as CQL is concerned) but not much more (a small exception is that CASSANDRA-3707 is included). I plan on experimenting next with a few additional features, like allowing clients to register to events like 'a new node joined' and be notified when such event happen, but I'll leave that to follow up tickets.

        I've push at https://github.com/pcmanus/cassandra/commits/2478 an initial implementation of this protocol (using netty). Almost all of the protocol is implemented except for the NEXT message, but I'll get to that. There is currently mainly 4 patches:

        • the first one is the bulk of the new server
        • the second one is a simple client, along with a small "debug" console (whose code is ugly), that allow to send message to test the server. This does not necessarily have to make it in the final commit, but it is useful for testing.
        • the third one replace the use of CqlResult in the CQL3 code to directly build the new messages. And for the thrift interface, it simply translate those message to CqlResult. I've done it that way (instead of generating CqlResult and convert that to messages of the native protocol) because I think that is the direction we want to go. However currently that means that there is a few info that don't make it anymore in the CqlResult, namely the timestamp of the columns. Anywa, imo CASSANDRA-4217 is a much better way to access the timestamp and I'm not sure existing client were exposing the timestamp, but if there is complaints, that can be fixe is a much better way to access the timestamp and I'm not sure existing client were exposing the timestamp, but if there is complaints, that can be fixed.
        • the last one changes our CassandraDaemon business so that we can run a server with both the thrift and native protocol server running cleanly.

        Other than that, I have really benchmarked this (but that should be done). I meant to update stress to use the new server but realized that stress doesn't work for CQL3 at all, so that will be a separate ticket probably.

        Show
        Sylvain Lebresne added a comment - Attaching cql_binary_protocol as a draft for such a custom binary protocol (CQL3 only). The protocol follows more or less Rick's outline above. It is frame based, each frame has a small header indicating amongst other the opcode for the message that defines what the body of said message must contain. A typical communication start by a small handshake/authentication phase, after which queries (and preparation/execution) can be performed. The protocol support a form of 'cursor' api for queries. Given a select query, the client can ask the server to return only a handful of rows first, and then the client can fetch more rows at his own rate using a NEXT message. Outside of the cursor thing, the protocol as described here pretty much expose the same things than the thrift transport (as far as CQL is concerned) but not much more (a small exception is that CASSANDRA-3707 is included). I plan on experimenting next with a few additional features, like allowing clients to register to events like 'a new node joined' and be notified when such event happen, but I'll leave that to follow up tickets. I've push at https://github.com/pcmanus/cassandra/commits/2478 an initial implementation of this protocol (using netty). Almost all of the protocol is implemented except for the NEXT message, but I'll get to that. There is currently mainly 4 patches: the first one is the bulk of the new server the second one is a simple client, along with a small "debug" console (whose code is ugly), that allow to send message to test the server. This does not necessarily have to make it in the final commit, but it is useful for testing. the third one replace the use of CqlResult in the CQL3 code to directly build the new messages. And for the thrift interface, it simply translate those message to CqlResult. I've done it that way (instead of generating CqlResult and convert that to messages of the native protocol) because I think that is the direction we want to go. However currently that means that there is a few info that don't make it anymore in the CqlResult, namely the timestamp of the columns. Anywa, imo CASSANDRA-4217 is a much better way to access the timestamp and I'm not sure existing client were exposing the timestamp, but if there is complaints, that can be fixe is a much better way to access the timestamp and I'm not sure existing client were exposing the timestamp, but if there is complaints, that can be fixed. the last one changes our CassandraDaemon business so that we can run a server with both the thrift and native protocol server running cleanly. Other than that, I have really benchmarked this (but that should be done). I meant to update stress to use the new server but realized that stress doesn't work for CQL3 at all, so that will be a separate ticket probably.
        Hide
        Ahmet AKYOL added a comment -

        +1 for netty

        Show
        Ahmet AKYOL added a comment - +1 for netty
        Hide
        Jonathan Ellis added a comment -

        Related? HBASE-5355

        Show
        Jonathan Ellis added a comment - Related? HBASE-5355
        Hide
        Norman Maurer added a comment -

        If you go with netty I may be able to help (I'm one of the netty "committers/devs") with writting the code for it. Let me know if you are interested..

        Show
        Norman Maurer added a comment - If you go with netty I may be able to help (I'm one of the netty "committers/devs") with writting the code for it. Let me know if you are interested..
        Hide
        Eric Evans added a comment -

        I don't see any other right off the bat, but between those two I would have a slight preference for a custom protocol. My (to be honest not so extensive) experience with HTTP is that it can be slowish and a tad annoying to work with when you use it for something it wasn't designed for (typically streaming is not a given). But a custom protocol will clearly be more work for us. I just have a feeling that it may be worth it in the end.

        And whether it is HTTP or custom, I've had good experience with Netty in the past too.

        +1

        Show
        Eric Evans added a comment - I don't see any other right off the bat, but between those two I would have a slight preference for a custom protocol. My (to be honest not so extensive) experience with HTTP is that it can be slowish and a tad annoying to work with when you use it for something it wasn't designed for (typically streaming is not a given). But a custom protocol will clearly be more work for us. I just have a feeling that it may be worth it in the end. And whether it is HTTP or custom, I've had good experience with Netty in the past too. +1
        Hide
        Sylvain Lebresne added a comment -

        Are there any serious contenders besides HTTP and a simple custom protocol similar to the one outlined by Rick?

        I don't see any other right off the bat, but between those two I would have a slight preference for a custom protocol. My (to be honest not so extensive) experience with HTTP is that it can be slowish and a tad annoying to work with when you use it for something it wasn't designed for (typically streaming is not a given). But a custom protocol will clearly be more work for us. I just have a feeling that it may be worth it in the end.

        And whether it is HTTP or custom, I've had good experience with Netty in the past too.

        Show
        Sylvain Lebresne added a comment - Are there any serious contenders besides HTTP and a simple custom protocol similar to the one outlined by Rick? I don't see any other right off the bat, but between those two I would have a slight preference for a custom protocol. My (to be honest not so extensive) experience with HTTP is that it can be slowish and a tad annoying to work with when you use it for something it wasn't designed for (typically streaming is not a given). But a custom protocol will clearly be more work for us. I just have a feeling that it may be worth it in the end. And whether it is HTTP or custom, I've had good experience with Netty in the past too.
        Hide
        Jonathan Ellis added a comment -

        I believe the focus is on getting CQL up to 100% parity functionality-wise with Thrift, and then there will be time to work on this.

        The end is in sight (CASSANDRA-2474 is almost done, with CASSANDRA-2477 to follow soon), so I agree that it's time to start thinking harder about this.

        Are there any serious contenders besides HTTP and a simple custom protocol similar to the one outlined by Rick? Hessian looks a lot more like Thrift than like something we'd use here.

        Show
        Jonathan Ellis added a comment - I believe the focus is on getting CQL up to 100% parity functionality-wise with Thrift, and then there will be time to work on this. The end is in sight ( CASSANDRA-2474 is almost done, with CASSANDRA-2477 to follow soon), so I agree that it's time to start thinking harder about this. Are there any serious contenders besides HTTP and a simple custom protocol similar to the one outlined by Rick? Hessian looks a lot more like Thrift than like something we'd use here.
        Hide
        Piotr Kołaczkowski added a comment - - edited

        Have you considered Hessian? Or taking some parts of it - because it is open-source and quite simple. We used it as the main platform for webservices at our company (mostly doing MMORPG games). It is very lightweight, it supports streaming, it supports many programming languages, it would be possible to run using Jetty or Netty (we used Jetty). However, it doesn't support versioning, so it would have to be improved.

        http://hessian.caucho.com/

        Show
        Piotr Kołaczkowski added a comment - - edited Have you considered Hessian? Or taking some parts of it - because it is open-source and quite simple. We used it as the main platform for webservices at our company (mostly doing MMORPG games). It is very lightweight, it supports streaming, it supports many programming languages, it would be possible to run using Jetty or Netty (we used Jetty). However, it doesn't support versioning, so it would have to be improved. http://hessian.caucho.com/
        Hide
        Rick Branson added a comment -

        I believe the focus is on getting CQL up to 100% parity functionality-wise with Thrift, and then there will be time to work on this. Once CQL becomes the default, it'll be transparent to switch out the 'innards' with a custom protocol. An extremely optimistic figure puts this 6 months out.

        Show
        Rick Branson added a comment - I believe the focus is on getting CQL up to 100% parity functionality-wise with Thrift, and then there will be time to work on this. Once CQL becomes the default, it'll be transparent to switch out the 'innards' with a custom protocol. An extremely optimistic figure puts this 6 months out.
        Hide
        Christoph Hack added a comment -

        I would really appreciate a such a custom protocol since I would like to write a Go client.

        Go seems to be a bad fit for Thrift since the language and API is improving steadily (which isn't a problem for Go programs, since there is a tool called "gofix" which is able to rewrite existing code), but updating the code-generator for Thrift all the time is not practical. Also, I do not want to contribute to Thrift at the moment and writing a generator to generate non-idiomatic code instead of writing a nice API in the first place (without huge amounts of duplicated generated code) seems like the better solution for me (but that's just my personal preference).

        I know, this feature was just proposed recently, but is it likely that it might be implemented in the near future?

        Show
        Christoph Hack added a comment - I would really appreciate a such a custom protocol since I would like to write a Go client. Go seems to be a bad fit for Thrift since the language and API is improving steadily (which isn't a problem for Go programs, since there is a tool called "gofix" which is able to rewrite existing code), but updating the code-generator for Thrift all the time is not practical. Also, I do not want to contribute to Thrift at the moment and writing a generator to generate non-idiomatic code instead of writing a nice API in the first place (without huge amounts of duplicated generated code) seems like the better solution for me (but that's just my personal preference). I know, this feature was just proposed recently, but is it likely that it might be implemented in the near future?
        Hide
        Rick Branson added a comment -

        So I'll start throwing ideas out there.

        I think we should use a simple framed binary transport with a 4-byte frame size prefix followed by the frame. Inside the frame is the message: a 2-byte integer indicating message type followed by the message contents, which varies based on the type. The client sends requests, and the server responds to them.

        The Cassandra protocol needs to do the following things, which I would imagine would equate to a pair of message types (request and response):

        • Handshake
        • Authentication
        • Health Checks (ping/pong)
        • CQL Queries (... and a response that encodes rows, columns, and supercolumns)
        • CQL Statement Preparation & Execution

        Anything I'm missing? It would be nice to make some more complex or problematic functionality, such as compression and streaming, optional for the client.

        Show
        Rick Branson added a comment - So I'll start throwing ideas out there. I think we should use a simple framed binary transport with a 4-byte frame size prefix followed by the frame. Inside the frame is the message: a 2-byte integer indicating message type followed by the message contents, which varies based on the type. The client sends requests, and the server responds to them. The Cassandra protocol needs to do the following things, which I would imagine would equate to a pair of message types (request and response): Handshake Authentication Health Checks (ping/pong) CQL Queries (... and a response that encodes rows, columns, and supercolumns) CQL Statement Preparation & Execution Anything I'm missing? It would be nice to make some more complex or problematic functionality, such as compression and streaming, optional for the client.
        Hide
        Ryan King added a comment -

        Finagle is a library for building protocols that happens to come with a few built-in implementations (http, memcached, thrift, etc). It solves a lot of problems that you'd have to re-build on top of netty.

        Show
        Ryan King added a comment - Finagle is a library for building protocols that happens to come with a few built-in implementations (http, memcached, thrift, etc). It solves a lot of problems that you'd have to re-build on top of netty.
        Hide
        Rick Branson added a comment -

        Looks like Finagle supports HTTP, Thrift, Memcached, and some protocol called "More to come!"

        Show
        Rick Branson added a comment - Looks like Finagle supports HTTP, Thrift, Memcached, and some protocol called "More to come!"
        Hide
        Jonathan Ellis added a comment -

        the idea is to reduce dependencies and improve support for non-JVM clients

        But it's still just HTTP right?

        Has anyone proposed anything regarding the wire protocol?

        Not yet.

        Show
        Jonathan Ellis added a comment - the idea is to reduce dependencies and improve support for non-JVM clients But it's still just HTTP right? Has anyone proposed anything regarding the wire protocol? Not yet.
        Hide
        Rick Branson added a comment -

        Finagle is nice, but I think the idea is to reduce dependencies and improve support for non-JVM clients.

        Has anyone proposed anything regarding the wire protocol yet or is it still wide open?

        Show
        Rick Branson added a comment - Finagle is nice, but I think the idea is to reduce dependencies and improve support for non-JVM clients. Has anyone proposed anything regarding the wire protocol yet or is it still wide open?
        Hide
        Ryan King added a comment -

        If you want to use netty, I'd suggest considering using finagle on top of it: http://github.com/twitter/finagle. Its written in scala but its very easy to use from java.

        Show
        Ryan King added a comment - If you want to use netty, I'd suggest considering using finagle on top of it: http://github.com/twitter/finagle . Its written in scala but its very easy to use from java.

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Eric Evans
            Reviewer:
            Yuki Morishita
          • Votes:
            3 Vote for this issue
            Watchers:
            26 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development