Thrift
  1. Thrift
  2. THRIFT-1920 Binary Type
  3. THRIFT-830

Switch binary field implementation from byte[] to ByteBuffer

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4
    • Labels:
      None

      Description

      Instead of using byte[] as the implementation for binary fields, let's use ByteBuffer.

      There's nothing that you can do with byte[] that you can't also do with ByteBuffer, and there are more things you can do with ByteBuffer. It opens the way for us to avoid needless buffer copies on serialization and deserialization. It gives us a generally accepted equals() and compareTo() implementation, so we don't have to have custom cases for that anymore.

      Making this change will probably cause more than a little bit of trauma, changing the method signatures in both TProtocol and generated code. It's possible that I could be persuaded to support a command line switch for producing old-style byte[] methods in some contexts, but I'd love not to waste time supporting suboptimal features.

      1. thrift-830.patch
        36 kB
        Bryan Duxbury

        Issue Links

          Activity

          Hide
          Bryan Duxbury added a comment -

          I would encourage you to look at the generated code. For binary fields, we generate:

          • getFieldName: byte[]
          • bufferForFieldName: ByteBuffer
          • setFieldName(byte[])
          • setFieldName(ByteBuffer)

          We explicitly expose both interfaces in a fashion that is backwards compatible and optional.

          Yes, the benefits of using a ByteBuffer implementation are significant, even if you're talking about RPC over the internet (which is likely a minority of Thrift use cases). You have to realize that this reduces server-side CPU and memory usage at no cost to any other resource. Even if you ignore the buffer-accessing methods, if you access fewer than 100% of the binary fields in a struct, you'll get better performance.

          If your data isn't stored in an in-memory byte array, then why are you using a binary field? If you don't use any binary fields, your generated structs won't contain ByteBuffers or byte[] at all.

          Also, what if someone wants to send 2, non consecutive, parts of the data in the same call? ByteBuffer doesn't solve this.

          ByteBuffer actually does solve this by letting you take slices from the same backing array. On the contrary, using byte[]s to solve this situation would be more difficult and less efficient.

          You didn't find the cons to this feature because there really aren't any. The complexity of interacting with ByteBuffer is an implementation cost and nothing more.

          Show
          Bryan Duxbury added a comment - I would encourage you to look at the generated code. For binary fields, we generate: getFieldName: byte[] bufferForFieldName: ByteBuffer setFieldName(byte[]) setFieldName(ByteBuffer) We explicitly expose both interfaces in a fashion that is backwards compatible and optional. Yes , the benefits of using a ByteBuffer implementation are significant, even if you're talking about RPC over the internet (which is likely a minority of Thrift use cases). You have to realize that this reduces server-side CPU and memory usage at no cost to any other resource. Even if you ignore the buffer-accessing methods, if you access fewer than 100% of the binary fields in a struct, you'll get better performance. If your data isn't stored in an in-memory byte array, then why are you using a binary field? If you don't use any binary fields, your generated structs won't contain ByteBuffers or byte[] at all. Also, what if someone wants to send 2, non consecutive, parts of the data in the same call? ByteBuffer doesn't solve this. ByteBuffer actually does solve this by letting you take slices from the same backing array. On the contrary, using byte[]s to solve this situation would be more difficult and less efficient. You didn't find the cons to this feature because there really aren't any. The complexity of interacting with ByteBuffer is an implementation cost and nothing more.
          Hide
          Jens Geyer added a comment -

          I fail to see how byte[] is going to solve the issues ByteBuffer can't solve.

          Show
          Jens Geyer added a comment - I fail to see how byte[] is going to solve the issues ByteBuffer can't solve.
          Hide
          Alik Elzin added a comment -

          Bryan Duxbury ByteBuffer is not internal. If I decalare a method "void uploadChunk(1:binary bytes)_", the generated result is "public void uploadChunk(ByteBuffer bytes) ...".
          It means that ByteBuffer is part of the external interface for both client and server.

          I **totally agree** that if you have a large dataset as byte array and want to send only parts of the data each time, a ByteBuffer is more efficient - instead of copying parts of the data, you can send a ByteBuffer that points to the correct place in the byte array with a specific length.

          But, the how efficient is it really compared to transferring stuff over the internet? How do you compare them against the cons and pros?
          Also, things might not always be stored in an in-memory bytearray. For example, the data is stored in a DB. In such case, there's no need for a ByteBuffer at all.
          Also, what if someone wants to send 2, non consecutive, parts of the data in the same call? ByteBuffer doesn't solve this.

          While reading through this user story, I couldn't find the cons to this feature - just the pros. I'm quite sure you discussed the cons but haven't seen them in writing so I decided to add them.

          ByteBuffer to the rescue...

          Show
          Alik Elzin added a comment - Bryan Duxbury ByteBuffer is not internal. If I decalare a method "void uploadChunk(1:binary bytes)_", the generated result is "public void uploadChunk(ByteBuffer bytes) ...". It means that ByteBuffer is part of the external interface for both client and server. I ** totally agree ** that if you have a large dataset as byte array and want to send only parts of the data each time, a ByteBuffer is more efficient - instead of copying parts of the data, you can send a ByteBuffer that points to the correct place in the byte array with a specific length. But, the how efficient is it really compared to transferring stuff over the internet? How do you compare them against the cons and pros? Also, things might not always be stored in an in-memory bytearray. For example, the data is stored in a DB. In such case, there's no need for a ByteBuffer at all. Also, what if someone wants to send 2, non consecutive, parts of the data in the same call? ByteBuffer doesn't solve this. While reading through this user story, I couldn't find the cons to this feature - just the pros. I'm quite sure you discussed the cons but haven't seen them in writing so I decided to add them. ByteBuffer to the rescue...
          Hide
          Bryan Duxbury added a comment -

          I'm confused by your objections.

          The use of the ByteBuffer is internal and hidden by default through the normal getter, though it is getable if you are daring and want to work with the ByteBuffer directly. Users who prefer to use byte[] can do so transparently (and probably already are).

          Internally, it is measurably more efficient to use ByteBuffer than byte[] for storing and handling binary data. Not "more convenient" - more efficient. Array copies are minimized by using this technique. "Simplifying" this implementation would be a net performance loss, and I'm not seeing the benefit.

          Show
          Bryan Duxbury added a comment - I'm confused by your objections. The use of the ByteBuffer is internal and hidden by default through the normal getter, though it is getable if you are daring and want to work with the ByteBuffer directly. Users who prefer to use byte[] can do so transparently (and probably already are). Internally, it is measurably more efficient to use ByteBuffer than byte[] for storing and handling binary data. Not "more convenient" - more efficient. Array copies are minimized by using this technique. "Simplifying" this implementation would be a net performance loss, and I'm not seeing the benefit.
          Hide
          Jens Geyer added a comment -

          Hi Alik Elzin,

          ByteBuffer ... gives the feeling of streaming

          That's an interesting point of view.

          ByteBuffer provides a state and meant to buffer data, but that's not the intention in Thrift. The intention in Thrift is different: Just passing data from over the network

          Not quite. You have to have some kind of a buffer to hold the data that you intend to pass around. Whether or not that buffer is a ByteBuffer or a byte[] or a MemoryStream does not really matter. You can't pass around data without storing them somewhere.

          Simplicity - byte[] is easy to grasp, exists in lots of languages and can always be wrapped with ByeBuffer. ByteBuffer on the other hand is more complex to grasp and I think exists in fewer languages.

          I agree that internally, for Thrift generated code and library, it's easier to work with ByteBuffer. But that's not a concern for the users of Thrift. I think it's much more convenient for me to work with byte[].

          Well, one of the primary goals of Thrift is efficiency. So of course it is important for all parties to get as much as possible efficiency out here. I would disagree with ByteBuffer being too hard to grasp, and I'm not really a Java programmer. Similar concepts exist in a number of languages.

          Consistency - In C# (csharp), binary is generated to byte[]. Using ByteBuffer in Java diminishes consistency.

          Ok, that's easy. We should adjust C# to conform to the standards Joke aside: From my understanding, all languages should map the data types into something that is sufficient, efficient, practical and idiomatic to that particular language.

          This imples two things:

          • a given language A may or may not use similar constructs as another language B does
          • there may be multiple equally suitable ways to represent a particular Thrift IDL data type in code, but we have to choose one of them.

          As being more on the a practicioner side, I tend to agree with what Bryan wrote above and if asked I would choose ByteBuffer over byte[] in a heartbeat, because it offers more value to the developer compared to the cost.

          In the context of this particular ticket, breaking the code again at the same places, now in the opposite direction, will for sure be widely recognized as a very "clever" tactical move. The next two questions people will ask are (a) why we did it and (b) whether it will be changed again in the next release? That's not to say that I am strictly against it, but every change we make should add value to the project, and given the arguments so far I don't think this will really be the case here.

          If at some day we get something like the feature mentioned in THRIFT-1854, or what we discussed just recently in some other ticket (I forgot the number), something where the exact type mapping could be changed in an easy, declarative or configurational way, offering a completely flexible and open type mapping system, the whole issue will become a no-brainer. Having such a feature would make certain things easier and more consistent that are solved today either by annotations, by compiler switch options, or not at all. At least in my humble opinion, this is the worthwhile goal to strive for.

          Show
          Jens Geyer added a comment - Hi Alik Elzin , ByteBuffer ... gives the feeling of streaming That's an interesting point of view. ByteBuffer provides a state and meant to buffer data, but that's not the intention in Thrift. The intention in Thrift is different: Just passing data from over the network Not quite. You have to have some kind of a buffer to hold the data that you intend to pass around. Whether or not that buffer is a ByteBuffer or a byte[] or a MemoryStream does not really matter. You can't pass around data without storing them somewhere. Simplicity - byte[] is easy to grasp, exists in lots of languages and can always be wrapped with ByeBuffer. ByteBuffer on the other hand is more complex to grasp and I think exists in fewer languages. I agree that internally, for Thrift generated code and library, it's easier to work with ByteBuffer. But that's not a concern for the users of Thrift. I think it's much more convenient for me to work with byte[]. Well, one of the primary goals of Thrift is efficiency. So of course it is important for all parties to get as much as possible efficiency out here. I would disagree with ByteBuffer being too hard to grasp, and I'm not really a Java programmer. Similar concepts exist in a number of languages. Consistency - In C# (csharp), binary is generated to byte[]. Using ByteBuffer in Java diminishes consistency. Ok, that's easy. We should adjust C# to conform to the standards Joke aside: From my understanding, all languages should map the data types into something that is sufficient, efficient, practical and idiomatic to that particular language. This imples two things: a given language A may or may not use similar constructs as another language B does there may be multiple equally suitable ways to represent a particular Thrift IDL data type in code, but we have to choose one of them . As being more on the a practicioner side, I tend to agree with what Bryan wrote above and if asked I would choose ByteBuffer over byte[] in a heartbeat, because it offers more value to the developer compared to the cost. In the context of this particular ticket, breaking the code again at the same places, now in the opposite direction, will for sure be widely recognized as a very "clever" tactical move. The next two questions people will ask are (a) why we did it and (b) whether it will be changed again in the next release? That's not to say that I am strictly against it, but every change we make should add value to the project, and given the arguments so far I don't think this will really be the case here. If at some day we get something like the feature mentioned in THRIFT-1854 , or what we discussed just recently in some other ticket (I forgot the number), something where the exact type mapping could be changed in an easy, declarative or configurational way, offering a completely flexible and open type mapping system, the whole issue will become a no-brainer. Having such a feature would make certain things easier and more consistent that are solved today either by annotations, by compiler switch options, or not at all. At least in my humble opinion, this is the worthwhile goal to strive for.
          Hide
          Alik Elzin added a comment -

          And there's a 6th point as well:
          6. Consistency - In C# (csharp), binary is generated to byte[]. Using ByteBuffer in Java diminishes consistency.

          Show
          Alik Elzin added a comment - And there's a 6th point as well: 6. Consistency - In C# (csharp), binary is generated to byte[]. Using ByteBuffer in Java diminishes consistency.
          Hide
          Alik Elzin added a comment -

          Forgot a 5th point:
          5. Statefull vs stateless - A state (offset) is passed using a ByteBuffer - not the data itself. Sounds like hidden business logic. Obligated to see it even though I don't use it.

          Show
          Alik Elzin added a comment - Forgot a 5th point: 5. Statefull vs stateless - A state (offset) is passed using a ByteBuffer - not the data itself. Sounds like hidden business logic. Obligated to see it even though I don't use it.
          Hide
          Alik Elzin added a comment -

          Waking up the dead...
          I'm new to Thrift and the ByteBuffer is confusing:
          1. Intuition - It gives the feeling of streaming, as if the server/client can continue update the the buffer and the other side will see the change.
          2. Simplicity - byte[] is easy to grasp, exists in lots of languages and can always be wrapped with ByeBuffer. ByteBuffer on the other hand is more complex to grasp and I think exists in fewer languages.
          3. Intention - ByteBuffer provides a state and meant to buffer data, but that's not the intention in Thrift. The intention in Thrift is different: Just passing data from over the network.
          4. Convenience - I agree that internally, for Thrift generated code and library, it's easier to work with ByteBuffer. But that's not a concern for the users of Thrift. I think it's much more convenient for me to work with byte[].

          This feels like the good old equivalent discussion to RISC vs. CISC. No right answer. I would just love things to be simple.

          I didn't see the points I raised in the conversation. Where was I 4 years ago? Sorry.

          Show
          Alik Elzin added a comment - Waking up the dead... I'm new to Thrift and the ByteBuffer is confusing: 1. Intuition - It gives the feeling of streaming, as if the server/client can continue update the the buffer and the other side will see the change. 2. Simplicity - byte[] is easy to grasp, exists in lots of languages and can always be wrapped with ByeBuffer. ByteBuffer on the other hand is more complex to grasp and I think exists in fewer languages. 3. Intention - ByteBuffer provides a state and meant to buffer data, but that's not the intention in Thrift. The intention in Thrift is different: Just passing data from over the network. 4. Convenience - I agree that internally, for Thrift generated code and library, it's easier to work with ByteBuffer. But that's not a concern for the users of Thrift. I think it's much more convenient for me to work with byte[]. This feels like the good old equivalent discussion to RISC vs. CISC. No right answer. I would just love things to be simple. I didn't see the points I raised in the conversation. Where was I 4 years ago? Sorry.
          Hide
          Bryan Duxbury added a comment -

          I just committed this.

          Show
          Bryan Duxbury added a comment - I just committed this.
          Hide
          Mathias Herberts added a comment -

          The way my GWT extension is done is by generating special stripped down POJOs intended to be used on the GWT client side and extra methods in regular Java generated classes to convert to/from those stripped down POJOs.

          Additionaly all the GWT service interfaces (both async and sync) are also generated thus speeding up development.

          Thus my GWT extension is intended more as a bridge than a standalone version of generated code.

          I'll make it evolve so the GWT POJOs still use byte[] and the conversion layer correctly converts to/from ByteBuffer to byte[].

          I retract my -1, if it improves the 'vanilla' java code, I'll adapt.

          My -1 was just me being lazy...

          Show
          Mathias Herberts added a comment - The way my GWT extension is done is by generating special stripped down POJOs intended to be used on the GWT client side and extra methods in regular Java generated classes to convert to/from those stripped down POJOs. Additionaly all the GWT service interfaces (both async and sync) are also generated thus speeding up development. Thus my GWT extension is intended more as a bridge than a standalone version of generated code. I'll make it evolve so the GWT POJOs still use byte[] and the conversion layer correctly converts to/from ByteBuffer to byte[]. I retract my -1, if it improves the 'vanilla' java code, I'll adapt. My -1 was just me being lazy...
          Hide
          Bryan Duxbury added a comment -

          I recognize that it will break a lot of code. Luckily, that breakage is pretty trivial, though. Also, we're making this change at a version boundary, so we can announce the change cautiously.

          As far as the GWT stuff goes, would it make sense to spin that off into a separate generator that inherits from the standard Java generator, and therefore can make its own adjustments?

          Show
          Bryan Duxbury added a comment - I recognize that it will break a lot of code. Luckily, that breakage is pretty trivial, though. Also, we're making this change at a version boundary, so we can announce the change cautiously. As far as the GWT stuff goes, would it make sense to spin that off into a separate generator that inherits from the standard Java generator, and therefore can make its own adjustments?
          Hide
          Mathias Herberts added a comment -

          For whatever it's worth, I don't feel comfortable with this issue, replacing byte[] with ByteBuffer will break tons of code, thus making andry quite a few people who are already coping with thrift betaness. Not good for adoption.

          And on a more personal point of view it will also break the GWT compatibility patch I've crafted for Thrift as GWT does not have ByteBuffer in its JRE emulation layer.

          Therefore -1.

          Show
          Mathias Herberts added a comment - For whatever it's worth, I don't feel comfortable with this issue, replacing byte[] with ByteBuffer will break tons of code, thus making andry quite a few people who are already coping with thrift betaness. Not good for adoption. And on a more personal point of view it will also break the GWT compatibility patch I've crafted for Thrift as GWT does not have ByteBuffer in its JRE emulation layer. Therefore -1.
          Hide
          David Reiss added a comment -

          LGTM. This will be a big improvement.

          Show
          David Reiss added a comment - LGTM. This will be a big improvement.
          Hide
          Nate McCall added a comment -

          I've only implemented this partially (simple inserts and gets), but it "worked" for that. I'll be back on this next week and will let you know immediately if anything comes up.

          If you want to commit it, this would help us somewhat coordinating with THRIFT-831 (could serve as a baseline for those patches until that is in), but thats totally up to you.

          Show
          Nate McCall added a comment - I've only implemented this partially (simple inserts and gets), but it "worked" for that. I'll be back on this next week and will let you know immediately if anything comes up. If you want to commit it, this would help us somewhat coordinating with THRIFT-831 (could serve as a baseline for those patches until that is in), but thats totally up to you.
          Hide
          Bryan Duxbury added a comment -

          Nate - has your testing progressed far enough that you think this should be committed?

          Show
          Bryan Duxbury added a comment - Nate - has your testing progressed far enough that you think this should be committed?
          Hide
          Nate McCall added a comment -

          Wow, I missed that part completely. Sorry guys. I'll let you know how it turns out after I account for the above.

          Show
          Nate McCall added a comment - Wow, I missed that part completely. Sorry guys. I'll let you know how it turns out after I account for the above.
          Hide
          Bryan Duxbury added a comment -

          I think Johnathan beat me to the punch, but generally, you can use array() to get the underlying array, and then the contents between .position() and .limit(). The entire message will be in the underlying array, but you're just supposed to not look at it

          Show
          Bryan Duxbury added a comment - I think Johnathan beat me to the punch, but generally, you can use array() to get the underlying array, and then the contents between .position() and .limit(). The entire message will be in the underlying array, but you're just supposed to not look at it
          Hide
          Jonathan Ellis added a comment -

          right, the idea is that we'll use the ByteBuffer offset and limit to avoid having to copy out each binary field into a new byte[]

          Show
          Jonathan Ellis added a comment - right, the idea is that we'll use the ByteBuffer offset and limit to avoid having to copy out each binary field into a new byte[]
          Hide
          Nate McCall added a comment - - edited

          After some initial experimentation patching these changes into Cassandra I have found some strange behavior.

          It appears the message is not being broken up correctly into component parts. When I examine the message to retrieve the key, it appears the whole message body is intact within the buffer. For example when I expect to receive "k1", the buffer contains the entire contents of a Cassandra insert:

          [0, 0, 0, 6, 105, 110, 115, 101, 114, 116, 1, 0, 0, 0, 4, 11, 0, 1, 0, 0, 0, 2, 107, 48, 12, 0, 2, 11, 0, 3, 0, 0, 0, 8, 73, 110, 100, 101, 120, 101, 100, 49, 0, 12, 0, 3, 11, 0, 1, 0, 0, 0, 9, 98, 105, 114, 116, 104, 100, 97, 116, 101, 11, 0, 2, 0, 0, 0, 8, 49, 57, 55, 53, 49, 50, 51, 48, 12, 0, 3, 10, 0, 1, 0, 4, -116, -127, 83, 19, -95, -32, 0, 0, 8, 0, 4, 0, 0, 0, 1, 0]

          Where previously, it was simply [107,49]

          I will try to distill this into a reproducable test case of some sort, but hopefully the above can at least provide some information.

          Edit: this is with TBinaryProtocol and TFramedTransport

          Show
          Nate McCall added a comment - - edited After some initial experimentation patching these changes into Cassandra I have found some strange behavior. It appears the message is not being broken up correctly into component parts. When I examine the message to retrieve the key, it appears the whole message body is intact within the buffer. For example when I expect to receive "k1", the buffer contains the entire contents of a Cassandra insert: [0, 0, 0, 6, 105, 110, 115, 101, 114, 116, 1, 0, 0, 0, 4, 11, 0, 1, 0, 0, 0, 2, 107, 48, 12, 0, 2, 11, 0, 3, 0, 0, 0, 8, 73, 110, 100, 101, 120, 101, 100, 49, 0, 12, 0, 3, 11, 0, 1, 0, 0, 0, 9, 98, 105, 114, 116, 104, 100, 97, 116, 101, 11, 0, 2, 0, 0, 0, 8, 49, 57, 55, 53, 49, 50, 51, 48, 12, 0, 3, 10, 0, 1, 0, 4, -116, -127, 83, 19, -95, -32, 0, 0, 8, 0, 4, 0, 0, 0, 1, 0] Where previously, it was simply [107,49] I will try to distill this into a reproducable test case of some sort, but hopefully the above can at least provide some information. Edit: this is with TBinaryProtocol and TFramedTransport
          Hide
          Jonathan Ellis added a comment -

          Cassandra is testing this over in CASSANDRA-1324.

          Show
          Jonathan Ellis added a comment - Cassandra is testing this over in CASSANDRA-1324 .
          Hide
          Bryan Duxbury added a comment -

          Here's my first go at this change. All the tests pass.

          Show
          Bryan Duxbury added a comment - Here's my first go at this change. All the tests pass.

            People

            • Assignee:
              Bryan Duxbury
              Reporter:
              Bryan Duxbury
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development