Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Thrift is not very compact in writing out data as (say protobuf) . It does not have the concept of variable length integers and various other optimizations possible . In Solr we use a lot of such optimizations to make a very compact payload. Thrift has a lot common with that format.

      It is all done in a single class

      http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/common/util/NamedListCodec.java?revision=685640&view=markup

      The other optimizations include writing type/value in same byte, very fast writes of Strings, externalizable strings etc

      We could use a thrift format for non-java clients and I would like to see it as compact as the current java version

      1. thrift-110-v9.patch
        82 kB
        Bryan Duxbury
      2. thrift-110-v8.patch
        83 kB
        Bryan Duxbury
      3. thrift-110-v7.patch
        83 kB
        Bryan Duxbury
      4. thrift-110-v6.patch
        80 kB
        Bryan Duxbury
      5. thrift-110-v5.patch
        97 kB
        Bryan Duxbury
      6. thrift-110-v4.patch
        87 kB
        Bryan Duxbury
      7. thrift-110-v3.patch
        85 kB
        Bryan Duxbury
      8. thrift-110-v2.patch
        80 kB
        Bryan Duxbury
      9. thrift-110-v12.patch
        49 kB
        Bryan Duxbury
      10. thrift-110-v11.patch
        49 kB
        Bryan Duxbury
      11. thrift-110-v10.patch
        48 kB
        Bryan Duxbury
      12. thrift-110.patch
        38 kB
        Bryan Duxbury
      13. irclog.txt
        30 kB
        David Reiss
      14. compact-proto-spec-2.txt
        2 kB
        Bryan Duxbury
      15. compact-proto-spec-2.txt
        2 kB
        Bryan Duxbury
      16. compact-proto-spec-2.txt
        2 kB
        Bryan Duxbury
      17. compact-proto-spec-2.txt
        2 kB
        Bryan Duxbury
      18. compact_proto_spec.txt
        3 kB
        Bryan Duxbury
      19. compact_proto_spec.txt
        3 kB
        Bryan Duxbury

        Issue Links

          Activity

          Hide
          Bryan Duxbury added a comment -

          I just committed this.

          Thanks to all who contributed with ideas and code reviews!

          Show
          Bryan Duxbury added a comment - I just committed this. Thanks to all who contributed with ideas and code reviews!
          Hide
          David Reiss added a comment -

          "obvioulsly" is a typo in TCompactProtocol.java.

          Some trailing whitespace is around.

          • TODO: return a constant TStruct instance? nobody's going to know.
            Now that THRIFT-10 is resolved, you can do this. By the way, I assume you have svn upped and rebuilt since THRIFT-10 was resolved. Same with the "new TField" for TStop.

          nit: readStructBegin should be moved down to where readStructEnd is. (Or the reverse, plus moving writeStruct* right after writeMessage*)

          writeStringBody is unused.

          FWIW, the varint encoding is in the opposite byte order of that used by TDenseProtocol and MIDI.

          Show
          David Reiss added a comment - "obvioulsly" is a typo in TCompactProtocol.java. Some trailing whitespace is around. TODO: return a constant TStruct instance? nobody's going to know. Now that THRIFT-10 is resolved, you can do this. By the way, I assume you have svn upped and rebuilt since THRIFT-10 was resolved. Same with the "new TField" for TStop. nit: readStructBegin should be moved down to where readStructEnd is. (Or the reverse, plus moving writeStruct* right after writeMessage*) writeStringBody is unused. FWIW, the varint encoding is in the opposite byte order of that used by TDenseProtocol and MIDI.
          Hide
          Bryan Duxbury added a comment -

          This addresses the unrelated changes.

          Show
          Bryan Duxbury added a comment - This addresses the unrelated changes.
          Hide
          David Reiss added a comment -

          I see some unrelated changes in build.xml, and also this from DebugProtoTest.thrift

          -  i32 Janky(i32 arg)
          +  i32 Janky( i32 arg)
          
          Show
          David Reiss added a comment - I see some unrelated changes in build.xml, and also this from DebugProtoTest.thrift - i32 Janky(i32 arg) + i32 Janky( i32 arg)
          Hide
          David Reiss added a comment -

          Does this mean you're giving up on eliminating type tags from nested containers?

          Show
          David Reiss added a comment - Does this mean you're giving up on eliminating type tags from nested containers?
          Hide
          Nathan Marz added a comment -

          +1

          Show
          Nathan Marz added a comment - +1
          Hide
          Bryan Duxbury added a comment -

          Alright, this patch implements the short list and short set functionality.

          Would anyone object to me committing this patch soon? I'd love to get it into production use here.

          Show
          Bryan Duxbury added a comment - Alright, this patch implements the short list and short set functionality. Would anyone object to me committing this patch soon? I'd love to get it into production use here.
          Hide
          Bryan Duxbury added a comment -

          OK, one final update to the spec. There's empty space in the type headers of lists and sets that can be filled with the size of the list, as long as the list is 0-14 elements long. Size 15 is reserved to mean there is a varint size following for lists > 14 items long. This optimization will be pretty easy to code up, and will save a byte on every short list or set serialized.

          Show
          Bryan Duxbury added a comment - OK, one final update to the spec. There's empty space in the type headers of lists and sets that can be filled with the size of the list, as long as the list is 0-14 elements long. Size 15 is reserved to mean there is a varint size following for lists > 14 items long. This optimization will be pretty easy to code up, and will save a byte on every short list or set serialized.
          Hide
          Bryan Duxbury added a comment -

          In this patch:

          • Added protocol-id field to message
          • Extended message type of version-and-type field
          • Cleaned up some unneeded stuff in TCompactProtocolTest

          This patch also doesn't contain the ruby stuff anymore, as that's not really ready to be committed.

          Show
          Bryan Duxbury added a comment - In this patch: Added protocol-id field to message Extended message type of version-and-type field Cleaned up some unneeded stuff in TCompactProtocolTest This patch also doesn't contain the ruby stuff anymore, as that's not really ready to be committed.
          Hide
          David Reiss added a comment -

          I'm pretty happy with the latest version. Specifically, I don't think that combining the integer types is necessary. Bryan and I talked about a bunch of the details in IRC. I'm attaching a transcript in case anyone is interested.

          Show
          David Reiss added a comment - I'm pretty happy with the latest version. Specifically, I don't think that combining the integer types is necessary. Bryan and I talked about a bunch of the details in IRC. I'm attaching a transcript in case anyone is interested.
          Hide
          Bryan Duxbury added a comment -

          One more small change - make the message type id 3 bits and the version id 5 bits. This will give us more room to add message types.

          Show
          Bryan Duxbury added a comment - One more small change - make the message type id 3 bits and the version id 5 bits. This will give us more room to add message types.
          Hide
          Bryan Duxbury added a comment -

          Here's an updated version of the protocol spec. It includes a few minor changes:

          • Message header now contains the "protocol-id", a single-byte value designed to make it easy to tell this isn't a framed transport message and to distinguish it on the wire from TBinaryProtocol and TDenseProtocol.
          • Added the "extended" type header symbol, essentially as a reserved position so that we can give ourselves some more room for additional types should we ever need them. (Note that there are currently no implemented extended types.)
          • List, set, and map changed the order of the fields in their headers to avoid storing the subtype headers if the collection length is zero.

          I will update the Java TCompactProtocol to comply to the new spec shortly.

          Show
          Bryan Duxbury added a comment - Here's an updated version of the protocol spec. It includes a few minor changes: Message header now contains the "protocol-id", a single-byte value designed to make it easy to tell this isn't a framed transport message and to distinguish it on the wire from TBinaryProtocol and TDenseProtocol. Added the "extended" type header symbol, essentially as a reserved position so that we can give ourselves some more room for additional types should we ever need them. (Note that there are currently no implemented extended types.) List, set, and map changed the order of the fields in their headers to avoid storing the subtype headers if the collection length is zero. I will update the Java TCompactProtocol to comply to the new spec shortly.
          Hide
          Bryan Duxbury added a comment -

          Ok. So I think I heard at least one vote of confidence for the protocol spec . If no one else objects, I'm going to assume the spec is good enough for the moment, and then commit my java implementation.

          Show
          Bryan Duxbury added a comment - Ok. So I think I heard at least one vote of confidence for the protocol spec . If no one else objects, I'm going to assume the spec is good enough for the moment, and then commit my java implementation.
          Hide
          Larry Hastings added a comment -

          One final note, for what it's worth: if you're not changing the compiler, "compact-proto-spec-2.txt" looks pretty near optimal to me.

          Show
          Larry Hastings added a comment - One final note, for what it's worth: if you're not changing the compiler, "compact-proto-spec-2.txt" looks pretty near optimal to me.
          Hide
          Larry Hastings added a comment - - edited

          I gather my suggestions are irrelevant if you're simply not going to change the compiler. That's fine--and we might as well stop discussing them and save everyone's time.

          What does this accomplish? Leaving extra room for more types in the future? The current formulation of the protocol leaves 3 open type spots, [...]

          Yes, this accomplishes leaving extra room for more types in the future, as my proposed alternate protocol left zero open type slots. If you won't ever need to support new types you could certainly leave it undefined, or at the very least unimplemented. (Me, I'm always afraid I've forgotten something, so I always leave myself an out.)

          Cheers,

          /larry/

          Show
          Larry Hastings added a comment - - edited I gather my suggestions are irrelevant if you're simply not going to change the compiler. That's fine--and we might as well stop discussing them and save everyone's time. What does this accomplish? Leaving extra room for more types in the future? The current formulation of the protocol leaves 3 open type spots, [...] Yes, this accomplishes leaving extra room for more types in the future, as my proposed alternate protocol left zero open type slots. If you won't ever need to support new types you could certainly leave it undefined, or at the very least unimplemented. (Me, I'm always afraid I've forgotten something, so I always leave myself an out.) Cheers, /larry/
          Hide
          Zheng Shao added a comment -

          I would vote for a protocol that works within the thrift framework.
          It's too much cost breaking the current API: backward/forward data/code compatibility, how to move old custom thrift protocols to new API, etc.

          Show
          Zheng Shao added a comment - I would vote for a protocol that works within the thrift framework. It's too much cost breaking the current API: backward/forward data/code compatibility, how to move old custom thrift protocols to new API, etc.
          Hide
          Bryan Duxbury added a comment -

          First: allow casting from bool to int, so that you can send the integer values 0 and 1 as boolean-false and boolean-true respectively.

          That would be cool, but this is one of those changes that would require all of Thrift to change, too. I'm not usually one to avoid sweeping changes if I think there's benefit, but right now I'm not really pro the whole "change Thrift interface" thing. We're talking about every library, protocol, and code generator changing to match some different form of protocol/struct interface. While Ben has mentioned this a few times, I haven't seen a complete proposal for something like this yet, and it's definitely a nontrivial change.

          For the sake of expediency, I'd really like to limit the scope of the discussion of this protocol to the current Thrift interface. Changing Thrift to be even more compact isn't going to happen in the near future (and possibly not before our first release), while this protocol implementation could be committed, working, now.

          Second: make up your mind--are you using zigzag ints or not?

          Zigzags are important in this protocol, and they're not used in every situation. For user-entered data and field ids, there could be negative numbers, so I have to protect against worst-case sign extension by using zigzag. But there are other things, like list and string lengths, which are uniformly non-negative, and so zigzagging them would be a waste.

          Also, while it would be nice to have specific-sized int headers available, this doesn't help me when I'm in a map or list/set and I have to use one type header without knowing the range of values up front. Zigzag allows me to just put stuff in there and get a pretty respectable compression.

          ... "followed by a variable-length type-header value" ...

          I'm not sure I understand this proposal. What does this accomplish? Leaving extra room for more types in the future? The current formulation of the protocol leaves 3 open type spots, and while one might be spoken for by externalized strings, I don't really know what other types we're likely to introduce in the future.

          Show
          Bryan Duxbury added a comment - First: allow casting from bool to int, so that you can send the integer values 0 and 1 as boolean-false and boolean-true respectively. That would be cool, but this is one of those changes that would require all of Thrift to change, too. I'm not usually one to avoid sweeping changes if I think there's benefit, but right now I'm not really pro the whole "change Thrift interface" thing. We're talking about every library, protocol, and code generator changing to match some different form of protocol/struct interface. While Ben has mentioned this a few times, I haven't seen a complete proposal for something like this yet, and it's definitely a nontrivial change. For the sake of expediency, I'd really like to limit the scope of the discussion of this protocol to the current Thrift interface. Changing Thrift to be even more compact isn't going to happen in the near future (and possibly not before our first release), while this protocol implementation could be committed, working, now. Second: make up your mind--are you using zigzag ints or not? Zigzags are important in this protocol, and they're not used in every situation. For user-entered data and field ids, there could be negative numbers, so I have to protect against worst-case sign extension by using zigzag. But there are other things, like list and string lengths, which are uniformly non-negative, and so zigzagging them would be a waste. Also, while it would be nice to have specific-sized int headers available, this doesn't help me when I'm in a map or list/set and I have to use one type header without knowing the range of values up front. Zigzag allows me to just put stuff in there and get a pretty respectable compression. ... "followed by a variable-length type-header value" ... I'm not sure I understand this proposal. What does this accomplish? Leaving extra room for more types in the future? The current formulation of the protocol leaves 3 open type spots, and while one might be spoken for by externalized strings, I don't really know what other types we're likely to introduce in the future.
          Hide
          Larry Hastings added a comment - - edited

          If you're willing to go that far, then let's go all the way: dump all the type information from the type-header. The nibble would therefore be:

          stop => 0
          value-zero => 1 # effective payload is the value zero, 0x0, maps to integer 0 / boolean false
          value-one => 2 # effective payload is the value one, 0x1, maps to integer 1 / boolean true
          payload-length-0 => 3 # empty payload
          payload-length-1 => 4 # payload is 1 byte long
          payload-length-2 => 5 # ...
          payload-length-3 => 6
          payload-length-4 => 7
          payload-length-5 => 8
          payload-length-6 => 9
          payload-length-7 => 0xA
          payload-length-8 => 0xB
          payload-length-9 => 0xC
          payload-length-10 => 0xD # payload is 10 bytes long
          payload-length-variable => 0xE # payload length follows--zigzag int?
          extended-type => 0xF # extended type follows – zigzag int?

          If empty payloads are not possible, we can drop "payload-length-0", shift all the payload-length fields down by one, and add payload-length-11 to the end.

          On the other hand, if -1 is anything like a common value, we could add a "value-negative-one", which would stand in for any all-bits-set fixed-size value. I'm guessing it's rare, so it's probably not worth it, but as already stated I'm not really a Thrift guy.

          /larry/

          p.s. Hi Ben!

          Show
          Larry Hastings added a comment - - edited If you're willing to go that far, then let's go all the way: dump all the type information from the type-header. The nibble would therefore be: stop => 0 value-zero => 1 # effective payload is the value zero, 0x0, maps to integer 0 / boolean false value-one => 2 # effective payload is the value one, 0x1, maps to integer 1 / boolean true payload-length-0 => 3 # empty payload payload-length-1 => 4 # payload is 1 byte long payload-length-2 => 5 # ... payload-length-3 => 6 payload-length-4 => 7 payload-length-5 => 8 payload-length-6 => 9 payload-length-7 => 0xA payload-length-8 => 0xB payload-length-9 => 0xC payload-length-10 => 0xD # payload is 10 bytes long payload-length-variable => 0xE # payload length follows--zigzag int? extended-type => 0xF # extended type follows – zigzag int? If empty payloads are not possible, we can drop "payload-length-0", shift all the payload-length fields down by one, and add payload-length-11 to the end. On the other hand, if -1 is anything like a common value, we could add a "value-negative-one", which would stand in for any all-bits-set fixed-size value. I'm guessing it's rare, so it's probably not worth it, but as already stated I'm not really a Thrift guy. /larry/ p.s. Hi Ben!
          Hide
          Ben Maurer added a comment -

          Hey Larry!

          I completely agree with you. In fact, you can even go further and get rid of the double type (after all, a double is just a fixed 8bit number). The problem is that the way the thrift code currently works, the readField method needs to give back the exact type (bool, i32, etc).

          My personal thought is that it's better to break the current API now than have a suboptimal file format forever. OTOH, I'll admit that zig zag comes nearly as close to optimal as we're going to get.

          Show
          Ben Maurer added a comment - Hey Larry! I completely agree with you. In fact, you can even go further and get rid of the double type (after all, a double is just a fixed 8bit number). The problem is that the way the thrift code currently works, the readField method needs to give back the exact type (bool, i32, etc). My personal thought is that it's better to break the current API now than have a suboptimal file format forever. OTOH, I'll admit that zig zag comes nearly as close to optimal as we're going to get.
          Hide
          Larry Hastings added a comment -

          I'm not a Thrift guy, but I just talked with David Reiss about this thread. I have a couple more crazy suggestions--but he tells me they'd require compiler changes.

          First: allow casting from bool to int, so that you can send the integer values 0 and 1 as boolean-false and boolean-true respectively.

          Second: make up your mind--are you using zigzag ints or not? If you are, you only need one integer type. I think it'd be better to go the other way: have int types for int-8, int-16, int-24, int-32, int-40, int-64. That would give back one additional bit per byte of the ints as you'd no longer need the zigzag marker bit.

          Finally: this still leaves one type-header value for future expansion, which I suggest should be explicitly defined as "followed by a variable-length type-header value". Even if you dropped zigzag ints for integer values, they might be worth keeping here.

          I'm gonna mark up type-and-id below; for clarity I'm going to put square brackets around individual bytes.

          If we use zigzag ints here, type-and-id becomes:

          type-and-id
          => [ field-id-delta type-header ]

          [ 0 type-header ] zigzag-varint
          [ field-id-delta 0xF ] zigzag-varint
          [ 0 0xF ] zigzag-varint zigzag-varint

          If we remove zigzags entirely, then for extended field deltas or types we must follow the type-and-id with a byte containing two type-id-headers: the high one for the extended field-id-delta, and the low one for type-header. If either isn't used, set those four bits to 0.

          type-and-id
          => [ field-id-delta type-header ]

          [ 0 type-header ] [ field-id-delta-int-type-header 0 ] n-bit-int
          [ field-id-delta 0xF ] [ 0 type-header-int-type-header ] n-bit-int
          [ 0 0xF ] [ field-id-delta-int-type-header type-header-int-type-header ] n-bit-int n-bit-int

          In terms of compression, I suspect leaving the zigzag ints here is a win; given that real-world use would likely never see extended type-headers, the only variant we'd see was a field-id-delta that didn't fit in the range 1-15. In that case, zigzag ints would be strictly either smaller or the same size as the second approach.

          I'm probably crazy,

          /larry/

          Show
          Larry Hastings added a comment - I'm not a Thrift guy, but I just talked with David Reiss about this thread. I have a couple more crazy suggestions--but he tells me they'd require compiler changes. First: allow casting from bool to int, so that you can send the integer values 0 and 1 as boolean-false and boolean-true respectively. Second: make up your mind--are you using zigzag ints or not? If you are, you only need one integer type. I think it'd be better to go the other way: have int types for int-8, int-16, int-24, int-32, int-40, int-64. That would give back one additional bit per byte of the ints as you'd no longer need the zigzag marker bit. Finally: this still leaves one type-header value for future expansion, which I suggest should be explicitly defined as "followed by a variable-length type-header value". Even if you dropped zigzag ints for integer values, they might be worth keeping here. I'm gonna mark up type-and-id below; for clarity I'm going to put square brackets around individual bytes. If we use zigzag ints here, type-and-id becomes: type-and-id => [ field-id-delta type-header ] [ 0 type-header ] zigzag-varint [ field-id-delta 0xF ] zigzag-varint [ 0 0xF ] zigzag-varint zigzag-varint If we remove zigzags entirely, then for extended field deltas or types we must follow the type-and-id with a byte containing two type-id-headers: the high one for the extended field-id-delta, and the low one for type-header. If either isn't used, set those four bits to 0. type-and-id => [ field-id-delta type-header ] [ 0 type-header ] [ field-id-delta-int-type-header 0 ] n-bit-int [ field-id-delta 0xF ] [ 0 type-header-int-type-header ] n-bit-int [ 0 0xF ] [ field-id-delta-int-type-header type-header-int-type-header ] n-bit-int n-bit-int In terms of compression, I suspect leaving the zigzag ints here is a win; given that real-world use would likely never see extended type-headers, the only variant we'd see was a field-id-delta that didn't fit in the range 1-15. In that case, zigzag ints would be strictly either smaller or the same size as the second approach. I'm probably crazy, /larry/
          Hide
          Bryan Duxbury added a comment -

          This version of the patch improves the performance of the compact protocol in java to very close to the performance of the binary protocol - it only takes about 16% longer.

          I think that we're getting really close to being able to commit this protocol. Can interested parties review at least the protocol spec? If we can finalize the spec, we can fork this issue off into implementations for each library.

          Show
          Bryan Duxbury added a comment - This version of the patch improves the performance of the compact protocol in java to very close to the performance of the binary protocol - it only takes about 16% longer. I think that we're getting really close to being able to commit this protocol. Can interested parties review at least the protocol spec? If we can finalize the spec, we can fork this issue off into implementations for each library.
          Hide
          Bryan Duxbury added a comment -

          OK, this changes the stack of TFields into a stack of shorts and enhances the comments on readMapBegin, readListBegin, and readSetBegin.

          Show
          Bryan Duxbury added a comment - OK, this changes the stack of TFields into a stack of shorts and enhances the comments on readMapBegin, readListBegin, and readSetBegin.
          Hide
          Nathan Marz added a comment -

          A couple comments:

          1. Instead of keeping a stack of TField's, can we keep a stack of Integers instead? I think that will make the code easier to understand.
          2. There are some gotchas with this protocol. Namely, that the TList, TSet, and TMap returned by the respective read*Begin will be incorrect for 0 size collections - they won't have the right types in them. I think this is an ok optimization to do, but these gotchas should be listed in the comments for the class.

          Show
          Nathan Marz added a comment - A couple comments: 1. Instead of keeping a stack of TField's, can we keep a stack of Integers instead? I think that will make the code easier to understand. 2. There are some gotchas with this protocol. Namely, that the TList, TSet, and TMap returned by the respective read*Begin will be incorrect for 0 size collections - they won't have the right types in them. I think this is an ok optimization to do, but these gotchas should be listed in the comments for the class.
          Hide
          Bryan Duxbury added a comment -

          This patch adds javadoc comments for all the TCompactProtocol methods, lots of cleanup, and version checking in readMessageBegin. There are still a fair number of TODOs in there, but they're mostly optional or performance enhancement ideas.

          A very simple benchmark says that it takes about 1.5-1.6 times as long to serialize a struct in the compact protocol as in the binary protocol. I bet there are gains to be made, though.

          Show
          Bryan Duxbury added a comment - This patch adds javadoc comments for all the TCompactProtocol methods, lots of cleanup, and version checking in readMessageBegin. There are still a fair number of TODOs in there, but they're mostly optional or performance enhancement ideas. A very simple benchmark says that it takes about 1.5-1.6 times as long to serialize a struct in the compact protocol as in the binary protocol. I bet there are gains to be made, though.
          Hide
          Ben Maurer added a comment -

          Well, at first I just used the TField directly. However, the generated struct serialization code reuses the TField objects, so if I don't copy them, nothing works.

          Right, I mean that you should re-use the TFields that you create on your own after you pop them. The issue here is that Java effectively creates a TField*[], whereas if we were writing efficient java code we'd have TFIeld[] so that we didn't keep on malloc/freeing things. You can simulate that by having a TField*[] but not freeing the object when you pop it, then reusing it the next time you want a push.

          There should be a benchmark to see if this stuff matters.

          Hm, true, but the alternative is copying the object before returning it, and it's unlikely the caller is going to modify it. We should probably just make the fields on TField final.

          You said "However, the generated struct serialization code reuses the TField objects". I think that might preclude making things final

          Show
          Ben Maurer added a comment - Well, at first I just used the TField directly. However, the generated struct serialization code reuses the TField objects, so if I don't copy them, nothing works. Right, I mean that you should re-use the TFields that you create on your own after you pop them. The issue here is that Java effectively creates a TField*[], whereas if we were writing efficient java code we'd have TFIeld[] so that we didn't keep on malloc/freeing things. You can simulate that by having a TField*[] but not freeing the object when you pop it, then reusing it the next time you want a push. There should be a benchmark to see if this stuff matters. Hm, true, but the alternative is copying the object before returning it, and it's unlikely the caller is going to modify it. We should probably just make the fields on TField final. You said "However, the generated struct serialization code reuses the TField objects". I think that might preclude making things final
          Hide
          Bryan Duxbury added a comment -

          + lastField_.push(new TField(field));

          Avoid allocation by re-using the previous TField. You might want to make it an ArrayList<TField> and use an int to keep track of the position of the stack. That way, you don't have to re-allocate when you enter/exit a struct.

          Well, at first I just used the TField directly. However, the generated struct serialization code reuses the TField objects, so if I don't copy them, nothing works.

          Forgot to remove writeVarint16 method. It's unused at the moment.

          + lastField_.push(field);
          + return field;

          TField is not immutable, so you shouldn't do this.

          Hm, true, but the alternative is copying the object before returning it, and it's unlikely the caller is going to modify it. We should probably just make the fields on TField final.

          + byte[] buf = new byte[1];
          + trans_.read(buf, 0, 1);

          Allocate per class

          I assume you mean that I should just keep a protocol-level single-byte buffer for this purpose.

          + switch ((byte)(type & 0x0f)) {

          Use an array + lookup up (it's a shame that Java can't have a const array that is mapped into memory)

          Sounds like a good idea, but I'd like to have a benchmark built first before making optimizations.

          Thanks for the thoughtful comments.

          Show
          Bryan Duxbury added a comment - + lastField_.push(new TField(field)); Avoid allocation by re-using the previous TField. You might want to make it an ArrayList<TField> and use an int to keep track of the position of the stack. That way, you don't have to re-allocate when you enter/exit a struct. Well, at first I just used the TField directly. However, the generated struct serialization code reuses the TField objects, so if I don't copy them, nothing works. Forgot to remove writeVarint16 method. It's unused at the moment. + lastField_.push(field); + return field; TField is not immutable, so you shouldn't do this. Hm, true, but the alternative is copying the object before returning it, and it's unlikely the caller is going to modify it. We should probably just make the fields on TField final. + byte[] buf = new byte [1] ; + trans_.read(buf, 0, 1); Allocate per class I assume you mean that I should just keep a protocol-level single-byte buffer for this purpose. + switch ((byte)(type & 0x0f)) { Use an array + lookup up (it's a shame that Java can't have a const array that is mapped into memory) Sounds like a good idea, but I'd like to have a benchmark built first before making optimizations. Thanks for the thoughtful comments.
          Hide
          Ben Maurer added a comment -

          + lastField_.push(new TField(field));

          Avoid allocation by re-using the previous TField. You might want to make it an ArrayList<TField> and use an int to keep track of the position of the stack. That way, you don't have to re-allocate when you enter/exit a struct.

          + byte[] data = new byte[]

          {0, 0, 0, 0, 0, 0, 0, 0}

          ;

          There's a pre-allocated buffer for longs

          + private void writeVarint16(short n) throws TException {

          Just pass to writeVariant32... no need to unroll the loop

          + * TODO: make a permanent buffer like writeVarint64?

          You probably should, the overhead for virtual calls is very high (I'd be interested to see how high... I know it makes a big difference in C++)

          + lastField_.push(field);
          + return field;

          TField is not immutable, so you shouldn't do this.

          + byte[] buf = new byte[1];
          + trans_.read(buf, 0, 1);

          Allocate per class

          + switch ((byte)(type & 0x0f)) {

          Use an array + lookup up (it's a shame that Java can't have a const array that is mapped into memory)

          Show
          Ben Maurer added a comment - + lastField_.push(new TField(field)); Avoid allocation by re-using the previous TField. You might want to make it an ArrayList<TField> and use an int to keep track of the position of the stack. That way, you don't have to re-allocate when you enter/exit a struct. + byte[] data = new byte[] {0, 0, 0, 0, 0, 0, 0, 0} ; There's a pre-allocated buffer for longs + private void writeVarint16(short n) throws TException { Just pass to writeVariant32... no need to unroll the loop + * TODO: make a permanent buffer like writeVarint64? You probably should, the overhead for virtual calls is very high (I'd be interested to see how high... I know it makes a big difference in C++) + lastField_.push(field); + return field; TField is not immutable, so you shouldn't do this. + byte[] buf = new byte [1] ; + trans_.read(buf, 0, 1); Allocate per class + switch ((byte)(type & 0x0f)) { Use an array + lookup up (it's a shame that Java can't have a const array that is mapped into memory)
          Hide
          Bryan Duxbury added a comment -

          OK, I've reworked the Java implementation into the new spec. It seems like it's 2-10% more compact with the test structs, and on Rapleaf's structs, it increases the margin of size reduction to marginally more than 50% of the binary protocol. Seems like a really beneficial improvement. There's also a lot less code in the new implementation.

          The Ruby implementation still needs to be switched over, and I need some time to re-comment the Java version.

          Show
          Bryan Duxbury added a comment - OK, I've reworked the Java implementation into the new spec. It seems like it's 2-10% more compact with the test structs, and on Rapleaf's structs, it increases the margin of size reduction to marginally more than 50% of the binary protocol. Seems like a really beneficial improvement. There's also a lot less code in the new implementation. The Ruby implementation still needs to be switched over, and I need some time to re-comment the Java version.
          Hide
          Bryan Duxbury added a comment -

          I'm not convinced that it makes sense to try and account for future types now. We have all the primitives and containers. The only thing I could imagine adding is unsigned ints, maybe. I'd rather cross that bridge when we get to it anyway.

          I haven't put together a test that compares compression with binary protocol versus compact protocol. Our dataset compressed something like 8:1 with binary protocol, and it's probably less than that with the compact protocol. I don't think I'll bother to do the comparison until after I've tried out the delta-based approach, though.

          Show
          Bryan Duxbury added a comment - I'm not convinced that it makes sense to try and account for future types now. We have all the primitives and containers. The only thing I could imagine adding is unsigned ints, maybe. I'd rather cross that bridge when we get to it anyway. I haven't put together a test that compares compression with binary protocol versus compact protocol. Our dataset compressed something like 8:1 with binary protocol, and it's probably less than that with the compact protocol. I don't think I'll bother to do the comparison until after I've tried out the delta-based approach, though.
          Hide
          Ben Maurer added a comment -

          There's not a lot of room for expansion of TTypes – only 3 more could be added. I might take one extra bit out of the MSB for extra headroom there. Other than that (and the redundancy in the type expressions for variants..) LGTM

          Btw, one thing I'd be curious to know – you said that the previous version was a 50% reduction in space on your dataset – what happens when you compress the records. It'd be interesting to figure out how things do with compression.

          Show
          Ben Maurer added a comment - There's not a lot of room for expansion of TTypes – only 3 more could be added. I might take one extra bit out of the MSB for extra headroom there. Other than that (and the redundancy in the type expressions for variants..) LGTM Btw, one thing I'd be curious to know – you said that the previous version was a 50% reduction in space on your dataset – what happens when you compress the records. It'd be interesting to figure out how things do with compression.
          Hide
          Bryan Duxbury added a comment -

          Here's a new spec for the compact protocol that uses the 4 MSB of the type header to encode field id deltas instead of extra value information. It's a LOT simpler, and I think it gets probably nearly the same compaction, but I haven't coded it up yet, so I'm not sure.

          One notable downside of this approach is that all ints (other than collection sizes) become zigzag varints. This avoids the worst-case size explosions for negative numbers, but makes positive numbers take up more space, too. I think it's a decent tradeoff, though.

          Show
          Bryan Duxbury added a comment - Here's a new spec for the compact protocol that uses the 4 MSB of the type header to encode field id deltas instead of extra value information. It's a LOT simpler, and I think it gets probably nearly the same compaction, but I haven't coded it up yet, so I'm not sure. One notable downside of this approach is that all ints (other than collection sizes) become zigzag varints. This avoids the worst-case size explosions for negative numbers, but makes positive numbers take up more space, too. I think it's a decent tradeoff, though.
          Hide
          Bryan Duxbury added a comment -

          I've been thinking some more about the idea of using the extra bits in the type header for field id deltas instead of as type modifier space, and it seems like a good idea.

          There's 4 bits in the type header that are free to modify, which means you can represent 0-15When you have dense structs, or even marginally sparse structs, the delta enocding will save you a 2 bytes per field, plus 1-2 bytes as overhead for the first field. When you have very sparse structs (ie set fields are > 15 field ids apart at all times), you will have at least one byte per field, and quickly end up with 2, since you're skipping so many ids you must have hundreds of fields.

          As a positive side effect of this approach, the type modifier wouldn't be used to help with encoding the value anymore, which would probably significantly reduce the complexity of the protocol. On the downside, we'll have to maintain a stack of the last field id for the current struct so that we can descend into nested structures, adding some complexity. Overall, I think it seems like it'll be a win.

          Show
          Bryan Duxbury added a comment - I've been thinking some more about the idea of using the extra bits in the type header for field id deltas instead of as type modifier space, and it seems like a good idea. There's 4 bits in the type header that are free to modify, which means you can represent 0-15When you have dense structs, or even marginally sparse structs, the delta enocding will save you a 2 bytes per field, plus 1-2 bytes as overhead for the first field. When you have very sparse structs (ie set fields are > 15 field ids apart at all times), you will have at least one byte per field, and quickly end up with 2, since you're skipping so many ids you must have hundreds of fields. As a positive side effect of this approach, the type modifier wouldn't be used to help with encoding the value anymore, which would probably significantly reduce the complexity of the protocol. On the downside, we'll have to maintain a stack of the last field id for the current struct so that we can descend into nested structures, adding some complexity. Overall, I think it seems like it'll be a win.
          Hide
          Ben Maurer added a comment -

          Simplifying to just an integer type will require API changes because of this type of code:

          if (ftype == facebook::thrift::protocol::T_I32)

          { xfer += iprot->readI32(this->expires); isset_expires = true; }

          else

          { ( xfer += iprot->skip(ftype); (}

          The change that would be needed is a new method in TProtocol:

          TWireField readWireField()

          class TWireField

          { TWireType type; short fieldid; }

          enum TWireType

          { INTEGER, STRING, LIST, MAP, STRUCT }

          What this API change means:

          1) If you implemented your own TProtocol, you'll need to support somebody calling any of the readI[8/16/32/64] methods after getting any of the integer field types.

          2) If you generated stubs with the current version of thrift and are using a current protocol, then you upgrade to a new version of thrift (without regenerating stubs), your code will not break. TBinaryProtocol will still offer readField()

          3) In order to take advantage of the new protocol, a language binding would need to change their compiler a bit.

          I think this is a small price to pay for a clean file format that offers more chances for optimization.

          Show
          Ben Maurer added a comment - Simplifying to just an integer type will require API changes because of this type of code: if (ftype == facebook::thrift::protocol::T_I32) { xfer += iprot->readI32(this->expires); isset_expires = true; } else { ( xfer += iprot->skip(ftype); (} The change that would be needed is a new method in TProtocol: TWireField readWireField() class TWireField { TWireType type; short fieldid; } enum TWireType { INTEGER, STRING, LIST, MAP, STRUCT } What this API change means: 1) If you implemented your own TProtocol, you'll need to support somebody calling any of the readI [8/16/32/64] methods after getting any of the integer field types. 2) If you generated stubs with the current version of thrift and are using a current protocol, then you upgrade to a new version of thrift (without regenerating stubs), your code will not break. TBinaryProtocol will still offer readField() 3) In order to take advantage of the new protocol, a language binding would need to change their compiler a bit. I think this is a small price to pay for a clean file format that offers more chances for optimization.
          Hide
          Bryan Duxbury added a comment -

          As discussed previously, using deltas for the field ids is really only a win if your structs are densely populated. Rapleaf, for one, has structs with many fields that are unset in every structure. The current approach guarantees 1 byte per field id as long as you have less than 127 field ids which I think is probably good enough.

          Your idea of dropping some types from the wire is interesting, though. Maybe the protocol could just encode "integer" on the wire and always use varints. That would free up a bit in the type header, which would expand the space we could use for a lot of things... I'll think on that.

          I already proposed changing the proto interface once (THRIFT-167), and we decided that wasn't a good idea because it increased the complexity of code generation and made protocols more intricate by default rather than as an option.

          Show
          Bryan Duxbury added a comment - As discussed previously, using deltas for the field ids is really only a win if your structs are densely populated. Rapleaf, for one, has structs with many fields that are unset in every structure. The current approach guarantees 1 byte per field id as long as you have less than 127 field ids which I think is probably good enough. Your idea of dropping some types from the wire is interesting, though. Maybe the protocol could just encode "integer" on the wire and always use varints. That would free up a bit in the type header, which would expand the space we could use for a lot of things... I'll think on that. I already proposed changing the proto interface once ( THRIFT-167 ), and we decided that wasn't a good idea because it increased the complexity of code generation and made protocols more intricate by default rather than as an option.
          Hide
          Ben Maurer added a comment -

          I think it might be a net win to take out some of the bigger uses of the field-type-and-value and instead reserve the upper 2-3 bits of the field-type for doing a compact field ID. The field IDs can be encoded as deltas, so you only have to use a larger size field ID if you skip 3 or 7 items (depending on how many bits you take). I think that in the vast majority of cases use, this will save 1 byte per field while the extra types that have to be given up save an extra byte some of the time.

          It's a shame that due to the API, there are redundant types (eg, positive-i64 is a superset of positive-i16). Since the TProtocol is only called from the generated code, it seems like it might be worth breaking the API to have a more clean (and potentially more compact ) protocol.

          Show
          Ben Maurer added a comment - I think it might be a net win to take out some of the bigger uses of the field-type-and-value and instead reserve the upper 2-3 bits of the field-type for doing a compact field ID. The field IDs can be encoded as deltas, so you only have to use a larger size field ID if you skip 3 or 7 items (depending on how many bits you take). I think that in the vast majority of cases use, this will save 1 byte per field while the extra types that have to be given up save an extra byte some of the time. It's a shame that due to the API, there are redundant types (eg, positive-i64 is a superset of positive-i16). Since the TProtocol is only called from the generated code, it seems like it might be worth breaking the API to have a more clean (and potentially more compact ) protocol.
          Hide
          Bryan Duxbury added a comment -

          This version of the patch merges with the re-namespacing, and adds a ton of comments to the TCompactProtocol class, including a version of the protocol spec encoded in the class comment.

          Show
          Bryan Duxbury added a comment - This version of the patch merges with the re-namespacing, and adds a ton of comments to the TCompactProtocol class, including a version of the protocol spec encoded in the class comment.
          Hide
          Bryan Duxbury added a comment -

          Sounds like this patch has rotted. Here's an updated version that should apply cleanly.

          Show
          Bryan Duxbury added a comment - Sounds like this patch has rotted. Here's an updated version that should apply cleanly.
          Hide
          Bryan Duxbury added a comment -

          Here's a version with ruby's compact format implementation working and spec'd.

          Show
          Bryan Duxbury added a comment - Here's a version with ruby's compact format implementation working and spec'd.
          Hide
          Bryan Duxbury added a comment -

          I've reversed the order of ids and types in structs in java in this patch. The ruby version included still has the ids and types reversed, but as it turns out I accidentally deleted the spec I was working on, let's just make sure this stuff is persisted somewhere

          Show
          Bryan Duxbury added a comment - I've reversed the order of ids and types in structs in java in this patch. The ruby version included still has the ids and types reversed, but as it turns out I accidentally deleted the spec I was working on, let's just make sure this stuff is persisted somewhere
          Hide
          Bryan Duxbury added a comment -

          Here's a patch for TCompactProtocol in Java. It deviates slightly from the spec I attached previously, so I'll have to revamp it slightly and attach an updated version. It includes a test for the protocol that fairly comprehensively tests the protocol. This is probably still something of a draft, though. There are a fair number TODOs and other crap in the code to clean up, so a better version will definitely follow.

          If anyone has time to take a look, dig into the code and let me know what you think.

          Show
          Bryan Duxbury added a comment - Here's a patch for TCompactProtocol in Java. It deviates slightly from the spec I attached previously, so I'll have to revamp it slightly and attach an updated version. It includes a test for the protocol that fairly comprehensively tests the protocol. This is probably still something of a draft, though. There are a fair number TODOs and other crap in the code to clean up, so a better version will definitely follow. If anyone has time to take a look, dig into the code and let me know what you think.
          Hide
          Chad Walters added a comment -

          Containers - I'm not sure that requiring all of the elements to be of a single type is that big of a deal. If the options are storing an additional byte per element to indicate type encoding vs just choosing the best compromise type for all of them, I think the compromise is probably best. The only time where it really becomes a problem is with ints, and that's easily resolvable. If the numbers are all positive and less than a 64bits long, then we can use positive varints. If they're all negative, we can use negative varints. If there's a mix and the values are no bigger than the threshold, we can use zigzag encoding. If none of those compromises yields the best size, then we can always go for fixed size.

          Bryan, I think it is difficult to implement this unless you are willing to buffer a lot of state in the protocol, since the type code gets written out before you have seen the values being written. This was one of the main reasons I was proposing the type modifier extensions to the IDL – passing the buck along to the application writer.

          Show
          Chad Walters added a comment - Containers - I'm not sure that requiring all of the elements to be of a single type is that big of a deal. If the options are storing an additional byte per element to indicate type encoding vs just choosing the best compromise type for all of them, I think the compromise is probably best. The only time where it really becomes a problem is with ints, and that's easily resolvable. If the numbers are all positive and less than a 64bits long, then we can use positive varints. If they're all negative, we can use negative varints. If there's a mix and the values are no bigger than the threshold, we can use zigzag encoding. If none of those compromises yields the best size, then we can always go for fixed size. Bryan, I think it is difficult to implement this unless you are willing to buffer a lot of state in the protocol, since the type code gets written out before you have seen the values being written. This was one of the main reasons I was proposing the type modifier extensions to the IDL – passing the buck along to the application writer.
          Hide
          Bryan Duxbury added a comment -

          I've created THRIFT-167 to discuss enhancing the Protocol interface to make coding this protocol easier.

          Show
          Bryan Duxbury added a comment - I've created THRIFT-167 to discuss enhancing the Protocol interface to make coding this protocol easier.
          Hide
          Chad Walters added a comment -

          Another option is to change the Protocol interface so that things like readField() which read types have an additional output parameter that returns protocol state and all the read() methods take in the protocol state. The changes to the existing protocol implementations are minimal and its totally backwards compatible. The changes to the generator are likewise not too big. We would only need to make these changes as we add the new compact protocol to each language.

          Show
          Chad Walters added a comment - Another option is to change the Protocol interface so that things like readField() which read types have an additional output parameter that returns protocol state and all the read() methods take in the protocol state. The changes to the existing protocol implementations are minimal and its totally backwards compatible. The changes to the generator are likewise not too big. We would only need to make these changes as we add the new compact protocol to each language.
          Hide
          Noble Paul added a comment - - edited

          As distasteful as it would to me to have to separate generated code paths for driving the traditional and compact protocols

          'If possible' we must use the same API . It will relieve us from touching the codegen part. But , how do we know which all fields are present when a writeStructBegin()is invoked. As it is the API is not a very convenient one for the current protocol also because it does not use most of the methods

          Show
          Noble Paul added a comment - - edited As distasteful as it would to me to have to separate generated code paths for driving the traditional and compact protocols 'If possible' we must use the same API . It will relieve us from touching the codegen part. But , how do we know which all fields are present when a writeStructBegin()is invoked. As it is the API is not a very convenient one for the current protocol also because it does not use most of the methods
          Hide
          David Reiss added a comment -

          As distasteful as it would to me to have to separate generated code paths for driving the traditional and compact protocols, I think it would be possible if that is what is necessary to make the compact protocol work. My preference is strongly in favor of using the existing protocol API, though (possibly with some additions).

          Show
          David Reiss added a comment - As distasteful as it would to me to have to separate generated code paths for driving the traditional and compact protocols, I think it would be possible if that is what is necessary to make the compact protocol work. My preference is strongly in favor of using the existing protocol API, though (possibly with some additions).
          Hide
          Bryan Duxbury added a comment -

          There's two ways to go with this. The first way is to keep the interface the same and make the protocol itself stateful. In this scenario, when you call writeFieldBegin(), rather than immediately writing to the transport, the protocol will buffer the fact that you said it's going to be an int, and once you actually write said integer, it will then write the field information along with the modifier and value.

          The preferable approach would be to change the TProtocol interface to write fields along with their values all at once. This would make implementing the described protocol much simpler, but would obviously necessitate changing all the existing protocols. As such, it doesn't seem likely that people will accept that change, so I'm going to go with the stateful approach for now. If I'm wrong, and we would actually consider changing the protocol interface, I'm glad to open a separate issue to that effect.

          Show
          Bryan Duxbury added a comment - There's two ways to go with this. The first way is to keep the interface the same and make the protocol itself stateful. In this scenario, when you call writeFieldBegin(), rather than immediately writing to the transport, the protocol will buffer the fact that you said it's going to be an int, and once you actually write said integer, it will then write the field information along with the modifier and value. The preferable approach would be to change the TProtocol interface to write fields along with their values all at once. This would make implementing the described protocol much simpler, but would obviously necessitate changing all the existing protocols. As such, it doesn't seem likely that people will accept that change, so I'm going to go with the stateful approach for now. If I'm wrong, and we would actually consider changing the protocol interface, I'm glad to open a separate issue to that effect.
          Hide
          Chad Walters added a comment -

          Yes, to some extent. But go back and look at your own comment here (https://issues.apache.org/jira/browse/THRIFT-110?focusedCommentId=12625891#action_12625891)... How are you proposing to extend the protocol interface to handle this?

          Show
          Chad Walters added a comment - Yes, to some extent. But go back and look at your own comment here ( https://issues.apache.org/jira/browse/THRIFT-110?focusedCommentId=12625891#action_12625891 )... How are you proposing to extend the protocol interface to handle this?
          Hide
          Bryan Duxbury added a comment -

          You're right, they're not part of the IDL, since that would actually be less compact in some circumstances.

          When you read a field, the first thing you do is read the field id, which is a varint. Then, you read the 1-byte type identifier. Mask off the 4 LSB, and there you have your base type. Depending on the type, the 4 MSB might have more information for you. Take the string case. Mask off the 4 MSB, shift it down 4 positions, and then check the value. If it's 0-14, it means what follows is a 0-14 byte string. If it's 15, it means that what follows is a varint-encoded size. You read the size as a varint, then read the appropriate number of string bytes.

          Does that example make sense?

          Show
          Bryan Duxbury added a comment - You're right, they're not part of the IDL, since that would actually be less compact in some circumstances. When you read a field, the first thing you do is read the field id, which is a varint. Then, you read the 1-byte type identifier. Mask off the 4 LSB, and there you have your base type. Depending on the type, the 4 MSB might have more information for you. Take the string case. Mask off the 4 MSB, shift it down 4 positions, and then check the value. If it's 0-14, it means what follows is a 0-14 byte string. If it's 15, it means that what follows is a varint-encoded size. You read the size as a varint, then read the appropriate number of string bytes. Does that example make sense?
          Hide
          Chad Walters added a comment -

          OK, I'll stop worrying about negative field IDs.

          So can you walk me through how modifier bits get handled, especially during deserialization? In my original idea about the modifiers, they were actually part of the IDL. Now it seems that they are just internal data specific to the protocol implementation. Is that correct?

          Show
          Chad Walters added a comment - OK, I'll stop worrying about negative field IDs. So can you walk me through how modifier bits get handled, especially during deserialization? In my original idea about the modifiers, they were actually part of the IDL. Now it seems that they are just internal data specific to the protocol implementation. Is that correct?
          Hide
          Bryan Duxbury added a comment -

          Chad, I definitely see what you're saying, but if we can avoid going to a second byte for twice as long, I think it's a good thing. Especially when the tradeoff is for giving better support to a usecase we should be discouraging.

          Show
          Bryan Duxbury added a comment - Chad, I definitely see what you're saying, but if we can avoid going to a second byte for twice as long, I think it's a good thing. Especially when the tradeoff is for giving better support to a usecase we should be discouraging.
          Hide
          Chad Walters added a comment -

          The biggest one in my current setup is 85 fields. I expect it to get much bigger with time.

          Are all of those required and present in every struct?

          Also, the impact is only for fields numbered between 64 and 128 – if you get a lot more than that, it's becomes a lot less relevant.

          Show
          Chad Walters added a comment - The biggest one in my current setup is 85 fields. I expect it to get much bigger with time. Are all of those required and present in every struct? Also, the impact is only for fields numbered between 64 and 128 – if you get a lot more than that, it's becomes a lot less relevant.
          Hide
          Bryan Duxbury added a comment -

          Here's another version. I think it makes better use of available bits in the modifier space and as such will improve density for certain values.

          I've also removed the string table stuff, as per Noble's suggestion we don't need to have a separate string table. This model still doesn't account for user-specified extern strings, rather just automatically detected extern strings.

          Show
          Bryan Duxbury added a comment - Here's another version. I think it makes better use of available bits in the modifier space and as such will improve density for certain values. I've also removed the string table stuff, as per Noble's suggestion we don't need to have a separate string table. This model still doesn't account for user-specified extern strings, rather just automatically detected extern strings.
          Hide
          Bryan Duxbury added a comment -

          Well, it only impacts structs with IDs greater than 63 (I think). Are there are lot of folks using structs that big?

          The biggest one in my current setup is 85 fields. I expect it to get much bigger with time.

          Show
          Bryan Duxbury added a comment - Well, it only impacts structs with IDs greater than 63 (I think). Are there are lot of folks using structs that big? The biggest one in my current setup is 85 fields. I expect it to get much bigger with time.
          Hide
          Chad Walters added a comment -

          bq The problem is that it would reduce the available number of compact field ids by 50%, which in my mind unduly punishes structs that are doing things right and specifying the field ids.

          Well, it only impacts structs with IDs greater than 63 (I think). Are there are lot of folks using structs that big?

          Show
          Chad Walters added a comment - bq The problem is that it would reduce the available number of compact field ids by 50%, which in my mind unduly punishes structs that are doing things right and specifying the field ids. Well, it only impacts structs with IDs greater than 63 (I think). Are there are lot of folks using structs that big?
          Hide
          Noble Paul added a comment -

          Out of curiosity, do you have a sense of length and frequency of duplicated strings in your dataset?

          When we write out a resultset of lucene documents , in each row the names may get repeated and the values are different. Unlike RDBMS , the columns are not fixed in lucene a document (it can have arbitrary fields). It is not possible to prepopulate the string table because the no:of documents may be huge and they are read from the disk just in time.
          So the name is written as an EXTERN_STRING which we know that will repeat very likely. Making every string an EXTERN_STRING defeats the purpose we end up storing too many strings in memory which is a memory bloat

          wonder if it would make sense to implement the core of this functionality in an abstract protocol and then fork it into DenseCompact and SpareCompact concrete protocols............

          I guess the proposal to make too many protocols will lead more problems than we need. At this point we already have a working protocol. And everyone is using it and more or less are fine w/ that. Let us look at the most compact solution and finalize it . The new protocol anyway is not compatible w/ the old one.So, even the API does not have to be compatible. For the new protocol we can have new rules and new types (no negative ids, etc) and new compiler

          Show
          Noble Paul added a comment - Out of curiosity, do you have a sense of length and frequency of duplicated strings in your dataset? When we write out a resultset of lucene documents , in each row the names may get repeated and the values are different. Unlike RDBMS , the columns are not fixed in lucene a document (it can have arbitrary fields). It is not possible to prepopulate the string table because the no:of documents may be huge and they are read from the disk just in time. So the name is written as an EXTERN_STRING which we know that will repeat very likely. Making every string an EXTERN_STRING defeats the purpose we end up storing too many strings in memory which is a memory bloat wonder if it would make sense to implement the core of this functionality in an abstract protocol and then fork it into DenseCompact and SpareCompact concrete protocols............ I guess the proposal to make too many protocols will lead more problems than we need. At this point we already have a working protocol. And everyone is using it and more or less are fine w/ that. Let us look at the most compact solution and finalize it . The new protocol anyway is not compatible w/ the old one.So, even the API does not have to be compatible. For the new protocol we can have new rules and new types (no negative ids, etc) and new compiler
          Hide
          Bryan Duxbury added a comment -

          @Chad - zigzag would of course work. The problem is that it would reduce the available number of compact field ids by 50%, which in my mind unduly punishes structs that are doing things right and specifying the field ids.

          @Noble:

          I understand you suggestion for string externalization. If we had the extern keyword in the IDL, then this would be trivial. However, at the moment, we do not, and I'm not willing to make that a prerequisite. The alternative is to consider ALL strings for externalization, but this is probably too coarse. The risk here would be that since every string is getting an ID, when the duplicate of that string comes up, the ID will be moderately large (ie greater than 4 bits) and we'll have to use more bytes than we'd like in order to represent the relationship. Out of curiosity, do you have a sense of length and frequency of duplicated strings in your dataset?

          I wonder if it would make sense to implement the core of this functionality in an abstract protocol and then fork it into DenseCompact and SpareCompact concrete protocols. That would enable us to make optimizations like using the bit field instead of field ids. Though, I think that the bit field approach would necessitate the way that we generate our struct code, since it expects there to be a field id per field, rather than a single global bit field.

          Show
          Bryan Duxbury added a comment - @Chad - zigzag would of course work. The problem is that it would reduce the available number of compact field ids by 50%, which in my mind unduly punishes structs that are doing things right and specifying the field ids. @Noble: I understand you suggestion for string externalization. If we had the extern keyword in the IDL, then this would be trivial. However, at the moment, we do not, and I'm not willing to make that a prerequisite. The alternative is to consider ALL strings for externalization, but this is probably too coarse. The risk here would be that since every string is getting an ID, when the duplicate of that string comes up, the ID will be moderately large (ie greater than 4 bits) and we'll have to use more bytes than we'd like in order to represent the relationship. Out of curiosity, do you have a sense of length and frequency of duplicated strings in your dataset? I wonder if it would make sense to implement the core of this functionality in an abstract protocol and then fork it into DenseCompact and SpareCompact concrete protocols. That would enable us to make optimizations like using the bit field instead of field ids. Though, I think that the bit field approach would necessitate the way that we generate our struct code, since it expects there to be a field id per field, rather than a single global bit field.
          Hide
          Noble Paul added a comment -

          String externalization - the problem is how you design the scoping

          String externalization may not need a special string table. When we write down a string that is to be externalized we store it in memory in a map(string->index) also with an index assigned to it.When we write subsequent strings we lookup in the map if it is present . If it is present write the type as EXTERN_STRING_IDX write down the index instead of the actual string .

          While reading, when an external string is read it is stored into a vector and the position of the string in the vector is the index. If the type is EXTERN_STRING_IDX read the next int value and lookup in the vector for the string at that index

          Bit field - as we've previously discussed in this issue, the bit field only gives you savings if you have dense structs and fields are stored ordered.

          This is the power we give to the user . If he knows how the fields are written he can make the judgement and benefit from the perf improvement. I assume that every struct will can have less than 2-3 bytes of overhead . if the user chooses ids like 1000,200 5000 then we end up using 2 bytes/field .
          So this can be documented and ask users to choose numbers which are small

          Show
          Noble Paul added a comment - String externalization - the problem is how you design the scoping String externalization may not need a special string table. When we write down a string that is to be externalized we store it in memory in a map(string->index) also with an index assigned to it.When we write subsequent strings we lookup in the map if it is present . If it is present write the type as EXTERN_STRING_IDX write down the index instead of the actual string . While reading, when an external string is read it is stored into a vector and the position of the string in the vector is the index. If the type is EXTERN_STRING_IDX read the next int value and lookup in the vector for the string at that index Bit field - as we've previously discussed in this issue, the bit field only gives you savings if you have dense structs and fields are stored ordered. This is the power we give to the user . If he knows how the fields are written he can make the judgement and benefit from the perf improvement. I assume that every struct will can have less than 2-3 bytes of overhead . if the user chooses ids like 1000,200 5000 then we end up using 2 bytes/field . So this can be documented and ask users to choose numbers which are small
          Hide
          David Reiss added a comment -

          Chad, that would probably work, but negative field ids are pretty bad from a standpoint of safety and compatibility, so I don't think we should expend implementation time or CPU time supporting them. I'm fine with making them very inefficient. If it makes it easier to implement the compact protocol faster in more languages, I'd also be fine with throwing an exception on negative field ids. No segfaults, though.

          Show
          David Reiss added a comment - Chad, that would probably work, but negative field ids are pretty bad from a standpoint of safety and compatibility, so I don't think we should expend implementation time or CPU time supporting them. I'm fine with making them very inefficient. If it makes it easier to implement the compact protocol faster in more languages, I'd also be fine with throwing an exception on negative field ids. No segfaults, though.
          Hide
          Chad Walters added a comment -

          Couldn't you just use zig-zag encoding for the field IDs? This would allow negative field IDs to be handled just fine with minimal cost in most standard cases.

          Show
          Chad Walters added a comment - Couldn't you just use zig-zag encoding for the field IDs? This would allow negative field IDs to be handled just fine with minimal cost in most standard cases.
          Hide
          Bryan Duxbury added a comment -

          Regarding negative field IDs - I agree, I don't think we need to be terribly worried about it. I'd be fine with just making it clear in the documentation. Note that we can't necessarily disallow them, since it would mean structs without specified field IDs couldn't be serialized by this protocol.

          Doubles - yeah, let's just not worry about compressing those at all at this point.

          Containers - I'm not sure that requiring all of the elements to be of a single type is that big of a deal. If the options are storing an additional byte per element to indicate type encoding vs just choosing the best compromise type for all of them, I think the compromise is probably best. The only time where it really becomes a problem is with ints, and that's easily resolvable. If the numbers are all positive and less than a 64bits long, then we can use positive varints. If they're all negative, we can use negative varints. If there's a mix and the values are no bigger than the threshold, we can use zigzag encoding. If none of those compromises yields the best size, then we can always go for fixed size.

          String externalization - the problem is how you design the scoping. When you're doing an RPC call, there's only one message, so it's obvious where to put the string table. When you are serializing a struct directly, it's impossible to discern that from when you're serializing a struct within an RPC call, so it's difficult to figure out where to put the string table. The last thing you want to do is have a string table per struct, unless you really have that many duplicated strings per struct. Additionally, this spec was designed with the intention that implementation would not require any additions to the Thrift IDL (such as the extern keyword). The string externalization described in this spec would essentially allow any string to be externalized if it was repeated. (As I think about it some more, there might be some ways to make the string table inline everywhere, alleviating this problem... more on this later.)

          Bit field - as we've previously discussed in this issue, the bit field only gives you savings if you have dense structs and fields are stored ordered. I for one do not have dense structs, so I would definitely be paying a premium.

          Show
          Bryan Duxbury added a comment - Regarding negative field IDs - I agree, I don't think we need to be terribly worried about it. I'd be fine with just making it clear in the documentation. Note that we can't necessarily disallow them, since it would mean structs without specified field IDs couldn't be serialized by this protocol. Doubles - yeah, let's just not worry about compressing those at all at this point. Containers - I'm not sure that requiring all of the elements to be of a single type is that big of a deal. If the options are storing an additional byte per element to indicate type encoding vs just choosing the best compromise type for all of them, I think the compromise is probably best. The only time where it really becomes a problem is with ints, and that's easily resolvable. If the numbers are all positive and less than a 64bits long, then we can use positive varints. If they're all negative, we can use negative varints. If there's a mix and the values are no bigger than the threshold, we can use zigzag encoding. If none of those compromises yields the best size, then we can always go for fixed size. String externalization - the problem is how you design the scoping. When you're doing an RPC call, there's only one message, so it's obvious where to put the string table. When you are serializing a struct directly, it's impossible to discern that from when you're serializing a struct within an RPC call, so it's difficult to figure out where to put the string table. The last thing you want to do is have a string table per struct, unless you really have that many duplicated strings per struct. Additionally, this spec was designed with the intention that implementation would not require any additions to the Thrift IDL (such as the extern keyword). The string externalization described in this spec would essentially allow any string to be externalized if it was repeated. (As I think about it some more, there might be some ways to make the string table inline everywhere, alleviating this problem... more on this later.) Bit field - as we've previously discussed in this issue, the bit field only gives you savings if you have dense structs and fields are stored ordered. I for one do not have dense structs, so I would definitely be paying a premium.
          Hide
          Noble Paul added a comment -

          String externalization is supported on RPC calls, but not direct serialization.

          Why not?
          extern strings can be an explicit declaration by user because he knows what is possibly repeated

          Wherever possible, use the 4 MSB in the type field as a modifier to store type-specific info.

          if possible try out the approach I said of using 5 and 3 bits combo . If we have 5 bits we can store values upto 31 as opposed to values upto 15 for 4 bits. we use it and found to be useful

          Autogenerated negative field ids will be represented poorly as varints

          We must disallow negative field ids . There is no possible use for that

          writing out id as an integer/varint is not very efficient as opposed to using a variable length bitset to say which field is present and which field is not.

          Doubles can possibly be varint compressed as well... need to do more research

          the advantage can be minimal and it is better left out. let us write out double as double

          Show
          Noble Paul added a comment - String externalization is supported on RPC calls, but not direct serialization. Why not? extern strings can be an explicit declaration by user because he knows what is possibly repeated Wherever possible, use the 4 MSB in the type field as a modifier to store type-specific info. if possible try out the approach I said of using 5 and 3 bits combo . If we have 5 bits we can store values upto 31 as opposed to values upto 15 for 4 bits. we use it and found to be useful Autogenerated negative field ids will be represented poorly as varints We must disallow negative field ids . There is no possible use for that writing out id as an integer/varint is not very efficient as opposed to using a variable length bitset to say which field is present and which field is not. Doubles can possibly be varint compressed as well... need to do more research the advantage can be minimal and it is better left out. let us write out double as double
          Hide
          David Reiss added a comment -

          I don't think you should worry about optimizing for negative field ids. Their use is discouraged. (Negative data is different, of course.)

          I think the only reasonable variable-length encoding for floats would be to switch between a single and double precision format.

          I think one big issue that needs to be resolved with regards to encoding extra information in type identifiers is how to deal with the case of containers which only use one type identifier for the entire container.

          Show
          David Reiss added a comment - I don't think you should worry about optimizing for negative field ids. Their use is discouraged. (Negative data is different, of course.) I think the only reasonable variable-length encoding for floats would be to switch between a single and double precision format. I think one big issue that needs to be resolved with regards to encoding extra information in type identifiers is how to deal with the case of containers which only use one type identifier for the entire container.
          Hide
          Bryan Duxbury added a comment -

          I've created a grammar-like spec for the compact format and attached it here. Some general notes on features:

          • Heavy use of variable length integers to keep size down.
          • Wherever possible, use the 4 MSB in the type field as a modifier to store type-specific info.
          • Booleans are encoded as constants.
          • String externalization is supported on RPC calls, but not direct serialization.

          Some known issues:

          • Autogenerated negative field ids will be represented poorly as varints. Need to think of a strategy that doesn't overly punish those who use specified field ids, but doesn't immediately use 3-5 bytes for autogenerated values.
          • Containers must contain values all encoded the same way, so the protocol will have to choose the most effective encoding for all the values.
          • Doubles can possibly be varint compressed as well... need to do more research.

          I'd much appreciate people weighing in on the ideas in this spec and the open issues I've listed above.

          Show
          Bryan Duxbury added a comment - I've created a grammar-like spec for the compact format and attached it here. Some general notes on features: Heavy use of variable length integers to keep size down. Wherever possible, use the 4 MSB in the type field as a modifier to store type-specific info. Booleans are encoded as constants. String externalization is supported on RPC calls, but not direct serialization. Some known issues: Autogenerated negative field ids will be represented poorly as varints. Need to think of a strategy that doesn't overly punish those who use specified field ids, but doesn't immediately use 3-5 bytes for autogenerated values. Containers must contain values all encoded the same way, so the protocol will have to choose the most effective encoding for all the values. Doubles can possibly be varint compressed as well... need to do more research. I'd much appreciate people weighing in on the ideas in this spec and the open issues I've listed above.
          Hide
          Noble Paul added a comment - - edited

          Exactly, you can use a single byte for lots of information:

          We use it in our implementation and I did not specify it because it can get very complex.

          Here is how it is

          • the 5 lsb are used to encode types with no extra info or the ones that cannot benefit from it. VOID, STOP,BOOLEAN_TRUE, BOOLEAN_FALSE, DOUBLE, SIGNED INT , SIGNED LONG, BYTE etc
            • That means, for those types the 3 msb will be set to '0' . And always read the 3 msb first and if it is '0' read the 5 lsb
          • The 3 msb will be used to represent types which can probably benefit from some extra info. That means we can have a max of 7 such types .But those types can use the 5 msbs (say values 0-31) for extra info .examples
            • String/extern string . a lot of string lengths tend to be in the small range say <31.So length can be put in the extra info
            • list/map/set . their lengths are also usually low. lengths can be put into the extra info
            • struct with name (THRIFT-122) . tend to be few
            • positive int . usually tend to be small. the value can be put into the extra info
          Show
          Noble Paul added a comment - - edited Exactly, you can use a single byte for lots of information: We use it in our implementation and I did not specify it because it can get very complex. Here is how it is the 5 lsb are used to encode types with no extra info or the ones that cannot benefit from it. VOID, STOP,BOOLEAN_TRUE, BOOLEAN_FALSE, DOUBLE, SIGNED INT , SIGNED LONG, BYTE etc That means, for those types the 3 msb will be set to '0' . And always read the 3 msb first and if it is '0' read the 5 lsb The 3 msb will be used to represent types which can probably benefit from some extra info. That means we can have a max of 7 such types .But those types can use the 5 msbs (say values 0-31) for extra info .examples String/extern string . a lot of string lengths tend to be in the small range say <31.So length can be put in the extra info list/map/set . their lengths are also usually low. lengths can be put into the extra info struct with name ( THRIFT-122 ) . tend to be few positive int . usually tend to be small. the value can be put into the extra info
          Hide
          Chad Walters added a comment -

          Yes, these are the kind of issues I was getting at. Hence the suggestion about the type modifiers in the IDL (THRIFT-121).

          Show
          Chad Walters added a comment - Yes, these are the kind of issues I was getting at. Hence the suggestion about the type modifiers in the IDL ( THRIFT-121 ).
          Hide
          Bryan Duxbury added a comment -

          This is true. I imagine it means that lists of ints will all have to encoded the same way. We can still have the protocol determine which encoding is the most efficient for the array by looking at all the values, though, so we'll still probably get an efficiency boost.

          Show
          Bryan Duxbury added a comment - This is true. I imagine it means that lists of ints will all have to encoded the same way. We can still have the protocol determine which encoding is the most efficient for the array by looking at all the values, though, so we'll still probably get an efficiency boost.
          Hide
          David Reiss added a comment -

          Bryan, good call. We would also have to deal with the list<int> case. Currently, we only include the type code once for the entire list. Other than that, the only obvious downside I see is that you'd have to have the if/elseif logic for choosing which format to use and a switch statement for reading the value.

          Show
          David Reiss added a comment - Bryan, good call. We would also have to deal with the list<int> case. Currently, we only include the type code once for the entire list. Other than that, the only obvious downside I see is that you'd have to have the if/elseif logic for choosing which format to use and a switch statement for reading the value.
          Hide
          Bryan Duxbury added a comment -

          Chad, I think I see what you're hinting at. One problem with the way Protocols are currently structured is that when you're writing the field type, you don't know the value yet. That is, you'll to writeFieldBegin(boolean, some id) before you subsequently do writeBoolean(true). We can either change the protocol interface to take the field info and the value at the same time, or we can make our new protocol stateful, so it doesn't actually write field header info until it's got the values too. This could be a very complex way to go.

          Show
          Bryan Duxbury added a comment - Chad, I think I see what you're hinting at. One problem with the way Protocols are currently structured is that when you're writing the field type, you don't know the value yet. That is, you'll to writeFieldBegin(boolean, some id) before you subsequently do writeBoolean(true). We can either change the protocol interface to take the field info and the value at the same time, or we can make our new protocol stateful, so it doesn't actually write field header info until it's got the values too. This could be a very complex way to go.
          Hide
          Chad Walters added a comment -

          Hmm, I am not sure if this will work without greatly expanding the protocol interface. I think this is the point where some code would be useful.

          Show
          Chad Walters added a comment - Hmm, I am not sure if this will work without greatly expanding the protocol interface. I think this is the point where some code would be useful.
          Hide
          Johan Stuyts added a comment -

          Exactly, you can use a single byte for lots of information:

          • 0 = stop
          • 1 = boolean false
          • 2 = boolean true
          • 3 = int
          • 4 = vint >= 32 (followed by vint with value - 32)
          • 5 = zig-zag int (possibly added to/subtracted from just like vint)
          • 6 = string >= 64 bytes (followed by vint of length - 64 and the 64+ bytes)
          • ...
          • 64-127 = integers with values -32 to 31 (direct value)
          • 128-191 = longs with values -32 to 31 (direct value)
          • 192-255 = string lengths for UTF-8 strings with 0-31 bytes (followed by 0-31 bytes)

          As you can see you can get pretty dense, and I am sure other people can come up with an even better ways to encode values (e.g. maybe inline values -1 to 62 for integers is more effective).

          Not having to choose the encoding in IDL also ensures that changes to the range of valid values (e.g. allowing negative values for integers) will require no changes to the IDL and deployed systems.

          You might not be able to get as dense as with hints specified in the IDL, but I think you can come close enough to the optimum. And protocols do not have to contend for IDL keywords and type bits.

          Show
          Johan Stuyts added a comment - Exactly, you can use a single byte for lots of information: 0 = stop 1 = boolean false 2 = boolean true 3 = int 4 = vint >= 32 (followed by vint with value - 32) 5 = zig-zag int (possibly added to/subtracted from just like vint) 6 = string >= 64 bytes (followed by vint of length - 64 and the 64+ bytes) ... 64-127 = integers with values -32 to 31 (direct value) 128-191 = longs with values -32 to 31 (direct value) 192-255 = string lengths for UTF-8 strings with 0-31 bytes (followed by 0-31 bytes) As you can see you can get pretty dense, and I am sure other people can come up with an even better ways to encode values (e.g. maybe inline values -1 to 62 for integers is more effective). Not having to choose the encoding in IDL also ensures that changes to the range of valid values (e.g. allowing negative values for integers) will require no changes to the IDL and deployed systems. You might not be able to get as dense as with hints specified in the IDL, but I think you can come close enough to the optimum. And protocols do not have to contend for IDL keywords and type bits.
          Hide
          Bryan Duxbury added a comment -

          I was just thinking the same thing a little while ago. I think that you're right, and the user doesn't need to specify. This is especially the case if we have extra bits in the field type that we can use as the "code" to indicate whether it's a regular int, vint, or zigzag.

          Show
          Bryan Duxbury added a comment - I was just thinking the same thing a little while ago. I think that you're right, and the user doesn't need to specify. This is especially the case if we have extra bits in the field type that we can use as the "code" to indicate whether it's a regular int, vint, or zigzag.
          Hide
          Johan Stuyts added a comment -

          Why would a more compact protocol require changes to the IDL? Couldn't the protocol itself decide if a more suitable representation can be written? Here is an example for writing integer fields:

          if (0 <= v and v < {limit that requires at least four bytes})
          {
            // Put code for vint on the stream
            // Put vint-encoded value on the stream
          }
          else if ({some negative number} <= v && v <= {some positive number})
          {
            // Put code for zig-zag on the stream
            // Put zig-zag-encoded value on the stream
          }
          else
          {
            // Put code for int on the stream
            // Put value on the stream
          }
          

          The values defined in TType.java are the ones structures must use to communicate with protocols, but there is no requirement for protocols to use them internally.

          Show
          Johan Stuyts added a comment - Why would a more compact protocol require changes to the IDL? Couldn't the protocol itself decide if a more suitable representation can be written? Here is an example for writing integer fields: if (0 <= v and v < {limit that requires at least four bytes}) { // Put code for vint on the stream // Put vint-encoded value on the stream } else if ({some negative number} <= v && v <= {some positive number}) { // Put code for zig-zag on the stream // Put zig-zag-encoded value on the stream } else { // Put code for int on the stream // Put value on the stream } The values defined in TType.java are the ones structures must use to communicate with protocols, but there is no requirement for protocols to use them internally.
          Hide
          Bryan Duxbury added a comment -

          The case where I could envision having one field with a large number set would be if you have a large enum. For example, a generic UserAction structure that contains hundreds of optional fields for different types of user actions. A log file with a lot of user actions would contain a ton of these structures, each with only one field set. Because there would be lots of different types of actions, the bitfield for such a structure would have to be large, but we would only use one tag with the current system. I'm not super-concerned with this use case, but I think it is worth considering. At Facebook, our user actions are individual methods in an umbrella service.

          This is what Rapleaf does as well.

          Show
          Bryan Duxbury added a comment - The case where I could envision having one field with a large number set would be if you have a large enum. For example, a generic UserAction structure that contains hundreds of optional fields for different types of user actions. A log file with a lot of user actions would contain a ton of these structures, each with only one field set. Because there would be lots of different types of actions, the bitfield for such a structure would have to be large, but we would only use one tag with the current system. I'm not super-concerned with this use case, but I think it is worth considering. At Facebook, our user actions are individual methods in an umbrella service. This is what Rapleaf does as well.
          Hide
          Noble Paul added a comment -

          The current system can skip unknown structs

          Point taken . So we may not need the name of the struct.

          Show
          Noble Paul added a comment - The current system can skip unknown structs Point taken . So we may not need the name of the struct.
          Hide
          David Reiss added a comment -

          The current system can skip unknown structs. Even though it doesn't know what is supposed to be there, it knows that it is a struct and can read the individual field tags and skip the appropriate type. Take a look at the skip protocol method in any of the mappings for details.

          The case where I could envision having one field with a large number set would be if you have a large enum. For example, a generic UserAction structure that contains hundreds of optional fields for different types of user actions. A log file with a lot of user actions would contain a ton of these structures, each with only one field set. Because there would be lots of different types of actions, the bitfield for such a structure would have to be large, but we would only use one tag with the current system. I'm not super-concerned with this use case, but I think it is worth considering. At Facebook, our user actions are individual methods in an umbrella service.

          Show
          David Reiss added a comment - The current system can skip unknown structs. Even though it doesn't know what is supposed to be there, it knows that it is a struct and can read the individual field tags and skip the appropriate type. Take a look at the skip protocol method in any of the mappings for details. The case where I could envision having one field with a large number set would be if you have a large enum. For example, a generic UserAction structure that contains hundreds of optional fields for different types of user actions. A log file with a lot of user actions would contain a ton of these structures, each with only one field set. Because there would be lots of different types of actions, the bitfield for such a structure would have to be large, but we would only use one tag with the current system. I'm not super-concerned with this use case, but I think it is worth considering. At Facebook, our user actions are individual methods in an umbrella service.
          Hide
          Noble Paul added a comment -

          David - Your are right .
          The type information is important. Because if it is not there we do not know how many bytes to skip. But writing down the id in the field is more inefficient than using a variable length bitset. typically , structures have 3 or more fields index starting from 1.

          In the current system if I add a new struct field to an existing object , it fails because it does not write down which struct it is. So it does not know how many bytes to skip.

          I recommend writing down the name of the struct as in IDL (as an extern string. so it is written only once) for all struct objects so that if the deserializer already is aware of the struct it knows how many bytes to skip or if it does not know about it it can throw an error saying "unknown struct".

          This approach can be useful for writing down heterogeneous collections

          If you only have one field set but that field happens to be number 600.

          If you have only one field and the id is 600 the system does not break. but why would a user do it?

          Show
          Noble Paul added a comment - David - Your are right . The type information is important. Because if it is not there we do not know how many bytes to skip. But writing down the id in the field is more inefficient than using a variable length bitset. typically , structures have 3 or more fields index starting from 1. In the current system if I add a new struct field to an existing object , it fails because it does not write down which struct it is. So it does not know how many bytes to skip. I recommend writing down the name of the struct as in IDL (as an extern string. so it is written only once) for all struct objects so that if the deserializer already is aware of the struct it knows how many bytes to skip or if it does not know about it it can throw an error saying "unknown struct". This approach can be useful for writing down heterogeneous collections If you only have one field set but that field happens to be number 600. If you have only one field and the id is 600 the system does not break. but why would a user do it?
          Hide
          David Reiss added a comment -

          The type information is necessary for two reasons. The first is that if someone changes the type of a field (which they shouldn't, but they might), we want to just skip that field rather than desyncing the entire stream. The second reason is more important. If one side of the connection sends a field that the other side doesn't understand, we need the type information embedded in the stream in order to know how to properly skip it.

          The most obvious downside of using a bitfield (aside from a more complicated implementation) is that it is worse for very sparse structures. If you only have one field set but that field happens to be number 600, you're going to be wasting a lot of space. It would definitely save space in (what we have found to be) the more common cases, though.

          Show
          David Reiss added a comment - The type information is necessary for two reasons. The first is that if someone changes the type of a field (which they shouldn't, but they might), we want to just skip that field rather than desyncing the entire stream. The second reason is more important. If one side of the connection sends a field that the other side doesn't understand, we need the type information embedded in the stream in order to know how to properly skip it. The most obvious downside of using a bitfield (aside from a more complicated implementation) is that it is worse for very sparse structures. If you only have one field set but that field happens to be number 600, you're going to be wasting a lot of space. It would definitely save space in (what we have found to be) the more common cases, though.
          Hide
          Bryan Duxbury added a comment -

          This bitfield approach is very interesting. I don't think it's a separate issue, though. It is a feature of the new "compact" protocol this issue is discussing.

          Show
          Bryan Duxbury added a comment - This bitfield approach is very interesting. I don't think it's a separate issue, though. It is a feature of the new "compact" protocol this issue is discussing.
          Hide
          Noble Paul added a comment -

          Another point of compression would be in the way we wrrite structs and fields

           public void writeFieldBegin(TField field) throws TException {
              writeByte(field.type);
              writeI16(field.id);
            }
          

          This is too much of an overhead .

          The best solution would be to write a bitset for the fields included in this struct. The bitset will mark the fields that are present as '1' and fields which are absent as '0' .The type information is available at both end of the pipes so that is redundant. If the type/name of a field is changed in a newer version the struct must assign it a different index . The bitset must be serailized like a variable length integer . So the overhead is 1 byte per seven fields
          The advantages are that

          • we do not have a per field overhead of 3 bytes
          • we can easily add remove fields in different versions of the objects and it will be backward compatible too

          I'll add this in the wiki too and open a separate issue too

          Show
          Noble Paul added a comment - Another point of compression would be in the way we wrrite structs and fields public void writeFieldBegin(TField field) throws TException { writeByte(field.type); writeI16(field.id); } This is too much of an overhead . The best solution would be to write a bitset for the fields included in this struct. The bitset will mark the fields that are present as '1' and fields which are absent as '0' .The type information is available at both end of the pipes so that is redundant. If the type/name of a field is changed in a newer version the struct must assign it a different index . The bitset must be serailized like a variable length integer . So the overhead is 1 byte per seven fields The advantages are that we do not have a per field overhead of 3 bytes we can easily add remove fields in different versions of the objects and it will be backward compatible too I'll add this in the wiki too and open a separate issue too
          Hide
          Johan Oskarsson added a comment -

          I've started a small wiki page to track the different ideas and suggestions for a more compact protocol since the threads quickly become hard to follow. The page is fairly empty but I hope that others can help fill it, especially with more details on the different opinions.

          http://wiki.apache.org/thrift/New_compact_binary_protocol

          Show
          Johan Oskarsson added a comment - I've started a small wiki page to track the different ideas and suggestions for a more compact protocol since the threads quickly become hard to follow. The page is fairly empty but I hope that others can help fill it, especially with more details on the different opinions. http://wiki.apache.org/thrift/New_compact_binary_protocol
          Hide
          Pete Wyckoff added a comment -

          bq compatibility

          not having maps is a big problem i wasn't aware of, but support and compatibility are 2 different things. We support both JSON and Binary protocols but they are incompatible.

          Show
          Pete Wyckoff added a comment - bq compatibility not having maps is a big problem i wasn't aware of, but support and compatibility are 2 different things. We support both JSON and Binary protocols but they are incompatible.
          Hide
          Noble Paul added a comment -

          protobuf's binary protocol is probably stable. I doubt it is being changed.

          even so, we may have other requirements which protobuf may not support for instance protobuf does not have map thrift has . We can copy ideas from there but we do not need to maintain compatibility. That can prove a real millstone around our neck .

          Show
          Noble Paul added a comment - protobuf's binary protocol is probably stable. I doubt it is being changed. even so, we may have other requirements which protobuf may not support for instance protobuf does not have map thrift has . We can copy ideas from there but we do not need to maintain compatibility. That can prove a real millstone around our neck .
          Hide
          Pete Wyckoff added a comment -

          > We will have to keep a close track of how the other protocol is making use of certain 'bits' and we will have to resort to very clever tricks to work around. T

          protobuf's binary protocol is probably stable. I doubt it is being changed. And the best way to make interoperability happen is always to use their libraries themselves, licensing issues aside. Interoperability give's protobuf's users backwards compatibility when using thrift, a big thing towards adopting it in addition to getting all the other languages. i.e., thrift would be a proper superset.

          Show
          Pete Wyckoff added a comment - > We will have to keep a close track of how the other protocol is making use of certain 'bits' and we will have to resort to very clever tricks to work around. T protobuf's binary protocol is probably stable. I doubt it is being changed. And the best way to make interoperability happen is always to use their libraries themselves, licensing issues aside. Interoperability give's protobuf's users backwards compatibility when using thrift, a big thing towards adopting it in addition to getting all the other languages. i.e., thrift would be a proper superset.
          Hide
          Noble Paul added a comment -

          Not directly related and I don't know if this is possible. but it would be pretty interesting if thrift could "speak" protocol buffers - i.e., if we had TProtocolBuffersProtocol.

          When it comes to binary protocols interoperability is not a very nice thing to have. We will have to keep a close track of how the other protocol is making use of certain 'bits' and we will have to resort to very clever tricks to work around. Thrift in itself can become a standard and anyone who wants to use both thrift and some other protocol (say protobuf) can have two separate output formats.

          Show
          Noble Paul added a comment - Not directly related and I don't know if this is possible. but it would be pretty interesting if thrift could "speak" protocol buffers - i.e., if we had TProtocolBuffersProtocol. When it comes to binary protocols interoperability is not a very nice thing to have. We will have to keep a close track of how the other protocol is making use of certain 'bits' and we will have to resort to very clever tricks to work around. Thrift in itself can become a standard and anyone who wants to use both thrift and some other protocol (say protobuf) can have two separate output formats.
          Hide
          Eric Anderson added a comment -

          bq You would also have to instantiate the structure by name which is impossible in C++ and expensive in Java.

          This is not completely true; in our use of thrift, we are only using the marshalling layer, not the service portion of it. We treat certain structures (those with names like xRequest or xResponse) as messages. We extended thrift so that we could specify a virtual base class for structures, and then it's easy to write something that will instantiate a structure by name and can pass it to something else. This approach doesn't work with base types or sets/lists/maps, but you could wrap those in structures. I can send in a patch, but it won't be as clean as the other patches I've submitted since I'm working on those first.

          Show
          Eric Anderson added a comment - bq You would also have to instantiate the structure by name which is impossible in C++ and expensive in Java. This is not completely true; in our use of thrift, we are only using the marshalling layer, not the service portion of it. We treat certain structures (those with names like xRequest or xResponse) as messages. We extended thrift so that we could specify a virtual base class for structures, and then it's easy to write something that will instantiate a structure by name and can pass it to something else. This approach doesn't work with base types or sets/lists/maps, but you could wrap those in structures. I can send in a patch, but it won't be as clean as the other patches I've submitted since I'm working on those first.
          Hide
          Pete Wyckoff added a comment -

          Not directly related and I don't know if this is possible. but it would be pretty interesting if thrift could "speak" protocol buffers - i.e., if we had TProtocolBuffersProtocol. it would add a lot of interoperability and could help people who started using protocol buffers to switch over to thrift.

          Show
          Pete Wyckoff added a comment - Not directly related and I don't know if this is possible. but it would be pretty interesting if thrift could "speak" protocol buffers - i.e., if we had TProtocolBuffersProtocol. it would add a lot of interoperability and could help people who started using protocol buffers to switch over to thrift.
          Hide
          Bryan Duxbury added a comment -

          It seems like there's a fair amount of stuff floating around in this issue that really doesn't belong to this issue. I see:

          • Support for heterogeneous lists
          • Support for type modifiers (extern, fixed, etc) in IDL, and at least noop implementations in existing protocols

          At least those items should be broken out into separate issues, and used as prerequisites for this one. I think it would make the discussion easier to follow.

          Show
          Bryan Duxbury added a comment - It seems like there's a fair amount of stuff floating around in this issue that really doesn't belong to this issue. I see: Support for heterogeneous lists Support for type modifiers (extern, fixed, etc) in IDL, and at least noop implementations in existing protocols At least those items should be broken out into separate issues, and used as prerequisites for this one. I think it would make the discussion easier to follow.
          Hide
          Noble Paul added a comment -

          This type modifier mechanism could also be used to pass through the "extern" modifier on strings (and binary?):

          The type modifier actually is a good idea, as you said it can mean different things for different types for instance for list type it can say 0 for homogeneous and 1 for heterogeneous. and for map it can say whether key or value or both are heterogeneous

          0 = default (variable length encoding with non-negative preferred)

          I guess the default should be fixed encoding same as the current binary protocol

          1 = zipper (variable length encoding that also works well for small negative values)

          small negative values. are you referring to the protocol buffer like zig-zag encoding?

          This type modifier mechanism could also be used to pass through the "extern" modifier on strings (and binary?):

          0 = standard handling

          OK some more modifiers
          1 = string is put in here but should be stored in a map of string->index
          2 = string should be read from the index and the next is a zipper int which gives the index of the stored string

          Show
          Noble Paul added a comment - This type modifier mechanism could also be used to pass through the "extern" modifier on strings (and binary?): The type modifier actually is a good idea, as you said it can mean different things for different types for instance for list type it can say 0 for homogeneous and 1 for heterogeneous. and for map it can say whether key or value or both are heterogeneous 0 = default (variable length encoding with non-negative preferred) I guess the default should be fixed encoding same as the current binary protocol 1 = zipper (variable length encoding that also works well for small negative values) small negative values. are you referring to the protocol buffer like zig-zag encoding? This type modifier mechanism could also be used to pass through the "extern" modifier on strings (and binary?): 0 = standard handling OK some more modifiers 1 = string is put in here but should be stored in a map of string->index 2 = string should be read from the index and the next is a zipper int which gives the index of the stored string
          Hide
          Noble Paul added a comment -

          Are you suggesting that the non-homogeneous collections would only contain base types?

          In our case , yes. We only support basic types
          But I guess it is not a fair assumption to be made on behalf of the rest of the community, I would assume that it must accept all the basic types + all the structs defined in the IDL.
          We can gradually work towards it.

          If non-homogenous collections are make-or-break for you ....

          The types w/o wild cards must continue to work as it works now. The non-homogeneous types must be dealt as a totally different type (maybe by flipping a the top bits).

          Show
          Noble Paul added a comment - Are you suggesting that the non-homogeneous collections would only contain base types? In our case , yes. We only support basic types But I guess it is not a fair assumption to be made on behalf of the rest of the community, I would assume that it must accept all the basic types + all the structs defined in the IDL. We can gradually work towards it. If non-homogenous collections are make-or-break for you .... The types w/o wild cards must continue to work as it works now. The non-homogeneous types must be dealt as a totally different type (maybe by flipping a the top bits).
          Hide
          Chad Walters added a comment -

          This can decide on whether thrift is usable or not for us.

          If non-homogenous collections are make-or-break for you, then let's figure out what the full requirements and see whether its doable or not. I'm open to adding support for this as long as we can support it in C++ as well and as long as it is added in a way that doesn't impact the efficiency of the current homogeneous collections.

          I know what the types are and the types are the ones which are already defined in the IDL

          Are you suggesting that the non-homogeneous collections would only contain base types? If so, that is a big simplification. If not, that is where things get a little challenging I think...

          Show
          Chad Walters added a comment - This can decide on whether thrift is usable or not for us. If non-homogenous collections are make-or-break for you, then let's figure out what the full requirements and see whether its doable or not. I'm open to adding support for this as long as we can support it in C++ as well and as long as it is added in a way that doesn't impact the efficiency of the current homogeneous collections. I know what the types are and the types are the ones which are already defined in the IDL Are you suggesting that the non-homogeneous collections would only contain base types? If so, that is a big simplification. If not, that is where things get a little challenging I think...
          Hide
          Chad Walters added a comment - - edited

          There seem to be 3 general areas that we are touching on here:
          A. Variable-length encoding for integer types
          B. Indexed string lookup
          C. Non-homogeneous collections

          It seems like A is not that controversial, so here is a bit of a deep dive WRT implementation:

          So far we all seem to be in agreement on at least the following:
          1. Don't change the BinaryProtocol at all
          2. Extend the DenseProtocol somewhat to support various encoding schemes

          Let me point out a few important implementation concerns related to the above:

          1. In Thrift, all protocol implementations share a common interface that is used in the bindings. So while we don't want to change the actual wire format of the BinaryProtocol, some code changes to the BinaryProtocol (and any other protocol implementations) will be necessary. In particular, when we extend the IDL to include new types or type modifiers, we will need to make some code changes, even if the net result for the BinaryProtocol is to treat the new types like the old types or ignore the type modifiers.

          2. WRT DenseProtocol implementation, nobody seems to have taken note of my specific suggestion on making the current variable-length encoding the default and then having modifiers to allow IDL writers to state that something should used "fixed" or "zipper" encoding. I'd like to push on this a little more. This mechanism would provide full backwards compatibility for any current users of the DenseProtocol (David says it is used internally at Facebook).

          There are two possible approaches we can take to extending the IDL and the protocol interface: add new types or add type modifiers.

          I personally prefer the type modifier approach. Instead of adding a profusion of new types, we just add a couple modifier keywords (tentatively "fixed" and "zipper"). We could use the top 2 bits of the type bytes for type modifiers . These two bits would be pulled out in the language bindings and passed as an additional parameter in each call to the protocol interface. For integers, the values would be:
          0 = default (variable length encoding with non-negative preferred)
          1 = zipper (variable length encoding that also works well for small negative values)
          2 = fixed
          3 = reserved for future use

          The BinaryProtocol would just ignore the type modifier parameter and work as is. The DenseProtocol would respect it and use the appropriate encoding. The performance cost of pulling out the top two bits and masking the type byte should be negligible.

          This type modifier mechanism could also be used to pass through the "extern" modifier on strings (and binary?):
          0 = standard handling
          1 = string is stored via index into externed buffer

          Again, the BinaryProtocol could ignore this and the DenseProtocol could make use of it.

          In practice the IDL would now look something like this:

          foo.thrift
          struct Foo {
          # use fixed encoding for hash values
           1: fixed i64 md5high,
           2: fixed i64 md5low,
          # default assumes small non-negative values predominate
           3: i64 count,
          # use zipper encoding when small values (negative and positive) predominate
           4: zipper i32 adjust,
          # indicate that the string value should be stored in an external 
           5: extern string typeName
          }
          
          Show
          Chad Walters added a comment - - edited There seem to be 3 general areas that we are touching on here: A. Variable-length encoding for integer types B. Indexed string lookup C. Non-homogeneous collections It seems like A is not that controversial, so here is a bit of a deep dive WRT implementation: So far we all seem to be in agreement on at least the following: 1. Don't change the BinaryProtocol at all 2. Extend the DenseProtocol somewhat to support various encoding schemes Let me point out a few important implementation concerns related to the above: 1. In Thrift, all protocol implementations share a common interface that is used in the bindings. So while we don't want to change the actual wire format of the BinaryProtocol, some code changes to the BinaryProtocol (and any other protocol implementations) will be necessary. In particular, when we extend the IDL to include new types or type modifiers, we will need to make some code changes, even if the net result for the BinaryProtocol is to treat the new types like the old types or ignore the type modifiers. 2. WRT DenseProtocol implementation, nobody seems to have taken note of my specific suggestion on making the current variable-length encoding the default and then having modifiers to allow IDL writers to state that something should used "fixed" or "zipper" encoding. I'd like to push on this a little more. This mechanism would provide full backwards compatibility for any current users of the DenseProtocol (David says it is used internally at Facebook). There are two possible approaches we can take to extending the IDL and the protocol interface: add new types or add type modifiers. I personally prefer the type modifier approach. Instead of adding a profusion of new types, we just add a couple modifier keywords (tentatively "fixed" and "zipper"). We could use the top 2 bits of the type bytes for type modifiers . These two bits would be pulled out in the language bindings and passed as an additional parameter in each call to the protocol interface. For integers, the values would be: 0 = default (variable length encoding with non-negative preferred) 1 = zipper (variable length encoding that also works well for small negative values) 2 = fixed 3 = reserved for future use The BinaryProtocol would just ignore the type modifier parameter and work as is. The DenseProtocol would respect it and use the appropriate encoding. The performance cost of pulling out the top two bits and masking the type byte should be negligible. This type modifier mechanism could also be used to pass through the "extern" modifier on strings (and binary?): 0 = standard handling 1 = string is stored via index into externed buffer Again, the BinaryProtocol could ignore this and the DenseProtocol could make use of it. In practice the IDL would now look something like this: foo.thrift struct Foo { # use fixed encoding for hash values 1: fixed i64 md5high, 2: fixed i64 md5low, # default assumes small non-negative values predominate 3: i64 count, # use zipper encoding when small values (negative and positive) predominate 4: zipper i32 adjust, # indicate that the string value should be stored in an external 5: extern string typeName }
          Hide
          Noble Paul added a comment -

          Both are surmountable, but I believe that they add more complexity than value to Thrift, compared to the union-like-struct-based approach.

          I know what the types are and the types are the ones which are already defined in the IDL (one or many of them). So why should I explicitly create another union like structure. I see it as work which must be done by compiler. That actually consumes more space than a typebyte+value encoding for items

          but I believe that they add more complexity than value to Thrift

          This can decide on whether thrift is usable or not for us. All the modern languages support this kind of collections and if they cannot use it they will have to look for alternatives

          Show
          Noble Paul added a comment - Both are surmountable, but I believe that they add more complexity than value to Thrift, compared to the union-like-struct-based approach. I know what the types are and the types are the ones which are already defined in the IDL (one or many of them). So why should I explicitly create another union like structure. I see it as work which must be done by compiler. That actually consumes more space than a typebyte+value encoding for items but I believe that they add more complexity than value to Thrift This can decide on whether thrift is usable or not for us. All the modern languages support this kind of collections and if they cannot use it they will have to look for alternatives
          Hide
          David Reiss added a comment -

          That would require the ability to instantiate an object given the "short code" when reading. There are two difficulties there. The first is guaranteeing uniqueness. The second is looking up the code and choosing the class dynamically. Both are surmountable, but I believe that they add more complexity than value to Thrift, compared to the union-like-struct-based approach.

          Show
          David Reiss added a comment - That would require the ability to instantiate an object given the "short code" when reading. There are two difficulties there. The first is guaranteeing uniqueness. The second is looking up the code and choosing the class dynamically. Both are surmountable, but I believe that they add more complexity than value to Thrift, compared to the union-like-struct-based approach.
          Hide
          Shalin Shekhar Mangar added a comment -

          We are not 'making' a list contain objects of different types. Most languages already 'allow' you to do that. Since Thrift supports a collection type, why support it only partially?

          The user can only put those types in the collection which are defined in the IDL or the basic supported types. If it is not any of these then we don't even know how to serialize it. We can just write the short code for that type when writing the collection element.

          Show
          Shalin Shekhar Mangar added a comment - We are not 'making' a list contain objects of different types. Most languages already 'allow' you to do that. Since Thrift supports a collection type, why support it only partially? The user can only put those types in the collection which are defined in the IDL or the basic supported types. If it is not any of these then we don't even know how to serialize it. We can just write the short code for that type when writing the collection element.
          Hide
          dhruba borthakur added a comment -

          Please excuse me for jumping in here... but the option of supporting a "union" sounds better than making a list contain objects of different types. This type of list will have easy mapping into various languages.. C++ and Java included.

          Show
          dhruba borthakur added a comment - Please excuse me for jumping in here... but the option of supporting a "union" sounds better than making a list contain objects of different types. This type of list will have easy mapping into various languages.. C++ and Java included.
          Hide
          David Reiss added a comment -

          You would have to include enough information to instantiate the class, which, if I understand properly, is the full package and class name in Java (and Python). However, if you know beforehand what structures you might have, you can define a union-like wrapper struct and use a list of those. This is possible (and efficient) with the current protocol.

          Show
          David Reiss added a comment - You would have to include enough information to instantiate the class, which, if I understand properly, is the full package and class name in Java (and Python). However, if you know beforehand what structures you might have, you can define a union-like wrapper struct and use a list of those. This is possible (and efficient) with the current protocol.
          Hide
          Noble Paul added a comment -

          I'll reiterate that having a list<?> type that could contain two different structures would be prohibitively expensive

          I understand that it may be more expensive. But how much more expensive ? What are the implementation details? do we really need to write a string for a type? or can we manage to write a short code? what if the user needs that feature and is willing to pay a small price for that?

          The wildcard type must not be the default so regular users do not pay a price.

          Show
          Noble Paul added a comment - I'll reiterate that having a list<?> type that could contain two different structures would be prohibitively expensive I understand that it may be more expensive. But how much more expensive ? What are the implementation details? do we really need to write a string for a type? or can we manage to write a short code? what if the user needs that feature and is willing to pay a small price for that? The wildcard type must not be the default so regular users do not pay a price.
          Hide
          David Reiss added a comment -

          We've talked about a new and more compact protocol before, so I like the idea of not meddling with the existing TBinaryProtocol.

          I'll reiterate that having a list<?> type that could contain two different structures would be prohibitively expensive because you would have to include the fully qualified type name with every single element. You would also have to instantiate the structure by name which is impossible in C++ and expensive in Java. I would recommend defining a union-like struct that contains one field for each type that you would want in the list and creating a list of those (all but one field of each would be empty).

          Show
          David Reiss added a comment - We've talked about a new and more compact protocol before, so I like the idea of not meddling with the existing TBinaryProtocol. I'll reiterate that having a list<?> type that could contain two different structures would be prohibitively expensive because you would have to include the fully qualified type name with every single element. You would also have to instantiate the structure by name which is impossible in C++ and expensive in Java. I would recommend defining a union-like struct that contains one field for each type that you would want in the list and creating a list of those (all but one field of each would be empty).
          Hide
          Noble Paul added a comment -

          The BinaryProtocol would treat them the same as the regular type, so it would not need to handle variable encodings or introduce any backward compatibility issue. The newer compact protocols could choose to use these hints.

          I would say , let us not meddle with the existing binary protocol. So no existing users must have a problem. Let us develop a new "DenseProtocol" which is a totally different format and we will have the freedom to create a very efficient format

          Perhaps 'unsigned' is the wrong word to use here since..

          Your are right. unsigned is misleading. . we call it vint (variable length int ) and vlong (variable length int ) in lucene/solr. I guess that is not quite a bad thing. If the code generated is for the old binary protocol it can ignore the 'v' part treat it like an ordinary integer/long . the new format can have different encoding for these

          We could look into extending the type system to allow for non-homogeneous collections types but I'd like to understand first how this plays itself out in detail for C++, where there is no base Object type

          The fact that somebody is using non-homogeneous collection means that he is using a language that supports that. Let us say there is wildcard support in IDL like list<?> or map<?,?> if the users tries to generate code for a language which does not have that support it can easily throw an error and fail. We can explicitly document that in the IDL documentation. Most of the users actually do not generate code for all languages . They only use max 2-3 languages . Just because some language which he does not need doesn't support it shouldn't prevent him from using a powerful feature of his language.

          I am not sure of the best approach for the extern string issue but I will mention that cross-language string handling is complicated

          What is represented in the binary form does not have to be the same as the in memory representation of the language. Though java has 2-byte characters in memory we use utf-8 always to serialize /deserialize strings (which is not java's native format) . And it has worked well.

          Are you sure that Bryan's enumeration idea wouldn't handle a large percentage of your use cases?

          one thing I can say is that the solution does not work. We tried to do it initially but quite did not work .

          extern string can be lowered in priority. But it can be a very nice addition. But what I can say is implementation is easier than you imagine (easier than the other things. Just look at the writeExternString() method in this class ).

          Show
          Noble Paul added a comment - The BinaryProtocol would treat them the same as the regular type, so it would not need to handle variable encodings or introduce any backward compatibility issue. The newer compact protocols could choose to use these hints. I would say , let us not meddle with the existing binary protocol. So no existing users must have a problem. Let us develop a new "DenseProtocol" which is a totally different format and we will have the freedom to create a very efficient format Perhaps 'unsigned' is the wrong word to use here since.. Your are right. unsigned is misleading. . we call it vint (variable length int ) and vlong (variable length int ) in lucene/solr. I guess that is not quite a bad thing. If the code generated is for the old binary protocol it can ignore the 'v' part treat it like an ordinary integer/long . the new format can have different encoding for these We could look into extending the type system to allow for non-homogeneous collections types but I'd like to understand first how this plays itself out in detail for C++, where there is no base Object type The fact that somebody is using non-homogeneous collection means that he is using a language that supports that. Let us say there is wildcard support in IDL like list<?> or map<?,?> if the users tries to generate code for a language which does not have that support it can easily throw an error and fail. We can explicitly document that in the IDL documentation. Most of the users actually do not generate code for all languages . They only use max 2-3 languages . Just because some language which he does not need doesn't support it shouldn't prevent him from using a powerful feature of his language. I am not sure of the best approach for the extern string issue but I will mention that cross-language string handling is complicated What is represented in the binary form does not have to be the same as the in memory representation of the language. Though java has 2-byte characters in memory we use utf-8 always to serialize /deserialize strings (which is not java's native format) . And it has worked well. Are you sure that Bryan's enumeration idea wouldn't handle a large percentage of your use cases? one thing I can say is that the solution does not work. We tried to do it initially but quite did not work . extern string can be lowered in priority. But it can be a very nice addition. But what I can say is implementation is easier than you imagine (easier than the other things. Just look at the writeExternString() method in this class ).
          Hide
          Chad Walters added a comment -

          I for one am very pleased that the Solr team is interested in getting involved here – very happy to see another Apache project adopt Thrift. Here are a few thoughts:

          1. WRT David's concerns about adding 'unsigned' to the IDL, I see this and other modifiers as things that are hints to the protocol that can be ignored by the protocol if it chooses. The BinaryProtocol would treat them the same as the regular type, so it would not need to handle variable encodings or introduce any backward compatibility issue. The newer compact protocols could choose to use these hints.

          One way to implement this would be to use some of the upper bits of the type bytes to represent the modifiers and extend the protocol interface to accept these bits. Alternatively, we could add new types and expand the protocol interface to support these types – for the BinaryProtocol, these would just call the same routines as the unmodified types.

          2. Perhaps 'unsigned' is the wrong word to use here since it doesn't make the type truly unsigned and could cause confusion. What modifier names would be good to represent the various possibilities here? Perhaps we could assume non-negative values as the default and thus use the current variable length encoding scheme from the DenseProtocol by default – this would make this particular set of changes backward compatible for the DenseProtocol. Adding a modifier like 'fixed' would hint that variable-length encoding is not advised. Adding a different modifier like "zipper" (better name please!) would hint that the values are likely to be small integers but include negatives and thus the zipper encoding from protocol buffers would work well for the data.

          3.

          bq. We must handle all usecases gracefully

          I disagree. Designing an IDL requires striking a balance between the richness of data structures that your IDL can support and the richness of languages that can support your IDL.

          I agree with David here – we want to support data types that can be expressed naturally in a lot of different languages which sometimes leads to a lowest-common denominator approach. True unsigned types were left out of Thrift intentionally, for example, because many languages (Java included) don't support them naturally. We could look into extending the type system to allow for non-homogeneous collections types but I'd like to understand first how this plays itself out in detail for C++, where there is no base Object type.

          4. I am not sure of the best approach for the extern string issue but I will mention that cross-language string handling is complicated due to Java's 2-byte strings (per the string vs binary handling issue we had to address) so extra attention probably needs to be paid here. Are you sure that Bryan's enumeration idea wouldn't handle a large percentage of your use cases?

          Show
          Chad Walters added a comment - I for one am very pleased that the Solr team is interested in getting involved here – very happy to see another Apache project adopt Thrift. Here are a few thoughts: 1. WRT David's concerns about adding 'unsigned' to the IDL, I see this and other modifiers as things that are hints to the protocol that can be ignored by the protocol if it chooses. The BinaryProtocol would treat them the same as the regular type, so it would not need to handle variable encodings or introduce any backward compatibility issue. The newer compact protocols could choose to use these hints. One way to implement this would be to use some of the upper bits of the type bytes to represent the modifiers and extend the protocol interface to accept these bits. Alternatively, we could add new types and expand the protocol interface to support these types – for the BinaryProtocol, these would just call the same routines as the unmodified types. 2. Perhaps 'unsigned' is the wrong word to use here since it doesn't make the type truly unsigned and could cause confusion. What modifier names would be good to represent the various possibilities here? Perhaps we could assume non-negative values as the default and thus use the current variable length encoding scheme from the DenseProtocol by default – this would make this particular set of changes backward compatible for the DenseProtocol. Adding a modifier like 'fixed' would hint that variable-length encoding is not advised. Adding a different modifier like "zipper" (better name please!) would hint that the values are likely to be small integers but include negatives and thus the zipper encoding from protocol buffers would work well for the data. 3. bq. We must handle all usecases gracefully I disagree. Designing an IDL requires striking a balance between the richness of data structures that your IDL can support and the richness of languages that can support your IDL. I agree with David here – we want to support data types that can be expressed naturally in a lot of different languages which sometimes leads to a lowest-common denominator approach. True unsigned types were left out of Thrift intentionally, for example, because many languages (Java included) don't support them naturally. We could look into extending the type system to allow for non-homogeneous collections types but I'd like to understand first how this plays itself out in detail for C++, where there is no base Object type. 4. I am not sure of the best approach for the extern string issue but I will mention that cross-language string handling is complicated due to Java's 2-byte strings (per the string vs binary handling issue we had to address) so extra attention probably needs to be paid here. Are you sure that Bryan's enumeration idea wouldn't handle a large percentage of your use cases?
          Hide
          Shalin Shekhar Mangar added a comment -

          We evaluated using protocol buffers to replace the custom binary serialization format in Solr. However, we stopped short because we already had a very optimized client for Java (even more optimized than protobuf e.g. with extern strings and streaming support). The C++ client was not very relevant because not many of Solr users use C++ for consuming it. Python was the only other language protobuf supported which was interesting to us. Thrift on the other hand has the advantage of supporting multiple languages which are very relevant to Solr and besides it's in incubation at Apache so we are kind of biased towards it

          Looking at this discussion, I can gather a few points:

          1. DenseProtocol can be enhanced with these suggestions provided people are willing to do it. Almost all suggestions seem to have been shot down. That cannot work if Thrift wants to get acceptance as a de-facto open source binary protocol.
          2. We don't have to support it in all languages right from the start. We can start with C++ and Java – release it and then keep adding more languages to the fray (release early, release often). Let the early adopters use this new format and give us feedback. In these early stages, we don't even need to worry about back compatibility until we get to 1.0

          From Solr's side I can say that we have the flexibility of not using certain features if they are impossible to support by Thrift across all languages. The format would anyway be faster than the XML which we were using previously. It would be great if we can provide the flexibility of an efficient binary protocol across multiple languages using Thrift rather than a custom format.

          I suggest we start with:

          1. A list of things that will be good to have in the format
          2. What would easily be accommodated across all languages? Not all things will be equally efficient which is OK.

          Btw is there an issue open on DenseProtocol?

          Show
          Shalin Shekhar Mangar added a comment - We evaluated using protocol buffers to replace the custom binary serialization format in Solr. However, we stopped short because we already had a very optimized client for Java (even more optimized than protobuf e.g. with extern strings and streaming support). The C++ client was not very relevant because not many of Solr users use C++ for consuming it. Python was the only other language protobuf supported which was interesting to us. Thrift on the other hand has the advantage of supporting multiple languages which are very relevant to Solr and besides it's in incubation at Apache so we are kind of biased towards it Looking at this discussion, I can gather a few points: DenseProtocol can be enhanced with these suggestions provided people are willing to do it. Almost all suggestions seem to have been shot down. That cannot work if Thrift wants to get acceptance as a de-facto open source binary protocol. We don't have to support it in all languages right from the start. We can start with C++ and Java – release it and then keep adding more languages to the fray (release early, release often). Let the early adopters use this new format and give us feedback. In these early stages, we don't even need to worry about back compatibility until we get to 1.0 From Solr's side I can say that we have the flexibility of not using certain features if they are impossible to support by Thrift across all languages. The format would anyway be faster than the XML which we were using previously. It would be great if we can provide the flexibility of an efficient binary protocol across multiple languages using Thrift rather than a custom format. I suggest we start with: A list of things that will be good to have in the format What would easily be accommodated across all languages? Not all things will be equally efficient which is OK. Btw is there an issue open on DenseProtocol?
          Hide
          Ian Holsman added a comment -

          have you investigated using googles protobuf as a format ?

          Show
          Ian Holsman added a comment - have you investigated using googles protobuf as a format ?
          Hide
          Noble Paul added a comment - - edited

          I'm a little hesitant to do this. For one thing, it would mean that clients in languages that didn't support this encoding would be unable to communicate with servers that used it

          We must ensure that all the languages that we support supports this before we roll it out it. So that all clients for a given version of thrift are able to consume this.

          The mandate for a library like thrift is to make transport efficient and fast . If we shy away from that , we can consider this project complete and move on.

          For example, we don't support inheritance of data structures or shared structure

          IDL does not support inheritance . So it is not really an API limitation. But maps and lists are supported by IDL hence the difference

          the amount of extra metadata required to have a list that could contain two different types of structures would be huge.

          We can always add a new types for heterogeneous collection types. So the user can make the choice and pay the price for that.

          We do not have to assume that there is only one way of doing something. Probably there are easier ways to achieve this if we are willing to have a dialogue

          In order to implement extern strings, I would create a separate list of strings, store indexes into the list in your main data structure, and send both across.

          While it is possible to do so, with a lot of extra effort , for the library user . It will be easy if I could use it straight off the shelf. In our case doing so is really expensive because the strings can only be obtained after reading from an object which may be residing in the disk which we wish to load just before writing only

          We are discussing as if the window for discussion is closed and thrift has finalized everything in incubation itself. All features do not need to be there all at once . But if we want to gain acceptance we must be willing to embrace ideas from others and be willing to accommodate whatever is possible. Moreover , we must look at our competitors and incorporate good ideas (if possible)

          Show
          Noble Paul added a comment - - edited I'm a little hesitant to do this. For one thing, it would mean that clients in languages that didn't support this encoding would be unable to communicate with servers that used it We must ensure that all the languages that we support supports this before we roll it out it. So that all clients for a given version of thrift are able to consume this. The mandate for a library like thrift is to make transport efficient and fast . If we shy away from that , we can consider this project complete and move on. For example, we don't support inheritance of data structures or shared structure IDL does not support inheritance . So it is not really an API limitation. But maps and lists are supported by IDL hence the difference the amount of extra metadata required to have a list that could contain two different types of structures would be huge. We can always add a new types for heterogeneous collection types. So the user can make the choice and pay the price for that. We do not have to assume that there is only one way of doing something. Probably there are easier ways to achieve this if we are willing to have a dialogue In order to implement extern strings, I would create a separate list of strings, store indexes into the list in your main data structure, and send both across. While it is possible to do so, with a lot of extra effort , for the library user . It will be easy if I could use it straight off the shelf. In our case doing so is really expensive because the strings can only be obtained after reading from an object which may be residing in the disk which we wish to load just before writing only We are discussing as if the window for discussion is closed and thrift has finalized everything in incubation itself. All features do not need to be there all at once . But if we want to gain acceptance we must be willing to embrace ideas from others and be willing to accommodate whatever is possible. Moreover , we must look at our competitors and incorporate good ideas (if possible)
          Hide
          David Reiss added a comment -

          This first step is to add 'unsigned' . unsigned means it can only contain +ve numbers so we can use a variable length format.

          I'm a little hesitant to do this. For one thing, it would mean that clients in languages that didn't support this encoding would be unable to communicate with servers that used it. This would require keeping track of which features were used by which servers, eliminating the current nice situation that any Thrift client can talk to any Thrift server. Also, variable-length encoding and decoding as significant extra overhead to dynamic languages, so using the protocol accelerator modules would become pretty much mandatory.

          The idea is that the protocol must not dictate how users must use collection APIs.

          You can use the APIs however you want within your program, but the data model supported by Thrift is necessarily going to be less expressive than whatever your language of choice uses. For example, we don't support inheritance of data structures or shared structure

          We must handle all usecases gracefully

          I disagree. Designing an IDL requires striking a balance between the richness of data structures that your IDL can support and the richness of languages that can support your IDL.

          Other problem is with the list/map types . They are expected to be homogenous which is not acceptable (I guess so). So type information needs to be encoded with each element

          There are a few problems with this. First, it won't work with C++. Second it would result in a huge increase in serialized size if we allowed all collections to have variant contents. Third the amount of extra metadata required to have a list that could contain two different types of structures would be huge. You would have to include the full name of the structure class. In Java, you would have to do a get-class-by-name to instantiate each element, which is super slow.

          The dense protocol is complete for (de)serializing standalone structures. It can't be used as the protocol for calls/replies, but that is not a big change.

          In order to implement extern strings, I would create a separate list of strings, store indexes into the list in your main data structure, and send both across.

          Show
          David Reiss added a comment - This first step is to add 'unsigned' . unsigned means it can only contain +ve numbers so we can use a variable length format. I'm a little hesitant to do this. For one thing, it would mean that clients in languages that didn't support this encoding would be unable to communicate with servers that used it. This would require keeping track of which features were used by which servers, eliminating the current nice situation that any Thrift client can talk to any Thrift server. Also, variable-length encoding and decoding as significant extra overhead to dynamic languages, so using the protocol accelerator modules would become pretty much mandatory. The idea is that the protocol must not dictate how users must use collection APIs. You can use the APIs however you want within your program, but the data model supported by Thrift is necessarily going to be less expressive than whatever your language of choice uses. For example, we don't support inheritance of data structures or shared structure We must handle all usecases gracefully I disagree. Designing an IDL requires striking a balance between the richness of data structures that your IDL can support and the richness of languages that can support your IDL. Other problem is with the list/map types . They are expected to be homogenous which is not acceptable (I guess so). So type information needs to be encoded with each element There are a few problems with this. First, it won't work with C++. Second it would result in a huge increase in serialized size if we allowed all collections to have variant contents. Third the amount of extra metadata required to have a list that could contain two different types of structures would be huge. You would have to include the full name of the structure class. In Java, you would have to do a get-class-by-name to instantiate each element, which is super slow. The dense protocol is complete for (de)serializing standalone structures. It can't be used as the protocol for calls/replies, but that is not a big change. In order to implement extern strings, I would create a separate list of strings, store indexes into the list in your main data structure, and send both across.
          Hide
          Noble Paul added a comment - - edited

          Should we be discussing creating a new protocol, or enhancing/formalizing DenseProtocol, or what?

          The protocol must be enhanced significantly . we must extract maximum mileage out of every bit. We have a lot of other tool/libraries from which we can copy.

          Are the names really variable, or could you use enumerated "strings" in the name/value mappings?

          They are not enumerated strings. At least we do not treat it that way.

          The idea is that the protocol must not dictate how users must use collection APIs . We must handle all usecases gracefully

          Show
          Noble Paul added a comment - - edited Should we be discussing creating a new protocol, or enhancing/formalizing DenseProtocol, or what? The protocol must be enhanced significantly . we must extract maximum mileage out of every bit. We have a lot of other tool/libraries from which we can copy. Are the names really variable, or could you use enumerated "strings" in the name/value mappings? They are not enumerated strings. At least we do not treat it that way. The idea is that the protocol must not dictate how users must use collection APIs . We must handle all usecases gracefully
          Hide
          Bryan Duxbury added a comment -

          +1 to adding "unsigned" types.

          It seems like using variable length ints could give us some reasonable gains in a few different places. For instance, enumerated types are i32s, but I'm guessing that whole number space isn't actually used very frequently. You could save 3 bytes in some cases. Any time we have list sizes (list, map, string, binary) we should be able to save some space. At least for Rapleaf, this seems like it'd be a big win in terms of reducing the amount of I/O we'll do.

          Should we be discussing creating a new protocol, or enhancing/formalizing DenseProtocol, or what?

          again from Solr. We write everything as name and value where names may be small but repeated

          Are the names really variable, or could you use enumerated "strings" in the name/value mappings?

          Show
          Bryan Duxbury added a comment - +1 to adding "unsigned" types. It seems like using variable length ints could give us some reasonable gains in a few different places. For instance, enumerated types are i32s, but I'm guessing that whole number space isn't actually used very frequently. You could save 3 bytes in some cases. Any time we have list sizes (list, map, string, binary) we should be able to save some space. At least for Rapleaf, this seems like it'd be a big win in terms of reducing the amount of I/O we'll do. Should we be discussing creating a new protocol, or enhancing/formalizing DenseProtocol, or what? again from Solr. We write everything as name and value where names may be small but repeated Are the names really variable, or could you use enumerated "strings" in the name/value mappings?
          Hide
          Noble Paul added a comment -

          I am not sure this is going to change. In C++, lists are implemented as vectors of objects, not object references. Can you provide some concrete cases where this has been an absolute requirement?

          Solr response is a composed fully of collection objects (no custom objects) . So we mix and match various collection objects . Moreover , it is not a good idea to have a limit in the protocol what is allowed by the API.

          Sounds interesting. This generally wouldn't be that interesting for small RPC messages but I can see other use cases where it would be.

          again from Solr. We write everything as name and value where names may be small but repeated . so the response is string intensive .actually good string Serialization/DeSerialization performance is our main requirement

          Moreover as protocol buffers is out as open source, users will definitely compare the pros/cons .

          Show
          Noble Paul added a comment - I am not sure this is going to change. In C++, lists are implemented as vectors of objects, not object references. Can you provide some concrete cases where this has been an absolute requirement? Solr response is a composed fully of collection objects (no custom objects) . So we mix and match various collection objects . Moreover , it is not a good idea to have a limit in the protocol what is allowed by the API. Sounds interesting. This generally wouldn't be that interesting for small RPC messages but I can see other use cases where it would be. again from Solr. We write everything as name and value where names may be small but repeated . so the response is string intensive .actually good string Serialization/DeSerialization performance is our main requirement Moreover as protocol buffers is out as open source, users will definitely compare the pros/cons .
          Hide
          Chad Walters added a comment -

          I am glad that there is interest in this topic and I would be happy to talk further on how to make progress in this area.

          Our own protocol is extremely compact . I am looking for a true cross language library so that all our users can take advantage of that. Thrift has the potential but it is very inefficient compared to protocol buffers.

          I would very much like to see Thrift become the defacto protocol system used by other Apache projects and beyond. Part of the current lack of compactness is due to Thrift's heritage as a fast RPC system but there are definite thoughts on several different protocol implementations that would address compactness.

          This first step is to add 'unsigned' .

          I agree, although I'd point out that the DenseProtocol kind of assumes unsigned already.

          Other problem is with the list/map types . They are expected to be homogenous which is not acceptable (I guess so).

          I am not sure this is going to change. In C++, lists are implemented as vectors of objects, not object references. Can you provide some concrete cases where this has been an absolute requirement?

          Writing a string with length (i32) first consumes four bytes. Which is inefficient for small strings .

          Defintely. The DenseProtocol already does this and any compact protocol would want to follow suit.

          We have a special type called extern string . It stores the strings already written and assign an index to it .

          Sounds interesting. This generally wouldn't be that interesting for small RPC messages but I can see other use cases where it would be.

          When I was talking about adding a "hash" hint, this was to indicate that an integer type was unlikely to benefit at all from a variable-length encoding – eg: a value coming from a portion of an MD5 or the like.

          Show
          Chad Walters added a comment - I am glad that there is interest in this topic and I would be happy to talk further on how to make progress in this area. Our own protocol is extremely compact . I am looking for a true cross language library so that all our users can take advantage of that. Thrift has the potential but it is very inefficient compared to protocol buffers. I would very much like to see Thrift become the defacto protocol system used by other Apache projects and beyond. Part of the current lack of compactness is due to Thrift's heritage as a fast RPC system but there are definite thoughts on several different protocol implementations that would address compactness. This first step is to add 'unsigned' . I agree, although I'd point out that the DenseProtocol kind of assumes unsigned already. Other problem is with the list/map types . They are expected to be homogenous which is not acceptable (I guess so). I am not sure this is going to change. In C++, lists are implemented as vectors of objects, not object references. Can you provide some concrete cases where this has been an absolute requirement? Writing a string with length (i32) first consumes four bytes. Which is inefficient for small strings . Defintely. The DenseProtocol already does this and any compact protocol would want to follow suit. We have a special type called extern string . It stores the strings already written and assign an index to it . Sounds interesting. This generally wouldn't be that interesting for small RPC messages but I can see other use cases where it would be. When I was talking about adding a "hash" hint, this was to indicate that an integer type was unlikely to benefit at all from a variable-length encoding – eg: a value coming from a portion of an MD5 or the like.
          Hide
          Noble Paul added a comment -

          Take a look at the DenseProtocol

          Our own protocol is extremely compact . I am looking for a true cross language library so that all our users can take advantage of that. Thrift has the potential but it is very inefficient compared to protocol buffers.

          I'd also like to see the IDL extended so that hints can be provided for compactness - such as "small" or "small unsigned" or "hash"

          This first step is to add 'unsigned' . unsigned means it can only contain +ve numbers so we can use a variable length format.

          Protocol buffers use something called zigzag encoding for signed integers

          Other problem is with the list/map types . They are expected to be homogenous which is not acceptable (I guess so). So type information needs to be encoded with each element

          Writing a string with length (i32) first consumes four bytes. Which is inefficient for small strings .

          writing hash itself is not very efficient for strings a long may take 8 bytes. We have a special type called extern string . It stores the strings already written and assign an index to it . Subsequent writes just write that index if the string is already written.

          Show
          Noble Paul added a comment - Take a look at the DenseProtocol Our own protocol is extremely compact . I am looking for a true cross language library so that all our users can take advantage of that. Thrift has the potential but it is very inefficient compared to protocol buffers. I'd also like to see the IDL extended so that hints can be provided for compactness - such as "small" or "small unsigned" or "hash" This first step is to add 'unsigned' . unsigned means it can only contain +ve numbers so we can use a variable length format. Protocol buffers use something called zigzag encoding for signed integers Other problem is with the list/map types . They are expected to be homogenous which is not acceptable (I guess so). So type information needs to be encoded with each element Writing a string with length (i32) first consumes four bytes. Which is inefficient for small strings . writing hash itself is not very efficient for strings a long may take 8 bytes. We have a special type called extern string . It stores the strings already written and assign an index to it . Subsequent writes just write that index if the string is already written.
          Hide
          Chad Walters added a comment -

          Take a look at the DenseProtocol (http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/cpp/src/protocol/TDenseProtocol.h?view=log). It's only really been worked on for C++ so far – adding a Java version would be good but you might want to check around to see whether the protocol is really complete at this point (dreiss?).

          At one point we talked about a family of more compact protocols that allow for the type information to be passed out of band, resulting in even greater savings. So far it hasn't been a high priority but It's definitely something I'd love to see.

          I'd also like to see the IDL extended so that hints can be provided for compactness – such as "small" or "small unsigned" or "hash" – which the protocol implementation is free to ignore if need be. FWIW, the DenseProtocol as it is currently implemented is very bad for negative numbers.

          Show
          Chad Walters added a comment - Take a look at the DenseProtocol ( http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/cpp/src/protocol/TDenseProtocol.h?view=log ). It's only really been worked on for C++ so far – adding a Java version would be good but you might want to check around to see whether the protocol is really complete at this point (dreiss?). At one point we talked about a family of more compact protocols that allow for the type information to be passed out of band, resulting in even greater savings. So far it hasn't been a high priority but It's definitely something I'd love to see. I'd also like to see the IDL extended so that hints can be provided for compactness – such as "small" or "small unsigned" or "hash" – which the protocol implementation is free to ignore if need be. FWIW, the DenseProtocol as it is currently implemented is very bad for negative numbers.

            People

            • Assignee:
              Bryan Duxbury
              Reporter:
              Noble Paul
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development