Thrift
  1. Thrift
  2. THRIFT-1920 Binary Type
  3. THRIFT-395

Python library + compiler does not support unicode strings

    Details

    • Patch Info:
      Patch Available

      Description

      Effectively, all strings in the python bindings are treated as binary strings – no encoding/decoding to UTF-8 is done. So if a unicode object is passed to a (regular, non-binary) string, an exception is raised.

        Activity

        Hide
        David Reiss added a comment -

        I think I thought there was something else left to resolve, but now I can't figure out what it was.

        Show
        David Reiss added a comment - I think I thought there was something else left to resolve, but now I can't figure out what it was.
        Hide
        Bryan Duxbury added a comment -

        Then we should probably close this issue.

        Show
        Bryan Duxbury added a comment - Then we should probably close this issue.
        Hide
        David Reiss added a comment -

        I already committed it a month ago. Check the "all" view.

        Show
        David Reiss added a comment - I already committed it a month ago. Check the "all" view.
        Hide
        Bryan Duxbury added a comment -

        +1 to Nathan's patch. If no one objects, I'll commit this later today.

        Show
        Bryan Duxbury added a comment - +1 to Nathan's patch. If no one objects, I'll commit this later today.
        Hide
        Michael Montano added a comment - - edited

        +1 to applying Nathan's patch

        Show
        Michael Montano added a comment - - edited +1 to applying Nathan's patch
        Hide
        David Reiss added a comment -

        This is fine with me. Thoughts?

        Show
        David Reiss added a comment - This is fine with me. Thoughts?
        Hide
        Nathan Marz added a comment -

        This patch builds upon Jonathan's patch to add the "utf8strings" option to the python generator. This causes thrift to encode/decode strings using utf8 in the generated code. Note that this patch does not modify the python lib at all, only the code generator.

        Show
        Nathan Marz added a comment - This patch builds upon Jonathan's patch to add the "utf8strings" option to the python generator. This causes thrift to encode/decode strings using utf8 in the generated code. Note that this patch does not modify the python lib at all, only the code generator.
        Hide
        David Reiss added a comment -

        Yeah, I was thinking the same thing. I think my patches handle the write path acceptably without the need for an option. An option for the read path should be pretty easy. It will take some more work to get it to work for the accelerator module, but Jonathan's patch doesn't touch that, so I'm guessing you guys aren't too concerned with it.

        Show
        David Reiss added a comment - Yeah, I was thinking the same thing. I think my patches handle the write path acceptably without the need for an option. An option for the read path should be pretty easy. It will take some more work to get it to work for the accelerator module, but Jonathan's patch doesn't touch that, so I'm guessing you guys aren't too concerned with it.
        Hide
        Nathan Marz added a comment -

        Here's a proposal to resolve the deadlock.

        I propose adding an option to the python generator that will force strings to be utf-8 encoded/decoded, ala Jonathan's patch. Without the option, python thrift will remain with the current behavior (so existing code will continue to function the same way), and the rest of us can use the option when we generate code to resolve our problems.

        How does this sound?

        Show
        Nathan Marz added a comment - Here's a proposal to resolve the deadlock. I propose adding an option to the python generator that will force strings to be utf-8 encoded/decoded, ala Jonathan's patch. Without the option, python thrift will remain with the current behavior (so existing code will continue to function the same way), and the rest of us can use the option when we generate code to resolve our problems. How does this sound?
        Hide
        Esteve Fernandez added a comment -

        +1 to applying. The current situation regarding unicode and Python is very frustratring.

        Show
        Esteve Fernandez added a comment - +1 to applying. The current situation regarding unicode and Python is very frustratring.
        Hide
        David Reiss added a comment -

        This is going to break existing code, so I vote -1.

        Show
        David Reiss added a comment - This is going to break existing code, so I vote -1.
        Hide
        Bryan Duxbury added a comment -

        +1 to applying.

        Show
        Bryan Duxbury added a comment - +1 to applying.
        Hide
        Michael Montano added a comment -

        i agree that it's best to do something about this issue rather than continuing to debate about it. +1 to applying jonathan's patch and opening a new issue.

        Show
        Michael Montano added a comment - i agree that it's best to do something about this issue rather than continuing to debate about it. +1 to applying jonathan's patch and opening a new issue.
        Hide
        Nathan Marz added a comment -

        I want to be able to use "string" fields without manually encoding/decoding, so both. Jonathan's patch would solve all the issues I'm facing currently, including reading Python serialized objects from Java and talking to Twisted servers from Python clients.

        Show
        Nathan Marz added a comment - I want to be able to use "string" fields without manually encoding/decoding, so both. Jonathan's patch would solve all the issues I'm facing currently, including reading Python serialized objects from Java and talking to Twisted servers from Python clients.
        Hide
        David Reiss added a comment -

        Can you clarify whether you are focused on Python reading unicode strings or Python writing unicode strings?

        Show
        David Reiss added a comment - Can you clarify whether you are focused on Python reading unicode strings or Python writing unicode strings?
        Hide
        Nathan Marz added a comment -

        We're constantly running into this issue at BackType. Python Thrift is just plain broken right now because of this issue, and lots of people are having problems. Since it sounds like we're not all going to agree on the "best" approach, I vote for applying Jonathan's patch and opening a new issue where we can debate "the right way". Even if Jonathan's patch isn't the "right way", it's a hell of a lot better than the current state of things.

        Show
        Nathan Marz added a comment - We're constantly running into this issue at BackType. Python Thrift is just plain broken right now because of this issue, and lots of people are having problems. Since it sounds like we're not all going to agree on the "best" approach, I vote for applying Jonathan's patch and opening a new issue where we can debate "the right way". Even if Jonathan's patch isn't the "right way", it's a hell of a lot better than the current state of things.
        Hide
        Bryan Duxbury added a comment -

        Should we wake this issue up again? I think the general debate was on whether our string types should be explicitly UTF-8 or not. I am for making Thrift string types UTF-8, enforcing this in languages where we can, and in languages we can't, making it clear that we've punted.

        We can put it to a vote on the list.

        Show
        Bryan Duxbury added a comment - Should we wake this issue up again? I think the general debate was on whether our string types should be explicitly UTF-8 or not. I am for making Thrift string types UTF-8, enforcing this in languages where we can, and in languages we can't, making it clear that we've punted. We can put it to a vote on the list.
        Hide
        Bryan Duxbury added a comment -

        My suspicion is that the python lib wrote a multibyte character string to the wire with a "character" instead of "byte" length header. Then, once the clientside read into the middle of that string, it got off sync. I could see this tripping up pretty much any client lib, including python.

        Show
        Bryan Duxbury added a comment - My suspicion is that the python lib wrote a multibyte character string to the wire with a "character" instead of "byte" length header. Then, once the clientside read into the middle of that string, it got off sync. I could see this tripping up pretty much any client lib, including python.
        Hide
        Nathan Marz added a comment -

        PHP read a massive length for a string on the wire, so it tried to read more bytes than were available. It would always time out when reading the 4 bytes for the size of the next frame (using framed transport). Don't really know the details of the internals well enough to give a more detailed answer than that.

        Show
        Nathan Marz added a comment - PHP read a massive length for a string on the wire, so it tried to read more bytes than were available. It would always time out when reading the 4 bytes for the size of the next frame (using framed transport). Don't really know the details of the internals well enough to give a more detailed answer than that.
        Hide
        David Reiss added a comment -

        How was PHP timing out if it was sending back a response?

        Show
        David Reiss added a comment - How was PHP timing out if it was sending back a response?
        Hide
        Nathan Marz added a comment -

        It was a python client. I think the response was just corrupted and every lib was just reacting differently. Unfortunately, I don't have a simple test case. Modifying my code to manually encode as UTF-8 did fix the issue.

        Show
        Nathan Marz added a comment - It was a python client. I think the response was just corrupted and every lib was just reacting differently. Unfortunately, I don't have a simple test case. Modifying my code to manually encode as UTF-8 did fix the issue.
        Hide
        David Reiss added a comment -

        Why was Python deserializing anything (other than the response) if it was the server? Also, missing fields shouldn't cause PHP to time out unless there is another bug there. Do you have a simple test case for this?

        Show
        David Reiss added a comment - Why was Python deserializing anything (other than the response) if it was the server? Also, missing fields shouldn't cause PHP to time out unless there is another bug there. Do you have a simple test case for this?
        Hide
        Nathan Marz added a comment -

        I just ran into this, and figuring out this was the issue was really convoluted. I had a python server returning thrift objects with unicode strings, and php was timing out, python was deserializing fine but missing fields, and php was timing out. Can we at least put a patch in that throws an error when a unicode string is detected so that this is easier to debug?

        Show
        Nathan Marz added a comment - I just ran into this, and figuring out this was the issue was really convoluted. I had a python server returning thrift objects with unicode strings, and php was timing out, python was deserializing fine but missing fields, and php was timing out. Can we at least put a patch in that throws an error when a unicode string is detected so that this is easier to debug?
        Hide
        David Reiss added a comment -

        > I was worried that educating potential consumers of the service about string encoding
        Unfortunately, if you are sending a Unicode string to a language where the string type does not use Unicode, your users must be educated about string encoding.

        If you are primarily concerned with the strings being sent from Python to Java, the patches I posted will cover it.

        Show
        David Reiss added a comment - > I was worried that educating potential consumers of the service about string encoding Unfortunately, if you are sending a Unicode string to a language where the string type does not use Unicode, your users must be educated about string encoding. If you are primarily concerned with the strings being sent from Python to Java, the patches I posted will cover it.
        Hide
        James Clarke added a comment -

        You can continue to use the string type. Just be sure that all of the str objects that you pass into Thrift are properly UTF-8-encoded.

        This is what I'm currently doing which is fine for me but may be more challenging for users. I was worried that educating potential consumers of the service about string encoding (and how to do it in their language, everything but Java and C#) would make the service appear less user friendly than it is.

        Show
        James Clarke added a comment - You can continue to use the string type. Just be sure that all of the str objects that you pass into Thrift are properly UTF-8-encoded. This is what I'm currently doing which is fine for me but may be more challenging for users. I was worried that educating potential consumers of the service about string encoding (and how to do it in their language, everything but Java and C#) would make the service appear less user friendly than it is.
        Hide
        David Reiss added a comment -

        Personally, I disagree, but I see how that view might make more sense to some.

        Show
        David Reiss added a comment - Personally, I disagree, but I see how that view might make more sense to some.
        Hide
        Jonathan Ellis added a comment -

        > You can continue to use the string type. Just be sure that all of the str objects that you pass into Thrift are properly UTF-8-encoded.

        which is to say, "you can use the string type, if you pretend it is binary on the python side."

        less confusing and error-prone to use real binary type.

        Show
        Jonathan Ellis added a comment - > You can continue to use the string type. Just be sure that all of the str objects that you pass into Thrift are properly UTF-8-encoded. which is to say, "you can use the string type, if you pretend it is binary on the python side." less confusing and error-prone to use real binary type.
        Hide
        David Reiss added a comment -

        You can continue to use the string type. Just be sure that all of the str objects that you pass into Thrift are properly UTF-8-encoded. If we apply the patches that I posted, you will be able to pass unicode objects into Thrift and have them automatically encoded as UTF-8. However, the changes required to make Thrift return unicode objects from its deserialization routines is more complex.

        Show
        David Reiss added a comment - You can continue to use the string type. Just be sure that all of the str objects that you pass into Thrift are properly UTF-8-encoded. If we apply the patches that I posted, you will be able to pass unicode objects into Thrift and have them automatically encoded as UTF-8. However, the changes required to make Thrift return unicode objects from its deserialization routines is more complex.
        Hide
        Jonathan Ellis added a comment -

        many months from now it's likely that thrift will have adopted and debugged the more complex solution started in THRIFT-414.

        until then you're screwed. use binary instead of string and encode/decode manually.

        Show
        Jonathan Ellis added a comment - many months from now it's likely that thrift will have adopted and debugged the more complex solution started in THRIFT-414 . until then you're screwed. use binary instead of string and encode/decode manually.
        Hide
        James Clarke added a comment -

        I just wanted to bump this issue since I'm running into it now (Java <-> Python).

        What is the current plan for resolving this issue and how can we move forward?

        Show
        James Clarke added a comment - I just wanted to bump this issue since I'm running into it now (Java <-> Python). What is the current plan for resolving this issue and how can we move forward?
        Hide
        David Reiss added a comment -

        I addressed Chad's comments on THRIFT-414 and got the patch to compile and the tests to pass. If the Java folks sign off on that, I'll commit it and 413, then it should be easier for us to move forward on this one. I think the hardest thing is going to be propagating the presence of the annotation into the extension module so that it knows to verify the encoding on output and decode strings on input.

        Show
        David Reiss added a comment - I addressed Chad's comments on THRIFT-414 and got the patch to compile and the tests to pass. If the Java folks sign off on that, I'll commit it and 413, then it should be easier for us to move forward on this one. I think the hardest thing is going to be propagating the presence of the annotation into the extension module so that it knows to verify the encoding on output and decode strings on input.
        Hide
        Terry Jones added a comment -

        > David Reiss writes:
        > As I recall, Jonathan was the only person who really seemed to
        > care about this issue, and he wasn't satisfied with my
        > suggestion, so I put it aside. Chad also requested some changes
        > to my diff for the JSON protocol. I'll try to reevaluate the
        > status some time soon, but I am away from a computer today.

        Jonathan - are you still in the loop on this one? What do you think?

        Given the fundamental differences in what a "string" is across different languages, there's unlikely to be a clean solution that suits everyone. Having a backwards-compatible compromise that works is much better than having nothing, though.

        Show
        Terry Jones added a comment - > David Reiss writes: > As I recall, Jonathan was the only person who really seemed to > care about this issue, and he wasn't satisfied with my > suggestion, so I put it aside. Chad also requested some changes > to my diff for the JSON protocol. I'll try to reevaluate the > status some time soon, but I am away from a computer today. Jonathan - are you still in the loop on this one? What do you think? Given the fundamental differences in what a "string" is across different languages, there's unlikely to be a clean solution that suits everyone. Having a backwards-compatible compromise that works is much better than having nothing, though.
        Hide
        Terry Jones added a comment -

        It seems that a decision/consensus was almost reached here, specifically David's suggestion at http://bit.ly/ofFr0

        Can we re-animate this issue and get it resolved? I somehow skipped this discussion when it was going on as I knew (or thought I knew) that strings were sent as UTF-8 and was mistakenly assuming that the Python support did the Right Thing and that if an app passed a Python unicode object in a call you'd get a Python unicode object out on the other end. Last night I found out to my great surprise that that's not the case.

        It would be really nice to have this resolved. Otherwise it's going to mean a bunch of crufty manual coding decoding. And it's made worse in our case as we have a dozen internal services that all speak to each other extensively using Thrift. So not only do we need to deal with outside clients being able to somehow pass unicode, we'd have to manually decode each arg in each method in each service, and then manually encode them again to call another Thrift method inside our own service. Either that or keep things as UTF-8 strings, which isn't an option.

        The patches are in, and backwards compatibility is not an issue with David's suggestion. Real users need it ASAP to avoid real pain What's still stopping this from being resolved/applied/committed?

        Terry

        Show
        Terry Jones added a comment - It seems that a decision/consensus was almost reached here, specifically David's suggestion at http://bit.ly/ofFr0 Can we re-animate this issue and get it resolved? I somehow skipped this discussion when it was going on as I knew (or thought I knew) that strings were sent as UTF-8 and was mistakenly assuming that the Python support did the Right Thing and that if an app passed a Python unicode object in a call you'd get a Python unicode object out on the other end. Last night I found out to my great surprise that that's not the case. It would be really nice to have this resolved. Otherwise it's going to mean a bunch of crufty manual coding decoding. And it's made worse in our case as we have a dozen internal services that all speak to each other extensively using Thrift. So not only do we need to deal with outside clients being able to somehow pass unicode, we'd have to manually decode each arg in each method in each service, and then manually encode them again to call another Thrift method inside our own service. Either that or keep things as UTF-8 strings, which isn't an option. The patches are in, and backwards compatibility is not an issue with David's suggestion. Real users need it ASAP to avoid real pain What's still stopping this from being resolved/applied/committed? Terry
        Hide
        Kevin Clark added a comment -

        I think it's clear that we've still got issues to resolve here. I'm pushing this to 0.2 so we don't hold up the release.

        Show
        Kevin Clark added a comment - I think it's clear that we've still got issues to resolve here. I'm pushing this to 0.2 so we don't hold up the release.
        Hide
        Eric Anderson added a comment -

        Is it worth adding a text and binary types (following SQL's convention)? You could then put a warning on strings that they are deprecated because of the cross-language typing issue, and have them accept input as either text or binary and have text and binary accept strings (with validation perhaps on text). This preserves backwards compatibility and provides a migration path.

        Show
        Eric Anderson added a comment - Is it worth adding a text and binary types (following SQL's convention)? You could then put a warning on strings that they are deprecated because of the cross-language typing issue, and have them accept input as either text or binary and have text and binary accept strings (with validation perhaps on text). This preserves backwards compatibility and provides a migration path.
        Hide
        David Reiss added a comment -

        @Jonathan: I think you are still assuming that the "string" is a "Unicode string" or a "string of Unicode code points". Just because something is a "string" doesn't mean it is Unicode.

        @Chad: Please feel free to ask questions on 413 and 414. They both have fairly simple patches posted, and shouldn't change any existing behavior.

        @Bryan: Even if we decide that that we want strings to be always Unicode, there are encodings other than UTF-8, and I don't see why we should prevent users from using annotations to specify an alternate encoding.

        Show
        David Reiss added a comment - @Jonathan: I think you are still assuming that the "string" is a "Unicode string" or a "string of Unicode code points". Just because something is a "string" doesn't mean it is Unicode. @Chad: Please feel free to ask questions on 413 and 414. They both have fairly simple patches posted, and shouldn't change any existing behavior. @Bryan: Even if we decide that that we want strings to be always Unicode, there are encodings other than UTF-8, and I don't see why we should prevent users from using annotations to specify an alternate encoding.
        Hide
        Bryan Duxbury added a comment -

        I think the point is that we want to strengthen the meaning of "string" wherever possible. Clearly it used to be used like arbitrary bytes, but since we have binary now, it seems to make sense that the key use case is for actual text. In some ways, I see specifying the encoding of strings as a necessary part of the protocol. After all, the protocol specifies the encoding of ints, doubles, maps, etc, right? Jonathan has consistently argued for us to have a standard.

        Right now, we have a de facto standard of "UTF8 if it's convenient, whatever else otherwise". This can obviously lead to problems in some situations. Yes, you can make the application be concerned with the encoding, but that seems like a workaround, and it will quickly become inconvenient if you have more than two languages involved.

        In general, I'm sort of against allowing "alternate encodings" (a la THRIFT-414), because it seems like overkill for the problem. Either you are dealing with strings that could contain special characters, in which case you're probably looking for Unicode support, or you basically don't care about encoding, in which case the base subset of ASCII is probably more than enough for you. I think it's tricky to add annotations for string encodings because the wire won't contain that information, and could lead to you being able to read but unable to decode a string sent to you.

        Show
        Bryan Duxbury added a comment - I think the point is that we want to strengthen the meaning of "string" wherever possible. Clearly it used to be used like arbitrary bytes, but since we have binary now, it seems to make sense that the key use case is for actual text. In some ways, I see specifying the encoding of strings as a necessary part of the protocol. After all, the protocol specifies the encoding of ints, doubles, maps, etc, right? Jonathan has consistently argued for us to have a standard. Right now, we have a de facto standard of "UTF8 if it's convenient, whatever else otherwise". This can obviously lead to problems in some situations. Yes, you can make the application be concerned with the encoding, but that seems like a workaround, and it will quickly become inconvenient if you have more than two languages involved. In general, I'm sort of against allowing "alternate encodings" (a la THRIFT-414 ), because it seems like overkill for the problem. Either you are dealing with strings that could contain special characters, in which case you're probably looking for Unicode support, or you basically don't care about encoding, in which case the base subset of ASCII is probably more than enough for you. I think it's tricky to add annotations for string encodings because the wire won't contain that information, and could lead to you being able to read but unable to decode a string sent to you.
        Hide
        Chad Walters added a comment -

        I think David's comments above are excellent suggestions that could satisfy almost everything for all parties. Yes it introduces some complexity but if we choose reasonable defaults we should hide a lot of that complexity from most users. I'd like to understand the implementation implications of THRIFT-413 and THRIFT-414 in more detail first though.

        Show
        Chad Walters added a comment - I think David's comments above are excellent suggestions that could satisfy almost everything for all parties. Yes it introduces some complexity but if we choose reasonable defaults we should hide a lot of that complexity from most users. I'd like to understand the implementation implications of THRIFT-413 and THRIFT-414 in more detail first though.
        Hide
        Jonathan Ellis added a comment -

        > If your api method takes an integer, but in your application the only valid values are even numbers, should we include that validation in Thrift as well?

        Kevin: that's an excellent analogy.

        If some languages defined an `even` type, should we expose that to thrift? I am arguing that there are two consistent alternatives:

        • expose an `even` type and make thrift responsible for raising an error if a client passes a non-even int (in languages where this is possible)
        • don't expose `even` at all and make everyone use int so it's explicit what the expectations are

        what we have now is, in effect, everyone using the closest "native" type they have to `even` which is in some cases not necessarily even at all, which leads to strange errors when sending one of those to a language that does have native `even`s.

        Show
        Jonathan Ellis added a comment - > If your api method takes an integer, but in your application the only valid values are even numbers, should we include that validation in Thrift as well? Kevin: that's an excellent analogy. If some languages defined an `even` type, should we expose that to thrift? I am arguing that there are two consistent alternatives: expose an `even` type and make thrift responsible for raising an error if a client passes a non-even int (in languages where this is possible) don't expose `even` at all and make everyone use int so it's explicit what the expectations are what we have now is, in effect, everyone using the closest "native" type they have to `even` which is in some cases not necessarily even at all, which leads to strange errors when sending one of those to a language that does have native `even`s.
        Hide
        David Reiss added a comment - - edited

        @Kevin: I think a brand new type is overkill. What would you think of...

        • Committing support for annotations on base types (THRIFT-413)?
        • Committing something to my patch for alternate encodings in Java (THRIFT-414).
        • Stating that strings should be UTF-8 by convention unless otherwise specified.
        • Defining a "unicode.strict" attribute that we could implement on a per-language basis as it becomes convenient. In Python, strs would be verified when writing and decode into unicodes when reading. In Ruby and PHP, we could validate the encoding on both sides (and throw an exception if validation fails). In C++, maybe we could use wstring and encode/decode with ICU if unicode.string is set to "omg yes really even C++ jerk!"
        Show
        David Reiss added a comment - - edited @Kevin: I think a brand new type is overkill. What would you think of... Committing support for annotations on base types ( THRIFT-413 )? Committing something to my patch for alternate encodings in Java ( THRIFT-414 ). Stating that strings should be UTF-8 by convention unless otherwise specified. Defining a "unicode.strict" attribute that we could implement on a per-language basis as it becomes convenient. In Python, strs would be verified when writing and decode into unicodes when reading. In Ruby and PHP, we could validate the encoding on both sides (and throw an exception if validation fails). In C++, maybe we could use wstring and encode/decode with ICU if unicode.string is set to "omg yes really even C++ jerk!"
        Hide
        Kevin Clark added a comment - - edited

        One of the Ruby guys here. waves

        @David: As things are now, no, I don't think Ruby should enforce string encoding. Right now the format the string is expected to be in should be published as part of api specs and handled application side. 'string' is a semantic label in our case, distinct from binary in that it is assumed to be characters, but doesn't define encoding. What I would be in favor of is a new utf8 type, which would define encoding. But without that, I don't think the restriction should be placed on string.

        @Jonathan: If your api method takes an integer, but in your application the only valid values are even numbers, should we include that validation in Thrift as well? Hostage taking seems a little extreme. I prefer to think of it as the boyscout helping the old lady across the street, but not making sure she has two legs. If it doesn't bother her, it doesn't bother me.

        Show
        Kevin Clark added a comment - - edited One of the Ruby guys here. waves @David: As things are now, no, I don't think Ruby should enforce string encoding. Right now the format the string is expected to be in should be published as part of api specs and handled application side. 'string' is a semantic label in our case, distinct from binary in that it is assumed to be characters, but doesn't define encoding. What I would be in favor of is a new utf8 type, which would define encoding. But without that, I don't think the restriction should be placed on string. @Jonathan: If your api method takes an integer, but in your application the only valid values are even numbers, should we include that validation in Thrift as well? Hostage taking seems a little extreme. I prefer to think of it as the boyscout helping the old lady across the street, but not making sure she has two legs. If it doesn't bother her, it doesn't bother me.
        Hide
        David Reiss added a comment -

        Actually, I have another question for you, Jonathan. How do you think Thrift should handle strings in C++?

        Show
        David Reiss added a comment - Actually, I have another question for you, Jonathan. How do you think Thrift should handle strings in C++?
        Hide
        David Reiss added a comment -

        Two of our active committers use both Java (a strings-are-Unicode language) and Ruby (a strings-are-bytes language). I'd be interested to hear their thoughts. Should Ruby verify that all strings are UTF-8 before writing them out?

        Show
        David Reiss added a comment - Two of our active committers use both Java (a strings-are-Unicode language) and Ruby (a strings-are-bytes language). I'd be interested to hear their thoughts. Should Ruby verify that all strings are UTF-8 before writing them out?
        Hide
        Jonathan Ellis added a comment -

        > Java is not "hostage". You document what you consider to be acceptable values and throw an exception when the client supplies something else.

        In other words, because thrift won't enforce any semantics at all for `string` beyond those implied by `binary`, I have to do the validation by hand.

        That sure sounds like being held hostage by the lowest common denominator to me.

        Is this horse dead yet?

        Show
        Jonathan Ellis added a comment - > Java is not "hostage". You document what you consider to be acceptable values and throw an exception when the client supplies something else. In other words, because thrift won't enforce any semantics at all for `string` beyond those implied by `binary`, I have to do the validation by hand. That sure sounds like being held hostage by the lowest common denominator to me. Is this horse dead yet?
        Hide
        David Reiss added a comment -

        I think you are assuming that the "string" is a "Unicode string" or a "string of Unicode code points". Just because something is a "string" doesn't mean it is Unicode.

        Calling it a string when it may or may not be is holding languages that do have real string support hostage to those that don't.

        It is a string, regardless of whether or not it is UTF-8 or Unicode. And Java is not "hostage". You document what you consider to be acceptable values and throw an exception when the client supplies something else.

        Show
        David Reiss added a comment - I think you are assuming that the "string" is a "Unicode string" or a "string of Unicode code points". Just because something is a "string" doesn't mean it is Unicode. Calling it a string when it may or may not be is holding languages that do have real string support hostage to those that don't. It is a string, regardless of whether or not it is UTF-8 or Unicode. And Java is not "hostage". You document what you consider to be acceptable values and throw an exception when the client supplies something else.
        Hide
        Jonathan Ellis added a comment -

        > As in all C++, Ruby, PHP, Perl, and Erlang programs, it is the simply application's responsibility to ensure that the string is properly UTF-8 encoded on writing and to interpret the string ast UTF-8 on reading.

        Then you are really passing binary data around and your IDL should reflect that. Calling it a string implies there are semantics beyond a bunch of bytes for which it's the application's responsibility to derive any meaning from.

        Calling it a string when it may or may not be is holding languages that do have real string support hostage to those that don't.

        Show
        Jonathan Ellis added a comment - > As in all C++, Ruby, PHP, Perl, and Erlang programs, it is the simply application's responsibility to ensure that the string is properly UTF-8 encoded on writing and to interpret the string ast UTF-8 on reading. Then you are really passing binary data around and your IDL should reflect that. Calling it a string implies there are semantics beyond a bunch of bytes for which it's the application's responsibility to derive any meaning from. Calling it a string when it may or may not be is holding languages that do have real string support hostage to those that don't.
        Hide
        David Reiss added a comment -

        Man, I sleep in one morning and miss the whole party.

        Since we've strayed a bit from the original topic, let me ask a quick question to make sure we're at least aware of the scope of the discussion: What concrete changes would you like to see in Thrift (in the area of encodings) other than the return type of readString in Python 2?

        Now, let me just respond to a bunch of stuff in chronological order...

        Can we split the difference and have some kind of configuration option to "enforce UTF-8" for Python (but make it off by default)?

        I'd be fine with this, though the change to the extension module is more complicated that the change to the pure-python stuff.

        what do you think of adding a new annotation (e.g. string.encoding) for specifying the actual string encoding?

        I'd also be fine with that. See THRIFT-414 for my planned approach.

        I'd deprecate str strings and, at some point in the future, support unicode strings only

        If you're talking about Python, I think we should definitely do this for Python 3, but never do it for Python 2. If you're talking about all languages, I think it is unrealistic because C++, PHP, Perl, and Erlang are not going to have robust native Unicode support any time in the foreseeable future.

        is utf8 strings the right design decision, absent backwards-compatibility concerns [...] I think some people are reluctant to admit 1. because they are afraid of 2.

        I think that it is not. Requiring UTF-8 might seem sensible in a mostly-English environment, but having support for UTF-16 or a Chinese-oriented encoding (for example) can be very useful. I'm fine saying that Thrift strings should be UTF-8 encoded unless otherwise specified (like, by an annotation), but enforcing it in environments that could benefit from a non-UTF-8 encoding is harmful.

        I think adding user-specified encodings adds more complexity than it's worth

        I think allowing the user to specify string encoding just adds complexity

        I disagree. I think that if we say that strings should default to UTF-8 unless otherwise annotated, it is not a big deal. I think that removing the ability to support other encodings is a big deal.

        made restrictions on the types for map keys

        I haven't ruled this out, if you want to talk about it. But it should be a separate issue. And if you are serious, we should do it before the release.

        made binary its own standalone type

        This is effectively the case already. The only possible problems arise when you change a field from string to binary without changing the field id (which is what Jonathan is suggesting, btw), and even then, I think only in the JSON protocol.

        If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server

        C++, Ruby, Perl, PHP, and Erlang also do this.

        if I am understanding correctly, Python3 is now in the same camp as Java and C# - is that correct?

        Exactly. The "str" type in Python 3 is effectively the same as the "unicode" object in Python 2. It is a string of Unicode code points that cannot be used in a context where bytes are expected.

        If so, maybe we want to treat Python3 as a different target language from Python2

        Definitely.

        I detect a little bit of pro-Python2 on David's part

        That is not my intention. I actually think the Java/Python3 data model makes more sense in most contexts. But I think that we should treat Python 2 as Python 2 (AFAIK, Thrift doesn't work in Python 3), which means that strings are strs. A few examples of this: repr returns a str. Exception messages are strs. "" is a str. Data read from files (even not opened in binary mode) are strs.

        Right now most thrift implementations cannot talk to my Java server and that is broken.

        We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type - and make sure that applications only send UTF-8 data via 'string'

        In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book.

        Chad is right. As in all C++, Ruby, PHP, Perl, and Erlang programs, it is the simply application's responsibility to ensure that the string is properly UTF-8 encoded on writing and to interpret the string ast UTF-8 on reading. I think you are assuming that the "string" is a "Unicode string" or a "string of Unicode code points". In Thrift, this is not the case. It is a string of bytes (that are presumably representing text), and it is up to the application to ensure that the bytes make sense. Now, if we want to establish a convention that the bytes should be a UTF-8-encoded Unicode string unless otherwise annotated, that's fine with me, but I think that mandating UTF-8 is a harmful restriction, mandating Unicode, while probably fine, is not without downsides, and forcing applications to use special types for strings is pretty much out of the question.

        In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon.

        AFAIK all the thrift languages do support unicode already but I could be wrong on one or two.

        There is a difference between supporting Unicode and having native-feeling support for unicode. If you mean native-feeling support, then most languages do not have it.

        • C++ has wstring, which can be used for Unicode strings, but they are rarely used and there is no support for encoding and decoding. The native-feeling way to write C++ is to use string-of-bytes std::string.
        • Ruby and PHP's string type is a string of bytes. They have special functions for treating them as pre-encoded Unicode strings. Believe it or not, it seems like PHP's support here might actually be better than Ruby's.
        • Erlang is completely Unicode-oblivious.

        The only reason that this discussion is coming up here is that Python is the only Thrift language (AFAIK) that is on the fence between strings as bytes and strings as code points.

        Show
        David Reiss added a comment - Man, I sleep in one morning and miss the whole party. Since we've strayed a bit from the original topic, let me ask a quick question to make sure we're at least aware of the scope of the discussion: What concrete changes would you like to see in Thrift (in the area of encodings) other than the return type of readString in Python 2? Now, let me just respond to a bunch of stuff in chronological order... Can we split the difference and have some kind of configuration option to "enforce UTF-8" for Python (but make it off by default)? I'd be fine with this, though the change to the extension module is more complicated that the change to the pure-python stuff. what do you think of adding a new annotation (e.g. string.encoding) for specifying the actual string encoding? I'd also be fine with that. See THRIFT-414 for my planned approach. I'd deprecate str strings and, at some point in the future, support unicode strings only If you're talking about Python, I think we should definitely do this for Python 3, but never do it for Python 2. If you're talking about all languages, I think it is unrealistic because C++, PHP, Perl, and Erlang are not going to have robust native Unicode support any time in the foreseeable future. is utf8 strings the right design decision, absent backwards-compatibility concerns [...] I think some people are reluctant to admit 1. because they are afraid of 2. I think that it is not. Requiring UTF-8 might seem sensible in a mostly-English environment, but having support for UTF-16 or a Chinese-oriented encoding (for example) can be very useful. I'm fine saying that Thrift strings should be UTF-8 encoded unless otherwise specified (like, by an annotation), but enforcing it in environments that could benefit from a non-UTF-8 encoding is harmful. I think adding user-specified encodings adds more complexity than it's worth I think allowing the user to specify string encoding just adds complexity I disagree. I think that if we say that strings should default to UTF-8 unless otherwise annotated, it is not a big deal. I think that removing the ability to support other encodings is a big deal. made restrictions on the types for map keys I haven't ruled this out, if you want to talk about it. But it should be a separate issue. And if you are serious, we should do it before the release. made binary its own standalone type This is effectively the case already. The only possible problems arise when you change a field from string to binary without changing the field id (which is what Jonathan is suggesting, btw), and even then, I think only in the JSON protocol. If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server C++, Ruby, Perl, PHP, and Erlang also do this. if I am understanding correctly, Python3 is now in the same camp as Java and C# - is that correct? Exactly. The "str" type in Python 3 is effectively the same as the "unicode" object in Python 2. It is a string of Unicode code points that cannot be used in a context where bytes are expected. If so, maybe we want to treat Python3 as a different target language from Python2 Definitely. I detect a little bit of pro-Python2 on David's part That is not my intention. I actually think the Java/Python3 data model makes more sense in most contexts. But I think that we should treat Python 2 as Python 2 (AFAIK, Thrift doesn't work in Python 3), which means that strings are strs. A few examples of this: repr returns a str. Exception messages are strs. "" is a str. Data read from files (even not opened in binary mode) are strs. Right now most thrift implementations cannot talk to my Java server and that is broken. We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type - and make sure that applications only send UTF-8 data via 'string' In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book. Chad is right. As in all C++, Ruby, PHP, Perl, and Erlang programs, it is the simply application's responsibility to ensure that the string is properly UTF-8 encoded on writing and to interpret the string ast UTF-8 on reading. I think you are assuming that the "string" is a "Unicode string" or a "string of Unicode code points". In Thrift, this is not the case. It is a string of bytes (that are presumably representing text), and it is up to the application to ensure that the bytes make sense. Now, if we want to establish a convention that the bytes should be a UTF-8-encoded Unicode string unless otherwise annotated, that's fine with me, but I think that mandating UTF-8 is a harmful restriction, mandating Unicode, while probably fine, is not without downsides, and forcing applications to use special types for strings is pretty much out of the question. In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon. AFAIK all the thrift languages do support unicode already but I could be wrong on one or two. There is a difference between supporting Unicode and having native-feeling support for unicode. If you mean native-feeling support, then most languages do not have it. C++ has wstring, which can be used for Unicode strings, but they are rarely used and there is no support for encoding and decoding. The native-feeling way to write C++ is to use string-of-bytes std::string. Ruby and PHP's string type is a string of bytes. They have special functions for treating them as pre-encoded Unicode strings. Believe it or not, it seems like PHP's support here might actually be better than Ruby's. Erlang is completely Unicode-oblivious. The only reason that this discussion is coming up here is that Python is the only Thrift language (AFAIK) that is on the fence between strings as bytes and strings as code points.
        Hide
        Chad Walters added a comment -

        In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book.

        Partly true. 'string' was a bit overloaded and was used for arbitrary binary data as well. Facebook originally was mostly concerned with C++ and PHP. The string encoding issue didn't really crop up until Java started getting some real usage (my understanding is that it had been implemented by FB but not heavily exercised), which came after the initial open-sourcing of the code. Whether you see this as a bug or not depends if you think calling something a 'string' means that it is a C++ std:string (which can certainly be arbitrary binary data) or a Java String (which has an encoding attached to it). My personal take on it is that the 'string' type is unfortunately a bit schizophrenic around this – in C++ it is std::string and in Java it is String. So if you want to talk to Java from C++ using 'string', you better submit to the strictures of Java String – but if you only care about C++ and languages with other encoding-agnostic string primitives, then you don't.

        In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon.

        Perhaps. I don't think C++ is going to change the semantics of std::string any time in the near future, however. I guess opinions will vary about whether C++ qualifies as "barely usable", "highly usable", or "eye-bleedingly unusable".

        Show
        Chad Walters added a comment - In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book. Partly true. 'string' was a bit overloaded and was used for arbitrary binary data as well. Facebook originally was mostly concerned with C++ and PHP. The string encoding issue didn't really crop up until Java started getting some real usage (my understanding is that it had been implemented by FB but not heavily exercised), which came after the initial open-sourcing of the code. Whether you see this as a bug or not depends if you think calling something a 'string' means that it is a C++ std:string (which can certainly be arbitrary binary data) or a Java String (which has an encoding attached to it). My personal take on it is that the 'string' type is unfortunately a bit schizophrenic around this – in C++ it is std::string and in Java it is String. So if you want to talk to Java from C++ using 'string', you better submit to the strictures of Java String – but if you only care about C++ and languages with other encoding-agnostic string primitives, then you don't. In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon. Perhaps. I don't think C++ is going to change the semantics of std::string any time in the near future, however. I guess opinions will vary about whether C++ qualifies as "barely usable", "highly usable", or "eye-bleedingly unusable".
        Hide
        Chad Walters added a comment -

        WRT backwards compatibility breaking changes, my sympathies lie on the side of biting off a set of compatibility-breaking changes before the first release. The biggest proponents of full backwards compatibility have probably been David and Ben. I don't think it is a coincidence that they are also the ones with the biggest existing investment in persisted Thrift data. I think we need to at least respect the fact that Facebook had made a big investment in Thrift prior to open-sourcing it and that we can't strand Facebook completely in decisions that we make to fix earlier warts.

        b. nobody is forcing you to upgrade. if svn 700000-whatever of thrift works for you, keep using it. if necessary, backporting fixes to a branch is not an unheard-of strategy either.

        I don't think it is reasonable to ask Facebook to do this.

        c. you have to weigh the pain against current users from changing behavior, to pain for future users from NOT changing. hopefully the current user base is a fraction of what it will be in a few years. unicode is not going to get less popular in that time.

        I agree completely with you on this one – I do wish the existing users would sign up for this even at the expense of some amount of pain now in the interest of the future of the project.

        d. if you can't change broken behavior before there is an official release, when CAN you change it?

        I also agree with you on this one.

        However, and despite my agreement, I don't see this changing unless David and Ben sign up to it.

        In the absence of that, I think adding a new type might be the best course of action.

        Show
        Chad Walters added a comment - WRT backwards compatibility breaking changes, my sympathies lie on the side of biting off a set of compatibility-breaking changes before the first release. The biggest proponents of full backwards compatibility have probably been David and Ben. I don't think it is a coincidence that they are also the ones with the biggest existing investment in persisted Thrift data. I think we need to at least respect the fact that Facebook had made a big investment in Thrift prior to open-sourcing it and that we can't strand Facebook completely in decisions that we make to fix earlier warts. b. nobody is forcing you to upgrade. if svn 700000-whatever of thrift works for you, keep using it. if necessary, backporting fixes to a branch is not an unheard-of strategy either. I don't think it is reasonable to ask Facebook to do this. c. you have to weigh the pain against current users from changing behavior, to pain for future users from NOT changing. hopefully the current user base is a fraction of what it will be in a few years. unicode is not going to get less popular in that time. I agree completely with you on this one – I do wish the existing users would sign up for this even at the expense of some amount of pain now in the interest of the future of the project. d. if you can't change broken behavior before there is an official release, when CAN you change it? I also agree with you on this one. However, and despite my agreement, I don't see this changing unless David and Ben sign up to it. In the absence of that, I think adding a new type might be the best course of action.
        Hide
        Jonathan Ellis added a comment -

        > Why is this? We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type - and make sure that applications only send UTF-8 data via 'string'.

        In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book.

        > From their perspective, they don't want to "dumb down" their type system

        In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon.

        AFAIK all the thrift languages do support unicode already but I could be wrong on one or two.

        Show
        Jonathan Ellis added a comment - > Why is this? We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type - and make sure that applications only send UTF-8 data via 'string'. In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book. > From their perspective, they don't want to "dumb down" their type system In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon. AFAIK all the thrift languages do support unicode already but I could be wrong on one or two.
        Hide
        Chad Walters added a comment -

        Jonathan, when I said there was not complete consistency, I was referring to consistently choosing between maximum interoperability across languages vs more flexibility within given languages to follow their idioms. The choice in various cases has been somewhat driven by pragmatics and somewhat by historical factors.

        WRT the treatment of strings, there is consistency of a kind here: in this instance, the choice was made for more flexibility within given languages (although I agree with your characterization that concerns about backwards-compatibility breakage also played a role in the choice). The current situation is that 'string' is free to contain arbitrary data in languages where that is supported; but clearly not so in language that enforce encoding on strings (as in Java and C#). If you want to interoperate with those languages, then make sure your application only passes UTF8 encoded data in 'string'.

        I think the issue is mostly that you don't like the answer you are getting and partly that the differences between Python2 and Python3 with regards to enforcing encoding (if I am understanding correctly, Python3 is now in the same camp as Java and C# – is that correct? If so, maybe we want to treat Python3 as a different target language from Python2, which might sidestep some of the issues here since I detect a little bit of pro-Python2 on David's part vs pro-Python3 on your part).

        Isn't that mostly a non-issue, though? If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server or things would already be broken.

        I am not sure it's a non-issue. You are mistaken about using only Python on both side (or the same language on both sides if that is what you meant). You can currently send binary data via 'string' comfortably across C+, Ruby, PHP, etc, as well. I may be wrong, but I think readBinary() and readString() can return different types in different languages (although I do see that in C+ they both return std::string, which at least knocks out part of my concern – are there other languages where the types might matter).

        >Right now most thrift implementations cannot talk to my Java server and that is broken.

        Why is this? We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type – and make sure that applications only send UTF-8 data via 'string'.

        Consider that not all Thrift shops use all languages. From their perspective, they don't want to "dumb down" their type system and flexibility because of some language that they don't care about interoperating with.

        That said (and as I said before) I am totally sympathetic to your concerns – my preference would be that we more consistently choose in favor of maximal interoperability. I am just pointing out that the state of things is not as bad as you seem to believe they are and that these choices have not been completely arbitrary or without merit.

        Show
        Chad Walters added a comment - Jonathan, when I said there was not complete consistency, I was referring to consistently choosing between maximum interoperability across languages vs more flexibility within given languages to follow their idioms. The choice in various cases has been somewhat driven by pragmatics and somewhat by historical factors. WRT the treatment of strings, there is consistency of a kind here: in this instance, the choice was made for more flexibility within given languages (although I agree with your characterization that concerns about backwards-compatibility breakage also played a role in the choice). The current situation is that 'string' is free to contain arbitrary data in languages where that is supported; but clearly not so in language that enforce encoding on strings (as in Java and C#). If you want to interoperate with those languages, then make sure your application only passes UTF8 encoded data in 'string'. I think the issue is mostly that you don't like the answer you are getting and partly that the differences between Python2 and Python3 with regards to enforcing encoding (if I am understanding correctly, Python3 is now in the same camp as Java and C# – is that correct? If so, maybe we want to treat Python3 as a different target language from Python2, which might sidestep some of the issues here since I detect a little bit of pro-Python2 on David's part vs pro-Python3 on your part). Isn't that mostly a non-issue, though? If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server or things would already be broken. I am not sure it's a non-issue. You are mistaken about using only Python on both side (or the same language on both sides if that is what you meant). You can currently send binary data via 'string' comfortably across C+ , Ruby, PHP, etc, as well. I may be wrong, but I think readBinary() and readString() can return different types in different languages (although I do see that in C + they both return std::string, which at least knocks out part of my concern – are there other languages where the types might matter). >Right now most thrift implementations cannot talk to my Java server and that is broken. Why is this? We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type – and make sure that applications only send UTF-8 data via 'string'. Consider that not all Thrift shops use all languages. From their perspective, they don't want to "dumb down" their type system and flexibility because of some language that they don't care about interoperating with. That said (and as I said before) I am totally sympathetic to your concerns – my preference would be that we more consistently choose in favor of maximal interoperability. I am just pointing out that the state of things is not as bad as you seem to believe they are and that these choices have not been completely arbitrary or without merit.
        Hide
        Carlos Valiente added a comment -

        Adding a new type for Unicode strings seems fine to me as well

        Would it make sense to call that new type utf8, to stress the fact that what gets sent through the wire are UTF-8 encoded strings?

        Show
        Carlos Valiente added a comment - Adding a new type for Unicode strings seems fine to me as well Would it make sense to call that new type utf8 , to stress the fact that what gets sent through the wire are UTF-8 encoded strings?
        Hide
        Jonathan Ellis added a comment -

        > Thrift has not always made a consistent choice on this.

        And that's the problem; as I said above, I can live with either choice as long as it's made consistently. Right now most thrift implementations cannot talk to my Java server and that is broken.

        > Adding a new type for Unicode strings seems fine to me as well

        Maybe that is the way to go then.

        Show
        Jonathan Ellis added a comment - > Thrift has not always made a consistent choice on this. And that's the problem; as I said above, I can live with either choice as long as it's made consistently. Right now most thrift implementations cannot talk to my Java server and that is broken. > Adding a new type for Unicode strings seems fine to me as well Maybe that is the way to go then.
        Hide
        Jonathan Ellis added a comment -

        > I am not sure that it's as easy as you suggest. I think this will break code since the return type of readBinary() is not the same as the return type of readString().

        Isn't that mostly a non-issue, though? If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server or things would already be broken.

        Show
        Jonathan Ellis added a comment - > I am not sure that it's as easy as you suggest. I think this will break code since the return type of readBinary() is not the same as the return type of readString(). Isn't that mostly a non-issue, though? If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server or things would already be broken.
        Hide
        Chad Walters added a comment -

        There is a tension in Thrift between allowing types to have their natural meaning in each language and having complete interoperability across the full suite of languages. This is just one example.

        Should I limit the expressiveness of Thrift structures I can use in one language (C++) because of the strictures of some other language amongst those supported by Thrift that I may or may not be using? Thrift has not always made a consistent choice on this.

        On the full interoperability side:
        – unsigned integers are not supported because a number of languages don't support them natively.

        On the side of "the IDL writer and/or the applications are responsible for guaranteeing interoperability":
        – the application is responsible for interoperability of strings between, say, Java and C++ – that is, if you want to interoperate, you need to make sure that C++ is sending only UTF8 encoded data in strings
        – the IDL writer is responsible for determining if map keys should only be primitive types (required by some languages – and the JSON protocol btw) or if they can be structures or containers as well (limiting interoperability)

        I personally have always leaned on the side of more interoperability and correcting some of the early architectural warts. If it had been up to me, we would have bitten off the backwards compatibility break a while back, changed string to only be UTF-8, made restrictions on the types for map keys, made binary its own standalone type, etc. However, I can understand the concerns of those with a big investment in persisted Thrift data who have pushed back against non-backwards compatible changes.

        I'll repeat my suggestion and expand on it a little further: Thrift could operate in 2 modes: "more flexibility" or "more compatibility". Under "more flexibility", it would operate more or less as things are today (eg: I wouldn't re-examine the signed vs unsigned decision – too many opportunities for foot shooting there). Under "more compatibility", strings would be required to be UTF-8, map keys would be required to be primitives, etc. I would expect that most people adopting Thrift now would select "more compatibility" but those with specific needs could use the "more flexibility" mode.

        Adding a new type for Unicode strings seems fine to me as well – in fact there are already suitable unused type constants in TProtocol.h (UTF8, along with UTF7 and UTF16) (see http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/cpp/src/protocol/TProtocol.h?view=markup) (Side note: Why are those there? I always figured it was leftover cruft from something that was worked on at FB and then abandoned. Should they be cleaned up?). The only real issue I can see is that it entails changes to all the protocols and code generators across all the languages – it's just work but its not hard work.

        a. as I have said several times (to no contradictions), simply recompiling the IDL after changing to binary is a virtually painless way to get back the old behavior of treating all data as binary no matter what it was declared as

        I hate to say this since doing what you said would also smooth the way for promoting binary to a full fledged type but I am not sure that it's as easy as you suggest. I think this will break code since the return type of readBinary() is not the same as the return type of readString().

        Show
        Chad Walters added a comment - There is a tension in Thrift between allowing types to have their natural meaning in each language and having complete interoperability across the full suite of languages. This is just one example. Should I limit the expressiveness of Thrift structures I can use in one language (C++) because of the strictures of some other language amongst those supported by Thrift that I may or may not be using? Thrift has not always made a consistent choice on this. On the full interoperability side: – unsigned integers are not supported because a number of languages don't support them natively. On the side of "the IDL writer and/or the applications are responsible for guaranteeing interoperability": – the application is responsible for interoperability of strings between, say, Java and C++ – that is, if you want to interoperate, you need to make sure that C++ is sending only UTF8 encoded data in strings – the IDL writer is responsible for determining if map keys should only be primitive types (required by some languages – and the JSON protocol btw) or if they can be structures or containers as well (limiting interoperability) I personally have always leaned on the side of more interoperability and correcting some of the early architectural warts. If it had been up to me, we would have bitten off the backwards compatibility break a while back, changed string to only be UTF-8, made restrictions on the types for map keys, made binary its own standalone type, etc. However, I can understand the concerns of those with a big investment in persisted Thrift data who have pushed back against non-backwards compatible changes. I'll repeat my suggestion and expand on it a little further: Thrift could operate in 2 modes: "more flexibility" or "more compatibility". Under "more flexibility", it would operate more or less as things are today (eg: I wouldn't re-examine the signed vs unsigned decision – too many opportunities for foot shooting there). Under "more compatibility", strings would be required to be UTF-8, map keys would be required to be primitives, etc. I would expect that most people adopting Thrift now would select "more compatibility" but those with specific needs could use the "more flexibility" mode. Adding a new type for Unicode strings seems fine to me as well – in fact there are already suitable unused type constants in TProtocol.h (UTF8, along with UTF7 and UTF16) (see http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/cpp/src/protocol/TProtocol.h?view=markup ) (Side note: Why are those there? I always figured it was leftover cruft from something that was worked on at FB and then abandoned. Should they be cleaned up?). The only real issue I can see is that it entails changes to all the protocols and code generators across all the languages – it's just work but its not hard work. a. as I have said several times (to no contradictions), simply recompiling the IDL after changing to binary is a virtually painless way to get back the old behavior of treating all data as binary no matter what it was declared as I hate to say this since doing what you said would also smooth the way for promoting binary to a full fledged type but I am not sure that it's as easy as you suggest. I think this will break code since the return type of readBinary() is not the same as the return type of readString().
        Hide
        Andrew McGeachie added a comment -

        Adding my (potentially naive) opinion to the mix:

        I think allowing the user to specify string encoding just adds complexity, and possible a bit of headache for the runtime library maintainers. I think we should either say "Strings should be UTF8 encoded, across the board", or create a new well defined unicode-string type specifier.

        Show
        Andrew McGeachie added a comment - Adding my (potentially naive) opinion to the mix: I think allowing the user to specify string encoding just adds complexity, and possible a bit of headache for the runtime library maintainers. I think we should either say "Strings should be UTF8 encoded, across the board", or create a new well defined unicode-string type specifier.
        Hide
        Jonathan Ellis added a comment -

        @Esteve:

        Adding a usting type with well-defined cross-platform semantics is a reasonable approach. I think adding user-specified encodings adds more complexity than it's worth, but I don't care that much.

        Show
        Jonathan Ellis added a comment - @Esteve: Adding a usting type with well-defined cross-platform semantics is a reasonable approach. I think adding user-specified encodings adds more complexity than it's worth, but I don't care that much.
        Hide
        Jonathan Ellis added a comment -

        @Chad:

        So there are two questions here:

        1. is utf8 strings the right design decision, absent backwards-compatibility concerns

        2. is it worth breaking back-compat for

        I think some people are reluctant to admit 1. because they are afraid of 2.

        I think the case for 2. can be made as follows:

        a. as I have said several times (to no contradictions), simply recompiling the IDL after changing to binary is a virtually painless way to get back the old behavior of treating all data as binary no matter what it was declared as

        b. nobody is forcing you to upgrade. if svn 700000-whatever of thrift works for you, keep using it. if necessary, backporting fixes to a branch is not an unheard-of strategy either.

        c. you have to weigh the pain against current users from changing behavior, to pain for future users from NOT changing. hopefully the current user base is a fraction of what it will be in a few years. unicode is not going to get less popular in that time.

        d. if you can't change broken behavior before there is an official release, when CAN you change it?

        Now, perhaps it's worth the time to briefly explain my use case.

        I work on the Cassandra distributed database, where we use thrift to let clients in any supported language talk to the Java server. Keys are `string`s so they absolutely have to be compatible cross-platform. Currently they are not. Telling users that "thrift doesn't really support unicode, so in Python you have to set this flag, and in other languages it doesn't work at all and it will be an uphill battle to get a patch accepted" is a non-starter. Cassandra has a high enough barrier to entry as it is, that adding to it unnecessarily is foolish.

        Without real unicode support we'll have to switch to binary to get behavior that is at least consistent cross platform.

        Show
        Jonathan Ellis added a comment - @Chad: So there are two questions here: 1. is utf8 strings the right design decision, absent backwards-compatibility concerns 2. is it worth breaking back-compat for I think some people are reluctant to admit 1. because they are afraid of 2. I think the case for 2. can be made as follows: a. as I have said several times (to no contradictions), simply recompiling the IDL after changing to binary is a virtually painless way to get back the old behavior of treating all data as binary no matter what it was declared as b. nobody is forcing you to upgrade. if svn 700000-whatever of thrift works for you, keep using it. if necessary, backporting fixes to a branch is not an unheard-of strategy either. c. you have to weigh the pain against current users from changing behavior, to pain for future users from NOT changing. hopefully the current user base is a fraction of what it will be in a few years. unicode is not going to get less popular in that time. d. if you can't change broken behavior before there is an official release, when CAN you change it? Now, perhaps it's worth the time to briefly explain my use case. I work on the Cassandra distributed database, where we use thrift to let clients in any supported language talk to the Java server. Keys are `string`s so they absolutely have to be compatible cross-platform. Currently they are not. Telling users that "thrift doesn't really support unicode, so in Python you have to set this flag, and in other languages it doesn't work at all and it will be an uphill battle to get a patch accepted" is a non-starter. Cassandra has a high enough barrier to entry as it is, that adding to it unnecessarily is foolish. Without real unicode support we'll have to switch to binary to get behavior that is at least consistent cross platform.
        Hide
        Esteve Fernandez added a comment -

        I'd rather use unicode string everywhere, but if we have to maintain backwards compatibility with legacy code, what do you think of adding a new annotation (e.g. string.encoding) for specifying the actual string encoding? Something like this:

        typedef string (string.encoding = "utf8") ustring

        and then use ustring instead of string. In any case, I'd deprecate str strings and, at some point in the future, support unicode strings only. In the case of Python, that would make it easier to support Python 3.0 (as it only supports unicode)

        I'm not sure if I'll be able to write a patch for this in the next couple of days, but will try if there's consensus on using annotations for this.

        Show
        Esteve Fernandez added a comment - I'd rather use unicode string everywhere, but if we have to maintain backwards compatibility with legacy code, what do you think of adding a new annotation (e.g. string.encoding) for specifying the actual string encoding? Something like this: typedef string (string.encoding = "utf8") ustring and then use ustring instead of string. In any case, I'd deprecate str strings and, at some point in the future, support unicode strings only. In the case of Python, that would make it easier to support Python 3.0 (as it only supports unicode) I'm not sure if I'll be able to write a patch for this in the next couple of days, but will try if there's consensus on using annotations for this.
        Hide
        Chad Walters added a comment -

        Jonathan, you have stumbled on an old ugly problem in Thrift. The 'string' type was originally the only way to pass arbitrary binary data around but this didn't actually work properly in Java because of its requirement that String's carry an encoding. The 'binary' subtype was introduced to fix this. There was no agreement that string should enforce UTF-8 encoding, even though this meant an inability to enforce interoperability with Java, probably driven in large part by pre-existing data at Facebook (and other places?) where strings were already for binary data in C++ (at the time, Java was somewhat of a second-class citizen for Thrift – IIRC Facebook's emphasis was on C++. Python, PHP). Somehow I imagine that the backwards compatibility issue is not going to be taken off the table.

        I may not fully understand the issues with Python so forgive me if this suggestion is naive: Can we split the difference and have some kind of configuration option to "enforce UTF-8" for Python (but make it off by default)?

        The policy would then be: use non-UTF8 encoding in strings if you wish, but realize that you will not interoperate correctly with Java and C# all the time or with Python when "enforce UTF-8" mode is on.

        Show
        Chad Walters added a comment - Jonathan, you have stumbled on an old ugly problem in Thrift. The 'string' type was originally the only way to pass arbitrary binary data around but this didn't actually work properly in Java because of its requirement that String's carry an encoding. The 'binary' subtype was introduced to fix this. There was no agreement that string should enforce UTF-8 encoding, even though this meant an inability to enforce interoperability with Java, probably driven in large part by pre-existing data at Facebook (and other places?) where strings were already for binary data in C++ (at the time, Java was somewhat of a second-class citizen for Thrift – IIRC Facebook's emphasis was on C++. Python, PHP). Somehow I imagine that the backwards compatibility issue is not going to be taken off the table. I may not fully understand the issues with Python so forgive me if this suggestion is naive: Can we split the difference and have some kind of configuration option to "enforce UTF-8" for Python (but make it off by default)? The policy would then be: use non-UTF8 encoding in strings if you wish, but realize that you will not interoperate correctly with Java and C# all the time or with Python when "enforce UTF-8" mode is on.
        Hide
        Jonathan Ellis added a comment -

        From IRC:

        ***johano votes for utf-8 strings and a binary type
        seliopou: ima go with the string for utf8 and binary for binary
        ahfeel: i agree with you guys, strings => utf8 and binary as a type
        bryanduxbury: re: utf8, we basically need to convince dreiss and other doubters that we should specify that encoding throughout

        Show
        Jonathan Ellis added a comment - From IRC: ***johano votes for utf-8 strings and a binary type seliopou: ima go with the string for utf8 and binary for binary ahfeel: i agree with you guys, strings => utf8 and binary as a type bryanduxbury: re: utf8, we basically need to convince dreiss and other doubters that we should specify that encoding throughout
        Hide
        Jonathan Ellis added a comment -

        > The consistency is that in every Thrift language, we use the native "string" type to represent the Thrift "string" type.

        Then you should be honest and just use binary everywhere, because native string types are not at all cross-platform.

        > We do not try to force Unicode semantics on languages where they are non-idiomatic.

        I've explained what modern Python idiom is: strings may be ascii `str` or any `unicode`. Binary data is also represented as `str` but that does not make it a "string."

        So I'm very skeptical of this appeal to idiom when the current behavior is NOT idimatic for Python any time since the unicode type was added. (2.0, october 2000.)

        > For what it's worth, protocol buffers use a blob type for strings in C++.

        See http://code.google.com/apis/protocolbuffers/docs/proto.html. "A string must always contain UTF-8 encoded or 7-bit ASCII text."

        > It gives application writers the option of putting unicode objects in their Thrift structures

        to be read out as str? Doing half of encode/decode is worse than not doing it at all.

        > We do: str

        You just admitted that when you write unicode it reads back as str.

        "if you have code that is using the Thrift string type when it should be binary, s/string/binary/ in your IDL is a virtually painless change to make."

        Assuming for the sake of argument that strings should be utf8 (which includes ascii!), do you agree with the above statement?

        Show
        Jonathan Ellis added a comment - > The consistency is that in every Thrift language, we use the native "string" type to represent the Thrift "string" type. Then you should be honest and just use binary everywhere, because native string types are not at all cross-platform. > We do not try to force Unicode semantics on languages where they are non-idiomatic. I've explained what modern Python idiom is: strings may be ascii `str` or any `unicode`. Binary data is also represented as `str` but that does not make it a "string." So I'm very skeptical of this appeal to idiom when the current behavior is NOT idimatic for Python any time since the unicode type was added. (2.0, october 2000.) > For what it's worth, protocol buffers use a blob type for strings in C++. See http://code.google.com/apis/protocolbuffers/docs/proto.html . "A string must always contain UTF-8 encoded or 7-bit ASCII text." > It gives application writers the option of putting unicode objects in their Thrift structures to be read out as str? Doing half of encode/decode is worse than not doing it at all. > We do: str You just admitted that when you write unicode it reads back as str. — "if you have code that is using the Thrift string type when it should be binary, s/string/binary/ in your IDL is a virtually painless change to make." Assuming for the sake of argument that strings should be utf8 (which includes ascii!), do you agree with the above statement?
        Hide
        David Reiss added a comment -

        But "there is string, and binary, and sometimes the former is utf8-encoded, but not always" is not.

        The consistency is that in every Thrift language, we use the native "string" type to represent the Thrift "string" type. We do not try to force Unicode semantics on languages where they are non-idiomatic.

        (For what it's worth, protocol buffers defines `string` and `bytes` types, corresponding to the behavior of `string` and `binary` in what we are calling the "java and C# way" here.)

        For what it's worth, protocol buffers use a blob type for strings in C++.

        your patch encodes on write but does not decode on read

        Yeah, that was the point. It gives application writers the option of putting unicode objects in their Thrift structures, but doesn't break compatibility with programs that use str objects and/or use alternate encodings for their strings.

        So even for python to python communication it is broken.

        Works fine for me.

        (Surely we at least agree that the server read should return the same kind of object that the client wrote, and vice versa.)

        We do: str

        Show
        David Reiss added a comment - But "there is string, and binary, and sometimes the former is utf8-encoded, but not always" is not. The consistency is that in every Thrift language, we use the native "string" type to represent the Thrift "string" type. We do not try to force Unicode semantics on languages where they are non-idiomatic. (For what it's worth, protocol buffers defines `string` and `bytes` types, corresponding to the behavior of `string` and `binary` in what we are calling the "java and C# way" here.) For what it's worth, protocol buffers use a blob type for strings in C++. your patch encodes on write but does not decode on read Yeah, that was the point. It gives application writers the option of putting unicode objects in their Thrift structures, but doesn't break compatibility with programs that use str objects and/or use alternate encodings for their strings. So even for python to python communication it is broken. Works fine for me. (Surely we at least agree that the server read should return the same kind of object that the client wrote, and vice versa.) We do: str
        Hide
        Jonathan Ellis added a comment -

        Also: your patch encodes on write but does not decode on read. So even for python to python communication it is broken. (Surely we at least agree that the server read should return the same kind of object that the client wrote, and vice versa.)

        Show
        Jonathan Ellis added a comment - Also: your patch encodes on write but does not decode on read. So even for python to python communication it is broken. (Surely we at least agree that the server read should return the same kind of object that the client wrote, and vice versa.)
        Hide
        Jonathan Ellis added a comment -

        You are right, I didn't know the history here. But citing the (outdated) whitepaper which you have already violated when convenient isn't a very convincing argument.

        Both the old way ("there is no string, only binary") or the Java and C# way (strings are utf-8; binary is byte[]) are self-consistent and make sense. But "there is string, and binary, and sometimes the former is utf8-encoded, but not always" is not.

        Personally I think the Java / C# way is better, since it solves a common problem across languages which is one of the reasons to bother using thrift. But if you want to argue the other way, fine, let's file bugs against Java and C# and remove the misleading type. (I would argue that `string` is the misleading one and `binary` is the proper name for its behavior.)

        (For what it's worth, protocol buffers defines `string` and `bytes` types, corresponding to the behavior of `string` and `binary` in what we are calling the "java and C# way" here.)

        Show
        Jonathan Ellis added a comment - You are right, I didn't know the history here. But citing the (outdated) whitepaper which you have already violated when convenient isn't a very convincing argument. Both the old way ("there is no string, only binary") or the Java and C# way (strings are utf-8; binary is byte[]) are self-consistent and make sense. But "there is string, and binary, and sometimes the former is utf8-encoded, but not always" is not. Personally I think the Java / C# way is better, since it solves a common problem across languages which is one of the reasons to bother using thrift. But if you want to argue the other way, fine, let's file bugs against Java and C# and remove the misleading type. (I would argue that `string` is the misleading one and `binary` is the proper name for its behavior.) (For what it's worth, protocol buffers defines `string` and `bytes` types, corresponding to the behavior of `string` and `binary` in what we are calling the "java and C# way" here.)
        Hide
        David Reiss added a comment -

        code that is incorrectly using the Thrift string type

        I think this illustrates your key misunderstanding. The original Thrift Whitepaper <http://incubator.apache.org/thrift/static/thrift-20070401.pdf> defines the string type as "An encoding-agnostic text or binary string". Java and C# use UTF-8 because no one has taken the time to allow them to work with non-UTF-8 text, and the binary type was created to allow Java to use a different in-memory representation for non-text values.

        completely fail to interoperate with other thrift implementations

        C++, Python 2, Ruby, Perl, PHP, and Erlang all delegate decoding of string fields to the application. Java and C# are the only ones that I know of that don't. Applications that manually encode to UTF-8 (like Alexander's) have no trouble interoperating with any other Thrift implementations, and those that use non-utf-8 (or non-Unicode!) strings can interoperate will all but Java and C#.

        Show
        David Reiss added a comment - code that is incorrectly using the Thrift string type I think this illustrates your key misunderstanding. The original Thrift Whitepaper < http://incubator.apache.org/thrift/static/thrift-20070401.pdf > defines the string type as "An encoding-agnostic text or binary string". Java and C# use UTF-8 because no one has taken the time to allow them to work with non-UTF-8 text, and the binary type was created to allow Java to use a different in-memory representation for non-text values. completely fail to interoperate with other thrift implementations C++, Python 2, Ruby, Perl, PHP, and Erlang all delegate decoding of string fields to the application. Java and C# are the only ones that I know of that don't. Applications that manually encode to UTF-8 (like Alexander's) have no trouble interoperating with any other Thrift implementations, and those that use non-utf-8 (or non-Unicode!) strings can interoperate will all but Java and C#.
        Hide
        Jonathan Ellis added a comment -

        Here's what I don't understand: what is the big deal with doing it right, and rejecting binary (non-ascii str) passed to writeString? If you are passing binary data you need to declare it as such or your code will completely fail to interoperate with other thrift implementations. Letting people do that is not doing them a favor. And if you have code that is incorrectly using the Thrift string type when it should be binary, s/string/binary/ in your IDL is a virtually painless change to make and everything will work again.

        Show
        Jonathan Ellis added a comment - Here's what I don't understand: what is the big deal with doing it right, and rejecting binary (non-ascii str) passed to writeString? If you are passing binary data you need to declare it as such or your code will completely fail to interoperate with other thrift implementations. Letting people do that is not doing them a favor. And if you have code that is incorrectly using the Thrift string type when it should be binary, s/string/binary/ in your IDL is a virtually painless change to make and everything will work again.
        Hide
        David Reiss added a comment -

        Hopefully this is a first step that we can all agree on. Unicode objects that the application puts into Thrift structures will be encode with UTF-8 on serialization. There should be no change in behavior for existing programs. The only weird thing is that TBinaryProtocol will still throw an exception if you put a unicode object in a binary field, but TBinaryProtocolAccelerated will encode it and not complain. I don't think that is a big deal, though.

        Show
        David Reiss added a comment - Hopefully this is a first step that we can all agree on. Unicode objects that the application puts into Thrift structures will be encode with UTF-8 on serialization. There should be no change in behavior for existing programs. The only weird thing is that TBinaryProtocol will still throw an exception if you put a unicode object in a binary field, but TBinaryProtocolAccelerated will encode it and not complain. I don't think that is a big deal, though.
        Hide
        Jonathan Ellis added a comment -

        To be more clear, in a unicode-aware python 2 program,

        a "string-like object" (that is, duck-typed string) may be either a `str` containing ascii bytes or a `unicode` object. a "binary object" should only be a `str`.

        this patch complies on both counts.

        (in python 3 a `str` is always unicode and a new `bytes` type is introduced for binary.)

        Show
        Jonathan Ellis added a comment - To be more clear, in a unicode-aware python 2 program, a "string-like object" (that is, duck-typed string) may be either a `str` containing ascii bytes or a `unicode` object. a "binary object" should only be a `str`. this patch complies on both counts. (in python 3 a `str` is always unicode and a new `bytes` type is introduced for binary.)
        Hide
        Jonathan Ellis added a comment -

        Wrong.

        Python 2 has a strong tradition of using the str type for ascii strings as well as blobs.

        That continues to work fine with this patch.

        Python 2 has always used the unicode type for unicode strings.

        Passing random binary stuff that may or may not be the result of encoding a unicode object to something expecting a unicode string (and i mean generically not specifically the unicode type) will crap out.

        Try it with sqlalchemy or mako or any modern unicode-supporting python 2 library.

        Show
        Jonathan Ellis added a comment - Wrong. Python 2 has a strong tradition of using the str type for ascii strings as well as blobs. That continues to work fine with this patch. Python 2 has always used the unicode type for unicode strings. Passing random binary stuff that may or may not be the result of encoding a unicode object to something expecting a unicode string (and i mean generically not specifically the unicode type) will crap out. Try it with sqlalchemy or mako or any modern unicode-supporting python 2 library.
        Hide
        David Reiss added a comment -

        There's really no two ways around it: the old behavior (treating all strings as binary) was a bug.

        This is simply not the case. Python 2 has a strong tradition of using the "str" type for strings, and the str type is a blob without any awareness of encodings. Python 3 has moved to a Java-like model of unicode strings, but the current behavior is the most Python2-esque way of behaving.

        Show
        David Reiss added a comment - There's really no two ways around it: the old behavior (treating all strings as binary) was a bug. This is simply not the case. Python 2 has a strong tradition of using the "str" type for strings, and the str type is a blob without any awareness of encodings. Python 3 has moved to a Java-like model of unicode strings, but the current behavior is the most Python2-esque way of behaving.
        Hide
        Jonathan Ellis added a comment -

        There's really no two ways around it: the old behavior (treating all strings as binary) was a bug.

        I think option (1) clearly violates the spirit of python ("in the face of ambiguity, refuse the tempation to guess") in a way that is guaranteed to cause problems. If a program relies on buggy behavior, let's fail fast rather than working "sometimes."

        As for option (2) I don't think I should be required to use a decorator to get correct behavior. I would suggest a decorator to get the buggy behavior but (a) that seems ... wrong, and (b) how hard is it to regenerate your api with s/string/binary/ anyway? You'll get the exact behavior as before. I still don't see how this is a big deal.

        Show
        Jonathan Ellis added a comment - There's really no two ways around it: the old behavior (treating all strings as binary) was a bug. I think option (1) clearly violates the spirit of python ("in the face of ambiguity, refuse the tempation to guess") in a way that is guaranteed to cause problems. If a program relies on buggy behavior, let's fail fast rather than working "sometimes." As for option (2) I don't think I should be required to use a decorator to get correct behavior. I would suggest a decorator to get the buggy behavior but (a) that seems ... wrong, and (b) how hard is it to regenerate your api with s/string/binary/ anyway? You'll get the exact behavior as before. I still don't see how this is a big deal.
        Hide
        David Reiss added a comment -

        I think that the attached patch is definitely how things should work in Python 3, and I definitely think we should try to keep consistency between the pure python implementation and the extension. For Python 2, I can think of two possible approaches.

        • Try to conform to the old behavior to the extent possible. Don't attempt to re-encode str objects when writing. Return raw str objects when reading.
        • Implement type annotations for base types (I'm about 75% of the way through this) and require an annotation to trigger the unicode behavior in Python 2.

        Unfortunately, the convention in Python 2 is that strings are blobs, not unicode.

        Thoughts?

        Show
        David Reiss added a comment - I think that the attached patch is definitely how things should work in Python 3, and I definitely think we should try to keep consistency between the pure python implementation and the extension. For Python 2, I can think of two possible approaches. Try to conform to the old behavior to the extent possible. Don't attempt to re-encode str objects when writing. Return raw str objects when reading. Implement type annotations for base types (I'm about 75% of the way through this) and require an annotation to trigger the unicode behavior in Python 2. Unfortunately, the convention in Python 2 is that strings are blobs, not unicode. Thoughts?
        Hide
        Jonathan Ellis added a comment -

        Thanks for catching that, Bryan. Revised v2 patch attached.

        Show
        Jonathan Ellis added a comment - Thanks for catching that, Bryan. Revised v2 patch attached.
        Hide
        Bryan Duxbury added a comment -

        Looking at the patch, I would add that writeString should just call down to writeBinary after doing the proper encoding, rather than duplicating the functionality (however slight) already in writeBinary. Same for the reading side.

        Show
        Bryan Duxbury added a comment - Looking at the patch, I would add that writeString should just call down to writeBinary after doing the proper encoding, rather than duplicating the functionality (however slight) already in writeBinary. Same for the reading side.
        Hide
        Bryan Duxbury added a comment -

        Not being a pythonista myself, I can't speak to the implementation particulars, but in terms of the correct behavior, it sounds like Jonathan is on the right track.

        If the Thrift IDL says a field is of type "string", then you must UTF-8 encode/decode it for it to be wire compatible with other language libraries. If the IDL says "binary", then you should write it through with no encoding. Doing anything else is a break with the Thrift specification for the binary protocol.

        If, prior to this patch, you were UTF-8 encoding and encoding strings yourself and storing them in string fields, then yes, you will have to change your code if you want the field to remain a string. You can of course change your IDL to a "binary" field and leave your code the way it is, if that would somehow make your life easier. However, as any client code doing this is actually working around a library bug, it seems to me like it's something worth fixing, and in any case, should be a simplification.

        Show
        Bryan Duxbury added a comment - Not being a pythonista myself, I can't speak to the implementation particulars, but in terms of the correct behavior, it sounds like Jonathan is on the right track. If the Thrift IDL says a field is of type "string", then you must UTF-8 encode/decode it for it to be wire compatible with other language libraries. If the IDL says "binary", then you should write it through with no encoding. Doing anything else is a break with the Thrift specification for the binary protocol. If, prior to this patch, you were UTF-8 encoding and encoding strings yourself and storing them in string fields, then yes, you will have to change your code if you want the field to remain a string. You can of course change your IDL to a "binary" field and leave your code the way it is, if that would somehow make your life easier. However, as any client code doing this is actually working around a library bug, it seems to me like it's something worth fixing, and in any case, should be a simplification.
        Hide
        Jonathan Ellis added a comment -

        I'm done repeating myself.

        Show
        Jonathan Ellis added a comment - I'm done repeating myself.
        Hide
        Alexander Shigin added a comment -

        I don't get it. I have to rewrite my code. How can it "improves compatibility"?

        Show
        Alexander Shigin added a comment - I don't get it. I have to rewrite my code. How can it "improves compatibility"?
        Hide
        Jonathan Ellis added a comment -

        First, what part of this fails to work when you re-generate your API with a `binary` type using the included patch? The code is exactly the same as used to be followed for all `string`.

        Second, it's disingenuous to claim that a patch that fixes a bug preventing python code from working with any other client/server "breaks compatibility." Rather it improves compatibility, at the cost of requiring a trivial change from people already working around the bug instead of fixing it.

        Show
        Jonathan Ellis added a comment - First, what part of this fails to work when you re-generate your API with a `binary` type using the included patch? The code is exactly the same as used to be followed for all `string`. Second, it's disingenuous to claim that a patch that fixes a bug preventing python code from working with any other client/server "breaks compatibility." Rather it improves compatibility, at the cost of requiring a trivial change from people already working around the bug instead of fixing it.
        Hide
        Alexander Shigin added a comment - - edited

        You misunderstand me.

        Here is a trivial snippet of my current code:

        result = service.query(request.encode('utf-8'))
        

        It can't work with the patch. "*-remote" helper can't work with the patch. The patch breaks compatibility.

        Show
        Alexander Shigin added a comment - - edited You misunderstand me. Here is a trivial snippet of my current code: result = service.query(request.encode('utf-8')) It can't work with the patch. "*-remote" helper can't work with the patch. The patch breaks compatibility.
        Hide
        Jonathan Ellis added a comment -

        You just have to use the `binary` type where it's required instead of relying on a bug.

        Show
        Jonathan Ellis added a comment - You just have to use the `binary` type where it's required instead of relying on a bug.
        Hide
        Alexander Shigin added a comment -

        I think that main reason for str.encode is hex, gzip and other pseudo-encoding.

        Your proposal is good (py3k, consistency with java and c# lib), but if you use thrift, python and thrift's string type you have to rewrite your application.

        Show
        Alexander Shigin added a comment - I think that main reason for str.encode is hex, gzip and other pseudo-encoding. Your proposal is good (py3k, consistency with java and c# lib), but if you use thrift, python and thrift's string type you have to rewrite your application.
        Hide
        Jonathan Ellis added a comment -

        There's nothing accidental about it. That is the main reason it makes sense for str to have an encode method, not just unicode.

        Show
        Jonathan Ellis added a comment - There's nothing accidental about it. That is the main reason it makes sense for str to have an encode method, not just unicode.
        Hide
        Alexander Shigin added a comment -

        > 2. pass a binary string containing ascii characters to a `string` field -> works

        By accident. It shouldn't work, but you are lucky.

        Show
        Alexander Shigin added a comment - > 2. pass a binary string containing ascii characters to a `string` field -> works By accident. It shouldn't work, but you are lucky.
        Hide
        Jonathan Ellis added a comment - - edited

        Again: unicode is not forced. Either a unicode object or a binary string containing ascii characters is legal to pass to a `string` type. This is standard for unicode-supporting apps in Python 2.x. (And is the whole point of using UTF-8 as the encoding target.)

        Show
        Jonathan Ellis added a comment - - edited Again: unicode is not forced. Either a unicode object or a binary string containing ascii characters is legal to pass to a `string` type. This is standard for unicode-supporting apps in Python 2.x. (And is the whole point of using UTF-8 as the encoding target.)
        Hide
        Alexander Shigin added a comment -

        If you want to force using unicode type in the Thrift, Service-remote generation should be patched.

        Show
        Alexander Shigin added a comment - If you want to force using unicode type in the Thrift, Service-remote generation should be patched.
        Hide
        Jonathan Ellis added a comment -

        In other words, my patch conforms to normal UTF-8-supporting behavior:

        1. pass a unicode object to a `string` field -> works
        2. pass a binary string containing ascii characters to a `string` field -> works
        3. pass binary data to a `binary` field -> works
        4. pass arbitrary binary data to a `string` field -> doesn't work

        Pre-patch, the python api would allow case four, but this was a bug, because any server conforming to the thrift wire protocol (i.e. anything but another buggy python server) would try to decode from utf-8 and get garbage. Switching from `string` to `binary` is the right fix for code in this situation.

        Show
        Jonathan Ellis added a comment - In other words, my patch conforms to normal UTF-8-supporting behavior: 1. pass a unicode object to a `string` field -> works 2. pass a binary string containing ascii characters to a `string` field -> works 3. pass binary data to a `binary` field -> works 4. pass arbitrary binary data to a `string` field -> doesn't work Pre-patch, the python api would allow case four, but this was a bug, because any server conforming to the thrift wire protocol (i.e. anything but another buggy python server) would try to decode from utf-8 and get garbage. Switching from `string` to `binary` is the right fix for code in this situation.
        Hide
        Jonathan Ellis added a comment -

        If you want to send pre-encoded data you should be using binary type, not string.

        Python makes this clear if you know what to look for – the result of encode() is a str [binary string].

        Show
        Jonathan Ellis added a comment - If you want to send pre-encoded data you should be using binary type, not string. Python makes this clear if you know what to look for – the result of encode() is a str [binary string] .
        Hide
        Alexander Shigin added a comment -

        Your patch also throws exception if I send str type.

        >>> unicode('абв', 'utf-8').encode('utf-8').encode('utf-8')
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
        
        Show
        Alexander Shigin added a comment - Your patch also throws exception if I send str type. >>> unicode('абв', 'utf-8').encode('utf-8').encode('utf-8') Traceback (most recent call last): File "<stdin>" , line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
        Hide
        Jonathan Ellis added a comment -

        > Why do you think the input encoding would be utf-8?

        because that is the encoding used by other thrift bindings on the wire. Java:

        public void writeString(String str) throws TException {
        try

        { byte[] dat = str.getBytes("UTF-8"); writeI32(dat.length); trans_.write(dat, 0, dat.length); }

        catch (UnsupportedEncodingException uex)

        { throw new TException("JVM DOES NOT SUPPORT UTF-8"); }

        }

        Show
        Jonathan Ellis added a comment - > Why do you think the input encoding would be utf-8? because that is the encoding used by other thrift bindings on the wire. Java: public void writeString(String str) throws TException { try { byte[] dat = str.getBytes("UTF-8"); writeI32(dat.length); trans_.write(dat, 0, dat.length); } catch (UnsupportedEncodingException uex) { throw new TException("JVM DOES NOT SUPPORT UTF-8"); } }
        Hide
        Alexander Shigin added a comment -

        > That's a nonsensical statement. There is no encoding inherent to unicode objects.

        Here is a snippet from your patch:

           def readString(self):
             len = self.readI32()
             str = self.trans.readAll(len)
        -    return str
        

        Why do you think the input encoding would be utf-8?

        > but my understanding is that fastbinary already has some limitations,

        I know the only limitation is THRIFT-105. And the fastbinary wouldn't be used in this case. Your case is different, you should check if any field has string type to stop using fastbinary.

        Show
        Alexander Shigin added a comment - > That's a nonsensical statement. There is no encoding inherent to unicode objects. Here is a snippet from your patch: def readString(self): len = self.readI32() str = self.trans.readAll(len) - return str Why do you think the input encoding would be utf-8? > but my understanding is that fastbinary already has some limitations, I know the only limitation is THRIFT-105 . And the fastbinary wouldn't be used in this case. Your case is different, you should check if any field has string type to stop using fastbinary.
        Hide
        Jonathan Ellis added a comment -

        > It would break backward compatibility if encoding isn't utf-8.

        That's a nonsensical statement. There is no encoding inherent to unicode objects.

        > And there is no way to specify which encoding is used.

        That's because read/writeString always use utf-8, so any client can send to any server.

        > The next issue is fastbinary. You'll still get the exception if TBinaryProtocolAccelerated is used.

        I actually don't know if this bug is present in fastbinary to begin with, but my understanding is that fastbinary already has some limitations, so if it is, this can be added to the list.

        Show
        Jonathan Ellis added a comment - > It would break backward compatibility if encoding isn't utf-8. That's a nonsensical statement. There is no encoding inherent to unicode objects. > And there is no way to specify which encoding is used. That's because read/writeString always use utf-8, so any client can send to any server. > The next issue is fastbinary. You'll still get the exception if TBinaryProtocolAccelerated is used. I actually don't know if this bug is present in fastbinary to begin with, but my understanding is that fastbinary already has some limitations, so if it is, this can be added to the list.
        Hide
        Alexander Shigin added a comment -

        It would break backward compatibility if encoding isn't utf-8. And there is no way to specify which encoding is used.

        The next issue is fastbinary. You'll still get the exception if TBinaryProtocolAccelerated is used.

        Show
        Alexander Shigin added a comment - It would break backward compatibility if encoding isn't utf-8. And there is no way to specify which encoding is used. The next issue is fastbinary. You'll still get the exception if TBinaryProtocolAccelerated is used.
        Hide
        Jonathan Ellis added a comment -

        This patch adds unicode support to the python bindings. Binary strings continue to not be encoded/decoded (renamed to readBinary/writeBinary for consistency w/ other Protocol implementations) and new write/read String methods were added to support non-binary strings.

        There should be no backwards-compatibility problems. `binary` fields will continue to work as before. `string` fields with ascii data will also continue to work. The only difference is that you can now pass a unicode object to a `string` field w/o it breaking.

        Show
        Jonathan Ellis added a comment - This patch adds unicode support to the python bindings. Binary strings continue to not be encoded/decoded (renamed to readBinary/writeBinary for consistency w/ other Protocol implementations) and new write/read String methods were added to support non-binary strings. There should be no backwards-compatibility problems. `binary` fields will continue to work as before. `string` fields with ascii data will also continue to work. The only difference is that you can now pass a unicode object to a `string` field w/o it breaking.

          People

          • Assignee:
            Nathan Marz
            Reporter:
            Jonathan Ellis
          • Votes:
            4 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development