Man, I sleep in one morning and miss the whole party.
Since we've strayed a bit from the original topic, let me ask a quick question to make sure we're at least aware of the scope of the discussion: What concrete changes would you like to see in Thrift (in the area of encodings) other than the return type of readString in Python 2?
Now, let me just respond to a bunch of stuff in chronological order...
Can we split the difference and have some kind of configuration option to "enforce UTF-8" for Python (but make it off by default)?
I'd be fine with this, though the change to the extension module is more complicated that the change to the pure-python stuff.
what do you think of adding a new annotation (e.g. string.encoding) for specifying the actual string encoding?
I'd also be fine with that. See
THRIFT-414 for my planned approach.
I'd deprecate str strings and, at some point in the future, support unicode strings only
If you're talking about Python, I think we should definitely do this for Python 3, but never do it for Python 2. If you're talking about all languages, I think it is unrealistic because C++, PHP, Perl, and Erlang are not going to have robust native Unicode support any time in the foreseeable future.
is utf8 strings the right design decision, absent backwards-compatibility concerns [...] I think some people are reluctant to admit 1. because they are afraid of 2.
I think that it is not. Requiring UTF-8 might seem sensible in a mostly-English environment, but having support for UTF-16 or a Chinese-oriented encoding (for example) can be very useful. I'm fine saying that Thrift strings should be UTF-8 encoded unless otherwise specified (like, by an annotation), but enforcing it in environments that could benefit from a non-UTF-8 encoding is harmful.
I think adding user-specified encodings adds more complexity than it's worth
I think allowing the user to specify string encoding just adds complexity
I disagree. I think that if we say that strings should default to UTF-8 unless otherwise annotated, it is not a big deal. I think that removing the ability to support other encodings is a big deal.
made restrictions on the types for map keys
I haven't ruled this out, if you want to talk about it. But it should be a separate issue. And if you are serious, we should do it before the release.
made binary its own standalone type
This is effectively the case already. The only possible problems arise when you change a field from string to binary without changing the field id (which is what Jonathan is suggesting, btw), and even then, I think only in the JSON protocol.
If you are using the current code and sending binary data as a "string" then you are probably using Python on both client and server
C++, Ruby, Perl, PHP, and Erlang also do this.
if I am understanding correctly, Python3 is now in the same camp as Java and C# - is that correct?
Exactly. The "str" type in Python 3 is effectively the same as the "unicode" object in Python 2. It is a string of Unicode code points that cannot be used in a context where bytes are expected.
If so, maybe we want to treat Python3 as a different target language from Python2
I detect a little bit of pro-Python2 on David's part
That is not my intention. I actually think the Java/Python3 data model makes more sense in most contexts. But I think that we should treat Python 2 as Python 2 (AFAIK, Thrift doesn't work in Python 3), which means that strings are strs. A few examples of this: repr returns a str. Exception messages are strs. "" is a str. Data read from files (even not opened in binary mode) are strs.
Right now most thrift implementations cannot talk to my Java server and that is broken.
We interoperate via Thrift across C++, Ruby, Java, Python2 and Erlang here and everything works just fine. We just make limited use of the 'string' type - and make sure that applications only send UTF-8 data via 'string'
In other words, you are sending binary data that happens to be an encoded string and calling that a string, which it is not. It is binary data. That's working around one bug with another in my book.
Chad is right. As in all C++, Ruby, PHP, Perl, and Erlang programs, it is the simply application's responsibility to ensure that the string is properly UTF-8 encoded on writing and to interpret the string ast UTF-8 on reading. I think you are assuming that the "string" is a "Unicode string" or a "string of Unicode code points". In Thrift, this is not the case. It is a string of bytes (that are presumably representing text), and it is up to the application to ensure that the bytes make sense. Now, if we want to establish a convention that the bytes should be a UTF-8-encoded Unicode string unless otherwise annotated, that's fine with me, but I think that mandating UTF-8 is a harmful restriction, mandating Unicode, while probably fine, is not without downsides, and forcing applications to use special types for strings is pretty much out of the question.
In 2009 a language that doesn't support unicode is barely usable, and will almost certainly support unicode soon.
AFAIK all the thrift languages do support unicode already but I could be wrong on one or two.
There is a difference between supporting Unicode and having native-feeling support for unicode. If you mean native-feeling support, then most languages do not have it.
- C++ has wstring, which can be used for Unicode strings, but they are rarely used and there is no support for encoding and decoding. The native-feeling way to write C++ is to use string-of-bytes std::string.
- Ruby and PHP's string type is a string of bytes. They have special functions for treating them as pre-encoded Unicode strings. Believe it or not, it seems like PHP's support here might actually be better than Ruby's.
- Erlang is completely Unicode-oblivious.
The only reason that this discussion is coming up here is that Python is the only Thrift language (AFAIK) that is on the fence between strings as bytes and strings as code points.