Thrift
  1. Thrift
  2. THRIFT-1920 Binary Type
  3. THRIFT-765

Improved string encoding and decoding performance

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.2
    • Fix Version/s: 0.4
    • Component/s: Java - Library
    • Labels:
      None

      Description

      One of the most consistent time-consuming spots of Thrift serialization and deserialization is string encoding. For some inscrutable reason, String.getBytes("UTF-8") is slow.

      However, it's recently been brought to my attention that DataOutputStream's writeUTF method has a faster implementation of UTF-8 encoding and decoding. We should use this style of encoding.

      1. thrift-765.patch
        9 kB
        Bryan Duxbury
      2. THRIFT-765.patch
        16 kB
        Robert Muir
      3. thrift-765-redux.patch
        11 kB
        Bryan Duxbury
      4. thrift-765-redux-v2.patch
        12 kB
        Bryan Duxbury

        Issue Links

          Activity

          Hide
          Bryan Duxbury added a comment -

          This patch adds a new class, Utf8Helper, that implements the fast encoding and decoding methods. It also changes Binary and Compact protocols to use this method of encoding.

          I haven't benchmarked the overall performance improvements to the respective protocols, but I did perform a micro-benchmark of string encoding and decoding, which showed the new methods take about half as long. The benefit you see in your actual serialization/deserialization will depend on how many strings you use.

          Show
          Bryan Duxbury added a comment - This patch adds a new class, Utf8Helper, that implements the fast encoding and decoding methods. It also changes Binary and Compact protocols to use this method of encoding. I haven't benchmarked the overall performance improvements to the respective protocols, but I did perform a micro-benchmark of string encoding and decoding, which showed the new methods take about half as long. The benefit you see in your actual serialization/deserialization will depend on how many strings you use.
          Hide
          Bryan Duxbury added a comment -

          I just committed this.

          Show
          Bryan Duxbury added a comment - I just committed this.
          Hide
          David Reiss added a comment -

          Why is there a custom UTF-8 codec in the patch while your description referenced a Java library class. Did you copy the implementation? Are there license restrictions on it?

          Show
          David Reiss added a comment - Why is there a custom UTF-8 codec in the patch while your description referenced a Java library class. Did you copy the implementation? Are there license restrictions on it?
          Hide
          Bryan Duxbury added a comment -

          We can't actually use the method that already exists on DataOutputStream, because it wants to wrap a stream object, and implies a Java-specific frame size that doesn't meet our needs.

          I did not copy the implementation. I adapted the technique found in DataOutputStream#writeUTF (and a similar version found in Kryo's StringSerializer) to work with Thrift. I could include the license notice from Kryo if you think that we could be in license trouble here.

          Show
          Bryan Duxbury added a comment - We can't actually use the method that already exists on DataOutputStream, because it wants to wrap a stream object, and implies a Java-specific frame size that doesn't meet our needs. I did not copy the implementation. I adapted the technique found in DataOutputStream#writeUTF (and a similar version found in Kryo's StringSerializer) to work with Thrift. I could include the license notice from Kryo if you think that we could be in license trouble here.
          Hide
          Doug Cutting added a comment -

          It doesn't look like this handles surrogate pairs correctly. Java strings are not arrays of unicode codepoints, but are rather UTF-16 encoded. UTF-8 can have 4-byte values that must be converted to two java characters.

          Show
          Doug Cutting added a comment - It doesn't look like this handles surrogate pairs correctly. Java strings are not arrays of unicode codepoints, but are rather UTF-16 encoded. UTF-8 can have 4-byte values that must be converted to two java characters.
          Hide
          Bryan Duxbury added a comment -

          Could you provide some example strings?

          Show
          Bryan Duxbury added a comment - Could you provide some example strings?
          Hide
          Ken Krugler added a comment -

          For example, code point U+20491 is two UTF-16 characters (\uD841 \uDC91) and thus two Java characters. The equivalent in UTF-8 is four bytes (0xF0 0xA0 0x92 0x91). Just to see how well Jira & your browser handle this, the actual character is 𠒑

          Show
          Ken Krugler added a comment - For example, code point U+20491 is two UTF-16 characters (\uD841 \uDC91) and thus two Java characters. The equivalent in UTF-8 is four bytes (0xF0 0xA0 0x92 0x91). Just to see how well Jira & your browser handle this, the actual character is 𠒑
          Hide
          Bryan Duxbury added a comment -

          I added a testcase that included the example character you just provided, and successfully encoded and decoded it. Seems like it works ok.

          Anyone have others they'd like to see in the tests?

          Show
          Bryan Duxbury added a comment - I added a testcase that included the example character you just provided, and successfully encoded and decoded it. Seems like it works ok. Anyone have others they'd like to see in the tests?
          Hide
          Doug Cutting added a comment -

          Mea culpa. It looks like it actually does handle surrogate pairs. (I didn't realize that they don't require special case code for this conversion.) It doesn't detect invalid inputs but that's probably okay. Nevermind.

          Show
          Doug Cutting added a comment - Mea culpa. It looks like it actually does handle surrogate pairs. (I didn't realize that they don't require special case code for this conversion.) It doesn't detect invalid inputs but that's probably okay. Nevermind.
          Hide
          Yonik Seeley added a comment -

          Doug, I think you were right the first time.
          I took a quick glance at the code, and it doesn't handle surrogate pairs. The resulting output will be CESU-8, not UTF-8.

          Taking Ken's example, that code point encoded in UTF-8 is 4 bytes, while the encoding this patch implements will yield 6 bytes of CESU-8. A round-trip test is not sufficient to test for correct UTF-8 (they will both decode to the same Java String).

          Show
          Yonik Seeley added a comment - Doug, I think you were right the first time. I took a quick glance at the code, and it doesn't handle surrogate pairs. The resulting output will be CESU-8, not UTF-8. Taking Ken's example, that code point encoded in UTF-8 is 4 bytes, while the encoding this patch implements will yield 6 bytes of CESU-8. A round-trip test is not sufficient to test for correct UTF-8 (they will both decode to the same Java String).
          Hide
          Bryan Duxbury added a comment -

          It's not just a round-trip test. I tested that the strings serialized to the same bytes with either encoder.

          Show
          Bryan Duxbury added a comment - It's not just a round-trip test. I tested that the strings serialized to the same bytes with either encoder.
          Hide
          Doug Cutting added a comment -

          Milind claims in AVRO-532 that Hadoop's record io correctlly handled surrogate pairs.

          https://svn.apache.org/repos/asf/hadoop/common/branches/pre-HADOOP-4687/src/core/org/apache/hadoop/record/Utils.java

          That code is quite different from the code in this patch. Notably, it calls String#getCodePoint() and StringBuffer#appendCodePoint() to let Java implement surrogate pairs.

          Show
          Doug Cutting added a comment - Milind claims in AVRO-532 that Hadoop's record io correctlly handled surrogate pairs. https://svn.apache.org/repos/asf/hadoop/common/branches/pre-HADOOP-4687/src/core/org/apache/hadoop/record/Utils.java That code is quite different from the code in this patch. Notably, it calls String#getCodePoint() and StringBuffer#appendCodePoint() to let Java implement surrogate pairs.
          Hide
          Yonik Seeley added a comment -

          It's not just a round-trip test. I tested that the strings serialized to the same bytes with either encoder.

          I don't see any code points outside the BMP in your test.
          Try something like "\uD801\uDC00", which is one character.

          Show
          Yonik Seeley added a comment - It's not just a round-trip test. I tested that the strings serialized to the same bytes with either encoder. I don't see any code points outside the BMP in your test. Try something like "\uD801\uDC00", which is one character.
          Hide
          Bryan Duxbury added a comment -

          Hm, it looks like I made mistake in transcribing that earlier test case, and it does indeed fail to encode correctly.

          Show
          Bryan Duxbury added a comment - Hm, it looks like I made mistake in transcribing that earlier test case, and it does indeed fail to encode correctly.
          Hide
          Bryan Duxbury added a comment -

          I had a look at Hadoop's RecordIO implementation, and this did the trick. After adapting it to Thrift, I found that it was pretty slow, so I monkeyed around a bit until I'd basically regained all of the encoding performance, though not very much of the decoding performance. It's still faster, but I just can't seem to shake the StringBuilder overhead even when doing nothing but ASCII strings.

          All the test pass. What do you guys think?

          Show
          Bryan Duxbury added a comment - I had a look at Hadoop's RecordIO implementation, and this did the trick. After adapting it to Thrift, I found that it was pretty slow, so I monkeyed around a bit until I'd basically regained all of the encoding performance, though not very much of the decoding performance. It's still faster, but I just can't seem to shake the StringBuilder overhead even when doing nothing but ASCII strings. All the test pass. What do you guys think?
          Hide
          Doug Cutting added a comment -

          Does using a switch instead of an 'if else' in decode help any?

          Another thing to try might be to use a char[] buffer instead of StringBuffer and directly implement UTF-16 coding. Codepoints less than 0x10000 are encoded as one Java char. Codepoints over that are encoded in two chars, with the first containing 0xD8 plus the high-order ten bits of codepoint minus 0x10000, and the second 0xDC plus the low-order ten bits of codepoint minus 0x10000.

          The one-, two- and three- byte cases of UTF-8 all produce characters less than 0x10000 and so will always generate a single char. And the four-byte case of UTF-8 will only and always generate a character greater than or equal to 0x10000. So you don't need to test that if you combine the UTF-16 encoding with the UTF-8 decoding.

          Show
          Doug Cutting added a comment - Does using a switch instead of an 'if else' in decode help any? Another thing to try might be to use a char[] buffer instead of StringBuffer and directly implement UTF-16 coding. Codepoints less than 0x10000 are encoded as one Java char. Codepoints over that are encoded in two chars, with the first containing 0xD8 plus the high-order ten bits of codepoint minus 0x10000, and the second 0xDC plus the low-order ten bits of codepoint minus 0x10000. The one-, two- and three- byte cases of UTF-8 all produce characters less than 0x10000 and so will always generate a single char. And the four-byte case of UTF-8 will only and always generate a character greater than or equal to 0x10000. So you don't need to test that if you combine the UTF-16 encoding with the UTF-8 decoding.
          Show
          Yonik Seeley added a comment - You can also check out Lucene's code for UTF8 encoding/decoding: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/UnicodeUtil.java?revision=931668&view=markup
          Hide
          Bryan Duxbury added a comment -

          OK, after looking over Lucene's implementation, I've managed to make an approach that reclaims the performance benefits of the original broken version.

          Protocols that use this could probably still benefit a little bit by reusing the byte and char buffers used during encoding and decoding, but I haven't thought all the way through the memory implications, so I'm leaving that out for now.

          If no one has any objections, I will commit this patch shortly.

          Show
          Bryan Duxbury added a comment - OK, after looking over Lucene's implementation, I've managed to make an approach that reclaims the performance benefits of the original broken version. Protocols that use this could probably still benefit a little bit by reusing the byte and char buffers used during encoding and decoding, but I haven't thought all the way through the memory implications, so I'm leaving that out for now. If no one has any objections, I will commit this patch shortly.
          Hide
          Bryan Duxbury added a comment -

          I just committed this (again).

          Show
          Bryan Duxbury added a comment - I just committed this (again).
          Hide
          David Reiss added a comment -

          Any idea why this is faster than the built-in Java codec (which is probably implemented in C)? Reinventing the wheel here seems like it's just asking for problems. The surrogate pair bug is just the first example. This implementation also appears to allow overlong forms.

          Show
          David Reiss added a comment - Any idea why this is faster than the built-in Java codec (which is probably implemented in C)? Reinventing the wheel here seems like it's just asking for problems. The surrogate pair bug is just the first example. This implementation also appears to allow overlong forms.
          Hide
          Bryan Duxbury added a comment -

          There still seems to be a bug in the mix somewhere. I am working to chase it down, but I don't think this patch will make it in for 0.3.

          Show
          Bryan Duxbury added a comment - There still seems to be a bug in the mix somewhere. I am working to chase it down, but I don't think this patch will make it in for 0.3.
          Hide
          Bryan Duxbury added a comment -

          Firstly, as far as I can tell, the built-in Java version is not implemented in C. I don't know all of the reasons why it's faster, but a few of them are: avoid byte array allocation/copy, more specific implementation, and less modular code with fewer intermediate layers.

          It seems like many other projects that do UTF-8 encoding do so via a custom method, some better than others. I'll admit it's tricky, but the performance benefit is so huge that I'm willing to pursue it more deeply.

          Show
          Bryan Duxbury added a comment - Firstly, as far as I can tell, the built-in Java version is not implemented in C. I don't know all of the reasons why it's faster, but a few of them are: avoid byte array allocation/copy, more specific implementation, and less modular code with fewer intermediate layers. It seems like many other projects that do UTF-8 encoding do so via a custom method, some better than others. I'll admit it's tricky, but the performance benefit is so huge that I'm willing to pursue it more deeply.
          Hide
          Bryan Duxbury added a comment -

          I've spent a bunch of time playing around with this. I discovered that the "bugs" with the new encoder/decoder only occur when decoding invalid UTF-8 strings. In particular, the Java standard decoder seems to take a lot more care with validating every byte it examines before combining it into the codepoint and converting it to UTF-16.

          While the decoding function I have right now is still about 2x the speed of Java's, adding in all the checks required to get us to parity of functionality will almost certainly evaporate the performance benefits. Even worse, it seems like the performance benefits of encoding have somehow disappeared during the process of implemented surrogate pair support. (This one confuses me - my benchmark doesn't even cover a string that contains surrogate pairs. Maybe my original tests were flawed?)

          The bottom line is that it looks like this is a dead end, unless we are willing to sacrifice "correctness" when decoding invalid UTF-8 encoded strings. You could argue that if it's a bad encoding already, it might be best to detect that and throw rather than silently convert like Java does, but that's a debate for another time.

          The only way that reviving this issue would make sense is if we are willing to support a separate encoding mechanism for the purpose of avoiding buffer allocation and copies during write. At the moment, we're not equipped to benefit from that, so maybe I'll reevaluate later.

          Show
          Bryan Duxbury added a comment - I've spent a bunch of time playing around with this. I discovered that the "bugs" with the new encoder/decoder only occur when decoding invalid UTF-8 strings. In particular, the Java standard decoder seems to take a lot more care with validating every byte it examines before combining it into the codepoint and converting it to UTF-16. While the decoding function I have right now is still about 2x the speed of Java's, adding in all the checks required to get us to parity of functionality will almost certainly evaporate the performance benefits. Even worse, it seems like the performance benefits of encoding have somehow disappeared during the process of implemented surrogate pair support. (This one confuses me - my benchmark doesn't even cover a string that contains surrogate pairs. Maybe my original tests were flawed?) The bottom line is that it looks like this is a dead end, unless we are willing to sacrifice "correctness" when decoding invalid UTF-8 encoded strings. You could argue that if it's a bad encoding already, it might be best to detect that and throw rather than silently convert like Java does, but that's a debate for another time. The only way that reviving this issue would make sense is if we are willing to support a separate encoding mechanism for the purpose of avoiding buffer allocation and copies during write. At the moment, we're not equipped to benefit from that, so maybe I'll reevaluate later.
          Hide
          David Dabbs added a comment -

          Lucene's codebase has moved on from the revision link referenced above. If I can get the time I will compare to ICU4J, etc.
          This is a common interest -.Thrift, Avro and protobuf all have similar issues filed.

          Show
          David Dabbs added a comment - Lucene's codebase has moved on from the revision link referenced above. If I can get the time I will compare to ICU4J, etc. This is a common interest -.Thrift, Avro and protobuf all have similar issues filed.
          Hide
          Robert Muir added a comment -

          just found this issue randomly, late to the game.

          I don't think you should give up here, besides the performance problems,
          there are also bugs in Sun (and other JDK) UTF-8 implementations.

          Included in the patch is a commented-out test showing a bug in my
          Sun JDK's UTF-8 decode (1.6.0_19), just as an example.

          This patch includes bulk UTF-8 encode/decode with full error-checking.
          Ensuring correctness doesn't need to be slow:

          • For encode its trivial: just surrogate validation
          • For decode its tricky, but the code can be simple with a DFA.

          I used the DFA approach here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
          In my opinion, its the only way to go without a lot of complexity.

          I implemented as char[] -> byte[] and back, and throw a checked
          CharacterCodingException on invalid inputs.

          I didn't modify any thrift code to actually use this, except the benchmark
          from the previous patch.
          Hopefully its easy adjust to your needs.

          Show
          Robert Muir added a comment - just found this issue randomly, late to the game. I don't think you should give up here, besides the performance problems, there are also bugs in Sun (and other JDK) UTF-8 implementations. Included in the patch is a commented-out test showing a bug in my Sun JDK's UTF-8 decode (1.6.0_19), just as an example. This patch includes bulk UTF-8 encode/decode with full error-checking. Ensuring correctness doesn't need to be slow: For encode its trivial: just surrogate validation For decode its tricky, but the code can be simple with a DFA. I used the DFA approach here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ In my opinion, its the only way to go without a lot of complexity. I implemented as char[] -> byte[] and back, and throw a checked CharacterCodingException on invalid inputs. I didn't modify any thrift code to actually use this, except the benchmark from the previous patch. Hopefully its easy adjust to your needs.
          Hide
          Bryan Duxbury added a comment -

          I just had a look at this patch and it seems really promising. The performance improvement is great (cuts about 30% from encoding and 60% from decoding) so I'm encouraged.

          I'm thinking that maybe we should give the Random instance in the test a consistent seed so that the test is deterministic. Do you think we should try some invalid strings as well? It should just lead to encoding exceptions, right?

          Show
          Bryan Duxbury added a comment - I just had a look at this patch and it seems really promising. The performance improvement is great (cuts about 30% from encoding and 60% from decoding) so I'm encouraged. I'm thinking that maybe we should give the Random instance in the test a consistent seed so that the test is deterministic. Do you think we should try some invalid strings as well? It should just lead to encoding exceptions, right?
          Hide
          Robert Muir added a comment -

          Do you think we should try some invalid strings as well? It should just lead to encoding exceptions, right?

          The (commented-out test) generates purely random bytes for decode, and purely random chars for encode.

          you can safely enable the random chars encode test, if you think there are no encode bugs in your jdk (sun is ok)
          you cannot safely enable the random bytes decode test, because of the sun decode bug i mentioned (see testJDKCorrectness to test your jdk).

          I reported the bug to sun here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052

          by the way: you can speed up decode further for the multibyte case:

                if (state != UTF8_ACCEPT) {
                  // non-starter, or following illegal byte sequence
                  if (state == UTF8_REJECT)               // remove this line
                    throw new CharacterCodingException(); // remove this line also
          

          this is because the DFA is designed such that, if the decoder enters a reject state it will never leave.
          so this code is redundant, it will throw the exception at the end anyway (sorry for leaving the check in the code)

          Show
          Robert Muir added a comment - Do you think we should try some invalid strings as well? It should just lead to encoding exceptions, right? The (commented-out test) generates purely random bytes for decode, and purely random chars for encode. you can safely enable the random chars encode test, if you think there are no encode bugs in your jdk (sun is ok) you cannot safely enable the random bytes decode test, because of the sun decode bug i mentioned (see testJDKCorrectness to test your jdk). I reported the bug to sun here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052 by the way: you can speed up decode further for the multibyte case: if (state != UTF8_ACCEPT) { // non-starter, or following illegal byte sequence if (state == UTF8_REJECT) // remove this line throw new CharacterCodingException(); // remove this line also this is because the DFA is designed such that, if the decoder enters a reject state it will never leave. so this code is redundant, it will throw the exception at the end anyway (sorry for leaving the check in the code)

            People

            • Assignee:
              Bryan Duxbury
              Reporter:
              Bryan Duxbury
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development