Avro
  1. Avro
  2. AVRO-315

Performance improvements to BinaryDecoder

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java
    • Labels:
      None

      Description

      The forthcoming patch improves the performance of BinaryDecoder.readLong(), readFloat() and readDouble().

      The test-patch has a command-line program Perf in org.apache.avro.io in the (test part of the source directory) which tests the performance of readInt() (which calls readLong()) readFloat() and readDouble(). On my machine, the patch improves the performance by 10% for readInt() and about 50% for readFloat() and readDouble().

      The idea is to unroll the loops in readLong(), readFloat() and readDouble(). There is a small change in doReadBytes() which checks for most common condition before less common ones.

      1. AVRO-315.patch
        2 kB
        Doug Cutting
      2. AVRO-315.patch
        5 kB
        Thiruvalluvan M. G.
      3. AVRO-315-prototype.patch
        16 kB
        Scott Carey
      4. AVRO-315-test.patch
        7 kB
        Thiruvalluvan M. G.

        Issue Links

          Activity

          Hide
          Kevin Oliver added a comment -

          This should be nice. We've seen readLong() take up quite a bit of time in the profiler.

          Show
          Kevin Oliver added a comment - This should be nice. We've seen readLong() take up quite a bit of time in the profiler.
          Hide
          Doug Cutting added a comment -

          I saw no improvement to readInt() on Ubuntu 9.10 using Java 1.6.0_15.

          I was able to further improve readFloat() and readDouble() by ~12% by pre-allocating the buffer.

          I attached a new version of the patch with that change.

          Show
          Doug Cutting added a comment - I saw no improvement to readInt() on Ubuntu 9.10 using Java 1.6.0_15. I was able to further improve readFloat() and readDouble() by ~12% by pre-allocating the buffer. I attached a new version of the patch with that change.
          Hide
          Scott Carey added a comment -

          I could not reproduce the performance improvement from unrolling the loop. I was able to modify the loop to remove a conditional which had a small impact and it is ugly:

          int n = 0;
              int b;
              int shift = 0;
              do {
                b = in.read();
                n |= (b & 0x7f) << shift;
                shift += 7;
              } while (((b ^ 0xFFFFFF00) & 0xFFFFFF80) == 0xFFFFFF80);
              
              if (-1 == b) {
                throw new EOFException();
              } 
              return n;

          I have modified the test to pick a broader set of integers to serialize, increased its duration, and increased its warmup time. It is nearly stable now, the JVM is usually done compiling before the test starts. I run the Integer and Long test twice to help guage the stability.

          OS X Snow Leopard, 1.6.0_15, -server -XX:+UseCompressedOops.

          Reviewing the code, the primary problem looks to be the ubiquitous use of this slow function:

          InputStream.read();
          

          Unfortunately, this old stream API is slow. It is optimized for bulk transfer and very poor for reading small chunks at once. Ideally one should always use the bulk copy operators on InputStream and its relations if possible.

          I made a quick and dirty stab at removing that problem by adding a byte[] buffer in this class and buffering access to the input stream, allowing array access to retrieve bytes.

          The result is about 2x to 3x as fast!

          I'm attaching a patch that has a test class (modified from Thiru's) and a prototype "FastBinaryDecoder.java". There is room for improvement on top of this. This is a prototype to share – needs some work to clean it up and merge it into BinaryDecoder. Plus, it is easier to test one versus the other when they are in different classes. The new test class has only one line to change to switch implementations.

          The idea is to use array accessors as often as possible, and reduce the number of times that bounds have to be checked. There are ways we could reduce this even further.
          The nio ByteBuffer API is smarter about this sort of thing than InputStream, but raw byte[] access tends to be the fastest, even if it means making an extra copy. An extra 2k copy is almost free (fits in L1 cache) – two or three orders of magnitude faster than a large in-process memcopy or copy from out of process.

          Sample test output below:

          BinaryDecoder (moodified, including Thiru's changes but without the loop unrolling):
          -----
          ReadInt({"type":"array","items":"int"}): 945 ms, 31.731626594910882 million numbers decoded /sec
          ReadLong({"type":"array","items":"long"}): 1434 ms, 20.92050209205021 million numbers decoded /sec
          ReadInt({"type":"array","items":"int"}): 831 ms, 36.08644869649733 million numbers decoded /sec
          ReadLong({"type":"array","items":"long"}): 1430 ms, 20.972303975370128 million numbers decoded /sec
          ReadFloat({"type":"array","items":"float"}): 636 ms, 47.09687041296106 million numbers decoded /sec
          ReadDouble({"type":"array","items":"double"}): 772 ms, 38.81877068716988 million numbers decoded /sec
          
          FastBinaryDecoder:
          --------
          ReadInt({"type":"array","items":"int"}): 298 ms, 100.49746243907342 million numbers decoded /sec
          ReadLong({"type":"array","items":"long"}): 793 ms, 37.79032416540069 million numbers decoded /sec
          ReadInt({"type":"array","items":"int"}): 291 ms, 102.76084126875385 million numbers decoded /sec
          ReadLong({"type":"array","items":"long"}): 806 ms, 37.2056106060794 million numbers decoded /sec
          ReadFloat({"type":"array","items":"float"}): 324 ms, 92.45904064499426 million numbers decoded /sec
          ReadDouble({"type":"array","items":"double"}): 345 ms, 86.7109663359125 million numbers decoded /sec
          Show
          Scott Carey added a comment - I could not reproduce the performance improvement from unrolling the loop. I was able to modify the loop to remove a conditional which had a small impact and it is ugly: int n = 0; int b; int shift = 0; do { b = in.read(); n |= (b & 0x7f) << shift; shift += 7; } while (((b ^ 0xFFFFFF00) & 0xFFFFFF80) == 0xFFFFFF80); if (-1 == b) { throw new EOFException(); } return n; I have modified the test to pick a broader set of integers to serialize, increased its duration, and increased its warmup time. It is nearly stable now, the JVM is usually done compiling before the test starts. I run the Integer and Long test twice to help guage the stability. OS X Snow Leopard, 1.6.0_15, -server -XX:+UseCompressedOops. Reviewing the code, the primary problem looks to be the ubiquitous use of this slow function: InputStream.read(); Unfortunately, this old stream API is slow. It is optimized for bulk transfer and very poor for reading small chunks at once. Ideally one should always use the bulk copy operators on InputStream and its relations if possible. I made a quick and dirty stab at removing that problem by adding a byte[] buffer in this class and buffering access to the input stream, allowing array access to retrieve bytes. The result is about 2x to 3x as fast! I'm attaching a patch that has a test class (modified from Thiru's) and a prototype "FastBinaryDecoder.java". There is room for improvement on top of this. This is a prototype to share – needs some work to clean it up and merge it into BinaryDecoder. Plus, it is easier to test one versus the other when they are in different classes. The new test class has only one line to change to switch implementations. The idea is to use array accessors as often as possible, and reduce the number of times that bounds have to be checked. There are ways we could reduce this even further. The nio ByteBuffer API is smarter about this sort of thing than InputStream, but raw byte[] access tends to be the fastest, even if it means making an extra copy. An extra 2k copy is almost free (fits in L1 cache) – two or three orders of magnitude faster than a large in-process memcopy or copy from out of process. Sample test output below: BinaryDecoder (moodified, including Thiru's changes but without the loop unrolling): ----- ReadInt({"type":"array","items":"int"}): 945 ms, 31.731626594910882 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 1434 ms, 20.92050209205021 million numbers decoded /sec ReadInt({"type":"array","items":"int"}): 831 ms, 36.08644869649733 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 1430 ms, 20.972303975370128 million numbers decoded /sec ReadFloat({"type":"array","items":"float"}): 636 ms, 47.09687041296106 million numbers decoded /sec ReadDouble({"type":"array","items":"double"}): 772 ms, 38.81877068716988 million numbers decoded /sec FastBinaryDecoder: -------- ReadInt({"type":"array","items":"int"}): 298 ms, 100.49746243907342 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 793 ms, 37.79032416540069 million numbers decoded /sec ReadInt({"type":"array","items":"int"}): 291 ms, 102.76084126875385 million numbers decoded /sec ReadLong({"type":"array","items":"long"}): 806 ms, 37.2056106060794 million numbers decoded /sec ReadFloat({"type":"array","items":"float"}): 324 ms, 92.45904064499426 million numbers decoded /sec ReadDouble({"type":"array","items":"double"}): 345 ms, 86.7109663359125 million numbers decoded /sec
          Hide
          Scott Carey added a comment -

          Patch includes test class and prototype FastBinaryDecoder class.

          Show
          Scott Carey added a comment - Patch includes test class and prototype FastBinaryDecoder class.
          Hide
          Doug Cutting added a comment -

          Scott: This is great stuff!

          I note that doReadBytes(), once the buffer is exhausted, never calls in.read(bytes, start, length). Shouldn't we bypass the buffer in this case?

          This changes semantics a bit, since currently BinaryDecoder does not buffer. For example, in DataFileReader seeks the underlying stream. So, in that case, we'll need to be able to reset the Decoder's buffer after a seek. There could also be places where an application reads alternately from the Decoder and its underlying InputStream, although, at a glance, I can't find any such in the code currently.

          Show
          Doug Cutting added a comment - Scott: This is great stuff! I note that doReadBytes(), once the buffer is exhausted, never calls in.read(bytes, start, length). Shouldn't we bypass the buffer in this case? This changes semantics a bit, since currently BinaryDecoder does not buffer. For example, in DataFileReader seeks the underlying stream. So, in that case, we'll need to be able to reset the Decoder's buffer after a seek. There could also be places where an application reads alternately from the Decoder and its underlying InputStream, although, at a glance, I can't find any such in the code currently.
          Hide
          Thiruvalluvan M. G. added a comment -

          As pointed out by Doug, this introduces a change in semantics.

          I think it's desirable to keep the property that the decoder does not take more bytes off the input stream than necessary. That property will allow the input stream not rewindable.

          In any case, since the optimization for readLong() did not show much improvement in performance and the we may need more time to settle on the new proposal, I'm going ahead and committing the other optimization for readFloat() and readDouble(). I've created a new JIRA AVRO-327 to take improvement to readLong() forward.

          Show
          Thiruvalluvan M. G. added a comment - As pointed out by Doug, this introduces a change in semantics. I think it's desirable to keep the property that the decoder does not take more bytes off the input stream than necessary. That property will allow the input stream not rewindable. In any case, since the optimization for readLong() did not show much improvement in performance and the we may need more time to settle on the new proposal, I'm going ahead and committing the other optimization for readFloat() and readDouble(). I've created a new JIRA AVRO-327 to take improvement to readLong() forward.
          Hide
          Thiruvalluvan M. G. added a comment -

          Committed revision 899182. I've retained the improved tests by Scott. Thanks Scott.

          Show
          Thiruvalluvan M. G. added a comment - Committed revision 899182. I've retained the improved tests by Scott. Thanks Scott.

            People

            • Assignee:
              Thiruvalluvan M. G.
              Reporter:
              Thiruvalluvan M. G.
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development