Avro
  1. Avro
  2. AVRO-557

Speed up one-time data decoding

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.2
    • Fix Version/s: 1.4.0
    • Component/s: java
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      There are big gains to be had in performance when using a BinaryDecoder and a GenericDatumReader just one time. This is due to the relatively expensive parsing and initialization that came with 1.3. Patch with example code and a Perf harness to follow.

      1. AVRO-557.patch
        18 kB
        Doug Cutting
      2. AVRO-557.patch
        11 kB
        Doug Cutting
      3. AVRO-557.patch
        6 kB
        Doug Cutting
      4. AVRO-557.patch
        5 kB
        Doug Cutting
      5. AVRO-557.patch
        31 kB
        Kevin Oliver
      6. AVRO-557-partial.patch
        10 kB
        Scott Carey

        Activity

        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Scott!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Scott!
        Hide
        Scott Carey added a comment -

        +1

        Performance results for all GenericReader tests using updated test suite.

        Before:

        GenericReaderTest: 4316 ms, 0.9652518597286222 million entries/sec. 37.46562600631532 million bytes/sec
        GenericReaderOneTimeUseReaderTest: 12438 ms, 0.33496174128331707 million entries/sec. 13.002931005679518 million bytes/sec
        GenericReaderOneTimeUseDecoderTest: 4541 ms, 0.9175101858715102 million entries/sec. 35.60243412710682 million bytes/sec
        GenericReaderOneTimeUseTest: 12994 ms, 0.3206316809649748 million entries/sec. 12.444687860652925 million bytes/sec
        GenericReaderTestWithDefaultTest: 18814 ms, 0.22144625531538778 million entries/sec. 8.594070597799654 million bytes/sec
        GenericReaderTestWithOutOfOrderTest: 13612 ms, 0.3060743396647221 million entries/sec. 11.87857138764328 million bytes/sec
        GenericReaderTestWithPromotionTest: 13968 ms, 0.29828822008881134 million entries/sec. 11.577040834808965 million bytes/sec

        After:

        GenericReaderTest: 1038 ms, 4.010383780204211 million entries/sec. 155.65178722683382 million bytes/sec
        GenericReaderOneTimeUseReaderTest: 1315 ms, 3.1672704992679512 million entries/sec. 122.93213306717061 million bytes/sec
        GenericReaderOneTimeUseDecoderTest: 1089 ms, 3.82440681077608 million entries/sec. 148.44554591766487 million bytes/sec
        GenericReaderOneTimeUseTest: 1278 ms, 3.258017977159077 million entries/sec. 126.44808401330889 million bytes/sec
        GenericReaderTestWithDefaultTest: 2281 ms, 1.826564690595329 million entries/sec. 70.89078839731388 million bytes/sec
        GenericReaderTestWithOutOfOrderTest: 1313 ms, 3.1731271581301055 million entries/sec. 123.14764465849947 million bytes/sec
        GenericReaderTestWithPromotionTest: 1627 ms, 2.559650956129309 million entries/sec. 99.33645049856828 million bytes/sec

        About a 5x to 9x performance improvement.

        A quick summary of changes to Perf.java – Changed all GenericReader tests to inherit from one parent class. GenericReader tests write a series of objects, other tests write an array of Records. All GenericReader tests should be consistent now and it should be much easier to add more variations later. The MB/sec measurement was broken before, it should be correct now.

        Show
        Scott Carey added a comment - +1 Performance results for all GenericReader tests using updated test suite. Before: GenericReaderTest: 4316 ms, 0.9652518597286222 million entries/sec. 37.46562600631532 million bytes/sec GenericReaderOneTimeUseReaderTest: 12438 ms, 0.33496174128331707 million entries/sec. 13.002931005679518 million bytes/sec GenericReaderOneTimeUseDecoderTest: 4541 ms, 0.9175101858715102 million entries/sec. 35.60243412710682 million bytes/sec GenericReaderOneTimeUseTest: 12994 ms, 0.3206316809649748 million entries/sec. 12.444687860652925 million bytes/sec GenericReaderTestWithDefaultTest: 18814 ms, 0.22144625531538778 million entries/sec. 8.594070597799654 million bytes/sec GenericReaderTestWithOutOfOrderTest: 13612 ms, 0.3060743396647221 million entries/sec. 11.87857138764328 million bytes/sec GenericReaderTestWithPromotionTest: 13968 ms, 0.29828822008881134 million entries/sec. 11.577040834808965 million bytes/sec After: GenericReaderTest: 1038 ms, 4.010383780204211 million entries/sec. 155.65178722683382 million bytes/sec GenericReaderOneTimeUseReaderTest: 1315 ms, 3.1672704992679512 million entries/sec. 122.93213306717061 million bytes/sec GenericReaderOneTimeUseDecoderTest: 1089 ms, 3.82440681077608 million entries/sec. 148.44554591766487 million bytes/sec GenericReaderOneTimeUseTest: 1278 ms, 3.258017977159077 million entries/sec. 126.44808401330889 million bytes/sec GenericReaderTestWithDefaultTest: 2281 ms, 1.826564690595329 million entries/sec. 70.89078839731388 million bytes/sec GenericReaderTestWithOutOfOrderTest: 1313 ms, 3.1731271581301055 million entries/sec. 123.14764465849947 million bytes/sec GenericReaderTestWithPromotionTest: 1627 ms, 2.559650956129309 million entries/sec. 99.33645049856828 million bytes/sec About a 5x to 9x performance improvement. A quick summary of changes to Perf.java – Changed all GenericReader tests to inherit from one parent class. GenericReader tests write a series of objects, other tests write an array of Records. All GenericReader tests should be consistent now and it should be much easier to add more variations later. The MB/sec measurement was broken before, it should be correct now.
        Hide
        Doug Cutting added a comment -

        > I'm afraid using ThreadLocals without any exposed lifecycle control only introduces longer-than-necessary garbage retention

        The ThreadLocal here contains only weak references, so no garbage should be retained longer than necessary. A 10x performance improvement is not something we should ignore lightly.

        Show
        Doug Cutting added a comment - > I'm afraid using ThreadLocals without any exposed lifecycle control only introduces longer-than-necessary garbage retention The ThreadLocal here contains only weak references, so no garbage should be retained longer than necessary. A 10x performance improvement is not something we should ignore lightly.
        Hide
        Holger Hoffstätte added a comment -

        (just a lurker, sorry to interrupt..)

        Doug, as much as I like performance (really I'm afraid using ThreadLocals without any exposed lifecycle control only introduces longer-than-necessary garbage retention, file/memory/thread leaks due to un-undeployable webapps and all sorts of other headaches.

        Show
        Holger Hoffstätte added a comment - (just a lurker, sorry to interrupt..) Doug, as much as I like performance (really I'm afraid using ThreadLocals without any exposed lifecycle control only introduces longer-than-necessary garbage retention, file/memory/thread leaks due to un-undeployable webapps and all sorts of other headaches.
        Hide
        Doug Cutting added a comment -

        Here's a new version that incorporates Scott's improved benchmarks. It also uses a ThreadLocal for the cache instead of synchronization, for better thread scalability.

        Before:

        GenericReaderTest: 9267 ms, 0.4496015202175007 million entries/sec.  17.44851002390717 million bytes/sec
        GenericReaderOneTimeUseReaderTest: 20250 ms, 0.2057450903019226 million entries/sec.  7.985813094572571 million bytes/sec
        GenericReaderOneTimeUseDecoderTest: 9712 ms, 0.4289618484945668 million entries/sec.  16.649207221055114 million bytes/sec
        GenericReaderOneTimeUseTest: 22072 ms, 0.18876675392721157 million entries/sec.  7.325522819121037 million bytes/sec
        

        After:

        GenericReaderTest: 1333 ms, 3.123353146179524 million entries/sec.  121.19457294433288 million bytes/sec
        GenericReaderOneTimeUseReaderTest: 2132 ms, 1.9538027524398502 million entries/sec.  75.83652675628883 million bytes/sec
        GenericReaderOneTimeUseDecoderTest: 1410 ms, 2.954731932450708 million entries/sec.  114.68139742190509 million bytes/sec
        GenericReaderOneTimeUseTest: 2064 ms, 2.0183186700653284 million entries/sec.  78.33658376373683 million bytes/sec
        
        Show
        Doug Cutting added a comment - Here's a new version that incorporates Scott's improved benchmarks. It also uses a ThreadLocal for the cache instead of synchronization, for better thread scalability. Before: GenericReaderTest: 9267 ms, 0.4496015202175007 million entries/sec. 17.44851002390717 million bytes/sec GenericReaderOneTimeUseReaderTest: 20250 ms, 0.2057450903019226 million entries/sec. 7.985813094572571 million bytes/sec GenericReaderOneTimeUseDecoderTest: 9712 ms, 0.4289618484945668 million entries/sec. 16.649207221055114 million bytes/sec GenericReaderOneTimeUseTest: 22072 ms, 0.18876675392721157 million entries/sec. 7.325522819121037 million bytes/sec After: GenericReaderTest: 1333 ms, 3.123353146179524 million entries/sec. 121.19457294433288 million bytes/sec GenericReaderOneTimeUseReaderTest: 2132 ms, 1.9538027524398502 million entries/sec. 75.83652675628883 million bytes/sec GenericReaderOneTimeUseDecoderTest: 1410 ms, 2.954731932450708 million entries/sec. 114.68139742190509 million bytes/sec GenericReaderOneTimeUseTest: 2064 ms, 2.0183186700653284 million entries/sec. 78.33658376373683 million bytes/sec
        Hide
        Scott Carey added a comment -

        partial patch with updates to Perf.java for measuring performance impact.

        Show
        Scott Carey added a comment - partial patch with updates to Perf.java for measuring performance impact.
        Hide
        Scott Carey added a comment -

        The one that fixed it for me is not your second one, but the first one you posted.

        Don't trust the current performance numbers from the test. I'll upload a partial patch that only includes enhancements to Perf.java with better measurements right away.

        For me, using your first patch that only caches the resolver, I get:

        GenericReaderTest: 4413 ms, 0.9439455178163628 million entries/sec. 36.63489596549099 million bytes/sec
        GenericReaderOneTimeUseReaderTest: 4563 ms, 0.9130533872713805 million entries/sec. 35.44198664989503 million bytes/sec
        GenericReaderOneTimeUseDecoderTest: 4476 ms, 0.9307708856028631 million entries/sec. 36.128051198318644 million bytes/sec
        GenericReaderOneTimeUseTest: 4644 ms, 0.8970147804525915 million entries/sec. 34.812096233011566 million bytes/sec

        Show
        Scott Carey added a comment - The one that fixed it for me is not your second one, but the first one you posted. Don't trust the current performance numbers from the test. I'll upload a partial patch that only includes enhancements to Perf.java with better measurements right away. For me, using your first patch that only caches the resolver, I get: GenericReaderTest: 4413 ms, 0.9439455178163628 million entries/sec. 36.63489596549099 million bytes/sec GenericReaderOneTimeUseReaderTest: 4563 ms, 0.9130533872713805 million entries/sec. 35.44198664989503 million bytes/sec GenericReaderOneTimeUseDecoderTest: 4476 ms, 0.9307708856028631 million entries/sec. 36.128051198318644 million bytes/sec GenericReaderOneTimeUseTest: 4644 ms, 0.8970147804525915 million entries/sec. 34.812096233011566 million bytes/sec
        Hide
        Doug Cutting added a comment -

        The google-collections based version causes test-tools to hang running the tethering test, a bit of multi-threaded Avro. There could well be bugs in the tethering code that use of google's concurrent hashmap somehow brings out, but I didn't take the time to debug that.

        Google collections 1.0 is also incompatible with checkstyle, which needs google-collections 0.9. So, until checkstyle 5.1 is available in a maven repo, if we use google collections, then we need to use 0.9.

        Here's a new version that doesn't rely on google collections for the above reasons. The tethering test passes and checkstyle works, but it's a bit slower:

        GenericReaderOneTimeUsageTest: 2264 ms, 1.839572825273033 million entries/sec. 0.008609545204085958 million bytes/sec

        I also tried using an equals hashmap, and things slow to around 3800 milliseconds. Also, folks should not directly compare my timings with Kevin's: my laptop seems about 35% slower than whatever Kevin uses. If that's right, this latest version should be a bit faster than 1.2.

        Show
        Doug Cutting added a comment - The google-collections based version causes test-tools to hang running the tethering test, a bit of multi-threaded Avro. There could well be bugs in the tethering code that use of google's concurrent hashmap somehow brings out, but I didn't take the time to debug that. Google collections 1.0 is also incompatible with checkstyle, which needs google-collections 0.9. So, until checkstyle 5.1 is available in a maven repo, if we use google collections, then we need to use 0.9. Here's a new version that doesn't rely on google collections for the above reasons. The tethering test passes and checkstyle works, but it's a bit slower: GenericReaderOneTimeUsageTest: 2264 ms, 1.839572825273033 million entries/sec. 0.008609545204085958 million bytes/sec I also tried using an equals hashmap, and things slow to around 3800 milliseconds. Also, folks should not directly compare my timings with Kevin's: my laptop seems about 35% slower than whatever Kevin uses. If that's right, this latest version should be a bit faster than 1.2.
        Hide
        Scott Carey added a comment -

        Actually that fixes it.

        Perf.java had some inconsistencies. I have corrected those and refactored the class a little. There is still some work to make it more consistent but the measurements for GenericReader tests are accurate with some changes.

        I am working on a patch that includes those changes and measures the performance drop when Decoders and DatumReaders are not re-used.

        I am also considering removing the Google Collections change for now to simplify this.

        I think we may want to include that later, in the form of Google Guava since Google Collections has merged into that, but should we leave that for another JIRA?.

        Show
        Scott Carey added a comment - Actually that fixes it. Perf.java had some inconsistencies. I have corrected those and refactored the class a little. There is still some work to make it more consistent but the measurements for GenericReader tests are accurate with some changes. I am working on a patch that includes those changes and measures the performance drop when Decoders and DatumReaders are not re-used. I am also considering removing the Google Collections change for now to simplify this. I think we may want to include that later, in the form of Google Guava since Google Collections has merged into that, but should we leave that for another JIRA?.
        Hide
        Doug Cutting added a comment -

        Okay, here's a winner. This one caches the entire ResolvingDecoder, not just its resolver.

        GenericReaderOneTimeUsageTest: 1793 ms, 2.3228573849749345 million entries/sec. 0.010871407417979413 million bytes/sec

        Kevin, can you confirm whether an identity-based cache of schemas works for your use case? If not, we could try this with an equals-hash and perhaps optimize Schema#hashCode().

        Show
        Doug Cutting added a comment - Okay, here's a winner. This one caches the entire ResolvingDecoder, not just its resolver. GenericReaderOneTimeUsageTest: 1793 ms, 2.3228573849749345 million entries/sec. 0.010871407417979413 million bytes/sec Kevin, can you confirm whether an identity-based cache of schemas works for your use case? If not, we could try this with an equals-hash and perhaps optimize Schema#hashCode().
        Hide
        Doug Cutting added a comment -

        Here's a patch that caches resolvers in a weak, concurrent, identity hash map.

        Before:

        GenericReaderOneTimeUsageTest: 20208 ms, 0.2061803736078433 million entries/sec. 9.649627469945864E-4 million bytes/sec

        After:

        GenericReaderOneTimeUsageTest: 9513 ms, 0.4379442048677028 million entries/sec. 0.002049660865215458 million bytes/sec

        So caching resolvers gets some but not all of the speed back.

        Show
        Doug Cutting added a comment - Here's a patch that caches resolvers in a weak, concurrent, identity hash map. Before: GenericReaderOneTimeUsageTest: 20208 ms, 0.2061803736078433 million entries/sec. 9.649627469945864E-4 million bytes/sec After: GenericReaderOneTimeUsageTest: 9513 ms, 0.4379442048677028 million entries/sec. 0.002049660865215458 million bytes/sec So caching resolvers gets some but not all of the speed back.
        Hide
        Doug Cutting added a comment -

        > hashes on schemas are expensive. (I did separately push for schema's to be immutable so that their hashcodes could be memoized, but that hasn't been done yet).

        Schema's are mostly immutable. You can add properties, and there's setFields(), which is called once during the construction of a record. In the latter case, it's an error to use the schema in any way before setFields() has been called. In the former, we could reset the memoized hashCode whenever a property is added. So I think we could reasonably memoize Schema#hashCode(), and it could be a good performance win.

        But perhaps we shouldn't call hashCode() anyway, but instead use an identity map. If we used Google collections we could have a weak, identity cache of the form Map<Schema,Map<Schema,Resolver>>. If someone adds a property to the schema after its used that changed what the resolver does then they'd have trouble, but (a) they shouldn't do that; and (b) the resolver doesn't currently use any user-defined schema properties anyway, and I don't expect it ever will.

        Would identity work for your use cases, Kevin? Do you have a table of schemas somewhere, or are you parsing them anew each time?

        Show
        Doug Cutting added a comment - > hashes on schemas are expensive. (I did separately push for schema's to be immutable so that their hashcodes could be memoized, but that hasn't been done yet). Schema's are mostly immutable. You can add properties, and there's setFields(), which is called once during the construction of a record. In the latter case, it's an error to use the schema in any way before setFields() has been called. In the former, we could reset the memoized hashCode whenever a property is added. So I think we could reasonably memoize Schema#hashCode(), and it could be a good performance win. But perhaps we shouldn't call hashCode() anyway, but instead use an identity map. If we used Google collections we could have a weak, identity cache of the form Map<Schema,Map<Schema,Resolver>>. If someone adds a property to the schema after its used that changed what the resolver does then they'd have trouble, but (a) they shouldn't do that; and (b) the resolver doesn't currently use any user-defined schema properties anyway, and I don't expect it ever will. Would identity work for your use cases, Kevin? Do you have a table of schemas somewhere, or are you parsing them anew each time?
        Hide
        Scott Carey added a comment -

        DecoderFactory.createBinaryDecoder(byte[], int, int, BinaryDecoder)
        uses the given byte[] AS the buffer, so it is the most optimal.

        DecoderFactory.createBinaryDecoder(InputStream, BinaryDecoder)
        uses the internal read buffer of size binaryDecoderBufferSize, filling it from the InputStream. If a user must interleave reads on the given InputStream between some other code and the BinaryDecoder they must either have that other code use the InputStream from BinaryDecoder.inputStream() to compensate for the 'read-ahead' that the buffering causes, or use the slower DirectBinaryDecoder which only takes the absolute minimum number of bytes from the InputStream.

        Note that, even if configured to use the DirectBinaryDecoder, it will only do so if a user is attempting to use a factory method that reads from an input stream. If reading from a byte[] it always uses the more optimal BinaryDecoder.

        Show
        Scott Carey added a comment - DecoderFactory.createBinaryDecoder(byte[], int, int, BinaryDecoder) uses the given byte[] AS the buffer, so it is the most optimal. DecoderFactory.createBinaryDecoder(InputStream, BinaryDecoder) uses the internal read buffer of size binaryDecoderBufferSize, filling it from the InputStream. If a user must interleave reads on the given InputStream between some other code and the BinaryDecoder they must either have that other code use the InputStream from BinaryDecoder.inputStream() to compensate for the 'read-ahead' that the buffering causes, or use the slower DirectBinaryDecoder which only takes the absolute minimum number of bytes from the InputStream. Note that, even if configured to use the DirectBinaryDecoder, it will only do so if a user is attempting to use a factory method that reads from an input stream. If reading from a byte[] it always uses the more optimal BinaryDecoder.
        Hide
        Kevin Oliver added a comment -

        Yeah. I had noticed my mistaken naming a few hours later. Interesting.

        Where does the buffer size get used though? I'm looking at DecoderFactory.createBinaryDecoder(byte[], int, int, BinaryDecoder) and it never passes the DecoderFactory's binaryDecoderBufferSize over to the BinaryDecoder.

        Show
        Kevin Oliver added a comment - Yeah. I had noticed my mistaken naming a few hours later. Interesting. Where does the buffer size get used though? I'm looking at DecoderFactory.createBinaryDecoder(byte[], int, int, BinaryDecoder) and it never passes the DecoderFactory's binaryDecoderBufferSize over to the BinaryDecoder.
        Hide
        Scott Carey added a comment -

        But... Given what you said, shouldn't GenericReaderOneTimeUsageDirectDecoderNoResolverTest have the same performance as GenericReaderOneTimeUsage12Test? Mind you, I'm happy with this improvement, just trying to see what the difference is.

        The test above is not using a DirectBinaryDecoder, but a buffered one with a buffer size of 256 bytes. To test the DirectBinaryDecoder you would have to call configureDirectDecoder(true) on the factory.
        So the above is the new binary decoder (which avoids the slow InputStream API as much as possible, preferring array access), but with a tiny buffer so that lots of allocations won't hurt.

        However the above test is creating the decoder on a byte[], which avoids then InputStream entirely. If your use case wraps around an InputStream it will be slower than the above because it will create that 256 byte buffer each time.

        Show
        Scott Carey added a comment - But... Given what you said, shouldn't GenericReaderOneTimeUsageDirectDecoderNoResolverTest have the same performance as GenericReaderOneTimeUsage12Test? Mind you, I'm happy with this improvement, just trying to see what the difference is. The test above is not using a DirectBinaryDecoder, but a buffered one with a buffer size of 256 bytes. To test the DirectBinaryDecoder you would have to call configureDirectDecoder(true) on the factory. So the above is the new binary decoder (which avoids the slow InputStream API as much as possible, preferring array access), but with a tiny buffer so that lots of allocations won't hurt. However the above test is creating the decoder on a byte[], which avoids then InputStream entirely. If your use case wraps around an InputStream it will be slower than the above because it will create that 256 byte buffer each time.
        Hide
        Kevin Oliver added a comment -

        Nice. Big improvement.

          private static class GenericReaderOneTimeUsageDirectDecoderNoResolverTest extends GenericReaderOneTimeUsageTest {
            private final DecoderFactory factory;
            
            protected GenericReaderOneTimeUsageDirectDecoderNoResolverTest() throws IOException {
              super("GenericReaderOneTimeUsageDirectDecoderNoResolverTest");
              factory = new DecoderFactory().configureDecoderBufferSize(256);
            }
            @Override protected DatumReader<Object> getReader() {
              GenericDatumReaderWithOptionalResolver<Object> reader = new GenericDatumReaderWithOptionalResolver<Object>(writerSchema);
              reader.setUseResolvingDecoder(false);
              return reader;
            }
            @Override protected Decoder getDecoder() {
              return factory.createBinaryDecoder(data, null);
            }
          }
        

        GenericReaderOneTimeUsage12Test: 1899 ms, 2.1932087119840316 million entries/sec. 0.010264627357179553 million bytes/sec
        GenericReaderOneTimeUsage13Test: 13182 ms, 0.3160544030781795 million entries/sec. 0.0014791937741568462 million bytes/sec
        GenericReaderOneTimeUsageDirectDecoderNoResolverTest: 902 ms, 4.61638520363505 million entries/sec. 0.02160554697489103 million bytes/sec

        But... Given what you said, shouldn't GenericReaderOneTimeUsageDirectDecoderNoResolverTest have the same performance as GenericReaderOneTimeUsage12Test? Mind you, I'm happy with this improvement, just trying to see what the difference is.

        Ok, so it looks like one half of this issue is resolved – thanks Scott. On to GenericDatumReader. I'd prefer to not do caching – as you say, the hashes on schemas are expensive. (I did separately push for schema's to be immutable so that their hashcodes could be memoized, but that hasn't been done yet).

        I dug into this change a bit, and looks like AVRO-388 added ResolvingDecoders to GenericDatumReader, and if you look, in version 1 of the patch, https://issues.apache.org/jira/secure/attachment/12431862/AVRO-388.patch the resolver was optional (only used when the actual and expected schemas differed).

        Show
        Kevin Oliver added a comment - Nice. Big improvement. private static class GenericReaderOneTimeUsageDirectDecoderNoResolverTest extends GenericReaderOneTimeUsageTest { private final DecoderFactory factory; protected GenericReaderOneTimeUsageDirectDecoderNoResolverTest() throws IOException { super ( "GenericReaderOneTimeUsageDirectDecoderNoResolverTest" ); factory = new DecoderFactory().configureDecoderBufferSize(256); } @Override protected DatumReader< Object > getReader() { GenericDatumReaderWithOptionalResolver< Object > reader = new GenericDatumReaderWithOptionalResolver< Object >(writerSchema); reader.setUseResolvingDecoder( false ); return reader; } @Override protected Decoder getDecoder() { return factory.createBinaryDecoder(data, null ); } } GenericReaderOneTimeUsage12Test: 1899 ms, 2.1932087119840316 million entries/sec. 0.010264627357179553 million bytes/sec GenericReaderOneTimeUsage13Test: 13182 ms, 0.3160544030781795 million entries/sec. 0.0014791937741568462 million bytes/sec GenericReaderOneTimeUsageDirectDecoderNoResolverTest: 902 ms, 4.61638520363505 million entries/sec. 0.02160554697489103 million bytes/sec But... Given what you said, shouldn't GenericReaderOneTimeUsageDirectDecoderNoResolverTest have the same performance as GenericReaderOneTimeUsage12Test? Mind you, I'm happy with this improvement, just trying to see what the difference is. Ok, so it looks like one half of this issue is resolved – thanks Scott. On to GenericDatumReader. I'd prefer to not do caching – as you say, the hashes on schemas are expensive. (I did separately push for schema's to be immutable so that their hashcodes could be memoized, but that hasn't been done yet). I dug into this change a bit, and looks like AVRO-388 added ResolvingDecoders to GenericDatumReader, and if you look, in version 1 of the patch, https://issues.apache.org/jira/secure/attachment/12431862/AVRO-388.patch the resolver was optional (only used when the actual and expected schemas differed).
        Hide
        Scott Carey added a comment -

        Interesting. DirectBinaryDecoder is the same as the 1.2 decoders, so I suspect that the schema parsing from new GenericDatumReader is the main culprit.

        What if you use your GenericDatumReaderWithOptionalResolver but the 1.3 BinaryDecoder. Something like:

        private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest {
            private final DecoderFactory factory;
            
            protected GenericReaderOneTimeUsage13Test() throws IOException {
              super("GenericReaderOneTimeUsage13Test");
              factory = new DecoderFactory().configureDecoderBufferSize(256);
            }
            @Override protected DatumReader<Object> getReader() {
              GenericDatumReaderWithOptionalResolver<Object> reader = new GenericDatumReaderWithOptionalResolver<Object>(writerSchema);
              reader.setUseResolvingDecoder(false);
              return reader;
            }
            @Override protected Decoder getDecoder() {
              return factory.createBinaryDecoder(data, null);
            }
          }
        

        There is room for improvement in how GenericDatumReader works with resolvers. – we could cache them per (expected, actual) schema pair, for example, to avoid too much parsing. But that won't be free either, equals() on a Schema is not trivial.

        Show
        Scott Carey added a comment - Interesting. DirectBinaryDecoder is the same as the 1.2 decoders, so I suspect that the schema parsing from new GenericDatumReader is the main culprit. What if you use your GenericDatumReaderWithOptionalResolver but the 1.3 BinaryDecoder. Something like: private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest { private final DecoderFactory factory; protected GenericReaderOneTimeUsage13Test() throws IOException { super ( "GenericReaderOneTimeUsage13Test" ); factory = new DecoderFactory().configureDecoderBufferSize(256); } @Override protected DatumReader< Object > getReader() { GenericDatumReaderWithOptionalResolver< Object > reader = new GenericDatumReaderWithOptionalResolver< Object >(writerSchema); reader.setUseResolvingDecoder( false ); return reader; } @Override protected Decoder getDecoder() { return factory.createBinaryDecoder(data, null ); } } There is room for improvement in how GenericDatumReader works with resolvers. – we could cache them per (expected, actual) schema pair, for example, to avoid too much parsing. But that won't be free either, equals() on a Schema is not trivial.
        Hide
        Kevin Oliver added a comment -

        Thanks Scott.

        Unfortunately, I don't see any improvement with that. Here's a revised 1.3 test:

          private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest {
            private final DecoderFactory factory;
            
            protected GenericReaderOneTimeUsage13Test() throws IOException {
              super("GenericReaderOneTimeUsage13Test");
              factory = new DecoderFactory().configureDecoderBufferSize(256);
            }
            @Override protected DatumReader<Object> getReader() {
              return new GenericDatumReader<Object>(writerSchema);
            }
            @Override protected Decoder getDecoder() {
              return factory.createBinaryDecoder(data, null);
            }
          }
        

        GenericReaderOneTimeUsage12Test: 2108 ms, 1.9756520386699568 million entries/sec. 0.00924642139783131 million bytes/sec
        GenericReaderOneTimeUsage13Test: 13197 ms, 0.31569739270247654 million entries/sec. 0.0014775228987635406 million bytes/sec

        I also tried having the factory configured using direct decoders and a small buffer and the results were about the same:
        GenericReaderOneTimeUsage13Test: 12738 ms, 0.32707876157061233 million entries/sec. 0.0015307898357438956 million bytes/sec

        As far as storing them in ThreadLocals, I have not tested that internally and I'm not in love with that approach. What about adding another option onto DecoderFactory for one-time use?

        Show
        Kevin Oliver added a comment - Thanks Scott. Unfortunately, I don't see any improvement with that. Here's a revised 1.3 test: private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest { private final DecoderFactory factory; protected GenericReaderOneTimeUsage13Test() throws IOException { super ( "GenericReaderOneTimeUsage13Test" ); factory = new DecoderFactory().configureDecoderBufferSize(256); } @Override protected DatumReader< Object > getReader() { return new GenericDatumReader< Object >(writerSchema); } @Override protected Decoder getDecoder() { return factory.createBinaryDecoder(data, null ); } } GenericReaderOneTimeUsage12Test: 2108 ms, 1.9756520386699568 million entries/sec. 0.00924642139783131 million bytes/sec GenericReaderOneTimeUsage13Test: 13197 ms, 0.31569739270247654 million entries/sec. 0.0014775228987635406 million bytes/sec I also tried having the factory configured using direct decoders and a small buffer and the results were about the same: GenericReaderOneTimeUsage13Test: 12738 ms, 0.32707876157061233 million entries/sec. 0.0015307898357438956 million bytes/sec As far as storing them in ThreadLocals, I have not tested that internally and I'm not in love with that approach. What about adding another option onto DecoderFactory for one-time use?
        Hide
        Scott Carey added a comment -

        note that
        'new BinaryDecoder()'

        is deprecated now as well. The factory allows you to control the performance behavior better.

        I suspect that if you changed the 1.3 'one time' test from:

          private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest {
            protected GenericReaderOneTimeUsage13Test() throws IOException {
              super("GenericReaderOneTimeUsage13Test");
            }
            @Override protected DatumReader<Object> getReader() {
              return new GenericDatumReader<Object>(writerSchema);
            }
            @Override protected Decoder getDecoder() {
              return DecoderFactory.defaultFactory().createBinaryDecoder(data, null);
            }
          }
        

        to

          static DecoderFactory factory = new DecoderFactory();
          static {
            factory.configureDecoderBufferSize(256);
          }
          private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest {
            protected GenericReaderOneTimeUsage13Test() throws IOException {
              super("GenericReaderOneTimeUsage13Test");
            }
            @Override protected DatumReader<Object> getReader() {
              return new GenericDatumReader<Object>(writerSchema);
            }
            @Override protected Decoder getDecoder() {
              return factory.createBinaryDecoder(data, null);
            }
          }
        

        That the performance would be much better.

        Show
        Scott Carey added a comment - note that 'new BinaryDecoder()' is deprecated now as well. The factory allows you to control the performance behavior better. I suspect that if you changed the 1.3 'one time' test from: private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest { protected GenericReaderOneTimeUsage13Test() throws IOException { super ( "GenericReaderOneTimeUsage13Test" ); } @Override protected DatumReader< Object > getReader() { return new GenericDatumReader< Object >(writerSchema); } @Override protected Decoder getDecoder() { return DecoderFactory.defaultFactory().createBinaryDecoder(data, null ); } } to static DecoderFactory factory = new DecoderFactory(); static { factory.configureDecoderBufferSize(256); } private static class GenericReaderOneTimeUsage13Test extends GenericReaderOneTimeUsageTest { protected GenericReaderOneTimeUsage13Test() throws IOException { super ( "GenericReaderOneTimeUsage13Test" ); } @Override protected DatumReader< Object > getReader() { return new GenericDatumReader< Object >(writerSchema); } @Override protected Decoder getDecoder() { return factory.createBinaryDecoder(data, null ); } } That the performance would be much better.
        Hide
        Scott Carey added a comment -

        2) BinaryDecoders now created a bunch of arrays and got more complicated, again significantly slowing down one time usage.

        Either configure the factory to have a small default buffer size (256 byte will be fine), or use the DirectBinaryDecoder, which is basically the 1.2 decoder. See
        DecoderFactory, and its configureDecoderBufferSize(), configureDirectDecoder() and createXXX factory methods.
        You can configure your own factory instance, the default factory does not allow you to configure it.

        The BinaryDecoder is optimized for the case where you are keeping one around and re-using it. If possible, keep them around. You can use something like a ThreadLocal<BinaryDecoder> to make this easier. DirectBinaryDecoder will have smaller initialization costs, but will be about 1.5x to 5x slower when streaming large amounts of data.

        If you aren't reading from a stream, but are reading from a byte[]. the new decoder will work without creating its own array as well if you use the createBinaryDecoder(byte[], . . . ) factory methods. You can reuse a decoder and reset its state from these as well.

        Show
        Scott Carey added a comment - 2) BinaryDecoders now created a bunch of arrays and got more complicated, again significantly slowing down one time usage. Either configure the factory to have a small default buffer size (256 byte will be fine), or use the DirectBinaryDecoder, which is basically the 1.2 decoder. See DecoderFactory, and its configureDecoderBufferSize(), configureDirectDecoder() and createXXX factory methods. You can configure your own factory instance, the default factory does not allow you to configure it. The BinaryDecoder is optimized for the case where you are keeping one around and re-using it. If possible, keep them around. You can use something like a ThreadLocal<BinaryDecoder> to make this easier. DirectBinaryDecoder will have smaller initialization costs, but will be about 1.5x to 5x slower when streaming large amounts of data. If you aren't reading from a stream, but are reading from a byte[]. the new decoder will work without creating its own array as well if you use the createBinaryDecoder(byte[], . . . ) factory methods. You can reuse a decoder and reset its state from these as well.
        Hide
        Kevin Oliver added a comment -

        We do a decent amount of 1 time usage of BinaryDecoders and GenericDatumReaders. When we upgraded to Avro 1.3 we saw significant regression in performance on decoding. A profiler showed the issue pretty quickly.

        Basically, it boiled down to 2 issue:
        1) Having GenericDatumReaders always create the ResolvingDecoder is too expensive for one time usage.
        2) BinaryDecoders now created a bunch of arrays and got more complicated, again significantly slowing down one time usage.

        I'm attaching a patch that has a somewhat hacky workaround. I've resurrected the BinaryDecoder code from v1.2 (more or less). I've also created a GenericDatumReaderWithOptionalResolver class that basically forks GenericDatumReader to allow for reading directly from the supplied decoder.

        Running the newly added 'Perf -GoneTimeUse' you can see the stark difference:
        GenericReaderOneTimeUsage12Test: 2175 ms, 1.9147720770945649 million entries/sec. 0.008961491780473783 million bytes/sec
        GenericReaderOneTimeUsage13Test: 13152 ms, 0.3167766318368232 million entries/sec. 0.0014825739399539307 million bytes/sec

        I don't believe we should commit the patch as is. But I'd like some feedback on how to go from here to get this performance back.

        Show
        Kevin Oliver added a comment - We do a decent amount of 1 time usage of BinaryDecoders and GenericDatumReaders. When we upgraded to Avro 1.3 we saw significant regression in performance on decoding. A profiler showed the issue pretty quickly. Basically, it boiled down to 2 issue: 1) Having GenericDatumReaders always create the ResolvingDecoder is too expensive for one time usage. 2) BinaryDecoders now created a bunch of arrays and got more complicated, again significantly slowing down one time usage. I'm attaching a patch that has a somewhat hacky workaround. I've resurrected the BinaryDecoder code from v1.2 (more or less). I've also created a GenericDatumReaderWithOptionalResolver class that basically forks GenericDatumReader to allow for reading directly from the supplied decoder. Running the newly added 'Perf -GoneTimeUse' you can see the stark difference: GenericReaderOneTimeUsage12Test: 2175 ms, 1.9147720770945649 million entries/sec. 0.008961491780473783 million bytes/sec GenericReaderOneTimeUsage13Test: 13152 ms, 0.3167766318368232 million entries/sec. 0.0014825739399539307 million bytes/sec I don't believe we should commit the patch as is. But I'd like some feedback on how to go from here to get this performance back.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Kevin Oliver
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development