Avro
  1. Avro
  2. AVRO-388

Using ResolvingDecoder in GenericDatumReader

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java
    • Labels:
      None

      Description

      This patch removes GenericDatumReader's own logic for schema resolution by relying on ResolvingDecoder. The code has become somewhat compact. The duplication logic we were concerned about in AVRO-383 is gone.

      There are three changes to the public interface of GenericDatumReader:

      • readInt() and readString() functions no longer provide the writer's schema. I think it is reasonable that code outside GenericDatumReader should not worry about schema resolution and thus don't have to know about writer's schema.
      • setSchema() function of DatumReader may throw IOException.

      None in Avro codebase is affected by these changes.

      In terms of performance:

      • When there is no resolution required, that is reader's and writer's schemas are identical, there is no change in performance. In fact, ResolvingEncoder is not used in this case. Use Perf -G to measuer.
      • If the resolution involves only promotions (from int to long) the new code is about 5% faster. User Perf -Gp
      • If the resolution involves supplying default values from the reader's schema because writer's schema does have a field, the node is faster about 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure this.
      • If the resolution involves field reordering (reader's and writer's schema have the same fields, but in different order) the new code is slower by about 20%. This is because, ResolvingDecoder provides information on field reordering through a vector of Schema.Field objects. The Field class has the field position but not the field name. But GenericDatumWriter's setField() function needs the field name. getFieldName() function introduced in this patch does a linear search to find the name from the reader's schema. This is inefficient. We can handle this problem in any of the following ways:
        1. Do nothing claiming that field reordering is rare and we can live with this inefficiency. I tried this and saw that the new code performs about 5-10% faster than the old.
        2. Change the signature of the setField() so that it does not require the field name. This is an incomaptible change to the GenericDatumReader's public interface. None of the code in the current Avro will need field name. But if somebody has extended GenericDatumReader, they will be affected.
        3. Have a new class FieldInfo which has the name and position of the field and make ResolvingDecoder return an array of FieldInfo instead of Field. It's not obvious, which package/class this new class would go to. If one wants a clean dependencies, probably it belongs to io package. I'm not confortable adding another class into the io package at the present. Keeping it as an inner class of ResolvingDecoder will create circular dependencies between the io.parsing package and ResolvingDecoder class.
        4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
        5. Add a new field called "name" to Schema.Field class. It will be somewhat redundant because the key of the fields map of RecordSchema is the field name. With this, the field name will be available both as the key and as a part of value of the map.

      My personal preference is 5 followed by 4.

      1. AVRO-388.patch
        22 kB
        Thiruvalluvan M. G.
      2. AVRO-388-new.patch
        23 kB
        Thiruvalluvan M. G.
      3. AVRO-388-test.patch
        15 kB
        Thiruvalluvan M. G.

        Activity

        Hide
        Scott Carey added a comment -

        It doesn't seem to be a leak, but it does seem to be a large chunk of state and garbage creation from the Generic tests.

        I modified the test so that it frees the Test instances as it runs them, which helps delay the OOME or avoid it. increasing the number of iterations does not cause an OOME so its not a leak, but the total memory required to run the test is a lot as there is much memory allocation and GC thrashing. The arrays for all tests are allocated up front. So the latter tests have more space, but these have GC issues.

        Some jmap -histo output during the Generic tests below:

         num     #instances         #bytes  class name
        ----------------------------------------------
           1:          1427       59024200  [B
           2:         27956        3624672  [Ljava.lang.Object;
           3:         82825        1987800  java.lang.Integer
           4:         82569        1981656  java.lang.Double
           5:         12289        1575376  <constMethodKlass>
           6:         12289        1480328  <methodKlass>
           7:          1026        1128752  <constantPoolKlass>
           8:         19468         956488  <symbolKlass>
           9:         27523         880736  org.apache.avro.generic.GenericData$Record
          10:          1026         744032  <instanceKlassKlass>
          11:           874         692128  <constantPoolCacheKlass>
          12:           994         629936  [I
          13:          2925         294576  [C
          14:          1161         213624  java.lang.Class
        
         num     #instances         #bytes  class name
        ----------------------------------------------
           1:          1427       59024200  [B
           2:        192562       15476304  [Ljava.lang.Object;
           3:        576644       13839456  java.lang.Integer
           4:        576388       13833312  java.lang.Double
           5:        192129        6148128  org.apache.avro.generic.GenericData$Record
           6:         12289        1575376  <constMethodKlass>
           7:         12289        1480328  <methodKlass>
           8:          1026        1128752  <constantPoolKlass>
           9:         19468         956488  <symbolKlass>
          10:          1026         744032  <instanceKlassKlass>
          11:           874         692128  <constantPoolCacheKlass>
          12:           994         636696  [I
        
         num     #instances         #bytes  class name
        ----------------------------------------------
           1:        473087       77890584  [B
           2:        236262       23995904  [Ljava.lang.Object;
           3:        707740       16985760  java.lang.Integer
           4:        707483       16979592  java.lang.Double
           5:        471654       15092928  org.apache.avro.util.Utf8
           6:        235828        7546496  org.apache.avro.generic.GenericData$Record
           7:         12289        1575376  <constMethodKlass>
           8:         12289        1480328  <methodKlass>
           9:          7547        1170144  [I
          10:          1026        1128752  <constantPoolKlass>
          11:         19468         956488  <symbolKlass>
          12:          1026         744032  <instanceKlassKlass>
          13:           874         692128  <constantPoolCacheKlass>
        
        

        It appears as though there is a very large volume of boxed Integer and Double objects, and in one test a lot of Utf8 and Byte[] creation. There is a large number of reflection related objects as well.

        Performance can improve if allocations of the above can be reduced in the Generic API case. This is a single threaded test, but with two GC threads active, the CPU usage on my machine during the Generic tests is 130% to 150%. This indicates that about 30% to 50% of the time is in GC . A multithreaded test would cause that ratio to go up to close to 66% GC time.

        Details on GC time calculation for 150% CPU usage on single threaded app:
        (1 * (app_time) + 2 * (gc_thread_time) ) = 150% ;
        only a gc thread can go above 100% CPU, so the min time for one gc_thread is 50%. This leaves 50% of one CPU to the app thread.
        By clock time 50% is the app thread, and 50% is both GC threads at the same time.
        By total CPU use, GC is 100% of a CPU, and the app is 50%, a 2:1 ratio.

        Show
        Scott Carey added a comment - It doesn't seem to be a leak, but it does seem to be a large chunk of state and garbage creation from the Generic tests. I modified the test so that it frees the Test instances as it runs them, which helps delay the OOME or avoid it. increasing the number of iterations does not cause an OOME so its not a leak, but the total memory required to run the test is a lot as there is much memory allocation and GC thrashing. The arrays for all tests are allocated up front. So the latter tests have more space, but these have GC issues. Some jmap -histo output during the Generic tests below: num #instances #bytes class name ---------------------------------------------- 1: 1427 59024200 [B 2: 27956 3624672 [Ljava.lang.Object; 3: 82825 1987800 java.lang.Integer 4: 82569 1981656 java.lang.Double 5: 12289 1575376 <constMethodKlass> 6: 12289 1480328 <methodKlass> 7: 1026 1128752 <constantPoolKlass> 8: 19468 956488 <symbolKlass> 9: 27523 880736 org.apache.avro.generic.GenericData$Record 10: 1026 744032 <instanceKlassKlass> 11: 874 692128 <constantPoolCacheKlass> 12: 994 629936 [I 13: 2925 294576 [C 14: 1161 213624 java.lang.Class num #instances #bytes class name ---------------------------------------------- 1: 1427 59024200 [B 2: 192562 15476304 [Ljava.lang.Object; 3: 576644 13839456 java.lang.Integer 4: 576388 13833312 java.lang.Double 5: 192129 6148128 org.apache.avro.generic.GenericData$Record 6: 12289 1575376 <constMethodKlass> 7: 12289 1480328 <methodKlass> 8: 1026 1128752 <constantPoolKlass> 9: 19468 956488 <symbolKlass> 10: 1026 744032 <instanceKlassKlass> 11: 874 692128 <constantPoolCacheKlass> 12: 994 636696 [I num #instances #bytes class name ---------------------------------------------- 1: 473087 77890584 [B 2: 236262 23995904 [Ljava.lang.Object; 3: 707740 16985760 java.lang.Integer 4: 707483 16979592 java.lang.Double 5: 471654 15092928 org.apache.avro.util.Utf8 6: 235828 7546496 org.apache.avro.generic.GenericData$Record 7: 12289 1575376 <constMethodKlass> 8: 12289 1480328 <methodKlass> 9: 7547 1170144 [I 10: 1026 1128752 <constantPoolKlass> 11: 19468 956488 <symbolKlass> 12: 1026 744032 <instanceKlassKlass> 13: 874 692128 <constantPoolCacheKlass> It appears as though there is a very large volume of boxed Integer and Double objects, and in one test a lot of Utf8 and Byte[] creation. There is a large number of reflection related objects as well. Performance can improve if allocations of the above can be reduced in the Generic API case. This is a single threaded test, but with two GC threads active, the CPU usage on my machine during the Generic tests is 130% to 150%. This indicates that about 30% to 50% of the time is in GC . A multithreaded test would cause that ratio to go up to close to 66% GC time. Details on GC time calculation for 150% CPU usage on single threaded app: (1 * (app_time) + 2 * (gc_thread_time) ) = 150% ; only a gc thread can go above 100% CPU, so the min time for one gc_thread is 50%. This leaves 50% of one CPU to the app thread. By clock time 50% is the app thread, and 50% is both GC threads at the same time. By total CPU use, GC is 100% of a CPU, and the app is 50%, a 2:1 ratio.
        Hide
        Scott Carey added a comment -

        There is a memory leak somewhere in the test, or in the code.

        If I run all the tests one after another, I get an OutOfMemoryError. If I set it to -Xmx256m, it finishes (barely) while the GC is going nuts (seen in jconsole).

        It might be the test. I'll report back if I can fix it by setting some stuff to null in the test.

        Show
        Scott Carey added a comment - There is a memory leak somewhere in the test, or in the code. If I run all the tests one after another, I get an OutOfMemoryError. If I set it to -Xmx256m, it finishes (barely) while the GC is going nuts (seen in jconsole). It might be the test. I'll report back if I can fix it by setting some stuff to null in the test.
        Hide
        Thiruvalluvan M. G. added a comment -

        Committed revision 905865 incorporating Doug's suggestions on initializing resolver and order of switch cases. Thanks Doug.

        Show
        Thiruvalluvan M. G. added a comment - Committed revision 905865 incorporating Doug's suggestions on initializing resolver and order of switch cases. Thanks Doug.
        Hide
        Doug Cutting added a comment -

        +1 Overall this looks good.

        A few minor changes:

        • The common call sequence, e.g., when opening a data file or in RPC, is that the DatumReader constructor is passed the expected schema and the actual schema is subsequently set by the container with #setSchema(). In this common case, setResolver() is called twice. This is only done once per file, but still might be a substantial, redundant, computation. I wonder if rather setting expected or actual should cause resolver to be set to null, then construct the resolver in #read(D, decoder) when it's null?
        • I'd prefer if the switch statement in #read(Object,Schema,Decoder) were ordered the same as the symbols in the Type enum. I've tried to everywhere use this same order, for consistency. So UNION should go between MAP and FIXED. Not a big deal.
        Show
        Doug Cutting added a comment - +1 Overall this looks good. A few minor changes: The common call sequence, e.g., when opening a data file or in RPC, is that the DatumReader constructor is passed the expected schema and the actual schema is subsequently set by the container with #setSchema(). In this common case, setResolver() is called twice. This is only done once per file, but still might be a substantial, redundant, computation. I wonder if rather setting expected or actual should cause resolver to be set to null, then construct the resolver in #read(D, decoder) when it's null? I'd prefer if the switch statement in #read(Object,Schema,Decoder) were ordered the same as the symbols in the Type enum. I've tried to everywhere use this same order, for consistency. So UNION should go between MAP and FIXED. Not a big deal.
        Hide
        Thiruvalluvan M. G. added a comment -

        The new patch has name field in Schema.Field class.

        With this, resolution involving promotion, resolution involving field-reordering, resolution involving default values are all faster by about 20%.

        What is surprising is that even when there is no resolution involved (reader's and writer's schemas are identical), using ResolvingDecoder is about 15% faster. This is probably because, we no longer walk over the field list of the record schema; instead we use the field array returned by readFieldOrder(). So in this patch I use ResolvingDecoder for all cases including where no resolution is required.

        Apply the test.patch and run "Perf -G", "Perf -Gd", "Perf -Go" and "Perf -Gp" to measure the performance without the patch. Then apply the new.patch and run the same tests to measure the performance after the patch.

        Show
        Thiruvalluvan M. G. added a comment - The new patch has name field in Schema.Field class. With this, resolution involving promotion, resolution involving field-reordering, resolution involving default values are all faster by about 20%. What is surprising is that even when there is no resolution involved (reader's and writer's schemas are identical), using ResolvingDecoder is about 15% faster. This is probably because, we no longer walk over the field list of the record schema; instead we use the field array returned by readFieldOrder(). So in this patch I use ResolvingDecoder for all cases including where no resolution is required. Apply the test.patch and run "Perf -G", "Perf -Gd", "Perf -Go" and "Perf -Gp" to measure the performance without the patch. Then apply the new.patch and run the same tests to measure the performance after the patch.
        Hide
        Doug Cutting added a comment -

        It's worth noting that, once this is committed, all Java DatumReader implementations will be parser-based. This is a bit frightening, since I think few beside Thiru understand fully how the this works. It does have documentation.

        http://hadoop.apache.org/avro/docs/current/api/java/org/apache/avro/io/parsing/doc-files/parsing.html

        But I have never made it through a detailed read of that document and only claim a partial understanding.

        The parser-based implementation has real advantages: it's faster and it simplifies things like validation, json encoders/decoders, etc. So, on the whole, I think this is a good change, but it does scare me a bit, since, without Thiru, we'd have a hard time maintaining it.

        Show
        Doug Cutting added a comment - It's worth noting that, once this is committed, all Java DatumReader implementations will be parser-based. This is a bit frightening, since I think few beside Thiru understand fully how the this works. It does have documentation. http://hadoop.apache.org/avro/docs/current/api/java/org/apache/avro/io/parsing/doc-files/parsing.html But I have never made it through a detailed read of that document and only claim a partial understanding. The parser-based implementation has real advantages: it's faster and it simplifies things like validation, json encoders/decoders, etc. So, on the whole, I think this is a good change, but it does scare me a bit, since, without Thiru, we'd have a hard time maintaining it.
        Hide
        Doug Cutting added a comment -

        > 5. Add a new field called "name" to Schema.Field class.

        This seems reasonable to me.

        Show
        Doug Cutting added a comment - > 5. Add a new field called "name" to Schema.Field class. This seems reasonable to me.

          People

          • Assignee:
            Thiruvalluvan M. G.
            Reporter:
            Thiruvalluvan M. G.
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development