Avro
  1. Avro
  2. AVRO-388

Using ResolvingDecoder in GenericDatumReader

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java
    • Labels:
      None

      Description

      This patch removes GenericDatumReader's own logic for schema resolution by relying on ResolvingDecoder. The code has become somewhat compact. The duplication logic we were concerned about in AVRO-383 is gone.

      There are three changes to the public interface of GenericDatumReader:

      • readInt() and readString() functions no longer provide the writer's schema. I think it is reasonable that code outside GenericDatumReader should not worry about schema resolution and thus don't have to know about writer's schema.
      • setSchema() function of DatumReader may throw IOException.

      None in Avro codebase is affected by these changes.

      In terms of performance:

      • When there is no resolution required, that is reader's and writer's schemas are identical, there is no change in performance. In fact, ResolvingEncoder is not used in this case. Use Perf -G to measuer.
      • If the resolution involves only promotions (from int to long) the new code is about 5% faster. User Perf -Gp
      • If the resolution involves supplying default values from the reader's schema because writer's schema does have a field, the node is faster about 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure this.
      • If the resolution involves field reordering (reader's and writer's schema have the same fields, but in different order) the new code is slower by about 20%. This is because, ResolvingDecoder provides information on field reordering through a vector of Schema.Field objects. The Field class has the field position but not the field name. But GenericDatumWriter's setField() function needs the field name. getFieldName() function introduced in this patch does a linear search to find the name from the reader's schema. This is inefficient. We can handle this problem in any of the following ways:
        1. Do nothing claiming that field reordering is rare and we can live with this inefficiency. I tried this and saw that the new code performs about 5-10% faster than the old.
        2. Change the signature of the setField() so that it does not require the field name. This is an incomaptible change to the GenericDatumReader's public interface. None of the code in the current Avro will need field name. But if somebody has extended GenericDatumReader, they will be affected.
        3. Have a new class FieldInfo which has the name and position of the field and make ResolvingDecoder return an array of FieldInfo instead of Field. It's not obvious, which package/class this new class would go to. If one wants a clean dependencies, probably it belongs to io package. I'm not confortable adding another class into the io package at the present. Keeping it as an inner class of ResolvingDecoder will create circular dependencies between the io.parsing package and ResolvingDecoder class.
        4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
        5. Add a new field called "name" to Schema.Field class. It will be somewhat redundant because the key of the fields map of RecordSchema is the field name. With this, the field name will be available both as the key and as a part of value of the map.

      My personal preference is 5 followed by 4.

      1. AVRO-388.patch
        22 kB
        Thiruvalluvan M. G.
      2. AVRO-388-new.patch
        23 kB
        Thiruvalluvan M. G.
      3. AVRO-388-test.patch
        15 kB
        Thiruvalluvan M. G.

        Activity

        Thiruvalluvan M. G. created issue -
        Thiruvalluvan M. G. made changes -
        Field Original Value New Value
        Summary Using ResolvingDecoder in GenericDatumWriter Using ResolvingDecoder in GenericDatumReader
        Description This patch removes GenericDatumWriter's own logic for schema resolution by relying on ResolvingDecoder. The code has become somewhat compact. The duplication logic we were concerned about in AVRO-383 is gone. In terms of performance:

           * When there is no resolution required, that is reader's and writer's schemas are identical, there is no change in performance. In fact, ResolvingEncoder is not used in this case. Use Perf -G to measuer.
           * If the resolution involves only promotions (from int to long) the new code is about 5% faster. User Perf -Gp
           * If the resolution involves supplying default values from the reader's schema because writer's schema does have a field, the node is faster about 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure this.
           * If the resolution involves field reordering (reader's and writer's schema have the same fields, but in different order) the new code is _slower_ by about 20%. This is because, ResolvingDecoder provides information on field reordering through a vector of Schema.Field objects. The Field class has the field position but not the field name. But GenericDatumWriter's setField() function needs the field name. getFieldName() function introduced in this patch does a linear search to find the name from the reader's schema. This is inefficient. We can handle this problem in any of the following ways:
              1. Do nothing claiming that field reordering is rare and we can live with this inefficiency. I tried this and saw that the new code performs about 5-10% faster than the old.
              2. Change the signature of the setField() so that it does not require the field name. This is an incomaptible change to the GenericDatumReader's public interface. None of the code in the current Avro will need field name. But if somebody has extended GenericDatumReader, they will be affected.
              3. Have a new class FieldInfo which has the name and position of the field and make ResolvingDecoder return an array of FieldInfo instead of Field. It's not obvious, which package/class this new class would go to. If one wants a clean dependencies, probably it belongs to io package. I'm not confortable adding another class into the io package at the present. Keeping it as an inner class of ResolvingDecoder will create circular dependencies between the io.parsing package and ResolvingDecoder class.
             4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
             5. Add a new field name to Schema.Field. It will be somewhat redundant because the key of the fields map of RecordSchema is the field name. With this the field name will be available both as the key and as a part of value of the map.

        My personal preference is 5 followed by 4.
        This patch removes GenericDatumReader's own logic for schema resolution by relying on ResolvingDecoder. The code has become somewhat compact. The duplication logic we were concerned about in AVRO-383 is gone. In terms of performance:

           * When there is no resolution required, that is reader's and writer's schemas are identical, there is no change in performance. In fact, ResolvingEncoder is not used in this case. Use Perf -G to measuer.
           * If the resolution involves only promotions (from int to long) the new code is about 5% faster. User Perf -Gp
           * If the resolution involves supplying default values from the reader's schema because writer's schema does have a field, the node is faster about 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure this.
           * If the resolution involves field reordering (reader's and writer's schema have the same fields, but in different order) the new code is _slower_ by about 20%. This is because, ResolvingDecoder provides information on field reordering through a vector of Schema.Field objects. The Field class has the field position but not the field name. But GenericDatumWriter's setField() function needs the field name. getFieldName() function introduced in this patch does a linear search to find the name from the reader's schema. This is inefficient. We can handle this problem in any of the following ways:
              1. Do nothing claiming that field reordering is rare and we can live with this inefficiency. I tried this and saw that the new code performs about 5-10% faster than the old.
              2. Change the signature of the setField() so that it does not require the field name. This is an incomaptible change to the GenericDatumReader's public interface. None of the code in the current Avro will need field name. But if somebody has extended GenericDatumReader, they will be affected.
              3. Have a new class FieldInfo which has the name and position of the field and make ResolvingDecoder return an array of FieldInfo instead of Field. It's not obvious, which package/class this new class would go to. If one wants a clean dependencies, probably it belongs to io package. I'm not confortable adding another class into the io package at the present. Keeping it as an inner class of ResolvingDecoder will create circular dependencies between the io.parsing package and ResolvingDecoder class.
             4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
             5. Add a new field name to Schema.Field. It will be somewhat redundant because the key of the fields map of RecordSchema is the field name. With this the field name will be available both as the key and as a part of value of the map.

        My personal preference is 5 followed by 4.
        Thiruvalluvan M. G. made changes -
        Description This patch removes GenericDatumReader's own logic for schema resolution by relying on ResolvingDecoder. The code has become somewhat compact. The duplication logic we were concerned about in AVRO-383 is gone. In terms of performance:

           * When there is no resolution required, that is reader's and writer's schemas are identical, there is no change in performance. In fact, ResolvingEncoder is not used in this case. Use Perf -G to measuer.
           * If the resolution involves only promotions (from int to long) the new code is about 5% faster. User Perf -Gp
           * If the resolution involves supplying default values from the reader's schema because writer's schema does have a field, the node is faster about 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure this.
           * If the resolution involves field reordering (reader's and writer's schema have the same fields, but in different order) the new code is _slower_ by about 20%. This is because, ResolvingDecoder provides information on field reordering through a vector of Schema.Field objects. The Field class has the field position but not the field name. But GenericDatumWriter's setField() function needs the field name. getFieldName() function introduced in this patch does a linear search to find the name from the reader's schema. This is inefficient. We can handle this problem in any of the following ways:
              1. Do nothing claiming that field reordering is rare and we can live with this inefficiency. I tried this and saw that the new code performs about 5-10% faster than the old.
              2. Change the signature of the setField() so that it does not require the field name. This is an incomaptible change to the GenericDatumReader's public interface. None of the code in the current Avro will need field name. But if somebody has extended GenericDatumReader, they will be affected.
              3. Have a new class FieldInfo which has the name and position of the field and make ResolvingDecoder return an array of FieldInfo instead of Field. It's not obvious, which package/class this new class would go to. If one wants a clean dependencies, probably it belongs to io package. I'm not confortable adding another class into the io package at the present. Keeping it as an inner class of ResolvingDecoder will create circular dependencies between the io.parsing package and ResolvingDecoder class.
             4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
             5. Add a new field name to Schema.Field. It will be somewhat redundant because the key of the fields map of RecordSchema is the field name. With this the field name will be available both as the key and as a part of value of the map.

        My personal preference is 5 followed by 4.
        This patch removes GenericDatumReader's own logic for schema resolution by relying on ResolvingDecoder. The code has become somewhat compact. The duplication logic we were concerned about in AVRO-383 is gone.

        There are three changes to the public interface of GenericDatumReader:
           * readInt() and readString() functions no longer provide the writer's schema. I think it is reasonable that code outside GenericDatumReader should not worry about schema resolution and thus don't have to know about writer's schema.
           * setSchema() function of DatumReader may throw IOException.

        None in Avro codebase is affected by these changes.

        In terms of performance:

           * When there is no resolution required, that is reader's and writer's schemas are identical, there is no change in performance. In fact, ResolvingEncoder is not used in this case. Use Perf -G to measuer.
           * If the resolution involves only promotions (from int to long) the new code is about 5% faster. User Perf -Gp
           * If the resolution involves supplying default values from the reader's schema because writer's schema does have a field, the node is faster about 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure this.
           * If the resolution involves field reordering (reader's and writer's schema have the same fields, but in different order) the new code is _slower_ by about 20%. This is because, ResolvingDecoder provides information on field reordering through a vector of Schema.Field objects. The Field class has the field position but not the field name. But GenericDatumWriter's setField() function needs the field name. getFieldName() function introduced in this patch does a linear search to find the name from the reader's schema. This is inefficient. We can handle this problem in any of the following ways:
              1. Do nothing claiming that field reordering is rare and we can live with this inefficiency. I tried this and saw that the new code performs about 5-10% faster than the old.
              2. Change the signature of the setField() so that it does not require the field name. This is an incomaptible change to the GenericDatumReader's public interface. None of the code in the current Avro will need field name. But if somebody has extended GenericDatumReader, they will be affected.
              3. Have a new class FieldInfo which has the name and position of the field and make ResolvingDecoder return an array of FieldInfo instead of Field. It's not obvious, which package/class this new class would go to. If one wants a clean dependencies, probably it belongs to io package. I'm not confortable adding another class into the io package at the present. Keeping it as an inner class of ResolvingDecoder will create circular dependencies between the io.parsing package and ResolvingDecoder class.
             4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
             5. Add a new field called "name" to Schema.Field class. It will be somewhat redundant because the key of the fields map of RecordSchema is the field name. With this, the field name will be available both as the key and as a part of value of the map.

             My personal preference is 5 followed by 4.
        Thiruvalluvan M. G. made changes -
        Attachment AVRO-388.patch [ 12431859 ]
        Attachment AVRO-388-test.patch [ 12431860 ]
        Thiruvalluvan M. G. made changes -
        Attachment AVRO-388.patch [ 12431862 ]
        Thiruvalluvan M. G. made changes -
        Attachment AVRO-388.patch [ 12431859 ]
        Thiruvalluvan M. G. made changes -
        Assignee Thiruvalluvan M. G. [ thiru_mg ]
        Thiruvalluvan M. G. made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Thiruvalluvan M. G. made changes -
        Attachment AVRO-388-new.patch [ 12434520 ]
        Thiruvalluvan M. G. made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Doug Cutting made changes -
        Fix Version/s 1.3.0 [ 12314318 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Thiruvalluvan M. G.
            Reporter:
            Thiruvalluvan M. G.
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development