Avro
  1. Avro
  2. AVRO-726

Make GenericDatumReader/GenericDatumWriter data member protected so that it can be used by the derived classes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.5.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, GenericDatumReader/GenericDatumWriter data members are private. Is it possible to make them protected so that we could extend those classes and create our own special DatumReader/Writer? The reason we want to do that is because we've created our own base SpecificRecordEx that implements SpecificRecord and added put/get for primitive types. We now want to extend the GenericDatumReader/GenericDatumWriter to use those primitive put/get functions to reduce box/unbox for better performance.
      Thanks,

      Xiaolu

      1. AVRO-726.patch
        3 kB
        Doug Cutting
      2. AVRO-726.1.patch
        2 kB
        Scott Carey

        Issue Links

          Activity

          Hide
          Scott Carey added a comment -

          We should definitely consider this so that such extensions are possible.

          As for the use case above, the latest version of Avro uses Velocity templates for SpecificRecord generation – so you could generate classes with all the getter/setters you want. One of the items I wanted to get to was to experiment with ways to remove the boxing/unboxing overhead for these objects. IndexedRecord neatly simplifies access to fields but has the boxing overhead.

          Are you willing to share or contribute what you have done so far?

          There are some approaches I have considered. I was going to take a stab at 'merging' the Reflect and Specific API by way of annotations to get rid of this overhead – essentially have the code gen create annotated classes that specified how the schema maps to the fields – and have reflect consume those. Later down the road, Reflect could use cgilib and/or asm to create serialization / deserialization classes on the fly that would be compiled to very very efficient code and be faster than the current Specific API.

          Lastly, note that autoboxing/unboxing in some cases has become 'free' on the latest Sun JVM with '-XX:+UseEscapeAnalysis' on. This will default to on in a later JVM release. If the object is boxed manually (using new Integer(), not Integer.valueOf()) and the object does not escape a small enough code block, the JVM will avoid object creation entirely. This can help some of the IndexedRecord API usage.

          Show
          Scott Carey added a comment - We should definitely consider this so that such extensions are possible. As for the use case above, the latest version of Avro uses Velocity templates for SpecificRecord generation – so you could generate classes with all the getter/setters you want. One of the items I wanted to get to was to experiment with ways to remove the boxing/unboxing overhead for these objects. IndexedRecord neatly simplifies access to fields but has the boxing overhead. Are you willing to share or contribute what you have done so far? There are some approaches I have considered. I was going to take a stab at 'merging' the Reflect and Specific API by way of annotations to get rid of this overhead – essentially have the code gen create annotated classes that specified how the schema maps to the fields – and have reflect consume those. Later down the road, Reflect could use cgilib and/or asm to create serialization / deserialization classes on the fly that would be compiled to very very efficient code and be faster than the current Specific API. Lastly, note that autoboxing/unboxing in some cases has become 'free' on the latest Sun JVM with '-XX:+UseEscapeAnalysis' on. This will default to on in a later JVM release. If the object is boxed manually (using new Integer(), not Integer.valueOf()) and the object does not escape a small enough code block, the JVM will avoid object creation entirely. This can help some of the IndexedRecord API usage.
          Hide
          Xiaolu Ye added a comment -

          Hi Scott,

          As said in the email, I'm more than happy to share what we've done. Here's the summary:

          1) Introduced SpecificRecordEx interface that extends SpecificRecord and added setter/getter of primitive types in it.
          2) All generated classes are derived from SpecificRecordEx with impl class has setter/getter of primitive types auto-generated based on the schema
          3) Auto generate interface class for record type with user friendly setter/getter of proper types and maintain inheritance structure based on supertypes definition in the schema
          4) Support callback dispatch by introducing Cbk interface in SpecificRecordEx with dispatch(Cbk) function. All derived interface has its own Cbk interface derived from SpecificRecordEx.Cbk, and its impl class implements dispatch(Cbk) to invoke the proper callback.
          5) Auto generate Builder class for each impl class

          If you think this is in the right direction of what you want, I'm more than happy to supply a patch. Also interested to get your input for any enhancement. Please let me know. I have taked out cyclical reference patch from my implemenation based on comments from AVRO-695 that Moustapha will change its impl.

          Thanks,

          Xiaolu

          Show
          Xiaolu Ye added a comment - Hi Scott, As said in the email, I'm more than happy to share what we've done. Here's the summary: 1) Introduced SpecificRecordEx interface that extends SpecificRecord and added setter/getter of primitive types in it. 2) All generated classes are derived from SpecificRecordEx with impl class has setter/getter of primitive types auto-generated based on the schema 3) Auto generate interface class for record type with user friendly setter/getter of proper types and maintain inheritance structure based on supertypes definition in the schema 4) Support callback dispatch by introducing Cbk interface in SpecificRecordEx with dispatch(Cbk) function. All derived interface has its own Cbk interface derived from SpecificRecordEx.Cbk, and its impl class implements dispatch(Cbk) to invoke the proper callback. 5) Auto generate Builder class for each impl class If you think this is in the right direction of what you want, I'm more than happy to supply a patch. Also interested to get your input for any enhancement. Please let me know. I have taked out cyclical reference patch from my implemenation based on comments from AVRO-695 that Moustapha will change its impl. Thanks, Xiaolu
          Hide
          Scott Carey added a comment -

          Hi Xiaolu,

          These features sound very interesting and useful. I would love to see a patch and figure out how to incorporate similar features. Even if we go in a slightly different direction, I think all your needs are things I would like to address:

          1/2) reduce cost of autoboxing and memory footprint of objects.
          3) Solutions for mapping inheritance to avro.
          4) Callbacks – I would need to see what you did here and some example use cases – are these for users or the internal framework?
          5) Generating code that instantiates objects with the builder pattern and avoids publicly exposed variables has been discussed many times here, I definitely want to have that built-in to Avro.

          Show
          Scott Carey added a comment - Hi Xiaolu, These features sound very interesting and useful. I would love to see a patch and figure out how to incorporate similar features. Even if we go in a slightly different direction, I think all your needs are things I would like to address: 1/2) reduce cost of autoboxing and memory footprint of objects. 3) Solutions for mapping inheritance to avro. 4) Callbacks – I would need to see what you did here and some example use cases – are these for users or the internal framework? 5) Generating code that instantiates objects with the builder pattern and avoids publicly exposed variables has been discussed many times here, I definitely want to have that built-in to Avro.
          Hide
          Xiaolu Ye added a comment -

          Hi Scott,

          Sorry, just saw your comment. I'll supply a patch shortly.

          Meanwhile, is it possible to make those data members protected in 1.5 anyway? Noticed as of now, it's not in the blocker list.

          Thanks,

          Xiaolu

          Show
          Xiaolu Ye added a comment - Hi Scott, Sorry, just saw your comment. I'll supply a patch shortly. Meanwhile, is it possible to make those data members protected in 1.5 anyway? Noticed as of now, it's not in the blocker list. Thanks, Xiaolu
          Hide
          Scott Carey added a comment -

          Yes, I'll look at getting just the basics: set those to protected, done for 1.5.0. I can't guarantee it but it is the first thing I'll look at after the API changing blockers.

          As for the patch, even if it only applies against 1.4.1, as long as it is posted here and has the granted Apache license we can review and start talking about incorporating it or using it as a base to build similar features that solve your use cases. You might want to start a new JIRA for that, and we'll leave this one for making the members protected only.

          Show
          Scott Carey added a comment - Yes, I'll look at getting just the basics: set those to protected, done for 1.5.0. I can't guarantee it but it is the first thing I'll look at after the API changing blockers. As for the patch, even if it only applies against 1.4.1, as long as it is posted here and has the granted Apache license we can review and start talking about incorporating it or using it as a base to build similar features that solve your use cases. You might want to start a new JIRA for that, and we'll leave this one for making the members protected only.
          Hide
          Xiaolu Ye added a comment -

          sounds good. i'll create a jira and summit the part as part of that.

          Show
          Xiaolu Ye added a comment - sounds good. i'll create a jira and summit the part as part of that.
          Hide
          Xiaolu Ye added a comment -

          Hi Scott,

          I've created jira AVRO-777 for the enhancement. Just want to mention that in addition to change the data member to protected, could you also look at whether some of the member functions could be made protected like npe() in GenericDatumWriter? I needed to use npe in my derived class and had to change GenericDatumWriter in my local copy.

          Thanks,

          Xiaolu

          Show
          Xiaolu Ye added a comment - Hi Scott, I've created jira AVRO-777 for the enhancement. Just want to mention that in addition to change the data member to protected, could you also look at whether some of the member functions could be made protected like npe() in GenericDatumWriter? I needed to use npe in my derived class and had to change GenericDatumWriter in my local copy. Thanks, Xiaolu
          Hide
          Scott Carey added a comment -

          Simple patch that exposes the data member variables to child classes. Also exposes a couple helper methods to child classes.

          Show
          Scott Carey added a comment - Simple patch that exposes the data member variables to child classes. Also exposes a couple helper methods to child classes.
          Hide
          Doug Cutting added a comment -

          Protected ites should have javadoc. Also, might we provide getters and setters instead of just making the fields protected?

          Show
          Doug Cutting added a comment - Protected ites should have javadoc. Also, might we provide getters and setters instead of just making the fields protected?
          Hide
          Xiaolu Ye added a comment -

          For us, making the fields protected is good enough.

          Btw, I noticed the fix version is now 1.5.1 for this issue. Is it because of the npe function I requested for GenericDatumWriter?

          Show
          Xiaolu Ye added a comment - For us, making the fields protected is good enough. Btw, I noticed the fix version is now 1.5.1 for this issue. Is it because of the npe function I requested for GenericDatumWriter?
          Hide
          Scott Carey added a comment -

          Its because we have some more discussion/documenation before there is consensus and this does not break API. We could get 1.5.1 out relatively fast if there is demand for it. Also, if there are delays in 1.5.0 this could still slip back in, but it is not blocking the release.

          Questions:

          • Would a protected setter/getter work for you? That would expose less of the internals than a protected member variable.
          • Is it OK if the field is final? One of the two was final before, and if they both can be, then only a getter is necessary.
          Show
          Scott Carey added a comment - Its because we have some more discussion/documenation before there is consensus and this does not break API. We could get 1.5.1 out relatively fast if there is demand for it. Also, if there are delays in 1.5.0 this could still slip back in, but it is not blocking the release. Questions: Would a protected setter/getter work for you? That would expose less of the internals than a protected member variable. Is it OK if the field is final? One of the two was final before, and if they both can be, then only a getter is necessary.
          Hide
          Xiaolu Ye added a comment -

          Both should be fine. Thanks.

          Just want to confirm, in 1.5.1, you would make GenericDatumWriter.npe function protected as well? In my derived class, I override writeRecord which throws npe as in the super class. Thanks!

          Show
          Xiaolu Ye added a comment - Both should be fine. Thanks. Just want to confirm, in 1.5.1, you would make GenericDatumWriter.npe function protected as well? In my derived class, I override writeRecord which throws npe as in the super class. Thanks!
          Hide
          Scott Carey added a comment -

          I've created jira AVRO-777 for the enhancement.

          In case anyone is looking for it, its AVRO-770.

          Show
          Scott Carey added a comment - I've created jira AVRO-777 for the enhancement. In case anyone is looking for it, its AVRO-770 .
          Hide
          Doug Cutting added a comment -

          Here's a version that, instead of changing the data field to be protected, adds the method:

          public GenericData getData();

          Show
          Doug Cutting added a comment - Here's a version that, instead of changing the data field to be protected, adds the method: public GenericData getData();
          Hide
          Scott Carey added a comment -

          +1
          Looks good, this version is simpler and making the data field final in both cases is an improvement.

          Show
          Scott Carey added a comment - +1 Looks good, this version is simpler and making the data field final in both cases is an improvement.
          Hide
          Doug Cutting added a comment -

          I committed this.

          Show
          Doug Cutting added a comment - I committed this.

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Xiaolu Ye
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development