Avro
  1. Avro
  2. AVRO-839

Implement builder pattern in generated record classes that sets default values when omitted

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is an idea for an improvement to the SpecificCompiler-generated record classes. There are two main issues to address:

      1. Default values specified in schemas are only used at read time, not when writing/serializing records. For example, a NullPointerException is thrown when attempting to write a record that has an uninitialized array or string type. I'm sure this was done for good reasons, like giving users maximum control and preventing unnecessary garbage collection, but I think it's also somewhat confusing and unintuitive for new users (myself included).
      2. Users have to create their own factory classes/methods for every record type, both to ensure that all non-primitive members are initialized and to facilitate the construction and initialization of record instances (i.e. constructing and setting values in a single statement).

      These issues have been discussed previously here:

      I'd like to propose a solution that is used by at least one other messaging framework. For each generated record class there will be a public static inner class called Builder. The Builder inner class has the same fields as the record class, as well as accessors and mutators for each of these fields. Whenever a mutator method is called, the Builder sets a boolean flag indicating that the field has been set. All mutators return a reference to 'this', so it's possible to chain a series of setter invocations, which makes it really easy to construct records in a single statement. The Builder also has a build() method which constructs a record instance using the values that were set in the Builder. When the build() method is invoked, if there are any fields that have not been set but have default values as defined in the schema, the Builder will set the values of these fields using their defaults.

      One nice thing about implementing the builder pattern in a static inner Builder class rather than in the record itself is that this enhancement will be completely backwards-compatible with existing code. The record class itself would not change, and the public fields would still be there, so existing code would still work. Users would have the option to use the Builder or continue constructing records manually. Eventually the public fields could be phased out, and the record would be made immutable. All changes would have to be done through the Builder.

      Here is an example of what this might look like:

      // Person.newBuilder() returns a new Person.Builder instance
      // All Person.Builder setters return 'this' allowing us to chain set calls together for convenience
      // Person.Builder.build() returns a Person instance after setting any uninitialized values that have defaults
      Person me = Person.newBuilder().setName("James").setCountry("US").setState("MA").build();
      
      // We still have direct access to Person's members, so the records are backwards-compatible
      me.state = "CA";
      
      // Person has accessor methods now so that the public fields can be phased out later
      System.out.println(me.getState());
      
      // No NPE here because the array<Person> field that stores this person's friends has been automatically 
      // initialized by the Builder to a new java.util.ArrayList<Person> due to a @java_class annotation in the IDL
      System.out.println(me.getFriends().size());
      

      What do people think about this approach? Any other ideas?

      1. AVRO-839.patch
        141 kB
        Doug Cutting
      2. AVRO-839.patch
        128 kB
        Doug Cutting
      3. AVRO-839.patch
        92 kB
        James Baldassari
      4. AVRO-839-v2.patch
        102 kB
        James Baldassari
      5. AVRO-839-v3.patch
        100 kB
        James Baldassari
      6. AVRO-839-v4.patch
        117 kB
        James Baldassari
      7. AVRO-839-v4.patch
        115 kB
        James Baldassari
      8. AVRO-839-v5.patch
        146 kB
        James Baldassari

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          +1 This sounds like great approach.

          We might deprecate the public fields to encourage folks to start using getters and setters instead? We should probably leave the public no-arg constructor though for use when reading data.

          Show
          Doug Cutting added a comment - +1 This sounds like great approach. We might deprecate the public fields to encourage folks to start using getters and setters instead? We should probably leave the public no-arg constructor though for use when reading data.
          Hide
          James Baldassari added a comment -

          Thanks, Doug. Agreed on deprecating the public fields and leaving the public no-arg constructor. Hopefully I'll have time to work on a patch soon.

          Show
          James Baldassari added a comment - Thanks, Doug. Agreed on deprecating the public fields and leaving the public no-arg constructor. Hopefully I'll have time to work on a patch soon.
          Hide
          Doug Cutting added a comment -

          This is also related to AVRO-784.

          Show
          Doug Cutting added a comment - This is also related to AVRO-784 .
          Hide
          James Baldassari added a comment -

          I finally had some time to work on this, and I have a patch ready now. The functionality is basically as described in this issue. There are a few implementation details to note:

          • Implemented both generic and specific Builders
          • Builders can also be initialized from other Builders or from record types. For example:
            • Person.newBuilder(anotherPersonBuilder)
            • Person.newBuilder(anotherPerson)
          • In addition to the accessor method in the record type I also added a mutator method. I'm not sure if this is best; I just thought it was more flexible. We could instead force all mutations to go through a Builder. Thoughts?
          • I introduced a new set of reserved words for error types which includes all the standard reserved words in addition to "message" and "cause". The reason these two new words need to be reserved (for error types only) is that otherwise the generated accessor methods would conflict with those in java.lang.Throwable.
          • I found that determining the default Java value for a field is fairly expensive because it requires serializing the JSON field and then deserializing it (is there a better way to do this?). For that reason I introduced a static cache of default values in RecordBuilderBase. We only determine the default value once per schema and field, at which point it is added to the static cache. After a default value has been added to the cache, whenever we need use it we first obtain a deep copy of it so that subsequent changes will not affect the values stored in the cache.
          • Even though the @java_class annotation is used in the IDL documentation, I couldn't find it anywhere in the Avro source. Maybe that should be a future enhancement instead?
          • Updated all unit tests so that they no longer directly access record fields. They now use the new accessor/mutator methods. As a result, a lot of the changes in this patch are just updates to unit tests.

          While I was working on the Velocity templates I also made the following changes:

          • Implemented changes requested in AVRO-846
          • Added constructor to generated Fixed instances that allows setting the byte[] in the constructor. This change makes it easier to used Fixed values with the builder interface.

          All existing unit tests pass, and I added unit tests to cover all new functionality.

          Show
          James Baldassari added a comment - I finally had some time to work on this, and I have a patch ready now. The functionality is basically as described in this issue. There are a few implementation details to note: Implemented both generic and specific Builders Builders can also be initialized from other Builders or from record types. For example: Person.newBuilder(anotherPersonBuilder) Person.newBuilder(anotherPerson) In addition to the accessor method in the record type I also added a mutator method. I'm not sure if this is best; I just thought it was more flexible. We could instead force all mutations to go through a Builder. Thoughts? I introduced a new set of reserved words for error types which includes all the standard reserved words in addition to "message" and "cause". The reason these two new words need to be reserved (for error types only) is that otherwise the generated accessor methods would conflict with those in java.lang.Throwable. I found that determining the default Java value for a field is fairly expensive because it requires serializing the JSON field and then deserializing it (is there a better way to do this?). For that reason I introduced a static cache of default values in RecordBuilderBase. We only determine the default value once per schema and field, at which point it is added to the static cache. After a default value has been added to the cache, whenever we need use it we first obtain a deep copy of it so that subsequent changes will not affect the values stored in the cache. Even though the @java_class annotation is used in the IDL documentation, I couldn't find it anywhere in the Avro source. Maybe that should be a future enhancement instead? Updated all unit tests so that they no longer directly access record fields. They now use the new accessor/mutator methods. As a result, a lot of the changes in this patch are just updates to unit tests. While I was working on the Velocity templates I also made the following changes: Implemented changes requested in AVRO-846 Added constructor to generated Fixed instances that allows setting the byte[] in the constructor. This change makes it easier to used Fixed values with the builder interface. All existing unit tests pass, and I added unit tests to cover all new functionality.
          Hide
          James Baldassari added a comment -

          Here's the patch. This applies cleanly to trunk as of r1142876.

          Show
          James Baldassari added a comment - Here's the patch. This applies cleanly to trunk as of r1142876.
          Hide
          James Baldassari added a comment -

          This patch includes changes requested in AVRO-846.

          Show
          James Baldassari added a comment - This patch includes changes requested in AVRO-846 .
          Hide
          Doug Cutting added a comment -

          Looks great!

          The deepCopy() method would better be on GenericData than on Schema. Schema doesn't have any logic specific to representations of the data, while GenericData is the base class for three representations (generic, specific, and reflect).

          I think there are a number of opportunities for optimization, but we can leave those for subsequent patches. What's important at this point is that the APIs look good and that it doesn't slow existing APIs. It doesn't much change existing APIs so I'm not worried there.

          Do RecordBuilderBase#has(int), get(int) and set(int) need to be public? Protected would be better, since these are unsafe methods, or at least a javadoc comment encouraging caution. Also, on a related note, it might be simpler and faster to have the builder to store things in an IndexedRecord instance, rather than in an Object[]. Then the generated specific builder methods could just do something like 'instance.foo = value'. This might make it harder to reuse builders, though...

          Show
          Doug Cutting added a comment - Looks great! The deepCopy() method would better be on GenericData than on Schema. Schema doesn't have any logic specific to representations of the data, while GenericData is the base class for three representations (generic, specific, and reflect). I think there are a number of opportunities for optimization, but we can leave those for subsequent patches. What's important at this point is that the APIs look good and that it doesn't slow existing APIs. It doesn't much change existing APIs so I'm not worried there. Do RecordBuilderBase#has(int), get(int) and set(int) need to be public? Protected would be better, since these are unsafe methods, or at least a javadoc comment encouraging caution. Also, on a related note, it might be simpler and faster to have the builder to store things in an IndexedRecord instance, rather than in an Object[]. Then the generated specific builder methods could just do something like 'instance.foo = value'. This might make it harder to reuse builders, though...
          Hide
          James Baldassari added a comment -

          Thanks for the comments, Doug.

          The deepCopy() method would better be on GenericData than on Schema.

          Sure, makes sense. I'll move it.

          I think there are a number of opportunities for optimization, but we can leave those for subsequent patches.

          Agreed. I did a quick performance test, and it looks like the builders are about an order of magnitude slower than manually creating a specific record instance. I'm not sure why, but I'll try profiling it. Maybe someone who is more familiar with Avro internals can suggest some improvements.

          Do RecordBuilderBase#has(int), get(int) and set(int) need to be public?

          Nope. I only made them public because get(int) and put(int, Object) are public in IndexedRecord and SpecificRecordBase. I'll change these to be protected in the builder interfaces/classes.

          it might be simpler and faster to have the builder to store things in an IndexedRecord instance, rather than in an Object[]. Then the generated specific builder methods could just do something like 'instance.foo = value'.

          That's an interesting idea. I'll try it out and see if it improves anything. However, the builders will still need to maintain that array of booleans for keeping track of which fields have been set (simply checking for null won't suffice because some types like unions can have null values). One other idea I had was to replace the Object[] with a GenericData.Record so that less code would be duplicated and the equals() method would work correctly.

          Show
          James Baldassari added a comment - Thanks for the comments, Doug. The deepCopy() method would better be on GenericData than on Schema. Sure, makes sense. I'll move it. I think there are a number of opportunities for optimization, but we can leave those for subsequent patches. Agreed. I did a quick performance test, and it looks like the builders are about an order of magnitude slower than manually creating a specific record instance. I'm not sure why, but I'll try profiling it. Maybe someone who is more familiar with Avro internals can suggest some improvements. Do RecordBuilderBase#has(int), get(int) and set(int) need to be public? Nope. I only made them public because get(int) and put(int, Object) are public in IndexedRecord and SpecificRecordBase. I'll change these to be protected in the builder interfaces/classes. it might be simpler and faster to have the builder to store things in an IndexedRecord instance, rather than in an Object[]. Then the generated specific builder methods could just do something like 'instance.foo = value'. That's an interesting idea. I'll try it out and see if it improves anything. However, the builders will still need to maintain that array of booleans for keeping track of which fields have been set (simply checking for null won't suffice because some types like unions can have null values). One other idea I had was to replace the Object[] with a GenericData.Record so that less code would be duplicated and the equals() method would work correctly.
          Hide
          James Baldassari added a comment -

          I have a new version of the patch ready. Here is a summary of the most significant changes from the first version:

          • deepCopy() moved from Schema to GenericData
          • RecordBuilderBase#has(int),get(int),set(int) made protected
          • Added clear[Field]() methods, which set the value of a field to null and clear the boolean indicating that the field has been set
          • I wasn't satisfied with the performance of the Builder implementation, so I did quite a bit of profiling and addressed the major bottlenecks. These were:
            • Schema.hashCode(), as noted in AVRO-853. I removed this bottleneck by changing the type of the default value cache from a Map<Schema,Map<Field,Object>> to a Map<String,Map<Integer,Object>>, where the String is the Schema's full name and the Integer is the field's position.
            • Deep copying Utf8 instances was inefficient because it converted the Utf8 to a String and then back to a Utf8. It turns out String.getBytes(Charset) is fairly expensive. I fixed this issue by adding a copy constructor to Utf8 that copies the bytes field. I also improved the performance of Utf8 slightly by initializing a Charset once as a static constant and using that throughout the class rather than the String "UTF-8".
            • Generated SpecificBuilder classes no longer rely on reflection to create new record instances. Instead, newRecord() is overridden in the generated Builder classes.

          I wrote some performance tests in TestSpecificBuilder that can be run to compare the performance of the Builder, with and without default values, to manually creating record instances. The Builder is about an order of magnitude slower than manually creating record instances, but I think it's acceptable. For example, building 1 billion records manually took about 8ms on my system vs. 217ms using the Builder. After profiling this version of the Builder I don't see many opportunities for further optimization.

          So I think the patch is in pretty good shape now, but let me know what you think.

          Show
          James Baldassari added a comment - I have a new version of the patch ready. Here is a summary of the most significant changes from the first version: deepCopy() moved from Schema to GenericData RecordBuilderBase#has(int),get(int),set(int) made protected Added clear [Field] () methods, which set the value of a field to null and clear the boolean indicating that the field has been set I wasn't satisfied with the performance of the Builder implementation, so I did quite a bit of profiling and addressed the major bottlenecks. These were: Schema.hashCode(), as noted in AVRO-853 . I removed this bottleneck by changing the type of the default value cache from a Map<Schema,Map<Field,Object>> to a Map<String,Map<Integer,Object>>, where the String is the Schema's full name and the Integer is the field's position. Deep copying Utf8 instances was inefficient because it converted the Utf8 to a String and then back to a Utf8. It turns out String.getBytes(Charset) is fairly expensive. I fixed this issue by adding a copy constructor to Utf8 that copies the bytes field. I also improved the performance of Utf8 slightly by initializing a Charset once as a static constant and using that throughout the class rather than the String "UTF-8". Generated SpecificBuilder classes no longer rely on reflection to create new record instances. Instead, newRecord() is overridden in the generated Builder classes. I wrote some performance tests in TestSpecificBuilder that can be run to compare the performance of the Builder, with and without default values, to manually creating record instances. The Builder is about an order of magnitude slower than manually creating record instances, but I think it's acceptable. For example, building 1 billion records manually took about 8ms on my system vs. 217ms using the Builder. After profiling this version of the Builder I don't see many opportunities for further optimization. So I think the patch is in pretty good shape now, but let me know what you think.
          Hide
          James Baldassari added a comment -

          Version 2 of the AVRO-839 patch.

          Show
          James Baldassari added a comment - Version 2 of the AVRO-839 patch.
          Hide
          Doug Cutting added a comment -

          Sorry I didn't notice this previously but I don't think RecordBuilder etc. should be in the top-level package. These might be better in generic. We probably really need another package named 'data' or somesuch that contains the base classes and interfaces that generic, specific and reflect implement and extend. We could add that package in this patch with these new classes, or put them in generic along with IndexedRecord, etc., but I'd rather not clutter the top-level package if possible.

          Also, the generated getters and setters should directly set the field rather than calling 'get(0)' or 'put(0, value)'. This will avoid the switch statement, speeding things.

          Show
          Doug Cutting added a comment - Sorry I didn't notice this previously but I don't think RecordBuilder etc. should be in the top-level package. These might be better in generic. We probably really need another package named 'data' or somesuch that contains the base classes and interfaces that generic, specific and reflect implement and extend. We could add that package in this patch with these new classes, or put them in generic along with IndexedRecord, etc., but I'd rather not clutter the top-level package if possible. Also, the generated getters and setters should directly set the field rather than calling 'get(0)' or 'put(0, value)'. This will avoid the switch statement, speeding things.
          Hide
          James Baldassari added a comment -

          Sorry for the long delay in working on this. I finally had a chance to get back to it this week, and I have another patch ready. The main changes are:

          • Moved Builder-related interfaces and base classes out of top-level package into org.apache.avro.data
          • Getters/setters get/set fields directly rather than calling get()/put()
          • In an effort to improve overall builder performance for specific/generated records, I moved most of the generic code out of RecordBuilderBase and pushed it down into the generated Builder code. I also did quite a bit of profiling in an effort to reduce the biggest bottlenecks.
            • There is now a lot more generated code, but it's faster.
            • Testing indicates that without using default values (i.e. all fields in the Builder are set), the Builder is about one order of magnitude slower than directly creating/initializing a record. However, a Builder instance can be reused, and if this is done rather than creating a new Builder each time there is almost no performance penalty for using the Builder.
            • If the Builder has to resolve a default value from the schema, it's obviously going to be a little slower, but it's still only one order of magnitude slower than directly creating records. In my testing the Builder was about 3-4 times slower when resolving default values, but I expect this will vary a lot depending on the schema.

          So let me know what you think. I'm happy to make more adjustments if necessary.

          Show
          James Baldassari added a comment - Sorry for the long delay in working on this. I finally had a chance to get back to it this week, and I have another patch ready. The main changes are: Moved Builder-related interfaces and base classes out of top-level package into org.apache.avro.data Getters/setters get/set fields directly rather than calling get()/put() In an effort to improve overall builder performance for specific/generated records, I moved most of the generic code out of RecordBuilderBase and pushed it down into the generated Builder code. I also did quite a bit of profiling in an effort to reduce the biggest bottlenecks. There is now a lot more generated code, but it's faster. Testing indicates that without using default values (i.e. all fields in the Builder are set), the Builder is about one order of magnitude slower than directly creating/initializing a record. However, a Builder instance can be reused, and if this is done rather than creating a new Builder each time there is almost no performance penalty for using the Builder. If the Builder has to resolve a default value from the schema, it's obviously going to be a little slower, but it's still only one order of magnitude slower than directly creating records. In my testing the Builder was about 3-4 times slower when resolving default values, but I expect this will vary a lot depending on the schema. So let me know what you think. I'm happy to make more adjustments if necessary.
          Hide
          James Baldassari added a comment -

          Here's the third iteration of the patch.

          Show
          James Baldassari added a comment - Here's the third iteration of the patch.
          Hide
          Doug Cutting added a comment -

          This is looking good. A few comments:

          • Why does SpecificExceptionBase include serialVersionUID?
          • The first line of new SpecificFixed constructor can just be 'this();', no?
          • GenericData#deepCopy() should probably not be static. We might instead refactor GenericDatumReader's newRecord() and createFixed() to GenericData so they can be used by deepCopy(). Currently I believe deepCopy is broken for SpecificFixed. It also adds a dependency by generic on specific, when otherwise specific currently layers on generic.
          • RecordBuilderBase#isValidValue's logic for checking for a NULL in a union is incorrect. Schema#getTypes() returns List<Schema>, not List<Schema.Type>. We should perhaps add a test of a union with null for this. Also, I don't think this method needs to be static.
          Show
          Doug Cutting added a comment - This is looking good. A few comments: Why does SpecificExceptionBase include serialVersionUID? The first line of new SpecificFixed constructor can just be 'this();', no? GenericData#deepCopy() should probably not be static. We might instead refactor GenericDatumReader's newRecord() and createFixed() to GenericData so they can be used by deepCopy(). Currently I believe deepCopy is broken for SpecificFixed. It also adds a dependency by generic on specific, when otherwise specific currently layers on generic. RecordBuilderBase#isValidValue's logic for checking for a NULL in a union is incorrect. Schema#getTypes() returns List<Schema>, not List<Schema.Type>. We should perhaps add a test of a union with null for this. Also, I don't think this method needs to be static.
          Hide
          James Baldassari added a comment -

          Sorry for the delay again. I'll try to find time to work on this more regularly. Thanks for the comments.

          Why does SpecificExceptionBase include serialVersionUID?

          Probably just to get Eclipse to stop complaining. SpecificExceptinoBase isA Throwable, which is serializable, so apparently the "right thing" to do is to declare the magic serialVersionUID field. However, it's not strictly required, so I'll remove it if you'd prefer.

          The first line of new SpecificFixed constructor can just be 'this();', no?

          Yes. I'll change it.

          GenericData#deepCopy() should probably not be static.

          OK, I didn't realize that GenericData was a singleton, so I'll make deepCopy() non-static and then invoke it via the singleton instance.

          We might instead refactor GenericDatumReader's newRecord() and createFixed() to GenericData so they can be used by deepCopy(). Currently I believe deepCopy is broken for SpecificFixed. It also adds a dependency by generic on specific, when otherwise specific currently layers on generic.

          Comparing GenericData#deepCopy() and GenericDatumReader#createFixed(), it looks to me like they're doing similar things for Fixed values. Could you elaborate on why you think it's broken in GenericData#deepCopy()? Even if it does work in deepCopy(), it would probably be a good idea to move createFixed() over to GenericData just to prevent code duplication.

          As for the new specific dependency in GenericData, I agree that it's not an ideal situation. This dependency is coming from the code that handles deep copies of record instances. This code could be changed to simply return new GenericData.Record(schema) for all records. It just seemed to me that performing a deep copy on a specific record should return the same specific record type rather than a GenericData.Record. What about adding a newInstance() method to IndexedRecord (or GenericContainer)? That way generic records could return GenericData.Record instances, and specific records could return instances of the generated type.

          RecordBuilderBase#isValidValue's logic for checking for a NULL in a union is incorrect. Schema#getTypes() returns List<Schema>, not List<Schema.Type>. We should perhaps add a test of a union with null for this. Also, I don't think this method needs to be static.

          Good catch! I think I've fixed this problem, but I will add a unit test for this method as you suggested.

          Show
          James Baldassari added a comment - Sorry for the delay again. I'll try to find time to work on this more regularly. Thanks for the comments. Why does SpecificExceptionBase include serialVersionUID? Probably just to get Eclipse to stop complaining. SpecificExceptinoBase isA Throwable, which is serializable, so apparently the "right thing" to do is to declare the magic serialVersionUID field. However, it's not strictly required, so I'll remove it if you'd prefer. The first line of new SpecificFixed constructor can just be 'this();', no? Yes. I'll change it. GenericData#deepCopy() should probably not be static. OK, I didn't realize that GenericData was a singleton, so I'll make deepCopy() non-static and then invoke it via the singleton instance. We might instead refactor GenericDatumReader's newRecord() and createFixed() to GenericData so they can be used by deepCopy(). Currently I believe deepCopy is broken for SpecificFixed. It also adds a dependency by generic on specific, when otherwise specific currently layers on generic. Comparing GenericData#deepCopy() and GenericDatumReader#createFixed(), it looks to me like they're doing similar things for Fixed values. Could you elaborate on why you think it's broken in GenericData#deepCopy()? Even if it does work in deepCopy(), it would probably be a good idea to move createFixed() over to GenericData just to prevent code duplication. As for the new specific dependency in GenericData, I agree that it's not an ideal situation. This dependency is coming from the code that handles deep copies of record instances. This code could be changed to simply return new GenericData.Record(schema) for all records. It just seemed to me that performing a deep copy on a specific record should return the same specific record type rather than a GenericData.Record. What about adding a newInstance() method to IndexedRecord (or GenericContainer)? That way generic records could return GenericData.Record instances, and specific records could return instances of the generated type. RecordBuilderBase#isValidValue's logic for checking for a NULL in a union is incorrect. Schema#getTypes() returns List<Schema>, not List<Schema.Type>. We should perhaps add a test of a union with null for this. Also, I don't think this method needs to be static. Good catch! I think I've fixed this problem, but I will add a unit test for this method as you suggested.
          Hide
          Doug Cutting added a comment -

          Re serialVersionUID: yes, please remove it. Perhaps you can disable this "feature" in Eclipse?

          Re deepCopy(): We should move newRecord() and createFixed() from GenericDatumReader & SpecificDatumReader (which overrides them) to GenericData and SpecificData. The GenericData#deepCopy() implementation can call these and will then work correctly for specific or generic without directly referencing specific. Does that make sense? We don't need to add a newInstance() method to IndexedRecord, but should instead use GenericData#newRecord().

          Show
          Doug Cutting added a comment - Re serialVersionUID: yes, please remove it. Perhaps you can disable this "feature" in Eclipse? Re deepCopy(): We should move newRecord() and createFixed() from GenericDatumReader & SpecificDatumReader (which overrides them) to GenericData and SpecificData. The GenericData#deepCopy() implementation can call these and will then work correctly for specific or generic without directly referencing specific. Does that make sense? We don't need to add a newInstance() method to IndexedRecord, but should instead use GenericData#newRecord().
          Hide
          James Baldassari added a comment -

          Makes sense, Doug. I've made the changes you suggested. I also had to make the generic/specific builders similar to the GenericDatumReader/SpecificDatumReader in that the base class has a GenericData and the specific subclass passes a SpecificData to the super constructor.

          I created some new unit tests for RecordBuilderBase#isValidValue(Field, Object). These can be found in org.apache.avro.data.RecordBuilderBaseTest in lang/java/avro. The one that addresses your specific concern about unions is RecordBuilderBaseTest#testIsValidValueWithUnion(). In order to make it easier to unit test the isValidValue() method I left it static, but if you'd prefer to make that method non-static I'll find another way to test it.

          I'll have a new patch ready shortly. Oh, and I just noticed AVRO-784 today. I haven't had a chance to look at the patch yet, but is there any overlap in functionality between 784 and 839 with the bean-style accessors/mutators? It sounded very similar from the description. If it turns out to be basically the same, then I can just wait for that to be committed and then merge in the changes.

          Show
          James Baldassari added a comment - Makes sense, Doug. I've made the changes you suggested. I also had to make the generic/specific builders similar to the GenericDatumReader/SpecificDatumReader in that the base class has a GenericData and the specific subclass passes a SpecificData to the super constructor. I created some new unit tests for RecordBuilderBase#isValidValue(Field, Object). These can be found in org.apache.avro.data.RecordBuilderBaseTest in lang/java/avro. The one that addresses your specific concern about unions is RecordBuilderBaseTest#testIsValidValueWithUnion(). In order to make it easier to unit test the isValidValue() method I left it static, but if you'd prefer to make that method non-static I'll find another way to test it. I'll have a new patch ready shortly. Oh, and I just noticed AVRO-784 today. I haven't had a chance to look at the patch yet, but is there any overlap in functionality between 784 and 839 with the bean-style accessors/mutators? It sounded very similar from the description. If it turns out to be basically the same, then I can just wait for that to be committed and then merge in the changes.
          Hide
          James Baldassari added a comment -

          Here's v4 of the patch.

          Show
          James Baldassari added a comment - Here's v4 of the patch.
          Hide
          Doug Cutting added a comment -

          This looks great.

          For some reason the changes to SpecificDatumReader didn't apply cleanly, but I was easily able to fix that up by hand.

          I also moved the SchemaConstructable interface from SpecificDatumReader to SpecificData.

          As for AVRO-784, I think this patch mostly subsumes it. There is one issue there that we perhaps ought to fix before we commit this, though. We should mangle things if there's a field named 'class' or 'schema', so that we don't conflict with the Class#getClass and GenericContainer#getSchema.

          Other than that, I think this is ready to commit.

          Show
          Doug Cutting added a comment - This looks great. For some reason the changes to SpecificDatumReader didn't apply cleanly, but I was easily able to fix that up by hand. I also moved the SchemaConstructable interface from SpecificDatumReader to SpecificData. As for AVRO-784 , I think this patch mostly subsumes it. There is one issue there that we perhaps ought to fix before we commit this, though. We should mangle things if there's a field named 'class' or 'schema', so that we don't conflict with the Class#getClass and GenericContainer#getSchema. Other than that, I think this is ready to commit.
          Hide
          James Baldassari added a comment -

          Great. I'll get caught up on the AVRO-784 discussion regarding the mangling of class and schema fields and submit an updated patch shortly.

          I'm not sure why the SpecificDatumReader stuff didn't apply cleanly. Maybe my working copy isn't up to date. I'll try to fix that as well.

          Show
          James Baldassari added a comment - Great. I'll get caught up on the AVRO-784 discussion regarding the mangling of class and schema fields and submit an updated patch shortly. I'm not sure why the SpecificDatumReader stuff didn't apply cleanly. Maybe my working copy isn't up to date. I'll try to fix that as well.
          Hide
          James Baldassari added a comment -

          I've made the following changes:

          • Merged recent commits to trunk so that the patch will apply cleanly (my working copy was slightly out of date)
          • Moved SchemaConstructable from SpecificDatumReader to SpecificData
          • Mangle accessor and mutator method names for 'class' and 'schema' fields
          Show
          James Baldassari added a comment - I've made the following changes: Merged recent commits to trunk so that the patch will apply cleanly (my working copy was slightly out of date) Moved SchemaConstructable from SpecificDatumReader to SpecificData Mangle accessor and mutator method names for 'class' and 'schema' fields
          Hide
          James Baldassari added a comment -

          Here's v4 of the patch.

          Show
          James Baldassari added a comment - Here's v4 of the patch.
          Hide
          James Baldassari added a comment -

          Reading through the comments from AVRO-784 in more detail I noticed this one:

          Also, if someone has a record with fields named "foo" and "Foo" then we'd be in trouble.

          That would be a problem with any approach that uses bean-style naming conventions to generate the accessor/mutator methods, and in fact it's a problem in this patch in its current form. Should we address this now? We could have SpecificCompiler detect these method name collisions and mangle them in some way. For example, the accessor methods for the fields "foo" and "Foo" might be named "getFoo$0()" and "getFoo$1()" respectively.

          Show
          James Baldassari added a comment - Reading through the comments from AVRO-784 in more detail I noticed this one: Also, if someone has a record with fields named "foo" and "Foo" then we'd be in trouble. That would be a problem with any approach that uses bean-style naming conventions to generate the accessor/mutator methods, and in fact it's a problem in this patch in its current form. Should we address this now? We could have SpecificCompiler detect these method name collisions and mangle them in some way. For example, the accessor methods for the fields "foo" and "Foo" might be named "getFoo$0()" and "getFoo$1()" respectively.
          Hide
          Doug Cutting added a comment -

          > Should we address this now?

          If you like. Or we could file it as a followup issue.

          I updated the patch to work after AVRO-877, which added some tests that check compiler output. I'll commit this version tomorrow unless someone objects.

          One other minor improvement that I'd like to see is the addition of a package.html file for org.apache.avro.data, to make the javadoc more easily understood. Longer term we might migrate some interfaces from generic to that package.

          Show
          Doug Cutting added a comment - > Should we address this now? If you like. Or we could file it as a followup issue. I updated the patch to work after AVRO-877 , which added some tests that check compiler output. I'll commit this version tomorrow unless someone objects. One other minor improvement that I'd like to see is the addition of a package.html file for org.apache.avro.data, to make the javadoc more easily understood. Longer term we might migrate some interfaces from generic to that package.
          Hide
          James Baldassari added a comment -

          If you like. Or we could file it as a followup issue.

          I implemented this. It wasn't too much work, and it's easier to just get it over with now than risk some potentially backwards-incompatible change after people start using the new builder stuff. I used the strategy described in my earlier comment: "foo" -> "getFoo$0()", "Foo" -> "getFoo$1()" where the "0" is always for the lowercase field name and the "1" is always for the uppercase field name. It's easy to change this if you think a different naming scheme would be better.

          I merged your recent trunk commits, so this patch should apply cleanly to trunk. After merging the updates from trunk I found that TestSpecificCompilerTool was failing because it was comparing the compiler output to a pre-builder generated class. I fixed this problem by copying the generated class that includes the builder modifications to src/test/compiler/output/.

          One other change I made was a minor fix to the generated mutator methods in record.vm. If the field happened to be named "value" the mutator would not work (because it was doing something like "field = value" instead of "this.field = value".

          One other minor improvement that I'd like to see is the addition of a package.html file for org.apache.avro.data

          Is this something you would like me to do?

          Show
          James Baldassari added a comment - If you like. Or we could file it as a followup issue. I implemented this. It wasn't too much work, and it's easier to just get it over with now than risk some potentially backwards-incompatible change after people start using the new builder stuff. I used the strategy described in my earlier comment: "foo" -> "getFoo$0()", "Foo" -> "getFoo$1()" where the "0" is always for the lowercase field name and the "1" is always for the uppercase field name. It's easy to change this if you think a different naming scheme would be better. I merged your recent trunk commits, so this patch should apply cleanly to trunk. After merging the updates from trunk I found that TestSpecificCompilerTool was failing because it was comparing the compiler output to a pre-builder generated class. I fixed this problem by copying the generated class that includes the builder modifications to src/test/compiler/output/. One other change I made was a minor fix to the generated mutator methods in record.vm. If the field happened to be named "value" the mutator would not work (because it was doing something like "field = value" instead of "this.field = value". One other minor improvement that I'd like to see is the addition of a package.html file for org.apache.avro.data Is this something you would like me to do?
          Hide
          James Baldassari added a comment -

          v5 of the patch attached.

          Show
          James Baldassari added a comment - v5 of the patch attached.
          Hide
          Doug Cutting added a comment -

          > After merging the updates from trunk I found that TestSpecificCompilerTool was failing because it was comparing the compiler output to a pre-builder generated class.

          Yes, I mentioned that yesterday and fixed it in the patch I uploaded.

          I added a package.html and committed this.

          Thanks, James! This is a great addition to Avro.

          Show
          Doug Cutting added a comment - > After merging the updates from trunk I found that TestSpecificCompilerTool was failing because it was comparing the compiler output to a pre-builder generated class. Yes, I mentioned that yesterday and fixed it in the patch I uploaded. I added a package.html and committed this. Thanks, James! This is a great addition to Avro.
          Hide
          James Baldassari added a comment -

          Thanks for all your help, Doug. BTW, you can also close out AVRO-846 too as the desired functionality has been implemented in this patch.

          Show
          James Baldassari added a comment - Thanks for all your help, Doug. BTW, you can also close out AVRO-846 too as the desired functionality has been implemented in this patch.
          Hide
          Scott Carey added a comment -

          This seems to have removed GenericDatumReader.newRecord().
          I noticed this ticket in Pig: https://issues.apache.org/jira/browse/PIG-2463

          Was this intended? The builder takes care of this feature, and also moves it out from the GenericDatumReader heirarchy which is useful to help limit what needs to be implemented to extend that.

          If this change is not intended, we may want to bring that back to avoid breaking users who upgrade to 1.6.x from 1.5.x.

          Show
          Scott Carey added a comment - This seems to have removed GenericDatumReader.newRecord(). I noticed this ticket in Pig: https://issues.apache.org/jira/browse/PIG-2463 Was this intended? The builder takes care of this feature, and also moves it out from the GenericDatumReader heirarchy which is useful to help limit what needs to be implemented to extend that. If this change is not intended, we may want to bring that back to avoid breaking users who upgrade to 1.6.x from 1.5.x.
          Hide
          James Baldassari added a comment -

          I believe the following methods were moved from GenericDatumReader to GenericData as part of Avro-839 to support the deepCopy() method (see comments from 11/Aug/11):

          • newRecord(Object, Schema)
          • createFixed(Object old, Schema schema)
          • createFixed(Object, byte[], Schema)

          To fix this issue we could put those methods back into GenericDatumReader and simply have them call through to the GenericData instance inside GenericDatumReader. What do you think? If that sounds ok I'll work on a patch.

          Show
          James Baldassari added a comment - I believe the following methods were moved from GenericDatumReader to GenericData as part of Avro-839 to support the deepCopy() method (see comments from 11/Aug/11): newRecord(Object, Schema) createFixed(Object old, Schema schema) createFixed(Object, byte[], Schema) To fix this issue we could put those methods back into GenericDatumReader and simply have them call through to the GenericData instance inside GenericDatumReader. What do you think? If that sounds ok I'll work on a patch.
          Hide
          Scott Carey added a comment -

          I'm not convinced we need to do this. But here is an idea: we could add the methods back as deprecated, with a default implementation and some javadoc so that other folks won't be confused by the change and code with or without an override on these will still work. Then in 1.7.x we can drop it and there will be a better paper trail for others to follow on where the new non-deprecated versions are.

          This way, projects like Pig could potentially upgrade to 1.6.next and compile without change.

          Alternatively, we do a better job documenting how to upgrade code to use 1.6.x. This means that code compiled against 1.5.x will fail with 1.6.x if it uses these methods or extends the class however.

          Any other opinions on what to do? Is the documentation in this ticket enough? Should we add these back in? Any other ideas?

          Show
          Scott Carey added a comment - I'm not convinced we need to do this. But here is an idea: we could add the methods back as deprecated, with a default implementation and some javadoc so that other folks won't be confused by the change and code with or without an override on these will still work. Then in 1.7.x we can drop it and there will be a better paper trail for others to follow on where the new non-deprecated versions are. This way, projects like Pig could potentially upgrade to 1.6.next and compile without change. Alternatively, we do a better job documenting how to upgrade code to use 1.6.x. This means that code compiled against 1.5.x will fail with 1.6.x if it uses these methods or extends the class however. Any other opinions on what to do? Is the documentation in this ticket enough? Should we add these back in? Any other ideas?
          Hide
          Doug Cutting added a comment -

          > we could add the methods back as deprecated, with a default implementation and some javadoc so that other folks won't be confused by the change and code with or without an override on these will still work. Then in 1.7.x we can drop it and there will be a better paper trail for others to follow on where the new non-deprecated versions are.

          +1 We should adopt this as a standard process for incompatible changes to public APIs.

          Show
          Doug Cutting added a comment - > we could add the methods back as deprecated, with a default implementation and some javadoc so that other folks won't be confused by the change and code with or without an override on these will still work. Then in 1.7.x we can drop it and there will be a better paper trail for others to follow on where the new non-deprecated versions are. +1 We should adopt this as a standard process for incompatible changes to public APIs.
          Hide
          James Baldassari added a comment -

          I created AVRO-993 to track the changes necessary to fix PIG-2463.

          Show
          James Baldassari added a comment - I created AVRO-993 to track the changes necessary to fix PIG-2463 .

            People

            • Assignee:
              James Baldassari
              Reporter:
              James Baldassari
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development