Avro
  1. Avro
  2. AVRO-803

Java generated Avro classes make using Avro painful and surprising

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: java
    • Labels:
      None
    • Environment:

      Any

      Description

      Currently the Avro generated Java classes expose CharSequence in their API. However, you cannot use any old CharSequence when interacting with them. In fact, you have to use the Utf8 class if you want to get consistent results. I think that Avro should work with any CharSequence if that is the API. Here is an example where this happens:

      https://github.com/spullara/avro-generated-code/blob/master/src/test/java/AnnoyingTest.java

      That prints out 'false' three times unexpectedly. If you can't get it to print 'true' three times then you should probably change it back to Utf8.

      1. Foo.java
        4 kB
        Doug Cutting
      2. AVRO-803.patch
        45 kB
        Doug Cutting
      3. AVRO-803.patch
        54 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Scott Carey added a comment -

          A co-worker highlighted this and brought this to my attention very recently. CharSequence's contract explicity states that hashCode() and equals() are not consistent across implementations.

          http://download.oracle.com/javase/6/docs/api/java/lang/CharSequence.html
          And thus a CharSequence can't be used for a key in a map.

          In Java, we'll have to use a specific implementation of CharSequence inside any map keys, and can't generalize on CharSequence in a few places. The low level Encoder / Decoder can happily use CharSequence in many places, but generated classes and interfaces on the higher layer APIs have to be more strict.

          Show
          Scott Carey added a comment - A co-worker highlighted this and brought this to my attention very recently. CharSequence's contract explicity states that hashCode() and equals() are not consistent across implementations. http://download.oracle.com/javase/6/docs/api/java/lang/CharSequence.html And thus a CharSequence can't be used for a key in a map. In Java, we'll have to use a specific implementation of CharSequence inside any map keys, and can't generalize on CharSequence in a few places. The low level Encoder / Decoder can happily use CharSequence in many places, but generated classes and interfaces on the higher layer APIs have to be more strict.
          Hide
          Doug Cutting added a comment -

          I don't think this would be a compatible change, so should go into 1.6.

          Show
          Doug Cutting added a comment - I don't think this would be a compatible change, so should go into 1.6.
          Hide
          George Fletcher added a comment -

          We've run into this as well and for now, I wrote my own velocity template that makes the data elements private and generates getters/setters (similar to what's in 1.6.0). In these getters/setters I convert the CharSequence to a String and then do copies for arrays and maps to convert the Utf8 type to a String. This is an ugly work around and has a big performance impact. Are there any plans to fix this in 1.6.0?

          Show
          George Fletcher added a comment - We've run into this as well and for now, I wrote my own velocity template that makes the data elements private and generates getters/setters (similar to what's in 1.6.0). In these getters/setters I convert the CharSequence to a String and then do copies for arrays and maps to convert the Utf8 type to a String. This is an ugly work around and has a big performance impact. Are there any plans to fix this in 1.6.0?
          Hide
          Doug Cutting added a comment -

          What's a concrete proposal? Should we just switch generated code back to Utf8?

          Or we might:

          • Use Utf8 for field values, permitting efficient reuse in loops that read data;
          • Provide getter methods that return String. A Utf8 memoizes its string conversion, so repeated calls to the getter would only allocate a single String. Applications that wanted to avoid that could use the field directly.
          • Provide setter methods that accept CharSequence. This would check the runtime type and convert String to Utf8.
          • For lists and maps, use wrappers to adapt w/o copying the entire list.
          class Foo {
          
            public Utf8 x;
            public String getX() {
               return x == null ? null : x.toString();
            }
            public void setX(CharSequence x) {
               this.x = x instanceof Utf8 ? (Utf8)x : new Utf8(x.toString()); 
            }
          
            public List<Utf8> values;
            public List<String> getValues() {
              return new AbstractList<String> {
                public String get(int i) {
                   Utf8 value = values.get(i);
                   return value == null ? null : value.toString();
                }
                public int size() { return values.size(); }
              };
            public void setValues(final List<? extends CharSequence> values) {
              this.values = new AbstractList<Utf8> {
                public Utf8 get(int i) {
                   CharSequence value = values.get(i);
                   if (value instanceof Utf8)
                     return (Utf8)value;
                   else if (value == null)
                     return null;
                   else
                     return new Utf8(value.toString());
                }
                public int size() { return values.size(); }
              };
          }
          

          Thoughts?

          Show
          Doug Cutting added a comment - What's a concrete proposal? Should we just switch generated code back to Utf8? Or we might: Use Utf8 for field values, permitting efficient reuse in loops that read data; Provide getter methods that return String. A Utf8 memoizes its string conversion, so repeated calls to the getter would only allocate a single String. Applications that wanted to avoid that could use the field directly. Provide setter methods that accept CharSequence. This would check the runtime type and convert String to Utf8. For lists and maps, use wrappers to adapt w/o copying the entire list. class Foo { public Utf8 x; public String getX() { return x == null ? null : x.toString(); } public void setX(CharSequence x) { this .x = x instanceof Utf8 ? (Utf8)x : new Utf8(x.toString()); } public List<Utf8> values; public List< String > getValues() { return new AbstractList< String > { public String get( int i) { Utf8 value = values.get(i); return value == null ? null : value.toString(); } public int size() { return values.size(); } }; public void setValues( final List<? extends CharSequence> values) { this .values = new AbstractList<Utf8> { public Utf8 get( int i) { CharSequence value = values.get(i); if (value instanceof Utf8) return (Utf8)value; else if (value == null ) return null ; else return new Utf8(value.toString()); } public int size() { return values.size(); } }; } Thoughts?
          Hide
          Sam Pullara added a comment -

          This is what I did in my generated classes. I think it is a pretty good solution to the problem and gives developers the option. Will it also work as easily with Maps?

          Show
          Sam Pullara added a comment - This is what I did in my generated classes. I think it is a pretty good solution to the problem and gives developers the option. Will it also work as easily with Maps?
          Hide
          Doug Cutting added a comment -

          A couple more observations:

          • I'd rather keep the raw fields using CharSequence, so that we can avoid any conversions on write. The low-level output routines can handle either Utf8 or String, so we should pass along whichever the application prefers to create.
          • Map keys and values are not currently reused, so there's little point to using Utf8 there. So we might just always use String in maps, especially as keys to avoid comparison problems.
          • It is possible to wrap a map and convert keys and/or values.

          I've attached code that illustrates how this might look. This assumes that map keys are now always String but that values can be any CharSequence.

          Show
          Doug Cutting added a comment - A couple more observations: I'd rather keep the raw fields using CharSequence, so that we can avoid any conversions on write. The low-level output routines can handle either Utf8 or String, so we should pass along whichever the application prefers to create. Map keys and values are not currently reused, so there's little point to using Utf8 there. So we might just always use String in maps, especially as keys to avoid comparison problems. It is possible to wrap a map and convert keys and/or values. I've attached code that illustrates how this might look. This assumes that map keys are now always String but that values can be any CharSequence.
          Hide
          Scott Carey added a comment -

          I like the idea of forcing String for map keys by default. There is less value to using Utf8 there since Utf8 is mutable. Plus, Strings in the JVM are heavily optimized for use as keys in maps.

          I believe that users will sometimes want to force their value types to any of the three choices: String, Utf8, and CharSequence. We should have reasonable defaults, but make it easier for users to choose what they want for their use case. For many use cases, direct conversion to Strings is better. For others, using Utf8 to be as lazy as possible in the expensive Utf8 <-> Utf16 conversion is.

          Some extra laziness in Utf8 can help in a few places as well, by making the cost of creating a Utf8 from a String cheaper.
          Right now, Utf8 when called with a String parameter in the constructor is not lazy, and generates the utf8 byte[]. It could leave this null, and only lazily create the byte[] if needed, just like it lazily creates the String only if needed.

          Show
          Scott Carey added a comment - I like the idea of forcing String for map keys by default. There is less value to using Utf8 there since Utf8 is mutable. Plus, Strings in the JVM are heavily optimized for use as keys in maps. I believe that users will sometimes want to force their value types to any of the three choices: String, Utf8, and CharSequence. We should have reasonable defaults, but make it easier for users to choose what they want for their use case. For many use cases, direct conversion to Strings is better. For others, using Utf8 to be as lazy as possible in the expensive Utf8 <-> Utf16 conversion is. Some extra laziness in Utf8 can help in a few places as well, by making the cost of creating a Utf8 from a String cheaper. Right now, Utf8 when called with a String parameter in the constructor is not lazy, and generates the utf8 byte[]. It could leave this null, and only lazily create the byte[] if needed, just like it lazily creates the String only if needed.
          Hide
          Doug Cutting added a comment -

          Here's a new proposal:

          • add a new Decoder method, 'String readString()' implemented to avoid allocating new intermediate byte arrays for each call as is currently done when Utf8's are not reused.
          • change generated specific code to optionally use String everywhere instead of CharSequence. (We could also add an option to emit Utf8 everywhere.) When String is used we add a property to the string schemas in the generated code so they become {"type":"string", "java":"String"}

            .

          • GenericData#readString() would call the new Decoder method when "java":"String" is present in the String's schema.

          This is totally back-compatible.

          Show
          Doug Cutting added a comment - Here's a new proposal: add a new Decoder method, 'String readString()' implemented to avoid allocating new intermediate byte arrays for each call as is currently done when Utf8's are not reused. change generated specific code to optionally use String everywhere instead of CharSequence. (We could also add an option to emit Utf8 everywhere.) When String is used we add a property to the string schemas in the generated code so they become {"type":"string", "java":"String"} . GenericData#readString() would call the new Decoder method when "java":"String" is present in the String's schema. This is totally back-compatible.
          Hide
          Scott Carey added a comment -

          Sounds good to me.

          One thing though: "java":"String"

          I think we may want to use the avro reserved namespace on this property to avoid collisions. Perhaps "avro.java.stringImpl":"String" ?

          Show
          Scott Carey added a comment - Sounds good to me. One thing though: "java":"String" I think we may want to use the avro reserved namespace on this property to avoid collisions. Perhaps "avro.java.stringImpl":"String" ?
          Hide
          Doug Cutting added a comment -

          > Perhaps "avro.java.stringImpl":"String"?

          It should really be "avro.java.string.class.name":"java.lang.String", but that seems way too verbose. Maybe I went too far in my brevity, though. How about "avro.java.string":"String"?

          Show
          Doug Cutting added a comment - > Perhaps "avro.java.stringImpl":"String"? It should really be "avro.java.string.class.name":"java.lang.String", but that seems way too verbose. Maybe I went too far in my brevity, though. How about "avro.java.string":"String"?
          Hide
          Doug Cutting added a comment -

          Here's a patch that implements the above proposal.

          Applications need only add <stringType>String</stringType> to the configuration of avro-maven-plugin in their pom.xml to switch to using String in their generated code.

          Show
          Doug Cutting added a comment - Here's a patch that implements the above proposal. Applications need only add <stringType>String</stringType> to the configuration of avro-maven-plugin in their pom.xml to switch to using String in their generated code.
          Hide
          Doug Cutting added a comment -

          New version of patch with some cleanups. Also updates mapred, protobuf & thrift code to take advantage of these changes.

          Show
          Doug Cutting added a comment - New version of patch with some cleanups. Also updates mapred, protobuf & thrift code to take advantage of these changes.
          Hide
          Doug Cutting added a comment -

          I'll commit this soon unless there are objections.

          Show
          Doug Cutting added a comment - I'll commit this soon unless there are objections.
          Hide
          Scott Carey added a comment -

          We broke back-compatibility in Avro 1.5 – formerly we used Utf8 and made maps with Utf8 keys. I think we can break back compatibility in 1.6.0 again. And we probably should for Maps since it is not appropriate to ever have a map with a CharSequence key; hashCode and equals are not compatible between implementations.

          Show
          Scott Carey added a comment - We broke back-compatibility in Avro 1.5 – formerly we used Utf8 and made maps with Utf8 keys. I think we can break back compatibility in 1.6.0 again. And we probably should for Maps since it is not appropriate to ever have a map with a CharSequence key; hashCode and equals are not compatible between implementations.
          Hide
          Doug Cutting added a comment -

          I'd prefer not to break back-compatibility this time. It makes it impossible for folks to upgrade one project without making source code changes to other projects. If you specify <stringType>String</stringType> in your pom.xml then all your Map keys become java.lang.String.

          Show
          Doug Cutting added a comment - I'd prefer not to break back-compatibility this time. It makes it impossible for folks to upgrade one project without making source code changes to other projects. If you specify <stringType>String</stringType> in your pom.xml then all your Map keys become java.lang.String.
          Hide
          Doug Cutting added a comment -

          I committed this.

          Scott, I hope you're okay with the default. If you and/or others feel strongly that we should break compatibility for this, we might consider changing that default, but I wanted to get this committed before it became stale, as it touches a lot of files.

          Show
          Doug Cutting added a comment - I committed this. Scott, I hope you're okay with the default. If you and/or others feel strongly that we should break compatibility for this, we might consider changing that default, but I wanted to get this committed before it became stale, as it touches a lot of files.
          Hide
          Scott Carey added a comment -

          I'm fine with this. Any discussion about changing the default can wait and would likely require more user input now that we have many more users. Keeping it the way it is now, with a work-around and some documentation / examples for best practices is a lot better than where we are in 1.5.x.

          Show
          Scott Carey added a comment - I'm fine with this. Any discussion about changing the default can wait and would likely require more user input now that we have many more users. Keeping it the way it is now, with a work-around and some documentation / examples for best practices is a lot better than where we are in 1.5.x.

            People

            • Assignee:
              Joseph Adler
              Reporter:
              Sam Pullara
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development