Avro
  1. Avro
  2. AVRO-1368

String schema accepts any Java object, and converts it with toString()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.5
    • Fix Version/s: 1.7.6
    • Component/s: java
    • Labels:
      None

      Activity

      Hide
      Christophe Taton added a comment -

      Here is a test that demonstrates the behavior:

          final Schema schema = Schema.create(Schema.Type.STRING);
          final ByteArrayOutputStream baos = new ByteArrayOutputStream();
          final Encoder encoder = EncoderFactory.get().directBinaryEncoder(baos, null);
          final DatumWriter<Object> writer = new SpecificDatumWriter<Object>(schema);
          writer.write(new Object(), encoder);
          final byte[] bytes = baos.toByteArray();
          System.out.println(new String(bytes));
      

      The java Object is silently accepted and converted to a string like "java.lang.Object@ee51b2c".

      Show
      Christophe Taton added a comment - Here is a test that demonstrates the behavior: final Schema schema = Schema.create(Schema.Type.STRING); final ByteArrayOutputStream baos = new ByteArrayOutputStream(); final Encoder encoder = EncoderFactory.get().directBinaryEncoder(baos, null ); final DatumWriter< Object > writer = new SpecificDatumWriter< Object >(schema); writer.write( new Object (), encoder); final byte [] bytes = baos.toByteArray(); System .out.println( new String (bytes)); The java Object is silently accepted and converted to a string like "java.lang.Object@ee51b2c".
      Hide
      Christophe Taton added a comment -

      This only affects the SpecificDatumWriter. With a GenericDatumWriter, I get the following error:

      java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.CharSequence
      at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:213)
      at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:208)
      at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:76)
      at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)

      Show
      Christophe Taton added a comment - This only affects the SpecificDatumWriter. With a GenericDatumWriter, I get the following error: java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.CharSequence at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:213) at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:208) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:76) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
      Hide
      Alexandre Normand added a comment -

      It looks like we want SpecificDatumWriter#writeString() to check is the object is stringable before calling toString() on it.

      This assumes this is Stringable but your example shows that it's a bad assumption:

      @Override
        protected void writeString(Schema schema, Object datum, Encoder out)
          throws IOException {
          if (!(datum instanceof CharSequence))         // Stringable
            datum = datum.toString();                   // call toString()
          writeString(datum, out);
        }
      
      Show
      Alexandre Normand added a comment - It looks like we want SpecificDatumWriter#writeString() to check is the object is stringable before calling toString() on it. This assumes this is Stringable but your example shows that it's a bad assumption: @Override protected void writeString(Schema schema, Object datum, Encoder out) throws IOException { if (!(datum instanceof CharSequence)) // Stringable datum = datum.toString(); // call toString() writeString(datum, out); }
      Hide
      Christophe Taton added a comment -

      This seems to work, but requires an explicit cast from GenericData to SpecificData:

      @Override
      protected void writeString(Schema schema, Object datum, Encoder out)
          throws IOException {
        if (!(datum instanceof CharSequence) && ((SpecificData) getData()).isStringable(datum.getClass())) {
          datum = datum.toString();
        }
        writeString(datum, out);
      }
      

      GenericData doesn't have isStringable() : does this mean the generic API would not be able to recognize Stringable classes?

      Show
      Christophe Taton added a comment - This seems to work, but requires an explicit cast from GenericData to SpecificData: @Override protected void writeString(Schema schema, Object datum, Encoder out) throws IOException { if (!(datum instanceof CharSequence) && ((SpecificData) getData()).isStringable(datum.getClass())) { datum = datum.toString(); } writeString(datum, out); } GenericData doesn't have isStringable() : does this mean the generic API would not be able to recognize Stringable classes?
      Hide
      Alexandre Normand added a comment -

      Can you do this in SpecificDatumWriter#writeString() instead?

        @Override
        protected void writeString(Schema schema, Object datum, Encoder out)
          throws IOException {
          if (!(datum instanceof CharSequence) && SpecificData.get().isStringable(datum.getClass()))         // Stringable
            datum = datum.toString();                                                                        // call toString()
          writeString(datum, out);
        }
      
      Show
      Alexandre Normand added a comment - Can you do this in SpecificDatumWriter#writeString() instead? @Override protected void writeString(Schema schema, Object datum, Encoder out) throws IOException { if (!(datum instanceof CharSequence) && SpecificData.get().isStringable(datum.getClass())) // Stringable datum = datum.toString(); // call toString() writeString(datum, out); }
      Hide
      Doug Cutting added a comment -

      We shouldn't call SpecificData.get() when a SpecificData instance is available. We should probably deprecate that method and folks should just call 'new SpecificData()' instead for defaults. SpecificData, ReflectData, & GenericData used to be singletons, but now they have constructors for different classloaders, etc. so often the singleton is the wrong instance. Perhaps we should add a SpecificDatumWriter#getSpecificData() method?

      Show
      Doug Cutting added a comment - We shouldn't call SpecificData.get() when a SpecificData instance is available. We should probably deprecate that method and folks should just call 'new SpecificData()' instead for defaults. SpecificData, ReflectData, & GenericData used to be singletons, but now they have constructors for different classloaders, etc. so often the singleton is the wrong instance. Perhaps we should add a SpecificDatumWriter#getSpecificData() method?
      Hide
      Christophe Taton added a comment -

      Adding SpecificDatumWriter.getSpecificData() seems reasonable on one hand, but not necessary with what follows:
      GenericDatumWriter/GenericData totally ignores the Stringable concept. Perhaps, the stringable concept should be moved to GenericData instead?

      Last question: is there a reason for Avro integer to accept any Java Number, and silently convert/truncate them to Integer, while other Avro primitive types will require an exact class match (eg. Avro long will require Java Long, and not accept Java Integer).

      Show
      Christophe Taton added a comment - Adding SpecificDatumWriter.getSpecificData() seems reasonable on one hand, but not necessary with what follows: GenericDatumWriter/GenericData totally ignores the Stringable concept. Perhaps, the stringable concept should be moved to GenericData instead? Last question: is there a reason for Avro integer to accept any Java Number, and silently convert/truncate them to Integer, while other Avro primitive types will require an exact class match (eg. Avro long will require Java Long, and not accept Java Integer).
      Hide
      Doug Cutting added a comment -

      > Perhaps, the stringable concept should be moved to GenericData instead?

      Perhaps, but that would be much more incompatible. Adding stringables to specific permitted some things to work (e.g., BigInteger) that used to fail, which is not 100% compatible, but largely. With generic, it would cause schemas that were readable before to change the representation they're read into. So such a change, if desired, would need to go into 1.8.0.

      Is this desired? With generic, any schema can be read into a closed set of classes. With extensible stringables this would no longer be true. That makes it harder to write programs that can generically process data of any schema.

      > is there a reason for Avro integer to accept any Java Number, and silently convert/truncate them to Integer, while other Avro primitive types will require an exact class match (eg. Avro long will require Java Long, and not accept Java Integer).

      In reflect, an int schema is used to handle Java's Byte, Character and Short types too. Is that what you're referring to?

      Show
      Doug Cutting added a comment - > Perhaps, the stringable concept should be moved to GenericData instead? Perhaps, but that would be much more incompatible. Adding stringables to specific permitted some things to work (e.g., BigInteger) that used to fail, which is not 100% compatible, but largely. With generic, it would cause schemas that were readable before to change the representation they're read into. So such a change, if desired, would need to go into 1.8.0. Is this desired? With generic, any schema can be read into a closed set of classes. With extensible stringables this would no longer be true. That makes it harder to write programs that can generically process data of any schema. > is there a reason for Avro integer to accept any Java Number, and silently convert/truncate them to Integer, while other Avro primitive types will require an exact class match (eg. Avro long will require Java Long, and not accept Java Integer). In reflect, an int schema is used to handle Java's Byte, Character and Short types too. Is that what you're referring to?
      Hide
      Christophe Taton added a comment -

      I don't have enough context to make a reasonable guess here.
      From a generic API perspective, this becomes an opaque field in the form of
      a java String.

      I'm referring to this conversion:
      https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L78

      • where an Avro int is expected, I can write int, long, float, double, any
        subclass of Number, and it will be silently encoded as the truncated
        integer value.
      • where an Avro long is expected, I can only write a long/java.lang.Long
        and nothing else (int/java.lang.Integer) is rejected, for example.
      Show
      Christophe Taton added a comment - I don't have enough context to make a reasonable guess here. From a generic API perspective, this becomes an opaque field in the form of a java String. I'm referring to this conversion: https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L78 where an Avro int is expected, I can write int, long, float, double, any subclass of Number, and it will be silently encoded as the truncated integer value. where an Avro long is expected, I can only write a long/java.lang.Long and nothing else (int/java.lang.Integer) is rejected, for example.
      Hide
      Doug Cutting added a comment -

      I suspect that conversion was intended to convert Integer, Short and Character to int, and that also converting Float, Double etc. was unintended. Do you think all cases should convert any Number? That would be a compatible change, but might have performance implications. Or do you think this case should be narrowed to only handle Integer, Short and Character? That would be incompatible.

      Show
      Doug Cutting added a comment - I suspect that conversion was intended to convert Integer, Short and Character to int, and that also converting Float, Double etc. was unintended. Do you think all cases should convert any Number? That would be a compatible change, but might have performance implications. Or do you think this case should be narrowed to only handle Integer, Short and Character? That would be incompatible.
      Hide
      Christophe Taton added a comment -
      Show
      Christophe Taton added a comment - Here is a patch according to your suggestions. https://github.com/kryzthov/avro/commit/4cc52caaffbc8d648c9a37e03a780894783cd0b0
      Hide
      Christophe Taton added a comment -

      Doug, I realize my reply to you last question didn't go through.

      Very personally, I dislikes silent, implicit conversions, and especially lossy ones.
      If we had the choice, I would rather narrow these to only handle Integer, Short and Character.

      As you mention, this would be an incompatible API change. Is it eligible for the coming 1.8?

      That said, accepting an Integer where a long is expected would make sense.
      Similarly, accepting a Float where a double is expected would make sense too.
      I believe both could be implemented now in a compatible way.
      The question that remains is about performance.

      Show
      Christophe Taton added a comment - Doug, I realize my reply to you last question didn't go through. Very personally, I dislikes silent, implicit conversions, and especially lossy ones. If we had the choice, I would rather narrow these to only handle Integer, Short and Character. As you mention, this would be an incompatible API change. Is it eligible for the coming 1.8? That said, accepting an Integer where a long is expected would make sense. Similarly, accepting a Float where a double is expected would make sense too. I believe both could be implemented now in a compatible way. The question that remains is about performance.
      Hide
      ASF subversion and git services added a comment -

      Commit 1520141 from Doug Cutting in branch 'avro/trunk'
      [ https://svn.apache.org/r1520141 ]

      AVRO-1368. Fix SpecificDatumWriter to, when writing a string schema, not silently convert any object to a string. Contributed by Christophe Taton.

      Show
      ASF subversion and git services added a comment - Commit 1520141 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1520141 ] AVRO-1368 . Fix SpecificDatumWriter to, when writing a string schema, not silently convert any object to a string. Contributed by Christophe Taton.
      Hide
      Doug Cutting added a comment -

      I committed this. Thanks, Christophe!

      Show
      Doug Cutting added a comment - I committed this. Thanks, Christophe!
      Hide
      Hudson added a comment -

      SUCCESS: Integrated in AvroJava #395 (See https://builds.apache.org/job/AvroJava/395/)
      AVRO-1368. Fix SpecificDatumWriter to, when writing a string schema, not silently convert any object to a string. Contributed by Christophe Taton. (cutting: rev 1520141)

      • /avro/trunk/CHANGES.txt
      • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificDatumWriter.java
      • /avro/trunk/lang/java/avro/src/test/java/org/apache/avro/specific/TestSpecificData.java
      Show
      Hudson added a comment - SUCCESS: Integrated in AvroJava #395 (See https://builds.apache.org/job/AvroJava/395/ ) AVRO-1368 . Fix SpecificDatumWriter to, when writing a string schema, not silently convert any object to a string. Contributed by Christophe Taton. (cutting: rev 1520141) /avro/trunk/CHANGES.txt /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/specific/SpecificDatumWriter.java /avro/trunk/lang/java/avro/src/test/java/org/apache/avro/specific/TestSpecificData.java

        People

        • Assignee:
          Christophe Taton
          Reporter:
          Christophe Taton
        • Votes:
          0 Vote for this issue
          Watchers:
          6 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development