Avro
  1. Avro
  2. AVRO-1533

permit promotions between string and bytes

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.7
    • Component/s: java
    • Labels:
      None

      Description

      Avro strings are a subset of bytes, so promoting from string to bytes is lossless and should be possible. Promotion from bytes to strings may cause problems, as not all byte strings are valid UTF8, but it also might be useful.

      1. AVRO-1533.patch
        6 kB
        Doug Cutting
      2. AVRO-1533.patch
        4 kB
        Doug Cutting

        Activity

        Hide
        Doug Cutting added a comment -

        Here's a patch that implements promotion from string to bytes.

        Show
        Doug Cutting added a comment - Here's a patch that implements promotion from string to bytes.
        Hide
        Doug Cutting added a comment -

        Does this seem useful to folks? Should we also support conversion from bytes to string, even though that could result in runtime errors due to invalid UTF8?

        Show
        Doug Cutting added a comment - Does this seem useful to folks? Should we also support conversion from bytes to string, even though that could result in runtime errors due to invalid UTF8?
        Hide
        Doug Cutting added a comment -

        Here's a version that permits promotions in both directions, from bytes to string and from string to bytes.

        Show
        Doug Cutting added a comment - Here's a version that permits promotions in both directions, from bytes to string and from string to bytes.
        Hide
        Doug Cutting added a comment -

        I'll commit this soon unless there are objections.

        Show
        Doug Cutting added a comment - I'll commit this soon unless there are objections.
        Hide
        ASF subversion and git services added a comment -

        Commit 1607197 from Doug Cutting in branch 'avro/trunk'
        [ https://svn.apache.org/r1607197 ]

        AVRO-1533. Java: In schema resolution, permit conversion between bytes and string.

        Show
        ASF subversion and git services added a comment - Commit 1607197 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1607197 ] AVRO-1533 . Java: In schema resolution, permit conversion between bytes and string.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in AvroJava #461 (See https://builds.apache.org/job/AvroJava/461/)
        AVRO-1533. Java: In schema resolution, permit conversion between bytes and string. (cutting: rev 1607197)

        • /avro/trunk/CHANGES.txt
        • /avro/trunk/doc/src/content/xdocs/spec.xml
        • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ResolvingDecoder.java
        • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/ResolvingGrammarGenerator.java
        • /avro/trunk/lang/java/avro/src/test/java/org/apache/avro/io/TestResolvingIO.java
        Show
        Hudson added a comment - SUCCESS: Integrated in AvroJava #461 (See https://builds.apache.org/job/AvroJava/461/ ) AVRO-1533 . Java: In schema resolution, permit conversion between bytes and string. (cutting: rev 1607197) /avro/trunk/CHANGES.txt /avro/trunk/doc/src/content/xdocs/spec.xml /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ResolvingDecoder.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/ResolvingGrammarGenerator.java /avro/trunk/lang/java/avro/src/test/java/org/apache/avro/io/TestResolvingIO.java
        Hide
        Martin Kleppmann added a comment -

        Only just saw this, sorry for the delay. I am wondering what this means for validation of schema compatibility, e.g. AVRO-1315. If a "bytes" field is changed to "string", and thus runtime errors are possible due to invalid UTF-8, should the schemas be considered compatible? If people are going to understand "validation succeeded" as "guaranteed no runtime errors", this is potentially an issue.

        Show
        Martin Kleppmann added a comment - Only just saw this, sorry for the delay. I am wondering what this means for validation of schema compatibility, e.g. AVRO-1315 . If a "bytes" field is changed to "string", and thus runtime errors are possible due to invalid UTF-8, should the schemas be considered compatible? If people are going to understand "validation succeeded" as "guaranteed no runtime errors", this is potentially an issue.
        Hide
        Doug Cutting added a comment -

        It won't generate runtime errors for invalid UTF-8, but instead replaces erroneous sequences with the character "�":

        http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String(byte[],%20java.nio.charset.Charset)

        I think can be considered a compatible change, since it won't break existing applications. Today attempts to switch a field from bytes to string would fail. I suppose an application could currently rely on such failures, but I consider that unlikely enough that I'm willing to ignore it. Do others disagree?

        We could:

        1. revert this change entirely, declaring it incompatible
        2. revert just the change to the specification, so that Avro Java is more lenient in what conversions it permits than the specification (following Postel's law)
        3. file issues to update the AVRO-1315 schema validation to permit such conversions
        • also file issues for C, C++ and C# to update their schema resolution to support these conversions

        Thoughts?

        Show
        Doug Cutting added a comment - It won't generate runtime errors for invalid UTF-8, but instead replaces erroneous sequences with the character "�": http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String(byte[],%20java.nio.charset.Charset ) I think can be considered a compatible change, since it won't break existing applications. Today attempts to switch a field from bytes to string would fail. I suppose an application could currently rely on such failures, but I consider that unlikely enough that I'm willing to ignore it. Do others disagree? We could: revert this change entirely, declaring it incompatible revert just the change to the specification, so that Avro Java is more lenient in what conversions it permits than the specification (following Postel's law) file issues to update the AVRO-1315 schema validation to permit such conversions also file issues for C, C++ and C# to update their schema resolution to support these conversions Thoughts?
        Hide
        graham sanderson added a comment -

        I had a quick look at the patch, and there are a few s.getBytes() & new String(byte[])... since the intention seems to be to assume that bytes are interchangeable with UTF-8 encoded strings, it should probably be explicit as it is in Utf8, no?

        Show
        graham sanderson added a comment - I had a quick look at the patch, and there are a few s.getBytes() & new String(byte[])... since the intention seems to be to assume that bytes are interchangeable with UTF-8 encoded strings, it should probably be explicit as it is in Utf8, no?
        Hide
        Doug Cutting added a comment -

        Graham, good catch. I did fix the new String(byte[]...) to specify the charset as UTF-8 but forgot to update the patch. The getBytes call is on a org.apache.avro.util.Utf8 instance, not a java.lang.String, so that's okay.

        The commit diff is at:

        https://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ResolvingDecoder.java?r1=1607197&r2=1607196&pathrev=1607197

        I think we're okay.

        Show
        Doug Cutting added a comment - Graham, good catch. I did fix the new String(byte[]...) to specify the charset as UTF-8 but forgot to update the patch. The getBytes call is on a org.apache.avro.util.Utf8 instance, not a java.lang.String, so that's okay. The commit diff is at: https://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ResolvingDecoder.java?r1=1607197&r2=1607196&pathrev=1607197 I think we're okay.
        Hide
        Martin Kleppmann added a comment -

        Doug Cutting: Ok, in that case I think this is fine. I'm in favour of option 3 (making schema validation and other language implementations' resolvers consistent with the updated spec).

        Perhaps we should also make explicit in the spec that the promotion from bytes to string is lossy if the byte sequence is not valid UTF-8. (There is already precedent for lossy promotion: e.g. converting long to float may also lose some precision. That isn't a problem as long as people understand what's happening.)

        Show
        Martin Kleppmann added a comment - Doug Cutting : Ok, in that case I think this is fine. I'm in favour of option 3 (making schema validation and other language implementations' resolvers consistent with the updated spec). Perhaps we should also make explicit in the spec that the promotion from bytes to string is lossy if the byte sequence is not valid UTF-8. (There is already precedent for lossy promotion: e.g. converting long to float may also lose some precision. That isn't a problem as long as people understand what's happening.)
        Hide
        Doug Cutting added a comment -

        I'll resolve this as fixed and we can file follow-on issues.

        Show
        Doug Cutting added a comment - I'll resolve this as fixed and we can file follow-on issues.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Doug Cutting
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development