Avro
  1. Avro
  2. AVRO-249

in reflection, represent Java short as fixed data

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently the Java reflect API treats shorts as ints. This fails however to naturally handle shorts in arrays and other containers. It would be better if we used a distinct type for reflected Java short values.

      1. AVRO-249.patch
        5 kB
        Doug Cutting
      2. AVRO-249.patch
        6 kB
        Doug Cutting
      3. AVRO-249.patch
        7 kB
        Doug Cutting

        Activity

        Hide
        Doug Cutting added a comment -

        With this patch, reflection maps short values to the schema:

        {"type": "fixed", "name": "java.lang.Short", "size": 2}

        .

        Tests are added for arrays of shorts.

        Show
        Doug Cutting added a comment - With this patch, reflection maps short values to the schema: {"type": "fixed", "name": "java.lang.Short", "size": 2} . Tests are added for arrays of shorts.
        Hide
        Scott Carey added a comment -

        BigEndian, it appears. Does that fact belong in documentation for the Reflect API?

        On reading, it will be a bit faster if you avoid the array allocation – at least until scalar replacement or stack allocation works well in the JVM.
        This will not show up in microbenchmarks, but can be rather significant in more complicated high throughput systems with enough other objects of 'medium' lifetimes. (The same is true of autoboxing and creating objects members instead of intrinsic members.)
        If you can avoid an allocation and not affect performance or functionality, its a good idea to do so. This could be done with two one byte reads or by adding a ByteBuffer - like "readShort()" method to Decoder.

        Just to confirm, the "fixed" type doesn't store a length byte, so there is no extra overhead.

        What about char? Its just an unsigned short.

        Show
        Scott Carey added a comment - BigEndian, it appears. Does that fact belong in documentation for the Reflect API? On reading, it will be a bit faster if you avoid the array allocation – at least until scalar replacement or stack allocation works well in the JVM. This will not show up in microbenchmarks, but can be rather significant in more complicated high throughput systems with enough other objects of 'medium' lifetimes. (The same is true of autoboxing and creating objects members instead of intrinsic members.) If you can avoid an allocation and not affect performance or functionality, its a good idea to do so. This could be done with two one byte reads or by adding a ByteBuffer - like "readShort()" method to Decoder. Just to confirm, the "fixed" type doesn't store a length byte, so there is no extra overhead. What about char? Its just an unsigned short.
        Hide
        Philip Zeyliger added a comment -

        Using "java.lang.Short" in a name seems suboptimal. Could you just call it "Short"? It seems like the schemas induced by reflection should ideally be language-neutral.

        It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes.

        Have you considered using an approach similar to the annotation approach in AVRO-242 for this? Even for something like longs, you're sometimes going to want to hint about what encoding to use (fixed vs. varint).

        – Philip

        Show
        Philip Zeyliger added a comment - Using "java.lang.Short" in a name seems suboptimal. Could you just call it "Short"? It seems like the schemas induced by reflection should ideally be language-neutral. It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes. Have you considered using an approach similar to the annotation approach in AVRO-242 for this? Even for something like longs, you're sometimes going to want to hint about what encoding to use (fixed vs. varint). – Philip
        Hide
        Doug Cutting added a comment -

        > Could you just call it "Short"?

        I'm hesitant to do this now. It seems tantamount to extending the specification with a new type. We could perhaps add it to a standard "lib" namespace or somesuch. But, until we've established that, an application that needs a two-byte quantity should define its schema to use an application-specific fixed type in its schemas.

        The use of java.lang.Short highlights a place where folks are using a java-specific type. A mechanism for mapping arbitrary application-specific Avro types to unboxed Java types is not straightforward today. I don't want to solve that problem today. Today I want to make applications that use shorts work using reflection. And I don't mind that appearing as a wart on that application until we solve these harder problems.

        > It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes.

        Yes, it is arbitrary. We could instead model shorts as a one-field int record. That would not restrict its range, and might use more space in some cases. Six of one.

        > Have you considered using an approach similar to the annotation approach in AVRO-242 for this?

        Yes. What type would you annotate? Would you annotate each field that's a short? Would we implement mixin annotations?

        Show
        Doug Cutting added a comment - > Could you just call it "Short"? I'm hesitant to do this now. It seems tantamount to extending the specification with a new type. We could perhaps add it to a standard "lib" namespace or somesuch. But, until we've established that, an application that needs a two-byte quantity should define its schema to use an application-specific fixed type in its schemas. The use of java.lang.Short highlights a place where folks are using a java-specific type. A mechanism for mapping arbitrary application-specific Avro types to unboxed Java types is not straightforward today. I don't want to solve that problem today. Today I want to make applications that use shorts work using reflection. And I don't mind that appearing as a wart on that application until we solve these harder problems. > It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes. Yes, it is arbitrary. We could instead model shorts as a one-field int record. That would not restrict its range, and might use more space in some cases. Six of one. > Have you considered using an approach similar to the annotation approach in AVRO-242 for this? Yes. What type would you annotate? Would you annotate each field that's a short? Would we implement mixin annotations?
        Hide
        Philip Zeyliger added a comment -

        It wouldn't surprise me if Avro evolved to have int8, int16, int32, int64 and fixed8, fixed,16, fixed32, fixed64 "types" (all implemented using varint and fixed, but with the APIs responsible for bounds checking) eventually. But that's neither here nor there.

        Not sure what you mean by mix-ins, but, yes, you could annotate the field in the class whose schema is being induced.

        I understand the expedience of modelling shorts this way. I worry that it will make the reflection API harder to evolve. It's saying that from now on, all Java shorts shall be 2-byte fixeds, and that's going to be the default forever more.

        I haven't thought about the evolution required of the reflect API. Is it possible to change, later, how the same Java class is serialized?

        Show
        Philip Zeyliger added a comment - It wouldn't surprise me if Avro evolved to have int8, int16, int32, int64 and fixed8, fixed,16, fixed32, fixed64 "types" (all implemented using varint and fixed, but with the APIs responsible for bounds checking) eventually. But that's neither here nor there. Not sure what you mean by mix-ins, but, yes, you could annotate the field in the class whose schema is being induced. I understand the expedience of modelling shorts this way. I worry that it will make the reflection API harder to evolve. It's saying that from now on, all Java shorts shall be 2-byte fixeds, and that's going to be the default forever more. I haven't thought about the evolution required of the reflect API. Is it possible to change, later, how the same Java class is serialized?
        Hide
        Doug Cutting added a comment -

        > BigEndian, it appears. Does that fact belong in documentation for the Reflect API?

        Yes. I added this to ReflectData, along with the documentation of other mappings.

        > On reading, it will be a bit faster if you avoid the array allocation

        At this point I am not concerned about performance here. The reflect API already suffers from known performance problems. The correct handling of shorts is required for reflection to support existing Hadoop RPC protocols, a primary target for the reflect API. If this does show up as a performance bottleneck in benchmarks then we should address it.

        > Just to confirm, the "fixed" type doesn't store a length byte, so there is no extra overhead.

        That's correct.

        > What about char? Its just an unsigned short.

        If we need, we could add support for that in a separate issue.

        Show
        Doug Cutting added a comment - > BigEndian, it appears. Does that fact belong in documentation for the Reflect API? Yes. I added this to ReflectData, along with the documentation of other mappings. > On reading, it will be a bit faster if you avoid the array allocation At this point I am not concerned about performance here. The reflect API already suffers from known performance problems. The correct handling of shorts is required for reflection to support existing Hadoop RPC protocols, a primary target for the reflect API. If this does show up as a performance bottleneck in benchmarks then we should address it. > Just to confirm, the "fixed" type doesn't store a length byte, so there is no extra overhead. That's correct. > What about char? Its just an unsigned short. If we need, we could add support for that in a separate issue.
        Hide
        Doug Cutting added a comment -

        Offline Philip suggested it might be better to map short to:

        {"type": "int", "java-class": "java.lang.Short"}

        I like this approach better. Here's a version of the patch that implements it.

        Show
        Doug Cutting added a comment - Offline Philip suggested it might be better to map short to: {"type": "int", "java-class": "java.lang.Short"} I like this approach better. Here's a version of the patch that implements it.
        Hide
        Philip Zeyliger added a comment -

        I like the new patch. +1.

        Show
        Philip Zeyliger added a comment - I like the new patch. +1.
        Hide
        Scott Carey added a comment -

        It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes.

        Floats aren't encoded varint, are they? I can't see the advantage there, the high bits will be set too frequently.

        At this point I am not concerned about performance here.

        Thats fair.

        It wouldn't surprise me if Avro evolved to have int8, int16, int32, int64 and fixed8, fixed,16, fixed32, fixed64 "types"

        Maybe, but if this happens, it sounds like Avro 2.0 not Avro 1.x.
        Also,, there would be no benefit to varint of one byte, and for the int16 case there may be very little or no benefit. Its easy to speculate that a int32 is very often less than 2^20 in size. Its hard to speculate that shorts are mostly less than 2^6 and not frequently more than 2^13.

        Not sure what you mean by mix-ins, but, yes, you could annotate the field in the class whose schema is being induced.

        Basically if you don't want, or can't change class A, you can write MixIn class B that has annotations that "target" the methods and members of class A. See:
        http://wiki.fasterxml.com/JacksonMixInAnnotations
        The goal, is to allow annotating a class you can't change the source code for.

        Ok, if we're talking about the long term Reflect API, I will add this:

        I have been starting to dig in to using Avro myself, and thinking about schema evolution. I don't particularly like the Specific API and its code generation, I'd generally rather direct a schema at my own classes for most use cases. I don't want to use Reflection either, with its restrictions and performance pitfalls (my requirements differ from those Doug is working on for Hadoop RPC significantly).

        I think that these two APIs can be combined in one annotations based API. Sure, we can still have code generation from avro schemas with basic defaults to create classes, but that step can be optional, even for inter-language use cases.

        Imagine something like this.
        You have a pre-existing class, Stuff, and you want to define how it is serialized. You make an Avro schema for it, to share with other languages/tools. Now, you want to map the two together. Using Specific, you have to write wrapper code to read the Avro generated class into your current class (that has a little bit of logic in it, maybe a custom hashCode() and equals(), a few other constructors for test cases , and some setters and getters that aren't just "return foo" and "this.foo = foo". If this class is an already long lived class with lots of unit tests, there aren't a lot of nice ways to do this without refactoring more than just the class. More importantly, if you have 40 or so such classes —
        Reflect can somewhat get around this, but then if you want to share the data with other languages and tools you've just exposed your Java implementation of your object to the world... I'd rather not have a schema change just because I changed some internal data structure already encapsulated with getters/setters.

        Ideally, I would like to just annotate the class with something that says "this is serializeable with Avro with an avro type named org.something.X".
        Then map the getters/setters or the fields to avro fields, and build any custom logic there if needed to deal with versions. Being able to map to a constructor would be cool too (like Jackson), but less important at the start.
        We could even set it up to map projected schemas – "this class can be serialized as org.something.X, or the projection org.something.minimalX if method 'isMinimal()' returns true""
        This same mapping can be done with an annotation MixIn if the class can't be modified at this time.
        Now, when decoding anything where an avro tye of X is encountered, it just builds the object as instructed by the annotations. Of course, this can all be optimized early on at class loading time rather than with runtime reflection with something like ASM.

        It may even be possible to just 'borrow' Jackson's annotations entirely, and be nearly or completely compatible with those.

        The reason why I say that a 'complete' annotation style API can replace both reflection and specific, is that the rules for specific can be one set of defaults – what to construct when a type does not map to a known class, and the reflection default rules the other (how to serialize when a class is not annotated). The need to generate classes at compile time might go away (It can be defined when first encountered with ASM). The default behavior for both cases can be defined as some sort of Mix-In default: When reflecting, if you find a short, serialize it as an avro fixed 2 byte quantity. When generating an object from a type that is not declared, create org.apache.commons.MutableInteger for avro ints.

        I had intended to create a ticket for something like the above after learning more and exploring – I should have more time over the next couple months to do more than observe and comment.

        Show
        Scott Carey added a comment - It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes. Floats aren't encoded varint, are they? I can't see the advantage there, the high bits will be set too frequently. At this point I am not concerned about performance here. Thats fair. It wouldn't surprise me if Avro evolved to have int8, int16, int32, int64 and fixed8, fixed,16, fixed32, fixed64 "types" Maybe, but if this happens, it sounds like Avro 2.0 not Avro 1.x. Also,, there would be no benefit to varint of one byte, and for the int16 case there may be very little or no benefit. Its easy to speculate that a int32 is very often less than 2^20 in size. Its hard to speculate that shorts are mostly less than 2^6 and not frequently more than 2^13. Not sure what you mean by mix-ins, but, yes, you could annotate the field in the class whose schema is being induced. Basically if you don't want, or can't change class A, you can write MixIn class B that has annotations that "target" the methods and members of class A. See: http://wiki.fasterxml.com/JacksonMixInAnnotations The goal, is to allow annotating a class you can't change the source code for. Ok, if we're talking about the long term Reflect API, I will add this: I have been starting to dig in to using Avro myself, and thinking about schema evolution. I don't particularly like the Specific API and its code generation, I'd generally rather direct a schema at my own classes for most use cases. I don't want to use Reflection either, with its restrictions and performance pitfalls (my requirements differ from those Doug is working on for Hadoop RPC significantly). I think that these two APIs can be combined in one annotations based API. Sure, we can still have code generation from avro schemas with basic defaults to create classes, but that step can be optional, even for inter-language use cases. Imagine something like this. You have a pre-existing class, Stuff, and you want to define how it is serialized. You make an Avro schema for it, to share with other languages/tools. Now, you want to map the two together. Using Specific, you have to write wrapper code to read the Avro generated class into your current class (that has a little bit of logic in it, maybe a custom hashCode() and equals(), a few other constructors for test cases , and some setters and getters that aren't just "return foo" and "this.foo = foo". If this class is an already long lived class with lots of unit tests, there aren't a lot of nice ways to do this without refactoring more than just the class. More importantly, if you have 40 or so such classes — Reflect can somewhat get around this, but then if you want to share the data with other languages and tools you've just exposed your Java implementation of your object to the world... I'd rather not have a schema change just because I changed some internal data structure already encapsulated with getters/setters. Ideally, I would like to just annotate the class with something that says "this is serializeable with Avro with an avro type named org.something.X". Then map the getters/setters or the fields to avro fields, and build any custom logic there if needed to deal with versions. Being able to map to a constructor would be cool too (like Jackson), but less important at the start. We could even set it up to map projected schemas – "this class can be serialized as org.something.X, or the projection org.something.minimalX if method 'isMinimal()' returns true"" This same mapping can be done with an annotation MixIn if the class can't be modified at this time. Now, when decoding anything where an avro tye of X is encountered, it just builds the object as instructed by the annotations. Of course, this can all be optimized early on at class loading time rather than with runtime reflection with something like ASM. It may even be possible to just 'borrow' Jackson's annotations entirely, and be nearly or completely compatible with those. The reason why I say that a 'complete' annotation style API can replace both reflection and specific, is that the rules for specific can be one set of defaults – what to construct when a type does not map to a known class, and the reflection default rules the other (how to serialize when a class is not annotated). The need to generate classes at compile time might go away (It can be defined when first encountered with ASM). The default behavior for both cases can be defined as some sort of Mix-In default: When reflecting, if you find a short, serialize it as an avro fixed 2 byte quantity. When generating an object from a type that is not declared, create org.apache.commons.MutableInteger for avro ints. I had intended to create a ticket for something like the above after learning more and exploring – I should have more time over the next couple months to do more than observe and comment.
        Hide
        Philip Zeyliger added a comment -

        Floats aren't encoded varint, are they? I can't see the advantage there, the high bits will be set too frequently.

        Yes, they're not encoded varint. Totally bogus statement on my part.

        [Thoughts on a unified API]

        Scott, I totally see where you're going. I think an "annotation-based" API would be a great addition. Even though it's possible to unify "specific" and "reflection" APIs under that one roof, I really like the simplicity of the "specific" API. This comes in part from some experience with Protocol Buffers. Yes, it sucks to wrap each of your data classes with another class, but, in practice, this is often easier to follow than stuff that does some magic behind your back.

        In terms of examples, ORMs solve the same sorts of problems. Hiberate and friends deserve a look too, though I agree Jackson is more similar.

        Show
        Philip Zeyliger added a comment - Floats aren't encoded varint, are they? I can't see the advantage there, the high bits will be set too frequently. Yes, they're not encoded varint. Totally bogus statement on my part. [Thoughts on a unified API] Scott, I totally see where you're going. I think an "annotation-based" API would be a great addition. Even though it's possible to unify "specific" and "reflection" APIs under that one roof, I really like the simplicity of the "specific" API. This comes in part from some experience with Protocol Buffers. Yes, it sucks to wrap each of your data classes with another class, but, in practice, this is often easier to follow than stuff that does some magic behind your back. In terms of examples, ORMs solve the same sorts of problems. Hiberate and friends deserve a look too, though I agree Jackson is more similar.
        Hide
        Doug Cutting added a comment -

        I just committed this.

        Show
        Doug Cutting added a comment - I just committed this.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Doug Cutting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development