It also seems pretty arbitrary that integers, longs, and floats are represented with zigzag varint encoding, but shorts are always two bytes.
Floats aren't encoded varint, are they? I can't see the advantage there, the high bits will be set too frequently.
At this point I am not concerned about performance here.
It wouldn't surprise me if Avro evolved to have int8, int16, int32, int64 and fixed8, fixed,16, fixed32, fixed64 "types"
Maybe, but if this happens, it sounds like Avro 2.0 not Avro 1.x.
Also,, there would be no benefit to varint of one byte, and for the int16 case there may be very little or no benefit. Its easy to speculate that a int32 is very often less than 2^20 in size. Its hard to speculate that shorts are mostly less than 2^6 and not frequently more than 2^13.
Not sure what you mean by mix-ins, but, yes, you could annotate the field in the class whose schema is being induced.
Basically if you don't want, or can't change class A, you can write MixIn class B that has annotations that "target" the methods and members of class A. See:
The goal, is to allow annotating a class you can't change the source code for.
Ok, if we're talking about the long term Reflect API, I will add this:
I have been starting to dig in to using Avro myself, and thinking about schema evolution. I don't particularly like the Specific API and its code generation, I'd generally rather direct a schema at my own classes for most use cases. I don't want to use Reflection either, with its restrictions and performance pitfalls (my requirements differ from those Doug is working on for Hadoop RPC significantly).
I think that these two APIs can be combined in one annotations based API. Sure, we can still have code generation from avro schemas with basic defaults to create classes, but that step can be optional, even for inter-language use cases.
Imagine something like this.
You have a pre-existing class, Stuff, and you want to define how it is serialized. You make an Avro schema for it, to share with other languages/tools. Now, you want to map the two together. Using Specific, you have to write wrapper code to read the Avro generated class into your current class (that has a little bit of logic in it, maybe a custom hashCode() and equals(), a few other constructors for test cases , and some setters and getters that aren't just "return foo" and "this.foo = foo". If this class is an already long lived class with lots of unit tests, there aren't a lot of nice ways to do this without refactoring more than just the class. More importantly, if you have 40 or so such classes —
Reflect can somewhat get around this, but then if you want to share the data with other languages and tools you've just exposed your Java implementation of your object to the world... I'd rather not have a schema change just because I changed some internal data structure already encapsulated with getters/setters.
Ideally, I would like to just annotate the class with something that says "this is serializeable with Avro with an avro type named org.something.X".
Then map the getters/setters or the fields to avro fields, and build any custom logic there if needed to deal with versions. Being able to map to a constructor would be cool too (like Jackson), but less important at the start.
We could even set it up to map projected schemas – "this class can be serialized as org.something.X, or the projection org.something.minimalX if method 'isMinimal()' returns true""
This same mapping can be done with an annotation MixIn if the class can't be modified at this time.
Now, when decoding anything where an avro tye of X is encountered, it just builds the object as instructed by the annotations. Of course, this can all be optimized early on at class loading time rather than with runtime reflection with something like ASM.
It may even be possible to just 'borrow' Jackson's annotations entirely, and be nearly or completely compatible with those.
The reason why I say that a 'complete' annotation style API can replace both reflection and specific, is that the rules for specific can be one set of defaults – what to construct when a type does not map to a known class, and the reflection default rules the other (how to serialize when a class is not annotated). The need to generate classes at compile time might go away (It can be defined when first encountered with ASM). The default behavior for both cases can be defined as some sort of Mix-In default: When reflecting, if you find a short, serialize it as an avro fixed 2 byte quantity. When generating an object from a type that is not declared, create org.apache.commons.MutableInteger for avro ints.
I had intended to create a ticket for something like the above after learning more and exploring – I should have more time over the next couple months to do more than observe and comment.