Previous discussion on the mailing list led me down several paths.
What I have so far needs some work in a couple areas and needs more testing, but It is in a good state to share the core.
Essentially, I wrote a DatumReader / DatumWriter pair for Pig.
My first attempt was to mimic GenericDatumReader/Writer, but there were a lot of branches in the code that were going to cause a more work for me to complete good unit testing coverage. Plus, I wanted to use this as a place to try out a DatumReader/Writer pair that did not require traversing the avro Schema for every tuple.
In some sense this is a prototype for a performance improvement to GenericDatumReader/Writer but Pig is simpler because there is no recursion or schema resolution. I will have to learn the Parser better in order to leverage that to create something similar.
Schema Translation Design
The core classes where the tricky stuff happens are org.apache.avro.pig.PigDataAssembly and org.apache.avro.pig.SchemaTranslator.
PigDataAssembly takes an Avro schema that represents Pig data, and builds a data structure that can read / write Tuples to avro. Its really just a linked list of Action objects (private inner classes) for each type that can exist in a Tuple – if this was C they would be structs with a function pointer and some misc data. But its Java so they're private inner classes that override a method from a supertype.
Once this data structure is created, it can efficiently read data from a Decoder or write to an Encoder without parsing the schema. However this makes it stateful and like a ResolvingDecoder, the initialization is not that cheap (it is still fairly fast).
SchemaTranslator is responsible for handling Pig Schemas (ResourceSchema) and converting these to Avro Schemas as well as validating that an Avro schema actually represents a Pig Tuple created by the same class. The current restriction is that you can't read an arbitrary Avro record and make a Tuple out of it, even though the total number of possible Avro schemas that can be coerced into a Tuple is much larger than supported, I wanted to support that in a separate place. This one is strict and constrained to reading and writing from pig – although one can easily read data written by this with GenericDatumReader.
One of the primary goals was to be able to persist the Pig schema so that when the data was read by Pig, the loader would provide the same schema back that was stored originally. To accomplish this the Schema translation makes a custom schema from the Pig schema, preserving field names and types. The only exception is the value type of a Pig Map, which must be generic with no field names and is untyped from the POV of a Pig schema. There is a "GENERIC_ELEMENT" Schema that handles this, and corresponding Bag, and Tuple types to handle all types of map values.
Integration with Hadoop / Pig
The next set of complexities are around Hadoop integration. Pig 0.7+ requires using the org.apache.hadoop.mapreduce API, not the legacy mapred API. So, I made org.apache.avro.mapreduce classes – AvroInputFormat, AvroOutputFormat, and AvroRecordReader modeled mostly after the mapred ones. However, these are limited to input and output – not meant to be used for intermediate data between Map and Reduce.
Pig currently has several hoops you have to jump through to get a StoreFunc and LoadFunc to work correctly. In particular, you can't easily read/write to the job configuration and only have a couple points of access to move information from the front-end to the back-end.
I think the right thing to do is to remove the org.apache.avro.mapreduce classes I have here and simply embed them in the only classes that use them: PigAvroInputFormat and PigAvroOutputFormat. That would significantly clean things up and prevent users from wondering which API to use for non-Pig mapreduce stuff.
The unit test I've done hits the schema translation and serialization / deserialization well, but I have not tested all of the Pig/Hadoop wrappers.
Other Potential Avro Changes / Additions
DatumWriter/Reader are problematic here. Because the PigDataAssembly is stateful, it would be best to force a DatumWriter or DatumReader to take a Schema and Encoder/Decoder on the constructor and disallow setting them later. In particular, changing the Schema is expensive.
The fact that we need to construct a DatumReader before creating a DataFileReader means that the schema has to be set later. Perhaps we can delay construction by passing around a Factory rather than an already constructed object? Then we can delay creation of the DatumReader/Writer until later when both the Schema and Decoder/Encoder are available.
As a more general concept, this is moving Schema resolution and Object serialization out as separate concerns from the Encoder / Decoder. Right now, a DatumReader/Writer handles object serialization, but the Decoder/Encoder handles Schema resolution. From a performance and code consolidation perspective, these two actions should be moved closer together – they both require parsing Schemas and can be optimized together by doing:
[writer Schema] ->> [Shared Assembly Construction configured for a type -- Generic, Specific, Pig, Hive, etc] ->> SerializationAssembly
[reader Schema, actual Schema] ->> [Shared Assembly Construction for a type] ->> SerializationAssembly
The shared assembly stuff could leverage the parser, but I'm not knowledgeable enough with that to do that yet.