Issue Details (XML | Word | Printable)

Key: HADOOP-6120
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Sharad Agarwal
Reporter: Sharad Agarwal
Votes: 0
Watchers: 14
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Add support for Avro types in hadoop

Created: 30/Jun/09 06:03 AM   Updated: 08/Oct/09 06:52 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 6120_v1.patch 2009-06-30 07:08 AM Sharad Agarwal 4 kB
Text File Licensed for inclusion in ASF works 6120_v2.patch 2009-07-14 07:55 AM Sharad Agarwal 17 kB
Text File Licensed for inclusion in ASF works 6120_v3.patch 2009-07-15 05:31 AM Sharad Agarwal 18 kB
Text File Licensed for inclusion in ASF works 6120_v4.patch 2009-07-16 10:39 AM Sharad Agarwal 24 kB
Text File Licensed for inclusion in ASF works 6120_v5.patch 2009-07-17 09:33 AM Sharad Agarwal 28 kB
Text File Licensed for inclusion in ASF works 6120_v6.patch 2009-07-21 03:55 AM Sharad Agarwal 28 kB
Text File Licensed for inclusion in ASF works HADOOP-6120.patch 2009-07-16 05:01 PM Doug Cutting 24 kB
Issue Links:
Reference

Release Note: New Avro serialization in .../io/serializer/avro.
Resolution Date: 23/Jul/09 07:22 PM


 Description  « Hide
Support to serialize and deserialize Avro types in Hadoop.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sharad Agarwal added a comment - 30/Jun/09 06:30 AM
The Serialization interface can be implemented for Avro types. Currently Avro has 3 types in java: specific, reflect and generic. I think the current Serialization interface is fine to implement 'specific' type as the schema information can be inferred from the class. The schema can't be inferred from the 'reflect' and 'generic' type classes. So the current Serialization interface doesn't seem to be sufficient to implement these. I think the Serialization interface need to be extended to pass on additional info to have implementation for 'generic' and 'reflect'.

Sharad Agarwal added a comment - 30/Jun/09 07:08 AM
An early patch having AvroSpecificSerialization. Test case not added yet. Include avro jar in classpath to build with this patch.
Avro release is round the corner. Once it is released we can checkin avro jar in lib folder till it is published to maven repos.

Philip Zeyliger added a comment - 30/Jun/09 06:53 PM
This is perhaps more of an AVRO question, but, out of curiosity, is there a way to avoid this cast? Could SpecificDatumReader.read() return SpecificRecord, or does the inheritance model forbid it?

return (SpecificRecord) reader.read(t, decoder);


Doug Cutting added a comment - 30/Jun/09 06:54 PM
This looks useful. A few comments:
  • Support for generic data will eventually be needed but will require changes to Avro's serializaton API that we should address in a separate issue.
  • I think we can use ReflectData#getSchema() to implement a reflect-based Serializer. The only tricky bit will be the 'accept()' implementation, which might use a configuration property and/or a tag interface (e.g., AvroReflectSerializeable).
  • The call to ReflectionUtils#newInstance() in deserialize is not required.
  • The schema might be created in the constructor rather than in the open() method.

Sharad Agarwal added a comment - 14/Jul/09 07:55 AM
This patch:
  • addresses Doug's concerns
  • moves the common code in abstract base class AvroSerialization
  • implements 'reflect' serialization based on Reflect#getSchema. The accept call is based on two mechanisms:
    tag interface - AvroReflectSerializable.
    package names. Applications can configure the package to be considered as reflect type. Applications then can use it with their existing classes without the need for implementing AvroReflectSerializable interface.
  • adds testcase

Doug Cutting added a comment - 14/Jul/09 05:11 PM
Looks good. Some comments:
  • in AvroSerialization, s/GenericDatum{Reader,Writer}/Datum{Reader,Writer}/? All implementations will probably always use a subclass of Generic, but it feels to me like this API should be more abstract, no?
  • in AvroReflectSerialization, 'List<String> packages' might be more efficient as Set<String>?
  • in AvroReflectSerialization, getPackages() could use Configuration#getStrings().
  • does Avro's reflect API actually require that fields are public? Specific always generates public fields, but I think reflect should be able to serialize and deserialize private fields. If not, maybe we should fix that in Avro?

Sharad Agarwal added a comment - 15/Jul/09 03:16 AM

All implementations will probably always use a subclass of Generic, but it feels to me like this API should be more abstract, no?

Absolutely. I realized that but missed to use it in the patch.

reflect should be able to serialize and deserialize private fields. If not, maybe we should fix that in Avro?

I have filed AVRO-78. At places the avro code uses Class#getFields() instead of Class#getDeclaredFields()


Sharad Agarwal added a comment - 15/Jul/09 05:31 AM
Incorporated review comments.

Hadoop QA added a comment - 15/Jul/09 08:26 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12413524/6120_v3.patch
against trunk revision 793987.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to cause Findbugs to fail.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/573/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/573/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/573/console

This message is automatically generated.


Sharad Agarwal added a comment - 15/Jul/09 08:47 AM
Hudson failed because patch needs avro jar.

All tests passed locally.
test-patch output:

-1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     -1 release audit.  The applied patch generated 265 release audit warnings (more than the trunk's current 264 warnings).

Release audit warnings are due to test file avroRecord.avsc not having license header.


Tom White added a comment - 15/Jul/09 01:21 PM
Looks good. Minor nits:
  • Move the testSerialization() method from TestWritableSerialization since it is now being used from the Avro tests. Something like SerializationTestUtil.
  • Add a package.html to the org.apache.hadoop.io.serializer.avro package.

Have you tried using either of these serializations with MapReduce? Something like TestJavaSerialization would be good (but this could be another issue).


Doug Cutting added a comment - 15/Jul/09 06:35 PM
Looks great!
  • shouldn't Serializer/Desrializer#close() call encoder/decoder.close()?
  • should AvroReflectSerialization#accept() be synchronized? I don't know if we generally require Serializers to be thread-safe, but it's probably best and should not have much performance impact.

Doug Cutting added a comment - 15/Jul/09 09:05 PM
> Hudson failed because patch needs avro jar.

The avro jar should soon appear in the Maven repo. Then we'll be able to just add it as a dependency in ivy.xml.


Sharad Agarwal added a comment - 16/Jul/09 10:35 AM

shouldn't Serializer/Desrializer#close() call encoder/decoder.close()?

I don't see encoder/decoder has close() method. Am I missing something?


Sharad Agarwal added a comment - 16/Jul/09 10:39 AM
Taken care of Tom and Doug's comments:
  1. Moved the testSerialization() method from TestWritableSerialization
  2. Added package.html to the org.apache.hadoop.io.serializer.avro
  3. Made AvroReflectSerialization#accept() as synchronized

Sharad Agarwal added a comment - 16/Jul/09 10:45 AM

Have you tried using either of these serializations with MapReduce? Something like TestJavaSerialization would be good (but this could be another issue).

Now after the project split this has to be another issue. smile


Doug Cutting added a comment - 16/Jul/09 04:59 PM
> I don't see encoder/decoder has close() method. Am I missing something?

No, it seems Avro is. I think the contract for serializers and deserializers is that they can close the underlying stream. So, until we add Encoder/Decoder#close() to Avro, we should explicitly save a pointer to the stream passed in and close it after flushing the encoder/decoder.

A few other comments.

  • AvroSerialization should be public, since public classes extend it.
  • Every public or protected method and field in public classes should have javadoc, unless its inherited.
  • We should add a bit more documentation at the package and class level about how this is to be used, e.g., explain that reflect serialization will work for any class that's either in the configured package list or that implements the tag interface, and that specific serialization will just work for generated classes.
  • Is there any reason not to add these serializers to io.serializations by default?

Doug Cutting added a comment - 16/Jul/09 05:01 PM
Attaching a patch that's identical to Sharad's last patch, except that Avro is now retrieved from the Maven repo. This should now hopefully pass Hudson tests.

Doug Cutting added a comment - 16/Jul/09 05:02 PM
Marking as "Patch Available" to see if Hudson can now find Avro.

Hadoop QA added a comment - 16/Jul/09 07:02 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12413707/HADOOP-6120.patch
against trunk revision 793987.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 15 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 release audit. The applied patch generated 261 release audit warnings (more than the trunk's current 260 warnings).

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/576/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/576/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/576/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/576/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/576/console

This message is automatically generated.


Doug Cutting added a comment - 16/Jul/09 07:25 PM
Everything passed but the release audit. Not bad.

The "release audit warnings" link above is broken, but the offending file is avroRecord.avsc. We should be able to put an Apache license at top of that file, since Jackson, the JSON parser that Avro uses, permits C-style comments.


Sharad Agarwal added a comment - 17/Jul/09 09:33 AM
Implemented AvroSerialization#close, added more javadoc, made AvroSerialization as public.

Is there any reason not to add these serializers to io.serializations by default?

I didn't add considering JavaSerialization also not there in default. But I think its ok to have these. Added both Avro and java serializations to core-default.xml and SerializationFactory.


Tom White added a comment - 17/Jul/09 09:52 AM
JavaSerialization was written more to demonstrate the generality of the abstraction (HADOOP-1986) than to be used in real-world applications, since its performance is considerably less than Writables (and presumably Avro). So I would be inclined not to add it to the default list, preferring users to explicitly enable it. Does this sound reasonable?

Hadoop QA added a comment - 17/Jul/09 11:01 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12413792/6120_v5.patch
against trunk revision 795028.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 15 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/579/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/579/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/579/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/579/console

This message is automatically generated.


Doug Cutting added a comment - 17/Jul/09 08:10 PM
Sigh. I just reallized this is really fragile, since we don't permit one to specify a different schema to read things.

We need to get metadata from the serializer used to write data to the deserializer used to deserialize it. For even Writable, we should check that the data was indeed written with WritableSerializer, since folks can configure a given class to be serialized in different ways.

So we might add methods like:

  • Deserializer Serialization#getDeserializer(Map<String,String> meta);
  • Serializer Serialization#getSerializer(Map<String,String> meta);
  • Map<String,String> Serializer#getMeta();
    And alter, e.g. SequenceFile and other containers to use these.

Perhaps this belongs in another Jira, but without it we're missing out on a lot of Avro's functionality.


Sharad Agarwal added a comment - 20/Jul/09 11:39 AM

JavaSerialization was written more to demonstrate the generality of the abstraction

I am fine removing JavaSerialization from the defaults, if folks think it will discourage people to use java serialization.

folks can configure a given class to be serialized in different ways.

Do you mean that accept(Class<?> c) should select serialization not only based on class but also based on the meta data passed ? We need to change accept as well, no ?

Perhaps this belongs in another Jira, but without it we're missing out on a lot of Avro's functionality.

I think this can be discussed in another Jira, as the current implementation should be fine in case reader and writer are for same schema. We can start using this in mapreduce jobs.


Doug Cutting added a comment - 20/Jul/09 07:37 PM
If, e.g., a class both implements Writable and also either implements AvroReflectSerializeable or whose package is listed in the config, then which serialization is used is determined by the order of the serializers in the config, which could change. Similarly for the Serializeable interface and JavaSerialization. So its not safe to assume that the Class->Serialization map is fixed, and we should really be storing at least the serialization's name in container metadata, and probably also a version (e.g. serialVersionUID for JavaSerialization). And once we have a mechanism to support that, we can also store other metadata, like the schema, so that we can read older versions and generic data.

+1 for removing JavaSerialization from the defaults.

+1 for a separate Jira on serialization metadata. But it would be best to have such metadata by the 0.21 freeze, in two weeks.


Sharad Agarwal added a comment - 21/Jul/09 03:55 AM
Removed JavaSerialization from defaults

Hadoop QA added a comment - 21/Jul/09 04:18 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12414061/6120_v6.patch
against trunk revision 796052.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 15 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/582/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/582/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/582/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/582/console

This message is automatically generated.


Sharad Agarwal added a comment - 21/Jul/09 04:26 AM
Test failure is unrelated.

Tom White added a comment - 22/Jul/09 11:44 AM
Opened HADOOP-6165 for discussion about serialization metadata.

Doug Cutting added a comment - 23/Jul/09 07:22 PM
I just committed this. Thanks, Sharad!

Tsz Wo (Nicholas), SZE added a comment - 10/Aug/09 07:08 PM
After updated hadoop-core-0.21.0-dev.jar and hadoop-core-test-0.21.0-dev.jar on hdfs, some tests are failing with "java.lang.NoClassDefFoundError: org/apache/avro/io/DatumReader
". See HDFS-534.

Robert Chansler added a comment - 08/Oct/09 06:52 PM
Editorial pass over all release notes prior to publication of 0.21.