Issue Details (XML | Word | Printable)

Key: HADOOP-3413
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Tom White
Reporter: Arun C Murthy
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

SequenceFile.Reader doesn't use the Serialization framework

Created: 19/May/08 08:33 AM   Updated: 22/Aug/08 07:50 PM
Return to search
Component/s: io
Affects Version/s: 0.17.0
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hadoop-3413-v2.patch 2008-06-02 04:39 PM Tom White 18 kB
Text File Licensed for inclusion in ASF works hadoop-3413-v3.patch 2008-06-02 07:55 PM Tom White 19 kB
Text File Licensed for inclusion in ASF works hadoop-3413-v4.patch 2008-06-06 10:53 AM Tom White 21 kB
Text File Licensed for inclusion in ASF works hadoop-3413.patch 2008-05-19 01:53 PM Tom White 9 kB

Hadoop Flags: Reviewed
Resolution Date: 06/Jun/08 10:24 PM


 Description  « Hide
Currently SequenceFile.Reader only works with Writables, since it doesn't use the new Serialization framework. This is a glaring considering that SequenceFile.Writer uses the Serializer and handles arbitrary types via the SerializationFactory.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Arun C Murthy added a comment - 19/May/08 08:33 AM
Is this a 0.17.0 blocker?

Arun C Murthy made changes - 19/May/08 08:42 AM
Field Original Value New Value
Component/s io [ 12310687 ]
Arun C Murthy added a comment - 19/May/08 09:32 AM
Uh, and SequenceFile{Input|Output}Format & SequenceFileRecord{Reader|Writer} still relies on Writables too...

Tom White added a comment - 19/May/08 12:37 PM
There is still clearly work to make the serialization framework fully supported in Hadoop, but I don't think the lack of support in SequenceFile.Reader is a blocker for 0.17.0.

The work done in HADOOP-1986 enabled serialization support in the MapReduce kernel, so you can use arbitrary types for keys and values. However, it is not yet possible to use arbitrary types for map inputs or reduce outputs out of the box, since the support from SequenceFile{Input|Output}Format and SequenceFileRecord{Reader|Writer} is still Writable-based as you point out. That said, it is possible to write your own InputFormat, OutputFormat, RecordReader, RecordWriter implementations to do this for you. For example, you can use SequenceFile.Writer#append(Object, Object) to write any objects to a sequence file (using a Serializer) and the SequenceFile.Reader#nextRaw methods to read bytes out to be manually deserialized using a Deserializer.

On a related note, unfortunately the RecordReader interface is incompatible with serialization frameworks that don't reuse objects - like Java Serialization. The problem is that

boolean next(K key, V value) throws IOException

has no way of passing keys and values that are deserialized from the stream back to the client of the RecordReader. This is not a problem for Writables and Thrift since the client passes in objects that are updated in-place. To fix this will require some surgery on the API.


Devaraj Das added a comment - 19/May/08 01:01 PM
I agree that this shouldn't be a blocker for 17 for the reasons Tom outlined.

Tom White added a comment - 19/May/08 01:53 PM
Here's a patch for extending SequenceFile.Reader, as well as making SequenceFile{Input|Output}Format and SequenceFileRecord{Reader|Writer} all use the Serialization framework. (I haven't tested it yet.)

Tom White made changes - 19/May/08 01:53 PM
Attachment hadoop-3413.patch [ 12382302 ]
Arun C Murthy added a comment - 19/May/08 05:26 PM - edited
Tom, makes sense - thanks for weighing in!

The minor jarring note is that a simple app which writes a SequenceFile with Java Serialization won't be able to read it back...
On the bright side most Map-Reduce applications are ok since SequenceFile{Input|Output}Format doesn't support non-Writables yet.


Tom White added a comment - 19/May/08 07:10 PM

a simple app which writes a SequenceFile with Java Serialization won't be able to read it back...

But it would have to write its own SequenceFileOutputFormat as things stand, so while possible, it's not likely to be a common occurrence.

Thanks for picking this up Arun - I hope we can get this fixed in 0.18. There are also few other places that can remove their dependency on Writables without much effort - e.g. SequenceFileInputFilter, and MultipleOutputFormat (and subclasses).


Enis Soztutar added a comment - 20/May/08 12:09 PM
Please do not forget MapFile and related stuff.

Tom White made changes - 20/May/08 07:55 PM
Assignee Arun C Murthy [ acmurthy ] Tom White [ tomwhite ]
Tom White added a comment - 02/Jun/08 04:39 PM
New patch to address SequenceFileInputFilter and MultipleOutputFormat. I'll open another issue for MapFile changes.

Tom White made changes - 02/Jun/08 04:39 PM
Attachment hadoop-3413-v2.patch [ 12383235 ]
Tom White made changes - 02/Jun/08 04:39 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 02/Jun/08 06:49 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383235/hadoop-3413-v2.patch
against trunk revision 661918.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2538/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2538/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2538/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2538/console

This message is automatically generated.


Tom White added a comment - 02/Jun/08 07:55 PM
New patch to fix javadoc warnings. Existing tests ensure this change doesn't break anything.

Tom White made changes - 02/Jun/08 07:55 PM
Attachment hadoop-3413-v3.patch [ 12383248 ]
Owen O'Malley added a comment - 04/Jun/08 06:31 AM
Doesn't this patch have the same problem that HADOOP-2997 had with the java serialization? It seems like the java serialization, which requires passing null the first time is pretty incompatible with a next method that looks like:
boolean next(K key, V value) throws IOException;

since clearly if you call it with next(null, null) you aren't going to be able to read your data. smile


Tom White added a comment - 04/Jun/08 09:20 AM
Owen,

I agree. This is a problem for the RecordReader interface, which will need work to change. However, I wasn't proposing we make such a radical change in this Jira, instead we are making it possible to use the serialization framework with SequenceFile.Reader, and removing unnecessary dependencies on Writable in some of the MapReduce layers (so you could at least use Thrift there). Note that the changes to SequenceFile.Reader are compatible with Java serialization, since instead of adding

public boolean next(Object key, Object val)

there's

public Object next(Object key)

and

public Object getCurrentValue(Object val)

Owen O'Malley added a comment - 04/Jun/08 06:10 PM
Ah, ok. So if I understand, this will let you read java.io sequence files, but not use it for map reduce input, right?

Owen O'Malley added a comment - 04/Jun/08 06:12 PM
I really think this should have unit tests, before it gets committed.

Owen O'Malley made changes - 04/Jun/08 06:12 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Tom White added a comment - 06/Jun/08 10:53 AM
Updated patch to test the new public API of SequenceFile.reader.

Tom White made changes - 06/Jun/08 10:53 AM
Attachment hadoop-3413-v4.patch [ 12383547 ]
Tom White made changes - 06/Jun/08 10:54 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 06/Jun/08 02:29 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383547/hadoop-3413-v4.patch
against trunk revision 663889.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 4 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2606/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2606/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2606/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2606/console

This message is automatically generated.


Repository Revision Date User Message
ASF #664159 Fri Jun 06 22:18:48 UTC 2008 omalley HADOOP-3413. Allow SequenceFile.Reader to use serialization framework. Contributed by Tom White.
Files Changed
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/SequenceFileInputFormat.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/io/SequenceFile.java
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/SequenceFileOutputFormat.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputFormat.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleTextOutputFormat.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleSequenceFileOutputFormat.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/mapred/SequenceFileInputFilter.java

Owen O'Malley added a comment - 06/Jun/08 10:24 PM
I just committed this, after fixing to reflect trunk changes. Thanks, Tom!

Owen O'Malley made changes - 06/Jun/08 10:24 PM
Resolution Fixed [ 1 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Hadoop Flags [Reviewed]
Repository Revision Date User Message
ASF #665778 Mon Jun 09 17:07:37 UTC 2008 omalley HADOOP-3413. I missed the test file.
Files Changed
ADD /hadoop/core/trunk/src/test/org/apache/hadoop/io/TestSequenceFileSerialization.java

Repository Revision Date User Message
ASF #665791 Mon Jun 09 17:24:17 UTC 2008 omalley Adding missed test file for HADOOP-3413.
Files Changed
ADD /hadoop/core/branches/branch-0.18/src/test/org/apache/hadoop/io/TestSequenceFileSerialization.java (from /hadoop/core/trunk/src/test/org/apache/hadoop/io/TestSequenceFileSerialization.java)

Hudson added a comment - 16/Jun/08 10:38 PM

Nigel Daley made changes - 22/Aug/08 07:50 PM
Status Resolved [ 5 ] Closed [ 6 ]