Issue Details (XML | Word | Printable)

Key: HADOOP-3460
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Koji Noguchi
Reporter: Koji Noguchi
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

SequenceFileAsBinaryOutputFormat

Created: 28/May/08 08:18 PM   Updated: 08/Jul/09 04:52 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-3460-part1.patch 2008-06-03 11:34 PM Koji Noguchi 12 kB
Text File Licensed for inclusion in ASF works HADOOP-3460-part2.patch 2008-06-06 03:35 AM Koji Noguchi 16 kB
Text File Licensed for inclusion in ASF works HADOOP-3460-part3.patch 2008-06-06 05:21 PM Koji Noguchi 18 kB

Hadoop Flags: Reviewed
Release Note: Created SequenceFileAsBinaryOutputFormat to write raw bytes as keys and values to a SequenceFile.
Resolution Date: 06/Jun/08 06:19 PM


 Description  « Hide
Add an OutputFormat to write raw bytes as keys and values to a SequenceFile.

In C++-Pipes, we're using SequenceFileAsBinaryInputFormat to read Sequencefiles.
However, we current don't have a way to write a sequencefile efficiently without going through extra (de)serializations.

I'd like to store the correct classnames for key/values but use BytesWritable to write
(in order for the next java or pig code to be able to read this sequencefile).



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Koji Noguchi added a comment - 03/Jun/08 11:34 PM
First try.

Chris Douglas added a comment - 04/Jun/08 08:22 PM
This looks great; just a few points:
  • Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.
  • The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:
    -public class SequenceFileOutputFormat
    -extends FileOutputFormat<WritableComparable, Writable> {
    +public class SequenceFileOutputFormat<K extends WritableComparable,
    +                                      V extends Writable>
    +    extends FileOutputFormat<K,V> {
    

    Permitting SeqFABOF:

    public class SequenceFileAsBinaryOutputFormat
        extends SequenceFileOutputFormat<BytesWritable,BytesWritable> {
    

    This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed.

  • Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted
  • This should be in o.a.h.mapred.lib rather than o.a.h.mapred
  • Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.
  • The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException
  • WritableValueBytes should be a static inner class
  • The indentation on the anonymous RecordWriter::close should be consistent with the standards

Koji Noguchi added a comment - 06/Jun/08 03:35 AM
Chris, thanks for the review.

* Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.

Reason I did it this way is, I want to use this ouputformat with C++Pipes.

(1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)...

And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(),
but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile.

How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use
outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config.

* The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:

Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter.

* Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted

Done. Test added.

* This should be in o.a.h.mapred.lib rather than o.a.h.mapred

Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred.
For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib.

* Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.

Done. (Not sure if I did it correctly.)

* The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException

Done.

* WritableValueBytes should be a static inner class

Done.

* The indentation on the anonymous RecordWriter::close should be consistent with the standards

Done


Chris Douglas added a comment - 06/Jun/08 05:45 AM
Just a few quick changes and I think this is ready to commit:
  1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)
  2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature
  3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

Koji Noguchi added a comment - 06/Jun/08 05:21 PM

1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test,

Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression.

Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set.
Fixed it.

and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)

Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000".
Done.

2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature

I left it in since the original SequenceFile.ValueBytes has a signature

 
    public void writeCompressedBytes(DataOutputStream outStream) 
      throws IllegalArgumentException, IOException;

Should I still take it out?

3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

Done.


Chris Douglas added a comment - 06/Jun/08 06:19 PM
     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

I just committed this. Thanks, Koji


Hadoop QA added a comment - 06/Jun/08 07:27 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch
against trunk revision 664041.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console

This message is automatically generated.