[HADOOP-3460] SequenceFileAsBinaryOutputFormat - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.18.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed
Release Note:
Created SequenceFileAsBinaryOutputFormat to write raw bytes as keys and values to a SequenceFile.

Description

Add an OutputFormat to write raw bytes as keys and values to a SequenceFile.

In C++-Pipes, we're using SequenceFileAsBinaryInputFormat to read Sequencefiles.
However, we current don't have a way to write a sequencefile efficiently without going through extra (de)serializations.

I'd like to store the correct classnames for key/values but use BytesWritable to write
(in order for the next java or pig code to be able to read this sequencefile).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-3460-part3.patch
06/Jun/08 17:21
18 kB
Koji Noguchi
HADOOP-3460-part2.patch
06/Jun/08 03:35
16 kB
Koji Noguchi
HADOOP-3460-part1.patch
03/Jun/08 23:34
12 kB
Koji Noguchi

Activity

Ascending order - Click to sort in descending order

Koji Noguchi added a comment - 03/Jun/08 23:34

First try.

Koji Noguchi added a comment - 03/Jun/08 23:34 First try.

Christopher Douglas added a comment - 04/Jun/08 20:22

This looks great; just a few points:

Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.

The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:

-public class SequenceFileOutputFormat
-extends FileOutputFormat<WritableComparable, Writable> {
+public class SequenceFileOutputFormat<K extends WritableComparable,
+                                      V extends Writable>
+    extends FileOutputFormat<K,V> {

Permitting SeqFABOF:

public class SequenceFileAsBinaryOutputFormat
    extends SequenceFileOutputFormat<BytesWritable,BytesWritable> {

This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed.

Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted
This should be in o.a.h.mapred.lib rather than o.a.h.mapred
Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.
The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException
WritableValueBytes should be a static inner class
The indentation on the anonymous RecordWriter::close should be consistent with the standards

Christopher Douglas added a comment - 04/Jun/08 20:22 This looks great; just a few points: Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass. The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct: -public class SequenceFileOutputFormat -extends FileOutputFormat<WritableComparable, Writable> { +public class SequenceFileOutputFormat<K extends WritableComparable, + V extends Writable> + extends FileOutputFormat<K,V> { Permitting SeqFABOF: public class SequenceFileAsBinaryOutputFormat extends SequenceFileOutputFormat<BytesWritable,BytesWritable> { This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed. Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted This should be in o.a.h.mapred.lib rather than o.a.h.mapred Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write. The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException WritableValueBytes should be a static inner class The indentation on the anonymous RecordWriter::close should be consistent with the standards

Koji Noguchi added a comment - 06/Jun/08 03:35

Chris, thanks for the review.

* Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.

Reason I did it this way is, I want to use this ouputformat with C++Pipes.

(1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)...

And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(),
but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile.

How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use
outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config.

* The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:

Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter.

* Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted

Done. Test added.

* This should be in o.a.h.mapred.lib rather than o.a.h.mapred

Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred.
For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib.

* Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.

Done. (Not sure if I did it correctly.)

* The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException

Done.

* WritableValueBytes should be a static inner class

Done.

* The indentation on the anonymous RecordWriter::close should be consistent with the standards

Done

Koji Noguchi added a comment - 06/Jun/08 03:35 Chris, thanks for the review. * Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass. Reason I did it this way is, I want to use this ouputformat with C++Pipes. (1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)... And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(), but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile. How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config. * The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct: Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter. * Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted Done. Test added. * This should be in o.a.h.mapred.lib rather than o.a.h.mapred Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred. For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib. * Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write. Done. (Not sure if I did it correctly.) * The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException Done. * WritableValueBytes should be a static inner class Done. * The indentation on the anonymous RecordWriter::close should be consistent with the standards Done

Christopher Douglas added a comment - 06/Jun/08 05:45

Just a few quick changes and I think this is ready to commit:

The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)
WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature
SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

Christopher Douglas added a comment - 06/Jun/08 05:45 Just a few quick changes and I think this is ready to commit: The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath) WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

Koji Noguchi added a comment - 06/Jun/08 17:21

1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test,

Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression.

Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set.
Fixed it.

and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)

Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000".
Done.

2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature

I left it in since the original SequenceFile.ValueBytes has a signature

 
    public void writeCompressedBytes(DataOutputStream outStream) 
      throws IllegalArgumentException, IOException;

Should I still take it out?

3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

Done.

Koji Noguchi added a comment - 06/Jun/08 17:21 1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression. Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set. Fixed it. and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath) Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000". Done. 2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature I left it in since the original SequenceFile.ValueBytes has a signature public void writeCompressedBytes(DataOutputStream outStream) throws IllegalArgumentException, IOException; Should I still take it out? 3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException Done.

Christopher Douglas added a comment - 06/Jun/08 18:19

     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

I just committed this. Thanks, Koji

Christopher Douglas added a comment - 06/Jun/08 18:19 [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. I just committed this. Thanks, Koji

Hadoop QA added a comment - 06/Jun/08 19:27

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch
against trunk revision 664041.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console

This message is automatically generated.

Hadoop QA added a comment - 06/Jun/08 19:27 -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch against trunk revision 664041. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console This message is automatically generated.

People

Assignee:: Koji Noguchi

Reporter:: Koji Noguchi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/May/08 20:18

Updated:: 08/Jul/09 16:52

Resolved:: 06/Jun/08 18:19

Hadoop Common

Details

Description

Attachments

Attachments

Activity

People

Dates