Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Created SequenceFileAsBinaryOutputFormat to write raw bytes as keys and values to a SequenceFile.

      Description

      Add an OutputFormat to write raw bytes as keys and values to a SequenceFile.

      In C++-Pipes, we're using SequenceFileAsBinaryInputFormat to read Sequencefiles.
      However, we current don't have a way to write a sequencefile efficiently without going through extra (de)serializations.

      I'd like to store the correct classnames for key/values but use BytesWritable to write
      (in order for the next java or pig code to be able to read this sequencefile).

      1. HADOOP-3460-part1.patch
        12 kB
        Koji Noguchi
      2. HADOOP-3460-part2.patch
        16 kB
        Koji Noguchi
      3. HADOOP-3460-part3.patch
        18 kB
        Koji Noguchi

        Activity

        Koji Noguchi created issue -
        Hide
        Koji Noguchi added a comment -

        First try.

        Show
        Koji Noguchi added a comment - First try.
        Koji Noguchi made changes -
        Field Original Value New Value
        Attachment HADOOP-3460-part1.patch [ 12383339 ]
        Hide
        Chris Douglas added a comment -

        This looks great; just a few points:

        • Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.
        • The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:
          -public class SequenceFileOutputFormat
          -extends FileOutputFormat<WritableComparable, Writable> {
          +public class SequenceFileOutputFormat<K extends WritableComparable,
          +                                      V extends Writable>
          +    extends FileOutputFormat<K,V> {
          

          Permitting SeqFABOF:

          public class SequenceFileAsBinaryOutputFormat
              extends SequenceFileOutputFormat<BytesWritable,BytesWritable> {
          

          This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed.

        • Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted
        • This should be in o.a.h.mapred.lib rather than o.a.h.mapred
        • Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.
        • The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException
        • WritableValueBytes should be a static inner class
        • The indentation on the anonymous RecordWriter::close should be consistent with the standards
        Show
        Chris Douglas added a comment - This looks great; just a few points: Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass. The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct: -public class SequenceFileOutputFormat -extends FileOutputFormat<WritableComparable, Writable> { +public class SequenceFileOutputFormat<K extends WritableComparable, + V extends Writable> + extends FileOutputFormat<K,V> { Permitting SeqFABOF: public class SequenceFileAsBinaryOutputFormat extends SequenceFileOutputFormat<BytesWritable,BytesWritable> { This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed. Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted This should be in o.a.h.mapred.lib rather than o.a.h.mapred Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write. The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException WritableValueBytes should be a static inner class The indentation on the anonymous RecordWriter::close should be consistent with the standards
        Chris Douglas made changes -
        Assignee Koji Noguchi [ knoguchi ]
        Hide
        Koji Noguchi added a comment -

        Chris, thanks for the review.

        * Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.

        Reason I did it this way is, I want to use this ouputformat with C++Pipes.

        (1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)...
        

        And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(),
        but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile.

        How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use
        outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config.

        * The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:

        Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter.

        * Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted

        Done. Test added.

        * This should be in o.a.h.mapred.lib rather than o.a.h.mapred

        Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred.
        For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib.

        * Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.

        Done. (Not sure if I did it correctly.)

        * The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException

        Done.

        * WritableValueBytes should be a static inner class

        Done.

        * The indentation on the anonymous RecordWriter::close should be consistent with the standards

        Done

        Show
        Koji Noguchi added a comment - Chris, thanks for the review. * Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass. Reason I did it this way is, I want to use this ouputformat with C++Pipes. (1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)... And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(), but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile. How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config. * The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct: Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter. * Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted Done. Test added. * This should be in o.a.h.mapred.lib rather than o.a.h.mapred Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred. For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib. * Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write. Done. (Not sure if I did it correctly.) * The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException Done. * WritableValueBytes should be a static inner class Done. * The indentation on the anonymous RecordWriter::close should be consistent with the standards Done
        Koji Noguchi made changes -
        Attachment HADOOP-3460-part2.patch [ 12383525 ]
        Hide
        Chris Douglas added a comment -

        Just a few quick changes and I think this is ready to commit:

        1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)
        2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature
        3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException
        Show
        Chris Douglas added a comment - Just a few quick changes and I think this is ready to commit: The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath) WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException
        Hide
        Koji Noguchi added a comment -

        1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test,

        Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression.

        Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set.
        Fixed it.

        and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)

        Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000".
        Done.

        2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature

        I left it in since the original SequenceFile.ValueBytes has a signature

         
            public void writeCompressedBytes(DataOutputStream outStream) 
              throws IllegalArgumentException, IOException;
        

        Should I still take it out?

        3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

        Done.

        Show
        Koji Noguchi added a comment - 1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression. Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set. Fixed it. and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath) Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000". Done. 2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature I left it in since the original SequenceFile.ValueBytes has a signature public void writeCompressedBytes(DataOutputStream outStream) throws IllegalArgumentException, IOException; Should I still take it out? 3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException Done.
        Koji Noguchi made changes -
        Attachment HADOOP-3460-part3.patch [ 12383575 ]
        Chris Douglas made changes -
        Fix Version/s 0.18.0 [ 12312972 ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hadoop Flags [Reviewed]
        Hide
        Chris Douglas added a comment -
             [exec] +1 overall.  
        
             [exec]     +1 @author.  The patch does not contain any @author tags.
        
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
        
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
        
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
        
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
        

        I just committed this. Thanks, Koji

        Show
        Chris Douglas added a comment - [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. I just committed this. Thanks, Koji
        Chris Douglas made changes -
        Resolution Fixed [ 1 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch
        against trunk revision 664041.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch against trunk revision 664041. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console This message is automatically generated.
        Devaraj Das made changes -
        Release Note Adds an OutputFormat, SequenceFileAsBinaryOutputFormat, to write raw bytes as keys and values to a SequenceFile.
        Robert Chansler made changes -
        Release Note Adds an OutputFormat, SequenceFileAsBinaryOutputFormat, to write raw bytes as keys and values to a SequenceFile. Created SequenceFileAsBinaryOutputFormat to write raw bytes as keys and values to a SequenceFile.
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s mapred [ 12310690 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        8d 21h 43m 1 Chris Douglas 06/Jun/08 19:02
        Patch Available Patch Available Resolved Resolved
        16m 49s 1 Chris Douglas 06/Jun/08 19:19
        Resolved Resolved Closed Closed
        77d 1h 31m 1 Nigel Daley 22/Aug/08 20:50

          People

          • Assignee:
            Koji Noguchi
            Reporter:
            Koji Noguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development