Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.18.0
    • None
    • None
    • Reviewed
    • Created SequenceFileAsBinaryOutputFormat to write raw bytes as keys and values to a SequenceFile.

    Description

      Add an OutputFormat to write raw bytes as keys and values to a SequenceFile.

      In C++-Pipes, we're using SequenceFileAsBinaryInputFormat to read Sequencefiles.
      However, we current don't have a way to write a sequencefile efficiently without going through extra (de)serializations.

      I'd like to store the correct classnames for key/values but use BytesWritable to write
      (in order for the next java or pig code to be able to read this sequencefile).

      Attachments

        1. HADOOP-3460-part3.patch
          18 kB
          Koji Noguchi
        2. HADOOP-3460-part2.patch
          16 kB
          Koji Noguchi
        3. HADOOP-3460-part1.patch
          12 kB
          Koji Noguchi

        Activity

          knoguchi Koji Noguchi added a comment -

          First try.

          knoguchi Koji Noguchi added a comment - First try.

          This looks great; just a few points:

          • Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.
          • The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:
            -public class SequenceFileOutputFormat
            -extends FileOutputFormat<WritableComparable, Writable> {
            +public class SequenceFileOutputFormat<K extends WritableComparable,
            +                                      V extends Writable>
            +    extends FileOutputFormat<K,V> {
            

            Permitting SeqFABOF:

            public class SequenceFileAsBinaryOutputFormat
                extends SequenceFileOutputFormat<BytesWritable,BytesWritable> {
            

            This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed.

          • Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted
          • This should be in o.a.h.mapred.lib rather than o.a.h.mapred
          • Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.
          • The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException
          • WritableValueBytes should be a static inner class
          • The indentation on the anonymous RecordWriter::close should be consistent with the standards
          cdouglas Christopher Douglas added a comment - This looks great; just a few points: Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass. The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct: -public class SequenceFileOutputFormat -extends FileOutputFormat<WritableComparable, Writable> { +public class SequenceFileOutputFormat<K extends WritableComparable, + V extends Writable> + extends FileOutputFormat<K,V> { Permitting SeqFABOF: public class SequenceFileAsBinaryOutputFormat extends SequenceFileOutputFormat<BytesWritable,BytesWritable> { This generates a warning in MultipleSequenceFileOutputFormat, but it's spurious and can be suppressed. Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted This should be in o.a.h.mapred.lib rather than o.a.h.mapred Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write. The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException WritableValueBytes should be a static inner class The indentation on the anonymous RecordWriter::close should be consistent with the standards
          knoguchi Koji Noguchi added a comment -

          Chris, thanks for the review.

          * Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass.

          Reason I did it this way is, I want to use this ouputformat with C++Pipes.

          (1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)...
          

          And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(),
          but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile.

          How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use
          outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config.

          * The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct:

          Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter.

          * Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted

          Done. Test added.

          * This should be in o.a.h.mapred.lib rather than o.a.h.mapred

          Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred.
          For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib.

          * Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write.

          Done. (Not sure if I did it correctly.)

          * The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException

          Done.

          * WritableValueBytes should be a static inner class

          Done.

          * The indentation on the anonymous RecordWriter::close should be consistent with the standards

          Done

          knoguchi Koji Noguchi added a comment - Chris, thanks for the review. * Different properties for the output key/value classes aren't necessary; you can use the existing methods, like JobConf::getOutputKeyClass. Reason I did it this way is, I want to use this ouputformat with C++Pipes. (1) c++-reducer ---> (2) Java-PipesReducer ---> (3) collector ---> (4) SequenceFile(AsBinary)... And (2) mapred/pipes/PipesReducer calls job.getOutputKeyClass() and job.getOutputValueClass(), but I want those outputs to be BytesWritable and not the key/value classes of the SequenceFile. How about this. Just like mapoutputkeyclass uses outputkeyclass as the default class, we'll use outputkeyclass if SequenceFileOutputKeyClass is not being defined in the config. * The generic signature on the RecordWriter can be <BytesWritable,BytesWritable> if the signature on SeqFileOF were correct: Done. Modified SequenceFile.java. Added @SuppressWarnings("unchecked") for MultipleSequenceFileOutputFormat.getBaseRecordWriter. * Since record compression is not supported, it might be worthwhile to override OutputFormat::checkOutputSpecs and throw if it's attempted Done. Test added. * This should be in o.a.h.mapred.lib rather than o.a.h.mapred Yes. Except that SequenceFileAsBinaryInputFormat is in o.a.h.mapred. For now, I'll leave this to o.a.h.mapred and we can create a new Jira to move both of them to o.a.h.mapred.lib. * Keeping a WritableValueBytes instance around (and adding a reset method) might be useful, so a new one isn't created for each write. Done. (Not sure if I did it correctly.) * The IllegalArgumentException in WritableValueBytes should probably be an UnsupportedOperationException Done. * WritableValueBytes should be a static inner class Done. * The indentation on the anonymous RecordWriter::close should be consistent with the standards Done

          Just a few quick changes and I think this is ready to commit:

          1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)
          2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature
          3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException
          cdouglas Christopher Douglas added a comment - Just a few quick changes and I think this is ready to commit: The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath) WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException
          knoguchi Koji Noguchi added a comment -

          1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test,

          Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression.

          Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set.
          Fixed it.

          and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath)

          Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000".
          Done.

          2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature

          I left it in since the original SequenceFile.ValueBytes has a signature

           
              public void writeCompressedBytes(DataOutputStream outStream) 
                throws IllegalArgumentException, IOException;
          

          Should I still take it out?

          3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException

          Done.

          knoguchi Koji Noguchi added a comment - 1. The testcase: doesn't need a main method, you might want to break up the check for forbidding record compression into a separate test, Separted the test into three. testbinary, testSequenceOutputClassDefaultsToMapRedOutputClass, and testcheckOutputSpecsForbidRecordCompression. Also, I had a bug in the testing such that checkOutputSpecs was throwing an exception because output path was not set and not because RECORD compression was being set. Fixed it. and the call to JobConf::setInputPath is generating a warning (replace with FileInputFormat::addInputPath) Ah. I should have compiled with "-Djavac.args="-Xlint -Xmaxwarns 1000". Done. 2. WritableValueBytes::writeCompressedBytes no longer throws IllegalArgumentException, so that can be removed from its signature I left it in since the original SequenceFile.ValueBytes has a signature public void writeCompressedBytes(DataOutputStream outStream) throws IllegalArgumentException, IOException; Should I still take it out? 3. SeqFABOF::checkOutputSpecs doesn't need to list InvalidJobConfException Done.
               [exec] +1 overall.  
          
               [exec]     +1 @author.  The patch does not contain any @author tags.
          
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
          
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
          
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          

          I just committed this. Thanks, Koji

          cdouglas Christopher Douglas added a comment - [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. I just committed this. Thanks, Koji
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch
          against trunk revision 664041.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383575/HADOOP-3460-part3.patch against trunk revision 664041. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2608/console This message is automatically generated.

          People

            knoguchi Koji Noguchi
            knoguchi Koji Noguchi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: