Pig
  1. Pig
  2. PIG-911

[Piggybank] SequenceFileLoader

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: None
    • Labels:
      None

      Description

      The proposed piggybank contribution adds a SequenceFileLoader to the piggybank.

      1. pig_sequencefile.patch
        12 kB
        Dmitriy V. Ryaboy
      2. pig_911.2.patch
        14 kB
        Dmitriy V. Ryaboy

        Activity

        Hide
        Alan Gates added a comment -

        Committed patch. Thanks Dmitry.

        Show
        Alan Gates added a comment - Committed patch. Thanks Dmitry.
        Hide
        Alan Gates added a comment -

        I'm reviewing this patch

        Show
        Alan Gates added a comment - I'm reviewing this patch
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12416830/pig_911.2.patch
        against trunk revision 804406.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416830/pig_911.2.patch against trunk revision 804406. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/console This message is automatically generated.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Concerning making this a StoreFunc, as well – the StoreFunc interface is not very friendly to this.
        All you get in the bind call is the output stream; for LoadFunc, you also get the name of the file (or, presumably, whatever it was the user passed in under the guise of a file name). This means that for the LoadFunc, I was able to use the passed in filename to back into a Path and a FileSystem. I can't do the same for StoreFunc, where the filename is not available – only the output stream is. That means I can't create the appropriate SequenceFile.Writer . Is there a way around this limitation that does not involve requiring special constructor parameters to be used?
        Is it possible to change the StoreFunc api to provide this information, or to make it available through some side channel (MapRedUtils or similar)?

        Show
        Dmitriy V. Ryaboy added a comment - Concerning making this a StoreFunc, as well – the StoreFunc interface is not very friendly to this. All you get in the bind call is the output stream; for LoadFunc, you also get the name of the file (or, presumably, whatever it was the user passed in under the guise of a file name). This means that for the LoadFunc, I was able to use the passed in filename to back into a Path and a FileSystem. I can't do the same for StoreFunc, where the filename is not available – only the output stream is. That means I can't create the appropriate SequenceFile.Writer . Is there a way around this limitation that does not involve requiring special constructor parameters to be used? Is it possible to change the StoreFunc api to provide this information, or to make it available through some side channel (MapRedUtils or similar)?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Addressed Alan's comments.

        Show
        Dmitriy V. Ryaboy added a comment - Addressed Alan's comments.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Alan,
        Thanks for the feedback.

        I'll add the try/catch

        In regards to the UTF8StorageConverter – I think I added that because before that the code broke if you didn't declare a schema at load time (so, a=load 'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using SequenceFileLoader() as (a:chararray, b:double)

        I'll figure out what exactly is going on with that and remove the UTF8StorageConverter

        Will add Store as time allows.

        Show
        Dmitriy V. Ryaboy added a comment - Alan, Thanks for the feedback. I'll add the try/catch In regards to the UTF8StorageConverter – I think I added that because before that the code broke if you didn't declare a schema at load time (so, a=load 'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using SequenceFileLoader() as (a:chararray, b:double) I'll figure out what exactly is going on with that and remove the UTF8StorageConverter Will add Store as time allows.
        Hide
        Alan Gates added a comment -

        Dmitry,

        First this is great. We've had requests to read Sequence files. Being able to write them also would be great.

        A few thoughts:

        1) This should not extend UTF8StorageConverter. This loader will be returning actual data types, not bytes that need to be interpreted. I would think instead that it should implement the bytesToX() methods itself and just throw an exception saying it didn't expect to do any conversion.

        2) The getSampledTuple looks fine if skip is handling getting the stream to the point that reading the next tuple is viable.

        3) In the bindTo call, where you obtain the key and value by reflection, should there be a try/catch block there in case the cast to Writable fails? In the same way, in describe schema you're asking how to suppress warnings from the cast in reader.getKeyClass(). But don't you want to check that what you got really is a writable, since there is no guarantee?

        Show
        Alan Gates added a comment - Dmitry, First this is great. We've had requests to read Sequence files. Being able to write them also would be great. A few thoughts: 1) This should not extend UTF8StorageConverter. This loader will be returning actual data types, not bytes that need to be interpreted. I would think instead that it should implement the bytesToX() methods itself and just throw an exception saying it didn't expect to do any conversion. 2) The getSampledTuple looks fine if skip is handling getting the stream to the point that reading the next tuple is viable. 3) In the bindTo call, where you obtain the key and value by reflection, should there be a try/catch block there in case the cast to Writable fails? In the same way, in describe schema you're asking how to suppress warnings from the cast in reader.getKeyClass(). But don't you want to check that what you got really is a writable, since there is no guarantee?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415673/pig_sequencefile.patch
        against trunk revision 801460.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415673/pig_sequencefile.patch against trunk revision 801460. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/console This message is automatically generated.
        Hide
        Dmitriy V. Ryaboy added a comment -

        The attached patch is an initial implementation of a loader for SequenceFiles.

        It works with keys and values of the following types:
        Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable, ByteWritable

        I would appreciate some comments on how to properly handle errors (casting errors, IO errors, etc).

        Show
        Dmitriy V. Ryaboy added a comment - The attached patch is an initial implementation of a loader for SequenceFiles. It works with keys and values of the following types: Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable, ByteWritable I would appreciate some comments on how to properly handle errors (casting errors, IO errors, etc).

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development