Pig
  1. Pig
  2. PIG-911

[Piggybank] SequenceFileLoader

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: None
    • Labels:
      None

      Description

      The proposed piggybank contribution adds a SequenceFileLoader to the piggybank.

      1. pig_911.2.patch
        14 kB
        Dmitriy V. Ryaboy
      2. pig_sequencefile.patch
        12 kB
        Dmitriy V. Ryaboy

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        12d 5h 16m 1 Dmitriy V. Ryaboy 18/Aug/09 04:00
        Open Open Patch Available Patch Available
        4m 54s 2 Dmitriy V. Ryaboy 18/Aug/09 04:01
        Patch Available Patch Available Resolved Resolved
        28d 20h 58m 1 Alan Gates 16/Sep/09 00:59
        Resolved Resolved Closed Closed
        189d 22h 15m 1 Alan Gates 24/Mar/10 22:14
        Alan Gates made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Alan Gates made changes -
        Assignee Dmitriy V. Ryaboy [ dvryaboy ]
        Alan Gates made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.5.0 [ 12314213 ]
        Resolution Fixed [ 1 ]
        Hide
        Alan Gates added a comment -

        Committed patch. Thanks Dmitry.

        Show
        Alan Gates added a comment - Committed patch. Thanks Dmitry.
        Hide
        Alan Gates added a comment -

        I'm reviewing this patch

        Show
        Alan Gates added a comment - I'm reviewing this patch
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12416830/pig_911.2.patch
        against trunk revision 804406.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416830/pig_911.2.patch against trunk revision 804406. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/168/console This message is automatically generated.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Concerning making this a StoreFunc, as well – the StoreFunc interface is not very friendly to this.
        All you get in the bind call is the output stream; for LoadFunc, you also get the name of the file (or, presumably, whatever it was the user passed in under the guise of a file name). This means that for the LoadFunc, I was able to use the passed in filename to back into a Path and a FileSystem. I can't do the same for StoreFunc, where the filename is not available – only the output stream is. That means I can't create the appropriate SequenceFile.Writer . Is there a way around this limitation that does not involve requiring special constructor parameters to be used?
        Is it possible to change the StoreFunc api to provide this information, or to make it available through some side channel (MapRedUtils or similar)?

        Show
        Dmitriy V. Ryaboy added a comment - Concerning making this a StoreFunc, as well – the StoreFunc interface is not very friendly to this. All you get in the bind call is the output stream; for LoadFunc, you also get the name of the file (or, presumably, whatever it was the user passed in under the guise of a file name). This means that for the LoadFunc, I was able to use the passed in filename to back into a Path and a FileSystem. I can't do the same for StoreFunc, where the filename is not available – only the output stream is. That means I can't create the appropriate SequenceFile.Writer . Is there a way around this limitation that does not involve requiring special constructor parameters to be used? Is it possible to change the StoreFunc api to provide this information, or to make it available through some side channel (MapRedUtils or similar)?
        Dmitriy V. Ryaboy made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Dmitriy V. Ryaboy made changes -
        Attachment pig_911.2.patch [ 12416830 ]
        Hide
        Dmitriy V. Ryaboy added a comment -

        Addressed Alan's comments.

        Show
        Dmitriy V. Ryaboy added a comment - Addressed Alan's comments.
        Dmitriy V. Ryaboy made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Dmitriy V. Ryaboy added a comment -

        Alan,
        Thanks for the feedback.

        I'll add the try/catch

        In regards to the UTF8StorageConverter – I think I added that because before that the code broke if you didn't declare a schema at load time (so, a=load 'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using SequenceFileLoader() as (a:chararray, b:double)

        I'll figure out what exactly is going on with that and remove the UTF8StorageConverter

        Will add Store as time allows.

        Show
        Dmitriy V. Ryaboy added a comment - Alan, Thanks for the feedback. I'll add the try/catch In regards to the UTF8StorageConverter – I think I added that because before that the code broke if you didn't declare a schema at load time (so, a=load 'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using SequenceFileLoader() as (a:chararray, b:double) I'll figure out what exactly is going on with that and remove the UTF8StorageConverter Will add Store as time allows.
        Hide
        Alan Gates added a comment -

        Dmitry,

        First this is great. We've had requests to read Sequence files. Being able to write them also would be great.

        A few thoughts:

        1) This should not extend UTF8StorageConverter. This loader will be returning actual data types, not bytes that need to be interpreted. I would think instead that it should implement the bytesToX() methods itself and just throw an exception saying it didn't expect to do any conversion.

        2) The getSampledTuple looks fine if skip is handling getting the stream to the point that reading the next tuple is viable.

        3) In the bindTo call, where you obtain the key and value by reflection, should there be a try/catch block there in case the cast to Writable fails? In the same way, in describe schema you're asking how to suppress warnings from the cast in reader.getKeyClass(). But don't you want to check that what you got really is a writable, since there is no guarantee?

        Show
        Alan Gates added a comment - Dmitry, First this is great. We've had requests to read Sequence files. Being able to write them also would be great. A few thoughts: 1) This should not extend UTF8StorageConverter. This loader will be returning actual data types, not bytes that need to be interpreted. I would think instead that it should implement the bytesToX() methods itself and just throw an exception saying it didn't expect to do any conversion. 2) The getSampledTuple looks fine if skip is handling getting the stream to the point that reading the next tuple is viable. 3) In the bindTo call, where you obtain the key and value by reflection, should there be a try/catch block there in case the cast to Writable fails? In the same way, in describe schema you're asking how to suppress warnings from the cast in reader.getKeyClass(). But don't you want to check that what you got really is a writable, since there is no guarantee?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415673/pig_sequencefile.patch
        against trunk revision 801460.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415673/pig_sequencefile.patch against trunk revision 801460. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/153/console This message is automatically generated.
        Dmitriy V. Ryaboy made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Dmitriy V. Ryaboy made changes -
        Field Original Value New Value
        Attachment pig_sequencefile.patch [ 12415673 ]
        Hide
        Dmitriy V. Ryaboy added a comment -

        The attached patch is an initial implementation of a loader for SequenceFiles.

        It works with keys and values of the following types:
        Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable, ByteWritable

        I would appreciate some comments on how to properly handle errors (casting errors, IO errors, etc).

        Show
        Dmitriy V. Ryaboy added a comment - The attached patch is an initial implementation of a loader for SequenceFiles. It works with keys and values of the following types: Text, IntWritable, LongWritable, FloatWritable, DoubleWritable, BooleanWritable, ByteWritable I would appreciate some comments on how to properly handle errors (casting errors, IO errors, etc).
        Dmitriy V. Ryaboy created issue -

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development