Pig
  1. Pig
  2. PIG-2494

Improvement to SequenceFileLoader (NullWritable and Delimiter)

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.9.1
    • Fix Version/s: None
    • Component/s: piggybank
    • Labels:
    • Environment:

      All

    • Release Note:
      Should have no compatible issues
    • Tags:
      SequenceFileLoader NullWritable Delimiter

      Description

      I wanted to add two features to SequenceFileLoader.
      1. I added a delimiter so it will act more like PigStorage, in that it will Split the value if it is a type Text (chararray).
      2. I added the option of the key being a NullWritable. I wanted to be able to process my Hive files in both Hive and Pig, but because my Hive sequence files have a NullWritable key I could not make this work with the current implementation of SequenceFileLoader.

      My change is attached to this Issue.

        Activity

        Hide
        Ted Malaska added a comment -

        Here is my implementation

        Show
        Ted Malaska added a comment - Here is my implementation
        Hide
        Joey Echeverria added a comment -

        Hi Ted,

        Thanks for the contribution! Would you mind formatting your submission as a patch? You can find instructions on how to generate the patch here:

        https://cwiki.apache.org/confluence/display/PIG/HowToContribute

        This will make it easier to review your changes.

        Show
        Joey Echeverria added a comment - Hi Ted, Thanks for the contribution! Would you mind formatting your submission as a patch? You can find instructions on how to generate the patch here: https://cwiki.apache.org/confluence/display/PIG/HowToContribute This will make it easier to review your changes.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Note that a far more powerful version of a Sequence File Loader is available in Elephant-Bird: https://github.com/kevinweil/elephant-bird

        This is a pretty small patch, though. Good one to practice patch submission on, if someone wanted to post it using the procedure Joey linked to above.

        Show
        Dmitriy V. Ryaboy added a comment - Note that a far more powerful version of a Sequence File Loader is available in Elephant-Bird: https://github.com/kevinweil/elephant-bird This is a pretty small patch, though. Good one to practice patch submission on, if someone wanted to post it using the procedure Joey linked to above.
        Hide
        Ted Malaska added a comment -

        Hey Dmitriy,

        I know it's been a long time but I'm going to try to finish this Issue # now.

        I just reviewed the SequenceFileLoader code in elephant-bird and it looks like the major piece to bring over is the idea of the converter and it's ability to transform the raw data and provide a schema for the outputting format.

        This would add a lot of power to the existing implementation.

        I'll start on this tonight.

        Show
        Ted Malaska added a comment - Hey Dmitriy, I know it's been a long time but I'm going to try to finish this Issue # now. I just reviewed the SequenceFileLoader code in elephant-bird and it looks like the major piece to bring over is the idea of the converter and it's ability to transform the raw data and provide a schema for the outputting format. This would add a lot of power to the existing implementation. I'll start on this tonight.
        Hide
        Ted Malaska added a comment -

        So I have four options on how I should address this issue #.

        1. Update Sequence Loader so that it will be able to handle nullWritable keys and also handle delimiters like PigStorage.
        2. All of option (1) plus update sequence loader to sequence storage so we can use it to dump out data in sequence files.
        3. Bring the elephant-bird implementation over to piggybank and add support for delimiters.
        4. Drop the whole delimiter thing because we can use TOKENIZE

        Let me know.

        Show
        Ted Malaska added a comment - So I have four options on how I should address this issue #. 1. Update Sequence Loader so that it will be able to handle nullWritable keys and also handle delimiters like PigStorage. 2. All of option (1) plus update sequence loader to sequence storage so we can use it to dump out data in sequence files. 3. Bring the elephant-bird implementation over to piggybank and add support for delimiters. 4. Drop the whole delimiter thing because we can use TOKENIZE Let me know.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Malaska
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development