Details

      Description

      One of the nice thing between Pig and Hbase is that they can be integrated. Thanks to recent patch (PIG-1250) committed.

      The documentation is not well updated yet (currently almost relate to the patch itself). It world be nice to document this feature in detail in the Pig documentation page (e.g, in here: http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions).

      1. PIG-2341.5.patch
        9 kB
        Bill Graham
      2. PIG-2341.4.patch
        8 kB
        Jayesh Thakrar
      3. PIG-2341.3.patch
        9 kB
        Jayesh Thakrar
      4. PIG-2341.2.patch
        8 kB
        Bill Graham
      5. PIG-2341.patch
        5 kB
        Jayesh Thakrar

        Activity

        Hide
        Bill Graham added a comment -

        Committed, thanks Jayesh! This documentation is way overdue, so huge props for jumping on it.

        Show
        Bill Graham added a comment - Committed, thanks Jayesh! This documentation is way overdue, so huge props for jumping on it.
        Hide
        Bill Graham added a comment -

        Thanks Jayesh for the merge! I think we're all set. Attaching patch 5 which contains some minor tweaks and two main changes:

        • Rebasing the patch the base of the Pig repos. You generally will want to submit pathes so they can apply from the base dir.
        • Rolling javadoc bug PIG-3092 into this one.
        Show
        Bill Graham added a comment - Thanks Jayesh for the merge! I think we're all set. Attaching patch 5 which contains some minor tweaks and two main changes: Rebasing the patch the base of the Pig repos. You generally will want to submit pathes so they can apply from the base dir. Rolling javadoc bug PIG-3092 into this one.
        Hide
        Jayesh Thakrar added a comment -

        Merged my changes with Bill's patch.

        Show
        Jayesh Thakrar added a comment - Merged my changes with Bill's patch.
        Hide
        Jayesh Thakrar added a comment -

        Some more details added to the function documentation.

        Show
        Jayesh Thakrar added a comment - Some more details added to the function documentation.
        Hide
        Bill Graham added a comment -

        Attaching a second patch with my comments included. Added a section on using HBaseStorage for loading and added missing options. Will commit if no one has any comments.

        Show
        Bill Graham added a comment - Attaching a second patch with my comments included. Added a section on using HBaseStorage for loading and added missing options. Will commit if no one has any comments.
        Hide
        Bill Graham added a comment -

        Jayesh, this patch is great. Thanks for taking this on. Just a few comments:

        • The various options should be listed in the Terms table, instead of in usage.
        • Try to talk about what it does, as opposed to what it can do. For example, "HBaseStorage can store and load data from HBase" should be "HBaseStorage stores and loads data from HBase"
        • When describing the various params, describe what the param does, as opposed to how it does it. For example, "This specifies to the HBase scan method to read rows greater than minKeyVal" should be "Specifies only rows with a rowKey greater than minKeyVal are to be returned".
        • "and a wildcard as a suffix" should be "followed by an asterisk (*)". "using the column family name and a wildcard" should be "using the column family name and an asterisk (i.e., cf:*)"
        • "Columns from multiple column families are specified by seperating each column family and column qualifier pair by a single space." should be "Columns from multiple column families can be returned." No need to specify the space delimiter, since you already have.
        • Likewise, the last two sentance of Usage can be omitted, since you mention above that not all columns must be specified.
        • Should specify that loadKey is false by default as well as how it inserts an extra field as the first element fo the schema, before the columns specified.
        • There are a few more options to describe (see the constructor javadocs in the code on the trunk): delim, ignoreWhitespace, noWAL, minTimestamp, maxTimestamp, timestamp. Note that the "extreme caution" warning in the javadoc is mis-located. It should apply to the noWAL option.
        • We should add some discussion about STORE and how the first field needs to be the rowKey, as well as how maps and scalars are handled. See the Javadoc of the class for a description of this.

        Also, after you upload the next patch (typically named something like PIG-2341_2.patch) you'll want to set the "patch available" flag, which alerts folks that it's ready for review.

        Show
        Bill Graham added a comment - Jayesh, this patch is great. Thanks for taking this on. Just a few comments: The various options should be listed in the Terms table, instead of in usage. Try to talk about what it does, as opposed to what it can do. For example, "HBaseStorage can store and load data from HBase" should be "HBaseStorage stores and loads data from HBase" When describing the various params, describe what the param does, as opposed to how it does it. For example, "This specifies to the HBase scan method to read rows greater than minKeyVal" should be "Specifies only rows with a rowKey greater than minKeyVal are to be returned". "and a wildcard as a suffix" should be "followed by an asterisk (*)". "using the column family name and a wildcard" should be "using the column family name and an asterisk (i.e., cf:*)" "Columns from multiple column families are specified by seperating each column family and column qualifier pair by a single space." should be "Columns from multiple column families can be returned." No need to specify the space delimiter, since you already have. Likewise, the last two sentance of Usage can be omitted, since you mention above that not all columns must be specified. Should specify that loadKey is false by default as well as how it inserts an extra field as the first element fo the schema, before the columns specified. There are a few more options to describe (see the constructor javadocs in the code on the trunk): delim, ignoreWhitespace, noWAL, minTimestamp, maxTimestamp, timestamp. Note that the "extreme caution" warning in the javadoc is mis-located. It should apply to the noWAL option. We should add some discussion about STORE and how the first field needs to be the rowKey, as well as how maps and scalars are handled. See the Javadoc of the class for a description of this. Also, after you upload the next patch (typically named something like PIG-2341 _2.patch) you'll want to set the "patch available" flag, which alerts folks that it's ready for review.
        Hide
        Jayesh Thakrar added a comment -

        I have attached the patch file for review. This is my first attempt to contribute to Apache, so not sure of the protocol......

        Show
        Jayesh Thakrar added a comment - I have attached the patch file for review. This is my first attempt to contribute to Apache, so not sure of the protocol......
        Hide
        Bill Graham added a comment -

        Thanks Jayesh for volunteering! Having HBaseStorage documented on the site is way overdue.

        To document, you'll want to check out pig from SVN (or git), edit src/docs/src/documentation/content/xdocs/func.xml, build locally to check the generated HTML and then submit a patch.

        This wiki explains how to build the documentation:
        https://cwiki.apache.org/confluence/display/PIG/HowToDocument

        And this one is a more general doc to get set up:
        https://cwiki.apache.org/confluence/display/PIG/HowToContribute

        And of course ask questions here or on the list if you have any.

        Show
        Bill Graham added a comment - Thanks Jayesh for volunteering! Having HBaseStorage documented on the site is way overdue. To document, you'll want to check out pig from SVN (or git), edit src/docs/src/documentation/content/xdocs/func.xml , build locally to check the generated HTML and then submit a patch. This wiki explains how to build the documentation: https://cwiki.apache.org/confluence/display/PIG/HowToDocument And this one is a more general doc to get set up: https://cwiki.apache.org/confluence/display/PIG/HowToContribute And of course ask questions here or on the list if you have any.
        Hide
        Jayesh Thakrar added a comment -

        Hi,

        I am using the HBaseStorage at my work and am very happy about it. I would like to volunteer to take up this task. How can I go about doing it?

        Will greatly appreciate any pointers......

        Thanks,
        Jayesh

        Show
        Jayesh Thakrar added a comment - Hi, I am using the HBaseStorage at my work and am very happy about it. I would like to volunteer to take up this task. How can I go about doing it? Will greatly appreciate any pointers...... Thanks, Jayesh
        Hide
        Dmitriy V. Ryaboy added a comment -

        Good call.

        Docs on the API are pretty good, but we should put them into func.html as well.

        http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html

        (the class description is half the documentation; the constructor documents arguments HBaseStorage understands, and is therefore also quite important).

        Show
        Dmitriy V. Ryaboy added a comment - Good call. Docs on the API are pretty good, but we should put them into func.html as well. http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html (the class description is half the documentation; the constructor documents arguments HBaseStorage understands, and is therefore also quite important).

          People

          • Assignee:
            Jayesh Thakrar
            Reporter:
            Mikael Sitruk
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development