Pig
  1. Pig
  2. PIG-2114

Enhancements to PIG HBaseStorage Load & Store Func with extra scan configurations

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: impl
    • Labels:

      Description

      • Configure the rowkey prefixes filter for the scan (Hariprasad Kuppuswwamy)
      • Added ability to omit nulls when dealing with hbase storage (Greg Bowyer)
      • Added ability to specify Put timestamps while insertion (Hariprasad Kuppuswamy)

        Issue Links

          Activity

          Hari created issue -
          Hari made changes -
          Field Original Value New Value
          Attachment 0001-Enhancements-to-Pig-HBase-Load-Storage-function-as-l.patch [ 12481902 ]
          Hari made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Olga Natkovich added a comment -

          Delaying till 10 since we are about to spin the release

          Show
          Olga Natkovich added a comment - Delaying till 10 since we are about to spin the release
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Fix Version/s 0.9.0 [ 12315191 ]
          Hide
          Bill Graham added a comment -

          The HBaseStorage schema returned does not support multiple cell versions, so I don't think the scanMaxVersions flag does anything in this implementation. A unit test might prove me wrong though.

          There was discussion in PIG-1782 about creating an AdvancedHBaseStorage class with a more complex data structure that could support multiple cell versions.

          Show
          Bill Graham added a comment - The HBaseStorage schema returned does not support multiple cell versions, so I don't think the scanMaxVersions flag does anything in this implementation. A unit test might prove me wrong though. There was discussion in PIG-1782 about creating an AdvancedHBaseStorage class with a more complex data structure that could support multiple cell versions.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Thanks for contributing, good stuff in there.

          Comments on the patch:

          • "none existent" should be "non-existent"
          • if you add the line "Object storeValue = t.get;", might as well reuse storeValue in other places in this function rather than keep calling t.get
          • the call setMaxVersions() is different from current behavior. We currently only get the latest version, and don't have provisions for dealing with multiple versions. As Bill pointed out, that also means scanMaxVersions probably doesn't work
          • this really needs tests for every new flag.
          Show
          Dmitriy V. Ryaboy added a comment - Thanks for contributing, good stuff in there. Comments on the patch: "none existent" should be "non-existent" if you add the line "Object storeValue = t.get ;", might as well reuse storeValue in other places in this function rather than keep calling t.get the call setMaxVersions() is different from current behavior. We currently only get the latest version, and don't have provisions for dealing with multiple versions. As Bill pointed out, that also means scanMaxVersions probably doesn't work this really needs tests for every new flag.
          Hide
          Dmitriy V. Ryaboy added a comment -

          canceling patch pending unit tests & other updates.

          Show
          Dmitriy V. Ryaboy added a comment - canceling patch pending unit tests & other updates.
          Dmitriy V. Ryaboy made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Hari added a comment -

          Bill/Dmitriy
          Agree, I have a working version with unit test of the multiple version fetch.
          Will fork out a patch sometime this weekend.

          Will that be good ?

          Thanks

          Show
          Hari added a comment - Bill/Dmitriy Agree, I have a working version with unit test of the multiple version fetch. Will fork out a patch sometime this weekend. Will that be good ? Thanks
          Hide
          Dmitriy V. Ryaboy added a comment -

          That'd be great!
          Do note my point about your change to the default behavior w.r.t. multiple versions – it can't go in like that, default has to stay "get the latest".

          Show
          Dmitriy V. Ryaboy added a comment - That'd be great! Do note my point about your change to the default behavior w.r.t. multiple versions – it can't go in like that, default has to stay "get the latest".
          Hide
          Dmitriy V. Ryaboy added a comment -

          Assigning to Hariprasad so he can get proper JIRA credit .

          Show
          Dmitriy V. Ryaboy added a comment - Assigning to Hariprasad so he can get proper JIRA credit .
          Dmitriy V. Ryaboy made changes -
          Assignee Hariprasad Kuppuswamy [ hariprasad kuppuswamy ]
          Hide
          Hari added a comment -

          Okay, It took me little longer to squeeze in sometime for this.

          For the timestamp based row view, I have introduced new flag for the same & this patch also have a small unit test with a test pig script.
          Let me know if this is good.

          Thanks,

          Show
          Hari added a comment - Okay, It took me little longer to squeeze in sometime for this. For the timestamp based row view, I have introduced new flag for the same & this patch also have a small unit test with a test pig script. Let me know if this is good. Thanks,
          Hari made changes -
          Attachment 0001-Enhancements-to-Pig-HBase-Load-Storage-function-as-l.patch [ 12481902 ]
          Hari made changes -
          Attachment Enhancments-to-enable-timestampversion-based-row-scan.patch [ 12487022 ]
          Hide
          Hari added a comment -

          Also, this patch includes changes related to PIG-2115 as it was required for unit test case to run properly

          Show
          Hari added a comment - Also, this patch includes changes related to PIG-2115 as it was required for unit test case to run properly
          Hide
          Matt Henkel added a comment -

          Currently the "scanStartTimestamp" parameter description states that "Record version timestamps must be greater than this value", however it should be "greater than or equal to" as scan's minStamp is inclusive.

          Show
          Matt Henkel added a comment - Currently the "scanStartTimestamp" parameter description states that "Record version timestamps must be greater than this value", however it should be "greater than or equal to" as scan's minStamp is inclusive .
          Hari made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Dmitriy V. Ryaboy added a comment -

          Apologies again for taking a while to review.

          Thanks, that looks like a fair bit of work.

          First, just a couple of procedural notes:
          1) make sure the new files don't have @author annotations and do have the apache headers
          2) there is already a TestHBaseStorage. Why add a new one in util?
          3) This is sort of a PITA request, especially as there are plenty of places in the codebase that don't adhere to this practice, but can you make sure to do things like put spaces after commas (as in Map<family,Map<qualifier,Map<timestamp,value>>>) and before opening parens (as in for(Map.Entry valueEntry: ...), wrap lines to a reasonable length, etc?

          My major concern with the patch is as follows.

          In getNext() you inserted a completely new flow that is used if timestamps are used. It bypasses all the existing logic for how results are created, and as far as I can see, does not respect things like projection pushdown. It also makes it so any future work on the hbase loader logic has to happen in two places. Let's not do that. Isn't loading a single-version row just a special case of loading multiple versions (with n = 1)? We should be able to do this in one go.

          There being so much stuff mixed in here, I propose we get the smaller stuff like PIG-2115 in. Some of the things you are doing here are also pretty non-controversial, like omitNulls and prefix filters, we can get those in pretty easily. Let's factor out the multiple versions changes and add them to PIG-1832, leaving this (blessedly unspecifically titled ) ticket to deal with the smaller stuff.

          Show
          Dmitriy V. Ryaboy added a comment - Apologies again for taking a while to review. Thanks, that looks like a fair bit of work. First, just a couple of procedural notes: 1) make sure the new files don't have @author annotations and do have the apache headers 2) there is already a TestHBaseStorage. Why add a new one in util? 3) This is sort of a PITA request, especially as there are plenty of places in the codebase that don't adhere to this practice, but can you make sure to do things like put spaces after commas (as in Map<family,Map<qualifier,Map<timestamp,value>>>) and before opening parens (as in for(Map.Entry valueEntry: ...), wrap lines to a reasonable length, etc? My major concern with the patch is as follows. In getNext() you inserted a completely new flow that is used if timestamps are used. It bypasses all the existing logic for how results are created, and as far as I can see, does not respect things like projection pushdown. It also makes it so any future work on the hbase loader logic has to happen in two places. Let's not do that. Isn't loading a single-version row just a special case of loading multiple versions (with n = 1)? We should be able to do this in one go. There being so much stuff mixed in here, I propose we get the smaller stuff like PIG-2115 in. Some of the things you are doing here are also pretty non-controversial, like omitNulls and prefix filters, we can get those in pretty easily. Let's factor out the multiple versions changes and add them to PIG-1832 , leaving this (blessedly unspecifically titled ) ticket to deal with the smaller stuff.
          Hide
          Alan Gates added a comment -

          Cancelling patch as there has been no response to Dmitry's concerns.

          Unlinking from 0.10.

          Show
          Alan Gates added a comment - Cancelling patch as there has been no response to Dmitry's concerns. Unlinking from 0.10.
          Alan Gates made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Fix Version/s 0.10 [ 12316246 ]
          Hide
          Hari added a comment -

          Apologies for the lack of response for so long !
          This somehow sneaked out of my backlogs and completely missed it.

          Will update a patch that addresses all Dmitriy's concern by this weekend.

          Show
          Hari added a comment - Apologies for the lack of response for so long ! This somehow sneaked out of my backlogs and completely missed it. Will update a patch that addresses all Dmitriy's concern by this weekend.
          Hari made changes -
          Description - Added capability to specify scan based on timestamps (Hariprasad Kuppuswwamy)
          - Ability to specify number of versions to be fetched with current scan (Hariprasad Kuppuswwamy)
          - Configure the rowkey prefixes filter for the scan (Hariprasad Kuppuswwamy)
          - Added ability to omit nulls when dealing with hbase storage (Greg Bowyer)
          - Added ability to specify Put timestamps while insertion (Hariprasad Kuppuswamy)
          - Configure the rowkey prefixes filter for the scan (Hariprasad Kuppuswwamy)
          - Added ability to omit nulls when dealing with hbase storage (Greg Bowyer)
          - Added ability to specify Put timestamps while insertion (Hariprasad Kuppuswamy)
          Hari made changes -
          Attachment Enhancments-to-enable-timestampversion-based-row-scan.patch [ 12487022 ]
          Hari made changes -
          Hide
          Hari added a comment -

          Dmitriy

          "There being so much stuff mixed in here, I propose we get the smaller stuff like PIG-2115 in. Some of the things you are doing here are also pretty non-controversial, like omitNulls and prefix filters, we can get those in pretty easily."

          Agreed. Modified the patch description and created a new patch as you requested.

          "Let's factor out the multiple versions changes and add them to PIG-1832, leaving this (blessedly unspecifically titled ) ticket to deal with the smaller stuff."

          Okay, I will update the patch with larger change there addressing all your concerns shortly.

          Hope this is good.

          Show
          Hari added a comment - Dmitriy "There being so much stuff mixed in here, I propose we get the smaller stuff like PIG-2115 in. Some of the things you are doing here are also pretty non-controversial, like omitNulls and prefix filters, we can get those in pretty easily." Agreed. Modified the patch description and created a new patch as you requested. "Let's factor out the multiple versions changes and add them to PIG-1832 , leaving this (blessedly unspecifically titled ) ticket to deal with the smaller stuff." Okay, I will update the patch with larger change there addressing all your concerns shortly. Hope this is good.
          Hide
          Hari added a comment -

          Dmitriy: Any updates ?

          Show
          Hari added a comment - Dmitriy: Any updates ?
          Hide
          Eric Yang added a comment -

          Puttimestamp option is a concern. The storefunc can only populate one timestamp per invocation. This can overwrite data with the same row key. Is this behavior acceptable? Could you provide a test case?

          Show
          Eric Yang added a comment - Puttimestamp option is a concern. The storefunc can only populate one timestamp per invocation. This can overwrite data with the same row key. Is this behavior acceptable? Could you provide a test case?
          Hide
          Bill Graham added a comment -

          Canceling patch due to unaddressed comments and missing unit tests.

          Show
          Bill Graham added a comment - Canceling patch due to unaddressed comments and missing unit tests.
          Bill Graham made changes -
          Patch Info Patch Available [ 10042 ]
          Guido Serra aka Zeph made changes -
          Link This issue is related too PIG-1832 [ PIG-1832 ]
          Guido Serra aka Zeph made changes -
          Link This issue is related to PIG-2886 [ PIG-2886 ]
          Gavin made changes -
          Link This issue is related to PIG-1832 [ PIG-1832 ]
          Gavin made changes -
          Link This issue is related to PIG-1832 [ PIG-1832 ]

            People

            • Assignee:
              Hari
              Reporter:
              Hari
            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development