Pig
  1. Pig
  2. PIG-1946

HBaseStorage constructor syntax is error prone

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      More friendly HBaseStorage constructor syntax.

      Description

      Using HBaseStorage like so seems like a reasonable thing to do, but it will yield unexpected results:

      STORE result INTO 'hbase://foo' USING
       org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       'info:first_name, info:last_name');
      

      The problem us that a column named info:first_name, will be created, with the trailing comma included. I've had numerous developers get tripped up on this issue since everywhere else in Pig variables are separated by commas, so I propose we fix it.

      I propose we trim leading/trailing commas from column names, but I'm open to other ideas.

      Also should we accept column names that are comman-delimited without spaces?

      1. PIG-1946_1.patch
        6 kB
        Bill Graham
      2. PIG-1946_2.patch
        12 kB
        Bill Graham
      3. PIG-1946_3.patch
        14 kB
        Bill Graham

        Activity

        Hide
        Dmitriy V. Ryaboy added a comment -

        I meant <pre>--no-prefix</pre>.

        +1. Will commit to 0.10.

        Show
        Dmitriy V. Ryaboy added a comment - I meant <pre>--no-prefix</pre>. +1. Will commit to 0.10.
        Hide
        Bill Graham added a comment -

        All good suggestions, here's patch #3.

        The -no-patch option doesn't seem to be valid in git merge. I tried -no-prefix but the format still looks gitish. Any other suggestions for an SVN-friendly git patch? I've made manual changes in the past when applying git patches to SVN, but I'd love to find a way to not require that.

        Show
        Bill Graham added a comment - All good suggestions, here's patch #3. The - no-patch option doesn't seem to be valid in git merge. I tried -no-prefix but the format still looks gitish. Any other suggestions for an SVN-friendly git patch? I've made manual changes in the past when applying git patches to SVN, but I'd love to find a way to not require that.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Looks good. The test passed. The purist in me wants to make you do "true".equalsIgnoreCase(value) instead of (value == null || !value.equalsIgnoreCase("true")), and to pull out the column parsing behavior into its own function.

        You generated the patch from git – that's not friendly to the automated patch machinery; you have to use

        git diff --no-patch

        to generate legit patches.

        If you have time to make the changes, that would be awesome. If not I will try to fit it in.

        Thanks for the work on this!

        Show
        Dmitriy V. Ryaboy added a comment - Looks good. The test passed. The purist in me wants to make you do "true".equalsIgnoreCase(value) instead of (value == null || !value.equalsIgnoreCase("true")), and to pull out the column parsing behavior into its own function. You generated the patch from git – that's not friendly to the automated patch machinery; you have to use git diff --no-patch to generate legit patches. If you have time to make the changes, that would be awesome. If not I will try to fit it in. Thanks for the work on this!
        Hide
        Bill Graham added a comment -

        Attaching a new patch with the changes suggested. I've created a new unit test for speed, since the param testing for this bug doesn't require the mini cluster to be run.

        Also, I named the delimiter field -delim instead of -d since the former is more consistent with existing param naming in {{HBaseStorage}]. Let me know if you disagree.

        FYI, this patch can also be found on git here: https://github.com/billonahill/pig/tree/pig_1946

        Show
        Bill Graham added a comment - Attaching a new patch with the changes suggested. I've created a new unit test for speed, since the param testing for this bug doesn't require the mini cluster to be run. Also, I named the delimiter field -delim instead of -d since the former is more consistent with existing param naming in {{HBaseStorage}]. Let me know if you disagree. FYI, this patch can also be found on git here: https://github.com/billonahill/pig/tree/pig_1946
        Hide
        Bill Graham added a comment -

        Thanks for changing that state. The patch was reviewed but now I'm working on a new one. I should have something to submit soon.

        Show
        Bill Graham added a comment - Thanks for changing that state. The patch was reviewed but now I'm working on a new one. I should have something to submit soon.
        Hide
        Thejas M Nair added a comment -

        +1 on that last suggestion, sounds good to me. I'll work on the patch.

        Removing the patch-available state as Bill is working on a new patch.
        patch-available state is being used for finding jira's that are ready for review.

        Show
        Thejas M Nair added a comment - +1 on that last suggestion, sounds good to me. I'll work on the patch. Removing the patch-available state as Bill is working on a new patch. patch-available state is being used for finding jira's that are ready for review.
        Hide
        Thejas M Nair added a comment -

        Removing the patch-available state as

        Show
        Thejas M Nair added a comment - Removing the patch-available state as
        Hide
        Bill Graham added a comment -

        +1 on that last suggestion, sounds good to me. I'll work on the patch.

        Show
        Bill Graham added a comment - +1 on that last suggestion, sounds good to me. I'll work on the patch.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Ok so maybe the semantics should be something like -d specifies the delimiter, and the delimiter can always be surrounded by an arbitrary number of spaces (\s) unless -ignoreWhitespace is set to false.

        Show
        Dmitriy V. Ryaboy added a comment - Ok so maybe the semantics should be something like -d specifies the delimiter, and the delimiter can always be surrounded by an arbitrary number of spaces (\s) unless -ignoreWhitespace is set to false.
        Hide
        Bill Graham added a comment -

        The intent of this JIRA is to allow "foo bar", "foo, bar" and "foo,bar" all work in the constructor by default, since that's how the AS clause works and every other list of fields within Pig and, more importantly, that's how users are currently specifying the params in the constructor. Hence, this is the exact problem I think we should fix.

        Show
        Bill Graham added a comment - The intent of this JIRA is to allow "foo bar", "foo, bar" and "foo,bar" all work in the constructor by default, since that's how the AS clause works and every other list of fields within Pig and, more importantly, that's how users are currently specifying the params in the constructor. Hence, this is the exact problem I think we should fix.
        Hide
        Dmitriy V. Ryaboy added a comment -

        my suggestion would make "foo bar" and "foo, bar" both work by default, though "foo,bar" would not. Given the lax rules governing column naming in HBase I now think that's acceptable.

        I think it's confusing to have a -d parameter whose default value of "a comma followed by zero or more spaces, or 1 or more spaces" cannot be supplied as a value in the constructor.

        Show
        Dmitriy V. Ryaboy added a comment - my suggestion would make "foo bar" and "foo, bar" both work by default, though "foo,bar" would not. Given the lax rules governing column naming in HBase I now think that's acceptable. I think it's confusing to have a -d parameter whose default value of "a comma followed by zero or more spaces, or 1 or more spaces" cannot be supplied as a value in the constructor.
        Hide
        Bill Graham added a comment -

        @Dmitriy, the goal of this JIRA is to make the default delimiter any combination of comma and space (not just space), since that's the way Pig works everywhere else. We can achieve that and provide the ability to go outside of that as needed with just one param (-constructorDelimiter) that overrides this behavior with the field of the users choice. This would be option b.) above.

        Show
        Bill Graham added a comment - @Dmitriy, the goal of this JIRA is to make the default delimiter any combination of comma and space (not just space), since that's the way Pig works everywhere else. We can achieve that and provide the ability to go outside of that as needed with just one param (-constructorDelimiter) that overrides this behavior with the field of the users choice. This would be option b.) above.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Spaces are legal:

        hbase(main):011:0> put 'dropme', 'id1', 'columns:foo bar', 'value'
        0 row(s) in 0.0060 seconds

        hbase(main):012:0> scan 'dropme'
        ROW COLUMN+CELL
        id1 column=columns:foo bar, timestamp=1307395400107, value=value

        So I think that means 2 flags are needed: -d (delimiter, space by default) and -ignoreTrailingCommas (true by default, turn off to force trailing commas to be treated as part of the column name)

        Show
        Dmitriy V. Ryaboy added a comment - Spaces are legal: hbase(main):011:0> put 'dropme', 'id1', 'columns:foo bar', 'value' 0 row(s) in 0.0060 seconds hbase(main):012:0> scan 'dropme' ROW COLUMN+CELL id1 column=columns:foo bar, timestamp=1307395400107, value=value So I think that means 2 flags are needed: -d (delimiter, space by default) and -ignoreTrailingCommas (true by default, turn off to force trailing commas to be treated as part of the column name)
        Hide
        Eric Yang added a comment -

        a) should be the sensible thing to do. b) is unlikely to mix well with rest of pig syntax anyways.

        Show
        Eric Yang added a comment - a) should be the sensible thing to do. b) is unlikely to mix well with rest of pig syntax anyways.
        Hide
        Bill Graham added a comment -

        I was thinking any combination of comma and space would always be used to delimit by default. If -constructorDelimiter is specified we can either say: a.) same logic but sub comma out for -constructorDelimiter; or b.) Only use -constructorDelimiter as the delimiter in which case spaces would be part of the column descriptor if found.

        I lean towards a.) but if allowing people to use spaces in column names is valid, then we should do b.).

        Thoughts?

        Show
        Bill Graham added a comment - I was thinking any combination of comma and space would always be used to delimit by default. If -constructorDelimiter is specified we can either say: a.) same logic but sub comma out for -constructorDelimiter; or b.) Only use -constructorDelimiter as the delimiter in which case spaces would be part of the column descriptor if found. I lean towards a.) but if allowing people to use spaces in column names is valid, then we should do b.). Thoughts?
        Hide
        Eric Yang added a comment -

        Take it one step further, what does -constructorDelimiter look like with space delimiter?

        Show
        Eric Yang added a comment - Take it one step further, what does -constructorDelimiter look like with space delimiter?
        Hide
        Bill Graham added a comment -

        How about we say that commas are the default delimiter (with or without spaces), but this can be overriden with the "-constructorDelimiter" param. That would allow those pesky comma users to specify some other character as their delimiter. Kinda like sed.

        Show
        Bill Graham added a comment - How about we say that commas are the default delimiter (with or without spaces), but this can be overriden with the "-constructorDelimiter" param. That would allow those pesky comma users to specify some other character as their delimiter. Kinda like sed.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Coming back to this after a long break..
        We can do the standard backslash escaping (with
        meaning a real backslash) but that will look horrible in Pig code, where you have to escape each backslash for the parser first.

        Maybe an optional "-literalColumns" flag that makes us revert to space-delimited column names?

        I don't really see an elegant way out of this right now.

        Show
        Dmitriy V. Ryaboy added a comment - Coming back to this after a long break.. We can do the standard backslash escaping (with meaning a real backslash) but that will look horrible in Pig code, where you have to escape each backslash for the parser first. Maybe an optional "-literalColumns" flag that makes us revert to space-delimited column names? I don't really see an elegant way out of this right now.
        Hide
        Bill Graham added a comment -

        The column descriptors take anything I can throw at them:

        hbase(main):001:0>  create 't1', {NAME => 'f1', VERSIONS => 5}                     
        0 row(s) in 0.6400 seconds
        
        hbase(main):002:0> put 't1', 'r1', 'f1:!@#$%)(:+_-=\][{}|;:"><,./?`~', 'value'     
        0 row(s) in 0.0660 seconds
        

        I'm also able to create column families with both '/' and '\' in them. Any suggestions for a valid encoding scheme?

        Show
        Bill Graham added a comment - The column descriptors take anything I can throw at them: hbase(main):001:0> create 't1', {NAME => 'f1', VERSIONS => 5} 0 row(s) in 0.6400 seconds hbase(main):002:0> put 't1', 'r1', 'f1:!@#$%)(:+_-=\][{}|;:"><,./?`~', 'value' 0 row(s) in 0.0660 seconds I'm also able to create column families with both '/' and '\' in them. Any suggestions for a valid encoding scheme?
        Hide
        Dmitriy V. Ryaboy added a comment -

        This is for column families: http://hbase.apache.org/xref/org/apache/hadoop/hbase/HColumnDescriptor.html#278
        (no slashes, colons, or ISOControl chars, and no starting with "."). I believe columns are similar.

        Show
        Dmitriy V. Ryaboy added a comment - This is for column families: http://hbase.apache.org/xref/org/apache/hadoop/hbase/HColumnDescriptor.html#278 (no slashes, colons, or ISOControl chars, and no starting with "."). I believe columns are similar.
        Hide
        Bill Graham added a comment -

        No problem. Dmitriy, have you found the list of valid characters for column descriptors? I feel like I've seen it before somewhere but I can't find it. After a quick test, I can write column descriptor with any character I try so I'm at a loss for a valid escaping scheme.

        Show
        Bill Graham added a comment - No problem. Dmitriy, have you found the list of valid characters for column descriptors? I feel like I've seen it before somewhere but I can't find it. After a quick test, I can write column descriptor with any character I try so I'm at a loss for a valid escaping scheme.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Bill, sorry it took me a while to get to this. Looks good, but I just confirmed that commas are valid in column names.. we should add escaping.

        Show
        Dmitriy V. Ryaboy added a comment - Bill, sorry it took me a while to get to this. Looks good, but I just confirmed that commas are valid in column names.. we should add escaping.
        Hide
        Bill Graham added a comment -

        This patch is ready for review. It allows commas and/or spaces to be used as column delimiters.

        Show
        Bill Graham added a comment - This patch is ready for review. It allows commas and/or spaces to be used as column delimiters.
        Hide
        Dmitriy V. Ryaboy added a comment -

        I think this is a good change, the comma thing trips even me up.
        People who put commas in column names get what they deserve . Though perhaps we can allow some escaping (sometimes column names are autogenerated).

        Show
        Dmitriy V. Ryaboy added a comment - I think this is a good change, the comma thing trips even me up. People who put commas in column names get what they deserve . Though perhaps we can allow some escaping (sometimes column names are autogenerated).
        Hide
        Bill Graham added a comment -

        Eric, the syntax you show is the current syntax - the tokens are split on space to get the columns. Hence the issue when there are also commas in the mix. In addition to supporting the current syntax of:

        'info:first_name info:last_name'
        

        I propose we also support these variations:

        'info:first_name info:last_name'
        'info:first_name,info:last_name'
        'info:first_name, info:last_name'
        

        Only downside is that we wouldn't be backward compatible with a user who has column names with commas in them. This seems to me like an odd thing to do, but who knows.

        Anyone think we need to support column names with commas in them?

        Show
        Bill Graham added a comment - Eric, the syntax you show is the current syntax - the tokens are split on space to get the columns. Hence the issue when there are also commas in the mix. In addition to supporting the current syntax of: 'info:first_name info:last_name' I propose we also support these variations: 'info:first_name info:last_name' 'info:first_name,info:last_name' 'info:first_name, info:last_name' Only downside is that we wouldn't be backward compatible with a user who has column names with commas in them. This seems to me like an odd thing to do, but who knows. Anyone think we need to support column names with commas in them?
        Hide
        Eric Yang added a comment -

        An alternative is to modify the syntax like:

        STORE result INTO 'hbase://foo' USING
         org.apache.pig.backend.hadoop.hbase.HBaseStorage(
         'info:first_name info:last_name');
        

        Eliminate comma from the syntax completely. It may have some readability issue with this approach.

        Having that said, the problem can be solved in more user friendly manner by improving the parser to filter prefix and suffix of comma, and comma only cases.

        Maybe this issue can be defined more accurately with: "Improve syntax parsing for HBaseStorage constructor".

        Show
        Eric Yang added a comment - An alternative is to modify the syntax like: STORE result INTO 'hbase://foo' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name'); Eliminate comma from the syntax completely. It may have some readability issue with this approach. Having that said, the problem can be solved in more user friendly manner by improving the parser to filter prefix and suffix of comma, and comma only cases. Maybe this issue can be defined more accurately with: "Improve syntax parsing for HBaseStorage constructor".

          People

          • Assignee:
            Bill Graham
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development