Pig
  1. Pig
  2. PIG-123

cannot escape single quotes in single quoted strings when using the eq or match operator

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      a query like this :

      filter c by ( c.string eq 'hell\'s angels' OR c.string eq 'blue\'s clues')

      generates an error with the pig parser, because the parser does not allow single quotes (') within single quotes, even if they are escaped with a backslash.

        Activity

        Hide
        Benjamin Reed added a comment -

        Committed revision 637293.

        Show
        Benjamin Reed added a comment - Committed revision 637293.
        Hide
        Benjamin Reed added a comment -

        I'm checking this in now.

        Show
        Benjamin Reed added a comment - I'm checking this in now.
        Hide
        Pi Song added a comment -

        In sync with the latest trunk (11Mar2008)

        Unit tests passed

        This patch has been around for a while. Somebody please commit.

        Show
        Pi Song added a comment - In sync with the latest trunk (11Mar2008) Unit tests passed This patch has been around for a while. Somebody please commit.
        Hide
        Benjamin Francisoud added a comment -

        Setting "Patch Available" as request by Pi Song

        Show
        Benjamin Francisoud added a comment - Setting "Patch Available" as request by Pi Song
        Hide
        Pi Song added a comment -

        Surprisingly, there are so many of our tests that do end-to-end runs instead of a specific point test.

        In the long run, we will need to consolidate all these tests. I don't want to bloat this Jira. Let's discuss later.

        Show
        Pi Song added a comment - Surprisingly, there are so many of our tests that do end-to-end runs instead of a specific point test. In the long run, we will need to consolidate all these tests. I don't want to bloat this Jira. Let's discuss later.
        Hide
        Pi Song added a comment - - edited

        Here it is.

        The test is done at Parser level and subsequently Consts in LogicalOps are checked.
        This test took 0.283 seconds on my machine.

        In fact, I see this patch as an improvement rather than a bug fix.

        Please mark as patch available.

        Show
        Pi Song added a comment - - edited Here it is. The test is done at Parser level and subsequently Consts in LogicalOps are checked. This test took 0.283 seconds on my machine. In fact, I see this patch as an improvement rather than a bug fix. Please mark as patch available.
        Hide
        Olga Natkovich added a comment -

        +1 on code changes.

        I think tests should be true unit tests not end-to-end runs of pig scripts. We really want to minimize the time it takes to run unit tests and not to go to full tests when possible.

        Show
        Olga Natkovich added a comment - +1 on code changes. I think tests should be true unit tests not end-to-end runs of pig scripts. We really want to minimize the time it takes to run unit tests and not to go to full tests when possible.
        Hide
        Benjamin Reed added a comment -

        +1

        Show
        Benjamin Reed added a comment - +1
        Hide
        Benjamin Reed added a comment -

        You are correct. I was more worried about the strain on the memory manager, but as you point out this is done in the parser, not in the execution engine.

        Show
        Benjamin Reed added a comment - You are correct. I was more worried about the strain on the memory manager, but as you point out this is done in the parser, not in the execution engine.
        Hide
        Pi Song added a comment - - edited

        Ben,

        Pig is a processing-oriented system. You spend milliseconds to compile the query but spend hours to do processing work. The real point to be optimized should be in the execution engine where the code gets called over and over millions of time. This change in frontend encoding should have extremely small impact on the whole execution.

        In terms of performance analysis, this is an O ( n ) operation on a short in-memory string (normally no longer than 1K) which gets called only once in an execution.

        PS. If you do string scan first, it's still O ( n ) operation

        Show
        Pi Song added a comment - - edited Ben, Pig is a processing-oriented system. You spend milliseconds to compile the query but spend hours to do processing work. The real point to be optimized should be in the execution engine where the code gets called over and over millions of time. This change in frontend encoding should have extremely small impact on the whole execution. In terms of performance analysis, this is an O ( n ) operation on a short in-memory string (normally no longer than 1K) which gets called only once in an execution. PS. If you do string scan first, it's still O ( n ) operation
        Hide
        Benjamin Reed added a comment -

        Excellent work! Have you done a microbenchmark to see the overhead? I'm wondering if it would be worth it to scan the string first to see if there are any slashes to avoid processing (in the common case?).

        Show
        Benjamin Reed added a comment - Excellent work! Have you done a microbenchmark to see the overhead? I'm wondering if it would be worth it to scan the string first to see if there are any slashes to avoid processing (in the common case?).
        Hide
        Pi Song added a comment - - edited

        This patch includes implementation of the above spec.
        A new test case also included (TestPigScriptParser) which tests 3 special cases

        • single quote escape
        • \n escape
        • \uxxxx unicode escape

        All unit tests passed.

        Ready to be committed

        *Please mark this as patch available*

        Show
        Pi Song added a comment - - edited This patch includes implementation of the above spec. A new test case also included (TestPigScriptParser) which tests 3 special cases single quote escape \n escape \uxxxx unicode escape All unit tests passed. Ready to be committed * Please mark this as patch available *
        Hide
        Pi Song added a comment -

        This patch adds escape character to pig script and grunt.

        -  "\'","\r","\f","\t","\n","\b" are supported
        - "\\" will be interpreted to \
        - "\uxxxx" for unicode are supported
        - "\x" where x doesn't match anything above will be interpreted to x (skipping "\" )
        

        This patch is only for review. Unit test is coming after I sort out the my miniMR problem.

        Show
        Pi Song added a comment - This patch adds escape character to pig script and grunt. - "\'","\r","\f","\t","\n","\b" are supported - "\\" will be interpreted to \ - "\uxxxx" for unicode are supported - "\x" where x doesn't match anything above will be interpreted to x (skipping "\" ) This patch is only for review. Unit test is coming after I sort out the my miniMR problem.
        Hide
        Benjamin Reed added a comment -

        It would be nice to fix this properly and allow support for specifying unicode characters, \uXXXX , and supporting standard escapes, \r \n \t. As you mention, we have to introduce a semantic change, so we should do it right.

        BTW, make sure to test for '
        ' and '\\\''.

        Show
        Benjamin Reed added a comment - It would be nice to fix this properly and allow support for specifying unicode characters, \uXXXX , and supporting standard escapes, \r \n \t. As you mention, we have to introduce a semantic change, so we should do it right. BTW, make sure to test for ' ' and '\\\''.
        Hide
        Olga Natkovich added a comment -

        +1. Pi, your proposal makes sense to me.

        Show
        Olga Natkovich added a comment - +1. Pi, your proposal makes sense to me.
        Hide
        Pi Song added a comment -

        After having a look, currently there is no escape character implementation in our grammar.

        By changing the definition of quoted string in grammar file from

        TOKEN : { <QUOTEDSTRING : "'" (~["'"])* "'"> }
        

        to

        TOKEN : { <QUOTEDSTRING : "'" ( (~["'","\\"]) | ("\\" (["\\","'"] ) )* "'"> }
        

        and changing the state manipulation in Grunt not to switch out of IN_STRING after "\'" (backslash and a quote) is encountered, this should work.

        ***However*** the real problem of doing this is the semantic change. As we introduce an escape character, we no longer treat the character as itself. People using "\" in their code will get a weird error message.

        Here are what will be affected from the above grammar change (Similar to Java) :-

        "\\"  will be interpreted as "\"
        "\'"  will be interpreted as "'"
        "\"+(char) will no longer allowed. ***An error will be raised***.
        

        ***Another concern*** is I don't know how having a single quote in the expression evaluation engine (which sometimes uses encoded text notation internally) will be affected.

        Need to listen to others!!!

        Show
        Pi Song added a comment - After having a look, currently there is no escape character implementation in our grammar. By changing the definition of quoted string in grammar file from TOKEN : { <QUOTEDSTRING : "'" (~["'"])* "'"> } to TOKEN : { <QUOTEDSTRING : "'" ( (~["'","\\"]) | ("\\" (["\\","'"] ) )* "'"> } and changing the state manipulation in Grunt not to switch out of IN_STRING after "\'" (backslash and a quote) is encountered, this should work. *** However *** the real problem of doing this is the semantic change. As we introduce an escape character, we no longer treat the character as itself. People using "\" in their code will get a weird error message. Here are what will be affected from the above grammar change (Similar to Java) :- "\\" will be interpreted as "\" "\'" will be interpreted as "'" "\"+(char) will no longer allowed. ***An error will be raised***. *** Another concern *** is I don't know how having a single quote in the expression evaluation engine (which sometimes uses encoded text notation internally) will be affected. Need to listen to others!!!

          People

          • Assignee:
            Pi Song
            Reporter:
            andrei badulescu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development