Pig
  1. Pig
  2. PIG-3414

QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: parser
    • Labels:
      None

      Description

      QueryParserDriver provides a convenient method to parse from string to LogicalSchema. But if a comma is missing between two fields in the schema definition, it silently returns a wrong result. For example,

      a:int b:long
      

      This string will be parsed up to "a:int", and "b:long" will be silently discarded. This should rather fail with a parser exception.

      1. PIG-3414.patch
        3 kB
        Cheolsoo Park
      2. PIG-3414-2.patch
        4 kB
        Cheolsoo Park
      3. PIG-3414-3.patch
        3 kB
        Cheolsoo Park
      4. PIG-3414-4.patch
        4 kB
        Cheolsoo Park
      5. PIG-3414-5.patch
        4 kB
        Cheolsoo Park
      6. PIG-3414-6.patch
        16 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Thank you Xuefu! Yes, all the unit tests pass now. Sorry for all the unexpected inconvenience.

        Committed to trunk.

        Show
        Cheolsoo Park added a comment - Thank you Xuefu! Yes, all the unit tests pass now. Sorry for all the unexpected inconvenience. Committed to trunk.
        Hide
        Xuefu Zhang added a comment -

        +1. I assume all test pass now.

        Show
        Xuefu Zhang added a comment - +1. I assume all test pass now.
        Hide
        Cheolsoo Park added a comment -

        All the unit tests are fixed. I uploaded the new patch to RB:
        https://reviews.apache.org/r/13551/

        Please review one more time. Thanks!

        Show
        Cheolsoo Park added a comment - All the unit tests are fixed. I uploaded the new patch to RB: https://reviews.apache.org/r/13551/ Please review one more time. Thanks!
        Hide
        Cheolsoo Park added a comment -

        I am discovering several buggy unit tests that were accidentally passing. Now I am adding this new check, they start failing.

        I will fix all the buggy unit tests in a new patch.

        Show
        Cheolsoo Park added a comment - I am discovering several buggy unit tests that were accidentally passing. Now I am adding this new check, they start failing. I will fix all the buggy unit tests in a new patch.
        Hide
        Cheolsoo Park added a comment -

        A new patch includes the following changes:

        • Moved my test case to TestSchema.java since that seems like a better place than TestQueryParser.java.
        • Fixed the bug in TestSchema.testGetStringFromSchema(), so now TestSchema passes.
        Show
        Cheolsoo Park added a comment - A new patch includes the following changes: Moved my test case to TestSchema.java since that seems like a better place than TestQueryParser.java. Fixed the bug in TestSchema.testGetStringFromSchema(), so now TestSchema passes.
        Hide
        Cheolsoo Park added a comment -

        I realized that I broke TestSchema so reverted my commit. In fact, it's the test case that has a bug:

        datetime(int,long,float,double,boolean,datetime) -- A comma is missing, so it should fail!
        

        I am going to post a new patch that fixes this test case.

        Show
        Cheolsoo Park added a comment - I realized that I broke TestSchema so reverted my commit. In fact, it's the test case that has a bug: datetime( int , long , float , double , boolean ,datetime) -- A comma is missing, so it should fail! I am going to post a new patch that fixes this test case.
        Hide
        Cheolsoo Park added a comment -

        Committed to trunk. Thank you Xuefu for the review!

        Show
        Cheolsoo Park added a comment - Committed to trunk. Thank you Xuefu for the review!
        Hide
        Xuefu Zhang added a comment -

        +1

        Show
        Xuefu Zhang added a comment - +1
        Hide
        Cheolsoo Park added a comment -

        Thank you for the suggestion! I renamed the rule to "schema" and moved it next to "query". So it should be clear that it's a top level rule.

        Show
        Cheolsoo Park added a comment - Thank you for the suggestion! I renamed the rule to "schema" and moved it next to "query". So it should be clear that it's a top level rule.
        Hide
        Xuefu Zhang added a comment -

        Cheolsoo Park Thanks for reworking on the patch. The patch looks very good, except for a minor point: instead of "standalone_field_def_list", would you consider naming it just as "schema" or some sort, paring with its counterpart "query"? It's now a top-level root and cannot be used in other grammar rules. This might make it clear to other developers.

        Show
        Xuefu Zhang added a comment - Cheolsoo Park Thanks for reworking on the patch. The patch looks very good, except for a minor point: instead of "standalone_field_def_list", would you consider naming it just as "schema" or some sort, paring with its counterpart "query"? It's now a top-level root and cannot be used in other grammar rules. This might make it clear to other developers.
        Hide
        Cheolsoo Park added a comment -

        OK, here is a new patch that fixes the grammar. I added a new rule as follows:

        standalone_field_def_list: field_def_list EOF;
        

        As can be seen, this rule will ensure that all the tokens are consumed. Let me know if you have a better suggestion.

        Show
        Cheolsoo Park added a comment - OK, here is a new patch that fixes the grammar. I added a new rule as follows: standalone_field_def_list: field_def_list EOF; As can be seen, this rule will ensure that all the tokens are consumed. Let me know if you have a better suggestion.
        Hide
        Cheolsoo Park added a comment -

        Xuefu Zhang, thank you for your comment sir!

        I agree with you. In fact, this issue doesn't exist when the schema definition is used inside the AS clause in a LOAD statement because the grammar looks for a right parenthesis and throws a parser exception if it doesn't find one. Unfortunately, I didn't have a good idea to detect this in the grammar itself when the schema definition is used alone.

        For the context, I have a load func that takes the schema string as an option, and I use this helper function to parse it. One user accidentally omitted a comma, and that resulted in some surprising side-effects in downstream ETL processes. So I put some quick fix to prevent such unfortunate event again.

        But yes, I agree that this is probably not the best approach. I will cancel the patch for now.

        Show
        Cheolsoo Park added a comment - Xuefu Zhang , thank you for your comment sir! I agree with you. In fact, this issue doesn't exist when the schema definition is used inside the AS clause in a LOAD statement because the grammar looks for a right parenthesis and throws a parser exception if it doesn't find one. Unfortunately, I didn't have a good idea to detect this in the grammar itself when the schema definition is used alone. For the context, I have a load func that takes the schema string as an option, and I use this helper function to parse it. One user accidentally omitted a comma, and that resulted in some surprising side-effects in downstream ETL processes. So I put some quick fix to prevent such unfortunate event again. But yes, I agree that this is probably not the best approach. I will cancel the patch for now.
        Hide
        Xuefu Zhang added a comment -

        I think using custom code to detect grammar error might not be the best approach. I can image that there can be more cases like this. Ideally, we should fix this by the grammar or the use of antlr.

        Show
        Xuefu Zhang added a comment - I think using custom code to detect grammar error might not be the best approach. I can image that there can be more cases like this. Ideally, we should fix this by the grammar or the use of antlr.
        Hide
        Cheolsoo Park added a comment -

        Attached is a patch that throws an exception when a comma is missing in the schema definition.

        I also added new test cases.

        Show
        Cheolsoo Park added a comment - Attached is a patch that throws an exception when a comma is missing in the schema definition. I also added new test cases.

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development