Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Hive.g now still uses "backtrack=true". "backtrack" not only slows down the parsing in case of error, it can also produce wrong syntax error messages (usually based on the last try of the backtracking).

      We should follow http://www.antlr.org/wiki/display/ANTLR3/How+to+remove+global+backtracking+from+your+grammar to remove the need of doing backtrack.

      1. HIVE-416.1.2.branch-0.3.patch
        7 kB
        Zheng Shao
      2. HIVE-416.1.2.patch
        7 kB
        Zheng Shao
      3. HIVE-416.1.1.patch
        7 kB
        Zheng Shao
      4. HIVE-416.1.patch
        4 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          HIVE-416 is a more thorough (and risky) change of Hive.g than HIVE-415, so it should go in after HIVE-415.

          Show
          Zheng Shao added a comment - HIVE-416 is a more thorough (and risky) change of Hive.g than HIVE-415 , so it should go in after HIVE-415 .
          Hide
          Zheng Shao added a comment -

          This patch changed Hive.g to disable backtrack.

          All ambiguity are resolved except an unresolvable one: "Identifier DOT Identifier". This one will require some work in the SemanticAnalyzer as well.

          We had to add a little restriction to the grammar:
          1. "SELECT TRANSFORM" will have to have brackets around all expression List after it. Originally the brackets are optional;
          2. "MAP" and "REDUCE" cannot have brackets around all expression list after it (just like SELECT). Originally we allow optional brackets.
          Both changes are required if we want to disable backtrack.

          Show
          Zheng Shao added a comment - This patch changed Hive.g to disable backtrack. All ambiguity are resolved except an unresolvable one: "Identifier DOT Identifier". This one will require some work in the SemanticAnalyzer as well. We had to add a little restriction to the grammar: 1. "SELECT TRANSFORM" will have to have brackets around all expression List after it. Originally the brackets are optional; 2. "MAP" and "REDUCE" cannot have brackets around all expression list after it (just like SELECT). Originally we allow optional brackets. Both changes are required if we want to disable backtrack.
          Hide
          Namit Jain added a comment -

          I had some questions.

          1. Wont it be better to factor out the common part in a seperate rule instead of providing the look-ahead.
          For eg:

          instead of:

          alterStatement
          options

          {k=5;}

          @init

          { msgs.push("alter statement"); }

          @after

          { msgs.pop(); }
          : alterStatementRename
          | alterStatementAddCol
          | alterStatementDropPartitions
          | alterStatementAddPartitions
          | alterStatementProperties
          | alterStatementSerdeProperties
          ;


          wont it be better to factor out < ALTER TABLE identifier> in a common and then have the remaining rules ?


          2. On the same lines, I could not understand the reason for brackets around SELECT TRANSFORM and
          no brackets around MAP/REDUCE.


          Instead of this:

          selectClause
          @init { msgs.push("select clause"); }
          @after { msgs.pop(); }

          :
          KW_SELECT (KW_ALL | dist=KW_DISTINCT)?
          selectList -> {$dist == null}? ^(TOK_SELECT selectList)
          -> ^(TOK_SELECTDI selectList)

          trfmClause ->^(TOK_SELECT ^(TOK_SELEXPR trfmClause) )
          ;

          if we factor out:

          KW_SELECT for the first part and the transform clause, brackets should become optional.

          Am I missing something here ?

          Show
          Namit Jain added a comment - I had some questions. 1. Wont it be better to factor out the common part in a seperate rule instead of providing the look-ahead. For eg: instead of: alterStatement options {k=5;} @init { msgs.push("alter statement"); } @after { msgs.pop(); } : alterStatementRename | alterStatementAddCol | alterStatementDropPartitions | alterStatementAddPartitions | alterStatementProperties | alterStatementSerdeProperties ; wont it be better to factor out < ALTER TABLE identifier> in a common and then have the remaining rules ? 2. On the same lines, I could not understand the reason for brackets around SELECT TRANSFORM and no brackets around MAP/REDUCE. Instead of this: selectClause @init { msgs.push("select clause"); } @after { msgs.pop(); } : KW_SELECT (KW_ALL | dist=KW_DISTINCT)? selectList -> {$dist == null}? ^(TOK_SELECT selectList) -> ^(TOK_SELECTDI selectList) trfmClause ->^(TOK_SELECT ^(TOK_SELEXPR trfmClause) ) ; if we factor out: KW_SELECT for the first part and the transform clause, brackets should become optional. Am I missing something here ?
          Hide
          Zheng Shao added a comment -

          1. I checked the generated code for

          {k=5;}

          , it's a nested if so there is no performance penalty. But I agree most grammars have k up to 3, and it should be easy to extract the common prefix, so I will do it.

          2. optional brackets won't be possible with a LL(k) parser with any k (without backtrack), because I can construct an arbitarily long string like "(((((((a+b..." and it's not possible to know whether the first "(" is the optional bracket or not.

          Most people who has been using "SELECT TRANSFORM" are adding the brackets, while those using "MAP/REDUCE" are probably not (think "MAP" / "REDUCE" similar to "SELECT"), that's why I made the choice like that. We can discuss more on this if needed.

          Show
          Zheng Shao added a comment - 1. I checked the generated code for {k=5;} , it's a nested if so there is no performance penalty. But I agree most grammars have k up to 3, and it should be easy to extract the common prefix, so I will do it. 2. optional brackets won't be possible with a LL(k) parser with any k (without backtrack), because I can construct an arbitarily long string like "(((((((a+b..." and it's not possible to know whether the first "(" is the optional bracket or not. Most people who has been using "SELECT TRANSFORM" are adding the brackets, while those using "MAP/REDUCE" are probably not (think "MAP" / "REDUCE" similar to "SELECT"), that's why I made the choice like that. We can discuss more on this if needed.
          Hide
          Ashish Thusoo added a comment -

          I think for Identifier DOT Identifier we should probably treat it as a lexical rule rather than a grammar rule. It will also make it much simpler to support optional aliasing with complex types.

          Right now

          select T.a.b FROM T

          and

          select a.b FROM T

          is very hard to handle in the SemanticAnalyzer as the grammar treats a as a table alias instead of a complex column name.

          About the comment on optional brackets - clearly these are optional in expressions. So how do we support those expression e.g (a+b) and a+b are both valid sql expressoins if we cannot support this without backtracking...

          Show
          Ashish Thusoo added a comment - I think for Identifier DOT Identifier we should probably treat it as a lexical rule rather than a grammar rule. It will also make it much simpler to support optional aliasing with complex types. Right now select T.a.b FROM T and select a.b FROM T is very hard to handle in the SemanticAnalyzer as the grammar treats a as a table alias instead of a complex column name. About the comment on optional brackets - clearly these are optional in expressions. So how do we support those expression e.g (a+b) and a+b are both valid sql expressoins if we cannot support this without backtracking...
          Hide
          Ashish Thusoo added a comment -

          Can we use semantic/syntactic predicates to support the optional brackets?

          Show
          Ashish Thusoo added a comment - Can we use semantic/syntactic predicates to support the optional brackets?
          Hide
          Zheng Shao added a comment -

          Extracted common prefix for ALTER TABLE, and Removed all "

          {k=5}

          ".

          Show
          Zheng Shao added a comment - Extracted common prefix for ALTER TABLE, and Removed all " {k=5} ".
          Hide
          Zheng Shao added a comment -

          > About the comment on optional brackets - clearly these are optional in expressions. So how do we support those expression e.g (a+b) and a+b are both valid sql expressoins if we cannot support this without backtracking...

          We do support "(a+b)" and "a+b".

          The problem is that there is no easy way of supporting both "((((a))), b)" and "((((a)))), b". No matter what is k, it's not possible to determine whether the first "(" is the optional bracket for the expression list, or just part of the first expression.

          I will need to go over the antlr book to know more about Semantic/Syntactic predicate to know whether that is possible.

          > Identifier DOT Identifier.
          Treating it as a lexical rule won't allow both T.a.b and a.b. I am making a first Identifier a TOK_TABLE_OR_COL. I will let SemanticAnalyzer to decide whether it is a table name or column name. Not sure that should go into the same transaction or not since it's a much bigger change.

          Show
          Zheng Shao added a comment - > About the comment on optional brackets - clearly these are optional in expressions. So how do we support those expression e.g (a+b) and a+b are both valid sql expressoins if we cannot support this without backtracking... We do support "(a+b)" and "a+b". The problem is that there is no easy way of supporting both "((((a))), b)" and "((((a)))), b". No matter what is k, it's not possible to determine whether the first "(" is the optional bracket for the expression list, or just part of the first expression. I will need to go over the antlr book to know more about Semantic/Syntactic predicate to know whether that is possible. > Identifier DOT Identifier. Treating it as a lexical rule won't allow both T.a.b and a.b. I am making a first Identifier a TOK_TABLE_OR_COL. I will let SemanticAnalyzer to decide whether it is a table name or column name. Not sure that should go into the same transaction or not since it's a much bigger change.
          Hide
          Raghotham Murthy added a comment -

          How about the following:

          Add the production:
          {{

          { expr -> expr ',' expr }

          }}
          With this production and expr -> '(' expr ')' we can support arbitrarily nested parentheses. The issue is that the new production will create a left-deep tree of comma-expressions. We could implement a method which takes such a tree and flatten out comma expressions into expression lists.

          Also, I remember Venky asking for arbitrarily nested parentheses around queries for his query authoring tool. We could do something similar and create comma-query-expressions.

          Show
          Raghotham Murthy added a comment - How about the following: Add the production: {{ { expr -> expr ',' expr } }} With this production and expr -> '(' expr ')' we can support arbitrarily nested parentheses. The issue is that the new production will create a left-deep tree of comma-expressions. We could implement a method which takes such a tree and flatten out comma expressions into expression lists. Also, I remember Venky asking for arbitrarily nested parentheses around queries for his query authoring tool. We could do something similar and create comma-query-expressions.
          Hide
          Zheng Shao added a comment -

          @Raghu: I just had a second thought on that approach. The new production you add is left-recursive and it's not permitted in LL(k), but it's possible to use precedence rules to fix that.
          However given all the change including the flattening, it seems to me that's too much work with very little benefit - who care about the optional brackets, for usage it's exactly the same.

          For Venky's case, it's a separate problem. Venky's case is more like supporting "a" and "((((a))))". We should be able to support it easily once we allow omitting the sub query alias.

          Show
          Zheng Shao added a comment - @Raghu: I just had a second thought on that approach. The new production you add is left-recursive and it's not permitted in LL(k), but it's possible to use precedence rules to fix that. However given all the change including the flattening, it seems to me that's too much work with very little benefit - who care about the optional brackets, for usage it's exactly the same. For Venky's case, it's a separate problem. Venky's case is more like supporting "a" and "((((a))))". We should be able to support it easily once we allow omitting the sub query alias.
          Hide
          Namit Jain added a comment - - edited

          The changes look good to me.
          +1

          I can test and commit them, but will wait for a couple of hours if anyone has any issues, since many people have been reviewing this code.
          I will wait till Friday evening (say around 5pm), if no-one has any objections, I will commit this (if the tests pass)

          Show
          Namit Jain added a comment - - edited The changes look good to me. +1 I can test and commit them, but will wait for a couple of hours if anyone has any issues, since many people have been reviewing this code. I will wait till Friday evening (say around 5pm), if no-one has any objections, I will commit this (if the tests pass)
          Hide
          Ashish Thusoo added a comment -

          A couple of minor comments:

          1. Do we still need k=5 in the alterStatement rule considering that the ties can be broken by k=3 lookahead in alterStatementSuffix.
          2. In the select list and select clause can factor out KW_SELECT as that is the common part in the two options in the selectClause rule.

          Show
          Ashish Thusoo added a comment - A couple of minor comments: 1. Do we still need k=5 in the alterStatement rule considering that the ties can be broken by k=3 lookahead in alterStatementSuffix. 2. In the select list and select clause can factor out KW_SELECT as that is the common part in the two options in the selectClause rule.
          Hide
          Zheng Shao added a comment -

          @Ashish:
          This patch removes the unnecessary "

          {k=5}

          ".
          We cannot extract the common "SELECT" out because we allow "MAP", "REDUCE", as well as "SELECT TRANSFORM" in the trfmClause.

          Show
          Zheng Shao added a comment - @Ashish: This patch removes the unnecessary " {k=5} ". We cannot extract the common "SELECT" out because we allow "MAP", "REDUCE", as well as "SELECT TRANSFORM" in the trfmClause.
          Hide
          Namit Jain added a comment -

          Committed. Thanks Zheng

          Show
          Namit Jain added a comment - Committed. Thanks Zheng
          Hide
          Zheng Shao added a comment -

          This patch is for branch-0.3. It seems without this, the TestNegativeCliDriver sometimes produce wrong error messages (undeterministically). That is probably caused by the backtracking which is fixed with this patch.

          Please commit this one to branch-0.3.

          Show
          Zheng Shao added a comment - This patch is for branch-0.3. It seems without this, the TestNegativeCliDriver sometimes produce wrong error messages (undeterministically). That is probably caused by the backtracking which is fixed with this patch. Please commit this one to branch-0.3.

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development