Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23172

Quoted Backtick Columns Are Not Parsing Correctly

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I recently came across a weird behavior while examining failures of special_character_in_tabnames_2.q while working on HIVE-23150. I was surprised to see it fail because I couldn't see of any reason why it should... it's doing pretty standard SQL statements just like every other test, but for some reason this test is just a little bit differently than most others and it brought this issue to light.

      Turns out,... the parsing of table names is pretty much wrong across the board.

      The statement that caught my attention was this:

      DROP TABLE IF EXISTS `s/c`;
      

      And here is the relevant grammar:

      fragment
      RegexComponent
          : 'a'..'z' | 'A'..'Z' | '0'..'9' | '_'
          | PLUS | STAR | QUESTION | MINUS | DOT
          | LPAREN | RPAREN | LSQUARE | RSQUARE | LCURLY | RCURLY
          | BITWISEXOR | BITWISEOR | DOLLAR | '!'
          ;
      
      Identifier
          :
          (Letter | Digit) (Letter | Digit | '_')*
          | {allowQuotedId()}? QuotedIdentifier  /* though at the language level we allow all Identifiers to be QuotedIdentifiers; 
                                                    at the API level only columns are allowed to be of this form */
          | '`' RegexComponent+ '`'
          ;
      
      fragment    
      QuotedIdentifier 
          :
          '`'  ( '``' | ~('`') )* '`' { setText(StringUtils.replace(getText().substring(1, getText().length() -1 ), "``", "`")); }
          ;
      

      The mystery for me was that, for some reason, this String `s/c` was being stripped of its back-ticks. Every other test I investigated did not have this behavior, the back ticks were always preserved around the table name. The main Hive Java code base would see the back-ticks and deal with it internally. For HIVE-23150, I introduced some sanity checks and they were failing because they were expecting the back ticks to be present.

      With the help of HIVE-23171 I finally figured it out. So, what I discovered is that pretty much every table name is hitting the RegexComponent rule and the back ticks are carried forward. However, `s/c` the forward slash `/` is not allowable in RegexComponent so it hits on QuotedIdentifier rule which is trimming the back ticks.

      I validated this by disabling QuotedIdentifier. When I did this, `s/c` fails in error but `sc` parses successfully... because `sc` is being treated as a RegexComponent.

      So, if you have allowQuotedId disabled, table names can only use the characters defined in the RegexComponent rule (otherwise it errors), and it will not strip the back ticks. If you have allowQuotedId enabled, then if the table name has a character not specified in RegexComponent, it will identify it as a table name and it will strip the back ticks, if all the characters are part of RegexComponent then it will not strip the back ticks.

      Attachments

        Issue Links

          Activity

            People

              belugabehr David Mollitor
              belugabehr David Mollitor
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: