Supporting Quoted Identifiers in Column Names

1 Current Behavior

1.1 Definition of an Identifier in Hive Lexer

An Identifier is a sequence of alphaNumeric and under_score characters or a sequence that is surrounded by back-tick characters (`). In the case of back-tick, certain Regex symbols are allowed in addition to alphaNumeric and under_score characters. So the rules are:

Identifier
    :
    (Letter | Digit) (Letter | Digit | '_')*
    | '`' RegexComponent+ '`'
    ;

fragment
RegexComponent
    : 'a'..'z' | 'A'..'Z' | '0'..'9' | '_'
    | PLUS | STAR | QUESTION | MINUS | DOT
    | LPAREN | RPAREN | LSQUARE | RSQUARE | LCURLY | RCURLY
    | BITWISEXOR | BITWISEOR | DOLLAR | '/'
    ;

More on the implication of the RegexComponent rule when we get to the Select Expression behavior.

1.2 What is an Identifier in Hive

An Identifier can be a:

  • tableName
  • columnName
  • select expr alias
  • lateral view aliases
  • database name
  • view name
  • subquery alias
  • function name
  • ptf argument identifier
  • index name
  • property name for: db,tbl,partition…
  • fileFormat
  • role name
  • privilege name
  • principal name
  • macro name
  • hint name
  • window name

1.3 Definition of a String Token in Hive Lexer

A String in Hive can be delimited by either a single quote(') or double quote(") characters. The rules are:

StringLiteral
    :
    ( '\'' ( ~('\''|'\\') | ('\\' .) )* '\''
    | '\"' ( ~('\"'|'\\') | ('\\' .) )* '\"'
    )+
    ;

The implication of this is that we cannot use double quote delimited sequences as Identifiers.

  • we cannot remove the second Rule: will break existing SQLs.
  • allowing a double quoted String to be either a String or Identifier is not feasible ( at least I cannot figure this out; maybe an antlr expert can pipe in). Identifiers and Strings can appear in the same context; the Parser cannot provide sufficient disambiguating context to the Lexer. For e.g. 'select "abc" from R1'; "abc" can be String or an Identifier.

Whereas the SQL spec:

  • defines Strings as character sequences delimited by quotes.
  • defines Quoted Identifiers as character sequences delimited by double quotes.

As far as I can tell: most Databases use separate delimiters for Strings and Identifiers. For e.g.:

  • Oracle uses single quote for Strings, double quote for Quoted Identifiers.
  • SQL Server uses single quote for Strings, brackets([]) for Quoted Identifiers.

1.4 The validateName check in MetaStore Utils

The Name validation check in the Metastore layer today disallows table/partition/column names to contain anything other than alphaNumeric and under_score. This check conflicts with what is in the Hive Lexer. So the following:

create table test( `x+y` string);

Gives a not very helpful SemanticException error.

1.5 FileSystem escaping of names

The Hive Warehouse layer escapes/unescapes Partition Column Names and Partition Values. It escapes a character using the form '%{hexvalue}'. The characters it escapes are captured in a static in the FileUtils class.

/**
 * ASCII 01-1F are HTTP control characters that need to be escaped.
 * \u000A and \u000D are \n and \r, respectively.
 */
char[] clist = new char[] {'\u0001', '\u0002', '\u0003', '\u0004',
    '\u0005', '\u0006', '\u0007', '\u0008', '\u0009', '\n', '\u000B',
    '\u000C', '\r', '\u000E', '\u000F', '\u0010', '\u0011', '\u0012',
    '\u0013', '\u0014', '\u0015', '\u0016', '\u0017', '\u0018', '\u0019',
    '\u001A', '\u001B', '\u001C', '\u001D', '\u001E', '\u001F',
    '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '\u007F', '{',
    '[', ']', '^'};

for (char c : clist) {
  charToEscape.set(c);
}

if(Shell.WINDOWS){
  //On windows, following chars need to be escaped as well
  char [] winClist = {' ', '<','>','|'};
  for (char c : winClist) {
    charToEscape.set(c);
  }
}

1.6 Identifier behavior for everything but Select Expressions

In the following assume that we have relaxed the MetaStore validation check. Assume that the Metastore allows any Identifiers that are valid in the grammar. For these egs, I made the validate function just return true.

All these tests, work. This is far from comprehensive; the goal is to show you that beyond the Rules in the Lexer and the MetaStore validation check the rest of Hive doesn't seem to have an issue with special characters in Identifiers. (Please suggest more tests here)

create table test(`x+y` String, `a?b` String);
show tables;
describe table;
load data local inpath '.../data/files/kv.txt' 
overwrite into table test;
select * from test where `x+y` < '1';
select count(*) from test group by `x+y`;
select rank() over (partition by `x+y` order by `a?b`) from test;

-- partitioned table
create table partition_date_1 (key string, value string) 
partitioned by (`dt+x` date, region int);

insert overwrite table partition_date_1 
partition(`dt+x`='2000-01-01', region=1)
select * from src tablesample (10 rows);

create table partition_1(key string, value string) 
partitioned by (`dt*1` string);

insert overwrite table partition_1 
partition(`dt*1`='2000/01/01')
select * from src tablesample (10 rows);

-- bucketed table
create table bucket_test(`key?1` string, value string) 
clustered by (`key?1`) into 5 buckets;

insert overwrite table bucket_test select * from src cluster by key;

-- from clause
create table `test+1`(key String, value String);
insert overwrite table `test+1` select * from test;
select * from `test+1`;
select `test+1`.* from `test+1`;
select `test+1`.key from `test+1`;
select `test+1`.key from `test+1` where `test+1`.key < '2';

1.7 Identifier behavior for Select Expressions

The reason for the Regex rule in Hive Lexer is because when a Select Expression that is an Identifier is encountered: the Identifier text is checked to see if it contains Regular expression symbols; if it does the Identifier text is used to find columns that match the Regex. So for e.g.:

-- on src(key string, value string)
select `.*e.*` from src;
-- returns both key and value columns.

-- on test(`x+y` string, `a?b` string)
select `a?b` from test;
-- we get a SemanticException that states there are no columns matching the expression.

-- the matching is triggered only for Identifier; so on the test table this works correctly
select concat(`a?b`, '') from test;

As the last e.g. states, the matching is triggered for Select Expressions that are just Identifiers.

2 Proposal

2.1 Assumptions

  • The use of the Regex form in a Select List is rare. So it is ok to overload the meaning of back tick delimited character sequences.
  • The use of non alpahNumeric (+ under_score) Identifiers is mostly used when naming Columns. The use of Quoted Identifiers for other things(tables/databases/roles…) is much less common. The reason to bring this up is because introducing Identifiers for all entities is a major testing undertaking. So we propose to gradually introduce support for Quoted Identifiers.

2.2 Changes

  1. Introduce a new flag hive.support.quoted.identifiers This is a comma separated list of values. Initially it will only understand the terms: none or column. Over time we will add support for other entities: table, database, role, privilege etc.
  2. The existing behavior of the flag is that of 'none'.
  3. We will add a new rule to Hive Lexer, which defines a Quoted Identifier to mean a sequence of characters delimited by back-ticks. Back-ticks can be escaped by double back-ticks(``).
  4. This rule will be enabled (using Antlr's Semantic predicate mechanism) only if the flag is set. At the language level, if the flag is enabled Identifiers will be allowed to have any characters within backticks.
    • We will not introduce a new Token Type for Quoted Identifiers. The unescaping of Quoted Identifiers is done in the Lexer. Introducing a new Token Type will have a major impact on the logic in Semantic Analyzer genPlan.
  5. At the Metadata level, column names will now be allowed to have any characters.
    • Why don't we control this behavior with the flag setting? Metadata layer objects can be shared by multiple Sessions; controlling this behavior based on the invoking session would require the Session context to be passed into every API entry point. This would be a major change. Such a change is not warranted here: allowing column names to have any character in the Metadata layer doesn't introduce inconsistencies with current behavior.

2.3 Questions/Feedback

(thanks to Julian Hyde, Alan Gates)

2.3.1 Please specify whether identifiers are case-sensitive

Identifiers will continue to be case insensitive.

2.3.2 Can quoted identifiers contain back-ticks? If so, how are they escaped?

Escaping via double back-ticks

2.3.3 What character set can be used for identifiers? Unicode names?

Unicode; relying to Antlr using Unicode as the Character set.

2.3.4 Consider adding a property to disallow double-quoted character literals. They are non-standard.

Not in this version.

2.3.5 There is a thrift API through which users could create a table name, HiveMetaStoreClient.

(Thanks to Alan Gates and Ashutosh in coming up with this solution)

At the Metadata level we will now allow column names to contain any characters. So there should be no impact on the Metadata layer)

2.3.6 Worth reviewing the JDBC metadata calls

[ see http://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html ], supportsMixedCaseIdentifiers, supportsMixedCaseQuotedIdentifiers, storesLowerCaseQuotedIdentifiers, etc. and figure out what they would return. You will need to change getIdentifierQuoteString.

Need to follow up on this.

2.4 Rejected/Alternate options

  1. Introduce Quoted Identifiers as a first class concept: this is probably the right way to do this. Introduce a 'Quoted Identifier' token in the Grammar. We also need to introduce the concept of a Name class that captures if it is quoted. Both these changes have a major impact on the amount of code change:
    • throughout Semantic Analysis we need to handle the new 'Quoted Identifier' token that can appear anywhere an Identifier token can.
    • currently entity names are just strings; introducing a Name class in the Metadata layer is an even bigger change.
  2. Use double quotes(") as the Identifier delimiter: as we mentioned before, because we already use double quotes as String delimiters this is possible without a regression. (Unless someone can figure out disambiguation rules for Identifiers and Strings).

Author: Harish Butani

Created: 2013-12-11 Wed 15:45

Emacs 24.3.1 (Org mode 8.0.3)

Validate XHTML 1.0