Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-528

Creating output row type of a Join does not obey case-sensitivity flags

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1-incubating
    • Fix Version/s: 1.9.0
    • Component/s: None
    • Labels:
      None

      Description

      In JoinRelBase.createJoinType() which creates a row type of the output row, a HashSet of String is used to keep track of unique field names. The field names 'column1' and 'Column1' will both be stored. This creates a problem for systems which are treating identifiers as case-insensitive (such as Drill) which rely on a Project below a Join to create unique names if the join columns are the same name (regardless of case).

      Ideally, the comparison for this should be done based on the criteria specified in the Lex settings when instantiating the SqlParser.ParserConfigImpl. So, if the parser was created with MYSQL Lex settings (see Lex.java), it should be obeyed by the JoinRelBase.createJoinType().

        Activity

        Hide
        julianhyde Julian Hyde added a comment -

        Has the validator decided that "column1" and "Column1" are the same field? If so, they should end up being the same field after sql-to-rel translation. It does not matter whether that field is called "column1" or "Column1" or "foobar$56" – it's just for debugging purposes and you shouldn't be relying on it being anything in particular. Calcite happens to use case-sensitive matching for RelNode field names but it should not matter, because users can't see RelNode fields.

        The case-sensitivity flags only apply to SQL parsing/validation. It sounds like there is either a bug in sql-to-rel conversion or no bug at all.

        Show
        julianhyde Julian Hyde added a comment - Has the validator decided that "column1" and "Column1" are the same field? If so, they should end up being the same field after sql-to-rel translation. It does not matter whether that field is called "column1" or "Column1" or "foobar$56" – it's just for debugging purposes and you shouldn't be relying on it being anything in particular. Calcite happens to use case-sensitive matching for RelNode field names but it should not matter, because users can't see RelNode fields. The case-sensitivity flags only apply to SQL parsing/validation. It sounds like there is either a bug in sql-to-rel conversion or no bug at all.
        Hide
        jni Jinfeng Ni added a comment - - edited

        Hi Julian,

        Seems to me that Calcite base code does not see this case-sensitive issue simply because Calcite connection config uses LEX.ORACLE by default. (

        CalciteConnectionProperty.java 
        
         /** Lexical policy. */
          LEX("lex", Type.ENUM, Lex.ORACLE, false),
        
        Lex.java
        
         /** Lexical policy similar to Oracle. The case of identifiers enclosed in
           * double-quotes is preserved; unquoted identifiers are converted to
           * upper-case; after which, identifiers are matched case-sensitively. */
          ORACLE(Quoting.DOUBLE_QUOTE, Casing.TO_UPPER, Casing.UNCHANGED, true),
        
          /** Lexical policy similar to MySQL. (To be precise: MySQL on Windows;
           * MySQL on Linux uses case-sensitive matching, like the Linux file system.)
           * The case of identifiers is preserved whether or not they quoted;
           * after which, identifiers are matched case-insensitively.
           * Back-ticks allow identifiers to contain non-alphanumeric characters. */
          MYSQL(Quoting.BACK_TICK, Casing.UNCHANGED, Casing.UNCHANGED, false),
        
        

        In SqlToRelConverter, caseSensitive is hard-coded to "true". This works for for LEX.ORACLE, since all the unquoted identifier have been converted to upper-case. However, it will not work for LEX.MYSQL or other LEX which uses "UNCHANGED" casing.

        SqlToRElConverter.java:3323
        
              final boolean caseSensitive = true; // name already fully-qualified
              e = rexBuilder.makeFieldAccess(e, name, caseSensitive);
        

        Therefore, I think we need pass the SqlParser's LEX case-sensitive to validator, and probably SqlRelConverter. That way, it will work for either LEX.ORACLE or other LEX configuration.

        Also, I see probably almost all the queries in JdbcTest.java uses quoted column names. If I change column name to unquoted, then Calcite will complain "Column 'EMPID' not found in any table" error, for the following simple query:

        select empid from "hr"."emps"
        

        That probably happens because CatalogReader use case-sensitive=true, while the unquoted identifier have been converted to upper-case.

        Any suggestion? Thanks!

        Show
        jni Jinfeng Ni added a comment - - edited Hi Julian, Seems to me that Calcite base code does not see this case-sensitive issue simply because Calcite connection config uses LEX.ORACLE by default. ( CalciteConnectionProperty.java /** Lexical policy. */ LEX( "lex" , Type.ENUM, Lex.ORACLE, false ), Lex.java /** Lexical policy similar to Oracle. The case of identifiers enclosed in * double -quotes is preserved; unquoted identifiers are converted to * upper- case ; after which, identifiers are matched case -sensitively. */ ORACLE(Quoting.DOUBLE_QUOTE, Casing.TO_UPPER, Casing.UNCHANGED, true ), /** Lexical policy similar to MySQL. (To be precise: MySQL on Windows; * MySQL on Linux uses case -sensitive matching, like the Linux file system.) * The case of identifiers is preserved whether or not they quoted; * after which, identifiers are matched case -insensitively. * Back-ticks allow identifiers to contain non-alphanumeric characters. */ MYSQL(Quoting.BACK_TICK, Casing.UNCHANGED, Casing.UNCHANGED, false ), In SqlToRelConverter, caseSensitive is hard-coded to "true". This works for for LEX.ORACLE, since all the unquoted identifier have been converted to upper-case. However, it will not work for LEX.MYSQL or other LEX which uses "UNCHANGED" casing. SqlToRElConverter.java:3323 final boolean caseSensitive = true ; // name already fully-qualified e = rexBuilder.makeFieldAccess(e, name, caseSensitive); Therefore, I think we need pass the SqlParser's LEX case-sensitive to validator, and probably SqlRelConverter. That way, it will work for either LEX.ORACLE or other LEX configuration. Also, I see probably almost all the queries in JdbcTest.java uses quoted column names. If I change column name to unquoted, then Calcite will complain "Column 'EMPID' not found in any table" error, for the following simple query: select empid from "hr" . "emps" That probably happens because CatalogReader use case-sensitive=true, while the unquoted identifier have been converted to upper-case. Any suggestion? Thanks!
        Hide
        julianhyde Julian Hyde added a comment -

        I agree, we probably miss a few bugs in case-insensitive validation because we use Lex.ORACLE. We would miss some other bugs if the house style were something else. Feel free to add tests for different case-sensitivity settings.

        But it's good that there is a "house style" if a test is checking something unrelated to parsing/validation, e.g. making sure that the SUBSTRING function returns the right result. And for better or worse, the house style is Oracle.

        That line in SqlToRelConverter is correct. As I said earlier, in RelNode/RexNode world, field names are case sensitive. I put the constant on a separate line because I wanted to be explicit.

        Show
        julianhyde Julian Hyde added a comment - I agree, we probably miss a few bugs in case-insensitive validation because we use Lex.ORACLE. We would miss some other bugs if the house style were something else. Feel free to add tests for different case-sensitivity settings. But it's good that there is a "house style" if a test is checking something unrelated to parsing/validation, e.g. making sure that the SUBSTRING function returns the right result. And for better or worse, the house style is Oracle. That line in SqlToRelConverter is correct. As I said earlier, in RelNode/RexNode world, field names are case sensitive. I put the constant on a separate line because I wanted to be explicit.
        Hide
        jni Jinfeng Ni added a comment -

        If we always use case sensitive in RelNode/RexNode world, regardless whatever Lex we uses, then all unquoted SqlIdentifier have to be casted to either UPPER-CASE or LOWER-case, when we do full-qualified. Otherwise, how can two identifier with different casing match ?

           SELECT COL1 from ( select col1 from T1);
        

        Here, the field name in row type of subquery will be "col1", while the qualified SqlIdentifer is still "COL1" with current Calcite. That will hit "Column 'COL1' does not exit" error.

        The current code simply would not work for all case-insensitivte validation when we use Lex other than Lex.ORACLE, since the field name or fully-qualified SqlIdentifer are created without knowledge of case-sensitivy.

        Show
        jni Jinfeng Ni added a comment - If we always use case sensitive in RelNode/RexNode world, regardless whatever Lex we uses, then all unquoted SqlIdentifier have to be casted to either UPPER-CASE or LOWER-case, when we do full-qualified. Otherwise, how can two identifier with different casing match ? SELECT COL1 from ( select col1 from T1); Here, the field name in row type of subquery will be "col1", while the qualified SqlIdentifer is still "COL1" with current Calcite. That will hit "Column 'COL1' does not exit" error. The current code simply would not work for all case-insensitivte validation when we use Lex other than Lex.ORACLE, since the field name or fully-qualified SqlIdentifer are created without knowledge of case-sensitivy.
        Hide
        jni Jinfeng Ni added a comment -

        SqlParser is taking Lex's case sensitivity. However, SqlValidatorScope does not. It's fullQualify(SqlIdentifer identifier) will simply keep the original casing; it's essentially UNCHANGED. For Lex other than Lex.ORACLE, since both SqlParser and SqlValidatorScope will keep the casing UNCHANGED, while SqlToRelConverter unfortunately enforce case-sensitive matching, that will make the code completely ignore the Lex's case insensitive policy, and hence hit the error of "column not found".

        There are 2 options:
        1) SqlValidator convert all SqlIdentifer to UPPER-CASE, if Lex's caseSensitive = false, even when Lex.unquotedCasing = "UNCHANGED". Then, the downstream operators could enforce case-sensitive matching, just like what SqlToRelConverter is doing today. Also, the field names in row type should be converted to UPPER-CASE as well.

        2) SqlValidator keep the current behavior, but all the consequent matching should know whether it's case-sensitive or case-in-sensitive. (I could see several places need change, in particular SqlToRelConverter).

        Suggestion?

        Show
        jni Jinfeng Ni added a comment - SqlParser is taking Lex's case sensitivity. However, SqlValidatorScope does not. It's fullQualify(SqlIdentifer identifier) will simply keep the original casing; it's essentially UNCHANGED. For Lex other than Lex.ORACLE, since both SqlParser and SqlValidatorScope will keep the casing UNCHANGED, while SqlToRelConverter unfortunately enforce case-sensitive matching, that will make the code completely ignore the Lex's case insensitive policy, and hence hit the error of "column not found". There are 2 options: 1) SqlValidator convert all SqlIdentifer to UPPER-CASE, if Lex's caseSensitive = false, even when Lex.unquotedCasing = "UNCHANGED". Then, the downstream operators could enforce case-sensitive matching, just like what SqlToRelConverter is doing today. Also, the field names in row type should be converted to UPPER-CASE as well. 2) SqlValidator keep the current behavior, but all the consequent matching should know whether it's case-sensitive or case-in-sensitive. (I could see several places need change, in particular SqlToRelConverter). Suggestion?
        Hide
        jnadeau Jacques Nadeau added a comment -

        I think there are a couple disconnects here. Calcite's default behavior is to be an ordinal based system. However, it can work as either an Ordinal or a Name-based system. In fact, Drill has been relying on its name-based capabilities for more than a year.

        I also think we're mixing up case sensitivity and case-preserving. For example, Drill is case insensitive but case preserving.

        Julian Hyde: I'm confused on your statement on the "no bug or sql to rel". It sounds like you're saying that Calcite shouldn't be doing column renames in the Rel classes. Wherever Calcite decides that t1.ColumnA and t2.columnA are the same or different should be case aware. Where do you think this should be done?

        Show
        jnadeau Jacques Nadeau added a comment - I think there are a couple disconnects here. Calcite's default behavior is to be an ordinal based system. However, it can work as either an Ordinal or a Name-based system. In fact, Drill has been relying on its name-based capabilities for more than a year. I also think we're mixing up case sensitivity and case-preserving. For example, Drill is case insensitive but case preserving. Julian Hyde : I'm confused on your statement on the "no bug or sql to rel". It sounds like you're saying that Calcite shouldn't be doing column renames in the Rel classes. Wherever Calcite decides that t1.ColumnA and t2.columnA are the same or different should be case aware. Where do you think this should be done?
        Hide
        julianhyde Julian Hyde added a comment -

        Wherever Calcite decides that t1.ColumnA and t2.columnA are the same or different should be case aware

        I disagree. All of the lexing and case-sensitivity options concern SQL parsing and validation. They have nothing to do with RelNode field names. RelNode field names happen to be based on SQL field names but we do not guarantee it.

        This case is framed as a bug whereas it seems that you have been misunderstanding the spec. If we can agree on that, maybe we can reboot this discussion.

        Show
        julianhyde Julian Hyde added a comment - Wherever Calcite decides that t1.ColumnA and t2.columnA are the same or different should be case aware I disagree. All of the lexing and case-sensitivity options concern SQL parsing and validation. They have nothing to do with RelNode field names. RelNode field names happen to be based on SQL field names but we do not guarantee it. This case is framed as a bug whereas it seems that you have been misunderstanding the spec. If we can agree on that, maybe we can reboot this discussion.
        Hide
        jnadeau Jacques Nadeau added a comment -

        Julian Hyde, picking this (old) thread up. We had a good conversation about this shortly after this discussion (14 months ago) and I forgot to report back to it here.

        My understanding is that field names in the rel tree are not meaningful. Therefore, there should be no expectation of field names having any meaning. My question is: given that, why would the join relnode even rename columns or interact with them at all? As I understand it, consumers of Calcite must interact with the parser to determine the output field names and shouldn't interrogate the relnodes at all for this purpose. I think the initial disconnect above is probably in part due to the fact that relnodes even have field names and that they interact with them. Shouldn't relnodes be solely focused on ordinals? Thoughts?

        Show
        jnadeau Jacques Nadeau added a comment - Julian Hyde , picking this (old) thread up. We had a good conversation about this shortly after this discussion (14 months ago) and I forgot to report back to it here. My understanding is that field names in the rel tree are not meaningful. Therefore, there should be no expectation of field names having any meaning. My question is: given that, why would the join relnode even rename columns or interact with them at all? As I understand it, consumers of Calcite must interact with the parser to determine the output field names and shouldn't interrogate the relnodes at all for this purpose. I think the initial disconnect above is probably in part due to the fact that relnodes even have field names and that they interact with them. Shouldn't relnodes be solely focused on ordinals? Thoughts?
        Hide
        julianhyde Julian Hyde added a comment -

        Shouldn't relnodes be solely focused on ordinals?

        Basically, yes. Rules should just use ordinals.

        The row-type is a struct type, and therefore we have to give the fields names, and those names have to be unique. I suppose that we could make sure that the field names are always $0, $1 etc. But it's traditional, and useful, to try to preserve meaningful field names for the benefit of a human debugging a plan. (To do this, we sometimes have to assign field names like $f0 (where we can't derive a field name from source), and we sometimes need to add suffixes to ensure that field names are unique.)

        I like that tradition, and don't feel a strong need to change it. But developers need to remember that the field names cannot be relied upon.

        Show
        julianhyde Julian Hyde added a comment - Shouldn't relnodes be solely focused on ordinals? Basically, yes. Rules should just use ordinals. The row-type is a struct type, and therefore we have to give the fields names, and those names have to be unique. I suppose that we could make sure that the field names are always $0, $1 etc. But it's traditional, and useful, to try to preserve meaningful field names for the benefit of a human debugging a plan. (To do this, we sometimes have to assign field names like $f0 (where we can't derive a field name from source), and we sometimes need to add suffixes to ensure that field names are unique.) I like that tradition, and don't feel a strong need to change it. But developers need to remember that the field names cannot be relied upon.
        Hide
        julianhyde Julian Hyde added a comment -

        One more thing. In the RelBuilder, you CAN rely on field names. And RelBuilder even has a concept of table aliases (assigned using RelBuilder.as(String), and used in methods such as RelBuilder.field(String, String)).

        If you add a trivial project, RelBuilder might choose not to add a Project node but you can still refer to the field names.

        Show
        julianhyde Julian Hyde added a comment - One more thing. In the RelBuilder, you CAN rely on field names. And RelBuilder even has a concept of table aliases (assigned using RelBuilder.as(String) , and used in methods such as RelBuilder.field(String, String) ). If you add a trivial project, RelBuilder might choose not to add a Project node but you can still refer to the field names.
        Hide
        jnadeau Jacques Nadeau added a comment -

        Makes sense.

        It seems like if struct requires unique columns, the uniqueness requirement should be based on the case handling policy. Otherwise, it seems like we shouldn't enforce a uniqueness requirement as the requirement is incorrect/misleading depending on the case handling.

        Show
        jnadeau Jacques Nadeau added a comment - Makes sense. It seems like if struct requires unique columns, the uniqueness requirement should be based on the case handling policy. Otherwise, it seems like we shouldn't enforce a uniqueness requirement as the requirement is incorrect/misleading depending on the case handling.
        Hide
        julianhyde Julian Hyde added a comment -

        Jacques Nadeau, The natural thing is for the case-sensitivity policy to apply only to SQL and validation, not at the RelNode/RexNode level.

        But it seems that Drill needs otherwise, so I suppose we can accommodate it. We could, say, put a "uniquifier" into the type factory.

        Show
        julianhyde Julian Hyde added a comment - Jacques Nadeau , The natural thing is for the case-sensitivity policy to apply only to SQL and validation, not at the RelNode/RexNode level. But it seems that Drill needs otherwise, so I suppose we can accommodate it. We could, say, put a "uniquifier" into the type factory.
        Show
        minjikim MinJi Kim added a comment - https://github.com/apache/calcite/pull/245
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/acd27fde . Jacques Nadeau , Jinfeng Ni , MinJi Kim , Thanks for the PR!
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Resolved in release 1.9.0 (2016-09-22)

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.9.0 (2016-09-22)

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            amansinha100 Aman Sinha
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development