Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1998

Hive - Version specific handling for NULLS FIRST/ NULLS LAST

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.15.0
    • Fix Version/s: 1.15.0
    • Component/s: core
    • Labels:
      None

      Description

      This JIRA is for PR https://github.com/apache/calcite/pull/545

      We are making the HiveSqlDialect version aware for the NULLS FIRST and NULLS LAST feature. In https://issues.apache.org/jira/browse/HIVE-12994 , the authors clarified that the default NullCollation for Hive is "NullDirection.LOW". Currently, the DEFAULT set in Calcite for Hive is NullCollation.HIGH

      In this PR, we are making 2 changes.

      1. Change the default NullCollation from HIGH to LOW
      2. Add NullCollation emulation in the HiveSqlDialect when the version of the dialect is less that 2.1.0 or when the version is blank.

      We're also adding a new Dialect "BigQuerySqlDialect" with the "quoteString" as "`" based on https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical

        Activity

        Hide
        julianhyde Julian Hyde added a comment -

        Resolved in release 1.15.0 (2017-12-11).

        Show
        julianhyde Julian Hyde added a comment - Resolved in release 1.15.0 (2017-12-11).
        Hide
        julianhyde Julian Hyde added a comment - - edited

        I was initially worried that we needed to generate "order by isnull(x), x" for MySQL, but I tried "order by x is null, x" and it worked fine, so I think we're OK.

        Show
        julianhyde Julian Hyde added a comment - - edited I was initially worried that we needed to generate "order by isnull(x), x" for MySQL, but I tried "order by x is null, x" and it worked fine, so I think we're OK.
        Hide
        nerrve Abbas Gadhia added a comment - - edited

        Hi Julian Hyde, I just noticed that MysqlSqlDialect still uses "x is null" instead of isnull(x). Did you intend to leave it that way or was it an oversight?

        Show
        nerrve Abbas Gadhia added a comment - - edited Hi Julian Hyde , I just noticed that MysqlSqlDialect still uses "x is null" instead of isnull(x). Did you intend to leave it that way or was it an oversight?
        Hide
        julianhyde Julian Hyde added a comment -
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/45425103 ; thanks for the PR, Abbas Gadhia !
        Hide
        julianhyde Julian Hyde added a comment -

        You don't need to do anything. I am still testing. I will get back to it in a day or two. Sorry for the delay.

        Show
        julianhyde Julian Hyde added a comment - You don't need to do anything. I am still testing. I will get back to it in a day or two. Sorry for the delay.
        Hide
        nerrve Abbas Gadhia added a comment -

        Julian Hyde Do you need me to work on anything else on this issue before it is merged into master?

        Show
        nerrve Abbas Gadhia added a comment - Julian Hyde Do you need me to work on anything else on this issue before it is merged into master?
        Hide
        julianhyde Julian Hyde added a comment -

        Reviewing https://github.com/apache/calcite/pull/545/commits/2262b29d85628cd1778b1d2203efba56c8cbd2c7.

        Thanks for adding the optimization for NULLS FIRST/NULLS LAST.

        You changed MySQL dialect to generate "x IS NULL" rather than "ISNULL(x)". No tests failed due to that change, but I think that's because we're not pushing ORDER BY down correctly. So, I made the MySQL dialect generate ISNULL as before.

        I tidied up the tests you added (there was quite a lot of redundancy), restored the SqlDialect.databaseVersion field, and renamed DatabaseProduct.BIGQUERY to BIG_QUERY. With that, it looks good; after tests pass I will commit.

        Show
        julianhyde Julian Hyde added a comment - Reviewing https://github.com/apache/calcite/pull/545/commits/2262b29d85628cd1778b1d2203efba56c8cbd2c7 . Thanks for adding the optimization for NULLS FIRST/NULLS LAST. You changed MySQL dialect to generate "x IS NULL" rather than "ISNULL(x)". No tests failed due to that change, but I think that's because we're not pushing ORDER BY down correctly. So, I made the MySQL dialect generate ISNULL as before. I tidied up the tests you added (there was quite a lot of redundancy), restored the SqlDialect.databaseVersion field, and renamed DatabaseProduct.BIGQUERY to BIG_QUERY. With that, it looks good; after tests pass I will commit.
        Hide
        nerrve Abbas Gadhia added a comment -

        Julian Hyde Yes. Will send across the recommended changes by Monday. Would that be ok?

        Show
        nerrve Abbas Gadhia added a comment - Julian Hyde Yes. Will send across the recommended changes by Monday. Would that be ok?
        Hide
        julianhyde Julian Hyde added a comment -

        Abbas Gadhia, Do you think you'll have chance to work on this soon?

        Show
        julianhyde Julian Hyde added a comment - Abbas Gadhia , Do you think you'll have chance to work on this soon?
        Hide
        christian.beikov Christian Beikov added a comment -

        Looks nice, I added some comments to the PR. When that's done, I'm fine with merging this.

        Show
        christian.beikov Christian Beikov added a comment - Looks nice, I added some comments to the PR. When that's done, I'm fine with merging this.
        Hide
        nerrve Abbas Gadhia added a comment - - edited

        I've made the optimization Julian recommended i.e the one that does not add the emulation of "is null" when it is not needed.
        I've also changed the emulation to use the standard IS NULL function instead of the Hive and MySQL specific ISNULL() functions.
        Also, i've started to use the major and minor versions of the DatabaseMetadata in the Dialect and have removed the version "string". This means that the dependency on "maven-artifiact" has also gone away.
        Also fixed the oversight inside BigSqlQueryDialect

        Show
        nerrve Abbas Gadhia added a comment - - edited I've made the optimization Julian recommended i.e the one that does not add the emulation of "is null" when it is not needed. I've also changed the emulation to use the standard IS NULL function instead of the Hive and MySQL specific ISNULL() functions. Also, i've started to use the major and minor versions of the DatabaseMetadata in the Dialect and have removed the version "string". This means that the dependency on "maven-artifiact" has also gone away. Also fixed the oversight inside BigSqlQueryDialect
        Hide
        julianhyde Julian Hyde added a comment - - edited

        An observation about the SqlDialect.emulateNullDirection(SqlNode node, boolean nullsFirst) method (not this particular change).

        To implement "select ... from t order by x nulls first", I notice that we would generate "select ... from t order by isnull(x), x" when "select ... from t order by x" would have the same effect and be more efficient (because Hive sorts nulls low). (For "select ... from t order by x desc nulls first", we can't improve on "select ... from t order by isnull(x), x desc".)

        To make that optimization, emulateNullDirection would need to be told whether it is an ascending or descending sort, i.e. would need an extra parameter, boolean desc.

        Show
        julianhyde Julian Hyde added a comment - - edited An observation about the SqlDialect.emulateNullDirection(SqlNode node, boolean nullsFirst) method (not this particular change). To implement "select ... from t order by x nulls first", I notice that we would generate "select ... from t order by isnull(x), x" when "select ... from t order by x" would have the same effect and be more efficient (because Hive sorts nulls low). (For "select ... from t order by x desc nulls first", we can't improve on "select ... from t order by isnull(x), x desc".) To make that optimization, emulateNullDirection would need to be told whether it is an ascending or descending sort, i.e. would need an extra parameter, boolean desc .
        Hide
        julianhyde Julian Hyde added a comment -

        You don't need to parse version strings. JDBC makes major and minor versions available (as int values).

        BigSqlQueryDialect.DEFAULT creates a Hive dialect by mistake.

        I don't feel strongly whether Hive dialect should look at the version in the constructor or in the emulateNullDirection method. But don't add any fields to SqlDialect, only to the Hive sub-class.

        Show
        julianhyde Julian Hyde added a comment - You don't need to parse version strings. JDBC makes major and minor versions available (as int values). BigSqlQueryDialect.DEFAULT creates a Hive dialect by mistake. I don't feel strongly whether Hive dialect should look at the version in the constructor or in the emulateNullDirection method. But don't add any fields to SqlDialect, only to the Hive sub-class.
        Hide
        nerrve Abbas Gadhia added a comment -
        I don't think it's a good idea to put the product version in the dialect
        

        I agree. I'll make the change.

        I'm also not sure about the dependency, the version checking can be easily implemented within the code base
        

        If you look at this stackoverflow question, the different answers there show that there's a bit of complexity into this that I would rather not reinvent. I know that Hadoop chose to copy paste this exact class i.e

        ComparableVersion
        

        into their code base to avoid the "maven-artifact" dependency. We can do the same

        Show
        nerrve Abbas Gadhia added a comment - I don't think it's a good idea to put the product version in the dialect I agree. I'll make the change. I'm also not sure about the dependency, the version checking can be easily implemented within the code base If you look at this stackoverflow question, the different answers there show that there's a bit of complexity into this that I would rather not reinvent. I know that Hadoop chose to copy paste this exact class i.e ComparableVersion into their code base to avoid the "maven-artifact" dependency. We can do the same
        Hide
        christian.beikov Christian Beikov added a comment -

        I don't think it's a good idea to put the product version in the dialect. Just let the constructor of the dialect determine capabilities based on version rather then recheck the capabilities based on version every time. I'm also not sure about the dependency, the version checking can be easily implemented within the code base.

        Show
        christian.beikov Christian Beikov added a comment - I don't think it's a good idea to put the product version in the dialect. Just let the constructor of the dialect determine capabilities based on version rather then recheck the capabilities based on version every time. I'm also not sure about the dependency, the version checking can be easily implemented within the code base.

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            nerrve Abbas Gadhia
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development